This repository provides a recipe for audio captioning with sequence to sequence models: data preprocessing, training, evaluation and inference.
Checkout this repository and install the required packages:
$ git clone https://github.com/wsntxxn/AudioCaption
$ cd AudioCaption
$ pip install -r requirements.txt
Install the repository as a package:
$ pip install -e .
We now support Clotho and AudioCaps. See details in data/README.md.
The training configuration is written in a YAML file and passed to the training script. Examples are in eg_configs
.
We use contrastive learning for audio-text pre-training. The code is in another repo. We also provide the pre-trained audio-text retrieval model used for audio captioning training.
For example, train a Cnn14_Rnn-Transformer model on Clohto:
$ python captioning/pytorch_runners/run.py train eg_configs/clotho_v2/waveform/cnn14rnn_trm.yaml
Assume the experiment directory is $EXP_PATH
. Evaluation under the configuration in eg_configs/clotho_v2/waveform/test.yaml
:
$ python captioning/pytorch_runners/run.py evaluate $EXP_PATH eg_configs/clotho_v2/waveform/test.yaml
Using the trained model (checkpoint in $CKPT
) to inference on new audio files:
$ python captioning/pytorch_runners/inference_waveform.py test.wav test.json $CKPT
Several models can be used to ensemble for inference, especially in challenges. We provide a sample configuration eg_configs/dcase2022/ensemble/config.yaml
:
$ python captioning/pytorch_runners/ensemble.py evaluate eg_configs/dcase2022/ensemble/config.yaml
We release the models trained on Clotho and AudioCaps for easy use. They use contrastive pre-trained feature extractor:
$ mkdir pretrained_feature_extractors
$ wget https://github.com/wsntxxn/AudioCaption/releases/download/v0.0.2/contrastive_pretrain_cnn14_bertm.pth -O pretrained_feature_extractors/contrastive_pretrain_cnn14_bertm.pth
AudioCaps:
$ wget https://github.com/wsntxxn/AudioCaption/releases/download/v0.0.2/audiocaps_cntrstv_cnn14rnn_trm.zip
$ unzip audiocaps_cntrstv_cnn14rnn_trm.zip
$ python captioning/pytorch_runners/inference_waveform.py test.wav test.json audiocaps_cntrstv_cnn14rnn_trm/swa.pth
Clotho:
$ wget https://github.com/wsntxxn/AudioCaption/releases/download/v0.0.2/clotho_cntrstv_cnn14rnn_trm.zip
$ unzip clotho_cntrstv_cnn14rnn_trm.zip
$ python captioning/pytorch_runners/inference_waveform.py test.wav test.json clotho_cntrstv_cnn14rnn_trm/swa.pth
If you find the models useful, please cite our technical report:
@techreport{xu2022sjtu,
author={Xu, Xuenan and Xie, Zeyu and Wu, Mengyue and Yu, Kai},
title={The SJTU System for DCASE2022 Challenge Task 6: Audio Captioning with Audio-Text Retrieval Pre-training},
institution={DCASE2022 Challenge},
year={2022}
}
The following papers are related to this repository: