Skip to content

wsntxxn/AudioCaption

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Audio Captioning Recipe

This repository provides a recipe for audio captioning with sequence to sequence models: data preprocessing, training, evaluation and inference.

Install

Checkout this repository and install the required packages:

$ git clone https://github.com/wsntxxn/AudioCaption
$ cd AudioCaption
$ pip install -r requirements.txt

Install the repository as a package:

$ pip install -e .

Data preprocessing

We now support Clotho and AudioCaps. See details in data/README.md.

Training

Configuration

The training configuration is written in a YAML file and passed to the training script. Examples are in eg_configs.

Contrastive Audio-text Pre-training

We use contrastive learning for audio-text pre-training. The code is in another repo. We also provide the pre-trained audio-text retrieval model used for audio captioning training.

Start training

For example, train a Cnn14_Rnn-Transformer model on Clohto:

$ python captioning/pytorch_runners/run.py train eg_configs/clotho_v2/waveform/cnn14rnn_trm.yaml

Evaluation

Assume the experiment directory is $EXP_PATH. Evaluation under the configuration in eg_configs/clotho_v2/waveform/test.yaml:

$ python captioning/pytorch_runners/run.py evaluate $EXP_PATH eg_configs/clotho_v2/waveform/test.yaml

Inference

Using the trained model (checkpoint in $CKPT) to inference on new audio files:

$ python captioning/pytorch_runners/inference_waveform.py test.wav test.json $CKPT

Ensemble

Several models can be used to ensemble for inference, especially in challenges. We provide a sample configuration eg_configs/dcase2022/ensemble/config.yaml:

$ python captioning/pytorch_runners/ensemble.py evaluate eg_configs/dcase2022/ensemble/config.yaml

Using off-the-shelf models

We release the models trained on Clotho and AudioCaps for easy use. They use contrastive pre-trained feature extractor:

$ mkdir pretrained_feature_extractors
$ wget https://github.com/wsntxxn/AudioCaption/releases/download/v0.0.2/contrastive_pretrain_cnn14_bertm.pth -O pretrained_feature_extractors/contrastive_pretrain_cnn14_bertm.pth

AudioCaps:

$ wget https://github.com/wsntxxn/AudioCaption/releases/download/v0.0.2/audiocaps_cntrstv_cnn14rnn_trm.zip
$ unzip audiocaps_cntrstv_cnn14rnn_trm.zip
$ python captioning/pytorch_runners/inference_waveform.py test.wav test.json audiocaps_cntrstv_cnn14rnn_trm/swa.pth

Clotho:

$ wget https://github.com/wsntxxn/AudioCaption/releases/download/v0.0.2/clotho_cntrstv_cnn14rnn_trm.zip
$ unzip clotho_cntrstv_cnn14rnn_trm.zip
$ python captioning/pytorch_runners/inference_waveform.py test.wav test.json clotho_cntrstv_cnn14rnn_trm/swa.pth

If you find the models useful, please cite our technical report:

@techreport{xu2022sjtu,
    author={Xu, Xuenan and Xie, Zeyu and Wu, Mengyue and Yu, Kai},
    title={The SJTU System for DCASE2022 Challenge Task 6: Audio Captioning with Audio-Text Retrieval Pre-training},
    institution={DCASE2022 Challenge},
    year={2022}
}

Related Papers

The following papers are related to this repository: