Image Captioning based on Bottom-Up and Top-Down Attention model

Our overall approach centers around the Bottom-Up and Top-Down Attention model, as designed by Anderson et al. We used this framework as a starting point for further experimentation, implementing, in addition to various hyperparameter tunings, two additional model architectures. First, we reduced the complexity of Bottom-Up and Top-Down by considering only a simple LSTM architecture. Then, taking inspiration from the Transformer architecture, we implement a non-recurrent model which does not need to keep track of an internal state across time. Our results are comparable to the author’s implementation of the Bottom- Up and Top Down Attention model. Our code serves as a baseline for future experiments which are done using the Pytorch framework.

Results

Model	BLEU-1	BLEU-4	METEOR	ROUGE-L	CIDEr
Author's implementation	77.2	36.2	27.0	56.4	113.5
Our implementation	73.8	32.9	26.0	53.7	103.8

Our implementation can be improved using some more hyperparameter tuning and using some tricks like gradient clipping and using ReLU instead of tanh. We hope our model serves as the baseline for future experiments.

We use the Adam optimizer with a learning rate of 0.0001 and teacher forcing during training with a batch size of 100. The training was completed with Nvidia P40 GPUs in approximately 8 GPU hours which is significantly less than author’s 18 GPU hours on Titan X GPUs. During testing, a beam width of 5 was found to be the most effective.

Ablation study:

Model	BLEU-1	BLEU-4	METEOR	ROUGE-L	CIDEr
Bottom-Up Top-Down	73.8	32.9	26.0	53.7	103.8
w/o attention (Simple LSTM)	67	24.9	21.9	49.4	77.6
w/o Beam search	74.2	31.3	25.9	54.0	102.4
Teacher forcing (p=0.5)	74	31.4	25.1	53.5	100.0
Transformer-inspired	57.4	9.7	15.2	41.2	42.5

Getting Started

Machine configuration used for testing: Nvidia P40 GPUs card with 24GB memory (though a machine with lesser memory would work just fine)

We use the Karpathy splits as described in Deep visual-semantic alignments for generating image descriptions.. The Bottom-Up image features are used directly from here. Please refer to this repo for clarifications. The annotations are downloaded from the COCO website (2014 train val annotations). All the models have been trained from scratch.

The code takes around 8 hours to train on the karpathy train split.

Prerequisites

What things you need to install the software and how to install them

Software used:

Pytorch 0.4.1
Python 2.7

Dependencies: Create a conda environment using the captioning_env.yml file. Use: conda env create -f captioning_env.yml

If you are not using conda as a package manager, refer to the yml file and install the libraries manually.

Running the code

Data

We use image features generated by a resnet101 Faster RCNN network directly. These are available here as mentioned previously. Convert the tsv file to hdf5 files based on [this] (https://github.com/hengyuan-hu/bottom-up-attention-vqa/blob/master/tools/detection_features_converter.py) code. Now generate the karpathy splits using [this] (https://github.com/njchoma/transformer_image_caption/blob/master/src/data_helpers/create_ks_splits_h5.py) code.
For the annotations, download from the COCO website with the link mentioned earlier and run [this] (https://github.com/njchoma/transformer_image_caption/blob/master/src/data_helpers/create_ks_splits1.py) script to generate the karpathy splits.
We will be using [dataloader_ks.py] (https://github.com/njchoma/transformer_image_caption/blob/master/src/data_helpers/data_loader_ks.py) script which will handle the data loading part.

Training

Edit the main.sh file by changing the path variables and activating the appropriate conda environment. Run the same script with the appropriate arguments. The arguments have been listed in the src/utils_experiment.py file.
After the model has been trained, run src/evaluate_test.py

Evaluation

For calculating the metrics such as BLEU, run [this] (https://github.com/njchoma/transformer_image_caption/blob/master/src/eval/coco_caption/eval.py) script with the appropriate test annotations (pkl file) and the json file generated by the model.

License

This project is licensed under the MIT License - see the LICENSE file for details

Contributors

Nicholas Choma (New York University)
Omkar Damle (New York University)

Note: Equal contribution from both contributors.

This code was produced as a part of my course project at New York University. I would like to thank Prof. Fergus for his guidance and providing access to the GPUs.

References:

Code for metrics evaluation was borrowed from https://github.com/tylin/coco-caption
Code for converting the image features from tsv to hdf5 is based on https://github.com/hengyuan-hu/bottom-up-attention-vqa

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
captioning_env.yml		captioning_env.yml
cluster_main.sh		cluster_main.sh
evaluate.sh		evaluate.sh
main.sh		main.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

captioning_env.yml

captioning_env.yml

cluster_main.sh

cluster_main.sh

evaluate.sh

evaluate.sh

main.sh

main.sh

requirements.txt

requirements.txt

Repository files navigation

Image Captioning based on Bottom-Up and Top-Down Attention model

Results

Getting Started

Prerequisites

Running the code

License

Contributors

About

Releases

Packages

Languages

License

zhanghp96/transformer_image_caption

Folders and files

Latest commit

History

Repository files navigation

Image Captioning based on Bottom-Up and Top-Down Attention model

Results

Getting Started

Prerequisites

Running the code

License

Contributors

About

Resources

License

Stars

Watchers

Forks

Languages