Skip to content

Image captioning with spatial attention using keras with tensorflow backend

Notifications You must be signed in to change notification settings

zbxzc35/imcap_keras

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Disclaimer: I am not working on this anymore. I will be happy to answer questions and review & merge PRs though.

Image Captioning with Spatial Attention in Keras

This is a Keras & Tensorflow implementation of a captioning model. In particular, it uses the attention models described in this paper, which is depicted below:

where V are the K local features from the last convolutional layer of a ConvNet (e.g. ResNet-50), xt is the input (composed of the embedding of the previous word and the average image feature). ht is the hidden state of the LSTM at time t, which is used to compute the attention weights to apply to V in order to obtain the context vector ct. ct and ht are combined to predict the current word yt. In (b), an additional gate is incorporated into the LSTM to produce the additional st output, which is combined with V to compute the attention weights. st is used as an alternative feature to look at rather than the image features in V.

Installation

  • Clone this repository
# Make sure to clone with --recursive
git clone --recursive https://github.com/amaiasalvador/sat_keras.git
  • Install python 2.7.
  • Install tensorflow 0.12.
  • pip install -r requirements.txt
  • (Optional )Install this Keras PR with support for layer-wise learning rate multipliers:
git clone https://github.com/amaiasalvador/keras.git
cd keras
git checkout lr_mult
python setup.py install

This option is disabled by default, so you can use "regular" keras 1.2.2 if you don't want to set a different learning rate to the base model.

  • Set tensorflow as the keras backend in ~/.keras/keras.json:
{
    "image_dim_ordering": "tf", 
    "epsilon": 1e-07, 
    "floatx": "float32", 
    "backend": "tensorflow"
}

Data & Pretrained model

$coco/                                    # dataset dir
$coco/annotations/                        # annotations directory
$coco/annotations/captions_train2014.json # caption anns for training set
$coco/annotations/captions_val2014.json   # ...
$coco/images/                             # image dir
$coco/images/train2014                    # train image dir
$coco/images/val2014                      # ...
  • Navigate to imcap/utils and run:
python prepro_coco.py --output_json path_to_json --output_h5 path_to_h5 --images_root path_to_coco_images
this will create the vocabulary and HDF5 file with data.
  • [Coming soon] Download pretrained model here.

Usage

Unless stated otherwise, run all commands from ./imcap:

Demo

Run sample_captions.ipynb to test the trained network on some images and visualize attention maps.

Training

Run python train.py. Run python args.py --help for a list of the available arguments to pass.

Testing

  • Run python test.py to forward all validation images through a trained network and create json file with results. Use --cnntrain flag if evaluating a model with fine tuned convnet.
  • Navigate to ./imcap/coco_caption/.
  • From there run:
    python eval_caps.py -results_file results.json -ann_file gt_file.json
    
    to get METEOR, Bleu, ROUGE_L & CIDEr scores for the previous json file with generated captions.

Note on used train/val/test splits

For the sake of comparison, the data processing script follows the one in NeuralTalk2 and AdaptiveAttention.

References

Contact

For questions and suggestions either use the issues section or send an e-mail to amaia.salvador@upc.edu.

About

Image captioning with spatial attention using keras with tensorflow backend

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 87.8%
  • Python 12.2%