Image Captioning in Keras

(Note: You can read an in-depth tutorial about the implementation in this blogpost.)

This is an implementation of image captioning model based on Vinyals et al. with a few differences:

For CNN we use Inception v3 instead of Inception v1.
For RNN we use multi-layered LSTM instead of single-layered one.
We don't have a special start-of-sentence word so we feed the first word at t = 1 instead of t = 2.
We use different values for some hyperparameters:

Hyperparameter Value

Learning rate 0.00051

Batch size 32

Epochs 33

Dropout rate 0.22

Embedding size 300

LSTM output size 300

LSTM layers 3

Examples of Captions Generated by the Proposed Model

Evaluation Metrics

Quantitatively, the proposed model's performance is on par with Vinyals' model on Flickr8k dataset:

Metric	Proposed Model	Vinyals' Model
BLEU-1	61.8	63
BLEU-2	40.8	41
BLEU-3	27.8	27
BLEU-4	19.0	N/A
METEOR	21.5	N/A
CIDEr	41.5	N/A

Environment Setup

Download the dataset needed.
```
./scripts/download_dataset.sh
```

Download pretrained word vectors.

./scripts/download_pretrained_word_vectors.sh

Download pycocoevalcap data.

./scripts/download_pycocoevalcap_data.sh

Install the dependencies.

Note: It was only tested on Python 2.7. It may need minor code changes to work on Python 3.
```
# Optional: Create and activate your virtualenv / Conda environment

pip install -r requirements.txt
```
Setup PYTHONPATH.
```
source ./scripts/setup_pythonpath.sh
```

Run a Training

For reproducing the model, execute:

python -m keras_image_captioning.training \
  --training-label repro-final-model \
  --from-training-dir results/flickr8k/final-model

There are many arguments available that you can look inside training.py.

Run an Inference and Evaluate It

python -m keras_image_captioning.inference \
  --dataset-type test \
  --method beam_search \
  --beam-size 3 \
  --training-dir var/flickr8k/training-results/repro-final-model

Note:

dataset_type can be either 'validation' or 'test'.
You can look the captions generated at var/flickr8k/training-results/repro-final-model/test-predictions-3-20.yaml. You can compare it with my result at results/flickr8k/final-model/test-predictions-3-20.yaml.

License

MIT License. See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 230 Commits
.vscode		.vscode
keras_image_captioning		keras_image_captioning
notes		notes
pycocoevalcap		pycocoevalcap
results/flickr8k/final-model		results/flickr8k/final-model
scripts		scripts
var		var
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.in		requirements.in
requirements.txt		requirements.txt
results-without-errors.jpg		results-without-errors.jpg

Hyperparameter	Value
Learning rate	0.00051
Batch size	32
Epochs	33
Dropout rate	0.22
Embedding size	300
LSTM output size	300
LSTM layers	3

License

minglanliang/keras-image-captioning

Folders and files

Latest commit

History

Repository files navigation

Image Captioning in Keras

Examples of Captions Generated by the Proposed Model

Evaluation Metrics

Environment Setup

Run a Training

Run an Inference and Evaluate It

License

About

Resources

License

Stars

Watchers

Forks

Languages