Skip to content

Generate captions for images using a CNN-RNN model that is trained on the Microsoft Common Objects in COntext (MS COCO) dataset.

License

Notifications You must be signed in to change notification settings

fhswf/image_captioning

 
 

Repository files navigation

Image Captioning

Introduction

Generate captions from images using a deep learning model. When given an image, the model is able to describe in English what is in the image. In order to achieve this, our model is comprised of an encoder which is a CNN and a decoder which is an RNN. The CNN encoder is given images for a classification task and its output is fed into the RNN decoder which outputs English sentences.

The model and the tuning of its hyperparamaters are based on ideas presented in the paper Show and Tell: A Neural Image Caption Generator and Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.

We use the Microsoft Common Objects in COntext (MS COCO) dataset for this project. It is a large-scale dataset for scene understanding. The dataset is commonly used to train and benchmark object detection, segmentation, and captioning algorithms. For instructions on downloading the data, see the Data section below.

Changes from the original project by Trang Nguyen

This project is forked from a git repository created by Trang Nguyen. The following points have been changed:

  • The encoder used is a pre-trained instance of ResNeXt101_32x8d.
  • The decoder used is a two-layer GRU RNN instead of a single-layer LSTM RNN.
  • Training is done in training.py. Instead of sampling training captions into batches of fixed-length captions, the training captions are padded and packed with torch.nn.utils.rnn.pack_padded_sequence. This improves training speed on my machine and ensures that all training samples are used during a training epoch.
  • Evaluation of the model is done with the 'official' MS COCO Evaluation Code and the CIDEr score is used to decide whether the model has improved or not.
  • Besides MS COCO, I used free captioned images from pexels.com for the training.
  • I included a simple REST service rest_service.py which can be used to call the model via a simple web frontend. See this online demo on my home page.

Code

TBD

Setup

  1. Install pycocoevalcap and the pycocotools by running

    pip install git+https://github.com/salaniz/pycocoevalcap
    
  2. Install PyTorch (4.0 recommended) and torchvision.

    pip install pytorch torchvision 
    
  3. Other dependencies:

  • Python 3
  • nltk
  • numpy
  • scikit-image
  • matplotlib
  • tqdm

Data

Download the following data from the COCO website, and place them, as instructed below, into a coco subdirectory located inside this project's directory:

  • under Annotations, download:
    • 2014 Train/Val annotations [241MB] (extract captions_train2014.json, captions_val2014.json, instances_train2014.json and instances_val2014.json, and place them in the subdirectory coco/annotations/)
    • 2014 Testing Image info [1MB] (extract image_info_test2014.json and place it in the subdirectory coco/annotations/)
  • under Images, download:
    • 2014 Train images [83K/13GB] (extract the train2014 folder and place it in the subdirectory coco/images/)
    • 2014 Val images [41K/6GB] (extract the val2014 folder and place it in the subdirectory coco/images/)
    • 2014 Test images [41K/6GB] (extract the test2014 folder and place it in the subdirectory coco/images/)

Run

To train the model, run:

python training.py

To run any IPython Notebook, use:

jupyter notebook <notebook_name.ipynb>

About

Generate captions for images using a CNN-RNN model that is trained on the Microsoft Common Objects in COntext (MS COCO) dataset.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%