Skip to content

svaisakh/captioner

Repository files navigation

Show and Tell: A Neural Image Caption Generator

See Demo license

forthebadge forthebadge

an implementation inspired by the paper by Oriol Vinyals, Alexander Toshev, Samy Bengio and Dumitru Erhan.

Results

The Idea

Model

Model

We approach the problem in two stages:

  1. Vision:

    A pre-trained CNN is used to extract the image features.

    In our case, we take a ResNet trained on ImageNet classification and detach it's head.

    The penultimate layer gives us the features.

  2. Language:

    A pre-trained word embedding is used to process and tokenize the captions.

    In our case, we use the 'en_core_web_lg' model from spaCy.

    This is then teacher-forced to an RNN which predicts the next word.

Combining the two

The extracted features are treated as the initial hidden state of the RNN. In order to match the dimensionality, it's first sent through a Linear layer and reshaped.

On the basis of this conditioning, the model generates it's hidden states which are further sent through a Linear layer of dimension vocab_size.

Thus, at each timestep, we have a score for each possible word.

We treat this like a classification problem and use the categorical cross-entropy loss to match it to the desired label at each timestep.

Sampling

We use beam search to sample the most likely caption at evaluation time.

Using this repo

There are three ways you might want to use this project.

  1. Learn: To follow along with the Jupyter Notebooks, go to the notebooks folder.
  2. Apply: To execute on the command line, go to the captioner folder.
  3. Serve: To serve the model as a flask microservice, go to the server folder.

Prerequisites

  1. Follow the general steps in this tutorial to set up the environment and the startup file.

    Make sure you have MagNet installed.

  2. Install pycocotools in the conda environment by running

    pip install "git+https://github.com/philferriere/cocoapi.git#egg=pycocotools&subdirectory=PythonAPI"

  3. Install the 'en_core_web_lg' model from spaCy by running python -m spacy download en_core_web_lg

Pre-trained Model

A pre-trained model is available here.

The hyperparameters are the defaults in the repo.

Place it in the checkpoints directory and you're good to go.

forthebadge

About

Simple Image Captioning CNN-RNN Model written in PyTorch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published