an implementation inspired by the paper by Oriol Vinyals, Alexander Toshev, Samy Bengio and Dumitru Erhan.
We approach the problem in two stages:
-
A pre-trained CNN is used to extract the image features.
In our case, we take a ResNet trained on ImageNet classification and detach it's head.
The penultimate layer gives us the features.
-
A pre-trained word embedding is used to process and tokenize the captions.
In our case, we use the 'en_core_web_lg' model from spaCy.
This is then teacher-forced to an RNN which predicts the next word.
The extracted features are treated as the initial hidden state of the RNN. In order to match the dimensionality, it's first sent through a Linear layer and reshaped.
On the basis of this conditioning, the model generates it's hidden states which are further sent through a Linear layer of dimension vocab_size
.
Thus, at each timestep, we have a score for each possible word.
We treat this like a classification problem and use the categorical cross-entropy loss to match it to the desired label at each timestep.
We use beam search to sample the most likely caption at evaluation time.
There are three ways you might want to use this project.
- Learn: To follow along with the Jupyter Notebooks, go to the notebooks folder.
- Apply: To execute on the command line, go to the captioner folder.
- Serve: To serve the model as a flask microservice, go to the server folder.
-
Follow the general steps in this tutorial to set up the environment and the startup file.
Make sure you have MagNet installed.
-
Install pycocotools in the conda environment by running
pip install "git+https://github.com/philferriere/cocoapi.git#egg=pycocotools&subdirectory=PythonAPI"
-
Install the 'en_core_web_lg' model from spaCy by running
python -m spacy download en_core_web_lg
A pre-trained model is available here.
The hyperparameters are the defaults in the repo.
Place it in the checkpoints directory and you're good to go.