Image Captioning Transformer

Intro

This project is an implementation of Image Captiong model based on Transformer
Used Pytorch for the code. ResNet152 is used for extracting the features. You can check pre-trained models here.
Using COCO dataset 2017 Train/Val/Test images, annotations.
Please check config.py. Also, you can train on multi GPUs.

Getting Started

Prerequisites

Clone this repo

git clone https://github.com/minjoong507/Image-Captioning-Transformer.git
cd Image-Captioning-Transformer

Download COCO dataset

After downloading the image/annotation data, you should put the image/annotation files in data dir.

mkdir data

data
├── annotations
├── ls
├── train2017
├── val2017
└── test2017

Install packages:

Python 3.8.5
Pytorch 1.7.0+cu110
nltk
tqdm
pycocotools

Training

Add project root to PYTHONPATH

source setup.sh

Build Vocabulary

python vocab/make_vocab.py

Extract Image features

python feature_extraction/extraction.py

Training Model

python train.py

Training using the above config will stop at epoch 100. I use single or six RTX 3090 GPU. result dir containing the result of code. 2021-*(=Start time) containing the saved model and train-log.txt.

Example

result
└──2021-04-16-12-00
    ├── model.ckpt
    └── train-log.txt

Testing

python Inference --test_path MODEL_DIR_NAME

MODEL_DIR_NAME is the name of the dir containing the saved model, e.g., result/2021-*.

Evaluation

Train loss & acc (100 epoch)

Single GPU :

Accuracy : 98.6766 %
Result

predict : a [UNK] with people is near a pier on clear water . [EOS]
target : a [UNK] with people is near a pier on clear water . [EOS]

Six GPUs :

Accuracy : 99.4794 %
Result

predict : a picture of a giraffe standing in a zoo exhibit . [EOS]
target : a picture of a giraffe standing in a zoo exhibit . [EOS]

predict : people and buses on a city street under cloudy skies . [EOS]
target : people and buses on a city street under cloudy skies . [EOS]

predict : a man at an office desk drinking a glass of wine . [EOS]
target : a man at an office desk drinking a glass of wine . [EOS]

predict : two zebras are standing next to a log . [EOS]
target : two zebras are standing next to a log . [EOS]

Inference
- An example of such file is pred.jsonl, formatted as:
```
{
    "image_id": 00004983,
    "caption": ""
}
```
- pred.jsonl will be saved in eval dir.

TODO List

Description of the model and other details
Code Refactoring
Add Inference.py

File

Image-Captioning-Transformer
├── model
│   ├── data_loader.py
│   ├── layers.py
│   ├── model.py
│   └── optimization.py
├── data
│   ├── output_feature.pickle # after python extraction.py
│   ├── annotations
│   ├── ls
│   └── val2017
├── feature_extraction
│   ├── data_loader.py
│   ├── extraction.py
│   └── resnet.py
├── vocab
│   ├── vocab.pickle # after python make_vocab.py
│   ├── coco_idx.npy # after python extraction.py
│   └── make_vocab.py
├── LICENSE
├── .gitignore
└── README.md

License

MIT License

Reference

[1] TVCaption
[2] huggingface/transformer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature_extraction

feature_extraction

image

image

model

model

vocab

vocab

.gitignore

.gitignore

Inference.py

Inference.py

LICENSE

LICENSE

README.md

README.md

config.py

config.py

setup.sh

setup.sh

train.py

train.py

utils.py

utils.py

Repository files navigation

Image Captioning Transformer

Intro

Getting Started

Evaluation

TODO List

File

License

Reference

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
feature_extraction		feature_extraction
image		image
model		model
vocab		vocab
.gitignore		.gitignore
Inference.py		Inference.py
LICENSE		LICENSE
README.md		README.md
config.py		config.py
setup.sh		setup.sh
train.py		train.py
utils.py		utils.py

License

minjoong507/Image-Captioning-Transformer

Folders and files

Latest commit

History

Repository files navigation

Image Captioning Transformer

Intro

Getting Started

Evaluation

TODO List

File

License

Reference

About

Resources

License

Stars

Watchers

Forks

Languages