Skip to content

raarielgrace/copynet

 
 

Repository files navigation

CopyNet

This is an implementation of CopyNet which extends the functionality of encoder-decoder models to allow the generation of output sequences that contain "out of vocabulary" tokens that are present in the input sequence.

Dependencies: pytorch numpy tensorboardX (for logging) tqdm (for logging) spacy (for tokenization)

The model is trained on sequence pairs. Create a directory to hold training files. Each file should have 2 lines of text. The first is the input sequence, the second is the target output sequnce. The tokens in each sequence should be seperated by spaces. I used spacy to tokenize the training data so the SequencePairDataset class as well as the evaluation methods assume that spacy will be used. If you want to use a different tokenizer be sure to update those files accordingly.

Train the model using the train.py script. Most hyperparameters can be tuned with command line arguments documented in the training script.

About

An implementation of CopyNet

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%