Transformers

In this repository, I implement the Transformer and apply it to various seq2seq language taks: spelling and grammar error corrector, paraphrase generation and text summarization. The goal is to explore how the following modifications influence performance on each task:

Static Positional Encoding vs Learned Positional Encoding
Label Smoothing
Word-level Encoding vs Character-level Encoding vs Byte Part Encoding
Greedy Decoding vs Beam Search

Spelling and Grammar Error Corrector

We begin with the spelling and grammar error corrector. The goal of such a system is to correct typos made in a given text if they are present. For example I training a spell corecotr needs to be corrected to I am training a spell corrector, making sure to correct both spelling errors corecotr -> corrector as well as grammatical errors I training -> I am training.

Dataset used: Github typo corpus

Positional encoding is designed to incorporate a noition of order of tokens in a sentence. Learned positional encoding is implemented by introducing an embedding layer of maximum token size and embedding size. Static positional encoding are implemented by introducing fixed sinusoidal functions of different frequencies. For more information about positional encodings, check this paper.

We train the corrector with character-level and and byte-part encoding (BPE) encoding, since word-level encoding is not applicable for this task. While training with character-level encoding, the validation loss did not decrease for a very large number of iterations. To create BPE encodings, we use BertTokenizer from the transofrmers library. With BPE we observed a more rapid decrease in validation loss.

Data Augmentation

To improve the corrector, more diverse examples of missplellings are required. One powerful method is to train a reverse "corruptor" model that will introduce errors similr to those made by humans to given clean examples. To do this, we reverse source and target during training. Now the trained "corrputor" model an be used to create artificial typo corpus of infinite size (in theory). For more information about this idea, refer to this blog.

How To Use

Create a conda environment and install required packages

conda env create -f env.yml
conda activate transformer_env

Read github data and split into train, validation and test sets

python3 process_data.py

Start training of the typo corrector model, set encoding and embedding types accordingly

python3 train.py -encoding_type char -embedding_type learned

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
README.md		README.md
attention.py		attention.py
dataset.py		dataset.py
decoder.py		decoder.py
embeddings.py		embeddings.py
encoder.py		encoder.py
env.yml		env.yml
feedforward.py		feedforward.py
infer.py		infer.py
process_data.py		process_data.py
train.py		train.py
transformer.py		transformer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

attention.py

attention.py

dataset.py

dataset.py

decoder.py

decoder.py

embeddings.py

embeddings.py

encoder.py

encoder.py

env.yml

env.yml

feedforward.py

feedforward.py

infer.py

infer.py

process_data.py

process_data.py

train.py

train.py

transformer.py

transformer.py

Repository files navigation

Transformers

Spelling and Grammar Error Corrector

Data Augmentation

How To Use

About

Releases

Packages

Languages

LianaMikael/Transformers

Folders and files

Latest commit

History

Repository files navigation

Transformers

Spelling and Grammar Error Corrector

Data Augmentation

How To Use

About

Resources

Stars

Watchers

Forks

Languages