Pytorch Implementation of GoEmotions, Conditional BERT contextual augmentation and BERT with Self-Supervised Attention with Huggingface Transformers
Dataset labeled 58000 Reddit comments with 28 emotions
-
admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise + neutral
-
dataset with three different taxonomies placed in
data
- Clone this repo
- Python 3.6
- Install Pytorch==1.4.0
- Install the rest of the requirements:
pip install -r requirements.txt
python analyze_dataset.py [--aug]
Hyperparameters can be changed from the json files in config
directory. By default, the python script runs dataset analysis on data/original/train.tsv
, with labels defined in data/original/labels.txt
. If run with the aug
flag on, dataset analysis will be performed on the augmented training dataset stored at data/original/train_augmented_*.tsv
by default, without reading the labels file (augmented training dataset is generated using CBERT with a label distribution threshold of user's choice).
python run_goemotions.py --taxonomy original
First, finetune the conditional BERT model with the original training dataset.
python cbert_finetune.py
Second, use the model saved in the previous step to generate new examples. The original examples, masked version and predicted version are stored in separate files in data/original
by default.
python cbert_augdata.py
Third, remove duplicates, sanitize and merge the newly generated into the original training corpus.
python cbert_merge.py
cd ssa_BERT
python run_ssa.py
To run BERT with Self-Supervised Attention on the augmented GoEmotions dataset, simply change the default value of train_data_file
defined in ssa_BERT/run_ssa.py
.