Translate and Classify

Translation

Data Preparation

To transliterate the datasets, use the scripts in translation/preprocessing/. They use the fork of csnli available here. The transliterated versions of the datasets have also been provided in translation/preprocessing/preprocessed_data/.

Download and extract IIT Bombay English-Hindi Corpus v3.0 to a directory. Also copy the transliterated datasets to the same directory. The final directory should look like this:

.
├── iitb_corpus
│   ├── dev_test
│   │   ├── dev.en
│   │   ├── dev.hi
│   │   ├── test.en
│   │   └── test.hi
│   └── parallel
│       ├── IITB.en-hi.en
│       └── IITB.en-hi.hi
├── mrinal_dhar.jsonl
└── phinc.jsonl

Training

Install the dependencies using

conda env create --file environment.yml
conda activate cmtranslation2

Download mBART pre-trained checkpoint:

wget -c https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.v2.tar.gz

Finally, to train the model:

bash train.sh <path to mbart.cc25.v2.tar.gz or mBART-hien temporary directory when training mBART-hien-cm> <temporary directory which will be created> <path to dataset directory>

The checkpoints are stored in the directory <temporary directory>/checkpoint.

Evaluation

bash eval.sh <temporary directory> <path to best checkpoint>
bash eval_phinc.sh <temporary directory> <path to best checkpoint>

Classification

We show our performance on the GLUECoS benchmark. Our dataset parsing and training codes are also based on their codebase.

We provide the preprocessed data which can be used directly. To download and translate the datasets by yourself, run datasets/download_data.sh, datasets/Data/Preprocess_Scripts/preprocess_{nli/sent}_en_hi_2.py, datasets/Data/Preprocess_Scripts/to_translate_{nli/sent}.py, datasets/Data/Preprocess_Scripts/translate_mbart.sh, datasets/Data/Preprocess_Scripts/after_translate_{nli/sent}.py in order.

Install the dependencies for the classification models and run (after changing paths in the code) using:

conda env create --file environment.yml
conda activate cm_nli
python3 code_mixed_nli.py -n <experiment name> | tee training_log # to train and evaluate NLI
python3 code_mixed_sa.py -n <experiment name> | tee training_log # to train and evaluate Sentiment Analysis

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
classification		classification
translation		translation
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

classification

classification

translation

translation

README.md

README.md

Repository files navigation

Translate and Classify

Translation

Data Preparation

Training

Evaluation

Classification

About

Releases

Packages

Languages

devanshg27/cm_translatify

Folders and files

Latest commit

History

Repository files navigation

Translate and Classify

Translation

Data Preparation

Training

Evaluation

Classification

About

Resources

Stars

Watchers

Forks

Languages