Natural Language Processing course project
- Download the dataset from the DaDoEval competition website
- Put the training set, test set, and gold file folder into the
data
folder in this repository. - Install the conda environment with
conda env create -f environment.yml
. If you don't have conda install it from here. - Run the preprocessing scripts into
src/utils
to preprocess the data. Example (from thesrc
folder):python -m utils.preprocess_train_data
- Train the model with
python train.py
. The default strategy is Umberto with truncation of the first 512 tokens and embedding derived from the sum of the embedding of the [CLS] token of the last four layers.
- By now the change of parameters is manual and embedded in the code. I'll provide an automatic way of training a given model without directly changing the code.
- A GPU is not required but recommended.
- On a NVidia Titan X GPU the training took about 10 minutes.
If you want to know more details about the project and the tests, have a look at the report here.