Neural Transfer Learning for Natural Language Processing

UPDATE: Work in Progress to improve the code quality. Thank you for your understanding

The Master Thesis implements Progressive Neural Networks (PNN) for the Transfer Learning between Named Entity Recognition (NER) and Text Classification (Sentiment Analysis). The PNNs are compared with the standard pre-training/fine-tuning (PTFT) technique of Transfer Learning in which a pre-trained network is fine-tuned on a target task/data.

This work has been accepted at LREC2020 conference. The paper can be found here: https://www.aclweb.org/anthology/2020.lrec-1.172/

Comprehensive information regarding this work can be found in the defense presented for this Thesis in Documents/ folder.

Background

More information about Transfer Learning, PNN, NER and text classification can be found in the following:

PNN: https://arxiv.org/abs/1606.04671
NER: https://arxiv.org/abs/1603.01354
TC: https://www.aclweb.org/anthology/D14-1181
Transfer Learning: http://ruder.io/thesis/

Overview

Below is the overview of the files:

src: This directory contains the source code
src/ner: contains the code for Named Entity Recognition
src/ner/data/: processed dataset after running SL notebooks on the raw datasets to convert them into ‘Sentence →Label’ format. A dummy dataset has been uploaded. The dataset contains three folders: Train, Val and Test for the respective phase. The folder is enriched with additional information after the build_vocab.py is run. The additional information contains the JSON files to map the indices of the words and the characters to their respective embedding matrices.
src/ner/embeddings/: contains the pre-trained word embeddings (raw .txt or .vec) files.
src/ner/experiments/: folder for each experiment. An experiment contains the following files and folders:
src/ner/experiments/<experiment_name>/plots: contains graphs and metrics over the epochs stored in the PKL, JSON and PNG formats
src/ner/experiments/<experiment_name>/best.pth: the Pytorch model for the best prediction on validation set obtained so far
src/ner/experiments/<experiment_name>/data_encoder.pkl: data encoder pickled
src/ner/experiments/<experiment_name>/label_encoder.pkl: label encoder pickled
src/ner/experiments/<experiment_name>/last.pth: Pytorch model for the last prediction on the validation set
src/ner/experiments/<experiment_name>/params.json: hyperparameters for the network
src/ner/experiments/<experiment_name>/train.log: logs during the training loop
src/ner/experiments/<experiment_name>/train_snapshot.json: snapshot of the train_new.py at the time of the training to facilitate reproducibility.
src/ner/data.py: iterators for various kinds of the data formats. Currently supports reading from CoNLL03 format and raw string.src/ner/encoder.py: encoding the data into indices and numericssrc/ner/evaluate.py: evaluating for validation and test datasets. Generates the metrics
src/ner/evaluation.py: function definitions of various metrics
src/ner/train_new.py: the training loop
src/ner/utils.py: utils such as pickling, saving and reading text files, etc.
src/tc: contains the code for Text Classification. The directory structure is similar to the NER.
src/booster: Contains the code for Transfer Learning using PNNs and PTFT
src/booster/algorithms/: Transfer Learning algorithms. Currently only fine_tune.ipynb which performs the standard PTFT algorithm.
src/booster/future/: Code for future work
src/booster/progNN: Progressive Neural Networks
src/booster/progNN/adapter.py: Read the PNN paper for more details
src/booster/progNN/column_ner.py: Fitting a Neural Network to the ‘Column’. Read PNN paper on information about the Column. This column is specific to NER
src/booster/progNN/column_tc.py: Fitting a Neural Network to the ‘Column’. Read PNN paper on information about the Column. This column is specific to TC
src/booster/progNN/decoder.py: Conditional Random Field (CRF) module for Sequence Decoding
src/booster/progNN/net.py: Neural Network with modifications for PNN, for NER
src/booster/progNN/prognet.py: General PNN framework
src/booster/progressive_<build_vocab><data_loader>: same intention as the counterparts in NER and TC folders, but with modifications for the PNN framework.
src/booster/progressive_ner.py: PNN for NER
src/booster/progressive_ner_3col.py: PNN for NER as target task with 2 source columns. The source columns can be NER for the same-task transfer, or TC for cross-task transfer.
src/booster/progressive_tc.py: TC counterpart for the progressive_ner.py
src/booster/progressive_tc_3col.py: TC counterpart for progressive_ner_3col.py. The source columns could either be NER or TC.
src/booster/utils.py: same as utils in NER/
src/notebooks: Contains Jupyter notebooks
src/notebooks/data_exploration: statistics about the NER and Sentiment Analysis (SA) datasets
src/notebooks/data_preparation: preparing the data for processing
src/notebooks/data_preparation/SL: converting data into a ‘Sentence → Label’ format. The dummy files for this format are available under src/ner/data/dummy folder
src/notebooks/data_preparation/split: splitting the data into 10 portions of varying sizes; starts from 10% of complete dataset to 100% dataset for training. This is to mimic the varying availability of the training dataset in 10% increments.
src/notebooks/graphs: notebooks to create graphs from the experiments
src/notebooks/named_entity_recognition: Jupyter notebooks to run the complete pipeline
src/notebooks/named_entity_recognition/evaluate.ipynb: evaluation on validation and test set
src/notebooks/named_entity_recognition/feat.ipynb: creating features from the training datasets
src/notebooks/named_entity_recognition/fine_tune.ipynb: PTFT for NER
src/notebooks/named_entity_recognition/inference.ipynb: obtain predictions for the input sentences using pre-trained model
src/notebooks/named_entity_recognition/progressive_2col.ipynb: PNN with 1 source column and 1 target column
src/notebooks/named_entity_recognition/progressive_3col.ipynb: PNN with 2 source columns and 1 target column
src/notebooks/named_entity_recognition/train.ipynb: training loop
src/notebooks/text_classification: Jupyter notebooks to run complete pipeline for Sentiment Analysis. The sub-notebooks are similar to NER.
src/resources: This directory contains the raw datasets for NER and Sentiment Analysis
Documents/: contains the defense
Resources/: contains the raw datasets used in this work

Reproducibility

Below are the instructions to run the experiments. The instructions are general and not supported with the commands to allow for more flexibility. The methods listed below specify the general pipeline to follow to reproduce an experiment. The reader is expected to run the notebooks provided to get a gist of the pipeline.

Clone the repository: git clone https://github.com/sarthakTUM/progressive-neural-networks-for-nlp.git
Install the requirements: pip install -r requirements.txt
Follow the steps below for the required functionality

Named Entity Recognition Single-Task:

Download the raw dataset with train, validation and test splits
Run the ‘sentence→label’ converter in src/notebooks/data_preparation folder. There are Jupyter notebooks for various datasets. The notebooks convert the CoNLL03 format into SL format. The resulting datasets will be saved in the NER/data folder.
Download the 6B tokens English embeddings from http://nlp.stanford.edu/data/glove.6B.zip and place the .txt file in NER/embeddings/ folder. The dimensionality depends upon the use-case
Run the progressive_build_vocab.py in src/booster folder. In the script, the following parameters should be changed:
--data_folder: the path to the SL format datasets
--embeddings_folder: the path to the embeddings in the src/ner/embeddings/ directory.
--embeddings_dim: the dimensionality of the embeddings
--embeddings_type: type of embeddings. Supported: GloVe, Word2vec and Fasttext
The features are saved in the data folder.
Run the train_new.py in src/ner/ directory. This trains the neural network and saves the model. The following parameters can be changed:
--data_dir: directory of the SL format data, enriched by progressive_build_vocab.py
--model_dir: directory to save the model.
--restore_file: file to restore model from
The evaluation can be done using the evaluate.py

Text Classification Single-Task

Download the raw dataset with train, validation and test splits
Run the ‘sentence→label’ converter in src/notebooks/data_preparation/SL/text_classification folder. There are Jupyter notebooks for various datasets. The notebooks convert the ‘label_whitespace_text’ format to the SL format.
Download the 6B tokens English embeddings from http://nlp.stanford.edu/data/glove.6B.zip and place the .txt file in TC/embeddings/ folder. The dimensionality depends upon the use-case
Run the build_vocab.py in src/tc folder. In the script, the following parameters should be changed:
--data_folder: the path to the SL format datasets
--embeddings_folder: the path to the embeddings in the src/ner/embeddings/ directory.
--embeddings_dim: the dimensionality of the embeddings (100 for Text Classification) --embeddings_type: type of embeddings. Supported: GloVe, Word2vec and Fasttext ('glove' for Text Classification) The features are saved in the data folder.
Run the train.py in src/tc/ directory. This trains the neural network and saves the model. The following parameters can be changed:
--data_dir: directory of the SL format data, enriched by progressive_build_vocab.py
--model_dir: directory to save the model.
--restore_file: file to restore model from
The evaluation can be done using the src/tc/evaluate.py

Pre-training/Fine-Tuning

A pre-trained network is fine-tuned on a target dataset. The target task must be identical to the source task.

Named Entity Recognition:

Follow the steps 1-4 for the Named Entity Recognition Single-Task setting.
Run the src/booster/algorithms/fine_tune.py with the following parameters:
--data_dir: directory of the SL format data, enriched by progressive_build_vocab.py
--model_dir: directory to save the model.
--pretrained_model_dir: The pre-trained model (experiment directory to be specified) from which the knowledge will be transferred.
--all_layer: True or False. Whether to fine-tune all the layers or only the last layer.
Evaluation can be done the same way as for NER for Single-Task

Text Classification

Follow the steps 1-4 for the Text Classification Single-Task setting
run src/tc/train_ptft.py with the following parameters:
--data_dir: directory of the SL format data, enriched by progressive_build_vocab.py
--model_dir: directory to save the model.
--pretrained_model_dir: The pre-trained model (experiment directory to be specified) from which the knowledge will be transferred.
--all_layer: True or False. Whether to fine-tune all the layers or only the last layer.
Evaluation can be done the same way as for TC for Single-Task

Progressive Transfer:

The same-task and cross-task transfer using the Progressive Neural Networks

NER → NER or TC → NER

Follow the steps 1-4 of NER for single-Task setting
Run src/booster/progressive_ner.py with the following parameters:
--data_dir: directory of the SL format data, enriched by progressive_build_vocab.py
--model_dir: directory to save the model.
--freeze_prev: Boolean. Whether to freeze the source column or not
--best_prev: Boolen. Whether to use the optimized source column or random source column
--linear_adapter: Boolean. True if linear adapter, False if non-linear
--best_target: Boolean. Whether to load the optimized target column model before the progressive transfer.
--pretrained_model_dir: The pre-trained model (experiment directory to be specified) from which the knowledge will be transferred.
The option to use NER or TC as a source column is explained in the script.
The evaluation on test set is done while training and logged inside the ‘plots’ directory of the experiment folder.

TC → TC or NER → TC

Follow the steps 1-4 of TC for single-Task setting
Run src/booster/progressive_tc.py with the following parameters:
--data_dir: directory of the SL format data, enriched by progressive_build_vocab.py
--model_dir: directory to save the model.
--freeze_prev: Boolean. Whether to freeze the source column or not
--best_prev: Boolen. Whether to use the optimized source column or random source column
--linear_adapter: Boolean. True if linear adapter, False if non-linear
--best_target: Boolean. Whether to load the optimized target column model before the progressive transfer.
--pretrained_model_dir: The pre-trained model (experiment directory to be specified) from which the knowledge will be transferred.
The option to use NER or TC as a source column is explained in the script.
The evaluation on test set is done while training and logged inside the ‘plots’ directory of the experiment folder.

[TC, TC] → NER or [NER, NER] → NER

Follow the steps 1-4 of NER for single-Task setting
Run src/booster/progressive_ner_3col.py with the following parameters:
--data_dir: directory of the SL format data, enriched by progressive_build_vocab.py
--model_dir: directory to save the model.
--freeze_prev: Boolean. Whether to freeze the source column or not
--best_prev: Boolen. Whether to use the optimized source column or random source column
--linear_adapter: Boolean. True if linear adapter, False if non-linear
--best_target: Boolean. Whether to load the optimized target column model before the progressive transfer.
--c1_model_dir: The pre-trained model (experiment directory to be specified) from which the knowledge will be transferred, for the 1st source column.
--c2_model_dir: The pre-trained model (experiment directory to be specified) from which the knowledge will be transferred, for the 2nd source column.
The option to use NER or TC as a source column is explained in the script.
The evaluation on test set is done while training and logged inside the ‘plots’ directory of the experiment folder.

[TC, TC] → NER or [NER, NER] → NER

Follow the steps 1-4 of TC for single-Task setting
Run src/booster/progressive_tc_3col.py with the following parameters:
--data_dir: directory of the SL format data, enriched by progressive_build_vocab.py
--model_dir: directory to save the model.
--freeze_prev: Boolean. Whether to freeze the source column or not
--best_prev: Boolen. Whether to use the optimized source column or random source column
--linear_adapter: Boolean. True if linear adapter, False if non-linear
--best_target: Boolean. Whether to load the optimized target column model before the progressive transfer.
--c1_model_dir: The pre-trained model (experiment directory to be specified) from which the knowledge will be transferred, for the 1st source column.
--c2_model_dir: The pre-trained model (experiment directory to be specified) from which the knowledge will be transferred, for the 2nd source column.
The option to use NER or TC as a source column is explained in the script.
The evaluation on test set is done while training and logged inside the ‘plots’ directory of the experiment folder.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
content		content
documents		documents
resources/data		resources/data
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

content

content

documents

documents

resources/data

resources/data

src

src

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Neural Transfer Learning for Natural Language Processing

UPDATE: Work in Progress to improve the code quality. Thank you for your understanding

Background

Overview

Reproducibility

Named Entity Recognition Single-Task:

Text Classification Single-Task

Pre-training/Fine-Tuning

Named Entity Recognition:

Text Classification

Progressive Transfer:

NER → NER or TC → NER

TC → TC or NER → TC

[TC, TC] → NER or [NER, NER] → NER

[TC, TC] → NER or [NER, NER] → NER

About

Languages

sarthakTUM/progressive-neural-networks-for-nlp

Folders and files

Latest commit

History

Repository files navigation

Neural Transfer Learning for Natural Language Processing

UPDATE: Work in Progress to improve the code quality. Thank you for your understanding

Background

Overview

Reproducibility

Named Entity Recognition Single-Task:

Text Classification Single-Task

Pre-training/Fine-Tuning

Named Entity Recognition:

Text Classification

Progressive Transfer:

NER → NER or TC → NER

TC → TC or NER → TC

[TC, TC] → NER or [NER, NER] → NER

[TC, TC] → NER or [NER, NER] → NER

About

Topics

Resources

Stars

Watchers

Forks

Languages