Knowledge Base Enrichment in Conversational Domain

This repository is the implementations for my MSc dissertation. We adapted several state-of-the-art document-level RE models, and conducted thorough evaluations on DocRED and DialogRE.

Dataset

DocRED

Please download it here, provided by DocRED: A Large-Scale Document-Level Relation Extraction Dataset.

DialogRE

Please download it here, provided by Dialogue-Based Relation Extraction.

Pre-processing

DialogRE needs to be converted to the same format as DocRED.

Enter the directory:

cd dialogre/data_processing
Run the shell script:

source process_docred.sh

Three documents will be generated under the directory ../data/processed:

train_annotated.json, dev.json, test.json

Note: their names are the same as DocRED for convenience.

BiLSTM

Main directory:

cd docred

Adapted from:

https://github.com/thunlp/DocRED/tree/master

Reference paper:

DocRED: A Large-Scale Document-Level Relation Extraction Dataset

Requirements and Installation

python3

pytorch>=1.0

pip3 install -r requirements.txt

preprocessing data

DocRED

Download metadata from TsinghuaCloud or GoogleDrive for baseline method and put them into prepro_data folder.

DialogRE

Replace the rel2id.json under prepro_data with dialogre/data_processing/rel2id.json

Run the script:

$ cd code
$ python3 gen_data.py --in_path ../data --out_path prepro_data

Train

$ cd code
$ CUDA_VISIBLE_DEVICES=0 python3 train.py --model_name BiLSTM --save_name checkpoint_BiLSTM --train_prefix dev_train --test_prefix dev_dev

Note: change the self.relation_num to 37 for DialogRE

Test

$ cd code
$ CUDA_VISIBLE_DEVICES=0 python3 test.py --model_name BiLSTM --save_name checkpoint_BiLSTM --train_prefix dev_train --test_prefix dev_dev --input_theta 0.3601

BERT-Embed

Main directory:

cd DocRed-BERT

Adapted from:

https://github.com/hongwang600/DocRed/tree/master

Reference paper:

Fine-tune Bert for DocRED with Two-step Process

Note: Please refer to BiLSTM for preprocessing data, train, and test.

Sent-Model

Main directory:

cd DocRed-sent_level_enc

Adapted from:

https://github.com/hongwang600/DocRed/tree/sent_level_enc

Reference paper:

Fine-tune Bert for DocRED with Two-step Process

Note: Please refer to BiLSTM for preprocessing data, train, and test.

Graph-LSR

Main directory: cd LSR

Adapted from https://github.com/nanguoshun/LSR/tree/master

Reference paper: Reasoning with Latent Structure Refinement for Document-Level Relation Extraction

Requirement

python==3.6.7 
torch==1.3.1 + CUDA == 9.2 1.5.1
OR torch==1.5.1 + CUDA == 10.1
tqdm==4.29.1
numpy==1.15.4
spacy==2.1.3
networkx==2.4

Data Proprocessing

DocRED

Download metadata from TsinghuaCloud or GoogleDrive for baseline method and put them into prepro_data folder.

DialogRE

Replace the rel2id.json under prepro_data with dialogre/data_processing/rel2id.json

Run the script:

$ cd code
$ python3 gen_data.py

Training

In order to train the model, run:

$ cd code
$ python3 train.py

Note: change the self.relation_num to 37 for DialogRE

Test

After the training process, we can test the model by:

python3 test.py

BERT-LSR

Main directory: cd LSR_BERT

Adapted from https://github.com/nanguoshun/LSR/tree/master

Reference paper: Reasoning with Latent Structure Refinement for Document-Level Relation Extraction

Requirement

python==3.6.7 
torch==1.3.1 + CUDA == 9.2 1.5.1
OR torch==1.5.1 + CUDA == 10.1
tqdm==4.29.1
numpy==1.15.4
spacy==2.1.3
networkx==2.4
pytorch-transformers==1.2.0

Data Proprocessing

DocRED

Download metadata from TsinghuaCloud or GoogleDrive for baseline method and put them into prepro_data folder.

Run the script

$ cd code
$ python3 gen_data.py

DialogRE

Replace the rel2id.json under prepro_data with dialogre/data_processing/rel2id.json

Run the script

$ cd code
$ python3 gen_data_bert.py

Training

In order to train the model, run:

$ cd code
$ python3 train.py

Test

After the training process, we can test the model by:

python3 test.py

Graph-EOG

Main directory:

cd edge-oriented-graph

Adapted from:

https://github.com/fenchri/edge-oriented-graph/tree/master

Reference paper:

Connecting the Dots: Document-level Relation Extraction with Edge-oriented Graphs

Environment

$ pip3 install -r requirements.txt

Datasets & Pre-processing

Download the two datasets first.

$ mkdir data && cd data
$ mkdir DocRED && mkdir Dialogue
$ # put dev_train.json dev_dev.json dev_test.json of the two datasets in each directory
$ cd ..

Two datasets should first be transformed into the PubTator format.

Run the processing scripts as follows:

$ sh process_docred.sh #DocRED
$ sh process_dialogue.sh #DialogRE

In order to get the data statistics run:

DocRED

python3 statistics.py --data ../data/DocRED/processed/dev_train.data
python3 statistics.py --data ../data/DocRED/processed/dev_dev.data
python3 statistics.py --data ../data/DocRED/processed/dev_test.data

DialogRE

python3 statistics.py --data ../data/Dialogue/processed/dev_train.data
python3 statistics.py --data ../data/Dialogue/processed/dev_dev.data
python3 statistics.py --data ../data/Dialogue/processed/dev_test.data

This will additionally generate the gold-annotation file in the same folder with suffix .gold.

Pre-trained Word Embeddings

The initial model utilized pre-trained PubMed embeddings.

Please download GloVe embeddings, and put it under ./embeds

Train

DocRED

$ cd src/
$ python3 eog.py --config ../configs/parameters_docred.yaml --train --gpu 0

DialogRE

$ cd src/ 
$ python3 eog.py --config ../configs/parameters_dialogue.yaml --train --gpu 0

Test

$ python3 eog.py --config ../configs/parameters_docred.yaml --test --gpu 0

Post-processing

In order to evaluate the results, the prediction file test.preds need to be converted to the same format as DocRED:

DocRED

$ mkdir ../data/DocRED 
$ # put the test.preds and rel2id.json under the directory
$ python3 convert2DocREDFormat --data DocRED

DialogRE

$ mkdir ../data/Dialogue 
$ # put the test.preds and rel2id.json under the directory
$ python3 convert2DocREDFormat --data Dialogue

DialogRE

Main directory:

cd dialogre

Adapted from:

https://github.com/nlpdata/dialogre/tree/master

Reference Paper:

Dialogue-Based Relation Extraction

Environment

Python 3.6 and PyTorch 1.0.

Preparation

kb/Fandom_triples: relational triples from Fandom.
kb/matching_table.txt: mapping from Fandom relational types to DialogRE relation types.
bert folder: a re-implementation of BERT and BERT_S baselines.
1. Download and unzip BERT from here, and set up the environment variable for BERT by export BERT_BASE_DIR=/PATH/TO/BERT/DIR.
2. Copy the dataset folder data to bert/.
3. In bert, execute python convert_tf_checkpoint_to_pytorch.py --tf_checkpoint_path=$BERT_BASE_DIR/bert_model.ckpt --bert_config_file=$BERT_BASE_DIR/bert_config.json --pytorch_dump_path=$BERT_BASE_DIR/pytorch_model.bin.

Train

To run the BERT_S baseline, execute the following commands in bert:

$ cd bert

$ python run_classifier.py   --task_name berts  --do_train --do_eval   --data_dir .   --vocab_file $BERT_BASE_DIR/vocab.txt   --bert_config_file $BERT_BASE_DIR/bert_config.json   --init_checkpoint $BERT_BASE_DIR/pytorch_model.bin   --max_seq_length 512   --train_batch_size 24   --learning_rate 3e-5   --num_train_epochs 20.0   --output_dir berts_f1  --gradient_accumulation_steps 2

$ rm berts_f1/model_best.pt && cp -r berts_f1 berts_f1c && python run_classifier.py   --task_name bertsf1c --do_eval   --data_dir .   --vocab_file $BERT_BASE_DIR/vocab.txt   --bert_config_file $BERT_BASE_DIR/bert_config.json   --init_checkpoint $BERT_BASE_DIR/pytorch_model.bin   --max_seq_length 512   --train_batch_size 24   --learning_rate 3e-5   --num_train_epochs 20.0   --output_dir berts_f1c  --gradient_accumulation_steps 2

Test

To evaluate the BERT_S baseline, execute the following commands in bert:

$ cd bert
$ python evaluate.py --f1dev berts_f1/logits_dev.txt --f1test berts_f1/logits_test.txt --f1cdev berts_f1c/logits_dev.txt --f1ctest berts_f1c/logits_test.txt

Evaluations

Main directory:

cd Evaluation

Put train_annotated.json, dev.json, test.json and prediction results dev_test_index.json under the directory code/DocRED/re_data or code/Dialogue/re_data

F1-score versus relation types

$ cd code
$ python3 eval_re_type.py --data DocRED|Dialogue

Specifically, to evaluate BERT_S

$ cd ../dialogre/bert
$ python3 evaluate_rel_type.py

F1-score of intra- v.s. inter-sentential relations

$ cd code
$ python3 eval_re_intra_inter.py --data DocRED|Dialogue

F1-score versus relation distances

$ cd code
$ python3 eval_re_dist.py --data DocRED|Dialogue

Distributions of relation types

$ cd code
$ python3 get_re_type_distri.py --data DocRED|Dialogue

Distributions of intra- v.s. inter-sentential relations

$ cd code
$ python3 get_re_intra_inter_distri.py --data DocRED|Dialogue

Distributions of relation distances

$ cd code
$ python3 get_re_dist_distri.py --data DocRED|Dialogue

Distributions of relation distances for date_of_birth and part_of

$ cd code
$ python3 get_dist_distri_given_re_type.py --inputfile train_annotated.json|dev.json|test.json
--type date_of_birth|part_of

F1 score of intra- versus inter- relations for date_of_birth and part_of

$ cd code
$ python3 eval_intra_inter_given_re_type --data DocRED|Dialogue --type date_of_birth|part_of

Known Issues

A reported bug from the authors of Graph-LSR: nanguoshun/LSR#9

Our current workaround:
- Graph-LSR: change the batch size from 20 to 10.
- BERT-LSR: change the batch size from 20 to 10; make the document number an integer times the batch size.

Acknowledgement

We acknowledge that the initial models and source code own to the authors of the following officially published papers and released code we referred to.

We also referred to the descriptions of these open source repositories for the write-up of this README file.

References

[1] DocRED: A Large-Scale Document-Level Relation Extraction Dataset

[2] Fine-tune Bert for DocRED with Two-step Process

[3] Reasoning with Latent Structure Refinement for Document-Level Relation Extraction

[4] Connecting the Dots: Document-level Relation Extraction with Edge-oriented Graphs

[5] Dialogue-Based Relation Extraction

Open Source Repositories

[1] https://github.com/thunlp/DocRED/tree/master

[2] https://github.com/hongwang600/DocRed/tree/master

[3] https://github.com/nanguoshun/LSR/tree/master

[4] https://github.com/fenchri/edge-oriented-graph/tree/master

[5] https://github.com/nlpdata/dialogre/tree/master

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
DocRED		DocRED
DocRed-BERT		DocRed-BERT
DocRed-sent_level_enc		DocRed-sent_level_enc
Evaluation		Evaluation
LSR		LSR
LSR_BERT		LSR_BERT
dialogre		dialogre
edge-oriented-graph		edge-oriented-graph
README.md		README.md

crystal-xu/KBEnrichment

Folders and files

Latest commit

History

Repository files navigation

Knowledge Base Enrichment in Conversational Domain

Dataset

DocRED

DialogRE

Pre-processing

BiLSTM

Requirements and Installation

preprocessing data

Train

Test

BERT-Embed

Sent-Model

Graph-LSR

Requirement

Data Proprocessing

Training

Test

BERT-LSR

Requirement

Data Proprocessing

Training

Test

Graph-EOG

Environment

Datasets & Pre-processing

Pre-trained Word Embeddings

Train

Test

Post-processing

DialogRE

Environment

Preparation

Train

Test

Evaluations

Known Issues

Acknowledgement

References

Open Source Repositories

About

Resources

Stars

Watchers

Forks

Languages