Based on OpenNRE

The distant supervision model is based on OpenNRE framework

Key Information:

pre-pocess
- Please use this pre-processed data: Pre-processed NYT Data
tersonflow version: tensorflow==1.14 (based on TACRED environment, then add the 1.14 tensorflow seems good)
- There will be many strange bugs, if you use the official requirement.txt
- Conda Workable environment: base_OpenNRE_25_10_2019.yml
About memory:
- At least 32 GB (because Out of Memory or Memory error at 20GB laptop), about
- Running time: about 7 hours on server (64GB memory)
Test Results - NYT10 Dataset

PCNN + Attention: max F1: 0.4078 (Recall: 0.3697 Precision: 0.4546)
AUC Results:

Model	Attention	Maximum	Average
PCNN	0.3408	0.3247	0.3190
CNN	0.3277	0.3151	0.3044
RNN	0.3418	0.3473	0.3405
BiRNN	0.3352	0.3575	0.3244

Overview

It is a TensorFlow-based framwork for easily building relation extraction (RE) models. We divide the pipeline of relation extraction into four parts, which are embedding, encoder, selector (for distant supervision) and classifier. For each part we have implemented several methods.

Embedding
- Word embedding
- Position embedding
Encoder
- PCNN
- CNN
- RNN
- Bidirection RNN
Selector
- Attention
- Maximum
- Average
Classifier
- Softmax Loss Function
- Output

All those methods could be combined freely.

We also provide training and testing framework for sentence-level RE and bag-level RE. A plotting tool is also in the package.

This project is under MIT license.

Requirements

pip install -r requirements.txt

Python (>=2.7)
Numpy (>=1.13.3)
TensorFlow (>=1.4.1)
- CUDA (>=8.0) if you are using gpu
Matplotlib (>=2.0.0)
scikit-learn (>=0.18)

Data Format

For training and testing, you should provide four JSON files including training data, testing data, word embedding data and relation-ID mapping data.

Training Data & Testing Data

Training data file and testing data file, containing sentences and their corresponding entity pairs and relations, should be in the following format

[
    {
        'sentence': 'Bill Gates is the founder of Microsoft .',
        'head': {'word': 'Bill Gates', 'id': 'm.03_3d', ...(other information)},
        'tail': {'word': 'Microsoft', 'id': 'm.07dfk', ...(other information)},
        'relation': 'founder'
    },
    ...
]

IMPORTANT: In the sentence part, words and punctuations should be separated by blank spaces.

Word Embedding Data

Word embedding data is used to initialize word embedding in the networks, and should be in the following format

[
    {'word': 'the', 'vec': [0.418, 0.24968, ...]},
    {'word': ',', 'vec': [0.013441, 0.23682, ...]},
    ...
]

Relation-ID Mapping Data

This file indicates corresponding IDs for relations to make sure during each training and testing period, the same ID means the same relation. Its format is as follows

{
    'NA': 0,
    'relation_1': 1,
    'relation_2': 2,
    ...
}

IMPORTANT: Make sure the ID of NA is always 0.

Provided Data

NYT10 Dataset

NYT10 is a distantly supervised dataset originally released by the paper "Sebastian Riedel, Limin Yao, and Andrew McCallum. Modeling relations and their mentions without labeled text.". Here is the download link for the original data.

We've provided a toolkit to convert the original NYT10 data into JSON format that OpenNRE could use. You could download the original data + toolkit from Google Drive or Tsinghua Cloud. Further instructions are included in the toolkit.

Installation and Quick Start

Install all the required package.
Clone the OpenNRE repository:

git clone https://github.com/thunlp/OpenNRE.git

Since there are too many history commits of this project and the .git folder is too large, you could use the following command to download only the latest commit:

git clone https://github.com/thunlp/OpenNRE.git --depth 1

Make data folder in the following structure

OpenNRE
|-- ... 
|-- data
    |
    |-- {DATASET_NAME_1}
    |   |
    |   |-- train.json
    |   |-- test.json
    |   |-- word_vec.json
    |   |-- rel2id.json
    |
    |-- {DATASET_NAME_2}
    |   |
    |   |-- ...
    |
    |-- ...

You could use your own data or download datasets provided above.

Run train_demo.py {DATASET_NAME} {ENCODER_NAME} {SELECTOR_NAME}. For example, if you want to train model with PCNN as the encoder and attention as the selector on the nyt dataset, run the following command

python train_demo.py nyt pcnn att

Currently {ENCODER_NAME} includes pcnn, cnn, rnn and birnn, and {SELECTOR_NAME} includes att (for attention), max (for maximum) and ave (for average). The model will be named as {DATASET_NAME}_{ENCODER_NAME}_{SELECTOR_NAME} automatically.

The checkpoint of the best epoch (each epoch will be validated while training) will be saved in ./checkpoint and results for plotting precision-recall curve will be saved in ./test_result by default.

Use draw_plot.py to check auc, average precision, F1 score and precision-recall curve by the following command

python draw_plot.py {MODEL_NAME_1} {MODEL_NAME_2} ...

All the results of the models mentioned will be printed and precision-recall curves containing all the models will be saved in ./test_result/pr_curve.png.

If you have the checkpoint of the model and want to evaluate it, run test_demo.py {DATASET_NAME} {ENCODER_NAME} {SELECTOR_NAME}. For example:

python test_demo.py nyt pcnn att

The prediction results will be stored in test_result/nyt_pcnn_att_pred.json.

Additional Modules

Reinforcement Learning (Feng et al. 2018)

We have implemented a reinforcement learning module following (Feng et al. 2018). There might be some slight differences in implementation details. The RL code is in nrekit/rl.py, and it can be added to any models by running:

python train_demo.py {DATASET_NAME} {ENCODER_NAME} {SELECTOR_NAME} rl

For example, by running python train_demo.py nyt pcnn att rl, you will get a pcnn_att model trained by RL.

For how the RL module helps alleviate false positive problem in DS data, please refer to the paper.

Reference

Neural Relation Extraction with Selective Attention over Instances. Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, Maosong Sun. ACL2016. paper
Adversarial Training for Relation Extraction. Yi Wu, David Bamman, Stuart Russell. EMNLP2017. paper
A Soft-label Method for Noise-tolerant Distantly Supervised Relation Extraction. Tianyu Liu, Kexiang Wang, Baobao Chang, Zhifang Sui. EMNLP2017. paper
Reinforcement Learning for Relation Classification from Noisy Data. Jun Feng, Minlie Huang, Li Zhao, Yang Yang, Xiaoyan Zhu. AAAI2018. paper

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
data		data
nrekit		nrekit
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
__init__.py		__init__.py
base_OpenNRE_25_10_2019.yml		base_OpenNRE_25_10_2019.yml
draw_plot.py		draw_plot.py
requirements.txt		requirements.txt
test_demo.py		test_demo.py
train_demo.py		train_demo.py

License

ljm0/myOpenNRE

Folders and files

Latest commit

History

Repository files navigation