information-extraction

2019 Language and Intelligence Challenge: Information Extraction

Prerequisites

Install required packages by:

pip install -r requirements.txt

Data

Download data: initialize and update the information-extraction-data git submodule by git submodule init and git submodule update, and then unzip the data files

sample schema:

{"object_type": "地点", "predicate": "祖籍", "subject_type": "人物"}

sample data, with postag and text as input and spo_list as output:

{
    "postag": [
        {"word": "一直", "pos": "d"}, 
        {"word": "陪", "pos": "v"}, 
        {"word": "我", "pos": "r"}, 
        {"word": "到", "pos": "p"}, 
        {"word": "现在", "pos": "t"}, 
        {"word": "是", "pos": "v"}, 
        {"word": "歌手", "pos": "n"}, 
        {"word": "马健涛", "pos": "nr"}, 
        {"word": "原创", "pos": "v"}, 
        {"word": "的", "pos": "u"}, 
        {"word": "歌曲", "pos": "n"}
    ], 
    "text": "一直陪我到现在是歌手马健涛原创的歌曲", 
    "spo_list": [
        {"predicate": "歌手", "object_type": "人物", "subject_type": "歌曲", "object": "马健涛", "subject": "一直陪我到现在"}
    ]
}

Baseline

baidu/information-extraction

Idea

Train multi-label classification model: predict predicate.
Train sequence labeling model: input text and predicate, output text labeling.
Extract SPO from sequence labeling result.

Implementation

Check report/PRML-final-project-doc-2019.pdf for details.

Multi-label Classification

CNN, BiRNN, BiLSTM, BiLSTM with max pooling and RCNN
BERT

Sequence Labeling

Encoder: BiLSTM and Transformer
Decoder: CRF

Result

Multi-label Classification

Sequence Labeling

fitlog usage

Initialize fitlog in classification folder:

cd classification/
fitlog init
fitlog log logs

Initialize fitlog in labeling folder:

cd labeling/
fitlog init
fitlog log logs

Author

Zhongyu Chen

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
chinese_L-12_H-768_A-12		chinese_L-12_H-768_A-12
classification		classification
data		data
handout		handout
information-extraction-data @ 71b4f8e		information-extraction-data @ 71b4f8e
labeling		labeling
legacy		legacy
pic		pic
pytorch		pytorch
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
convert_tf_checkpoint_to_pytorch.py		convert_tf_checkpoint_to_pytorch.py
requirements.txt		requirements.txt
tf2pytorch_ckpt.sh		tf2pytorch_ckpt.sh

License

zhongyuchen/information-extraction

Folders and files

Latest commit

History

Repository files navigation

information-extraction

Prerequisites

Data

Baseline

Idea

Implementation

Multi-label Classification

Sequence Labeling

Result

Multi-label Classification

Sequence Labeling

fitlog usage

Author

About

Topics

Resources

License

Stars

Watchers

Forks

Languages