BERT for Character Identification

BERT for Coreference Resolution: Baselines and Analysis 를 기반으로 한 인물 식별 모델입니다.

Task Definition

Datasets

Format

All datasets follow the CoNLL 2012 Shared Task data format. Documents are delimited by the comments in the following format:

#begin document (<Document ID>)[; part ###]
...
#end document

Each sentence is delimited by a new line ("\n") and each column indicates the following:

Document ID: /- (e.g., /friends-s01e01).
Scene ID: the ID of the scene within the episode.
Token ID: the ID of the token within the sentence.
Word form: the tokenized word.
Part-of-speech tag: the part-of-speech tag of the word (auto generated).
Constituency tag: the Penn Treebank style constituency tag (auto generated).
Lemma: the lemma of the word (auto generated).
Frameset ID: not provided (always _).
Word sense: not provided (always _).
Speaker: the speaker of this sentence.
Named entity tag: the named entity tag of the word (auto generated).
Start time: start time of the sentence on video. (millisecond)
End time: start time of the sentence on video. (millisecond)
Video file: Pre-processed sequence of image file from the video corresponding to the sentence. This column represents the file name of the pickle object (Pickle object will be released on 08/01)
Entity ID: the entity ID of the mention, that is consistent across all documents.

Here is a sample from the training dataset:

/friends-s01e01  0  0  He     PRP   (TOP(S(NP*)    he     -  -  Monica_Geller   *  55422 59256 00005.pickle (284)
/friends-s01e01  0  1  's     VBZ          (VP*    be     -  -  Monica_Geller   *  55422 59256 00005.pickle -
/friends-s01e01  0  2  just   RB        (ADVP*)    just   -  -  Monica_Geller   *  55422 59256 00005.pickle -
/friends-s01e01  0  3  some   DT        (NP(NP*    some   -  -  Monica_Geller   *  55422 59256 00005.pickle -
/friends-s01e01  0  4  guy    NN             *)    guy    -  -  Monica_Geller   *  55422 59256 00005.pickle (284)
/friends-s01e01  0  5  I      PRP  (SBAR(S(NP*)    I      -  -  Monica_Geller   *  55422 59256 00005.pickle (248)
/friends-s01e01  0  6  work   VBP          (VP*    work   -  -  Monica_Geller   *  55422 59256 00005.pickle -
/friends-s01e01  0  7  with   IN     (PP*))))))    with   -  -  Monica_Geller   *  55422 59256 00005.pickle -
/friends-s01e01  0  8  !      .             *))    !      -  -  Monica_Geller   *  55422 59256 00005.pickle -
/friends-s01e01  0  0  C'mon  VB   (TOP(S(S(VP*))  c'mon  -  -  Joey_Tribbiani  *  59459 61586 00006.pickle -
/friends-s01e01  0  1  ,      ,                 *  ,      -  -  Joey_Tribbiani  *  59459 61586 00006.pickle -
/friends-s01e01  0  2  you    PRP           (NP*)  you    -  -  Joey_Tribbiani  *  59459 61586 00006.pickle (248)
/friends-s01e01  0  3  're    VBP            (VP*  be     -  -  Joey_Tribbiani  *  59459 61586 00006.pickle -
/friends-s01e01  0  4  going  VBG            (VP*  go     -  -  Joey_Tribbiani  *  59459 61586 00006.pickle -
/friends-s01e01  0  5  out    RP           (PRT*)  out    -  -  Joey_Tribbiani  *  59459 61586 00006.pickle -
/friends-s01e01  0  6  with   IN             (PP*  with   -  -  Joey_Tribbiani  *  59459 61586 00006.pickle -
/friends-s01e01  0  7  the    DT             (NP*  the    -  -  Joey_Tribbiani  *  59459 61586 00006.pickle -
/friends-s01e01  0  8  guy    NN            *))))  guy    -  -  Joey_Tribbiani  *  59459 61586 00006.pickle (284)
/friends-s01e01  0  9  !      .               *))  !      -  -  Joey_Tribbiani  *  59459 61586 00006.pickle -

A mention may include more than one word:

/friends-s01e02  0  0  Ugly         JJ   (TOP(S(NP(ADJP*  ugly         -  -  Chandler_Bing  *  332158 334460 00038.pickle (380
/friends-s01e02  0  1  Naked        JJ                *)  naked        -  -  Chandler_Bing  *  332158 334460 00038.pickle -
/friends-s01e02  0  2  Guy          NNP               *)  Guy          -  -  Chandler_Bing  *  332158 334460 00038.pickle 380)
/friends-s01e02  0  3  got          VBD             (VP*  get          -  -  Chandler_Bing  *  332158 334460 00038.pickle -
/friends-s01e02  0  4  a            DT              (NP*  a            -  -  Chandler_Bing  *  332158 334460 00038.pickle -
/friends-s01e02  0  5  Thighmaster  NN               *))  thighmaster  -  -  Chandler_Bing  *  332158 334460 00038.pickle -
/friends-s01e02  0  6  !            .                *))  !            -  -  Chandler_Bing  *  332158 334460 00038.pickle -

The mapping between the entity ID and the actual character can be found in friends_entity_map.txt

Setup

Install python3 requirements: pip install -r requirements.txt
./setup_all.sh: This builds the custom kernels

Train

data 폴더 내에 train, dev, test data set을 각각 friendstrain.english.v4_gold_conll friendstrain.english.v4_gold_conll friendstrain.english.v4_gold_conll 로 저장
data 폴더 내에 friends_entity_map.txt를 확인.
models 폴더 내에 BERT 모델을 다운로드.
./setup_training.sh: sh파일 내에 vocab_file 경로가 올바른지 확인 한 후 실행.
Experiment configurations을 experiments.conf에 설정.
Training: GPU=0 python train.py <experiment>

Evaluation

evaluate를 진행할 모델을 복사.
experiments.conf 내에 복사된 이름의 experiment를 설정.
GPU=0 python evaluate.py <experiment> .
생성된 evaluate_result.txt로 결과 확인.
python link_character_friendsnew.py, python character_evaluate_friendsnew.py 실행을 통해 인물인식 성능을 확인.

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
bert		bert
cased_config_vocab		cased_config_vocab
conll-2012		conll-2012
data		data
debug		debug
viz		viz
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
basic.conf		basic.conf
character_evaluate_friendskor.py		character_evaluate_friendskor.py
character_evaluate_friendsnew.py		character_evaluate_friendsnew.py
command.sh		command.sh
conll.py		conll.py
coref_kernels.cc		coref_kernels.cc
coref_ops.py		coref_ops.py
current_models.py		current_models.py
download_pretrained.sh		download_pretrained.sh
evaluate.py		evaluate.py
experiments.conf		experiments.conf
gap_to_jsonlines.py		gap_to_jsonlines.py
independent.py		independent.py
link_character_friendskor.py		link_character_friendskor.py
link_character_friendsnew.py		link_character_friendsnew.py
metrics.py		metrics.py
minimize.py		minimize.py
optimization.py		optimization.py
overlap.py		overlap.py
overlap_minimize.py		overlap_minimize.py
predict.py		predict.py
pytorch_to_tf.py		pytorch_to_tf.py
requirements.txt		requirements.txt
setup_all.sh		setup_all.sh
setup_training.sh		setup_training.sh
to_gap_tsv.py		to_gap_tsv.py
train.py		train.py
util.py		util.py

License

machinereading/BERT-for-character-idetification

Folders and files

Latest commit

History

Repository files navigation

BERT for Character Identification

Task Definition

Datasets

Format

Setup

Train

Evaluation

About

Resources

License

Stars

Watchers

Forks

Languages