Skip to content

zangsir/coref

Repository files navigation

1. Convert OntoNotes gold file to CoNLL format

Parse OntoNotes coreference gold files and convert into CONLL 2011 shared task file format. Goal is to use the official scorer script for evaluating the coreferencer.

Multiple gold files from the ./11docs/ directory will be read.Output will be one file, containing all docs under that directory. Each doc will begin with a comment #begin document ...

Note that due to restricted license on the OntoNotes and PennTreeBank, the actual documents will not be included here.

Usage:

to run it on a directory 11docs/ and write to one output file (11docs containing all ontoNotes gold .coref files):
python genConllGold.py -w 11docs/

alternatively, to run it with a print option will take one file (.coref) as input and output the converted format into the stdout, for example:
python genConllGold.py -p wsj_0120.coref

(or to use -w in this case to write output to one file)

There is also a simple script, formatXML.py, to output the well formatted (indented) gold file as xml, for instance:

$ python formatXML.py wsj_2321.coref >> wsj21.xml

2. Output OntoNotes gold file to html

Note: html file must be placed in the correct directory together with the javascript and css files for correct visualization of coreference.

Usage (take one input file and redirect the stdout to a file):

$ python OntoToHtml.py 11docsonly/wsj_0120.coref >>html_ex/wsj_0120_gold.html

if no redirection is used, currently will print html to stdout. To batch process all coref gold files in a directory, use the shell script provided, after modifying it to suit your directory name (below shows usage on UNIX bash):

$ exec ./OntoConll.sh

3. Extract all book titles from the constituent parse tree files

This will output only unique titles existing in the text.

Usage:

(1) In "file" -f mode, extract and output (to stdout) all book titles from one .parse file, and redirect the output to a text file using:

$ python extractTTL.py -f const_parses/wsj_0037.parse >> 0037_titles.txt

(2) In "directory" -d mode, extract and output (to stdout) all book titles from all .parse files under the specified directory, and redirect the output to a text file using:

$ python extractTTL.py -d const_parses/ >> all_titles.txt

4. More Preprocessing for OntoNotes gold files: nested markable removal and kill singleton (ref. sec.23 of wsj)

Nested markable is a markable inside another markable with the same ID. After the inner redundant markable is removed, we check if outter markable becomes a singleton - at which occasion we also remove the outter markable. This is in accordance with the OntoNotes coref guidelines.

Example Usage:

$ python rmNestKilSg.py wsj_2320.coref >> wsj_2320_new.coref

To batch process, use the shell script (Bash on mac) after you've modified the input directory where all the original .coref gold files are located:

$ exec ./fixNest.sh

5. Adding coref information to Stanford coreNLP conll output

Currently there is no easy way to output coref chains from dcoref using Stanford CoreNLP. You can output conll files using CoreNLP but there is no coref chain columns despite it is included in the list of annotators. Alternatively, by default you can output xml files from CoreNLP and it does contain coref chain info if it is indeed in the specified. In this task, we take both formats of output (.cnoll and .xml) and use a simple python script to add the coref chain info from the xml back to the conll files in the last column. Note that it takes plain text of OnteNotes files, one sentence per line, with the OnteNotes tokenization.

Usage:

The pipeline here is to first obtain the conll and xml output from Stanford coreNLP using dcoref, and then put them together. TO get the conll and xml output first, you can modify and use the shell script provided (be sure to modify the dir that stores all the input plain text files, and the outputFormat parameter to either 'conll' or 'xml'):

$ exec ./batch_coreNLP.sh

Then after putting both conll files and xml files (the same doc should have same file names except for the extension), you can put them together:

$ python coreNLP_conll.py path/to/conll/file path/to/xml/file

This will write output the complete conll file on the stdout.

If you wish to do batch processing, i.e., read a directory of OntoNotes plain text files, then you can use the bash shell script provided(after making it executable), specifying the name of the output file on the command line:

$ ./coreNLPConll.sh newOutput.conll

And this will write complete conll output for all documents into one output conll file, approprite for evaluation using the conll11,12 shared task scorer script. In the big output file, each document is marked by comments for beginning and ending of the document.

6. Extracting gold NER from GUM xml files to Conll format

This is part of the training data that Berkley joint coref system needs. Use gum-ner.py for single files or modify a shell script to batch process. Expects the coref file on command line:

$ python gum_ner.py onto-gum/GUM_interview_ants.coref

Output the conll file with id, tok, and NER columns. You can check that this output has the same number of tokens as the coref conll file on GUM repo by using the check_conll.py, before merging them or doing other processing.

About

tools related to coreference resolution (Xrenner)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published