Skip to content
forked from zangsir/coref

tools related to coreference resolution

Notifications You must be signed in to change notification settings

amir-zeldes/coref

 
 

Repository files navigation

1. Convert OntoNotes gold file to CoNLL format

Parse OntoNotes coreference gold files and convert into CONLL 2011 shared task file format. Goal is to use the official scorer script for evaluating the coreferencer.

Multiple gold files from the ./11docs/ directory will be read.Output will be one file, containing all docs under that directory. Each doc will begin with a comment #begin document ...

Note that due to restricted license on the OntoNotes and PennTreeBank, the actual documents will not be included here.

Usage:

to run it on a directory 11docs/ and write to one output file (11docs containing all ontoNotes gold .coref files):
python genConllGold.py -w 11docs/

alternatively, to run it with a print option will take one file (.coref) as input and output the converted format into the stdout, for example:
python genConllGold.py -p wsj_0120.coref

(or to use -w in this case to write output to one file)

There is also a simple script, formatXML.py, to output the well formatted (indented) gold file as xml, for instance:

$ python formatXML.py wsj_2321.coref >> wsj21.xml

2. Output OntoNotes gold file to html

Note: html file must be placed in the correct directory together with the javascript and css files for correct visualization of coreference.

Usage (take one input file and redirect the stdout to a file):

$ python OntoToHtml.py 11docsonly/wsj_0120.coref >>html_ex/wsj_0120_gold.html

if no redirection is used, currently will print html to stdout. To batch process all coref gold files in a directory, use the shell script provided, after modifying it to suit your directory name (below shows usage on UNIX bash):

$ exec ./OntoConll.sh

3. Extract all book titles from the constituent parse tree files

This will output only unique titles existing in the text.

Usage:

(1) In "file" -f mode, extract and output (to stdout) all book titles from one .parse file, and redirect the output to a text file using:

$ python extractTTL.py -f const_parses/wsj_0037.parse >> 0037_titles.txt

(2) In "directory" -d mode, extract and output (to stdout) all book titles from all .parse files under the specified directory, and redirect the output to a text file using:

$ python extractTTL.py -d const_parses/ >> all_titles.txt

4. More Preprocessing for OntoNotes gold files: nested markable removal and kill singleton (ref. sec.23 of wsj)

Nested markable is a markable inside another markable with the same ID. After the inner redundant markable is removed, we check if outter markable becomes a singleton - at which occasion we also remove the outter markable. This is in accordance with the OntoNotes coref guidelines.

Example Usage:

$ python rmNestKilSg.py wsj_2320.coref >> wsj_2320_new.coref

To batch process, use the shell script (Bash on mac) after you've modified the input directory where all the original .coref gold files are located:

$ exec ./fixNest.sh

About

tools related to coreference resolution

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.0%
  • Shell 4.0%