Multiple gold files from the ./11docs/ directory will be read.Output will be one file, containing all docs under that directory. Each doc will begin with a comment #begin document ...
Note that due to restricted license on the OntoNotes and PennTreeBank, the actual documents will not be included here.
Usage:
to run it on a directory 11docs/ and write to one output file (11docs containing all ontoNotes gold .coref files):
python genConllGold.py -w 11docs/
alternatively, to run it with a print option will take one file (.coref) as input and output the converted format into the stdout, for example:
python genConllGold.py -p wsj_0120.coref
(or to use -w in this case to write output to one file)
There is also a simple script, formatXML.py, to output the well formatted (indented) gold file as xml, for instance:
$ python formatXML.py wsj_2321.coref >> wsj21.xml
Note: html file must be placed in the correct directory together with the javascript and css files for correct visualization of coreference.
Usage (take one input file and redirect the stdout to a file):
$ python OntoToHtml.py 11docsonly/wsj_0120.coref >>html_ex/wsj_0120_gold.html
if no redirection is used, currently will print html to stdout. To batch process all coref gold files in a directory, use the shell script provided, after modifying it to suit your directory name (below shows usage on UNIX bash):
$ exec ./OntoConll.sh
This will output only unique titles existing in the text.
Usage:
(1) In "file" -f mode, extract and output (to stdout) all book titles from one .parse file, and redirect the output to a text file using:
$ python extractTTL.py -f const_parses/wsj_0037.parse >> 0037_titles.txt
(2) In "directory" -d mode, extract and output (to stdout) all book titles from all .parse files under the specified directory, and redirect the output to a text file using:
$ python extractTTL.py -d const_parses/ >> all_titles.txt
4. More Preprocessing for OntoNotes gold files: nested markable removal and kill singleton (ref. sec.23 of wsj)
Nested markable is a markable inside another markable with the same ID. After the inner redundant markable is removed, we check if outter markable becomes a singleton - at which occasion we also remove the outter markable. This is in accordance with the OntoNotes coref guidelines.
Example Usage:
$ python rmNestKilSg.py wsj_2320.coref >> wsj_2320_new.coref
To batch process, use the shell script (Bash on mac) after you've modified the input directory where all the original .coref gold files are located:
$ exec ./fixNest.sh