Skip to content

david-bressler/e2e-coref

Repository files navigation

David Bressler's guide to getting coref code up and running:

  • git clone this repo into e2e-coref folder
  • cd e2e-coref
  • #Need to set up a virtualenv for Python2, and activate it
  • virtualenv --python=/usr/bin/python2.7 env_e2e_2p7
  • source env_e2e_2p7/bin/activate
  • #install the requirements
  • pip install -r requirements.txt
  • #download the data, and move to the root of the folder
  • gdown https://drive.google.com/uc?id=1fkifqZzdzsOEo0DXMzCFjiNXqsKG_cHi
  • #Move the downloaded file to the root of the repo and extract: tar -xzvf e2e-coref.tgz
  • tar -xzvf e2e-coref.tgz
  • #Download GloVe embeddings and build custom kernels by running setup_all.sh
  • ./setup_all.sh
  • #Now, input should be .json file, with a field called 'results', which contains a list of dictionaries containing the articles that you want processed in the "body" field
  • #E.g. if you have two articles you want processed, when the json is loaded as 'the_dic', the article text should be here the_dic['results'][0]['body'] and here the_dic['results'][1]['body']
  • #Then edit run_bulk_coref.py with the following:
  • runtypea=0
  • namea= 'your_entity' # e.g. if your json file is 'Tesla.json', this line should be namea='Tesla'
  • #The output is a json file (e.g. 'Tesla_coref.json'), which contains fields for 'e2e_body', 'predicted_clusters', and 'clusters_words'
  • #'e2e_body' is the tokenized version of your input text
  • #'predicted_clusters' contains the indices into e2e_body for each of the detected coref chains
  • #'clusters_words' contains the words in each of the detected coref chains

Higher-order Coreference Resolution with Coarse-to-fine Inference

Introduction

This repository contains the code for replicating results from

Getting Started

  • Install python (either 2 or 3) requirements: pip install -r requirements.txt
  • Download pretrained models at https://drive.google.com/file/d/1fkifqZzdzsOEo0DXMzCFjiNXqsKG_cHi
    • Move the downloaded file to the root of the repo and extract: tar -xzvf e2e-coref.tgz
  • Download GloVe embeddings and build custom kernels by running setup_all.sh.
    • There are 3 platform-dependent ways to build custom TensorFlow kernels. Please comment/uncomment the appropriate lines in the script.
  • To train your own models, run setup_training.sh
    • This assumes access to OntoNotes 5.0. Please edit the ontonotes_path variable.

Training Instructions

  • Experiment configurations are found in experiments.conf
  • Choose an experiment that you would like to run, e.g. best
  • Training: python train.py <experiment>
  • Results are stored in the logs directory and can be viewed via TensorBoard.
  • Evaluation: python evaluate.py <experiment>

Demo Instructions

  • Command-line demo: python demo.py final
  • To run the demo with other experiments, replace final with your configuration name.

Batched Prediction Instructions

  • Create a file where each line is in the following json format (make sure to strip the newlines so each line is well-formed json):
{
  "clusters": [],
  "doc_key": "nw",
  "sentences": [["This", "is", "the", "first", "sentence", "."], ["This", "is", "the", "second", "."]],
  "speakers": [["spk1", "spk1", "spk1", "spk1", "spk1", "spk1"], ["spk2", "spk2", "spk2", "spk2", "spk2"]]
}
  • clusters should be left empty and is only used for evaluation purposes.
  • doc_key indicates the genre, which can be one of the following: "bc", "bn", "mz", "nw", "pt", "tc", "wb"
  • speakers indicates the speaker of each word. These can be all empty strings if there is only one known speaker.
  • Run python predict.py <experiment> <input_file> <output_file>, which outputs the input jsonlines with predicted clusters.

Other Quirks

  • It does not use GPUs by default. Instead, it looks for the GPU environment variable, which the code treats as shorthand for CUDA_VISIBLE_DEVICES.
  • The training runs indefinitely and needs to be terminated manually. The model generally converges at about 400k steps.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published