Skip to content

rogerevansNLTG/traces-through-time

Repository files navigation

These are mostly auxiliary scripts for converting source data (from some sort of TEI representation) to text, and for summarising the entity extraction results.

Installation

  1. Fetch this repository and nimrodel

     git clone https://github.com/kowey/traces-through-time.git
     git clone https://github.com/kowey/nimrodel.git
    
  2. Set up nimrodel (see nimrodel/README.md)

  3. Set up a Python virtual environment:

     virtualenv $HOME/.virtualenvs/ttt
     source $HOME/.virtualenvs/ttt/bin/activate
     pip install -r requirements.txt
    

Note that when you want to run one of these scripts, you will need to activate your virtual environment first:

source $HOME/.virtualenvs/ttt/bin/activate

University of Brighton users: Follow on at devel/README.md

Installed Scripts

The following scripts are installed in your virtual environment when you run the above.

Before nimrodel (convert raw data to text)

There is a number of scripts in the converters directory that will be installed by the setup procedure above, for example:

  • state-papers-to-text.py
  • fine-rolls-to-text.py
  • petitions-to-text.py

These all operate on the same principle: they read some input directory of files in various formats, and output a similarly structured directory with mostly plain text files that can be processed by nimrodel.

Note that in data distributions, you may see the names 'kleanthi' and 'calendar' floating around. Files with such names should have been renamed to 'state-papers' and 'fine-rolls' respectively

Nimrodel

Nimrodel provides a shell (or Windows batch) command that you can run. The script provides several modes, for example, nimrodel string takes a string on the command line for quick one-off tests, and nimrodel dir reads from an input directory, and writes to an output directory. The output of the latter will be a directory of json files. See nimrodel docs for more details.

After nimrodel (study json output)

When you have run nimrodel you should have a directory containing json output files (with potentially some arbitrary division into subdirectories, etc).

We provide a couple of tools to help examine the results.

  • print-entities.py - just dump out occurrences from json dir

  • mk-report.py - create an HTML report listing some basic statistics about the output, and displaying all the outputs in a variety of tables with the aim of helping you see everything at a glance.

    Note that you can also use this script to compare different annotations of the same data, for example, human annotations vs nimrodel; or one version of nimrodel vs another. To do this, pass the reference directory (or the one generated by the older version of nimrodel) with the flag --before

One-off scripts

Scripts in these directory were used for various one-off tasks that we don't think are that repeatable

  • annotations-to-json.py - convert manual annotation to json (can also convert GATE output)
  • fix-json.py - should not be needed anymore
  • filter-names.py - narrow done a list of candidate names to those that look relatively likely to actually be names

Annotation campaign notes

These notes are for a short-lived annotation campaign around 2014-11. We were marking up a small sample of texts for interesting spans

  • convert annotations to something comparable with nimrodel (takes text with angle brackets, spits out json):

    python annotation/annotations-to-json.py ANNOTATED-DIR HUMAN-JSON-DIR
    
  • see just the refs (takes json, spits out just entities)

    print-entities.py SOME-JSON-DIR SOME-TEXT-DIR
    
  • run nimrodel, save the json output

  • generate report/scoring:

    mk-report.py --before HUMAN-JSONDIR NIMRODEL-JSON-DIR REPORT-DIR
    

Tips

If you have access to a beefy multicore server, you can use the parallel-nimrodel-on-dir script to run nimrodel on several inputs at the same time.

You should group the files into buckets of roughly equal size (choosing the right number of buckets can be tricky; smaller is probably better, but not so small that you're wasting time starting nimrodel up repeatedly)

It may help to have a script like the below that you can run mindlessly.

#/bin/bash

DATASET=traces-through-time/data/snippet-2014-11-14

rm -f nohup.out
mkdir -p "${DATASET}"/json
nohup nimrodel/bin/nimrodel parallel-dir 8\
        "${DATASET}"/text\
        "${DATASET}"/json &

In the above script, we also use the nohup command to make it so that you can log out of the terminal session and check on the progress by logging back in and looking at the nohup.out file from time to time. You can get a sense of the progress by counting the number of occurences of "^walking" that file, ie. the number of buckets processed.

See also

About

helper scripts for work in the Traces Through Time project

Resources

Stars

Watchers

Forks

Packages

No packages published