GitHub - rogerevansNLTG/traces-through-time: helper scripts for work in the Traces Through Time project

These are mostly auxiliary scripts for converting source data (from some sort of TEI representation) to text, and for summarising the entity extraction results.

Installation

Fetch this repository and nimrodel

 git clone https://github.com/kowey/traces-through-time.git
 git clone https://github.com/kowey/nimrodel.git

Set up nimrodel (see nimrodel/README.md)

Set up a Python virtual environment:

 virtualenv $HOME/.virtualenvs/ttt
 source $HOME/.virtualenvs/ttt/bin/activate
 pip install -r requirements.txt

Note that when you want to run one of these scripts, you will need to activate your virtual environment first:

source $HOME/.virtualenvs/ttt/bin/activate

University of Brighton users: Follow on at devel/README.md

Installed Scripts

The following scripts are installed in your virtual environment when you run the above.

Before nimrodel (convert raw data to text)

There is a number of scripts in the converters directory that will be installed by the setup procedure above, for example:

state-papers-to-text.py
fine-rolls-to-text.py
petitions-to-text.py

These all operate on the same principle: they read some input directory of files in various formats, and output a similarly structured directory with mostly plain text files that can be processed by nimrodel.

Note that in data distributions, you may see the names 'kleanthi' and 'calendar' floating around. Files with such names should have been renamed to 'state-papers' and 'fine-rolls' respectively

Nimrodel

Nimrodel provides a shell (or Windows batch) command that you can run. The script provides several modes, for example, nimrodel string takes a string on the command line for quick one-off tests, and nimrodel dir reads from an input directory, and writes to an output directory. The output of the latter will be a directory of json files. See nimrodel docs for more details.

After nimrodel (study json output)

When you have run nimrodel you should have a directory containing json output files (with potentially some arbitrary division into subdirectories, etc).

We provide a couple of tools to help examine the results.

print-entities.py - just dump out occurrences from json dir
mk-report.py - create an HTML report listing some basic statistics about the output, and displaying all the outputs in a variety of tables with the aim of helping you see everything at a glance.

Note that you can also use this script to compare different annotations of the same data, for example, human annotations vs nimrodel; or one version of nimrodel vs another. To do this, pass the reference directory (or the one generated by the older version of nimrodel) with the flag --before

One-off scripts

Scripts in these directory were used for various one-off tasks that we don't think are that repeatable

annotations-to-json.py - convert manual annotation to json (can also convert GATE output)
fix-json.py - should not be needed anymore
filter-names.py - narrow done a list of candidate names to those that look relatively likely to actually be names

Annotation campaign notes

These notes are for a short-lived annotation campaign around 2014-11. We were marking up a small sample of texts for interesting spans

convert annotations to something comparable with nimrodel (takes text with angle brackets, spits out json):
```
python annotation/annotations-to-json.py ANNOTATED-DIR HUMAN-JSON-DIR
```
see just the refs (takes json, spits out just entities)
```
print-entities.py SOME-JSON-DIR SOME-TEXT-DIR
```
run nimrodel, save the json output

generate report/scoring:

mk-report.py --before HUMAN-JSONDIR NIMRODEL-JSON-DIR REPORT-DIR

Tips

If you have access to a beefy multicore server, you can use the parallel-nimrodel-on-dir script to run nimrodel on several inputs at the same time.

You should group the files into buckets of roughly equal size (choosing the right number of buckets can be tricky; smaller is probably better, but not so small that you're wasting time starting nimrodel up repeatedly)

It may help to have a script like the below that you can run mindlessly.

#/bin/bash

DATASET=traces-through-time/data/snippet-2014-11-14

rm -f nohup.out
mkdir -p "${DATASET}"/json
nohup nimrodel/bin/nimrodel parallel-dir 8\
        "${DATASET}"/text\
        "${DATASET}"/json &

In the above script, we also use the nohup command to make it so that you can log out of the terminal session and check on the progress by logging back in and looking at the nohup.out file from time to time. You can get a sense of the progress by counting the number of occurences of "^walking" that file, ie. the number of buckets processed.

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
converters		converters
devel		devel
evaluation		evaluation
lexicon-augmentation		lexicon-augmentation
oneoff		oneoff
ttt		ttt
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

converters

converters

devel

devel

evaluation

evaluation

lexicon-augmentation

lexicon-augmentation

oneoff

oneoff

ttt

ttt

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

Installation

Installed Scripts

Before nimrodel (convert raw data to text)

Nimrodel

After nimrodel (study json output)

One-off scripts

Annotation campaign notes

Tips

See also

About

Releases

Packages

Contributors 2

Languages

rogerevansNLTG/traces-through-time

Folders and files

Latest commit

History

Repository files navigation

Installation

Installed Scripts

Before nimrodel (convert raw data to text)

Nimrodel

After nimrodel (study json output)

One-off scripts

Annotation campaign notes

Tips

See also

About

Resources

Stars

Watchers

Forks

Languages