Skip to content

daandouwe/chart-parser

Repository files navigation

Chart parser

A simple chart parser in python with a CKY in cython for speed.

Inspired by the recent success of benepar and the minimal-span parser I wanted to revisit chart parsing with CKY on binarized trees. No neural networks here however, just rule probabilities estimated by maximum likelihood.

Setup

To obtain the data and grammar, use:

cd grammar
./get-grammar.sh

To compile cky, use:

cd cky
python setup.py build_ext --inplace

Usage

To run a quick test, use:

python main.py

To parse a sentence, use:

python main.py --sent "The horse raced past the barn fell."

To parse the dev-set and compute f-score, use:

python main.py --infile grammar/dev/dev.tokens --outfile grammar/dev/dev.pred.trees --goldfile grammar/data/dev.trees

This can be done in parallel by adding --parallel.

To parse 5 sentences from the dev-set, show predicted and gold parses, and compute their individual f-scores, use:

python main.py --treefile grammar/data/dev.trees -n 5

The default grammar used is the vanilla CNF. To use the (v1h1) Markovized grammar, use:

python main.py --grammar grammar/train/train.markov.grammar

Speed

To speed up the CKY parsing, we use a (simple) cythonized version that is almost a numpy implementation. We also provide a numpy cky. To use this, add the flag --use-numpy. The speed difference is very significant: the cython CKY parses a 20-word sentence in ~1 second, the numpy CKY takes ~90 seconds.

Parsing the entire development set in parallel with 8 processes (for my quad-core machine) takes around 15 minutes.

Accuracy

The Markovized CNF gives these results on the test set:

=== Summary ===

-- All --
Number of sentence        =   2416
Number of Error sentence  =      0
Number of Skip  sentence  =      0
Number of Valid sentence  =   2416
Bracketing Recall         =  78.20
Bracketing Precision      =  76.53
Bracketing FMeasure       =  77.36
Complete match            =  14.82
Average crossing          =   2.44
No crossing               =  41.68
2 or less crossing        =  65.02
Tagging accuracy          =  95.49

This is what we should expect based on the numbers that Klein and Manning (2003) report on the unrefined and Markovized grammars.

Requirements

python>=3.6.0
numpy
cython
nltk
tqdm
flake8
PYEVALB

Contributing

Working to make collaboration easier.

Run tests

Under construction

Run linters

Run flake8 from the project directory for style guide enforcement. See the documentaion for more info on flake8.

TODO

  • More elaborate unking scheme (e.g. UNK-DASH-ity)
  • Write a setup.py to make collaboration easier. See this example.
  • Add tests. See this example.

About

A simple chart-based parser.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published