RST Parser

Basic Description

RST parser for document-level discourse parsing. The parsing algorithm is shift-reduce parsing, and the parsing model is a offline trained multi-class classifier.

To obtain a good performance, you can:

add more features into the feature generator (in feature.py)
tune the parameters in parsing model (in model.py). For now, I simply use LinearSVC with default parameter setting.

Demo

Start from "main.py" for a demo

Modules

tree: any operation about an RST tree is included in this module. For example
- Build general/binary RST tree from annotated file
- Binarize a general RST tree to the binary form (original RST trees in the RST treebank may not in the binary form)
- Generate bracketing sequence for evaluation
- Write an RST tree into file (not implemented yet)
- Generate Shift-reduce parsing action examples
- Get all EDUs from the RST tree
parser: an implementation of the shift-reduce parsing algorithm, including following functions:
- Initialize parsing status given a sequence of texts
- Change the status according to a specific parsing action
- Get the status of stack/queue
- Check whether should stop parsing
model: an parsing model module, where a trained parsing model could predict parsing actions. This module includes:
- Batch training on the data generated by the data module
- Predict parsing actions for a given feature set
- Save/load parsing model
feature: an feature generator, which can generate features from current stack/queue status.
data: generate training data for offline training

Main Classes

(For all the following functions, please refer to the code for more explanation)

RSTTree (in tree module):
- build(): Build an binary RST tree from an annotated discourse file
- generate_sample(): Generate a sequence of parsing actions and the corresponding training examples, which can be used for offline training on parsing model
- getedutext(): Get a sequence of EDU texts from the given RST tree
- bracketing: Generate bracketing sequence for evaluation
SRParser (in parser module):
- init(texts): Initialize the queue status from the given text sequence. Each element in this sequence will be treated as an EDU
- operate(action_tuple): Change the queue/stack according to the action tuple, for example, the operation (Shift, None, None) will move one element from the head of the queue to the top of the stack
- getparsetree(): Return the entire RST tree
FeatureGenerator (in feature module):
- features(): the major generator which could extract all the necessary features from current queue/stack. You can extend this generator by calling other sub-functions in it.
ParsingModel (in model module):
- train(trnM, trnL): Offline training on the parsing model (aka, a multi-class classifier) from the given training data trnM and corresponding labels trnL
- predict(features): Predict a parsing action according to the given feature generator
- sr_parse(texts): Performing shift-reduce RST parsing on the given text sequence. Each element in this sequence will be treated as an EDU
Data (in data module):
- buildvocab(thresh): Build feature vocab by removing some low-frequency features. The same vocab will also be used for future parsing work in test stage.
- buildmatrix(): Build data matrix for offline training
- savematrix(fname): Save data matrix and corresponding labels into fname
- getvocab(): Get feature vocab
- savevocab(fname): Save feature vocab and relation mapping (from relations to indices) into fname

Reference

Yangfeng Ji, Jacob Eisenstein, Representation Learning for Text-level Discourse Parsing, Proceedings of ACL, 2014

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
ROUGE-1.5.5		ROUGE-1.5.5
doc		doc
examples		examples
maltparser		maltparser
pyrouge-0.1.0		pyrouge-0.1.0
stanford-postagger-full-2014-10-26		stanford-postagger-full-2014-10-26
summaries-gold		summaries-gold
summary		summary
topics		topics
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
buildtree.py		buildtree.py
contributors.txt		contributors.txt
data.py		data.py
datastructure.py		datastructure.py
educreator.py		educreator.py
evalparser.py		evalparser.py
evaluation.py		evaluation.py
feature.py		feature.py
featurelist.md		featurelist.md
generate_summaries.py		generate_summaries.py
learn.py		learn.py
main.py		main.py
maltparser.py		maltparser.py
model.py		model.py
parser.py		parser.py
postagger.py		postagger.py
run.py		run.py
run.sh		run.sh
test.py		test.py
tree.py		tree.py
util.py		util.py

License

jerryyeezus/nlp-summarization

Folders and files

Latest commit

History

Repository files navigation

RST Parser

Basic Description

Demo

Modules

Main Classes

Reference

About

Resources

License

Stars

Watchers

Forks

Languages