GitHub - tungpun/nlp-assignment-parser

Introduction

This is our assignment for NLP's course

Supported parsers:

Lexiclized Parser
Shift-Reduce Parser
Neural Network Parser
Vietnamese PCFG

How to use

Clone this repository
Github does not allow pushing big file to its service, so you should manually download some files: stanford-srparser-2014-10-23-models.jar, stanford-corenlp-2015-04-20-models.jar
Install Java 1.8+
Run python doit.py (python2.7)

Sample output

~ python doit.py



         .d8888b.                                      .d8888b.                      .d8888b.         d8888
        d88P  Y88b                                    d88P  Y88b                    d88P  Y88b       d88888
        888    888                                         .d88P                    888    888      d88P888
        888        888d888  .d88b.  888  888 88888b.      8888"                     888            d88P 888
        888  88888 888P"   d88""88b 888  888 888 "88b      "Y8b.                    888           d88P  888
        888    888 888     888  888 888  888 888  888 888    888       888888       888    888   d88P   888
        Y88b  d88P 888     Y88..88P Y88b 888 888 d88P Y88b  d88P                    Y88b  d88P  d8888888888
         "Y8888P88 888      "Y88P"   "Y88888 88888P"   "Y8888P"                      "Y8888P"  d88P     888
                                             888
                                             888
                                             888
[+] Input Data: "By default, output files are written to the current directory.
[+] Outfile cleaned: input.txt.xml

    Select from the menu:
        [1] Lexiclized Parser
        [2] Shift Reduce Parser
        [3] Neural Network Parser
        [4] Vietnamese PCFG
        [0] Quit Parser Wrapper

    Parser Wrapper> 2
[+] Implementing Shift Reduce Parser (included Dependency and Context-Free-Grammar representation)...

['java', '-cp', '"*"', '-Xmx2000m', 'edu.stanford.nlp.pipeline.StanfordCoreNLP', '-annotators', 'tokenize,ssplit,pos,parse', '-parse.model', 'edu/stanford/nlp/models/srparser/englishSR.ser.gz', '-file', 'input.txt']
Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [2.6 sec].
Adding annotator parse
Loading parser from serialized file edu/stanford/nlp/models/srparser/englishSR.ser.gz ... done [52.3 sec].

Ready to process: 1 files, skipped 0, total 1
Processing file /root/nlp/parser/stanford-corenlp/input.txt ... writing to /root/nlp/parser/stanford-corenlp/input.txt.xml {
  Annotating file /root/nlp/parser/stanford-corenlp/input.txt [50.447 seconds]
} [51.557 seconds]
Processed 1 documents
Skipped 0 documents, error annotating 0 documents
Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.5 sec.
ParserAnnotator: 49.8 sec.
TOTAL: 50.4 sec. for 804 tokens at 15.9 tokens/sec.
Pipeline setup: 0.2 sec.
Total time for StanfordCoreNLP pipeline: 51.8 sec.
[+] Parsing completed

[+] Output report is saved to input.txt.xml . You can open with MS Excel for more detail or view a brief as below.

[+] Parse tree:
   [0] (ROOT (S (PP (IN By) (NP (NN default))) (, ,) (NP (NN output) (NNS files)) (VP (VBP are) (VP (VBN written) (PP (TO to) (NP (DT the) (JJ current) (NN directory))))) (. .)))
(ROOT(S(PP(IN By)(NP(NN default)))(, ,)(NP(NN output)(NNS files))(VP(VBP are)(VP(VBN written)(PP(TO to)(NP(DT the)(JJ current)(NN directory)))))(. .)))

|-- ROOT
|   |-- S
|   |   |-- PP
|   |   |   |-- IN --- By
|   |   |   |-- NP
|   |   |   |   |-- NN --- default
|   |   |-- , --- ,
|   |   |-- NP
|   |   |   |-- NN --- output
|   |   |   |-- NNS --- files
|   |   |-- VP
|   |   |   |-- VBP --- are
|   |   |   |-- VP
|   |   |   |   |-- VBN --- written
|   |   |   |   |-- PP
|   |   |   |   |   |-- TO --- to
|   |   |   |   |   |-- NP
|   |   |   |   |   |   |-- DT --- the
|   |   |   |   |   |   |-- JJ --- current
|   |   |   |   |   |   |-- NN --- directory
|   |   |-- . --- .

XML Report Output

Find me on Twitter

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
bin		bin
conf		conf
data		data
dist		dist
doc		doc
lib		lib
models		models
patterns		patterns
samples		samples
sutime		sutime
tokensregex		tokensregex
.gitignore		.gitignore
CHANGES.txt		CHANGES.txt
CoreNLP-to-HTML.xsl		CoreNLP-to-HTML.xsl
DependencyParserDemo.java		DependencyParserDemo.java
LIBRARY-LICENSES		LIBRARY-LICENSES
LICENSE.txt		LICENSE.txt
Makefile		Makefile
ParserDemo.java		ParserDemo.java
ParserDemo2.java		ParserDemo2.java
README-en.pdf		README-en.pdf
README-vi.txt		README-vi.txt
README.md		README.md
README.txt		README.txt
README_dependencies.txt		README_dependencies.txt
SemgrexDemo.java		SemgrexDemo.java
ShiftReduceDemo.java		ShiftReduceDemo.java
StanfordCoreNlpDemo.java		StanfordCoreNlpDemo.java
StanfordDependenciesManual.pdf		StanfordDependenciesManual.pdf
VietNamesePCFG.gz		VietNamesePCFG.gz
VietNamesePCFG.gz.old		VietNamesePCFG.gz.old
build.properties		build.properties
build.xml		build.xml
corenlp.sh		corenlp.sh
doit.py		doit.py
dump.txt		dump.txt
ejml-0.23-src.zip		ejml-0.23-src.zip
ejml-0.23.jar		ejml-0.23.jar
hs_err_pid18718.log		hs_err_pid18718.log
hs_err_pid47681.log		hs_err_pid47681.log
input-full.txt		input-full.txt
input-tiny.txt		input-tiny.txt
input-tiny.txt.xml		input-tiny.txt.xml
input.en.txt		input.en.txt
input.txt		input.txt
input.txt.1		input.txt.1
input.txt.xml		input.txt.xml
javax.json-api-1.0-sources.jar		javax.json-api-1.0-sources.jar
javax.json.jar		javax.json.jar
joda-time-2.1-sources.jar		joda-time-2.1-sources.jar
joda-time.jar		joda-time.jar
jollyday-0.4.7-sources.jar		jollyday-0.4.7-sources.jar
jollyday.jar		jollyday.jar
lexparser-gui.bat		lexparser-gui.bat
lexparser-gui.command		lexparser-gui.command
lexparser-gui.sh		lexparser-gui.sh
lexparser-lang-train-test.sh		lexparser-lang-train-test.sh
lexparser-lang.sh		lexparser-lang.sh
lexparser.bat		lexparser.bat
lexparser.sh		lexparser.sh
lexparser_lang.def		lexparser_lang.def
output.txt		output.txt
pom.xml		pom.xml
protobuf.jar		protobuf.jar
stanford-corenlp-3.5.2-javadoc.jar		stanford-corenlp-3.5.2-javadoc.jar
stanford-corenlp-3.5.2-sources.jar		stanford-corenlp-3.5.2-sources.jar
stanford-corenlp-3.5.2.jar		stanford-corenlp-3.5.2.jar
stanford-parser-3.5.2-javadoc.jar		stanford-parser-3.5.2-javadoc.jar
stanford-parser-3.5.2-sources.jar		stanford-parser-3.5.2-sources.jar
stanford-parser.jar		stanford-parser.jar
t.txt		t.txt
tokenizer.log		tokenizer.log
tokenizer.properties		tokenizer.properties
tokenizered_data.txt		tokenizered_data.txt
vietnamesePCFG.sh		vietnamesePCFG.sh
vn.hus.nlp.tokenizer-4.1.1-bin.tar.gz		vn.hus.nlp.tokenizer-4.1.1-bin.tar.gz
vn.hus.nlp.tokenizer-4.1.1.jar		vn.hus.nlp.tokenizer-4.1.1.jar
vnTokenizer.bat		vnTokenizer.bat
vnTokenizer.sh		vnTokenizer.sh
xom-1.2.10-src.jar		xom-1.2.10-src.jar
xom.jar		xom.jar

License

tungpun/nlp-assignment-parser

Folders and files

Latest commit

History