GitHub - HeavenlyBerserker/NLP_Project

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
Good proj.pys		Good proj.pys
developset		developset
output		output
scoring program		scoring program
testset1		testset1
venv		venv
.DS_Store		.DS_Store
155.rtf		155.rtf
ALLPATTERN123.rtf		ALLPATTERN123.rtf
final-predictions.txt		final-predictions.txt
final-predictions.txt.trace		final-predictions.txt.trace
final-predictions_1.txt		final-predictions_1.txt
final-predictions_2.txt		final-predictions_2.txt
final-predictions_4.txt		final-predictions_4.txt
final-predictions_5.txt		final-predictions_5.txt
infoextract.sh		infoextract.sh
infoextract.sh~		infoextract.sh~
last.rtf		last.rtf
new.rtf		new.rtf
nltk_download.py		nltk_download.py
nltk_download.py~		nltk_download.py~
pattern_1237.rtf		pattern_1237.rtf
pattern_new_1139.rtf		pattern_new_1139.rtf
predict.py		predict.py
project.py		project.py
project.pyc		project.pyc
project.py~		project.py~
readme		readme
sample-answers.txt		sample-answers.txt
sample-textfile.txt		sample-textfile.txt
test.sh		test.sh
testdev.sh		testdev.sh
testtest.sh		testtest.sh

Repository files navigation

NLP Project 2017
Project by Hong Xu and Jie Zhang
Written on python 2.7.12 and tested on lab2-12 cade machine
#!/bin/bash
For any questions, please contact Hong Xu u0867999@utah.edu
Input: file to test as argument to ./infoextract.sh
Output: final-predictions.txt in main directory

Created October 28 2017
Sources for nltk: http://www.nltk.org/book_1ed/
Credit to : Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.

#######################################################
Instructions:

1. To install python repositories and test the system on a file, run test.sh with the filename to test and answers as a single argument:

./infoextract.sh <textfile>

For instance, to run it on tst1, run

./infoextract.sh testset1/testset1-input.txt

The final output file will be in the main directory (NLP_Project_2017) named final-predictions.txt.

a ######################################################
External sources:
nltk
spacy

b ######################################################
Time Estimate:
For file prediction
~10s per file

For training:
<300s total

Total for testset1 samples:
<30 min

c ######################################################
Team member contributions.

Hong Xu: Was responsible for writing the code to parse the files into usable text format.
Created the system to create automatic patterns and find triggers. Implemented a way to classify incidents by frequency of certain words.
Responsible for the testing and experimentation on the test files.
Was responsible of finding incidents, first by unigrams and bigrams, then by testing rules.
Also responsible for finding weapons, and wrote part of the code for perp_org.
Tested on cade machines and made sure all the libraries installed correctly, the venv was set up
correctly and the code ran correctly and in a format that was acceptable.

Zhang Jie: Read AutoSlog paper and understand basic steps/concepts.
Tried to implement a NER system that will learn words in the categories and tried to classify new words seen. So far does not work well.
Implemented a method that extracts relatively long sentence chunks as patterns.
Used NER-taggers to create patterns to test on the data.
Handwrote patterns to test on the data.
Did some preprocessing to sharpen answers.

d ######################################################
Results and final thoughts:

The project was overall very interesting and we have learned a lot from it regardless of our performance.

At first, we tried to created a fully automated system reminiscent of autoslog, but after multiple
failed trials decided on using more systems along the lines of ngrams, at least in terms of incidents
and weapons, again to no avail. Finally, we decided to go old-school and hand-pick valid patterns
from the created patterns to input into our final system. Turns out this technique yielded the best results
and although they only hovered around 35%, we were quite pleased with the results. In truth, if we had
only considered incidents, weapons, and perp org, our system would reach 40%, but we decided it was
better to be mediocre at the well rounded task than to ignore entries.

After this project, we have definetely gained respect for working automated systems and hope
that this will improve further in the future. However, we do believe that deeper language understanding
will be required to improve on the state of the art performances using primarily methods reminiscent of
statistical studies on datasets.

Summing up, after testing various automatic methods, we decided on using hand written rules to
improve our system, aware of the intricacies inherent to the domain-restricted event extraction task
which elude automatic systems. This has, in turn, spiked our curiosity as to what the state of the art
in this task really looks like, including the subtle advances that seem to be steadily improving our
the performace on the task.

Change log:
2017-10-29 Added readme file and did some modifications to the code.
2017-10-29 Parsed the file further. Now each file has been put with
its answers, parsed answers, raw text, and parsed sentences.
2017-10-29 Added finding important words
2017-11-04 Figured out parser and created some patterns. Work in progress.
2017-11-04 Wrote a custom parser using the nltk tagger. Results are okay. Higher correct answer instances in answers.
2017-11-05 Wrote predict.py to start evaluating answers.
2017-11-05 Tried using verbs and verb phrases as trigger words. Experiments failed, so we are using any word with high P log F score.
2017-11-06 Struggled with finding better parsers, expecially in terms of separating subjects, DOs and PPs.
2017-11-06 Tried different configurations for detecting incidents by exploring the ratio between words that appear in articles vs other articles, capped at around 50% recall.
2017-11-06 With prunning of some answers, obtained a recall of 20%, but still low precision of 1%.
2017-11-06 Improved recall to ~25%.
2017-11-07 Trying to improve system to no avail.
2017-11-15 to 2017-11-28 Final push: tried to implement multiple systems, did extensive testing and converted everything into turn-in format.