GitHub

README FOR NLP PRE-PROCESSING MODULE

This module consists of the basic tools to pre-process a query. The module runs the following tools in respective order:

Enchant Spell Checker -> Stanford CoreNLP TrueCase* (truecase) -> Stanford CoreNLP Tokenizer (tokenize) -> Stanford CoreNLP Sentence Split (ssplit) -> Stanford CoreNLP POS Tagger (pos), Stanford CoreNLP Lemma (lemma) -> Stanford CoreNLP CRF NER (ner) -> Stanford CoreNLP REGEX NER (regexner) -> Stanford CoreNLP Parse (parse).

INPUT: "I want to buk a flight tikket from new york to bangalore between 21st june and 23rd june."

SPELL and TRUECASE OUTPUT: I want to book a flight ticket from New YORK TO BANGALORE between 21st June and 23rd June.

NER OUTPUT: I want to book a flight ticket from New (LOCATION: YORK) TO (LOCATION: BANGALORE) between (DATE: 21st June) and (DATE: 23rd June. )

Truecase requires tokenize -> ssplit -> pos -> lemma tools as to predict the truecase of a word.

The module uses following packages. The instructions for installing these packages are described here:

cd path/to/genie directory mkdir /Tools

The packages will be installed inside the 'Tools' directory. a. - Download Enchant1.6.0 (enchant-1.6.0.tar.gz) from http://www.abisource.com/downloads/enchant/1.6.0/enchant-1.6.0.tar.gz
- Install PyEnchant using 'sudo pip install pyenchant'
b. Download Stanford CoreNLP 3.5.2 (stanford-corenlp-full-2015-04-20.zip) from http://nlp.stanford.edu/software/corenlp.shtml

Upgrade Java to 1.8+ This package is already build.

Use this command to run the Stanford CoreNLP java -cp '*' -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,truecase -file

where annotators are the "tools" to be run.

Use this page for further instructions: http://nlp.stanford.edu/software/corenlp.shtml
The script textToDemand.py runs the preprocessing module.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
.metadata		.metadata
database		database
experiments		experiments
mysql		mysql
preprocessing_tools		preprocessing_tools
temp		temp
testing_temp/testcases		testing_temp/testcases
utilities		utilities
.DS_Store		.DS_Store
.gitignore		.gitignore
FILE_CONSTANTS.py		FILE_CONSTANTS.py
README		README
README.md		README.md
__init__.py		__init__.py
combine_test.py		combine_test.py
config.ini		config.ini
data_ner_city		data_ner_city
data_nlp_preprocessing_engine		data_nlp_preprocessing_engine
food_testing.py		food_testing.py
newrelic.ini		newrelic.ini
nlp_engine.py		nlp_engine.py
nlp_engine_test.py		nlp_engine_test.py
nlp_logging.py		nlp_logging.py
nlp_preprocessing_engine.py		nlp_preprocessing_engine.py
preprocessing_module_testrun		preprocessing_module_testrun
preprocessor.log		preprocessor.log
search.py		search.py
temp.py		temp.py
test		test
test_ner_city.py		test_ner_city.py
test_nlp_preprocessing_engine.py		test_nlp_preprocessing_engine.py
testing_food.py		testing_food.py
train		train
travel.sql		travel.sql
warehouse.sql		warehouse.sql

bopopescu/Personal_Assistant

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages