This repository contains some NLP experiment code using NuPIC. Current goal is to build POS(Part of Speech) predictor.
Some Python libraries are required. We can install them with pip
.
You should use Python 2.x, not 3.x. Because current NuPIC do not support Python 3.x...
$ pip install -r requirements.txt --user
We use Brown Corpus via NLTK. Install corpura using nltk.download_shell()
.
In [1]: import nltk
In [2]: nltk.download_shell()
NLTK Downloader
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> d
Download which package (l=list; x=cancel)?
Identifier> all
Downloading collection u'all'
|
| Downloading package abc
...
|
Done downloading collection all
Next, we make NuPIC model using Brown corpus that POS tagged by NLTK.
$ python src/pos_learning.py
This command takes couple of hour. If model
directory already exist,
script fail to run, do rm -rf model
before run script.
After create model. Now we can predict POS of a sentence.
In [1]: from src.pos_prediction import predictPOS
In [3]: predictPOS("Numenta has developed a number of applications to demonstrate the applicability of its technology.")
Out[3]:
[('Numenta', 'NNP', 0.17242003298048558),
('has', 'VBZ', 0.0),
('developed', 'VBN', 0.20699545397928326),
('a', 'DT', 0.035622404184603919),
('number', 'NN', 0.45799918713405025),
('of', 'IN', 0.27102120514372302),
('applications', 'NNS', 0.049628877304811303),
('to', 'TO', 0.0),
('demonstrate', 'VB', 0.45884920292902631),
('the', 'DT', 0.32169826979365984),
('applicability', 'NN', 0.52366338091042608),
('of', 'IN', 0.25839685244089977),
('its', 'PRP$', 0.060240797605142671),
('technology', 'NN', 0.44383397040713313),
('.', '.', 0.14901950388899751)]
predictPOS
is the main predict function. It's take a string input and output list of 3 item tuple. Each touple contains
(Input word, Tagged POS by NLTK, Accuracy)
It's assume that NLTK's tagging is always correct.
I know this assumption is not perfect...
Accuracy takes [0,1]
range.
This parameter shows NuPIC predict correct POS with how accuracy.