whatfone-api

API of Whatfone

Getting Started

Install Python:

Goto http://python.org/download/ and download the latest Python 2.7.5 (32bit) installer
Install Python
(Windows) Add “c:\python27” and “c:\python27\scripts” into PATH (the latter is for your python packages)

Windows 7 and below:

Right-click “My Computer” and choose Properties
Click “Advanced System Settings and click “Environment Variables”
Find “PATH” and append “c:\python27; c:\python27\scripts”. (IMPORTANT: Must have semi-colon after each directory)

Windows 8:

Press “Windows Key + W” and type “System”
Click “Advanced System Settings and click “Environment Variables”
Find “PATH” and append “c:\python27; c:\python27\scripts”. (IMPORTANT: Must have semi-colon after each directory)

Install Python Packages:
For Windows:
1) Setuptools (http://www.lfd.uci.edu/~gohlke/pythonlibs/#setuptools)
2) Pip (http://www.lfd.uci.edu/~gohlke/pythonlibs/#pip)
3) After installing pip, goto terminal and `$ pip install nltk`
For Mac
1) Setuptools (https://pypi.python.org/pypi/setuptools)
2) After installing setuptools, goto terminal and `$ sudo easy_install pip` to install Pip
3) After installing pip, goto terminal and `$ sudo pip install nltk`

NLTK Treebank Download
1) Goto terminal and `$ python`, it will go into python terminal mode.
2) Type `$ import nltk`
3) Type `$ nltk.download()`, and a new window will appear.
4) Go under "All Packages", and download "maxent_treebank_pos_tagger"

Corpora FAQ

All data stored in ~PROJECT_ROOT/data folder

Subfolder	Description	Docs	Tokens	Status
0	Original copy	4,999	200,003	N/A
1	Tag Stage 1	25	1057	DONE
2	Tag Stage 2	26	939	DONE
3	Tag Stage 3	9	1043	DONE
4	Tag Stage 4	47	969	DONE
5	Tag Stage 5	29	1028	DONE
6	Tag Stage 6	20	1009	Trained, not corrected yet
7	Tag Stage 7	28	972
8	Tag Stage 8	32	1097
9	Tag Stage 9	32	1121
10	Tag Stage 10	16	1517

API Commands

Count Stats for XML (counter.py)
$ python counter.py <FILENAME>.xml
Example: $ python counter.py data/0/reviews.xml
Train using Default Tagger Model (default-tag-trainer.py)
$ python default-tag-trainer.py data/x/<RAW>.xml data/x/trained.xml
Example: $ python default-tag-trainer.py data/1/test1.xml data/1/trained1.xml
Train using previously trained XML (trainer-tag-trainer.py)
$ python trained-tag-trainer.py <INT_NUM_OF_TRAINED_FILES> <TRAINED_FILE_X> * <TEST_FILE> <TRAINED_FILE>
Example for tagging stage 2, need to pass trained+corrected stage 1 file to train a stage 2 test file:
$ python trained-tag-trainer.py 1 corrected1.xml test2.xml trained2.xml
Example for tagging stage 4, need to pass trained+corrected stage 1,2,3 files to train a stage 4 test file:
$ python trained-tag-trainer.py 3 corrected1.xml corrected2.xml corrected3.xml test4.xml trained4.xml
Analyze Tags for Precision, Recall, F1 (analyze-tags.py)
$ python analyze-tags.py trained.xml corrected.xml
Precision = total_correct_tags (before correction) / total tokens
Recall = total_original_tagged_tokens / total_tokens
F1 = 2 * ((precision * recall) / (precision + recall))
Check for missing tags (check-missing-tags.py)
$ python check-missing-tags.py <XML_FILE>
If missing tag found, it will display on terminal. If the missing tag is non-word, it is fine. Else, go tag that word.

POS-Tagging FAQ

Currently, we uses NLTK for POS-tagging.

Codes covering POS-Tagging:

Library File: libraries\tags.py
API Wrapper Files: default-tag-trainer.py (Default Tagging Model) and trained-tag-trainer.py (Custom Model via trained tags)

Handy POS Tag List

http://www.monlp.com/2011/11/08/part-of-speech-tags/

What do we need to do?

Tagging consists of 10 stages. For each stage, there are a series of steps to be completed. Here's the steps:

Stage	Steps
1	1) `$ python default-tag-trainer.py data/1/test1.xml data/1/trained1.xml` 2) Manually correct `trained1.xml` and saved as `corrected1.xml`
2	1) `$ python trained-tag-trainer.py 1 data/1/corrected1.xml data/2/test2.xml data/2/trained2.xml` 2) Manually correct `trained2.xml` and saved as `corrected2.xml`
3	1) `$ python trained-tag-trainer.py 2 data/1/corrected1.xml data/2/corrected2.xml data/3/test3.xml data/3/trained3.xml` 2) Manually correct `trained2.xml` and saved as `corrected2.xml`
Blah Blah Blah...

* What if the token is not really formal English? - As mentioned by Prof. Kim, if it is "plz" instead of "please", tag it with the same tag. For this case, Adverb. So, `please/RB` equals `plz/RB` - If it is Singlish, like "lah", "leh", "lor", use Interjection (`UH`) tag. Interjection means "exclamation" word. Examples of Interjection are: "Uhhuh", "Oh", "Damn". - If it is email address or something non-word and non-punctuation, tag it as "Foreign Word" (`FW`). * What if there's spelling error on the word? - Leave it as it is, tag it to the closest word you think it represents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

libraries

libraries

models

models

.gitignore

.gitignore

README.md

README.md

analyze-tags.py

analyze-tags.py

check-missing-tags.py

check-missing-tags.py

counter.py

counter.py

default-tag-trainer.py

default-tag-trainer.py

json-to-xml.py

json-to-xml.py

trained-tag-trainer.py

trained-tag-trainer.py

Repository files navigation

whatfone-api

Getting Started

Corpora FAQ

API Commands

POS-Tagging FAQ

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data		data
libraries		libraries
models		models
.gitignore		.gitignore
README.md		README.md
analyze-tags.py		analyze-tags.py
check-missing-tags.py		check-missing-tags.py
counter.py		counter.py
default-tag-trainer.py		default-tag-trainer.py
json-to-xml.py		json-to-xml.py
trained-tag-trainer.py		trained-tag-trainer.py

jarrodtoh/whatfone-api

Folders and files

Latest commit

History

Repository files navigation

whatfone-api

Getting Started

Corpora FAQ

API Commands

POS-Tagging FAQ

About

Resources

Stars

Watchers

Forks

Languages