API of Whatfone
Install Python:
- Goto http://python.org/download/ and download the latest Python 2.7.5 (32bit) installer
- Install Python
- (Windows) Add “c:\python27” and “c:\python27\scripts” into PATH (the latter is for your python packages)
- Windows 7 and below:
- Right-click “My Computer” and choose Properties
- Click “Advanced System Settings and click “Environment Variables”
- Find “PATH” and append “c:\python27; c:\python27\scripts”. (IMPORTANT: Must have semi-colon after each directory)
- Windows 8:
- Press “Windows Key + W” and type “System”
- Click “Advanced System Settings and click “Environment Variables”
- Find “PATH” and append “c:\python27; c:\python27\scripts”. (IMPORTANT: Must have semi-colon after each directory)
Install Python Packages:
For Windows:
1) Setuptools (http://www.lfd.uci.edu/~gohlke/pythonlibs/#setuptools)
2) Pip (http://www.lfd.uci.edu/~gohlke/pythonlibs/#pip)
3) After installing pip, goto terminal and `$ pip install nltk`
For Mac
1) Setuptools (https://pypi.python.org/pypi/setuptools)
2) After installing setuptools, goto terminal and `$ sudo easy_install pip` to install Pip
3) After installing pip, goto terminal and `$ sudo pip install nltk`
NLTK Treebank Download
1) Goto terminal and `$ python`, it will go into python terminal mode.
2) Type `$ import nltk`
3) Type `$ nltk.download()`, and a new window will appear.
4) Go under "All Packages", and download "maxent_treebank_pos_tagger"
All data stored in ~PROJECT_ROOT/data
folder
Subfolder | Description | Docs | Tokens | Status |
0 | Original copy | 4,999 | 200,003 | N/A |
1 | Tag Stage 1 | 25 | 1057 | DONE |
2 | Tag Stage 2 | 26 | 939 | DONE |
3 | Tag Stage 3 | 9 | 1043 | DONE |
4 | Tag Stage 4 | 47 | 969 | DONE |
5 | Tag Stage 5 | 29 | 1028 | DONE |
6 | Tag Stage 6 | 20 | 1009 | Trained, not corrected yet |
7 | Tag Stage 7 | 28 | 972 | |
8 | Tag Stage 8 | 32 | 1097 | |
9 | Tag Stage 9 | 32 | 1121 | |
10 | Tag Stage 10 | 16 | 1517 |
- Count Stats for XML (counter.py)
$ python counter.py <FILENAME>.xml
Example:$ python counter.py data/0/reviews.xml
- Train using Default Tagger Model (default-tag-trainer.py)
$ python default-tag-trainer.py data/x/<RAW>.xml data/x/trained.xml
Example:$ python default-tag-trainer.py data/1/test1.xml data/1/trained1.xml
- Train using previously trained XML (trainer-tag-trainer.py)
$ python trained-tag-trainer.py <INT_NUM_OF_TRAINED_FILES> <TRAINED_FILE_X> * <TEST_FILE> <TRAINED_FILE>
Example for tagging stage 2, need to pass trained+corrected stage 1 file to train a stage 2 test file:
$ python trained-tag-trainer.py 1 corrected1.xml test2.xml trained2.xml
Example for tagging stage 4, need to pass trained+corrected stage 1,2,3 files to train a stage 4 test file:
$ python trained-tag-trainer.py 3 corrected1.xml corrected2.xml corrected3.xml test4.xml trained4.xml
- Analyze Tags for Precision, Recall, F1 (analyze-tags.py)
$ python analyze-tags.py trained.xml corrected.xml
Precision = total_correct_tags (before correction) / total tokens
Recall = total_original_tagged_tokens / total_tokens
F1 = 2 * ((precision * recall) / (precision + recall)) - Check for missing tags (check-missing-tags.py)
$ python check-missing-tags.py <XML_FILE>
If missing tag found, it will display on terminal. If the missing tag is non-word, it is fine. Else, go tag that word.
Currently, we uses NLTK for POS-tagging.
- Codes covering POS-Tagging:
- Library File:
libraries\tags.py
- API Wrapper Files:
default-tag-trainer.py
(Default Tagging Model) andtrained-tag-trainer.py
(Custom Model via trained tags)
- Handy POS Tag List
- What do we need to do?
- Tagging consists of 10 stages. For each stage, there are a series of steps to be completed. Here's the steps:
Stage | Steps |
1 |
1) `$ python default-tag-trainer.py data/1/test1.xml data/1/trained1.xml` 2) Manually correct `trained1.xml` and saved as `corrected1.xml` |
2 |
1) `$ python trained-tag-trainer.py 1 data/1/corrected1.xml data/2/test2.xml data/2/trained2.xml` 2) Manually correct `trained2.xml` and saved as `corrected2.xml` |
3 |
1) `$ python trained-tag-trainer.py 2 data/1/corrected1.xml data/2/corrected2.xml data/3/test3.xml data/3/trained3.xml` 2) Manually correct `trained2.xml` and saved as `corrected2.xml` |
Blah Blah Blah... |