Skip to content


Repository files navigation


API of Whatfone

Getting Started

Install Python:

  1. Goto and download the latest Python 2.7.5 (32bit) installer
  2. Install Python
  3. (Windows) Add “c:\python27” and “c:\python27\scripts” into PATH (the latter is for your python packages)
  • Windows 7 and below:
  • Right-click “My Computer” and choose Properties
  • Click “Advanced System Settings and click “Environment Variables”
  • Find “PATH” and append “c:\python27; c:\python27\scripts”. (IMPORTANT: Must have semi-colon after each directory)
  • Windows 8:
  • Press “Windows Key + W” and type “System”
  • Click “Advanced System Settings and click “Environment Variables”
  • Find “PATH” and append “c:\python27; c:\python27\scripts”. (IMPORTANT: Must have semi-colon after each directory)

Install Python Packages:
For Windows:
1) Setuptools (
2) Pip (
3) After installing pip, goto terminal and `$ pip install nltk`
For Mac
1) Setuptools (
2) After installing setuptools, goto terminal and `$ sudo easy_install pip` to install Pip
3) After installing pip, goto terminal and `$ sudo pip install nltk`

NLTK Treebank Download
1) Goto terminal and `$ python`, it will go into python terminal mode.
2) Type `$ import nltk`
3) Type `$`, and a new window will appear.
4) Go under "All Packages", and download "maxent_treebank_pos_tagger"

Corpora FAQ

All data stored in ~PROJECT_ROOT/data folder

Subfolder Description Docs Tokens Status
0 Original copy 4,999 200,003 N/A
1 Tag Stage 1 25 1057 DONE
2 Tag Stage 2 26 939 DONE
3 Tag Stage 3 9 1043 DONE
4 Tag Stage 4 47 969 DONE
5 Tag Stage 5 29 1028 DONE
6 Tag Stage 6 20 1009 Trained, not corrected yet
7 Tag Stage 7 28 972
8 Tag Stage 8 32 1097
9 Tag Stage 9 32 1121
10 Tag Stage 10 16 1517

API Commands

  • Count Stats for XML (
    $ python <FILENAME>.xml
    Example: $ python data/0/reviews.xml
  • Train using Default Tagger Model (
    $ python data/x/<RAW>.xml data/x/trained.xml
    Example: $ python data/1/test1.xml data/1/trained1.xml
  • Train using previously trained XML (
    Example for tagging stage 2, need to pass trained+corrected stage 1 file to train a stage 2 test file:
    $ python 1 corrected1.xml test2.xml trained2.xml
    Example for tagging stage 4, need to pass trained+corrected stage 1,2,3 files to train a stage 4 test file:
    $ python 3 corrected1.xml corrected2.xml corrected3.xml test4.xml trained4.xml
  • Analyze Tags for Precision, Recall, F1 (
    $ python trained.xml corrected.xml
    Precision = total_correct_tags (before correction) / total tokens
    Recall = total_original_tagged_tokens / total_tokens
    F1 = 2 * ((precision * recall) / (precision + recall))
  • Check for missing tags (
    $ python <XML_FILE>
    If missing tag found, it will display on terminal. If the missing tag is non-word, it is fine. Else, go tag that word.

POS-Tagging FAQ

Currently, we uses NLTK for POS-tagging.

  • Codes covering POS-Tagging:
  • Library File: libraries\
  • API Wrapper Files: (Default Tagging Model) and (Custom Model via trained tags)
  • Handy POS Tag List
  • What do we need to do?
  • Tagging consists of 10 stages. For each stage, there are a series of steps to be completed. Here's the steps:
Stage Steps
1 1) `$ python data/1/test1.xml data/1/trained1.xml`
2) Manually correct `trained1.xml` and saved as `corrected1.xml`
2 1) `$ python 1 data/1/corrected1.xml data/2/test2.xml data/2/trained2.xml`
2) Manually correct `trained2.xml` and saved as `corrected2.xml`
3 1) `$ python 2 data/1/corrected1.xml data/2/corrected2.xml data/3/test3.xml data/3/trained3.xml`
2) Manually correct `trained2.xml` and saved as `corrected2.xml`
Blah Blah Blah...
* What if the token is not really formal English? - As mentioned by Prof. Kim, if it is "plz" instead of "please", tag it with the same tag. For this case, Adverb. So, `please/RB` equals `plz/RB` - If it is Singlish, like "lah", "leh", "lor", use Interjection (`UH`) tag. Interjection means "exclamation" word. Examples of Interjection are: "Uhhuh", "Oh", "Damn". - If it is email address or something non-word and non-punctuation, tag it as "Foreign Word" (`FW`). * What if there's spelling error on the word? - Leave it as it is, tag it to the closest word you think it represents.


API of Whatfone






No releases published


No packages published
