Skip to content

jarrodtoh/whatfone-api

Repository files navigation

whatfone-api

API of Whatfone

Getting Started

Install Python:

  1. Goto http://python.org/download/ and download the latest Python 2.7.5 (32bit) installer
  2. Install Python
  3. (Windows) Add “c:\python27” and “c:\python27\scripts” into PATH (the latter is for your python packages)
  • Windows 7 and below:
  • Right-click “My Computer” and choose Properties
  • Click “Advanced System Settings and click “Environment Variables”
  • Find “PATH” and append “c:\python27; c:\python27\scripts”. (IMPORTANT: Must have semi-colon after each directory)
  • Windows 8:
  • Press “Windows Key + W” and type “System”
  • Click “Advanced System Settings and click “Environment Variables”
  • Find “PATH” and append “c:\python27; c:\python27\scripts”. (IMPORTANT: Must have semi-colon after each directory)

Install Python Packages:
For Windows:
1) Setuptools (http://www.lfd.uci.edu/~gohlke/pythonlibs/#setuptools)
2) Pip (http://www.lfd.uci.edu/~gohlke/pythonlibs/#pip)
3) After installing pip, goto terminal and `$ pip install nltk`
For Mac
1) Setuptools (https://pypi.python.org/pypi/setuptools)
2) After installing setuptools, goto terminal and `$ sudo easy_install pip` to install Pip
3) After installing pip, goto terminal and `$ sudo pip install nltk`

NLTK Treebank Download
1) Goto terminal and `$ python`, it will go into python terminal mode.
2) Type `$ import nltk`
3) Type `$ nltk.download()`, and a new window will appear.
4) Go under "All Packages", and download "maxent_treebank_pos_tagger"

Corpora FAQ

All data stored in ~PROJECT_ROOT/data folder

Subfolder Description Docs Tokens Status
0 Original copy 4,999 200,003 N/A
1 Tag Stage 1 25 1057 DONE
2 Tag Stage 2 26 939 DONE
3 Tag Stage 3 9 1043 DONE
4 Tag Stage 4 47 969 DONE
5 Tag Stage 5 29 1028 DONE
6 Tag Stage 6 20 1009 Trained, not corrected yet
7 Tag Stage 7 28 972
8 Tag Stage 8 32 1097
9 Tag Stage 9 32 1121
10 Tag Stage 10 16 1517

API Commands

  • Count Stats for XML (counter.py)
    $ python counter.py <FILENAME>.xml
    Example: $ python counter.py data/0/reviews.xml
  • Train using Default Tagger Model (default-tag-trainer.py)
    $ python default-tag-trainer.py data/x/<RAW>.xml data/x/trained.xml
    Example: $ python default-tag-trainer.py data/1/test1.xml data/1/trained1.xml
  • Train using previously trained XML (trainer-tag-trainer.py)
    $ python trained-tag-trainer.py <INT_NUM_OF_TRAINED_FILES> <TRAINED_FILE_X> * <TEST_FILE> <TRAINED_FILE>
    Example for tagging stage 2, need to pass trained+corrected stage 1 file to train a stage 2 test file:
    $ python trained-tag-trainer.py 1 corrected1.xml test2.xml trained2.xml
    Example for tagging stage 4, need to pass trained+corrected stage 1,2,3 files to train a stage 4 test file:
    $ python trained-tag-trainer.py 3 corrected1.xml corrected2.xml corrected3.xml test4.xml trained4.xml
  • Analyze Tags for Precision, Recall, F1 (analyze-tags.py)
    $ python analyze-tags.py trained.xml corrected.xml
    Precision = total_correct_tags (before correction) / total tokens
    Recall = total_original_tagged_tokens / total_tokens
    F1 = 2 * ((precision * recall) / (precision + recall))
  • Check for missing tags (check-missing-tags.py)
    $ python check-missing-tags.py <XML_FILE>
    If missing tag found, it will display on terminal. If the missing tag is non-word, it is fine. Else, go tag that word.

POS-Tagging FAQ

Currently, we uses NLTK for POS-tagging.

  • Codes covering POS-Tagging:
  • Library File: libraries\tags.py
  • API Wrapper Files: default-tag-trainer.py (Default Tagging Model) and trained-tag-trainer.py (Custom Model via trained tags)
  • Handy POS Tag List
  • What do we need to do?
  • Tagging consists of 10 stages. For each stage, there are a series of steps to be completed. Here's the steps:
Stage Steps
1 1) `$ python default-tag-trainer.py data/1/test1.xml data/1/trained1.xml`
2) Manually correct `trained1.xml` and saved as `corrected1.xml`
2 1) `$ python trained-tag-trainer.py 1 data/1/corrected1.xml data/2/test2.xml data/2/trained2.xml`
2) Manually correct `trained2.xml` and saved as `corrected2.xml`
3 1) `$ python trained-tag-trainer.py 2 data/1/corrected1.xml data/2/corrected2.xml data/3/test3.xml data/3/trained3.xml`
2) Manually correct `trained2.xml` and saved as `corrected2.xml`
Blah Blah Blah...
* What if the token is not really formal English? - As mentioned by Prof. Kim, if it is "plz" instead of "please", tag it with the same tag. For this case, Adverb. So, `please/RB` equals `plz/RB` - If it is Singlish, like "lah", "leh", "lor", use Interjection (`UH`) tag. Interjection means "exclamation" word. Examples of Interjection are: "Uhhuh", "Oh", "Damn". - If it is email address or something non-word and non-punctuation, tag it as "Foreign Word" (`FW`). * What if there's spelling error on the word? - Leave it as it is, tag it to the closest word you think it represents.

About

API of Whatfone

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages