Installation and Usage Instructions
requires Python 2.7 or higher.
Prerequisites Tools
- Make sure "ark-tweet-nlp-0.3.2.jar" is in folder "NERD_Tweets". This is needed for Tokenization and POS tagging.
- Make sure the in memory DB "bcluster.pdl" is in folder "NERD_Tweets". This contains Brown word clusters for Normalization.
- Go to the web site https://code.google.com/p/word2vec/ . From the section "Pre-trained entity vectors with Freebase naming" download the entity vector model "freebase-vectors-skipgram1000-en.bin.gz" (2.5 GB) and place it in the folder "NERD_Tweets".
Prerequisites API keys
Freebase API Key
This key is needed for the NER component. Key can be obtained from Google developers website https://developers.google.com/freebase/v1/getting-started#api-keys Please see Screenshots/API_Keys.png to check where to add your API key in the "nerd_tweets.py" file (line 23).
Microsoft Web ngram user token
This is needed for the Language model and dropping false positives. Token can be found by sending a mail to "webngram@microsoft.com" with subject "Token Request". More info and tutorial : http://weblm.research.microsoft.com/info/rest.html http://weblm.research.microsoft.com/info/index.html Please see Screenshots/API_Keys.png to check where to add your ngram token in the "nerd_tweets.py" file (line 24).
Prerequisites Python Modules
gensim
This module is needed for the word2vec entity vector model. Install from terminal with command "sudo pip install --upgrade gensim". Detailed information & installation instructions : http://radimrehurek.com/2013/09/deep-learning-with-word2vec-and-gensim/ and http://radimrehurek.com/gensim/install.html
jellyfish
This module is needed for Normalization to compute Character edit distance (Levenshtein) and Phonetic edit distance (Metaphone). To install the module, download the zip from the site https://github.com/sunlightlabs/jellyfish , unzip it . Then navigate to the unzipped folder in terminal and run the command "python setup.py install". tutorial : https://pypi.python.org/pypi/jellyfish (optional : not needed for using this tool).
PyDbLite
This is an adapter for SQLite in-memory database and is needed to manipulate the word clusters. Installation instructions : http://www.pydblite.net/en/index.html
pyenchant
This module is required for Out-of-vocabulary word checking and spelling suggestions. Installation command : "sudo pip install pyenchant" More information : https://pythonhosted.org/pyenchant/api/enchant.html (optional : not needed for using this tool)
Other modules
Other used modules are pre-existing in Python 2.7.
Usage instructions
- Download the Project as a zip file from GitHub
- Unzip and navigate to the unzipped "NERD_Tweets" folder from terminal
- Make sure all the prerequisite tools are present inside "NERD_Tweets" folder and the required Python modules are installed. Also, add the Microsoft user token and Freebase API key in appropriate places in the file "nerd_tweets.py" as shown in the screenshots.
- Run the command "python nerd_tweets.py sample_input.txt sample_output.txt" sample_input.txt : contains sample tweets separated by lines sample_output.txt : The tool will output the extracted named entities with their disambiguations (Screenshots/terminal_output.png).
Queries
For any queries, please contact me through mail "srpatra88@gmail.com". For reference, please see https://github.com/srpatra88/Soumya_Thesis/tree/master/Reports