Description

This code does the following top level tasks:

Builds a mongo database
Reads a series of word-to-slang, word-to-phonetic and word-to-emoticon lists and stores them as mongo documents.
Builds a webservice for the querying and updating of the documents.

Purpose

The aim of this work is to make it easy to retrieve different representations of standard english words through a web service. It is hoped that some level of machine learning could contribute to the detection of new slang words and emoticons as they appear in social media to improve and update the store.

This code was written to support the mining of social media content using text processing software. To extract the maximum useful content from social media, the content is often transformed from slang/abbreviation terms into 'real' words. Several projects have attempted to document the mapping between words and slang terms, words and emoticons. This code uses those initial datasets as a starting point for a web service, hopefully useful to anyone mining social media for useful content.

Going one stage further for those wishing to identify words based on their phonetic representation, this code also includes the phonetic representation of words.

Mongo documents represent words, known tokens in the english language. Each word document has fields storing the word's phonetic representation and a field storing an array of slang terms that are associated with the word. There is not a 1-1 mapping between a slang term and a word; a slang term may apply to more than one word (incidentally, this is one of the main reasons for returning a list/array of json objects as the default).

To ensure this was available permanently and as a learning experience, the code was hosted on dotcloud as a sandbox.

Web Service

A very simple web services is provided by python bottle and wsgi, running in nginx when deployed to dotcloud. Unit/functional tests are provided for the web parts to ensure the url parameters work, so check those for good examples.

There are 3 functional URL parameters. ALL ARE REQUIRED.

token: Is the term/word/character set that you wish to search for in the db. Eg. /wordslang?token=smiiiillle --> search term is 'smiiiillle'
check: Is the field you want to search for that token in. The options are 'word','slang','pho' and 'all'. The latter checks each of the first 3. Eg. /wordslang?token=smiiiillle&check=word --> checks for 'smiiiillle' in the word field.
output: Is the field you want returned. The options are 'word','slang','pho','all',exists'. Where all is specified, all fields will be returned. Where 'exists' is specified, a {"exists":"true"} or {"exists":"false"} will be returned. Eg. /wordslang?token=smiiiillle&check=word&output=pho --> checks the word field for the term foo and outputs the phonetic representation of 'smile'.

There are some POST/PUT hooks in for machine-based population of new word-slang combinations too, but they aren't tested.

Dotcloud Deployment

The service deploys to dotcloud automatically, but the population of the database must be done manually under ssh with the server. $ dotcloud ssh wordslang

It was hoped that even the database inserts could be automated but at the time, *nix 'at' was not supported on dotcloud and building it into the postinstall made that fail due postinstall timeout limit (5 mins?). I looked into a kludge involving crons and then deleting the cron, but it seems OTT.

The setupapp.py handles the setup of the mongo db (trivial), the building of the right indexes and some username and password configuration. Specifically the configureDatabase.py extracts the admin password from the environment.json file built on the dotcloud server, uses that to authenticate against mongo, creates a new user/password and then writes that back to the main runtime config file for the other processes to use.

Debugging bottle/wsgi was pretty tedious on dotcloud, possibly because I didn't fully understand where it would be writing its errors to. I ended up using a combination of try/except and then error logging to a local file, which worked, but was klunky.

Improvements

Here are some improvements I can/should make when time permits and based on whether it gets used.

Error webpages and templates
API explanation pages
PUT/POST to a quarantene collection? Deployment docs for Apache webserver Deployment docs for nginx on dotcloud Other datasets for inclusion? Working with academic groups to build upon this list. Including non-space separated versions of the emoticons - DONE. Inclusion of a fuzzy/regex matching query for terms - with the speed costs that will have. Possibly just partial matching - wild card on the ends of words?

Sources:

http://www.cool-smileys.com/text-emoticons-part1 & http://www.cool-smileys.com/text-emoticons-part2 for the emoticon lookup

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
baseData		baseData
config		config
errorOutput		errorOutput
insertScripts		insertScripts
setupAndBase		setupAndBase
static		static
tests		tests
web		web
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
dotcloud.yml		dotcloud.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

baseData

baseData

config

config

errorOutput

errorOutput

insertScripts

insertScripts

setupAndBase

setupAndBase

static

static

tests

tests

web

web

.DS_Store

.DS_Store

.gitignore

.gitignore

README.md

README.md

dotcloud.yml

dotcloud.yml

Repository files navigation

Description

Purpose

Web Service

Dotcloud Deployment

Improvements

About

Releases

Packages

Languages

robrant/wordslang

Folders and files

Latest commit

History

Repository files navigation

Description

Purpose

Web Service

Dotcloud Deployment

Improvements

About

Resources

Stars

Watchers

Forks

Languages