Stemming Algorithm

The stemming algorithm in StemFinder can stem words ending in -s, -es, -ed, -ing, -ness, -ly, all the irregular cases in irregular.txt, and more. It also takes into account some of the stranger rules like picnicking -> picnic and larvae -> larva. The algorithm is not perfect, but it does a pretty good job and handles all of the cases in the test.txt file, some of which are pretty tricky.

It seems like one of the most important things for a stemming algorithm to avoid is creating words that didn't exist in the original document. To make sure this doesn't happen, there is an all_words.txt file containing all or at least the overwhelming majority of english words. The stemming algorithm checks all of its results against this list of words before returning them. That's the purpose of the function 'checkStem'. This is effective in many cases, but proper nouns and fake words are also important.

It wouldn't make sense to exclude the word 'Jedi' from a star wars script, and few people reading greek literature want the name Menelaus stemmed to Menelau. That's why when the stemming algorithm encounters a word that's not in the English dictionary, it leaves it alone.

Word Counter

The WordCounter module does the dirty work. It scans in a .txt file, removes the punctuation and unnecessary whitespace, casts everything to lowercase, and combines all the repeated words. Combining the words before stemming is helpful, because while the stemming function isn't slow (something like 1*10^-5 seconds last I checked), it is slower than the built-in python functions for handling text input. This way for a big file like Moby Dick, I only have to call the stemming algorithm 20,000 times rather than 206,052 times.

Website

The website at http://top25-informant.rhcloud.com/ is hosted on openshift, and uses Python 2.7.6 with Django 1.6.5. The website is pretty simple, but it does do some helpful things. It maintains an archive of all previously analyzed files, and if a user tries to pass a file that is not .txt, it will raise an error. The website also stores the results of WordAnalyzer/StemFinder as text in a mysql database, so that when a user clicks on a file in the archive, the load time is significantly reduced. This difference is unnoticeable for most .txt files, but essential for files like mobydick.txt, which can take upwards of 9 seconds to load. The website also keeps unique copies of all uploaded files (using the time uploaded) so that two files uploaded with the same name don't cause problems in the archive. Also, as an added bonus, when you click on a word in the display page, it will look it up on dictionary.com (just passes the word through the website's url search parameter).

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
libs		libs
wsgi		wsgi
README.md		README.md
app.py.disabled		app.py.disabled
requirements.txt		requirements.txt
setup.py		setup.py
test.txt		test.txt
wsgi.py		wsgi.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

libs

libs

wsgi

wsgi

README.md

README.md

app.py.disabled

app.py.disabled

requirements.txt

requirements.txt

setup.py

setup.py

test.txt

test.txt

wsgi.py

wsgi.py

Repository files navigation

Stemming Algorithm

Word Counter

Website

About

Releases

Packages

Languages

cnwalker/top25

Folders and files

Latest commit

History

Repository files navigation

Stemming Algorithm

Word Counter

Website

About

Resources

Stars

Watchers

Forks

Languages