GitHub - avh2ag/Spelling-Suggester: Python project to suggest correctons for misspelled words

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
__pycache__		__pycache__
data_processing		data_processing
edit_distance		edit_distance
.gitattributes		.gitattributes
__init__.py		__init__.py
assignment.md		assignment.md
mispelled_queries.csv		mispelled_queries.csv
ouput.txt		ouput.txt
output.dat		output.dat
readme.txt		readme.txt
run.py		run.py
suggestions.txt		suggestions.txt
word_frequency.csv		word_frequency.csv

Repository files navigation

Acknowledgements:
http://en.wikipedia.org/wiki/Levenshtein_distance
Denotes the general algorithm for determining the number of edits between two strings.

Runs on python version 2.7.6
Unit testing relies on pytest.py
to install, pip install pytest

To run all system tests, type py.test to see all unit testing at the root level of the package.

This assumes that both 'word_frequency.csv' and 'mispelled_queries.csv' are in the same directory as run.py.

To run the Spelling Suggester, use 'python run.py' in the primary Spelling-Suggester directory.

1. How would the code perform if the size of the dictionary were 1 million words?
The code would slow down tremendously, and would require many more optimizations to be usable. Even now, when I don't check the input word against any word 3 letters longer or shorter, the brute force attempt is making hundreds of calls to the edit distance algorithms. The optimization I currently use, given the size of the dictionary (~11000 words), is to organize all the words of the same length into their own maps. That way, I don't have search all the words each time, as the minimum edit distance of two words of different lengths is the respective difference in length. To handle 1 million words, an approach similar to Peter Norvig's (http://norvig.com/spell-correct.html) is necessary. However, his source code is well discussed and publicly available in python. As a result, I didn't feel comfortable taking his source code line for line.

2. How would the code perform with an edit distance of 3?
The code currently takes about 2 seconds per input to find suggestions. Increasing the edit distance that produces an acceptable word does not slow down the code, as it currently is just tossing out those results as unacceptable. The main overhead in the code is the number of words that are being searched for edit-distance.

3. How does the code perform on long queries versus short queries and why?
Short queries take less time. The Levinshtein algorithm used for finding the edit distance runs in proportion to the strings being compared. The shorter the words being queried, the fewer comparisons needed to compute the edit distance.

About

Python project to suggest correctons for misspelled words

Readme

Activity

0 stars

2 watching

0 forks

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pycache

pycache

data_processing

data_processing

edit_distance

edit_distance

.gitattributes

.gitattributes

init.py

init.py

assignment.md

assignment.md

mispelled_queries.csv

mispelled_queries.csv

ouput.txt

ouput.txt

output.dat

output.dat

readme.txt

readme.txt

run.py

run.py

suggestions.txt

suggestions.txt

word_frequency.csv

word_frequency.csv

Repository files navigation

About

Releases

Packages

Languages

avh2ag/Spelling-Suggester

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages