Spelling Correction in Python

Note: The following is based on Peter Norvig’s online blog post. However please do not just grab the code. The main point of the exercise is to get used to text processing in Python. I have intentionally not linked to the blog but will do so later in the term.

If you are new to Python - use this as an opportunity to slowly learn. It will be a challenge at first but we are sure you will learn very quickly.

Introduction

Spelling Correction is a common task for most applications which deal with text. For example, most search engines will correct queries containing non-valid words. The best spelling correction algorithms use sophisticated models of common spelling errors. However we can achieve much using just lots of data.

Using lots of data for spelling correction

Suppose we have a misspelt word. It's likely that the word is only wrong by 1 or possibly two typos. Even so there are potentially many words it could be. For instance if I type "lates" then I might mean "late", "latest" or "lattes" etc. In such cases we try to find the correction c out of all the possible corrections that maximizes the probability that c is the intended correction, given the original word w:

argmax_{c ∈ candidates} P(c|w)

By Bayes Theorem this is equivalent to:

argmax_{c ∈ candidates} P(c) P(w|c) / P(w)

But P(w) is constant to all candidates so we can ignore it. We are left with choosing the maximum of the product of

The probability of the correction occurring (without any consideration of context). This is the language model.
The probability of the word given the above correction. This is the error model.

Error models can be quite sophisticated. For instance it's far more common to duplicate letters or mistake "i" for "e". Below we'll ignore this and assume that errors due to just one mistake are far more likely than errors due to 2 but far less likely than words we actually have seen before.

Data

Under Data on this Canvas module, you'll find a file "bigtxt.zip" which is a zipped file containing lots of freely available (and copyright free) text. We'll use this to create a language model for calculating the probability of any word.

Specific Python Knowledge Required

Basic String Manipulation

Regular Expressions

Sets (not needed but life is much easier if you do)

plus basic Python control structures such as loops & conditionals. Ask us for help!

The Counter data structure.

Counter is a data structure contained in the "collections" module. Collections is a set of high performance data structures for big data. To use a Counter we need to :

>>> from collections import Counter

A Counter is a hashable dictionary which provides a count of how many times a "thing" occurs. For instance:

>>> def words(text): return re.findall(r'\w+', text.lower())
>>> WORDS = Counter(words(open('/.../big.txt').read()))

will create a dictionary lookup for the number of times every word occurs in big.txt. The function "words" uses a regular expression to transform all words in bigtxt to lowercase.

Containers have a .value() method which generates a list of values

>>> WORDS.values()

These numbers are the number of times each word occurs in big.txt.

>>> WORDS['a']
21124

Of course, big.txt also contains a lot of noise - including misspellings. However with enough data we can ignore such imperfections.

Procedure

write a function which calculates the probability of any word occurring in any context.

>>> P('the')
0.07154004401278254
>>> P('computer')
1.0756688194982902e-05

write a function which generates all one character edits from a given word.

>> edits1('cat')
{'cyat', 'cato', 'fat', 'cati', 'cate', 'czat', 'cgt', 'catv', 'cnt', 'cact', 'catc', 'caa', 'car', 'cax', 'cap', 'cmt', ... 'caot'}

An edit can be defined as inserting, removing or changing one character or swapping two letters (e.g. "cmoputer"). You will note above my function returns a set. This is a nice data structure but you could also use a list (but it's less efficient since it will contain duplicates.)

write a function which all generates all two character edits from a given word.

Hint: if you've done 2 then all you need to do is find all one character edits from the set of one character edits the function above returns.

a function for checking which words are already known

Given a string of length n there will be n deletions, n-1 swaps, 26n alterations, and 26(n+1) insertions or 54n+25 possible candidates. This number becomes huge if we consider two edits. However we can shrink this set by only considering words we've already seen.

a function which suggests candidate corrections for any misspelling.

Any candidate must be 1) a word in big.txt 2) either 1 or 2 edits away from the misspelt word.

>>> candidates('th')
{'to', 'th', 'tu', 'oh', 'tt', 'ty', 'ti', 'ta', 'the', 'te', 'tm', 'ah', 'h', 't', 'tr', 'tz', 'thy', 'wh', 'eh', 'ch'}

Note the above example - there's lots of noise. However probability theory will come to the rescue!

a function which pulls everything together - giving a misspelt word, it generates all possible corrections and chooses the most likely.

>>> correction('cimputer')
'computer'

More Advanced Work

The spelling corrector actually works fairly well - Norvig claims 75% accuracy on a corpus of spelling errors. More importantly it's relatively fast. However it's not without problems. Try:

>>> correction('reciet')
'recite'

>>> correction('adres')
'acres'

Contrary to our code, reciet is a common error for "receipt" and "adres" is commonly "address". How can you improve the performance of the spelling corrector?

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
NLTK_tutorial.py		NLTK_tutorial.py
README.md		README.md
big.txt		big.txt
cgfs-and-parsing.py		cgfs-and-parsing.py
computerscience.txt		computerscience.txt
corpora-and-regex.py		corpora-and-regex.py
corrector.py		corrector.py
pos_taggers.py		pos_taggers.py
regular-expressions.py		regular-expressions.py
wikification-for-NER.py		wikification-for-NER.py
wordlist.txt		wordlist.txt
wordlist2.txt		wordlist2.txt
wsd-and-relation-extraction.py		wsd-and-relation-extraction.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLTK_tutorial.py

NLTK_tutorial.py

README.md

README.md

big.txt

big.txt

cgfs-and-parsing.py

cgfs-and-parsing.py

computerscience.txt

computerscience.txt

corpora-and-regex.py

corpora-and-regex.py

corrector.py

corrector.py

pos_taggers.py

pos_taggers.py

regular-expressions.py

regular-expressions.py

wikification-for-NER.py

wikification-for-NER.py

wordlist.txt

wordlist.txt

wordlist2.txt

wordlist2.txt

wsd-and-relation-extraction.py

wsd-and-relation-extraction.py

Repository files navigation

Spelling Correction in Python

Introduction

Using lots of data for spelling correction

Data

Procedure

More Advanced Work

About

Releases

Packages

Languages

weydaej/Spelling-Correction-in-Python

Folders and files

Latest commit

History

Repository files navigation

Spelling Correction in Python

Introduction

Using lots of data for spelling correction

Data

Procedure

More Advanced Work

About

Resources

Stars

Watchers

Forks

Languages