GitHub - tedunderwood/ocreval: Python modules that evaluate OCR quality.

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
data		data
dictionaries		dictionaries
drafts		drafts
.DS_Store		.DS_Store
AccEval.py		AccEval.py
BatchTest.py		BatchTest.py
Dictionary.py		Dictionary.py
FSBuilder.py		FSBuilder.py
ReadMe.txt		ReadMe.txt
SingleTest.py		SingleTest.py
TokenGen.py		TokenGen.py

Repository files navigation

Some of the scripts included in this set are drafts that I set aside
based on discussions we've had over e-mail. The CORE section below covers
the OCR evaluator, and the DRAFT section those that I set aside.

CORE SCRIPTS:

Dictionary.py: reads all files in a directory and compiles the first word
from each line into a set.

FSBuilder.py: accepts a set of words (ideally generated by Dictionary.py)
and produces a set contained every word that has non-final s's, substitutes
f's for all non-final s's.

TokenGen.py: accepts a list of lines and cleans them for processing by
breaking them into potential tokens, stripping them of punctuation,
and discarding invalid tokens. Discarding happens twice, once before
punctuation is removed to catch tags and again after punctuation is removed
to catch numbers. A second similar function can be called from this
module to try fuse words at the end of a line that were broken by a
hyphen (not used by default, see below). I mostly included it here with
an eye towards having multiple functions in TokenGen to be used for
different levels of scrutiny or correction later in the project.

AccEval.py: accepts at minimum a cleaned list of words and a dictionary.
Optionally, it will accept a list of substitution rules and a command to
use the hyphen correction version of the token generator (only used if
you specifically pass in a boolean for it). Checks all tokens against
the dictionary and substitution rules (if passed in) and returns a six-
value tuple with total capitalized tokens, total capitalized dictionary
matches, total capitalized matches by substitution, total lower-case
tokens, total lower-case dictionary matches, and total lower-cas matches
by substitution.

These are the four core modules. Dictionary's and FSBuilder's are designed
to be run once per batch process to load all potential matches into memory.
AccEval doesn't run them, it only accepts sets that should be generated
with them.

To see how they were designed to interact with one another, look at and/or
run either SingleTest.py or BatchTest.py. These are just quick scripts I
put together to help me test my functions after I started separating them
into modules. In both, Dictionary/FSBuilder are run once at the beginning,
and their results are passed into AccEval along with a list of lines.
AccEval calls TokenGen to process the list of lines and then compares the
list of processed tokens it returns to the Dictionary/FSBuilder sets that
were passed into it. SingleTest doesn't do anything more than dump
the tuple into the shell, but BatchTest produces a short report. I mainly
did this just so I could see how tweaks to my evaluation algorithms were
behaving across the samples.

DRAFTS/:

I originally wrote my token generator to set all characters to lower case,
mostly because I saw you doing that in a lot of your old scripts. If you
want to run them, they'll need to be moved out of the drafts directory
because they depend on Dictionary and FSBuilder.

The two hyphenread scripts are two methods of fusing I played around with.
The first checks for a match before fusing, the other fuses automatically.
I included a version of the first as a separate function in TokenGen (that,
again, AccEval won't use unless specifically told to).

Other scripts in that directory are probably only of use to anyone who
later wants to learn how to do this sort of thing. As examples, they're
really simple and modular.

*******
Changes to May 27th (TU):
First, I changed AccEval so that it checked for matches with word.lower() instead of word. This is necessary because my dictionaries only have lowercase words. So while we do want to separate out capitalized and uncapitalized forms, we always have to check for matches using the lowercase form.
Also, instead of checking whether word.islower() to identify “capitalized” words, I now check whether word[0].islower(), because there are a fair number of errors like “wiU” (for “will”) in the texts. They aren’t lowercase, but they aren’t capitalized either. I want to count them as lowercase because the underlying rationale for ignoring capitalized words is that proper nouns might mistakenly be counted as “errors.” But a word like “wiU” really is an error, and should count in the lowercase column.
I also expanded TokenGen, because there are some subtleties to the tokenizing process that I hadn’t fully thought through in my initial survey of that problem -- especially where hyphens are concerned. I left the functions “Basic” and “Hyphen” in TokenGen, but wrote two slightly altered functions based on “Hyphen,” that are the ones I expect to use. These functions changed two things: the handling of fusion across end-of-line boundaries, and the handling of hyphenated forms.
In general, I want to try fusing across eol boundaries whether or not there’s a final hyphen. That final hyphen often gets missed by OCR, or some other character (like a quotation mark) may technically be the final character in the line. So while the eol hyphen can be a useful supplementary clue at a later stage (when we actually parse individual texts), it’s not something we’ll rely on heavily at this stage. The functions I wrote have slightly different rules to decide whether to fuse, but both of them always try fusing, whether or not there’s an eol hyphen.
For the purposes of OCReval, I think we’re also going to want to break hyphenated words into parts and check the parts. The core reason for this is that it’s impossible to have an exhaustive hyphen dictionary. Hyphenation gets used in a number of different ways in English; some of them are ad-hoc-stylistic-rather-than-semantic-ways. When I get to the final stages of processing, it does become necessary to make some hard choices between “to-day” and “today,” “data base” and “database,” etc. But at the OCReval stage, it’s better just to break everything up and check the parts. This implies that we don’t need to supplement MainDictionary with HyphenDictionary, but we will need to supplement it with a list of particles like “re” and “un” that commonly precede a hyphen and should be counted as correct for the purpose of OCR evaluation. I’ll generate this list and put it in the /dictionaries folder insteadof HyphenDictionary.
I also want to break the text at certain kinds of punctuation, especially commas, because you quite often see situations where the space is missing after punctuation — e.g. “London,where.” Alas, I can’t do this with periods, because they’re used in abbreviations, e.g. “a.m.”
To take care of breaking the text at hyphens and certain other kinds of punctuation, I create a translation map in the new functions and use the .translate string method to turn hyphens, etc. into spaces which will then get broken by the .split method.
The function keep_hyphens in TokenGen is not used by the OCReval modules; it’s something I wrote to prepare for TypeIndexer.