GitHub - porcpine1967/ocr-proofreader: Tool for turning scanned books into proofread documents

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
test		test
.gitignore		.gitignore
README		README
__init__.py		__init__.py
comparison_manager.py		comparison_manager.py
controller.py		controller.py
document_builder.py		document_builder.py
dpgui.py		dpgui.py
gui.py		gui.py
gui2.py		gui2.py
gui3.py		gui3.py
gui4.py		gui4.py
gui5.py		gui5.py
line_manager.py		line_manager.py
models.py		models.py
ocr.py		ocr.py
process_pdf.py		process_pdf.py
regex_helper.py		regex_helper.py
spell_checker.py		spell_checker.py
test.py		test.py

Repository files navigation

This module supports the conversion of scanned documents into accurate text.

If the scans are bound up in a pdf, you will need to run something like pdfimages to extract them into readable images.

The module consists of utilities that accomplish the following:
 * Turn the images into text
  * Splitting of images of two pages into two images of a single page each.
  * Running each page through tesseract to extract the text into appropriately-named documents (one document per page)
 * Clean up common ocr errors
  * Utility to remove headers (batch)
  * Utility to join words separated by hyphen-line breaks
  * Utility to correct common recognition mistakes (e.g. rn -> m)
  * Utility to join words across lines

# To implement
  * orthography fix
  * add ability to add a note to FIX ME
  * generate shell file for FIX ME
  * English check for single character not I, a, or A


# To investigate
  * Use images to find paragraphs
  * Page class to manage page metadata
    * has header?
    * line objects
      * x,y coordinates (or box?)
      * length
      * density
      * index(?)