Skip to content

porcpine1967/ocr-proofreader

Repository files navigation

This module supports the conversion of scanned documents into accurate text.

If the scans are bound up in a pdf, you will need to run something like pdfimages to extract them into readable images.

The module consists of utilities that accomplish the following:
 * Turn the images into text
  * Splitting of images of two pages into two images of a single page each.
  * Running each page through tesseract to extract the text into appropriately-named documents (one document per page)
 * Clean up common ocr errors
  * Utility to remove headers (batch)
  * Utility to join words separated by hyphen-line breaks
  * Utility to correct common recognition mistakes (e.g. rn -> m)
  * Utility to join words across lines

# To implement
  * orthography fix
  * add ability to add a note to FIX ME
  * generate shell file for FIX ME
  * English check for single character not I, a, or A


# To investigate
  * Use images to find paragraphs
  * Page class to manage page metadata
    * has header?
    * line objects
      * x,y coordinates (or box?)
      * length
      * density
      * index(?)

About

Tool for turning scanned books into proofread documents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages