porcpine1967/ocr-proofreader
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This module supports the conversion of scanned documents into accurate text. If the scans are bound up in a pdf, you will need to run something like pdfimages to extract them into readable images. The module consists of utilities that accomplish the following: * Turn the images into text * Splitting of images of two pages into two images of a single page each. * Running each page through tesseract to extract the text into appropriately-named documents (one document per page) * Clean up common ocr errors * Utility to remove headers (batch) * Utility to join words separated by hyphen-line breaks * Utility to correct common recognition mistakes (e.g. rn -> m) * Utility to join words across lines # To implement * orthography fix * add ability to add a note to FIX ME * generate shell file for FIX ME * English check for single character not I, a, or A # To investigate * Use images to find paragraphs * Page class to manage page metadata * has header? * line objects * x,y coordinates (or box?) * length * density * index(?)
About
Tool for turning scanned books into proofread documents
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published