Skip to content

Parse OCR result files for pagenos, tables of contents, etc.

Notifications You must be signed in to change notification settings

internetarchive/analyze_ocr

 
 

Repository files navigation

Some code for analyzing OCR'ed documents.  It's currently pretty
specific to Internet Archive OCR'd books, but it may be generalizable.

Entry point: analyze_ocr.py - run this against an archive scanned book.

Functionality: find headers/footers, page numbers, tables of contents.

About

Parse OCR result files for pagenos, tables of contents, etc.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.3%
  • PHP 0.7%