GitHub - internetarchive/analyze_ocr: Parse OCR result files for pagenos, tables of contents, etc.

internetarchive / analyze_ocr Public

forked from mikemccabe/analyze_ocr

Notifications You must be signed in to change notification settings
Fork 3
Star 14

Parse OCR result files for pagenos, tables of contents, etc.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
fonts		fonts
.gitignore		.gitignore
README		README
analyze_ocr.php		analyze_ocr.php
analyze_ocr.py		analyze_ocr.py
color.py		color.py
diff_match_patch.py		diff_match_patch.py
extract_sorted.py		extract_sorted.py
find_header_footer.py		find_header_footer.py
find_pagenos.py		find_pagenos.py
font.py		font.py
iabook.py		iabook.py
interval.py		interval.py
make_toc.py		make_toc.py
rnums.py		rnums.py
toc_to_xml.py		toc_to_xml.py
tuples.py		tuples.py
visualize.py		visualize.py
windowed_iterator.py		windowed_iterator.py

Repository files navigation

Some code for analyzing OCR'ed documents.  It's currently pretty
specific to Internet Archive OCR'd books, but it may be generalizable.

Entry point: analyze_ocr.py - run this against an archive scanned book.

Functionality: find headers/footers, page numbers, tables of contents.