OCRPDF

OCRPDF is a Python wrapper that helps you quiclkly OCR multi-page PDF documents

Requirements

You must have already installed:

ImageMagik (http://www.imagemagick.org/script/binary-releases.php)
GhostScript (http://www.ghostscript.com)
The Tesseract OCR engine (https://code.google.com/p/tesseract-ocr/)

Dependencies

You must also have installed the following Python modules:

PIL (or Pillow)
Pytesser (you may have to modify pytesser.py if needed to change "import Image" to "from PythonMagick import Image")
PythonMagik (helpful guidance can be found at http://stackoverflow.com/questions/13984357/pythonmagick-cant-find-my-pdf-files)

Basic Usage

To create a new instance of OCRPDF and OCR a file:

from OCRPDF import OCRPDF

ocrTool = OCRPDF()
result = ocrTool.OCRPDF('YourFileNameHere')

This returns an object of:

	t         : raw text
	t_clean   : cleaned text
	pages     : number of pages
	p         : list of page data objects
	            pagenum : page number
				t       : raw text from this page
				t_clean : cleaned text from this page

So to view the raw text from page 3 of your document:

print result.p[2].t

(It's p[2] because lists are 0-based.)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
OCRPDF.py		OCRPDF.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitattributes

.gitattributes

.gitignore

.gitignore

OCRPDF.py

OCRPDF.py

README.md

README.md

Repository files navigation

OCRPDF

Requirements

Dependencies

Basic Usage

About

Releases

Packages

Languages

bdheath/OCRPDF

Folders and files

Latest commit

History

Repository files navigation

OCRPDF

Requirements

Dependencies

Basic Usage

About

Resources

Stars

Watchers

Forks

Languages