Skip to content

pombredanne/pdftables

Repository files navigation

pdftables - a library for extracting tables from PDF files

pdftables uses pdfminer to get information on the locations of text elements in a PDF document.

First we get a file handle to a PDF:

filepath = os.path.join(PDF_TEST_FILES,SelectedPDF)
fh = open(filepath,'rb')

Then we use our getPDFPage function to selection a single page from the document:

pdfPage = getPDFPage(fh, pagenumber)    
table,diagnosticData = pageToTables(pdfPage, extend_y = False, hints = hints, atomise = False)

Setting the optional extend_y parameter to True extends the grid used to extract the table to the full height of the page. The optional hints parameter is a two element string array, the first element should contain unique text at the top of the table, the second element should contain unique text from the bottom row of the table. Setting the optional atomise parameter to True converts all the text to individual characters this will be slower but will sometimes split closely separated columns.

table is a list of lists of strings. diagnosticData is an object containing diagnostic information which can be displayed using the plotpage function:

fig,ax1 = plotpage(diagnosticData)

Build Status

About

A library for extracting tables from PDF files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published