A Benchmark & Evaluation for Text Extraction from PDF

This project is about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles. It provides (1) a benchmark generator, (2) a ready-to-use benchmark and (3) an extensive evaluation, with meaningful evaluation criteria.

The Benchmark Generator

constructs high-quality benchmarks from TeX source files.
identifies the following 16 logical text blocks: title, author(s), affiliation(s), date, abstract, headings, paragraphs of the body text, formulas, figures, tables, captions, listing-items, footnotes, acknowledgements, references, appendices.
serializes desired logical text blocks to plain text, XML or JSON format.

For more details and usage, see benchmark-generator/.

The Benchmark

consists of 12,099 ground truth files and 12,099 PDF files of scientific articles, randomly selected from arXiv.org. Each ground truth file contains the title, the headings and the body text paragraphs of a particular scientific article.
was generated using the benchmark generated above.

For more details, see benchmark/.

The Evaluation

assesses the following 13 PDF extraction tools: pdftotext, pdftohtml, pdf2xml (Xerox), pdf2xml (Tiedemann), PdfBox, ParsCit, LA-PdfText, PdfMiner, pdfXtk, pdf-extract, PDFExtract, Grobid, Icecite.
provides meaningful evaluation criteria in order to assess the semantic abilities of a tool on identifying (1) words, (2) the reading order, (3) paragraph boundaries and (4) the semantic roles of text elements in PDF.

For more details, see evaluation/.

Name		Name	Last commit message	Last commit date
Latest commit History 268 Commits
benchmark-generator		benchmark-generator
benchmark		benchmark
evaluation		evaluation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_google_adwords.txt		README_google_adwords.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark-generator

benchmark-generator

benchmark

benchmark

evaluation

evaluation

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

README_google_adwords.txt

README_google_adwords.txt

Repository files navigation

A Benchmark & Evaluation for Text Extraction from PDF

The Benchmark Generator

The Benchmark

The Evaluation

About

Releases

Packages

Languages

License

ckorzen/pdf-text-extraction-benchmark

Folders and files

Latest commit

History

Repository files navigation

A Benchmark & Evaluation for Text Extraction from PDF

The Benchmark Generator

The Benchmark

The Evaluation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages