Solution to extract fields and line items from native PDF invoices

This project is based on invoice2data which extracts data from invoice pdf files using user written regular expressions. It is observed that quite some invoices cannot be extracted just using the regular expression approach since pdf text is organized not as lines of text rather as boxes with co-ordinates. Field values are placed at locations like to the right, bottom of a field so on. Line items can also placed in tables. This solution tries to use other techniques along with regular expressions to solve the problem.

The solution is written in python and provides the following capabilities

Extract Field values (Based on regex or location based - right, bottom ...)
Table based line item extraction (regex, vertical or horizontal table lines ...)
Check if the extracted line item Total values matches the Sum Total.

Installation

Requires python 3
pip/conda install the following libraries json, re, deepcopy, configparser, logging, pdfminer, argparse, sortedcontainers, sets, pluginbase

Templates

Templates which are in json format are used to tell the application how to pick field values (regex, top, bottom of field name) and also how line items are present in the invoice (With or Without Horizontal Vertical Lines, Line item columns…).

Look at the data folder which contains jpg images of a few invoices (original native pdfs). Corresponding templates can be found at the template folder. Reading the jpg and its corresponding template json side-by-side would help easily understand the template contents.

Detailed Documentation

Refer doc/invoice-extractor-checker.pdf for detailed documentation.

Running the application

From the src folder, python Main.py --dump --template --file --output

E.g:-, python Main.py --dump dump.txt --template ../template/ --file "../data/Amazon-Storeji.pdf" --output "../output/Amazon-Storeji.json"

Future work

Simple User Interface to edit template files.
Docker image.
Multi page/Multi header occurrences/Multi header pdfs.
Improve Table Line items extractions.
Run on more invoice samples/new formats.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
doc		doc
output		output
schema		schema
src		src
template		template
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

doc

doc

output

output

schema

schema

src

src

template

template

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Solution to extract fields and line items from native PDF invoices

Installation

Templates

Detailed Documentation

Running the application

Future work

About

Releases

Packages

Languages

License

doo-bluh-vay/invoice-extractor-checker

Folders and files

Latest commit

History

Repository files navigation

Solution to extract fields and line items from native PDF invoices

Installation

Templates

Detailed Documentation

Running the application

Future work

About

Topics

Resources

License

Stars

Watchers

Forks

Languages