GitHub - hack-yg-team-pdf/pdf-parser: script for taking a PDF and making a best guess at the form fields

Repo for "Team PDF" participation in HackYG 2018

PDF to JSON Parser

The Purpose of this repo is to take a collection of PDF forms from the Yukon Government and attempt to output a series of JSON files in the JSON Forms format to make a series of web-accessible forms.

Requirements

Python 3.6
Python Image Library
PdfMiner.six (The Python 3 fork of PDFMiner)
pdf2image

Inputs

The Script will load all PDF files loaded into the raw_pdfs/ directory.

Outputs

mturk_images/* - Cropped images of a specific form field for uploading to mechanical turk for field identification, filenames formatted as {pdfid}_{form_object_id}.png
mturk.csv - Manifest of all the Mechanical turk images along with text nearby to make copy/paste easier. The label text is not guaranteed to be in this blob, so mturk operators would need to be presented with the corresponding image and the text blob, but will still manually need to input the label.
output_json/* - Json representation of corresponding PDF form fields with a "best guess" as to field labels for JSON forms format. filenames formatted as {pdfid}.json

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
json_forms		json_forms
mturk_images		mturk_images
raw_pdfs		raw_pdfs
.gitignore		.gitignore
README.MD		README.MD
miner.py		miner.py
mturk.csv		mturk.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json_forms

json_forms

mturk_images

mturk_images

raw_pdfs

raw_pdfs

.gitignore

.gitignore

README.MD

README.MD

miner.py

miner.py

mturk.csv

mturk.csv

Repository files navigation

PDF to JSON Parser

Requirements

Inputs

Outputs

About

Releases

Packages

Languages

hack-yg-team-pdf/pdf-parser

Folders and files

Latest commit

History

Repository files navigation

PDF to JSON Parser

Requirements

Inputs

Outputs

About

Resources

Stars

Watchers

Forks

Languages