PDFrankenstein

Python tool for bulk malicious PDF feature extraction.

Dependencies

PyV8 (and V8) (optional: if you intend to use JS deobfuscation. Note: JS deobfuscation needs to be run in a safe environment, as you would treat any malware.
lxml
scandir (optional: module included in lib folder)
postgresql and psycopg2 (optional: if you intend to use postgresql backing storage)

$ pdfrankenstein.py --help

Output to a file in delimited plain text, parses ALL files in pdf-dir/

$ pdfrankenstein.py -o file -n fileoutput.txt ~/pdf-dir

Output to an sqlite database

$ pdfrankenstein.py -o sqlite3 -n pdf-db ~/pdf-dir

Output to stdout after parsing all files listed inside file-with-pdfs

$ pdfrankensetin.py -o stdout ~/file-with-pdfs

pdf_in	PDF input for analysis. Can be a single PDF file or a directory of files.
-d, --debug	Print debugging messages.
-o, --out	Analysis output filename or type. Default to 'unnamed-out.*' file in CWD. Options: 'sqlite3'\|\|'postgres'\|\|'stdout'\|\|[filename]
-n, --name	Name for output database.
--hasher	Specify which type of hasher to use. PeePDF \| PDFMiner (default). PDFMiner option provides better parsing capabilities.
-v, --verbose	Spam the terminal, TODO.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
db		db
jobs		jobs
pdfminer		pdfminer
peepdf		peepdf
scripts		scripts
util		util
.gitignore		.gitignore
JSAnalysis.py		JSAnalysis.py
LICENSE.md		LICENSE.md
README.md		README.md
__init__.py		__init__.py
build_pdf_objects.py		build_pdf_objects.py
cfg.py		cfg.py
db_mgmt.py		db_mgmt.py
huntterp.py		huntterp.py
pdfrankenstein.py		pdfrankenstein.py
sdhasher.py		sdhasher.py
storage.py		storage.py
xml_creator.py		xml_creator.py