Skip to content

Intergration pipeline for extracting PDFs data

Notifications You must be signed in to change notification settings

seanBr97/curation-pipeline

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

README
--------------------------------------------------------------------------
About

Classes
- PDFigCapX: Extracts the images and captions from input PDF files.
- FigSplit:

Dependencies
- PDFigCapX
  - Binaries to chromedriver
  - Binaries to ImageMagick convert
  - Binaries to xpdf pdftohtml
  - OpenCV2 version 2.4.X, the module does not run in OpenCV2 version 3.X
  To instantiate the PDFigCapX class, pass all the paths in the constructor.

  To set up an environment in Windows or Mac, build an environment with Anaconda:
  conda create -n py27 python=2.7
  conda install -c menpo opencv
  conda install -c anaconda numpy
  conda install -c conda-forge selenium
  conda install -c anaconda scipy
  conda install -c anaconda lxml
  conda install -c anaconda pil

  Mac further needs the installation of Ghostscript
  https://pages.uoregon.edu/koch/

- FigSplit
  conda install -c anaconda requests

Run on tinman
source ~/virtualenv2/bin/activate
cd /usa/jtrell2/Curation-Pipeline/src
Xvfb :99 & export DISPLAY=:99
python pipeline_runner.py /usa/jtrell2/Curation-Pipeline/src/config.json /eecis/shatkay/UICCollab/PDFs/Pheno_rel /usa/jtrell2/Curation-UI/dist/images 10

Contact
Juan Trelles Trabucco jtrell2@uic.edu
Pengyuan Li pengyuan@udel.edu

About

Intergration pipeline for extracting PDFs data

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%