GitHub - seanBr97/curation-pipeline: Intergration pipeline for extracting PDFs data

seanBr97 / curation-pipeline Public

forked from uic-evl/curation-pipeline

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Intergration pipeline for extracting PDFs data

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
input		input
src		src
.gitignore		.gitignore
README		README

Repository files navigation

README
--------------------------------------------------------------------------
About

Classes
- PDFigCapX: Extracts the images and captions from input PDF files.
- FigSplit:

Dependencies
- PDFigCapX
  - Binaries to chromedriver
  - Binaries to ImageMagick convert
  - Binaries to xpdf pdftohtml
  - OpenCV2 version 2.4.X, the module does not run in OpenCV2 version 3.X
  To instantiate the PDFigCapX class, pass all the paths in the constructor.

  To set up an environment in Windows or Mac, build an environment with Anaconda:
  conda create -n py27 python=2.7
  conda install -c menpo opencv
  conda install -c anaconda numpy
  conda install -c conda-forge selenium
  conda install -c anaconda scipy
  conda install -c anaconda lxml
  conda install -c anaconda pil

  Mac further needs the installation of Ghostscript
  https://pages.uoregon.edu/koch/

- FigSplit
  conda install -c anaconda requests

Run on tinman
source ~/virtualenv2/bin/activate
cd /usa/jtrell2/Curation-Pipeline/src
Xvfb :99 & export DISPLAY=:99
python pipeline_runner.py /usa/jtrell2/Curation-Pipeline/src/config.json /eecis/shatkay/UICCollab/PDFs/Pheno_rel /usa/jtrell2/Curation-UI/dist/images 10

Contact
Juan Trelles Trabucco jtrell2@uic.edu
Pengyuan Li pengyuan@udel.edu