FASoC Datasheet-Scrubber

The FASoC Datasheet Scrubber is a utility that scrubs through large sets of PDF datasheets/documents in order to extract key circuit information. The information gathered is used to build a database of commercial off-the-shelf (COTS) IP that can be used to build larger SoC in the FASoC design. More information here

You can do Datasheet Scrubbing by running Datasheet_Scrubbing.py, which you can input a datasheet (between one of ADC, CDC, DCDC, PLL, LDO, SRAM, Temperature Sensor, BDRT, Counters, DAC, Delay_Line, Digital Potentiometers, DSP, IO, Opamp categories) and observe the extracted specs and pins. Instruction steps of each of these would be as follows:

Environment

Python 3.7 or anaconda 3 is required, older versions of Python will not work. You will also need the following Python libraries which can be installed via pip (in PowerShell/terminal).

Install and upgrade pip: python -m pip install --upgrade pip

Install Python dependencies: You may replace pip with conda if you are working with anaconda

pip install pandas
pip install -U scipy
pip install matplot
pip install matplotlib
pip install pdfminer.six
pip install pypdf2
pip install request
pip install lxml
pip install tabula-py
pip install sklearn
pip install regex
pip install keras
pip install tensorflow
pip install pdf2image
pip install pillow
pip install pytesseract
pip install -U numpy
pip install opencv-python
pip install gensim
    pip install nltk

Here we propose two different approaches:

Categorization using Bag of words, text extraction using regular expression, and table extraction using tabula (please see here for more information)
Categorization, text extraction, and table extraction using Convolutional neural network (CNN) (please see here for more information)
- If you want to test the CNN part, we need more software:
  - Poppler
  - Tesseract
  - Visual Studio 2017

Getting Started

These are steps for compiling codes:

Clone the datasheet-scrubber repository

git clone https://github.com/idea-fasoc/datasheet-scrubber.git

Go to here and download (All_pdf.zip should be downloaded)
Run make init which runs Initializer.py. It will ask you to type All_pdf.zip directory, your work directory, and your code dirrectory (for datasheet scrubbing) that you have just cloned. After running initializer.py you should see something like this.

Also when this window will pops up, press download.

All_pdf, All_text, cropped_pdf, and cropped_text are training directories. For addign more files to training set put your labeled pdf files in All_pdf directory (it means put ADC datasheets in ADC folder inside All_pdf, CDC datasheet in CDC folder inside All_pdf and so on).
Run make categorizer which runs test_confusion_matrix.py and shows the confusion matrix of our categorizer on whole dataset.
Put pdf files of datasheets that you want to test in Test_pdf folder and please email them to fayazi@umich.edu in order to have a better repository.
Run make extraction which runs Datasheet_Scrubbing.py which you can input a datasheet (between one of ADC, BDRT, CDC, counters, DAC, DCDC, Delay_Line, Digital_Potentiometers, DSP, IO, LDO, Opamp, PLL, SRAM, Temperature Sensor categories) and observe the extracted specs and pins.

Contributing

Extracted datasheets can be emailed to fayazi@umich.edu in order build a bigger repository.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

Repository files navigation

FASoC Datasheet-Scrubber

Environment

Getting Started

Contributing

About

Releases

Packages

Languages

License

cfrentze/datasheet-scrubber

Folders and files

Latest commit

History

Repository files navigation

FASoC Datasheet-Scrubber

Environment

Getting Started

Contributing

About

Resources

License

Stars

Watchers

Forks

Languages