A Python Library to extract information from academic papers.
Paper Miner (PapMinPy) is a library to extract information from academic paper PDFs. Currently, the paper can be converted into structured XML, references can be extracted with separate reference information (i.e., author, title etc.) and citations inside papers can be extracted.
pip install git+https://github.com/KKGanguly/PapMinPy
This project uses CERMINE java library for its purpose, so JDK must be installed.
First, create the CitationExtractor object with the following code.
from PapMinPy import citationextractor
citationExtractor=citationextractor.CitationExtractor("pdfFileName.pdf")
Now, the references can be extracted using the following code.
citationExtractor.getReferences(True)
To get the output as objects, rather than JSONs, simply use -
citationExtractor.getReferences(False)
The citation snippets (paragraphs containing a specific citation) can be found utilizing -
citationExtractor.getCitationSnippets(json=True)
- 0.1a
- CHANGE: Added inital release (alpha)
- 0.1b
- Work in progress
Kishan Kumar Ganguly – kkganguly.iit.du@gmail.com
Distributed under the MIT license. See LICENSE
for more information.
- Fork it (https://github.com/KKGanguly/PapMinPy/fork)
- Create your feature branch (
git checkout -b feature/fooBar
) - Commit your changes (
git commit -am 'Add some fooBar'
) - Push to the branch (
git push origin feature/fooBar
) - Create a new Pull Request