This is a project designed for researchers to conveniently access papers they need.
A command line tool paper-downloader.py
is included, to automatically search and download paper
from Internet, with the name of the paper given.
The downloaded paper will thus have a readable file name.
It mainly supports searching papers in computer science.
This project also comes with a naive server to provide integrated search/read/download experience.
To run the command line tool, you'll need the following installed:
- requests
- BeautifulSoup4
- termcolor
- poppler-utils (optional)
Usage:
./paper-downloader.py --help
./paper-downloader.py "Distinctive image features from scale-invariant keypoints"
./paper-downloader.py "http://arxiv.org/abs/1506.03184"
NOTE: If you are not in school, you may need proxy by environment variable http_proxy
and https_proxy
,
to be able to download from certain sites (such as 'dl.acm.org').
The searcher
module will fuzzy search and analyse results in
- Google Scholar
and the fetcher
module will further analyse the results and download papers from the following sources:
- direct pdf link
- dl.acm.org
- ieeexplore.ieee.org
- arxiv.org
Searcher
and Fetcher
are extensible to support more resources.
The command line tool will directly download the paper with a clean filename.
All the downloaded paper will be compressed using ps2pdf
from poppler-utils, if available.
The server provide:
- RESTful APIs on papers
- Interactive paper reading UI supported by pdf2htmlEX
Command line tool is sufficient to use. If you'd like to play with the server, you'll need:
- Python2 with virtualenv. Python headers are needed (python-dev on debian/ubuntu).
- ghostscript
- libcurl (libcurl4-{openssl,nss,gnutls}-dev on debian/ubuntu)
- xapian (libxapian-dev & python2-xapian on debian/ubuntu)
- pdf2htmlEx installed. See its download guide
- poppler-utils which provide the 'pdftotext' command line util
Note: if you need to run server on debian/ubuntu, make sure you do not have 'python2-bson' package installed.
- Fetcher dedup: when arxiv abs/pdf apperas both in search results, page would be downloaded twice (maybe add a cache for requests)
- Don't trust arxiv link from google scholar
- Is title correctly updated for dlacm?
- Extract title from bibtex -- more accurate?
- Fetcher for other sites