GitHub - Mukosame/SoPaper: Automatically Search and Download Papers

SoPaper, So Easy

This is a project designed for researchers to conveniently access papers they need.

A command line tool paper-downloader.py is included, to automatically search and download paper from Internet, with the name of the paper given. The downloaded paper will thus have a readable file name. It mainly supports searching papers in computer science.

This project also comes with a naive server to provide integrated search/read/download experience.

How to Use

To run the command line tool, you'll need the following installed:

requests
BeautifulSoup4
termcolor
poppler-utils (optional)

Usage:

./paper-downloader.py --help
./paper-downloader.py "Distinctive image features from scale-invariant keypoints"
./paper-downloader.py "http://arxiv.org/abs/1506.03184"

NOTE: If you are not in school, you may need proxy by environment variable http_proxy and https_proxy, to be able to download from certain sites (such as 'dl.acm.org').

Features

The searcher module will fuzzy search and analyse results in

Google Scholar
Google

and the fetcher module will further analyse the results and download papers from the following sources:

Searcher and Fetcher are extensible to support more resources.

The command line tool will directly download the paper with a clean filename. All the downloaded paper will be compressed using ps2pdf from poppler-utils, if available.

The server provide:

RESTful APIs on papers
Interactive paper reading UI supported by pdf2htmlEX

Command line tool is sufficient to use. If you'd like to play with the server, you'll need:

Python2 with virtualenv. Python headers are needed (python-dev on debian/ubuntu).
ghostscript
libcurl (libcurl4-{openssl,nss,gnutls}-dev on debian/ubuntu)
xapian (libxapian-dev & python2-xapian on debian/ubuntu)
pdf2htmlEx installed. See its download guide
poppler-utils which provide the 'pdftotext' command line util

Note: if you need to run server on debian/ubuntu, make sure you do not have 'python2-bson' package installed.

TODO

Fetcher dedup: when arxiv abs/pdf apperas both in search results, page would be downloaded twice (maybe add a cache for requests)
Don't trust arxiv link from google scholar
Is title correctly updated for dlacm?
Extract title from bibtex -- more accurate?
Fetcher for other sites

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
common		common
manage		manage
report		report
webapi		webapi
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
TODO		TODO
paper-downloader.py		paper-downloader.py
pdf-compress.py		pdf-compress.py
standalone_server.py		standalone_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common

common

manage

manage

report

report

webapi

webapi

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

TODO

TODO

paper-downloader.py

paper-downloader.py

pdf-compress.py

pdf-compress.py

standalone_server.py

standalone_server.py

Repository files navigation

SoPaper, So Easy

How to Use

Features

TODO

About

Releases

Packages

Languages

Mukosame/SoPaper

Folders and files

Latest commit

History

Repository files navigation

SoPaper, So Easy

How to Use

Features

TODO

About

Resources

Stars

Watchers

Forks

Languages