Skip to content

Mukosame/SoPaper

 
 

Repository files navigation

SoPaper, So Easy

This is a project designed for researchers to conveniently access papers they need.

A command line tool paper-downloader.py is included, to automatically search and download paper from Internet, with the name of the paper given. The downloaded paper will thus have a readable file name. It mainly supports searching papers in computer science.

This project also comes with a naive server to provide integrated search/read/download experience.

How to Use

To run the command line tool, you'll need the following installed:

Usage:

./paper-downloader.py --help
./paper-downloader.py "Distinctive image features from scale-invariant keypoints"
./paper-downloader.py "http://arxiv.org/abs/1506.03184"

NOTE: If you are not in school, you may need proxy by environment variable http_proxy and https_proxy, to be able to download from certain sites (such as 'dl.acm.org').

Features

The searcher module will fuzzy search and analyse results in

  • Google Scholar
  • Google

and the fetcher module will further analyse the results and download papers from the following sources:

Searcher and Fetcher are extensible to support more resources.

The command line tool will directly download the paper with a clean filename. All the downloaded paper will be compressed using ps2pdf from poppler-utils, if available.

The server provide:

  • RESTful APIs on papers
  • Interactive paper reading UI supported by pdf2htmlEX

Command line tool is sufficient to use. If you'd like to play with the server, you'll need:

  • Python2 with virtualenv. Python headers are needed (python-dev on debian/ubuntu).
  • ghostscript
  • libcurl (libcurl4-{openssl,nss,gnutls}-dev on debian/ubuntu)
  • xapian (libxapian-dev & python2-xapian on debian/ubuntu)
  • pdf2htmlEx installed. See its download guide
  • poppler-utils which provide the 'pdftotext' command line util

Note: if you need to run server on debian/ubuntu, make sure you do not have 'python2-bson' package installed.

TODO

  • Fetcher dedup: when arxiv abs/pdf apperas both in search results, page would be downloaded twice (maybe add a cache for requests)
  • Don't trust arxiv link from google scholar
  • Is title correctly updated for dlacm?
  • Extract title from bibtex -- more accurate?
  • Fetcher for other sites

About

Automatically Search and Download Papers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 51.0%
  • JavaScript 19.7%
  • HTML 14.5%
  • TeX 8.4%
  • CSS 3.9%
  • Shell 2.1%
  • Makefile 0.4%