Skip to content

pfdamasceno/shakespeare

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

shakespeare

Identify relevant scientific papers with simple machine learning techniques

Installation

Copy shakespeare.py, data and content_sources to your pythonpath.

To intsall an example knowledge set, copy examples' contents to $HOME/.shakespeare

Depends on bibtexparser, feedparser scikit-learn packages, which can be installed via pip

pip install bibtexparser scikit-learn feedparser

Features

  • fetch functions for the following journals

    • Phys Rev A-X
    • PRL
    • PNAS
    • Nature + Nature:Stuff
    • Science
    • Small
    • ACS Nano, Nano Letters
    • Soft Matter
    • Langmuir
    • Angewandte Chemie
    • JCP, JCP B
  • Fetch functions for arXiv

  • support for BibTex Files

  • Naive bayes training and classification

Usage

The very first thing to do is to let the code know where 'bad stuff' is

./shakespeare.py -g good.bib -k examples/ --overwrite-knowledge --train

Train naive_bayes algorithm

./shakespeare -g thegoodstuff.bib -b thebadstuff.bib -k examples --train

Find papers from nature nano and PNAS

./shakespeare.py -j natnano pnas -o cool_papers.md

Find papers from the arxiv cond-mat.soft and math, then review the algorithms selection

./shakespeare.py -a cond-mat.soft math --feedback

Help printout

usage: shakespeare.py [-h] [-o OUTPUT] [-b [BIBFILES [BIBFILES ...]]]
                      [-j [JOURNALS [JOURNALS ...]]] [-a [ARXIV [ARXIV ...]]]
                      [--all_sources] [--all_good_sources] [--train]
                      [-g GOOD_SOURCE] [-m METHOD] [-k KNOWLEDGE]
                      [--overwrite-knowledge] [--feedback] [--review_all]
optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        output file name. only supports markdown right now.
  -b [BIBFILES [BIBFILES ...]], --bibtex [BIBFILES [BIBFILES ...]]
                        bibtex files to fetch
  -j [JOURNALS [JOURNALS ...]], --journals [JOURNALS [JOURNALS ...]]
                        journals to fetch. Currently supports physreve
                        physrevd jchemphysb physreva physrevc pnas nature
                        jchemphys science natmat physrevb acsnano jphyschem
                        nanoletters natphys prl small angewantechemie langmuir
                        physrevx natnano.
  -a [ARXIV [ARXIV ...]], --arXiv [ARXIV [ARXIV ...]]
                        arXiv categories to fetch
  --all_sources         flag to search from all sources.
  --all_good_sources    flag to search from good sources. Specfied in your
                        config file.
  --train               flag to train. All sources beside "--train-input-good"
                        are treated as bad/irrelevant papers
  -g GOOD_SOURCE, --train_input_good GOOD_SOURCE
                        bibtex file containing relevant articles.
  -m METHOD, --method METHOD
                        Methods to try to find relevent papers. Right now,
                        only all, title, author, and abstract are valid fields
  -k KNOWLEDGE, --knowledge KNOWLEDGE
                        path to database containing information about good and
                        bad keywords. If you are training, you must specifiy
                        this, as it will be where your output is written
  --overwrite-knowledge
                        flag to overwrite knowledge,if training
  --feedback            flag to give feedback after sorting content
  --review_all          review all the new selections. Otherwise, you will
                        only review the good selections

TODO

  • Train a bunch and see if this is worth any more time
  • Make an nice installer
  • Add support for a config file for setting defaults (which journals to search, etc)

About

Identify relevant scientific papers with simple machine learning techniques

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TeX 73.8%
  • OpenEdge ABL 24.3%
  • Python 1.9%