GitHub - lucasguillermo/opp-tools: the tools I use for the 'online papers in philosophy' RSS feed

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 560 Commits
jsonserver		jsonserver
rpdf		rpdf
rules		rules
spamcorpus		spamcorpus
test		test
util		util
Converter.pm		Converter.pm
Doctidy.pm		Doctidy.pm
Extractor.pm		Extractor.pm
README		README
authorsfinder.py		authorsfinder.py
blogpostparser.py		blogpostparser.py
classifier.py		classifier.py
config-example.pl		config-example.pl
config.py		config.py
email_notifications.py		email_notifications.py
jsonserver.wsgi		jsonserver.wsgi
process_links.pl		process_links.pl
process_pages.pl		process_pages.pl
rss		rss
run.sh		run.sh
runjsonserver.py		runjsonserver.py
setup.sql		setup.sql
sourcesfinder.py		sourcesfinder.py
subjectivebayes.py		subjectivebayes.py
update_sources		update_sources

Repository files navigation

opp-tools is a collection of tools to track academic papers linked on
certain web pages.

---------------------------------------------------------------------
INSTALLATION AND SETUP
---------------------------------------------------------------------

Some additional software packages are required to handle documents of
various formats. (You may omit packages for formats you don't need.)

pdftohtml, the Poppler version   (all formats)
pdftk                            (all formats)
cuneiform >= 0.9                 (PDF/OCR, Postscript)
convert, from ImageMagick        (PDF/OCR, Postscript)
ps2pdf                           (Postscript)
openoffice.org                   (Word, RTF)
unoconv                          (Word, RTF)
wkhtmltopdf >= 0.9.9.2           (HTML)
aspell-en                        (all formats)
libaspell-dev                    (all formats)

On recent Ubuntu versions, the following installs them all:

  sudo apt-get install poppler-utils pdftk imagemagick ghostscript \
  openoffice.org unoconv aspell-en libaspell-dev cuneiform wkhtmltopdf

Next, make sure you have the following CPAN Perl modules. (You may
need to install gcc and openssl first.)

  sudo apt-get install gcc openssl libssl-dev

  sudo cpan -i HTML::LinkExtractor
  sudo cpan -i HTML::Encoding
  sudo cpan -i HTML::Strip
  sudo cpan -i HTML::TreeBuilder
  sudo cpan -i XML::Writer
  sudo cpan -i XML::XPath
  sudo cpan -i XML::RSS
  sudo cpan -i Text::Capitalize
  sudo cpan -i Text::Names
  sudo cpan -i Text::Aspell
  sudo cpan -i Text::Unidecode
  sudo cpan -i String::Approx
  sudo cpan -i Algorithm::NaiveBayes
  sudo cpan -i AI::Categorizer
  sudo cpan -i Statistics::Lite
  sudo cpan -i Biblio::Citation::Compare
  sudo cpan -i Lingua::Stem::Snowball

Now set up a MySQL database to store the tracked pages and papers. To
create the tables, call

  mysql -u dbuser -p dbname < setup.sql 

Then rename config-example.pl to config.pl and adjust the values for
the database connection.

At this point there aren't any pages in the database yet. To test the
setup, you might run

  ./update_sources add "http://consc.net/papers.html"
  ./process_pages.pl
  ./process_links.pl


---------------------------------------------------------------------
BASIC USAGE
---------------------------------------------------------------------

At the heart of opp-tools lie two Perl modules, Converter and
Extractor. The Converter module converts documents from PDF, Word,
HTML and other formats to XML. The Extractor module extracts metadata
from such XML documents. These two modules can be used on their own,
without any database or internet connection:

  #! /usr/bin/perl
  use Converter;
  use Extractor;
  Converter::convert2xml("test.pdf", "test.xml");
  my $ex = Extractor->new("test.xml");
  $ex->extract('authors', 'title');
  print "authors: ", join(', ', @{$ex->{authors}}), "\n";
  print "title: ", $ex->{title}, "\n";

Possible arguments for extract() are 'authors', 'title', 'abstract'
and 'bibliography'.

Most of the rest of the package makes use of these modules to keep
track of documents linked on certain web pages. The tracked pages are
stored in the "sources" table of the database. Instead of manually
editing the MySQL table, you can use update_sources, as in:

  ./update_sources add "http://consc.net/papers.html"

See

  ./update_sources -h

for more usage information.

The two main scripts for retrieving author, title and abstract
information from the source pages are process_pages.pl and
process_links.pl. The first checks for links on the source pages and
stores them in the "locations" table in the database. The second
fetches the linked documents, extracts metadata and stores the result
in the "documents" table. The little bash script run.sh calls these
scripts in an infinite loop, occasionally resting when all pages and
documents have recently been processed.

It is up to you to do anything further with the contents of the
"documents" table. Included is a script called "rss" that generates an
RSS feed of the most recently found documents.


--------------------------------------------------------------------
SPAM FILTER
---------------------------------------------------------------------

Often a source page contains links not only to papers but also to
irrelevant stuff like a department home page or a CV. To filter these
out, process_links.pl assigns a "spamminess" score between 0 and 1 to
each processed URL; the score is stored in the "locations" table. It
is calculated by a combination of ad hoc heuristics and a Bayesian
classifier.

The classifier comes pre-trained with a small "ham" and "spam" corpus,
which is included in the spamcorpus directory. To adjust the filter,
you can add or remove documents in the ham and spam directories and
then run

  ./train_filter

in the spamcorpus directory. The documents in the ham and spam
directory should be in plain text format. To add e.g. a PDF or HTML
document, you can run

  ./add_document -s http://example.com/department/index.html

This adds the text content of the given URL to the spam
corpus. Without the -s flag, the content is added to the ham corpus.


---------------------------------------------------------------------
CUSTOMISATION
---------------------------------------------------------------------

opp-tools is fine-tuned for handling papers in English-speaking
academic philosophy retrieved from personal pages. This is reflected
in the spam filter corpus, but also in the heuristics used to extract
metadata from documents. These can be found in the /rules directory,
and should be relatively easy to adjust. A high-level explanation of
how these rules are applied can be found at the end of Extractor.pm.

The file rules/Keywords.pm also contains regular expressions for URLs
from which papers shouldn't be retrieved.


---------------------------------------------------------------------
SMALL PRINT 
---------------------------------------------------------------------

The development of this software was supported by the University of
London and the UK Joint Information Systems Committee as part of the
PhilPapers 2.0 project (Information Environment programme).

Copyright (c) 2003-2011 Wolfgang Schwarz, wo@umsu.de

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or (at
your option) any later version. See
http://www.gnu.org/licenses/gpl.html.

This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.