forked from wo/paperscraper
the tools I use for the 'online papers in philosophy' RSS feed
lucasguillermo/opp-tools
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
opp-tools is a collection of tools to track academic papers linked on certain web pages. --------------------------------------------------------------------- INSTALLATION AND SETUP --------------------------------------------------------------------- Some additional software packages are required to handle documents of various formats. (You may omit packages for formats you don't need.) pdftohtml, the Poppler version (all formats) pdftk (all formats) cuneiform >= 0.9 (PDF/OCR, Postscript) convert, from ImageMagick (PDF/OCR, Postscript) ps2pdf (Postscript) openoffice.org (Word, RTF) unoconv (Word, RTF) wkhtmltopdf >= 0.9.9.2 (HTML) aspell-en (all formats) libaspell-dev (all formats) On recent Ubuntu versions, the following installs them all: sudo apt-get install poppler-utils pdftk imagemagick ghostscript \ openoffice.org unoconv aspell-en libaspell-dev cuneiform wkhtmltopdf Next, make sure you have the following CPAN Perl modules. (You may need to install gcc and openssl first.) sudo apt-get install gcc openssl libssl-dev sudo cpan -i HTML::LinkExtractor sudo cpan -i HTML::Encoding sudo cpan -i HTML::Strip sudo cpan -i HTML::TreeBuilder sudo cpan -i XML::Writer sudo cpan -i XML::XPath sudo cpan -i XML::RSS sudo cpan -i Text::Capitalize sudo cpan -i Text::Names sudo cpan -i Text::Aspell sudo cpan -i Text::Unidecode sudo cpan -i String::Approx sudo cpan -i Algorithm::NaiveBayes sudo cpan -i AI::Categorizer sudo cpan -i Statistics::Lite sudo cpan -i Biblio::Citation::Compare sudo cpan -i Lingua::Stem::Snowball Now set up a MySQL database to store the tracked pages and papers. To create the tables, call mysql -u dbuser -p dbname < setup.sql Then rename config-example.pl to config.pl and adjust the values for the database connection. At this point there aren't any pages in the database yet. To test the setup, you might run ./update_sources add "http://consc.net/papers.html" ./process_pages.pl ./process_links.pl --------------------------------------------------------------------- BASIC USAGE --------------------------------------------------------------------- At the heart of opp-tools lie two Perl modules, Converter and Extractor. The Converter module converts documents from PDF, Word, HTML and other formats to XML. The Extractor module extracts metadata from such XML documents. These two modules can be used on their own, without any database or internet connection: #! /usr/bin/perl use Converter; use Extractor; Converter::convert2xml("test.pdf", "test.xml"); my $ex = Extractor->new("test.xml"); $ex->extract('authors', 'title'); print "authors: ", join(', ', @{$ex->{authors}}), "\n"; print "title: ", $ex->{title}, "\n"; Possible arguments for extract() are 'authors', 'title', 'abstract' and 'bibliography'. Most of the rest of the package makes use of these modules to keep track of documents linked on certain web pages. The tracked pages are stored in the "sources" table of the database. Instead of manually editing the MySQL table, you can use update_sources, as in: ./update_sources add "http://consc.net/papers.html" See ./update_sources -h for more usage information. The two main scripts for retrieving author, title and abstract information from the source pages are process_pages.pl and process_links.pl. The first checks for links on the source pages and stores them in the "locations" table in the database. The second fetches the linked documents, extracts metadata and stores the result in the "documents" table. The little bash script run.sh calls these scripts in an infinite loop, occasionally resting when all pages and documents have recently been processed. It is up to you to do anything further with the contents of the "documents" table. Included is a script called "rss" that generates an RSS feed of the most recently found documents. -------------------------------------------------------------------- SPAM FILTER --------------------------------------------------------------------- Often a source page contains links not only to papers but also to irrelevant stuff like a department home page or a CV. To filter these out, process_links.pl assigns a "spamminess" score between 0 and 1 to each processed URL; the score is stored in the "locations" table. It is calculated by a combination of ad hoc heuristics and a Bayesian classifier. The classifier comes pre-trained with a small "ham" and "spam" corpus, which is included in the spamcorpus directory. To adjust the filter, you can add or remove documents in the ham and spam directories and then run ./train_filter in the spamcorpus directory. The documents in the ham and spam directory should be in plain text format. To add e.g. a PDF or HTML document, you can run ./add_document -s http://example.com/department/index.html This adds the text content of the given URL to the spam corpus. Without the -s flag, the content is added to the ham corpus. --------------------------------------------------------------------- CUSTOMISATION --------------------------------------------------------------------- opp-tools is fine-tuned for handling papers in English-speaking academic philosophy retrieved from personal pages. This is reflected in the spam filter corpus, but also in the heuristics used to extract metadata from documents. These can be found in the /rules directory, and should be relatively easy to adjust. A high-level explanation of how these rules are applied can be found at the end of Extractor.pm. The file rules/Keywords.pm also contains regular expressions for URLs from which papers shouldn't be retrieved. --------------------------------------------------------------------- SMALL PRINT --------------------------------------------------------------------- The development of this software was supported by the University of London and the UK Joint Information Systems Committee as part of the PhilPapers 2.0 project (Information Environment programme). Copyright (c) 2003-2011 Wolfgang Schwarz, wo@umsu.de This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. See http://www.gnu.org/licenses/gpl.html. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
About
the tools I use for the 'online papers in philosophy' RSS feed
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- PostScript 70.8%
- Perl 22.0%
- Python 4.1%
- HTML 1.2%
- Other 0.8%
- TeX 0.7%
- Other 0.4%