Skip to content

emsrc/idiscape

Repository files navigation

IDISCAPE

Erwin Marsi
emars@idi.ntnu.no
January 2014


Crawls IDI webpages for names of scientific staff. Then searches
CiteSeerX for their publications, downloads PDF files, preprocesses
them (part-of-speech tagging and lemmatization, filtering), and turns
them into vectors using a bag-of-ngrams approach. Author vectors are
created by summing the vectors of their publications. Finally word
clouds for every author are constructed from the 200 most frequent
ngrams in their publications.


==============================================================================
REQUIREMENTS
==============================================================================

A recent version of Python plus a number of additional packages:
scrapy lxml pip ipython twisted lxml six requests scikit-learn pil
cython unidecode. See files 'versions' and 'setup_virtualenv.py' to
setup a virtual environment using the Anaconda Python distro.

Text extraction from PDF files requires 'pdfinfo' and 'pdftotext'
command line tools from Poppler (or Xpdf). See
http://poppler.freedesktop.org. Can be installed with Macports.

Annotation requires installation of Stanford CoreNLP tools in the
'annot' dir. See http://nlp.stanford.edu/software/corenlp.shtml. For
this, you need a recent version of the Java SE Runtime Environment.

For word clouds, Andreas Mueller's "word_cloud" package. See
http://peekaboo-vision.blogspot.no/2012/11/a-wordcloud-in-python.html
and the source code at https://github.com/amueller/word_cloud.  Unpack
in cloud/word_cloud, build and copy/link lib file with 'ln -s
word_cloud/query_integral_image.so'.

XMl formatting requires xmllint.

Tested on Mac OS X (10.9.1). Will probably run on any *nix platform
with a bit of tweaking.

Free diskspace required: <5GB


==============================================================================
PROCEDURE
==============================================================================

Quick start:

      $ cd idiscape 
      $ make

There is no proper documentation, but the subsequent steps in the
processing pipeline are (somewhat) decscribed in the Makefile.  Most
of the Pyhon scripts also have a doc string.

About

Word clouds for (most) members of the Intelligent Systems group at IDI

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published