emsrc/idiscape
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
IDISCAPE Erwin Marsi emars@idi.ntnu.no January 2014 Crawls IDI webpages for names of scientific staff. Then searches CiteSeerX for their publications, downloads PDF files, preprocesses them (part-of-speech tagging and lemmatization, filtering), and turns them into vectors using a bag-of-ngrams approach. Author vectors are created by summing the vectors of their publications. Finally word clouds for every author are constructed from the 200 most frequent ngrams in their publications. ============================================================================== REQUIREMENTS ============================================================================== A recent version of Python plus a number of additional packages: scrapy lxml pip ipython twisted lxml six requests scikit-learn pil cython unidecode. See files 'versions' and 'setup_virtualenv.py' to setup a virtual environment using the Anaconda Python distro. Text extraction from PDF files requires 'pdfinfo' and 'pdftotext' command line tools from Poppler (or Xpdf). See http://poppler.freedesktop.org. Can be installed with Macports. Annotation requires installation of Stanford CoreNLP tools in the 'annot' dir. See http://nlp.stanford.edu/software/corenlp.shtml. For this, you need a recent version of the Java SE Runtime Environment. For word clouds, Andreas Mueller's "word_cloud" package. See http://peekaboo-vision.blogspot.no/2012/11/a-wordcloud-in-python.html and the source code at https://github.com/amueller/word_cloud. Unpack in cloud/word_cloud, build and copy/link lib file with 'ln -s word_cloud/query_integral_image.so'. XMl formatting requires xmllint. Tested on Mac OS X (10.9.1). Will probably run on any *nix platform with a bit of tweaking. Free diskspace required: <5GB ============================================================================== PROCEDURE ============================================================================== Quick start: $ cd idiscape $ make There is no proper documentation, but the subsequent steps in the processing pipeline are (somewhat) decscribed in the Makefile. Most of the Pyhon scripts also have a doc string.
About
Word clouds for (most) members of the Intelligent Systems group at IDI
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published