GitHub - emsrc/idiscape: Word clouds for (most) members of the Intelligent Systems group at IDI

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
annot		annot
cloud		cloud
crawl_citeseer		crawl_citeseer
crawl_idi		crawl_idi
download		download
extract		extract
vector		vector
webpage		webpage
.gitignore		.gitignore
Makefile		Makefile
README		README
setup_virtualenv.sh		setup_virtualenv.sh
versions		versions

Repository files navigation

IDISCAPE

Erwin Marsi
emars@idi.ntnu.no
January 2014


Crawls IDI webpages for names of scientific staff. Then searches
CiteSeerX for their publications, downloads PDF files, preprocesses
them (part-of-speech tagging and lemmatization, filtering), and turns
them into vectors using a bag-of-ngrams approach. Author vectors are
created by summing the vectors of their publications. Finally word
clouds for every author are constructed from the 200 most frequent
ngrams in their publications.


==============================================================================
REQUIREMENTS
==============================================================================

A recent version of Python plus a number of additional packages:
scrapy lxml pip ipython twisted lxml six requests scikit-learn pil
cython unidecode. See files 'versions' and 'setup_virtualenv.py' to
setup a virtual environment using the Anaconda Python distro.

Text extraction from PDF files requires 'pdfinfo' and 'pdftotext'
command line tools from Poppler (or Xpdf). See
http://poppler.freedesktop.org. Can be installed with Macports.

Annotation requires installation of Stanford CoreNLP tools in the
'annot' dir. See http://nlp.stanford.edu/software/corenlp.shtml. For
this, you need a recent version of the Java SE Runtime Environment.

For word clouds, Andreas Mueller's "word_cloud" package. See
http://peekaboo-vision.blogspot.no/2012/11/a-wordcloud-in-python.html
and the source code at https://github.com/amueller/word_cloud.  Unpack
in cloud/word_cloud, build and copy/link lib file with 'ln -s
word_cloud/query_integral_image.so'.

XMl formatting requires xmllint.

Tested on Mac OS X (10.9.1). Will probably run on any *nix platform
with a bit of tweaking.

Free diskspace required: <5GB


==============================================================================
PROCEDURE
==============================================================================

Quick start:

      $ cd idiscape 
      $ make

There is no proper documentation, but the subsequent steps in the
processing pipeline are (somewhat) decscribed in the Makefile.  Most
of the Pyhon scripts also have a doc string.

About

Word clouds for (most) members of the Intelligent Systems group at IDI

Readme

Activity

0 stars

1 watching

0 forks

Report repository

Releases

No releases published

Packages

No packages published

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

annot

annot

cloud

cloud

crawl_citeseer

crawl_citeseer

crawl_idi

crawl_idi

download

download

extract

extract

vector

vector

webpage

webpage

.gitignore

.gitignore

Makefile

Makefile

README

README

setup_virtualenv.sh

setup_virtualenv.sh

versions

versions

Repository files navigation

About

Releases

Packages

Languages

emsrc/idiscape

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages