Topic modelling from EuroPython's list of abstracts

This is the code to produce a list of topics from abstracts downloaded from the conference website.

The different steps and corresponding modules are:

Web srapping to retrieve the abstracts, based on beautifulsoup4, and urllib2.

joblib is also useful for caching, to avoid multiple crawls of the websites and downloads.

I could have asked access to a dump of the database for the organizers, but it was more fun to crawl.
Stemming: trying to convert plural words to singular, using NLTK.

Note that stemming is in general more sophisticated, and will convert words to their roots, such as 'organization' -> 'organ'. To have understandable word clouds, we want to keep more differentiation. Hence we add a custom layer to reduce the power of the stemmer.
Topic modelling with scikit-learn.

It's a 2 step process: first we convert the text data to a numerical representation, "vectorizing"; second we use a Non-negative Matrix Factorization to extract "topics" in these.
Word-cloud figures with the wordcloud module.
Create a webpace with the tempita.

This application beautifully combines multiple facets of the Python ecosystem, from web tools to PyData.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
github-pages-publish @ a117994		github-pages-publish @ a117994
out/ep16		out/ep16
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.rst		README.rst
index_template.html		index_template.html
topics_extraction.py		topics_extraction.py

Provide feedback