Skip to content

andreas-koukorinis/my_topics

 
 

Repository files navigation

Topic modelling from EuroPython's list of abstracts

This is the code to produce a list of topics from abstracts downloaded from the conference website.

The different steps and corresponding modules are:

  • Web srapping to retrieve the abstracts, based on beautifulsoup4, and urllib2.

    joblib is also useful for caching, to avoid multiple crawls of the websites and downloads.

    I could have asked access to a dump of the database for the organizers, but it was more fun to crawl.

  • Stemming: trying to convert plural words to singular, using NLTK.

    Note that stemming is in general more sophisticated, and will convert words to their roots, such as 'organization' -> 'organ'. To have understandable word clouds, we want to keep more differentiation. Hence we add a custom layer to reduce the power of the stemmer.

  • Topic modelling with scikit-learn.

    It's a 2 step process: first we convert the text data to a numerical representation, "vectorizing"; second we use a Non-negative Matrix Factorization to extract "topics" in these.

  • Word-cloud figures with the wordcloud module.
  • Create a webpace with the tempita.

This application beautifully combines multiple facets of the Python ecosystem, from web tools to PyData.

Releases

No releases published

Packages

No packages published

Languages

  • Python 60.1%
  • HTML 38.4%
  • Makefile 1.5%