#Mapping Wikileaks' Cablegate thematics using Python, MongoDB and Gephi
Talk proposal for FOSDEM's data dev room, Brussels, Feb 5 2011,
##Speakers
We are two software engineers at Centre National de la Recherche Scientifique (France) working on the TINA project.
- Julian Bilcke : contributor for the Gephi project. Follow me at @flngr.
- Elias Showk : Its key areas of work are text-mining with python, building data applications engines with non-relational databases and customized HTTP servers. Also codes Javascript/JQuery/HTML5 web interfaces and, less recently, Perl/Moose/Catalyst modules. Follow me @elishowk
##Audience
- intermediate or beginner
##Abstract
We propose to present a complete work-flow of textual data analysis, from acquisition to visual exploration of a complex network. Through the presentation of a simple software specifically developed for this talk, we will cover a set of productive and widely used softwares and libraries in text analysis, then introduce some features of Gephi, an open-source network visualization & analysis software, using the data collected and transformed with cablegate-semnet.
###Data and methodology
The presentation will focus on Wikileaks' cablegate data, and specially on the full text of all published diplomatic cables yet. The goal is to produce a weighted network. This network will contain two categories of nodes :
- thematics nodes linked by co-occurrences, automatically extracted from full-text
- leaked cables nodes linked by a custom similarity index (adaptation of Jaccard similarity index).
Both categories will be linked by occurrences.
###1st Part : Information extraction, internals of a simple python software
- speaker : Elias
This software illustrates common methods of text-mining taking advantage of Python built-in functions as well as some external and famous libraries (NLTK, BeautifulSoup). It also demonstrate the simplicity and power of Mongo DB in tasks like document indexing and information extraction.
The talk will focus on the following topics :
- Parses cables with NLTK's HTML cleaner, BeautifulSoup's HTML parser and Python's regular expressions
- Inserts cables into Mongo DB using its internal JSON format, and presenting the Python driver for Mongo DB
- Extracts relevant keyphrases with NLTK : part-of-speech tagging, stemming, keyphrases selection based on a grammar, Mongo DB atomic modifiers.
- Pre-processes the network with Mongo DB's map/reduce capabilities to get edges' weight between nodes.
- Exports the network in a Gephi compatible format (GEXF) using Tenjin template engine
###2nd part : Network visualization : Gephi demonstration
- speaker : Julian
Cablegate-semnet has a quite naive automatic selection of text thematics and produces a network of thousands of nodes but containing some noise. On the other hand, the presence of two types of nodes implies three types of edges so that we can expect a dense graph. As a conclusion, we produce a weighted network quite rich in information, so the aim of this second part is to demonstrate Gephi's features in network post-processing, with a focus on :
- How to import a network data file
- Overview of basic visualization features
- How to remove meaningless content using the data table, sorting and filtering
- How to highlight meaningful elements using cluster detection, ranking, coloration
- How to customize the graph appearance, and export the map to PDF and the web