Skip to content

heuer/cablegate_semnet

 
 

Repository files navigation

#Mapping Wikileaks' Cablegate thematics using Python, MongoDB and Gephi

Talk proposal for FOSDEM's data dev room, Brussels, Feb 5 2011,

##Speakers

We are two software engineers at Centre National de la Recherche Scientifique (France) working on the TINA project.

  • Julian Bilcke : contributor for the Gephi project. Follow me at @flngr.
  • Elias Showk : Its key areas of work are text-mining with python, building data applications engines with non-relational databases and customized HTTP servers. Also codes Javascript/JQuery/HTML5 web interfaces and, less recently, Perl/Moose/Catalyst modules. Follow me @elishowk

##Audience

  • intermediate or beginner

##Abstract

We propose to present a complete work-flow of textual data analysis, from acquisition to visual exploration of a complex network. Through the presentation of a simple software specifically developed for this talk, we will cover a set of productive and widely used softwares and libraries in text analysis, then introduce some features of Gephi, an open-source network visualization & analysis software, using the data collected and transformed with cablegate-semnet.

###Data and methodology

The presentation will focus on Wikileaks' cablegate data, and specially on the full text of all published diplomatic cables yet. The goal is to produce a weighted network. This network will contain two categories of nodes :

  • thematics nodes linked by co-occurrences, automatically extracted from full-text
  • leaked cables nodes linked by a custom similarity index (adaptation of Jaccard similarity index).

Both categories will be linked by occurrences.

###1st Part : Information extraction, internals of a simple python software

  • speaker : Elias

This software illustrates common methods of text-mining taking advantage of Python built-in functions as well as some external and famous libraries (NLTK, BeautifulSoup). It also demonstrate the simplicity and power of Mongo DB in tasks like document indexing and information extraction.

The talk will focus on the following topics :

###2nd part : Network visualization : Gephi demonstration

  • speaker : Julian

Cablegate-semnet has a quite naive automatic selection of text thematics and produces a network of thousands of nodes but containing some noise. On the other hand, the presence of two types of nodes implies three types of edges so that we can expect a dense graph. As a conclusion, we produce a weighted network quite rich in information, so the aim of this second part is to demonstrate Gephi's features in network post-processing, with a focus on :

  • How to import a network data file
  • Overview of basic visualization features
  • How to remove meaningless content using the data table, sorting and filtering
  • How to highlight meaningful elements using cluster detection, ranking, coloration
  • How to customize the graph appearance, and export the map to PDF and the web

About

Mapping Wikileaks' Cablegate thematics using Python, MongoDB and Gephi

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%