Skip to content

marcomorucci/Clustering-Constitutions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

This is the code for an exploratory analysis of language similarities between constitutions. The analysis is performed with Python using nltk, and scikit-learn, a full list of dependencies is included below.

Results from the sample run shown in the paper are in the sample_run folder. They include datasets and tables created by the script.

Data

All dataset used and created are included in the repo. The constitutions of 192 countries in plaintext format are also included. These were downloaded from constitute They can be downloaded again using the download.py script.

Other datasets used are:

Freedom House Country Ratings and Status, 1973-2014

State Fragility Index and Matrix

Latent Judicial Independence Around the Globe, 1948-2010

Please use the versions included here to run the analysis, as they were slightly modified to be read by the script. If the analysis is run on downloaded versions of these datasets, not all countries will have their scores read by the script and the program will most likely hang and throw some errors.

Instructions

To begin simply chdir into the src folder and do

python main.py

If you prefer running the module from a python shell open it into the src directory and do:

>>> import main.py
>>> data = main.run_analysis()

the results of the analysis will be stored in the data object returned by the run_analysis function.

Output

If the script is run with all the default settings, it should output 4 csv tables:

  • clusters.csv: A dataset of all the countries with associated clusters and dependent variables
  • desc_stat.csv: A table of descriptive statistics for the clusters created.
  • tf_idf.csv: A dataset with the tf-idf frequencies for all words and all constitutions.
  • top_words.csv: A dataset containing the 20 most used words in each cluster.

The script also outputs one txt file called regression_results.txt, which contains the results generated by the OLS regressions.

The default settings should also output 4 graphs:

  • FH.png: Boxplot of clusters vs. freedom house scores.
  • LJI.png: Boxplot of clusters vs. latent judicial independence.
  • SFI.png: Boxplot of clusters vs. state fragility.
  • cluster_map.png: Map with each country colored according to the cluster it's in.
  • top_words_hist.png: Histogram of the frequency of the most used words in each cluster.

Dataset Object

If the script is run in a python shell, the run_analysis function will return a dataset object through which all the tables used for the analysis will be accessible. If the analysis is run like so:

>>> data = main.run_analysis(#args)

then the tables will be accessible from the data object.

>>> # Raw frequency table for each word in the corpus
>>> data.df 
>>> # Table containing clusters and dependent variables for each country
>>> data.cdb 
>>> # Table of descriptive statistics for each cluster and each dependent variable
>>> data.descStat
>>> # Tf-idf scores for each term in the corpus 
>>> data.tf_idf
>>> # Most used words in each cluster
>>> data.topWords
>>> # Reprint regression results
>>> data.regressionResults()
>>> # Show plots
>>> data.showPlots()

Note:

Running the analysis takes time. Especially building the top words table. If you'd like to avoid that step, you can perform the analysis manually by calling each function in the order listed in run_analysis().

Dependencies:

All these packages have to be installed for the main analysis code to work.

Numpy

Scipy

Nltk

Scikit-learn

Pandas

Statsmodels

Xlrd

Optional:

Progressbar

BeautifulSoup

Requests

TODO

  • We can understand if and what characteristics the clustering has captured by comparing cluster membership with the characteristics of constitutions collected in the database from the Comparative Constitutions Project.
  • Code should be cleaned up and properly documented.

Version

The version of python the sample run was performed on is 2.7.5.

About

An analysis of similarities between the world's constitutions done through k-means clustering and NLP.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages