README

This is the code for an exploratory analysis of language similarities between constitutions. The analysis is performed with Python using nltk, and scikit-learn, a full list of dependencies is included below.

Results from the sample run shown in the paper are in the sample_run folder. They include datasets and tables created by the script.

Data

All dataset used and created are included in the repo. The constitutions of 192 countries in plaintext format are also included. These were downloaded from constitute They can be downloaded again using the download.py script.

Other datasets used are:

Freedom House Country Ratings and Status, 1973-2014

State Fragility Index and Matrix

Latent Judicial Independence Around the Globe, 1948-2010

Please use the versions included here to run the analysis, as they were slightly modified to be read by the script. If the analysis is run on downloaded versions of these datasets, not all countries will have their scores read by the script and the program will most likely hang and throw some errors.

Instructions

To begin simply chdir into the src folder and do

python main.py

If you prefer running the module from a python shell open it into the src directory and do:

>>> import main.py
>>> data = main.run_analysis()

the results of the analysis will be stored in the data object returned by the run_analysis function.

Output

If the script is run with all the default settings, it should output 4 csv tables:

clusters.csv: A dataset of all the countries with associated clusters and dependent variables
desc_stat.csv: A table of descriptive statistics for the clusters created.
tf_idf.csv: A dataset with the tf-idf frequencies for all words and all constitutions.
top_words.csv: A dataset containing the 20 most used words in each cluster.

The script also outputs one txt file called regression_results.txt, which contains the results generated by the OLS regressions.

The default settings should also output 4 graphs:

FH.png: Boxplot of clusters vs. freedom house scores.
LJI.png: Boxplot of clusters vs. latent judicial independence.
SFI.png: Boxplot of clusters vs. state fragility.
cluster_map.png: Map with each country colored according to the cluster it's in.
top_words_hist.png: Histogram of the frequency of the most used words in each cluster.

Dataset Object

If the script is run in a python shell, the run_analysis function will return a dataset object through which all the tables used for the analysis will be accessible. If the analysis is run like so:

>>> data = main.run_analysis(#args)

then the tables will be accessible from the data object.

>>> # Raw frequency table for each word in the corpus
>>> data.df 
>>> # Table containing clusters and dependent variables for each country
>>> data.cdb 
>>> # Table of descriptive statistics for each cluster and each dependent variable
>>> data.descStat
>>> # Tf-idf scores for each term in the corpus 
>>> data.tf_idf
>>> # Most used words in each cluster
>>> data.topWords
>>> # Reprint regression results
>>> data.regressionResults()
>>> # Show plots
>>> data.showPlots()

Note:

Running the analysis takes time. Especially building the top words table. If you'd like to avoid that step, you can perform the analysis manually by calling each function in the order listed in run_analysis().

Dependencies:

All these packages have to be installed for the main analysis code to work.

Optional:

Progressbar

BeautifulSoup

Requests

TODO

We can understand if and what characteristics the clustering has captured by comparing cluster membership with the characteristics of constitutions collected in the database from the Comparative Constitutions Project.
Code should be cleaned up and properly documented.

Version

The version of python the sample run was performed on is 2.7.5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

constitutions

constitutions

data

data

output

output

sample_run

sample_run

src

src

.gitignore

.gitignore

README.md

README.md

Repository files navigation

README

Data

Instructions

Output

Dataset Object

Note:

Dependencies:

Optional:

TODO

Version

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
constitutions		constitutions
data		data
output		output
sample_run		sample_run
src		src
.gitignore		.gitignore
README.md		README.md

marcomorucci/Clustering-Constitutions

Folders and files

Latest commit

History

Repository files navigation

README

Data

Instructions

Output

Dataset Object

Note:

Dependencies:

Optional:

TODO

Version

About

Resources

Stars

Watchers

Forks

Languages