Analysis-StackExchange

This is all the python code you would need to mine the complete StackExchange dump. The data used is all the stackexchange data till September, 2011, released under a creative-commons license. I used this for a course project that is not even remotely commercial.

You can download the data over Http, which is what I did, or you can use torrents. The data is quite large in size, I am uploading the smallest XML files for you to get started while your data gets downloaded ;)

StackExchange Story

The data is in the form of big XML files. Some questions you can answer

how many users
badges distribution
how many producers/consumers
popular tags

and so on.

There was a slightly higher purpose of doing this project. I wanted to find out the kind of patterns that are visible in contributions. But, those details can be found in the blog posts here : http://www.rohitdholakia.com/blog/categories/stackoverflow/

Getting Started

I have added data about Bicycles StackExchange in the folder. Lets find out how many users. Earlier, I had a separate script for each. Now, to get an idea, you can generate Summary.py

	Rohits-MacBook-Pro:Analysis-StackExchange rohitdholakia$ python Scripts/Summary.py Bicycles/
 yo !   num users and epic users are  2002 	0
num questions and answers and accepted answers are  1208 	4396 	862
 and finally, num famous in this are  0

The script is very simple. But, it has a couple of things which will be seen throughout. Note that we are doing this work on a 4gb macbook pro with a modest processor. nothing fancy. Hence, dealing with this gigantic xml files keeping them all in memory in not an option. That is where the iterparse and clearElem() play a role. we parse it line-by-line and when we are done, we remove the element and all its parents. The parents part becomes important if you are dealing with highly nested XML files.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Bicycles		Bicycles
Outputs		Outputs
Scripts		Scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bicycles

Bicycles

Outputs

Outputs

Scripts

Scripts

README.md

README.md

Repository files navigation

Analysis-StackExchange

StackExchange Story

Getting Started

About

Releases

Packages

Languages

rohitdholakia/Analysis-StackExchange

Folders and files

Latest commit

History

Repository files navigation

Analysis-StackExchange

StackExchange Story

Getting Started

About

Resources

Stars

Watchers

Forks

Languages