GitHub

#User Churn in Focused Q&A Sites: Characterizations and Prediction

##Steps to generate the data, analytical plots, and to perform classification tasks

Skip step 1-4 if you don't intend to generate and clean the data.

Warning: Without enough tuning of tables for performance, it might take several days to generate the data

##Generating the data

Download the 2008-2012 data from http://blog.stackoverflow.com/category/cc-wiki-dump/
Convert all the XMLs into SQL using the instructions provided in http://terokarvinen.com/2012/reading-stackoverflow-xml-dump-to-mysql-database
Install MySQL and create/insert all the data into a DB
Go to churn/sql and execute the SQL files in this order - a)churn_create.sql b)procedures.sql c)all except churn_time_update_features.sql c) churn_time_update_features.sql
Execute get_classifier_data.sh

##Generating plots

To generate plots, go to churn/plot_draw directory and run the corresponding python script. Python scripts are named after the corresponding feature (feat_feature_name.py)

##Running classification tasks

To get classification results based on the number of posts (k), run python churn_classify_driver_K.py
To get classification results based on the number of observation days (T), run python churn_classify_driver_T.py

Note: Classification tasks are based on 10 fold nested cross validation, using grid search to search for optimum hyper-parameters. These scripts iterate over the four classifiers - Linear SVM, SVM with RBF kernel, Logistic Regression, and Decision Tree Classifier

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
charts		charts
churn		churn
data		data
dectreeimg		dectreeimg
feat_hists.bak		feat_hists.bak
feat_hists		feat_hists
sql		sql
.gitignore		.gitignore
README.md		README.md
X_control.csv		X_control.csv
__init__.py		__init__.py
a.py		a.py
classify.pyc		classify.pyc
classify_utils.pyc		classify_utils.pyc
ctrl_users.csv		ctrl_users.csv
draw_hist.py		draw_hist.py
draw_hist.py.bak		draw_hist.py.bak
draw_hist.pyc		draw_hist.pyc
dtc.py		dtc.py
feat2.png		feat2.png
feat_props.py		feat_props.py
feat_props.pyc		feat_props.pyc
hist50		hist50
jdcommon.py		jdcommon.py
jdutils.py		jdutils.py
knn_op		knn_op
mysql_history		mysql_history
plot_decision_tree.py		plot_decision_tree.py
post_chart.py		post_chart.py
post_control_users_0_6		post_control_users_0_6
rep50_new.py		rep50_new.py
rep5_month.py		rep5_month.py
rep5_new.py		rep5_new.py
rep_chart.py		rep_chart.py
rep_count5.csv		rep_count5.csv
rep_group.py		rep_group.py
sample_svm.txt		sample_svm.txt
svm.pyc		svm.pyc
svm_grid.py		svm_grid.py
testing.csv		testing.csv
timechart.py		timechart.py
y_X_control.csv		y_X_control.csv
y_control.csv		y_control.csv

churnprediction/stackoverflow

Folders and files

Latest commit

History

Repository files navigation

#User Churn in Focused Q&A Sites: Characterizations and Prediction

##Generating the data

##Generating plots

##Running classification tasks

About

Resources

Stars

Watchers

Forks

Languages