##Steps to generate the data, analytical plots, and to perform classification tasks
Skip step 1-4 if you don't intend to generate and clean the data.
Warning: Without enough tuning of tables for performance, it might take several days to generate the data
- Download the 2008-2012 data from http://blog.stackoverflow.com/category/cc-wiki-dump/
- Convert all the XMLs into SQL using the instructions provided in http://terokarvinen.com/2012/reading-stackoverflow-xml-dump-to-mysql-database
- Install MySQL and create/insert all the data into a DB
- Go to churn/sql and execute the SQL files in this order - a)churn_create.sql b)procedures.sql c)all except churn_time_update_features.sql c) churn_time_update_features.sql
- Execute get_classifier_data.sh
- To generate plots, go to churn/plot_draw directory and run the corresponding python script. Python scripts are named after the corresponding feature (feat_feature_name.py)
- To get classification results based on the number of posts (k), run python churn_classify_driver_K.py
- To get classification results based on the number of observation days (T), run python churn_classify_driver_T.py
Note: Classification tasks are based on 10 fold nested cross validation, using grid search to search for optimum hyper-parameters. These scripts iterate over the four classifiers - Linear SVM, SVM with RBF kernel, Logistic Regression, and Decision Tree Classifier