My Capstone project for Galvanize's Data Science Immersive course.
Objective: Finding leading factors of conversion and churn in HaikuDeck users.
For this project, I had access to HaikuDeck's Stripe (Payment Processor) and Mixpanel (User Analytics) account which gave me access to payment/subscription events and in-app events.
I fetched data from their respective APIs, stored them in DynamoDB DynamoDB. From there, I normalized the data and stored them in a Postgres database hosted on RDS.
I was able to fetch and store data on approximately 16,000 customers from stripe, and 7,500,000 events 300,000 distinct profiles from Mixpanel.
Note: Data pipeline code can be found in /src/pipeline.py
Note: API Keys are stored in environment variables not stored in this repo
For this project, I created two Gradient Boosting Classifiers, one for Conversion and another for Churn. Datasets were created from SQL queries on the data collected, fit to the classifiers and tuned with a GridSearch using 3 fold cross validation.
Data for fitting the models was pulled using SQL queries on stored data, aggregating the counts of different types of events. To tune my model, I used a compute optimized EC2 to do a parellelized grid search over key hyper parameters.
Note: Models and data queries are found in the /src/modeling
folder
Evaluation of this model can be found in the Scoring Notebook. In which I scored my models, created ROC curves, extracted features importances, and made partial dependency plots for each model.
At this point, given feature importances I was able to determine the most important features that have lead to conversion/churn
For deployment, I created a dashboard that displays predictions for current users generated by the current model.
I built this using Flask and React JS to query predictions that were previously generated and stored as a seperate table in my SQL database.
To run:
- Run asset pipeline
webpack
- Run
python app/__init__.py
Visit http://127.0.0.1:5000/ view the dashboard locally.