Supervised Machine Learning for Click Fraud Detection

Shan Dou
MLND capstone project
July 2018

Link to proposal review: https://review.udacity.com/#!/reviews/1314525

1. Data source:

Kaggle competion "TalkingData AdTracking Fraud Detection Challenge".

2. Create conda environment with all dependencies

conda env create -f environment.yml

This operation will create a conda environment named mlnd_clean. If you wish to use a different name, please open the requirement file environment.yml and change the first line name: mlnd_clean into your preferred name.

3. Activate and deactivate conda environment

Once all the dependencies are installed, please run the following command in your shell terminal to activate the environment

source activate mlnd_clean

To deactivate, type

source deactivate

4. Key modules and availabilities

The following modules are installed with conda

1. numpy
2. pandas 
3. seaborn
4. sklearn
5. xgboost
6. lightgbm
7. imblearn
8. notebook

Module for stack ensemble is installed with pip:

9. mlens

For more information about mlens, please visit its webiste.

5. Key components of the project

Jupyter notebooks:
- MLNDcapstone_shandou_main.ipynb: Main workbook
- MLNDcapstone_shandou_robustness.ipynb: Companion workbook for models' robustness texts
Python models in ./customlib/:
- ./customlib/preprocessing.py: data processing
- ./customlib/modeling.py: modeling
- ./customlib/utils.py: miscellaneous tasks such as visualization and generating result summary tables
Dataset:
The raw data train.csv can be directly download from Kaggle. Out of file size concerns, only downsized training data and the original testing data are included in this repo.
- train_sample.csv: 0.1% of the raw click records
- train_sample_2.csv: 0.2% of the raw click records
- test.csv: First 10 lines of the orignal test data downloaded from Kaggle. NOTE that test.csv provided by Kaggle is only used for checking data fields. In the actual implementation, testing data is instead a portion of train_sample.csv or train_sample_2.csv
Proposals and reports:
- proposal.pdf: Proposal of the capstone project
- proposal_review.pdf: Comments from proposal review
- report.pdf: Report of the capstone project
Others:
- folder images/: contains all the images used in the report
- matplotlib style sheet stylelib/custom.mplstyle: dataviz styler used throughout this project

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.ipynb_checkpoints		.ipynb_checkpoints
AI		AI
__pycache__		__pycache__
customlib		customlib
images		images
papers		papers
proposal		proposal
stylelib		stylelib
submission		submission
.DS_Store		.DS_Store
.gitignore		.gitignore
MLNDcapstone_eda.ipynb		MLNDcapstone_eda.ipynb
MLNDcapstone_report_draft.pdf		MLNDcapstone_report_draft.pdf
MLNDcapstone_report_final.pdf		MLNDcapstone_report_final.pdf
MLNDcapstone_shandou_main.ipynb		MLNDcapstone_shandou_main.ipynb
MLNDcapstone_shandou_robustness.ipynb		MLNDcapstone_shandou_robustness.ipynb
README.md		README.md
capstone_report_rubrics.pdf		capstone_report_rubrics.pdf
environment.yml		environment.yml
proposal_review.pdf		proposal_review.pdf
report.pdf		report.pdf
submission.zip		submission.zip

shandou/MLND_capstone

Folders and files

Latest commit

History

Repository files navigation

Supervised Machine Learning for Click Fraud Detection

1. Data source:

2. Create conda environment with all dependencies

3. Activate and deactivate conda environment

4. Key modules and availabilities

5. Key components of the project

About

Topics

Resources

Stars

Watchers

Forks

Languages