Final project for CS360

Data- Use this website to download the dataset

https://www.kaggle.com/mlg-ulb/creditcardfraud#creditcard.csv

Roadmap

To test adaboost with different thershold, run python3 adaboost.py -d creditcard.csv -n 10 -t 0.5
To test adaboost with different upsample rate, run python3 adaboost.py -d creditcard.csv -s 1 -n 10
To test fc_nn with different upsample rate, run python3 fc.py -d creditcard.csv -s 1 -n 10
To test svm with different upsample rate, run python3 svm.py -d creditcard.csv -s 1 -n 10
To test any model with just an upsample ratio but not a range just use "-n". For example, python3 svm.py -d creditcard.csv -n 10
The upsample range will be [1, n]
To automatically generate curve on three models, run python3 run_pipeline.py -d creditcard.csv -s 1 -n 10

Lab Notebook

Development Note:

2019/12/04 (Jiaping)

Add data parsing, tested on temp.csv (a snapshot of original file)
Initialization of fully connected network with two dense layers

2019/12/05 (Jiaping)

All data points are detected as no-fraud, not acceptable
Normalize data and delete one feature
NN reaches a true positive of 65%
Will start working on using SVM, Adaboost

2019/12/07 (Jiaping)

Initialization of Adaboost, havent used different threshold yet
Rather than using only 100,000 data, models are trained on the whole dataset this time
Both Adaboost and nn reach a true positive rate ranging from 70 to 75%
No idea why we can reach this result, need further investigation

2019/12/08 (Andy)

Note that at this point dataset has not been upsampled yet - (meaning repopulate fraud cases to reach 4k cases). In other words, we still use the original data here
Create SVM
Figuring out the params to be used for SVM

2019/12/09 (Andy)

Figure out & clean up correct params for SVM
Decided to run SVM twice - one with gammaa=1 and one without to see the difference in accuracy & runtime (Concluded that we need gamma=1)
Investigated the built-in Roc curve functions for our use

2019/12/09 (Jiaping)

Add upsample function to manually increase weight on examples with true label
Performance gets worse on adaboost with a ratio of 0.3
Add roc curve

2019/12/10 (Andy)

Get rid of kfolds for SVM & still run single SVM multiple times to finish collecting results without upsampling

2019/12/10 (Jiaping)

Add another type of upsample, this time we only increase the number of true class by n times
Fix the issue with upsample
Adaboost reaches a recall rate of 0.85 when upsampling true class by 10 times
NN reaches a recall rate of 0.798 when upsampling true class by 10 times
Plan to enable adjusting threshold for Adaboost
Plan to make roc curve on models with different upsample ratio

2019/12/12 (Andy)

Clean up the util.py to make some functions more understandable & Added comments to each method to make it more readable for future readers
Update SVM to accommodate upsamples
Will run SVM with 10X upsample to see the difference in performance
will create a results section in README that contains the running results

2019/12/13 (Jiaping)

Revise comment style
Modify parser and right now we can run adaboost with a threshold range
Users can now run models with a range of upsample rate
Start to produce final result and plan to delete intermediate code

2019/12/14 (Andy)

change the type for '--upsamplen' from 'float' to 'int'
Modified SVM to accomodate new parsing features
Added plot_recall_upsample_curve funciton
Get all results from running SVM
Finish ploting all models on the recall_upsample curve
create generate_curves.py for creating recall_upsample curve
all results updated. Ready to start working on presentation slides tomorrow

2019/12/16 (Andy)

Add roc curve & precision recall curves to SVM

2019/12/16 (Jiaping)

Get rid of hard code style file (generate_cruve.py)
fcnn and adaboost are now able to automatically generate graph

2019/12/16 (Andy)

Run all files to get output graphs
Mix last bit of stuff on all files

Model Output Results:

SVM:

#running SVM with 1-10 upsample

#upsample=1, recall=0.78 [[85265 11] [ 36 131]]

#upsample=2, recall=0.82 [[85268 11] [ 56 255]]

#upsample=3, recall=0.81 [[85306 7] [ 81 344]]

#upsample=4, recall=0.84 [[85292 12] [ 95 486]]

#upsample=5, recall=0.82 [[85290 12] [ 134 597]]

#upsample=6, recall=0.81 [[85288 13] [ 166 714]]

#upsample=7, recall=0.82 [[85300 11] [ 186 831]]

#upsample=8, recall=0.85 [[85271 24] [ 174 1007]]

#upsample=9, recall=0.84 [[85329 23] [ 206 1065]]

#upsample=10, recall=0.84 - 1.53% [[85279 31] [ 233 1228]]

#=== #upsample=20, recall=0.87 - 3.17% [[85292 56] [ 383 2516]]

#positive case - 4.77% #upsample=30, recall=0.88 [[85293 58] [ 541 3831]]

#10% #upsample=60, recall=0.9 [[85224 103] [ 888 7936]]

adaboost:

Upsample range from 1 time to times: [[85293 13] [ 36 101]] [[85290 18] [ 56 226]] [[85287 33] [ 76 342]] [[85310 40] [ 106 429]] [[85321 36] [ 123 553]] [[85299 49] [ 118 715]] [[85294 44] [ 145 845]] [[85264 50] [ 193 969]] [[85326 41] [ 208 1048]] [[85259 63] [ 214 1235]]

Unsample param = 10: 0.4 [[19782 65531] [ 0 1458]] 0.405 [[19982 65331] [ 0 1458]] 0.410 [[20020 65293] [ 0 1458]] 0.415 [[20026 65287] [ 0 1458]] 0.420 [[20026 65287] [ 0 1458]] 0.425 [[20026 65287] [ 0 1458]] 0.430 [[20028 65285] [ 0 1458]] 0.435 [[20033 65280] [ 0 1458]] 0.440 [[20064 65249] [ 0 1458]] 0.445 [[20226 65087] [ 0 1458]] 0.450 [[20796 64517] [ 0 1458]] 0.455 [[22545 62768] [ 0 1458]] 0.460 [[26687 58626] [ 0 1458]] 0.465 [[35198 50115] [ 0 1458]] 0.470 [[48167 37146] [ 4 1454]] 0.475 [[63503 21810] [ 4 1454]] 0.480 [[76080 9233] [ 4 1454]] 0.485 [[82029 3284] [ 4 1454]] 0.490 [[84409 904] [ 30 1428]] 0.495 [[85091 222] [ 134 1324]] 0.500 [[85247 66] [ 225 1233]] 0.505 [[85286 27] [ 317 1141]]

fcnn:

Upsample range from 1 time to times: [array([[85273, 11], [ 55, 104]]), array([[85269, 23], [ 63, 235]]), array([[85279, 34], [ 68, 357]]), array([[85300, 24], [ 108, 453]]), array([[85300, 29], [ 123, 581]]), array([[85289, 26], [ 137, 729]]), array([[85292, 40], [ 136, 860]]), array([[85253, 36], [ 223, 964]]), array([[85353, 21], [ 238, 1011]]), array([[85267, 42], [ 209, 1253]])]

Reference

Andy

https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c

https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/

sklearn.metrics Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
README.md		README.md
SVM_PR_noUp.png		SVM_PR_noUp.png
adaboost.py		adaboost.py
adaboost_roc.jpg		adaboost_roc.jpg
comparison.jpg		comparison.jpg
comparison2.png		comparison2.png
fc.jpg		fc.jpg
fc.py		fc.py
generate_curves.py		generate_curves.py
run_pipeline.py		run_pipeline.py
svm.py		svm.py
svm_PR_10UP.png		svm_PR_10UP.png
util.py		util.py

haverford-cs/project-jwang4-yhong1

Folders and files

Latest commit

History

Repository files navigation

Final project for CS360

Data- Use this website to download the dataset

Roadmap

Lab Notebook

Development Note:

2019/12/04 (Jiaping)

2019/12/05 (Jiaping)

2019/12/07 (Jiaping)

2019/12/08 (Andy)

2019/12/09 (Andy)

2019/12/09 (Jiaping)

2019/12/10 (Andy)

2019/12/10 (Jiaping)

2019/12/12 (Andy)

2019/12/13 (Jiaping)

2019/12/14 (Andy)

2019/12/16 (Andy)

2019/12/16 (Jiaping)

2019/12/16 (Andy)

Model Output Results:

SVM:

adaboost:

fcnn:

Reference

Andy

sklearn.metrics Documentation

About

Resources

Stars

Watchers

Forks

Languages