Progress history

W1-2 : 19.04 - 03.05 : Initialize the document, trainning algorithms on default parameters

W3-4 : 03.05 - 17.05 : Parameter tuning on different configurations

W5-6 : 17.05 - 31.05 : Increasing the number of anomaly

W7 : Finalizing

W3-4 : Parameter tuning and cross-validation search

Methodology :

Train test split the data, rate = 0.25
3-fold cross-validation on the trainning data so that the test size of each split will be equal to the final test size
Apply best parameters to test data
Compared with default parameters

Trainning configuration for specific algorithm :

OneClass SVM

- Dataset: SF 10%
- Anomaly rate : 4.5%
- random_state = 1
- nu : tuned using gridsearch_cv [0.045, 0.18]
- kernel : default
- gamma : default
- Dataset: SF 10%
- Anomaly rate : 4.5%
- random_state : 2
- nu: tuned using gridsearch_cv [0.045, 0.18]
- kernel : poly
- gamma : default
- degree : default
- Dataset: SF 10%
- Anomaly rate : 4.5%
- random_state = 3
- nu : tuned using gridsearch_cv [0.045, 0.18]
- kernel : default
- gamma : default

Isolation Forest

- Dataset: SF 100%
- Anomaly rate : 0.5%
- random_state = 1
- contamination : tuned using gridsearch_cv [0.005, 0.2]
- n_estimator : default
- max_samples : default
- Dataset: SF 100%
- Anomaly rate : 0.5%
- random_state = 2
- contamination : tuned using gridsearch_cv [0.005, 0.02]
- n_estimator : default
- max_samples : default
- Dataset: SF 100%
- Anomaly rate : 0.5%
- random_state : 3
- contamination : tuned using gridsearch_cv [0.005, 0.02]
- n_estimator : default
- max_samples : default

Dataset size influence on algorithm's performance :

Isolation Forest

- Dataset: SF 10%
- Anomaly rate : 4.5%
- random_state = 1
- contamination : tuned manually [0.045, 0.02]
- n_estimator : default
- max_samples : default
- Dataset: SF 20%
- Anomaly rate : 0.5%
- random_state = 2
- contamination : tuned manually [0.005, 0.02]
- n_estimator : default
- max_samples : default
- Dataset: SF 50%
- Anomaly rate : 0.5%
- random_state = 2
- contamination : tuned manually [0.005, 0.02]
- n_estimator : default
- max_samples : default
- Dataset: SF 70%
- Anomaly rate : 0.5%
- random_state : 3
- contamination : tuned manually [0.005, 0.02]
- n_estimator : default
- max_samples : default
- Dataset: SF 100%
- Anomaly rate : 0.5%
- random_state : 4
- contamination : tuned manually [0.005, 0.02]
- n_estimator : default
- max_samples : default

Anomaly_rate influence on algorithm's performance :

Isolation Forest

- Dataset: SF 100%
- Anomaly rate : 0.5%
- random_state = 1
- contamination : tuned using gridsearch_cv [0.005, 0.02]
- n_estimator : default
- max_samples : default
- Dataset: SF 20%
- Anomaly rate : 4.5%
- random_state = 2
- contamination : tuned using gridsearch_cv [0.045, 0.18]
- n_estimator : default
- max_samples : default
- Dataset: SF 50%
- Anomaly rate : 10%
- random_state : 3
- contamination : tuned using gridsearch_cv [0.01, 0.04]
- n_estimator : default
- max_samples : default
- Dataset: SF 100%
- Anomaly rate : 20%
- random_state : 4
- contamination : tuned using gridsearch_cv [0.2, 0.8]
- n_estimator : default
- max_samples : default

Work log :

Tuned algorithms manually starting Week 7 since gridsearch cv proved to not yield a good enough result
Split the original SA dataset into normal and abnormal dât
Frac

Increasing the number of normal data

Script: notebook/Increasing the number of normal data.ipynb We start with the 1% SA dataset, the data is then split into normal and abnormal data

Work log

Week 8 : Worked on

Author

Phan Trung Thành

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.ipynb_checkpoints		.ipynb_checkpoints
notebook		notebook
ressources		ressources
.gitignore		.gitignore
README.md		README.md
kddcup results.ipynb		kddcup results.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

notebook

notebook

ressources

ressources

.gitignore

.gitignore

README.md

README.md

kddcup results.ipynb

kddcup results.ipynb

Repository files navigation

Progress history

W1-2 : 19.04 - 03.05 : Initialize the document, trainning algorithms on default parameters

W3-4 : 03.05 - 17.05 : Parameter tuning on different configurations

W5-6 : 17.05 - 31.05 : Increasing the number of anomaly

W7 : Finalizing

W3-4 : Parameter tuning and cross-validation search

Methodology :

Trainning configuration for specific algorithm :

OneClass SVM

Isolation Forest

Dataset size influence on algorithm's performance :

Isolation Forest

Anomaly_rate influence on algorithm's performance :

Isolation Forest

Work log :

Increasing the number of normal data

Work log

Author

About

Releases

Packages

Languages

Thanhphan1147/kddcup_benchmark

Folders and files

Latest commit

History

Repository files navigation

Progress history

W1-2 : 19.04 - 03.05 : Initialize the document, trainning algorithms on default parameters

W3-4 : 03.05 - 17.05 : Parameter tuning on different configurations

W5-6 : 17.05 - 31.05 : Increasing the number of anomaly

W7 : Finalizing

W3-4 : Parameter tuning and cross-validation search

Methodology :

Trainning configuration for specific algorithm :

OneClass SVM

Isolation Forest

Dataset size influence on algorithm's performance :

Isolation Forest

Anomaly_rate influence on algorithm's performance :

Isolation Forest

Work log :

Increasing the number of normal data

Work log

Author

About

Resources

Stars

Watchers

Forks

Languages