Skip to content

This repo compared the performance of various anomaly detection algorithms on the 1999 KDDCUP dataset

Notifications You must be signed in to change notification settings

Thanhphan1147/kddcup_benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Progress history

W1-2 : 19.04 - 03.05 : Initialize the document, trainning algorithms on default parameters

W3-4 : 03.05 - 17.05 : Parameter tuning on different configurations

W5-6 : 17.05 - 31.05 : Increasing the number of anomaly

W7 : Finalizing

W3-4 : Parameter tuning and cross-validation search

Methodology :

  1. Train test split the data, rate = 0.25
  2. 3-fold cross-validation on the trainning data so that the test size of each split will be equal to the final test size
  3. Apply best parameters to test data
  4. Compared with default parameters

Trainning configuration for specific algorithm :

OneClass SVM

    • Dataset: SF 10%
    • Anomaly rate : 4.5%
    • random_state = 1
    • nu : tuned using gridsearch_cv [0.045, 0.18]
    • kernel : default
    • gamma : default
    • Dataset: SF 10%
    • Anomaly rate : 4.5%
    • random_state : 2
    • nu: tuned using gridsearch_cv [0.045, 0.18]
    • kernel : poly
    • gamma : default
    • degree : default
    • Dataset: SF 10%
    • Anomaly rate : 4.5%
    • random_state = 3
    • nu : tuned using gridsearch_cv [0.045, 0.18]
    • kernel : default
    • gamma : default

Isolation Forest

    • Dataset: SF 100%
    • Anomaly rate : 0.5%
    • random_state = 1
    • contamination : tuned using gridsearch_cv [0.005, 0.2]
    • n_estimator : default
    • max_samples : default
    • Dataset: SF 100%
    • Anomaly rate : 0.5%
    • random_state = 2
    • contamination : tuned using gridsearch_cv [0.005, 0.02]
    • n_estimator : default
    • max_samples : default
    • Dataset: SF 100%
    • Anomaly rate : 0.5%
    • random_state : 3
    • contamination : tuned using gridsearch_cv [0.005, 0.02]
    • n_estimator : default
    • max_samples : default

Dataset size influence on algorithm's performance :

Isolation Forest

    • Dataset: SF 10%
    • Anomaly rate : 4.5%
    • random_state = 1
    • contamination : tuned manually [0.045, 0.02]
    • n_estimator : default
    • max_samples : default
    • Dataset: SF 20%
    • Anomaly rate : 0.5%
    • random_state = 2
    • contamination : tuned manually [0.005, 0.02]
    • n_estimator : default
    • max_samples : default
    • Dataset: SF 50%
    • Anomaly rate : 0.5%
    • random_state = 2
    • contamination : tuned manually [0.005, 0.02]
    • n_estimator : default
    • max_samples : default
    • Dataset: SF 70%
    • Anomaly rate : 0.5%
    • random_state : 3
    • contamination : tuned manually [0.005, 0.02]
    • n_estimator : default
    • max_samples : default
    • Dataset: SF 100%
    • Anomaly rate : 0.5%
    • random_state : 4
    • contamination : tuned manually [0.005, 0.02]
    • n_estimator : default
    • max_samples : default

Anomaly_rate influence on algorithm's performance :

Isolation Forest

    • Dataset: SF 100%
    • Anomaly rate : 0.5%
    • random_state = 1
    • contamination : tuned using gridsearch_cv [0.005, 0.02]
    • n_estimator : default
    • max_samples : default
    • Dataset: SF 20%
    • Anomaly rate : 4.5%
    • random_state = 2
    • contamination : tuned using gridsearch_cv [0.045, 0.18]
    • n_estimator : default
    • max_samples : default
    • Dataset: SF 50%
    • Anomaly rate : 10%
    • random_state : 3
    • contamination : tuned using gridsearch_cv [0.01, 0.04]
    • n_estimator : default
    • max_samples : default
    • Dataset: SF 100%
    • Anomaly rate : 20%
    • random_state : 4
    • contamination : tuned using gridsearch_cv [0.2, 0.8]
    • n_estimator : default
    • max_samples : default

Work log :

  • Tuned algorithms manually starting Week 7 since gridsearch cv proved to not yield a good enough result
  • Split the original SA dataset into normal and abnormal dât
  • Frac

Increasing the number of normal data

  • Script: notebook/Increasing the number of normal data.ipynb We start with the 1% SA dataset, the data is then split into normal and abnormal data

Work log

Week 8 : Worked on

Author

Phan Trung Thành

About

This repo compared the performance of various anomaly detection algorithms on the 1999 KDDCUP dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages