Skip to content

This is our implementation of the third assignment in Data Mining

License

Notifications You must be signed in to change notification settings

aditya-srikanth/Data-Mining-Assignment-3

Repository files navigation

Fraud Detection using Local Outlier Factor

This implementation of Local Outlier Factor (LOF) attempts to detect frauds in a given database of credit card transactions.

The database can be found here: Credit Card Fraud Detection (Kaggle)

The project is divided into three major parts

  1. Preprocessing
  2. LOF Calculation and Fraud Detection
  3. DBSCAN cluster generation and Fraud (Noise) Detection
  4. Visualization

Preprocessing mainly involves reading the data provided in the form of a CSV file and normalizing the data to make it more smooth. This helps us improve calculation results.

DBSCAN Based Evaluation involves calculating the distance of each point from each point, and getting all the points that are within a specified radius (EPS) from the point. Based on this, the point is labeled as a "Core Point". The rest are then compared in the same way and if the point is within the EPS neighbourhood of a core point, it is labeled a border point. The remaining points are labeled noise points.

The main steps involved are:

  1. Finding the points within EPS neighbourhood of a point.
  2. Finding the number of points within this neighbourhood.
  3. Label those points where the number of points within this EPS neighbourhood is greater than MinPts as core points.
  4. Label those points that are not core points and are within the Eps radius of a core point as Border Points.
  5. Label the remaining points as Noise Points and these are the outliers.

The MinPts and the Eps value is evalueated from 1 to 50 in increments of 5 each.

LOF Calculation it involves calculating the LOF score of each point, comparing it with a custom threshold value and concluding if a point represents an outlier or not- in this case a fraud. This involves the following main steps:

  1. Finding K nearest neighbors
  2. Finding the Kth neighbor for a point and it's distance.
  3. Calculating the Reach Distance (RD) for a point with respect to it's K neighbors.
  4. Calculating Local Reachability Density (LRD) of the point.
  5. Calculating Local Outlier Factor (LOF) of point using RD and LRD.
  6. Comparing LOF with a threshold value. If LOF > threshold, then the point most probably is an outlier/fraud.

The default threshold value is set to 1.5 for the credit data, while it is 1 for a sample data of four 2-D points. The threshold value can be changed using THRESH and the sample data can be toggled using DATA_FLAG. The default value for K is set to 2 which can be changes using K.

Visualization involves applying dimensional reduction to the points and reduce the number of attributes to 2. This helps in plotting on a 2-D graph with points divided into two classes. The reduction is done using a modulo-i operation, where i is the index of the point in the dataset.

Results

The current implementation of LOF using a threshold of 1.5 and K = 5 consistently gives an accuracy of above 85% over various permutations of the dataset, with an average accuracy of 93%.

An sample run of the dataset on the first 500 samples gives:
Accuracy: 96%
Run Time: 15 seconds

Scatter plot after dimensional reduction:

Scatter Plot (1000 samples)

All the predicted outliers are well separated from the normal points, with the outliers marked in red and the normal points in blue.

Author

Naren Surampudi (https://github.com/nsurampu/)
V Aditya Srikanth (https://github.com/aditya-srikanth/)
Prateek Dasgupta (https://github.com/patrik171298/)

About

This is our implementation of the third assignment in Data Mining

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages