Udacity - Intro to Machine Learning

Using Machine Learning to Investigate the Enron fraud

Author : bmatthewtaylor@gmail.com github : github.com/aspiringguru/udacityIntroMachineLearning

Introduction

The Enron scandal, publically revealed in October 2001, led to the bankruptcy of the Enron Corporation, the largest corporate bankruptcy in America at the time. The failure of auditor Arthur Anderson to discharge their professional duties led to the defacto dissolution of one of the largest audit and accountancy partnerships in the world.

Many Enron executives were charged and sentenced to prison for their acts in the case.

In 2005 the documentry film "Enron: The Smartest Guys in the Room" was made about Enron case.

https://en.wikipedia.org/wiki/Enron_scandal https://en.wikipedia.org/wiki/Enron:_The_Smartest_Guys_in_the_Room

Purpose

The purpose of this machine learning analyse is to demonstrate the ability and suitability of machine learning techniques to detect fraud and persons of interest for audit investigations.

Data

The data available for analysis includes emails and financial information. The financial information includes data on each employees remuneration package, and non direct remuneration. Some of the field names provided are listed below.

salary
to_messages
deferral_payments
total_payments
exercised_stock_options
bonus
restricted_stock
shared_receipt_with_poi
restricted_stock_deferred
total_stock_value
expenses
loan_advances
from_messages
other
from_this_person_to_poi
poi
director_fees
deferred_income
long_term_incentive
email_address
from_poi_to_this_person

Other data considered is a list of known Persons of Interest (POI). This list was collated from court records and media reports of the fraud investigation.

Outliers

The numerical data was sorted and plotted for quick visual interpretation. A common outlier across several data columns was the row 'TOTAL' and excluded from further analysis.

Plotting and comparing columns identified a few persons with significantly higher values than the 'normal' distribution, specifically a short list of names appeared in the majority of shortlisted top 10.

'salary' : 'FREVERT MARK A, LAY KENNETH L, SKILLING JEFFREY K
'deferral_payments' : FREVERT MARK A
'total_payments': 'LAY KENNETH L'.
'exercised_stock_options' : 'HIRKO JOSEPH' and 'LAY KENNETH L'.
'bonus' : BELDEN TIMOTHY N, SKILLING JEFFREY K, LAY KENNETH L, LAVORATO JOHN J
'restricted_stock' : WHITE JR THOMAS E, LAY KENNETH L
'restricted_stock_deferred' : BHATNAGAR SANJAY
'total_stock_value' : RICE KENNETH D, PAI LOU L, SKILLING JEFFREY K, HIRKO JOSEPH, LAY KENNETH L
'expenses' : URQUHART JOHN A, MCCLELLAN GEORGE
'loan_advances' : PICKERING MARK R, FREVERT MARK A, LAY KENNETH L (only three non-zero values in this column)
'from_messages' : KAMINSKI WINCENTY J
'from_this_person_to_poi' : BECK SALLY W, KEAN STEVEN J, LAVORATO JOHN J, DELAINEY DAVID W
'director_fees' : significent banding observed around 1M and at 0.4M. No obvious outliers.

Several of these plots showed a very significent disparity between the shortlisted names and general employees. Current senior executive remuneration trends commonly result in remuneration packages including various forms of stock incentives. While the disparities in forms of remuneration and total remuneration are not indicative of the person being a Person Of Interest (POI), the recurrence of names known to be POI provides good reason to examine this data in more detail using statistical methods. Also of interest is the correlation between publically reported remuneration packages due to regulatory requirements and remuneration less visible to scrutiny via public financial reports.

Algorythmic Analysis

Various algorythmic methods were compared as a revision/warm up exercise.

Naive Bayes

The Naive Bayes method was tested using various keys to identify which provided the highest accuracy. (poi_id_Naive_Bayes1.py)

keys	clf.score	accuracy_score
salary, total_payments, exercised_stock_options	0.845360824742	0.833333333333
total_payments, exercised_stock_options	0.845360824742	0.854166666667
total_payments	0.876288659794	0.833333333333
exercised_stock_options	0.886597938144	0.875
from_poi_to_this_person	0.886597938144	0.854166666667
salary	0.886597938144	0.833333333333
from_this_person_to_poi	0.886597938144	0.854166666667
nb: clf.score [clf.score(a_train, b_train)] compares the accuracy of predictions on training data.
nbb: accuracy_score [accuracy_score(b_test, b_pred)] compares predicted values with known values on a test set.

Interestingly, while several keys and combinations of keys provided accuracy in the range 83%-88%, none of these models demonstrated a significently higher accuracy than others. The key with the highest accuracy was salary.

An interesting observation can be drawn from the accuracy wrt emails sent to a poi and emails received from a poi, persons receiving emails from a poi were marginally more likely to be a poi themselve than persons sending emails to a poi. Given the small % difference this factor alone needs further analysis to confirm accuracy as a predictor.

Support Vector Machines

The Support Vector Machines method required careful choice of kernel method to achieve a useful result. Some limited experimentaiton with C values was conducted to optimise results.

keys	kernel method	clf.score	accuracy_score	Fitting Time	Predicting Time
salary, total_payments, exercised_stock_options	linear	no result	no result	no result	no result
total_payments, exercised_stock_options	linear	no result	no result	no result	no result
total_payments	linear	no result	no result	no result	no result
exercised_stock_options	linear	no result	no result	no result	no result
from_poi_to_this_person	linear	0.886597938144	0.854166666667	0.0	0.0
salary	linear	0.876288659794	0.854166666667	0.0	0.0
from_this_person_to_poi	linear	0.886597938144	0.854166666667	0.0	0.0

C=1.0 (default value) | salary, total_payments, exercised_stock_options | rbf | 1.0 | 0.854166666667 | 0.001 | 0.0 | | total_payments, exercised_stock_options | rbf | 1.0 | 0.854166666667 | 0.0 | 0.0 | | total_payments | rbf | 1.0 | 0.854166666667 | 0.0 | 0.001 | | exercised_stock_options | rbf | 0.958762886598 | 0.854166666667 | 0.0 | 0.0 | | from_poi_to_this_person | rbf | 0.917525773196 | 0.854166666667 | 0.001 | 0.0 | | salary | rbf | 1.0 | 0.854166666667 | 0.001 | 0.0 | | from_this_person_to_poi | rbf | 0.907216494845 | 0.854166666667 | 0.001 | 0.0 | C=10

| salary, total_payments, exercised_stock_options | rbf | 1.0 | 0.854166666667 | 0.0 | 0.0 | | total_payments, exercised_stock_options | rbf | 1.0 | 0.854166666667 | 0.0 | 0.0 | | total_payments | rbf | 1.0 | 0.854166666667 | 0.0 | 0.0 | | exercised_stock_options | rbf | 0.958762886598 | 0.854166666667 | 0.0 | 0.0 | | from_poi_to_this_person | rbf | 0.948453608247 | 0.8125 | 0.0 | 0.0 | | salary | rbf | 1.0 | 0.854166666667 | 0.0 | 0.0 | | from_this_person_to_poi | rbf | 0.948453608247 | 0.791666666667 | 0.0 | 0.0 |

NB: 'no result' = no solution found in a reasonable time given the dataset size. (cutoffs varied, typically min 3 minutes.)

Decision Trees

Configuring the DecisionTreeClassifier has the default option criterion='gini', the 'entropy' option was also evaluated. As expected the 'gini' option demonstrated tighter fitting when compared to 'entropy'.

With DecisionTreeClassifier(criterion='gini')

keys	clf.score	accuracy_score
salary, total_payments, exercised_stock_options	1.0	0.854166666667
total_payments, exercised_stock_options	1.0	0.854166666667
total_payments	1.0	0.854166666667
exercised_stock_options	0.958762886598	0.854166666667
from_poi_to_this_person	0.948453608247	0.8125
salary	1.0	0.854166666667
from_this_person_to_poi	0.948453608247	0.791666666667

With DecisionTreeClassifier(criterion='entropy')

keys	clf.score	accuracy_score
salary, total_payments, exercised_stock_options	1.0	0.8125
total_payments, exercised_stock_options	1.0	0.729166666667
total_payments	1.0	0.75
exercised_stock_options	0.958762886598	0.8125
from_poi_to_this_person	0.948453608247	0.8125
salary	1.0	0.8125
from_this_person_to_poi	0.948453608247	0.791666666667

With DecisionTreeClassifier(criterion='entropy', min_samples_split=8) NB: 'min_samples_split' is the minimum number of samples required to split an internal node.

keys	clf.score	accuracy_score
salary, total_payments, exercised_stock_options	0.969072164948	0.729166666667
total_payments, exercised_stock_options	0.969072164948	0.645833333333
total_payments	0.927835051546	0.833333333333
exercised_stock_options	0.948453608247	0.8125
from_poi_to_this_person	0.917525773196	0.8125
salary	0.917525773196	0.854166666667
from_this_person_to_poi	0.938144329897	0.791666666667

New algorythmic methods

Three methods of analysis were considered for the project.

K nearest neighbours
Adaboost
Random Forest

K nearest neighbours

With KNeighborsClassifier(n_neighbors=3)

keys	clf.score	accuracy_score
salary, total_payments, exercised_stock_options	0.907216494845	0.875
total_payments, exercised_stock_options	0.917525773196	0.875
total_payments	0.886597938144	0.854166666667
exercised_stock_options	0.907216494845	0.875
from_poi_to_this_person	0.886597938144	0.833333333333
salary	0.886597938144	0.875
from_this_person_to_poi	0.876288659794	0.791666666667

With KNeighborsClassifier(n_neighbors=6)

keys	clf.score	accuracy_score
salary, total_payments, exercised_stock_options	0.886597938144	0.854166666667
total_payments, exercised_stock_options	0.886597938144	0.854166666667
total_payments	0.886597938144	0.854166666667
exercised_stock_options	0.886597938144	0.854166666667
from_poi_to_this_person	0.886597938144	0.854166666667
salary	0.886597938144	0.854166666667
from_this_person_to_poi	0.886597938144	0.854166666667

NB: clf.score & accuracy_score values are same for all parameters. This indicates the classifier uses an excessive value for n_neighbors.

With KNeighborsClassifier(n_neighbors=4)

keys	clf.score	accuracy_score
salary, total_payments, exercised_stock_options	0.886597938144	0.854166666667
total_payments, exercised_stock_options	0.886597938144	0.854166666667
total_payments	0.886597938144	0.854166666667
exercised_stock_options	0.886597938144	0.854166666667
from_poi_to_this_person	0.886597938144	0.854166666667
salary	0.886597938144	0.854166666667
from_this_person_to_poi	0.886597938144	0.854166666667

NB: again, clf.score & accuracy_score values are same for all parameters. This indicates the classifier uses an excessive value for n_neighbors.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
choose_your_own		choose_your_own
datasets_questions		datasets_questions
decision_tree		decision_tree
evaluation		evaluation
feature_selection		feature_selection
final_project		final_project
k_means		k_means
lessons		lessons
naive_bayes		naive_bayes
outliers		outliers
pca		pca
regression		regression
svm		svm
text_learning		text_learning
tools		tools
validation		validation
.gitignore		.gitignore
README.md		README.md
Untitled.ipynb		Untitled.ipynb
Untitled1.ipynb		Untitled1.ipynb
Untitled2.ipynb		Untitled2.ipynb
bmt1_final_project.ipynb		bmt1_final_project.ipynb
enron_data_working_notes.txt		enron_data_working_notes.txt
untitled.txt		untitled.txt

aspiringguru/udacityIntroMachineLearning

Folders and files

Latest commit

History

Repository files navigation

Udacity - Intro to Machine Learning

Using Machine Learning to Investigate the Enron fraud

Author : bmatthewtaylor@gmail.com github : github.com/aspiringguru/udacityIntroMachineLearning

Introduction

Purpose

Data

Outliers

Algorythmic Analysis

Naive Bayes

Support Vector Machines

Decision Trees

New algorythmic methods

K nearest neighbours

About

Resources

Stars

Watchers

Forks

Languages