-
Notifications
You must be signed in to change notification settings - Fork 0
/
build_week_2_london_crimes_nisha_arya.py
388 lines (265 loc) · 15.4 KB
/
build_week_2_london_crimes_nisha_arya.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
# -*- coding: utf-8 -*-
"""Build Week 2 - London Crimes - Nisha Arya
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1z6aphy51MJe2yn47nWOL-SIv3Ou4O00V
"""
url = https://www.kaggle.com/jboysen/london-crime/download
import pandas as pd
df = pd.read_csv('london_crime_by_lsoa.csv')
df.head()
"""The dataset I have chosen is based on the freequency and the type of crimes that have occured between January 2008 to December 2016. It looks at the different boroughs in London and if the crime committed it considered 'major' or minor. It gives more information on the year and month that the crime occured, with the column 'value' telling us how many times it occured within that specific month. Looking at the data, I can start to brainstorm and explore if there are seasonal or time-of-week/day changes in crime occurrences? Or if there are particular crimes that mainly occur in a particular brorough and if these increase or decrease in a particular month."""
df.shape
#the number of rows and columns
df.tail()
"""Overall, my dataset is quite large, containing 13 million rows. It is a good dataset as I have a wide range of data to use, compare and analyse but I know it will have an impact on the accuracy of my data and it will be time consuming. With that, I have then decided to use a specific year, which is 2011. I chose this year due to the fact that I knew that the 2011 London Riots occured and I wanted to see if this had an effect on the frequency of crimes that occured during the year."""
from datetime import date
df2 = df[(df['year'] == 2011)]
df2.head()
df.isnull().sum()
# Unique Values - the amount of tims the crimes happen per month
import numpy as np
np.unique(df2["value"])
df2.describe()
# mean of value x count of value = baseline prediction
# baseline prediction is the minimum
# on average, there is 0.47 times a crime occurs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
def function_name(item):
if item == 0:
return 0
elif item > 0:
return 1
df2['crime_occured'] = df2['value'].apply(function_name)
"""Baseline - create predictions for a dataset
A baseline is a method that uses simple summary statistics which is creates predictions for a dataset. Baselines are used to measure accuracy and used as an indication to compare back to once you further analyse your data and do more accuracy testing.
"""
df3 = df2.drop(columns=['value'])
df3.head()
Total = df2['value'].sum()
print (Total)
"""The below interactive bar plot is showing me the different crimes that occured in the year 2011. The bar plot allows me to see which borough in London had the most crimes in the year, giving me the interaction of seeing which crimes occured (major or minor). Looking at this, I would like to analyse the column 'month' and see how it differs in the type of crime that occured."""
import plotly.express as px
data = px.data.gapminder()
df3_major = df3['crime_occured']
fig = px.bar(df2, x='borough', y='value',
hover_data=['major_category', 'minor_category'], color='crime_occured',
labels={'Month':'Sum of Crimes Occured'}, height=400)
#fig.set_title("The frequency and type of crime occured in different boroughs in London")
fig.show()
import plotly.express as px
data = px.data.gapminder()
df3_major = df3['crime_occured']
fig = px.bar(df3, x='borough', y='crime_occured',
hover_data=['major_category', 'minor_category'], color='crime_occured',
labels={'Month':'Sum of Crimes Occured'}, height=400)
#fig.set_title("The frequency and type of crime occured in different boroughs in London")
fig.show()
# Scatter plot for minor_category
fig = px.scatter_3d(df2, x='month', y='value', z='minor_category',
color='borough')
fig.show()
# Scatter plot for major_category
fig = px.scatter_3d(df2, x='month', y='value', z='major_category',
color='borough')
fig.show()
df2.value.value_counts().groupby(df3['crime_occured'])
"""I found my baseline by using value_counts and it stands at 74%. My target is 'crime_occured' which is a type of binary data which represents if a crime occured (1) or did not occur (0) in that given month. So the baseline is telling me that 74% of my data has no correlation between the time of the month and the frequency of crime committed."""
df3['crime_occured'].value_counts(normalize=True)
df3.crime_occured.value_counts().plot.bar()
df3.head()
# 2.Choose what data to hold out for your test set
#The training set contains a known output and the model learns on this data
#test data is used to evaluate its accuracy
from sklearn.model_selection import train_test_split
train, test = train_test_split(df3, train_size=0.80, test_size=0.20, random_state=2)
train.shape
test.shape
train, val = train_test_split(train, train_size=0.80, test_size=0.20, random_state=42)
train.shape
val.shape
"""I am now going to focus on my target and features which will allow me to chose an evaluation metric and compare the different accuracy scores that I get."""
target = 'crime_occured'
features = ['lsoa_code' , 'borough' , 'major_category' , 'minor_category' , 'year' , 'month']
X_train = train[features]
X_val = val[features]
X_test = test[features]
y_train = train[target]
y_val = val[target]
y_test = test[target]
X_train.shape
X_val.shape
X_test.shape
"""# Objective 2: Define a regression or classification problem, choose an appropriate evaluation metric and begin with baselines"""
!pip install category_encoders
import category_encoders as ce
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint, uniform
from sklearn.model_selection import RandomizedSearchCV
"""Test Accuracy"""
X_test
X_train.head()
y_train.head()
# Classification problem
# Evaluation metric - accuracy score
pipeline = make_pipeline(
ce.OrdinalEncoder(),
SimpleImputer(strategy='median'),
RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)
# Fit on train
pipeline.fit(X_train, y_train)
print('Test Accuracy', pipeline.score(X_test, y_test))
# This test accuracy is same as my baseline
"""# Objective 3: Student fits and evaluates any linear model for regression or classification"""
import category_encoders as ce
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
preprocessing= make_pipeline(
ce.OrdinalEncoder(),
SimpleImputer(strategy='median'),
StandardScaler()
)
X_train_transform = preprocessing.fit_transform(X_train)
X_test_transform = preprocessing.transform(X_test)
X_val_transform = preprocessing.fit_transform(X_val)
model = RandomForestClassifier(bootstrap=True, class_weight=None,
criterion='gini', max_depth=100,
max_features='auto',
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=10,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=106, n_jobs=-1,
oob_score=False, random_state=7,
verbose=0, warm_start=False)
model.fit(X_train_transform, y_train)
!pip install eli5
import eli5
from eli5.sklearn import PermutationImportance
permuter = PermutationImportance(
model,
scoring='accuracy',
n_iter=2,
random_state=42
)
permuter.fit(X_val_transform, y_val)
new_variable = X_val.columns.tolist()
pd.Series(permuter.feature_importances_, new_variable).sort_values(ascending=False)
eli5.show_weights(
permuter,
top=None, # show permutation importances for all features
feature_names=new_variable)
# minor_category holds the most weight on its influence on mypredictions
plt.figure(figsize=(8,8))
rf = pipeline.named_steps['randomforestclassifier']
importances = pd.Series(rf.feature_importances_, X_train.columns)
importances.sort_values().plot.barh(color='grey');
"""1) Train/Test/Val Accuracy"""
from sklearn.metrics import accuracy_score
# Fit on train set
model.fit(X_train_transform, y_train)
# Get train accuracy
y_pred = model.predict(X_train_transform)
print('Train Accuracy', accuracy_score(y_train, y_pred))
# Get test accuracy
y_pred = model.predict(X_test_transform)
print('Test Accuracy', accuracy_score(y_test, y_pred))
# Get validation accuracy
y_pred = model.predict(X_val_transform)
print('Validation Accuracy', accuracy_score(y_val, y_pred))
"""2) Train/Test/Val - Using Logistic Regression"""
#Logistic regression - binary
# train accuracy
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train_transform, y_train)
print('Train Accuracy', log_reg.score(X_train_transform, y_train))
log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_test_transform, y_test)
print('Test Accuracy', log_reg.score(X_test_transform, y_test))
log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_val_transform, y_val)
print('Validation Accuracy', log_reg.score(X_val_transform, y_val))
#solver is a hyperparameter - that looks into the data, coefficeent etc
#74.536% of the time were accurate, whilst our baseline was 74.583%
"""# Objective 4: Student fits and evaluates a decision tree, random forest, or gradient boosting model for regression or classification
3) Train/Test/Val - Using Gradient Boosting Model
"""
from xgboost import XGBClassifier
pipeline = make_pipeline(
ce.OrdinalEncoder(),
XGBClassifier(n_estimators=118, random_state=42, n_jobs=-1, max_depth = 5)
)
pipeline.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
y_pred = pipeline.predict(X_train)
print('Train Accuracy', accuracy_score(y_train, y_pred))
y_pred = pipeline.predict(X_test)
print('Test Accuracy', accuracy_score(y_test, y_pred))
y_pred = pipeline.predict(X_val)
print('Validation Accuracy', accuracy_score(y_val, y_pred))
"""Feature Importance"""
# Just an example to test out
from sklearn.impute import SimpleImputer
#drop-column year
column = 'year'
# Fit without column
pipeline = make_pipeline(
ce.OrdinalEncoder(),
SimpleImputer(strategy='median'),
RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)
pipeline.fit(X_train.drop(columns=column), y_train)
score_without = pipeline.score(X_test.drop(columns=column), y_test)
print(f'Validation Accuracy without {column}: {score_without}')
# Fit with column
pipeline = make_pipeline(
ce.OrdinalEncoder(),
SimpleImputer(strategy='median'),
RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)
pipeline.fit(X_train, y_train)
score_with = pipeline.score(X_test, y_test)
print(f'Validation Accuracy with {column}: {score_with}')
# Compare the error with & without column
print(f'Drop-Column Importance for {column}: {score_with - score_without}')
"""# Objective 6: Student writes 300+ words (not including code).
# Student reports baseline score, validation scores from 2+ models, and test score from 1 selected mode
The dataset I have chosen is based on the freequency and the type of crimes that have occured between January 2008 to December 2016. It looks at the different boroughs in London and if the crime committed it considered 'major' or minor. It gives more information on the year and month that the crime occured, with the column 'value' telling us how many times it occured within that specific month. Looking at the data, I can start to brainstorm and explore if there are seasonal or time-of-week/day changes in crime occurrences? Or if there are particular crimes that mainly occur in a particular brorough and if these increase or decrease in a particular month.
Overall, my dataset is quite large, containing 13 million rows. It is a good dataset as I have a wide range of data to use, compare and analyse but I know it will have an impact on the accuracy of my data and it will be time consuming. With that, I have then decided to use a specific year, which is 2011. I chose this year due to the fact that I knew that the 2011 London Riots occured and I wanted to see if this had an effect on the frequency of crimes that occured during the year.
A baseline is a method that uses simple summary statistics which is creates predictions for a dataset. Baselines are used to measure accuracy and used as an indication to compare back to once you further analyse your data and do more accuracy testing. I kicked off with getting my baseline, which is the starting point to creating predictions for my data set. I got my baseline by using value_counts (number of occurrence of an element in a list) and it stands at 74%. My target is 'crime_occured' which is a type of binary data which represents if a crime occurred (1) or did not occur (0) in that given month. Binary data only has 2 outcomes, yes or no, truth or false. Looking at the value of my baseline, it is telling me that 74% of my data has no correlation between the time of the month and the frequency of crime committed.
I initially started off with my Test Accuracy, using classification accuracy score. Accuracy score is a type of evaluation metric which looks at the number of correct predictions over the total number of predictions. My test accuracy is 0.73%, which is 1% lower than my baseline. This indicates to me that I may need to use other machine learning algorithms to try to beat my baseline of 74%.
I then moved onto the Random Forest Classifier which is considered as a highly accurate and robust method because of the number of decision trees (predictions)it outputs. It takes the average of all the predictions, cancelling out the biases, whilst handling missing values and being able to get the feature importance, which helps in selecting the most contributing features. I then used Eli5 which is a package used in Data Science which helps to debug machine learning classifiers and explain their predictions. This tells me that lsoa_code ( Lower Super Output Area code), month and borough hold the most weight on its influence on my predictions. This supports my research question that the time of the year and borough does have an affect on the frequency of crimes that occur. I also looked into the feature importance, which allowed me to explore which features had any significance with my research question.
# Objective 7:Student makes 2+ visualizations to explain their model
"""
!pip install shap
import shap
shap.initjs()
enc = ce.OrdinalEncoder()
enc.fit(X_train)
processed_X_train = enc.transform(X_train)
X_train_clean = processed_X_train.fillna(method='ffill')
model = RandomForestClassifier(n_estimators = 200, random_state = 6)
model.fit(X_train_clean, y_train)
row = X_train_clean.iloc[[0]]
explainerModel = shap.TreeExplainer(model)
shap_values_Model = explainerModel.shap_values(row)
shap.force_plot(base_value = explainerModel.expected_value[0]
,shap_values = explainerModel.shap_values(row)[0]
,features = row.iloc[[0]]
,link = 'logit'
)
# 0.71 is our accuracy