build_week_2_london_crimes_nisha_arya.py

# -*- coding: utf-8 -*-
"""Build Week 2 - London Crimes - Nisha Arya

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1z6aphy51MJe2yn47nWOL-SIv3Ou4O00V
"""

url = https://www.kaggle.com/jboysen/london-crime/download

import pandas as pd

df = pd.read_csv('london_crime_by_lsoa.csv')

df.head()

"""The dataset I have chosen is based on the freequency and the type of crimes that have occured between January 2008 to December 2016. It looks at the different boroughs in London and if the crime committed it considered 'major' or minor. It gives more information on the year and month that the crime occured, with the column 'value' telling us how many times it occured within that specific month. Looking at the data, I can start to brainstorm and explore if there are seasonal or time-of-week/day changes in crime occurrences? Or if there are particular crimes that mainly occur in a particular brorough and if these increase or decrease in a particular month."""

df.shape

#the number of rows and columns

df.tail()

"""Overall, my dataset is quite large, containing 13 million rows. It is a good dataset as I have a wide range of data to use, compare and analyse but I know it will have an impact on the accuracy of my data and it will be time consuming. With that, I have then decided to use a specific year, which is 2011. I chose this year due to the fact that I knew that the 2011 London Riots occured and I wanted to see if this had an effect on the frequency of crimes that occured during the year."""

from datetime import date

df2 = df[(df['year'] == 2011)]

df2.head()

df.isnull().sum()

# Unique Values - the amount of tims the crimes happen per month
import numpy as np

np.unique(df2["value"])

df2.describe()

# mean of value x count of value = baseline prediction
# baseline prediction is the minimum 
# on average, there is 0.47 times a crime occurs

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

def function_name(item):
  if item == 0:
    return 0
  elif item > 0:
    return 1
df2['crime_occured'] = df2['value'].apply(function_name)

"""Baseline - create predictions for a dataset

A baseline is a method that uses simple summary statistics which is creates predictions for a dataset. Baselines are used to measure accuracy and used as an indication to compare back to once you further analyse your data and do more accuracy testing.
"""

df3 = df2.drop(columns=['value'])

df3.head()

Total = df2['value'].sum()
print (Total)

"""The below interactive bar plot is showing me the different crimes that occured in the year 2011. The bar plot allows me to see which borough in London had the most crimes in the year, giving me the interaction of seeing which crimes occured (major or minor). Looking at this, I would like to analyse the column 'month' and see how it differs in the type of crime that occured."""

import plotly.express as px
data = px.data.gapminder()

df3_major = df3['crime_occured']
fig = px.bar(df2, x='borough', y='value',
             hover_data=['major_category', 'minor_category'], color='crime_occured',
             labels={'Month':'Sum of Crimes Occured'}, height=400)
#fig.set_title("The frequency and type of crime occured in different boroughs in London")
fig.show()

import plotly.express as px
data = px.data.gapminder()

df3_major = df3['crime_occured']
fig = px.bar(df3, x='borough', y='crime_occured',
             hover_data=['major_category', 'minor_category'], color='crime_occured',
             labels={'Month':'Sum of Crimes Occured'}, height=400)
#fig.set_title("The frequency and type of crime occured in different boroughs in London")
fig.show()

# Scatter plot for minor_category

fig = px.scatter_3d(df2, x='month', y='value', z='minor_category',
                    color='borough')
fig.show()

# Scatter plot for major_category

fig = px.scatter_3d(df2, x='month', y='value', z='major_category',
                    color='borough')
fig.show()

df2.value.value_counts().groupby(df3['crime_occured'])

"""I found my baseline by using value_counts and it stands at 74%. My target is 'crime_occured' which is a type of binary data which represents if a crime occured (1) or did not occur (0) in that given month. So the baseline is telling me that 74% of my data has no correlation between the time of the month and the frequency of crime committed."""

df3['crime_occured'].value_counts(normalize=True)

df3.crime_occured.value_counts().plot.bar()

df3.head()

# 2.Choose what data to hold out for your test set

#The training set contains a known output and the model learns on this data
#test data is used to evaluate its accuracy

from sklearn.model_selection import train_test_split

train, test = train_test_split(df3, train_size=0.80, test_size=0.20, random_state=2)

train.shape

test.shape

train, val = train_test_split(train, train_size=0.80, test_size=0.20, random_state=42)

train.shape

val.shape

"""I am now going to focus on my target and features which will allow me to chose an evaluation metric and compare the different accuracy scores that I get."""

target = 'crime_occured'
features = ['lsoa_code' , 'borough' , 'major_category' , 'minor_category' , 'year' , 'month']

X_train = train[features]
X_val = val[features]
X_test = test[features]

y_train = train[target]
y_val = val[target]
y_test = test[target]

X_train.shape

X_val.shape

X_test.shape

"""# Objective 2: Define a regression or classification problem, choose an appropriate evaluation metric and begin with baselines"""

!pip install category_encoders
import category_encoders as ce
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint, uniform
from sklearn.model_selection import RandomizedSearchCV

"""Test Accuracy"""

X_test

X_train.head()

y_train.head()

# Classification problem
# Evaluation metric - accuracy score

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

# Fit on train
pipeline.fit(X_train, y_train)
print('Test Accuracy', pipeline.score(X_test, y_test))

# This test accuracy is same as my baseline

"""# Objective 3: Student fits and evaluates any linear model for regression or classification"""

import category_encoders as ce
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

preprocessing= make_pipeline(
               ce.OrdinalEncoder(),
               SimpleImputer(strategy='median'),
               StandardScaler()
    
)

X_train_transform = preprocessing.fit_transform(X_train)
X_test_transform = preprocessing.transform(X_test)
X_val_transform = preprocessing.fit_transform(X_val)

model = RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=100,
                                        max_features='auto',
                                        max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=10,
                                        min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=106, n_jobs=-1,
                                        oob_score=False, random_state=7,
                                        verbose=0, warm_start=False)
model.fit(X_train_transform, y_train)

!pip install eli5

import eli5
from eli5.sklearn import PermutationImportance

permuter = PermutationImportance(
    model, 
    scoring='accuracy', 
    n_iter=2, 
    random_state=42
)

permuter.fit(X_val_transform, y_val)

new_variable = X_val.columns.tolist()

pd.Series(permuter.feature_importances_, new_variable).sort_values(ascending=False)

eli5.show_weights(
    permuter,
    top=None, # show permutation importances for all features
    feature_names=new_variable)

# minor_category holds the most weight on its influence on mypredictions

plt.figure(figsize=(8,8))
rf = pipeline.named_steps['randomforestclassifier']
importances = pd.Series(rf.feature_importances_, X_train.columns)
importances.sort_values().plot.barh(color='grey');

"""1) Train/Test/Val Accuracy"""

from sklearn.metrics import accuracy_score

# Fit on train set
model.fit(X_train_transform, y_train)

# Get train accuracy
y_pred = model.predict(X_train_transform)
print('Train Accuracy', accuracy_score(y_train, y_pred))

# Get test accuracy
y_pred = model.predict(X_test_transform)
print('Test Accuracy', accuracy_score(y_test, y_pred))

# Get validation accuracy
y_pred = model.predict(X_val_transform)
print('Validation Accuracy', accuracy_score(y_val, y_pred))

"""2) Train/Test/Val - Using Logistic Regression"""

#Logistic regression - binary

# train accuracy

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train_transform, y_train)
print('Train Accuracy', log_reg.score(X_train_transform, y_train))

log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_test_transform, y_test)
print('Test Accuracy', log_reg.score(X_test_transform, y_test))

log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_val_transform, y_val)
print('Validation Accuracy', log_reg.score(X_val_transform, y_val))

#solver is a hyperparameter - that looks into the data, coefficeent etc
#74.536% of the time were accurate, whilst our baseline was 74.583%

"""# Objective 4: Student fits and evaluates a decision tree, random forest, or gradient boosting model for regression or classification

3) Train/Test/Val - Using Gradient Boosting Model
"""

from xgboost import XGBClassifier

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    XGBClassifier(n_estimators=118, random_state=42, n_jobs=-1, max_depth = 5)
)

pipeline.fit(X_train, y_train)

from sklearn.metrics import accuracy_score

y_pred = pipeline.predict(X_train)
print('Train Accuracy', accuracy_score(y_train, y_pred))

y_pred = pipeline.predict(X_test)
print('Test Accuracy', accuracy_score(y_test, y_pred))

y_pred = pipeline.predict(X_val)
print('Validation Accuracy', accuracy_score(y_val, y_pred))

"""Feature Importance"""

# Just an example to test out

from sklearn.impute import SimpleImputer

#drop-column year
column  = 'year'

# Fit without column
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)
pipeline.fit(X_train.drop(columns=column), y_train)
score_without = pipeline.score(X_test.drop(columns=column), y_test)
print(f'Validation Accuracy without {column}: {score_without}')

# Fit with column
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)
pipeline.fit(X_train, y_train)
score_with = pipeline.score(X_test, y_test)
print(f'Validation Accuracy with {column}: {score_with}')

# Compare the error with & without column
print(f'Drop-Column Importance for {column}: {score_with - score_without}')

"""# Objective 6: Student writes 300+ words (not including code). 
# Student reports baseline score, validation scores from 2+ models, and test score from 1 selected mode

The dataset I have chosen is based on the freequency and the type of crimes that have occured between January 2008 to December 2016. It looks at the different boroughs in London and if the crime committed it considered 'major' or minor. It gives more information on the year and month that the crime occured, with the column 'value' telling us how many times it occured within that specific month. Looking at the data, I can start to brainstorm and explore if there are seasonal or time-of-week/day changes in crime occurrences? Or if there are particular crimes that mainly occur in a particular brorough and if these increase or decrease in a particular month.

Overall, my dataset is quite large, containing 13 million rows. It is a good dataset as I have a wide range of data to use, compare and analyse but I know it will have an impact on the accuracy of my data and it will be time consuming. With that, I have then decided to use a specific year, which is 2011. I chose this year due to the fact that I knew that the 2011 London Riots occured and I wanted to see if this had an effect on the frequency of crimes that occured during the year.

A baseline is a method that uses simple summary statistics which is creates predictions for a dataset. Baselines are used to measure accuracy and used as an indication to compare back to once you further analyse your data and do more accuracy testing. I kicked off with getting my baseline, which is the starting point to creating predictions for my data set. I got my baseline by using value_counts (number of occurrence of an element in a list) and it stands at 74%. My target is 'crime_occured' which is a type of binary data which represents if a crime occurred (1) or did not occur (0) in that given month. Binary data only has 2 outcomes, yes or no, truth or false. Looking at the value of my baseline, it is telling me that 74% of my data has no correlation between the time of the month and the frequency of crime committed.

I initially started off with my Test Accuracy, using classification accuracy score. Accuracy score is a type of evaluation metric which looks at the number of correct predictions over the total number of predictions. My test accuracy is 0.73%, which is 1% lower than my baseline. This indicates to me that I may need to use other machine learning algorithms to try to beat my baseline of 74%.

I then moved onto the Random Forest Classifier which is considered as a highly accurate and robust method because of the number of decision trees (predictions)it outputs. It takes the average of all the predictions, cancelling out the biases, whilst handling missing values and being able to get the feature importance, which helps in selecting the most contributing features. I then used Eli5 which is a package used in Data Science which helps to debug machine learning classifiers and explain their predictions. This tells me that lsoa_code ( Lower Super Output Area code), month and borough hold the most weight on its influence on my predictions. This supports my research question that the time of the year and borough does have an affect on the frequency of crimes that occur. I also looked into the feature importance, which allowed me to explore which features had any significance with my research question.

# Objective 7:Student makes 2+ visualizations to explain their model
"""

!pip install shap

import shap

shap.initjs()
enc = ce.OrdinalEncoder()
enc.fit(X_train)
processed_X_train = enc.transform(X_train)
X_train_clean = processed_X_train.fillna(method='ffill')
model     = RandomForestClassifier(n_estimators = 200, random_state = 6)
model.fit(X_train_clean, y_train)
row               = X_train_clean.iloc[[0]]
explainerModel    = shap.TreeExplainer(model)
shap_values_Model = explainerModel.shap_values(row)
shap.force_plot(base_value  = explainerModel.expected_value[0]
              ,shap_values = explainerModel.shap_values(row)[0]
              ,features    = row.iloc[[0]]
              ,link = 'logit'
              )

# 0.71 is our accuracy