- Let's prove that by dropping the `lat` feature from our data and retraining a model.

X_train3 = X_train2.drop(columns=['lat']).copy()
X_test3 = X_test2.drop(columns=['lat']).copy()
X_train3.head()

knn = KNeighborsClassifier(n_neighbors=1).fit(X_train3, y_train)
print(f'Train error = {1 - knn.score(X_train3, y_train):.2f}')
print(f'Test error = {1 - knn.score(X_test3, y_test):.2f}')

- We get the exact same error rate even though we just threw away one of our features!
- What is happening is that any variability in `lat` is "drowned out" by the variability in `lon` (after we've multiplied by 1000)

knn = KNeighborsClassifier(n_neighbors=1).fit(X_train2, y_train)
plot_model(X_train2, y_train, knn).interactive()

## Normalization/standardization
- Here, we artifically changed the scale of our features
- But in practice, scaling issues are extremely common
- Two approaches to handle scale issues are:
    1. normalization
    2. standardization

| Approach | What it does | sklearn implementation | 
|---------|------------|----------------|
| normalization | sets range to $[0,1]$   |  [`MinMaxScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)
| standardization | sets mean to $0$, s.d. to $1$   | [`StandardScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) |

There are all sorts of articles on this; see, e.g. [here](http://www.dataminingblog.com/standardization-vs-normalization/) and [here](https://medium.com/@rrfd/standardize-or-normalize-examples-in-python-e3f174b65dfc).

- Let's use normalization to correct our (artificially created) scaling issue in the cities data
- Scaling functions in sklearn are similar to models, except instead of `.fit()` and `.predict()`, they have `.fit()` and `.transform()`
Example #2
0
knn = KNeighborsRegressor(n_neighbors=3).fit(X, y)
knn.predict(np.atleast_2d([0, 0]))

# 5. kNN on a real dataset (10 mins) <a id=5></a>

- Let's reload the cities dataset to see how kNN performs on a real dataset

df = pd.read_csv('data/cities_USA.csv', index_col=0)
X = df.drop(columns=['vote'])
y = df[['vote']]

- Let's use our code to plot up some different kNN models (with different k values)
- First we will do `k=1`

knn = KNeighborsClassifier(n_neighbors=1).fit(X, y)
plot_model(X, y, knn)

1 - knn.score(X, y)

- How about a larger `k`

knn = KNeighborsClassifier(n_neighbors=20).fit(X, y)
plot_model(X, y, knn)

#### How does kNN relate to decision trees?
- Large *k* is "simple" like a decision stump
- It's not actually simple because we have to compare to a large number of observations!
- Small *k* is like a deep tree

# -------- Break (10 mins) -------- <a id="break"></a>
Example #3
0
## 2.1 Training Error

### max_depth=1
- Let's first create a decision tree classifier with `max_depth=1`
- Recall that to create a model:
    - we first make an instance of the model class
    - we then `.fit()` the model with some data


model = DecisionTreeClassifier(max_depth=1)
model.fit(X, y)

- We can plot our model to visualise how it is classifying our data

plot_model(X, y, model)

- For the case of `max_depth=1` we see that we have smoothed over much of the scatter in our data
- We have a very simple model!
- Recall that we can calculate the error of our model using the `.score()` method
- Previously we talked about accuracy, but typically we work in terms of error (1 - accuracy)

print(f"Error rate: {1 - model.score(X, y):.2f}")

- We get an error of 0.25
- Can we do better?

### max_depth=None
- Let's specify an unlimited `max_depth`
- I'll also now start chaining `.fit()` on a single line for succinctness

model = DecisionTreeClassifier(max_depth=None).fit(X, y)
sys.path.append('code/')
from model_plotting import plot_model, plot_regression_model, plot_tree_grid # these are some custom plotting scripts I made

## Lecture 1 - Introduction to Machine Learning, the decision tree algorithm

### Question

df = pd.read_csv('data/cities_USA.csv', index_col=0)

Your tasks:

1. How many features are in this dataset?
2. How many observations are in this dataset?
3. Using sklearn, create 3 different decision tree classifiers using 3 different `max_depth` values based on this data
4. What is the accuracy of each classifier on the training data?
5. Visualise each classifier using the `plot_model()` code (or some other method)
    1. Which `max_depth` value would you choose to predict this data?
    2. Would you choose the same `max-depth` value to predict new data?
6. Do you think most of the computational effort for a decision tree takes place in the `.fit()` stage or `.predict()` stage?

### Solution

# 1
print(f"There are {df.shape[1]-1} features and 1 target.")
# 2
print(f"There are {df.shape[0]} observations.")
# 3/4/5
X = df.drop(columns='vote')
y = df[['vote']]
for max_depth in [1, 5, 10]:
    model = DecisionTreeClassifier(max_depth=max_depth).fit(X, y)
from sklearn.svm import SVC

- I am going to define a basic SVC - a SVM with a linear kernel and some `C` which I've chosen to be 1
- **Note that in sklearn the margin width is inversely proportional to C** (opposite to how we conceptually introduced SVC earlier)
    - Large `C` reduces the margin
    - Small `C` increases the margin

model = SVC(C=1, kernel='linear')

X = pd.DataFrame({'x1': [13, 13, 13, 11, 8, 10, 10, 11, 7, 13, 6, 4, 6, 4, 6, 4, 5, 9, 3, 10],
                   'x2': [11, 7, 11, 9, 4, 8, 11, 11, 9, 12, 5, 7, 7, 9, 5, 3, 6, 6, 7, 3]})
y = pd.DataFrame({'class': ['blue']*10 + ['orange']*10})
model.fit(X, y)
plot_model(X, y, model)

- It's interesting to see our margins and how changing the value of `C` changes our classifier

plot_svc_grid(x1, y1, x2, y2, X, y, C = [0.005, 0.01, 0.1, 1])

- Choosing the best `C` is usually done with cross-validation
- If `C` is too large (the margin is very narrow) we might overfit our data
- If `C` is too small (the margin is very large) we might underfit our data

- We can actually visualise the decision boundary of a non-linear SVM too but it's hard to interpret
- What happens is that the line we draw in higher dimensional space is "projected back" to the original "lower-dimensional" space

model = SVC(C=1, kernel='rbf').fit(X, y)
plt.figure(figsize=(6,6))
plot_svc(x1, y1, x2, y2, model)
# We will always split the data as our first step from now on!
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=123)

- We won't be needing a validation set here or any cross-validation functions because we are not "tuning" any hyperparameters
- There are actually a few hyperparameters we can tune in the LogisticRegression classifier, but they usually don't impact the model much and so often aren't tuned

from sklearn.linear_model import LogisticRegression
model = LogisticRegression().fit(X_train, y_train)

- Let's plot our model to see how it's behaving

plot_model(X_train, y_train, model)

- Let's find the error rate of the model on the test data
- It's not too bad at all for such a simple, quick model!

print(f"Error rate = {1 - model.score(X_test, y_test):.2f}")

- Remember that logistic regression predicts probabilities
- So we can also get a nice map of predicted probabilities from our model
- This map looks just like our logistic function:
    - Probabilities are around 0.5 at the decision boundary
    - They increase/decrease rapdily away from the boundary

plot_model(X_train, y_train, model, predict_proba=True)

- With our cities dataset we've created a logistic regression model with two features
- So we have 3 coefficients:
                                feature_names=X.columns,
                                class_names=["blue", "red"],
                                impurity=False))

- We can better visualize what's going on by actually plotting our data and the model "decision boundaries"
- The code below does just this
- It's using some code I have made myself located in the code folder on Canvas
- I'm also using the plotting library `altair` to make this plot (which you may not have seen). It makes very nice plots but requires some wrangling to get data into a suitable format for use with the package. **You do not need to learn to use Altair in this course**, all your plotting for this course may be done in `matplotlib`

import altair as alt # altair is a plotting library
import sys
sys.path.append('code/')
from model_plotting import plot_model, plot_regression_model, plot_tree_grid # these are some custom plotting scripts I made

plot_model(X, y, model)

- In this plot the shaded regions show what our model predicts for different feature values
- The scatter points are our actual 6 observations
- From the above plot, we can see that our model is misclassifying one blue observation
- But there's an easier way to find out how our model is doing
- We can predict the data using the `.predict()` method of our model
- Let's see what the model predicts for our training data `X`

model.predict(X)

- Let's compare to the actual labels (seeing as we know them)
- Note that `.to_numpy()` simply changes a dataframe to a numpy array, and `np.squeeze()` squeezes the result to a 1d array. The only reason I'm using these commands is so we can easily compare the output to the output of `.predict()` above.

np.squeeze(y.to_numpy())

- We can see that our model correctly predicts 5 our of 6 points (the first one is misclassified)