- Let's prove that by dropping the `lat` feature from our data and retraining a model. X_train3 = X_train2.drop(columns=['lat']).copy() X_test3 = X_test2.drop(columns=['lat']).copy() X_train3.head() knn = KNeighborsClassifier(n_neighbors=1).fit(X_train3, y_train) print(f'Train error = {1 - knn.score(X_train3, y_train):.2f}') print(f'Test error = {1 - knn.score(X_test3, y_test):.2f}') - We get the exact same error rate even though we just threw away one of our features! - What is happening is that any variability in `lat` is "drowned out" by the variability in `lon` (after we've multiplied by 1000) knn = KNeighborsClassifier(n_neighbors=1).fit(X_train2, y_train) plot_model(X_train2, y_train, knn).interactive() ## Normalization/standardization - Here, we artifically changed the scale of our features - But in practice, scaling issues are extremely common - Two approaches to handle scale issues are: 1. normalization 2. standardization | Approach | What it does | sklearn implementation | |---------|------------|----------------| | normalization | sets range to $[0,1]$ | [`MinMaxScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) | standardization | sets mean to $0$, s.d. to $1$ | [`StandardScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) | There are all sorts of articles on this; see, e.g. [here](http://www.dataminingblog.com/standardization-vs-normalization/) and [here](https://medium.com/@rrfd/standardize-or-normalize-examples-in-python-e3f174b65dfc). - Let's use normalization to correct our (artificially created) scaling issue in the cities data - Scaling functions in sklearn are similar to models, except instead of `.fit()` and `.predict()`, they have `.fit()` and `.transform()`
knn = KNeighborsRegressor(n_neighbors=3).fit(X, y) knn.predict(np.atleast_2d([0, 0])) # 5. kNN on a real dataset (10 mins) <a id=5></a> - Let's reload the cities dataset to see how kNN performs on a real dataset df = pd.read_csv('data/cities_USA.csv', index_col=0) X = df.drop(columns=['vote']) y = df[['vote']] - Let's use our code to plot up some different kNN models (with different k values) - First we will do `k=1` knn = KNeighborsClassifier(n_neighbors=1).fit(X, y) plot_model(X, y, knn) 1 - knn.score(X, y) - How about a larger `k` knn = KNeighborsClassifier(n_neighbors=20).fit(X, y) plot_model(X, y, knn) #### How does kNN relate to decision trees? - Large *k* is "simple" like a decision stump - It's not actually simple because we have to compare to a large number of observations! - Small *k* is like a deep tree # -------- Break (10 mins) -------- <a id="break"></a>
## 2.1 Training Error ### max_depth=1 - Let's first create a decision tree classifier with `max_depth=1` - Recall that to create a model: - we first make an instance of the model class - we then `.fit()` the model with some data model = DecisionTreeClassifier(max_depth=1) model.fit(X, y) - We can plot our model to visualise how it is classifying our data plot_model(X, y, model) - For the case of `max_depth=1` we see that we have smoothed over much of the scatter in our data - We have a very simple model! - Recall that we can calculate the error of our model using the `.score()` method - Previously we talked about accuracy, but typically we work in terms of error (1 - accuracy) print(f"Error rate: {1 - model.score(X, y):.2f}") - We get an error of 0.25 - Can we do better? ### max_depth=None - Let's specify an unlimited `max_depth` - I'll also now start chaining `.fit()` on a single line for succinctness model = DecisionTreeClassifier(max_depth=None).fit(X, y)
sys.path.append('code/') from model_plotting import plot_model, plot_regression_model, plot_tree_grid # these are some custom plotting scripts I made ## Lecture 1 - Introduction to Machine Learning, the decision tree algorithm ### Question df = pd.read_csv('data/cities_USA.csv', index_col=0) Your tasks: 1. How many features are in this dataset? 2. How many observations are in this dataset? 3. Using sklearn, create 3 different decision tree classifiers using 3 different `max_depth` values based on this data 4. What is the accuracy of each classifier on the training data? 5. Visualise each classifier using the `plot_model()` code (or some other method) 1. Which `max_depth` value would you choose to predict this data? 2. Would you choose the same `max-depth` value to predict new data? 6. Do you think most of the computational effort for a decision tree takes place in the `.fit()` stage or `.predict()` stage? ### Solution # 1 print(f"There are {df.shape[1]-1} features and 1 target.") # 2 print(f"There are {df.shape[0]} observations.") # 3/4/5 X = df.drop(columns='vote') y = df[['vote']] for max_depth in [1, 5, 10]: model = DecisionTreeClassifier(max_depth=max_depth).fit(X, y)
from sklearn.svm import SVC - I am going to define a basic SVC - a SVM with a linear kernel and some `C` which I've chosen to be 1 - **Note that in sklearn the margin width is inversely proportional to C** (opposite to how we conceptually introduced SVC earlier) - Large `C` reduces the margin - Small `C` increases the margin model = SVC(C=1, kernel='linear') X = pd.DataFrame({'x1': [13, 13, 13, 11, 8, 10, 10, 11, 7, 13, 6, 4, 6, 4, 6, 4, 5, 9, 3, 10], 'x2': [11, 7, 11, 9, 4, 8, 11, 11, 9, 12, 5, 7, 7, 9, 5, 3, 6, 6, 7, 3]}) y = pd.DataFrame({'class': ['blue']*10 + ['orange']*10}) model.fit(X, y) plot_model(X, y, model) - It's interesting to see our margins and how changing the value of `C` changes our classifier plot_svc_grid(x1, y1, x2, y2, X, y, C = [0.005, 0.01, 0.1, 1]) - Choosing the best `C` is usually done with cross-validation - If `C` is too large (the margin is very narrow) we might overfit our data - If `C` is too small (the margin is very large) we might underfit our data - We can actually visualise the decision boundary of a non-linear SVM too but it's hard to interpret - What happens is that the line we draw in higher dimensional space is "projected back" to the original "lower-dimensional" space model = SVC(C=1, kernel='rbf').fit(X, y) plt.figure(figsize=(6,6)) plot_svc(x1, y1, x2, y2, model)
# We will always split the data as our first step from now on! X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123) - We won't be needing a validation set here or any cross-validation functions because we are not "tuning" any hyperparameters - There are actually a few hyperparameters we can tune in the LogisticRegression classifier, but they usually don't impact the model much and so often aren't tuned from sklearn.linear_model import LogisticRegression model = LogisticRegression().fit(X_train, y_train) - Let's plot our model to see how it's behaving plot_model(X_train, y_train, model) - Let's find the error rate of the model on the test data - It's not too bad at all for such a simple, quick model! print(f"Error rate = {1 - model.score(X_test, y_test):.2f}") - Remember that logistic regression predicts probabilities - So we can also get a nice map of predicted probabilities from our model - This map looks just like our logistic function: - Probabilities are around 0.5 at the decision boundary - They increase/decrease rapdily away from the boundary plot_model(X_train, y_train, model, predict_proba=True) - With our cities dataset we've created a logistic regression model with two features - So we have 3 coefficients:
feature_names=X.columns, class_names=["blue", "red"], impurity=False)) - We can better visualize what's going on by actually plotting our data and the model "decision boundaries" - The code below does just this - It's using some code I have made myself located in the code folder on Canvas - I'm also using the plotting library `altair` to make this plot (which you may not have seen). It makes very nice plots but requires some wrangling to get data into a suitable format for use with the package. **You do not need to learn to use Altair in this course**, all your plotting for this course may be done in `matplotlib` import altair as alt # altair is a plotting library import sys sys.path.append('code/') from model_plotting import plot_model, plot_regression_model, plot_tree_grid # these are some custom plotting scripts I made plot_model(X, y, model) - In this plot the shaded regions show what our model predicts for different feature values - The scatter points are our actual 6 observations - From the above plot, we can see that our model is misclassifying one blue observation - But there's an easier way to find out how our model is doing - We can predict the data using the `.predict()` method of our model - Let's see what the model predicts for our training data `X` model.predict(X) - Let's compare to the actual labels (seeing as we know them) - Note that `.to_numpy()` simply changes a dataframe to a numpy array, and `np.squeeze()` squeezes the result to a 1d array. The only reason I'm using these commands is so we can easily compare the output to the output of `.predict()` above. np.squeeze(y.to_numpy()) - We can see that our model correctly predicts 5 our of 6 points (the first one is misclassified)