an amalgamation of all my machine learning training and OO Python to demonstrate functionality


Object Oriented Machine Learning in Python

Data Preprocessing:

Importing dataset:

  • choose all cols but outputs for X, rest is y

Missing Data:

  • sklearn.impute.SimpleImputer
  • avoid string cols
  • missing_values = np.nan
  • strategy = 'mean'

String Data:

  • 1 hot encode so no relation between strs
  • do not scale these features afterwards

Column Transformer: for X


  • transform type = 'encoder'
  • encoding class = OneHotEncoder()
  • col index = [i]
  • remainder = 'passthrough'
  • transformers = [(transform type, encoding class, col index)]

Label encoder: for y


  • no/yes -> 0/1

Split into Train and Test sets


  • test_size = E[0, 1], usually ~0.2
  • random_state = constant -> same split each time

Feature Scaling

  • after splitting, else test set gets biased, info leaks
  • fit: get mean and SD
  • fit_transform: standardization formula
  • transform: apply the mean and SD
  • not on dummy/1hot cols, else may remove 1hot
  • X_test: Don't fit-> fresh data-> use training set scalar on test set



  • (x - mean(x) )/ SD(x)
  • E [-3,3] since percentiles around mean
  • works all the time, best choice


  • (x - min(x) )/ (max(x) - min(x))
  • E [0,1], good for normal distributions


Simple Linear

Pros Cons
Works on any size of dataset, gives information about relevance of features Linear Regression Assumptions

Assumptions of Linear Regression:

  • Linearity
  • Homoscedasticity (same variance for all values)
  • Multivariate Normality (normally distributed across all independent vars)
  • Independence of errors
  • Lack of Multi-Collinearity (no relation between "independent" vars)
    • when we have a set of string inputs, like states, then we can
      assign them Dummy Vars (one hot columns), but then D1 = 1 - D2 - D3..., so if we
      have n different entries in the column, only use n-1 dummy vars, 
      since the last term will be part of the bias term anyways)


  • probability of getting a sample like ours, or more extreme than ours, IF the null hypothesis is true
    •   Assume the null hypothesis == true, determine how “strange” sample 
        really is. 
        If not that strange (a large p-value), then assumption ok.
        As the p-value gets smaller, we may reject the null hypothesis.

LinearRegression auto avoids Dummy Var Trap and auto selects best P-value

if we wanted to manually remove dummy var, doing the whole model in pure python, we would consider X = X[:,1:]

Multiple Linear

5 methods to build a model (middle 3 are step-wise regression):

All-in Backward Elimination Forward Selection Bidirectional Elimination Brute Force
throw in all verifiable valid vars select significance level (ie SL = 0.05) select significance level (ie SL = 0.05) Choose SL-Enter and SL-Stay select criterion of goodness of fit
requires prior knowledge fit the full model with all possible predictors fit all basic models with all predictors (y~xn), choose the one with minimum P construct 2^(numVars)-1 possible regression models, choose the best one
while (vars in model): while (vars NOT in model): while (vars can be added to model):
consider prediction with highest P keep this var and find next smallest P when vars added to current model ForwardSelect(SL=SL-Enter) BackwardSelect(SL=SL-Stay)
if (P > SL): Remove the predictor, Fit model without this variable if (P < SL): Add the predictor, Fit model with this variable
else: model is ready, break else: model is ready, break.

Polynomial Linear

Pros Cons
Works on any size of dataset, works very well on non linear problems Need to choose the right polynomial degree for a good bias/variance trade-off

linear refers to the linear independence of the variable matrix


transforms vector into degree n matrix to get poly regression, use only on single feature sets

Support Vector (SVR)

Pros Cons
Easily adaptable, works very well on non linear problems, not biased by outliers Compulsory to apply feature scaling, not well known, more difficult to understand
Trend-line has epsilon-insensitive tube, ie a value of E where error less than this is negated, and otherwise the errors 
are taken from the edge of the tube around the trend-line. ( sum of minimum squares on this tube to get best trend-line.
The points outside the tube are the SVs, defining the shape of the E-I tube. The ones above are Ei*, and the ones below are Ei.

kernel = 'rbf', # Radial Basis Function Kernel

  • (function measuring euclidian distances radially around some center, as a basis)
  • used for SVR typically, (does not catch outliers well, but is better for polynomial trends than linear kernel)
  • points are considered proportional to the inverse exponential euclidean distance from the center, scaled by the inverse exponential variance
    • thus we can get a circumference as a cutoff hyperplane. (variance can be changed here)
  • we can also add different RBF equations for more complex mapping

Non Linear SVM = using RBF kernel to shift the nD plane to higher dimensionality ((n+1)D), we can cast a hyperplane onto the shape, with a top and bottom hyperplane at distance epsilon away in which we negate error.

Decision Tree

Pros Cons
Interpretability, no need for feature scaling, works on both linear / nonlinear problems Poor results on small dataset, over-fitting can easily occur

no feature scaling since it results on splits of the data, not an equation with a scale of data (thus no mean nor SD)

We make n-D trees splitting leaves around some percent above and below a cutoff percentile.
Different check for each point in decision tree.
We assign each segment an avg'd value and assign any new points that meet the cutoff as such

Random Forest

Pros Cons
Powerful and accurate, good performance on many problems, including non linear No interpretability, over-fitting can easily occur, need to choose the number of trees

When we want the output of a random input, assign it the avg output across all Decision Trees, thus getting a very stable result not prone to extraneous outliers (higher likelihood of correctness)

for i in range (number of N-Trees, ie 500):
  Pick at random K data points from Training Set

sum of squares of residuals SSres: Total squared error off the trend-line,

    we want SSres=0

sum of squares of total SStot: Total squared error off the avg

R^2: value estimating total error, E[0,1] if good trendline, ideally 1.

R^2 = 1 - (SSres/SStot)

Adjusted R-Squared: adding new vars to model will never decrease R^2, potentially minimizing SSres, since we are adding a var with at least a slight random correlation to the model (non zero factor)

n = sample size, p = numRegressors
Adj R^2 = 1 - (1 - R^2)(n-1)/(n-p-1) thus inhibiting adding too many regressors

To adjust for over-correction:

use hyper-tuning parameters to add factors to minimized errors

  • Ridge Regression: adding the factor lambda*(SSmodel_coefficients)
  • Lasso: adding the factor lambda*(Sum(abs(model_coefficients))
  • Elastic Net: use both Ridge and Lasso with different lambdas, reducing over-fitting <<<<<<< HEAD


Logistic Regression

classifying into groups, ie yes and no, or cat vs dog, as a probability


we can thus apply a sigmoid function to the output, 
and the n invert to get the input sigmoid (probability)
then we project y^ to the closest classification (if y<0.5 y ->0)

classification boundary is linear due to logistic regression being linear

  • C: Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization, counters over-fitting
    • can be as high as 1e5 if not worried about overfitting

K nearest Neighbors

  • nonlinear
  • metric = 'minkowski' for distance metric to use for the tree.
    • with p=2 is equivalent to the standard Euclidean metric
  • n_neighbors = 5 neighbors by default


Choose number of K neighbors (ie 5),
take the K nearest neighbors of the new data point, by Euclidean distance.
Count the classifications of these neighbors,
assign data point to the closest group

Support Vector Machine (SVM)

  • assume kernel is separable Intuition:

    finding best decision boundary, which has a Maximum margin between the closest points to the boundary (like the tube around the trendline), we want the max distance between these closest points

Maximum Margin Hyperplane: maximum margin classifier line in nD, splits positive and negative hyperplane

Kernel SVM

how to choose the right classification method

  • sometimes linearly inseparable sets in nD are linearly separable in higher dimensions
    • (as lines, planes, or hyperplanes)
    • ie x->(x-5)^2

Types of Kernel Functions: linear, rbf, poly, sigmoid

Non Linear SVM: using RBF kernel

  • shifts the nD plane to higher dimensionality ((n+1)D)
  • then we can cast a hyperplane onto the shape,
    • with a top and bottom hyperplane at distance epsilon away in which we negate error.

Naive Bayes


Posterior Probability: probability of y given X, P(y|X) = P(X|y) * P(y) / P(X)
P(y) = prior probability = number in classification y / total classified
P(X) = marginal likelihood = # similar Observations / Total observations, is the same for y and !y, can be factored out
P(X|y) = Likelihood = # similar observations among classification y / Total classified as y

Naive since it has independence assumptions of features, otherwise it will over-correlate to much

  • I.e. feature1 = 1 - all other features

  • priors: None so that there is no probability of one class over another predefined

  • var_smoothing: Portion of the largest variance of all features that is added to variances for calculation stability

Decision Tree


CART = Classification And Regression Trees
The rest is the same as Regression Decision Tree
  • Criterion: The function to measure the quality of a split. Supported criteria are
    • "gini" for the Gini impurity
    • "entropy" for the information gain.

Random Forest


The basic intuition is the same as Regression Random Forest

Checking Correctness

False Positives and Negatives: easy with sigmoid around center

Confusion Matrix: determining the occurance of falses:

[[#neg , #falsePos],[#falseneg, #Pos]]

Accuracy rate: correct / total, Error rate = 1 - accuracy rate

Accuracy Paradox: if y occurs much more than !y, then we can get a better model by always choosing y, which defeats the purpose of having a model

CAP (Cumulative Accuracy Profile) Curve: using specific demographics as the initial set of features to boost performance, there is a peak operating point where the slope of the model is maximized (total performance/sample size) Peak performance would be to include only those who choose y and nothing otherwise.

CAP Analysis: Random Model Line (R), Perfect Model Line (P), and Good model line (G) in between.

Accuracy ratio = check ratio of R/P at 50% sample size,

  • if output is less than 60%, then bad model, after which it improves diminishingly.
  • If near 90-100%, then likely over-fitting

ROC (Receiver Operating Characteristic) is not the same as CAP


K-means Clustering

Intuition: how to identify clusters of features

Choose the number of clusters, K
Select any K random points (not necessarily from dataset) as the centroids
Assign each data point to the closest centroid (forms K clusters)
while(reassignment of cluster data):
    Compute and place the new centroid of each cluster
    Reassign data points if possible
  • Caveat: Do we use Euclidean Distance or something else
  • Random Initialization Trap: if we choose centroids poorly initially, the final clusters can be objectively incorrect
    • Solution: K-means++
  • wcss: Within Cluster Sum of Squares, converges to 0 as numClusters -> numPoints, so we choose the optimal numClusters at the pivot point of the exponential relationship (elbow)

Hierarchical Clustering


assign each point as a cluster
while numClusters > 1:
    combine closest clusters (Euclidean distance or otherwise)
     (between centroids, closest or farthest points, avg dist)

Dendrogram: stores grouping memory, plotting the points as the values on the x-axis, connecting various points at the computed dissimilarity between them (the height of the bar)

  • A threshold dissimilarity can be set to limit the minimum numClusters, best split is at largest single height in chart (greatest dissimilarity)

  • ward: min variance method

Associative Rule Learning

people who bought x also bought y


  • I sort these by lifts


support(x) = #transactions containing x / #transactions
confidence(x -> x_2) = #transactions containing x and x_2 / #transactions containing x
lift(x->x_2) = confidence(x->x_2) / support(x), the improvement of the prediction


  1. set a minimum support and confidence
  2. take all subsets of transactions that have valid support
    1. and all the rules of these subsets with valid confidence
  3. sort by decreasing lift


simple apriori, determining sets of relations

  • only considering supports, sorted decreasingly (no confidences or lifts)
  • I sort these by supports

Reinforcement Learning

Upper Confidence Bound (UCB)


we have d devices to monitor, at each round n, where i gives reward r_i(n)E[0,1],
with r_i(n)=1 if user took action, and r_i(n) = 0 if not. We want to maximize
 reward over many rounds
  • Deterministic, Requires update at every round
  • we have N rounds of data, but we want to see how few rounds we need to validly identify the best device

N_i(n) : # times i was selected up to round n R_i(n) : sum of rewards of i up to round n _r*i(n) : avg rewards of i up to round n

= R_i(n) / N_i(n)


= sqrt(3 log(n) / (2 N_i(n))

confidence interval

[r*_i(n) - delta_i(n), r*_i(n) + delta_i(n)] (aka [LCB, UCB]), 
we select i with max UCB

Thompson Sampling

Same Intuition as UCB

We are creating distribution of where the expected values lie

  • probabilistic, not deterministic like UCB
  • Can accommodate delayed feedback

Bayesian Inference: derives the posterior probability as a consequence of two antecedents:

  • a prior probability
  • a "likelihood function" derived from a statistical model for the observed data.

Bayesian inference computes the posterior probability according to Bayes' theorem Ad i get reward y from Bernoulli distribution

    p(y|theta_i) ~ B(theta_i)

theta_i is unknown, but we set its uncertainty assuming uniform distribution

    p(y|theta_i) ~ U([0,1])

Bayes Rule: approach theta_i by the posterior distribution: p(theta_i|y),

    p(theta_i|y) = ( p(y|theta_i) * p(theta_i) ) / ( integral(p(y|theta_i) * p(theta_i)) d theta_i )
    p(theta_i|y) ~= ( p(y|theta_i) * p(theta_i) ), aka (likelihood function * prior distribution)
    p(theta_i|y) ~ B(numSuccess + 1, numFail + 1),
    at each round we take random theta_i(n) from p(theta_i|y), for each ad i
    At each round n, select i with highest theta_i(n).

Natural Language Processing

Bag of Words Model: tokenization

  • preprocess text before classification
  • involving vocab of know words
  • measure of the presence of known words

stopwords: words that dont add meaning, like the, I, etc.

quoting: =3 to ignore quotes

PorterStemmer: only consider the stem of the word, replace punctuation with spaces

max_features: how many words to include, thus letting us remove unnecessary words like names

Deep Learning

Artificial Neural Networks (ANN)

activation functions: sigmoid (used in output layer), rectifer(used in hidden layers), tanh, threshold.

cost function: measuring error between y (real) and y^ (predicted) =

batch gradient descent stochastic gradient descent
deterministic random
looks at weights after all data runs updates the weights dynamically
stochastic gradient descent avoids local minimums and finds global minimums, is faster
back-propagation: adjusting all weights compared to the error matrix simultaneously for max speed

Example code with comments:

classifier = Sequential()

# Adding the input layer and the first hidden layer
classifier.add(Dense(6, activation='relu', input_dim=11, kernel_initializer='uniform'))  # rectifier activation func, input dim seems unnecessary

# Adding the second hidden layer
classifier.add(Dense(6, activation='relu', kernel_initializer='uniform'))  # uniform weight dist around 0

# Adding the output layer
classifier.add(Dense(1, activation='sigmoid', kernel_initializer='uniform'))  # sigmoid for output probability
# classifier.add(Dense(n, activation='softmax', kernel_initializer='uniform'))  # softmax for nD dependant output probability

# Compiling the ANN    
classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# adam is a good stochastic gradient descent function, avoiding local min
# loss function is log loss func to account for sigmoid loss
# metrics uses accuracy criterion to improve model performance

# Part 3 - Training the ANN
# Training the ANN on the Training set, y_train, batch_size=32, # batch_size operations before updating weights
                                    epochs=100) # epoch num of these

Convolutional Neural Networks (CNN)


Convolution -> Max Pooling -> Flattening -> Full Connection

Convolution: signal processing func we learned in Math 256, integral of two functions to get laplace easier. (f*g)(t) We use a standard stride of 2 pixels to analyze image Feature detector (3x3) maps input image(nxn) to a feature map ((n-3+1)^2) and reduces noise, feature map lists number of matching points on image section compared to feature detector. Many maps for detecting all features -- goes into convolution layer

ReLU Layer: rectifier linear operation allows the isolation of all 1s from zeros, hence focusing only on the nonlinear, positive portions of the image, reducing noise.

Max Pooling: AKA Down Sampling, How to understand features facing different directions, or existing in different parts of image (tilts, offsets, etc). we use a stride of two to get the max values from the feature map into a Pooled Feature Map, accounting for any distortions since we pull only the max features. (ok to go over image size limits) This reduces input info and over-fitting because of it sub-sampling: mean pooling instead of max pooling

Flattening: Converting pooled feature map into a flat column, acts as input layer of ANN

Full Connection: layers of the ANN which are fully connected (O(n^2) connections) multiple output neurons possible, thus we have NN of features matched to output at output layer (due to output weights), redundant and irrelevant auto-removed by NN backtracking of errors

SoftMax: normalized exponential function: kD vector -> E[0,1] for each output type, all probs add to 1 overall,

 f_j(z) = e^(z_j) / (sum_k(e^z_k))

Cross-Entropy: similar to reducing error (like mean-squared) but is better with small initial gradient descent errors due to log term

  • Loss Func:

      L_i = -log(e^(f_yi) / (sum_j(e^f_j)))) (minimize these)
  • Cost Func:

      H(p,q) = - Sum_x( (p(x))*log(q(x)) ) (p=actual val, q=predicted val)

Example Code with comments:

# Initialising the CNN
cnn = tf.keras.models.Sequential()

# Step 1 - Convolution
cnn.add(tf.keras.layers.Conv2D(filters=32,  # the dimensionality of the output space
                               kernel_size=3,  # height and width of the 2D convolution window.
                               padding='same',  # for convolution layer
                               input_shape=[64, 64, 3]))  # (batch_size, channels, rows, cols) 4d tensor
# Step 2 - Pooling
cnn.add(tf.keras.layers.MaxPool2D(pool_size=(2, 2),  # max value over a 2x2 pooling window
                                  strides=2,  # 2 pixel stride
                                  padding='valid'))  # for pooling
# Adding a second convolutional layer
cnn.add(tf.keras.layers.Conv2D(filters=32, kernel_size=3, padding='same', activation='relu'))
cnn.add(tf.keras.layers.MaxPool2D(pool_size=(2, 2), strides=2, padding='valid'))

# Step 3 - Flattening

# Step 4 - Full Connection
cnn.add(tf.keras.layers.Dense(units=128, activation='relu', kernel_initializer='uniform'))

# Step 5 - Output Layer
cnn.add(tf.keras.layers.Dense(units=1, activation='sigmoid', kernel_initializer='uniform'))

# Part 3 - Training the CNN

# Compiling the CNN
cnn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Training the CNN on the Training set and evaluating it on the Test set, validation_data=test_set, epochs=25)

Recurrent Neural Networks (RNN)

Dimensionality Reduction

Principal Component Analysis (PCA)

dimensionality reduction algorithm by detecting correlation between vars.

(sort the eigenvalue of covariance matrix in descending order, choose top k)
create projection matrix W from k eigenvectors, transfer X via W to get
k-dimensional feature subspace Y
  • highly affected by data outliers
  • unsupervised algorithm
  • if 2 final components, then graphable
  • do not fit X-test in order to avoid info leakage

Linear Discriminant Analysis (LDA)

dimensionality reduction algorithm by detecting correlation between vars.

LDA differs from PCA by trying to maximize the separation between multiple classes (adjusting component axis)

  • LDA is supervise due to relation to dependant variable
    • (uses scatter matrices -- in-between-class and within-class)
    • sorts eigen-vectors and similar steps to PCA

Kernel PCA

Similar to PCA, but uses the kernels defined previously to assess correlations between features

Model Selection: Boosting

K-fold Cross Validation

creates multiple test folds to avoid any outlying data weighing results

  • cv: if an integer, then specify the number of folds in a (Stratified)KFold,
    • CV splitter, - An iterable yielding (train, test) splits as arrays of indices

Grid Search

test and find the best of many params all at once

  • different C values, kernels, gammas

  • own regularization parameters can be entered as sets inside array with arrays of desired values to test as seen here

     '[{'C': [0.25, 0.5, 0.75, 1],
      'kernel': ['rbf'],
      'gamma': [.1, .2, .3, .4, .5, .6, .7, .8, .9]}, ...]


an estimator object (regression or classification) which can run many tests in parallel and return optimal model

  • may be difficult to import library, thus the code is currently commented out
  • otherwise acts like any other regressor or classifier from a high level perspective


