an amalgamation of all my machine learning training and OO Python to demonstrate functionality
- choose all cols but outputs for X, rest is y
- sklearn.impute.SimpleImputer
- avoid string cols
- missing_values = np.nan
- strategy = 'mean'
- 1 hot encode so no relation between strs
- do not scale these features afterwards
sklearn.compose.ColumnTransformer
- transform type = 'encoder'
- encoding class = OneHotEncoder()
- col index = [i]
- remainder = 'passthrough'
- transformers = [(transform type, encoding class, col index)]
sklearn.preprocessing.LabelEncoder
- no/yes -> 0/1
sklearn.model_selection.train_test_split
- test_size = E[0, 1], usually ~0.2
- random_state = constant -> same split each time
- after splitting, else test set gets biased, info leaks
- fit: get mean and SD
- fit_transform: standardization formula
- transform: apply the mean and SD
- not on dummy/1hot cols, else may remove 1hot
- X_test: Don't fit-> fresh data-> use training set scalar on test set
sklearn.preprocessing.StandardScaler
- (x - mean(x) )/ SD(x)
- E [-3,3] since percentiles around mean
- works all the time, best choice
- (x - min(x) )/ (max(x) - min(x))
- E [0,1], good for normal distributions
Pros | Cons |
---|---|
Works on any size of dataset, gives information about relevance of features | Linear Regression Assumptions |
Assumptions of Linear Regression:
- Linearity
- Homoscedasticity (same variance for all values)
- Multivariate Normality (normally distributed across all independent vars)
- Independence of errors
- Lack of Multi-Collinearity (no relation between "independent" vars)
-
when we have a set of string inputs, like states, then we can assign them Dummy Vars (one hot columns), but then D1 = 1 - D2 - D3..., so if we have n different entries in the column, only use n-1 dummy vars, since the last term will be part of the bias term anyways)
-
P-value:
- probability of getting a sample like ours, or more extreme than ours,
IF the null hypothesis is true
-
Assume the null hypothesis == true, determine how “strange” sample really is. If not that strange (a large p-value), then assumption ok. As the p-value gets smaller, we may reject the null hypothesis.
-
LinearRegression auto avoids Dummy Var Trap and auto selects best P-value
if we wanted to manually remove dummy var, doing the whole model in pure python, we would consider X = X[:,1:]
5 methods to build a model (middle 3 are step-wise regression):
All-in | Backward Elimination | Forward Selection | Bidirectional Elimination | Brute Force |
---|---|---|---|---|
throw in all verifiable valid vars | select significance level (ie SL = 0.05) | select significance level (ie SL = 0.05) | Choose SL-Enter and SL-Stay | select criterion of goodness of fit |
requires prior knowledge | fit the full model with all possible predictors | fit all basic models with all predictors (y~xn), choose the one with minimum P | construct 2^(numVars)-1 possible regression models, choose the best one | |
while (vars in model): | while (vars NOT in model): | while (vars can be added to model): | ||
consider prediction with highest P | keep this var and find next smallest P when vars added to current model | ForwardSelect(SL=SL-Enter) BackwardSelect(SL=SL-Stay) | ||
if (P > SL): Remove the predictor, Fit model without this variable | if (P < SL): Add the predictor, Fit model with this variable | |||
else: model is ready, break | else: model is ready, break. |
Pros | Cons |
---|---|
Works on any size of dataset, works very well on non linear problems | Need to choose the right polynomial degree for a good bias/variance trade-off |
linear refers to the linear independence of the variable matrix
sklearn.preprocessing.PolynomialFeatures
transforms vector into degree n matrix to get poly regression, use only on single feature sets
Pros | Cons |
---|---|
Easily adaptable, works very well on non linear problems, not biased by outliers | Compulsory to apply feature scaling, not well known, more difficult to understand |
Trend-line has epsilon-insensitive tube, ie a value of E where error less than this is negated, and otherwise the errors
are taken from the edge of the tube around the trend-line. ( sum of minimum squares on this tube to get best trend-line.
The points outside the tube are the SVs, defining the shape of the E-I tube. The ones above are Ei*, and the ones below are Ei.
kernel = 'rbf', # Radial Basis Function Kernel
- (function measuring euclidian distances radially around some center, as a basis)
- used for SVR typically, (does not catch outliers well, but is better for polynomial trends than linear kernel)
- points are considered proportional to the inverse exponential euclidean distance from the center, scaled by the inverse exponential variance
- thus we can get a circumference as a cutoff hyperplane. (variance can be changed here)
- we can also add different RBF equations for more complex mapping
Non Linear SVM = using RBF kernel to shift the nD plane to higher dimensionality ((n+1)D), we can cast a hyperplane onto the shape, with a top and bottom hyperplane at distance epsilon away in which we negate error.
Pros | Cons |
---|---|
Interpretability, no need for feature scaling, works on both linear / nonlinear problems | Poor results on small dataset, over-fitting can easily occur |
no feature scaling since it results on splits of the data, not an equation with a scale of data (thus no mean nor SD)
We make n-D trees splitting leaves around some percent above and below a cutoff percentile.
Different check for each point in decision tree.
We assign each segment an avg'd value and assign any new points that meet the cutoff as such
Pros | Cons |
---|---|
Powerful and accurate, good performance on many problems, including non linear | No interpretability, over-fitting can easily occur, need to choose the number of trees |
When we want the output of a random input, assign it the avg output across all Decision Trees, thus getting a very stable result not prone to extraneous outliers (higher likelihood of correctness)
for i in range (number of N-Trees, ie 500):
Pick at random K data points from Training Set
DecisionTree(K)
sum of squares of residuals SSres: Total squared error off the trend-line,
we want SSres=0
sum of squares of total SStot: Total squared error off the avg
R^2: value estimating total error, E[0,1] if good trendline, ideally 1.
R^2 = 1 - (SSres/SStot)
Adjusted R-Squared: adding new vars to model will never decrease R^2, potentially minimizing SSres, since we are adding a var with at least a slight random correlation to the model (non zero factor)
n = sample size, p = numRegressors
Adj R^2 = 1 - (1 - R^2)(n-1)/(n-p-1) thus inhibiting adding too many regressors
use hyper-tuning parameters to add factors to minimized errors
- Ridge Regression: adding the factor lambda*(SSmodel_coefficients)
- Lasso: adding the factor lambda*(Sum(abs(model_coefficients))
- Elastic Net: use both Ridge and Lasso with different lambdas, reducing over-fitting <<<<<<< HEAD
classifying into groups, ie yes and no, or cat vs dog, as a probability
Intuition:
we can thus apply a sigmoid function to the output,
and the n invert to get the input sigmoid (probability)
then we project y^ to the closest classification (if y<0.5 y ->0)
classification boundary is linear due to logistic regression being linear
- C: Inverse of regularization strength; must be a positive float.
Like in support vector machines, smaller values specify stronger
regularization, counters over-fitting
- can be as high as 1e5 if not worried about overfitting
- nonlinear
- metric = 'minkowski' for distance metric to use for the tree.
- with p=2 is equivalent to the standard Euclidean metric
- n_neighbors = 5 neighbors by default
Intuition:
Choose number of K neighbors (ie 5),
take the K nearest neighbors of the new data point, by Euclidean distance.
Count the classifications of these neighbors,
assign data point to the closest group
-
assume kernel is separable Intuition:
finding best decision boundary, which has a Maximum margin between the closest points to the boundary (like the tube around the trendline), we want the max distance between these closest points
Maximum Margin Hyperplane: maximum margin classifier line in nD, splits positive and negative hyperplane
how to choose the right classification method
- sometimes linearly inseparable sets in nD are linearly separable in higher dimensions
- (as lines, planes, or hyperplanes)
- ie x->(x-5)^2
Types of Kernel Functions: linear, rbf, poly, sigmoid
Non Linear SVM: using RBF kernel
- shifts the nD plane to higher dimensionality ((n+1)D)
- then we can cast a hyperplane onto the shape,
- with a top and bottom hyperplane at distance epsilon away in which we negate error.
Intuition:
Posterior Probability: probability of y given X, P(y|X) = P(X|y) * P(y) / P(X)
P(y) = prior probability = number in classification y / total classified
P(X) = marginal likelihood = # similar Observations / Total observations, is the same for y and !y, can be factored out
P(X|y) = Likelihood = # similar observations among classification y / Total classified as y
Naive since it has independence assumptions of features, otherwise it will over-correlate to much
-
I.e. feature1 = 1 - all other features
-
priors: None so that there is no probability of one class over another predefined
-
var_smoothing: Portion of the largest variance of all features that is added to variances for calculation stability
Intuition:
CART = Classification And Regression Trees
The rest is the same as Regression Decision Tree
- Criterion: The function to measure the quality of a split. Supported criteria are
- "gini" for the Gini impurity
- "entropy" for the information gain.
Intuition:
The basic intuition is the same as Regression Random Forest
False Positives and Negatives: easy with sigmoid around center
Confusion Matrix: determining the occurance of falses:
[[#neg , #falsePos],[#falseneg, #Pos]]
Accuracy rate: correct / total, Error rate = 1 - accuracy rate
Accuracy Paradox: if y occurs much more than !y, then we can get a better model by always choosing y, which defeats the purpose of having a model
CAP (Cumulative Accuracy Profile) Curve: using specific demographics as the initial set of features to boost performance, there is a peak operating point where the slope of the model is maximized (total performance/sample size) Peak performance would be to include only those who choose y and nothing otherwise.
CAP Analysis: Random Model Line (R), Perfect Model Line (P), and Good model line (G) in between.
Accuracy ratio = check ratio of R/P at 50% sample size,
- if output is less than 60%, then bad model, after which it improves diminishingly.
- If near 90-100%, then likely over-fitting
ROC (Receiver Operating Characteristic) is not the same as CAP
Intuition: how to identify clusters of features
Choose the number of clusters, K
Select any K random points (not necessarily from dataset) as the centroids
Assign each data point to the closest centroid (forms K clusters)
while(reassignment of cluster data):
Compute and place the new centroid of each cluster
Reassign data points if possible
- Caveat: Do we use Euclidean Distance or something else
- Random Initialization Trap: if we choose centroids poorly initially, the final clusters can be objectively incorrect
- Solution: K-means++
- wcss: Within Cluster Sum of Squares, converges to 0 as numClusters -> numPoints, so we choose the optimal numClusters at the pivot point of the exponential relationship (elbow)
Intuition:
assign each point as a cluster
while numClusters > 1:
combine closest clusters (Euclidean distance or otherwise)
(between centroids, closest or farthest points, avg dist)
Dendrogram: stores grouping memory, plotting the points as the values on the x-axis, connecting various points at the computed dissimilarity between them (the height of the bar)
-
A threshold dissimilarity can be set to limit the minimum numClusters, best split is at largest single height in chart (greatest dissimilarity)
-
ward: min variance method
people who bought x also bought y
- I sort these by lifts
Intuition:
support(x) = #transactions containing x / #transactions
confidence(x -> x_2) = #transactions containing x and x_2 / #transactions containing x
lift(x->x_2) = confidence(x->x_2) / support(x), the improvement of the prediction
Steps:
- set a minimum support and confidence
- take all subsets of transactions that have valid support
- and all the rules of these subsets with valid confidence
- sort by decreasing lift
simple apriori, determining sets of relations
- only considering supports, sorted decreasingly (no confidences or lifts)
- I sort these by supports
Intuition:
we have d devices to monitor, at each round n, where i gives reward r_i(n)E[0,1],
with r_i(n)=1 if user took action, and r_i(n) = 0 if not. We want to maximize
reward over many rounds
- Deterministic, Requires update at every round
- we have N rounds of data, but we want to see how few rounds we need to validly identify the best device
N_i(n) : # times i was selected up to round n R_i(n) : sum of rewards of i up to round n _r*i(n) : avg rewards of i up to round n
= R_i(n) / N_i(n)
delta_i(n)
= sqrt(3 log(n) / (2 N_i(n))
confidence interval
[r*_i(n) - delta_i(n), r*_i(n) + delta_i(n)] (aka [LCB, UCB]),
we select i with max UCB
Same Intuition as UCB
We are creating distribution of where the expected values lie
- probabilistic, not deterministic like UCB
- Can accommodate delayed feedback
Bayesian Inference: derives the posterior probability as a consequence of two antecedents:
- a prior probability
- a "likelihood function" derived from a statistical model for the observed data.
Bayesian inference computes the posterior probability according to Bayes' theorem Ad i get reward y from Bernoulli distribution
p(y|theta_i) ~ B(theta_i)
theta_i is unknown, but we set its uncertainty assuming uniform distribution
p(y|theta_i) ~ U([0,1])
Bayes Rule: approach theta_i by the posterior distribution: p(theta_i|y),
p(theta_i|y) = ( p(y|theta_i) * p(theta_i) ) / ( integral(p(y|theta_i) * p(theta_i)) d theta_i )
p(theta_i|y) ~= ( p(y|theta_i) * p(theta_i) ), aka (likelihood function * prior distribution)
p(theta_i|y) ~ B(numSuccess + 1, numFail + 1),
at each round we take random theta_i(n) from p(theta_i|y), for each ad i
At each round n, select i with highest theta_i(n).
Bag of Words Model: tokenization
- preprocess text before classification
- involving vocab of know words
- measure of the presence of known words
stopwords: words that dont add meaning, like the, I, etc.
quoting: =3 to ignore quotes
PorterStemmer: only consider the stem of the word, replace punctuation with spaces
max_features: how many words to include, thus letting us remove unnecessary words like names
activation functions: sigmoid (used in output layer), rectifer(used in hidden layers), tanh, threshold.
cost function: measuring error between y (real) and y^ (predicted) =
.5(y^-y)^2
batch gradient descent | stochastic gradient descent |
---|---|
deterministic | random |
looks at weights after all data runs | updates the weights dynamically |
stochastic gradient descent avoids local minimums and finds global minimums, is faster | |
back-propagation: adjusting all weights compared to the error matrix simultaneously for max speed |
Example code with comments:
classifier = Sequential()
# Adding the input layer and the first hidden layer
classifier.add(Dense(6, activation='relu', input_dim=11, kernel_initializer='uniform')) # rectifier activation func, input dim seems unnecessary
# Adding the second hidden layer
classifier.add(Dense(6, activation='relu', kernel_initializer='uniform')) # uniform weight dist around 0
# Adding the output layer
classifier.add(Dense(1, activation='sigmoid', kernel_initializer='uniform')) # sigmoid for output probability
# classifier.add(Dense(n, activation='softmax', kernel_initializer='uniform')) # softmax for nD dependant output probability
# Compiling the ANN
classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# adam is a good stochastic gradient descent function, avoiding local min
# loss function is log loss func to account for sigmoid loss
# metrics uses accuracy criterion to improve model performance
# Part 3 - Training the ANN
# Training the ANN on the Training set
classifier.fit(X_train, y_train, batch_size=32, # batch_size operations before updating weights
epochs=100) # epoch num of these
Intuition:
Convolution -> Max Pooling -> Flattening -> Full Connection
Convolution: signal processing func we learned in Math 256, integral of two functions to get laplace easier. (f*g)(t) We use a standard stride of 2 pixels to analyze image Feature detector (3x3) maps input image(nxn) to a feature map ((n-3+1)^2) and reduces noise, feature map lists number of matching points on image section compared to feature detector. Many maps for detecting all features -- goes into convolution layer
ReLU Layer: rectifier linear operation allows the isolation of all 1s from zeros, hence focusing only on the nonlinear, positive portions of the image, reducing noise.
Max Pooling: AKA Down Sampling, How to understand features facing different directions, or existing in different parts of image (tilts, offsets, etc). we use a stride of two to get the max values from the feature map into a Pooled Feature Map, accounting for any distortions since we pull only the max features. (ok to go over image size limits) This reduces input info and over-fitting because of it sub-sampling: mean pooling instead of max pooling
Flattening: Converting pooled feature map into a flat column, acts as input layer of ANN
Full Connection: layers of the ANN which are fully connected (O(n^2) connections) multiple output neurons possible, thus we have NN of features matched to output at output layer (due to output weights), redundant and irrelevant auto-removed by NN backtracking of errors
SoftMax: normalized exponential function: kD vector -> E[0,1] for each output type, all probs add to 1 overall,
f_j(z) = e^(z_j) / (sum_k(e^z_k))
Cross-Entropy: similar to reducing error (like mean-squared) but is better with small initial gradient descent errors due to log term
-
Loss Func:
L_i = -log(e^(f_yi) / (sum_j(e^f_j)))) (minimize these)
-
Cost Func:
H(p,q) = - Sum_x( (p(x))*log(q(x)) ) (p=actual val, q=predicted val)
Example Code with comments:
# Initialising the CNN
cnn = tf.keras.models.Sequential()
# Step 1 - Convolution
cnn.add(tf.keras.layers.Conv2D(filters=32, # the dimensionality of the output space
kernel_size=3, # height and width of the 2D convolution window.
activation='relu',
padding='same', # for convolution layer
input_shape=[64, 64, 3])) # (batch_size, channels, rows, cols) 4d tensor
# Step 2 - Pooling
cnn.add(tf.keras.layers.MaxPool2D(pool_size=(2, 2), # max value over a 2x2 pooling window
strides=2, # 2 pixel stride
padding='valid')) # for pooling
# Adding a second convolutional layer
cnn.add(tf.keras.layers.Conv2D(filters=32, kernel_size=3, padding='same', activation='relu'))
cnn.add(tf.keras.layers.MaxPool2D(pool_size=(2, 2), strides=2, padding='valid'))
# Step 3 - Flattening
cnn.add(tf.keras.layers.Flatten())
# Step 4 - Full Connection
cnn.add(tf.keras.layers.Dense(units=128, activation='relu', kernel_initializer='uniform'))
# Step 5 - Output Layer
cnn.add(tf.keras.layers.Dense(units=1, activation='sigmoid', kernel_initializer='uniform'))
# Part 3 - Training the CNN
# Compiling the CNN
cnn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Training the CNN on the Training set and evaluating it on the Test set
cnn.fit(x=training_set, validation_data=test_set, epochs=25)
dimensionality reduction algorithm by detecting correlation between vars.
(sort the eigenvalue of covariance matrix in descending order, choose top k)
create projection matrix W from k eigenvectors, transfer X via W to get
k-dimensional feature subspace Y
- highly affected by data outliers
- unsupervised algorithm
- if 2 final components, then graphable
- do not fit X-test in order to avoid info leakage
https://setosa.io/ev/principal-component-analysis/
dimensionality reduction algorithm by detecting correlation between vars.
LDA differs from PCA by trying to maximize the separation between multiple classes (adjusting component axis)
- LDA is supervise due to relation to dependant variable
- (uses scatter matrices -- in-between-class and within-class)
- sorts eigen-vectors and similar steps to PCA
Similar to PCA, but uses the kernels defined previously to assess correlations between features
creates multiple test folds to avoid any outlying data weighing results
- cv: if an integer, then specify the number of folds in a
(Stratified)KFold
,CV splitter
, - An iterable yielding (train, test) splits as arrays of indices
test and find the best of many params all at once
-
different C values, kernels, gammas
-
own regularization parameters can be entered as sets inside array with arrays of desired values to test as seen here
'[{'C': [0.25, 0.5, 0.75, 1], 'kernel': ['rbf'], 'gamma': [.1, .2, .3, .4, .5, .6, .7, .8, .9]}, ...]
an estimator object (regression or classification) which can run many tests in parallel and return optimal model
- may be difficult to import library, thus the code is currently commented out
- otherwise acts like any other regressor or classifier from a high level perspective