forked from jbrittain72/DataMiningNotebooks
/
ICA3_DataMining.py
373 lines (330 loc) · 14 KB
/
ICA3_DataMining.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
#%%
# Ebnable HTML/CSS
from IPython.core.display import HTML
HTML("<link href='https://fonts.googleapis.com/css?family=Passion+One' rel='stylesheet' type='text/css'><style>div.attn { font-family: 'Helvetica Neue'; font-size: 30px; line-height: 40px; color: #FFFFFF; text-align: center; margin: 30px 0; border-width: 10px 0; border-style: solid; border-color: #5AAAAA; padding: 30px 0; background-color: #DDDDFF; }hr { border: 0; background-color: #ffffff; border-top: 1px solid black; }hr.major { border-top: 10px solid #5AAA5A; }hr.minor { border: none; background-color: #ffffff; border-top: 5px dotted #CC3333; }div.bubble { width: 65%; padding: 20px; background: #DDDDDD; border-radius: 15px; margin: 0 auto; font-style: italic; color: #f00; }em { color: #AAA; }div.c1{visibility:hidden;margin:0;height:0;}div.note{color:red;}</style>")
#%% [markdown]
# ___
# Enter Team Member Names here:
#
# - Name 1: Carson Drake
# - Name 2: Che Cobb
# - Name 3: David Josephs
# - Name 4: Andy Heroy
#
#
# ________
#
# # In Class Assignment Three
# In the following assignment you will be asked to fill in python code and
# derivations for a number of different problems. Please read all instructions
# carefully and turn in the rendered notebook (or HTML of the rendered notebook)
# before the end of class.
#
# <a id="top"></a>
# ## Contents
# * <a href="#Loading">Loading the Data</a>
# * <a href="#distance">Measuring Distances</a>
# * <a href="#KNN">K-Nearest Neighbors</a>
# * <a href="#naive">Naive Bayes</a>
#
# ________________________________________________________________________________________________________
# <a id="Loading"></a> <a href="#top">Back to Top</a>
# ## Downloading the Document Data
# Please run the following code to read in the "20 newsgroups" dataset from
# sklearn's data loading module.
#%%
from sklearn.datasets import fetch_20newsgroups_vectorized
import numpy as np
# this takes about 30 seconds to compute, read the next section while this downloads
ds = fetch_20newsgroups_vectorized(subset='train')
# this holds the continuous feature data (which is tfidf)
print('features shape:', ds.data.shape) # there are ~11000 instances and ~130k features per instance
print('target shape:', ds.target.shape)
print('range of target:', np.min(ds.target),np.max(ds.target))
print('Data type is', type(ds.data), float(ds.data.nnz)/(ds.data.shape[0]*ds.data.shape[1])*100, '% of the data is non-zero')
print('Number of keys is', ds.keys())
print(ds['DESCR'])
#%% [markdown]
# ## Understanding the Dataset
# Look at the description for the 20 newsgroups dataset at
# http://qwone.com/~jason/20Newsgroups/. You have just downloaded the
# "vectorized" version of the dataset, which means all the words inside the
# articles have gone through a transformation that binned them into 130 thousand
# features related to the words in them.
#
# **Question Set 1**:
# - How many instances are in the dataset?
# - What does each instance represent?
# - How many classes are in the da taset and what does each class represent?
# - Would you expect a classifier trained on this data would generalize to
# documents written in the past week? Why or why not?
# - Is the data represented as a sparse or dense matrix?
#%% [markdown]
# ___
# Enter your answer here:
#
# 1. There are 11314 instances in this dataset
# 2. Each instance represents 130,107 features
# 3. There are 20 classes in the dataset and each represent a different news category
# 4. I would think just the last week wouldn't be enough data to get a desired accuracy level
# 5. It is represented as a sparce matrix. Only .12% of the values are zero.
#
#%% [markdown]
# ___
# <a id="distance"></a> <a href="#top">Back to Top</a>
# ## Measures of Distance
# In the following block of code, we isolate three instances from the dataset.
# The instance "`a`" is from the group *computer graphics*, "`b`" is from from
# the group *recreation autos*, and "`c`" is from group *recreation motorcycle*.
#
#
# **Exercise for part 2**:
#
# Calculate the:
# - (1) Euclidean distance
# - (2) Cosine distance
# - (3) Jaccard similarity
#
#
# between each pair of instances using the imported functions below. Remember
# that the Jaccard similarity is only for binary valued vectors, so convert
# vectors to binary using a threshold.
#
#
# **Question for part 2**: Which distance seems more appropriate to use for this
# data? **Why**?
#%%
from scipy.spatial.distance import cosine
from scipy.spatial.distance import euclidean
from scipy.spatial.distance import jaccard
import numpy as np
# get first instance (comp)
idx = 550
a = ds.data[idx].todense()
a_class = ds.target_names[ds.target[idx]]
print('Instance A is from class', a_class)
# get second instance (autos)
idx = 4000
b = ds.data[idx].todense()
b_class = ds.target_names[ds.target[idx]]
print('Instance B is from class', b_class)
# get third instance (motorcycle)
idx = 7000
c = ds.data[idx].todense()
c_class = ds.target_names[ds.target[idx]]
print('Instance C is from class', c_class)
# Euclidean Distance
e_ab = euclidean(a, b)
e_ac = euclidean(a, c)
e_bc = euclidean(b, c)
# Cosine Distance
c_ab = cosine(a, b)
c_ac = cosine(a, c)
c_bc = cosine(b, c)
# converting to boolean
a = a > 0
b = b > 0
c = c > 0
# Jaccard Distance
j_ab = jaccard(a, b)
j_ac = jaccard(a, c)
j_bc = jaccard(b, c)
# Enter distance comparison below for each pair of vectors:
print('\n\nEuclidean Distance\n ab:', e_ab, 'ac:', e_ac, 'bc:',e_bc)
print('Cosine Distance\n ab:', c_ab, 'ac:', c_ac, 'bc:', c_bc)
print('Jaccard Dissimilarity (vectors should be boolean values)\n ab:', j_ab, 'ac:', j_ac, 'bc:', j_bc)
print('\n\nThe most appropriate distance is...Cosine Distance')
print('\nThe Cosine distance is best in this scenario because \nif the angle between the two vectors is small, then they \nare closer together and therefore more similar.')
#%% [markdown]
# ___
# # Start of Live Session Assignment
# ___
# <a id="KNN"></a> <a href="#top">Back to Top</a>
# ## Using scikit-learn with KNN
# Now let's use stratified cross validation with a holdout set to train a KNN
# model in `scikit-learn`. Use the example below to train a KNN classifier. The
# documentation for `KNeighborsClassifier` is here:
# http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
#
#
# **Exercise for part 3**: Use the code below to test what value of
# `n_neighbors` works best for the given data. *Note: do NOT change the metric
# to be anything other than `'euclidean'`. Other distance functions are not
# optimized for the amount of data we are working with.*
#
# **Question for part 3**: What is the accuracy of the best classifier you can
# create for this data (by changing only the `n_neighbors` parameter)?
#%%
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from IPython.html import widgets
sss = StratifiedShuffleSplit(n_splits= 1, test_size = 0.2, train_size=0.8)
cv = sss.split(X= ds.data, y = ds.target)
# fill in the training and testing data and save as separate variables
for trainidx, testidx in cv:
# note that these are sparse matrices
X_train = ds.data[trainidx]
X_test = ds.data[testidx]
y_train = ds.target[trainidx]
y_test = ds.target[testidx]
# fill in your code here to train and test
# calculate the accuracy and print it for various values of K
clf = KNeighborsClassifier(weights='uniform', metric='euclidean')
accuracies = []
for k in range(1,10):
clf.n_neighbors = k
clf.fit(X_train,y_train)
acc = clf.score(X_test,y_test)
accuracies.append(acc)
print('Accuracy of classifier with %d neighbors is: %.3f'%(k,acc))
#%% [markdown]
#=====================================
#
# The best accuracy is 68.9% with k=1 neighbors. Because we're only optimizing
# one point, the bias is fairly low and therefore performs better on the
# training data. Unfortunately, this also means the variance is probably
# higher within our model.
#
#
#%% [markdown]
# **Question for part 3**:
#With sparse data, does the use of a KDTree representation make sense? Why or
#Why not?
#
#%% [markdown]
# Enter your answer below:
#
#KDtree won't work well for this dataset because of the multitude of features.
#As you increase dimensionality its going to run slower and slower because its
#trying to calculate the angles/vectors between 130,000 features. Thats alot to
#compute and why we would refrain from doing so with this dataset.
#
#_____
#%% [markdown]
#_____
### KNN extensions - Centroids
#
#Now lets look at a very closely related classifier to KNN, called nearest
#centroid. In this classifier (which is more appropriate for big data scenarios
#and sparse data), the training step is used to calculate the centroids for each
#class. These centroids are saved. Unknown attributes, at prediction time, only
#need to have distances calculated for each saved centroid, drastically
#decreasing the time required for a prediction.
#
#**Exercise for part 4**: Use the template code below to create a nearest
#centroid classifier. Test which metric has the best cross validated
#performance: Euclidean, Cosine, or Manhattan. In `scikit-learn` you can see the
#documentation for NearestCentroid here:
#- http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestCentroid.html#sklearn.neighbors.NearestCentroid
#
#and for supported distance metrics here:
#- http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics
#%%
from sklearn.neighbors.nearest_centroid import NearestCentroid
# the parameters for the nearest centroid metric to test are:
# l1, l2, and cosine (all are optimized)
# fill in the training and testing data and save as separate variables
for d in ['l1', 'l2', 'cosine', 'euclidean', 'manhattan']:
clf = NearestCentroid(metric=d)
clf.fit(X_train, y_train)
yhat = clf.predict(X_test)
acc = accuracy_score(y_test, yhat)
print(d, acc)
p = 'cosine'
print('The best distance metric is: ', p)
#%% [markdown]
# ___
# <a id="naive"></a> <a href="#top">Back to Top</a>
# ## Naive Bayes Classification
# Now let's look at the use of the Naive Bayes classifier. The 20 newsgroups
# dataset has 20 classes and about 130,000 features per instance. Recall that
# the Naive Bayes classifer calculates a posterior distribution for each
# possible class. Each posterior distribution is a multiplication of many
# conditional distributions:
#
# $${\arg \max}_{j} \left(p(class=j)\prod_{i} p(attribute=i|class=j) \right)$$
#
# where $p(class=j)$ is the prior and $p(attribute=i|class=j)$ is the
# conditional probability.
#
# **Question for part 5**: With this many classes and features, how many
#different conditional probabilities need to be parameterized? How many priors
#need to be parameterized?
# %% [markdown]
# Enter you answer here:
#
# There are 2600000 conditionals probabilities that need to be parameterized. There are 20 priors that need to be parameterized.
#%% [markdown]
#
# Use this space for any calculations you might want to do.
#
# The above number was found because 130k features x 20 classes according to the
# argmax function
#%% [markdown]
# ___
# ## Naive Bayes in Scikit-learn
# Scikit has several implementations of the Naive Bayes classifier:
# `GaussianNB`, `MultinomialNB`, and `BernoulliNB`. Look at the documentation
# here: http://scikit-learn.org/stable/modules/naive_bayes.html Take a look at
# each implementation and then answer this question:
#
# **Questions for part 6**:
# - If the instances contain mostly continuous attributes, would it be better to
# use Gaussian Naive Bayes, Multinomial Naive Bayes, or Bernoulli? And Why?
# - What if the data is sparse, does this change your answer? Why or Why not?
#
#%% [markdown]
# Enter you answer here:
#
#
#
# A Gaussian Naive Bayes algorithm is a special type of NB algorithm. It's
# specifically used when the features have continuous values. It's also assumed
# that all the features are following a gaussian distribution i.e, normal
# distribution. If the data is sparce then our answer does not change because
# they're still continuous values.
# ___
#%% [markdown]
# ## Naive Bayes Comparison
# For the final section of this notebook let's compare the performance of Naive
# Bayes for document classification. Look at the parameters for `MultinomialNB`,
# and `BernoulliNB` (especially `alpha` and `binarize`).
#
# **Exercise for part 7**:
#
# Using the example code below, change the parameters for each classifier and
# see how accurate you can make the classifiers on the test set.
#
# **Question for part 7**:
#
# Why are these implementations so fast to train? What does the `'alpha'` value
# control in these models (*i.e.*, how does it change the parameterizations)?
#
#
# 1. They're both faster to train on because multinomial is using counts on a
# multinomial distribution. Bernoulli does so on a gaussian distribution
# and uses a binary analysis. Both operations are quite fast.
# 2. The Alpha value's control smoothing within the model.
#%%
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
for a in [0.0, 0.001, 0.01, 0.1, 1]:
clf_mnb = MultinomialNB(alpha=a)
clf_mnb.fit(X_train, y_train)
yhat = clf_mnb.predict(X_test)
acc = accuracy_score(y_test, yhat)
print("MultinomialNB (alpha=%f): %f" % (a, acc))
for b in [0.0, 0.002, 0.02, 0.04, 0.06, 0.08, 0.2]:
clf_bnb = BernoulliNB(alpha=a, binarize=b)
clf_bnb.fit(X_train, y_train)
yhat = clf_bnb.predict(X_test)
acc = accuracy_score(y_test, yhat)
print("BernoulliNB (alpha=%f, binarize=%f): %f" % (a, b, acc))
print('These classifiers are so fast because multinomial is using counts\n on a multinomial distribution. Bernoulli does so on a gaussian\n distribution and uses a binary analysis. Both operations are quite fast.\n\n')
print('The Alpha values control smoothing within the model. ')
#%% [markdown]
# ________________________________________________________________________________________________________
#
# That's all! Please **upload your rendered notebook to blackboard** and please include **team member names** in the notebook submission.
#%%