Example #1
0
# ![](Figures/gaussian_fit.png)
# 
# From your plot, you can see that most of the examples are in the region with the highest probability, while
# the anomalous examples are in the regions with lower probabilities.
# 
# To do the visualization of the Gaussian fit, we first estimate the parameters of our assumed Gaussian distribution, then compute the probabilities for each of the points and then visualize both the overall distribution and where each of the points falls in terms of that distribution.

# In[18]:


#  Estimate my and sigma2
mu, sigma2 = estimateGaussian(X)

#  Returns the density of the multivariate normal at each data point (row) 
#  of X
p = utils.multivariateGaussian(X, mu, sigma2)

#  Visualize the fit
utils.visualizeFit(X,  mu, sigma2)
pyplot.xlabel('Latency (ms)')
pyplot.ylabel('Throughput (mb/s)')
pyplot.tight_layout()


# <a id="section2"></a>
# ### 1.3 Selecting the threshold, $\varepsilon$
# 
# Now that you have estimated the Gaussian parameters, you can investigate which examples have a very high probability given this distribution and which examples have a very low probability. The low probability examples are more likely to be the anomalies in our dataset. One way to determine which examples are anomalies is to select a threshold based on a cross validation set. In this part of the exercise, you will implement an algorithm to select the threshold $\varepsilon$ using the $F_1$ score on a cross validation set.
# 
# 
# You should now complete the code for the function `selectThreshold`. For this, we will use a cross validation set $\{ (x_{cv}^{(1)}, y_{cv}^{(1)}), \dots, (x_{cv}^{(m_{cv})}, y_{cv}^{(m_{cv})})\}$, where the label $y = 1$ corresponds to an anomalous example, and $y = 0$ corresponds to a normal example. For each cross validation example, we will compute $p\left( x_{cv}^{(i)}\right)$. The vector of all of these probabilities $p\left( x_{cv}^{(1)}\right), \dots, p\left( x_{cv}^{(m_{cv})}\right)$ is passed to `selectThreshold` in the vector `pval`. The corresponding labels $y_{cv}^{(1)} , \dots , y_{cv}^{(m_{cv})}$ are passed to the same function in the vector `yval`.
def estimateGaussian(X):
    m, n = X.shape
    mu = np.zeros(n)
    sigma2 = np.zeros(n)
    mu = (1 / m) * X.sum(axis=0)
    sigma2 = (1 / m) * ((X - mu)**2).sum(axis=0)
    return mu, sigma2


#  Estimate my and sigma2
mu, sigma2 = estimateGaussian(X)
# print(mu) print(sigma2)
#  Returns the density of the multivariate normal at each data point (row)
#  of X
p = utils.multivariateGaussian(X, mu, sigma2)

# #  Visualize the fit
# utils.visualizeFit(X,  mu, sigma2)
# pyplot.xlabel('Latency (ms)')
# pyplot.ylabel('Throughput (mb/s)')
# pyplot.tight_layout()
# pyplot.show()


def selectThreshold(yval, pval):
    bestEpsilon = 0
    bestF1 = 0
    F1 = 0

    # linspace random epsilon loop 1k times