Skip to content

A few basig regression methods. Including linear and baysean regression

Notifications You must be signed in to change notification settings

sjnarmstrong/basic-regression-methods

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bibliography
reporttemplate.bib

We have seen how Bayesian methods can be useful in determining the probability of certain events occurring. We now turn to the problem of linear regression. This involves determining the process which is used to generate a set of target values given a set of input variables. Formally, given a set of input variables and a set of target variables , we seek to find the function , which was used to generate the target variables from the given input. This is a complex task as there is often an element of noise added to the function before the target variable is produced. To keep things simple, we will focus on data containing a 1D input variable and a 1D output target . furthermore, we turn to a polynomial function, in order to approximate our underling function . This is due to the fact that most functions can be accurately approximated by a few terms of their Taylor expansion. The parametric function used can therefore be written as follows:

Where

is the order of the polynomial function and plays a role in the maximum complexity of the function. We have also defined

as

. Further, for convenience when handling multiple data-points, we define

as

. Here

denotes the number of data-points.
We can now see that in order to approximate the given function

, we must determine suitable values for the parameters

. In this exercise we explore the use of Bayesian and classical methods to achieve this.

Least Squares Approach

The least squares approach tries to minimise the squared error between the target variables and the parametrised function. This error function is defined as follows:

This is then minimised in close form by taking the derivative with respect to W and setting this to zero.

The result is given by [@christopher2016pattern] in equation 3.15 as:

Maximum Likelihood Approach

We now consider the likelihood of obtaining the target data from the parametric function. For this we assume that the data has a Gaussian distribution around the given function at any given input . This is therefore written as follows:

Where

is the precision of the Gaussian distribution. Taking the natural logarithm of this function we get:

Minimising the log likelihood is identical to minimising the sum of squares error. This can be seen in equation \[eqn:E3:MLErr\], as the only term dependant on

is a scalar multiple of the least squares error function. Due to this,

can be determined with the use of equation \[eqn:E3:Wml\].

Bayesian Approach

The Bayesian approach attempts to determine the probability of the parameters given the target variables . Assuming this takes a Gaussian form, we can model this probability as follows:

Where

represents the mean of the weights and

represents the variance. These can be determined in a Bayesian approach by assuming an initial

and

. Equations 3.50 and 3.51 from [@christopher2016pattern] can then be used to update these parameters. This update step is given as follows:

It is common practice to assume a zero mean for and a large variation for corresponding to . Here is known as the identity matrix.

Results and Discussion of above methods {#sec:E3:ResDesc}

We now run the above algorithms on a dataset containing 10 points with corresponding and values. We will assume that the data is generated in such a manner that and . Furthermore, we will assume a zero prior mean on the weights for the Bayesian linear regression. We start by considering an order 4 polynomial function. The results are given in figure [fig:E3:Or4:LSQ].\

Plot of the results of least squares curve fitting (left) and maximum likelihood (right), with a order 4 polynomial function.[]{data-label="fig:E3:Or4:LSQ"}{width="49.00000%"} Plot of the results of least squares curve fitting (left) and maximum likelihood (right), with a order 4 polynomial function.[]{data-label="fig:E3:Or4:LSQ"}{width="49.00000%"}

One can see that these produce identical results as they are mathematically equivalent. With the Bayesian approach, we are able to quantify our certainty of a predicted point. This is shown in figure [fig:E3:Or4:Bays], by plotting the standard deviation around the mean indicated by the dashed line. It is also useful as it is a generative function. This means that we are able to produce various data-points following a similar distribution to the observed data-points. It is also possible to generate a list of plots that are lightly to have generated the data. This is done in the right plot of figure [fig:E3:Or4:Bays].\

Plot of the results of Bayesian curve fitting, with a order 4 polynomial function.[]{data-label="fig:E3:Or4:Bays"}{width="49.00000%"} Plot of the results of Bayesian curve fitting, with a order 4 polynomial function.[]{data-label="fig:E3:Or4:Bays"}{width="49.00000%"}

We now fit the graphs for . The results of the least squares and maximum likelihood curve fitting is shown in figure [fig:E3:Or9:LSQ]. These graphs show a phenomenon known as over-fitting. The data has 10 degrees of freedom, all of which can be accounted for by the parametric equation. Due to this, the best fit for the data, is one that goes through all the points. This has a very low error function but often does not generalise well to new data. Assuming that the test data was generated from a sine function, one can see that these new functions provide a poor approximation.\

Plot of the results of least squares curve fitting (left) and maximum likelihood (right), with a order 9 polynomial function.[]{data-label="fig:E3:Or9:LSQ"}{width="49.00000%"} Plot of the results of least squares curve fitting (left) and maximum likelihood (right), with a order 9 polynomial function.[]{data-label="fig:E3:Or9:LSQ"}{width="49.00000%"}

The results of Bayesian regression are less effected by the change in order and one can hardly identify the difference between order 4 and order 9. This is due to an inherent feature of Bayesian regression, whereby one can identify over-fitting with the training data alone. This mechanism can be intuitively understood by referring to equation 3.55 from [@christopher2016pattern]. This states:

From the term , it is possible to see that is negativity influenced by adding more parameters. Due to this, the Bayesian regression function will limit its effective complexity to keep the effects of this term low.

Plot of the results of Bayesian curve fitting, with a order 9 polynomial function.[]{data-label="fig:E3:Or9:Bays"}{width="49.00000%"} Plot of the results of Bayesian curve fitting, with a order 9 polynomial function.[]{data-label="fig:E3:Or9:Bays"}{width="49.00000%"}

It is also interesting to see how the standard deviation of the graphs change as the number of available training points are reduced. This is shown in figure [fig:E3:Or9:Bays:RandRem]. Here, 5 points have been removed from near the start of the data. Due to the lack of information, The standard deviation of the function around that region is increased. This result is very useful for real life applications, where the certainty of the predictions is required to make an informed decision.

Plot of the results of Bayesian curve fitting on partial data, with a order 9 polynomial function.[]{data-label="fig:E3:Or9:Bays:RandRem"}{width="49.00000%"} Plot of the results of Bayesian curve fitting on partial data, with a order 9 polynomial function.[]{data-label="fig:E3:Or9:Bays:RandRem"}{width="49.00000%"}

Bayesian Model Comparison {#sec:E3:ModComp}

We now use Bayesian methods to determine the best model out of a set of models to explain the underling data . For this we need to evaluate . For this, we can use bays rule which states:

If we assume that the prior probability

is constant over all models. Then we can simplify equation \[eqn:E3:MgD\] to

Therefore, it is equivalent to work out the

and normalise over all the models. Therefore, when comparing a list of polynomial functions, we can use

to determine the best model for the data. This is known as the evidence function. The formula required to calculate this is given by [@christopher2016pattern] in equation 3.78. This states that:

Where we can use equation 3.85 from [@christopher2016pattern], which states:

In order to compute

, we also require the following:

Using a new dataset containing 80 samples produced in a similar fashion to the dataset used in section \[sec:E3:ResDesc\], we can evaluate the evidence as given above. This is done for

up until

. The results of this are shown in figure \[fig:E3:Evi\].

Plot of the model evidence for various values of M.[]{data-label="fig:E3:Evi"}{width="60.00000%"}

We can see from this that the best for to the data relates to . To justify this result, we can turn to the Taylor expansion of a sine function. This is given by:

This is an odd function and hence even powers of

do not contribute to the final form of the function. Furthermore, the factorial in the denominator of each term, means that the contribution of each term diminishes quickly. These observations can be seen in figure \[fig:E3:Evi\] as

is a clear maximum followed by a sudden and sharp drop in preceding terms. The plot corresponding to the most likely model is given in figure \[fig:E3:m3:DTA2\].\

Plot of the fitted curve for M=3[]{data-label="fig:E3:m3:DTA2"}{width="60.00000%"}

Bayesian Model Averaging

We now average all the models tested in section [sec:E3:ModComp]. For this, we can take a weighted sum over the model space. This is given by equation 3.67 from [@christopher2016pattern].

We can determine

by normalising over the evidence function due to equation \[eqn:E3:EVeqProb\]. We can then use equations \[eq:E3:muave\] and \[eq:E3:sigmaave\], provided by [@trailovic2002variance], to determine the mean and standard deviation of the weights. Note that these equations only estimate the mean and variance of the distribution. This is due to the fact that a mixture distribution will most likely be multi-modal and contain more that one local maximum.

The results of the mixture distribution are given in figure [fig:E3:mix].

Plot of the fitted curve for a weighted mixture distribution[]{data-label="fig:E3:mix"}{width="60.00000%"}

When comparing figures [fig:E3:m3:DTA2] and [fig:E3:mix], one can see that the two are very similar. Hence it would be valid to use the most likely model as an approximation to the mixture distribution. This saves a lot of computation power and not much accuracy is lost in doing so.

Determining the hyperparameters

In order to determine and , we first need to assume an initial and . We then compute using equation [eqn:E3:mn] with the initial guess of and . We then need to compute the following two values:

This is then used to compute the new parameters

and

using equations 3.98 and 3.99 from [@christopher2016pattern].

This is then repeated until convergence or until a maximum number or iterations is reached. It is important to note that this method is only valid when the number of data points is greatly larger than the order of the polynomial function. If this is not the case, one must employ a more complicated procedure.

About

A few basig regression methods. Including linear and baysean regression

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published