PREDICTING PRICE OF HOUSES BY ANALYSING VARIOUS FACTORS
The project applies regression algorithms on dataset to predict the selling price of a new home.The objective of the project is to perform data visulalization techniques to understand the insight of the data.This project evaluate the performance and the predictive power of a model trained and tested on data set
The dataset contains 21613 entries and 21 variables.
price
sqft_living
grade
sqft_above
sqft_living15
bathrooms
view
sqft_basement
bedrooms
lat
waterfront
floors
yr_renovated
sqft_lot
sqft_lot15
yr_built
condition
long
year
month
I have implemented this project in three parts. First part contains data analysis by visualization of datasets in various plot as explained in Datavisual.py. Second is training and testing of machine learning models and figure out the best algorithm explained in Housepricesprediction.py Third one is the test our model with own data explained in houseprice.py Here, I have trained various algorithms like
Linear Regression refers to a model that assumes a linear relationship between input variables and target variables.Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. Multiple Linear Regression is an extension of Simple Linear Regression.Multiple regression generally explains the relationship between multiple independent or predictor variables and one dependent or criterion variable.
With increase of no of features,regression equation becomes higher order(polynomial equation)and leads to overfit that mean doesn't fit for realtime values and eliminate overfit by reducing size of coefficients Ridge Regression is a popular type of regularized linear regression that includes an L2 penalty and an extension of linear regression that adds a regularization penalty to the loss function during training.L2 penalty is to penalize a model based on the sum of the squared coefficient values (alpha).
Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean.Lasso Regression uses L1 regularization technique and also extension of linear regression that adds a regularization penalty to the loss function during training.L1 penalty is to penalize a model based on the absolute of the squared coefficient values (alpha).The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable.
By combining lasso and ridge regression we get Elastic-Net Regression. Elastic Net is proved to better it combines the regularization of both lasso and Ridge.The benefit is that elastic net allows a balance of both penalties, which can result in better performance than a model with either one or the other penalty on some problems.
Gradient boosting is one of the most powerful techniques for building predictive models and produces a prediction model in the form of an ensemble of weak prediction models. Gradient boosting Regression calculates the difference between the current prediction and the known correct target value. This difference is called residual. After that Gradient boosting Regression trains a weak model that maps features to that residual. This residual predicted by a weak model is added to the existing model input and thus this process nudges the model towards the correct target. Repeating this step again and again improves the overall model prediction..
Gradient boosting regression is the model that performed best among the several models
( MULTIPLE LINEAR REGRESSION,RIDGE REGRESSION,LASSO REGRESSION,ELASTIC NET REGRESSION )
-Prediction accuracy was tested by train score and test score
Linear Regression Model: 69.315% and 70.401 %
Ridge Regression Model: 69.311% and 70.398%
Lasso Regression Model: 69.315% and 70.395%
ElasticNet Regression Model:69.314% and 70.398%
GradientBoosting Regression Model: 97.664% and 89.122%
It was concluded that GradientBoosting Regression Model gives the best result.