Пример #1
0
def main():
    # load data from csv file
    data, features, labels = hpp.DataLoad()
    # Explore dataset using Statistics and graphs
    hpp.ExploreData(data, features, labels)
    # split data into training and testing sets
    features_train, features_test, labels_train, labels_test = train_test_split(
        features, labels, test_size=0.2, random_state=0)
    # Produce learning curves for varying training set sizes and maximum depths
    vs.ModelLearning(features, labels)
    vs.ModelComplexity(features_train, labels_train)
    # Fit the training data to the model using grid search
    reg = hpp.fitModel(features_train, labels_train)
    print("Parameter 'max-depth' is {} for the optimal model".format(
        reg.get_params()['max_depth']))
    print("Enter number of houses for which you want to predict their prices:")
    num_of_client = int(input())
    client_data = []
    for i in range(num_of_client):
        print(
            "Enter the fature values of client {}'s house in the sequence: number of rooms, lower class status and student teacher ratio:"
            .format(i + 1))
        client_data.append(i)
        client_data[i] = list(map(float, input().rstrip().split()))
    for i, price in enumerate(reg.predict(client_data)):
        print("Predicted selling price for Client {}'s house: ${:,.2f}".format(
            i + 1, price))
Пример #2
0
# In conclusion in Supervized learning we should split our data between train and test and estimate the performance to look if the model is underfitting and overfitting the test data or it does a good job of prediction.

# ----
#
# ## Analyzing Model Performance
# In this third section of the project, you'll take a look at several models' learning and testing performances on various subsets of training data. Additionally, you'll investigate one particular algorithm with an increasing `'max_depth'` parameter on the full training set to observe how model complexity affects performance. Graphing your model's performance based on varying criteria can be beneficial in the analysis process, such as visualizing behavior that may not have been apparent from the results alone.

# ### Learning Curves
# The following code cell produces four graphs for a decision tree model with different maximum depths. Each graph visualizes the learning curves of the model for both training and testing as the size of the training set is increased. Note that the shaded region of a learning curve denotes the uncertainty of that curve (measured as the standard deviation). The model is scored on both the training and testing sets using R<sup>2</sup>, the coefficient of determination.
#
# Run the code cell below and use these graphs to answer the following question.

# In[9]:

# Produce learning curves for varying training set sizes and maximum depths
vs.ModelLearning(features, prices)

# ### Question 4 - Learning the Data
# *Choose one of the graphs above and state the maximum depth for the model. What happens to the score of the training curve as more training points are added? What about the testing curve? Would having more training points benefit the model?*
# **Hint:** Are the learning curves converging to particular scores?

# **Answer: **
#
# Selecting Graph number 2 max_depth is 3 for the graph.
# In this graph we can see it has the visually closest lines between train score and test score points in all the four graphs.
# The testing score starts at 25 and then quickly goes on to learn at points 50. the training score starts higher at 1.0 and then comes down to 0.9 and then near and very close to 0.8 at points 100. It almost remains steady even if more points are added and never goes down below 0.8 level. However the training score of 350 is where the testing score converges or comes close to the training score. Immediately after the point 350 and 400 we see a sudden nudge towards south and the training score seems to be going down. This indicates adding any more data points beyond 350 is futile beyond this point.

# ### Complexity Curves
# The following code cell produces a graph for a decision tree model that has been trained and validated on the training data using different maximum depths. The graph produces two complexity curves — one for training and one for validation. Similar to the **learning curves**, the shaded regions of both the complexity curves denote the uncertainty in those curves, and the model is scored on both the training and validation sets using the `performance_metric` function.
#
# Run the code cell below and use this graph to answer the following two questions.
Пример #3
0
# 不允许导入任何计算决定系数的库


def performance_metric2(y_true, y_predict):
    """计算并返回预测值相比于预测值的分数"""
    score = None
    return score


# 计算这个模型的预测结果的决定系数
score = performance_metric([3, -0.5, 2, 7, 4.2], [2.5, 0.0, 2.1, 7.8, 5.3])
print(
    "Model has a coefficient of determination, R^2, of {:.3f}.".format(score))

# 根据不同的训练集大小,和最大深度,生成学习曲线
vs.ModelLearning(X_train, y_train)

# 根据不同的最大深度参数,生成复杂度曲线
vs.ModelComplexity(X_train, y_train)

# TODO 4

#提示: 导入 'KFold' 'DecisionTreeRegressor' 'make_scorer' 'GridSearchCV'
from sklearn.model_selection import KFold
from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

# TODO 4
#提示: 导入 'KFold' 'DecisionTreeRegressor' 'make_scorer' 'GridSearchCV'
from sklearn.model_selection import KFold

# #### Q.3 Training and Testing
#  What is the benefit to spliting a dataset into some ratio of training and testing subsets for a
# learning algorithm
# Answer:
# - If we are building a model and checking the performance of the model on the same data our model 
# will lead to overfitting and it will perform worst on the unseen data.That's why We use the training data
# to train the model.and then We use testing data to checking the performance of the model.
# it's important that two sets are independent from each other or result will be baised.

# In[8]:


#Analyzing Model perfomence
vs.ModelLearning(feature,price)


# #### Q.4 Learning the data
# - Choose one of the graphs above and state the maximum depth for the model
# - what happens to the score of the training curve as more training points are added?What about the
# testing curve
# - Would having more training points benefits the model
# 
# Answer:
# 
# 1. max_depth=1(high bias scenario)
#     - we can see how the testing score increases with the number of obsevation.
#     - However,the testing score only increases to approximately 0.4,a low score.
#     this indicates how the model does not generalizes well for new unseen data.
#     - moreover,the training score(red line) decreases with the number of observation .Also the 
score = performance_metric(X[:20], y[:20])
print(
    "Model has a coefficient of determination, R^2, of {:.3f}.".format(score))
# TODO: Import 'train_test_split'
from sklearn.model_selection import train_test_split
# TODO: Shuffle and split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.20,
                                                    random_state=4)
print(X_train)
# Success
print("Training and testing split was successful.")
import visuals as vs
vs.ModelLearning(X, y)
plt.show()

vs.ModelComplexity(X_train, y_train)
plt.show()

from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit


def fit_model(X, y):
    """ Performs grid search over the 'max_depth' parameter for a
        decision tree regressor trained on the input data [X, y]. """
Пример #6
0
def visulaize_learning_curves(features, prices):
    vs.ModelLearning(features, prices)
# TODO: 6.1 将data_df分割为特征和目标变量
labels = data_df['SalePrice']  # TODO:提取SalePrice作为labels
features = data_df.drop(['SalePrice'],
                        axis=1)  # TODO:提取除了SalePrice以外的特征赋值为features

# TODO: 打乱并分割训练集与测试集
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(features,
                                                    labels,
                                                    test_size=0.2,
                                                    random_state=42)
print("Training and testing split was successful.")

# Produce learning curves for varying training set sizes and maximum depths
vs.ModelLearning(features, labels)
plt.show()

vs.ModelComplexity(X_train, y_train)
plt.show()


def fit_model(X, y):
    cross_validator = KFold(10, random_state=42)
    regressor = DecisionTreeRegressor(random_state=42)
    params = {'max_depth': list(range(1, 11))}
    scoring_fnc = make_scorer(performance_metric)
    grid = GridSearchCV(regressor, params, scoring_fnc, cv=cross_validator)
    grid.fit(X, y)

    return grid.best_estimator_
Пример #8
0
                        scoring=scoring_func,
                        cv=cross_validator)
    return grid


##fit model with the grid
def fit_model(x, y):

    #build the grid searching via GridSearchCV
    grid = build_grid()
    #use the grid searching to fit
    grid = grid.fit(x, y)
    return grid.best_estimator_


###model evaluation
vs.ModelLearning(features_train, features_test)
vs.ModelComplexity(features_train, prices_train)

optimal_reg1 = fit_model(features_train, prices_train)

optimal_reg1.get_params()['max_depth']  #9

predicted_value = optimal_reg1.predict(features_test)
r2 = performance_metric(prices_test, predicted_value)
r2
#0.575754796918848
#under fitting

list(df.columns.values)
plt.title("costly location by zipcode?")
plt.show()

from sklearn.linear_model import LinearRegression
_linreg = LinearRegression()
_linlabel = _housedata['price']
_conv_dates = [1 if values == 2014 else 0 for values in _housedata.date ]
_housedata['date'] = _conv_dates
_lintrain = _housedata.drop(['id', 'price'],axis=1)

#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
x_train , x_test , y_train , y_test = train_test_split(_lintrain , _linlabel , test_size = 0.10,random_state =2)

import visuals as vs
vs.ModelLearning(_mainfeatures, _cost)

vs.ModelComplexity(x_train, y_train)

_linreg.fit(x_train,y_train)
_modelres=_linreg.score(x_test,y_test)
print("Accuracy by Liner Regression is  :",_modelres*100)

from sklearn import ensemble
clf = ensemble.GradientBoostingRegressor(n_estimators = 400, max_depth = 5, min_samples_split = 2,
          learning_rate = 0.1, loss = 'ls')
clf.fit(x_train, y_train)
_modelaccu=clf.score(x_test,y_test)
print("Accuracy by Gradient Boosting Regressor is :",_modelaccu*100)

Пример #10
0
#
# 因为R^2为0.923,比较接近1,所以该预测结果比较成果点描述了目标点变化量。

# ---
# ## 第四步. 分析模型的表现
# 在项目的第四步,我们来看一下不同参数下,模型在训练集和验证集上的表现。这里,我们专注于一个特定的算法(带剪枝的决策树,但这并不是这个项目的重点),和这个算法的一个参数 `'max_depth'`。用全部训练集训练,选择不同`'max_depth'` 参数,观察这一参数的变化如何影响模型的表现。画出模型的表现来对于分析过程十分有益,这可以让我们看到一些单看结果看不到的行为。

# ### 学习曲线
# 下方区域内的代码会输出四幅图像,它们是一个决策树模型在不同最大深度下的表现。每一条曲线都直观得显示了随着训练数据量的增加,模型学习曲线的在训练集评分和验证集评分的变化,评分使用决定系数R<sup>2</sup>。曲线的阴影区域代表的是该曲线的不确定性(用标准差衡量)。
#
# 运行下方区域中的代码,并利用输出的图形回答下面的问题。

# In[73]:

# 根据不同的训练集大小,和最大深度,生成学习曲线
vs.ModelLearning(X_train, y_train)

# ### 问题 4 - 学习曲线
# *选择上述图像中的其中一个,并给出其最大深度。随着训练数据量的增加,训练集曲线的评分有怎样的变化?验证集曲线呢?如果有更多的训练数据,是否能有效提升模型的表现呢?*
#
# **提示:**学习曲线的评分是否最终会收敛到特定的值?

# ### 问题 4 - 回答:
#
# 1)max_depth = 3时, 随着训练数据点增加,训练集曲线点评分会不断下降,而验证集曲线则会持续上升。并且会稳定在0.8左右。如果有更多点训练数据,也不太可能提升模型点表现了。
#
# 2)学习曲线点评分理论上应该会收敛到特定点值0.8左右。

# ### 复杂度曲线
# 下列代码内的区域会输出一幅图像,它展示了一个已经经过训练和验证的决策树模型在不同最大深度条件下的表现。这个图形将包含两条曲线,一个是训练集的变化,一个是验证集的变化。跟**学习曲线**相似,阴影区域代表该曲线的不确定性,模型训练和测试部分的评分都用的 `performance_metric` 函数。
#