Exemple #1
0
def main():
    # load data from csv file
    data, features, labels = hpp.DataLoad()
    # Explore dataset using Statistics and graphs
    hpp.ExploreData(data, features, labels)
    # split data into training and testing sets
    features_train, features_test, labels_train, labels_test = train_test_split(
        features, labels, test_size=0.2, random_state=0)
    # Produce learning curves for varying training set sizes and maximum depths
    vs.ModelLearning(features, labels)
    vs.ModelComplexity(features_train, labels_train)
    # Fit the training data to the model using grid search
    reg = hpp.fitModel(features_train, labels_train)
    print("Parameter 'max-depth' is {} for the optimal model".format(
        reg.get_params()['max_depth']))
    print("Enter number of houses for which you want to predict their prices:")
    num_of_client = int(input())
    client_data = []
    for i in range(num_of_client):
        print(
            "Enter the fature values of client {}'s house in the sequence: number of rooms, lower class status and student teacher ratio:"
            .format(i + 1))
        client_data.append(i)
        client_data[i] = list(map(float, input().rstrip().split()))
    for i, price in enumerate(reg.predict(client_data)):
        print("Predicted selling price for Client {}'s house: ${:,.2f}".format(
            i + 1, price))
# **Hint:** Are the learning curves converging to particular scores?

# **Answer: **
#
# Selecting Graph number 2 max_depth is 3 for the graph.
# In this graph we can see it has the visually closest lines between train score and test score points in all the four graphs.
# The testing score starts at 25 and then quickly goes on to learn at points 50. the training score starts higher at 1.0 and then comes down to 0.9 and then near and very close to 0.8 at points 100. It almost remains steady even if more points are added and never goes down below 0.8 level. However the training score of 350 is where the testing score converges or comes close to the training score. Immediately after the point 350 and 400 we see a sudden nudge towards south and the training score seems to be going down. This indicates adding any more data points beyond 350 is futile beyond this point.

# ### Complexity Curves
# The following code cell produces a graph for a decision tree model that has been trained and validated on the training data using different maximum depths. The graph produces two complexity curves — one for training and one for validation. Similar to the **learning curves**, the shaded regions of both the complexity curves denote the uncertainty in those curves, and the model is scored on both the training and validation sets using the `performance_metric` function.
#
# Run the code cell below and use this graph to answer the following two questions.

# In[10]:

vs.ModelComplexity(X_train, y_train)

# ### Question 5 - Bias-Variance Tradeoff
# *When the model is trained with a maximum depth of 1, does the model suffer from high bias or from high variance? How about when the model is trained with a maximum depth of 10? What visual cues in the graph justify your conclusions?*
# **Hint:** How do you know when a model is suffering from high bias or high variance?

# **Answer: **
#
# In max_depth= 1, this model has high bias. The training and validation scores are low compared to other areas of the graph. This means that the model is not able to explain the variance in the data.
#
# In the max_depth of 10 we can clearly see that the model is indicating high variance. The graph has a training score that is much higher than the validation score in comparision to max_depth=1.
# This is seen throughout the chart in max_depth=10 as the distance between training and test scores never comes closer.
#
# In a nutshell the rules to identify the High bias and High variance charts are the following:
# High variance models have a gap between the training and validation scores- We see this in graph max_depth=10
# High bias models have have a small or no gap between the training and validations scores- We see this in graph max_depth=1

#Testing if functionworks on a test set#####################################################################
#score = performance_metric([3, -0.5, 2, 7, 4.2], [2.5, 0.0, 2.1, 7.8, 5.3])
#print "Model has a coefficient of determination, R^2, of {:.3f}.".format(score)
##############################################################################
# Shuffle and split the data into training and testing subsets from sklearn.cross_validation--new version wil use model_selection
X_train, X_test, y_train, y_test = train_test_split(features['RM'],
                                                    prices,
                                                    test_size=0.20,
                                                    random_state=1)

# from visuals.py Produce learning curves for varying training set sizes and maximum depths
vs.ModelLearning(features, prices)
#from visual.py produce complexity plot of maximum depth v. score of traing and testing curves
vs.ModelComplexity(features, prices)


#==============================================================================
#Implement code to fit model... this case will be the decsion tree algoritm.
def fit_model(features, prices):
    """ Performs grid search over the 'max_depth' parameter for a 
        decision tree regressor trained on the input data [X, y]. """

    # Create cross-validation sets from the training data to define how to split
    #and how many test runs of each split on data
    cv_sets = ShuffleSplit(n_splits=10, test_size=0.2, random_state=1)

    # TODO: Create a decision tree regressor object
    regressor = DT()
Exemple #4
0
def visulaize_model_complexity_curves(X_train, y_train):
    vs.ModelComplexity(X_train, y_train)
Exemple #5
0
                        scoring=scoring_func,
                        cv=cross_validator)
    return grid


##fit model with the grid
def fit_model(x, y):

    #build the grid searching via GridSearchCV
    grid = build_grid()
    #use the grid searching to fit
    grid = grid.fit(x, y)
    return grid.best_estimator_


###model evaluation
vs.ModelLearning(features_train, features_test)
vs.ModelComplexity(features_train, prices_train)

optimal_reg1 = fit_model(features_train, prices_train)

optimal_reg1.get_params()['max_depth']  #9

predicted_value = optimal_reg1.predict(features_test)
r2 = performance_metric(prices_test, predicted_value)
r2
#0.575754796918848
#under fitting

list(df.columns.values)
Exemple #6
0
# ### 问题 4 - 回答:
#
# 1)max_depth = 3时, 随着训练数据点增加,训练集曲线点评分会不断下降,而验证集曲线则会持续上升。并且会稳定在0.8左右。如果有更多点训练数据,也不太可能提升模型点表现了。
#
# 2)学习曲线点评分理论上应该会收敛到特定点值0.8左右。

# ### 复杂度曲线
# 下列代码内的区域会输出一幅图像,它展示了一个已经经过训练和验证的决策树模型在不同最大深度条件下的表现。这个图形将包含两条曲线,一个是训练集的变化,一个是验证集的变化。跟**学习曲线**相似,阴影区域代表该曲线的不确定性,模型训练和测试部分的评分都用的 `performance_metric` 函数。
#
# 运行下方区域中的代码,并利用输出的图形并回答下面的两个问题。

# In[74]:

# 根据不同的最大深度参数,生成复杂度曲线
vs.ModelComplexity(X_train, y_train)

# ### 问题 5 - 偏差(bias)与方差(variance)之间的权衡取舍
# *当模型以最大深度 1训练时,模型的预测是出现很大的偏差还是出现了很大的方差?当模型以最大深度10训练时,情形又如何呢?图形中的哪些特征能够支持你的结论?*
#
# **提示:** 你如何得知模型是否出现了偏差很大或者方差很大的问题?

# ### 问题 5 - 回答:
#
# 1) 在最大深度为1训练时,预测模型出现了比较大点偏差bias,因为此时图中的R^2点值太低,0.4-0.5左右,属于欠拟合点情况。
#
# 2)当最大深度为10训练时,预测模型出现了比较大的方差variance.这是出现了过拟合的情况。因为此时图中点预测曲线和学习曲线出现了比较大点差异。

# ### 问题 6- 最优模型的猜测
# *结合问题 5 中的图,你认为最大深度是多少的模型能够最好地对未见过的数据进行预测?你得出这个答案的依据是什么?*