Example #1
0
    utils.r2_performance_metric_example()

    X_train, X_test, y_train, y_test = train_test_split(features,
                                                        prices,
                                                        test_size=0.2,
                                                        random_state=10)
    print("Training and testing split was successful.")

    utils.visulaize_learning_curves(features, prices)
    utils.visulaize_model_complexity_curves(X_train, y_train)

    # Fit the training data to the model using grid search
    decision_tree_reg = utils.fit_model(np.asarray(X_train),
                                        np.asarray(y_train))

    # # Produce the value for 'max_depth'
    print("Parameter 'max_depth' is {} for the optimal model.".format(
        decision_tree_reg.get_params()['max_depth']))

    features_set_to_predict = [
        [5, 17, 15],  # Client 1
        [4, 32, 22],  # Client 2
        [8, 3, 12]
    ]  # Client 3
    utils.predict_house_price(decision_tree_reg, features_set_to_predict)
    # Produce a matrix for client data

    vs.PredictTrials(features, prices, utils.fit_model,
                     features_set_to_predict)
Example #2
0
#
# For the feature LSTAT or Neibhorhood poverty level the highest percentage is that of Client 2 at 32\% and this is the main reason for bringing the prices down for a 4 room dwelling to $231639.13. Even for Client 1 with a 5 room house the LSTAT of 17% has brough down the price of the property. Whereas for Client 3 the LSTAT is low at 3% and it looks like a posh locality so the house prices are significantly higher at $893,100.00
#
# The PTRATIO is the Pupil to Teachers ratio which is also a factor looked at by parents before shifting to a new locality for the respective schools. The highest PTRATIO being for Client 2 which has a higher LSTAT ratio signifying that there are more poor people in the neighborhood is at 22:1 this has brought the property price for client 2 significantly down. For Client 1 although comparatively better PTRATIO than Client 2 but still being higher than Client 3 does not command premium price relative to Client 3's property. The lowest PTRATIO is that of Client 3 which has jacked up the price of the property.
#
# In conclusion I would say the biggest impacting features are the RM and LSTAT which sway the prices in either direction based on their inputs. PTRATIO also has some impact but I would guess it would follow the pattern with LSTAT at leas in this Client sample.
#
#
#

# ### Sensitivity
# An optimal model is not necessarily a robust model. Sometimes, a model is either too complex or too simple to sufficiently generalize to new data. Sometimes, a model could use a learning algorithm that is not appropriate for the structure of the data given. Other times, the data itself could be too noisy or contain too few samples to allow a model to adequately capture the target variable — i.e., the model is underfitted. Run the code cell below to run the `fit_model` function ten times with different training and testing sets to see how the prediction for a specific client changes with the data it's trained on.

# In[43]:

vs.PredictTrials(features, prices, fit_model, client_data)

# ### Question 11 - Applicability
# *In a few sentences, discuss whether the constructed model should or should not be used in a real-world setting.*
# **Hint:** Some questions to answering:
# - *How relevant today is data that was collected from 1978?*
# - *Are the features present in the data sufficient to describe a home?*
# - *Is the model robust enough to make consistent predictions?*
# - *Would data collected in an urban city like Boston be applicable in a rural city?*

# **Answer: **
#
# Although the data was collected in 1978 but it was scaled to current market prices. However to deploy this model in a real world setting if we are to deploy this model we need real current data maybe last 2 or three years should help gauge as to how the market has changed since then.
#
# From 1978 onwards a lot of things change people's preferences the buying patterns etc. New financial mortgage instruments etc. which may come in way of making simple property buying decisions. I can tell you from my case that different people do different trade-offs between locality, availability of amount of funds, Previous owner's goodwill value(Rich famous owner like an author will command higher price although in a relatively poor locality), location of property(Property near the main road is likely to fetch higher prices than one in the interior). Or a property very near a school is bound to command a premium than one away from it if children's education is the main buying criteria. Some people want privacy and like to live in a quite neighborhood and are willing to commute for school or office. This model does not take care of all these preferences of the buyers.
#
Example #3
0
# Optimal model 的 R^2 score 为0.80,是一个比较接近1点结果,所以最优模型应该是一个可以信赖点模型。

# ### 模型健壮性
#
# 一个最优的模型不一定是一个健壮模型。有的时候模型会过于复杂或者过于简单,以致于难以泛化新增添的数据;有的时候模型采用的学习算法并不适用于特定的数据结构;有的时候样本本身可能有太多噪点或样本过少,使得模型无法准确地预测目标变量。这些情况下我们会说模型是欠拟合的。
#
# ### 问题 12 - 模型健壮性
#
# 模型是否足够健壮来保证预测的一致性?
#
# **提示**: 执行下方区域中的代码,采用不同的训练和测试集执行 `fit_model` 函数10次。注意观察对一个特定的客户来说,预测是如何随训练数据的变化而变化的。

# In[80]:

# 请先注释掉 fit_model 函数里的所有 print 语句
vs.PredictTrials(features, prices, fit_model, client_data)

# ### 问题 12 - 回答:
#
# 因为这10次的测试结果都在400万左右,变化不大(变化范围在32万左右),所以模型是健壮的。

# ### 问题 13 - 实用性探讨
# *简单地讨论一下你建构的模型能否在现实世界中使用?*
#
# 提示:回答以下几个问题,并给出相应结论的理由:
# - *1978年所采集的数据,在已考虑通货膨胀的前提下,在今天是否仍然适用?*
# - *数据中呈现的特征是否足够描述一个房屋?*
# - *在波士顿这样的大都市采集的数据,能否应用在其它乡镇地区?*
# - *你觉得仅仅凭房屋所在社区的环境来判断房屋价值合理吗?*

# ### 问题 13 - 回答: