Python DecisionTreeRegressor.setMaxBins Examples

Programming Language: Python

Namespace/Package Name: pyspark.ml.regression

Method/Function: setMaxBins

Examples at hotexamples.com: 2

Python DecisionTreeRegressor.setMaxBins - 2 examples found. These are the top rated real world Python examples of pyspark.ml.regression.DecisionTreeRegressor.setMaxBins extracted from open source projects. You can rate examples to help us improve the quality of examples.

Frequently Used Methods

Show Hide

DecisionTreeRegressor(30)

fit(30)

setLabelCol(3)

setPredictionCol(3)

save(2)

setMaxBins(2)

explainParams(1)

getMaxDepth(1)

getNumTrees(1)

get_params(1)

load(1)

setFeaturesCol(1)

set_params(1)

transform(1)

Example #1

Show file

File: Power Plant ML Demo.py Project: abhinavg6/demo_azure_db_shard

# MAGIC Reference Decision Trees: https://en.wikipedia.org/wiki/Decision_tree_learning

# COMMAND ----------

# MAGIC %md
# MAGIC ###  Decision Tree Models

# COMMAND ----------

from pyspark.ml.regression import DecisionTreeRegressor

dt = DecisionTreeRegressor()
dt.setLabelCol("PE")
dt.setPredictionCol("Predicted_PE")
dt.setFeaturesCol("features")
dt.setMaxBins(100)

dtPipeline = Pipeline()
dtPipeline.setStages([vectorizer, dt])
# Let's just resuse our CrossValidator

crossval.setEstimator(dtPipeline)

paramGrid = ParamGridBuilder()\
  .addGrid(dt.maxDepth, range(2, 8))\
  .build()
crossval.setEstimatorParamMaps(paramGrid)

dtModel = crossval.fit(trainingSet)

# COMMAND ----------

Example #2

Show file

File: 10-6 Decision Trees.py Project: RGuseynov/spark_learning

# COMMAND ----------

# MAGIC %md
# MAGIC In Spark, data is partitioned by row. So when it needs to make a split, each worker has to compute summary statistics for every feature for  each split point. Then these summary statistics have to be aggregated (via tree reduce) for a split to be made.
# MAGIC
# MAGIC Think about it: What if worker 1 had the value `32` but none of the others had it. How could you communicate how good of a split that would be? So, Spark has a maxBins parameter for discretizing continuous variables into buckets, but the number of buckets has to be as large as the number of categorical variables.

# COMMAND ----------

# MAGIC %md
# MAGIC Let's go ahead and increase maxBins to `40`.

# COMMAND ----------

dt.setMaxBins(40)

# COMMAND ----------

# MAGIC %md
# MAGIC Take two.

# COMMAND ----------

pipelineModel = pipeline.fit(trainDF)

# COMMAND ----------

# MAGIC %md
# MAGIC ## Visualize the Decision Tree