Spark Ensemble Classifier (PySpark)

Ensemble Learning using Spark Framework (PySpark)

Problem Definition.

Step 1: Learn first-level classifiers based on the original training data set. We have several choices to learn base classifiers:

We can apply Bootstrap sampling technique tolearn independent classifiers;
we can adopt the strategy used in Boosting, i.e.,adaptively learn base classifiers based on data with a weight distribution;
we can tune parameters in a learning algorithm to generate diverse base classifiers (homogeneous classifiers);
we can apply different classification methods and/or sampling methods to generate base classifiers (heterogeneous classifiers).

Step 2: Construct a new data set based on the output of base classifiers. Here the output predicted labels of the first-level classifiers are regarded as new features, and the original class labels are kept as the labels in the new data set. Instead of using predicted labels we could use probability estimations of first-level classifiers. We could also use different activation functions like Relu, Logistic or Tanh to create new features.

Step 3: Learn a second-level classifier based on the newly constructed data set. Any learning method could be applied to learn second-level classifier.

The Stacking is a general framework. We can plug in different classifiers and learning approaches to create the first-level features and transform the data into another feature space.

Example. We show the basic procedure of Stacking using the data below. We use 2 first-level classifier. Table 1 shows the points in the data set.

Table 2 shows the new data generated from the Training Data. After applying the two classifiers we can construct a new data set based on the output of base classifiers. Since there are two base classifiers, our new x_new has two dimensions. The first one is the predicted label from the first classifier and the second one is the predicted label from the second classifier. The new dataset is used for the second-level classifier.

Datasets Used.

● Ionosphere Dataset

● Breast Cancer Dataset

● MNIST Handwritten Digit Database

● CIFAR-10 Dataset

● Synthetic Dataset

References.

[1] P.Smyth and D.Wolpert. Linearly combining density estimators via stacking . Machine Learning, 36(1-2):59–83, 1999.

[2] Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140.

[3] Liu, X. Y., Wu, J., & Zhou, Z. H. (2009). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539-550.

[4] Charu C. Aggarwal, Data Classification: Algorithms and Applications, Chapman & Hall/CRC, 2014.

[5] Fernández-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems. J. Mach. Learn. Res, 15(1), 3133-3181.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
dataset		dataset
intermediate_output		intermediate_output
output		output
scripts		scripts
README.md		README.md
ensemble_breast_cancer.py		ensemble_breast_cancer.py
ensemble_cifar.py		ensemble_cifar.py
ensemble_ionosphere.py		ensemble_ionosphere.py
ensemble_mnist.py		ensemble_mnist.py
ensemble_synth.py		ensemble_synth.py
naive_bayes_example.py		naive_bayes_example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

intermediate_output

intermediate_output

output

output

scripts

scripts

README.md

README.md

ensemble_breast_cancer.py

ensemble_breast_cancer.py

ensemble_cifar.py

ensemble_cifar.py

ensemble_ionosphere.py

ensemble_ionosphere.py

ensemble_mnist.py

ensemble_mnist.py

ensemble_synth.py

ensemble_synth.py

naive_bayes_example.py

naive_bayes_example.py

Repository files navigation

Spark Ensemble Classifier (PySpark)

About

Releases

Packages

Languages

avp38/Spark-Ensemble-Library-PySpark

Folders and files

Latest commit

History

Repository files navigation

Spark Ensemble Classifier (PySpark)

About

Resources

Stars

Watchers

Forks

Languages