Skip to content

API for performing Ordered One-Vs-Rest Classification with scikit-learn

Notifications You must be signed in to change notification settings

alvinthai/OrderedOVRClassifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OrderedOVRClassifier

API for performing Ordered One-Vs-Rest Classification with scikit-learn

📖 Documentation

Tutorial How to use OrderedOVRClassifier
API Reference The detailed reference for OrderedOVRClassifier's API

Introduction

OrderedOVRClassifier is a custom scikit-learn module for approaching multi-classification with an Ordered One-Vs-Rest modeling approach. Ordered One-Vs-Rest Classification performs a series of One-Vs-Rest Classifications where negative results are moved into subsequent training with previous classifications filtered out.

Most muliti-classification machine learning algorithms use an One-Vs-The-Rest strategy to decompose multiclass problems into n_class binary classification problems. For a dataset with n_class = 5, this means that the machine learning algorithm is training models as such:

Model 1:      Class [1] vs Classes [2, 3, 4, 5]
Model 2:      Class [2] vs Classes [1, 3, 4, 5]
Model 3:      Class [3] vs Classes [1, 2, 4, 5]
Model 4:      Class [4] vs Classes [1, 2, 3, 5]
Model 5:      Class [5] vs Classes [1, 2, 3, 4]

OrderedOVRClassifier provides an alternate paradigm for approaching multi-classification. For a dataset with n_class = 5, the training steps could look like:

Model 1:      Class [1] vs Classes [2, 3, 4, 5]
Model 2:      Class [2] vs Classes [3, 4, 5]
Model 3/4/5:  Class [3] vs Classes [4, 5]
              Class [4] vs Classes [3, 5]
              Class [5] vs Classes [3, 4]

Why would you want to model a multi-classification problem with a Ordered One-Vs-Rest approach? There are several use cases:

  • Perhaps some classes have high predictive accuracy and others not so much. If we have a binary model that screens out the highly predictive class, we can speed up the training for the remaining classes by reducing the number of classification models for training steps that require heavy optimization.

  • Maybe we are willing to sacrifice the precision or recall performance of the predictions from one class to improve the precision or recall in another. Changing the threshold for binary classification allows us to do this.

  • It could be that different algorithms perform better for specific classes. Ordered One-Vs-Rest classification does not require the same machine learning algorithm to be used for all classes, giving us the flexibility to mix different algorithms for classifying different classes.

With Ordered One-Vs-Rest classification, positive predictions from earlier modeling steps always take precedence in the final predictions.

Features

The API for OrderedOVRClassifier is designed to be user-friendly with pandas, numpy, and scikit-learn. There is also built in functionality to support easy handling for early stopping on the sklearn wrapper for XGBoost and LightGBM.

OrderedOVRClassifier could also be used to train multi-classification problems without an Ordered One-Vs-Rest strategy by setting train_final_only=True, allowing the user to take advantage of the general and convenience features listed below for their own modeling purposes.

  • Ordered One-Vs-Rest Classification Features
    • Reduce training time by training models on fewer classes.
    • Tradeoff accuracy/precision/recall between different classes.
    • Mix classification algorithms for different classes.
  • General Features
    • Model-agnostic calculation of feature importances.
    • Model-agnostic calculation of partial dependence.
    • Instantly evaluate the precision/recall/f1 scores for each class when making predictions.
  • Convenience Features
    • Train and evaluate results from pandas DataFrames without specifying y input.
    • Simple interface for passing in evaluation datasets for early stopping in LightGBM and XGBoost.
    • Attach models stepwise instead of training the full model in the fit step.

Quickstart

To use OrderedOVRClassifier for Ordered One-Vs-Rest Classification, the ordered steps for the classification must be specified with the ovr_vals parameter and the model(s) used to train the binary classifiers must be specified in the model_dict parameter. The model used to train the remaining classes should be specified with a 'final' key in the model_dict. Refer to the tutorial for more specific usage examples.

from OrderedOVRClassifier import OrderedOVRClassifier

ovr_vals = ['1st class', '2nd class']

model_dict = {'1st class': RandomForestClassifier(),
              '2nd class': RandomForestClassifier(),
              'final': XGBClassifier()}

oovr = OrderedOVRClassifier(target='output', ovr_vals=ovr_vals, model_dict=model_dict)

Fitting Models to Our Data

If working with pandas DataFrames, this is as simple as passing in the training dataset into the X parameter. An optional test dataset could be passed in similarly into the eval_set parameter. When working with numpy arrays, passing in X and y as usual is required.

oovr.fit(train_df, eval_set=test_df)

After fitting, we have lots of nice methods and properties attached to the fitter object.

Visualization and Model Evaluation

OrderedOVRClassifier has a simple interface for plotting feature importances and partial dependencies. Precision and recall can also easily be evaluated and plotted against the threshold for binary classification of another class. Refer to the tutorial to see example visualization outputs.

# plot model-agnostic feature importances
oovr.plot_feature_importance(train_df)

# plot model-agnostic partial dependence with respect to one column
oovr.plot_partial_dependence(train_df, 'some_column')

# generate multi-classification precision/recall/f1/accuracy report
oovr.multiclassification_report(test_df)

# plot threshold dependent accuracy/precision/recall/f1
oovr.plot_threshold_dependence('some_class', test_df)

Using OrderedOVRClassifier Modularly

OrderedOVRClassifier can be fit without training the full model pipeline. We could omit running the fit step altogether or fit an incomplete pipeline and instead use the fit_test or fit_test_ovr methods to train candidate models for attachment.

best_lgb = LGBMClassifier(n_estimators=100, num_leaves=250, min_child_samples=5,
                          colsample_bytree=1.0, subsample=0.8)

final_model = oovr.fit_test(best_lgb, train_df, eval_set=test_df)

The objects returned from the fit_test or fit_test_ovr methods can be attached to the OrderedOVRClassifier object with the attach_model method.

oovr.attach_model(final_model)  

Refer to the API Reference and the tutorial for more details.

Dependencies

OrderedOVRClassifier is tested on Python 2.7.13 and depends on numpy (≥1.13.3), pandas (≥0.21.1), scikit-learn (≥0.19.1), matplotlib (≥2.1.1), and skater(≥1.0.3). I have not tested the codebase against earlier versions of these packages.

💬 Feedback / Questions

Feature Requests / Issues https://github.com/alvinthai/OrderedOVRClassifier/issues
Email alvinthai@gmail.com

About

API for performing Ordered One-Vs-Rest Classification with scikit-learn

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages