Skip to content

ali4413/feature-selection-python

 
 

Repository files navigation

feature-selection

Feature Selection for Machine Learning Models.

codecov Release

Documentation Status

Overview

If you've ever encountered a dataset with a myriad number of features, you know it could be very difficult to work with all of them. Some features may not even be important or relevant and could even cause optimization bias. One approach to this problem is to select a subset of these features for your model. Feature selection will reduce complexity, reduce the time when training an algorithm, and improve the accuracy of your model -- if we select them wisely. However, this is not a trivial task and to that end we have created the feature-selection package in python.

If you are interested in a similar feature selection package for R, click here.

Features

In this package, four functions are included for feature selection:

  • forward_selection - Forward Selection for greedy feature selection. This iterative algorithm starts by considering each feature separately to determine the one that results in the model with best accuracy. The process is then repeated iteratively, adding another feature one at a time, again selecting the single feature that gives the best improvement in accuracy. This procedure stops when it is not longer possible to improve the model.

  • recursive_feature_elimination - Recursive Feature Elimination (RFE) for greedy feature selection. The model initially considers all features with the goal of discovering the worst performing feature which is then removed from the dataset. This process is repeated until the desired number of features are attained.

  • simulated_annealing - Perform simmulated annealing to select features by randomly choosing a set of features and determining model performance, then slightly modifying the chosen features randomly and testing to see if the modified feature list has improved model performance. If there is improvement, the newer model is kept, if not, a test is performed to determine if the worse model is still kept based on an acceptance probability that decreases as iterations continue and how worse the newer model performs. The process is repeated for a set number of iterations.

  • variance_thresholding - Select features based on their variances. A threshold, typically a low one, would be set so that any feature with a variance lower than that would be filtered out. Since this algorithm only looks at features without their outputs, it could be used to do feature selection on data related to unsupervised learning.

Existing Ecosystems

Some of the above features already exsist within the Python ecosystem:

Installation

pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple feature-selection

Dependencies

Usage

The Friedman dataset is used to generate data for some of the examples. The datasets contain some features that are generated by a whitenoise process and are expected to be eliminated during feature selection.

Load libraries and dataset

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_friedman1
X, y = make_friedman1(n_samples=200, n_features=15, random_state=0)

Use of feature selection functions

  • forward_selection

    from feature_selection import forward_selection
    
    #
    # User-defined 'scorer'
    # that fits a LR model and returns the model error.
    #
    def scorer(X, y):
        lm = LinearRegression().fit(X, y)
        return 1 - lm.score(X, y)
    
    
    forward_selection(scorer, X, y, 3, 6)

    Output:

    [3, 1, 0, 4]
  • recursive_feature_elimination

    from feature_selection import recursive_feature_elimination
    
    
    #
    # User-defined 'scorer'
    # that fits a LR model and returns the column
    # with the lowest coefficient weighting.
    #
    def scorer(X, y):
        model = LinearRegression()
        model.fit(X, y)
        return X.columns[model.coef_.argmin()]
    
    
    recursive_feature_elimination(scorer, X, y, n_features_to_select=5)

    Output:

    [0, 1, 2, 10, 14]
  • simulated_annealing

    from feature_selection import simulated_annealing
    
    
    #
    # User-defined 'scorer'
    # that fits a LR model and returns the model error.
    #
    def scorer(X, y):
        lm = LinearRegression().fit(X, y)
        return 1 - lm.score(X, y)
    
    
    simulated_annealing(scorer, X, y)

    Output:

    array([1,  2,  3,  6,  7,  9, 10, 13])
  • variance_thresholding

    from feature_selection import variance_thresholding
    
    # Example data
    X = [[1, 6, 0, 5],
         [1, 2, 4, 5],
         [1, 7, 8, 5]]
    
    variance_thresholding(X)

    Output:

    array([1, 2])

Documentation

The official documentation is hosted on Read the Docs: https://feature-selection-python-mds.readthedocs.io/en/latest/

Credits

This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.

About

A Python package for feature selection with ML estimators

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%