Data Mining algorithms

Some implementations of data mining algorithms in Python.

These are not "inventions", but merely implementations of algorithms in the literature that are either not implemented in Python, or of which I required my own implementation so that I could build upon them.

The structure of the repository is currently structured in families:

ensemble
neuralnet
svm
quantile.

Also we have utilities for:

preprocessing
timeseries
scoring metrics.

Notice the structure is in families, not function such as classification or ranking. Inside neuralnet, you can find the different function implementations. The exception being quantile regressions with its own directory.

A good amount of them were developed when writting:

R. Cruz, K. Fernandes, J. S. Cardoso, and J. F. P. Costa. Tackling Class Imbalance with Ranking. In International Joint Conference on Neural Networks (IJCNN). IEEE, 2016. They were written with the supervision of Kelwin Fernandes and James S. Cardoso.

Preprocessing

smote: SMOTE is a famous oversampling technique that generates new synthetic samples when you have too few observations of one class; I have implemented SMOTE and the MSMOTE variation
metacost: this is a clever method by Pedro Domingos to add costs support to a classifier by changing the classes

Classification

I work mostly on classification, but most of these could be adapted for regression problems as well.

Here I have:

bagging: a random forest implementation only
boosting: a AdaBoost, and a gradient boosting implementation (with a couple of different loss functions for the latter)
extreme-learning: extreme machine learning model
multiclass: one-vs-all and multiordinal ensembles, which turn binary classifiers into multiclass models
neuralnet: here I have a simple neural network implemented in pure Python and in C++ with Python-bindings, implemented both with batch and online iteration
svm: dual and primal implementations of SVM.

Ranking

Ranking are models used to produce a ranking list, for instance in searches.

The models I have implemented are called "pairwise scoring rankers" which are trained in pairs, but can produce a ranking scoring for each individual observation. This ranking score is only meaningful when compared to the score of another observation.

GBRank: adapation of gradient boosting for ranking
RankBoost: adapation of AdaBoost for ranking
RankNet: adapation of a neural network for ranking (I have also a C++ implementation in the classification folder)
RankSVM: adapation of SVM with linear kernel for ranking

Quantile

These are models which, instead of predict the average expected value, they produce the expected value for a given quantile. For instance, what the median prediction is, or what the lowest-10% value you can expect, et cetra.

I have here classification and regression models:

QBag: simple bagging adapation for quantiles
QBC and QBR: gradient boosting adapations for quantiles

And that's it!

Timeseries

These are some easy, but cumbersome, methods for timeseries that are sorely missing from python packages and that are always a pain to implement.

GrowingWindow and SlidingWindow: timeseries cross validation methods
delay: function to add a delay to a time-indexed variable, to use on an autoregressive model

Scores

Scores missing from sklearn:

pinball: MAE can be used for the median; this is a generalization for other quantiles
MMAE and AMAE: scoring functions used in imbalance ordinal contexts (it's the average and maximum MAE across classes, independent of frequency)

I meant to have some test files to unit-test the various algorithms. But I will probably never have the time to get around to do that. :) Please let me know if you use any, and whether you had problems using it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ensemble

ensemble

neuralnet

neuralnet

ordinal

ordinal

quantile

quantile

svm

svm

utilities

utilities

README.md

README.md

Repository files navigation

Data Mining algorithms

Preprocessing

Classification

Ranking

Quantile

Timeseries

Scores

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
ensemble		ensemble
neuralnet		neuralnet
ordinal		ordinal
quantile		quantile
svm		svm
utilities		utilities
README.md		README.md

rpmcruz/machine-learning

Folders and files

Latest commit

History

Repository files navigation

Data Mining algorithms

Preprocessing

Classification

Ranking

Quantile

Timeseries

Scores

About

Resources

Stars

Watchers

Forks

Languages