Semi-supervised learning frameworks for Python

This project contains Python implementations for semi-supervised learning, made compatible with scikit-learn, including

Contrastive Pessimistic Likelihood Estimation (CPLE) (based on - but not equivalent to - Loog, 2015), a `safe' framework applicable for all classifiers which can yield prediction probabilities (safe here means that the model trained on both labelled and unlabelled data should not be worse than models trained only on the labelled data)
Self learning (self training), a naive semi-supervised learning framework applicable for any classifier
Semi-Supervised Support Vector Machine (S3VM) - a simple scikit-learn compatible wrapper for the QN-S3VM code developed by Fabian Gieseke, Antti Airola, Tapio Pahikkala, Oliver Kramer (see http://www.fabiangieseke.de/index.php/code/qns3vm ) This method was included for comparison

The first method is a novel extension of Loog, 2015 for any discriminative classifier (the differences to the original CPLE are explained below). The last two methods are only included for comparison.

The advantages of the CPLE framework compared to other semi-supervised learning approaches include

it is a generally applicable framework (works with scikit-learn classifiers which allow per-sample weights)
it needs low memory (as opposed to e.g. Label Spreading which needs O(n^2)), and
it makes no additional assumptions except for the ones made by the choice of classifier

The main disadvantage is high computational complexity.

Usage

The project requires scikit-learn, matplotlib and NLopt to run.

Usage example:

# load `heart' dataset from mldata.org
heart = fetch_mldata("heart")
X = heart.data
ytrue = np.copy(heart.target)
ytrue[ytrue==-1]=0

# label a few points 
labeled_N = 2
ys = np.array([-1]*len(ytrue)) # -1 denotes unlabeled point
random_labeled_points = random.sample(np.where(ytrue == 0)[0], labeled_N/2)+\
                        random.sample(np.where(ytrue == 1)[0], labeled_N/2)
ys[random_labeled_points] = ytrue[random_labeled_points]

# supervised score
basemodel = SGDClassifier(loss='log', penalty='l1') # scikit logistic regression
basemodel.fit(X[random_labeled_points, :], ys[random_labeled_points])
print "supervised score", basemodel.score(X, ytrue)

# semi-supervised score (base model has to be able to take weighted samples)
ssmodel = CPLELearningModel(basemodel)
ssmodel.fit(X, ys)
print "semi-supervised score", ssmodel.score(X, ytrue)

# supervised score 0.418518518519
# semi-supervised score 0.555555555556

Examples

Two-class classification examples with 56 unlabelled (small circles in the plot) and 4 labelled (large circles in the plot) data points. Plot titles show classification accuracies (percentage of data points correctly classified by the model)

In the second example, the state-of-the-art S3VM performs worse than the purely supervised SVM, while the CPLE SVM (by means of the pessimistic assumption) provides increased accuracy.

Quadratic Discriminant Analysis (from left to right: supervised QDA, Self learning QDA, pessimistic CPLE QDA)

Support Vector Machine (from left to right: supervised SVM, S3VM (Gieseke et al., 2012), pessimistic CPLE QDA)

Motivation

Current semi-supervised learning approaches require strong assumptions, and perform badly if those assumptions are violated (e.g. low density assumption, clustering assumption). In some cases, they can perform worse than a supervised classifier trained only on the labeled exampels. Furthermore, the vast majority require O(N^2) memory.

(Loog, 2015) has suggested an elegant framework (called Contrastive Pessimistic Likelihood Estimation / CPLE) which only uses assumptions intrinsic to the chosen classifier, and thus allows choosing likelihood-based classifiers which fit the domain / data distribution at hand, and can work even if some of the assumptions mentioned above are violated. The idea is to pessimistically assign soft labels to the unlabelled data, such that the improvement over the supervised version is minimal (i.e. assume the worst case for the unknown labels); and at the same time maximize log likelihood over labelled data.

The parameters in CPLE can be estimated according to:

The original CPLE framework is only applicable to likelihood-based classifiers, and (Loog, 2015) only provides solutions for Linear Discriminant Analysis and the Nearest Mean Classifier.

The CPLE implementation in this project

Building on this idea, this project contains a general semi-supervised learning framework allowing plugging in any classifier which allows 1) instance weighting and 2) can generate probability estimates (such probability estimates can also be provided by Platt scaling for classifiers which don't support them. Also, an experimental feature is included to make the approach work with classifiers not supporting instance weighting).

In order to make the approach work with any classifier, the discriminative likelihood (DL) is used instead of the generative likelihood, which is the first major difference to (Loog, 2015). The second difference is that only the unlabelled data is included in the first term of the minimization objective below, which leads to pessimistic minimization of the DL over the unlabelled data, but maximization of the DL over the labelled data.

The resulting semi-supervised learning framework is highly computationally expensive, but has the advantages of being a generally applicable framework, needing low memory, and making no additional assumptions except for the ones made by the choice of classifier

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
examples		examples
frameworks		frameworks
methods		methods
.project		.project
.pydevproject		.pydevproject
LICENSE		LICENSE
README.md		README.md
alg1.png		alg1.png
eq1.png		eq1.png
qdaexample - Copy.png		qdaexample - Copy.png
qdaexample.png		qdaexample.png
svmexample1.png		svmexample1.png
svmexample2.png		svmexample2.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

frameworks

frameworks

methods

methods

.project

.project

.pydevproject

.pydevproject

LICENSE

LICENSE

README.md

README.md

alg1.png

alg1.png

eq1.png

eq1.png

qdaexample - Copy.png

qdaexample - Copy.png

qdaexample.png

qdaexample.png

svmexample1.png

svmexample1.png

svmexample2.png

svmexample2.png

Repository files navigation

Semi-supervised learning frameworks for Python

Usage

Examples

Motivation

The CPLE implementation in this project

About

Releases

Packages

Languages

License

aurora1625/semisup-learn

Folders and files

Latest commit

History

Repository files navigation

Semi-supervised learning frameworks for Python

Usage

Examples

Motivation

The CPLE implementation in this project

About

Resources

License

Stars

Watchers

Forks

Languages