GitHub - Sandy4321/spark-sklearn: Scikit-learn integration package for Spark

#Scikit-learn integration package for Apache Spark

This package contains some tools to integrate the Spark computing framework with the popular scikit-learn machine library. Among other tools:

train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default in scikit-learn.
convert Spark's Dataframes seamlessly into numpy ndarrays or sparse matrices.
(experimental) distribute Scipy's sparse matrices as a dataset of sparse vectors.

It focuses on problems that have a small amount of data and that can be run in parallel.

for small datasets, it distributes the search for estimator parameters (GridSearchCV in scikit-learn), using Spark,
for datasets that do not fit in memory, we recommend using the distributed implementation in Spark MLlib.

NOTE: This package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib).

Difference with the sparkit-learn project The sparkit-learn project aims at a comprehensive integration between Spark and scikit-learn. In particular, it adds some primitives to distribute numerical data using Spark, and it reimplements some of the most common algorithms found in scikit-learn.

License

This package is released under the Apache 2.0 license. See the LICENSE file.

Installation

This package is available on PYPI:

pip install spark-sklearn

This project is also available as as Spark package.

The developer version has the following requirements:

a recent release of scikit-learn. Release 0.17 has been tested, older versions may work too.
Spark >= 2.0. Spark may be downloaded from the Spark official website. In order to use this package, you need to use the pyspark interpreter or another Spark-compliant python interpreter. See the Spark guide for more details. NOTICE: currently, this package uses the nightly 2.0.0 snapshot, available here (TODO: remove reference after 2.0.0 release).
nose (testing dependency only)
Pandas, if using the Pandas integration or testing. Pandas==0.18 has been tested.

If you want to use a developer version, you just need to make sure the python/ subdirectory is in the PYTHONPATH when launching the pyspark interpreter:

PYTHONPATH=$PYTHONPATH:./python:$SPARK_HOME/bin/pyspark

Running tests You can directly run tests:

cd python && ./run-tests.sh

This requires the environment variable SPARK_HOME to point to your local copy of Spark.

Example

Here is a simple example that runs a grid search with Spark. See the Installation section on how to install the package.

from sklearn import svm, grid_search, datasets
from spark_sklearn import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
clf = GridSearchCV(sc, svr, parameters)
clf.fit(iris.data, iris.target)

This classifier can be used as a drop-in replacement for any scikit-learn classifier, with the same API.

Documentation

More extensive documentation (generated with Sphinx) is available in the python/doc_gen/index.html file.

Changelog

2015-12-10 First public release (0.1)
2016-01-10 Package fix release (0.1.1)
0.1.2:
- python 3 support

Name		Name	Last commit message	Last commit date
Latest commit History 149 Commits
bin		bin
build		build
project		project
python		python
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

build

build

project

project

python

python

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE

LICENSE

README.md

README.md

build.sbt

build.sbt

Repository files navigation

License

Installation

Example

Documentation

Changelog

About

Releases

Packages

Languages

License

Sandy4321/spark-sklearn

Folders and files

Latest commit

History

Repository files navigation

License

Installation

Example

Documentation

Changelog

About

Resources

License

Stars

Watchers

Forks

Languages