Skip to content

toxicscum/daal4py

 
 

Repository files navigation

daal4py - A Convenient Python API to the Intel(R) oneAPI Data Analytics Library

Build Status Build Status Conda Version Conda Version

A simplified API to Intel(R) oneAPI Data Analytics Library that allows for fast usage of the framework suited for Data Scientists or Machine Learning users. Built to help provide an abstraction to Intel(R) oneAPI Data Analytics Library for either direct usage or integration into one's own framework and extending this beyond by providing drop-in paching for scikit-learn.

Running full scikit-learn test suite with daal4p's optimization patches

  • CircleCI when applied to scikit-learn from PyPi
  • CircleCI when applied to build from master branch

Installation

daal4py can be installed from conda-forge (recommended):

conda install daal4py -c conda-forge

or from Intel channel:

conda install daal4py -c intel

You can build daal4py from sources as well.

Getting Started

Core functioanlity of daal4py is in place Scikit-learn patching - Same Code, Same Behavior but faster execution.

Intel CPU optimizations patching

from daal4py.sklearn import patch_sklearn
patch_sklearn()

from sklearn.svm import SVC
from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data, digits.target
clf = SVC().fit(X, y)
res = clf.predict(X)

Intel CPU/GPU optimizations patching

from daal4py.sklearn import patch_sklearn
from daal4py.oneapi import sycl_context
patch_sklearn()

from sklearn.svm import SVC
from sklearn.datasets import load_digits
digits = load_digits()
X, y = digits.data, digits.target
with sycl_context("gpu"):
    clf = SVC().fit(X, y)
    res = clf.predict(X)

daal4py API, allows you to use wider set of Intel(R) oneAPI Data Analytics Library algorithms in just one line:

import daal4py as d4p
init = d4p.kmeans_init(data, 10, t_method="plusPlusDense")
result = init.compute(X)

You can even run this on a cluster by making simple code changes:

import daal4py as d4p
d4p.daalinit()
d4p.kmeans_init(data, 10, t_method="plusPlusDense", distributed=True)
result = init.compute(X, daal4py.my_procid())
d4p.daalfini()

Scikit-learn patching

Speedups of daal4py-powered Scikit-learn over the original Scikit-learn
Technical details: float type: float64; HW: Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz, 2 sockets, 28 cores per socket; SW: scikit-learn 0.23.1, Intel® oneDAl (2021.1 Beta 10)

daal4py patching affects performance of specific Scikit-learn functionality listed below. In cases when unsupported parameters are used, daal4py fallbacks into stock Scikit-learn. These limitations described below. If the patching does not cover your scenarios, submit an issue on GitHub.

Scenarios that are already available in 2020.3 release:

Task Functionality Parameters support Data support
Classification SVC All parameters except kernel = 'poly' and 'sigmoid'. No limitations.
RandomForestClassifier All parameters except warmstart = True and cpp_alpha != 0, criterion != 'gini'. Multi-output and sparse data is not supported.
KNeighborsClassifier All parameters except metric != 'euclidean' or minkowski with p = 2. Multi-output and sparse data is not supported.
LogisticRegression / LogisticRegressionCV All parameters except solver != 'lbfgs' or 'newton-cg', class_weight != None, sample_weight != None. Only dense data is supported.
Regression RandomForestRegressor All parameters except warmstart = True and cpp_alpha != 0, criterion != 'mse'. Multi-output and sparse data is not supported.
LinearRegression All parameters except normalize != False and sample_weight != None. Only dense data is supported, #observations should be >= #features.
Ridge All parameters except normalize != False, solver != 'auto' and sample_weight != None. Only dense data is supported, #observations should be >= #features.
ElasticNet All parameters except sample_weight != None. Multi-output and sparse data is not supported, #observations should be >= #features.
Lasso All parameters except sample_weight != None. Multi-output and sparse data is not supported, #observations should be >= #features.
Clustering KMeans All parameters except precompute_distances and sample_weight != None. No limitations.
DBSCAN All parameters except metric != 'euclidean' or minkowski with p = 2. Only dense data is supported.
Dimensionality reduction PCA All parameters except svd_solver != 'full'. No limitations.
Other train_test_split All parameters are supported. Only dense data is supported.
assert_all_finite All parameters are supported. Only dense data is supported.
pairwise_distance With metric='cosine' and 'correlation'. Only dense data is supported.

Scenarios that are only available in the master branch (not released yet):

Task Functionality Parameters support Data support
Regression KNeighborsRegressor All parameters except metric != 'euclidean' or minkowski with p = 2. Sparse data is not supported.
Unsupervised NearestNeighbors All parameters except metric != 'euclidean' or minkowski with p = 2. Sparse data is not supported.
Dimensionality reduction TSNE All parameters except metric != 'euclidean' or minkowski with p = 2. Sparse data is not supported.
Other roc_auc_score Parameters average, sample_weight, max_fpr and multi_class are not supported. No limitations.

scikit-learn verbose

To find out which implementation of the algorithm is currently used (daal4py or stock Scikit-learn), set the environment variable:

  • On Linux and Mac OS: export IDP_SKLEARN_VERBOSE=INFO
  • On Windows: set IDP_SKLEARN_VERBOSE=INFO

For example, for DBSCAN you get one of these print statements depending on which implementation is used:

  • INFO: sklearn.cluster.DBSCAN.fit: uses Intel(R) oneAPI Data Analytics Library solver
  • INFO: sklearn.cluster.DBSCAN.fit: uses original Scikit-learn solver

Read more in the documentation.

Building from Source

See Building from Sources for details.

About

sources for daal4py - a convenient Python API to oneDAL

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 81.0%
  • C++ 18.1%
  • Other 0.9%