Skip to content

clustersdata/discomll

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

discomll

Disco Machine Learning Library (discomll) is a python package for machine learning with MapReduce paradigm. It works with Disco framework for distributed computing. discomll is suited for analysis of large datasets as it offers classification, regression and clustering algorithms.

Algorithms

Classification algorithms

  • naive Bayes - discrete and continuous features,
  • linear SVM - continuous features, binary target,
  • logistic regression - continuous features, binary target,
  • forest of distributed decision trees - discrete and continuous features,
  • distributed random forest - discrete and continuous features,
  • distributed weighted forest (experimental) - discrete and continuous features,
  • distributed weighted forest rand (experimental) - discrete and continuous features,

Clustering algorithms

  • k-means - continuous features,

Regression algorithms

  • linear regression - continuous features, continuous target,
  • locally weighted linear regression - continuous features, continuous target,

Utilities

  • evaluation of the accuracy,
  • distribution views,
  • model views.

Features of discomll

discomll works with following data sources:

  • datasets on the Disco Distributed File System,
  • text or gziped datasets accessible via file server.

discomll enables multiple settings for a dataset:

  • multiple data sources,
  • feature selection,
  • feature type specification,
  • parsing of data,
  • handling of missing values.

Installing

Prerequisites

  • Disco 0.5.4,
  • numpy should be installed on all worker nodes,
  • orange and scikit-learn are used in unit tests.
pip install discomll

Performance analysis

In performance analysis, we compare speed and accuracy of discomll algorithms with scikit and Knime. We measure speedups of discomll algorithms with 1, 3, 6 and 9 Disco workers.

Performance analysis 2##

In second performance analysis, we compare accuracy of distributed ensemble algorithms with scikit-learn algorithms. We train the model on whole dataset with distributed algorithms and on a subset with single core algorithms. We show that distributed ensembles achieve similar accuracy as single core algorithms.

Try it now

You can try discomll algorithms on the ClowdFlows platform. ClowdFlows is an open sourced cloud based platform for composition, execution, and sharing of interactive machine learning and data mining workflows. For instruction see the User Guide.

alt tag

Public workflows:

Release notes

version 0.1.4.2 (Released 18/oct/2015)

  • model view bug fixes for ensembles,
  • ensembles missing values support.

version 0.1.4.1 (Released 17/oct/2015)

  • model view fixed for ensembles,
  • bug fixes in examples and tests.

version 0.1.4 (Released 11/oct/2015)

  • distributed weighted forest Rand was added. Algorithm is similar to distributed weighted forest, but it uses randomly selected medoids.
  • improvements of algorithms, especially ensembles,
  • performance analysis 2.

About

Disco Machine Learning Library

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.7%
  • Shell 0.3%