Skip to content

learningOrchestra is a distributed Machine Learning processing tool that facilitates and streamlines iterative processes in a Data Science project.

License

Notifications You must be signed in to change notification settings

adnanjt/learningOrchestra

Β 
Β 

Repository files navigation

Wondering how to combine your various library and infrastructure needs for your latest data mining project? Just pick the bricks to build your pipeline and learningOrchestra will take care of the rest.

build-passing tag last-commit All Contributors

learningOrchestra

Introduction

Context

Nowadays, data science relies on a wide range of computer science skills, from data management to algorithm design, from code optimization to cloud infrastructures. Data scientists are expected to have expertise in these diverse fields, especially when working in small teams or for academia.

This situation can constitute a barrier to the actual extraction of new knowledge from collected data, which is why the last two decades have seen more efforts to facilitate and streamline the development of data mining workflows. The tools created can be sorted into two categories: high-level tools facilitate the building of automatic data processing pipelines (e.g. Weka) while low-level ones support the setup of appropriate physical and virtual infrastructure (e.g. Spark).

However, this landscape is still missing a tool that encompasses all steps and needs of a typical data science project. This is where learningOrchestra comes in.

The learningOrchestra system

learningOrchestra aims to facilitate the development of complex data mining workflows by seamlessly interfacing different data science tools and services. From a single interoperable Application Programming Interface (API), users can design their analytical pipelines and deploy them in an environment with the appropriate capabilities.

learningOrchestra is designed for data scientists from both engineering and academia backgrounds, so that they can focus on the discovery of new knowledge in their data rather than library or maintenance issues.

Quick-start

learningOrchestra provides two options to access its features: a REST API and a Python package.

REST API: We recommand using a GUI REST API caller like Postman or Insomnia.

Python package:

How do I install learningOrchestra?

πŸ”” This documentation assumes that the users are familiar with a number of advanced computer science concepts. We have tried to link to learning resources to support beginners, as well as introduce some of the concepts in the FAQ. But if something is still not clear, don't hesitate to ask for help.

We prrovide a documentation explaining how deploy this software, you can read more in installation docs

Interrupt learningOrchestra

Run docker stack rm microservice.

How do I use learningOrchestra?

learningOrchestra is organised into interoperable microservices. They offer access to third-party libraries, frameworks and software to gather data, clean data, train machine learning models, tune machine learning models, evaluate machine learning models and visualize data and results.

The current version of learningOrchestra offers 11 services:

  • Dataset - Responsible to obtain a dataset. External datasets are stored on MongoDB or on volumes using an Uniform Resource Locator (URL). There is also an alternative to load TensorFlow existing datasets.
  • Model - Responsible to load supervised or unsupervised models from existing repositories. It is useful to be used to configure a TensorFlow or Scikit-learn object with a tuned and pre-trained neural network using Google or Facebook best practicesand large instances, for example. On the other hand, it is also useful to load acustomized/optimized neural network developed from scratch by a team of data scientists.
  • Transform - Responsible for a catalog of tasks, including embedding, normalization, text enrichment, bucketization, data projection and so forth. Learning Orchestra has its own implementations for some services and implement other transform services from TensorFlow and Scikit-learn.
  • Explore - The data scientist must see the pipes steps results of an analytical pipeline, so Learning Orchestra supports data exploration using the catalog of explore capabilities of TensorFlow and Scikit-learn tools, including histogram, clustering, t-SNE,PCA and others. All outputs of this step are plottable.
  • Tune - Performs the search for an optimal set of parameters for a given model. It can be made through strategies like grid-search, random search, or Bayesian optimization
  • Train - Probably it is the most computational expensive service of an ML pipeline, because the models will be trained for best learn the subjacents patterns on data. Adiversity of algorithms can be executed, like Support Vector Machine (SVM), Random Forest, Bayesian inference, K-Nearest Neighbors (KNN), Deep Neural Networks(DNN), and many others.
  • Evaluate - After training a model, it is necessary to evaluate it’s power to generalize tonew unseen data. For that, the model needs to perform inferences or classification on a test dataset to obtain metrics that more accurately describe the capabilities of themodel. Some common metrics are precision, recall, f1-score, accuracy, mean squarederror (MSE), and cross-entropy. This service is useful to describe the generalization power and to detect the need for model calibrations
  • Predict - The model can run indefinitely. Sometimes feedbacks are mandatory toreinforce the train step, so the Evaluate services are called multiple times. This is the main reason for a production pipe and, consequently, a service of such a type
  • Builder - Responsible to execute Spark-ML or TensorFlow entire pipelines in Python, offering an alternative way to use the Learning Orchestra system just as a deployment alternative and not an environment for building ML workflows composed of pipelines.
  • Observe - Represents a catalog of collections of Learning Orchestra and a publish/subscribe mechanism. Applications can subscribe to these collections to receive notifications via observers.
  • Function - Responsible to wrap a Python function, representing a wildcard for the data scientist when there is no Learning Orchestra support for a specific ML service. It is different from Builder service, since it does not run the entire pipeline. Instead, it runs just a Python function of Scikit-learn or TensorFlow models on a cluster container. It is part of future plans the support of functions written in R language.

The REST API can be called on from any computer, including one that is not part of the cluster learningOrchestra is deployed on. learningOrchestra provides two options to access its features: a REST API and a Python package.

Using the REST API

We recommand using a GUI REST API caller like Postman or Insomnia. Of course, regular curl commands from the terminal remain a possibility.

The details for REST API are available in the open api documentation.

Using the Python package

learning-orchestra-client is a Python 3 package available through the Python Package Index. Install it with pip install learning-orchestra-client.

All your scripts must import the package and create a link to the cluster by providing the IP address to an instance of your cluster. Preface your scripts with the following code:

from learning_orchestra_client import *
cluster_ip = "xx.xx.xxx.xxx"
Context(cluster_ip)

Check the package documentation for a list of available features and an example use case.

Check cluster status

To check the deployed microservices and machines of your cluster, run CLUSTER_IP:9000 where CLUSTER_IP is replaced by the external IP of a machine in your cluster.

The same can be done to check Spark cluster state with CLUSTER_IP:8080.

About learningOrchestra

Research background

The first monograph

The second monograph(under construction)

Contributors ✨

Thanks goes to these wonderful people (emoji key):


Gabriel Ribeiro

πŸ’» 🚧 πŸ’¬ πŸ‘€

Navendu Pottekkat

πŸ“– 🎨 πŸ€”

hiperbolt

πŸ’» πŸ€” πŸš‡

Joubert de Castro Lima

πŸ€” πŸ“†

Lauro Moraes

πŸ€” πŸ“†

LaChapeliere

πŸ“–

Sudipto Ghosh

πŸ’»

This project follows the all-contributors specification. Contributions of any kind welcome!

Frequently Asked Questions

Where can I find the documentation?

Find the user documentation here.

What is the website linked to the repo?

The repo is linked to the user documentation.

Who is doing this?

See the contributors list.

Do you get money from learningOrchestra? How do you fund the project?

We use collaborative resources to develop this software.

On using learningOrchestra

I have a question/a feature request/some feedback, how do I contact you?

Please use the Issues page of this repo.

Can I copy your code for my project?

This project is distributed under the open source GPL-3 license.

You can copy, modify and distribute the code in the repository as long as you understand the license limitations (no liability, no warranty) and respect the license conditions (license and copyright notice, state changes, disclose source, same license.)

How do I cite learningOrchestra in my paper?

In discussion.

Where can I find data?

Kaggle is a good data source for beginners.

My computer runs on Windows/OSX, can I still use learningOrchestra?

You can use the microservices that run on a cluster where learningOrchestra is deployed, but not deploy learningOrchestra.

To use the microservices, through the REST APIs and a request client or through the Python client package, refer to the usage instructions above.

I have a single computer, can I still use learningOrchestra?

Theoretically, you can, if your machine has 12 Gb of RAM, a quad-core processor and 100 Gb of disk. However, your single machine won't be able to cope with the computing demanding for a real-life sized dataset.

What happens if learningOrchestra is killed while using a microservice?

If your cluster fails while a microservice is processing data, the task may be lost. Some fails might corrupt the database systems.

If no processing was in progress when your cluster fails, the learningOrchestra will automatically re-deploy and reboot the affected microservices.

What happens if my instances loose the connection to each other?

If the connection between cluster instances is shutdown, learningOrchestra will try to re-deploy the microservices from the lost instances on the remaining active instances of the cluster.

How do I interrupt learningOrchestra?

Run docker stack rm microservice in manager instance of docker swarm cluster.

On the languages and frameworks used by learningOrchestra

What is a container?

Containers are a software that package code and everything needed to run this code together, so that the code can be run simply in any environment. They also isolate the code from the rest of the machine. They are often compared to shipping containers.

What is a cluster?

A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. (From Wikipedia)

What are microservices?

Microservices - also known as the microservice architecture - is an architectural style that structures an application as a collection of services that are: highly maintainable and testable, loosely coupled, independently deployable, organized around business capabilities, owned by small team.

An overview of microservice architecture

Method X is very useful and should be included, why is it not there?

learningOrchestra is still in development. We try to prioritize the most handy methods/process, but we have a limited team.

You can suggest new features by creating an issue in Issues page. We also welcome new contributors.

On contributing to learningOrchestra

I want to contribute, where do I start?

The contributing guide is a good place to start.

If you are new to open source, consider giving the resources of FirstTimersOnly a look.

I'm not a developer, can I contribute?

Yes. Currently, we need help improving the documentation and spreading the word about the learningOrchestra project. Check our Issues page for open tasks.

About

learningOrchestra is a distributed Machine Learning processing tool that facilitates and streamlines iterative processes in a Data Science project.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 94.0%
  • Shell 3.0%
  • Dockerfile 3.0%