Wondering how to combine your various library and infrastructure needs for your latest data mining project? Just pick the bricks to build your pipeline and learningOrchestra will take care of the rest.
Nowadays, data science relies on a wide range of computer science skills, from data management to algorithm design, from code optimization to cloud infrastructures. Data scientists are expected to have expertise in these diverse fields, especially when working in small teams or for academia.
This situation can constitute a barrier to the actual extraction of new knowledge from collected data, which is why the last two decades have seen more efforts to facilitate and streamline the development of data mining workflows. The tools created can be sorted into two categories: high-level tools facilitate the building of automatic data processing pipelines (e.g. Weka) while low-level ones support the setup of appropriate physical and virtual infrastructure (e.g. Spark).
However, this landscape is still missing a tool that encompasses all steps and needs of a typical data science project. This is where learningOrchestra comes in.
learningOrchestra aims to facilitate the development of complex data mining workflows by seamlessly interfacing different data science tools and services. From a single interoperable Application Programming Interface (API), users can design their analytical pipelines and deploy them in an environment with the appropriate capabilities.
learningOrchestra is designed for data scientists from both engineering and academia backgrounds, so that they can focus on the discovery of new knowledge in their data rather than library or maintenance issues.
- Introduction
- Quick-start
- How do I install learningOrchestra?
- How do I use learningOrchestra?
- About learningOrchestra
- Frequently Asked Questions
- Requirements
- Deployment
- Cluster State
- REST API
learningOrchestra provides two options to access its features: a REST API and a Python package.
REST API: We recommand using a GUI REST API caller like Postman or Insomnia.
Python package:
- Check the package documentation for more details.
π This documentation assumes that the users are familiar with a number of advanced computer science concepts. We have tried to link to learning resources to support beginners, as well as introduce some of the concepts in the FAQ. But if something is still not clear, don't hesitate to ask for help.
We prrovide a documentation explaining how deploy this software, you can read more in installation docs
Run docker stack rm microservice
.
learningOrchestra is organised into interoperable microservices. They offer access to third-party libraries, frameworks and software to gather data, clean data, train machine learning models, tune machine learning models, evaluate machine learning models and visualize data and results.
The current version of learningOrchestra offers 11 services:
- Dataset - Responsible to obtain a dataset. External datasets are stored on MongoDB or on volumes using an Uniform Resource Locator (URL). There is also an alternative to load TensorFlow existing datasets.
- Model - Responsible to load supervised or unsupervised models from existing repositories. It is useful to be used to configure a TensorFlow or Scikit-learn object with a tuned and pre-trained neural network using Google or Facebook best practicesand large instances, for example. On the other hand, it is also useful to load acustomized/optimized neural network developed from scratch by a team of data scientists.
- Transform - Responsible for a catalog of tasks, including embedding, normalization, text enrichment, bucketization, data projection and so forth. Learning Orchestra has its own implementations for some services and implement other transform services from TensorFlow and Scikit-learn.
- Explore - The data scientist must see the pipes steps results of an analytical pipeline, so Learning Orchestra supports data exploration using the catalog of explore capabilities of TensorFlow and Scikit-learn tools, including histogram, clustering, t-SNE,PCA and others. All outputs of this step are plottable.
- Tune - Performs the search for an optimal set of parameters for a given model. It can be made through strategies like grid-search, random search, or Bayesian optimization
- Train - Probably it is the most computational expensive service of an ML pipeline, because the models will be trained for best learn the subjacents patterns on data. Adiversity of algorithms can be executed, like Support Vector Machine (SVM), Random Forest, Bayesian inference, K-Nearest Neighbors (KNN), Deep Neural Networks(DNN), and many others.
- Evaluate - After training a model, it is necessary to evaluate itβs power to generalize tonew unseen data. For that, the model needs to perform inferences or classification on a test dataset to obtain metrics that more accurately describe the capabilities of themodel. Some common metrics are precision, recall, f1-score, accuracy, mean squarederror (MSE), and cross-entropy. This service is useful to describe the generalization power and to detect the need for model calibrations
- Predict - The model can run indefinitely. Sometimes feedbacks are mandatory toreinforce the train step, so the Evaluate services are called multiple times. This is the main reason for a production pipe and, consequently, a service of such a type
- Builder - Responsible to execute Spark-ML or TensorFlow entire pipelines in Python, offering an alternative way to use the Learning Orchestra system just as a deployment alternative and not an environment for building ML workflows composed of pipelines.
- Observe - Represents a catalog of collections of Learning Orchestra and a publish/subscribe mechanism. Applications can subscribe to these collections to receive notifications via observers.
- Function - Responsible to wrap a Python function, representing a wildcard for the data scientist when there is no Learning Orchestra support for a specific ML service. It is different from Builder service, since it does not run the entire pipeline. Instead, it runs just a Python function of Scikit-learn or TensorFlow models on a cluster container. It is part of future plans the support of functions written in R language.
The REST API can be called on from any computer, including one that is not part of the cluster learningOrchestra is deployed on. learningOrchestra provides two options to access its features: a REST API and a Python package.
We recommand using a GUI REST API caller like Postman or Insomnia. Of course, regular curl
commands from the terminal remain a possibility.
The details for REST API are available in the open api documentation.
learning-orchestra-client is a Python 3 package available through the Python Package Index. Install it with pip install learning-orchestra-client
.
All your scripts must import the package and create a link to the cluster by providing the IP address to an instance of your cluster. Preface your scripts with the following code:
from learning_orchestra_client import *
cluster_ip = "xx.xx.xxx.xxx"
Context(cluster_ip)
Check the package documentation for a list of available features and an example use case.
To check the deployed microservices and machines of your cluster, run CLUSTER_IP:9000
where CLUSTER_IP is replaced by the external IP of a machine in your cluster.
The same can be done to check Spark cluster state with CLUSTER_IP:8080
.
The first monograph
The second monograph(under construction)
Thanks goes to these wonderful people (emoji key):
Gabriel Ribeiro π» π§ π¬ π |
Navendu Pottekkat π π¨ π€ |
hiperbolt π» π€ π |
Joubert de Castro Lima π€ π |
Lauro Moraes π€ π |
LaChapeliere π |
Sudipto Ghosh π» |
This project follows the all-contributors specification. Contributions of any kind welcome!
Find the user documentation here.
The repo is linked to the user documentation.
See the contributors list.
We use collaborative resources to develop this software.
Please use the Issues page of this repo.
This project is distributed under the open source GPL-3 license.
You can copy, modify and distribute the code in the repository as long as you understand the license limitations (no liability, no warranty) and respect the license conditions (license and copyright notice, state changes, disclose source, same license.)
In discussion.
Kaggle is a good data source for beginners.
You can use the microservices that run on a cluster where learningOrchestra is deployed, but not deploy learningOrchestra.
To use the microservices, through the REST APIs and a request client or through the Python client package, refer to the usage instructions above.
Theoretically, you can, if your machine has 12 Gb of RAM, a quad-core processor and 100 Gb of disk. However, your single machine won't be able to cope with the computing demanding for a real-life sized dataset.
If your cluster fails while a microservice is processing data, the task may be lost. Some fails might corrupt the database systems.
If no processing was in progress when your cluster fails, the learningOrchestra will automatically re-deploy and reboot the affected microservices.
If the connection between cluster instances is shutdown, learningOrchestra will try to re-deploy the microservices from the lost instances on the remaining active instances of the cluster.
Run docker stack rm microservice
in manager instance of docker swarm cluster.
Containers are a software that package code and everything needed to run this code together, so that the code can be run simply in any environment. They also isolate the code from the rest of the machine. They are often compared to shipping containers.
A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. (From Wikipedia)
Microservices - also known as the microservice architecture - is an architectural style that structures an application as a collection of services that are: highly maintainable and testable, loosely coupled, independently deployable, organized around business capabilities, owned by small team.
An overview of microservice architecture
learningOrchestra is still in development. We try to prioritize the most handy methods/process, but we have a limited team.
You can suggest new features by creating an issue in Issues page. We also welcome new contributors.
The contributing guide is a good place to start.
If you are new to open source, consider giving the resources of FirstTimersOnly a look.
Yes. Currently, we need help improving the documentation and spreading the word about the learningOrchestra project. Check our Issues page for open tasks.