Skip to content

sudiptog81/learningOrchestra

 
 

Repository files navigation

build-passing tag last-commit All Contributors

learningOrchestra

learningOrchestra facilitates and streamlines iterative processes in a Data Science project pipeline like:

  • Data Gathering
  • Data Cleaning
  • Model Building
  • Validating the Model
  • Presenting the Results

With learningOrchestra, you can:

  • load a dataset from an URL (in CSV format).
  • accomplish several pre-processing tasks with datasets.
  • create highly customised model predictions against a specific dataset by providing their own pre-processing code.
  • build prediction models with different classifiers simultaneously using a spark cluster transparently.

And so much more! Check the usage section for more.

Installation

Requirements

Ensure that your cluster environment does not block any traffic such as firewall rules in your network or in your hosts.

If in case, you have firewalls or other traffic-blockers, add learningOrchestra as an exception.

Ex: In Google Cloud Platform each of the VMs must allow both http and https traffic.

Deployment

In the manager Docker swarm machine, clone the repo using:

git clone https://github.com/riibeirogabriel/learningOrchestra.git

Navigate into the learningOrchestra directory and run:

cd learningOrchestra
sudo ./run.sh

That's it! learningOrchestra has been deployed in your swarm cluster!

Cluster State

CLUSTER_IP:80 - To visualize cluster state (deployed microservices and cluster's machines). CLUSTER_IP:8080 - To visualize spark cluster state.

* CLUSTER_IP is the external IP of a machine in your cluster.

Usage

learningOrchestra can be used with the Microservices REST API or with the learning-orchestra-client Python package.

Microservices REST APIs

Database API- Download and handle datasets in a database.

Projection API- Make projections of stored datasets using Spark cluster.

Data type API- Change dataset fields type between number and text.

Histogram API- Make histograms of stored datasets.

t-SNE API- Make a t-SNE image plot of stored datasets.

PCA API- Make a PCA image plot of stored datasets.

Model builder API- Create a prediction model from pre-processed datasets using Spark cluster.

Spark Microservices

The Projection, t-SNE, PCA and Model builder microservices uses the Spark microservice to work.

By default, this microservice has only one instance. In case your data processing requires more computing power, you can scale this microservice.

To do this, with learningOrchestra already deployed, run the following in the manager machine of your Docker swarm cluster:

docker service scale microservice_sparkworker=NUMBER_OF_INSTANCES

* NUMBER_OF_INSTANCES is the number of Spark microservice instances which you require. Choose it according to your cluster resources and your resource requirements.

Database GUI

NoSQLBooster- MongoDB GUI performs several database tasks such as file visualization, queries, projections and file extraction to CSV and JSON formats. It can be util to accomplish some these tasks with your processed dataset or get your prediction results.

Read the Database API docs for more info on configuring this tool.

See the full docs for detailed usage instructions.

Contributors ✨

Thanks goes to these wonderful people (emoji key):


Gabriel Ribeiro

💻 🚇 📆 🚧

Navendu Pottekkat

📖 🎨 🤔

This project follows the all-contributors specification. Contributions of any kind welcome!

About

learningOrchestra is a distributed Machine Learning processing tool that facilitates and streamlines iterative processes in a Data Science project.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 88.4%
  • Dockerfile 5.8%
  • Shell 5.8%