learningOrchestra facilitates and streamlines iterative processes in a Data Science project pipeline like:
- Data Gathering
- Data Cleaning
- Model Building
- Validating the Model
- Presenting the Results
With learningOrchestra, you can:
- load a dataset from an URL (in CSV format).
- accomplish several pre-processing tasks with datasets.
- create highly customised model predictions against a specific dataset by providing their own pre-processing code.
- build prediction models with different classifiers simultaneously using a spark cluster transparently.
And so much more! Check the usage section for more.
- Linux hosts
- Docker Engine must be installed in all instances of your cluster
- Cluster configured in swarm mode, check creating a swarm
- Docker Compose must be installed in the manager instance of your cluster
Ensure that your cluster environment does not block any traffic such as firewall rules in your network or in your hosts.
If in case, you have firewalls or other traffic-blockers, add learningOrchestra as an exception.
Ex: In Google Cloud Platform each of the VMs must allow both http and https traffic.
In the manager Docker swarm machine, clone the repo using:
git clone https://github.com/riibeirogabriel/learningOrchestra.git
Navigate into the learningOrchestra
directory and run:
cd learningOrchestra
sudo ./run.sh
That's it! learningOrchestra has been deployed in your swarm cluster!
CLUSTER_IP:80
- To visualize cluster state (deployed microservices and cluster's machines).
CLUSTER_IP:8080
- To visualize spark cluster state.
* CLUSTER_IP
is the external IP of a machine in your cluster.
learningOrchestra can be used with the Microservices REST API or with the learning-orchestra-client
Python package.
Database API- Download and handle datasets in a database.
Projection API- Make projections of stored datasets using Spark cluster.
Data type API- Change dataset fields type between number and text.
Histogram API- Make histograms of stored datasets.
t-SNE API- Make a t-SNE image plot of stored datasets.
PCA API- Make a PCA image plot of stored datasets.
Model builder API- Create a prediction model from pre-processed datasets using Spark cluster.
The Projection, t-SNE, PCA and Model builder microservices uses the Spark microservice to work.
By default, this microservice has only one instance. In case your data processing requires more computing power, you can scale this microservice.
To do this, with learningOrchestra already deployed, run the following in the manager machine of your Docker swarm cluster:
docker service scale microservice_sparkworker=NUMBER_OF_INSTANCES
* NUMBER_OF_INSTANCES
is the number of Spark microservice instances which you require. Choose it according to your cluster resources and your resource requirements.
NoSQLBooster- MongoDB GUI performs several database tasks such as file visualization, queries, projections and file extraction to CSV and JSON formats. It can be util to accomplish some these tasks with your processed dataset or get your prediction results.
Read the Database API docs for more info on configuring this tool.
See the full docs for detailed usage instructions.
Thanks goes to these wonderful people (emoji key):
Gabriel Ribeiro 💻 🚇 📆 🚧 |
Navendu Pottekkat 📖 🎨 🤔 |
This project follows the all-contributors specification. Contributions of any kind welcome!