Ploomber is the simplest way to build reliable data pipelines for Data Science and Machine Learning. Provide your source code in a standard form, and Ploomber automatically constructs the pipeline for you. Tasks can be anything from Python functions, Jupyter notebooks, Python/R/shell scripts, and SQL scripts.
When you're ready, deploy to Airflow or Kubernetes (using Argo) without code changes.
Here's how pipeline tasks look like:
Function | Jupyter notebook or Python script | SQL script | Pipeline declaration |
---|---|---|---|
def clean_users(product, upstream):
# run 'get_users' before this function.
# upstream['get_users'] returns the output
# of such task, used as input here
df = pd.read_csv(upstream['get_users'])
# your code here...
# save output using the provided
# product variable
df.to_csv(product) |
# + tags=["parameters"]
# run 'clean users' and 'clean_activity'
# before this script/notebook
upstream = ['clean_users', 'clean_activity']
# -
# a new cell is injected here with
# the product variable
# e.g., product = '/path/output.csv'
# and a new upstream variable:
# e.g., upstream = {'clean_users': '/path/...'
# 'clean_activity': '/another/...'}
# your code here...
# save output using the provided product variable
Path(product).write_bytes(pickle.dumps(model)) |
-- {{product}} is replaced by the table name
CREATE TABLE AS {{product}}
/*
run 'raw_data' before this task. Replace
{{upstream['raw_data']}} with table name
at runtime
*/
SELECT * FROM {{upstream['raw_data']}} |
tasks:
# function
- source: functions.clean_users
product: output/users-clean.csv
# python script (or notebook)
- source: notebooks/model-template.py
product:
model: output/model.pickle
nb: output/model-evaluation.html
# sql script
- source: scripts/some_script.sql
product: [schema, name, table]
client: db.get_client |
- Documentation
- Sample projects (Machine Learning pipeline, ETL, among others)
- Watch JupyterCon 2020 talk
pip install ploomber
Compatible with Python 3.6 and higher.
You can choose from one of the hosted options:
Or run locally:
# ML pipeline example
ploomber examples --name ml-basic
cd ml-basic
# if using pip
pip install -r requirements.txt
# if using conda
conda env create --file environment.yml
conda activate ml-basic
# run pipeline
ploomber build
Pipeline output saved in the output/
folder. Check out the pipeline definition
in the pipeline.yaml
file.
To get a list of examples, run ploomber examples
.
- Jupyter integration. When you open your notebooks, Ploomber will
automatically inject a new cell with the location of your input
files, as inferred from your
upstream
variable. If you open a Python or R script, it's converted to a notebook on the fly. - Incremental builds. Speed up execution by skipping tasks whose source code hasn't changed.
- Parallelization. Run tasks in parallel to speed up computations.
- Pipeline testing. Run tests upon task execution to verify that the output data has the right properties (e.g., values within expected range).
- Pipeline inspection. Start an interactive session with
ploomber interact
to debug your pipeline. Calldag['task_name'].debug()
to start a debugging session. - Deployment to Kubernetes and Airflow. You can develop and execute locally. Once you are ready to deploy, export to Kubernetes or Airflow.
Ploomber has two goals:
- Provide an excellent development experience for Data Science/Machine learning projects, which require a lot of experimentation/iteration: incremental builds and Jupyter integration are a fundamental part of this.
- Integrate with deployment tools (Airflow and Argo) to streamline deployment.
For a complete comparison, read our survey on workflow management tools.