Skip to content

Lean Data Science workflows: develop and test locally. Deploy to Kubernetes, Airflow, or any web framework.

License

Notifications You must be signed in to change notification settings

israelrico007/ploomber

 
 

Repository files navigation

CI Linux CI macOS CI Windows Documentation Status PyPI Coverage Twitter Binder Deepnote

Diagram

Ploomber is the simplest way to build reliable data pipelines for Data Science and Machine Learning. Provide your source code in a standard form, and Ploomber automatically constructs the pipeline for you. Tasks can be anything from Python functions, Jupyter notebooks, Python/R/shell scripts, and SQL scripts.

When you're ready, deploy to Airflow or Kubernetes (using Argo) without code changes.

Here's how pipeline tasks look like:

Function Jupyter notebook or Python script SQL script Pipeline declaration
def clean_users(product, upstream):
    # run 'get_users' before this function.
    # upstream['get_users'] returns the output
    # of such task, used as input here
    df = pd.read_csv(upstream['get_users'])

    # your code here...

    # save output using the provided
    # product variable
    df.to_csv(product)
# + tags=["parameters"]
# run 'clean users' and 'clean_activity'
# before this script/notebook
upstream = ['clean_users', 'clean_activity']
# -

# a new cell is injected here with
# the product variable
# e.g., product = '/path/output.csv'
# and a new upstream variable:
# e.g., upstream = {'clean_users': '/path/...'
#                   'clean_activity': '/another/...'}

# your code here...

# save output using the provided product variable
Path(product).write_bytes(pickle.dumps(model))
-- {{product}} is replaced by the table name
CREATE TABLE AS {{product}}
/*
run 'raw_data' before this task. Replace
{{upstream['raw_data']}} with table name
at runtime
*/
SELECT * FROM {{upstream['raw_data']}}
tasks:
  # function
  - source: functions.clean_users
    product: output/users-clean.csv

  # python script (or notebook)
  - source: notebooks/model-template.py
    product:
      model: output/model.pickle
      nb: output/model-evaluation.html
  
  # sql script
  - source: scripts/some_script.sql
    product: [schema, name, table]
    client: db.get_client

Resources

Installation

pip install ploomber

Compatible with Python 3.6 and higher.

Try it out!

You can choose from one of the hosted options:

image image

Or run locally:

# ML pipeline example
ploomber examples --name ml-basic
cd ml-basic

# if using pip
pip install -r requirements.txt

# if using conda
conda env create --file environment.yml
conda activate ml-basic

# run pipeline
ploomber build

Pipeline output saved in the output/ folder. Check out the pipeline definition in the pipeline.yaml file.

To get a list of examples, run ploomber examples.

Main features

  1. Jupyter integration. When you open your notebooks, Ploomber will automatically inject a new cell with the location of your input files, as inferred from your upstream variable. If you open a Python or R script, it's converted to a notebook on the fly.
  2. Incremental builds. Speed up execution by skipping tasks whose source code hasn't changed.
  3. Parallelization. Run tasks in parallel to speed up computations.
  4. Pipeline testing. Run tests upon task execution to verify that the output data has the right properties (e.g., values within expected range).
  5. Pipeline inspection. Start an interactive session with ploomber interact to debug your pipeline. Call dag['task_name'].debug() to start a debugging session.
  6. Deployment to Kubernetes and Airflow. You can develop and execute locally. Once you are ready to deploy, export to Kubernetes or Airflow.

How does Ploomber compare to X?

Ploomber has two goals:

  1. Provide an excellent development experience for Data Science/Machine learning projects, which require a lot of experimentation/iteration: incremental builds and Jupyter integration are a fundamental part of this.
  2. Integrate with deployment tools (Airflow and Argo) to streamline deployment.

For a complete comparison, read our survey on workflow management tools.

About

Lean Data Science workflows: develop and test locally. Deploy to Kubernetes, Airflow, or any web framework.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 97.6%
  • HTML 2.2%
  • Other 0.2%