GitHub - israelrico007/ploomber: Lean Data Science workflows: develop and test locally. Deploy to Kubernetes, Airflow, or any web framework.

Ploomber is the simplest way to build reliable data pipelines for Data Science and Machine Learning. Provide your source code in a standard form, and Ploomber automatically constructs the pipeline for you. Tasks can be anything from Python functions, Jupyter notebooks, Python/R/shell scripts, and SQL scripts.

When you're ready, deploy to Airflow or Kubernetes (using Argo) without code changes.

Here's how pipeline tasks look like:

Function

Jupyter notebook or Python script

SQL script

Pipeline declaration

def clean_users(product, upstream):
    # run 'get_users' before this function.
    # upstream['get_users'] returns the output
    # of such task, used as input here
    df = pd.read_csv(upstream['get_users'])

    # your code here...

    # save output using the provided
    # product variable
    df.to_csv(product)

# + tags=["parameters"]
# run 'clean users' and 'clean_activity'
# before this script/notebook
upstream = ['clean_users', 'clean_activity']
# -

# a new cell is injected here with
# the product variable
# e.g., product = '/path/output.csv'
# and a new upstream variable:
# e.g., upstream = {'clean_users': '/path/...'
#                   'clean_activity': '/another/...'}

# your code here...

# save output using the provided product variable
Path(product).write_bytes(pickle.dumps(model))

-- {{product}} is replaced by the table name
CREATE TABLE AS {{product}}
/*
run 'raw_data' before this task. Replace
{{upstream['raw_data']}} with table name
at runtime
*/
SELECT * FROM {{upstream['raw_data']}}

tasks:
  # function
  - source: functions.clean_users
    product: output/users-clean.csv

  # python script (or notebook)
  - source: notebooks/model-template.py
    product:
      model: output/model.pickle
      nb: output/model-evaluation.html
  
  # sql script
  - source: scripts/some_script.sql
    product: [schema, name, table]
    client: db.get_client

Resources

Installation

pip install ploomber

Compatible with Python 3.6 and higher.

Try it out!

You can choose from one of the hosted options:

Or run locally:

# ML pipeline example
ploomber examples --name ml-basic
cd ml-basic

# if using pip
pip install -r requirements.txt

# if using conda
conda env create --file environment.yml
conda activate ml-basic

# run pipeline
ploomber build

Pipeline output saved in the output/ folder. Check out the pipeline definition in the pipeline.yaml file.

To get a list of examples, run ploomber examples.

Main features

Jupyter integration. When you open your notebooks, Ploomber will automatically inject a new cell with the location of your input files, as inferred from your upstream variable. If you open a Python or R script, it's converted to a notebook on the fly.
Incremental builds. Speed up execution by skipping tasks whose source code hasn't changed.
Parallelization. Run tasks in parallel to speed up computations.
Pipeline testing. Run tests upon task execution to verify that the output data has the right properties (e.g., values within expected range).
Pipeline inspection. Start an interactive session with ploomber interact to debug your pipeline. Call dag['task_name'].debug() to start a debugging session.
Deployment to Kubernetes and Airflow. You can develop and execute locally. Once you are ready to deploy, export to Kubernetes or Airflow.

How does Ploomber compare to X?

Ploomber has two goals:

Provide an excellent development experience for Data Science/Machine learning projects, which require a lot of experimentation/iteration: incremental builds and Jupyter integration are a fundamental part of this.
Integrate with deployment tools (Airflow and Argo) to streamline deployment.

For a complete comparison, read our survey on workflow management tools.

Name		Name	Last commit message	Last commit date
Latest commit History 1,714 Commits
.github/workflows		.github/workflows
doc		doc
jupyter-config/jupyter_notebook_config.d		jupyter-config/jupyter_notebook_config.d
src/ploomber		src/ploomber
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
install_test_pkg.sh		install_test_pkg.sh
readthedocs.yaml		readthedocs.yaml
setup.cfg		setup.cfg
setup.py		setup.py
tasks.py		tasks.py
versioneer.py		versioneer.py

License

israelrico007/ploomber

Folders and files

Latest commit

History

Repository files navigation

Resources

Installation

Try it out!

Main features

How does Ploomber compare to X?

About

Resources

License

Stars

Watchers

Forks

Languages