Dagster is a system for building modern data applications.
- Elegant programming model: Dagster is a set of abstractions for building self-describing, testable, and reliable data applications. It embraces the principles of functional data programming; gradual, optional typing; and testability as a first-class value.
- Flexible & incremental: Dagster integrates with your existing tools and infrastructure, and can invoke any computation–whether it be Spark, Python, a Jupyter notebook, or SQL. It is also designed to deploy to any workflow engine, such as Airflow.
- Beautiful tools: Dagster's development environment, dagit–designed for data engineers, machine learning engineers, data scientists–enables astoundingly productive local development.
pip install dagster dagit
This installs two modules:
- Dagster: the core programming model and abstraction stack; stateless, single-node, single-process and multi-process execution engines; and a CLI tool for driving those engines.
- Dagit: a UI and rich development environment for Dagster, including a DAG browser, a type-aware config editor, and a streaming execution interface.
hello_dagster.py
from dagster import execute_pipeline, pipeline, solid
@solid
def get_name(_):
return 'dagster'
@solid
def hello(context, name: str):
context.log.info('Hello, {name}!'.format(name=name))
@pipeline
def hello_pipeline():
hello(get_name())
Save the code above in a file named hello_dagster.py
. You can execute the pipeline using any one of the following methods:
(1) Dagster Python API
if __name__ == "__main__":
execute_pipeline(hello_pipeline) # Hello, dagster!
(2) Dagster CLI
$ dagster pipeline execute -f hello_dagster.py -n hello_pipeline
(3) Dagit web UI
$ dagit -f hello_dagster.py -n hello_pipeline
Next, jump right into our tutorial, or read our complete documentation. If you're actively using Dagster or have questions on getting started, we'd love to hear from you:
For details on contributing or running the project for development, check out our contributing guide.
Dagster works with the tools and systems that you're already using with your data, including:
Integration | Dagster Library | |
Apache Airflow | dagster-airflow Allows Dagster pipelines to be scheduled and executed, either containerized or uncontainerized, as Apache Airflow DAGs. |
|
Apache Spark | dagster-spark · dagster-pyspark
Libraries for interacting with Apache Spark and PySpark. |
|
Dask | dagster-dask
Provides a Dagster integration with Dask / Dask.Distributed. |
|
Datadog | dagster-datadog
Provides a Dagster resource for publishing metrics to Datadog. |
|
/ | Jupyter / Papermill | dagstermill Built on the papermill library, dagstermill is meant for integrating productionized Jupyter notebooks into dagster pipelines. |
PagerDuty | dagster-pagerduty
A library for creating PagerDuty alerts from Dagster workflows. |
|
Snowflake | dagster-snowflake
A library for interacting with the Snowflake Data Warehouse. |
|
Cloud Providers | ||
AWS | dagster-aws
A library for interacting with Amazon Web Services. Provides integrations with S3, EMR, and (coming soon!) Redshift. |
|
GCP | dagster-gcp
A library for interacting with Google Cloud Platform. Provides integrations with BigQuery and Cloud Dataproc. |
This list is growing as we are actively building more integrations, and we welcome contributions!
Several example projects are provided under the examples folder demonstrating how to use Dagster, including:
- examples/airline-demo: A substantial demo project illustrating how these tools can be used together to manage a realistic data pipeline.
- examples/event-pipeline-demo: An example illustrating a typical web event processing pipeline with S3, Scala Spark, and Snowflake.