Dagster

Dagster is the asset-centric data orchestration platform that models data pipelines as software-defined assets rather than imperative tasks — enabling data engineering teams to define what data products should exist (tables, models, reports) and letting Dagster manage how and when they are produced, with first-class support for data quality testing, type-safe pipelines, and integrated observability.

What Is Dagster?

- Definition: A data orchestration platform founded in 2018 that introduces the Software-Defined Asset (SDA) paradigm — instead of defining "run Task A then Task B," teams define "Asset X depends on Asset Y," and Dagster manages materialization scheduling, dependency tracking, and freshness guarantees.
- Asset-Centric Philosophy: Dagster shifts orchestration from task-centric ("what computations should run?") to asset-centric ("what data products should exist, and are they fresh?") — modeling pipelines as a graph of data assets (database tables, ML models, reports) with defined dependencies between them.
- Software-Defined Assets: An SDA is a Python function decorated with @asset that produces a data artifact — Dagster tracks its lineage, freshness, test results, and materialization history, creating an observable catalog of all data products in the platform.
- Type Safety: Dagster uses Python type annotations throughout — inputs and outputs of assets have defined types that Dagster validates at runtime, catching schema mismatches before they corrupt downstream data.
- Testability: Dagster separates business logic (compute) from I/O (reading from S3, writing to database) via Resources — this separation makes unit testing data pipelines straightforward without mocking database connections.

Why Dagster Matters for AI and ML

- ML Model as Asset: An ML model is itself a data asset — Dagster tracks which training data version, which code version, and which hyperparameters produced each model version. The model's lineage is automatic, not manually documented.
- Data Quality Gates: Define asset checks that must pass before downstream assets are materialized — a model training asset only runs if the training data asset passes null-rate and distribution checks.
- Partitioned Assets: Handle time-partitioned data naturally — define that a feature table has daily partitions and Dagster tracks which partitions are materialized, missing, or stale without custom bookkeeping logic.
- Observable Data Catalog: Dagster's Asset Catalog shows all data products, their freshness, test results, and lineage in a unified UI — data engineers and ML teams see the same view of data dependencies.
- Sensor-Driven Materialization: Trigger asset materialization based on external events — when a new dataset arrives in S3, automatically trigger the downstream feature engineering and model training assets.

Dagster Core Concepts

Software-Defined Assets:
from dagster import asset, AssetIn, MetadataValue
import pandas as pd

@asset(
description="Raw customer transaction data from warehouse",
group_name="raw_data"
)
def raw_transactions() -> pd.DataFrame:
return fetch_from_warehouse("SELECT * FROM transactions WHERE date > CURRENT_DATE - 30")

@asset(
ins={"raw_transactions": AssetIn()},
description="Cleaned transactions with outliers removed",
group_name="features"
)
def clean_transactions(raw_transactions: pd.DataFrame) -> pd.DataFrame:
df = raw_transactions.dropna()
df = df[df["amount"] < df["amount"].quantile(0.99)]
return df

@asset(
ins={"clean_transactions": AssetIn()},
description="Customer lifetime value features for ML training",
group_name="features",
metadata={"feature_count": MetadataValue.int(5)}
)
def customer_features(clean_transactions: pd.DataFrame) -> pd.DataFrame:
return clean_transactions.groupby("customer_id").agg(
transaction_count=("amount", "count"),
total_spend=("amount", "sum"),
avg_spend=("amount", "mean"),
last_transaction=("date", "max")
).reset_index()

Resources (I/O Abstraction):
from dagster import resource, ConfigurableResource

class WarehouseResource(ConfigurableResource):
connection_string: str

def query(self, sql: str) -> pd.DataFrame:
engine = create_engine(self.connection_string)
return pd.read_sql(sql, engine)

# Resources injected into assets — swap prod/dev without code changes
defs = Definitions(
assets=[raw_transactions, customer_features],
resources={"warehouse": WarehouseResource(connection_string="...")}
)

Asset Checks (Data Quality):
from dagster import asset_check, AssetCheckResult

@asset_check(asset=customer_features)
def check_no_nulls(customer_features: pd.DataFrame) -> AssetCheckResult:
null_count = customer_features.isnull().sum().sum()
return AssetCheckResult(
passed=null_count == 0,
metadata={"null_count": MetadataValue.int(int(null_count))}
)

Partitioned Assets:
from dagster import DailyPartitionsDefinition

daily_partitions = DailyPartitionsDefinition(start_date="2024-01-01")

@asset(partitions_def=daily_partitions)
def daily_features(context) -> pd.DataFrame:
date = context.partition_key
return fetch_features_for_date(date)

Dagster vs Alternatives

| Aspect | Dagster | Airflow | Prefect |
|--------|---------|---------|---------|
| Primary Model | Data assets | Tasks/DAGs | Tasks/flows |
| Type Safety | Strong | None | Partial |
| Testability | Excellent | Difficult | Good |
| Data Catalog | Built-in | External | External |
| ML Lineage | Automatic | Manual | Manual |
| Learning Curve | Medium | High | Low |

Dagster is the data orchestration platform that treats data products as first-class citizens rather than side effects of task execution — by modeling pipelines as graphs of observable, testable data assets with automatic lineage tracking and data quality gates, Dagster gives ML and data engineering teams the visibility and reliability guarantees needed to build trustworthy data products at production scale.

Want to learn more?