Most data teams in 2026 are under pressure to ship insights faster without adding heavy infrastructure. A practical pattern is to combine Polars for blazing-fast dataframe transforms and DuckDB for local analytical SQL. In this guide, you will build a production-friendly mini pipeline that ingests CSV data, validates schema, performs feature transforms, and publishes reusable analytics outputs with clean Python code.
Why this stack works now
Traditional pandas workflows are still useful, but many pipelines now hit memory or speed limits as datasets grow. Polars uses a query engine optimized for columnar operations and parallel execution. DuckDB gives you SQL over local files and in-memory dataframes with minimal setup. Together, they are simple enough for laptops and powerful enough for scheduled jobs.
- Polars handles fast transforms and lazy execution.
- DuckDB handles joins, aggregations, and SQL-friendly reporting.
- Python orchestrates both with clear, testable modules.
Project structure
Use a small structure you can scale later:
analytics-pipeline/
data/
raw/
curated/
src/
ingest.py
transform.py
report.py
main.py
pyproject.toml
Install dependencies:
pip install polars duckdb pydantic richStep 1: Ingest and validate raw data
Let us assume incoming event data looks like this:
event_id,user_id,event_type,event_ts,amount
1,101,signup,2026-04-01T10:00:00,0
2,101,purchase,2026-04-02T09:10:00,49.99
3,202,signup,2026-04-03T11:20:00,0
Create src/ingest.py:
from pydantic import BaseModel, ValidationError
import polars as pl
class EventSchema(BaseModel):
event_id: int
user_id: int
event_type: str
event_ts: str
amount: float
def load_raw(path: str) -> pl.DataFrame:
df = pl.read_csv(path)
# Basic row-level validation (sampled for speed in larger files)
for row in df.head(500).to_dicts():
try:
EventSchema(**row)
except ValidationError as e:
raise ValueError(f"Schema validation failed: {e}")
return df
This keeps ingestion strict enough to catch obvious bad input before downstream steps run.
Step 2: Transform with Polars lazy queries
Now build reusable transforms in src/transform.py:
import polars as pl
def transform_events(df: pl.DataFrame) -> pl.DataFrame:
ldf = df.lazy()
out = (
ldf
.with_columns([
pl.col("event_ts").str.to_datetime().alias("event_dt"),
pl.when(pl.col("event_type") == "purchase")
.then(pl.lit(1))
.otherwise(pl.lit(0))
.alias("is_purchase")
])
.group_by_dynamic("event_dt", every="1d")
.agg([
pl.len().alias("events"),
pl.col("user_id").n_unique().alias("active_users"),
pl.col("amount").sum().alias("revenue"),
pl.col("is_purchase").sum().alias("purchases")
])
.sort("event_dt")
.collect()
)
return out
Key point: lazy execution lets Polars optimize query plans before computing results, which often reduces runtime for larger inputs.
Step 3: Run analytical SQL in DuckDB
After generating daily aggregates, use DuckDB for reporting queries in src/report.py:
import duckdb
import polars as pl
def run_report(daily_df: pl.DataFrame) -> pl.DataFrame:
con = duckdb.connect(database=":memory:")
con.register("daily_metrics", daily_df.to_pandas())
query = """
SELECT
event_dt::DATE AS day,
events,
active_users,
revenue,
CASE WHEN active_users = 0 THEN 0
ELSE revenue / active_users
END AS revenue_per_active_user
FROM daily_metrics
ORDER BY day
"""
result = con.execute(query).fetch_df()
return pl.from_pandas(result)
This is useful when your analysts prefer SQL for business logic but engineering wants Python modules and tests.
Step 4: Wire everything in main.py
from src.ingest import load_raw
from src.transform import transform_events
from src.report import run_report
RAW_PATH = "data/raw/events.csv"
CURATED_PATH = "data/curated/daily_metrics.parquet"
REPORT_PATH = "data/curated/business_report.parquet"
def main():
raw_df = load_raw(RAW_PATH)
daily_df = transform_events(raw_df)
report_df = run_report(daily_df)
daily_df.write_parquet(CURATED_PATH)
report_df.write_parquet(REPORT_PATH)
print("Pipeline completed successfully")
if __name__ == "__main__":
main()
Run with:
python main.pyProduction tips for real teams
1) Make schemas explicit
Do not rely only on inferred types. Explicit schema mapping prevents subtle breakage when source systems change column order or null behavior.
2) Add data quality checks
At minimum, enforce uniqueness for IDs and expected ranges for key numeric fields. Write failing rows to a quarantine file for quick triage.
3) Prefer Parquet for curated outputs
Parquet is compact and efficient for analytics reads. It also works naturally with DuckDB, Spark, and cloud lakehouse tools.
4) Keep transforms deterministic
Avoid hidden time dependencies. Pass run dates explicitly so reruns produce consistent results.
5) Log run metadata
Store row counts, input file hashes, and runtime per stage. This makes debugging and audit trails much easier.
Where this pattern fits in 2026
This approach is ideal for internal analytics tools, product dashboards, and feature engineering pre-steps where teams want speed without heavy orchestration. You can schedule this with a simple job runner first, then move to Airflow or Dagster only if complexity grows.
If your workload expands, you can still keep the same logic and shift storage to object stores, add partitioning, and plug in a metrics collector for pipeline SLAs.
Final takeaway
Polars plus DuckDB is one of the highest-leverage combinations for practical data science workflows in 2026. You get fast compute, readable code, and SQL-friendly outputs in a lightweight setup. Start with a clean local pipeline like this, then scale only the parts that truly need scaling.
In a follow-up post, I can share a tested template that adds incremental loads, idempotent reruns, and automated data quality scorecards.

Leave a Reply