At 11:47 p.m., our checkout API looked healthy, CPU was boring, and everyone in the release channel had already switched to “weekend mode.” Then one migration landed: ALTER TABLE orders ALTER COLUMN status TYPE text;. It seemed harmless. The table had “only” a few million rows. Within minutes, p95 latency spiked, workers piled up behind blocked transactions, and support started tagging us in payment-failed threads.
What hurt was not the SQL itself, it was the lock behavior we ignored. We treated a schema change like a code deploy. PostgreSQL did not.
This guide is the playbook I wish we had that night. It is for teams that need zero-downtime PostgreSQL migrations on live systems where write traffic never really sleeps.
Why “simple” migrations become incidents
Most migration incidents are not caused by bad intent. They come from three assumptions:
- Assumption 1: “This statement is fast in staging, so it will be fast in prod.” Large table size and concurrent write patterns change everything.
- Assumption 2: “If it blocks, it will block briefly.” Some DDL paths take stronger locks than expected and can queue application writes.
- Assumption 3: “Rollback is easy.” Data migrations and dual-version app behavior make rollback a product decision, not a single SQL command.
PostgreSQL docs are explicit: many ALTER TABLE operations acquire strong locks unless otherwise noted, and lock behavior depends on subcommand choice. Also, regular CREATE INDEX can block writes, while CREATE INDEX CONCURRENTLY is designed to avoid that specific write lock pattern, with caveats you still need to plan for.
The practical rollout model: expand, migrate, contract
The safest pattern for safe database schema changes in production is still expand and contract:
- Expand: Add new schema elements in a backward-compatible way.
- Migrate: Backfill and dual-read/dual-write in the application.
- Contract: Remove old columns/paths only after live verification.
This pattern is less glamorous than one-shot migrations, but it gives you operational escape hatches.
Phase 1, expand without breaking reads/writes
Suppose you are moving from a text status to an enum-like constrained column. Instead of mutating in place, add a new nullable column first, index it safely, and deploy code that can read both.
-- 001_expand_orders_status.sql
BEGIN;
ALTER TABLE orders
ADD COLUMN status_v2 text;
-- Add a NOT VALID check first to avoid long upfront validation work.
ALTER TABLE orders
ADD CONSTRAINT orders_status_v2_check
CHECK (status_v2 IN ('pending','paid','failed','refunded')) NOT VALID;
COMMIT;
-- Build index without blocking normal writes.
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_status_v2
ON orders (status_v2);
Tradeoff: CREATE INDEX CONCURRENTLY reduces write blocking risk, but it usually takes longer and has stricter execution rules than non-concurrent index builds. That is usually the right trade in production.
Phase 2, migrate data in chunks and ship dual-write
Do not backfill millions of rows in one transaction unless your outage budget is generous. Batch updates with deterministic ordering and short transactions.
# 002_backfill_status_v2.py
import psycopg
BATCH_SIZE = 5000
SQL = """
WITH cte AS (
SELECT id
FROM orders
WHERE status_v2 IS NULL
ORDER BY id
LIMIT %s
)
UPDATE orders o
SET status_v2 = CASE o.status
WHEN 'created' THEN 'pending'
WHEN 'captured' THEN 'paid'
WHEN 'declined' THEN 'failed'
WHEN 'returned' THEN 'refunded'
ELSE 'failed'
END
FROM cte
WHERE o.id = cte.id
RETURNING o.id;
"""
with psycopg.connect("postgresql://app:***@db/prod") as conn:
conn.autocommit = False
while True:
with conn.cursor() as cur:
cur.execute(SQL, (BATCH_SIZE,))
rows = cur.fetchall()
conn.commit()
if not rows:
break
print(f"migrated {len(rows)} rows")
Application deploy during this phase:
- Write both
statusandstatus_v2. - Read
COALESCE(status_v2, map(status))for temporary compatibility. - Add dashboards for migration lag and error rate before contract changes.
Phase 3, validate then contract
Once backfill is complete and dual-write has run cleanly for at least one full traffic cycle, validate constraints and remove old paths.
-- 003_contract_orders_status.sql
BEGIN;
-- Validate after data is in place.
ALTER TABLE orders VALIDATE CONSTRAINT orders_status_v2_check;
-- Optionally tighten nullability once proven safe.
ALTER TABLE orders
ALTER COLUMN status_v2 SET NOT NULL;
COMMIT;
-- Later release: switch app reads fully to status_v2,
-- then drop legacy column in a separate maintenance step.
Tradeoff: Keeping old and new columns briefly increases app complexity, but it dramatically lowers rollback risk. One-step replacement is cleaner on paper, riskier in production.
Operational guardrails that prevent 2 a.m. surprises
- Set lock timeouts deliberately: Failing fast is better than silently queueing every writer behind a migration lock.
- Use migration windows by write intensity, not by wall-clock tradition: midnight is sometimes your busiest async batch period.
- Observe progress views: PostgreSQL exposes progress views for commands like CREATE INDEX so teams can distinguish slow from stuck.
- Pair schema rollout with delivery controls: tie migration flags to your deployment gates, similar to how you already gate risky app changes in CI/CD.
If your team is already improving deploy safety, this complements the controls from our DevOps automation playbook and the reliability mindset from Backend reliability in 2026.
A quick decision matrix before you touch production
Before approving a schema rollout, answer these four questions explicitly in your change request:
- Table size and write rate: Is this a low-write table where a short lock is acceptable, or a hot path like orders/payments where even 30 seconds hurts revenue?
- Compatibility window: Can the app run with both old and new schema for a few days, or are you tightly coupled to one release?
- Rollback path: If contract phase fails halfway, do you have a tested way to keep serving reads/writes safely?
- Operational ownership: Who watches the migration in real time, and who has authority to abort when thresholds are breached?
If you cannot answer these clearly, postpone and harden the plan. A delayed migration is cheaper than a live checkout freeze. In practical terms, most teams should optimize for reversibility first, speed second. That often means two or three controlled deploys instead of one “heroic” migration. It feels slower, but in incident terms it is usually the fastest route to stable change.
Troubleshooting: when a migration is already hurting production
Symptom: API timeouts spike right after DDL starts
Likely cause: waiting transactions behind a stronger-than-expected table lock.
What to do now: pause deploy traffic, inspect blockers via pg_locks + pg_stat_activity, and decide quickly whether to cancel the migration session or drain writers first.
Symptom: CREATE INDEX CONCURRENTLY runs forever
Likely cause: long-running transactions preventing phase transitions.
What to do now: identify stale transactions from app workers, cron jobs, or forgotten admin sessions; terminate safely if approved; rerun during a cleaner window.
Symptom: backfill finishes, but app still serves mixed states
Likely cause: partial dual-write rollout or missed write path.
What to do now: compare write-path coverage, replay a sampled event set, and keep legacy reads enabled until parity checks are green.
For teams running analytics and operational data together, the reproducibility discipline in our DuckDB + Iceberg + dbt article helps prevent “it looked fine locally” surprises during migration verification. And if your app plus schema changes live in a monorepo, this Git monorepo performance guide is useful when migration CI starts dragging large teams.
FAQ
1) Is zero-downtime always possible for every schema change?
No. Some transformations are inherently disruptive at scale, depending on table size, hardware, and workload shape. The goal is to minimize blast radius with staged design, not pretend every change is free.
2) Should I always use CREATE INDEX CONCURRENTLY?
For production tables with active writes, usually yes. In controlled maintenance windows with no writes, non-concurrent builds can be faster and simpler. Pick based on traffic profile, not habit.
3) Can I skip dual-write if backfill is fast?
You can, but it increases rollback risk. Dual-write gives you a safer compatibility bridge while data catches up and while you verify edge write paths under real traffic.
Actionable takeaways
- Adopt expand and contract migration pattern as your default for high-traffic schema changes.
- Prefer PostgreSQL CREATE INDEX CONCURRENTLY for hot tables, and budget extra runtime.
- Treat migration success as an app-plus-data rollout with observability, not as a standalone SQL event.
- Define explicit lock timeout and rollback criteria before running any production DDL.
- Keep one migration runbook per service so on-call engineers can execute without guesswork.

Leave a Reply