A quick story from a “successful” migration that almost failed in production
A team inherited a Python service that calculated invoice adjustments. It was old, under-documented, and full of hand-rolled date logic. They used coding assistants to modernize it, added type hints, replaced some utility modules, and got test coverage from 18% to 71% in two weeks. CI was green. Everyone felt great.
Then a staging replay found a 0.7% mismatch in invoices for one country. Not a crash, not an exception, just subtle money drift. The new code was cleaner, but one “safe” refactor changed rounding order in edge cases around tax and currency conversion. They caught it before release, but only because they had replay tests with historical data.
That is Python engineering in 2026. AI tools can revive old systems fast, but speed without behavior controls can create polished regressions.
Why this matters now
Many companies are reviving old Python projects instead of rewriting them. It makes sense: faster delivery, lower migration risk, and less business disruption. But there is a trap. Modernized code can pass unit tests while violating invisible business contracts buried in legacy behavior.
Three patterns make this risk worse:
- AI-assisted edits often touch more code than the requested change.
- Legacy systems encode policy in quirks, not explicit docs.
- Teams optimize for “clean code” before locking down “correct outcomes.”
The practical goal is not to avoid refactoring. It is to modernize with behavior fidelity.
The 2026 approach: stabilize behavior first, then improve structure
When reviving a legacy Python service, use this order:
- Capture behavior: create executable contracts from real production inputs.
- Constrain change: enforce small, intent-scoped refactors.
- Modernize safely: introduce typing, linting, and dependency updates with replay gates.
- Promote gradually: canary and compare output deltas before full cutover.
Most expensive mistakes happen when teams flip this order.
Step 1: Build a behavior baseline from historical data
Before touching business logic, snapshot real inputs and outputs from the current system. This becomes your contract dataset. Unit tests are not enough here, because they often reflect what developers thought mattered, not what production actually does.
from dataclasses import dataclass
from decimal import Decimal
from typing import List
@dataclass
class InvoiceCase:
case_id: str
country: str
subtotal: Decimal
tax_rate: Decimal
discount: Decimal
expected_total: Decimal # from legacy prod output
def compare_totals(old_cases: List[InvoiceCase], new_fn, tolerance=Decimal("0.01")):
mismatches = []
for c in old_cases:
actual = new_fn(
country=c.country,
subtotal=c.subtotal,
tax_rate=c.tax_rate,
discount=c.discount
)
if abs(actual - c.expected_total) > tolerance:
mismatches.append({
"case_id": c.case_id,
"expected": str(c.expected_total),
"actual": str(actual),
"delta": str(actual - c.expected_total),
})
return mismatches
This gives you hard evidence when refactors change outcomes.
Step 2: Add invariants and property-based tests for domain rules
Replay datasets catch known behavior. Property tests catch unknown edge cases. For finance, pricing, and policy-heavy services, add invariant checks like:
- Total should never be negative after valid discount rules.
- Increasing tax rate should not reduce final amount.
- Re-applying idempotent adjustments should not change result.
These tests prevent elegant but incorrect refactors.
from hypothesis import given, strategies as st
from decimal import Decimal
@given(
subtotal=st.decimals(min_value="0", max_value="100000", places=2),
tax_rate=st.decimals(min_value="0", max_value="0.50", places=4),
discount=st.decimals(min_value="0", max_value="1.00", places=4),
)
def test_invoice_invariants(subtotal, tax_rate, discount):
total = calculate_total(
subtotal=Decimal(subtotal),
tax_rate=Decimal(tax_rate),
discount=Decimal(discount),
)
assert total >= Decimal("0.00")
higher_tax_total = calculate_total(
subtotal=Decimal(subtotal),
tax_rate=min(Decimal("0.50"), Decimal(tax_rate) + Decimal("0.01")),
discount=Decimal(discount),
)
assert higher_tax_total >= total
Property tests are especially useful when legacy code had many hidden branch combinations.
Step 3: Use AI assistants with strict edit boundaries
Coding assistants are great for boilerplate modernization, but in legacy systems you should enforce constraints:
- One business concern per PR.
- No unrelated file edits in critical modules.
- Required “behavior impact” note in PR description.
- Mandatory replay diff artifact attached to every merge request.
This keeps momentum without turning reviews into archaeology.
Step 4: Modernize runtime and dependencies incrementally
Teams often try to upgrade Python version, framework version, and all libraries at once. That makes root-cause analysis painful. Safer sequence:
- Upgrade runtime first with behavior baseline unchanged.
- Update one dependency family at a time.
- Run contract replay after each dependency wave.
- Promote only when drift is understood and accepted.
Small controlled upgrades beat one heroic migration every time.
Step 5: Release behind dual-run comparison
For high-impact services, run old and new implementations in parallel for a period, compare outputs, and only switch write authority when mismatch rate is below threshold.
This dual-run pattern has become standard in 2026 for reviving business-critical Python systems with minimal risk.
Troubleshooting when modernized Python code passes tests but fails trust
Symptom: tiny but consistent output drift
Check decimal context, rounding order, timezone assumptions, and locale/currency formatting behavior. These are common silent regression sources.
Symptom: only one region or tenant is wrong
Inspect country-specific rules, tax tables, and fallback defaults. Legacy code often encoded region logic in unexpected branches.
Symptom: replay tests flaky in CI
Look for nondeterministic dependencies: unordered dict iteration assumptions, current-time reads, random seeds, or external API mocks that vary.
Symptom: AI-generated refactor keeps breaking edge cases
Tighten prompt scope and enforce protected files in code owners. Ask for minimal diff patches and reject stylistic rewrites in critical paths.
Symptom: migration slowed by too many mismatch cases
Cluster mismatches by root cause and decide explicitly: preserve legacy behavior, or intentionally change it with documented business signoff.
FAQ
Should we preserve legacy behavior even if it looks wrong?
Not always. But treat behavior changes as product decisions, not incidental refactor side effects. Document and approve them explicitly.
How big should a replay dataset be?
Enough to cover real distribution and edge cases. Many teams start with 10k to 100k historical records for critical workflows, then sample continuously.
Can we rely only on unit tests if coverage is high?
No. High coverage can still miss business semantics. You need replay and invariant testing for confidence.
What is the best first modernization step?
Introduce behavior baseline tests before major refactors. That single move reduces migration risk more than any linter upgrade.
How long should dual-run last?
Until mismatch rates are below agreed thresholds across normal and peak traffic windows, usually days to a few weeks depending on domain risk.
Actionable takeaways for your next sprint
- Create a historical replay dataset and fail CI when output drift exceeds explicit tolerance.
- Add domain invariants with property-based tests for high-risk business logic.
- Enforce AI refactor boundaries with small PRs and mandatory behavior-impact notes.
- Use dual-run comparison in staging or shadow production before full cutover.
Leave a Reply