A release that “worked” until users touched edge cases A subscription platform launched a new account lifecycle flow: trial, upgrade, pause, resume, cancel, grace period. The rollout looked healthy. API error rates were low, latency stayed in budget, and deploy…
Author: Ankur Sharma
-

The SSH Key That Outlived the Contractor: A 2026 Playbook for OpenSSH User Certificates on AWS
Replace long-lived SSH keys with short-lived OpenSSH user certificates on AWS. Learn server config, issuance flow, and safe rollout with troubleshooting tips.
-
The Agent State Meltdown: A 2026 AI/ML Production Playbook with Statecharts, Provider Fallbacks, and Policy-Safe Execution
A Friday incident that started with one “simple” fallback A team launched a customer-support AI assistant that used one primary model provider and one fallback. In staging, it was smooth. In production, a short provider slowdown triggered fallback logic. Then…
-

The Preflight Tax Nobody Budgeted: Fast, Safe CORS for Production APIs in 2026
Practical CORS preflight optimization for production APIs: reduce latency, set Access-Control-Allow-Origin correctly, protect credentials, and debug fast.
-
The Ghost Setting in Production: A WordPress Engineering Playbook for Deterministic Config, Safer Auth Flows, and Reproducible Releases
A small settings change that cost a full weekend A WordPress team launched a membership feature on Friday evening. Checkout worked, login looked fine, and monitoring stayed mostly green. By Saturday morning, paid users in one region were getting logged…
-

The Agent That Opened the Wrong Door: A 2026 Playbook for Safe AI Agent Tool Calling
A practical 2026 guide to AI agent tool calling: strict schemas, approval gates, and prompt-injection defenses that prevent costly real-world mistakes.
-
The Dataset Was Correct, the Trust Was Missing: A 2026 SQL Playbook for Verifiable Data Lineage and Tamper-Evident Analytics
A quick story from a board prep that went sideways A fintech analytics team had done everything “right” before a quarterly review. Fresh models, passing tests, green pipelines, no failed jobs. But 40 minutes before the meeting, legal asked a…
-

The Model Was Fine, Our Time Travel Was Wrong: A 2026 Playbook for Point-in-Time Joins and Leakage-Proof Features
Point-in-time joins done right: stop data leakage in machine learning with dbt snapshots, feature freshness SLOs, and reproducible training data pipelines.
-
The Green Dashboard, Broken Journey: A 2026 Node.js Systems Playbook for Engineering Real Reliability
A quick story from a release that looked perfect A subscription platform shipped a major billing refactor on a Tuesday night. The team had done everything “right” on paper: tests passed, CPU stayed low, error rates looked normal, and all…
-

The 90-Second Java Pod Restart: A 2026 Runbook for CDS Archives, Startup Telemetry, and Safer JVM Flags
Java startup performance in Kubernetes: a practical runbook for CDS archives, startup telemetry, and safer JVM flags to reduce restart-to-ready delays.
-
The Refactor That Passed Tests but Broke Trust: Python Engineering for Durable Systems in 2026
A release that looked clean and still hurt users A payments team I worked with had a proud Friday moment. They cleaned up an old Python service, added type hints, swapped in modern libraries, and cut 1,200 lines of legacy…
-

The Vulnerability Report You Never Received: security.txt for WordPress That Actually Works
Learn how to implement RFC 9116 security.txt for WordPress with Nginx, a clear disclosure policy, expiry monitoring, and practical triage workflows for teams.