At 2:07 AM, our checkout API looked healthy on the dashboard, but payment confirmations were arriving twice, sometimes three times. Nothing was technically “down.” The real failure was more subtle: a retry policy that behaved perfectly in isolation and badly in combination. A short upstream slowdown triggered retries, retries amplified traffic, traffic increased latency, and latency triggered more retries. Classic feedback loop.
That incident is why I now treat a .NET 9 HttpClient resilience pipeline as an architecture decision, not a code snippet. If you are building production APIs that call other services, this post is a practical runbook for composing retries, timeouts, and idempotency without creating your own retry storm.
Your first design choice is not “how many retries?”
Before tuning numbers, decide what failure mode you are willing to absorb:
- Transient blips (short network or dependency hiccups): retries can help.
- Overload (queues growing, p95 climbing): retries often make it worse.
- Unknown completion (timeout but downstream may have succeeded): retries require idempotency.
Microsoft’s latest resilience guidance for Microsoft.Extensions.Http.Resilience emphasizes composable handlers and warns against blindly stacking multiple handlers on the same client. That aligns with what we see in real outages: chaos usually comes from policy interaction, not a missing single policy.
A sane baseline for outbound calls in .NET 9
For most service-to-service calls, start with one typed client and one composed resilience configuration. Keep it explicit, measured, and boring.
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Http.Resilience;
using System.Net;
var builder = WebApplication.CreateBuilder(args);
builder.Services
.AddHttpClient<InventoryClient>(client =>
{
client.BaseAddress = new Uri(builder.Configuration["Inventory:BaseUrl"]!);
client.Timeout = Timeout.InfiniteTimeSpan; // let resilience pipeline own timeout
})
.AddResilienceHandler("inventory-pipeline", static pipeline =>
{
pipeline.AddTimeout(TimeSpan.FromSeconds(2));
pipeline.AddRetry(new HttpRetryStrategyOptions
{
MaxRetryAttempts = 2,
Delay = TimeSpan.FromMilliseconds(200),
UseJitter = true,
BackoffType = DelayBackoffType.Exponential,
ShouldHandle = args => ValueTask.FromResult(
args.Outcome.Exception is HttpRequestException ||
args.Outcome.Result?.StatusCode is HttpStatusCode.RequestTimeout or
HttpStatusCode.TooManyRequests or
HttpStatusCode.BadGateway or
HttpStatusCode.ServiceUnavailable or
HttpStatusCode.GatewayTimeout)
});
pipeline.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
{
SamplingDuration = TimeSpan.FromSeconds(30),
FailureRatio = 0.2,
MinimumThroughput = 20,
BreakDuration = TimeSpan.FromSeconds(15)
});
});
var app = builder.Build();
app.Run();
public sealed class InventoryClient(HttpClient http)
{
public Task<HttpResponseMessage> ReserveAsync(HttpContent content, CancellationToken ct) =>
http.PostAsync("/v1/reservations", content, ct);
}
Three details here matter more than most teams think:
- Client timeout is disabled so you avoid competing timeout layers.
- Retries are selective for known transient conditions, not all failures.
- Jitter is enabled to avoid synchronized retry spikes.
If you want deeper background on load behavior, this complements our runbook on partitioned rate limiting and backpressure in ASP.NET Core 9.
The non-negotiable pair: retries + idempotency
Retries on non-idempotent operations are a business risk, not just a technical risk. If a POST can charge a card, create an order, or issue a license, you need an idempotency contract.
The current IETF draft for Idempotency-Key formalizes a pattern many payment APIs already use: client sends a unique key, server stores request fingerprint + result, and repeated requests with the same key replay the first result instead of re-executing side effects.
Below is a practical minimal API pattern that has worked well for payment-like flows:
app.MapPost("/api/checkout/confirm", async (
HttpContext ctx,
CheckoutRequest request,
AppDb db,
PaymentOrchestrator orchestrator,
CancellationToken ct) =>
{
var idemKey = ctx.Request.Headers["Idempotency-Key"].ToString();
if (string.IsNullOrWhiteSpace(idemKey) || idemKey.Length > 120)
return Results.BadRequest(new { error = "Missing or invalid Idempotency-Key" });
var routeScope = $"checkout:confirm:{request.CartId}";
var existing = await db.IdempotencyRecords
.Where(x => x.Scope == routeScope && x.Key == idemKey)
.SingleOrDefaultAsync(ct);
if (existing is not null)
{
// Replay previous response payload/status
return Results.Json(existing.ResponseJson, statusCode: existing.StatusCode);
}
// Fingerprint protects against key reuse with different payloads
var payloadHash = Convert.ToHexString(
SHA256.HashData(System.Text.Encoding.UTF8.GetBytes(JsonSerializer.Serialize(request))));
var result = await orchestrator.ConfirmPaymentAsync(request, ct);
db.IdempotencyRecords.Add(new IdempotencyRecord
{
Scope = routeScope,
Key = idemKey,
PayloadHash = payloadHash,
StatusCode = result.StatusCode,
ResponseJson = JsonSerializer.Serialize(result.Body),
ExpiresAtUtc = DateTime.UtcNow.AddHours(24)
});
await db.SaveChangesAsync(ct);
return Results.Json(result.Body, statusCode: result.StatusCode);
});
This is the practical center of an idempotency key .NET API implementation: bounded key length, request fingerprinting, scoped uniqueness, stored response replay, and TTL cleanup.
Tradeoffs teams usually discover late
- Long timeouts reduce error count, but increase contention. You can “improve success rate” while quietly exhausting worker threads and DB pools.
- Aggressive retries improve p50 but hurt p99. Fast users get faster, slow users get slower, and your dependency gets hammered.
- Hedging reduces tail latency but raises backend cost. Use only where duplicate in-flight work is acceptable.
- Idempotency storage adds write load. You are buying correctness with extra persistence; size TTL and indexes early.
Our broader post on deadline propagation and safe degradation pairs well with this when you need end-to-end timeout budgeting, not just HttpClient policy tuning.
A rollout sequence that avoids surprise regressions
Most resilience incidents come from policy rollout, not policy theory. A safe rollout sequence in .NET 9 looks like this:
- Baseline first: ship one week of latency/error telemetry without retries enabled. Capture p50/p95/p99 and failure codes by dependency.
- Enable timeout only: start with a clear timeout budget and no retries. Validate that cancellation flows through your call chain.
- Add low retry count: one or two attempts max, with jitter. Watch for RPS amplification and queue growth.
- Add circuit breaker: keep break windows short initially. Confirm your fallback path returns useful, bounded failure responses.
- Gate idempotent writes: require
Idempotency-Keyfor write endpoints before allowing retry on those paths.
Two metrics tell you quickly whether your rollout is healthy:
- Retry amplification ratio: total outbound requests / original inbound requests. If this climbs fast during a dependency incident, your retry policy is too eager.
- Meaningful success ratio: successful business outcomes, not just HTTP 2xx. This catches “successful duplicates” that still hurt users.
This sequence sounds conservative, but conservative is exactly what keeps outages small. Fast rollout is useful only if you can also fast rollback with confidence.
Troubleshooting: when your resilience setup behaves badly
1) Symptom: CPU and outbound RPS spike during a dependency slowdown
Likely cause: retries are too broad (for example retrying all 5xx, including persistent failures).
Fix: narrow retry conditions, reduce attempts, enable jitter, and confirm circuit breaker opens quickly enough.
2) Symptom: duplicate orders despite idempotency key
Likely cause: key is checked after side effects, or key scope is too broad/narrow.
Fix: perform idempotency lookup before side effects, include route/business scope, enforce payload hash mismatch detection.
3) Symptom: random timeout exceptions even when dependency is healthy
Likely cause: layered timeouts (HttpClient timeout + resilience timeout + reverse proxy timeout) racing each other.
Fix: define one owner per hop, document timeout budget per dependency, and pass cancellation tokens consistently.
4) Symptom: no useful signals in logs during incidents
Likely cause: retries and circuit events are happening, but not tagged with dependency name, attempt number, and idempotency key hash.
Fix: add structured fields to logs and traces so failure paths are queryable in minutes, not hours.
For operations that are scheduled and replayed by design, see the related reliability pattern in this idempotent scheduler runbook. And if your background workers are the choke point, this .NET worker backpressure guide is a useful companion.
FAQ
Should I always use AddStandardResilienceHandler?
Not always. It is a strong default, but high-volume or business-critical dependencies often need custom handling rules, timeout budgets, and breaker thresholds. Start standard, then customize with measurements.
How long should I keep idempotency records?
Tie TTL to your real retry window and business risk. Many payment-like flows use 24 hours, but shorter windows can work for low-risk operations. The key is consistency between client retry behavior and server retention.
Can I retry POST safely if I use idempotency keys?
Usually yes, if your server enforces key uniqueness within a clear scope and replays the first response for duplicates. Without that server contract, retries on POST remain unsafe.
Actionable takeaways you can apply this week
- Adopt one named or typed client per dependency and define one explicit resilience pipeline per client.
- Implement and document IHttpClientFactory best practices: scoped configuration, single timeout owner, and dependency-specific policy values.
- Audit every non-idempotent POST path and add an
Idempotency-Keycontract before increasing retry aggressiveness. - Use conservative defaults for timeouts retries backoff jitter, then tune with real latency/error telemetry, not intuition.
If your APIs are already “mostly reliable,” this is the upgrade path that keeps them reliable under stress, not just during calm traffic. Reliability is not one retry policy. It is composition, boundaries, and evidence.

Leave a Reply