When Users Time Out but Services Keep Working: A .NET 9 gRPC Deadline Propagation Runbook

At 9:07 PM on a Thursday deploy, our order service dashboard looked almost healthy. CPU was fine, pods were green, and p95 latency had only nudged up. But support tickets were climbing every minute, all with the same line: “Payment spinner never finishes.”

The root cause was not a database lock or a dead node. It was time budget drift across gRPC calls. The edge API had a 2.5-second SLA, but downstream calls quietly used default no-deadline behavior. One request fan-out became five calls with retries, and each branch believed it had unlimited time. Users timed out first, then our services kept working on requests nobody cared about anymore.

This guide is the runbook I wish we had that night: practical deadline propagation, cancellation wiring, and retry guardrails in ASP.NET Core gRPC, without hand-wavy “just add retries” advice.

The reliability gap most .NET gRPC teams underestimate

In .NET gRPC, there is no default deadline. If you do nothing, calls can run as long as the network and server allow. That is useful for a few streaming cases, but dangerous for request-response APIs with user-facing latency budgets.

In fan-out paths, missing deadlines cause three compounding failures:

User timeout mismatch: frontend gives up before backend work stops.
Retry amplification: retries spend extra time on requests that are already irrelevant.
Resource leakage: server keeps executing until your code observes cancellation.

If this sounds familiar, you might also like our reliability patterns in composable .NET HTTP resilience and the backpressure lessons from .NET worker services.

Keyword strategy for this article

Primary keyword: gRPC deadline propagation .NET
Secondary keyword 1: ASP.NET Core gRPC retries
Secondary keyword 2: EnableCallContextPropagation
Secondary keyword 3: gRPC cancellation token

Step 1: Treat deadline as a product contract, not a transport detail

Pick one end-to-end budget per endpoint class, then allocate sub-budgets only when needed. Example policy:

Interactive checkout read path: 1500 ms total
Fraud scoring side call: max 500 ms within that total
Tax/shipping lookup: max 400 ms each, fail soft when allowed

Tradeoff: tighter deadlines reduce tail latency and protect capacity, but can increase partial results during dependency blips. That tradeoff is usually worth it for user-facing flows, as long as your fallback behavior is explicit.

Step 2: Implement propagation once with client factory

Manually passing deadline: context.Deadline in every child call works, but it eventually gets missed. In ASP.NET Core, use gRPC client factory with EnableCallContextPropagation so deadlines and cancellation travel automatically.

using Grpc.Net.Client.Configuration;
using Grpc.AspNetCore.Server.ClientFactory;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddGrpc();

builder.Services
    .AddGrpcClient<Inventory.InventoryClient>(o =>
    {
        o.Address = new Uri(builder.Configuration["Grpc:InventoryUrl"]!);
    })
    .EnableCallContextPropagation()
    .ConfigureChannel(o =>
    {
        o.ServiceConfig = new ServiceConfig
        {
            MethodConfigs =
            {
                new MethodConfig
                {
                    Names = { MethodName.Default },
                    RetryPolicy = new RetryPolicy
                    {
                        MaxAttempts = 4,
                        InitialBackoff = TimeSpan.FromMilliseconds(100),
                        MaxBackoff = TimeSpan.FromMilliseconds(800),
                        BackoffMultiplier = 2,
                        RetryableStatusCodes = { Grpc.Core.StatusCode.Unavailable }
                    }
                }
            }
        };
    });

Important nuance: retries are still bounded by the original deadline. That is good. It prevents “resilience” from becoming slow failure.

Step 3: Make cancellation real on the server side

A canceled call does not magically stop your code. You need to flow the gRPC cancellation token into every async dependency that supports it.

public sealed class CheckoutAggregatorService : CheckoutAggregator.CheckoutAggregatorBase
{
    private readonly Inventory.InventoryClient _inventory;
    private readonly Pricing.PricingClient _pricing;

    public CheckoutAggregatorService(
        Inventory.InventoryClient inventory,
        Pricing.PricingClient pricing)
    {
        _inventory = inventory;
        _pricing = pricing;
    }

    public override async Task<QuoteReply> GetQuote(QuoteRequest request, ServerCallContext context)
    {
        // Child calls inherit deadline/cancellation via EnableCallContextPropagation.
        var inventoryTask = _inventory.CheckStockAsync(new StockRequest { Sku = request.Sku })
                                      .ResponseAsync;

        var pricingTask = _pricing.CalculateAsync(new PriceRequest
        {
            Sku = request.Sku,
            Quantity = request.Quantity
        }).ResponseAsync;

        await Task.WhenAll(inventoryTask, pricingTask).WaitAsync(context.CancellationToken);

        return new QuoteReply
        {
            InStock = inventoryTask.Result.InStock,
            UnitPrice = pricingTask.Result.UnitPrice
        };
    }
}

When a deadline expires, this token fires and your work can terminate quickly instead of burning CPU on abandoned requests.

Step 4: Separate retriable from non-idempotent operations

Automatic retries in gRPC are excellent for transient Unavailable scenarios, but only when repeated attempts are safe. Do not hedge or aggressively retry operations that can create side effects (for example, charging a card) unless you also enforce idempotency keys server-side.

This is the same discipline we discussed in our replay-proof webhook verification runbook: retries must pair with deduplication semantics, not hope.

Observability signals that catch deadline drift early

Three signals have high diagnostic value:

DEADLINE_EXCEEDED rate by method (not just total errors).
grpc-previous-rpc-attempts metadata distribution to see retry amplification.
In-flight request age histogram to find requests that outlive client patience.

If your platform is mixing HTTP and gRPC edges, compare this with your existing API throttling strategy from partitioned rate limiting in ASP.NET Core. Rate limits protect entry, deadlines protect journey length.

A 15-minute pre-release validation drill

Before each release, run one synthetic test where a downstream service intentionally sleeps beyond the caller budget. You are verifying behavior, not just uptime:

caller returns a controlled timeout, not a generic 500
child call receives cancellation and exits early
retry count stays within policy and does not outlive the deadline
metrics show DEADLINE_EXCEEDED on the right method name

# Example grpcurl probe with a tight deadline budget
# (adjust method and payload for your service)
grpcurl -d '{"sku":"SKU-123","quantity":1}' \
  -H 'x-debug-sleep-ms: 1200' \
  -max-time 1.0 \
  checkout.internal:443 checkout.CheckoutAggregator/GetQuote

If this drill fails in staging, it will fail harder under real traffic. Fixing it before launch is usually cheaper than one evening of incident triage.

Troubleshooting: when deadline propagation still fails in production

1) Symptom: child services run long after caller timeout

Likely cause: cancellation token not passed to DB or HTTP calls.

Fix: wire context.CancellationToken through EF Core, Dapper, HttpClient, and any Task delay/wait calls.

2) Symptom: retries happen, but user latency explodes

Likely cause: broad retry policy with too many attempts for tight SLA.

Fix: lower MaxAttempts, shrink backoff window, and cap by realistic endpoint budget.

3) Symptom: propagation throws context-not-found errors in background jobs

Likely cause: client configured with EnableCallContextPropagation but used outside gRPC request context.

Fix: either use a separate client registration for non-request workloads, or enable SuppressContextNotFoundErrors intentionally and attach explicit deadlines there.

4) Symptom: server streaming call never retries after first message

Likely cause: expected behavior. After response stream starts, call is considered committed.

Fix: implement stream re-establishment logic at application layer.

FAQ

Should every gRPC method have the same deadline?

No. Group methods by user impact and dependency depth. Read paths often need tighter budgets than admin or batch paths.

Is hedging better than retries for tail latency?

Sometimes, but hedging increases server load because it can execute the same RPC multiple times. Use it only for idempotent calls with strong capacity headroom.

Can I rely on retries alone for reliability?

No. Reliability here is a trio: bounded deadlines, correct cancellation propagation, and idempotent behavior for retried operations.

Actionable takeaways

Set explicit deadlines for all user-facing gRPC calls, never rely on defaults.
Use EnableCallContextPropagation to eliminate manual propagation gaps.
Pass cancellation tokens through every async dependency to stop abandoned work fast.
Limit retries to transient codes and keep them inside endpoint time budgets.
Track DEADLINE_EXCEEDED and retry-attempt metadata per method for early drift detection.

If your team already has “good uptime” but still gets “spinner never ends” complaints, this is usually the missing layer. gRPC deadline propagation .NET is less about theory, more about aligning system effort with user patience.