At 1:12 a.m., our order reconciliation worker looked healthy on the dashboard, but support had a different story. Orders were still marked “processing” 20 minutes after payment capture. CPU was low, memory was stable, and logs looked normal, until we rolled a deploy and saw it: the queue never fully drained before shutdown, then retries replayed stale work after restart.
That night forced us to stop treating background workers like “small web apps without controllers.” If your service ingests events, talks to flaky downstream APIs, and gets recycled by orchestrators, lifecycle behavior is your reliability model.
This guide is a practical playbook for .NET worker service graceful shutdown using bounded channels, explicit retry boundaries, and observability that actually helps during incidents. It is optimized for teams shipping on Kubernetes, ECS, or systemd, where restarts are normal and graceful handling is non-negotiable.
The real failure mode is usually coordination, not compute
In most postmortems, we find one of these patterns:
- Producers keep enqueueing while the host is stopping.
- Consumers are blocked on a slow dependency and never drain.
- Retries happen at multiple layers (SDK + app), multiplying latency.
- No trace context exists across dequeue, HTTP call, and final ack.
Microsoft’s hosted service guidance is clear that StopAsync has a bounded grace window and your service must react promptly to cancellation. If shutdown takes longer than your configured timeout, you are in forced-stop territory, and in-flight work is at risk.
A safer shape: bounded intake, single owner, explicit shutdown path
For most business workers, this architecture keeps behavior predictable:
- Bounded channel as the intake buffer (System.Threading.Channels backpressure).
- One producer boundary (HTTP endpoint, queue poller, or timer).
- One consumer loop that owns processing and retries.
- Host cancellation wired all the way to dependency calls.
The key tradeoff is straightforward: bounded channels can reject writes under pressure, but that is often better than silently growing memory and dying later. If work is truly must-not-drop, pair backpressure with durable upstream storage.
Code pattern 1: bounded channel + host-aware admission
using System.Threading.Channels;
using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Logging;
var builder = Host.CreateApplicationBuilder(args);
builder.Services.AddSingleton(Channel.CreateBounded<WorkItem>(
new BoundedChannelOptions(capacity: 500)
{
SingleReader = true,
SingleWriter = false,
FullMode = BoundedChannelFullMode.Wait // backpressure, don't drop silently
}));
builder.Services.AddHostedService<ReconcileWorker>();
builder.Services.AddHostedService<InboundPoller>();
await builder.Build().RunAsync();
public sealed record WorkItem(string OrderId, DateTimeOffset CapturedAt);
public sealed class InboundPoller(Channel<WorkItem> channel, ILogger<InboundPoller> log)
: BackgroundService
{
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while (!stoppingToken.IsCancellationRequested)
{
// Replace with real upstream intake
var item = new WorkItem(Guid.NewGuid().ToString("N"), DateTimeOffset.UtcNow);
// WaitToWriteAsync honors cancellation and enforces backpressure.
if (await channel.Writer.WaitToWriteAsync(stoppingToken))
{
await channel.Writer.WriteAsync(item, stoppingToken);
}
await Task.Delay(TimeSpan.FromMilliseconds(150), stoppingToken);
}
// Signal no more items when host is stopping.
channel.Writer.TryComplete();
log.LogInformation("Inbound poller completed writer.");
}
}
What matters here:
BoundedChannelFullMode.Waitturns pressure into controlled slowdown.- The producer exits on cancellation and calls
TryComplete(). - You avoid orphaned writers that keep the consumer loop hanging forever.
Code pattern 2: consumer with Polly retry boundaries and trace context
using Polly;
using Polly.Retry;
using System.Diagnostics;
public sealed class ReconcileWorker(
Channel<WorkItem> channel,
IHttpClientFactory httpClientFactory,
ILogger<ReconcileWorker> log) : BackgroundService
{
private static readonly ActivitySource Activity = new("7tech.worker.reconcile");
private static readonly ResiliencePipeline HttpPipeline =
new ResiliencePipelineBuilder()
.AddRetry(new RetryStrategyOptions
{
MaxRetryAttempts = 3,
Delay = TimeSpan.FromMilliseconds(300),
BackoffType = DelayBackoffType.Exponential,
UseJitter = true,
ShouldHandle = new PredicateBuilder().Handle<HttpRequestException>()
})
.Build();
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
var reader = channel.Reader;
await foreach (var item in reader.ReadAllAsync(stoppingToken))
{
using var span = Activity.StartActivity("reconcile.order", ActivityKind.Internal);
span?.SetTag("order.id", item.OrderId);
try
{
await HttpPipeline.ExecuteAsync(async token =>
{
using var linked = CancellationTokenSource.CreateLinkedTokenSource(token, stoppingToken);
var client = httpClientFactory.CreateClient("payments");
using var req = new HttpRequestMessage(HttpMethod.Post, $"/internal/reconcile/{item.OrderId}");
using var res = await client.SendAsync(req, linked.Token);
res.EnsureSuccessStatusCode();
}, stoppingToken);
}
catch (OperationCanceledException) when (stoppingToken.IsCancellationRequested)
{
log.LogWarning("Stopping while processing order {OrderId}", item.OrderId);
break;
}
catch (Exception ex)
{
log.LogError(ex, "Reconciliation failed for order {OrderId}", item.OrderId);
// Send to DLQ / outbox for replay with idempotency key.
}
}
log.LogInformation("Consumer loop drained and stopped cleanly.");
}
}
The practical tradeoff with retries: they raise success rate for transient failures, but they also consume your shutdown budget. Keep attempts small, jitter enabled, and move long recovery workflows to a dead-letter or outbox path.
Observability that helps at 2 a.m.
OpenTelemetry in .NET is now straightforward to wire, but the biggest win is choosing stable dimensions:
queue.depth(gauge),workitem.age_msat processing start,retry.count, and- shutdown drain duration.
If you cannot answer “how many items were left unprocessed at stop,” you still have an operational blind spot.
Common rollout mistakes (and the tradeoffs behind them)
- Unbounded channels by default. Easy startup, risky under spikes.
- Nested retries. SDK retries + Polly retries can create long tail latency.
- No idempotency key. Restart + replay can duplicate side effects.
- Ignoring orchestrator timing. Your host timeout, preStop hook, and probe settings must align.
A useful rule: design for “stop at any time, resume without surprise.” It is less glamorous than throughput tuning, but it prevents expensive incident weekends.
Troubleshooting: when the worker still behaves badly
Symptom: Deploys hang for 30s then force-kill
- Check if producer stops writing when cancellation is requested.
- Confirm consumer loop uses
ReadAllAsync(stoppingToken), not infinite blocking reads. - Reduce retry budget or make retries cancellation-aware.
Symptom: Memory rises during traffic spikes
- Confirm channel is bounded and full mode is deliberate.
- Inspect intake rate versus downstream throughput.
- Add upstream throttling or partitioned processing before increasing capacity.
Symptom: Duplicate side effects after restart
- Add idempotency keys per work item and enforce them downstream.
- Persist completion markers before acknowledging upstream messages.
- Move poison items to DLQ with replay metadata, not infinite retries.
FAQ
1) Should I use channels or an external broker?
Use channels for in-process coordination and short-lived buffering. If durability across pod restarts is mandatory, use an external broker and keep channels as a local execution buffer.
2) How many retries are “safe”?
There is no universal number, but in request-coupled workers, 2-3 attempts with jitter is often enough. Beyond that, you usually get slower failure, not real recovery.
3) Do I need tracing if logs already include order IDs?
Yes, because traces show timing and causal path across service boundaries. Logs are still essential, but traces make latency and retry behavior visible during incidents.
Recommended related reads on 7tech
- DevOps Automation in 2026: Building a Change-Intelligent Delivery Pipeline
- Cloud Architecture in 2026: Designing for Control, Portability, and Human Trust
- systemd Service Hardening for Linux Teams in 2026
- GitHub Actions Pipeline Hardening with OIDC and Pinned SHAs
Actionable takeaways for this week
- Switch worker intake to a bounded channel and document your full-mode choice.
- Make producer, consumer, and retries all cancellation-aware with one host token path.
- Set a strict retry boundary, then route hard failures to a replayable DLQ/outbox flow.
- Add OpenTelemetry spans and queue-depth metrics before your next deploy window.
- Run one controlled restart test in staging and measure actual drain time.
If your team treats graceful stop as a first-class requirement instead of an afterthought, worker incidents get shorter, cheaper, and much less dramatic.

Leave a Reply