If your 2026 Node.js app still runs heavy work directly inside request handlers, you are paying for it in timeouts, noisy retries, and poor user experience. A better pattern is to accept work fast, enqueue it, process asynchronously, and expose clear job status APIs. In this guide, you will build a production-ready background jobs system with Node.js, BullMQ, Redis Streams, and OpenTelemetry, including idempotency, retries, and observability.
Why this pattern matters in 2026
Modern apps trigger expensive tasks all the time, for example report generation, AI enrichment, media processing, webhook fan-out, and sync pipelines. Users expect instant responses, and teams expect traceability. A queue-backed architecture gives you:
Fast API responses by offloading slow work
Retry and backoff policies for transient failures
Controlled concurrency to protect downstream systems
Operational visibility with distributed tracing
Safer deploys because workers can be scaled independently
Architecture overview
We will build three pieces:
API service (Express): accepts a job request and returns
202 Accepted.Worker service (BullMQ): processes jobs with retry/backoff and emits progress.
Status endpoint: lets clients poll job state, result, and errors.
Redis is the shared backbone. BullMQ uses Redis data structures; Redis Streams help with event-style progress logs and replay when needed.
Project setup
mkdir node-jobs-2026 && cd node-jobs-2026
npm init -y
npm i express bullmq ioredis zod pino pino-http @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
npm i -D typescript tsx @types/express
Shared queue config
// src/queue.js
import { Queue } from "bullmq";
import IORedis from "ioredis";
export const redis = new IORedis(process.env.REDIS_URL ?? "redis://127.0.0.1:6379", {
maxRetriesPerRequest: null,
enableReadyCheck: false,
});
export const jobsQueue = new Queue("report-jobs", {
connection: redis,
defaultJobOptions: {
attempts: 5,
backoff: { type: "exponential", delay: 1000 },
removeOnComplete: { age: 3600, count: 5000 },
removeOnFail: { age: 86400, count: 10000 },
},
});
Create a resilient enqueue endpoint
We validate payloads, enforce idempotency with a client-provided key, and return quickly.
// src/api.js
import express from "express";
import crypto from "node:crypto";
import { z } from "zod";
import { jobsQueue, redis } from "./queue.js";
const app = express();
app.use(express.json());
const jobSchema = z.object({
userId: z.string().min(1),
reportType: z.enum(["weekly", "monthly", "custom"]),
filters: z.record(z.any()).optional(),
});
app.post("/v1/reports", async (req, res) => {
const parsed = jobSchema.safeParse(req.body);
if (!parsed.success) return res.status(400).json({ error: parsed.error.flatten() });
const idemKey = req.header("Idempotency-Key") ?? crypto.randomUUID();
const redisKey = `idem:report:${idemKey}`;
const existingJobId = await redis.get(redisKey);
if (existingJobId) {
return res.status(202).json({ jobId: existingJobId, statusUrl: `/v1/jobs/${existingJobId}` });
}
const job = await jobsQueue.add("generate-report", parsed.data, {
jobId: crypto.createHash("sha256").update(idemKey).digest("hex"),
});
await redis.set(redisKey, job.id, "EX", 3600);
return res.status(202).json({ jobId: job.id, statusUrl: `/v1/jobs/${job.id}` });
});
app.get("/v1/jobs/:id", async (req, res) => {
const job = await jobsQueue.getJob(req.params.id);
if (!job) return res.status(404).json({ error: "Job not found" });
const state = await job.getState();
return res.json({
id: job.id,
state,
progress: job.progress,
result: job.returnvalue ?? null,
failedReason: job.failedReason ?? null,
});
});
app.listen(3000, () => console.log("API on :3000"));
Worker with retries, progress, and stream events
The worker emits progress updates and writes an event trail to Redis Streams so operations teams can replay the lifecycle of any job.
// src/worker.js
import { Worker } from "bullmq";
import { redis } from "./queue.js";
const sleep = (ms) => new Promise((r) => setTimeout(r, ms));
const worker = new Worker(
"report-jobs",
async (job) => {
await job.updateProgress(5);
await redis.xadd(`stream:job:${job.id}`, "*", "event", "started", "ts", Date.now().toString());
// Simulated phases
await sleep(500);
await job.updateProgress(40);
await redis.xadd(`stream:job:${job.id}`, "*", "event", "fetch-data", "ok", "true");
await sleep(700);
await job.updateProgress(80);
await redis.xadd(`stream:job:${job.id}`, "*", "event", "render", "ok", "true");
// Throw occasionally to test retries
if (Math.random() < 0.15) throw new Error("Transient upstream timeout");
await job.updateProgress(100);
const url = `https://cdn.example.com/reports/${job.id}.pdf`;
await redis.xadd(`stream:job:${job.id}`, "*", "event", "completed", "url", url);
return { downloadUrl: url };
},
{ connection: redis, concurrency: 10 }
);
worker.on("failed", async (job, err) => {
await redis.xadd(`stream:job:${job?.id}`, "*", "event", "failed", "reason", err.message);
});
Add OpenTelemetry for debugging and SLOs
In 2026, queue systems without tracing are hard to operate. Instrument both API and worker so a single trace can connect enqueue requests to processing latency.
// src/otel.js
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
const sdk = new NodeSDK({
serviceName: process.env.OTEL_SERVICE_NAME ?? "node-jobs-service",
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Run API with OTEL_SERVICE_NAME=jobs-api and worker with OTEL_SERVICE_NAME=jobs-worker. Export traces to your preferred backend (Tempo, Jaeger, Honeycomb, Datadog, or OpenTelemetry Collector).
Production checklist
Idempotency: Always dedupe client retries with a stable key.
Backpressure: Tune worker concurrency and queue limits per downstream dependency.
Dead-letter strategy: Move hard failures to a review queue after max attempts.
Timeouts and cancellation: Propagate cancellation tokens for long-running tasks.
Data retention: Keep stream/job metadata long enough for debugging but enforce TTL.
SLOs: Track p95 enqueue-to-complete latency and failure rate by job type.
Quick test flow
# Create job
curl -X POST http://localhost:3000/v1/reports -H 'Content-Type: application/json' -H 'Idempotency-Key: report-123' -d '{"userId":"u_42","reportType":"weekly"}'
# Poll status
curl http://localhost:3000/v1/jobs/<JOB_ID>
Final thoughts
Background jobs are no longer an optional scaling trick, they are a core reliability primitive. With Node.js, BullMQ, Redis Streams, and OpenTelemetry, you get a system that handles retries gracefully, exposes real-time progress, and gives your team the visibility needed to debug production incidents quickly. Start with one high-latency endpoint this week, move it behind this pattern, and you will likely see immediate gains in API responsiveness and operational confidence.

Leave a Reply