At 2:13 AM, the alert said what alerts always say when they are least helpful: CPU > 90% for 2 minutes. By the time the on-call engineer opened dashboards, the storm had already passed. The service recovered, p99 dropped, and everyone was left with the same question: did we just survive a one-off spike, or miss a recurring failure pattern?
That is the trap with production Java incidents in 2026. The expensive part is rarely “collecting a profile.” The expensive part is collecting the right evidence before the signal disappears.
This runbook is how I now handle that class of incident: keep Java Flight Recorder in production as a low-overhead baseline, then escalate to deeper sampling only when the evidence justifies it. If you run Java 17/21 services in containers or VMs, this approach gives you faster root cause with less operational drama.
Why one-off profiling keeps failing under pager pressure
Ad-hoc profiling assumes the incident is still happening when you attach tooling. In real systems, especially bursty API workloads, that assumption is usually wrong.
- CPU storms can last 20 to 90 seconds and disappear before manual triage starts.
- Heap growth patterns can look normal at 1-minute scrape resolution but still trigger allocator pressure.
- Node-level contention (I/O wait, scheduler pressure, noisy neighbors) can amplify JVM symptoms.
The fix is not “more dashboards.” The fix is continuous JFR capture with bounded retention, then extracting the relevant window when the alert fires.
The practical model: always-on ring buffer + targeted deep dive
Think in two layers:
- Layer 1 (always on): JFR recording with bounded disk usage (maxage/maxsize) for immediate post-incident evidence.
- Layer 2 (on demand): async-profiler or short higher-detail JFR only during active anomalies.
Tradeoff: Layer 1 keeps overhead low and evidence continuous, but sampling granularity is intentionally moderate. Layer 2 gives sharper hotspots, but costs more and should be time-boxed.
Runbook commands with jcmd JFR.start and dump windows
# 1) Start (or ensure) an always-on JFR recording
PID=$(pgrep -f 'my-service.jar' | head -n1)
jcmd "$PID" JFR.start \
name=alwayson \
settings=profile \
disk=true \
maxage=30m \
maxsize=512m \
dumponexit=true \
filename=/var/log/jfr/%p-%t.jfr
# 2) During or right after an alert, dump the most recent 5 minutes
TS=$(date +%Y%m%d-%H%M%S)
jcmd "$PID" JFR.dump name=alwayson begin=-5m filename=/var/log/jfr/incident-${TS}.jfr
# 3) Verify recording health
jcmd "$PID" JFR.check
Why this works well operationally:
begin=-5mlets you grab pre-failure context, not only failure tail.maxageandmaxsizecap disk growth, so this is safe for long-lived services.- You can automate the dump step from alert hooks without restarting the JVM.
When to add RecordingStream, and when not to
JFR Event Streaming (JEP 349) is useful when you want near-real-time JVM signals without waiting for file analysis. I use it for lightweight local heuristics, such as “sustained machine CPU + lock contention for N intervals, then trigger enriched logs.”
I avoid using it for complex autonomous remediation. Streaming is great for detection and context, not for magical self-healing logic that is hard to audit at 3 AM.
import jdk.jfr.consumer.RecordingStream;
import java.time.Duration;
public class JfrCpuGuard {
public static void main(String[] args) {
try (var rs = new RecordingStream()) {
rs.enable("jdk.CPULoad").withPeriod(Duration.ofSeconds(1));
rs.enable("jdk.JavaMonitorEnter").withThreshold(Duration.ofMillis(20));
rs.onEvent("jdk.CPULoad", event -> {
float total = event.getFloat("machineTotal");
if (total > 0.85f) {
System.out.println("WARN high machine CPU: " + total);
}
});
rs.onEvent("jdk.JavaMonitorEnter", event -> {
String monitor = String.valueOf(event.getClass("monitorClass"));
System.out.println("LOCK contention over 20ms on " + monitor);
});
rs.start();
}
}
}
Tradeoff to remember: this is event-driven visibility, not a replacement for full trace-level causality. Keep handlers cheap, and forward summarized signals, not raw event floods.
Escalation path with async-profiler
If JFR points to CPU saturation but hotspots remain unclear, escalate briefly with async-profiler. It is sampling-based and designed to avoid classic safepoint bias problems. Keep the session short (for example, 30 to 60 seconds) and aligned with live anomaly windows.
In containerized environments, check host and runtime permissions before incident day. Profiling that requires emergency privilege changes during an outage is exactly the kind of process debt that turns a minor event into a major one.
How this fits our broader reliability practice
This runbook pairs well with a few patterns we have already used on 7tech:
- Java startup tuning with CDS and safer JVM flags for boot-time efficiency.
- Linux cgroup v2 I/O guardrails when JVM symptoms are amplified by host-level pressure.
- Human-verified SQL reliability checks so telemetry stories match business impact.
- Workflow integrity over green pipelines when automating incident capture and postmortems.
A rollout checklist you can apply this week
If you are introducing this pattern to an existing service fleet, do it in three passes. First, enable always-on recording only in one non-critical environment and validate that retention limits behave exactly as expected. Second, run a game-day drill where you trigger a synthetic CPU alert and confirm the JFR dump artifact reaches the same location your incident responders actually use. Third, document one “golden path” command set per runtime (VM, container, Kubernetes) so responders are not translating tribal knowledge during a live page.
This staged rollout sounds slower, but in practice it is faster than rolling out globally and then debugging path, permission, and ownership mismatches under pressure.
Troubleshooting: what usually goes wrong first
1) “JFR.start succeeded, but files are missing”
Usually a path/permission mismatch. Confirm the JVM user can write the configured directory. If running in Kubernetes, verify volume mounts and security context (readOnlyRootFilesystem is a common silent blocker).
2) “Recording exists, but incident window is empty”
The dump time window is wrong. Use JFR.check to inspect active recording timing and ensure begin=-5m (or a suitable range) actually overlaps the alert period. Time sync drift between alerting and node clocks can also mislead responders.
3) “Profiling caused noisy performance impact”
Sampling and event settings are too aggressive for the workload. Reduce recording detail, shorten deep-dive windows, and avoid stacking multiple heavyweight diagnostics simultaneously.
4) “async-profiler cannot attach in production”
This is often kernel/perf permission policy. Validate compatibility and permissions during daylight hours, document the exact escalation path, and avoid improvising security exceptions mid-incident.
FAQ
Should I keep JFR always on in production?
For most Java services, yes, with bounded retention and sane settings. The overhead is typically low enough to justify the incident-response gain, but test on your workload before fleet-wide rollout.
Is RecordingStream better than file-based JFR dumps?
They solve different problems. RecordingStream is excellent for live signals and lightweight heuristics. File dumps remain better for structured post-incident analysis and tool interoperability.
Do I still need logs and metrics if I have JFR?
Absolutely. JFR explains JVM behavior, while logs and metrics explain request paths, dependencies, and user impact. You need all three for reliable root-cause analysis.
Actionable takeaways
- Adopt a default always-on JFR recording with
maxageandmaxsizecaps. - Automate alert-triggered JFR dumps for the last 5 to 10 minutes.
- Use RecordingStream for lightweight detection signals, not heavy remediation logic.
- Reserve async-profiler for short, high-signal deep dives during active anomalies.
- Pre-validate permissions and storage paths so incident commands work the first time.
If your team keeps saying “we were too late to capture it,” this runbook is usually the fastest way to break that cycle.

Leave a Reply