The 90-Second Java Pod Restart: A 2026 Runbook for CDS Archives, Startup Telemetry, and Safer JVM Flags

Java startup performance dashboard for Kubernetes pods with CDS and readiness metrics

At 8:57 AM on a Monday, one of our Java pods restarted after a routine node drain. Nothing dramatic, no pager storm, just one quiet deploy. But checkout latency doubled for nine minutes because the replacement pod needed too long to become ready. Locally the service felt snappy. In Kubernetes, under real memory limits and real startup probes, it became fragile. That was the day we stopped treating startup as an afterthought and started treating Java startup performance like an SLO.

This guide is the runbook we now use for Spring Boot style services on JDK 21+, with a focus on class data sharing, reproducible startup telemetry, and conservative JVM tuning that does not guess. If your service is “fast enough” on a laptop but slow after pod restarts, this is for you.

Why startup fails in production even when benchmarks look fine

Most teams test throughput and p95 latency, then assume startup is solved. In containerized systems, startup is a separate reliability problem. Kubernetes schedules by requests, enforces limits, and can throttle or kill differently for CPU vs memory pressure. That means your startup path is competing with cgroup realities, not your dev machine.

Two mistakes show up repeatedly:

  • No startup observability: teams can tell you steady-state latency but not class-loading time, JIT ramp behavior, or readiness gate delay.
  • One-shot JVM tuning: flags are added in panic mode and never re-validated after dependency updates.

If this sounds familiar, you may also recognize the same pattern from other incident classes, like query-plan surprises in databases (our PostgreSQL plan drift runbook) where missing observability causes expensive guessing.

The three-pass runbook for Java startup performance

Pass 1: Measure startup as a first-class path

Before tuning, instrument startup with repeatable markers:

  • Container start timestamp
  • JVM process started
  • App context ready
  • Readiness probe success

Use Unified JVM Logging during test runs so class loading and CDS behavior are visible. Keep this verbose mode for staging and controlled canaries, not always-on production.

# Example startup diagnostic launch (staging only)
java \
  -Xlog:class+load=info,cds=info \
  -Xms512m -Xmx512m \
  -jar app.jar

Notice what we are not doing yet: no claim that one flag gives universal gains. We measure first, then narrow changes to bottlenecks we can see.

Pass 2: Add CDS with a reproducible build path

Class Data Sharing (CDS) helps reduce startup overhead by loading class metadata from a shared archive instead of doing all parsing and setup at runtime. JDKs ship with a default archive, but app-specific archives often improve outcomes when your service loads a substantial application classpath.

Key tradeoff: CDS archives are tied to classpath and build shape. If your dependencies change, regenerate the archive. Treat CDS as a build artifact, not a one-time tweak.

#!/usr/bin/env bash
set -euo pipefail

JAR=build/libs/orders-service.jar
ARCHIVE=build/cds/orders-service.jsa
CLASSLIST=build/cds/classes.lst

# 1) Training run to record loaded classes
java \
  -Xshare:off \
  -XX:DumpLoadedClassList=${CLASSLIST} \
  -jar ${JAR} --spring.main.web-application-type=none || true

# 2) Build archive from observed class list
java \
  -Xshare:dump \
  -XX:SharedClassListFile=${CLASSLIST} \
  -XX:SharedArchiveFile=${ARCHIVE} \
  -jar ${JAR}

# 3) Validate archive at runtime
java \
  -Xshare:on \
  -XX:SharedArchiveFile=${ARCHIVE} \
  -Xlog:cds=info \
  -jar ${JAR}

Production tip: test with -Xshare:on in pre-prod to catch mismatches early. For runtime resilience, many teams move to -Xshare:auto once validation is complete.

Pass 3: Align JVM settings with Kubernetes behavior

In Kubernetes, limits and requests are not just accounting metadata. CPU limits can throttle, and memory limit enforcement is reactive, so overcommit can look fine until pressure arrives. Startup is where this mismatch hurts first.

Use explicit resource values and keep JVM heap sizing deliberate. A practical baseline is to pin initial and max heap for predictability during startup, then revisit after measurements.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: orders-api
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: app
          image: ghcr.io/acme/orders-api:2026.04.26
          resources:
            requests:
              cpu: "500m"
              memory: "768Mi"
            limits:
              cpu: "1"
              memory: "768Mi"
          env:
            - name: JAVA_TOOL_OPTIONS
              value: >-
                -Xms512m -Xmx512m
                -Xshare:auto
                -XX:SharedArchiveFile=/opt/app/cds/orders-service.jsa
          startupProbe:
            httpGet:
              path: /actuator/health/startup
              port: 8080
            failureThreshold: 30
            periodSeconds: 2
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            periodSeconds: 5

This is intentionally boring. Reliability improves when startup behavior is predictable, not when flags are exotic.

What to watch after rollout

  • Time to readiness (median and p95), per build version.
  • Cold restart error rate in the first two minutes after pod start.
  • Archive hit confidence from startup logs during canaries.

Pair this with your existing Java hygiene. If startup is good but memory grows later, use a dedicated leak process like this JFR + async-profiler workflow. If request concurrency is the bottleneck, virtual-thread migration may matter more than startup tuning (reference runbook).

Tradeoffs you should decide explicitly

Startup optimization is useful, but not free. A small team can lose weeks chasing a 400 ms win that never changes user outcomes. I like to make tradeoffs explicit before rollout:

  • Build complexity: CDS introduces artifact management and validation steps. If your release pipeline is unstable, fix that first.
  • Operational clarity: every new JVM option increases cognitive load during incidents. Prefer options you can explain in one sentence.
  • Environment coupling: archives and resource assumptions can become environment-specific. Treat reproducibility as a hard requirement.

A practical checkpoint is this: “Would this change reduce real customer pain during restarts or autoscaling events?” If yes, keep it. If no, spend effort where the bottleneck actually lives. We learned this the hard way after over-optimizing startup for a service whose real issue was downstream rate limiting. The startup graph looked beautiful, but users still saw spikes until the dependency policy was fixed (similar reliability pattern here).

Troubleshooting

1) “CDS archive ignored” in logs

Likely cause: classpath mismatch between dump-time and runtime, or archive not present in container path.
Fix: regenerate archive from the exact artifact and verify image layer path. In pre-prod, run once with -Xshare:on to force fail-fast.

2) Faster startup, but first requests still spike

Likely cause: startup completed before critical lazy paths warmed.
Fix: add lightweight warmup calls behind startup probes, and ensure readiness reflects real dependency checks (DB, cache, message broker).

3) Pod restarts under load despite “safe” memory settings

Likely cause: heap fits, non-heap and native memory do not; container memory pressure triggers OOM kill behavior.
Fix: profile native + metaspace headroom, reduce burst allocations at startup, and avoid setting heap too close to container limit.

4) Startup improves on one node pool, regresses on another

Likely cause: inconsistent CPU entitlement or noisy-neighbor effects.
Fix: compare requests/limits and node class. Keep rollout cohorts isolated when testing JVM changes, similar to staged operational rollouts used in other systems (example playbook).

FAQ

Do I need AppCDS for every Java service?

No. Start with services where restart frequency and startup latency materially affect user experience or autoscaling stability. For tiny services, default CDS may already be enough.

Should I tune startup probes first or JVM flags first?

Probe configuration should reflect reality, but JVM and app startup behavior should still be improved. Treat probes as guardrails, not a mask for slow initialization.

Can CDS replace proper performance profiling?

Not at all. CDS helps one phase, startup class loading. It does not solve runtime lock contention, query inefficiency, or memory leaks.

Actionable takeaways

  • Track startup-to-readiness as an explicit reliability metric, not an anecdote.
  • Adopt class data sharing as a build artifact flow, with regeneration on dependency changes.
  • Keep JVM heap choices explicit and compatible with Kubernetes limits to avoid startup surprises.
  • Validate with canary logs, then simplify to stable defaults rather than accumulating ad-hoc flags.
  • Use adjacent runbooks for non-startup issues, so startup tuning does not become a dumping ground for every Java problem.

When teams treat startup like a real production path, failovers feel boring again, and boring is exactly what you want during a restart.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Privacy Policy · Contact · Sitemap

© 7Tech – Programming and Tech Tutorials