The OOM Kill That Wasn’t Random: Linux Memory Pressure Monitoring with PSI, cgroup v2, and Kubernetes MemoryQoS

Linux memory pressure monitoring with PSI and cgroup v2

At 2:17 AM last month, a checkout API that had looked healthy all evening started timing out in bursts. CPU was fine. Disk looked fine. The team increased the memory limit, the errors dropped for twenty minutes, and then came back harder. What finally fixed it was not “more RAM”. It was visibility into memory pressure and a better boundary strategy: Linux PSI signals, cgroup v2 throttling, and Kubernetes QoS tuned to workload reality.

If you run mixed workloads, this pattern is common. OOM kills feel random because the warning signs are often there, just not in the dashboards you are watching. This guide focuses on Linux memory pressure monitoring and how to turn it into practical controls before users see failures.

The problem with memory graphs that look “normal”

Most teams monitor memory usage as a percentage and assume that “not at 100%” means “safe.” But memory failure modes usually begin earlier:

  • Direct reclaim steals latency from request paths.
  • Page cache churn causes sudden I/O amplification.
  • One noisy service pushes neighbors into stalls long before global OOM.

This is why Linux Pressure Stall Information (PSI) matters. The kernel exposes how much time tasks spend stalled on resources, including memory, via /proc/pressure/memory. You get short and medium windows (avg10, avg60, avg300) and cumulative stall time, which is far more operationally useful than a single “used memory” line.

A pressure-first operating model (instead of an OOM-first model)

After reviewing kernel and Kubernetes docs, the most reliable pattern is:

  1. Detect rising pressure early (PSI and memory.events).
  2. Throttle and protect with cgroup v2 controls (memory.high, memory.max, memory.low / memory.min where appropriate).
  3. Align Kubernetes requests and limits so eviction behavior matches business priority.

The tradeoff is intentional: small throttling now versus unstable tail latency and hard kills later.

Step 1, make memory pressure visible in 10 minutes

Use this lightweight script on Linux hosts (or node debug pods) to sample PSI and cgroup events together. This gives you early warning when the kernel is reclaiming aggressively.

#!/usr/bin/env bash
set -euo pipefail

CGROUP_PATH="${1:-/sys/fs/cgroup}"
INTERVAL="${2:-5}"

echo "ts,psi_some_avg10,psi_full_avg10,memory_events_high,memory_events_oom,memory_current_bytes"
while true; do
  ts="$(date -u +%FT%TZ)"

  psi_line="$(grep '^some ' /proc/pressure/memory)"
  psi_full="$(grep '^full ' /proc/pressure/memory)"
  some_avg10="$(echo "$psi_line" | sed -E 's/.*avg10=([0-9.]+).*/\1/')"
  full_avg10="$(echo "$psi_full" | sed -E 's/.*avg10=([0-9.]+).*/\1/')"

  high="$(awk '/^high /{print $2}' "$CGROUP_PATH/memory.events" 2>/dev/null || echo 0)"
  oom="$(awk '/^oom /{print $2}' "$CGROUP_PATH/memory.events" 2>/dev/null || echo 0)"
  current="$(cat "$CGROUP_PATH/memory.current" 2>/dev/null || echo 0)"

  echo "$ts,$some_avg10,$full_avg10,$high,$oom,$current"
  sleep "$INTERVAL"
done

How to read it: if PSI “some” spikes repeatedly while memory.events:high increases, your workload is being throttled near its soft boundary. That is often a healthy warning state. If oom starts moving, you are already in a failure path.

Step 2, use cgroup v2 boundaries that reflect workload behavior

In cgroup v2, memory.max is your hard stop, while memory.high is a throttling point that activates earlier. The practical effect is smoother degradation under pressure instead of abrupt process death.

For systemd-managed services, this is the fastest safe baseline:

# /etc/systemd/system/payment-api.service.d/limits.conf
[Service]
# Hard cap, crossing this risks OOM kill behavior
MemoryMax=1200M

# Soft throttle before hard cap to reduce reclaim shock
MemoryHigh=900M

# Optional protection for critical daemons (use sparingly)
MemoryLow=300M

# Let managed OOM policies consider this unit under pressure
ManagedOOMMemoryPressure=kill
ManagedOOMMemoryPressureLimit=70%

Tradeoffs to discuss with your team:

  • Lower MemoryHigh protects neighbors but can increase p95 latency for bursty services.
  • Higher MemoryHigh improves burst throughput but raises node-level reclaim risk.
  • MemoryLow / MemoryMin can protect critical services, but overusing protection can starve everything else.

Step 3, map Linux behavior to Kubernetes QoS on purpose

Kubernetes eviction and QoS rules are predictable if requests and limits are set intentionally. As documented, BestEffort pods are evicted first under pressure, then Burstable, then Guaranteed. Also, MemoryQoS (cgroup v2-based throttling behavior) is version and feature-gate dependent, so verify your cluster state before assuming it is active.

apiVersion: v1
kind: Pod
metadata:
  name: checkout-api
spec:
  containers:
    - name: app
      image: ghcr.io/example/checkout:2026.04.1
      resources:
        requests:
          cpu: "500m"
          memory: "768Mi"
        limits:
          cpu: "1500m"
          memory: "1024Mi"

This profile makes the pod Burstable, which is often correct for APIs with controlled headroom. If this service is truly business-critical and you need stronger eviction resistance, use Guaranteed semantics (requests = limits), but expect less bin-packing efficiency.

Troubleshooting, when pressure signals and outcomes disagree

1) PSI stays low, but pods still restart

Check for container-level limit breaches first. Kubernetes memory limits are enforced reactively via OOM kills under pressure, so a single container can still die even if node-wide PSI looks tame.

2) memory.events:high climbs continuously and latency drifts upward

You are likely over-throttling. Raise memory.high gradually (for example 5 to 10 percent per change), then retest tail latency. Avoid jumping straight to removing limits.

3) Evictions hit “important” pods before expected

Audit requests and limits. Teams often discover critical pods running as Burstable with tiny requests, making them eviction candidates during node pressure.

4) Repeated OOMs after scaling out

Scaling can multiply page-cache pressure and allocator fragmentation. Re-check per-pod limits, JVM/GC or runtime memory caps, and node allocatable reservations.

How this connects to other reliability work you may already be doing

If this playbook feels adjacent to your current ops work, that is because it is. You can combine it with:

FAQ

Should I always enable Kubernetes MemoryQoS?

Not blindly. Treat it as a feature to validate per cluster version and kernel baseline. It can improve stability, but you should stage-test with representative load first.

Is memory.high just another hard limit?

No. memory.max is the hard boundary. memory.high is a throttling threshold designed to trigger reclaim pressure earlier, so you can avoid cliff-edge failure behavior.

What is a practical first alert for PSI?

Start simple: alert when memory PSI some avg10 is persistently elevated over your baseline and paired with rising memory.events:high. Tune thresholds from observed service behavior, not generic numbers.

Actionable takeaways

  • Track PSI alongside memory usage, not instead of it.
  • Set both soft (memory.high) and hard (memory.max) boundaries for critical services.
  • Align Kubernetes requests and limits with true business priority, not copy-paste defaults.
  • Run a monthly pressure drill: induce controlled load, validate alerting, and check eviction order.
  • Document one rollback-safe tuning path for each tier (API, worker, batch) before incidents happen.

When teams stop treating OOM as a random event and start treating memory pressure as a measurable signal, incident timelines get shorter and capacity decisions get cheaper. That shift is usually worth more than the next node-size upgrade.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Privacy Policy · Contact · Sitemap

© 7Tech – Programming and Tech Tutorials