Linux performance debugging in 2026 is less about guessing and more about collecting the right low-level evidence quickly. Teams run mixed workloads across containers, virtual machines, and edge devices, so a single dashboard graph is rarely enough to explain a spike in latency. The practical answer is eBPF. It lets you observe kernel and application behavior with low overhead, without recompiling your app or rebooting a server. In this guide, you will build a practical Linux observability workflow with eBPF that helps you spot CPU pressure, file system delay, and network retransmits before users feel the impact.
Why eBPF is now a core Linux skill
Traditional metrics like CPU percent, memory usage, and request count are still useful, but they often hide the root cause of an incident. eBPF gives you event-level visibility from the kernel and lets you answer real questions fast: which process is causing wakeup storms, where file reads are slowing down, and whether packet retransmits are rising during peak traffic. For developers and DevOps engineers, this closes the gap between “something is wrong” and “this specific thing is wrong”.
What we will build
- A repeatable eBPF toolkit for Linux hosts.
- Three practical probes for CPU, storage, and network behavior.
- A lightweight parser that converts noisy logs into a daily summary.
- A small alerting baseline your team can extend over time.
Step 1, install the tools
On modern Ubuntu and Debian-based systems, start with these packages:
sudo apt update
sudo apt install -y bpftrace linux-tools-common linux-tools-generic
Quick validation:
sudo bpftrace -e 'BEGIN { printf("bpftrace ready\n"); exit(); }'
If this prints bpftrace ready, your baseline setup is good.
Step 2, capture CPU wakeup pressure
High CPU usage does not always mean the same problem. Wakeup pressure helps you see which processes are constantly scheduling work and increasing contention.
sudo bpftrace -e 'tracepoint:sched:sched_wakeup { @wakeups[comm] = count(); } interval:s:10 { print(@wakeups); clear(@wakeups); }'
Use this output to answer three practical questions:
- Is one service dominating wakeups over multiple intervals?
- Did wakeups jump right after a deploy?
- Do wakeup spikes match user-facing latency spikes?
Step 3, inspect file read latency
Storage stalls often appear as random timeout errors in APIs. A histogram of read latency is a fast way to spot this pattern.
sudo bpftrace -e 'kprobe:vfs_read { @start[tid] = nsecs; } kretprobe:vfs_read /@start[tid]/ { @read_ms[comm] = hist((nsecs - @start[tid]) / 1000000); delete(@start[tid]); } interval:s:15 { print(@read_ms); clear(@read_ms); }'
If you see more events moving into high-latency buckets, check volume saturation, backup schedules, and shared disk contention.
Step 4, monitor TCP retransmits
Retransmits are one of the best early indicators of network pain. They often rise before your API error rate looks dramatic.
sudo bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb { @retransmits[comm] = count(); } interval:s:10 { print(@retransmits); clear(@retransmits); }'
When retransmits rise in a narrow time window, investigate packet loss, overloaded gateways, and noisy east-west traffic paths first.
Step 5, summarize traces for daily operations
Raw trace output is useful for specialists but noisy for most teams. A tiny parser creates a fast daily summary you can post to Slack or your ops channel.
python3 summarize.py traces.log
from collections import Counter
import sys
counter = Counter()
with open(sys.argv[1], 'r', errors='ignore') as f:
for line in f:
line = line.strip().lower()
if 'retransmit' in line:
counter['retransmit_events'] += 1
if 'read_ms' in line:
counter['storage_latency_events'] += 1
if 'wakeups' in line:
counter['cpu_pressure_events'] += 1
print('Daily eBPF summary')
for key, value in counter.items():
print(key, value)
This simple summary is enough to spot drift day over day and route incidents to the right owner faster.
Step 6, create a useful alert baseline
Start with actionable rules, not dozens of noisy thresholds:
- Warning alert when retransmit counts stay elevated for 5 minutes.
- Warning alert when read latency histogram shifts upward for 3 intervals.
- Critical alert when one process dominates wakeups and request latency is rising.
These three alerts are easy to explain and tightly connected to user impact.
Production guardrails
- Run tracing scripts with explicit time limits.
- Version control every probe your team uses in production.
- Tag captured logs with host, region, and service name.
- Test probes on canary nodes before fleet rollout.
- Document expected overhead for each script.
Common mistakes to avoid
- Collecting too many probes at once and overwhelming incident responders.
- Relying only on average latency instead of percentiles and histograms.
- Starting observability work only during outages.
- Ignoring deployment context while reading kernel-level signals.
A practical rollout plan for teams
Week 1, deploy the three probes above on one staging and one production canary host. Week 2, automate daily summaries and share them with developers and SREs. Week 3, convert recurring signal patterns into alert rules and ownership mappings. Week 4, run a game-day exercise where one team member introduces controlled CPU or network stress and another team diagnoses using only your new signals. This staged rollout creates confidence and avoids the usual observability overload.
Final thoughts
Linux troubleshooting in 2026 rewards teams that can move from symptom to evidence quickly. eBPF is not just for kernel experts anymore, it is a practical tool for any engineering team that wants fewer long incidents and faster root cause analysis. If you begin with CPU wakeups, file read latency, and TCP retransmits, you will solve a surprising number of production problems with clarity instead of guesswork. Keep the first version simple, automate summaries, and improve alert quality each sprint.

Leave a Reply