Skip to main content
home / blog / OOMKill Is the Next Lie: Why Kubernetes Memory Limits Are Hiding Your Latency Spikes
Back to blog

OOMKill Is the Next Lie: Why Kubernetes Memory Limits Are Hiding Your Latency Spikes

Bableen Kaur
Bableen Kaur Engineer · Zop.Dev
· · 9 min read
OOMKill Is the Next Lie: Why Kubernetes Memory Limits Are Hiding Your Latency Spikes

CPU throttling has a visibility problem that the Kubernetes community partially fixed. container_cpu_cfs_throttled_seconds_total exposes the throttle. Grafana dashboards flag it. Engineers know to look for it.

Memory limits have no equivalent. There is no container_memory_approaching_limit_seconds_total. The container either runs fine or it dies. The Linux kernel OOM killer fires, the process is terminated, and Kubernetes schedules a replacement. The event shows up in kubectl get events for a few hours, then disappears.

Meanwhile, your p99 latency dashboard looked normal until the pod count dropped, then spiked 300% for 90 seconds, then recovered. No CPU throttle metric fired. No memory warning fired. Your on-call engineer spends 20 minutes trying to correlate the latency spike to a deployment, a config change, or a dependency failure before finding the OOMKill event in the pod history.

This is a pattern we see consistently. CPU throttling is a metric problem that degrades gradually. OOMKill is an availability problem that hits hard and disappears fast.

OOMKill vs CPU Throttling: Two Different Failure Modes

CPU throttling happens when a container tries to use more CPU than its limit allows. The kernel throttles the process: it still runs, just slower. The degradation is gradual and proportional. A container at 2x its CPU limit runs at roughly half speed. Traffic increases linearly, latency increases gradually, the throttle metric climbs. There is time to detect and respond.

OOMKill is different. When a container’s RSS (resident set size) hits the cgroup memory limit, the kernel OOM killer terminates the process. Not throttled. Terminated. No grace period, no drain, no SIGTERM sequence. The pod status changes to OOMKilled. Kubernetes schedules a replacement pod. The replacement pulls its image, initializes its runtime, warms its connection pool, and starts serving traffic. That process takes 30-90 seconds depending on the application. During those 30-90 seconds, the pod is not serving requests.

DimensionCPU ThrottlingOOMKill
Failure typeGradual degradationHard stop
Application impactIncreased latencyRequests dropped during restart
Detection metriccontainer_cpu_cfs_throttled_seconds_totalcontainer_oom_events_total
VisibilityHigh: shows in dashboardsLow: appears in events only
Recovery timeImmediate when load drops30-90 seconds for pod restart
Warning before failureYes: throttle metric climbsNo: kill fires at limit hit

The practical consequence: CPU throttling gives you time to respond. OOMKill gives you a post-mortem.

The OOMKill Latency Signature

OOMKill produces a specific pattern in latency and availability graphs that is easy to identify once you know what to look for.

Phase 1: normal operation. Memory increases gradually as the application processes requests, the JVM heap fills, or a cache warms up. This phase can last hours. Nothing looks wrong.

Phase 2: limit hit. RSS hits the cgroup limit. The OOM killer fires. The pod terminates. Pod count drops by 1 (or more, if multiple replicas are hitting the same limit simultaneously).

Phase 3: latency spike. The remaining pods absorb 100% of traffic. If the service was running at 60% capacity on 5 pods, now 4 pods are serving at 75% capacity. If the pod was handling a disproportionate fraction of requests (sticky sessions, sharding), the remaining pods may spike significantly higher. p99 latency increases 200-400%.

Phase 4: recovery. The replacement pod comes up. Image pull: 10-30 seconds. Runtime initialization: 5-20 seconds. Connection pool warmup: 10-30 seconds. Health check passing: 5-10 seconds. Total: 30-90 seconds from kill to serving traffic.

Architecture diagram

The cycle repeats if the root cause is not fixed. An application with a gradual memory leak will OOMKill on a predictable schedule: every N hours as the RSS climbs from baseline to limit. Each kill looks like a random latency event until you plot OOMKill events against the latency timeline and see the correlation.

Language-Specific Memory Behavior Under cgroup Limits

Different runtimes fail in different ways when memory limits are set incorrectly.

JVM (Java, Kotlin, Scala): Before Java 10, the JVM does not read cgroup memory limits. It allocates heap based on the host machine’s total RAM: typically 25% of host memory by default. On a 64 GB node, the JVM allocates a 16 GB heap for a container with a 2 GB memory limit. The heap grows. The container hits 2 GB RSS. The OOM killer fires. The JVM had no idea the limit existed. From Java 10 onward, -XX:+UseContainerSupport is enabled by default and the JVM reads the cgroup limit. But legacy images built on JDK 8 without this flag still exist in production.

Go: The Go garbage collector is designed to use as much memory as available before triggering a full GC cycle. As RSS approaches the cgroup limit, GC triggers more frequently to stay under the limit. Each GC cycle creates a CPU burst. The container looks like it is CPU throttling (GC CPU spike) when the actual driver is memory pressure. The fix is GOMEMLIMIT, introduced in Go 1.19, which tells the GC to target a specific memory ceiling below the cgroup limit.

Node.js: The V8 heap has a default max of 1.5 GB on 64-bit systems regardless of available RAM. Containers with less than 1.5 GB memory limit can have the V8 heap allocation exceed the container limit. The fix: --max-old-space-size flag set to 75% of the container’s memory limit.

RuntimeFailure modeDetection signalFix
JVM (pre-Java 10)Heap ignores cgroup limitOOMKill + large heap configAdd -XX:+UseContainerSupport
JVM (Java 10+)GC thrashing near limitHigh GC pause time + CPU spikesSet -Xmx to 75% of limit
GoGC pressure masking as CPU throttleCPU burst without high RSSSet GOMEMLIMIT to 90% of limit
Node.jsV8 heap exceeds container limitOOMKill on memory-light containersSet --max-old-space-size to 75% of limit

The Memory Limit Formula: Request, Limit, and Headroom

The most common misconfiguration is setting memory limit equal to memory request. This comes from Kubernetes QoS documentation: Guaranteed QoS class (the highest priority, least likely to be evicted) requires that limits equal requests for every container.

The QoS benefit is real. But the cost is that any transient memory spike above the request value causes an OOMKill. A JVM running GC, a Go runtime sweeping a large heap, or a Node.js V8 compiling a hot function will spike RSS above steady-state by 20-40% for seconds at a time. If limit equals request at steady-state RSS, those GC spikes kill the pod.

The correct formula: requests set at P90 steady-state RSS measured over a 7-day window. Limit set at 1.5x to 2x the request, providing headroom for GC burst, traffic spikes, and initialization overhead.

Architecture diagram

In production, teams that set limit equal to request see 3-8x higher OOMKill rates than teams with 1.5x headroom. Setting limit to 1.5x request reduces OOMKills by 70-80% in the applications we have measured.

The trade-off: 1.5x headroom means each pod can use 50% more memory than its request before being killed. The scheduler sees requests, not limits, when placing pods. A node can technically over-commit memory by the aggregate headroom delta across all its pods. Monitor node_memory_MemAvailable_bytes to ensure actual node memory stays above 20% free.

Detection: The 4 Metrics to Catch OOMKills Before Users Do

OOMKills are detectable before they cause user-visible incidents if you have the right metrics in place.

MetricSourceWhat it measuresAlert threshold
container_oom_events_totalkube-state-metricsCumulative OOMKill count per containerAny increase in a 5-min window
container_memory_working_set_bytes / limitcAdvisorMemory utilization ratioAbove 85% for 5 minutes
kube_pod_container_status_last_terminated_reasonkube-state-metricsLast termination reason (OOMKilled)Value = OOMKilled in last 15 min
kube_node_status_condition{condition="MemoryPressure"}kube-state-metricsNode-level memory pressureValue = true on any node

container_memory_working_set_bytes above 85% of the limit is the early warning. At 85%, the container has consumed 85% of its headroom. A GC cycle that adds 20% RSS will hit the limit. Alert at 85% and investigate before the kill fires.

container_oom_events_total is the kill confirmation. Alert on any increase, not on a threshold. One OOMKill on a production pod is one too many. Investigate immediately: check if it is a memory leak (RSS growing monotonically) or a sizing issue (RSS plateauing near the limit).

The live pod state visibility tooling that surfaces crashloop and OOMKill events covers this detection surface directly. The pattern is the same: make the event visible in real time instead of discovering it in a post-mortem. Memory limits lie by being invisible. Make them visible and OOMKill becomes a configuration problem, not an incident.

Tagged
Bableen Kaur

Bableen Kaur

Engineer · Zop.Dev

Bableen works on the Kubernetes side of Zop.Dev, focused on cluster ops, autoscaling, and the long tail of pod-level reliability work. She writes about MTTR, OOMKill diagnosis, and what runbooks actually need to do.

Stop watching the waste.
Start cutting it.

See. Find. Fix. Automatic.

Connect your first cloud account in under 5 minutes. See your first remediation in under 7. No credit card required.

CDCR connect detect classify remediate
full audit every action traceable
read-only default access
Multi-cloud automation· Production-ready in 30 min· SOC 2 · ISO 27001 · zero-trust· 30% average cloud cost cut· 4 platforms · 1 console· Multi-cloud automation· Production-ready in 30 min· SOC 2 · ISO 27001 · zero-trust· 30% average cloud cost cut· 4 platforms · 1 console·