How Tech - Systems Programming

How Tech - Systems Programming

Linux OOM Killer: How the Kernel Decides Which Process Dies

Systems Programming Deep Dive | Memory Management Series

Apr 10, 2026
∙ Paid

Every production engineer eventually stares at a dead process and a dmesg line ending with:

Killed process 14392 (postgres) total-vm:8388608kB, anon-rss:4194304kB

Understanding exactly why Postgres died instead of the Python analytics job next to it means understanding oom_badness() — the kernel function that produces the death warrant.


Part I — How It Works


The Overcommit Contract

Linux lies to your processes. When you call malloc(1GB) on a machine with 256MB free, the kernel says yes, maps virtual address space, and hands back a pointer. No physical memory is committed. Pages only get faulted in on first write — demand paging. This is vm.overcommit_memory = 0 (heuristic mode), where the kernel estimates likelihood of use and accepts most requests.

The consequence: your system can carry 40GB committed virtual memory on 16GB RAM plus 8GB swap. /proc/meminfo tracks this via CommitLimit (the ceiling) and Committed_AS (what’s actually promised). When real demand exceeds physical capacity, kswapd activates — reclaims clean page cache, writes dirty anonymous pages to swap. When kswapd falls behind and direct reclaim inside __alloc_pages_slowpath() fails, the allocating CPU thread is about to block indefinitely. That’s when out_of_memory() gets called.


The Scoring Algorithm

oom_badness() in mm/oom_kill.c reduces each candidate process to a single integer. The formula:

points = rss_pages + swap_pages + page_table_pages
points += (oom_score_adj / 1000.0) × total_ram_pages

RSS here counts anonymous pages (heap, stack, anonymous mmap) plus file-backed mapped pages. Swap usage is added directly — a process that’s been heavily paged out still carries its full working set in the score. Page table overhead is included too; processes with fragmented mappings pay extra.

The /proc/[pid]/oom_score_adj knob ranges from -1000 to +1000. At -1000, oom_badness() returns LONG_MIN — the process is unconditionally immune. At +1000, the adjustment adds the equivalent of all system RAM to the raw score, making selection near-certain. The /proc/[pid]/oom_score value you read is this normalized to 0–1000 relative to total RAM.

select_bad_process() iterates every task on the system, calls oom_badness() for each, and returns the highest scorer. Thread groups are evaluated as a unit — if one thread scores highest, the entire process dies.


User's avatar

Continue reading this post for free, courtesy of Systems.

Or purchase a paid subscription.
© 2026 Sumedh S · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture