How Tech - Systems Programming

How Tech - Systems Programming

Deploying XDP for High-Performance Load Balancing

May 18, 2026
∙ Paid

The Linux kernel’s networking stack was not designed for 40 Gbps line rate. Every inbound packet triggers a kmalloc for an sk_buff, feeds through GRO and netfilter, climbs the protocol stack, and eventually reaches your application — accumulating allocator pressure and cache misses along the way. On a modern NIC at 10 million packets per second, you have roughly 100 nanoseconds per packet. The kernel spends a significant portion of that just allocating the structure describing the packet.

XDP inserts a BPF hook before sk_buff allocation — in the NIC driver’s NAPI poll loop, operating directly on the raw DMA-mapped page. The first question any BPF program sees is the raw Ethernet frame, cache-warm from the DMA completion interrupt. No allocations. No refcounting. The difference shows up at around two to three million packets per second, where the traditional stack hits a softirq CPU ceiling and XDP has barely started.

Hook Modes and What They Actually Cost

Three XDP attachment modes exist and the difference between them is not just performance — it determines where in the packet lifecycle your program runs.

Driver mode (XDP_FLAGS_DRV_MODE) requires explicit driver support: mlx5, i40e, ixgbe, bnxt, virtio_net, and a handful of others. The BPF program runs inside the driver’s napi_poll callback, before any kernel structure is allocated. This is the path Facebook’s Katran and Cloudflare’s Unimog operate on, and it is the only mode worth benchmarking for production traffic shaping.

Generic mode (XDP_FLAGS_SKB_MODE) runs after sk_buff allocation, giving XDP semantics to any interface — loopback, veth, bridge. The kernel allocates the sk_buff, strips it to find the data pointer, calls your program, then frees it if you return XDP_DROP. You get the API without the performance. Useful for development; useless for replacing IPVS at line rate.

Offload mode runs BPF on the NIC’s FPGA. Sub-nanosecond decisions, no host CPU involvement. Only Netronome Agilio supports it and the verifier restrictions are significant.

User's avatar

Continue reading this post for free, courtesy of Systems.

Or purchase a paid subscription.
© 2026 Sumedh S · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture