ClaudeCloudflareMay 12, 2026, 1:00 PM

When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug

A condensed section focused on the key takeaways first.

Original Post

Quick Digest

Summary

A condensed section focused on the key takeaways first.

claudeen

When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug Summary

Key Points

  • Point 1: When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug 2026-05-12 Esteban Carisimo Antonio Vicente 10 min read CUBIC, standardized in RFC 9438 , i
  • Point 2: At Cloudflare, our open-source implementation of QUIC, quiche , uses CUBIC as its default congestion controller, meaning this code is in the critical path for a significant share o
  • Point 3: In this post, we’ll tell the story of a bug in which CUBIC's congestion window (cwnd) gets permanently pinned at its minimum and never recovers from a congestion collapse event

Summary

This is an English summary of "When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug" published on 2026-05-12.

Key Points

  • Point 1: When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug 2026-05-12 Esteban Carisimo Antonio Vicente 10 min read CUBIC, standardized in RFC 9438 , i
  • Point 2: At Cloudflare, our open-source implementation of QUIC, quiche , uses CUBIC as its default congestion controller, meaning this code is in the critical path for a significant share o
  • Point 3: In this post, we’ll tell the story of a bug in which CUBIC's congestion window (cwnd) gets permanently pinned at its minimum and never recovers from a congestion collapse event

Full Translation

Translations

A translation section that keeps the flow of the original article.

claudeja

When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug(原文タイトル)

概要

公開日: 2026-05-12 翻訳生成に失敗したため、原文をそのまま保存しています。

原文

When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug 2026-05-12 Esteban Carisimo Antonio Vicente 10 min read CUBIC, standardized in RFC 9438 , is the default congestion controller in Linux, and as a result governs how most TCP and QUIC connections on the public Internet probe for available bandwidth, back off when they detect loss, and recover afterward. At Cloudflare, our open-source implementation of QUIC, quiche , uses CUBIC as its default congestion controller, meaning this code is in the critical path for a significant share of the traffic we serve. In this post, we’ll tell the story of a bug in which CUBIC's congestion window (cwnd) gets permanently pinned at its minimum and never recovers from a congestion collapse event. The story starts with a Linux kernel change aimed at bringing CUBIC into line with the app-limited exclusion described in RFC 9438 §4.2-12 — a fix to a real problem in TCP that, when ported to our QUIC implementation, surfaced unexpected behaviors in quiche. It has a happy ending: an elegant (near-)one-line fix that broke the cycle. CUBIC's logic in a nutshell Before we dive into the core problem, a quick refresher on CCAs may help to set the stage. The central knob a CCA turns is the congestion window ( cwnd ): the sender-side cap on how many bytes can be in flight (sent but not yet acknowledged) at any moment. A larger cwnd lets the sender push more data per round trip; a smaller cwnd throttles it. Every loss-based CCA, CUBIC included, is ultimately a policy for how to grow cwnd when the network looks healthy and how to shrink it when it doesn't. In essence, CCAs aim to maximize data transfer by inferring the "available bandwidth" of the network; because no one wants to pay for a 1 Gbps subscription and only use a fraction of it. The family of loss-based algorithms, to which CUBIC belongs, operate on a fundamental premise: (1) if there is no packet loss, increase the sending rate (i.e. increase the bandwidth utilization); (2) if there is loss, loss-based algorithms assume that the network's capacity has been exceeded, and the sender must back off (i.e. decrease the bandwidth utilization). This logic is built on several assumptions that have been revisited over the years. However, we'll save that discussion for another time. The symptom: a test that fails 61% of the time Our investigation started with the report of unexpected failures in our ingress proxy integration test pipeline. This erratic behavior appeared in tests where CUBIC was evaluated in a scenario of heavy loss in the early part of the connection. Recovery after congestion collapse is an uncommon regime, but it is exactly the regime a congestion controller exists to handle. Most congestion control tests exercise the steady-state and growth phases of an algorithm; far fewer probe what happens at minimum cwnd, after the connection has been beaten down. Bugs in this corner of the state space are invisible in throughput dashboards, undetectable by static review, and only surface when you deliberately drive a CCA into it and watch whether it can climb back out — which is exactly what this test did. The simulated test setup includes the following details: Quiche HTTP/3 client and server running at locally (localhost) RTT = 10ms (set up in the configuration) A 10 MB file download over HTTP/3 Using CUBIC congestion control With 30% random packet loss injected during the first two seconds After two seconds, loss stops entirely The test has a generous 10-second timeout to complete the download, which is expected to be completed in four or five seconds The expected behavior is straightforward: CUBIC should take some hits during the loss phase, reduce its congestion window, and once loss stops, steadily ramp up and finish the download well within the timeout. Instead, we observed in multiple 100-time runs that around 60% of our tests were not able to complete the download within the ge