ClaudeCloudflare2026/06/01 16:53

How we reduced core unit boot time from hours to minutes

要点だけを先に読めるように短く再構成したセクションです。

元記事

Quick Digest

要約

要点だけを先に読めるように短く再構成したセクションです。

claudeja

How we reduced core unit boot time from hours to minutes の要約

Key Points

  • ポイント1: How we reduced core unit boot time from hours to minutes 2026-06-01 Giovanni Pereira Zantedeschi Nnamdi Ajah Omar Sheik-Omar 7 min read Cloudflare's core is the centralized dat
  • ポイント2: Core servers are bare metal, and when issues happen during reboot, the consequences can cascade fast.
  • ポイント3: Their boot sequence is orchestrated by UEFI , the modern firmware standard that initializes hardware and hands off control to the operating system.

Summary

この記事は 2026-06-01 に公開された「How we reduced core unit boot time from hours to minutes」の内容を日本語で簡潔にまとめたものです。

Key Points

  • ポイント1: How we reduced core unit boot time from hours to minutes 2026-06-01 Giovanni Pereira Zantedeschi Nnamdi Ajah Omar Sheik-Omar 7 min read Cloudflare's core is the centralized dat
  • ポイント2: Core servers are bare metal, and when issues happen during reboot, the consequences can cascade fast.
  • ポイント3: Their boot sequence is orchestrated by UEFI , the modern firmware standard that initializes hardware and hands off control to the operating system.

Full Translation

翻訳

原文の流れを保ったまま読める翻訳セクションです。

claudeja

How we reduced core unit boot time from hours to minutes(原文タイトル)

概要

公開日: 2026-06-01 翻訳生成に失敗したため、原文をそのまま保存しています。

原文

How we reduced core unit boot time from hours to minutes 2026-06-01 Giovanni Pereira Zantedeschi Nnamdi Ajah Omar Sheik-Omar 7 min read Cloudflare's core is the centralized data centers that run our control plane, billing, and analytics -- distinct from the globally distributed edge that handles user traffic. Core servers are bare metal, and when issues happen during reboot, the consequences can cascade fast. Their boot sequence is orchestrated by UEFI , the modern firmware standard that initializes hardware and hands off control to the operating system. Small quirks in that handoff can have outsized consequences. After a routine firmware update, some of our core servers were taking four hours to come back online, rather than just minutes as they did before. What should have been a one-day fleet-wide rollout was stretching into multi-day slogs. New nodes faced the full timeout gauntlet on their very first boot. Maintenance windows ballooned. Engineering teams had to babysit upgrades that should have run unattended. This issue affected the entire Gen12 fleet -- nearly 2,000 units. Every unexpected failure mid-upgrade meant restarting the entire cycle, and new capacity sat idle waiting for the timeout gauntlet to clear. This is the story of how we tracked the cause to a firmware quirk and an over-eager linear search through every available network boot interface, and how we cut total boot and upgrade time from hours back down to minutes. Along the way, we'll share what we learned about UEFI internals, vendor-specific quirks, and the automation strategies that ultimately solved the problem. The network boot interface A network boot interface allows a server to boot its operating system over the network instead of from local storage. This is critical for centralized, automated, and scalable control over how machines start up, especially across a globally distributed fleet serving different workloads. Since our servers are located in different environments and serve different purposes, they have different requirements for a specific network boot interface. The two primary interfaces are the Preboot Execution Environment (PXE) and Unified Extensible Firmware Interface ( UEFI ) HTTPS boot. As part of our reboot process, our servers usually go through PXE for various automation reasons. At Cloudflare, we use the open-source iPXE , an open-source network boot firmware that supports modern protocols like HTTP and HTTPS. This allows computers to boot operating systems directly from web servers, the cloud, or enterprise storage networks with significantly faster speeds and greater reliability. For organizations, iPXE turns the boot process into a programmable workflow. It offers advanced scripting capabilities that allow IT teams to automate complex deployments, such as provisioning servers based on specific hardware configurations or managing secure, diskless workstations. Some of our hardware supports HTTPS-based UEFI network boot, which enables the computer's motherboard firmware to natively download operating system files securely. The linear search Our tale begins with that fateful firmware update. Following the update, the first reports came through our internal channels: servers weren't coming back online. Monitoring dashboards showed machines stuck in a pre-OS state for far longer than expected. Our initial suspicion was a firmware regression: perhaps the update itself had introduced a bug that was hanging the boot process. To rule that out, we pulled up the serial console on an affected machine and watched a boot cycle in real time. The firmware Power On Self Test (POST) completed normally and hardware initialization looked healthy. But then, instead of quickly reaching the network boot stage and pulling down an OS image, the server sat waiting. And waiting. The console output told the story: the system was attempting an IPv4 HTTPS network boot, timing out after several minutes, then trying IPv4 iPXE, timing out a