Code Orange: Fail Small is complete — a stronger Cloudflare network
Key Points
- Snapstone enables health‑mediated config rollouts
- Codex enforces safe patterns via AI reviews
- Fail‑stale/open limits blast radius
Summary
Cloudflare completed the "Code Orange: Fail Small" program that retrofits the network to prevent the November 18 and December 5, 2025 global outages. The effort focused on safer configuration changes, limiting blast radius when failures occur, strengthening emergency access and incident workflows, and codifying learnings into an enforceable internal Codex. These changes are now deployed and aim to make regressions visible earlier, reduce customer impact, and ensure consistent engineering practices going forward.
Key Points
-
Safer configuration rollouts
- Introduced Snapstone: bundles configuration changes for health-mediated, progressive rollouts with real-time monitoring and automated rollback.
- Teams can define any unit of config for Snapstone (data files, global flags) so risky patterns inherit safe deployment by default.
-
Reduced blast radius and safer failure modes
- Adopted "fail stale" (use last-known-good), and explicit fail-open / fail-close behavior per product.
- Increased service segmentation by customer cohort so failures affect a small percentage of traffic (example: Workers runtime cohorted; free tier first).
- Faster, smaller waves on less-critical segments to detect regressions early.
-
Better incident access and communications
- Audited tooling and built backup authorization pathways for 18 key services, plus emergency scripts and proxies.
- Regular drills scaled to 200+ engineers; established a dedicated communications team to produce clearer customer updates.
-
Codified rules and automated enforcement
- Created an internal Codex that turns RFCs into concrete rules (e.g., avoid .unwrap() outside tests) and integrates AI-powered reviews to reject unsafe changes pre-merge.
- The Codex enforces standards across the lifecycle: design review → merge review → deployment → incident analysis.
-
Operational hygiene
- Added SLOs for services, a global changelog, and consolidated maintenance coordination to reduce drift and regressions.
Practical impact for engineers
- Use Snapstone for any high-risk configuration change; expect progressive rollout and health checks by default.
- Follow Codex rules and submit RFCs for new patterns; anticipate AI review feedback on merge requests.
- Design services to fail stale or to fail open/close as appropriate and validate upstream inputs explicitly.
- Plan deployments against cohorts and leverage segment-specific testing; monitor health metrics closely during rollouts.
- Train on break-glass procedures and support incident communication cadence during drills.
Net takeaway
The program shifts failure detection left, constrains blast radii, and makes safe defaults the default — reducing the chance that a single configuration or code error causes a global outage.