概要
公開日: 2026-05-14
翻訳生成に失敗したため、原文をそのまま保存しています。
原文
Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse 2026-05-14 James Morrison Christian Endres 9 min read At Cloudflare, we are heavy users of ClickHouse, an open-source analytical database management system. We redesigned one of our largest ClickHouse tables to add a column to the partitioning key. The change enabled per-tenant retention on a table that serves hundreds of internal teams. The design went through several rounds of revision and review with engineers across multiple teams before we landed on the final approach. But a few weeks after rollout, the jobs that produce most of Cloudflare's bills were running up against their hard daily deadline. All the usual suspects looked clean: I/O, memory, rows scanned, parts read. Everything we would normally check when a ClickHouse query is slow appeared to be normal. The problem turned out to be lock contention in query planning, something we'd never had reason to look for before. This is the story of how this migration exposed a hidden bottleneck in ClickHouse's internals, and the patches we wrote to fix it. The setup: a petabyte-scale analytics platform We use ClickHouse to store over a hundred petabytes of data across a few dozen clusters. To simplify onboarding for our many internal teams, we built a system called "Ready-Analytics" in early 2022. The premise is simple: instead of designing new tables, teams can stream data into a single, massive table. Datasets are disambiguated by a namespace , and each record uses a standard schema (e.g., 20 float fields, 20 string fields, a timestamp, and an indexID ). In ClickHouse, the way data is sorted is crucial to query performance. This is where the indexID comes into play. It’s a string field, which forms part of the primary key, meaning that every individual namespace can have its data sorted in a way that is optimal for the queries the owners of that namespace expect to be running. Altogether, we end up with a primary key that looks like this: ( namespace , indexID , timestamp ). This system is popular, with hundreds of applications using it. It had already grown to more than 2PiB of data by December 2024, and an ingestion rate of millions of rows per second. But it had one critical flaw: its retention policy. The problem: one retention policy to rule them all Cloudflare has been using ClickHouse for many years, since before it had native Time-to-Live (TTL) features. Consequently, we built our own retention system based on partitioning. The Ready-Analytics table was partitioned by day , and our retention job simply dropped partitions older than 31 days. This "one-size-fits-all" 31-day retention was a major limitation. Some teams needed to store data for years due to legal or contractual obligations, while others needed only a few days. This restriction meant these use cases couldn't use Ready-Analytics and had to opt for a conventional setup, which has a far more complex onboarding process. We needed a new system that allowed per-namespace retention . The solution: a new partitioning scheme We considered two main approaches: A Table-per-Namespace: This would naturally solve the retention problem but would require significant new automation to manage thousands of tables on demand. A New Partitioning Key: We could change the partitioning key from just (day) to (namespace, day) . We chose the second option. This would allow our existing retention system to continue managing partitions, but now with per-namespace granularity. We knew this would increase the total number of data parts in the table, but we made a key assumption: since every query is filtered by a specific namespace, the number of parts read by any single query shouldn't change. We believed this meant performance would be unaffected. This shows how we changed the partitioning, allowing us to cheaply drop data for a single namespace This new system also allowed us to build a sophisticated sto