Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse
Key Points
- Lock contention in query planning (mutex protecting parts) caused the slowdown
- Three fixes: shared lock, shared parts cache (PR #85535), binary-search pruning by namespace
- Result: major latency drop and 50% reduction after binary-search deploy
Summary
A partitioning change from (day) to (namespace, day) to enable per-tenant retention greatly increased the total number of parts in a large ClickHouse table. Although individual queries still read the same data, end-to-end SELECT latencies rose because query planning hit severe lock contention and expensive copies of the parts vector. Investigation using ClickHouse trace_log (switching from CPU to Real traces) showed planners waiting on a mutex protecting the parts list and spending significant time copying and scanning that list.
Key Points
-
Root cause
- Migration added many parts; query planning acquired an exclusive mutex (
MergeTreeData) and copied the full parts vector for every planner, creating high contention and CPU/wait overhead.
- Migration added many parts; query planning acquired an exclusive mutex (
-
How it was diagnosed
- Used ClickHouse
trace_logand flame graphs; Real traces revealed blocking on mutexes (not just CPU hot spots).
- Used ClickHouse
-
Fixes implemented
- Use a shared lock (
std::shared_lock) for planners so multiple planners can read the parts list concurrently (large immediate latency drop). - Avoid full-vector copies by maintaining a shared read-only cache of the parts list; only regenerate on mutations and copy only filtered results (further improvement; upstream PR #85535, available since ClickHouse 25.11).
- Use binary search over the sorted partition key (namespace first) to narrow the candidate range before linear filtering; deployed March 2026, gave ~50% latency reduction and broke correlation with total part count.
- Use a shared lock (
-
Practical takeaways for engineers
- When queries slow after schema/partitioning changes, correlate latencies with global part counts, not just per-query IO or rows scanned.
- Use Real (wall-clock) traces, not only CPU samples, to reveal blocking/lock contention.
- Avoid exclusive locks for read-only planning paths; prefer shared locks and cached read-only structures.
- Exploit sort order of partition keys (binary search) to prune parts cheaply when most queries filter on a leading key.
-
Status and impact
- Two optimizations merged upstream (PR #85535 in ClickHouse 25.11); binary-search optimization rolled out internally in March 2026. Overall result: large, sustained latency reductions and decoupling of query latency from total part count.