Building the foundation for running extra-large language models
Key Points
- Prefill–decode disaggregation cut tail latencies and improved intertoken throughput
- x-session-affinity prompt caching raised cache hit ratio from ~60% to ~80%
- Infire: multi-GPU support, lower memory overhead, and <20s cold starts
Summary
Cloudflare describes the engineering and architecture changes used to run extra-large open-source LLMs (example: Kimi K2.5) on Workers AI. Key improvements include prefill–decode (PD) disaggregation with token-aware load balancing, prompt/session caching (x-session-affinity), cross‑GPU KV‑cache sharing using Mooncake and NVMe extension, speculative decoding with a smaller draft model (EAGLE‑3), and Infire — a Rust inference engine with multi‑GPU, low‑memory overhead, and fast cold‑starts.
These changes target agentic workloads that are input‑heavy and require fast tool calls and long contexts. The result: 3x intertoken throughput improvement, lower tail latencies, higher cache hit ratios, and better hardware utilization.
Key Points
-
Prefill–decode disaggregation
- Separate prefill (compute bound) and decode (memory bound) onto different servers so they can be tuned/scaled independently.
- Requires a load balancer that rewrites/merges decode stream responses and performs token‑aware load balancing across endpoint pools.
- Observed p90 time‑per‑token improved from ~100 ms to ~20–30 ms at scale.
-
Prompt caching and session affinity
- Use x-session-affinity header to route requests to existing cached input tensors; Cloudflare incentivizes use with discounted cached tokens.
- Worked best with agent harnesses (e.g., OpenCode); cache hit ratio improved from ~60% to ~80% during peaks.
- Recommendation: clients should emit session affinity to reduce recompute and GPU count.
-
KV‑cache sharing and storage tiering
- Leverage Mooncake Transfer Engine (RDMA: NVLink / NVMe over Fabric) + Mooncake Store to move KV caches across nodes and extend cache to NVMe.
- Integration with LMCache/HiCache enables reusing prefilled caches without strict session‑aware routing inside a cluster.
- Result: longer session retention in cache and more even load balancing.
-
Speculative decoding
- Use a smaller draft model (EAGLE‑3) to propose multiple future tokens per forward pass; target model validates choices to reduce compute cost.
- Tuning knob: number of future tokens from the draft model — tradeoff between throughput and rejection rate.
- Especially effective for predictable, structured outputs and frequent tool calls.
-
Infire inference engine
- Rust-based engine extended for multi‑GPU (pipeline, tensor, expert parallelism) and optimized for low activation/memory overhead.
- Enables running large models (e.g., Kimi K2.5 on 8x H100) with extra GBs left for KV cache; example: Llama 4 Scout on 2x H200 with >56 GiB free for cache.
- Faster cold starts (<20 s), and up to ~20% higher tokens/sec on unconstrained systems.
-
Practical guidance for engineers
- For input‑heavy/agent workloads, prioritize PD disaggregation and prompt caching on clients.
- Add x-session-affinity to harnesses to reduce GPU requirements and lower latency.
- Use KV‑sharing (RDMA/Mooncake) and NVMe spillover to increase cache lifetime and balance within clusters.
- Tune speculative decoding token lookahead per model and workload to maximize throughput without undermining quality.
This foundation is designed to let teams run extra‑large models more efficiently across heterogeneous hardware while keeping tail latencies and cost under control.