load-balancing Articles | DocsDigest

Matched posts: 1

Building the foundation for running extra-large language models

Cloudflare / Apr 16, 2026

Prefill–decode disaggregation cut tail latencies and improved intertoken throughput
x-session-affinity prompt caching raised cache hit ratio from ~60% to ~80%
Infire: multi-GPU support, lower memory overhead, and <20s cold starts

inference kv-cache multi-gpu speculative-decoding load-balancing prompt-caching

Previous1 / 1Next