Unweight: 22% lossless LLM weight compression without quality loss
Key Points
- Lossless BF16 exponent Huffman coding
- Reconstructive matmul decompresses in shared memory
- 15–22% model size reduction (~3GB VRAM saved)
Summary
Unweight is a lossless compression and runtime system that reduces LLM weight memory traffic on NVIDIA H100 GPUs by compressing BF16 exponent bytes and decompressing them directly in fast on‑chip shared memory. By fusing decompression with a reconstructive matrix‑multiply kernel and autotuning between four execution pipelines, Unweight achieves 15–22% overall model size reduction (≈3 GB VRAM saved on Llama‑3.1‑8B MLPs) while preserving bit‑exact outputs and avoiding special hardware.
Key Points
- Core idea: leave sign+mantissa untouched and Huffman‑encode the exponent byte using a small palette (top‑16 exponents cover ≈99% of weights), yielding ~30% exponent‑stream compression.
- Scope: applied selectively to MLP weight matrices (gate, up, down); attention, embeddings, and norms are left uncompressed.
- Handling edge cases: if any value in a 64‑wide row has a rare exponent, the entire row is stored verbatim to avoid per‑element branching in the hot path.
- Four runtime pipelines: full Huffman decode + cuBLAS (lowest kernel overhead), exponent‑only decode, runtime palette transcode, and direct palette (no preprocess, reconstruct on the fly). An autotuner picks the best pipeline per matrix and batch size.
- Reconstructive matmul: producer thread groups use TMA to stage compressed streams into shared memory; consumer groups reconstruct BF16 values in shared memory and feed them directly to WGMMA tensor‑core instructions so reconstructed weights never touch HBM.
- Performance tradeoffs: less preprocessing reduces HBM traffic but increases per‑element kernel work. Small batch sizes often favor full decode+cuBLAS; large batches favor palette/exponent paths. Different projection matrices can prefer different pipelines.
- Integration & reproducibility: designed for Hopper H100 GPUs, integrates with a Rust inference engine, includes autotuning, and the GPU kernels and technical paper are published open‑source.
Practical guidance for engineers
- Target MLP weight matrices first; expect the biggest bandwidth wins there.
- Run autotuning on target hardware and representative batch sizes—pipeline choice is workload and shape dependent.
- Monitor token latency vs throughput; small batch latencies may prefer simpler cuBLAS paths while high throughput can use reconstructive kernels.
- Validate bit‑exactness in CI/QA since the method is lossless but integration mistakes can break determinism.
- Leverage the open‑source kernels and paper for implementation details and performance tests on H100/Hopper-class hardware.
Impact
Unweight reduces memory bandwidth pressure (fewer bytes across the HBM→SMEM bus), enabling more models per GPU, lower inference cost, and faster token generation without any change to model outputs or special hardware.