Every time an AI model generates a word, it stores a “memory” of every previous word in the conversation. This KV cache is the single biggest memory consumer during inference. Google Research found a way to compress it 6× with zero quality loss — using a single elegant math trick that the industry is already racing to adopt.
When you quantize numbers — storing them in fewer bits to save memory — you need to keep track of how you quantized them. Every block of quantized data needs a zero point and a scale factor stored in full precision. Depending on block size, this overhead adds 1–2 extra bits per number, partially negating the whole point of compressing in the first place.
Imagine you’re packing a suitcase. Normal compression is like vacuum-sealing your clothes — they take less space, but you need to label each bag with its original size. TurboQuant is like discovering that if you shake the suitcase first, everything settles into a predictable shape, and you don’t need labels at all. The “shake” is a random rotation, and the predictable shape is a known mathematical distribution.
The key discovery is deceptively simple. If you multiply a vector by a random rotation matrix before quantizing it, the resulting coordinates follow a known mathematical distribution — regardless of what the original data looked like. Since you know the distribution in advance, you can precompute the optimal quantization codebook once and reuse it forever. No per-block statistics. No zero points. No scales. Zero overhead.
Before rotation, each coordinate’s distribution depends on the specific data — some channels might have huge outliers, others might be tightly clustered. You’d need to measure and store statistics for each group. After rotation, every coordinate follows the same Beta distribution (converging to Gaussian in high dimensions). You know the shape before seeing any data. One codebook fits all.
TurboQuant is the culmination of a three-paper research arc from the same team at Google Research. Each paper builds on the last, and together they form a complete theoretical and practical framework.
Apply a random Gaussian projection to key embeddings, then keep only the sign bit. 1-bit quantization with zero overhead. The asymmetric trick: quantize the key but leave the query full-precision, yielding an unbiased inner product estimator.
Random rotation, then convert to polar coordinates and quantize the angles. After rotation, angle distributions are analytically known — concentrated around π/4 at higher recursion levels. Flexible per-level bit allocation.
Combines optimal Cartesian scalar quantization (not polar) after rotation with QJL residual correction for unbiased inner products. Provably within 2.7× of the information-theoretic optimum.
Google’s own blog post describes TurboQuant’s first stage as “PolarQuant.” This is inaccurate. TurboQuant uses Cartesian scalar quantization after rotation — it never converts to polar coordinates. The TurboQuant paper treats PolarQuant as a competing baseline it outperforms, not a component it incorporates. Both share the random-rotation insight, but the quantization mechanics are fundamentally different.
These aren’t engineering heuristics — they’re backed by information-theoretic proofs showing exactly how close to optimal TurboQuant is, and exactly how much room is left to improve.
After multiplying by a random orthogonal matrix, each coordinate of a unit vector follows a Beta((d−1)/2, (d−1)/2) distribution. In high dimensions, this converges to N(0, 1/d). The distribution is data-independent — no calibration needed.
Since the distribution is known, the optimal scalar quantizer (Lloyd-Max) can be precomputed once. For b>4 bits, the Panter-Dite formula gives distortion = (1/12)·(∫f(x)1/3dx)3·4−b. The 1/3 power arises from balancing quadratic distortion penalty against linear bit budget.
The MSE-optimal quantizer has multiplicative bias (2/π at 1 bit). Applying a 1-bit Johnson-Lindenstrauss transform to the residual cancels this bias exactly. Quantizing the key to sign bits while leaving the query unquantized gives an asymmetric unbiased estimator.
Via Shannon’s rate-distortion theory and Yao’s minimax principle: no quantizer can achieve MSE < 1/4b. TurboQuant achieves (√3·π/2)/4b ≈ 2.7/4b. The gap factor is 2.7× worst-case, and just 1.45× at 1 bit.
Evaluated on Llama-3.1-8B-Instruct across the LongBench suite. TurboQuant at 3.5 bits exactly matches full 16-bit precision. Even at 2.5 bits — over 6× compression — the quality drop is negligible.
| Method | KV bits | SingleQA | MultiQA | Summ. | Few-shot | Code | Avg |
|---|---|---|---|---|---|---|---|
| Full Cache | 16 | 45.29 | 45.16 | 26.55 | 68.38 | 46.28 | 50.06 |
| KIVI | 3 | 43.38 | 37.99 | 27.16 | 68.38 | 44.68 | 48.50 |
| KIVI | 5 | 45.04 | 45.70 | 26.47 | 68.57 | 46.41 | 50.16 |
| PolarQuant | 3.9 | 45.18 | 44.48 | 26.23 | 68.25 | 45.24 | 49.78 |
| TurboQuant | 2.5 | 44.16 | 44.96 | 24.80 | 68.01 | 45.76 | 49.44 |
| TurboQuant | 3.5 | 45.01 | 45.31 | 26.00 | 68.63 | 46.17 | 50.06 |
The Google Research blog post went up March 24, 2026. Within four days, the two most important open-source LLM serving frameworks had working implementations. This is one of the fastest research-to-production transitions I’ve seen in ML infrastructure.
TurboQuant paper published at ICLR 2026, blog post goes live. Full paper, proofs, and algorithms available on arXiv.
Issue #38171 opened with 70 thumbs-up, 53 rocket reactions. Working PoC on vllm-omni fork shows 7.5× cache reduction on Qwen2.5-7B (H200). PR #38280 follows with full engine integration.
PR author reports 21% throughput improvement at batch size 16, zero latency overhead. Intel validates zero quality loss on XPU hardware. Phase 2 (bit-packed storage) in progress.
Issue #21618 and draft PR #21617 with core quantization logic, Triton kernels, and full SGLang integration across model_runner, FlashInfer, and Triton backends.
Feature requests open in llama.cpp (#20977), Ollama (#15051), Apple MLX, and LM Studio. The entire ecosystem is moving.
The algorithm is dead simple to implement: multiply by a random matrix, look up centroids in a precomputed table,
store the indices. No training, no calibration data, no model changes. It’s a drop-in replacement —
--kv-cache-dtype turboquant
and you’re done.
This isn’t incremental. A 6× reduction in the single biggest memory consumer during LLM inference changes the economics of AI deployment. Here’s how.
If KV cache was your bottleneck at 32K tokens, you can now handle 192K on the same hardware. Long-context applications — document analysis, multi-turn agents, RAG over large corpora — become practical where they weren’t before.
For serving providers, KV cache memory determines how many concurrent users a GPU can handle. Shrinking it 6× means serving 6× more requests simultaneously — or using fewer GPUs for the same load.
TurboQuant isn’t just for LLMs. For vector databases, it achieves better recall than Product Quantization while building indices 500,000× faster (0.002s vs. 494s at d=3072). No dataset-specific tuning required.
Complete papers with proofs, algorithms, and pseudocode are on arXiv. QJL has open-source CUDA kernels. The community implemented working integrations in under a week. The math is a public good.