◆ ICLR 2026  ·  AISTATS 2026  ·  AAAI 2025 — Google Research

How Google Made AI Memory
6× Smaller

Every time an AI model generates a word, it stores a “memory” of every previous word in the conversation. This KV cache is the single biggest memory consumer during inference. Google Research found a way to compress it with zero quality loss — using a single elegant math trick that the industry is already racing to adopt.

Memory Reduction
KV cache at 3 bits vs. 16-bit baseline
0
Quality Loss
Matches full precision on all benchmarks
2.7×
From Optimal
Provably near information-theoretic limit
4 days
To Adoption
vLLM & SGLang PRs within a week

Every Bit Costs Extra Bits

When you quantize numbers — storing them in fewer bits to save memory — you need to keep track of how you quantized them. Every block of quantized data needs a zero point and a scale factor stored in full precision. Depending on block size, this overhead adds 1–2 extra bits per number, partially negating the whole point of compressing in the first place.

Traditional Quantization
Useful data
3+2 bits
actual data   overhead (zeros & scales)
Up to 40% wasted on bookkeeping
TurboQuant
Useful data
3 bits
actual data   (no overhead)
Zero overhead — every bit is data
Think of it like this

Imagine you’re packing a suitcase. Normal compression is like vacuum-sealing your clothes — they take less space, but you need to label each bag with its original size. TurboQuant is like discovering that if you shake the suitcase first, everything settles into a predictable shape, and you don’t need labels at all. The “shake” is a random rotation, and the predictable shape is a known mathematical distribution.

Random Rotation Makes Everything Predictable

The key discovery is deceptively simple. If you multiply a vector by a random rotation matrix before quantizing it, the resulting coordinates follow a known mathematical distribution — regardless of what the original data looked like. Since you know the distribution in advance, you can precompute the optimal quantization codebook once and reuse it forever. No per-block statistics. No zero points. No scales. Zero overhead.

The Two-Stage Algorithm

Stage 1: = Quantize(Π · x) — b−1 bits, MSE-optimal
Stage 2: correction = sign(S · residual) — 1 bit, unbiased
Stage 1 applies a random rotation Π, then quantizes each coordinate using a precomputed codebook. This is MSE-optimal but biased for inner products. Stage 2 applies a 1-bit Johnson-Lindenstrauss transform to the residual error, yielding an unbiased inner product estimator. The combination achieves near-optimal distortion within a factor of 2.7× of the information-theoretic lower bound.
Why the rotation works

Before rotation, each coordinate’s distribution depends on the specific data — some channels might have huge outliers, others might be tightly clustered. You’d need to measure and store statistics for each group. After rotation, every coordinate follows the same Beta distribution (converging to Gaussian in high dimensions). You know the shape before seeing any data. One codebook fits all.

Three Papers, One Family

TurboQuant is the culmination of a three-paper research arc from the same team at Google Research. Each paper builds on the last, and together they form a complete theoretical and practical framework.

Binary

QJL

AAAI 2025 · The Foundation

Apply a random Gaussian projection to key embeddings, then keep only the sign bit. 1-bit quantization with zero overhead. The asymmetric trick: quantize the key but leave the query full-precision, yielding an unbiased inner product estimator.

>5× memory reduction at 3 bits, no accuracy drop. Open-source CUDA kernels on GitHub.
Polar

PolarQuant

AISTATS 2026 · Multi-Bit Precision

Random rotation, then convert to polar coordinates and quantize the angles. After rotation, angle distributions are analytically known — concentrated around π/4 at higher recursion levels. Flexible per-level bit allocation.

4.2× compression. Best quality scores among all methods on long-context benchmarks.
Turbo

TurboQuant

ICLR 2026 · The Unified Framework

Combines optimal Cartesian scalar quantization (not polar) after rotation with QJL residual correction for unbiased inner products. Provably within 2.7× of the information-theoretic optimum.

At 3.5 bits: identical to full precision. At 2.5 bits: 0.62 points below. Vector search 500,000× faster indexing than PQ.
A common misconception

Google’s own blog post describes TurboQuant’s first stage as “PolarQuant.” This is inaccurate. TurboQuant uses Cartesian scalar quantization after rotation — it never converts to polar coordinates. The TurboQuant paper treats PolarQuant as a competing baseline it outperforms, not a component it incorporates. Both share the random-rotation insight, but the quantization mechanics are fundamentally different.

The Math That Makes It Provably Optimal

These aren’t engineering heuristics — they’re backed by information-theoretic proofs showing exactly how close to optimal TurboQuant is, and exactly how much room is left to improve.

Rotation

Random Rotation → Known Distribution

After multiplying by a random orthogonal matrix, each coordinate of a unit vector follows a Beta((d−1)/2, (d−1)/2) distribution. In high dimensions, this converges to N(0, 1/d). The distribution is data-independent — no calibration needed.

LEMMA 1 — TURBOQUANT
Optimal

Optimal Codebook via Panter-Dite

Since the distribution is known, the optimal scalar quantizer (Lloyd-Max) can be precomputed once. For b>4 bits, the Panter-Dite formula gives distortion = (1/12)·(∫f(x)1/3dx)3·4−b. The 1/3 power arises from balancing quadratic distortion penalty against linear bit budget.

THEOREM 1 — TURBOQUANT
Unbiased

QJL Makes It Unbiased

The MSE-optimal quantizer has multiplicative bias (2/π at 1 bit). Applying a 1-bit Johnson-Lindenstrauss transform to the residual cancels this bias exactly. Quantizing the key to sign bits while leaving the query unquantized gives an asymmetric unbiased estimator.

THEOREM 2 — TURBOQUANT
Lower bound

Provable Lower Bound

Via Shannon’s rate-distortion theory and Yao’s minimax principle: no quantizer can achieve MSE < 1/4b. TurboQuant achieves (√3·π/2)/4b ≈ 2.7/4b. The gap factor is 2.7× worst-case, and just 1.45× at 1 bit.

THEOREM 3 — TURBOQUANT
Dmse ≤ 2.7 / 4b
TurboQuant’s distortion is within a constant factor of the information-theoretic optimum. No future method — no matter how clever — can do better than 1/4b. The remaining 2.7× gap is the total room left for improvement in all of quantization theory.

Smaller Memory, Same Quality

Evaluated on Llama-3.1-8B-Instruct across the LongBench suite. TurboQuant at 3.5 bits exactly matches full 16-bit precision. Even at 2.5 bits — over 6× compression — the quality drop is negligible.

Method KV bits SingleQA MultiQA Summ. Few-shot Code Avg
Full Cache 16 45.29 45.16 26.55 68.38 46.28 50.06
KIVI 3 43.38 37.99 27.16 68.38 44.68 48.50
KIVI 5 45.04 45.70 26.47 68.57 46.41 50.16
PolarQuant 3.9 45.18 44.48 26.23 68.25 45.24 49.78
TurboQuant 2.5 44.16 44.96 24.80 68.01 45.76 49.44
TurboQuant 3.5 45.01 45.31 26.00 68.63 46.17 50.06
TurboQuant matches full precision   |   Llama-3.1-8B-Instruct on LongBench-V1
0.997
Needle-in-Haystack
Identical to full precision (also 0.997)
Compute Speedup
4-bit TurboQuant vs. 32-bit on H100
0.002s
Index Time
vs. 494s for Product Quantization (d=3072)

Four Days from Paper to Working Code

The Google Research blog post went up March 24, 2026. Within four days, the two most important open-source LLM serving frameworks had working implementations. This is one of the fastest research-to-production transitions I’ve seen in ML infrastructure.

March 24 — Day 0

Google Research Blog Post

TurboQuant paper published at ICLR 2026, blog post goes live. Full paper, proofs, and algorithms available on arXiv.

March 26 — Day 2

vLLM Feature Request & PR

Issue #38171 opened with 70 thumbs-up, 53 rocket reactions. Working PoC on vllm-omni fork shows 7.5× cache reduction on Qwen2.5-7B (H200). PR #38280 follows with full engine integration.

March 28 — Day 4

vLLM Benchmarks: 21% Throughput Gain

PR author reports 21% throughput improvement at batch size 16, zero latency overhead. Intel validates zero quality loss on XPU hardware. Phase 2 (bit-packed storage) in progress.

March 29 — Day 5

SGLang Draft PR with 42 Tests

Issue #21618 and draft PR #21617 with core quantization logic, Triton kernels, and full SGLang integration across model_runner, FlashInfer, and Triton backends.

Also In Progress

llama.cpp, Ollama, MLX

Feature requests open in llama.cpp (#20977), Ollama (#15051), Apple MLX, and LM Studio. The entire ecosystem is moving.

Why so fast?

The algorithm is dead simple to implement: multiply by a random matrix, look up centroids in a precomputed table, store the indices. No training, no calibration data, no model changes. It’s a drop-in replacement — --kv-cache-dtype turboquant and you’re done.

Why This Matters

This isn’t incremental. A 6× reduction in the single biggest memory consumer during LLM inference changes the economics of AI deployment. Here’s how.

Context

6× Longer Context, Same GPU

If KV cache was your bottleneck at 32K tokens, you can now handle 192K on the same hardware. Long-context applications — document analysis, multi-turn agents, RAG over large corpora — become practical where they weren’t before.

CONTEXT LENGTH
Batch

6× More Users Per GPU

For serving providers, KV cache memory determines how many concurrent users a GPU can handle. Shrinking it 6× means serving 6× more requests simultaneously — or using fewer GPUs for the same load.

COST REDUCTION
Vector search

Vector Search Revolution

TurboQuant isn’t just for LLMs. For vector databases, it achieves better recall than Product Quantization while building indices 500,000× faster (0.002s vs. 494s at d=3072). No dataset-specific tuning required.

VECTOR SEARCH
Open

Fully Open Research

Complete papers with proofs, algorithms, and pseudocode are on arXiv. QJL has open-source CUDA kernels. The community implemented working integrations in under a week. The math is a public good.

OPEN ACCESS
Rotate · Quantize · Correct
Random rotation makes data predictable. Scalar quantization compresses it optimally. A 1-bit residual correction makes the estimate unbiased. Three steps, zero overhead, provably near-optimal. The elegance is in what’s not stored — no zero points, no scales, no per-block statistics. Just a shared rotation matrix and a precomputed codebook.