◆ ICLR 2026 · AISTATS 2026 · AAAI 2025 — Google Research

How Google Made AI Memory
6× Smaller

Every time an AI model generates a word, it stores a “memory” of every previous word in the conversation. This KV cache is the single biggest memory consumer during inference. Google Research found a way to compress it 6× with zero quality loss — using a single elegant math trick that the industry is already racing to adopt.

6×

Memory Reduction

KV cache at 3 bits vs. 16-bit baseline

Quality Loss

Matches full precision on all benchmarks

2.7×

From Optimal

Provably near information-theoretic limit

4 days

To Adoption

vLLM & SGLang PRs within a week

The Problem

Every Bit Costs Extra Bits

When you quantize numbers — storing them in fewer bits to save memory — you need to keep track of how you quantized them. Every block of quantized data needs a zero point and a scale factor stored in full precision. Depending on block size, this overhead adds 1–2 extra bits per number, partially negating the whole point of compressing in the first place.

Traditional Quantization

Useful data

3+2 bits

■ actual data ■ overhead (zeros & scales)

Up to 40% wasted on bookkeeping

→

TurboQuant

Useful data

3 bits

■ actual data (no overhead)

Zero overhead — every bit is data

Think of it like this

Imagine you’re packing a suitcase. Normal compression is like vacuum-sealing your clothes — they take less space, but you need to label each bag with its original size. TurboQuant is like discovering that if you shake the suitcase first, everything settles into a predictable shape, and you don’t need labels at all. The “shake” is a random rotation, and the predictable shape is a known mathematical distribution.

The Core Insight

Random Rotation Makes Everything Predictable

The key discovery is deceptively simple. If you multiply a vector by a random rotation matrix before quantizing it, the resulting coordinates follow a known mathematical distribution — regardless of what the original data looked like. Since you know the distribution in advance, you can precompute the optimal quantization codebook once and reuse it forever. No per-block statistics. No zero points. No scales. Zero overhead.

The Two-Stage Algorithm

Stage 1: x̃ = Quantize(Π · x) — b−1 bits, MSE-optimal

Stage 2: correction = sign(S · residual) — 1 bit, unbiased

Stage 1 applies a random rotation Π, then quantizes each coordinate using a precomputed codebook. This is MSE-optimal but biased for inner products. Stage 2 applies a 1-bit Johnson-Lindenstrauss transform to the residual error, yielding an unbiased inner product estimator. The combination achieves near-optimal distortion within a factor of 2.7× of the information-theoretic lower bound.

Why the rotation works

Before rotation, each coordinate’s distribution depends on the specific data — some channels might have huge outliers, others might be tightly clustered. You’d need to measure and store statistics for each group. After rotation, every coordinate follows the same Beta distribution (converging to Gaussian in high dimensions). You know the shape before seeing any data. One codebook fits all.

The Research

Three Papers, One Family

TurboQuant is the culmination of a three-paper research arc from the same team at Google Research. Each paper builds on the last, and together they form a complete theoretical and practical framework.

QJL

AAAI 2025 · The Foundation

Apply a random Gaussian projection to key embeddings, then keep only the sign bit. 1-bit quantization with zero overhead. The asymmetric trick: quantize the key but leave the query full-precision, yielding an unbiased inner product estimator.

>5× memory reduction at 3 bits, no accuracy drop. Open-source CUDA kernels on GitHub.

PolarQuant

AISTATS 2026 · Multi-Bit Precision

Random rotation, then convert to polar coordinates and quantize the angles. After rotation, angle distributions are analytically known — concentrated around π/4 at higher recursion levels. Flexible per-level bit allocation.

4.2× compression. Best quality scores among all methods on long-context benchmarks.

TurboQuant

ICLR 2026 · The Unified Framework

Combines optimal Cartesian scalar quantization (not polar) after rotation with QJL residual correction for unbiased inner products. Provably within 2.7× of the information-theoretic optimum.

At 3.5 bits: identical to full precision. At 2.5 bits: 0.62 points below. Vector search 500,000× faster indexing than PQ.

A common misconception

Google’s own blog post describes TurboQuant’s first stage as “PolarQuant.” This is inaccurate. TurboQuant uses Cartesian scalar quantization after rotation — it never converts to polar coordinates. The TurboQuant paper treats PolarQuant as a competing baseline it outperforms, not a component it incorporates. Both share the random-rotation insight, but the quantization mechanics are fundamentally different.

Under the Hood

The Math That Makes It Provably Optimal

These aren’t engineering heuristics — they’re backed by information-theoretic proofs showing exactly how close to optimal TurboQuant is, and exactly how much room is left to improve.

Random Rotation → Known Distribution

After multiplying by a random orthogonal matrix, each coordinate of a unit vector follows a Beta((d−1)/2, (d−1)/2) distribution. In high dimensions, this converges to N(0, 1/d). The distribution is data-independent — no calibration needed.

LEMMA 1 — TURBOQUANT

Optimal Codebook via Panter-Dite

Since the distribution is known, the optimal scalar quantizer (Lloyd-Max) can be precomputed once. For b>4 bits, the Panter-Dite formula gives distortion = (1/12)·(∫f(x)^1/3dx)³·4^−b. The 1/3 power arises from balancing quadratic distortion penalty against linear bit budget.

THEOREM 1 — TURBOQUANT

QJL Makes It Unbiased

The MSE-optimal quantizer has multiplicative bias (2/π at 1 bit). Applying a 1-bit Johnson-Lindenstrauss transform to the residual cancels this bias exactly. Quantizing the key to sign bits while leaving the query unquantized gives an asymmetric unbiased estimator.

THEOREM 2 — TURBOQUANT

Provable Lower Bound

Via Shannon’s rate-distortion theory and Yao’s minimax principle: no quantizer can achieve MSE < 1/4^b. TurboQuant achieves (√3·π/2)/4^b ≈ 2.7/4^b. The gap factor is 2.7× worst-case, and just 1.45× at 1 bit.

THEOREM 3 — TURBOQUANT

D_mse ≤ 2.7 / 4^b

TurboQuant’s distortion is within a constant factor of the information-theoretic optimum. No future method — no matter how clever — can do better than 1/4^b. The remaining 2.7× gap is the total room left for improvement in all of quantization theory.

Results

Smaller Memory, Same Quality

Evaluated on Llama-3.1-8B-Instruct across the LongBench suite. TurboQuant at 3.5 bits exactly matches full 16-bit precision. Even at 2.5 bits — over 6× compression — the quality drop is negligible.

Method	KV bits	SingleQA	MultiQA	Summ.	Few-shot	Code	Avg
Full Cache	16	45.29	45.16	26.55	68.38	46.28	50.06
KIVI	3	43.38	37.99	27.16	68.38	44.68	48.50
KIVI	5	45.04	45.70	26.47	68.57	46.41	50.16
PolarQuant	3.9	45.18	44.48	26.23	68.25	45.24	49.78
TurboQuant	2.5	44.16	44.96	24.80	68.01	45.76	49.44
TurboQuant	3.5	45.01	45.31	26.00	68.63	46.17	50.06

● TurboQuant matches full precision | Llama-3.1-8B-Instruct on LongBench-V1

0.997

Needle-in-Haystack

Identical to full precision (also 0.997)

8×

Compute Speedup

4-bit TurboQuant vs. 32-bit on H100

0.002s

Index Time

vs. 494s for Product Quantization (d=3072)

Industry Response

Four Days from Paper to Working Code

The Google Research blog post went up March 24, 2026. Within four days, the two most important open-source LLM serving frameworks had working implementations. This is one of the fastest research-to-production transitions I’ve seen in ML infrastructure.

March 24 — Day 0

Google Research Blog Post

TurboQuant paper published at ICLR 2026, blog post goes live. Full paper, proofs, and algorithms available on arXiv.

March 26 — Day 2

vLLM Feature Request & PR

Issue #38171 opened with 70 thumbs-up, 53 rocket reactions. Working PoC on vllm-omni fork shows 7.5× cache reduction on Qwen2.5-7B (H200). PR #38280 follows with full engine integration.

March 28 — Day 4

vLLM Benchmarks: 21% Throughput Gain

PR author reports 21% throughput improvement at batch size 16, zero latency overhead. Intel validates zero quality loss on XPU hardware. Phase 2 (bit-packed storage) in progress.

March 29 — Day 5

SGLang Draft PR with 42 Tests

Issue #21618 and draft PR #21617 with core quantization logic, Triton kernels, and full SGLang integration across model_runner, FlashInfer, and Triton backends.

Also In Progress

llama.cpp, Ollama, MLX

Feature requests open in llama.cpp (#20977), Ollama (#15051), Apple MLX, and LM Studio. The entire ecosystem is moving.

Why so fast?

The algorithm is dead simple to implement: multiply by a random matrix, look up centroids in a precomputed table, store the indices. No training, no calibration data, no model changes. It’s a drop-in replacement — --kv-cache-dtype turboquant and you’re done.

So What

Why This Matters

This isn’t incremental. A 6× reduction in the single biggest memory consumer during LLM inference changes the economics of AI deployment. Here’s how.

6× Longer Context, Same GPU

If KV cache was your bottleneck at 32K tokens, you can now handle 192K on the same hardware. Long-context applications — document analysis, multi-turn agents, RAG over large corpora — become practical where they weren’t before.

CONTEXT LENGTH

6× More Users Per GPU

For serving providers, KV cache memory determines how many concurrent users a GPU can handle. Shrinking it 6× means serving 6× more requests simultaneously — or using fewer GPUs for the same load.

COST REDUCTION

Vector Search Revolution

TurboQuant isn’t just for LLMs. For vector databases, it achieves better recall than Product Quantization while building indices 500,000× faster (0.002s vs. 494s at d=3072). No dataset-specific tuning required.

VECTOR SEARCH

Fully Open Research

Complete papers with proofs, algorithms, and pseudocode are on arXiv. QJL has open-source CUDA kernels. The community implemented working integrations in under a week. The math is a public good.

OPEN ACCESS

Rotate · Quantize · Correct

Random rotation makes data predictable. Scalar quantization compresses it optimally. A 1-bit residual correction makes the estimate unbiased. Three steps, zero overhead, provably near-optimal. The elegance is in what’s not stored — no zero points, no scales, no per-block statistics. Just a shared rotation matrix and a precomputed codebook.

How Google Made AI Memory6× Smaller

Every Bit Costs Extra Bits

Random Rotation Makes Everything Predictable

The Two-Stage Algorithm

Three Papers, One Family

QJL

PolarQuant

TurboQuant

The Math That Makes It Provably Optimal

Random Rotation → Known Distribution

Optimal Codebook via Panter-Dite

QJL Makes It Unbiased

Provable Lower Bound

Smaller Memory, Same Quality

Four Days from Paper to Working Code

Google Research Blog Post

vLLM Feature Request & PR

vLLM Benchmarks: 21% Throughput Gain

SGLang Draft PR with 42 Tests

llama.cpp, Ollama, MLX

Why This Matters

6× Longer Context, Same GPU

6× More Users Per GPU

Vector Search Revolution

Fully Open Research

How Google Made AI Memory
6× Smaller