Making AI Models
Think Faster

Every time you send a long message to ChatGPT, Claude, or any AI chatbot, there’s a hidden cost: the model has to re-read your entire conversation to generate each word of its reply. DeepSeek’s new technique — called Native Sparse Attention (NSA) — skips the parts that don’t matter, making long conversations up to 11× faster without losing quality.

11.6×
Faster Processing
On long conversations (64K tokens)
6.0×
Faster Replies
Generating each word of output
= or ↑
Same Quality
Matches or beats the original approach
27B
Parameter Model
Trained from scratch to prove it works

First, What Is Attention?

Attention is how AI language models understand context. When you ask ChatGPT a question, it doesn’t just look at your last sentence — it looks at everything you’ve said in the conversation so far. For every word it’s about to write, the model asks: “Which parts of the conversation are relevant to what I should say next?”

That process of looking back and deciding what’s relevant is called attention. It’s the core mechanism behind every modern AI model — GPT, Claude, Gemini, DeepSeek, all of them. And it has a fundamental scaling problem.

Think of it like this

Imagine you’re writing a reply to a long email thread. Before writing each sentence, you re-read every single email in the thread to make sure your reply is relevant. For a 5-email thread, that’s fine. For a 500-email thread, you’d spend all your time re-reading and almost none writing. That’s exactly the problem AI models face with long conversations.

Why Long Conversations Are So Expensive

Here’s the math problem. With standard attention, the model compares every word to every other word in the conversation. Double the conversation length and you quadruple the work. This is called O(n²) scaling — meaning the cost grows with the square of the input size. At 64,000 tokens (roughly a 50-page document), that’s over 4 billion comparisons per layer.

Researchers have tried to fix this before by skipping some comparisons — so-called “sparse attention.” But previous attempts had two problems: they were either bolted on after training (meaning the model wasn’t designed for them), or they used random access patterns that made GPUs slow despite doing less work. NSA solves both.

Standard: Compare Everything
Every word checked against every other word
Quadratic — 2× input = 4× cost
NSA: Only Check What Matters
summaries   important parts   recent
Linear — 2× input = 2× cost
Back to the email analogy

Instead of re-reading every email before writing each sentence, NSA does three smarter things: it keeps a summary of the whole thread (the gist of each topic), it pulls up specific emails that are actually relevant to the current point, and it always re-reads the last few messages for immediate context. Three strategies, combined. That’s the whole paper.

Three Ways to Read, One Answer

NSA splits the attention mechanism into three parallel pathways. Each one handles a different kind of “looking back at the conversation.” The model learns to combine them automatically — for some questions it leans on the summary, for others it focuses on specific earlier passages.

Compress

Compress

The big picture

Groups the conversation into chunks and compresses each chunk into a short summary. The model reads these summaries instead of every individual word — like reading chapter headings instead of the whole book.

Good for: “What have we been talking about?”
Select

Select

The important parts

Uses the compressed summaries as a quick index to figure out which chunks are actually relevant, then goes back and reads those specific chunks in full detail. Only the parts that matter get the expensive, thorough read.

Good for: “What did they say about pricing specifically?”
Slide

Slide

What just happened

Always reads the most recent part of the conversation in full detail. No shortcuts here — the last few hundred words are always fully processed. This keeps the model grounded in the current moment.

Good for: “They just asked me to clarify something.”

The Model Decides the Mix

output = gcompress · summary + gselect · details + gslide · recent
The weights g are not fixed — the model learns them during training. For a question about something mentioned earlier, it might weight “select” heavily. For a follow-up question, “slide” dominates. The model figures out the right balance on its own, for every single word it generates.

Designed for How GPUs Actually Work

Here’s why previous “sparse attention” methods didn’t deliver real speedups. A GPU is like a factory with a small, blazing-fast workbench (on-chip memory) and a giant but slow warehouse (main memory). The bottleneck isn’t the math — it’s moving data between the warehouse and the workbench.

Previous sparse methods skipped some math, but still needed to grab random pieces of data from the warehouse — scattered reads that are painfully slow on GPUs. NSA avoids this by organizing everything into neat, contiguous blocks that the GPU can grab in bulk.

Blocks

Organized in Blocks

All three branches work on tidy blocks of 64 consecutive words. No random jumping around in memory. The GPU can load an entire block in one efficient read instead of fetching individual words one at a time.

FAST MEMORY ACCESS
Fast

Scoring Is Free

The “select” branch needs to figure out which blocks are important. Cleverly, it reuses the compressed summaries that the “compress” branch already computed — so the importance scoring costs almost nothing extra.

NO EXTRA WORK
Fused

Everything Fused

Score the blocks, pick the winners, and do the attention — all in one GPU operation instead of three separate ones. Each separate operation would mean another slow round trip to main memory. Fusing them avoids that entirely.

ONE OPERATION
Trainable

Trainable From Scratch

The efficiency tricks aren’t bolted on after training. The model is born with sparse attention — it learns which words to skip as part of its normal training process. This means the shortcuts are optimized for the actual data.

NATIVE, NOT RETROFITTED

The GPU Bottleneck, Visualized

Main Memory
80 GB
Lots of space, but slow
On-Chip Cache
20 MB
Tiny, but 100× faster
Compute Cores
Do the math
Fast, but need data fed to them
Standard attention forces the GPU to load the entire conversation from main memory for every word generated. NSA keeps compressed summaries and block indices in the fast on-chip cache, and only loads the selected blocks from main memory. Less data moved = faster results.

Faster and Smarter

The obvious question: if you’re skipping parts of the conversation, don’t you lose quality? DeepSeek trained two identical 27-billion-parameter models — one with standard attention, one with NSA — on the exact same data. The NSA model matches or beats standard attention on every major benchmark. On tasks requiring long-context understanding, it actually does better, because learning to focus on what matters turns out to be a feature, not a compromise.

Benchmark Standard NSA (Sparse)
General Knowledge
MMLU (5-shot) 78.0 78.7
MMLU-Pro 52.3 52.8
MMLU-Redux 77.5 78.2
C-Eval 79.6 80.3
Math & Reasoning
MATH-500 72.4 75.4
GPQA (Diamond) 36.4 38.9
BBH 79.3 81.1
ARC-Challenge 91.6 91.4
Coding
HumanEval 69.5 72.0
MBPP (3-shot) 74.0 75.5
CRUXEval-O 56.0 57.0
Long Document Understanding (64K tokens)
RULER 90.3 93.0
NSA leads   |   Both models: 27B params, 260B training tokens, identical data
93.0 vs. 90.3
On RULER, a benchmark that tests whether models can find and use information buried in long documents, the sparse model outperforms the standard model. Learning to skip irrelevant parts doesn’t just save compute — it actually helps the model focus better. Less noise, cleaner signal.

How Much Faster?

These are real measurements on real GPUs (NVIDIA A100s), comparing NSA against the best existing implementation of standard attention. The speedup is most dramatic on long inputs — exactly the scenario where you need it most.

Standard
1.0×
NSA
11.6×
Standard
1.0×
NSA
6.0×
Standard
1.0×
NSA
6.2×
Why processing is faster than generating

When the model processes your entire input at once (the 11.6× number), it can reuse its “table of contents” across every word simultaneously. When generating a reply word by word (the 6.0× number), it still has to check the table of contents each time to figure out which parts of the conversation to focus on. Still much faster than reading everything — just not quite as parallelizable.

Why This Matters

This paper came from DeepSeek, the same lab behind DeepSeek-V2 and V3 — models that are actually in production serving millions of users. These aren’t theoretical ideas. They solve the single biggest bottleneck in making AI models useful: the cost of handling long conversations and large documents.

Cost

Cheaper AI

6× faster replies means 6× less GPU time, which means lower costs for API providers like OpenAI and Anthropic. That translates to cheaper pricing for users and developers. For companies running their own models, it means fewer expensive GPUs needed.

COST REDUCTION
Learned

Learned, Not Hacked

The key insight is that the model learns what to skip during training, rather than using a fixed rule imposed by researchers. This means the shortcuts are optimized for the actual data — the model discovers its own best strategy for what’s worth reading carefully.

SMARTER SHORTCUTS
Rocket

Longer Context, Finally

Today’s models advertise 128K or even 1M token context windows, but in practice they’re slow and expensive at those lengths. Sparse attention makes long context actually practical — fast enough to use routinely, not just as a demo feature.

REAL LONG CONTEXT
Open

Open Research

The full paper is public on arXiv with complete architecture details, training recipes, and code. DeepSeek continues to be one of the few frontier AI labs that publishes its core innovations openly, letting the entire research community build on this work.

ARXIV 2602.16928
Summarize · Select · Remember
Read the summaries for the big picture. Pull up the important parts for detail. Always keep track of what just happened. Three strategies that humans use intuitively when reading long documents — now built into how AI models process language. Faster, cheaper, and if anything, a little smarter.