Every time you send a long message to ChatGPT, Claude, or any AI chatbot, there’s a hidden cost: the model has to re-read your entire conversation to generate each word of its reply. DeepSeek’s new technique — called Native Sparse Attention (NSA) — skips the parts that don’t matter, making long conversations up to 11× faster without losing quality.
Attention is how AI language models understand context. When you ask ChatGPT a question, it doesn’t just look at your last sentence — it looks at everything you’ve said in the conversation so far. For every word it’s about to write, the model asks: “Which parts of the conversation are relevant to what I should say next?”
That process of looking back and deciding what’s relevant is called attention. It’s the core mechanism behind every modern AI model — GPT, Claude, Gemini, DeepSeek, all of them. And it has a fundamental scaling problem.
Imagine you’re writing a reply to a long email thread. Before writing each sentence, you re-read every single email in the thread to make sure your reply is relevant. For a 5-email thread, that’s fine. For a 500-email thread, you’d spend all your time re-reading and almost none writing. That’s exactly the problem AI models face with long conversations.
Here’s the math problem. With standard attention, the model compares every word to every other word in the conversation. Double the conversation length and you quadruple the work. This is called O(n²) scaling — meaning the cost grows with the square of the input size. At 64,000 tokens (roughly a 50-page document), that’s over 4 billion comparisons per layer.
Researchers have tried to fix this before by skipping some comparisons — so-called “sparse attention.” But previous attempts had two problems: they were either bolted on after training (meaning the model wasn’t designed for them), or they used random access patterns that made GPUs slow despite doing less work. NSA solves both.
Instead of re-reading every email before writing each sentence, NSA does three smarter things: it keeps a summary of the whole thread (the gist of each topic), it pulls up specific emails that are actually relevant to the current point, and it always re-reads the last few messages for immediate context. Three strategies, combined. That’s the whole paper.
NSA splits the attention mechanism into three parallel pathways. Each one handles a different kind of “looking back at the conversation.” The model learns to combine them automatically — for some questions it leans on the summary, for others it focuses on specific earlier passages.
Groups the conversation into chunks and compresses each chunk into a short summary. The model reads these summaries instead of every individual word — like reading chapter headings instead of the whole book.
Uses the compressed summaries as a quick index to figure out which chunks are actually relevant, then goes back and reads those specific chunks in full detail. Only the parts that matter get the expensive, thorough read.
Always reads the most recent part of the conversation in full detail. No shortcuts here — the last few hundred words are always fully processed. This keeps the model grounded in the current moment.
Here’s why previous “sparse attention” methods didn’t deliver real speedups. A GPU is like a factory with a small, blazing-fast workbench (on-chip memory) and a giant but slow warehouse (main memory). The bottleneck isn’t the math — it’s moving data between the warehouse and the workbench.
Previous sparse methods skipped some math, but still needed to grab random pieces of data from the warehouse — scattered reads that are painfully slow on GPUs. NSA avoids this by organizing everything into neat, contiguous blocks that the GPU can grab in bulk.
All three branches work on tidy blocks of 64 consecutive words. No random jumping around in memory. The GPU can load an entire block in one efficient read instead of fetching individual words one at a time.
The “select” branch needs to figure out which blocks are important. Cleverly, it reuses the compressed summaries that the “compress” branch already computed — so the importance scoring costs almost nothing extra.
Score the blocks, pick the winners, and do the attention — all in one GPU operation instead of three separate ones. Each separate operation would mean another slow round trip to main memory. Fusing them avoids that entirely.
The efficiency tricks aren’t bolted on after training. The model is born with sparse attention — it learns which words to skip as part of its normal training process. This means the shortcuts are optimized for the actual data.
The obvious question: if you’re skipping parts of the conversation, don’t you lose quality? DeepSeek trained two identical 27-billion-parameter models — one with standard attention, one with NSA — on the exact same data. The NSA model matches or beats standard attention on every major benchmark. On tasks requiring long-context understanding, it actually does better, because learning to focus on what matters turns out to be a feature, not a compromise.
| Benchmark | Standard | NSA (Sparse) |
|---|---|---|
| General Knowledge | ||
| MMLU (5-shot) | 78.0 | 78.7 |
| MMLU-Pro | 52.3 | 52.8 |
| MMLU-Redux | 77.5 | 78.2 |
| C-Eval | 79.6 | 80.3 |
| Math & Reasoning | ||
| MATH-500 | 72.4 | 75.4 |
| GPQA (Diamond) | 36.4 | 38.9 |
| BBH | 79.3 | 81.1 |
| ARC-Challenge | 91.6 | 91.4 |
| Coding | ||
| HumanEval | 69.5 | 72.0 |
| MBPP (3-shot) | 74.0 | 75.5 |
| CRUXEval-O | 56.0 | 57.0 |
| Long Document Understanding (64K tokens) | ||
| RULER | 90.3 | 93.0 |
These are real measurements on real GPUs (NVIDIA A100s), comparing NSA against the best existing implementation of standard attention. The speedup is most dramatic on long inputs — exactly the scenario where you need it most.
When the model processes your entire input at once (the 11.6× number), it can reuse its “table of contents” across every word simultaneously. When generating a reply word by word (the 6.0× number), it still has to check the table of contents each time to figure out which parts of the conversation to focus on. Still much faster than reading everything — just not quite as parallelizable.
This paper came from DeepSeek, the same lab behind DeepSeek-V2 and V3 — models that are actually in production serving millions of users. These aren’t theoretical ideas. They solve the single biggest bottleneck in making AI models useful: the cost of handling long conversations and large documents.
6× faster replies means 6× less GPU time, which means lower costs for API providers like OpenAI and Anthropic. That translates to cheaper pricing for users and developers. For companies running their own models, it means fewer expensive GPUs needed.
The key insight is that the model learns what to skip during training, rather than using a fixed rule imposed by researchers. This means the shortcuts are optimized for the actual data — the model discovers its own best strategy for what’s worth reading carefully.
Today’s models advertise 128K or even 1M token context windows, but in practice they’re slow and expensive at those lengths. Sparse attention makes long context actually practical — fast enough to use routinely, not just as a demo feature.
The full paper is public on arXiv with complete architecture details, training recipes, and code. DeepSeek continues to be one of the few frontier AI labs that publishes its core innovations openly, letting the entire research community build on this work.