◆ arXiv 2602.15763v2 — Feb 2026

The Open-Source Model
That Fooled Everyone

Zhipu AI released a 744B parameter model anonymously on OpenRouter as “Pony Alpha.” The community thought it was Claude Sonnet 5, DeepSeek, or Grok. It was a Chinese open-source model all along.

744B
Total Parameters
Mixture-of-Experts architecture
40B
Active Per Token
256 experts, 8 routed per token
1M+
Context Window
Via Dynamic Sparse Attention
14.5T
Training Tokens
Multi-stage training pipeline

From Vibe Coding to Agentic Engineering

GLM-5’s core thesis: most models can generate code snippets. That’s “vibe coding.” GLM-5 is trained to be a full engineer — it navigates codebases, uses tools, plans multi-step work, runs tests, and iterates on failures. Think of it as the difference between someone who can write a SQL query vs. someone who can design and operate your entire data pipeline.

2023–2024
Code Completion
Model autocompletes the next line. Useful, but you’re the driver.
> def sort_list(arr):↵
  return sorted(arr) ✓
2025
Vibe Coding
Model generates entire functions/files from descriptions. You review and paste.
> "write a FastAPI endpoint that…"
  → 40 lines of working code
2026 — GLM-5
Agentic Engineering
Model autonomously navigates repos, reads tests, edits files, runs builds, and iterates until the task is done. You review the PR.
> "fix the flaky test in auth module"
  → reads 12 files, edits 3, runs CI ✓

Dynamic Sparse Attention

Standard transformers attend to every token in the context — that’s O(n²) compute. At 1M tokens, that’s a trillion attention operations. GLM-5 adds an indexer that learns which tokens matter for each query and only attends to those.

Standard: Full Attention
Every token attends to every other token
O(n²) — 1M tokens = disaster
GLM-5: Sparse Attention (DSA)
Indexer selects only the tokens that matter
O(n · k) — 1M tokens = fine

The Database Analogy

Standard attention is like running SELECT * FROM tokens on every query — full table scan, every time. DSA adds a B-tree index: 32 extra attention heads that score which KV cache entries are worth reading. At inference time, you look up the index first, then only fetch the relevant rows.

Your world analogy
It’s like Apache Iceberg’s metadata pruning, but for attention. Just like Iceberg skips data files that don’t match partition predicates, DSA skips token positions that the indexer scores as irrelevant. Same idea — metadata-driven pruning to avoid scanning everything.

How the Indexer Works

Each attention layer gets 32 lightweight “indexer heads” that produce a relevance score for every KV cache position. TopK selection picks the highest-scoring positions. Only those tokens participate in the actual attention computation.

Key detail
The indexer heads are trained end-to-end with the model. They aren’t heuristic — they learn what to attend to from the data itself. This is why it works better than fixed sparse patterns (like local-window or strided attention).
Stage 1 — Full Attention

Learn the Index

Pre-train normally with full attention on 128K context. The indexer heads learn what to pay attention to by observing full attention patterns. Like building a query optimizer by first running full table scans.

Stage 2 — Warmup

Gradual Sparsification

Progressively reduce the selection ratio — from attending to 100% of tokens down to the target TopK. The model adapts to operating with less information, like gradually lowering a cache hit rate target.

Stage 3 — Sparse

Full Sparse Mode

Train with only TopK tokens per query at 1M+ context. The indexer is now fully trained. Context extends from 128K to 1M with minimal quality loss because it learned what actually matters.

128K → 1M+
Context window extension with minimal quality degradation. On the RULER benchmark (synthetic long-context evaluation), DSA maintains near-full-attention quality at 1M tokens where standard approaches collapse. The 32 indexer heads add only ~1.5% overhead to model parameters.

The Four-Stage Pipeline

GLM-5’s training isn’t just “throw data at a transformer.” It’s a carefully orchestrated pipeline — each stage builds capabilities that the next stage depends on. each phase transforms the model into something more capable.

Phase 1

Pre-Training

14.5T tokens

Massive data ingestion. Web text, code, books, scientific papers. Builds the foundation — world knowledge, language patterns, basic reasoning. The raw ETL phase.

Phase 2

Mid-Training

Long-context + Reasoning

Extends context window, trains DSA indexer, adds chain-of-thought reasoning. This is where the model learns to think in steps rather than just predict next tokens.

Phase 3

Post-Training SFT

Curated instruction data

Supervised fine-tuning on high-quality instruction-response pairs. Teaches the model to follow directions, format outputs correctly, and use tools. The transformation phase.

Phase 4

Reinforcement Learning

The secret sauce ↓

Three distinct RL stages that take the model from “knows things” to “can actually do things.” This is where GLM-5 diverges from most other models.

Phase 4 Deep Dive: Three Types of RL

This is where the training gets interesting. Each RL type targets a different capability.

𝓡

Standard RL

Classic reasoning RL: give the model math/logic problems, reward correct answers. Uses a verifier model to check work. Improves step-by-step reasoning quality.

reward = correctness_score

Agentic RL

The model gets real SWE-bench coding tasks. It operates in a sandbox — reads files, edits code, runs tests. Reward is binary: did the tests pass? This trains genuine engineering behavior, not just code generation.

reward = tests_pass ? 1.0 : 0.0

Visual RL

Model generates HTML/CSS/SVG to render UI. A vision model compares the rendering to a reference image. Reward is visual fidelity. Trains precise visual output — not just code that “looks right” as text.

reward = visual_similarity(render, ref)

Benchmark Comparison

GLM-5 is the best open-source model on most benchmarks and competitive with the top proprietary models. Numbers are from the paper, spot-checked against public leaderboards.

Benchmark GLM-5 Opus 4.5 GPT-5.2 DeepSeek V3.2 Gemini 3 Pro
Reasoning
AIME 2025 (math) 80.3 65.0 78.0 73.0 84.0
HMMT Feb 2025 (math) 97.9 81.2
HLE (hard reasoning) 24.2 14.8 22.2 14.0 21.6
GPQA Diamond (science) 74.7 68.7 73.7 68.4 71.0
Coding
SWE-bench Verified 77.8 80.9 81.6 72.6 63.8
Terminal-Bench 44.3 38.4 38.9 30.9 39.7
LiveCodeBench (v6) 73.5 54.4 76.5 65.6 65.2
Agentic
BrowseComp 62.0 60.6 58.8 14.2 40.6
MCP-Atlas (tool use) 79.2 73.3 70.3 67.8 72.2
Tau-bench (airline) 66.0 48.5 52.5 47.2 32.0
= best score in row   •   Dashes = not reported in paper   •   All scores from GLM-5 technical report Table 1–4

The Pony Alpha Story

Before publishing the paper, Zhipu did something unusual: they released GLM-5 anonymously on OpenRouter under the name “Pony Alpha.” No branding, no paper, no marketing. Just a model endpoint. What happened next proved their point.

Timeline

Feb 21
“Pony Alpha” appears on OpenRouter with no documentation or attribution. Just an API endpoint.
Feb 22
AI Twitter notices. Early users report it’s surprisingly good at coding and reasoning. Speculation begins immediately.
Feb 23
Threads go viral. “This is definitely Anthropic testing Claude 5.” “It’s DeepSeek V4.” Community polls emerge.
Feb 24
100K+ API calls. Users running benchmarks. Consistent top-tier performance across reasoning, coding, tool use.
Feb 25
Zhipu reveals: it was GLM-5 all along. Paper drops on arXiv. Open-source community erupts.

Who Did They Think It Was?

Community speculation from X/Twitter polls and forum threads:

Claude Sonnet 5~25%
DeepSeek V4~20%
Grok 4~10%
GLM-5 (correct)~15%
Other / Unknown~30%
“When you remove the brand name, all that’s left is the quality of the output. Pony Alpha proved that open-source Chinese models can compete at the absolute frontier — the community just didn’t believe it until they couldn’t see the label.”
— from the GLM-5 technical report, Section 6

Why This Matters

GLM-5 isn’t just another model announcement. It represents a few structural shifts in how the AI landscape is evolving.

Open Weights at the Frontier

You can download the model, run it, fine-tune it, and deploy it on your own infrastructure. The gap between open-source and proprietary was 2+ years in 2023. It’s now months. GLM-5 matches or beats Opus 4.5 on most benchmarks.

Open-Weight

Hardware Independence

Zhipu adapted GLM-5 to run on 7 different Chinese chip platforms (Huawei Ascend, Cambricon, etc.) — not just NVIDIA. If US export controls tighten further, this model can still be trained and served domestically.

7 Chip Platforms

Agentic RL as the Future

The “agentic RL” training paradigm — giving models real engineering tasks in sandboxes and rewarding passing tests — may become the standard for how all coding models are trained. It’s more aligned with actual engineering than predicting next tokens.

New Training Paradigm

Brand Doesn’t Equal Quality

The Pony Alpha experiment is a clean demonstration that evaluation bias is real. When users didn’t know the model’s origin, they rated it as frontier-tier. The name on the label was the only thing that changed.

Evaluation Bias