Zhipu AI released a 744B parameter model anonymously on OpenRouter as “Pony Alpha.” The community thought it was Claude Sonnet 5, DeepSeek, or Grok. It was a Chinese open-source model all along.
GLM-5’s core thesis: most models can generate code snippets. That’s “vibe coding.” GLM-5 is trained to be a full engineer — it navigates codebases, uses tools, plans multi-step work, runs tests, and iterates on failures. Think of it as the difference between someone who can write a SQL query vs. someone who can design and operate your entire data pipeline.
Standard transformers attend to every token in the context — that’s O(n²) compute. At 1M tokens, that’s a trillion attention operations. GLM-5 adds an indexer that learns which tokens matter for each query and only attends to those.
Standard attention is like running SELECT * FROM tokens on every query — full table scan, every time. DSA adds a B-tree index: 32 extra attention heads that score which KV cache entries are worth reading. At inference time, you look up the index first, then only fetch the relevant rows.
Each attention layer gets 32 lightweight “indexer heads” that produce a relevance score for every KV cache position. TopK selection picks the highest-scoring positions. Only those tokens participate in the actual attention computation.
Pre-train normally with full attention on 128K context. The indexer heads learn what to pay attention to by observing full attention patterns. Like building a query optimizer by first running full table scans.
Progressively reduce the selection ratio — from attending to 100% of tokens down to the target TopK. The model adapts to operating with less information, like gradually lowering a cache hit rate target.
Train with only TopK tokens per query at 1M+ context. The indexer is now fully trained. Context extends from 128K to 1M with minimal quality loss because it learned what actually matters.
GLM-5’s training isn’t just “throw data at a transformer.” It’s a carefully orchestrated pipeline — each stage builds capabilities that the next stage depends on. each phase transforms the model into something more capable.
Massive data ingestion. Web text, code, books, scientific papers. Builds the foundation — world knowledge, language patterns, basic reasoning. The raw ETL phase.
Extends context window, trains DSA indexer, adds chain-of-thought reasoning. This is where the model learns to think in steps rather than just predict next tokens.
Supervised fine-tuning on high-quality instruction-response pairs. Teaches the model to follow directions, format outputs correctly, and use tools. The transformation phase.
Three distinct RL stages that take the model from “knows things” to “can actually do things.” This is where GLM-5 diverges from most other models.
This is where the training gets interesting. Each RL type targets a different capability.
Classic reasoning RL: give the model math/logic problems, reward correct answers. Uses a verifier model to check work. Improves step-by-step reasoning quality.
The model gets real SWE-bench coding tasks. It operates in a sandbox — reads files, edits code, runs tests. Reward is binary: did the tests pass? This trains genuine engineering behavior, not just code generation.
Model generates HTML/CSS/SVG to render UI. A vision model compares the rendering to a reference image. Reward is visual fidelity. Trains precise visual output — not just code that “looks right” as text.
GLM-5 is the best open-source model on most benchmarks and competitive with the top proprietary models. Numbers are from the paper, spot-checked against public leaderboards.
| Benchmark | GLM-5 | Opus 4.5 | GPT-5.2 | DeepSeek V3.2 | Gemini 3 Pro |
|---|---|---|---|---|---|
| Reasoning | |||||
| AIME 2025 (math) | 80.3 | 65.0 | 78.0 | 73.0 | 84.0 |
| HMMT Feb 2025 (math) | 97.9 | — | — | — | 81.2 |
| HLE (hard reasoning) | 24.2 | 14.8 | 22.2 | 14.0 | 21.6 |
| GPQA Diamond (science) | 74.7 | 68.7 | 73.7 | 68.4 | 71.0 |
| Coding | |||||
| SWE-bench Verified | 77.8 | 80.9 | 81.6 | 72.6 | 63.8 |
| Terminal-Bench | 44.3 | 38.4 | 38.9 | 30.9 | 39.7 |
| LiveCodeBench (v6) | 73.5 | 54.4 | 76.5 | 65.6 | 65.2 |
| Agentic | |||||
| BrowseComp | 62.0 | 60.6 | 58.8 | 14.2 | 40.6 |
| MCP-Atlas (tool use) | 79.2 | 73.3 | 70.3 | 67.8 | 72.2 |
| Tau-bench (airline) | 66.0 | 48.5 | 52.5 | 47.2 | 32.0 |
Before publishing the paper, Zhipu did something unusual: they released GLM-5 anonymously on OpenRouter under the name “Pony Alpha.” No branding, no paper, no marketing. Just a model endpoint. What happened next proved their point.
Community speculation from X/Twitter polls and forum threads:
GLM-5 isn’t just another model announcement. It represents a few structural shifts in how the AI landscape is evolving.
You can download the model, run it, fine-tune it, and deploy it on your own infrastructure. The gap between open-source and proprietary was 2+ years in 2023. It’s now months. GLM-5 matches or beats Opus 4.5 on most benchmarks.
Zhipu adapted GLM-5 to run on 7 different Chinese chip platforms (Huawei Ascend, Cambricon, etc.) — not just NVIDIA. If US export controls tighten further, this model can still be trained and served domestically.
The “agentic RL” training paradigm — giving models real engineering tasks in sandboxes and rewarding passing tests — may become the standard for how all coding models are trained. It’s more aligned with actual engineering than predicting next tokens.
The Pony Alpha experiment is a clean demonstration that evaluation bias is real. When users didn’t know the model’s origin, they rated it as frontier-tier. The name on the label was the only thing that changed.