When AI Invents
Better Algorithms

Game theory researchers have spent decades hand-designing algorithms for strategic decision-making — the math behind poker bots, autonomous negotiation, and multi-agent AI. Google DeepMind pointed AlphaEvolve, an AI that writes and evolves code, at the problem. It discovered two new algorithms that outperform the hand-designed ones.

2
New Algorithms
Discovered by AI, not humans
10/11
Games Won (Small)
VAD-CFR beats hand-designed SOTA
8/11
Games Won (Large)
SHOR-PSRO beats hand-designed SOTA
200+
Generations
Of evolutionary code refinement

What Are Game Theory Algorithms?

Game theory algorithms figure out the best strategy when multiple players are competing. If you’ve played poker, you know the problem intuitively: your best move depends on what your opponent does, and their best move depends on what you do. There’s no single “right answer” — the optimal play is a balance where no player can improve by changing strategy alone. Mathematicians call this a Nash equilibrium.

Finding that balance is hard. For simple games, algorithms can compute it exactly. For complex games (full poker, military simulations, multi-robot coordination), they need clever approximations. Researchers have been refining these approximations for over 20 years. This paper asks: what if we let AI do the refining instead?

Small games

Small Games — Solve Exactly

Games small enough to fit in memory (simplified poker, card games). Algorithms like CFR (Counterfactual Regret Minimization) compute the exact optimal strategy by iterating through every possible situation millions of times. Each iteration gets closer to perfect play.

CFR FAMILY
Large games

Large Games — Approximate

Games too big for exact methods (full poker, real-world strategy). Algorithms like PSRO (Policy Space Response Oracles) build a population of strategies and evolve them against each other — like a round-robin tournament where each round produces smarter players.

PSRO FAMILY
Why this matters beyond poker

These algorithms aren’t just for card games. Any situation where multiple agents interact strategically uses game theory: autonomous vehicles negotiating intersections, AI assistants competing for resources, cybersecurity (attacker vs. defender), auction design, and training multi-agent AI systems. Better algorithms here make all of those applications smarter.

AlphaEvolve: AI That Writes Code

AlphaEvolve is a coding agent built by Google DeepMind. It doesn’t just write code — it evolves it. Give it a starting algorithm and a way to measure quality, and it will iteratively mutate, recombine, and improve the code over hundreds of generations. It’s powered by Gemini (Google’s frontier AI model) and uses evolutionary principles: the best-performing variants survive and breed the next generation.

Step 1

Start with a Known Algorithm

Feed AlphaEvolve a working implementation of an existing algorithm (like CFR or PSRO) as the starting point. This is the “seed” for evolution.

Step 2

Mutate the Code

Gemini proposes modifications — adding new logic, changing formulas, tweaking parameters. It understands the code semantically, not just randomly shuffling characters.

Step 3

Test on Training Games

Run each variant on a set of training games and measure how close it gets to optimal play (the “Nash gap”). Lower gap = better algorithm.

Step 4

Survive and Repeat

The best variants survive. The worst are discarded. New mutations are applied to the survivors. Repeat for 200+ generations until the algorithm stabilizes.

The key insight

Human researchers refine algorithms by intuition and mathematical analysis — they propose a tweak, prove it works theoretically, then test it. AlphaEvolve skips the intuition step. It proposes hundreds of tweaks, keeps what works, and discards what doesn’t. The result: algorithms with non-obvious tricks that no human would have thought to try.

VAD-CFR: For Smaller Games

Evolved from: Discounted CFR

VAD-CFR

Volatility-Adaptive Discounted Counterfactual Regret Minimization
Standard CFR converges to optimal play by tracking “regret” — how much each decision cost you in hindsight. Over thousands of iterations, the strategy converges to a Nash equilibrium. The best human-designed variant (DCFR+) uses fixed discount factors to weight recent experience more heavily. VAD-CFR makes those discounts adaptive — it watches how volatile the regrets are and adjusts on the fly.

Volatility Tracking

Monitors how much regret values are changing using an exponential moving average. When regrets are bouncing around (high volatility), it discounts older data more aggressively. When they stabilize, it trusts history more.

ADAPTIVE DISCOUNTING

Asymmetric Regret Boosting

Multiplies positive regrets by 1.1× — a subtle bias that makes the algorithm more eager to explore promising actions. No human researcher proposed this specific trick; the evolution found it empirically.

NON-OBVIOUS

Hard Warm-Start

Ignores the first 500 iterations entirely when computing the final strategy. The early iterations are noisy and unreliable — by throwing them away, the average strategy is cleaner. Humans typically soft-discount; hard-cutting is counterintuitive.

COUNTERINTUITIVE

Regret-Weighted Averaging

After the warm-start, weights each iteration’s contribution to the final strategy by the magnitude of cumulative regret. Iterations where the algorithm was more “certain” (higher regret magnitude) count more.

EVOLVED HEURISTIC

SHOR-PSRO: For Larger Games

Evolved from: Pipeline PSRO

SHOR-PSRO

Smoothed Hybrid Optimistic Regret Policy Space Response Oracles
PSRO works by building a roster of strategies. Each round, it trains a new strategy that beats the current roster, adds it, and recomputes the optimal mix. The human-designed state of the art uses regret matching to decide how to mix strategies. SHOR-PSRO replaces that with a hybrid approach that blends two different mixing methods and uses different settings for training vs. evaluation.

Hybrid Blending

Mixes two strategy-selection methods: Optimistic Regret Matching (principled, stable) and Boltzmann softmax over best pure strategies (aggressive, exploitative). The blend ratio decays from 30% to 5% over time.

DUAL STRATEGY

Dynamic Annealing

The blend ratio and a diversity bonus both decay on a specific schedule. Early on, the algorithm explores broadly. Later, it narrows to exploit what it’s learned. The exact decay curve was discovered by evolution, not derived from theory.

EVOLVED SCHEDULE

Training vs. Evaluation Split

Uses different solver configurations for growing the strategy roster (training) vs. computing the final strategy mix (evaluation). Humans typically use the same solver for both — the asymmetry was AlphaEvolve’s idea.

NON-OBVIOUS

Diversity Bonus

Adds a small bonus for strategies that are different from existing ones in the roster. Prevents the population from collapsing to a single approach. The bonus starts at 5% and decays to 0.1%.

ANTI-COLLAPSE

Evolved vs. Hand-Designed

Both algorithms were trained on just 4 games, then tested on 11 — including 7 they’d never seen. The key question: do tricks discovered on simple training games generalize to harder, unseen games? The answer is yes.

Game Best Human-Designed AlphaEvolve Winner
VAD-CFR vs. DCFR+ / PCFR+ (Small Games)
Kuhn Poker (3P) DCFR+ VAD-CFR Evolved
Leduc Poker (2P) PCFR+ VAD-CFR Evolved
Goofspiel (4-card) DCFR+ VAD-CFR Evolved
Liar’s Dice (5-sided) DCFR+ VAD-CFR Evolved
Kuhn Poker (4P) * PCFR+ VAD-CFR Evolved
Leduc Poker (3P) * DCFR+ VAD-CFR Evolved
Goofspiel (5-card) * DCFR+ VAD-CFR Evolved
SHOR-PSRO vs. Pipeline PSRO / NeuPL-JPSRO (Large Games)
Kuhn Poker (3P) P-PSRO SHOR-PSRO Evolved
Leduc Poker (2P) NeuPL SHOR-PSRO Evolved
Kuhn Poker (4P) * P-PSRO SHOR-PSRO Evolved
Leduc Poker (3P) * NeuPL SHOR-PSRO Evolved
* = Unseen during training   |   Winner determined by Nash gap (lower = closer to optimal play)
Trained on 4, Tested on 11
The algorithms weren’t just memorizing tricks for specific games. The mechanisms AlphaEvolve discovered — volatility tracking, adaptive annealing, asymmetric boosting — are general principles that transfer to unseen games. That’s the difference between overfitting and genuine algorithmic insight.

Why This Matters

The specific algorithms are interesting. But the bigger story is the method: an AI system that can discover algorithms humans haven’t found in 20+ years of trying. This isn’t AI using existing tools — it’s AI inventing new ones.

Discovery

Non-Obvious Discoveries

The evolved algorithms contain tricks no human researcher proposed — like throwing away the first 500 iterations, or using different solvers for training vs. evaluation. These aren’t things you’d derive from theory. They emerged from empirical search over code space.

BEYOND HUMAN INTUITION
Generalize

Generalizable Method

AlphaEvolve doesn’t know anything about game theory. It just evolves code against a fitness function. The same approach could discover algorithms in optimization, scheduling, routing, or any domain where you can measure “better” programmatically.

DOMAIN-AGNOSTIC
Research

New Research Paradigm

Instead of “researcher proposes algorithm, proves it works,” the loop becomes “AI proposes thousands of variants, researcher analyzes why the winner works.” The human role shifts from inventor to interpreter — understanding why the AI’s discoveries work, not finding them.

ROLE SHIFT
Open

Open Research

The full paper includes the complete evolved code for both algorithms, the training setup, and detailed ablation studies explaining which components matter most. The algorithms can be used by anyone working on multiagent systems.

ARXIV 2602.16928
AI Inventing AI’s Tools
Twenty years of hand-designed algorithms. Hundreds of papers. Careful mathematical proofs. An AI with no game theory knowledge, armed with nothing but a code editor and a fitness function, found better solutions in a few hundred generations. The tools are getting good enough to improve themselves. That’s a new kind of feedback loop.