Game theory researchers have spent decades hand-designing algorithms for strategic decision-making — the math behind poker bots, autonomous negotiation, and multi-agent AI. Google DeepMind pointed AlphaEvolve, an AI that writes and evolves code, at the problem. It discovered two new algorithms that outperform the hand-designed ones.
Game theory algorithms figure out the best strategy when multiple players are competing. If you’ve played poker, you know the problem intuitively: your best move depends on what your opponent does, and their best move depends on what you do. There’s no single “right answer” — the optimal play is a balance where no player can improve by changing strategy alone. Mathematicians call this a Nash equilibrium.
Finding that balance is hard. For simple games, algorithms can compute it exactly. For complex games (full poker, military simulations, multi-robot coordination), they need clever approximations. Researchers have been refining these approximations for over 20 years. This paper asks: what if we let AI do the refining instead?
Games small enough to fit in memory (simplified poker, card games). Algorithms like CFR (Counterfactual Regret Minimization) compute the exact optimal strategy by iterating through every possible situation millions of times. Each iteration gets closer to perfect play.
Games too big for exact methods (full poker, real-world strategy). Algorithms like PSRO (Policy Space Response Oracles) build a population of strategies and evolve them against each other — like a round-robin tournament where each round produces smarter players.
These algorithms aren’t just for card games. Any situation where multiple agents interact strategically uses game theory: autonomous vehicles negotiating intersections, AI assistants competing for resources, cybersecurity (attacker vs. defender), auction design, and training multi-agent AI systems. Better algorithms here make all of those applications smarter.
AlphaEvolve is a coding agent built by Google DeepMind. It doesn’t just write code — it evolves it. Give it a starting algorithm and a way to measure quality, and it will iteratively mutate, recombine, and improve the code over hundreds of generations. It’s powered by Gemini (Google’s frontier AI model) and uses evolutionary principles: the best-performing variants survive and breed the next generation.
Feed AlphaEvolve a working implementation of an existing algorithm (like CFR or PSRO) as the starting point. This is the “seed” for evolution.
Gemini proposes modifications — adding new logic, changing formulas, tweaking parameters. It understands the code semantically, not just randomly shuffling characters.
Run each variant on a set of training games and measure how close it gets to optimal play (the “Nash gap”). Lower gap = better algorithm.
The best variants survive. The worst are discarded. New mutations are applied to the survivors. Repeat for 200+ generations until the algorithm stabilizes.
Human researchers refine algorithms by intuition and mathematical analysis — they propose a tweak, prove it works theoretically, then test it. AlphaEvolve skips the intuition step. It proposes hundreds of tweaks, keeps what works, and discards what doesn’t. The result: algorithms with non-obvious tricks that no human would have thought to try.
Monitors how much regret values are changing using an exponential moving average. When regrets are bouncing around (high volatility), it discounts older data more aggressively. When they stabilize, it trusts history more.
Multiplies positive regrets by 1.1× — a subtle bias that makes the algorithm more eager to explore promising actions. No human researcher proposed this specific trick; the evolution found it empirically.
Ignores the first 500 iterations entirely when computing the final strategy. The early iterations are noisy and unreliable — by throwing them away, the average strategy is cleaner. Humans typically soft-discount; hard-cutting is counterintuitive.
After the warm-start, weights each iteration’s contribution to the final strategy by the magnitude of cumulative regret. Iterations where the algorithm was more “certain” (higher regret magnitude) count more.
Mixes two strategy-selection methods: Optimistic Regret Matching (principled, stable) and Boltzmann softmax over best pure strategies (aggressive, exploitative). The blend ratio decays from 30% to 5% over time.
The blend ratio and a diversity bonus both decay on a specific schedule. Early on, the algorithm explores broadly. Later, it narrows to exploit what it’s learned. The exact decay curve was discovered by evolution, not derived from theory.
Uses different solver configurations for growing the strategy roster (training) vs. computing the final strategy mix (evaluation). Humans typically use the same solver for both — the asymmetry was AlphaEvolve’s idea.
Adds a small bonus for strategies that are different from existing ones in the roster. Prevents the population from collapsing to a single approach. The bonus starts at 5% and decays to 0.1%.
Both algorithms were trained on just 4 games, then tested on 11 — including 7 they’d never seen. The key question: do tricks discovered on simple training games generalize to harder, unseen games? The answer is yes.
| Game | Best Human-Designed | AlphaEvolve | Winner |
|---|---|---|---|
| VAD-CFR vs. DCFR+ / PCFR+ (Small Games) | |||
| Kuhn Poker (3P) | DCFR+ | VAD-CFR | Evolved |
| Leduc Poker (2P) | PCFR+ | VAD-CFR | Evolved |
| Goofspiel (4-card) | DCFR+ | VAD-CFR | Evolved |
| Liar’s Dice (5-sided) | DCFR+ | VAD-CFR | Evolved |
| Kuhn Poker (4P) * | PCFR+ | VAD-CFR | Evolved |
| Leduc Poker (3P) * | DCFR+ | VAD-CFR | Evolved |
| Goofspiel (5-card) * | DCFR+ | VAD-CFR | Evolved |
| SHOR-PSRO vs. Pipeline PSRO / NeuPL-JPSRO (Large Games) | |||
| Kuhn Poker (3P) | P-PSRO | SHOR-PSRO | Evolved |
| Leduc Poker (2P) | NeuPL | SHOR-PSRO | Evolved |
| Kuhn Poker (4P) * | P-PSRO | SHOR-PSRO | Evolved |
| Leduc Poker (3P) * | NeuPL | SHOR-PSRO | Evolved |
The specific algorithms are interesting. But the bigger story is the method: an AI system that can discover algorithms humans haven’t found in 20+ years of trying. This isn’t AI using existing tools — it’s AI inventing new ones.
The evolved algorithms contain tricks no human researcher proposed — like throwing away the first 500 iterations, or using different solvers for training vs. evaluation. These aren’t things you’d derive from theory. They emerged from empirical search over code space.
AlphaEvolve doesn’t know anything about game theory. It just evolves code against a fitness function. The same approach could discover algorithms in optimization, scheduling, routing, or any domain where you can measure “better” programmatically.
Instead of “researcher proposes algorithm, proves it works,” the loop becomes “AI proposes thousands of variants, researcher analyzes why the winner works.” The human role shifts from inventor to interpreter — understanding why the AI’s discoveries work, not finding them.
The full paper includes the complete evolved code for both algorithms, the training setup, and detailed ablation studies explaining which components matter most. The algorithms can be used by anyone working on multiagent systems.