I built a Wingspan simulator with a hand-tuned AI. Every decision threshold is a number I guessed. Then I read a paper where Google DeepMind let an AI rewrite game-theory algorithms — and it found better ones than humans designed in 20 years. Here’s what that would look like applied to my project.
Wingspan is a competitive board game where 1–5 players collect birds, lay eggs, cache food, and chain bird powers together to score points across four rounds. It’s strategic, it has hidden information (your hand of bird cards), and every decision ripples through the rest of the game.
I built a Python simulator that plays the full game programmatically — all 446 birds across
four expansions, complete power chains, round goals, bonus cards, and a gym-style API
that returns (observation, reward, done, info)
after every action. It runs about 4.5 complete games per second with no UI.
The goal was always to train an AI on it. The AI I have so far… well.
The simulator’s AI is a class called SmartAIPlayer.
“Smart” is generous. It works — it plays legal moves, finishes games, scores reasonable points.
But every decision it makes comes down to a hardcoded number that I picked because it
felt about right.
| Threshold | What It Controls | Why This Number? |
|---|---|---|
| birds < 3 | “Early game” — prioritize playing birds | Felt right |
| birds < 8 | “Mid game” — balance all actions | Felt right |
| food > 10 | Stop hoarding food | Seemed like enough |
| hand > 10 | Stop drawing cards | Seemed like enough |
| points × 2 | Base bird attractiveness score | Doubles felt impactful |
| cost × 0.5 | Food cost penalty | Half-penalty felt balanced |
| +3 / +6 | Bonus for birds with powers | Powers seem worth 3–6 points |
| 30% | Chance to prioritize migration birds | Not too often, not too rare |
Here’s the actual bird-scoring function. Every line is a judgment call:
def _score_bird_play(self, bird, action, player, game): score = 0.0 # Base score from bird points score += bird.points * 2 # Why 2? Felt impactful. # Bonus for birds with powers if bird.power: score += 3 # Why 3? Seemed fair. if bird.power.abstracted_power: score += 3 # Why 3 more? Gut feeling. # Migration bird bonus (50% chance) if bird.name in MIGRATION_BIRDS and random() < 0.50: score += 15 # Why 15? Big number = gets played. # Prefer cheaper birds score -= food_cost * 0.5 # Why 0.5? Half-penalty felt right. return score
And the egg-laying decision? The most strategically important action in the late game?
def _select_best_egg_laying(self, actions, player, game): # For now, just pick randomly from valid egg actions # Could be enhanced to prefer birds with more egg capacity return random.choice(actions) # Literally random.
There are actually two heuristic AIs in the codebase — SmartAIPlayer
in the AI module and a separate smart_action_selector
in the simulator script. They evolved independently and disagree on the thresholds.
One says early game ends at 3 birds. The other uses round number instead.
One caps food at 10. The other at 4.
Two humans hand-tuning the same AI couldn’t even converge on the same numbers.
I wrote a full explainer of this paper, but here’s the short version. Google DeepMind built AlphaEvolve, a system that takes a working algorithm, uses an LLM (Gemini) to propose code mutations, evaluates each variant by running it on actual games, keeps the winners, and repeats for hundreds of generations.
They pointed it at two families of game-theory algorithms that researchers had been refining for over 20 years. AlphaEvolve discovered two new variants that beat the hand-designed ones on 10 out of 11 benchmark games. The evolved algorithms contained tricks no human proposed — like throwing away the first 500 iterations of data (counterintuitive, but it works) and multiplying certain values by exactly 1.1 (no theoretical justification, but empirically superior).
The paper’s algorithms had the same structure as my Wingspan AI: hardcoded thresholds,
fixed orderings, manually chosen constants. The only difference is that theirs were designed by
PhD researchers and mine were designed by me guessing. AlphaEvolve beat both kinds.
If it can improve algorithms refined by experts over two decades,
it can definitely improve score += bird.points * 2.
The AlphaEvolve approach needs three things: a seed algorithm to start from, a fitness function to measure quality, and code to mutate. The Wingspan simulator already has all three.
SmartAIPlayer — it already plays legal, complete games. It’s not optimal, but it’s a working starting point. AlphaEvolve doesn’t need a good algorithm; it needs a functional one to evolve from.
Average final score across N games. The simulator already returns final_scores at game end. Run 100 games, take the mean — higher average score = better algorithm. Simple, fast, and already built.
Three functions: _get_action_priorities() (what to do),
_score_bird_play() (which bird to play),
_select_best_food_to_gain() (which food to take).
These contain all the guessed numbers.
AlphaEvolve used Gemini (not public). But the approach works with any code-capable LLM. Claude, GPT-4, or even a local model — it just needs to read a Python function and propose a modified version.
Here’s what the evolution might produce. The current scoring function is a flat linear formula. An evolved version might discover non-linear interactions, conditional logic, or phase-dependent weights — things that work but that no human would think to try:
score = bird.points * 2 score += 3 if bird.power else 0 score -= food_cost * 0.5 # 3 constants, all guesses
score = bird.points * 2.3 score += 4.7 if bird.power else 0 score -= food_cost * 0.8 if round > 2 and eggs < 4: score += egg_capacity * 1.4 # non-obvious conditional the AI found
AlphaEvolve isn’t publicly available, but the idea is reproducible. There are three realistic ways to apply it to the Wingspan simulator, each at a different level of complexity.
Extract all 50+ thresholds into a parameter vector. Use a standard genetic algorithm to mutate values, run tournaments, keep winners. No LLM needed — just numerical optimization. The simplest path and the quickest to validate.
The gym-style API is already built. Hook up PPO or similar via
stable-baselines3.
The StateEncoder already produces
observation vectors. Train against copies of itself.
Send _score_bird_play() to the Claude API
with “propose a better version.” Run 100 games with each variant.
Keep the highest-scoring one. Repeat. This is literally what the paper does.
Path 1 (genetic algorithm) is the fastest to build and would already produce better thresholds than my guesses. Path 3 (DIY AlphaEvolve) is the most exciting because it can discover structural changes to the code — new conditionals, new factors, entirely new decision logic — not just better numbers. The paper’s most impressive discoveries were structural, not numerical.
This isn’t really about Wingspan. It’s about a pattern that shows up everywhere: a developer writes a heuristic with guessed constants, it works well enough, and nobody goes back to optimize it. Scheduling algorithms, recommendation scores, resource allocation weights, retry backoff timers — they’re all full of numbers someone picked because they felt right.
The AlphaEvolve paper shows that LLMs can do more than just write code to spec. They can explore the space of possible code and find solutions humans wouldn’t think to try. The Wingspan simulator is a perfect sandbox for that experiment — a complete game engine, a measurable fitness function, and an AI full of numbers that are begging to be replaced by better ones.
SmartAIPlayer is a guess.
The AlphaEvolve paper proved that AI can replace human guesses with empirically optimal values —
and discover entirely new decision logic in the process.
The simulator is ready. The fitness function is built. The only missing piece is the evolution loop.