◆ Mar 2026 · Personal Project

What If AI Designed
My Board Game AI?

I built a Wingspan simulator with a hand-tuned AI. Every decision threshold is a number I guessed. Then I read a paper where Google DeepMind let an AI rewrite game-theory algorithms — and it found better ones than humans designed in 20 years. Here’s what that would look like applied to my project.

446
Birds Implemented
All 4 expansions: Core, Europe, Asia, Oceania
4.5/sec
Games Simulated
Full headless games, no UI
50+
Hardcoded Thresholds
Every one is a guess
0
Learned Parameters
No ML, no training, pure heuristics

A Complete Wingspan Simulator

Wingspan is a competitive board game where 1–5 players collect birds, lay eggs, cache food, and chain bird powers together to score points across four rounds. It’s strategic, it has hidden information (your hand of bird cards), and every decision ripples through the rest of the game.

I built a Python simulator that plays the full game programmatically — all 446 birds across four expansions, complete power chains, round goals, bonus cards, and a gym-style API that returns (observation, reward, done, info) after every action. It runs about 4.5 complete games per second with no UI. The goal was always to train an AI on it. The AI I have so far… well.

Every Number Is a Guess

The simulator’s AI is a class called SmartAIPlayer. “Smart” is generous. It works — it plays legal moves, finishes games, scores reasonable points. But every decision it makes comes down to a hardcoded number that I picked because it felt about right.

Threshold What It Controls Why This Number?
birds < 3 “Early game” — prioritize playing birds Felt right
birds < 8 “Mid game” — balance all actions Felt right
food > 10 Stop hoarding food Seemed like enough
hand > 10 Stop drawing cards Seemed like enough
points × 2 Base bird attractiveness score Doubles felt impactful
cost × 0.5 Food cost penalty Half-penalty felt balanced
+3 / +6 Bonus for birds with powers Powers seem worth 3–6 points
30% Chance to prioritize migration birds Not too often, not too rare

Here’s the actual bird-scoring function. Every line is a judgment call:

def _score_bird_play(self, bird, action, player, game):
    score = 0.0

    # Base score from bird points
    score += bird.points * 2           # Why 2? Felt impactful.

    # Bonus for birds with powers
    if bird.power:
        score += 3                       # Why 3? Seemed fair.
        if bird.power.abstracted_power:
            score += 3                   # Why 3 more? Gut feeling.

    # Migration bird bonus (50% chance)
    if bird.name in MIGRATION_BIRDS and random() < 0.50:
        score += 15                      # Why 15? Big number = gets played.

    # Prefer cheaper birds
    score -= food_cost * 0.5           # Why 0.5? Half-penalty felt right.

    return score

And the egg-laying decision? The most strategically important action in the late game?

def _select_best_egg_laying(self, actions, player, game):
    # For now, just pick randomly from valid egg actions
    # Could be enhanced to prefer birds with more egg capacity
    return random.choice(actions)      # Literally random.
It gets worse

There are actually two heuristic AIs in the codebase — SmartAIPlayer in the AI module and a separate smart_action_selector in the simulator script. They evolved independently and disagree on the thresholds. One says early game ends at 3 birds. The other uses round number instead. One caps food at 10. The other at 4. Two humans hand-tuning the same AI couldn’t even converge on the same numbers.

AI That Rewrites Its Own Algorithms

I wrote a full explainer of this paper, but here’s the short version. Google DeepMind built AlphaEvolve, a system that takes a working algorithm, uses an LLM (Gemini) to propose code mutations, evaluates each variant by running it on actual games, keeps the winners, and repeats for hundreds of generations.

They pointed it at two families of game-theory algorithms that researchers had been refining for over 20 years. AlphaEvolve discovered two new variants that beat the hand-designed ones on 10 out of 11 benchmark games. The evolved algorithms contained tricks no human proposed — like throwing away the first 500 iterations of data (counterintuitive, but it works) and multiplying certain values by exactly 1.1 (no theoretical justification, but empirically superior).

The parallel

The paper’s algorithms had the same structure as my Wingspan AI: hardcoded thresholds, fixed orderings, manually chosen constants. The only difference is that theirs were designed by PhD researchers and mine were designed by me guessing. AlphaEvolve beat both kinds. If it can improve algorithms refined by experts over two decades, it can definitely improve score += bird.points * 2.

Mapping AlphaEvolve to Wingspan

The AlphaEvolve approach needs three things: a seed algorithm to start from, a fitness function to measure quality, and code to mutate. The Wingspan simulator already has all three.

Code

Seed Algorithm

SmartAIPlayer — it already plays legal, complete games. It’s not optimal, but it’s a working starting point. AlphaEvolve doesn’t need a good algorithm; it needs a functional one to evolve from.

EXISTS TODAY
Fitness

Fitness Function

Average final score across N games. The simulator already returns final_scores at game end. Run 100 games, take the mean — higher average score = better algorithm. Simple, fast, and already built.

EXISTS TODAY
Mutate

Code to Mutate

Three functions: _get_action_priorities() (what to do), _score_bird_play() (which bird to play), _select_best_food_to_gain() (which food to take). These contain all the guessed numbers.

EXISTS TODAY
LLM

The LLM

AlphaEvolve used Gemini (not public). But the approach works with any code-capable LLM. Claude, GPT-4, or even a local model — it just needs to read a Python function and propose a modified version.

AVAILABLE VIA API

Here’s what the evolution might produce. The current scoring function is a flat linear formula. An evolved version might discover non-linear interactions, conditional logic, or phase-dependent weights — things that work but that no human would think to try:

Current: Hand-Tuned
score  = bird.points * 2
score += 3 if bird.power else 0
score -= food_cost * 0.5
# 3 constants, all guesses
Hypothetical: Evolved
score  = bird.points * 2.3
score += 4.7 if bird.power else 0
score -= food_cost * 0.8
if round > 2 and eggs < 4:
    score += egg_capacity * 1.4
# non-obvious conditional the AI found

Three Paths Forward

AlphaEvolve isn’t publicly available, but the idea is reproducible. There are three realistic ways to apply it to the Wingspan simulator, each at a different level of complexity.

Genetic

Genetic Algorithm

Evolve the numbers

Extract all 50+ thresholds into a parameter vector. Use a standard genetic algorithm to mutate values, run tournaments, keep winners. No LLM needed — just numerical optimization. The simplest path and the quickest to validate.

EASIEST
RL

Reinforcement Learning

Learn from self-play

The gym-style API is already built. Hook up PPO or similar via stable-baselines3. The StateEncoder already produces observation vectors. Train against copies of itself.

MEDIUM
LLM

DIY AlphaEvolve

LLM evolves the code

Send _score_bird_play() to the Claude API with “propose a better version.” Run 100 games with each variant. Keep the highest-scoring one. Repeat. This is literally what the paper does.

MOST INTERESTING
The practical choice

Path 1 (genetic algorithm) is the fastest to build and would already produce better thresholds than my guesses. Path 3 (DIY AlphaEvolve) is the most exciting because it can discover structural changes to the code — new conditionals, new factors, entirely new decision logic — not just better numbers. The paper’s most impressive discoveries were structural, not numerical.

Why This Matters

This isn’t really about Wingspan. It’s about a pattern that shows up everywhere: a developer writes a heuristic with guessed constants, it works well enough, and nobody goes back to optimize it. Scheduling algorithms, recommendation scores, resource allocation weights, retry backoff timers — they’re all full of numbers someone picked because they felt right.

The AlphaEvolve paper shows that LLMs can do more than just write code to spec. They can explore the space of possible code and find solutions humans wouldn’t think to try. The Wingspan simulator is a perfect sandbox for that experiment — a complete game engine, a measurable fitness function, and an AI full of numbers that are begging to be replaced by better ones.

50+ Guesses → ?
Every threshold in SmartAIPlayer is a guess. The AlphaEvolve paper proved that AI can replace human guesses with empirically optimal values — and discover entirely new decision logic in the process. The simulator is ready. The fitness function is built. The only missing piece is the evolution loop.