Agent Demonstrations
Watch our trained agent speedrun through Pokémon Emerald. The agent operates purely on raw pixels without any VLM at inference time.
ROUTE101_TO_OLDALE — RL agent discovers efficient routing and autonomously selects RUN to skip wild battles (2x speed)
EXIT_BIRCH_LAB — Agent learns to skip the Pokémon nickname input UI through emergent behavior (2x speed)
The Challenge
The NeurIPS 2025 PokéAgent Challenge asks: Can we build an AI agent that speedruns Pokémon Emerald?
This is a uniquely difficult problem for two reasons:
- Long-horizon task — Reaching the first gym takes ~40 minutes of optimal play, requiring thousands of sequential decisions
- Requires near-optimal actions — Speedrunning demands efficiency; suboptimal actions compound into significant time loss
Why Existing Approaches Fall Short
LLM/VLM Agents
- Can generate high-level plans and understand game context
- Often suboptimal at low-level control
- High inference latency makes real-time play impractical
RL Agents
- Struggle with long-horizon tasks and sparse rewards
- Can achieve near-optimal performance with sufficient training
- Fast inference enables real-time play
Can we combine the strengths of both?
Our Approach: VLM Code Expert + Expert-Guided RL
We propose a pipeline that uses VLMs as teachers rather than players:
- Milestone Decomposition — Break the long-horizon game into manageable subgoals (e.g., "exit house", "reach Route 101", "defeat Roxanne")
- VLM Code Expert Generation — For each milestone, prompt a VLM (GPT-4o, Gemini 2.5) to generate Python code that solves the task using game state information
-
Expert-Guided Policy Learning — Use the code experts to guide RL training:
- DAgger — Collect expert demonstrations on the learner's state distribution
- Double DQN — Reinforce successful behaviors and improve beyond the expert baseline
- Pure Pixel Policy — The final agent is a CNN that maps raw pixels → actions, with no VLM at inference time
This approach combines VLM's game knowledge with RL's ability to optimize for speed — without the latency cost of running a VLM during gameplay.
Results
NeurIPS 2025 PokéAgent Challenge Leaderboard
| Rank | Team | Method | Time to First Gym |
|---|---|---|---|
| 🥇 1st | Ours | VLM Code Expert + Expert-Guided RL | 40:13 |
| 🥈 2nd | Hamburg PokeRunners | PPO with recurrent network | 01:14:43 |
| 🥉 3rd | anthonys | Tool-Calling VLM Policy | 01:29:17 |
Quantitative Analysis
Expert-guided RL significantly outperforms both naive RL and expert-only baselines:
| Milestone | Naive RL | Expert-only | Expert-guided RL |
|---|---|---|---|
| LITTLEROOT_TO_ROUTE101 | timeout | 90.15 steps (±33.7) | 55.75 steps (±12.9) |
| EXIT_BIRCH_LAB | timeout | 64.40 steps (±1.0) | 56.35 steps (±1.1) |
Emergent Behaviors
The RL agent discovered strategies not explicitly encoded in the expert code:
- Efficient route selection — avoiding unnecessary detours
- Skipping wild battles — using RUN to quickly exit encounters
- UI optimization — skipping nickname input via START+A combo
Summary
We present a knowledge-based, expert-guided reinforcement learning approach for playing Pokémon Emerald. Our method externalizes the game knowledge of a Vision-Language Model (VLM) into Python code expert policies, which serve as teachers for training pixel-based neural network agents.
The trained agent ranked 1st place on the NeurIPS 2025 PokéAgent Challenge Speedrun track, achieving the first gym (Roxanne) in 40 minutes and 13 seconds — without complex reward engineering or large-scale human demonstrations.