Agent Demonstrations
Watch our trained agent speedrun through Pokémon Emerald. The agent operates purely on raw pixels without any VLM at inference time.
ROUTE101_TO_OLDALE — RL agent discovers efficient routing and autonomously selects RUN to skip wild battles (2x speed)
EXIT_BIRCH_LAB — Agent learns to skip the Pokémon nickname input UI through emergent behavior (2x speed)
The Challenge
The NeurIPS 2025 PokéAgent Challenge asks: Can we build an AI agent that speedruns Pokémon Emerald?
This is a uniquely difficult problem for two reasons:
- Long-horizon task — Reaching the first gym takes ~40 minutes of optimal play, requiring thousands of sequential decisions
- Requires near-optimal actions — Speedrunning demands efficiency; suboptimal actions compound into significant time loss
Why Existing Approaches Fall Short
LLM/VLM Agents
- Can generate high-level plans and understand game context
- Often suboptimal at low-level control
- High inference latency makes real-time play impractical
RL Agents
- Struggle with long-horizon tasks and sparse rewards
- Can achieve near-optimal performance with sufficient training
- Fast inference enables real-time play
Can we combine the strengths of both?
Our Approach: VLM Code Expert + Expert-Guided RL
(1) Subgoal Generation: Given a long-horizon task specification, an LLM decomposes the task into sequential subgoals, each paired with an executable success-condition function
success_cond(state) that determines task completion.(2) Scripted Policy Generation: For each subgoal, the LLM generates a scripted policy that maps states to actions. The policy can invoke a VLM tool (
extract_feature) to parse visual information not available in the structured state, and uses logging statements (print(log)) to record execution traces for later analysis. The policy interacts with the environment until success_cond returns true or a timeout occurs. On failure, the LLM analyzes the logged traces and revises either the policy code or the subgoal specification.(3) Script-Guided RL: Once all scripted policies succeed reliably, we distill them into neural network policies via supervised learning on expert trajectories, followed by reinforcement learning with expert action guidance. The resulting neural policy exhibits more efficient behavior than policies trained without distillation.
Results
NeurIPS 2025 PokéAgent Challenge Leaderboard
| Rank | Team | Method | Time to First Gym |
|---|---|---|---|
| 🥇 1st | Ours | VLM Code Expert + Expert-Guided RL | 40:13 |
| 🥈 2nd | Hamburg PokeRunners | PPO with recurrent network | 01:14:43 |
| 🥉 3rd | anthonys | Tool-Calling VLM Policy | 01:29:17 |
Quantitative Analysis
Expert-guided RL significantly outperforms both naive RL and expert-only baselines:
| Milestone | Naive RL | Expert-only | Expert-guided RL |
|---|---|---|---|
| LITTLEROOT_TO_ROUTE101 | timeout | 90.15 steps (±33.7) | 55.75 steps (±12.9) |
| EXIT_BIRCH_LAB | timeout | 64.40 steps (±1.0) | 56.35 steps (±1.1) |
Emergent Behaviors
The RL agent discovered strategies not explicitly encoded in the expert code:
- Efficient route selection — avoiding unnecessary detours
- Skipping wild battles — using RUN to quickly exit encounters
- UI optimization — skipping nickname input via START+A combo
Summary
We present a knowledge-based, expert-guided reinforcement learning approach for playing Pokémon Emerald. Our method externalizes the game knowledge of a Vision-Language Model (VLM) into Python code expert policies, which serve as teachers for training pixel-based neural network agents.
The trained agent ranked 1st place on the NeurIPS 2025 PokéAgent Challenge Speedrun track, achieving the first gym (Roxanne) in 40 minutes and 13 seconds — without complex reward engineering or large-scale human demonstrations.