Solving a 40-Minute Game with RL

Agent Demonstrations

Watch our trained agent speedrun through Pokémon Emerald. The agent operates purely on raw pixels without any VLM at inference time.

ROUTE101_TO_OLDALE — RL agent discovers efficient routing and autonomously selects RUN to skip wild battles (2x speed)

EXIT_BIRCH_LAB — Agent learns to skip the Pokémon nickname input UI through emergent behavior (2x speed)

The Challenge

The NeurIPS 2025 PokéAgent Challenge asks: Can we build an AI agent that speedruns Pokémon Emerald?

This is a uniquely difficult problem for two reasons:

Long-horizon task — Reaching the first gym takes ~40 minutes of optimal play, requiring thousands of sequential decisions
Requires near-optimal actions — Speedrunning demands efficiency; suboptimal actions compound into significant time loss

Why Existing Approaches Fall Short

LLM/VLM Agents

Can generate high-level plans and understand game context
Often suboptimal at low-level control
High inference latency makes real-time play impractical

RL Agents

Struggle with long-horizon tasks and sparse rewards
Can achieve near-optimal performance with sufficient training
Fast inference enables real-time play

Can we combine the strengths of both?

Our Approach: VLM Code Expert + Expert-Guided RL

**Overview of Scripted Policy Distillation (SPD).** Our approach consists of three stages.

**(1) Subgoal Generation:** Given a long-horizon task specification, an LLM decomposes the task into sequential subgoals, each paired with an executable success-condition function `success_cond(state)` that determines task completion.

**(2) Scripted Policy Generation:** For each subgoal, the LLM generates a scripted policy that maps states to actions. The policy can invoke a VLM tool (`extract_feature`) to parse visual information not available in the structured state, and uses logging statements (`print(log)`) to record execution traces for later analysis. The policy interacts with the environment until `success_cond` returns true or a timeout occurs. On failure, the LLM analyzes the logged traces and revises either the policy code or the subgoal specification.

**(3) Script-Guided RL:** Once all scripted policies succeed reliably, we distill them into neural network policies via supervised learning on expert trajectories, followed by reinforcement learning with expert action guidance. The resulting neural policy exhibits more efficient behavior than policies trained without distillation.

Results

NeurIPS 2025 PokéAgent Challenge Leaderboard

Rank	Team	Method	Time to First Gym
🥇 1st	Ours	VLM Code Expert + Expert-Guided RL	40:13
🥈 2nd	Hamburg PokeRunners	PPO with recurrent network	01:14:43
🥉 3rd	anthonys	Tool-Calling VLM Policy	01:29:17

Quantitative Analysis

Expert-guided RL significantly outperforms both naive RL and expert-only baselines:

Milestone	Naive RL	Expert-only	Expert-guided RL
LITTLEROOT_TO_ROUTE101	timeout	90.15 steps (±33.7)	55.75 steps (±12.9)
EXIT_BIRCH_LAB	timeout	64.40 steps (±1.0)	56.35 steps (±1.1)

Emergent Behaviors

The RL agent discovered strategies not explicitly encoded in the expert code:

Efficient route selection — avoiding unnecessary detours
Skipping wild battles — using RUN to quickly exit encounters
UI optimization — skipping nickname input via START+A combo

Summary

We present a knowledge-based, expert-guided reinforcement learning approach for playing Pokémon Emerald. Our method externalizes the game knowledge of a Vision-Language Model (VLM) into Python code expert policies, which serve as teachers for training pixel-based neural network agents.

The trained agent ranked 1st place on the NeurIPS 2025 PokéAgent Challenge Speedrun track, achieving the first gym (Roxanne) in 40 minutes and 13 seconds — without complex reward engineering or large-scale human demonstrations.