We train a single PPO agent with unsupervised environment design on procedurally generated MiniGrid mazes, where TRACED adaptively prioritizes training tasks. Generalization is evaluated via zero-shot transfer to selected mazes with varying layouts and difficulty.
Simple Crossing
Small Corridor
Large Corridor
Four Rooms
Sixteen Rooms
Sixteen Rooms (Fewer Doors)
Maze
Maze 2
Maze 3
Labyrinth
Labyrinth 2
Perfect Maze (Medium)
We train a single PPO agent using unsupervised environment design on procedurally generated BipedalWalker terrains, where each task is defined by continuous terrain parameters such as gaps, stairs, stumps, and surface roughness. TRACED adaptively prioritizes training terrains. Generalization is evaluated via zero-shot transfer to held-out terrains with distinct obstacle configurations and difficulty levels.
Basic
Hardcore
Pit Gap
Stairs
Stumps
Roughness
To understand how TRACED constructs curricula over time, we visualize the evolution of training levels generated by the teacher during learning. Rather than focusing solely on final performance, this analysis reveals how task difficulty and structure adapt in response to the agent’s learning progress.
Across both MiniGrid and BipedalWalker, TRACED progressively increases environment complexity, transitioning from simple, easily solvable tasks to more challenging and diverse ones. This behavior reflects the combined effect of transition-aware regret, which drives difficulty ramp-up, and co-learnability, which favors tasks that induce transferable learning.
MiniGrid Curriculum Evolution
Mazes evolve from simple layouts to longer paths and denser obstacles as training progresses.
BipedalWalker Curriculum Evolution
Terrains gradually introduce larger gaps, higher stairs, and rougher surfaces, matching the agent’s improving locomotion capability.