This procedure resolves gradient conflicts between the exploration and diversity objectives. If a negative inner product is detected between their gradients, one is projected onto the orthogonal complement of the other. This ensures that the two objectives do not interfere destructively during optimization.
The agent collects diverse experiences by sampling skills and optimizing both exploration (via entropy and RND) and skill diversity (via contrastive learning). Intrinsic rewards are computed and their gradients are combined using gradient surgery, enabling stable pretraining without manual loss balancing.
In the downstream phase, a learned skill selector picks the best skill for the task based on the current state. The agent executes actions using the selected skill, and both the policy and selector are updated using extrinsic rewards. This adaptive reuse enables efficient skill transfer.
We evaluate AMPED on the Unsupervised Reinforcement Learning Benchmark (URLB), which spans diverse tasks involving locomotion and manipulation. These experiments demonstrate that AMPED learns diverse and reusable skills in high-dimensional control settings.
Stepping forward
Backward somersault
Getting up from ground
Backward somersault
Clockwise rotation
Upside-down recovery
Left reach & grasp
Right reach & grasp
Upward lifting
We evaluate AMPED in Tree Maze and 2D Maze environments to assess the spatial coverage and discriminability of the learned skills through visual analysis.
Tree MazeWe evaluate AMPED and prior methods on the Unsupervised Reinforcement Learning Benchmark (URLB) using four key metrics: Median, Interquartile Mean (IQM), Mean, and Optimality Gap, all measured via expert-normalized scores.
Notably, AMPED ranks first in all four metrics, highlighting both its consistency and effectiveness across a wide range of tasks. It significantly outperforms prior methods like BeCL, CeSD, ComSD, and DIAYN, particularly in terms of IQM and optimality gap.
We conduct ablation experiments by systematically removing each core component of AMPED: RND-based exploration, AnInfoNCE-based diversity, gradient surgery, and the skill selector. The performance is evaluated across three domains (Walker, Quadruped, Jaco) with four tasks each.
Removing RND or gradient surgery leads to the most significant performance degradation, particularly in locomotion-heavy environments like Walker and Quadruped. AnInfoNCE and skill selector also contribute meaningfully, but to a lesser degree. This highlights the importance of each component in achieving AMPED's superior performance.
We compare the wall-clock training time of AMPED with various baselines across the Walker, Quadruped, and Jaco domains. Despite achieving superior downstream performance, AMPED introduces only a modest runtime overhead compared to competitive methods such as CeSD and BeCL. Overall, these results indicate that AMPED achieves a favorable trade-off between computational cost and empirical performance.
We conduct an ablation study to investigate the effect of the gradient projection ratio—that is, the probability of projecting diversity gradients onto the exploration gradient direction when their dot product is negative.
We find that a moderate projection ratio offers the best trade-off between diversity and exploration. A ratio that is too low disables effective gradient alignment, leading to degraded performance, while a ratio that is too high may overly constrain exploration and limit skill diversity.