Prioritized Level Replay (PLR) Curriculum#

Prioritized Level Replay is a simple, yet effective curriculum learning method introduced in https://arxiv.org/pdf/2010.03934.pdf. See this paper for additional information on the method. The implementation in this code base is based on the original implementation https://github.com/facebookresearch/level-replay/tree/main

PLR has been sucessfully used to train agents in https://arxiv.org/pdf/2301.07608.pdf with a custom fitness function.

Prioritized Level Replay samples the next training level by prioritizing those with a higher estimated learning potential. The paper proposes multiple metrics for measuring learning progress, but suggest L1 Value loss or equivalently the Generalized Advantage Estimation (GAE) magnitude as the most effective metric. PLR also utilizes a staleness metric to ensure that every task’s learning progress is occasionally updated based on the current policy’s capabilities.

In practice prioritized level replay updates it’s sampling distribution after each batch, and samples the single highest learning potential task more than 90% of the time. The sampling temperature and task diversity can can be increased by raising the temperature argument.

The default hyperparameters are tuned for Procgen. When applying PLR to a new environment, you may want to tune the staleness_coef, the replay probability rho, or alter the number of training seeds. You can change the number of training tasks by modifying your task space.

Usage#

PLR expects the environment to be determinstic with respect to the task, which is typically the seed. You may not see good results if your environment is deterministic for a given task.

To intialize the curriculum, you will also need to provide the num-processes which is the number of parallel environments. We also recommend passing the same num_steps, gamma, and gae_lambda arguments that you use in your training process. You can set any PLR algorithmic options in the task_sampler_kwargs_dict. Please see the TaskSampler for a full list of options.

PLR requires L1 Value estimates from the training process to compute it’s sampling distirbution, so you need to add additional code to your training process to send these values to the curriculum. Below you can find examples of how to do this for some of the popular RL frameworks.

Insert the following code at the end of the step loop. For example, at line 216 in ppo.py.

for step in range(0, args.num_steps):
   ...

   with torch.no_grad():
      next_value = agent.get_value(next_obs)
   tasks = envs.get_attr("task")

   update = {
      "update_type": "on_demand",
      "metrics": {
         "value": value,
         "next_value": next_value,
         "rew": reward,
         "dones": done,
         "tasks": tasks,
      },
   }
   curriculum.update_curriculum(update)

Prioritized Level Replay#

class syllabus.curricula.plr.plr_wrapper.PrioritizedLevelReplay(task_space: TaskSpace, *curriculum_args, task_sampler_kwargs_dict: dict = {}, action_space: Space | None = None, device: str = 'cpu', num_steps: int = 256, num_processes: int = 64, gamma: float = 0.999, gae_lambda: float = 0.95, suppress_usage_warnings=False, **curriculum_kwargs)#

Bases: Curriculum

Prioritized Level Replay (PLR) Curriculum.

Parameters:
  • task_space (TaskSpace) – The task space to use for the curriculum.

  • *curriculum_args – Positional arguments to pass to the curriculum.

  • task_sampler_kwargs_dict (dict) – Keyword arguments to pass to the task sampler. See TaskSampler for details.

  • action_space (gym.Space) – The action space to use for the curriculum. Required for some strategies.

  • device (str) – The device to use to store curriculum data, either “cpu” or “cuda”.

  • num_steps (int) – The number of steps to store in the rollouts.

  • num_processes (int) – The number of parallel environments.

  • gamma (float) – The discount factor used to compute returns

  • gae_lambda (float) – The GAE lambda value.

  • suppress_usage_warnings (bool) – Whether to suppress warnings about improper usage.

  • **curriculum_kwargs – Keyword arguments to pass to the curriculum.

REQUIRES_CENTRAL_UPDATES = True#
REQUIRES_STEP_UPDATES = False#
log_metrics(writer, step=None)#

Log the task distribution to the provided tensorboard writer.

sample(k: int = 1) List | Any#

Sample k tasks from the curriculum.

Parameters:

k – Number of tasks to sample, defaults to 1

Returns:

Either returns a single task if k=1, or a list of k tasks

update_on_demand(metrics: Dict)#

Update the curriculum with arbitrary inputs.

update_on_episode(episode_return: float, trajectory: List | None = None) None#

Update the curriculum with episode results from the environment.

update_on_step(obs, rew, term, trunc, info) None#

Update the curriculum with the current step results from the environment.

update_on_step_batch(step_results: List[Tuple[int, int, int, int]]) None#

Update the curriculum with a batch of step results from the environment.

update_task_progress(task: Any, success_prob: float) None#

Update the curriculum with a task and its success probability upon success or failure.

class syllabus.curricula.plr.plr_wrapper.RolloutStorage(num_steps: int, num_processes: int, requires_value_buffers: bool, action_space: Space | None = None)#

Bases: object

after_update()#
compute_returns(next_value, gamma, gae_lambda)#
insert(masks, action_log_dist=None, value_preds=None, rewards=None, tasks=None)#
to(device)#

TaskSampler#

class syllabus.curricula.plr.task_sampler.TaskSampler(tasks: list, action_space: Space | None = None, num_actors: int = 1, strategy: str = 'value_l1', replay_schedule: str = 'proportionate', score_transform: str = 'rank', temperature: float = 0.1, eps: float = 0.05, rho: float = 1.0, nu: float = 0.5, alpha: float = 1.0, staleness_coef: float = 0.1, staleness_transform: str = 'power', staleness_temperature: float = 1.0)#

Bases: object

Task sampler for Prioritized Level Replay (PLR)

Parameters:
  • tasks (list) – List of tasks to sample from

  • action_space (gym.spaces.Space) – Action space of the environment

  • num_actors (int) – Number of actors/processes

  • strategy (str) – Strategy for sampling tasks. One of “value_l1”, “gae”, “policy_entropy”, “least_confidence”, “min_margin”, “one_step_td_error”.

  • replay_schedule (str) – Schedule for sampling replay levels. One of “fixed” or “proportionate”.

  • score_transform (str) – Transform to apply to task scores. One of “constant”, “max”, “eps_greedy”, “rank”, “power”, “softmax”.

  • temperature (float) – Temperature for score transform. Increasing temperature makes the sampling distribution more uniform.

  • eps (float) – Epsilon for eps-greedy score transform.

  • rho (float) – Proportion of seen tasks before replay sampling is allowed.

  • nu (float) – Probability of sampling a replay level if using a fixed replay_schedule.

  • alpha (float) – Linear interpolation weight for score updates. 0.0 means only use old scores, 1.0 means only use new scores.

  • staleness_coef (float) – Linear interpolation weight for task staleness vs. task score. 0.0 means only use task score, 1.0 means only use staleness.

  • staleness_transform (str) – Transform to apply to task staleness. One of “constant”, “max”, “eps_greedy”, “rank”, “power”, “softmax”.

  • staleness_temperature (float) – Temperature for staleness transform. Increasing temperature makes the sampling distribution more uniform.

after_update()#
metrics()#
property requires_value_buffers#
sample(strategy=None)#
sample_weights()#
update_task_score(actor_index, task_idx, score, num_steps)#
update_with_rollouts(rollouts)#