Prioritized Level Replay#

A curriculum learning method that estimates an agent’s regret on particular environment seed and uses a prioritized replay buffer to replay levels for which the agent has high regret. This implementation is based on the open-source original implementation at https://github.com/facebookresearch/level-replay, but has been modified to support Syllabus task spaces instead of just environment seeds. PLR has been used in multiple prominent RL works, such as Human-Timescale Adaptation in an Open-Ended Task Space. For more information you can read the original paper Prioritized Level Replay (Jiang et al. 2021).

Prioritized Level Replay samples the next training level by prioritizing those with a higher estimated learning potential. The paper proposes multiple metrics for measuring learning progress, but suggest L1 Value loss or equivalently the Generalized Advantage Estimation (GAE) magnitude as the most effective metric. PLR also utilizes a staleness metric to ensure that every task’s learning progress is occasionally updated based on the current policy’s capabilities.

In practice prioritized level replay updates it’s sampling distribution after each batch, and samples the single highest learning potential task with very high probability. The sampling temperature and task diversity can can be increased by raising the temperature argument.

The default hyperparameters are tuned for Procgen. When applying PLR to a new environment, you may want to tune the staleness_coef, the replay probability rho, or alter the number of training seeds. You can change the number of training tasks by modifying your task space.

Usage#

PLR expects the environment to be deterministic with respect to the task, which is typically the seed. You may not see good results if your environment is not deterministic for a given task. You can check if your environment is deterministic by modifying the determinism_tests script here https://github.com/RyanNavillus/Syllabus/blob/main/tests/determinism_tests.py to use your environment.

To initialize the curriculum, you will also need to provide the num_processes which is the number of parallel environments. If you are using Generalized Advantage Estimation, you need to pass the same num_steps, gamma, and gae_lambda arguments that you use in your training process. You can set any PLR algorithmic options in the task_sampler_kwargs_dict. Please see the TaskSampler for a full list of options.

PLR requires L1 Value estimates from the training process to compute it’s sampling distribution, and Syllabus provides several different ways to achieve this, each with its own pros and cons. In short:

PrioritizedLevelReplay - This is the simplest way to add PLR to a project. It receives step updates from the environments and uses an evaluator to recompute the values for each step. This allows you to use it without modifying the training code in any way, but also means it is duplicating a lot of computation.
CentralPrioritizedLevelReplay - This version directly receives value predictions and other data from the training process, and uses them to compute scores.
DirectPrioritizedLevelReplay - This method allows the user to directly provide the scores used in the sampling distribution. It provides the most control over the curriculum, but also has the highest potential for implementation errors.

We recommend using PrioritizedLevelReplay for initial experiments and tests, then transitioning to CentralPrioritizedLevelReplay or DirectPrioritizedLevelReplay for better performance. Since these have higher potential for implementation errors, you can compare their performance against a PrioritizedLevelReplay baseline to check for discrepancies. Below we go into more detail into how each method operates, and how to configure them for your project.

Note: we plan to merge these methods into a single class in the future.

Note: the current implementation of PrioritizedLevelReplay and CentralPrioritizedLevelReplay only support GAE returns. If you want to use a different return method, you can subclass these methods or use DirectPrioritizedLevelReplay.

Prioritized Level Replay#

This asynchronous implementation of PLR runs automatically with no direct changes to the training code. Once it is configured and the synchronization wrappers are applied, it will automatically begin sending high-priority tasks to the training environments. PrioritizedLevelReplay requires an Evaluator to get the value predictions used to calculate prioritization scores. This introduces some duplicate computation and in some cases can slow down training, especially in systems where agent inference is the bottleneck. If you need to train agents above 10,000 steps per second, we suggest looking at CentralPrioritizedLevelReplay or DirectPrioritizedLevelReplay.

The buffer_size argument to PLR defines how many multiples of num_steps should be allocated for PLR’s buffer. For instance, if the num_steps is 64 and buffer_size is 4, then PLR’s buffers will hold 256 total steps. PLR needs to hold extra data because in order to efficiently batch value predictions, it needs to evaluate values for all environments at once. However, due to the asynchronous updates, some environments may send multiple batches before other environments send any. This means that PLR may need to hold more than num_steps steps before it is able to collect values and update the TaskSampler. If one environment is running significantly slower than others, this may lead to an overflow error. If you encounter this issue, you can increase the buffer_size to hold more steps, or decrease the batch_size of your environment synchronization wrapper to increase the frequency of updates. Note that the batch_size argument should never exceed the total buffer size, or the update will fail on the first insert. There are also several warnings and error messages in the code to help you diagnose these issues.

Below is an example of how you can set up PrioritizedLevelReplay in your project.

from syllabus.curricula import PrioritizedLevelReplay
from syllabus.evaluators import CleanRLEvaluator
from syllabus.core import GymnasiumSyncWrapper, make_multiprocessing_curriculum

# Initialize the environment
env = Env()

# Create the Evaluator
evaluator = CleanRLEvaluator(agent)

# Initialize the Curriculum
curriculum = PrioritizedLevelReplay(env.task_space, env.observation_space)
curriculum = make_multiprocessing_curriculum(curriculum)

# Wrap the environment
env = GymnasiumSyncWrapper(env, curriculum.components)

For a complete example using PrioritizedLevelReplay with CleanRL’s PPO, see https://github.com/RyanNavillus/Syllabus/blob/main/syllabus/examples/training_scripts/cleanrl_procgen.py.

class syllabus.curricula.plr.plr_wrapper.PrioritizedLevelReplay(task_space: DiscreteTaskSpace | MultiDiscreteTaskSpace, observation_space: Space, *curriculum_args, task_sampler_kwargs_dict: dict | None = None, action_space: Space | None = None, lstm_size: int | None = None, device: str = 'cpu', num_steps: int = 256, num_processes: int = 64, num_minibatches: int = 1, buffer_size: int = 4, gamma: float = 0.999, gae_lambda: float = 0.95, suppress_usage_warnings=False, evaluator: Evaluator | None = None, **curriculum_kwargs)[source]#

Bases: Curriculum

Prioritized Level Replay (PLR) Curriculum.

Parameters:

task_space (TaskSpace) – The task space to use for the curriculum.
*curriculum_args – Positional arguments to pass to the curriculum.
task_sampler_kwargs_dict (dict) – Keyword arguments to pass to the task sampler. See TaskSampler for details.
action_space (gym.Space) – The action space to use for the curriculum. Required for some strategies.
device (str) – The device to use to store curriculum data, either “cpu” or “cuda”.
num_steps (int) – The number of steps to store in the rollouts.
num_processes (int) – The number of parallel environments.
gamma (float) – The discount factor used to compute returns
gae_lambda (float) – The GAE lambda value.
suppress_usage_warnings (bool) – Whether to suppress warnings about improper usage.
**curriculum_kwargs – Keyword arguments to pass to the curriculum.

log_metrics(writer, logs, step=None, log_n_tasks=1)[source]#: Log the task distribution to the provided tensorboard writer.

requires_step_updates() → bool[source]#

Returns whether the curriculum requires step updates from the environment.

Returns:: True if the curriculum requires step updates, False otherwise

sample(k: int = 1) → List | Any[source]#

Sample k tasks from the curriculum.

Parameters:: k – Number of tasks to sample, defaults to 1
Returns:: Either returns a single task if k=1, or a list of k tasks

update_on_step(task, obs, rew, term, trunc, info, progress, env_id: int | None = None) → None[source]#: Update the curriculum with the current step results from the environment.

update_on_step_batch(step_results, env_id=None) → None[source]#: Update the curriculum with a batch of step results from the environment.

class syllabus.curricula.plr.plr_wrapper.RolloutStorage(num_steps: int, num_processes: int, requires_value_buffers: bool, observation_space: Space, num_minibatches: int = 1, buffer_size: int = 2, action_space: Space | None = None, gamma: float = 0.999, gae_lambda: float = 0.95, lstm_size: int | None = None, evaluator: Evaluator | None = None, device: str = 'cpu')[source]#

Bases: object

after_update(env_index)[source]#

compute_returns(gamma, gae_lambda, env_index)[source]#

get_index(env_index)[source]#: Map the environment ids to indices in the buffer.

get_value_predictions()[source]#

insert_at_index(env_index, mask, obs=None, reward=None, task=None, steps=1)[source]#

to(device)[source]#

property using_lstm#

Central Prioritized Level Replay#

This version of PLR does not require an evaluator but does require additional code to send data from the training loop to the curriculum. Below you can find examples of how to do this for some of the popular RL frameworks.

Insert the following code at the end of the step loop. For example, at line 216 in ppo.py.

for step in range(0, args.num_steps):
   ...

   with torch.no_grad():
      next_value = agent.get_value(next_obs)
   tasks = [i["task"] for i in infos]

   update = {
      "value": value,
      "next_value": next_value,
      "rew": reward,
      "dones": done,
      "tasks": tasks,
   }
   curriculum.update(update)

You can use a callback to send the values to the curriculum. The callback should be added to the learn method.

class PLRCallback(BaseCallback):
   def __init__(self, curriculum, verbose=0):
      super().__init__(verbose)
      self.curriculum = curriculum

   def _on_step(self) -> bool:
      tasks = [i["task"] for i in self.locals["infos"]]

      obs = self.locals['new_obs']
      obs_tensor = torch.tensor(obs, dtype=torch.float32).to(self.model.device)
      with torch.no_grad():
         new_value = self.model.policy.predict_values(obs_tensor)

      update = {
         "value": self.locals["values"],
         "next_value": new_value,
         "rew": self.locals["rewards"],
         "dones": self.locals["dones"],
         "tasks": tasks,
      }
      self.curriculum.update(update)
      return True

curriculum = PrioritizedLevelReplay(task_space)
model.learn(10000, callback=CustomCallback(curriculum))

class syllabus.curricula.plr.central_plr_wrapper.CentralPrioritizedLevelReplay(task_space: DiscreteTaskSpace | MultiDiscreteTaskSpace, *curriculum_args, task_sampler_kwargs_dict: dict | None = None, action_space: Space | None = None, device: str = 'cpu', num_steps: int = 256, num_processes: int = 64, gamma: float = 0.999, gae_lambda: float = 0.95, suppress_usage_warnings=False, **curriculum_kwargs)[source]#

Bases: Curriculum

Prioritized Level Replay (PLR) Curriculum.

Parameters:

task_space (TaskSpace) – The task space to use for the curriculum.
*curriculum_args – Positional arguments to pass to the curriculum.
task_sampler_kwargs_dict (dict) – Keyword arguments to pass to the task sampler. See TaskSampler for details.
action_space (gym.Space) – The action space to use for the curriculum. Required for some strategies.
device (str) – The device to use to store curriculum data, either “cpu” or “cuda”.
num_steps (int) – The number of steps to store in the rollouts.
num_processes (int) – The number of parallel environments.
gamma (float) – The discount factor used to compute returns
gae_lambda (float) – The GAE lambda value.
suppress_usage_warnings (bool) – Whether to suppress warnings about improper usage.
**curriculum_kwargs – Keyword arguments to pass to the curriculum.

log_metrics(writer, logs, step=None, log_n_tasks=1)[source]#: Log the task distribution to the provided tensorboard writer.

sample(k: int = 1) → List | Any[source]#

Sample k tasks from the curriculum.

Parameters:: k – Number of tasks to sample, defaults to 1
Returns:: Either returns a single task if k=1, or a list of k tasks

update(metrics: Dict)[source]#: Update the curriculum with arbitrary inputs.

class syllabus.curricula.plr.central_plr_wrapper.RolloutStorage(num_steps: int, num_processes: int, requires_value_buffers: bool, action_space: Space | None = None)[source]#

Bases: object

after_update()[source]#

compute_returns(gamma, gae_lambda)[source]#

get_idxs(env_ids)[source]#: Map the environment ids to indices in the buffer.

insert(masks, action_log_dist=None, value_preds=None, rewards=None, tasks=None, next_values=None, env_ids=None)[source]#

to(device)[source]#

Direct Prioritized Level Replay#

This implementation of PLR allows you to directly compute your own scores used to prioritize tasks. This gives you the most control over the curriculum, but it can be tricky to implement a good scoring function. Below is an example of how to implement the Value L1 score in CleanRL’s PPO. The full script can be found here https://github.com/RyanNavillus/Syllabus/blob/main/syllabus/examples/training_scripts/cleanrl_procgen.py.

a, b = returns.shape
new_returns = torch.zeros((a + 1, b))
new_returns[:-1, :] = returns
new_values = torch.zeros((a + 1, b))
new_values[:-1, :] = values
new_values[-1, :] = next_value
scores = (new_returns - new_values).abs()
curriculum.update(tasks, scores, dones)

The tasks and dones arrays have the shape (num_steps, num_envs) and the scores array has the shape (num_steps + 1, num_envs). We need to expand the size of the value tensor to include the next value prediction, and the returns tensor to match. In some versions of PLR, the next values are also added to the final index of the returns tensor. This effectively removes the next values from the Value L1 score calculation, but allows them to still be used for GAE.

class syllabus.curricula.plr.direct_plr_wrapper.DirectPrioritizedLevelReplay(task_space: DiscreteTaskSpace | MultiDiscreteTaskSpace, *curriculum_args, task_sampler_kwargs_dict: dict | None = None, action_space: Space | None = None, device: str = 'cpu', num_steps: int = 256, num_processes: int = 64, suppress_usage_warnings=False, **curriculum_kwargs)[source]#

Bases: Curriculum

Prioritized Level Replay (PLR) Curriculum.

Parameters:

task_space (TaskSpace) – The task space to use for the curriculum.
*curriculum_args – Positional arguments to pass to the curriculum.
task_sampler_kwargs_dict (dict) – Keyword arguments to pass to the task sampler. See TaskSampler for details.
action_space (gym.Space) – The action space to use for the curriculum. Required for some strategies.
device (str) – The device to use to store curriculum data, either “cpu” or “cuda”.
num_steps (int) – The number of steps to store in the rollouts.
num_processes (int) – The number of parallel environments.
suppress_usage_warnings (bool) – Whether to suppress warnings about improper usage.
**curriculum_kwargs – Keyword arguments to pass to the curriculum.

log_metrics(writer, logs, step=None, log_n_tasks=1)[source]#: Log the task distribution to the provided tensorboard writer.

sample(k: int = 1) → List | Any[source]#

Sample k tasks from the curriculum.

Parameters:: k – Number of tasks to sample, defaults to 1
Returns:: Either returns a single task if k=1, or a list of k tasks

update(tasks, scores, dones, actors=None)[source]#: Update the curriculum with arbitrary inputs.

class syllabus.curricula.plr.direct_plr_wrapper.RolloutStorage(num_steps: int, num_processes: int)[source]#

Bases: object

after_update()[source]#

insert(tasks, masks, scores, actors)[source]#

ready()[source]#

to(device)[source]#

Task Sampler#

The task sampler is shared between the different PLR implementations. It is responsible for calculating and tracking scores, and sampling tasks. It has many different options for sampling strategies that can be configured by passing the task_sampler_kwargs dictionary to PLR’s initializer.

class syllabus.curricula.plr.task_sampler.TaskSampler(tasks: list, num_steps: int, action_space: Space | None = None, num_actors: int = 1, strategy: str = 'value_l1', replay_schedule: str = 'proportionate', score_transform: str = 'rank', temperature: float = 0.1, eps: float = 0.05, rho: float = 1.0, nu: float = 0.5, alpha: float = 1.0, staleness_coef: float = 0.1, staleness_transform: str = 'power', staleness_temperature: float = 1.0)[source]#

Bases: object

Task sampler for Prioritized Level Replay (PLR)

Parameters:

tasks (list) – List of tasks to sample from
action_space (gym.spaces.Space) – Action space of the environment
num_actors (int) – Number of actors/processes
strategy (str) – Strategy for sampling tasks. One of “value_l1”, “gae”, “policy_entropy”, “least_confidence”, “min_margin”, “one_step_td_error”.
replay_schedule (str) – Schedule for sampling replay levels. One of “fixed” or “proportionate”.
score_transform (str) – Transform to apply to task scores. One of “constant”, “max”, “eps_greedy”, “rank”, “power”, “softmax”.
temperature (float) – Temperature for score transform. Increasing temperature makes the sampling distribution more uniform.
eps (float) – Epsilon for eps-greedy score transform.
rho (float) – Proportion of seen tasks before replay sampling is allowed.
nu (float) – Probability of sampling a replay level if using a fixed replay_schedule.
alpha (float) – Linear interpolation weight for score updates. 0.0 means only use old scores, 1.0 means only use new scores.
staleness_coef (float) – Linear interpolation weight for task staleness vs. task score. 0.0 means only use task score, 1.0 means only use staleness.
staleness_transform (str) – Transform to apply to task staleness. One of “constant”, “max”, “eps_greedy”, “rank”, “power”, “softmax”.
staleness_temperature (float) – Temperature for staleness transform. Increasing temperature makes the sampling distribution more uniform.

after_update(actor_indices=None)[source]#

metrics()[source]#

property requires_value_buffers#

sample(strategy=None)[source]#

sample_weights()[source]#

update_task_score(actor_index, task_idx, score, num_steps)[source]#

update_with_rollouts(rollouts, actor_id=None)[source]#