Self Play Curricula#

These curricula are designed to be used for 2-player competitive games. They save a history of previous agents and sample one to play against the current agent at the start of each episode. For more information on the co-player interface, see Co-player Curricula. For all of these methods, the user is required to have training code which supports self play already. These curricula provide the additional logic to store and sample from a history of agents.

Self Play#

A simple method for 2-player competitive games where the protagonist plays against a copy of itself. This produces an implicit curriculum of increasingly challenging opponents as the agent becomes more proficient at the game. However, because the opponent is always equally skilled at the game, it does not always produce the most useful reward signal. In addition, in non-transitive games where it is not possible to strictly improve over a given strategy, Self Play can lead to oscillations in performance as the agent learns cyclical strategies to exploit its current behavior. The classic example of this is Rock Paper Scissors, where the agent will rotate between choosing rock, paper, and scissors over the course of training.

Note that this curriculum always returns the current agent’s identifier 0, so it does not add anything to existing self play code. It is included for completeness and to allow comparisons between the other self play algorithms with only a single change to the training code.

class syllabus.curricula.selfplay.SelfPlay(task_space: TaskSpace, agent: Agent, device: str)[source]#

Bases: Curriculum

Self play curriculum for training agents against themselves.

add_agent(agent: Agent) → int[source]#

Add an agent to the curriculum.

Parameters:: agent – Agent to add to the curriculum
Return agent_id:: Identifier of the added agent

get_agent(agent_id: int) → Agent[source]#

Load an agent from the buffer of saved agents.

Parameters:: agent_id – Identifier of the agent to load
Returns:: Loaded agent

log_metrics(writer, logs, step=None, log_n_tasks=1)[source]#: Log metrics for the curriculum.

sample(k=1)[source]#

Sample k tasks from the curriculum.

Parameters:: k – Number of tasks to sample, defaults to 1
Returns:: Either returns a single task if k=1, or a list of k tasks

update_winrate(agent_id: int, reward: int) → None[source]#

Uses an incremental mean to update an agent’s winrate. This assumes that reward is positive for a win and negative for a loss. Not used for sampling.

Parameters:

agent_id – Identifier of the agent
reward – Reward received by the agent

Fictitious Self Play#

An extension of Self Play that samples the opponent from previous iterations of the protagonist agent. The deep learning version is also sometimes called Neural Ficitious Self Play. This allows the agent to play against a variety of strategies that it has previously learned, and can be used to prevent oscillations in performance. This allows the agent to converge to a policy that is robust against all strategies it has previously learned. However, it can be less sample-efficient than Self Play because the agent must spend a disproportionate amount of time playing against older strategies that it has already learned to beat. Despite this, Fictitious Self Play has been used in several high profile successes in reinforcement learning including AlphaGo and OpenAI Five, and agent trained to play Dota 2. The method was originally introduced in “Iterative Solutions of Games by Fictitious Play” In Activity Analysis of Production and Allocation by Brown, G. W. in 1951.

This curriculum stores a history of previous agents to the disk, and maintains a cache of recently used agents in memory. You can control the size of the history with the max_agents argument and the size of the cache with the max_loaded_agents argument. You can also control the storage path for the agent history with storage_path, and the device that they will be loaded onto with device.

class syllabus.curricula.selfplay.FictitiousSelfPlay(task_space: TaskSpace, agent: Agent, device: str, storage_path: str, max_agents: int, seed: int = 0, max_loaded_agents: int = 1)[source]#

Bases: Curriculum

add_agent(agent)[source]#: Saves the current agent instance to a pickle file. When the max_agents limit is met, older agent checkpoints are overwritten.

get_agent(agent_id: int) → Agent[source]#: Loads an agent from the buffer of saved agents.

log_metrics(writer, logs, step=None, log_n_tasks=1)[source]#: Log metrics for the curriculum.

sample(k=1)[source]#

Sample k tasks from the curriculum.

Parameters:: k – Number of tasks to sample, defaults to 1
Returns:: Either returns a single task if k=1, or a list of k tasks

update_winrate(opponent_id: int, opponent_reward: int) → None[source]#: Uses an incremental mean to update the opponent’s winrate i.e. priority. This implies that sampling according to the winrates returns the most challenging opponents.

Prioritized Fictitious Self Play#

This method addresses some of the limitations of Fictitious Self Play by prioritizing agents which have a high winrate against the current agent. That way, the protagonist agent is trained against a variety of strategies but does not spend a disproportionate amount of time playing against weak opponents. This method in combination with many other curricula, was used to train AlphaStar, the agent which learned to play Starcraft 2 at a high professional level.

class syllabus.curricula.selfplay.PrioritizedFictitiousSelfPlay(task_space: TaskSpace, agent: Agent, device: str, storage_path: str, max_agents: int, seed: int = 0, max_loaded_agents: int = 1)[source]#

Bases: Curriculum

add_agent(agent) → None[source]#: Saves the current agent instance to a pickle file and update its priority.

get_agent(agent_id: int) → Agent[source]#: Samples an agent id from the softmax distribution induced by winrates then loads the selected agent from the buffer of saved agents.

log_metrics(writer, logs, step=None, log_n_tasks=1)[source]#: Log metrics for the curriculum.

sample(k=1)[source]#: Samples k agents from the buffer of saved agents, prioritizing opponents with higher winrates.

update_winrate(opponent_id: int, opponent_reward: int) → None[source]#: Uses an incremental mean to update the opponent’s winrate i.e. priority. This implies that sampling according to the winrates returns the most challenging opponents.