Evaluation¶

Evaluating RL agents trained with curriculum learning requires special consideration. Typically training tasks are assumed to be drawn from the same distribution as the test tasks. However, curriculum learning methods modify the training task distribution to improve test performance. Therefore, training returns are not a good measure of performance. Agents should be periodically evaluated during training on uniformly sampled tasks, ideally from a held out test set. You can see an example of this approach in our procgen script.

Correctly implementing evaluation code can be surprisingly challenging. Many papers use biased evaluations, so be careful to pay attention to these details when comparing your results to existing literature. In practice, as long as you use the same evaluation procedure for all methods you are comparing, it is possible to make fair comparisons:

Avoid bias evaluation results towards shorter episodes. This is is easy to do by accident if you multiprocess evaluations. For example, if you run a vectorized environment and save the first 10 results, your test returns will be biased toward shorter episodes, which may earn higher or lower returns depending on the reward function.
Reset the environments before each evaluation. This may seem obvious, but if you since some vectorized environments don’t allow you to directly reset the environments, some might be tempted to skip this step. If you don’t reset environments, then your evaluations will start from an arbitrary state in a trajectory started by a previous iteration of the agent. This will still measure improvement, but the measured returns will not exactly reflect the average returns that the current agent can achieve.
Use the same environment wrappers for the evaluation environment. This is important because some wrappers, such as the TimeLimit wrapper, can change the dynamics of the environment. If you use different wrappers, you may get different results, or your agent may exhibit a completely different policy than it does during training. If you see different evaluation and training performance when training with Domain Randomization, it may mean that you have misconfigured the evaluation environment.