Multitask Curriculum Learning In A Complex Visual Hardexploration Domain Minecraft

From AI Knowledge
Jump to: navigation, search

An important challenge in reinforcement learning is training agents that can solve a wide variety of tasks. If tasks depend on each other (e.g. needing to learn to walk before learning to run), curriculum learning can speed up learning by focusing on the next best task to learn. We explore curriculum learning in a complex, visual domain with many hard exploration challenges: Minecraft. We find that learning progress (defined as a change in success probability of a task) is a reliable measure of learnability for automatically constructing an effective curriculum. We introduce a learning-progress based curriculum and test it on a complex reinforcement learning problem (called “Simon Says”) where an agent is instructed to obtain a desired goal item. Many of the required skills depend on each other. Experiments demonstrate that: (1) a within-episode exploration bonus for obtaining new items improves performance, (2) dynamically adjusting this bonus across training such that it only applies to items the agent cannot reliably obtain yet further increases performance, (3) the learning-progress based curriculum elegantly follows the learning curve of the agent, and (4) when the learning-progress based curriculum is combined with the dynamic exploration bonus it learns much more efficiently and obtains far higher performance than uniform baselines. These results suggest that combining intra-episode and across-training exploration bonuses with learning progress creates a promising method for automated curriculum generation, which may substantially increase our ability to train more capable, generally intelligent agents.111This is an experiment in sharing ”partial work”, meaning research that sheds light on a subject, but is not as complete as we would make it were we planning on publishing it as ”complete work” in a peer-reviewed venue. Due to other priorities, we do not plan to do all that would be required for that level of scientific rigor. We thus faced a choice: either share it ”as is”, or not share it at all. We chose the former. We acknowledge much is missing, such as a more thorough literature review, experimental comparisons to other methods, ablations, etc. Nevertheless we believe that our results provide meaningful insights to the machine learning community. Our motivation is to share what we discovered, while minimizing the overhead in time and compute resources required to share the work.



An important challenge in reinforcement learning (RL) is to train agents that can solve a wide variety of tasks. Tasks often vary in difficulty and depend on each other, such that learning easier tasks first may help with learning more difficult tasks later. Just as human children crawl before they can walk, and walk before they can run, interesting use cases for RL agents have difficulty hierarchies we would like to exploit: Learning first how to write syntactically correct code might help an agent to later learn how to fix semantic errors in a program. Learning first how to solve high school physics problems likely helps learn college-level physics. Such dependencies particularly exist in RL, where the problem of exploration may prevent a policy from experiencing any useful gradient on a hard task before it has mastered a prerequisite task. If the goal is to pass a set of unit tests in a programming domain, an agent will not get any useful feedback if it always fails all unit tests because it is unable to write syntactically correct code. Side-stepping exploration by collecting human demonstrations can be an option, but in many domains of interest it might be difficult or impossible to collect enough high-quality data to reach expert-human-level performance. Furthermore, if the model relies purely on imitating humans its maximum performance will be limited by the best demonstrations in our data set, and even the combination of all the best demonstrations that humanity has to offer probably will not move the model far beyond human performance.



In the following we assume that we’re given a collection of RL tasks, each of which consists of a reward function (that defines the goal of the task) and an environment distribution. A naive solution to learning all tasks would be to attempt to learn all tasks simultaneously by uniformly sampling tasks irrespective of the dependencies between them. This solution ensures that the agent experiences a useful gradient as long as some of the tasks are learnable given the agent’s current skill level. However, if the fraction of learnable tasks at any moment is very small, it is computationally very inefficient, as the agent spends a lot of time on tasks where it does not actually learn anything.



A more efficient alternative is to build a curriculum that narrows the distribution of tasks being trained on to those that are currently learnable. Which tasks are learnable at a given point in time, or in what order tasks are most easily learnable, is typically not known in advance. As such, the present paper explores methods for how we can infer learnability on the fly and build an automatic curriculum over tasks. In particular, our goal was to, given a set of tasks of varying difficulty, learn as many tasks as fast and compute-efficient as possible.



This project focused specifically on tasks with identical environment distributions, but different reward functions. Having tasks with different goals but the same environment distribution is a natural setting for the powerful models we wish to create, as the real world presents a diverse but fixed environment distribution in which we would want the model to perform many different tasks. The same is true for our current generation of models, which generally have to learn many different tasks (e.g. write different programs, summarize different books, answer different questions, generate different images) in a universal environment such as vision or language.



A curriculum only helps if mastering one task makes another task easier to learn. However, in a goal-conditioned setup, even when tasks are learned in the "correct" order, meaning that easier tasks are learned before the harder tasks that depend on them, it can be difficult to learn the harder tasks. One problem is that the agent does not know the relationship between different tasks, meaning that if the agent is given the goal for a task that it hasn’t learned yet, it does not know which of its previously learned behaviors might help it achieve that goal. Another problem is that, even if the agent did know exactly which of the previously learned tasks is related to the task it is currently trying to solve, the behavior on that previously learned task may not generalize to the current task (meaning that executing the behavior learned on the previous task does not result in any zero-shot performance on the current task), because the tasks are too different. For example, even if the agent has learned to write syntactically correct code and is executing that behavior frequently, it may never write a program that passes a particular unit-test when that unit-test is selected as the current task and thus be unable to learn to write a program for it. We find that adding a goal-independent exploration bonus that rewards all tasks the agent has not yet learned helps the agent learn new tasks.



We have developed and evaluated these curriculum learning methods in a Minecraft environment based on the MineRL platformguss2019minerl . Minecraft is a well-thought-out visual world with rudimentary physics, which has the potential to allow our agents to learn many useful skills such as visual processing, spatial awareness, inferring causality and conducting experiments. Most relevant to curriculum learning in particular, Minecraft features a large tech-tree with many dependencies, making it relatively straightforward to define tasks of varying difficulty that depend on each other to varying degrees. In our experiment, in each task the agent is asked to obtain one out of 107 Minecraft items on command ("Simon Says"), some of which are deep into the tech tree (Figure 4).



Our key results are:



• Uniform sampling does not learn many tasks, and learning flatlines at a low level (Figure 3, red).



• An exploration bonus (as an auxiliary reward) to perform tasks unrelated to the current goal (a.k.a. curiosity search stanton2016curiosity ; stanton2018deep ) substantially improves performance (Figure 3, green).



• Adding an additional across-training diversity pressure (similar to novelty search lehman2011abandoning and intrinsic motivation strehl2008analysis ) by removing the exploration bonus dynamically for items the agent can already reliably obtain further improves performance (Figure 3, yellow).



• Adding a learning progress curriculum increases performance even more (Figure 3, dark dotted blue). A video of a successful agent obtaining challenging items high up in the tech tree can be viewed at https://youtu.be/MFDudOvn3oc.



• With the learning progress curriculum, the sampling of tasks elegantly follows the learning curve of the agent, focusing learning on the frontier of the agent’s skill set as that skill set expands (Figure 3, bidirectional learning-progress curriculum).



• Looking for any learning change (including performance drops) (Figure 3, dark dotted blue) prevents the catastrophic forgetting of previously learned tasks that otherwise occurs when you only measure learning improvements (Figure 3, light solid blue), hurting overall performance.



2.1 Learning progress curriculum



First, we identify the conditions where we expect curriculum learning to work much better than uniform sampling: assume we are given a long list of mutually dependent and progressively more difficult RL tasks T1,…,TNsubscript𝑇1…subscript𝑇𝑁T_1,...,T_Nitalic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. We further assume that a policy that has mastered Tisubscript𝑇𝑖T_iitalic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can learn Ti+1subscript𝑇𝑖1T_i+1italic_T start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT, but not Ti+2subscript𝑇𝑖2T_i+2italic_T start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT. An agent can therefore learn all tasks if they are presented in this order. However, learning all tasks under uniform sampling is much harder, because only every 1/N1𝑁1/N1 / italic_N-th rollout is sampled from a learnable task. At a minimum, if it takes at least H𝐻Hitalic_H samples per batch from task T𝑇Titalic_T to learn it, we will need to collect N𝑁Nitalic_N times more samples (i.e. train with an N𝑁Nitalic_N-times larger batch size) when performing uniform sampling just to ensure that we get H𝐻Hitalic_H samples from task T𝑇Titalic_T per batch. In addition, the non-learnable tasks may add additional noise to each batch without providing any signal, meaning an even larger than N𝑁Nitalic_N-times larger batch size may be needed. The optimal curriculum is therefore much more compute efficient than uniform sampling.



A key requirement of designing an automatic curriculum is to infer which new task may be learnable given the current skill a priori. A common approach is to use success probability as a proxy for learnability. Typically, one defines a lower success probability threshold below which tasks are considered too hard and an upper success probability threshold above which tasks are considered too easy and one would predominantly sample tasks between these two thresholds wang2019paired ; wang2020enhanced ; openAI2019solving . However, this approach has several shortcomings: First, if the initial state of the environment is randomized, the agent may have intermediate success probability on a task solely because the task is too easy from some initial states and too hard (or even impossible) from others, thus preventing the agent from improving any further. In Minecraft, for example, it is only possible to collect jungle planks when starting in a jungle biome. If the agent only starts in the jungle 20% of the time, its success probability for jungle planks would be capped at 20%, even if it perfectly learned the task. Second, success probability thresholds that correlate well with learnability might in general depend on the task, the initial state of the environment and the learning stage of the agent. Finally, stochastic environments (i.e. environments with a stochastic transition function) can have a similar disconnect between success probability and learnability as environments with random initial states: An easy-to-learn task may be capped at an intermediate success probability because of an uncontrollable stochastic event that prevents the agent from succeeding reliably. Selecting tasks with intermediate success probability might therefore select tasks where the agent cannot learn anything new.



Instead, we infer learnability of tasks by measuring learning progress, i.e. the recent change in success probability for each task. Given such a measure, the curriculum predominantly samples tasks with large learning progress. We explore sampling based on a bidirectional learning progress measure (that tracks both increases and decreases in success probability) and a unidirectional measure (that only tracks increases in success probability). The advantage of sampling based on the bidirectional measure is that it not only samples novel tasks when they start showing learning progress, but also samples tasks that are being forgotten.



2.2 Learning progress inference



When designing a process to infer learning progress from empirical measurements of successes and failures on a given task it is important to consider the parameters that influence the accuracy of a learning progress estimator. In particular, as learning progress is measured in terms of how the success probability changes over time, it is important to pick the appropriate time scale ΔtΔ𝑡\Delta troman_Δ italic_t over which the before/after change in success probability is measured. Consider the following example of a task whose true success probability (black) increases over time (Figure 5, left).



We would like to infer the true learning progress at t=0𝑡0t=0italic_t = 0 (vertical dotted black line), which we define as the slope of the success probability curve (solid red line), from the success probabilities measured at time snapshots t𝑡titalic_t and t-Δt𝑡Δ𝑡t-\Delta titalic_t - roman_Δ italic_t. However, we only have access to a history (up to t𝑡titalic_t) of noisy measurements of success probability (jagged grey line; the measurements here are sampled from a binomial distribution with the success probability p𝑝pitalic_p equal to the value of the true probability - the black line - at time t𝑡titalic_t and n𝑛nitalic_n, the number of samples, set to 200): If we choose ΔtΔ𝑡\Delta troman_Δ italic_t too small, we end up measuring the slope of the noise (dashed yellow line) instead of the slope of the underlying trend (with different noise the slope of yellow could have actually been sharply negative!); the variance of our estimator is too high. If we choose ΔtΔ𝑡\Delta troman_Δ italic_t too large we do not fully capture recent changes in the slope (light blue); our estimator is biased (in the statistical sense: the error is non-zero even with infinite data). An intermediate ΔtΔ𝑡\Delta troman_Δ italic_t gives us a good trade-off between bias and variance (dashed dark blue line). Indeed, resampling from the binomial many times (each sample being noisy owing to the small sample size) shows that the expected square error is minimal for intermediate values of ΔtΔ𝑡\Delta troman_Δ italic_t (Figure 5, middle, blue) and we can show analytically that this is in general true for any curved success probability curve (Figure 5, middle, orange and Appendix A.1).



Besides picking the right time scale, we can improve the reliability of our learning progress estimator by calculating it based on the success probability measurements of all previous time steps, rather than just two snapshots in time. We can implement this by replacing the snapshots with two exponential mean average (EMA) processes with different time scales. In particular, the first EMA represents a "fast success probability" pfastsubscript𝑝fastp_\textfastitalic_p start_POSTSUBSCRIPT fast end_POSTSUBSCRIPT estimate for each task (Figure 5, right, red). We then smooth pfastsubscript𝑝fastp_\textfastitalic_p start_POSTSUBSCRIPT fast end_POSTSUBSCRIPT with a second, identical EMA to obtain a "slow success probability" estimate pslowsubscript𝑝slowp_\textslowitalic_p start_POSTSUBSCRIPT slow end_POSTSUBSCRIPT (green). From there, we define learning progress as the difference between the fast and slow success probability estimates (Figure 5, right, yellow). This learning progress estimate increases after the success probability curves upward (meaning the agent is going from no learning progress to increasing learning progress), flattens after the success probability reaches a linear upward trend (meaning learning progress is happening at a steady pace), and finally goes down to zero again after the measured success probability converges (meaning no further learning progress is happening because the agent has learned everything there is to learn on this task). Note that learning progress does not perfectly follow the derivative of the success probability, but is delayed because it is calculated from two EMA processes which themselves are delayed estimates of the success probability. In practice, we are able to correct for this temporal delay, as explained in detail below. Based on this definition of learning progress, we further define bidirectional and unidirectional learning progress as the absolute and rectified difference, respectively, between fast and slow success probability. If pfastsubscript𝑝fastp_\textfastitalic_p start_POSTSUBSCRIPT fast end_POSTSUBSCRIPT is larger than pslowsubscript𝑝slowp_\textslowitalic_p start_POSTSUBSCRIPT slow end_POSTSUBSCRIPT, as in Figure 5, right, the two measures are identical.



We can improve the curriculum further by putting additional focus on tasks that have not only high learning progress, but also a low success probability. The rationale is as follows: Our goal is to learn as many tasks as possible, as opposed to learning a few tasks very well. We thus care much more about an improvement from 0.01% to 5% in a new task than we do about moving a task we already perform well on from 95% to 99.99% reliability (this is the same reason why we chose a 5% threshold for the purpose of evaluation). Since low success probability tasks are more likely to be novel, putting particular focus on them helps to expand the set of learned tasks. In addition, low success probability tasks are also harder to learn because the agent observes fewer successes. Sampling hard tasks more often increases the number of successes and therefore helps to mitigate the exploration problem. While our focus on low-probability tasks may seem reminiscent of the fixed success-probability-threshold methods discussed above, it avoids the false positives when selecting tasks solely based on success thresholds because it excludes tasks without learning progress.



We implement this focus by applying a reweighting function to pfastsubscript𝑝fastp_\textfastitalic_p start_POSTSUBSCRIPT fast end_POSTSUBSCRIPT and pslowsubscript𝑝slowp_\textslowitalic_p start_POSTSUBSCRIPT slow end_POSTSUBSCRIPT before computing learning progress. The reweighting function we use magnifies learning progress in tasks with low success probability at the expense of learning progress in tasks with high success probability. However, tasks without learning progress are still mapped to zero learning progress, no matter the success probability (see Appendix A.2 for details). In the fictional example above, reweighted learning progress is given by the blue curve in Figure 5, right. The figure illustrates that this reweighting can also compensate for the temporal delay that we observed with the regular learning progress estimate because success probabilities will be lowest right when the agent starts to learn a particular task. Reweighting is applied in the same way for bidirectional and unidirectional learning progress.



As a last step, we use a sampling function that focuses 90% of sampling weight on roughly 20% of tasks with the largest reweighted learning progress. This sampling function was designed to strike a balance between focusing only on the tasks with the highest learning probability, which can cause catastrophic forgetting of all other tasks, and uniformly sampling all tasks, which would waste a lot of computation on tasks where no learning progress is observed (see Appendix A.3).



In conclusion, by taking the difference between a fast and slow estimate of the measured success probability we obtain an accurate, but delayed, estimate of learning progress (Figure 5, yellow). By reweighing this estimate towards tasks with low success probability we compensate for the delay and put the majority of focus on exactly those tasks that are currently learnable, but have not been learned yet (Figure 5, blue). Taking 90% of our samples from the tasks that score within the top 20% of this reweighed learning-progress metric thus ensures that our curriculum always pushes on the frontier of the currently learnable tasks that have not been learned (well) yet.



2.3 Curricula over goal-conditioned tasks



A key requirement of curriculum learning (at least using current RL methods) is that the agent achieves a small success probability (within available/reasonable compute) on a new task after mastering a previous task. However, if tasks differ by goals that are given as observations to the agent, the agent may not know how to interpret the new goal observation and may just display random behavior instead of the behavior of a prerequisite task. In addition, obtaining useful feedback in a goal-conditioned setup is much harder than learning unconditional tasks because the agent only experiences a positive reward for completing the new task in episodes where the goal matches the task. The success probability of the new task is therefore suppressed by the background probability of sampling the corresponding goal, which makes the task difficult to learn.



We can encourage the discovery of new goals by combining the goal-conditioned main reward with a goal-independent exploration bonus for all of the curriculum tasks, even if they are unrelated to the current goal. This exploration bonus helps the agent to explore the environment when given an unknown goal, thus increasing the chances of success. In our Minecraft experiment, where a new goal corresponds to collecting a new item, for each item in the set of potential goal items, we provide a reward the first few times the agent collects that item in an episode, regardless of the currently selected goal. Specifically, the agent receives a reward of 0.5Nsuperscript0.5𝑁0.5^N0.5 start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT for the N𝑁Nitalic_N-th collection of the same item, i.e. the reward decays with a factor of 0.5 for each subsequent collection. In addition, the agent only receives a reward for a subsequent collection of item X𝑋Xitalic_X if N𝑁Nitalic_N is larger than the number of items X𝑋Xitalic_X in its inventory at any previous time during the episode (this constraint prevents the agent from just racking up reward by dropping and picking up items). The exploration bonus therefore incentivizes the agent to constantly collect or craft new items that it hasn’t held previously during the episode. This idea of encouraging an agent to do as many new, different things within an episode is similar to previous work in Curiosity Search stanton2016curiosity ; stanton2018deep .



The exploration bonus is combined with the main reward by simply summing the two rewards, meaning they have to be carefully balanced: the agent may favor the exploration bonus over the main reward if the exploration bonus is too high or not gain any benefits from the exploration bonus when it is too low. In addition, this balance will have to change as the curriculum advances; at the start of training it is fine for the agent to get an exploration bonus for all of the easy tasks, but as the curriculum moves towards harder tasks it becomes increasingly likely that the agent will spend most of the limited time it has per episode collecting reward by obtaining the exploration bonus for all easy items rather than attempting to complete the hard task selected by the curriculum. In our experiments, this balance is maintained across training by a second curriculum that successively removes already-learned goals from the exploration bonus. We call this curriculum-based exploration bonus the "dynamic exploration bonus".



In our Minecraft experiment, we implement the dynamic exploration bonus in the following way: Using exponential mean averaging (EMA), we keep track of the individual success probabilities in the main, goal-conditioned, task. Only items for which the agent has a success probability smaller than 0.1 are included in the set of items rewarded by the exploration bonus (called "exploration set"). Thus, as the agent improves its performance over training, items are successively removed from the exploration set, but are added back if the performance on the corresponding task drops again below 0.1. The dynamic exploration reward can be seen as an implementation of across-training diversity pressure similar to previous work in Novelty Search lehman2011abandoning and Intrinsic Motivation strehl2008analysis .



2.4 "Simon Says" task



We study curriculum learning on a set of goal-conditioned Minecraft tasks, in which the agent is tasked to collect one out of a set of 107 items from the Minecraft tech tree (Figure 4)444In addition to the 107 items displayed in Figure 4, the target set also contains 6 additional items that we later realized were impossible for the agent to obtain.. Some items (such as "dirt") can be very easily obtained, while other items (such as "diamond") are rare and also require first obtaining many other resources and crafting required tools. The agent observes the target item as a one-hot encoding. It has 5 minutes (1500 time steps) to complete the task and obtains a reward of +1 upon success. After each success or failure a new task is selected without resetting the world or respawning the agent. Tasks may be sampled uniformly or from a distribution that is determined by a curriculum.



The maximum episode length is 30 minutes (9000 time steps), but the episode is terminated prematurely if the agent has failed two consecutive tasks or dies of other causes, such as drowning, falling off a cliff or falling in lava. After an episode ends, a new episode is started in a new environment, but the agent has a 50% chance to retain the inventory from the end of the previous episode. In preliminary experiments, we found that this "inventory inheritance" leads to slightly faster learning, as the agent does not always have to gather all the necessary prerequisite items from scratch when trying to learn how to obtain difficult items deep in the tech tree. Because each run was computationally expensive (21d on 32 GPUs) we only plot one run per treatment, but we found inter-run variation to be low.



3 Results: Evaluation on Minecraft "Simon Says"



3.1 Uniform sampling without exploration bonus



In the standard Simon Says task, the agent only experiences a positive reward for obtaining an item if the item corresponds to the goal of the task. This makes all but the easiest tasks difficult to learn, because the success probability of the task is suppressed by the probability of sampling the corresponding goal and because there is no other exploration bonus. As expected, the results with this method are poor: the agent only learns to obtain 17 out of 107 tasks (Figure 3, red line). Note that we say that the agent has "learned" a task if it achieves a success probability of at least 5%. Worse, the plot shows that learning plateaued early and does not improve over time. The agent only discovers a subset of items that can be obtained on the surface, such as "dirt", "sapling" and a number of wooden tools (Figure 2, 1st from left).



3.2 Uniform sampling with fixed exploration bonus



To support exploration of harder tasks, we add the exploration bonus over all items in the Simon Says 107 set. The exploration bonus is added to the main Simon Says reward with a coefficient that was tuned to each condition. For uniform sampling without curriculum we found in preliminary experiments that a coefficient of 0.05 performs best.



Adding the exploration bonus increases the number of explored items at convergence from 17 to 43 (see Figure 3, dotted green line). The agent discovers almost all surface items, learns to mine underground, learns to craft stone tools, and even discovers how to create a few iron tools such as "iron ingot", "heavy weighted pressure plate", "tripwire hook" and "iron shovel" (Figure 2, 2nd from left).



3.3 Uniform sampling with dynamic exploration bonus



However, while the exploration bonus helps the agent in solving a larger number of tasks, it actually can make it hard to learn to collect harder items that are deeper in the tech tree. The reason is that the exploration bonus equally rewards items that the agent already knows how to obtain (but has not yet obtained this episode) and items that the agent still needs to learn how to get. When given a hard-to-obtain goal that the agent has not learned yet, it may be more rewarding to collect easy-to-obtain and previously learned items in order to collect their exploration bonus, therefore "distracting" the agent from the task it is currently supposed to complete. One straightforward solution to this problem is to provide an exploration bonus only for those items that the agent does not yet reliably know how to obtain. This allows us to include the exploration bonus only when it is useful, namely for learning how to obtain new items, without distracting the agent.



The "dynamic exploration bonus" implements exactly this idea by removing items whose goal-conditioned success probability in the main Simon-Says task is larger than 0.1.



As we only give an exploration bonus for hard items that the agent rarely gets, we can increase the coefficient of the exploration bonus without increasing the extent to which the agent gets distracted by easy items. MINECRAFT In preliminary experiments we found that a coefficient of 0.5 performed best.



The dynamic exploration bonus further increases the number of explored items at convergence to about 70 (see Figure 3, dashed yellow line). From within the target set, the agent discovers all surface items, all stone items, most iron items and most redstone items (Figure 2, 3rd from left).



Conceptually, the dynamic exploration bonus and the conditional Simon Says reward interleave in the following way: At first, the exploration bonus incentivizes the agent to learn how to obtain an item unconditionally, i.e. irrespective of the current Simon Says task. As the unconditional success probability increases, so does the conditional success probability, that is, the success probability when the new item corresponds to the goal. Once the conditional success probability surpasses 0.1, the item is removed from the exploration set and is only rewarded through the main Simon Says reward (i.e. only when given the task where the item corresponds to the goal). The agent can then solidify the association between the observed Simon Says task label and the task.



3.4 Learning progress curriculum



In this treatment we sample Simon Says tasks using the learning progress curriculum instead of uniform sampling. As in the previous section, we remove easy items from the exploration set using the dynamic exploration bonus and we again set the overall coefficient of the exploration bonus to 0.5. The learning progress curriculum improves task learning by two mechanisms: First, by focusing on Simon Says tasks with the largest learning progress it supports learning the conditional association between task label and task. Second, the curriculum interacts with the exploration bonus to facilitate the unconditional discovery of novel items. If learning progress is made on an item such that the curriculum frequently samples it, the agent is likely to obtain it more often and, if that item is a prerequisite for any dependent items, the agent is now more frequently in a position where it actually has this prerequisite item and is thus able to collect the dependent items.



We tested both the bidirectional and unidirectional versions of the learning progress curriculum (described above). With the bidirectional learning progress curriculum, the agent discovers 82 items (see Figure 3, dotted blue line). The agent discovers all surface items, all stone items, most iron items, most redstone items, half of the golden items and a few diamond items (Figure 2, 2nd from right). With the unidirectional learning progress curriculum, the agent discovers 79 items (see Figure 3, solid light blue line and Figure 2, 1st from right), which is almost as many as the bidirectional learning progress treatment, but training in the unidirectional treatment is unstable because of many cycles of forgetting and rediscovery (see next section). Both curricula accurately sample items for which success probability changes or increases the most (compare Figure 2, 1st and 2nd from right with Figure 3, 1st and 2nd from right).



With both the bidirectional and unidirectional learning-progress curricula, the interaction between the dynamic exploration bonus and the conditional Simon Says reward is similar to the interaction between the dynamic exploration bonus and uniform sampling. However, with the learning progress curricula the learning of the conditional association (performing the task when asked) is more focused and accelerated, because the learning progress curriculum detects the increase in conditional success probability and consequently focuses on that task, which means there will be many more rollouts where the task is also the current goal and thus many more positive examples from which the agent can learn the association.



3.5 Mitigating catastrophic forgetting by tracking bidirectional learning progress



A curious phenomenon of the unidirectional learning progress curriculum is that it goes through cycles where the number of discovered items drop sharply, before it again recovers. These cycles are caused by catastrophic forgetting owing to correlations in learning progress between groups of items. As an example, let us consider a case where, after having discovered underground items, the agent improves its success probability for a surface item such as "sapling". The unidirectional learning progress curriculum samples "sapling" more often which causes the agent to spend more time at the surface, which in turn causes the agent to also improve its success probability for other surface items (they are easier because the agent is already on the surface), creating a positive feedback loop. Meanwhile, success probabilities for underground items decrease because the agent spends less time underground and thus forgets (catastrophically) how to perform such tasks. The bidirectional learning progress curriculum would immediately increase its sampling of underground items in order to prevent those tasks from being forgotten, which would prevent catastrophic forgetting and thus cycles from appearing. In contrast, the unidirectional learning progress curriculum does not increase the sampling of underground items when their success probabilities are decreasing. As a consequence, it enters a period where it exclusively focuses on surface items and generally these periods last long enough for the agent to almost completely forget how to obtain underground items when asked. Since only 24 out of the 107 potential goals are surface items, this causes a large drop in the number of discovered items. However, after about 1000-2000 optimizers steps, the curriculum notices positive learning progress on the underground items again, allowing it to rediscover underground items, and the number of discovered items recovers (Figure 3, solid light blue line). Interestingly, much of the ability to perform these skills is still latent in the network, as its performance recovery is swift.



Neural networks in general and RL agents in particular are subject to catastrophic forgetting if the data distribution changes during learning. The easiest method to mitigate catastrophic forgetting is to sample the data i.i.d. (i.e. uniform sampling of discrete tasks), whereas a curriculum might cause the agent to forget early tasks.



The success of the bidirectional learning progress curriculum shows that monitoring previous tasks and resampling them if their performance drops can be a powerful method to mitigate catastrophic forgetting. As shown in Figure 3, 2nd from right, only sporadic resampling of old tasks is sufficient, which is much more compute efficient and has better scaling properties than iid sampling of all previously learned tasks.



4 Related work



Automated curriculum learning There is an extensive literature on training neural networks with a curriculum elman1993learning , see bengio2009curriculum for an overview. More recently, automated curricula have been studied extensively in the context of RL in sukhbaatar2018learning ; florensa2018automatic ; matiisen2019teacher ; portelas2020teacher ; wang2019paired ; wang2020enhanced ; openAI2019solving ; zhang2020automatic ; campero2020learning ; dennis2020emergent ; gur2021adversarial ; openai2021asymmetric . Generally, tasks are selected based on success probability or reward thresholds wang2019paired ; wang2020enhanced ; openAI2019solving ; campero2020learning or regret dennis2020emergent ; gur2021adversarial . Static-threshold-based methods present an intuitive starting point for a curriculum, but have the downside that they are difficult or even impossible to tune, as discussed previously (Sec. 3.4). Regret-based methods calculate per-task regret by subtracting the average return over several rollouts on that task from the maximum (known) return on that task, and then preferrably select tasks where regret is high, with the theory being that there is still a lot to learn on tasks where regret is high dennis2020emergent ; gur2021adversarial . In the presence of environmental stochasticity, this scheme may select more stochastic, less learnable tasks at the expense of less stochastic, more learnable tasks, because the maximum return may have been obtained under particularly lucky conditions such that the average return will always be much lower, despite there being nothing left to learn. Learning-progress-based curricula, like the method in this paper, have the potential to address these issues, as discussed next.



Learning progress was first proposed as a curiosity signal that incentivizes agents to explore novel states in stochastic environments schmidhuber1991curious ; oudeyer2007intrinsic ; kim2020active . Later, in graves2017automated ; matiisen2019teacher ; portelas2020teacher learning progress was used as a measure to select tasks in a curriculum. The novel contributions of our work are to systematically study how learning progress can be measured reliably, to apply learning progress curricula on hard problems that require RL at scale, and to show how learning progress curricula over goals can be combined with a dynamic, goal-independent exploration bonus.



Curiosity Search The static exploration bonus we examined incentivizes the agent to obtain items that it has not obtained in the current episode, and is thus a method for encouraging within-episode exploration. Within episode exploration has previously been explored in the Curiosity Search work by stanton2016curiosity ; stanton2018deep , who demonstrated that it effectively encourages agents to explore their environment, though they also demonstrated the downside of having to explore the entire environment in every episode when trying to perform deep exploration.



Intrinsic motivation The dynamic exploration bonus, on the other hand, changes over training and encourages the agent to obtain different items, not within a single episode, but across episodes. Exploration across episodes has been extensively studied in the form of count-based exploration (e.g. strehl2008analysis ), where the algorithm tracks the number of times each state has been visited over training and provides a bonus reward to the agent for visiting each state that is inverse proportional to the number of times that state has been visited. bellemare2016unifying adapted count-based exploration to work in state-spaces that are too large to keep individual counts, and they demonstrated some success in deeply exploring sparse reward environments. However, later work ecoffet2019go ; ecoffet2021first hypothesized that unconditional count-based exploration methods can suffer from detachment, where the agent consumes all the intrinsic reward towards one or more states and forgets how to return to those states, and derailment, where the exploratory mechanism of the algorithm can prevent the agent from returning to a previously visited state. Our bidirectional learning-progress curriculum avoids detachment by immediately increasing the sampling rate of any goal where success probabilities are declining, thus reducing the probability that the agent forgets how to visit a previous "state", as well as by reintroducing items back into the exploration bonus reward set if their success probability ever drops too low, thus ensuring that the agent can always recover. The learning-progress curriculum does not address derailment as explicitly, but the underlying dynamic exploration reward does in effect reduce derailment by removing learned items from the exploration bonus; while the agent is in a state where it does not have the necessary prerequisites to obtain any of the items still in exploration bonus reward set, it is incentivized to first obtain the necessary prerequisites (without taking exploratory actions), before focusing on exploratory actions again.



5 Discussion, Conclusion, and Future Work



We have introduced a learning-progress curriculum with a dynamic exploration bonus that adapts to the current skill level of the agent. Experiments were conducted on the Minecraft "Simon Says" tasks. We first showed that uniform sampling with no exploration bonus performs poorly, obtaining only 17 tasks and hitting a ceiling at that level where additional training produces no new learning. We then showed that combining the main goal-dependent reward with a static goal-independent exploration bonus increases the number of learned tasks from 17 to 43. Dynamically adjusting the exploration bonus to only include tasks with low success probability further increases the number of learned tasks to 70. Sampling tasks using the bidirectional learning progress curriculum instead of uniform sampling further increases the number of solved tasks to 82. Moreover, the sampling of tasks elegantly follows the learning curve of the agent, focusing learning on the frontier of the agent’s skill set as that skill set expands. In addition, the bidirectional learning progress curriculum, which tracks both tasks with improvements and deterioration in performance, effectively mitigates catastrophic forgetting (which we see in the unidirectional learning-progress curriculum) by resampling tasks that are at risk of being forgotten.



There are various ways in which the learning progress curriculum could be expanded in future work. First, while the current method was designed under the assumption that explicit task labels are available, it could be made more general by developing methods that can dynamically group tasks into clusters over which learning progress is averaged. Second, if the number of tasks becomes too large it becomes computationally expensive to faithfully estimate learning progress for each task. A promising future direction would be to estimate learning progress over large tasks spaces using function approximation or, relatedly, generate environments that are expected to feature high learning progress via a neural network environment generator.



ack We thank Ilge Akkaya, Bob McGrew, Reiichiro Nakano, Matthias Plappert and John Schulman for discussions, support and feedback on this manuscript.