Optionally, if youâre the sort of person who enjoys mathematical theory, study up on the math of monotonic improvement theory (which forms the basis for advanced policy gradient algorithms), or classical RL algorithms (which despite being superseded by deep RL algorithms, contain valuable insights that sometimes drive new research). Projects in this frame have a broad scope and can go on for a while (several months to a year-plus). Don’t overfit to existing implementations either. There are a wide range of topics you might find interesting: sample efficiency, exploration, transfer learning, hierarchy, memory, model-based RL, meta learning, and multi-agent, to name a few. This is because broken RL code almost always fails silently, where the code appears to run fine except that the agent never learns how to solve the task. To debug your implementations, try them with simple environments where learning should happen quickly, like CartPole-v0, InvertedPendulum-v0, FrozenLake-v0, and HalfCheetah-v2 (with a short time horizon—only 100 or 250 steps instead of the full 1000) from the OpenAI Gym. What This Is; Why We Built This; How This Serves Our Mission But what does that mean, and how well does it have to work to be important? We've designed Spinning Up to help people learn to use these technologies and to develop intuitions about them. We favor clarity over modularity—code reuse between implementations is strictly limited to logging and parallelization utilities. Bad hyperparameters can significantly degrade RL performance, but if you’re using hyperparameters similar to the ones in papers and standard implementations, those will probably not be the issue. Get comfortable with the main concepts and terminology in RL. Let’s start with the setup of “spinning up” on FloydHub, and then we’ll try to use it for our task. Deep RL refers to the combination of RL with deep learning. Approaches to idea-generation: There are a many different ways to start thinking about ideas for projects, and the frame you choose influences how the project might evolve and what risks it will face. I personally like to look at the mean/std/min/max for cumulative rewards, episode lengths, and value function estimates, along with the losses for the objectives, and the details of any exploration parameters (like mean entropy for stochastic policy optimization, or current epsilon for epsilon-greedy as in DQN). Know about standard architectures (MLP, vanilla RNN, LSTM (also see this blog), GRU, conv layers, resnets, attention mechanisms), common regularizers (weight decay, dropout), normalization (batch norm, layer norm, weight norm), and optimizers (SGD, momentum SGD, Adam, others). If you implement your baseline from scratch—as opposed to comparing against another paperâs numbers directly—itâs important to spend as much time tuning your baseline as you spend tuning your own algorithm. Find a paper that you enjoy on one of these subjects—something that inspires you—and read it thoroughly. For example, if youâre investigating architecture variants, keep the number of model parameters approximately equal between your model and the baseline. Also, watch videos of your agentâs performance every now and then; this will give you some insights you wouldnât get otherwise. What to look for in papers: When implementing an algorithm based on a paper, scour that paper, especially the ablation analyses and supplementary material (where available). For example: achieving perfect generalization from training levels to test levels in the Sonic domain or Gym Retro. And that if youâre starting from scratch, the learning curve is incredibly steep. But don’t overfit to paper details. The goal of this column is to help you get past the initial hurdle, and give you a clear sense of how to spin up as a deep RL researcher. Instead of thinking about existing methods or current grand challenges, think of an entirely different conceptual problem that hasn’t been studied yet. You know that itâs hard and it doesnât always work. For example, the original DDPG paper suggests a complex neural network architecture and initialization scheme, as well as batch normalization. Any method you propose is likely to have several key design decisions—like architecture choices or regularization techniques, for instance—each of which could separately impact performance. Learn more, We use analytics cookies to understand how you use our websites so we can make them better, e.g. As another example, the original A3C paper uses asynchronous updates from the various actor-learners, but it turns out that synchronous updates work about as well. Iterate fast in simple environments. To get there, you’ll need an idea for a project. For our first partnership, we’re working with the Center for Human-Compatible AI (CHAI) at the University of California at Berkeley to run a workshop on deep RL in early 2019, similar to the planned Spinning Up workshop at OpenAI. You donât need to know how to do everything, but you should feel pretty confident in implementing a simple program to do supervised learning. Spinning Up should also work on Windows though. By systematically evaluating what would happen if you were to swap them out with alternate design choices, or remove them entirely, you can figure out how to correctly attribute credit for the benefits your method confers. We use optional third-party analytics cookies to understand how you use GitHub.com so we can build better products. Usually the problem is that something is being calculated with the wrong equation, or on the wrong distribution, or data is being piped into the wrong place. At OpenAI, we believe that deep learning generally—and deep reinforcement learning specifically—will play central roles in the development of powerful AI technology. We’re also going to work with other organizations to help us educate people using these materials. Because projects like these are tied to existing methods, they are by nature narrowly scoped and can wrap up quickly (a few months), which may be desirable (especially when starting out as a researcher). The simplest versions of all of these can be written in just a few hundred lines of code (ballpark 250-300), and some of them even less (for example, a no-frills version of VPG can be written in about 80 lines). You should organize your efforts so that you implement the simplest algorithms first, and only gradually introduce complexity. This is the incrementalist angle, where you try to get performance gains in an established problem setting by tweaking an existing algorithm. We've had so many people ask for guidance in learning RL from scratch, that we've decided to formalize the informal advice we've been giving. We were inspired to build Spinning Up through our work with the OpenAI Scholars and Fellows initiatives, where we observed that it's possible for people with little-to-no experience in machine learning to rapidly ramp up as practitioners, if the right guidance and resources are available to them. Keep these habits! This is one of the hardest parts of research in deep RL. Measure everything. It is possible for a novice to approch this kind of problem, but there will be a steeper learning curve. On November 8th, OpenAI released their educational package to let anyone get started in Deep RL. For projects along these lines, a standard benchmark probably doesn’t exist yet, and you will have to design one. It can be pretty disheartening to get halfway through a project, and only then discover that there’s already a paper about your idea.