Backward View and Reward Redistribution for Delayed Rewards
Sprache des Vortragstitels:
Englisch
Original Tagungtitel:
Credit assignment in Deep Learning and Deep Reinforcement Learning Workshop at ICML 2018
Sprache des Tagungstitel:
Englisch
Original Kurzfassung:
Most reinforcement learning approaches rely on a forward view to predict the expected return. Examples are Monte Carlo and temporal difference methods like SARSA or Q-learning which estimate the state-value or value function, policy gradients using value or advantage functions, and Monte Carlo Tree Search. However the forward view faces problems with probabilistic environments and with high branching factors of the state transitions since it has to average over all possible futures. These problems become more severe for delayed rewards. The number of paths to the reward grows exponentially with the delay steps; the reward information must be propagated further back; averaging becomes more difficult; the variance of many values of state-action pairs is increased. We suggest avoiding these problems by a backward view where episodes that have been observed are analyzed. We avoid probabilities and guessing about possible futures, while identifying key events and important states that led to a reward. The backward view allows for a reward redistribution which largely reduces the delays of the rewards while the expected return of a policy is not changed. The optimal reward redistribution via a return decomposition gives an immediate feedback to the agent about each executed action. If the expectation of the return increases then a positive reward is given and if the expectation of the return decreases then a negative reward is given. We introduce RUDDER, a return decomposition method, which creates a new MDP with same optimal policies as the original MDP but with redistributed rewards that have largely reduced delays. If the return decomposition is optimal, then the new MDP does not have delayed rewards and TD estimates are unbiased. In this case, the rewards track Q-values so that the future expected reward is always zero. On artificial tasks with different lengths of reward delays, we show that RUDDER is exponentially faster than TD, MC, and MC Tree Search (MCTS).