Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution
Sprache des Titels:
Proceedings of the 39th International Conference on Machine Learning
ALIGN-RUDDER: LEARNINGFROMFEWDEMON-STRATIONS BYREWARDREDISTRIBUTIONVihang P. Patil?Markus Hofmarcher?Marius-Constantin Dinu?Matthias Dorfer?Patrick M. Blies?Johannes Brandstetter?Jose A. Arjona-Medina?Sepp Hochreiter?,??ELLIS Unit Linz and LIT AI Lab,Institute for Machine Learning,Johannes Kepler University Linz, Austria?Institute of Advanced Research in Artificial Intelligence (IARAI)?enliteAI, Vienna, AustriaABSTRACTReinforcement Learning algorithms require a large number of samples to solvecomplex tasks with sparse and delayed rewards. Complex tasks can often be hierar-chically decomposed into sub-tasks. A step in theQ-function can be associatedwith solving a sub-task, where the expectation of the return increases. RUDDERhas been introduced to identify these steps and then redistribute reward to them,thus immediately giving reward if sub-tasks are solved. Since the problem ofdelayed rewards is mitigated, learning is considerably sped up. However, forcomplex tasks, current exploration strategies as deployed in RUDDER strugglewith discovering episodes with high rewards. Therefore, we assume that episodeswith high rewards are given as demonstrations and do not have to be discoveredby exploration. Typically the number of demonstrations is small and RUDDER?sLSTM model as a deep learning method does not learn well. Hence, we introduceAlign-RUDDER, which is RUDDER with two major modifications. First, Align-RUDDER assumes that episodes with high rewards are given as demonstrations,replacing RUDDER?s safe exploration and lessons replay buffer. Second, we re-place RUDDER?s LSTM model by a profile model that is obtained from multiplesequence alignment of demonstrations. Profile models can be constructed fromas few as two demonstrations as known from bioinformatics. Align-RUDDERinherits the concept of reward redistribution, which considerably reduces the delayof rewards, thus speeding up learning. Align-RUDDER outperforms competitorson complex artificial tasks with delayed reward and few demonstrations. On theMineCraftObtainDiamondtask, Align-RUDDER is able to mine a diamond,though not frequently.
Sprache der Kurzfassung:
Proceedings of the 39 th International Conference on Machine Learning