Experience Stack Reinforcement Learning for Off-Policy Control

Reynolds, Stuart
Experience Stack Reinforcement Learning for Off-Policy Control
Cognitive Science Technical Report number CSRP-02-1, School of Computer Science, The University of Birmingham, Birmingham, B15 2TT, UK. January 2002 ( gzipped Postscript - 235 )

Abstract: This paper introduces a novel method for allowing backwards replay to be applied as an online learning algorithm. The general technique can be adapted to provide analogues of most existing eligibility trace algorithms. The new method remains as computationally cheap as current techniques but, as it directly employs lambda-return estimates in value updates, remains significantly simpler. The paper concentrates on multi-step off-policy control methods (such as Watkins' Q(lambda)) as a theoretically and practically important class of algorithms that are underutilised in practice. Experimental results show improvements upon existing eligibility trace methods across a wide range of parameter settings and also highlight the importance of the initial Q-function upon the performance of several reinforcement learning algorithms.