Reinforcement Learning with Exploration
PhD Thesis, School of Computer Science, The University of Birmingham, B15 2TT, UK ( gzipped Postscript - 1.1MB )
Abstract: Reinforcement Learning (RL) techniques may be used to find optimal controllers for multistep decision problems where the task is to maximise some reward signal. Successful applications include backgammon, network routing and scheduling problems. In many situations it is useful to have methods that learn about one behaviour while actually following another (i.e. `off-policy' methods). For example, the learner may follow an exploring behaviour, while its goal is to learn about the optimal behaviour, and not the exploring one. Existing methods for learning in this way (namely, Q-learning and Watkins' Q(lambda)) are notoriously inefficient with their use of experience. More efficient methods exist but are either unsound (i.e. provably non-convergent to optimal solutions), or cannot be applied online. Online learning is important to let the effects of actions be quickly associated with those actions, in turn allowing later decisions to be informed of those effects.
A new algorithm is introduced to overcome these problems. It works online, without `eligibility traces', and has a naturally efficient implementation. Experiments and analysis characterise when it is likely to outperform existing related methods. New insights into the use of optimism for encouraging exploration are also discovered. It is found that standard practices can have strongly negative effects on a large class of RL methods for control optimisation.
Also examined are large and non-discrete state-space problems where `function approximation' is needed, but where many RL methods are known to be unstable. Particularly, these are control optimisation methods and when experience is gathered in `off-policy' distributions (e.g. while exploring). By a new choice of error measure to minimise, the well studied linear gradient descent methods are shown to be `stable' when used with any `discounted return' estimating RL method. The notion of stability is weak (very large, but finite error bounds are shown), but the result is significant insofar as it covers new cases such as off-policy and multi-step methods for control optimisation.
New ways of viewing the goal of function approximation in RL are also examined. Rather than a process of error minimisation between the learned and observed reward signal, the objective is viewed to be that of finding representations that make it possible to identify the best action for given states. A new `decision boundary partitioning' algorithm is presented with this goal in mind. The method recursively refines the value-function representation, increasing it in areas where it is expected that this will result in better decision policies.