Iteratively extending time horizon reinforcement learning
Ernst, Damien , Pierre Geurts and Louis WehenkelIteratively extending time horizon reinforcement learning
Proceedings of ECML 2003
(Postscript - 6 KB )
Abstract: Reinforcement learning aims to determine an (infinite time horizon)
optimal control policy from interaction with a system. It can be
solved by approximating the so-called $Q$-function from a sample of
four-tuples $(x_t, u_t , r_t, x_{t+1})$ where $x_t$ denotes the system
state at time $t$, $u_t$ the control action taken, $r_t$ the
instantaneous reward obtained and $x_{t+1}$ the successor state of the
system, and by determining the optimal control from the $Q$-function.
Classical reinforcement learning algorithms use an ad hoc version of
stochastic approximation which iterates over the $Q$-function
approximations on a four-tuple by four-tuple basis. In this paper, we
reformulate this problem as a sequence of batch mode supervised
learning problems which in the limit converges to (an approximation
of) the $Q$-function. Each step of this algorithm uses the full
sample of four-tuples gathered from interaction with the system and
extends by one step the horizon of the optimality criterion. An
advantage of this approach is to allow the use of standard batch mode
supervised learning algorithms, instead of the incremental versions
used up to now. In addition to a theoretical justification the paper
provides empirical tests in the context of the ``Car on the Hill''
control problem based on the use of ensembles of regression trees.
The resulting algorithm is in principle able to handle efficiently
large scale reinforcement learning problems.