Optimistic Initial Q-values and the max Operator

Reynolds, Stuart
Optimistic Initial Q-values and the max Operator
UKCI'01 ( gzipped Postscript - 80 )

Abstract: This paper provides a surprising new insight into the role of the max operator used by reinforcement learning algorithms to estimate the future return available to an agent. It is shown how optimistic Q-value estimates prevent learning updates from being effective at quickly minimising the error in the predicted available maximum future return. Experimental results show that, when the effect of optimism on the agent's exploration strategy is accounted for, learning generally proceeds more quickly if non-optimistic initial Q-values are provided. In existing work, optimistic Q-values are frequently used when agents need to manage a tradeoff between exploration and exploitation. This paper presents a simple way to avoid the learning problems this can cause.