## Individual Q-learning in normal form games

**Leslie, David** , E. J. CollinsIndividual Q-learning in normal form games

* unpublished *
(PDF - 210K)

**Abstract**: The single-agent multi-armed bandit problem can be solved by an
agent that learns the values of each action using reinforcement
learning \cite{SuttonBarto98}. However the multi-agent version of
the problem, the iterated normal form game, presents a more complex
challenge, since the rewards available to each agent depend on the
strategies of the others. We consider the behaviour of value-based
learning agents in this situation, and show that such agents cannot
generally play at a Nash equilibrium, although if smooth best
responses are used a Nash distribution can be reached. We introduce
a particular value-based learning algorithm, individual
$Q$-learning, and use stochastic approximation to study the
asymptotic behaviour, showing that strategies will converge to Nash
distribution almost surely in 2-player zero-sum games and 2-player
partnership games. Player-dependent learning rates are then
considered, and it is shown that this extension converges in some
games for which many algorithms, including the basic algorithm
initially considered, fail to converge.