







|
Department of Computer Science
University of Massachusetts Amherst
Research
Reinforcement Learning
Most of the research in ALL is centered around reinforcement learning
(RL). Broadly speaking, RL refers to a class of algorithms which
compute approximate solutions to stochastic optimal control problems.
An RL agent refers to any entity which uses RL algorithms to solve
this type of problem. RL algorithms rely on a scalar reward signal
being provided to the RL agent at discrete time intervals. In most
applications, the stochastic optimal control problem is modelled as a
Markov decision process (MDP), and RL algorithms apply stochastic
dynamic programming to estimate a solution to the MDP.
RL algorithms offer several advantages over alternative approximation
techniques. Most RL algorithms are on-line, allowing an agent to
improve its behavior in real-time. Moreover, RL algorithms center the
computational effort around (possibly imaginary) trajectories followed
by an agent, which enables the agent to focus on relevant parts of the
control problem. Finally, RL algorithms can be combined with function
approximation techniques on problems with large state spaces.
In many applications, the MDP model is unrealistic, since the
environment is not fully observable to an RL agent. Instead, the
stochastic optimal control problem can be modelled as a partially
observable MDP (POMDP). Estimating a solution to a POMDP usually
involves the use of memory, since remembering past situations can help
an RL agent resolve ambiguities about its current state.
Many applications are so complex that an RL agent acting at random
will never receive a positive reward signal, and so is unable to
improve its behavior. In order to remedy this problem, an agent can
employ an exploration strategy which directs the behavior of the
agent. One exploration technique which was proposed at ALL is to
restrict an RL agent to actions which guarantee that the agent always
inches closer towards positive reward. Another area of study at ALL
is combining supervision with RL so that an agent's behavior is
directed by a teacher during the initial stages of learning. Once an
RL agent has accumulated enough experience of positive reward, it can
drop the exploration strategy.
Standard RL algorithms maintain estimates of values, or the expected
sum of future reward, for all states or all state-action pairs.
Frequently, we need to approximate these values, in which case RL
algorithms no longer guarantee an improvement in the agent's behavior.
Policy search methods
improve the policy in the direction of the policy gradient, without
explicitly computing or storing gradient estimates. Thus far, policy
search methods have proven slow in practice, but improving and
extending these methods is part of ALLs research agenda.
|