ALL topUMass
Home
People
Research
Pubs
Contact
Links
Restricted
RL Repository

Department of Computer Science
University of Massachusetts Amherst

Research

Reinforcement Learning

Most of the research in ALL is centered around reinforcement learning (RL). Broadly speaking, RL refers to a class of algorithms which compute approximate solutions to stochastic optimal control problems. An RL agent refers to any entity which uses RL algorithms to solve this type of problem. RL algorithms rely on a scalar reward signal being provided to the RL agent at discrete time intervals. In most applications, the stochastic optimal control problem is modelled as a Markov decision process (MDP), and RL algorithms apply stochastic dynamic programming to estimate a solution to the MDP.

RL algorithms offer several advantages over alternative approximation techniques. Most RL algorithms are on-line, allowing an agent to improve its behavior in real-time. Moreover, RL algorithms center the computational effort around (possibly imaginary) trajectories followed by an agent, which enables the agent to focus on relevant parts of the control problem. Finally, RL algorithms can be combined with function approximation techniques on problems with large state spaces.

In many applications, the MDP model is unrealistic, since the environment is not fully observable to an RL agent. Instead, the stochastic optimal control problem can be modelled as a partially observable MDP (POMDP). Estimating a solution to a POMDP usually involves the use of memory, since remembering past situations can help an RL agent resolve ambiguities about its current state.

Many applications are so complex that an RL agent acting at random will never receive a positive reward signal, and so is unable to improve its behavior. In order to remedy this problem, an agent can employ an exploration strategy which directs the behavior of the agent. One exploration technique which was proposed at ALL is to restrict an RL agent to actions which guarantee that the agent always inches closer towards positive reward. Another area of study at ALL is combining supervision with RL so that an agent's behavior is directed by a teacher during the initial stages of learning. Once an RL agent has accumulated enough experience of positive reward, it can drop the exploration strategy.

Standard RL algorithms maintain estimates of values, or the expected sum of future reward, for all states or all state-action pairs. Frequently, we need to approximate these values, in which case RL algorithms no longer guarantee an improvement in the agent's behavior. Policy search methods improve the policy in the direction of the policy gradient, without explicitly computing or storing gradient estimates. Thus far, policy search methods have proven slow in practice, but improving and extending these methods is part of ALLs research agenda.

[ Top of page ]   [ ALL Home ]   [ Department of Computer Science ]   [ University of Massachusetts Amherst ]