# Research

## Reinforcement Learning

Most of the research in the ALL is centered around reinforcement learning (RL). Broadly speaking, RL refers to a class of algorithms which compute approximate solutions to stochastic optimal control problems. An RL agent refers to any entity which uses RL algorithms to solve this type of problem. RL algorithms rely on a scalar reward signal being provided to the RL agent at discrete time intervals. In most applications, the stochastic optimal control problem is modelled as a Markov decision process (MDP), and RL algorithms apply stochastic dynamic programming to estimate a solution to the MDP.

RL algorithms offer several advantages over alternative approximation techniques. Most RL algorithms are on-line, allowing an agent to improve its behavior in real-time. Moreover, RL algorithms center the computational effort around (possibly imaginary) trajectories followed by an agent, which enables the agent to focus on relevant parts of the control problem. Finally, RL algorithms can be combined with function approximation techniques on problems with large state spaces.

In many applications, the MDP model is unrealistic, since the environment is not fully observable to an RL agent. Instead, the stochastic optimal control problem can be modelled as a partially observable MDP (POMDP). Estimating a solution to a POMDP usually involves the use of memory, since remembering past situations can help an RL agent resolve ambiguities about its current state.

Many applications are so complex that an RL agent acting at random will never receive a positive reward signal, and so is unable to improve its behavior. In order to remedy this problem, an agent can employ an exploration strategy which directs the behavior of the agent. One exploration technique which was proposed in the lab is to restrict an RL agent to actions which guarantee that the agent always inches closer towards positive reward. Another area of study at the ALL is combining supervision with RL so that an agent's behavior is directed by a teacher during the initial stages of learning. Once an RL agent has accumulated enough experience of positive reward, it can drop the exploration strategy.

Standard RL algorithms maintain estimates of values, or the expected sum of future reward, for all states or all state-action pairs. Frequently, we need to approximate these values, in which case RL algorithms no longer guarantee an improvement in the agent's behavior. Policy search methods improve the policy in the direction of the policy gradient, without explicitly computing or storing gradient estimates. Thus far, policy search methods have proven slow in practice, but improving and extending these methods is part of the ALL's research agenda.

### Proto-Value Functions

Proto-value functions (PVFs) are a unified framework for learning representation and behavior. This framework addresses a longstanding puzzle in AI: how can agents transform their temporal experience into multi-scale task-independent representations that effectively guide long-term task-specific behavior?

Proto-value functions are learned from the topology of the underlying state space and reflect large-scale geometric invariants, such as bottlenecks and symmetries. Fourier proto-value functions are learned by computing the eigenvectors of the graph Laplacian; wavelet proto-value functions are learned by dilating the unit basis functions using the powers of a random walk diffusion operator on the graph.

For more information on PVFs, please visit the Proto-value functions group webpage.

A three-room "grid-world" (left) and a corresponding Fourier PVF (right). Note how the PVF captures the structure inherent to the state space. Such structure can be useful for learning future tasks in this environment. |

### Mirror Descent

Mirror descent generalizes classical first-order gradient descent to non-Euclidean geometries by using a distance-generating function specific to a particular geometry. It is widely used as a powerful tool for acceleration and regularization for first-order optimization methods. We proposed the first sparse Q-learning algorithm, as well as a novel approach that we call Regularized Off-policy TD learning.