actor-critic - Refers to a class of agent architectures, where the actor plays out a particular policy, while the critic learns to evaluate actor's policy. Both the actor and critic are simultaneously improving by bootstrapping on each other.
agent - A system that is embedded in an environment, and takes actions to change the state of the environment. Examples include mobile robots, software agents, or industrial controllers.
average-reward methods - A framework where the agent's goal is to maximize the expected payoff per step. Average-reward methods are appropriate in problems where the goal is maximize the long-term performance. They are usually much more difficult to analyze than discounted algorithms.
discount factor - A scalar value between 0 and 1 which determines the present value of future rewards. If the discount factor is 0, the agent is concerned with maximizing immediate rewards. As the discount factor approaches 1, the agent takes more future rewards into account. Algorithms which discount future rewards include Q-learning and TD(lambda).
dynamic programming (DP) is a class of solution methods for solving sequential decision problems with a compositional cost structure. Richard Bellman was one of the principal founders of this approach.
environment - The external system that an agent is "embedded" in, and can perceive and act on.
function approximator refers to the problem of inducing a function from training examples. Standard approximators include decision trees, neural networks, and nearest-neighbor methods.
Markov decision process (MDP) - A probabilistic model of a sequential decision problem, where states can be perceived exactly, and the current state and action selected determine a probability distribution on future states. Essentially, the outcome of applying an action to a state depends only on the current action and state (and not on preceding actions or states).
model-based algorithms - These compute value functions using a model of the system dynamics. Adaptive Real-time DP (ARTDP) is a well-known example of a model-based algorithm.
model-free algorithms - these directly learn a value function without requiring knowledge of the consequences of doing actions. Q-learning is the best known example of a model-free algorithm.
model - The agent's view of the environment, which maps state-action pairs to probability distributions over states. Note that not every reinforcement learning agent uses a model of its environment.
Monte Carlo methods - A class of methods for learning of value functions, which estimates the value of a state by running many trials starting at that state, then averages the total rewards received on those trials.
policy - The decision-making function (control strategy) of the agent, which represents a mapping from situations to actions.
reward - A scalar value which represents the degree to which a state or action is desirable. Reward functions can be used to specify a wide range of planning goals (e.g. by penalizing every non-goal state, an agent can be guided towards learning the fastest route to the final state).
sensor - Agents perceive the state of their environment using sensors, which can refer to physical transducers, such as ultrasound, or simulated feature-detectors.
state - this can be viewed as a summary of the past history of the system, that determines its future evolution.
stochastic approximation - Most RL algorithms can be viewed as stochastic approximations of exact DP algorithms, where instead of complete sweeps over the state space, only selected states are backed up (which are sampled according to the underlying probabilistic model). A rich theory of stochastic approximation (e.g Robbins Munro) can be brought to bear to understand the theoretical convergence of RL methods.
TD (temporal difference) algorithms - A class of learning methods, based on the idea of comparing temporally successive predictions. Possibly the single most fundamental idea in all of reinforcement learning.
unsupervised learning - The area of machine learning in which an agent learns from interaction with its environment, rather than from a knowledgeable teacher that specifies the action the agent should take in any given state.
value function is a mapping from states to real numbers, where the value of a state represents the long-term reward achieved starting from that state, and executing a particular policy. The key distinguishing feature of RL methods is that they learn policies indirectly, by instead learning value functions. RL methods can be constrasted with direct optimization methods, such as genetic algorithms (GA), which attempt to search the policy space directly.