**actor-critic** - Refers to a class of agent architectures,
where the actor plays out a particular policy, while the critic learns
to evaluate actor's policy. Both the actor and critic are
simultaneously improving by bootstrapping on each other.

**agent** - A system that is embedded in an environment, and
takes * actions * to change the state of the environment.
Examples include mobile robots, software agents, or industrial
controllers.

**average-reward methods** - A framework where the agent's goal
is to maximize the expected payoff per step. Average-reward methods
are appropriate in problems where the goal is maximize the long-term
performance. They are usually much more difficult to analyze than
discounted algorithms.

**discount factor** - A scalar value between 0 and 1 which
determines the present value of future rewards. If the discount
factor is 0, the agent is concerned with maximizing immediate rewards.
As the discount factor approaches 1, the agent takes more future
rewards into account. Algorithms which discount future rewards include
Q-learning and TD(lambda).

**dynamic programming (DP) ** is a class of solution methods for
solving sequential decision problems with a compositional cost
structure. Richard Bellman was one of the principal founders of this
approach.

**environment** - The external system that an agent is
"embedded" in, and can perceive and act on.

**function approximator** refers to the problem of inducing a
function from training examples. Standard approximators include
decision trees, neural networks, and nearest-neighbor methods.

**Markov decision process (MDP)** - A probabilistic model of a
sequential decision problem, where states can be perceived exactly,
and the current state and action selected determine a probability
distribution on future states. Essentially, the outcome of applying an
action to a state depends only on the current action and state (and
not on preceding actions or states).

**model-based algorithms** - These compute value functions using
a model of the system dynamics. Adaptive Real-time DP (ARTDP) is a
well-known example of a model-based algorithm.

**model-free algorithms** - these directly learn a value
function without requiring knowledge of the consequences of doing
actions. Q-learning is the best known example of a model-free
algorithm.

**model** - The agent's view of the environment, which maps
state-action pairs to probability distributions over states. Note that
not every reinforcement learning agent uses a model of its environment.

**Monte Carlo methods** - A class of methods for learning of
value functions, which estimates the value of a state by running many
trials starting at that state, then averages the total rewards
received on those trials.

**policy** - The decision-making function (control strategy) of the
agent, which represents a mapping from situations to actions.

**reward** - A scalar value which represents the degree to which
a state or action is desirable. Reward functions can be used to
specify a wide range of planning goals (e.g. by penalizing every
non-goal state, an agent can be guided towards learning the fastest
route to the final state).

**sensor** - Agents perceive the state of their environment
using sensors, which can refer to physical transducers, such as
ultrasound, or simulated feature-detectors.

**state** - this can be viewed as a summary of the past history
of the system, that determines its future evolution.

**stochastic approximation** - Most RL algorithms can be viewed
as stochastic approximations of exact DP algorithms, where instead of
complete sweeps over the state space, only selected states are backed
up (which are sampled according to the underlying probabilistic
model). A rich theory of stochastic approximation (e.g Robbins Munro)
can be brought to bear to understand the theoretical convergence of RL
methods.

**TD (temporal difference) algorithms** - A class of learning
methods, based on the idea of comparing temporally successive
predictions. Possibly the single most fundamental idea in all of
reinforcement learning.

**unsupervised learning** - The area of machine learning in
which an agent learns from interaction with its environment, rather
than from a knowledgeable teacher that specifies the action the agent
should take in any given state.

**value function** is a mapping from states to real numbers,
where the value of a state represents the long-term reward achieved
starting from that state, and executing a particular policy. The key
distinguishing feature of RL methods is that they learn policies
indirectly, by instead learning value functions. RL methods can be
constrasted with direct optimization methods, such as genetic
algorithms (GA), which attempt to search the policy space directly.