Dynamic programming (DP) refers to a solution method for
finding optimal solutions to problems with a compositional cost
structure. Richard Bellman, Ronald Howard, and David Blackwell laid
the foundations for most of the early research into this problem. In
particular, they originated the most popular algorithms (value
iteration and policy iteration), as well as made significant
contributions to the the mathematical study of Markov decision
processes (MDP). Much of the current research into RL is based on the
framework of DP and MDP.
References: An extensive discussion of MDPs may be found in:
Puterman, M. Markov Decision Processes: Discrete Dynamic Stochastic Programming. John Wiley, 1994.