Error Bounds for Approximate Policy Iteration

Munos, Remi
Error Bounds for Approximate Policy Iteration
Icml 2003 ( gzipped Postscript - 80 KB )

Abstract: In Dynamic Programming, convergence of algorithms such as Value Iteration or Policy Iteration results --in discounted problems-- from a contraction property of the back-up operator, guaranteeing convergence to its fixed-point. When approximation is considered, known results in Approximate Policy Iteration provide bounds on the closeness to optimality of the approximate value function obtained by successive policy improvement steps as a function of the maximum norm of value determination errors during policy evaluation steps. Unfortunately, such results have limited practical range since most function approximators (such as linear regression) select the best fit in a given class of parameterized functions by minimizing some (weighted) quadratic norm. In this paper, we provide error bounds for Approximate Policy Iteration using quadratic norms, and illustrate those results in the case of feature-based linear function approximation.