[Mike Rosenstein]


Supervised Actor-Critic Reinforcement Learning
Simulated Robotic Arm Example Reinforcement learning (RL) methods are attractive for solving optimal control problems when accurate models are unavailable. For many such problems, however, RL alone is impractical and the associated learning problem must be structured somehow to take advantage domain knowledge. This project deals with domain knowledge in the form of a supervisor that generates control inputs in parallel with a reinforcement learning system. The basic challenge is to combine two sources of feedback: error information derived from the supervisor and evaluative reinforcement supplied by the environment. The approach taken for this work is to combine supervised learning with an actor-critic architecture and to use a stable controller as supervisor. The controller not only informs the RL system about favorable control actions but also protects the learning system from risky behavior. Demonstrations of the approach include a standard problem from the optimal control literature, a simulated robotic arm, and the seven-DOF manipulator shown below.
WAM frames


Related Publications
Supervised actor-critic reinforcement learning
M.T. Rosenstein and A.G. Barto. In J. Si, A. Barto, W. Powell, and D. Wunsch, eds., Learning and Approximate Dynamic Programming: Scaling Up to the Real World. John Wiley & Sons, Inc., New York, 359-380, 2004. [pdf] [ps.gz]
Reinforcement learning with supervision by a stable controller
M.T. Rosenstein and A.G. Barto. In Proceedings of the 2004 American Control Conference, 4517-4522, 2004.
Combining reinforcement learning with a local control algorithm
J. Randlov, A.G. Barto and M.T. Rosenstein. In Proceedings of the Seventeenth International Conference on Machine Learning, 775-782, 2000. [pdf] [ps.gz]


Movies
File Description Download
ship Learning to solve a ship steering task adapted from the optimal control literature. The supervisor for this example is a stable heuristic controller that always steers toward the goal. In the animation the green curve shows the optimal path and the grayscale region indicates the level of autonomy on the part of the learning system. (Black is full autonomy and white is full supervision.) 5.6 MB
SACRL_wam A perhaps subtle demonstration with the Barrett whole-arm manipulator (WAM). For this example, the supervisor is a stable controller that tracks a straight-line trajectory to the goal in configuration space. The movie shows the supervisor's solution superimposed with the one learned using a minimum-effort criterion. After 120 trials there was approximately 20% reduced effort (statistically significant) despite a small, but statistically significant increase in movement time. This was due to a favorable reconfiguration of the arm near the beginning of movement. 2.5 MB
 
Note: To view the movies, use a QuickTime player (Mac/Win), MPlayer (Mac/ Linux), or at least version 2.80 of XAnim (Unix).

updated 27-Aug-2004
mtr@cs.umass.edu