Information Loss Bounded Policy Optimization

Sep 1, 2018

Photo by Yunlong

Proximal and trust-region policy optimization methods belong to the standard reinforcement learning toolbox. Notably, PPO can be viewed as transforming the constrained TRPO problem into an unconstrained one, either via turning the constraint into a penalty or via objective clipping. In my master thesis, an alternative problem reformulation was studied, where the information loss is bounded using a novel transformation of the KullbackLeibler (KL) divergence constraint. In contrast to PPO, the considered method does not require tuning of the regularization parameter, which is known to be hard due to its sensitivity to the reward scaling. The resulting algorithm, termed information-loss-bounded policy optimization, both enjoys the benefits of the first-order methods, being straightforward to implement using automatic differentiation, and maintains the advantages of the quasi-second order methods. It performs competitively in simulated OpenAI MuJoCo environments and achieves robust performance on a real robotic task of the Furuta pendulum swing-up and stabilization.

Reinforcement Learning

Information Loss Bounded Policy Optimization

Yunlong Song

PhD in Robotics