On the Global Convergence and Approximation Benefits of Policy Gradient Methods

12/3/20 | 4:15pm | Online only

Dan Russo

Assistant Professor
Columbia

Abstract: Policy gradient methods apply to complex, poorly understood, control problems by performing stochastic gradient descent over a parameterized class of polices. Unfortunately, due to the multi-period nature of the objective, they face non-convex optimization problems. The first part of the talk aims to clarify when policy gradient methods can be expected to converge to a global optimum despite this non-convexity. We show they can get stuck in suboptimal stationary points in extremely simple problems but identify structural properties – shared by several canonical control problems – that guarantee the policy gradient objective function has no suboptimal stationary points despite being non-convex. In the second part of the talk, I’ll zoom in on the special case of state aggregated policies, which induce unavoidable approximation errors as they cannot represent an optimal policy. In this case, I’ll show policy gradient provably converges to better policies than popular approximate policy iteration and value iteration methods.

Bio: Professor Russo joined the Decision, Risk, and Operations division of the Columbia Business School as an assistant professor in Summer 2017. Prior to joining Columbia, he spent one great year as an assistant professor in the MEDS department at Northwestern’s Kellogg School of Management and one year at Microsoft Research in New England as Postdoctoral Researcher. He recieved my PhD from Stanford University in 2015, where he was advised by Benjamin Van Roy. In 2011 he recieved my BS in Mathematics and Economics from the University of Michigan.

Professor Russo’s research lies at the intersection of statistical machine learning and sequential decision-making, and contributes to the fields of online optimization, reinforcement learning, and sequential design of experiments. He is interested in the design and analysis of algorithms that learn over time to make increasingly effective decisions through interacting with a poorly understood environment.