Submitted by admin on Fri, 10/25/2024 - 05:30

Actor-critic algorithm and their extensions have made great achievements in real-world decision-making problems. In contrast to its empirical success, the theoretical understanding of the actor-critic seems unsatisfactory. Most existing results only show the asymptotic convergence, which is developed mainly based on approximating the dynamic system of the actor and critic using ordinary differential equations. However, the finite-time convergence analysis of the actor-critic algorithm remains to be explored. The main challenges lie in the nonconvexity of parameterized policies, the coupling of the updates for actor and critic, and the data sampling dependency in online settings. In this paper, we provide a finite-time convergence analysis for an online actor-critic algorithm under the infinite-horizon average reward setting. In the critic step, we give a theoretical analysis of the TD(0) algorithm for the average reward with dependent data in online settings. Besides, we show that the sequence of actor iterates converges in a sublinear rate to a stationary point up to some irremovable bias due to the value function approximation by linear functions. To the best of our knowledge, our work seems to provide the first finite-time convergence analysis for an online actor-critic algorithm with TD learning.

Shuang Qiu
Zhuoran Yang
Jieping Ye
Zhaoran Wang