RL策略梯度方法之(十三): actor-critic using Kronecker-factored trust region(ACKTR)_综合

本专栏按照 https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html 顺序进行总结。

文章目录

原理解析
算法实现
- 总体流程
- 代码实现

Kronecker因子化置信区间的演员-评论家算法

$ACKTR\color{red}ACKTR$ ：[ paper | code ]

原理解析

（更详细的解释可以参考：[https://blog.csdn.net/bbbeoy/article/details/106984109] (https://blog.csdn.net/bbbeoy/article/details/106984109)）
Kronecker因子化置信区间的演员-评论家算法（Actor-Critic using Kronecker-factored Trust Region，ACKTR，Yuhuai Wu, et al., 2017）使用Kronecker因子化曲率估计（K-FAC）同时进行演员以及评论家的梯度更新。K-FAC对自然梯度的计算进行了改进，这与我们的标准梯度有很大不同。这里有一个对于自然梯度很好很直观的解释。

如果要用一句话总结的话：

“我们首先考虑所有参数组合，这些参数组合导致新网络与旧网络保持恒定的KL差异。该常数值可以视为步长或学习速率。在所有这些可能的组合中，我们选择最小化损失函数的组合。“

我在这里列出了ACTKR主要是为了这篇文章的完整性，但我不会深入到细节部分，因为它涉及很多关于自然梯度和优化方法的理论知识。如果有兴趣，请在阅读ACKTR论文之前查看这些文章/帖子：

Amari. Natural Gradient Works Efficiently in Learning. 1998
Kakade. A Natural Policy Gradient. 2002
A intuitive explanation of natural gradient descent
Wiki: Kronecker product
Martens & Grosse. Optimizing neural networks with kronecker-factored approximate curvature. 2015.

以下是K-FAC论文的高度概括（译者注：以下为论文原文，因此不做翻译）：

“This approximation is built in two stages. In the first, the rows and columns of the Fisher are divided into groups, each of which corresponds to all the weights in a given layer, and this gives rise to a block-partitioning of the matrix. These blocks are then approximated as Kronecker products between much smaller matrices, which we show is equivalent to making certain approximating assumptions regarding the statistics of the network’s gradients.
\
In the second stage, this matrix is further approximated as having an inverse which is either block-diagonal or block-tridiagonal. We justify this approximation through a careful examination of the relationships between inverse covariances, tree-structured graphical models, and linear regression. Notably, this justification doesn’t apply to the Fisher itself, and our experiments confirm that while the inverse Fisher does indeed possess this structure (approximately), the Fisher itself does not.”

RL策略梯度方法之(十三): actor-critic using Kronecker-factored trust region(ACKTR)

文章目录

原理解析

算法实现

总体流程

代码实现