[RL 9] Trust Region Policy Optimization (ICML, 2015)_综合

Trust Region Policy Optimization (ICML, 2015)

1 Introduction

policy optimization categories
1. policy iteration (GPI)
2. PG (e.g. TRPO)
3. derivative-free(无导数) optimization methods

2 Preliminaries

Consider an infinite-horizon discounted MDP
1. instead an average reward one
objective measurement $η(π~)\eta(\tilde{\pi})$ (Kakade, 2002, TODO)
$η(π~)=η(π)+Es0,a0,??π~[∑t=0∞γtAπ(st,at)]=η(π)+∑sρπ~(s)∑aπ~(a∣s)Aπ(s,a)\eta(\tilde{\pi}) =\eta(\pi)+\mathbb{E}_{s_{0}, a_{0}, \cdots \sim \tilde{\pi}}\left[\sum_{t=0}^{\infty} \gamma^{t} A_{\pi}\left(s_{t}, a_{t}\right)\right] \\ =\eta(\pi)+\sum_{s} \rho_{\tilde{\pi}}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a)$
1. improving the expectation on right hand side leads to policy improvment
2. but $ρπ~(s)\rho_{\tilde{\pi}}(s)$ is hard to estimate
3. so use local approximation instead
local approximation objective measurement $Lπ(π~)L_{\pi} (\tilde{\pi})$ (Kakade, 2002, TODO)
$Lπ(π~)=η(π)+∑sρπ(s)∑aπ~(a∣s)Aπ(s,a)L_{\pi}(\tilde{\pi})=\eta(\pi)+\sum_{s} \rho_{\pi}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a)$
1. match $η(π~)\eta(\tilde{\pi})$ to first order
2. lower bound
3. limited form of policy improvement

3 Monotonic Improvement Guarantee for General Stochastic Policies

Theorem 1: lower bound with General Stochastic Policies
$C=4?γ(1?γ)2,?=max?s,a∣Aπ(s,a)∣\begin{aligned} \eta(\tilde{\pi}) & \geq L_{\pi}(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}) \\ & \text { where } C=\frac{4 \epsilon \gamma}{(1-\gamma)^{2}} , \epsilon=\max _{s, a}\left|A_{\pi}(s, a)\right| \end{aligned}$
- KL can be seen as penalty
- $C$ is a large constant, thus leads to a small step size
Algorithm 1: monotonic policy iteration
- advantage should be accurate
- can be extend to continunous state and action sapce

4 Optimization of Parameterized Policies

Approximations
1. trust region constrain
  - step size in Algorithm 1 could be very small
  - change KL penalty to a constrain
2. average KL
  - max operation involving traversal the whole state space
  - use average KL instead
    $DˉKLρold(θ1,θ2):=Es?ρold[DKL(πθ1(?∣s)∥πθ2(?∣s))]\bar{D}_{\mathrm{KL}}^{\rho_{old}}\left(\theta_{1}, \theta_{2}\right):=\mathbb{E}_{\textcolor{red}{s \sim \rho_{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{1}}(\cdot \mid s) \| \pi_{\theta_{2}}(\cdot \mid s)\right)\right]$
optimization problem up to now
$,θ)≤δ\begin{array}{c} \underset{\theta}{\operatorname{maximize}} \sum_{s} \rho_{\theta_{\text {old }}}(s) \sum_{a} \pi_{\theta}(a \mid s) A_{\theta_{\text {old }}}(s, a) \\ \text { subject to } \bar{D}_{\mathrm{KL}}^{\rho_{\text {old }}}\left(\theta_{\text {old }}, \theta\right) \leq \delta \end{array}$

5 Sample-Based Estimation of the Objective and Constraint

$Es?ρθold[DKL(πθold(?∣s)∥πθ(?∣s))]≤δ\begin{array}{l} \underset{\theta}{\operatorname{maximize}} \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}, a \sim q}\left[\frac{\pi_{\theta}(a \mid s)}{q(a \mid s)} Q_{\theta_{\mathrm{old}}}(s, a)\right] \\ \text { subject to } \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{\mathrm{old}}}(\cdot \mid s) \| \pi_{\theta}(\cdot \mid s)\right)\right] \leq \delta \end{array}$

$\sim \rho_{\theta \mathrm{old}}$
- by defination of expectation
$\sim q$
- by important sampling
$Qθold(s,a)Q_{\theta_{\mathrm{old}}}(s, a)$
- by the fact that
  1. $A (s, a) = Q (s, a) ? V (s)$
  2. $∑sVθold(s)=constant\sum_{s}V^{\theta_{old}}(s)=constant$
- so that MC can be used to estimate Q (mabey there is no proper MC estimation for advantage)
- estimation methods
  1. single path
  2. Vine
    - better estimation
    - but need more simulator
    - limit to the system where states can be recover

6 Practical Algorithm

Update Q by collected trajectoriess
Calulate objective and its constrain
Solve the constrain problem by Conjugate gradient alg TODO
- references
  1. paper Appendix C
  2. post with thorem and implement https://www.telesens.co/2018/06/09/efficiently-computing-the-fisher-vector-product-in-trpo/

8 Experiments

TRPO is related to prior methods (e.g. natural policy
gradient) but makes several changes, most notably by using a fixed KL divergence rather than a fixed penalty coefficient.
- These results provide empirical evidence that constraining the KL divergence is a more robust way to choose step sizes and make fast, consistent progress, compared to using a fixed penalty.
can obtain high-quality locomotion controllers from scratch, which is considered to be a hard problem.
the method we proposed is scalable and has strong theoretical foundations.