当前位置: 代码迷 >> 综合 >> [RL 9] Trust Region Policy Optimization (ICML, 2015)
  详细解决方案

[RL 9] Trust Region Policy Optimization (ICML, 2015)

热度:76   发布时间:2024-03-10 01:53:34.0

Trust Region Policy Optimization (ICML, 2015)

1 Introduction

  1. policy optimization categories
    1. policy iteration (GPI)
    2. PG (e.g. TRPO)
    3. derivative-free(无导数) optimization methods

2 Preliminaries

  1. Consider an infinite-horizon discounted MDP
    1. instead an average reward one
  2. objective measurement η(π~)\eta(\tilde{\pi})η(π~) (Kakade, 2002, TODO)
    η(π~)=η(π)+Es0,a0,??π~[∑t=0∞γtAπ(st,at)]=η(π)+∑sρπ~(s)∑aπ~(a∣s)Aπ(s,a)\eta(\tilde{\pi}) =\eta(\pi)+\mathbb{E}_{s_{0}, a_{0}, \cdots \sim \tilde{\pi}}\left[\sum_{t=0}^{\infty} \gamma^{t} A_{\pi}\left(s_{t}, a_{t}\right)\right] \\ =\eta(\pi)+\sum_{s} \rho_{\tilde{\pi}}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a) η(π~)=η(π)+Es0?,a0?,??π~?[t=0?γtAπ?(st?,at?)]=η(π)+s?ρπ~?(s)a?π~(as)Aπ?(s,a)
    1. improving the expectation on right hand side leads to policy improvment
    2. but ρπ~(s)\rho_{\tilde{\pi}}(s)ρπ~?(s) is hard to estimate
    3. so use local approximation instead
  3. local approximation objective measurement Lπ(π~)L_{\pi} (\tilde{\pi})Lπ?(π~) (Kakade, 2002, TODO)
    Lπ(π~)=η(π)+∑sρπ(s)∑aπ~(a∣s)Aπ(s,a)L_{\pi}(\tilde{\pi})=\eta(\pi)+\sum_{s} \rho_{\pi}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a) Lπ?(π~)=η(π)+s?ρπ?(s)a?π~(as)Aπ?(s,a)
    1. match η(π~)\eta(\tilde{\pi})η(π~) to first order
    2. lower bound
    3. limited form of policy improvement

3 Monotonic Improvement Guarantee for General Stochastic Policies

  1. Theorem 1: lower bound with General Stochastic Policies
    η(π~)≥Lπ(π~)?CDKLmax?(π,π~)where C=4?γ(1?γ)2,?=max?s,a∣Aπ(s,a)∣\begin{aligned} \eta(\tilde{\pi}) & \geq L_{\pi}(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}) \\ & \text { where } C=\frac{4 \epsilon \gamma}{(1-\gamma)^{2}} , \epsilon=\max _{s, a}\left|A_{\pi}(s, a)\right| \end{aligned} η(π~)?Lπ?(π~)?CDKLmax?(π,π~) where C=(1?γ)24?γ?,?=s,amax?Aπ?(s,a)?
    • KL can be seen as penalty
    • CCC is a large constant, thus leads to a small step size
  2. Algorithm 1: monotonic policy iteration
    • advantage should be accurate
    • can be extend to continunous state and action sapce

4 Optimization of Parameterized Policies

  1. Approximations
    1. trust region constrain
      • step size in Algorithm 1 could be very small
      • change KL penalty to a constrain
    2. average KL
      • max operation involving traversal the whole state space
      • use average KL instead
        DˉKLρold(θ1,θ2):=Es?ρold[DKL(πθ1(?∣s)∥πθ2(?∣s))]\bar{D}_{\mathrm{KL}}^{\rho_{old}}\left(\theta_{1}, \theta_{2}\right):=\mathbb{E}_{\textcolor{red}{s \sim \rho_{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{1}}(\cdot \mid s) \| \pi_{\theta_{2}}(\cdot \mid s)\right)\right]DˉKLρold??(θ1?,θ2?):=Es?ρold??[DKL?(πθ1??(?s)πθ2??(?s))]
  2. optimization problem up to now
    maximize?θ∑sρθold (s)∑aπθ(a∣s)Aθold (s,a)subject to DˉKLρold (θold ,θ)≤δ\begin{array}{c} \underset{\theta}{\operatorname{maximize}} \sum_{s} \rho_{\theta_{\text {old }}}(s) \sum_{a} \pi_{\theta}(a \mid s) A_{\theta_{\text {old }}}(s, a) \\ \text { subject to } \bar{D}_{\mathrm{KL}}^{\rho_{\text {old }}}\left(\theta_{\text {old }}, \theta\right) \leq \delta \end{array} θmaximize?s?ρθold ??(s)a?πθ?(as)Aθold ??(s,a) subject to DˉKLρold ??(θold ?,θ)δ?

5 Sample-Based Estimation of the Objective and Constraint

maximize?θEs?ρθold,a?q[πθ(a∣s)q(a∣s)Qθold(s,a)]subject to Es?ρθold[DKL(πθold(?∣s)∥πθ(?∣s))]≤δ\begin{array}{l} \underset{\theta}{\operatorname{maximize}} \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}, a \sim q}\left[\frac{\pi_{\theta}(a \mid s)}{q(a \mid s)} Q_{\theta_{\mathrm{old}}}(s, a)\right] \\ \text { subject to } \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{\mathrm{old}}}(\cdot \mid s) \| \pi_{\theta}(\cdot \mid s)\right)\right] \leq \delta \end{array} θmaximize?Es?ρθold?,a?q?[q(as)πθ?(as)?Qθold??(s,a)] subject to Es?ρθold??[DKL?(πθold??(?s)πθ?(?s))]δ?

  1. s?ρθolds \sim \rho_{\theta \mathrm{old}}s?ρθold?
    • by defination of expectation
  2. a?qa \sim qa?q
    • by important sampling
  3. Qθold(s,a)Q_{\theta_{\mathrm{old}}}(s, a)Qθold??(s,a)
    • by the fact that
      1. A(s,a)=Q(s,a)?V(s)A(s, a)=Q(s, a)-V(s)A(s,a)=Q(s,a)?V(s)
      2. ∑sVθold(s)=constant\sum_{s}V^{\theta_{old}}(s)=constants?Vθold?(s)=constant
    • so that MC can be used to estimate Q (mabey there is no proper MC estimation for advantage)
    • estimation methods
      1. single path
      2. Vine
        • better estimation
        • but need more simulator
        • limit to the system where states can be recover

6 Practical Algorithm

  1. Update Q by collected trajectoriess
  2. Calulate objective and its constrain
  3. Solve the constrain problem by Conjugate gradient alg TODO
    • references
      1. paper Appendix C
      2. post with thorem and implement https://www.telesens.co/2018/06/09/efficiently-computing-the-fisher-vector-product-in-trpo/

8 Experiments

  1. TRPO is related to prior methods (e.g. natural policy
    gradient) but makes several changes, most notably by using a fixed KL divergence rather than a fixed penalty coefficient.
    • These results provide empirical evidence that constraining the KL divergence is a more robust way to choose step sizes and make fast, consistent progress, compared to using a fixed penalty.
  2. can obtain high-quality locomotion controllers from scratch, which is considered to be a hard problem.
  3. the method we proposed is scalable and has strong theoretical foundations.
  相关解决方案