Trust Region Policy Optimization (ICML, 2015)
1 Introduction
- policy optimization categories
- policy iteration (GPI)
- PG (e.g. TRPO)
- derivative-free(无导数) optimization methods
2 Preliminaries
- Consider an infinite-horizon discounted MDP
- instead an average reward one
- objective measurement η(π~)\eta(\tilde{\pi})η(π~) (Kakade, 2002, TODO)
η(π~)=η(π)+Es0,a0,??π~[∑t=0∞γtAπ(st,at)]=η(π)+∑sρπ~(s)∑aπ~(a∣s)Aπ(s,a)\eta(\tilde{\pi}) =\eta(\pi)+\mathbb{E}_{s_{0}, a_{0}, \cdots \sim \tilde{\pi}}\left[\sum_{t=0}^{\infty} \gamma^{t} A_{\pi}\left(s_{t}, a_{t}\right)\right] \\ =\eta(\pi)+\sum_{s} \rho_{\tilde{\pi}}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a) η(π~)=η(π)+Es0?,a0?,??π~?[t=0∑∞?γtAπ?(st?,at?)]=η(π)+s∑?ρπ~?(s)a∑?π~(a∣s)Aπ?(s,a)- improving the expectation on right hand side leads to policy improvment
- but ρπ~(s)\rho_{\tilde{\pi}}(s)ρπ~?(s) is hard to estimate
- so use local approximation instead
- local approximation objective measurement Lπ(π~)L_{\pi} (\tilde{\pi})Lπ?(π~) (Kakade, 2002, TODO)
Lπ(π~)=η(π)+∑sρπ(s)∑aπ~(a∣s)Aπ(s,a)L_{\pi}(\tilde{\pi})=\eta(\pi)+\sum_{s} \rho_{\pi}(s) \sum_{a} \tilde{\pi}(a \mid s) A_{\pi}(s, a) Lπ?(π~)=η(π)+s∑?ρπ?(s)a∑?π~(a∣s)Aπ?(s,a)- match η(π~)\eta(\tilde{\pi})η(π~) to first order
- lower bound
- limited form of policy improvement
3 Monotonic Improvement Guarantee for General Stochastic Policies
- Theorem 1: lower bound with General Stochastic Policies
η(π~)≥Lπ(π~)?CDKLmax?(π,π~)where C=4?γ(1?γ)2,?=max?s,a∣Aπ(s,a)∣\begin{aligned} \eta(\tilde{\pi}) & \geq L_{\pi}(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}) \\ & \text { where } C=\frac{4 \epsilon \gamma}{(1-\gamma)^{2}} , \epsilon=\max _{s, a}\left|A_{\pi}(s, a)\right| \end{aligned} η(π~)?≥Lπ?(π~)?CDKLmax?(π,π~) where C=(1?γ)24?γ?,?=s,amax?∣Aπ?(s,a)∣?- KL can be seen as penalty
- CCC is a large constant, thus leads to a small step size
- Algorithm 1: monotonic policy iteration
- advantage should be accurate
- can be extend to continunous state and action sapce
4 Optimization of Parameterized Policies
- Approximations
- trust region constrain
- step size in Algorithm 1 could be very small
- change KL penalty to a constrain
- average KL
max
operation involving traversal the whole state space- use average KL instead
DˉKLρold(θ1,θ2):=Es?ρold[DKL(πθ1(?∣s)∥πθ2(?∣s))]\bar{D}_{\mathrm{KL}}^{\rho_{old}}\left(\theta_{1}, \theta_{2}\right):=\mathbb{E}_{\textcolor{red}{s \sim \rho_{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{1}}(\cdot \mid s) \| \pi_{\theta_{2}}(\cdot \mid s)\right)\right]DˉKLρold??(θ1?,θ2?):=Es?ρold??[DKL?(πθ1??(?∣s)∥πθ2??(?∣s))]
- trust region constrain
- optimization problem up to now
maximize?θ∑sρθold (s)∑aπθ(a∣s)Aθold (s,a)subject to DˉKLρold (θold ,θ)≤δ\begin{array}{c} \underset{\theta}{\operatorname{maximize}} \sum_{s} \rho_{\theta_{\text {old }}}(s) \sum_{a} \pi_{\theta}(a \mid s) A_{\theta_{\text {old }}}(s, a) \\ \text { subject to } \bar{D}_{\mathrm{KL}}^{\rho_{\text {old }}}\left(\theta_{\text {old }}, \theta\right) \leq \delta \end{array} θmaximize?∑s?ρθold ??(s)∑a?πθ?(a∣s)Aθold ??(s,a) subject to DˉKLρold ??(θold ?,θ)≤δ?
5 Sample-Based Estimation of the Objective and Constraint
maximize?θEs?ρθold,a?q[πθ(a∣s)q(a∣s)Qθold(s,a)]subject to Es?ρθold[DKL(πθold(?∣s)∥πθ(?∣s))]≤δ\begin{array}{l} \underset{\theta}{\operatorname{maximize}} \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}, a \sim q}\left[\frac{\pi_{\theta}(a \mid s)}{q(a \mid s)} Q_{\theta_{\mathrm{old}}}(s, a)\right] \\ \text { subject to } \mathbb{E}_{s \sim \rho_{\theta \mathrm{old}}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{\mathrm{old}}}(\cdot \mid s) \| \pi_{\theta}(\cdot \mid s)\right)\right] \leq \delta \end{array} θmaximize?Es?ρθold?,a?q?[q(a∣s)πθ?(a∣s)?Qθold??(s,a)] subject to Es?ρθold??[DKL?(πθold??(?∣s)∥πθ?(?∣s))]≤δ?
- s?ρθolds \sim \rho_{\theta \mathrm{old}}s?ρθold?
- by defination of expectation
- a?qa \sim qa?q
- by important sampling
- Qθold(s,a)Q_{\theta_{\mathrm{old}}}(s, a)Qθold??(s,a)
- by the fact that
- A(s,a)=Q(s,a)?V(s)A(s, a)=Q(s, a)-V(s)A(s,a)=Q(s,a)?V(s)
- ∑sVθold(s)=constant\sum_{s}V^{\theta_{old}}(s)=constant∑s?Vθold?(s)=constant
- so that MC can be used to estimate Q (mabey there is no proper MC estimation for advantage)
- estimation methods
- single path
- Vine
- better estimation
- but need more simulator
- limit to the system where states can be recover
- by the fact that
6 Practical Algorithm
- Update Q by collected trajectoriess
- Calulate objective and its constrain
- Solve the constrain problem by Conjugate gradient alg TODO
- references
- paper Appendix C
- post with thorem and implement https://www.telesens.co/2018/06/09/efficiently-computing-the-fisher-vector-product-in-trpo/
- references
8 Experiments
- TRPO is related to prior methods (e.g. natural policy
gradient) but makes several changes, most notably by using a fixed KL divergence rather than a fixed penalty coefficient.- These results provide empirical evidence that constraining the KL divergence is a more robust way to choose step sizes and make fast, consistent progress, compared to using a fixed penalty.
- can obtain high-quality locomotion controllers from scratch, which is considered to be a hard problem.
- the method we proposed is scalable and has strong theoretical foundations.