详细解决方案
[RL 6] Deterministic Policy Gradient Algorithms (ICML, 2014)
热度:3 发布时间:2024-03-08 18:18:15.0
Deterministic Policy Gradient Algorithms (ICML, 2014)
Stochastic PGT (SPGT)
Theorem ?θJ(πθ)=∫Sρπ(s)∫A?θπθ(a∣s)Qπ(s,a)dads=Es?ρπ,a?πθ[?θlog?πθ(a∣s)Qπ(s,a)]\begin{aligned} \nabla_{\theta} J\left(\pi_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\pi}(s) \int_{\mathcal{A}} \nabla_{\theta} \pi_{\theta}(a \mid s) Q^{\pi}(s, a) \mathrm{d} a \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\pi}, a \sim \pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(a \mid s) Q^{\pi}(s, a)\right] \end{aligned} ? θ ? J ( π θ ? ) ? = ∫ S ? ρ π ( s ) ∫ A ? ? θ ? π θ ? ( a ∣ s ) Q π ( s , a ) d a d s = E s ? ρ π , a ? π θ ? ? [ ? θ ? log π θ ? ( a ∣ s ) Q π ( s , a ) ] ?
Proof see: https://web.stanford.edu/class/cme241/lecture_slides/PolicyGradient.pdf
PGT derived algorithms
on-policy AC
actor update: PGT
critic update: any TD leanring
off-policy AC
actor update: off-PGT (TODO proof see Degris 2012)
critic: any TD (or TODO more general GAE)
Intuition DGPT
Greedy policy improvement in GPI
argmax Q
is not suitable for continues action space
DGPT
move the policy in the direction of the gradient of Q, rather than globally maximising Q.
思想同argmax, 改变policy选择Q value较大的action
Formal DPGT
Settings
episode
with discount factor
for continous task, set γ=1\gamma=1 γ = 1 , and use state distribution μθ(S)\mu_\theta(S) μ θ ? ( S ) in RL charpt 9
on-policy
Objective J(μθ)=∫Sp1(s)Vμθ(s)dsJ\left(\mu_{\theta}\right)= \int_{\mathcal{S}} p_{1}(s) V^{\mu_{\theta}}(s) \mathrm{d} s J ( μ θ ? ) = ∫ S ? p 1 ? ( s ) V μ θ ? ( s ) d s
Theorem
on-policy DPG ?θJ(μθ)=∫Sρμ(s)?θμθ(s)?aQμ(s,a)∣a=μθ(s)ds=Es?ρμ[?θμθ(s)?aQμ(s,a)∣a=μθ(s)]where,ρμ(s)=∫S∑t=1∞γt?1p1(s)p(s→s′,t,π)ds\begin{aligned} \nabla_{\theta} J\left(\mu_{\theta}\right) &=\left.\int_{\mathcal{S}} \rho^{\mu}(s) \nabla_{\theta} \mu_{\theta}(s) \nabla_{a} Q^{\mu}(s, a)\right|_{a=\mu_{\theta}(s)} \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\mu}}\left[\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} Q^{\mu}(s, a)\right|_{a=\mu_{\theta}(s)}\right] \end{aligned} \\ where,\rho^\mu(s) =\int_{\mathcal{S}} \sum_{t=1}^{\infty} \gamma^{t-1} p_{1}(s) p\left(s \rightarrow s^{\prime}, t, \pi\right) \mathrm{d} s ? θ ? J ( μ θ ? ) ? = ∫ S ? ρ μ ( s ) ? θ ? μ θ ? ( s ) ? a ? Q μ ( s , a ) ∣ ∣ ∣ ∣ ? a = μ θ ? ( s ) ? d s = E s ? ρ μ ? [ ? θ ? μ θ ? ( s ) ? a ? Q μ ( s , a ) ∣ a = μ θ ? ( s ) ? ] ? w h e r e , ρ μ ( s ) = ∫ S ? t = 1 ∑ ∞ ? γ t ? 1 p 1 ? ( s ) p ( s → s ′ , t , π ) d s
discount state distribution ρμ(s)\rho^\mu(s) ρ μ ( s ) :
定义: 状态分布, 可以理解为:依据policy μθ\mu_\theta μ θ ? , 遇到这个状态的概率
计算: 求和在所有time step t遇到s的概率, 通过gamma加权
从该分布采样: 直接用policy μθ\mu_\theta μ θ ? 与环境交互即可, 因为policy考虑最大化accumulative reward, 越靠后的reward(以及相应的state)权重越小
Regularity Conditions Regularity conditions A.1: p(s′∣s,a),?ap(s′∣s,a),μθ(s),?θμθ(s),r(s,a),?ar(s,a),p1(s)are continuous in all parameters and variables s,a,s′and x. Regularity conditions A.2: there exists a band Lsuch that sup?sp1(s)<b,sup?a,s,s′p(s′∣s,a)<b,sup?a,sr(s,a)<bsup?a,s,s′∥?ap(s′∣s,a)∥<L,and sup?a,s∥?ar(s,a)∥<L\begin{aligned} &\text { Regularity conditions A.1: } p\left(s^{\prime} \mid s, a\right), \nabla_{a} p\left(s^{\prime} \mid s, a\right), \mu_{\theta}(s), \nabla_{\theta} \mu_{\theta}(s), r(s, a), \nabla_{a} r(s, a), p_{1}(s) \text { are continuous in all }\\ &\text { parameters and variables } s, a, s^{\prime} \text { and } x \text { . }\\ &\text { Regularity conditions A.2: there exists a } b \text { and } L \text { such that } \sup _{s} p_{1}(s)<b, \sup _{a, s, s^{\prime}} p\left(s^{\prime} \mid s, a\right)<b, \sup _{a, s} r(s, a)<b\\ &\sup _{a, s, s^{\prime}}\left\|\nabla_{a} p\left(s^{\prime} \mid s, a\right)\right\|<L, \text { and } \sup _{a, s}\left\|\nabla_{a} r(s, a)\right\|<L \end{aligned} ? Regularity conditions A.1: p ( s ′ ∣ s , a ) , ? a ? p ( s ′ ∣ s , a ) , μ θ ? ( s ) , ? θ ? μ θ ? ( s ) , r ( s , a ) , ? a ? r ( s , a ) , p 1 ? ( s ) are continuous in all parameters and variables s , a , s ′ and x . Regularity conditions A.2: there exists a b and L such that s sup ? p 1 ? ( s ) < b , a , s , s ′ sup ? p ( s ′ ∣ s , a ) < b , a , s sup ? r ( s , a ) < b a , s , s ′ sup ? ∥ ? a ? p ( s ′ ∣ s , a ) ∥ < L , and a , s sup ? ∥ ? a ? r ( s , a ) ∥ < L ?
A.1 保证了V可对θ\theta θ 求导, 同时使得推导过程可以使用
Leibniz积分公式, 改变微分和积分次序: ?∫\nabla \int ? ∫ -> ∫?\int \nabla ∫ ?
Fubini定理, 改变积分次序
A.2 保证了梯度有界
Part of Proof ?θVμθ(s)=?θQμθ(s,μθ(s))=?θ(r(s,μθ(s))+∫Sγp(s′∣s,μθ(s))Vμθ(s′)ds′)=?θμθ(s)?ar(s,a)∣a=μθ(s)+?θ∫Sγp(s′∣s,μθ(s))Vμθ(s′)ds′=?θμθ(s)?ar(s,a)∣a=μθ(s)+∫Sγ(p(s′∣s,μθ(s))?θVμθ(s′)+?θμθ(s)?ap(s′∣s,a)∣a=μθ(s)Vμθ(s′))ds′=?θμθ(s)?a(r(s,a)+∫Sγp(s′∣s,a)Vμθ(s′)ds′)∣a=μθ(s)+∫Sγp(s′∣s,μθ(s))?θVμθ(s′)ds′=?θμθ(s)?aQμθ(s,a)∣a=μθ(s)+∫Sγp(s→s′,1,μθ)?θVμθ(s′)ds′\begin{aligned} \nabla_{\theta} V^{\mu_{\theta}}(s)=& \nabla_{\theta} Q^{\mu_{\theta}}\left(s, \mu_{\theta}(s)\right) \\ =& \nabla_{\theta}\left(r\left(s, \mu_{\theta}(s)\right)+\int_{\mathcal{S}} \gamma p\left(s^{\prime} \mid s, \mu_{\theta}(s)\right) V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime}\right) \\ =&\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} r(s, a)\right|_{a=\mu_{\theta}(s)}+\nabla_{\theta} \int_{\mathcal{S}} \gamma p\left(s^{\prime} \mid s, \mu_{\theta}(s)\right) V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime} \\ =&\left.\nabla_{\theta} \textcolor{red}\mu_{\theta}(s) \nabla_{a} \textcolor{red}r(s, a)\right|_{a=\mu_{\theta}(s)} \\ &+\int_{\mathcal{S}} \gamma\left(p\left(s^{\prime} \mid s, \mu_{\theta}(s)\right) \nabla_{\theta} V^{\mu_{\theta}}\left(s^{\prime}\right)+\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} \textcolor{red}p\left(s^{\prime} \mid s, a\right)\right|_{a=\mu_{\theta}(s)} V^{\mu_{\theta}}\left(s^{\prime}\right)\right) \mathrm{d} s^{\prime} \\ =&\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a}\left(r(s, a)+\int_{\mathcal{S}} \gamma p\left(s^{\prime} \mid s, a\right) V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime}\right)\right|_{a=\mu_{\theta}(s)} \\ &+\int_{\mathcal{S}} \gamma p\left(s^{\prime} \mid s, \mu_{\theta}(s)\right) \nabla_{\theta} V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime} \\ =&\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} Q^{\mu_{\theta}}(s, a)\right|_{a=\mu_{\theta}(s)}+\int_{\mathcal{S}} \gamma p\left(s \rightarrow s^{\prime}, 1, \mu_{\theta}\right) \nabla_{\theta} V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime} \end{aligned} ? θ ? V μ θ ? ( s ) = = = = = = ? ? θ ? Q μ θ ? ( s , μ θ ? ( s ) ) ? θ ? ( r ( s , μ θ ? ( s ) ) + ∫ S ? γ p ( s ′ ∣ s , μ θ ? ( s ) ) V μ θ ? ( s ′ ) d s ′ ) ? θ ? μ θ ? ( s ) ? a ? r ( s , a ) ∣ a = μ θ ? ( s ) ? + ? θ ? ∫ S ? γ p ( s ′ ∣ s , μ θ ? ( s ) ) V μ θ ? ( s ′ ) d s ′ ? θ ? μ θ ? ( s ) ? a ? r ( s , a ) ∣ a = μ θ ? ( s ) ? + ∫ S ? γ ( p ( s ′ ∣ s , μ θ ? ( s ) ) ? θ ? V μ θ ? ( s ′ ) + ? θ ? μ θ ? ( s ) ? a ? p ( s ′ ∣ s , a ) ∣ a = μ θ ? ( s ) ? V μ θ ? ( s ′ ) ) d s ′ ? θ ? μ θ ? ( s ) ? a ? ( r ( s , a ) + ∫ S ? γ p ( s ′ ∣ s , a ) V μ θ ? ( s ′ ) d s ′ ) ∣ ∣ ∣ ∣ ? a = μ θ ? ( s ) ? + ∫ S ? γ p ( s ′ ∣ s , μ θ ? ( s ) ) ? θ ? V μ θ ? ( s ′ ) d s ′ ? θ ? μ θ ? ( s ) ? a ? Q μ θ ? ( s , a ) ∣ a = μ θ ? ( s ) ? + ∫ S ? γ p ( s → s ′ , 1 , μ θ ? ) ? θ ? V μ θ ? ( s ′ ) d s ′ ? 从标红的地方,可以知道:
Action空间必须是连续的: 要求函数μθ\mu_{\theta} μ θ ? (S->A的映射) 连续, 即要求action space A是连续的. 同时, 由于不涉及输出为state的函数, 所以实际建模的MDP中state space不一定需要连续, 只要以state作为输入的函数, 对state有连续的定义域
reward function 连续
transition function (概率函数) 连续
TODO 试证明discrete state space & continue action space 的 PGT & DPG
off-policy DPG
Objective Jβ(μθ)=∫Sρβ(s)Vμ(s)ds=∫Sρβ(s)Qμ(s,μθ(s))ds\begin{aligned} J_{\beta}\left(\mu_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\beta}(s) V^{\mu}(s) \mathrm{d} s \\ &=\int_{\mathcal{S}} \rho^{\beta}(s) Q^{\mu}\left(s, \mu_{\theta}(s)\right) \mathrm{d} s \end{aligned} J β ? ( μ θ ? ) ? = ∫ S ? ρ β ( s ) V μ ( s ) d s = ∫ S ? ρ β ( s ) Q μ ( s , μ θ ? ( s ) ) d s ?
Theorem ?θJβ(μθ)≈∫Sρβ(s)?θμθ(a∣s)Qμ(s,a)ds=Es?ρβ[?θμθ(s)?aQμ(s,a)∣a=μθ(s)]\begin{aligned} \nabla_{\theta} J_{\beta}\left(\mu_{\theta}\right) & \approx \int_{\mathcal{S}} \rho^{\beta}(s) \nabla_{\theta} \mu_{\theta}(a \mid s) Q^{\mu}(s, a) \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\beta}}\left[\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} Q^{\mu}(s, a)\right|_{a=\mu_{\theta}(s)}\right] \end{aligned} ? θ ? J β ? ( μ θ ? ) ? ≈ ∫ S ? ρ β ( s ) ? θ ? μ θ ? ( a ∣ s ) Q μ ( s , a ) d s = E s ? ρ β ? [ ? θ ? μ θ ? ( s ) ? a ? Q μ ( s , a ) ∣ a = μ θ ? ( s ) ? ] ?
相比SPG的期望, DPG中没有important sampling ratio, 这是因为DPG中不涉及Action space的遍历, 因此也就不涉及如何将action space的遍历转化为期望以便对梯度进行采样估计的问题
此式给出的off-policy DPG actor的更新式, 但要实现off-policy control, critic也要能从off-policy data中学习
Proof
may be partly supported by Degris 2012 TODO
DPGT Derived AC Algorithms
on-policy AC
actor update: DPGT
critic update: SARSR TD learning
off-policy AC
actor update: off-DPGT
critic update: Q-learning TD learning