当前位置: 代码迷 >> 综合 >> [RL 6] Deterministic Policy Gradient Algorithms (ICML, 2014)
  详细解决方案

[RL 6] Deterministic Policy Gradient Algorithms (ICML, 2014)

热度:3   发布时间:2024-03-08 18:18:15.0

Deterministic Policy Gradient Algorithms (ICML, 2014)

Stochastic PGT (SPGT)

  1. Theorem
    ?θJ(πθ)=∫Sρπ(s)∫A?θπθ(a∣s)Qπ(s,a)dads=Es?ρπ,a?πθ[?θlog?πθ(a∣s)Qπ(s,a)]\begin{aligned} \nabla_{\theta} J\left(\pi_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\pi}(s) \int_{\mathcal{A}} \nabla_{\theta} \pi_{\theta}(a \mid s) Q^{\pi}(s, a) \mathrm{d} a \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\pi}, a \sim \pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(a \mid s) Q^{\pi}(s, a)\right] \end{aligned}?θ?J(πθ?)?=S?ρπ(s)A??θ?πθ?(as)Qπ(s,a)dads=Es?ρπ,a?πθ??[?θ?logπθ?(as)Qπ(s,a)]?
    • Proof see: https://web.stanford.edu/class/cme241/lecture_slides/PolicyGradient.pdf
  2. PGT derived algorithms
    1. on-policy AC
      1. actor update: PGT
      2. critic update: any TD leanring
    2. off-policy AC
      1. actor update: off-PGT (TODO proof see Degris 2012)
      2. critic: any TD (or TODO more general GAE)

Intuition DGPT

  1. Greedy policy improvement in GPI
    • argmax Q is not suitable for continues action space
  2. DGPT
    • move the policy in the direction of the gradient of Q, rather than globally maximising Q.
    • 思想同argmax, 改变policy选择Q value较大的action

Formal DPGT

  1. Settings
    • episode
    • with discount factor
    • for continous task, set γ=1\gamma=1γ=1, and use state distribution μθ(S)\mu_\theta(S)μθ?(S) in RL charpt 9
  2. on-policy
    1. Objective
      J(μθ)=∫Sp1(s)Vμθ(s)dsJ\left(\mu_{\theta}\right)= \int_{\mathcal{S}} p_{1}(s) V^{\mu_{\theta}}(s) \mathrm{d} s J(μθ?)=S?p1?(s)Vμθ?(s)ds
    2. Theorem
      1. on-policy DPG
        ?θJ(μθ)=∫Sρμ(s)?θμθ(s)?aQμ(s,a)∣a=μθ(s)ds=Es?ρμ[?θμθ(s)?aQμ(s,a)∣a=μθ(s)]where,ρμ(s)=∫S∑t=1∞γt?1p1(s)p(s→s′,t,π)ds\begin{aligned} \nabla_{\theta} J\left(\mu_{\theta}\right) &=\left.\int_{\mathcal{S}} \rho^{\mu}(s) \nabla_{\theta} \mu_{\theta}(s) \nabla_{a} Q^{\mu}(s, a)\right|_{a=\mu_{\theta}(s)} \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\mu}}\left[\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} Q^{\mu}(s, a)\right|_{a=\mu_{\theta}(s)}\right] \end{aligned} \\ where,\rho^\mu(s) =\int_{\mathcal{S}} \sum_{t=1}^{\infty} \gamma^{t-1} p_{1}(s) p\left(s \rightarrow s^{\prime}, t, \pi\right) \mathrm{d} s ?θ?J(μθ?)?=S?ρμ(s)?θ?μθ?(s)?a?Qμ(s,a)?a=μθ?(s)?ds=Es?ρμ?[?θ?μθ?(s)?a?Qμ(s,a)a=μθ?(s)?]?where,ρμ(s)=S?t=1?γt?1p1?(s)p(ss,t,π)ds
        1. discount state distribution ρμ(s)\rho^\mu(s)ρμ(s):
          • 定义: 状态分布, 可以理解为:依据policy μθ\mu_\thetaμθ?, 遇到这个状态的概率
          • 计算: 求和在所有time step t遇到s的概率, 通过gamma加权
          • 从该分布采样: 直接用policy μθ\mu_\thetaμθ?与环境交互即可, 因为policy考虑最大化accumulative reward, 越靠后的reward(以及相应的state)权重越小
    3. Regularity Conditions
      Regularity conditions A.1: p(s′∣s,a),?ap(s′∣s,a),μθ(s),?θμθ(s),r(s,a),?ar(s,a),p1(s)are continuous in all parameters and variables s,a,s′and x. Regularity conditions A.2: there exists a band Lsuch that sup?sp1(s)<b,sup?a,s,s′p(s′∣s,a)<b,sup?a,sr(s,a)<bsup?a,s,s′∥?ap(s′∣s,a)∥<L,and sup?a,s∥?ar(s,a)∥<L\begin{aligned} &\text { Regularity conditions A.1: } p\left(s^{\prime} \mid s, a\right), \nabla_{a} p\left(s^{\prime} \mid s, a\right), \mu_{\theta}(s), \nabla_{\theta} \mu_{\theta}(s), r(s, a), \nabla_{a} r(s, a), p_{1}(s) \text { are continuous in all }\\ &\text { parameters and variables } s, a, s^{\prime} \text { and } x \text { . }\\ &\text { Regularity conditions A.2: there exists a } b \text { and } L \text { such that } \sup _{s} p_{1}(s)<b, \sup _{a, s, s^{\prime}} p\left(s^{\prime} \mid s, a\right)<b, \sup _{a, s} r(s, a)<b\\ &\sup _{a, s, s^{\prime}}\left\|\nabla_{a} p\left(s^{\prime} \mid s, a\right)\right\|<L, \text { and } \sup _{a, s}\left\|\nabla_{a} r(s, a)\right\|<L \end{aligned} ? Regularity conditions A.1: p(ss,a),?a?p(ss,a),μθ?(s),?θ?μθ?(s),r(s,a),?a?r(s,a),p1?(s) are continuous in all  parameters and variables s,a,s and x .  Regularity conditions A.2: there exists a b and L such that ssup?p1?(s)<b,a,s,ssup?p(ss,a)<b,a,ssup?r(s,a)<ba,s,ssup??a?p(ss,a)<L, and a,ssup??a?r(s,a)<L?
      • A.1 保证了V可对θ\thetaθ求导, 同时使得推导过程可以使用
        1. Leibniz积分公式, 改变微分和积分次序: ?∫\nabla \int?-> ∫?\int \nabla?
        2. Fubini定理, 改变积分次序
      • A.2 保证了梯度有界
    4. Part of Proof
      ?θVμθ(s)=?θQμθ(s,μθ(s))=?θ(r(s,μθ(s))+∫Sγp(s′∣s,μθ(s))Vμθ(s′)ds′)=?θμθ(s)?ar(s,a)∣a=μθ(s)+?θ∫Sγp(s′∣s,μθ(s))Vμθ(s′)ds′=?θμθ(s)?ar(s,a)∣a=μθ(s)+∫Sγ(p(s′∣s,μθ(s))?θVμθ(s′)+?θμθ(s)?ap(s′∣s,a)∣a=μθ(s)Vμθ(s′))ds′=?θμθ(s)?a(r(s,a)+∫Sγp(s′∣s,a)Vμθ(s′)ds′)∣a=μθ(s)+∫Sγp(s′∣s,μθ(s))?θVμθ(s′)ds′=?θμθ(s)?aQμθ(s,a)∣a=μθ(s)+∫Sγp(s→s′,1,μθ)?θVμθ(s′)ds′\begin{aligned} \nabla_{\theta} V^{\mu_{\theta}}(s)=& \nabla_{\theta} Q^{\mu_{\theta}}\left(s, \mu_{\theta}(s)\right) \\ =& \nabla_{\theta}\left(r\left(s, \mu_{\theta}(s)\right)+\int_{\mathcal{S}} \gamma p\left(s^{\prime} \mid s, \mu_{\theta}(s)\right) V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime}\right) \\ =&\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} r(s, a)\right|_{a=\mu_{\theta}(s)}+\nabla_{\theta} \int_{\mathcal{S}} \gamma p\left(s^{\prime} \mid s, \mu_{\theta}(s)\right) V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime} \\ =&\left.\nabla_{\theta} \textcolor{red}\mu_{\theta}(s) \nabla_{a} \textcolor{red}r(s, a)\right|_{a=\mu_{\theta}(s)} \\ &+\int_{\mathcal{S}} \gamma\left(p\left(s^{\prime} \mid s, \mu_{\theta}(s)\right) \nabla_{\theta} V^{\mu_{\theta}}\left(s^{\prime}\right)+\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} \textcolor{red}p\left(s^{\prime} \mid s, a\right)\right|_{a=\mu_{\theta}(s)} V^{\mu_{\theta}}\left(s^{\prime}\right)\right) \mathrm{d} s^{\prime} \\ =&\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a}\left(r(s, a)+\int_{\mathcal{S}} \gamma p\left(s^{\prime} \mid s, a\right) V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime}\right)\right|_{a=\mu_{\theta}(s)} \\ &+\int_{\mathcal{S}} \gamma p\left(s^{\prime} \mid s, \mu_{\theta}(s)\right) \nabla_{\theta} V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime} \\ =&\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} Q^{\mu_{\theta}}(s, a)\right|_{a=\mu_{\theta}(s)}+\int_{\mathcal{S}} \gamma p\left(s \rightarrow s^{\prime}, 1, \mu_{\theta}\right) \nabla_{\theta} V^{\mu_{\theta}}\left(s^{\prime}\right) \mathrm{d} s^{\prime} \end{aligned} ?θ?Vμθ?(s)======??θ?Qμθ?(s,μθ?(s))?θ?(r(s,μθ?(s))+S?γp(ss,μθ?(s))Vμθ?(s)ds)?θ?μθ?(s)?a?r(s,a)a=μθ?(s)?+?θ?S?γp(ss,μθ?(s))Vμθ?(s)ds?θ?μθ?(s)?a?r(s,a)a=μθ?(s)?+S?γ(p(ss,μθ?(s))?θ?Vμθ?(s)+?θ?μθ?(s)?a?p(ss,a)a=μθ?(s)?Vμθ?(s))ds?θ?μθ?(s)?a?(r(s,a)+S?γp(ss,a)Vμθ?(s)ds)?a=μθ?(s)?+S?γp(ss,μθ?(s))?θ?Vμθ?(s)ds?θ?μθ?(s)?a?Qμθ?(s,a)a=μθ?(s)?+S?γp(ss,1,μθ?)?θ?Vμθ?(s)ds?
      从标红的地方,可以知道:
      1. Action空间必须是连续的: 要求函数μθ\mu_{\theta}μθ? (S->A的映射) 连续, 即要求action space A是连续的. 同时, 由于不涉及输出为state的函数, 所以实际建模的MDP中state space不一定需要连续, 只要以state作为输入的函数, 对state有连续的定义域
      2. reward function 连续
      3. transition function (概率函数) 连续
      4. TODO 试证明discrete state space & continue action space 的 PGT & DPG
  3. off-policy DPG
    1. Objective
      Jβ(μθ)=∫Sρβ(s)Vμ(s)ds=∫Sρβ(s)Qμ(s,μθ(s))ds\begin{aligned} J_{\beta}\left(\mu_{\theta}\right) &=\int_{\mathcal{S}} \rho^{\beta}(s) V^{\mu}(s) \mathrm{d} s \\ &=\int_{\mathcal{S}} \rho^{\beta}(s) Q^{\mu}\left(s, \mu_{\theta}(s)\right) \mathrm{d} s \end{aligned}Jβ?(μθ?)?=S?ρβ(s)Vμ(s)ds=S?ρβ(s)Qμ(s,μθ?(s))ds?
    2. Theorem
      ?θJβ(μθ)≈∫Sρβ(s)?θμθ(a∣s)Qμ(s,a)ds=Es?ρβ[?θμθ(s)?aQμ(s,a)∣a=μθ(s)]\begin{aligned} \nabla_{\theta} J_{\beta}\left(\mu_{\theta}\right) & \approx \int_{\mathcal{S}} \rho^{\beta}(s) \nabla_{\theta} \mu_{\theta}(a \mid s) Q^{\mu}(s, a) \mathrm{d} s \\ &=\mathbb{E}_{s \sim \rho^{\beta}}\left[\left.\nabla_{\theta} \mu_{\theta}(s) \nabla_{a} Q^{\mu}(s, a)\right|_{a=\mu_{\theta}(s)}\right] \end{aligned}?θ?Jβ?(μθ?)?S?ρβ(s)?θ?μθ?(as)Qμ(s,a)ds=Es?ρβ?[?θ?μθ?(s)?a?Qμ(s,a)a=μθ?(s)?]?
      1. 相比SPG的期望, DPG中没有important sampling ratio, 这是因为DPG中不涉及Action space的遍历, 因此也就不涉及如何将action space的遍历转化为期望以便对梯度进行采样估计的问题
      2. 此式给出的off-policy DPG actor的更新式, 但要实现off-policy control, critic也要能从off-policy data中学习
    3. Proof
      1. may be partly supported by Degris 2012 TODO

DPGT Derived AC Algorithms

  1. on-policy AC
    1. actor update: DPGT
    2. critic update: SARSR TD learning
  2. off-policy AC
    1. actor update: off-DPGT
    2. critic update: Q-learning TD learning
  相关解决方案