当前位置: 代码迷 >> 综合 >> [RL 7] Deep Deterministic Policy Gradient (DDPG) (ICLR, 2016)
  详细解决方案

[RL 7] Deep Deterministic Policy Gradient (DDPG) (ICLR, 2016)

热度:93   发布时间:2024-03-08 21:29:29.0

Deep Deterministic Policy Gradient (ICLR, 2016)

0.Abstract

  1. “end-to-end” learning: directly from raw pixel inputs

1.Introduction

  1. DQN is not natually suitable for continous action space

2.Background

  1. Bellman equation
    1. Stochastic Policy
      Qπ(st,at)=Ert,st+1?E[r(st,at)+γEat+1?π[Qπ(st+1,at+1)]]Q^{\pi}\left(s_{t}, a_{t}\right)=\mathbb{E}_{r_{t}, s_{t+1} \sim E}\left[r\left(s_{t}, a_{t}\right)+\gamma\mathbb{\textcolor{red}E}_{a_{t+1} \sim \pi}\left[Q^{\pi}\left(s_{t+1}, a_{t+1}\right)\right]\right]Qπ(st?,at?)=Ert?,st+1??E?[r(st?,at?)+γEat+1??π?[Qπ(st+1?,at+1?)]]
    2. Deterministic Policy
      Qμ(st,at)=Ert,st+1?E[r(st,at)+γQμ(st+1,μ(st+1))]Q^{\mu}\left(s_{t}, a_{t}\right)=\mathbb{E}_{r_{t}, s_{t+1} \sim E}\left[r\left(s_{t}, a_{t}\right)+\gamma Q^{\mu}\left(s_{t+1}, \mu\left(s_{t+1}\right)\right)\right]Qμ(st?,at?)=Ert?,st+1??E?[r(st?,at?)+γQμ(st+1?,μ(st+1?))]

3.Algorithm

  1. Method
    近似DPG中的off-policy AC中的
    1. actor μ(s)\mu(s)μ(s)
    2. critic Q(s,a)Q(s,a)Q(s,a)
  2. techniques
    1. non-linear approximation
      no converge gurantee but essential in order to learn and generalize on large state spaces
    2. experiences replay
    3. target (soft update)
      1. required to have stable targets yyy in order to consistently train the critic without divergence
    4. batch normalization
      1. in low-dimensional case: input + inner layers
      2. allow learn different envs with the same algorithm settings
    5. exploration
      1. OU
    6. reparametrization trick
      • μ(s)\mu(s)μ(s)输出均值
      • a = μ(s)+N\mu(s) + \mathcal{N}μ(s)+N
      • 避免了从分布采样

4.Results

  1. target network is necessary for good performance
  2. Value eastimation
    • It can be challenging to learn accurate value estimates in harder task, but DDPG still learn good policy
  3. BN在大部分实验中带来improvement
  4. learning speed
    fewer steps of experience than was used by
    DQN learning to find solutions in the Atari domain

5.Related Work

  1. TRPO
    1. does not require learning an action-value function, and (perhaps as a result) appears to be significantly less data efficient; 计算Q值可以当做Reuse data

Supplementary

  1. EXPERIMENT DETAILS
    1. actor critic learning rate 不同
    2. L2 weight decay TODO
    3. than layer for actor
    4. layer initialization
    5. action input position
  2. MUJOCO ENVIRONMENTS introduction
  相关解决方案