[RL 7] Deep Deterministic Policy Gradient (DDPG) (ICLR, 2016)_综合

 
  
 
 Deep Deterministic Policy Gradient (ICLR, 2016) 
 0.Abstract 
 “end-to-end” learning: directly from raw pixel inputs
 
 1.Introduction 
 DQN is not natually suitable for continous action space
 
 2.Background 
 Bellman equation 
   Stochastic Policy 
 Qπ(st,at)=Ert,st+1?E[r(st,at)+γEat+1?π[Qπ(st+1,at+1)]]Q^{\pi}\left(s_{t}, a_{t}\right)=\mathbb{E}_{r_{t}, s_{t+1} \sim E}\left[r\left(s_{t}, a_{t}\right)+\gamma\mathbb{\textcolor{red}E}_{a_{t+1} \sim \pi}\left[Q^{\pi}\left(s_{t+1}, a_{t+1}\right)\right]\right]Qπ(st?,at?)=Ert?,st+1??E?[r(st?,at?)+γEat+1??π?[Qπ(st+1?,at+1?)]]
Deterministic Policy 
 Qμ(st,at)=Ert,st+1?E[r(st,at)+γQμ(st+1,μ(st+1))]Q^{\mu}\left(s_{t}, a_{t}\right)=\mathbb{E}_{r_{t}, s_{t+1} \sim E}\left[r\left(s_{t}, a_{t}\right)+\gamma Q^{\mu}\left(s_{t+1}, \mu\left(s_{t+1}\right)\right)\right]Qμ(st?,at?)=Ert?,st+1??E?[r(st?,at?)+γQμ(st+1?,μ(st+1?))]
 
 
 3.Algorithm 
 Method 
 近似DPG中的off-policy AC中的 
   actor μ(s)\mu(s)μ(s)
critic Q(s,a)Q(s,a)Q(s,a)
 
techniques 
   non-linear approximation 
 no converge gurantee but essential in order to learn and generalize on large state spaces
experiences replay
target (soft update) 
     required to have stable targets yyy in order to consistently train the critic without divergence
 
batch normalization 
     in low-dimensional case: input + inner layers
allow learn different envs with the same algorithm settings
 
exploration 
     OU
 
reparametrization trick 
     μ(s)\mu(s)μ(s)输出均值
a = μ(s)+N\mu(s) + \mathcal{N}μ(s)+N
避免了从分布采样
 
 
 
 4.Results 
 target network is necessary for good performance
Value eastimation 
   It can be challenging to learn accurate value estimates in harder task, but DDPG still learn good policy
 
BN在大部分实验中带来improvement
learning speed 
 fewer steps of experience than was used by
 DQN learning to find solutions in the Atari domain
 
 5.Related Work 
 TRPO 
   does not require learning an action-value function, and (perhaps as a result) appears to be significantly less data efficient; 计算Q值可以当做Reuse data
 
 
 Supplementary 
 EXPERIMENT DETAILS 
   actor critic learning rate 不同
L2 weight decay TODO
than layer for actor
layer initialization
action input position
 
MUJOCO ENVIRONMENTS introduction