详细解决方案
[RL 7] Deep Deterministic Policy Gradient (DDPG) (ICLR, 2016)
热度:93 发布时间:2024-03-08 21:29:29.0
Deep Deterministic Policy Gradient (ICLR, 2016)
0.Abstract
- “end-to-end” learning: directly from raw pixel inputs
1.Introduction
- DQN is not natually suitable for continous action space
2.Background
- Bellman equation
- Stochastic Policy
Qπ(st,at)=Ert,st+1?E[r(st,at)+γEat+1?π[Qπ(st+1,at+1)]]Q^{\pi}\left(s_{t}, a_{t}\right)=\mathbb{E}_{r_{t}, s_{t+1} \sim E}\left[r\left(s_{t}, a_{t}\right)+\gamma\mathbb{\textcolor{red}E}_{a_{t+1} \sim \pi}\left[Q^{\pi}\left(s_{t+1}, a_{t+1}\right)\right]\right]Qπ(st?,at?)=Ert?,st+1??E?[r(st?,at?)+γEat+1??π?[Qπ(st+1?,at+1?)]]
- Deterministic Policy
Qμ(st,at)=Ert,st+1?E[r(st,at)+γQμ(st+1,μ(st+1))]Q^{\mu}\left(s_{t}, a_{t}\right)=\mathbb{E}_{r_{t}, s_{t+1} \sim E}\left[r\left(s_{t}, a_{t}\right)+\gamma Q^{\mu}\left(s_{t+1}, \mu\left(s_{t+1}\right)\right)\right]Qμ(st?,at?)=Ert?,st+1??E?[r(st?,at?)+γQμ(st+1?,μ(st+1?))]
3.Algorithm
- Method
近似DPG中的off-policy AC中的
- actor μ(s)\mu(s)μ(s)
- critic Q(s,a)Q(s,a)Q(s,a)
- techniques
- non-linear approximation
no converge gurantee but essential in order to learn and generalize on large state spaces
- experiences replay
- target (soft update)
- required to have stable targets yyy in order to consistently train the critic without divergence
- batch normalization
- in low-dimensional case: input + inner layers
- allow learn different envs with the same algorithm settings
- exploration
- OU
- reparametrization trick
- μ(s)\mu(s)μ(s)输出均值
- a = μ(s)+N\mu(s) + \mathcal{N}μ(s)+N
- 避免了从分布采样
4.Results
- target network is necessary for good performance
- Value eastimation
- It can be challenging to learn accurate value estimates in harder task, but DDPG still learn good policy
- BN在大部分实验中带来improvement
- learning speed
fewer steps of experience than was used by
DQN learning to find solutions in the Atari domain
5.Related Work
- TRPO
- does not require learning an action-value function, and (perhaps as a result) appears to be significantly less data efficient; 计算Q值可以当做Reuse data
Supplementary
- EXPERIMENT DETAILS
- actor critic learning rate 不同
- L2 weight decay TODO
than
layer for actor
- layer initialization
- action input position
- MUJOCO ENVIRONMENTS introduction