【文献阅读】ES as a Scalable Alternative to RL（OpenAI 17）_综合

Evolution Strategies as a Scalable Alternative to Reinforcement Learning

brief

文章链接
该文章是 Open AI 17年发布的，目前有300+的引用量。

Abstract

【开篇明意】We explore the use of Evolution Strategies (ES), a class of black box optimization algorithms, as an alternative to popular MDP-based RL techniques such as Qlearning and Policy Gradients. 我们探索使用进化策略（ES），一类黑盒优化算法，作为流行的基于MDP的RL技术（如Qlearning和Policy Gradients）的替代品。【卓越性能】Experiments on MuJoCo and Atari show that ES is a viable solution strategy that scales extremely well with the number of CPUs available: By using a novel communication strategy based on common random numbers, our ES implementation only needs to communicate scalars, making it possible to scale to over a thousand parallel workers. This allows us to solve 3D humanoid walking in 10 minutes and obtain competitive results on most Atari games after one hour of training.在MuJoCo和Atari上的实验表明，ES是一种可行的解决策略，随着可用CPU数量的增加，它的扩展性极好。通过使用基于公共随机数的新型通信策略，我们的ES实现只需要通信标量，使得它有可能扩展到超过1000个并行workers。这使得我们可以在10分钟内解决3D人形行走的问题，并在训练一小时后获得大多数Atari游戏的竞争结果。【更多优点】 In addition, we highlight several advantages of ES as a black box optimization technique: it is invariant to action frequency and delayed rewards, tolerant of extremely long horizons, and does not need temporal discounting or value function approximation. 此外，我们还强调了ES作为一种黑盒优化技术的几个优点：它对动作频率和延迟奖励不变，对极长的horizons有容忍度，并且不需要时空折现或价值函数逼近。

1 Introduction

第一段

【点明问题】开发能够在复杂、不确定的环境中完成具有挑战性任务的智能体 agents 是人工智能的一个关键目标。【最流行的方法RL】近年来，分析这类问题最流行的范式是基于马尔可夫决策过程（MDP）形式和价值函数概念的一类强化学习（RL）算法。这种方法的成功包括从像素点学习使用Atari的系统learn to play Atari from pixels【Mnih等人，2015年】、执行直升机特技飞行等perform helicopter aerobatics。[2006]，或玩专家级围棋 play expert-level Go [Silver等人，2016]。

第二段

【替代方法：黑盒优化】一种解决RL问题的方法是使用黑盒优化black-box optimization。这种方法被称为直接策略搜索 direct policy search [Schmidhuber and Zhao，1998]，或神经进化 neuro-evolution [Risi和Togelius，2015]，当应用于神经网络时。【本文】在本文中，我们研究进化策略（ES）[Rechenberg and Eigen，1973]，这类算法中的一组特殊的优化算法。我们表明，ES可以reliably 可靠地训练神经网络策略，in a fashion 其方式非常适合扩展 scaled to 到现代分布式计算机系统，用于控制MuJoCo物理模拟器中的机器人[Todorov等人，2012]和使用像素输入玩Atari游戏[Mnih等人，2015]。

我们的主要发现如下：

我们发现，使用虚拟批处理规范化 virtual batch normalization [Salimans等人，2016]和神经网络策略的其他重新参数化 reparameterizations of the nn policy（第2.2节）极大地提高了进化策略的可靠性reliability。在我们的实验中，没有这些方法ES被证明是脆弱的brittle，但是通过这些重新参数化，我们在各种各样的环境中取得了很好的结果。
我们发现进化策略方法具有高度的可并行性 highly parallelizable：通过引入一种新的基于公共随机数common random numbers 的通信策略 communication strategy，即使使用一千多个workers，我们也能够在运行时实现线性加速 achieve linear speedups in run time。特别是in particular，使用1440名工人，我们已经能够在不到10分钟内解决MuJoCo 3D人形任务。
进化策略的数据效率出人意料地好 the data efficiency of ES was surprisingly good：我们能够在大多数Atari环境下匹配 match A3C的最终性能[Mnih等人，2016]，同时使用了3到10倍的数据。由于不执行反向传播和没有值函数，所需计算量减少了大约3倍，这部分抵消了数据效率的轻微下降。The slight decrease in data efficiency is partly offset by a reduction in required computation of roughly 3x due to not performing backpropagation and not having a value function.我们1小时的ES结果所需的计算量与A3C发布的1天结果大致相同，而在测试的23个游戏中表现更好，在28个游戏中表现更差。在MuJoCo任务中，我们能够使用不超过10倍的数据来匹配 match 信任区域策略优化的学习策略性能[TRPO；Schulman et al.，2015]。
我们发现ES比TRPO等策略梯度方法表现出更好的探索行为：在MuJoCo类人任务中，ES能够学习非常广泛的步态 gaits（例如侧身行走或向后行走）。这些不寻常的步态从来没有用TRPO观测到，这表明了一种性质不同的勘探行为 whcih suggests a qualitatively different exploration behavior。
我们发现进化策略方法是稳健的 robust：我们在所有的Atari环境中使用了固定的超参数 fixed hyperparameters，并且在所有的MuJoCo环境中使用了一组不同的固定超参数（除了一个二进制超参数，在不同的MuJoCo环境中，这一点并不是恒定不变的）。

第三段

黑盒优化方法有几个非常吸引人的特性：

对报酬的分布（稀疏或密集）无所谓，indifference to the distribution of rewards (sparse or dense)
不需要反向传播梯度， no need for backpropagating gradients
以及对可能任意长时间范围的容忍。tolerance of potentially arbitrarily long time horizons.

然而，与Q学习和策略梯度等技术相比，它们在解决棘手的RL问题方面的效率较低。这项工作的贡献，我们希望这将重新引起人们对这类方法的兴趣，并带来新的有用的应用，它证明了进化策略可以在当今最困难的环境中与竞争的RL算法竞争，而且这种方法可以扩展到更多的并行工作者。The contribution of this work, which we hope will renew interest in this class of methods and lead to new useful applications, is a demonstration that evolution strategies can be competitive with competing RL algorithms on the hardest environments studied by the deep RL community today, and that this approach can scale to many more parallel workers.

2 Evolution Strategies

进化策略（Evolution Strategies，ES）是一类受自然进化启发的启发式搜索算法：在每一次迭代（“生成”）中，一组参数向量（“基因型”）都会受到扰动（“变异”）并评估其目标函数值（“适应度”）。At every iteration (“generation”), a population of parameter vectors (“genotypes”) is perturbed (“mutated”) and their objective function value (“fitness”) is evaluated. 然后对得分最高的参数向量进行重组，以形成下一代的总体，然后迭代该过程，直到目标完全优化。这类算法的不同之处在于它们如何表示种群以及如何执行变异和重组 mutation and recombination 。ES类中最广为人知的成员是协方差矩阵自适应进化策略covariance matrix adaptation evolution strategy[CMA-ES；Hansen and Ostermeier，2001]，它用全协方差多元高斯函数full-covariance multivariate Gaussian表示总体。CMA-ES在解决中低维的优化问题方面已经非常成功。
我们在这项工作中使用的ES版本属于自然进化策略的一类，与Sehnke等人的工作密切相关。[2010年]。设F表示作用于参数 $θ\theta$ 的目标函数。NES算法用参数 $pψ(θ)p_{\psi}(\theta)$ 上的分布来表示种群–它本身的参数为 $ψ\psi$ –并通过随机梯度上升搜索 $ψ\psi$ 来继续最大化种群上的平均目标值 $Eθ?pψF(θ)\mathbb E_{\theta\sim p_\psi}F(\theta)$ 。具体地说specifically，以与REINFORCE相似的方式使用 $▽ψEθ?pψF(θ)\bigtriangledown_\psi\mathbb E_{\theta\sim p_\psi}F(\theta)$ 的得分函数估计器[Williams，1992]，NES算法对以下估计器采取梯度步骤：

For the special case where $pψp_{\psi}$ is factored Gaussian 对于 $pψp_{\psi}$ is factored Gaussian 的特殊情况（如本文所述），得到的梯度估计量也被称为同步扰动随机逼近simultaneous perturbation stochastic approximation[Spall，1992]、参数化策略梯度parameterexploring
policy gradients [Sehnke等人，2010]或零阶梯度估计zero-order gradient estimation [Nesterov and Spokoiny，2011]。
在这项工作中，我们关注RL问题，因此 $F (.)$ 将是环境提供的随机回报， $θ\theta$ 将是一个确定性或随机策略的参数 $πθ\pi_\theta$ ，该策略描述了一个在该环境中活动的agent，由离散或连续的行为控制。RL算法的许多创新都集中在解决环境或政策的衍生工具的缺乏或存在的问题上。这种非光滑性可以用ES解决，如下所示。我们将总体分布 $pψp_{\psi}$ 实例化为具有均值 mean $ψ\psi$ 和固定协方差 $σ2I\sigma^2I$ 的各向同性多元高斯分布 isotropic multivariate Gaussian，允许我们直接根据均值 mean 参数向量 $θ\theta$ 写出 $Eθ?pψF(θ)\mathbb E_{\theta\sim p_\psi}F(\theta)$ ：we set
通过这种设置with this setup，我们的随机目标stochastic objective 可以看作是原始目标F的高斯模糊版本 Gaussian-blurred version，free of non-smoothness introduced by the environment or potentially discrete actions taken by the policy不受环境或策略所采取的潜在离散行动影响的非平稳性。关于ES和策略梯度方法如何处理非光滑性的进一步讨论可以在第3节中找到。
在我们的目标定义为 $θ\theta$ 的情况下，我们直接使用随机梯度上升和分数函数估计来优化 $θ\theta$ ，
梯度计算
which can be approximated with samples 可以用样本来近似。得到的算法（1）重复执行两个阶段：1）随机扰动策略的参数并通过在环境中运行一个 episode 来评估得到的参数；2）将这些 episodes 的结果合并，计算随机梯度估计，并更新参数。The resulting algorithm (1) repeatedly executes two phases: 1) Stochastically perturbing the parameters of the policy and evaluating the resulting parameters by running an episode in the environment, and 2) Combining the results of these episodes, calculating a stochastic gradient estimate, and updating the parameters.
algorithm1

2.1 Scaling and parallelizing ES

ES很适合扩展到许多并行工人parallel workers。1）它对完整的 episodes 进行操作，thereby 因此只需要workers 之间进行不频繁infrequent communication 的通信。2）每个worker获得的唯一信息是一个episode 的标量返回 scalar return ：如果我们在优化前同步synchronize worker之间的随机种子random seeds，每个worker都知道其他worker使用了什么扰动，因此每个worker只需要与其他worker之间通信一个标量就可以达成参数更新。因此，ES需要极低的带宽 extremely low bandwidth，与策略梯度方法形成鲜明对比 in sharp contrast to，后者需要工友沟通整个梯度 which require workers to communicate entire gradients。3）它不需要价值函数逼近。带有价值函数估计的RL本质上是连续的 inherently sequential 。为了改进给定的策略，通常需要对价值函数进行多次更新，以获得足够的信号。每当策略发生重大变化时，价值函数估计需要多次迭代才能赶上。
算法2中给出了ES的一个简单的并行版本。这里的主要创新之处在于，该算法使用了共享的随机种子 shared random seeds，这大大降低了工人之间通信所需的带宽 which drastically reduces the bandwidth required for communication between the workers。
algorithm 2
在实践中，我们通过让每个工人在训练开始时实例化 instantiate 一大块高斯噪声来实现采样，然后通过在每次迭代中添加这些噪声变量的随机索引子集来扰动其参数。尽管这意味着扰动在迭代中并不是严格独立的，但我们在实践中并没有发现这是一个问题。使用这种策略，我们发现算法2的第二部分（第9-12行）只占我们所有实验总时间的一小部分，即使使用多达1440个并行工作器。当使用更多的workers ，或者使用非常大的神经网络时，我们可以通过让 workers 只扰动参数 $θ\theta$ 的一个子集，而不是所有的参数，来减少这部分算法所需的计算量。在这种情况下，扰动分布 $pψp_\psi$ 对应于高斯的混合物，对其更新方程保持不变。在最极端的情况下 at the very extreme，每个worker 只扰动参数向量的一个坐标，这意味着我们将使用纯有限差分 which means that we could be using pure finite differences.
为了减少方差，我们使用反抽样 antithetic sampling Geweke[1988]，也就是ES文献中的镜像抽样Brockhoff等[2010]：即我们总是对高斯噪声向量 $?\epsilon$ 评估一对扰动 $?\epsilon$ , $??-\epsilon$ 。我们还发现，在计算每个参数更新之前，通过对返回值进行秩位变换 rank transformation，执行fitness shaping 适应度值调整 Wierstra等人[2014]是有用的。这样做可以消除每个种群中离群个体 outlier individuals 的影响，并降低 ES 在训练早期陷入局部最优值的倾向。此外，我们对政策网络的参数应用了权重衰减 weight decay：与扰动相比，这可以防止参数变得非常大。this prevents the parameters from growing very large compared to the perturbations.。
不像Wierstra等人。[2014]我们没有看到在训练期间适应 $σ\sigma$ 的好处，因此我们将其视为一个固定的超参数 fixed hyperparameter。我们直接在参数空间中进行优化；探索间接编码 indirect encodings Stanley等人。[2009年]，van Steenkiste等人。[2016]将留待未来工作。
如上所述，进化策略适用于 full-length episodes 。在一些罕见的情况下，这可能导致低CPU利用率，因为某些 episodes 比其他事件运行更多的步骤。基于这个原因，我们将所有工人的事件长度限制在一个恒定的m步数上，随着培训的进行，我们会动态调整。例如，通过将m设置为每集平均步骤数的两倍，我们可以保证在最坏的情况下CPU利用率保持在50%以上。

2.2 The impact of network parameterization

第一段：参数扰动——better return

然而，像 Q-learning 和策略梯度这样的RL算法是通过随机策略的抽样行为来探索的，而进化策略则是从策略参数的抽样实例化中获得学习信号。因此，在ES中的探索是由参数摄动驱动的。为了使ES改进参数 $θ\theta$ ，种群中的某些成员必须获得比其他成员更好的回报：即高斯扰动向量 $?\epsilon$ 偶尔会导致新的个体 $θ+σ?\theta+\sigma\epsilon$ 具有更好的回报。

第二段：虚拟批处理规范化 virtual batch normalization

对于Atari环境，我们发现 DeepMind 卷积架构上convolutional architectures的高斯参数扰动 Gaussian parameter perturbations【Mnih等人，2015年】并不总是导致充分的探索：对于某些环境，随机扰动参数倾向于编码策略，这些策略总是采取一种特定的行动，而不管作为输入的状态是什么。然而，我们发现，通过在策略规范中使用虚拟批处理规范化[Salimans等人，2016]，我们可以匹配大多数游戏的策略梯度方法的性能。然而，我们发现，通过在策略规范中使用虚拟批处理规范化[Salimans等人，2016]，我们可以匹配大多数游戏的策略梯度方法的性能。虚拟批次标准化与批次标准化完全等价[Ioffe and Szegedy，2015]，其中用于计算规范化统计数据的小批量在培训开始时选择并固定。参数化的这种变化使得策略在训练的早期阶段对输入图像中的非常小的变化更加敏感，而此时策略的权重是随机的，从而确保策略采取足够多的动作来获得偶尔的奖励 gather occasional rewards。对于大多数应用，虚拟批处理规范化的一个缺点是它使培训成本更高 a downside of virtual batch normalization is that it makes training more expensive。然而，对于我们的应用，用于计算规范化统计信息的 minibatch 比在一个典型 episode 中采取的步骤数要小得多，这意味着开销可以忽略不计 meaning that the overhead is negligible。

第三段：动作离散化——more exploration

对于MuJoCo任务，通过标准多层感知器映射到连续动作，我们在几乎所有的环境中都取得了良好的性能。然而，我们观察到，对于某些环境，我们可以通过将操作离散化来鼓励更多的探索。这就迫使动作在输入观察值和参数扰动方面变得不平滑 non-smooth，从而鼓励在 rollouts 过程中执行各种各样的行为。

3 Smoothing in parameter space versus smoothing in action space 参数空间中的平滑与动作空间中的平滑

如第2节所述，RL中的一个很大的困难源于缺乏策略性能的信息梯度：由于环境或策略的非平稳性 non-smoothness，这种梯度可能不存在，或者可能仅作为高方差估计值 high-variance estimates 可用，因为环境通常只能通过采样获得because the environment usually can only be accessed via sampling。显式地 Explicitly，假设我们希望解决一般的决策问题，在我们采取一系列动作actions $a= \{a_1, ...,a_T\}$ 后给出一个回报 $R (a)$ ，其中 actions 由一个确定的或随机的策略函数 $at=π(s,θ)a_t=\pi(s, \theta)$ 决定。我们要优化的目标是
$F(θ)=R(a(θ))F(\theta)=R(a(\theta))$
由于允许动作 actions 是离散的，策略是确定性的，所以 $F(θ)F(\theta)$ 在 $θ\theta$ 上可以是非光滑的 non-smooth。更重要的是，由于我们没有明确地访问决策问题的底层状态转换函数，所以不能用类似反向传播的算法来计算梯度。这意味着我们不能直接使用标准的基于梯度的优化方法来为 $θ\theta$ 找到一个好的解决方案。

在动作空间加噪声

为了使问题变得平滑 smooth，并有一种估计其梯度的方法，我们需要添加噪声 add noise。策略梯度法在行动空间中 action space 加入了噪声，它是通过从一个适当的分布中采样行动来完成的 which is done by sampling the actions from an appropriate distribution。例如，如果动作是离散的， $π(s;θ)\pi(s;\theta)$ 在选择最佳动作之前计算每个动作的得分，那么我们将从每个时间段动作的类别分布中采样sample 动作 $a(?,θ)a(\epsilon, \theta)$ （这里 $?\epsilon$ 是噪声源），对每个动作的得分应用softmax applying a softmax to the scores of each action。Doing so yields the objective $FPG(θ)=E?R(a(?,θ))F_{PG}(\theta)=\mathbb{E_\epsilon}R(a(\epsilon,\theta))$ , with gradients
tidu

在参数空间加噪声

另一方面，进化策略在参数空间parameter space 中增加了噪声。That is, they perturb the parameters as $θ~=θ+ξ\tilde{\theta}=\theta+\xi$ ，with $ξ\xi$ from a multivariate Gaussian distribution 其中 $ζ\zeta$ 来自多维高斯分布，and then pick actions as $at=a(ξ,θ)=π(s;θ~)a_t=a(\xi, \theta)=\pi(s; \tilde{\theta})$ 。它可以解释为在原始目标上添加高斯模糊，从而导致平滑的、可微的代价It can be interpreted as adding a Gaussian blur to the original objective, which results in a smooth, differentiable cost $FES(θ)=EξR(a(ξ,θ))F_{ES}(\theta)=\mathbb{E_\xi}R(a(\xi,\theta))$ , this time with gradients
tidu2
The two methods for smoothing the decision problem are thus quite similar, and can be made even more so by adding noise to both the parameters and the actions. 因此，这两种平滑决策问题的方法颇为相似，可以通过在参数和动作中加入噪声使之更加平滑。

3.1 When is ES better than policy gradients?

Given these two methods of smoothing the decision problem, which should we use? The answer depends strongly on the structure of the decision problem and on which type of Monte Carlo estimator is used to estimate the gradients
考虑到这两种平滑决策问题的方法，我们应该使用哪种方法？答案很大程度上取决于决策问题的结构以及使用哪种蒙特卡罗估计量来估计梯度 $▽θFPG(θ)\bigtriangledown_{\theta}F_{PG}(\theta)$ 和 $▽θFES(θ)\bigtriangledown_{\theta}F_{ES}(\theta)$ 。假设收益率和个体行为之间的相关性很低correlation between the return and the individual actions is low（对于任何困难的RL问题都是如此）。假设我们使用简单的蒙特卡罗simple Monte Carlo（REINFORCE）近似这些梯度，并且在返回时有一个良好的基线with a good baseline on the return，我们有

如果两种方法执行的探索量相似，则两个表达式的 $V a r [R (a)]$ 将相似。因此，the difference will thus be in the second term。Here we have that $▽θlogp(a;θ)=∑t=1T▽θlogp(at;θ)\bigtriangledown_{\theta}logp(a;\theta)=\sum_{t=1}^T\bigtriangledown_{\theta}logp(a_t;\theta)$ is a sum of T uncorrelated terms 是T个不相关项的总和, so that the variance of the policy gradient estimator will grow nearly linearly with T.因此，策略梯度估计的方差将与T近似线性增长。The corresponding term for evolution strategies $▽θlogp(θ~;θ)\bigtriangledown_{\theta}logp(\tilde{\theta};\theta)$ , is independent of T. 进化策略的对应项 $▽θlogp(θ~;θ)\bigtriangledown_{\theta}logp(\tilde{\theta};\theta)$ 与T无关。因此，与具有非常多时间步的长事件long episode 的策略梯度相比，进化策略将具有优势。【方法一】在实际应用中，策略梯度法通常通过对奖励进行折现来减少有效步骤T。In practice, the effective number of steps T is often reduced in policy gradient methods by discounting rewards.如果动作的效果是短暂的，这就允许我们显著地减少梯度估计中的方差，这对Atari游戏等应用的成功至关重要。If the effects of actions are short-lasting, this allows us to dramatically reduce the variance in our gradient estimate, and this has been critical to the success of applications such as Atari games。然而，如果行动有长期的影响，这种折扣会使我们的梯度估计产生偏差。【方法二】另一种降低有效T值的策略是使用值函数逼近value function approximation。这也是有效的，但再次冒着使我们的梯度估计产生偏差的风险runs the risk of biasing our gradient estimates。【结论】因此，如果有效的时间步数T很长，行为具有长期的影响，并且没有好的价值函数估计，进化策略是一个有吸引力的选择。

3.2 Problem dimensionality

ES的梯度估计可以解释为高维空间中随机有限差分的一种方法。事实上，利用以下事实 $E??N(0,I){F(θ)?/σ}=0\mathbb{E}_{\epsilon\sim N(0,I)}\{F(\theta)\epsilon/\sigma\}=0$ , we get 我们可以得到
tidu3
现在很明显，ES 可以看作是在随机选择的方向上计算有限差分导数估计，特别是当 $σ\sigma$ 变小时。ES与有限差分法的相似性表明，该方法在参数 $θ\theta$ 的维数下伸缩性较差。The resemblance of ES to finite differences suggests the method will scale poorly with the dimension of the parameters $θ\theta$ . 理论分析表明，对于一般的非光滑优化问题，所需的优化步数与维数成线性关系 Theoretical analysis indeed shows that for general non-smooth optimization problems, the required number of optimization steps scales linearly with the dimension [Nesterov and Spokoiny, 2011]. 然而，需要注意的是，这并不意味着在使用ES进行优化时，较大的神经网络会比较小的网络表现更差：重要的是优化问题的难度，或者说内在维度。为了看到我们模型的维度可以与优化问题的有效维度完全分开，考虑一个回归问题，我们用线性模型 $y^=x?w\hat{y}=x\cdot{w}$ 来逼近一个单变量y：如果我们通过将x与自身连在一起（即使用特征 $x^{'} = (x, x)$ ，将这个模型中的特征和参数数量增加一倍，问题并不会变得更加困难。The ES algorithm will do exactly the same thing when applied to this higher dimensional problem, as long as we divide the standard deviation of the noise by two, as well as the learning rate.ES算法在应用于这种高维问题时，只要我们将噪声的标准差除以二，以及学习率，就能做到完全一样。
In practice, we observe slightly better results when using larger networks with ES. For example, we tried both the larger network and smaller network used in A3C [Mnih et al., 2016] for learning Atari 2600 games, and on average obtained better results using the larger network. We hypothesize that this is due to the same effect that makes standard gradient-based optimization of large neural networks easier than for small ones: large networks have fewer local minima [Kawaguchi, 2016].在实践中，我们观察到当使用更大的网络和ES时，效果稍好一些。例如，我们尝试了A3C[Mnih et al.，2016]中使用的较大网络和较小网络来学习Atari 2600游戏，平均而言，使用较大的网络取得了更好的效果。我们假设这是由于同样的影响，使得基于梯度的标准优化大型神经网络比小型神经网络更容易：大型网络具有更少的局部极小值【Kawaguchi，2016年】。

3.3 Advantages of not calculating gradients

第一段

除了易于并行化，以及在具有长动作序列和延迟奖励的情况下具有优势外，像ES这样的黑盒优化算法比计算梯度的RL技术还有其他优势。在分布式环境中实现ES的通信开销比策略梯度和Q-learning等竞争RL方法要低，因为需要在各进程间通信的唯一信息是标量返回和用于生成扰动 $?\epsilon$ 的随机种子，而不是完整的梯度。另外，ES可以处理最大限度的稀疏和延迟奖励；只使用一个情节的总回报，而其他方法则使用单个奖励及其确切的时间。
In addition to being easy to parallelize, and to having an advantage in cases with long action sequences and delayed rewards, black box optimization algorithms like ES have other advantages over RL techniques that calculate gradients. The communication overhead of implementing ES in a distributed setting is lower than for competing RL methods such as policy gradients and Q-learning, as the only information that needs to be communicated across processes are the scalar return and the random seed that was used to generate the perturbations $?\epsilon$ , rather than a full gradient. Also, ES can deal with maximally sparse and delayed rewards; only the total return of an episode is used, whereas other methods use individual rewards and their exact timing.

第二段

由于不需要反向传播，黑盒优化器将每集的计算量减少了大约三分之二，内存则可能减少更多。此外，not explicitly 不显式地计算分析梯度analytical gradient可以防止使用递归神经网络recurrent neural networks时常见的爆炸梯度exploding gradients问题。通过在参数空间中平滑成本函数，我们减少了导致这些问题的病态曲率：足够平滑的有界成本函数不可能有爆炸性的梯度。By smoothing the cost function in parameter space, we reduce the pathological curvature that causes these problems: bounded cost functions that are smooth enough can’t have exploding gradients. 在极端情况下At the extreme，ES允许我们将不可微分的元素融入到我们的架构中，比如使用硬关注的模块。 ES allows us to incorporate non-differentiable elements into our architecture, such as modules that use hard attention [Xu et al., 2015].

第三段

黑盒优化方法独特地适用于深度学习的低精度硬件。低精度的算术Low precision arithmetic，如二进制神经网络中，可以比高精度的算术便宜得多。在优化这种低精度架构时，使用基于梯度的方法时，有偏差的低精度梯度估计可能是一个问题。同样，在使用ES进行优化时，可以直接使用神经网络推理的专用硬件，如TPU[Jouppi等人，2017]，而其有限的内存通常使得backpropagation 反向传播不可能实现。

第四段

通过在参数空间而不是行动空间中进行扰动，黑盒优化器对我们的代理在环境中的行动频率自然是不变的 black box optimizers are naturally invariant to the frequency at which our agent acts in the environment。另一方面，对于基于MDP的强化学习算法来说，众所周知，frameskip是优化成功的关键参数，要搞好这个参数[Braylan等，2005]。While this is usually a solvable problem for games that only require short-term planning and action, it is a problem for learning longer term strategic behavior.虽然这对于只需要短期计划和行动的游戏来说，通常是一个可以解决的问题，但对于学习长期的战略行为来说，却是一个问题。对于这些问题，RL需要层次结构 hierarchy才能成功[Parr和Russell，1998]，而在使用黑盒优化时，就没有那么必要了。

4 Experiments

4.1 MuJoCo

第一段：平台，要对比的算法

我们在OpenAI Gym的连续机器人控制问题的基准上对ES进行了评估，并与高度调整的信任区域策略优化（Trust Region Policy Optimization）的实现进行了对比，信任区域策略优化是一种旨在高效优化神经网络策略的策略梯度算法。我们在经典问题上进行了测试，比如平衡一个倒立的钟摆，以及最近文献中发现的更困难的问题，比如学习二维跳动和行走步态。这些环境都是由MuJoCo模拟的。We evaluated ES on a benchmark of continuous robotic control problems in the OpenAI Gym against a highly tuned implementation of Trust Region Policy Optimization, a policy gradient algorithm designed to efficiently optimize neural network policies. We tested on both classic problems, like balancing an inverted pendulum, and more difficult ones found in recent literature, like learning 2D hopping and walking gaits. The environments were simulated by MuJoCo .

第二段：网络结构和任务

We used both ES and TRPO to train policies with identical architectures: multilayer perceptrons with two 64-unit hidden layers separated by tanh nonlinearities. We found that ES occasionally benefited from discrete actions, since continuous actions could be too smooth with respect to parameter perturbation and could hamper exploration (see section 2.2). For the hopping and swimming tasks, we discretized the actions for ES into 10 bins for each action component.我们使用ES和TRPO来训练具有相同架构的策略：具有两个64个单元隐藏层的多层感知器，由tanh非线性分隔。我们发现，ES偶尔会从离散动作中受益，因为连续动作对于参数扰动可能过于平滑，可能会阻碍探索（见2.2节）。对于跳跃hopping 和游泳swimming 任务，我们将ES的动作分解为10个动作单元。
我们发现，ES在经过500万次的环境交互后，能够解决这些任务，达到TRPO的最终性能。为了得到这个结果，我们在6个随机种子上运行ES，并将平均学习曲线与TRPO类似计算的曲线进行比较。To obtain this result, we ran ES over 6 random seeds and compared the mean learning curves to similarly computed curves for TRPO. 学习过程中具体的样本复杂度权衡exact sample complexity tradeoffs 见表1，详细结果见附录表3。一般来说，在硬环境（Hopper和Walker2d）上，与TRPO相比，我们能够在样本复杂度上以小于10倍的惩罚来解决环境。在简单环境上，我们取得了比TRPO高3倍的样本复杂度。
Table1：MuJoCo tasks: Ratio of ES timesteps to TRPO timesteps needed to reach various percentages of TRPO’s learning progress at 5 million timesteps. MuJoCo任务：为达到500万次TRPO学习进度的不同百分比所需的ES次数与TRPO次数之比。

4.2 Atari

我们在OpenAI Gym中提供的51个Atari 2600游戏上运行了算法2中描述的进化策略的并行实现。我们使用了Mnih等[2016]使用的相同的预处理和前馈CNN架构。所有的游戏都进行了10亿帧的训练，这与A3C公布的1天结果使用3.2亿帧的神经网络计算量差不多。不同的是由于ES不进行反向传播，不使用值函数。通过在亚马逊EC2上的720个CPU上并行评估扰动参数，我们可以将每场比赛的训练过程所需的时间降低到1小时左右。训练结束后，我们将最终性能与已公布的A3C结果进行了对比，发现ES在测试的23款游戏中表现较好，而在28款游戏中表现较差。完整的结果见补充材料中的表2。

4.3 Parallelization

由于ES对通信带宽的要求较低，因此它特别适合于并行化（2.1节）。我们实现了算法2的分布式版本，以研究ES如何随着工人数量的增加而伸缩。我们的分布式实现不依赖于特殊的网络设置，并在公共云计算服务Amazon EC2上进行了测试。ES is particularly amenable to parallelization because of its low communication bandwidth requirement (Section 2.1). We implemented a distributed version of Algorithm 2 to investigate how ES scales with the number of workers. Our distributed implementation did not rely on special networking setup and was tested on public cloud computing service Amazon EC2.
我们从OpenAI Gym [Brockman等人，2016]中选取了3D Humanoid行走任务作为我们缩放实验 scaling experiment 的测试问题，因为它是用最先进的RL技术 state-of-the-art RL techniques 解决的最具挑战性的连续控制问题之一，which require about a day to learn on modern hardware在现代硬件上需要大约一天的时间来学习[Schulman等人，2015，Duan等人，2016a]。在一台18核机上用ES解决3D Humanoid需要约11小时，与RL相当。Solving 3D Humanoid with ES on one 18- core machine takes about 11 hours, which is on par with RL. 然而，当分布在80台机器和1440个CPU核时，ES仅需10分钟就能解决3D Humanoid，将实验周转时间缩短了两个数量级。图1显示，对于这个任务，ES能够实现CPU核数的线性加速。However, when distributed across 80 machines and 1, 440 CPU cores, ES can solve 3D Humanoid in just 10 minutes, reducing experiment turnaround time by two orders of magnitude. Figure 1 shows that, for this task, ES is able to achieve linear speedup in the number of CPU cores.
图1和图2
Figure1：在不同CPU核数的情况下，在3D Humanoid上达到6000分的时间。实验重复7次，报告中位时间。
Figure2：使用不同的跳帧参数，Pong的学习曲线。虽然性能是随机的，但每个设置都会导致大约同样快速的学习，每次运行在大约100个权重更新中收敛。

4.4 Invariance to temporal resolution

在RL中，通常的做法是让代理以比运行环境的模拟器更低的频率来决定其动作。这个动作频率，或称跳帧 This action frequency, or frame-skip, is a crucial parameter in many RL algorithms，是许多RL算法中的一个关键参数。如果帧跳转设置得过高，代理就无法在足够细的时间框架内做出决策，无法在环境中很好地发挥作用。另一方面，如果帧跳 frameskip 转设置得过低，则会使episode 的有效时间长度增加太多，从而恶化优化性能deteriorates optimization performance，这在3.1节中已经分析过。ES的一个优点是它的梯度估计对 episode 的长度是不变的 invariant to the length of the episode,，这使得它对动作频率更加稳健 which makes it much more robust to the action frequency。我们通过在{1，2，3，4}中使用跳帧参数运行Atari游戏Pong来证明这一点。从图2中可以看出，每种设置的学习曲线确实看起来非常相似。

5 Related work

已经有很多人尝试应用ES的相关方法来训练神经网络Risi和Togelius[2015]。对于Atari，Hausknecht等人[2014]获得了令人印象深刻的结果。Sehnke等人[2010]提出了一种与我们工作中研究的方法密切相关的方法。Koutník等人[2013，2010]和Srivastava等人[2012]也同样将一种ES方法应用于具有视觉输入的RL问题，但其中策略被以多种不同的方式压缩。自然演化策略已经成功地应用于黑盒优化Wierstra等[2008，2014]，以及循环神经网络中循环权重的训练Schmidhuber等[2007]。Stulp和Sigaud[2012]探讨了类似的黑盒优化方法。最近Usunier等人[2016]探索了一种有趣的黑盒优化和策略梯度方法的混合体。Hyper-Neat Stanley等人[2009]是一种同时演化神经网络的权重和参数的替代方法。Duchi等[2015]、Nesterov[2012]也分析了在convex setting 凸性环境下的无导数 derivative free optimization methods 优化方法。
Natural evolution strategies has been successfully applied to black box optimization Wierstra et al. [2008, 2014], as well as for the training of the recurrent weights in recurrent neural networks Schmidhuber et al. [2007]. Stulp and Sigaud [2012] explored similar approaches to black box optimization.
An interesting hybrid of black-box optimization and policy gradient methods was recently explored by Usunier et al. [2016].
Hyper-Neat Stanley et al. [2009] is an alternative approach to evolving both the weights of the neural networks and their parameters. Derivative free optimization methods have also been analyzed in the convex setting Duchi et al. [2015], Nesterov [2012].
我们工作中的主要贡献在于表明这一类算法在分布式硬件上的使用具有极高的可扩展性和效率。我们已经证明，ES在精心实现后，在当今最难解决的问题上，在性能上与竞争的RL算法具有竞争力，在数据效率上也令人惊讶地接近，同时需要更少的挂钟时间wallclock time来训练。

6 Conclusion

【和摘要一样】我们已经探索了进化策略，这是一类黑盒优化算法，作为流行的基于MDP的RL技术（如Q-learning和政策梯度）的替代方案。在Atari和MuJoCo上的实验表明，它是一个可行的选择，具有一些吸引人的特点：它对行动频率和延迟奖励不变，而且它不需要时间折扣或价值函数逼近。最重要的是，ES是高度可并行的，这使得我们可以通过扩展到更多的并行工作者来弥补数据效率的降低。
【未来的研究方向展望】在未来的工作中，我们计划将进化策略应用于那些基于MDP的强化学习不太适合的问题：时间跨度长、奖励结构复杂的问题long time horizons and complicated reward structure。我们对元学习（meta-learning）或学习到学习（learning-to-learn）特别感兴趣。Duan等人[2016b]给出了RL环境下元学习的概念证明。使用黑盒优化，我们希望能够扩展这些结果。我们还计划研究将ES与快速的低精度神经网络实现相结合，以充分利用ES的无梯度特性。We also plan to examine combining ES with fast low precision neural network implementations to fully make use of the gradient-free nature of ES.

词汇

in sharp contrast to 与……形成鲜明对比
In practice, we implement sampling by having each worker instantiate a large block of Gaussian noise at the start of training, and then perturbing its parameters by adding a randomly indexed subset of these noise variables at each iteration. 在实践中，我们通过让每个worker在训练开始时实例化一大块高斯噪声来实现采样，然后在每次迭代时通过添加这些噪声变量的随机索引子集来扰动其参数。
antithetic sampling 对偶抽样，also known as mirrored sampling 镜像采样
a downside of virtual batch normalization is that it makes training more expensive 虚拟批处理规范化的一个缺点是它使培训成本更高
More importantly, because we do not have explicit access to the underlying state transition function of our decision problems, the gradients cannot be computed with a backpropagation-like algorithm. This means we cannot directly use standard gradient-based optimization methods to find a good solution for $θ\theta$ . 更重要的是，由于我们没有明确地访问决策问题的底层状态转换函数，所以不能用类似反向传播的算法来计算梯度。这意味着我们不能直接使用标准的基于梯度的优化方法来为 $θ\theta$ 找到一个好的解决方案。