[论文笔记] The Dif?culty of Training Deep Architectures and the Effect of Unsupervised Pre-Training_综合

这是篇比较早的论文了，09年左右那会儿是 unsupervised pre-training 大热的时候，因为它让训练深层网络看到了希望，而这篇主要是探讨了关于训练深层网络困难的问题，并通过实验分析了 unsupervised pre-training 给训练深层网络带来的优势。

Experimental Results

这篇论文主要以实验分析为主，主要讨论了以下几个问题：

Why is it more difficult to train deep architectures than shallow architectures?
How does the depth of the architectures affect the difficulty of training?
What does the cost function landscape of deep architectures look like?
Is the advantage of unsupervised pre-training related to optimization, or perhaps some form of regularization?
What is the effect of random initialization on the learning trajectories?

Effect of Depth, Pre-training and Robustness to Random Initilization

第一组实验是随机参数初始化（无预训练，服从均匀分布）和预训练方式的效果比较。训练了400个模型，下图是 test error 的箱式图。可以看到带预训练的那组的 error ，随着隐层数量从1到4层，基本保持在一个 level，还有逐步减少的趋势；而没有预训练的那组在隐层为3层后 error 就开始上升了，并且隐层为2层后的 error 离群值也开始大幅增加。另外，实验结果中是没有5层隐层模型的结果，因为在当时的实验中，没有预训练的模型，5层网络就很难训练下去了。

在这里插入图片描述
另外通过下图两组不同层数的 test error 分布可以看到，1层网络没有 pre-training 的test error 分布区间为 [1.6, 2]，而4层网络没有 pre-training 的test error 分布区间为 [1.8, 3]。综合上面的结果，可以得到一个结论：增加层数会增加陷入 poor local minima 的概率，这也就回答了上面的其中一个问题，网络层数是如何影响深层网络训练的难度。

在这里插入图片描述

The Pre-Training Advantage: Better Optimization or Better Generalization?

预训练和随机初始化参数唯一的区别在于训练时参数空间的起始点。深层网络为什么会比浅层的网络训练要困难许多，主要是因为深层网络由许多层非线性层组成，这也就使得损失函数是非凸的，而非凸的优化是很困难的，因为存在许多疑似的局部最优点。而 pre-training 的优势在于其初始的参数处于较优的局部最优值附近，这也使得深层网络的优化会比随机初始化参数的情况下要好。

当然也可能存在一种情况，pre-training 初始的参数空间区域不一定比随机初始化的好（train error差不多），但是能使模型有更好的泛化能力（test error）。下面的实验则是验证了上面的猜想：pre-training 的优势不仅仅在于使得网络训练优化的更好，而且使得模型有更好的泛化能力。

为什么会有较好的泛化能力？一种可能是因为这里的 pre-training 是通过 stacked denoising auto-encoder 得到的，这也对模型的参数空间做了一个限制（因为初始的参数也要满足 denoising auto-encoder 能较好的重构出原始输入）。因此，pre-training 也起到了正则项（regularizer）的作用（但好像后来的论文也有提到说 pre-training 并没有起到 pre-training 的作用）。

在这里插入图片描述

A better Random Initialization?

作者有一个猜想蛮有趣的，就是是否有一个更好的分布来随机初始化参数能逼近 pre-training 的效果，他做了几组实验都和 pre-training 的效果有一定差距。当然，现在的一些参数初始化方式，已经有不错的效果了。

Evaluating the Importance of Pre-Training on Different Layers

作者还有一组实验是验证 pre-training 不同层所带来的效果增益。如下图，可以看到，只pre-training 第一层就能逼近 pre-training 全部层的效果，而 pre-training 第二层却达不到那种效果。这里的一个解释是 “training the lower layers is more difficult because gradient information becomes less informative as it is backpropagated through more layer.”，所以 pre-training 对 lower layers 来说是更重要。

在这里插入图片描述

总结

这篇论文讨论了为什么训练深层网络比浅层网络困难，并通过实验揭示了 pre-training 给深层网络训练带来的优势。虽然论文中 pre-training 和现在的 pre-training 方式不同（采用 stacked denoising auto-encoder 的方式），但论文中揭示了这里 pre-training 给深层网络训练带来效益的背后原因（better optimization and better generalization），这也是有一定指导意义的。