tensorflow实现seq2seq模型细节（4）：tensorflow nmt中的attention（scaled luong 和 normed bahdanau）和optimizer_综合

1.attention

Tensorflow的nmt教程中这样提到：

Attention: Bahdanau-style attention often requires bidirectionality on the encoder side to work well; whereas Luong-style attention tends to work well for different settings. For this tutorial code, we recommend using the two improved variants of Luong & Bahdanau-style attentions: scaled_luong & normed bahdanau.

Scaled_luong在tensorflow的体现：

注意到scale=True这个参数就是scaled_luong和luong参数设置的差别！

normed_bahdanau设置了normalize=True

???????2.optimizer

nmt教程中这样说到：

Optimizer: while Adam can lead to reasonable results for "unfamiliar" architectures, SGD with scheduling will generally lead to better performance if you can train with SGD.

“SGD with scheduling“ 我不太明白是什么意思，

此前在知乎上看到有人说过用adam更耗费显存，这一点似乎是对的，我用sgd时batch_size可以稍微大一些（原谅穷人只能用用笔记本自带的gtx960m）。

我继续看了下nmt的源码里面对sgd的学习率进行了衰减

提供了几种衰减方式：

Luong234 在进行了2/3的训练步骤后，开始对lr每4步衰减一半。

查看nmt提供的标准超参数：

大数据集attention都使用了：normed bahdanau

优化器都使用了sgd，也就是tf.train.GradientDecentOptimizer

初始学习率为 1.0

当然还有其他很多参数