当前位置: 代码迷 >> 综合 >> Layer/Batch/Instance Normalization
  详细解决方案

Layer/Batch/Instance Normalization

热度:35   发布时间:2023-12-16 22:06:31.0

总览

图中N表示batch,C表示CV中的通道(NLP中的序列长度、时间步),如果是图像则【H,W】表示每个通道下二维像素矩阵的高和宽,NLP中就只有一维特征向量。Batch Norm依赖Batch,对【Batch, H, W】三个维度做标准化;Layer Norm不依赖Batch,对【C,H,W】三个维度做标准化。Instance Norm既不受Batch也不受其它通道的影响,只对【H,W】两个维度做标准化。

三种标准化的表示式形式都相同,其区别在于 x x x的表示不同,其公式如下:

y = x ? E [ x ] Var ? [ x ] + ? ? γ + β y=\frac{x-\mathrm{E}[x]}{\sqrt{\operatorname{Var}[x]+\epsilon}} * \gamma+\beta y=Var[x]+? ?x?E[x]??γ+β
其中,对于不同Norm方法 x x x不同, E [ x ] \mathrm{E}[x] E[x]表示期望(均值), Var ? [ x ] \operatorname{Var}[x] Var[x]表示方差, γ \gamma γ β \beta β表示可学习参数,向量大小和输入维度一致。
在这里插入图片描述

Batch Norm

y t i l m = x t i l m ? μ i σ i 2 + ? ? γ + β , μ i = 1 H W T ∑ t = 1 T ∑ l = 1 W ∑ m = 1 H x t i l m , σ i 2 = 1 H W T ∑ t = 1 T ∑ l = 1 W ∑ m = 1 H ( x t i l m ? u i ) 2 y_{t i l m}=\frac{x_{t i l m}-\mu_{i}}{\sqrt{\sigma_{i}^{2}+\epsilon}}* \gamma+\beta, \quad \mu_{i}=\frac{1}{H W T} \sum_{t=1}^{T} \sum_{l=1}^{W} \sum_{m=1}^{H} x_{t i l m}, \quad \sigma_{i}^{2}=\frac{1}{H W T} \sum_{t=1}^{T} \sum_{l=1}^{W} \sum_{m=1}^{H}\left(x_{t i l m}- u_{i}\right)^{2} ytilm?=σi2?+? ?xtilm??μi???γ+β,μi?=HWT1?t=1T?l=1W?m=1H?xtilm?,σi2?=HWT1?t=1T?l=1W?m=1H?(xtilm??ui?)2

Layer Norm
  • 针对4维输入:
    y t i l m = x t i l m ? μ t σ t 2 + ? ? γ + β , μ t = 1 H W C ∑ i = 1 C ∑ l = 1 W ∑ m = 1 H x t i l m , σ t 2 = 1 H W C ∑ i = 1 C ∑ l = 1 W ∑ m = 1 H ( x t i l m ? u t ) 2 y_{t i l m}=\frac{x_{t i l m}-\mu_{t}}{\sqrt{\sigma_{t}^{2}+\epsilon}}* \gamma+\beta, \quad \mu_{t}=\frac{1}{H W C} \sum_{i=1}^{C} \sum_{l=1}^{W} \sum_{m=1}^{H} x_{t i l m}, \quad \sigma_{t}^{2}=\frac{1}{H W C} \sum_{i=1}^{C} \sum_{l=1}^{W} \sum_{m=1}^{H}\left(x_{t i l m}- u_{t}\right)^{2} ytilm?=σt2?+? ?xtilm??μt???γ+β,μt?=HWC1?i=1C?l=1W?m=1H?xtilm?,σt2?=HWC1?i=1C?l=1W?m=1H?(xtilm??ut?)2

  • 针对3维输入:
    y t i l = x t i l ? μ t σ t 2 + ? ? γ + β , μ t = 1 N S ∑ i = 1 S ∑ l = 1 N x t i l , σ t 2 = 1 N S ∑ i = 1 S ∑ l = 1 N ( x t i l ? u t ) 2 y_{t i l}=\frac{x_{t i l}-\mu_{t}}{\sqrt{\sigma_{t}^{2}+\epsilon}}* \gamma+\beta, \quad \mu_{t}=\frac{1}{N S} \sum_{i=1}^{S} \sum_{l=1}^{N} x_{t i l}, \quad \sigma_{t}^{2}=\frac{1}{N S} \sum_{i=1}^{S} \sum_{l=1}^{N} \left(x_{t i l}- u_{t}\right)^{2} ytil?=σt2?+? ?xtil??μt???γ+β,μt?=NS1?i=1S?l=1N?xtil?,σt2?=NS1?i=1S?l=1N?(xtil??ut?)2

  • 3维输入的代码测试样例:

import torch
from torch import nnclass LayerNorm(nn.Module):def __init__(self, features, eps=1e-5):super(LayerNorm, self).__init__()self.a_2 = nn.Parameter(torch.ones(features))self.b_2 = nn.Parameter(torch.zeros(features))self.eps = epsdef forward(self, x):sizes = x.size()mean = x.view(sizes[0], -1).mean(-1, keepdim=True)std = x.view(sizes[0], -1).std(-1, keepdim=True)mean = torch.repeat_interleave(mean, repeats=9, dim=1).view(2, 3, 3)std = torch.repeat_interleave(std, repeats=9, dim=1).view(2, 3, 3)return self.a_2 * (x - mean) / (std + self.eps) + self.b_2norm_2 = LayerNorm(features=[2, 3, 3])
norm = nn.LayerNorm(normalized_shape=(3, 3), elementwise_affine=True)
a = torch.tensor([[[1, 2, 3], [4, 5, 6], [7, 8, 9]], [[4, 5, 6], [7, 8, 9], [10, 11, 12]]], dtype=torch.float32)print(norm(a))
print(norm_2(a))
  • 代码运行结果:
tensor([[[-1.5492, -1.1619, -0.7746],[-0.3873,  0.0000,  0.3873],[ 0.7746,  1.1619,  1.5492]],[[-1.5492, -1.1619, -0.7746],[-0.3873,  0.0000,  0.3873],[ 0.7746,  1.1619,  1.5492]]], grad_fn=<NativeLayerNormBackward>)
tensor([[[-1.4606, -1.0954, -0.7303],[-0.3651,  0.0000,  0.3651],[ 0.7303,  1.0954,  1.4606]],[[-1.4606, -1.0954, -0.7303],[-0.3651,  0.0000,  0.3651],[ 0.7303,  1.0954,  1.4606]]], grad_fn=<AddBackward0>)

结果分析:从结果可以看出,自己实现的和库函数实现在数值大小稍有些不同,但计算方法应该是一致的,可能在具体实现的某个小细节上有所差异

Layer Norm原论文还给出了一个应用在RNN上案例,这个就针对三维的输入[batch_size, seq_len, num_features]。令当前时间步的输入为 x t x^{t} xt,上一时间步的隐层状态为 h t ? 1 h^{t-1} ht?1,因此可以计算得到 a t = W h h h t ? 1 + W x h x t a^{t}=W_{hh}h^{t-1}+W_{xh}x^{t} at=Whh?ht?1+Wxh?xt。注意这儿 a t a^{t} at其实是未经过激活函数的 h t h^{t} ht。然后直接对 a t a^{t} at进行归一化,如下式所示:
h t = f [ g σ t ⊙ ( a t ? μ t ) + b ] μ t = 1 H ∑ i = 1 H a i t σ t = 1 H ∑ i = 1 H ( a i t ? μ t ) 2 \mathbf{h}^{t}=f\left[\frac{\mathbf{g}}{\sigma^{t}} \odot\left(\mathbf{a}^{t}-\mu^{t}\right)+\mathbf{b}\right] \quad \mu^{t}=\frac{1}{H} \sum_{i=1}^{H} a_{i}^{t} \quad \sigma^{t}=\sqrt{\frac{1}{H} \sum_{i=1}^{H}\left(a_{i}^{t}-\mu^{t}\right)^{2}} ht=f[σtg?(at?μt)+b]μt=H1?i=1H?ait?σt=H1?i=1H?(ait??μt)2 ?
上式中 g \mathbf{g} g b \mathbf{b} b 相当于 γ \gamma γ β \beta β f f f 表示激活函数 tanh ? \tanh tanh注意:这儿实际上是对每个样本的每一个时间步未经过激活 h t h^{t} ht 做标准化,和上上式是不一样的。

  • Harwardnlp给出了一份基于每个时间步上Layer Norm实现:
class LayerNorm(nn.Module):"Construct a layernorm module (See citation for details)."def __init__(self, features, eps=1e-6):super(LayerNorm, self).__init__()self.a_2 = nn.Parameter(torch.ones(features))self.b_2 = nn.Parameter(torch.zeros(features))self.eps = epsdef forward(self, x):mean = x.mean(-1, keepdim=True)std = x.std(-1, keepdim=True)return self.a_2 * (x - mean) / (std + self.eps) + self.b_2
Instance Norm

y t i l m = x t i l m ? μ t i σ t i 2 + ? ? γ + β , μ t i = 1 H W ∑ l = 1 W ∑ m = 1 H x t i l m , σ t i 2 = 1 H W ∑ l = 1 W ∑ m = 1 H ( x t i l m ? u t i ) 2 y_{t i l m}=\frac{x_{t i l m}-\mu_{t i}}{\sqrt{\sigma_{t i}^{2}+\epsilon}}* \gamma+\beta, \quad \mu_{t i}=\frac{1}{H W} \sum_{l=1}^{W} \sum_{m=1}^{H} x_{t i l m}, \quad \sigma_{t i}^{2}=\frac{1}{H W} \sum_{l=1}^{W} \sum_{m=1}^{H}\left(x_{t i l m}- u_{t i}\right)^{2} ytilm?=σti2?+? ?xtilm??μti???γ+β,μti?=HW1?l=1W?m=1H?xtilm?,σti2?=HW1?l=1W?m=1H?(xtilm??uti?)2
如有不妥指出,欢迎纠正指出!

References

[1] Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” arXiv preprint arXiv:1502.03167 (2015).
[2] Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. “Layer normalization.” arXiv preprint arXiv:1607.06450 (2016).
[3] Ulyanov, Dmitry, Andrea Vedaldi, and Victor Lempitsky. “Instance normalization: The missing ingredient for fast stylization.” arXiv preprint arXiv:1607.08022 (2016).
[4] http://nlp.seas.harvard.edu/2018/04/03/attention.html

  相关解决方案