批量归一化和残差网络
- 批量归一化 Batch Normalization
- 1. 对全连接层做批量归一化
- 2.对卷积层做批量归?化
- 3.预测时的批量归?化
- 从零实现
- 基于LeNet的应用
- 简洁实现
- 残差网络 ResNet
- 残差块(Residual Block)
- ResNet模型
- DenseNet 稠密连接网络
- 稠密块(Dense Block)
- 过渡层(Transition layer)
- DenseNet模型
批量归一化 Batch Normalization
BN是针对深层CNN的方法之一,有助于有效模型的训练。是对数据的标准化处理。
- 对输入的标准化(浅层模型)
处理后的任意一个特征在数据集中所有样本上的均值为0、标准差为1。
标准化处理输入数据使各个特征的分布相近(更加容易训练出有效的模型)。但对于深层模型,仅做输入的标准化是不够的,网络太深,靠近输出层还是可能发生数据的剧烈变化。 - 批量归一化(深度模型)
利用小批量上的均值和标准差,不断调整神经网络中间输出,从而使整个神经网络在各层的中间输出的数值更稳定。
1. 对全连接层做批量归一化
位置:全连接层中的仿射变换和激活函数之间。
关于BN位置的问题:Batch-normalized 应该放在非线性激活层的前面还是后面?
全连接:
输入
,大小为 batch_size * 输入神经元个数,将批量归一化放在下面两个公式中间,即把
变成在这个batch的所有样本上的均值=0,标准差=1
批量归一化:
这?? > 0是个很小的常数,保证分母大于0; , 代表每一个样本,(例如输出神经元有d个, 就是长为d的向量)
引入可学习参数:拉伸参数γ和偏移参数β,保留了BN无效的一个可能性。若 和 ,可以看到 = ,批量归一化无效。如果在模型中BN效果不好,可以通过学习到那个值使得BN无效。
问题:学习拉伸参数
和偏移参数
学习的到底是什么
引用来自课程评论区liwei的回答:
这两个参数是为了保证归一化后的信息可以还原到以前的信息.打个比方 (如果信息太重要,归一化会损失很多信息.那么我们可以通过这两个参数还原归一化后信息.或者你可以想一下循环神经网络里面的各个门机制)
2.对卷积层做批量归?化
位置:卷积计算之后、应用激活函数之前。
如果卷积计算输出多个通道,我们需要对这些通道的输出分别做批量归一化,且每个通道都拥有独立的拉伸和偏移参数。
计算:对单通道,batchsize=m,卷积计算输出=pxq
对该通道中m×p×q个元素同时做批量归一化,使用相同的均值和方差。
3.预测时的批量归?化
1.和2.讲的BN都是对一个batch来做
训练:以batch为单位,对每个batch计算均值和方差。 (1.和2.属于这种)
预测:用移动平均估算整个训练数据集的样本均值和方差。(不同于训练的时候,预测过程没有可参考的均值和方差,所以只能估算整个训练集 )
从零实现
将全连接层输出神经元数目用d
表示,汉字太多,占地方
import time
import torch
from torch import nn, optim
import torch.nn.functional as F
import torchvision
import sys
sys.path.append("/home/kesci/input/")
import d2lzh1981 as d2l
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 这里X可能是全连接层的输出,也可能是卷积的输出
# 总之希望X变成均值=0,方差=1的X_hat
def batch_norm(is_training, X, gamma, beta, moving_mean, moving_var, eps, momentum):# 判断当前模式是训练模式还是预测模式if not is_training:# 如果是在预测模式下,直接使用传入的移动平均所得的均值和方差X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)else:assert len(X.shape) in (2, 4)if len(X.shape) == 2: # X为全连接层的输出, batch_size*d# 使用全连接层的情况,计算特征维上的均值和方差mean = X.mean(dim=0) # mean是长度为d的向量var = ((X - mean) ** 2).mean(dim=0)else: # X为卷积层的输出,4维# 使用二维卷积层的情况,计算通道维上(axis=1)的均值和方差。这里我们需要保持# X的形状以便后面可以做广播运算mean = X.mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)var = ((X - mean) ** 2).mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)# 训练模式下用当前的均值和方差做标准化X_hat = (X - mean) / torch.sqrt(var + eps)# 每到一个batch,更新移动平均的均值和方差# moving_mean之前学习到的均值,mean当前均值, 方差同理# momentum是超参数,提前设置. 这个值大一些,当前batch计算的平均值和方差对滑动平均值、滑动方差贡献越小;moving_mean = momentum * moving_mean + (1.0 - momentum) * meanmoving_var = momentum * moving_var + (1.0 - momentum) * varY = gamma * X_hat + beta # 拉伸和偏移return Y, moving_mean, moving_var
class BatchNorm主要用于维护学习参数和一些超参数
class BatchNorm(nn.Module):# 全连接:num_dims=2,num_features=d;# 卷积:num_dims=4,num_features=channelsdef __init__(self, num_features, num_dims): super(BatchNorm, self).__init__()if num_dims == 2:shape = (1, num_features) #全连接层输出神经元else:shape = (1, num_features, 1, 1) #通道数# 参与求梯度和迭代的拉伸和偏移参数,分别初始化成0和1self.gamma = nn.Parameter(torch.ones(shape))self.beta = nn.Parameter(torch.zeros(shape))# 不参与求梯度和迭代的变量,全在内存上初始化成0self.moving_mean = torch.zeros(shape)self.moving_var = torch.zeros(shape)def forward(self, X):# 如果X不在内存上,将moving_mean和moving_var复制到X所在显存上if self.moving_mean.device != X.device:self.moving_mean = self.moving_mean.to(X.device)self.moving_var = self.moving_var.to(X.device)# 保存更新过的moving_mean和moving_var, Module实例的traning属性默认为true, 调用.eval()后设成falseY, self.moving_mean, self.moving_var = batch_norm(self.training, X, self.gamma, self.beta, self.moving_mean,self.moving_var, eps=1e-5, momentum=0.9)return Y
基于LeNet的应用
net = nn.Sequential(nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_sizeBatchNorm(6, num_dims=4), # 因为前一层是conv所以num_dims=4,6即conv的out_channelsnn.Sigmoid(),nn.MaxPool2d(2, 2), # kernel_size, stridenn.Conv2d(6, 16, 5),BatchNorm(16, num_dims=4),nn.Sigmoid(),nn.MaxPool2d(2, 2),d2l.FlattenLayer(),nn.Linear(16*4*4, 120), # 输出神经元数目d=120BatchNorm(120, num_dims=2), # 前一层为全连接层,所以num_dims=2nn.Sigmoid(),nn.Linear(120, 84),BatchNorm(84, num_dims=2),nn.Sigmoid(),nn.Linear(84, 10))
print(net)
Sequential(
(0): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
(1): BatchNorm()
(2): Sigmoid()
(3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(4): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
(5): BatchNorm()
(6): Sigmoid()
(7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(8): FlattenLayer()
(9): Linear(in_features=256, out_features=120, bias=True)
(10): BatchNorm()
(11): Sigmoid()
(12): Linear(in_features=120, out_features=84, bias=True)
(13): BatchNorm()
(14): Sigmoid()
(15): Linear(in_features=84, out_features=10, bias=True)
)
#batch_size = 256
##如果用cpu训练 要调小batchsize
batch_size=16def load_data_fashion_mnist(batch_size, resize=None, root='/home/kesci/input/FashionMNIST2065'):"""Download the fashion mnist dataset and then load into memory."""trans = []if resize:trans.append(torchvision.transforms.Resize(size=resize))trans.append(torchvision.transforms.ToTensor())transform = torchvision.transforms.Compose(trans)mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, download=True, transform=transform)mnist_test = torchvision.datasets.FashionMNIST(root=root, train=False, download=True, transform=transform)train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=2)test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=2)return train_iter, test_iter
train_iter, test_iter = load_data_fashion_mnist(batch_size)# 训练和结果显示
lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)
简洁实现
自己要运用BatchNorm时,不需要自己去写class和函数,nn里面有内置的。
nn.BatchNorm2d( 6 )
:2d——放在卷积层后面的,6指前conv的输出通道数
nn.BatchNorm1d(120)
:1d——放在全连接层后,120为全连接层的输出神经元数目(也即batchNorm的输入通道
net = nn.Sequential(nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_sizenn.BatchNorm2d(6),nn.Sigmoid(),nn.MaxPool2d(2, 2), # kernel_size, stridenn.Conv2d(6, 16, 5),nn.BatchNorm2d(16),nn.Sigmoid(),nn.MaxPool2d(2, 2),d2l.FlattenLayer(),nn.Linear(16*4*4, 120),nn.BatchNorm1d(120),nn.Sigmoid(),nn.Linear(120, 84),nn.BatchNorm1d(84),nn.Sigmoid(),nn.Linear(84, 10))optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)
残差网络 ResNet
深度学习的问题:深度CNN网络达到一定深度后再一味地增加层数并不能带来进一步地分类性能提高,反而会招致网络收敛变得更慢,准确率也变得更差。
理论上讲,深层网络能拟合出的映射一定包含了浅层网络能拟合出的映射。但是层数太多,训练误差会不断上升,即使是用了BN带来了数据稳定性,但这个问题依然存在。针对这个问题,ResNet被提出。
残差块(Residual Block)
恒等映射:
左边:f(x)=x
右边:f(x)-x=0 (易于捕捉恒等映射的细微波动)
在残差块中,输入可通过跨层的数据线路更快 地向前传播。
左右区别在于:左边需要拟合出的是x的理想映射f(x),而右边需要拟合出f(x)-x,即理想映射的残差,然后再加上原来的输入x,再构成理想映射f(x)。
e.g.: 如果理想映射为f(x)=x,则左边需要拟合出=x,右边只需要拟合出=0
class Residual(nn.Module): # 本类已保存在d2lzh_pytorch包中方便以后使用#可以设定输出通道数、是否使用额外的1x1卷积层来修改通道数以及卷积层的步幅。def __init__(self, in_channels, out_channels, use_1x1conv=False, stride=1):super(Residual, self).__init__() # 这里定义了这些层(用什么结构对残差块是没有影响的,残差结构主要体现在forward里面)self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1, stride=stride)self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1)if use_1x1conv:self.conv3 = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride)else:self.conv3 = Noneself.bn1 = nn.BatchNorm2d(out_channels)self.bn2 = nn.BatchNorm2d(out_channels)def forward(self, X):Y = F.relu(self.bn1(self.conv1(X)))Y = self.bn2(self.conv2(Y))if self.conv3: # 即 use_1x1conv = trueX = self.conv3(X) # 这里改变通道数使得x与y通道数一样,y与x才能进行相加return F.relu(Y + X)
这里有个问题,class Residual()
里面提到,为了保证X能与Y相加,要利用1x1卷积层使得二者通道数一样,可能会想到 让use_1x1conv = in_channels != out_channels
不行吗?为什么要自己去指定False or True?
单纯为了保证in_c = out_c确实可以这么写,但是并不只有这个情况才需要用到1x1卷积层,为了框架通用性还是采取一致的做法比较好。
实例1,in_channels=out_channels=3,不需要使用1x1 conv来改变通道数:
blk = Residual(3, 3)
X = torch.rand((4, 3, 6, 6))
blk(X).shape # torch.Size([4, 3, 6, 6])
torch.Size([4, 3, 6, 6])
实例2,in_channels != out_channels,即Y(out)与X(in)通道数不相同。需要改变X的通道数:
blk = Residual(3, 6, use_1x1conv=True, stride=2)
blk(X).shape # torch.Size([4, 6, 3, 3])
torch.Size([4, 6, 3, 3])
ResNet模型
卷积(64,7x7,3)
批量一体化
最大池化(3x3,2)
残差块x4 (通过步幅为2的残差块在每个模块之间减小高和宽)
全局平均池化
全连接
ResNet模型代码:
net = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),nn.BatchNorm2d(64), nn.ReLU(),nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
# num_residuals:包含了几个residuals()(之前定义过的)
def resnet_block(in_channels, out_channels, num_residuals, first_block=False): if first_block: assert in_channels == out_channels # 第一个模块的通道数同输入通道数一致blk = []for i in range(num_residuals):if i == 0 and not first_block:blk.append(Residual(in_channels, out_channels, use_1x1conv=True, stride=2))else:blk.append(Residual(out_channels, out_channels))return nn.Sequential(*blk)net.add_module("resnet_block1", resnet_block(64, 64, 2, first_block=True))
net.add_module("resnet_block2", resnet_block(64, 128, 2))
net.add_module("resnet_block3", resnet_block(128, 256, 2))
net.add_module("resnet_block4", resnet_block(256, 512, 2))net.add_module("global_avg_pool", d2l.GlobalAvgPool2d()) # GlobalAvgPool2d的输出: (Batch, 512, 1, 1)
net.add_module("fc", nn.Sequential(d2l.FlattenLayer(), nn.Linear(512, 10)))
实例:输入X,224x224的灰度图。观察output,看X经过每一层之后的形状变化。
X = torch.rand((1, 1, 224, 224))
for name, layer in net.named_children():X = layer(X)print(name, ' output shape:\t', X.shape)
0 output shape: torch.Size([1, 64, 112, 112])
1 output shape: torch.Size([1, 64, 112, 112])
2 output shape: torch.Size([1, 64, 112, 112])
3 output shape: torch.Size([1, 64, 56, 56])
resnet_block1 output shape: torch.Size([1, 64, 56, 56])
resnet_block2 output shape: torch.Size([1, 128, 28, 28])
resnet_block3 output shape: torch.Size([1, 256, 14, 14])
resnet_block4 output shape: torch.Size([1, 512, 7, 7])
global_avg_pool output shape: torch.Size([1, 512, 1, 1])
fc output shape: torch.Size([1, 10])
lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)
DenseNet 稠密连接网络
区分左右e.g.: 左边 A和B的输出都为10通道,加起来也是10通道维 V.S 右边 A输出10通道,B输出20通道,连结起来输出30通道
DenseNet的主要构建模块:
稠密块(dense block): 定义了输入和输出是如何连结的。
过渡层(transition layer):用来控制通道数,使之不过大。(因为如果A和B一直加加加下去,通道数很大)
稠密块(Dense Block)
一开始A处输入in_channels,B输出out_c,连结成in+out(一个DenseBlock的输出),此时一个循环结束。然后in+out作为下一个DenseBlock的输入,又到A那里,新的循环开始,第二个DenseBlock输出通道数in+2*out…
def conv_block(in_channels, out_channels):blk = nn.Sequential(nn.BatchNorm2d(in_channels), nn.ReLU(),nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1))return blk # 打包起来方便调用class DenseBlock(nn.Module):# num_convs:包含了几个conv_block# in_channels:整个DenseBlock的输入通道;# out_channels:不是整个DenseBlock的输出,是每次卷积的输出def __init__(self, num_convs, in_channels, out_channels): super(DenseBlock, self).__init__()net = []for i in range(num_convs):in_c = in_channels + i * out_channelsnet.append(conv_block(in_c, out_channels))self.net = nn.ModuleList(net)self.out_channels = in_channels + num_convs * out_channels # 计算输出通道数def forward(self, X):for blk in self.net:Y = blk(X)X = torch.cat((X, Y), dim=1) # 在通道维上将输入和输出连结return X
blk = DenseBlock(2, 3, 10)
X = torch.rand(4, 3, 8, 8) # in_channels=3, hxw=8x8
Y = blk(X)
Y.shape # torch.Size([4, 23, 8, 8])
torch.Size([4, 23, 8, 8]) # 23= 3 + 2*10;
过渡层(Transition layer)
1x1卷积层:来减小通道数
步幅为2的平均池化层:减半高和宽
模型就不会越来越复杂
def transition_block(in_channels, out_channels):blk = nn.Sequential(nn.BatchNorm2d(in_channels), nn.ReLU(),nn.Conv2d(in_channels, out_channels, kernel_size=1), # 1x1卷积层nn.AvgPool2d(kernel_size=2, stride=2))return blkblk = transition_block(23, 10) # 放缩通道,使 channels从 23 -> 10
blk(Y).shape # torch.Size([4, 10, 4, 4])
torch.Size([4, 10, 4, 4])
DenseNet模型
Dense block+Trans layer+Dense block+Trans layer … 最后再加上BN层、ReLu、平均池化、全连接层,就构成了一个完整的神经网络模型。下图是照着代码画的网络的部分结构,标注了中间部分的通道数:
net = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3), # 宽高减半nn.BatchNorm2d(64), nn.ReLU(),nn.MaxPool2d(kernel_size=3, stride=2, padding=1)) # 宽高减半
num_channels, growth_rate = 64, 32 # num_channels为当前的通道数;
# growth_rate : 每经过一次卷积增长的channel数
num_convs_in_dense_blocks = [4, 4, 4, 4]for i, num_convs in enumerate(num_convs_in_dense_blocks):DB = DenseBlock(num_convs, num_channels, growth_rate)net.add_module("DenseBlosk_%d" % i, DB)# 上一个稠密块的输出通道数num_channels = DB.out_channels# 在稠密块之间加入通道数减半的过渡层if i != len(num_convs_in_dense_blocks) - 1:net.add_module("transition_block_%d" % i, transition_block(num_channels, num_channels // 2))num_channels = num_channels // 2
net.add_module("BN", nn.BatchNorm2d(num_channels))
net.add_module("relu", nn.ReLU())
net.add_module("global_avg_pool", d2l.GlobalAvgPool2d()) # GlobalAvgPool2d的输出: (Batch, num_channels, 1, 1)
net.add_module("fc", nn.Sequential(d2l.FlattenLayer(), nn.Linear(num_channels, 10))) X = torch.rand((1, 1, 96, 96))
for name, layer in net.named_children():X = layer(X)print(name, ' output shape:\t', X.shape)
0 output shape: torch.Size([1, 64, 48, 48])? ? # nn.Conv2d() 宽高减半
1 output shape: torch.Size([1, 64, 48, 48])? ? # nn.BatchNorm2d() 不改变形状
2 output shape: torch.Size([1, 64, 48, 48])? ? # nn.ReLu
3 output shape: torch.Size([1, 64, 24, 24])? ? # nn.MaxPool 减半
DenseBlosk_0 output shape: torch.Size([1, 192, 24, 24])
transition_block_0 output shape: torch.Size([1, 96, 12, 12])
DenseBlosk_1 output shape: torch.Size([1, 224, 12, 12])
transition_block_1 output shape: torch.Size([1, 112, 6, 6])
DenseBlosk_2 output shape: torch.Size([1, 240, 6, 6])
transition_block_2 output shape: torch.Size([1, 120, 3, 3])
DenseBlosk_3 output shape: torch.Size([1, 248, 3, 3])
BN output shape: torch.Size([1, 248, 3, 3])
relu output shape: torch.Size([1, 248, 3, 3])
global_avg_pool output shape: torch.Size([1, 248, 1, 1])
fc output shape: torch.Size([1, 10])
训练实例:
#batch_size = 256 # GPU训练
batch_size=16 # CPU训练,不要太大
# 如出现“out of memory”的报错信息,可减小batch_size或resize
train_iter, test_iter =load_data_fashion_mnist(batch_size, resize=96)
lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)