注意力机制——transformer模型代码解析（机器翻译）_综合

注意力机制——transformer模型代码解析

1 transformer图解分析（论文）
- 1.1 论文中的模型图
- 1.2 分模块解释
- - 1.2.1 单词、位置编码模块
  - 1.2.2 Encoder模块
  - 1.2.3 Decoder模块
  - 1.2.4 输出全连接层
2 transformer代码分析
- 2.1 模型代码分析
- - 2.1.1 Embedding（单词、位置编码）
  - 2.1.2 Multi-Head-Attention（多路注意力）
  - 2.1.3 Feed-Forward（全连接层）
  - 2.1.4 Layer-Normalization（层标准化）
  - 2.1.5 总模块（联合上述代码）
- 2.2 训练代码分析
- 2.3 预测代码分析
- 2.4 超参数代码分析

1 transformer图解分析（论文）

1.1 论文中的模型图

在这里插入图片描述

1.2 分模块解释

1.2.1 单词、位置编码模块

在这里插入图片描述
Input Embedding 和 Output Embedding

这一块主要是将对于编码器以及解码器输入的单词进行编码，通过torch.nn,embedding()方法进行编码，由于各语种单词的数量太多，如果使用传统的使用one-hot（独热码）编码就会使得编码的长度太长了，使得计算量增大，而且编码矩阵内的元素多半都是0。

Positional Encoding

位置编码，因为语句是一个序列，序列是存在前后输入的关系的，为了更好的拟合语句单词简单的关系，将位置信息也进行编码，然后与输入的单词编码进行叠加，这里也可以采用封装好的torch.nn.embedding()方法进行编码。

总体

将单词的编码与位置的编码经过叠加，这里本人猜想也可以采用各种叠加方式，例如权重等等，在下方代码中只采用了简单的线性求和方式。然后将叠加后的矩阵复制成三份，输入到（下方Mutil-Head-Attention）多路注意力模块中。

1.2.2 Encoder模块

在这里插入图片描述
Muliti-Head Attention

这一块进行多注意力机制的构造，在此模块中，在上一层模块中的输出复制三份，然后喂入到三个全连接层（相当于和三个矩阵进行相乘），形成query, key, value, 然后将query与key进行点积操作，过softmax形成对于每一个单词value的注意力权重，然后分别与value相乘求和。

Feed Forward

这一块将标准化后的输出进行两层全连接层。

Add & Norm

这一块是残差结构以及标准化，将多路注意力/全连接层（上边两个模块）的输入与输出进行残差连接，然后过标准化层。

总体

注意看左方的（N×）表示此Encoder模块按照（Multi-Head Attention ->Normal->Feed Forward->Normal）的顺序总共重复N次，在原论文中重复了6次。

1.2.3 Decoder模块

在这里插入图片描述
Multi-Head Attention

这一块进行多注意力机制的构造，与Encoder模块中的多注意力模块是一致的，只不过它的输入的value,与key来源于Encoder模块的输出。

Masked Multi-Head Attention

在多路注意力机制上加了一个Masked，因为输入的真实值语句序列不需要看到后方的注意力，所以加了一个Masked蒙层，屏蔽后方真实值语句序列。

Add & Norm

与Encoder模块一致。

Feed Forward

与Encoder模块一致。

1.2.4 输出全连接层

在这里插入图片描述

这一块是对每个单词的种类的概率的输出。

2 transformer代码分析

2.1 模型代码分析

2.1.1 Embedding（单词、位置编码）

## 定义编码的类
class embedding(nn.Module):def __init__(self, input_dim, output_dim, padding_is=True):""":param input_dim: 输入维度:param output_dim: 输出维度:param padding_is:"""super(embedding, self).__init__()self.padding_is = Trueself.input_dim = input_dimself.output_dim = output_dimself.embed = torch.nn.Embedding(self.input_dim, self.output_dim)def forward(self, x):### 将输入的维度 转化为 输出维度output = self.embed(x)return output

2.1.2 Multi-Head-Attention（多路注意力）

### 多路注意力机制
class multiple_head_attention(nn.Module):def __init__(self,num_units, num_heads= 8, dropout_rate=0, masked=False):""":param num_units::param num_heads: 多注意力机制:param dropout_rate: 舍弃率:param masked: 是否是解码处mask_attention"""super(multiple_head_attention, self).__init__()self.num_units = num_unitsself.num_heads = num_headsself.dropout_rate = dropout_rateself.masked = masked## 定义Q K V层 此处使用三个全连接层 输入为num_units 输出也为num_unitsself.Q = nn.Sequential(nn.Linear(self.num_units, self.num_units), nn.ReLU())self.K = nn.Sequential(nn.Linear(self.num_units, self.num_units), nn.ReLU())self.V = nn.Sequential(nn.Linear(self.num_units, self.num_units), nn.ReLU())self.output_dropout = nn.Dropout(p=self.dropout_rate)  # 定义舍弃层self.normalization = layer_normalization(self.num_units)  ### 定义标准化方式 使用的是layer_normalizationdef forward(self, queries, keys, values):## q,k,v的生成 三个全连接层q = self.Q(queries)k = self.K(keys)v = self.V(values)## 数据变换 因为是多路的出现##---------------------------------------------- 代码中这一块多路注意力## 现将上述得到的 q k v 在dim=2维度上进行拆分 拆成多路（？）部分 然后再在dim=0维度上进行拼接## [512, 10, 512 ]->[512, 10, 64]*8 ->[4096, 10, 64]q_ = torch.cat(torch.chunk(q, self.num_heads, dim=2), dim=0)k_ = torch.cat(torch.chunk(k, self.num_heads, dim=2), dim=0)v_ = torch.cat(torch.chunk(v, self.num_heads, dim=2), dim=0)### 将k_进行（1,2）维度上的转置 进行## 然后进行q与k进行计算注意力outputs = torch.bmm(q_, k_.permute(0, 2, 1))## 然后将值进行缩放 与原论文保持一致outputs = outputs / (k_.size()[-1] ** 0.5)## 在这里 如果解码器有masked的话if self.masked:  ##Truediag_vals = torch.ones(*outputs[0, :, :].size()).cuda() ## 产生一个二维的矩阵，就是output的后两个维度的sizetril = torch.tril(diag_vals, diagonal=0)   ## 返回一个下三角矩阵 全1masks = Variable(torch.unsqueeze(tril, 0).repeat(outputs.size()[0], 1, 1)) ## 在tril添加上output的第一个维度padding = Variable(torch.ones(*masks.size()).cuda() * (-2 ** 32 + 1))   ##与masks 同维度 负无穷大condition = masks.eq(0.).float() ## masks变上三角outputs = padding * condition + outputs * (1. - condition)   ## 进行masks了# Activationoutputs = F.softmax(outputs, dim=-1)  # (h*N, T_q, T_k) 在维度2上进行softmax# Dropouts 舍弃层outputs = self.output_dropout(outputs)  # (h*N, T_q, T_k)## 注意力与value进行加权outputs = torch.bmm(outputs, v_)## 还原回原来的## [512, 10, 512 ]->[512, 10, 64]*8 ->[4096, 10, 64] 反向##----------------------------------------------outputs = torch.cat(torch.chunk(outputs, self.num_heads, dim=0), dim=2)## 残差结构outputs += queries# Normalize 标准化outputs = self.normalization(outputs)  # (N, T_q, C)return outputs

2.1.3 Feed-Forward（全连接层）

### 全连接层
class feedforward(nn.Module):def __init__(self, inputs_channels, num_units=[2048, 512]):""":param inputs_channels::param num_units:"""super(feedforward, self).__init__()self.inputs_channels = inputs_channelsself.num_units = num_units## 有两种方法实现全连接操作 一种是卷积操作 另一种则是线性层# 这里我选用线性层 linearself.layer1 = nn.Sequential(nn.Linear(self.inputs_channels, self.num_units[0]), nn.ReLU())self.layer2 = nn.Linear(self.num_units[0], self.num_units[1])self.normalization = layer_normalization(self.inputs_channels)  ## 定义标准化def forward(self, inputs):outputs = self.layer1(inputs)outputs = self.layer2(outputs)outputs += inputsoutputs = self.normalization(outputs)return outputs

2.1.4 Layer-Normalization（层标准化）

### 层标准化
class layer_normalization(nn.Module):def __init__(self, features, epsilon=1e-8):super(layer_normalization, self).__init__()self.epsilon = epsilonself.gamma = nn.Parameter(torch.ones(features))self.beta = nn.Parameter(torch.zeros(features))def forward(self, x):mean = x.mean(-1, keepdim=True)std = x.std(-1, keepdim=True)return self.gamma * (x - mean) / (std + self.epsilon) + self.beta

2.1.5 总模块（联合上述代码）

##总模型
class scl_models(nn.Module):def __init__(self, enc_voc, dec_voc):super(scl_models, self).__init__()self.enc_voc = enc_vocself.dec_voc = dec_voc### 定义编码部分self.enc_embedding = embedding(enc_voc, hp.hidden_units)  ## 定义单词编码self.enc_pos_embedding = embedding(hp.maxlen, hp.hidden_units) ## 定义位置编码self.enc_dropout = nn.Dropout(hp.dropout_rate)  ##定义舍弃for i in range(hp.num_blocks):## 循环多少次多路注意力以及feedforwardself.__setattr__('enc_attention_%d' % i, multiple_head_attention(num_units=hp.hidden_units,num_heads=hp.num_heads,dropout_rate=hp.dropout_rate,masked=False))self.__setattr__('enc_feed_forward_%d' % i, feedforward(hp.hidden_units,[4*hp.hidden_units, hp.hidden_units]))#### 定义解码部分self.dec_embedding = embedding(enc_voc, hp.hidden_units)  ## 定义单词编码self.dec_pos_embedding = embedding(hp.maxlen, hp.hidden_units)  ## 定义位置编码self.dec_dropout = nn.Dropout(hp.dropout_rate)  ##定义舍弃for i in range(hp.num_blocks):  ## 循环多少次多路注意力以及feedforwardself.__setattr__('dec_attention_%d' % i, multiple_head_attention(num_units=hp.hidden_units,num_heads=hp.num_heads,dropout_rate=hp.dropout_rate,masked=True))self.__setattr__('dec_attention2_%d' % i, multiple_head_attention(num_units=hp.hidden_units,num_heads=hp.num_heads,dropout_rate=hp.dropout_rate,masked=False))self.__setattr__('dec_feed_forward_%d' % i, feedforward(hp.hidden_units,[4 * hp.hidden_units, hp.hidden_units]))## 定义线性层self.dec_voc 词的个数self.logits_layer = nn.Linear(hp.hidden_units, self.dec_voc)self.label_smoothing = label_smoothing()def forward(self, x, y):## 在每一个句子的开头添加一个开始符号self.decoder_inputs = torch.cat([Variable(torch.ones(y[:, :1].size()).cuda() * 2).long(), y[:, :-1]], dim=-1)## 将句子的单词进行编码self.enc = self.enc_embedding(x)## 对位置进行编码 然后与单词编码相加self.enc += self.enc_pos_embedding(Variable(torch.unsqueeze(torch.arange(0, x.size()[1]), 0).repeat(x.size(0), 1).long().cuda()))#### blocks （编码器多头注意力以及全连接层）for i in range(hp.num_blocks):self.enc = self.__getattr__('enc_attention_%d' % i)(self.enc, self.enc, self.enc)self.enc = self.__getattr__('enc_feed_forward_%d' % i)(self.enc)## 解码器部分self.dec = self.dec_embedding(self.decoder_inputs)self.dec += self.dec_pos_embedding(Variable(torch.unsqueeze(torch.arange(0, self.decoder_inputs.size()[1]), 0).repeat(self.decoder_inputs.size(0), 1).long().cuda()))self.dec = self.dec_dropout(self.dec)for i in range(hp.num_blocks):self.dec = self.__getattr__('dec_attention_%d' % i)(self.dec, self.dec, self.dec)self.dec = self.__getattr__('dec_attention2_%d' % i)(self.dec, self.enc, self.enc)self.dec = self.__getattr__('dec_feed_forward_%d' % i)(self.dec)self.logits = self.logits_layer(self.dec)## 经过softmax函数后又拉成二维，得到的就是每一个单词的每一种类的概率数self.probably = F.softmax(self.logits, dim=-1).view(-1, self.dec_voc)_, self.preds = torch.max(self.logits, -1)  ## 通过上边的全连接层 得到最大值## 用来判断本句话的位置是否有单词 然后拉直为一维数组self.istarget = (1. - y.eq(0.).float()).view(-1)## 计算准确度self.acc = torch.sum(self.preds.eq(y).float().view(-1) * self.istarget) / torch.sum(self.istarget)# Lossself.y_onehot = torch.zeros(self.logits.size()[0] * self.logits.size()[1], self.dec_voc).cuda()self.y_onehot = Variable(self.y_onehot.scatter_(1, y.view(-1, 1).data, 1))self.y_smoothed = self.label_smoothing(self.y_onehot)self.loss = - torch.sum(self.y_smoothed * torch.log(self.probably), dim=-1)self.mean_loss = torch.sum(self.loss * self.istarget) / torch.sum(self.istarget)return self.mean_loss, self.preds, self.acc

2.2 训练代码分析

de2idx, idx2de = load_de_vocab()   ## ctrl+鼠标左键 点击函数 查看注释
en2idx, idx2en = load_en_vocab()
enc_voc = len(de2idx)  ### 有多少单词映射idx
dec_voc = len(en2idx)  ### 有多少单词映射idx
writer = SummaryWriter()X, Y = load_train_data()     ## 得到处理后的二维数据 全部编码成了idx 数字
num_batch = len(X) // hp.batch_size  ### batch_size一批的大小 num_batch 总数据一共分为了多少批model = scl_models(enc_voc, dec_voc) ### 初始化自定义模型
model.train() ## 将模型设置为训练状态
model.cuda() ## 将模型部署到GPU上## 查看是够有训练模型的文件夹 ## 如果没有文件夹则新建一个文件夹
if not os.path.exists(hp.model_dir):os.makedirs(hp.model_dir)## 查看之前是否训练过模型
if hp.preload is not None and os.path.exists(hp.model_dir + '/history.pkl'):with open(hp.model_dir + '/history.pkl') as history_file:history = pickle.load(history_file)
else:history = {
    'current_batches': 0}current_batches = history['current_batches']  ## 取出batches
optimizer = optim.Adam(model.parameters(), lr= hp.lr, betas=[0.9, 0.99], eps=1e-8)  ## 定义优化器#### 如果有模型，则加载模型的参数
model_pth_path = os.path.join(hp.model_dir,'optimizer.pth')
if hp.preload is not None and os.path.exists(model_pth_path):optimizer.load_state_dict(torch.load(model_pth_path))if hp.preload is not None and os.path.exists(hp.model_dir + '/model_epoch_%02d.pth' % hp.preload):model.load_state_dict(torch.load(hp.model_dir + '/model_epoch_%02d.pth' % hp.preload))startepoch = int(hp.preload) if hp.preload is not None else 1for epoch in range(startepoch, hp.num_epochs + 1):current_batch = 0for index, current_index in get_batch_indices(len(X), hp.batch_size):tic = time.time()x_batch = Variable(torch.LongTensor(X[index]).cuda())y_batch = Variable(torch.LongTensor(Y[index]).cuda())toc = time.time()tic_r = time.time()torch.cuda.synchronize()optimizer.zero_grad()loss, _, acc = model(x_batch, y_batch)loss.backward()optimizer.step()torch.cuda.synchronize()toc_r = time.time()current_batches += 1current_batch += 1if current_batches % 10 == 0:writer.add_scalar('./loss', loss.data.cpu().numpy().item(), current_batches)writer.add_scalar('./acc', acc.data.cpu().numpy().item(), current_batches)if current_batches % 5 == 0:print('epoch %d, batch %d/%d, loss %f, acc %f' % (epoch, current_batch, num_batch, loss.item(), acc.item()))print('batch loading used time %f, model forward used time %f' % (toc - tic, toc_r - tic_r))if current_batches % 100 == 0:writer.export_scalars_to_json(hp.model_dir + '/all_scalars.json')with open(hp.model_dir + '/history.pkl', 'wb') as out_file:pickle.dump(history, out_file)checkpoint_path = hp.model_dir + '/model_epoch_%02d' % epoch + '.pth'torch.save(model.state_dict(), checkpoint_path)torch.save(optimizer.state_dict(), hp.model_dir + '/optimizer.pth')

2.3 预测代码分析

X, Sources, Targets = load_test_data()
de2idx, idx2de = load_de_vocab()
en2idx, idx2en = load_en_vocab()
enc_voc = len(de2idx)
dec_voc = len(en2idx)# load model
model = scl_models(enc_voc, dec_voc)
model.load_state_dict(torch.load(hp.model_dir + '/model_epoch_%02d' % hp.eval_epoch + '.pth'))
print('Model Loaded.')
model.eval()
model.cuda()
# Inference
if not os.path.exists('results'):os.mkdir('results')
with codecs.open('results/model%d.txt' % hp.eval_epoch, 'w', 'utf-8') as fout:list_of_refs, hypotheses = [], []for i in range(len(X) // hp.batch_size2):# Get mini-batchesx = X[i * hp.batch_size2: (i + 1) * hp.batch_size2]sources = Sources[i * hp.batch_size2: (i + 1) * hp.batch_size2]targets = Targets[i * hp.batch_size2: (i + 1) * hp.batch_size2]# Autoregressive inferencex_ = Variable(torch.LongTensor(x).cuda())preds_t = torch.LongTensor(np.zeros((hp.batch_size2, hp.maxlen), np.int32)).cuda()preds = Variable(preds_t)for j in range(hp.maxlen):_, _preds, _ = model(x_, preds)preds_t[:, j] = _preds.data[:, j]preds = Variable(preds_t.long())preds = preds.data.cpu().numpy()# Write to filefor source, target, pred in zip(sources, targets, preds):  # sentence-wisegot = " ".join(idx2en[idx] for idx in pred).split("</S>")[0].strip()fout.write("- source: " + source + "\n")fout.write("- expected: " + target + "\n")fout.write("- got: " + got + "\n\n")fout.flush()# bleu scoreref = target.split()hypothesis = got.split()if len(ref) > 3 and len(hypothesis) > 3:list_of_refs.append([ref])hypotheses.append(hypothesis)# Calculate bleu scorescore = corpus_bleu(list_of_refs, hypotheses)fout.write("Bleu Score = " + str(100 * score))

2.4 超参数代码分析

class Hyperparams:source_train = 'data/train.tags.de-en.de'               ### 训练集的x地址target_train = 'data/train.tags.de-en.en'               ### 训练集的目标值地址source_test = 'data/IWSLT16.TED.tst2014.de-en.de.xml'target_test = 'data/IWSLT16.TED.tst2014.de-en.en.xml'# trainingbatch_size = 512  # alias = Nbatch_size2 = 32  # alias = Nlr = 0.0001  # learning rate. In paper, learning rate is adjusted to the global step.logdir = 'logdir'  # log directorymodel_dir = './models/'  # saving directory 保存模型的文件夹# modelmaxlen = 10  # Maximum number of words in a sentence. alias = T. 每一句话的最大长度值# Feel free to increase this if you are ambitious.min_cnt = 20  # words whose occurred less than min_cnt are encoded as <UNK>.hidden_units = 512  # alias = Cnum_blocks = 6  # number of encoder/decoder blocksnum_epochs = 200  # epochnum_heads = 8   # attention numbersdropout_rate = 0.1eval_epoch = 135  # epoch of model for evalpreload = 20  # epcho of preloaded model for resuming training ## 重新开始训练的轮数