专业英语（5、Backpropagation）_综合

Computational graphs（计算图）

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6Q1yNmxd-1588694834360)(WEBRESOURCE0ebdca55c84b4f0a3a861f8b13628580)]

Convolutional network（卷积网络）

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jCew9tMl-1588694834364)(WEBRESOURCE79d0283e5eedbdffe1a5360d85da6ab4)]

Neural Turing Machine（神经图灵机）

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FUZdu8Iw-1588694834366)(WEBRESOURCE6cfe41746d14dcc11b6cda21cda0a131)]

Why we need backpropagation?

Problem statement

The core problem studied in this section is as
follows: We are given some function f(x) where x is a vector of
inputs and we are interested in computing the gradient
of f at x (i.e.??(?) ).

Motivation

In the specific case of Neural Networks, f(x) will
correspond to the loss function (L) and the inputs x will consist of the training data and the neural network weights. The training data is given and parameters (e.g. W and b) are variables. We need to compute the gradient for all the parameters so that we can use it to perform a parameter update.

Backpropagation: a simple example

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WqftopAV-1588694834369)(WEBRESOURCEcad7c5d0a768071367503bc029ed58db)]

Q1: What is a add gate?
Q2:What is a max gate?
Q3:What is a mul gate?
A1:add gate: gradient distributor(梯度分配器)
A2:max gate: gradient router（梯度路由器）
A3:mul gate: gradient switcher（梯度切换器）

Another example:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-a4ecCliN-1588694834372)(WEBRESOURCE34b34a80606ec507abee2848fefb9ae9)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-W57M9svd-1588694834375)(WEBRESOURCE3d6598a048d264470c3998dccfd55d35)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9NG3qIb2-1588694834376)(WEBRESOURCE70b036736314cf74a2d2e318d7fad059)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vukknNeW-1588694834377)(WEBRESOURCE44d60d11f4318fabd55d4dbb47891750)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YnLwphw4-1588694834378)(WEBRESOURCEd7f8a0b0f0c211ec7fd9c17931985456)]

What about autograd?

Deep learning frameworks can automatically perform backprop!
Problems might surface related to underlying gradients when
debugging your models.
深度学习框架可以自动执行backprop（反向传播）！
调试模型时，问题可能涉及表面的对底层梯度。

Modularized implementation: forward / backward API（模块化实现：前向/后向API）

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vsnp5B1j-1588694834379)(WEBRESOURCEc7d0193d397227b5f6207ffafcc2e82a)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JK7bWSuR-1588694834380)(WEBRESOURCE28ef78cdeef3f4376016bcd1e6da1019)]

例如：

$f(x,y)=x+σ(y)σ(x)+(x+y)2,其中σ（x）=11+e?xf(x,y)=\frac{x+\sigma(y)}{\sigma(x)+(x+y)^2},其中\sigma（x）=\frac{1}{1+e^{-x}}$

构建前向传播的代码模式：

x = 3 # 例子数值
y = -4# 前向传播
sigy = 1.0 / (1 + math.exp(-y)) # 分子中的sigmoi          #(1)
num = x + sigy # 分子                                    #(2)
sigx = 1.0 / (1 + math.exp(-x)) # 分母中的sigmoid         #(3)
xpy = x + y                                              #(4)
xpysqr = xpy**2                                          #(5)
den = sigx + xpysqr # 分母                                #(6)
invden = 1.0 / den                                       #(7)
f = num * invden # 搞定！                                 #(8)

构建后向传播的代码模式:

# 回传 f = num * invden
dnum = invden # 分子的梯度                                         #(8)
dinvden = num                                                     #(8)
# 回传 invden = 1.0 / den 
dden = (-1.0 / (den**2)) * dinvden                                #(7)
# 回传 den = sigx + xpysqr
dsigx = (1) * dden                                                #(6)
dxpysqr = (1) * dden                                              #(6)
# 回传 xpysqr = xpy**2
dxpy = (2 * xpy) * dxpysqr                                        #(5)
# 回传 xpy = x + y
dx = (1) * dxpy                                                   #(4)
dy = (1) * dxpy                                                   #(4)
# 回传 sigx = 1.0 / (1 + math.exp(-x))
dx += ((1 - sigx) * sigx) * dsigx # Notice += !! See notes below  #(3)
# 回传 num = x + sigy
dx += (1) * dnum                                                  #(2)
dsigy = (1) * dnum                                                #(2)
# 回传 sigy = 1.0 / (1 + math.exp(-y))
dy += ((1 - sigy) * sigy) * dsigy                                 #(1)
# 完成! 嗷~~

前向传播：从输入计算到输出（绿色）
反向传播：从尾部开始，根据链式法则递归地向前计算梯度（显示为红色），一直到网络的输入端。

求

$?C?w\frac{\partial C}{\partial w}$

可以分解为两步：

1、

$?z?w\frac{\partial z}{\partial w}$

这是个Forward pass（这个其实是求导的最后一步，可以从前向后直接得到）

2、

$?C?z\frac{\partial C}{\partial z}$

这是个Backward pass（这部分不能直接求出来，需要运用递归思想，从后向前计算）

Summary so far…

neural nets will be very large: impractical to write down gradient formula
by hand for all parameters
backpropagation = recursive application of the chain rule along a
computational graph to compute the gradients of all
inputs/parameters/intermediates
implementations maintain a graph structure, where the nodes implement forward() / backward() API
forward: compute result of an operation and save any intermediates
needed for gradient computation in memory
backward: apply the chain rule to compute the gradient of the loss
function with respect to the inputs

神经网络将非常大：用手写下所有参数的梯度公式是不切实际的
backpropagation =沿着计算图递归应用链规则来计算所有的梯度

输入/参数/中间体
实现维护图结构，其中节点实现forward（）/ backward（）API
计算操作结果并保存内存中梯度计算所需的任何中间体
应用链规则计算相对于输入的损失函数的梯度