Computational graphs(计算图)
Convolutional network(卷积网络)
Neural Turing Machine(神经图灵机)
Why we need backpropagation?
Problem statement
The core problem studied in this section is as
follows: We are given some function f(x) where x is a vector of
inputs and we are interested in computing the gradient
of f at x (i.e.??(?) ).
Motivation
In the specific case of Neural Networks, f(x) will
correspond to the loss function (L) and the inputs x will consist of the training data and the neural network weights. The training data is given and parameters (e.g. W and b) are variables. We need to compute the gradient for all the parameters so that we can use it to perform a parameter update.
Backpropagation: a simple example
-
Q1: What is a add gate?
-
Q2:What is a max gate?
-
Q3:What is a mul gate?
-
A1:add gate: gradient distributor(梯度分配器)
-
A2:max gate: gradient router(梯度路由器)
-
A3:mul gate: gradient switcher(梯度切换器)
Another example:
What about autograd?
- Deep learning frameworks can automatically perform backprop!
- Problems might surface related to underlying gradients when
debugging your models. - 深度学习框架可以自动执行backprop(反向传播)!
- 调试模型时,问题可能涉及表面的对底层梯度。
Modularized implementation: forward / backward API(模块化实现:前向/后向API)
例如:
f(x,y)=x+σ(y)σ(x)+(x+y)2,其中σ(x)=11+e?xf(x,y)=\frac{x+\sigma(y)}{\sigma(x)+(x+y)^2},其中\sigma(x)=\frac{1}{1+e^{-x}}f(x,y)=σ(x)+(x+y)2x+σ(y)?,其中σ(x)=1+e?x1?
构建前向传播的代码模式:
x = 3 # 例子数值
y = -4# 前向传播
sigy = 1.0 / (1 + math.exp(-y)) # 分子中的sigmoi #(1)
num = x + sigy # 分子 #(2)
sigx = 1.0 / (1 + math.exp(-x)) # 分母中的sigmoid #(3)
xpy = x + y #(4)
xpysqr = xpy**2 #(5)
den = sigx + xpysqr # 分母 #(6)
invden = 1.0 / den #(7)
f = num * invden # 搞定! #(8)
构建后向传播的代码模式:
# 回传 f = num * invden
dnum = invden # 分子的梯度 #(8)
dinvden = num #(8)
# 回传 invden = 1.0 / den
dden = (-1.0 / (den**2)) * dinvden #(7)
# 回传 den = sigx + xpysqr
dsigx = (1) * dden #(6)
dxpysqr = (1) * dden #(6)
# 回传 xpysqr = xpy**2
dxpy = (2 * xpy) * dxpysqr #(5)
# 回传 xpy = x + y
dx = (1) * dxpy #(4)
dy = (1) * dxpy #(4)
# 回传 sigx = 1.0 / (1 + math.exp(-x))
dx += ((1 - sigx) * sigx) * dsigx # Notice += !! See notes below #(3)
# 回传 num = x + sigy
dx += (1) * dnum #(2)
dsigy = (1) * dnum #(2)
# 回传 sigy = 1.0 / (1 + math.exp(-y))
dy += ((1 - sigy) * sigy) * dsigy #(1)
# 完成! 嗷~~
- 前向传播:从输入计算到输出(绿色)
- 反向传播:从尾部开始,根据链式法则递归地向前计算梯度(显示为红色),一直到网络的输入端。
求
?C?w\frac{\partial C}{\partial w}?w?C?
可以分解为两步:
1、
?z?w\frac{\partial z}{\partial w}?w?z?
这是个Forward pass(这个其实是求导的最后一步,可以从前向后直接得到)
2、
?C?z\frac{\partial C}{\partial z}?z?C?
这是个Backward pass(这部分不能直接求出来,需要运用递归思想,从后向前计算)
Summary so far…
- neural nets will be very large: impractical to write down gradient formula
by hand for all parameters - backpropagation = recursive application of the chain rule along a
computational graph to compute the gradients of all
inputs/parameters/intermediates - implementations maintain a graph structure, where the nodes implement forward() / backward() API
- forward: compute result of an operation and save any intermediates
needed for gradient computation in memory - backward: apply the chain rule to compute the gradient of the loss
function with respect to the inputs
-
神经网络将非常大:用手写下所有参数的梯度公式是不切实际的
-
backpropagation =沿着计算图递归应用链规则来计算所有的梯度
输入/参数/中间体
-
实现维护图结构,其中节点实现forward()/ backward()API
-
计算操作结果并保存内存中梯度计算所需的任何中间体
-
应用链规则计算相对于输入的损失函数的梯度