摘要:介绍梯度下降算法,以及在f(x)f(x)f(x)的梯度▽f(x)\bigtriangledown f(x)▽f(x)满足L-Lipschitz条件下的梯度下降算法的意义,并由此展开的非光滑约束下的近端梯度下降算法,求解min?xfs(x)+fn(x)\min_x f^s(x)+f^n(x)minx?fs(x)+fn(x)问题.
目录
- 梯度下降算法
- 二阶近似下的梯度下降算法
- 引入非光滑约束后的近端梯度下降
- 三个近端梯度下降计算非光滑约束优化的例子
1. 梯度下降
考虑min?xf(x)\min_x f(x)minx?f(x),其中f(x)f(x)f(x)为可微凸函数,且其梯度▽f(x)\bigtriangledown f(x)▽f(x)满足L-Lipschitz条件.
最简单的优化方法为梯度下降法(Gradient descent)
x(k+1)=x(k)?η▽f(x(k))x^{(k+1)}=x^{(k)}-\eta \bigtriangledown f(x^{(k)})x(k+1)=x(k)?η▽f(x(k))
将f(x)f(x)f(x)在x=x(k+1)x=x^{(k+1)}x=x(k+1)的值,在x(k)x^{(k)}x(k)处做Taylor展开,得到
f(x(k+1))=f(x(k))+▽f(x(k))(x(k+1)?x(k))=f(x(k))?η(▽f(x(k)))2≤f(x(k))\begin{aligned} f(x^{(k+1)})&= f(x^{(k)})+\bigtriangledown f(x^{(k)})(x^{(k+1)}-x^{(k)})\\ &=f(x^{(k)})-\eta \left(\bigtriangledown f(x^{(k)})\right)^2\\ &\leq f(x^{(k)}) \end{aligned}f(x(k+1))?=f(x(k))+▽f(x(k))(x(k+1)?x(k))=f(x(k))?η(▽f(x(k)))2≤f(x(k))?
步长参数0<η<10<\eta<10<η<1,则每一次迭代总能保证f(x(k+1))≤f(x(k))f(x^{(k+1)})\leq f(x^{(k)})f(x(k+1))≤f(x(k)).
2. 梯度▽f(x)\bigtriangledown f(x)▽f(x)满足L-Lipschitz条件下的梯度下降
首先给出L-Lipschitz定义:
设函数f(x)f(x)f(x)在有限区间[a,b][a,b][a,b]上满足如下条件:
- 当x∈[a,b]x\in[a,b]x∈[a,b]时,f(x)∈[a,b]f(x)\in[a,b]f(x)∈[a,b],即a≤f(x)≤ba\leq f(x)\leq ba≤f(x)≤b;
- 对任意的x1,x2∈[a,b]x_1,x_2\in[a,b]x1?,x2?∈[a,b],∣f(x1)?f(x2)∣≤L∣x1?x2∣|f(x_1)-f(x_2)|\leq L|x_1-x_2|∣f(x1?)?f(x2?)∣≤L∣x1??x2?∣恒成立;
则称f(x)f(x)f(x)在[a,b][a,b][a,b]上满足L-Lipschitz条件,LLL称为Lipschitz常数.
可以发现,L-Lipschitz连续比一致连续更强,要求函数值在有限区间的变化幅度受到限制.
进一步的,如果函数f(x)f(x)f(x)的梯度▽f(x)\bigtriangledown f(x)▽f(x)满足L-Lipschitz连续,则其在给定点x(k)x^{(k)}x(k)可以展开成如下二阶近似形式
f^(x;x(k))?f(x(k))+<▽f(x(k),x?x(k))>+L2∣∣x?x(k)∣∣2\hat{f}(x;x^{(k)})\doteq f(x^{(k)})+<\bigtriangledown f(x^{(k)},x-x^{(k)})>+\frac{L}{2}||x-x^{(k)}||^2f^?(x;x(k))?f(x(k))+<▽f(x(k),x?x(k))>+2L?∣∣x?x(k)∣∣2
展开,并将与xxx无关的项记为?(x(k))\phi(x^{(k)})?(x(k)),则可以进一步化简为
f^(x;x(k))=L2∣∣x?(x(k)?1L▽f(x(k)))∣∣2+?(x(k))\hat{f}(x;x^{(k)})=\frac{L}{2}\bigg\lvert\bigg\lvert x-\left(x^{(k)}-\frac{1}{L}\bigtriangledown f(x^{(k)})\right)\bigg\rvert\bigg\rvert^2+\phi(x^{(k)})f^?(x;x(k))=2L?∣∣∣∣?∣∣∣∣?x?(x(k)?L1?▽f(x(k)))∣∣∣∣?∣∣∣∣?2+?(x(k))
由图可知
f^(x;x(k))≥f(x)\hat{f}(x;x^{(k)})\geq f(x)f^?(x;x(k))≥f(x)
当且仅当x=x(k)x=x^{(k)}x=x(k)时,取等号. f^(x;x(k))\hat{f}(x;x^{(k)})f^?(x;x(k))实际上为原目标函数的二次上界.
令x(k+1)=arg?min?xf^(x;x(k))x^{(k+1)}=\arg\min_x \hat{f}(x;x^{(k)})x(k+1)=argminx?f^?(x;x(k)),则可以得到
x(k+1)=x(k)?1L▽f(x(k))x^{(k+1)}=x^{(k)}-\frac{1}{L}\bigtriangledown f(x^{(k)})x(k+1)=x(k)?L1?▽f(x(k))
因此,在二阶近似的条件下,梯度下降可以理解为:
- 每一次迭代都在最小化目标函数在上一次迭代点处的二次上界.
收敛速度为O(1k)O(\frac{1}{k})O(k1?).
3. 引入非光滑约束后的近端梯度下降算法
考虑min?xfs(x)+fn(x)\min_x f^s(x)+f^n(x)minx?fs(x)+fn(x),其中fs(x)f^s(x)fs(x)为可微凸函数,且其梯度▽fs(x)\bigtriangledown f^s(x)▽fs(x)满足L-Lipschitz条件,fn(x)f^n(x)fn(x)为非光滑函数.
对光滑部分做如上二阶近似,得到
f^(x;x(k))=L2∣∣x?(x(k)?1L▽fs(x(k)))∣∣2+?(x(k))+fn(x)\hat{f}(x;x^{(k)})=\frac{L}{2}\bigg\lvert\bigg\lvert x-\left(x^{(k)}-\frac{1}{L}\bigtriangledown f^s(x^{(k)})\right)\bigg\rvert\bigg\rvert^2+\phi(x^{(k)})+f^n(x)f^?(x;x(k))=2L?∣∣∣∣?∣∣∣∣?x?(x(k)?L1?▽fs(x(k)))∣∣∣∣?∣∣∣∣?2+?(x(k))+fn(x)
令x(k+1)=arg?min?xf^(x;x(k))x^{(k+1)}=\arg\min_x \hat{f}(x;x^{(k)})x(k+1)=argminx?f^?(x;x(k)),则可以得到近端梯度下降的更新公式
x(k+1)=arg?min?xL2∣∣x?(x(k)?1L▽fs(x(k)))∣∣2+fn(x)x^{(k+1)}=\arg\min_x \frac{L}{2}\bigg\lvert\bigg\lvert x-\left(x^{(k)}-\frac{1}{L}\bigtriangledown f^s(x^{(k)})\right)\bigg\rvert\bigg\rvert^2+f^n(x)x(k+1)=argxmin?2L?∣∣∣∣?∣∣∣∣?x?(x(k)?L1?▽fs(x(k)))∣∣∣∣?∣∣∣∣?2+fn(x)
而该更新公式可以通过如下近端问题高效求解:
proxμfn(x)(z)=arg?min?x12∣∣x?z∣∣2+μfn(x)prox_{\mu f^n(x)}(z)=\arg\min_x \frac{1}{2} ||x-z||^2+\mu f^n(x)proxμfn(x)?(z)=argxmin?21?∣∣x?z∣∣2+μfn(x)
即最小化μfn(x)\mu f^n(x)μfn(x)加上一个独立的二次问题. 此时的收敛速率仍为O(1k)O(\frac{1}{k})O(k1?).
4. 三个近端梯度下降计算非光滑约束优化的例子
例1:
凸稀疏罚函数fn(x)=∣∣x∣∣1f^n(x)=||x||_1fn(x)=∣∣x∣∣1?,此时得到的近端优化问题为
arg?min?x12∣∣x?z∣∣2+μ∣∣x∣∣1\arg\min_x\frac{1}{2} ||x-z||^2+\mu ||x||_1argxmin?21?∣∣x?z∣∣2+μ∣∣x∣∣1?
求解得到zzz的软阈值函数
proxμfn(x)(z)=Sμ(z)=sign(z)max?{∣z∣?μ,0}prox_{\mu f^n(x)}(z)=S_\mu(z)=sign(z)\max\left\{|z|-\mu,0\right\}proxμfn(x)?(z)=Sμ?(z)=sign(z)max{ ∣z∣?μ,0}
此时的该操作符能够将zzz的所有元素向000压缩,而且计算仅需线性时间.
例2:
取fn(x)=∣∣x∣∣0f^n(x)=||x||_0fn(x)=∣∣x∣∣0?,则得到zzz的硬阈值函数
proxμfn(x)(z)=Hμ(z)={z∣z∣≥μ0otherwiseprox_{\mu f^n(x)}(z)=H_\mu(z)=\begin{cases} z \quad|z|\geq\mu\\ 0 \quad otherwise \end{cases}proxμfn(x)?(z)=Hμ?(z)={ z∣z∣≥μ0otherwise?
例3:
取fn(x)=∑iI∞[xi≤0]f^n(x)=\sum_i I_{\infty}[x_i \leq 0]fn(x)=∑i?I∞?[xi?≤0],则得到ReLU网的非线性变换
proxμfn(x)(z)=Rec(z)=max?{z,0}prox_{\mu f^n(x)}(z)=Rec(z)=\max\left\{z,0\right\}proxμfn(x)?(z)=Rec(z)=max{ z,0}