当前位置: 代码迷 >> 综合 >> 近端梯度下降算法(Proximal Gradient Algorithm)
  详细解决方案

近端梯度下降算法(Proximal Gradient Algorithm)

热度:44   发布时间:2023-12-15 05:08:53.0

摘要:介绍梯度下降算法,以及在f(x)f(x)f(x)的梯度▽f(x)\bigtriangledown f(x)f(x)满足L-Lipschitz条件下的梯度下降算法的意义,并由此展开的非光滑约束下的近端梯度下降算法,求解min?xfs(x)+fn(x)\min_x f^s(x)+f^n(x)minx?fs(x)+fn(x)问题.

目录

  1. 梯度下降算法
  2. 二阶近似下的梯度下降算法
  3. 引入非光滑约束后的近端梯度下降
  4. 三个近端梯度下降计算非光滑约束优化的例子

1. 梯度下降

考虑min?xf(x)\min_x f(x)minx?f(x),其中f(x)f(x)f(x)为可微凸函数,且其梯度▽f(x)\bigtriangledown f(x)f(x)满足L-Lipschitz条件.
最简单的优化方法为梯度下降法(Gradient descent)

x(k+1)=x(k)?η▽f(x(k))x^{(k+1)}=x^{(k)}-\eta \bigtriangledown f(x^{(k)})x(k+1)=x(k)?ηf(x(k))

f(x)f(x)f(x)x=x(k+1)x=x^{(k+1)}x=x(k+1)的值,在x(k)x^{(k)}x(k)处做Taylor展开,得到

f(x(k+1))=f(x(k))+▽f(x(k))(x(k+1)?x(k))=f(x(k))?η(▽f(x(k)))2≤f(x(k))\begin{aligned} f(x^{(k+1)})&= f(x^{(k)})+\bigtriangledown f(x^{(k)})(x^{(k+1)}-x^{(k)})\\ &=f(x^{(k)})-\eta \left(\bigtriangledown f(x^{(k)})\right)^2\\ &\leq f(x^{(k)}) \end{aligned}f(x(k+1))?=f(x(k))+f(x(k))(x(k+1)?x(k))=f(x(k))?η(f(x(k)))2f(x(k))?

步长参数0<η<10<\eta<10<η<1,则每一次迭代总能保证f(x(k+1))≤f(x(k))f(x^{(k+1)})\leq f(x^{(k)})f(x(k+1))f(x(k)).


2. 梯度▽f(x)\bigtriangledown f(x)f(x)满足L-Lipschitz条件下的梯度下降

首先给出L-Lipschitz定义:

设函数f(x)f(x)f(x)在有限区间[a,b][a,b][a,b]上满足如下条件:

  • x∈[a,b]x\in[a,b]x[a,b]时,f(x)∈[a,b]f(x)\in[a,b]f(x)[a,b],即a≤f(x)≤ba\leq f(x)\leq baf(x)b
  • 对任意的x1,x2∈[a,b]x_1,x_2\in[a,b]x1?x2?[a,b]∣f(x1)?f(x2)∣≤L∣x1?x2∣|f(x_1)-f(x_2)|\leq L|x_1-x_2|f(x1?)?f(x2?)Lx1??x2?恒成立;

则称f(x)f(x)f(x)[a,b][a,b][a,b]上满足L-Lipschitz条件,LLL称为Lipschitz常数.

可以发现,L-Lipschitz连续比一致连续更强,要求函数值在有限区间的变化幅度受到限制.

进一步的,如果函数f(x)f(x)f(x)梯度▽f(x)\bigtriangledown f(x)f(x)满足L-Lipschitz连续,则其在给定点x(k)x^{(k)}x(k)可以展开成如下二阶近似形式

f^(x;x(k))?f(x(k))+<▽f(x(k),x?x(k))>+L2∣∣x?x(k)∣∣2\hat{f}(x;x^{(k)})\doteq f(x^{(k)})+<\bigtriangledown f(x^{(k)},x-x^{(k)})>+\frac{L}{2}||x-x^{(k)}||^2f^?(x;x(k))?f(x(k))+<f(x(k),x?x(k))>+2L?x?x(k)2

展开,并将与xxx无关的项记为?(x(k))\phi(x^{(k)})?(x(k)),则可以进一步化简为

f^(x;x(k))=L2∣∣x?(x(k)?1L▽f(x(k)))∣∣2+?(x(k))\hat{f}(x;x^{(k)})=\frac{L}{2}\bigg\lvert\bigg\lvert x-\left(x^{(k)}-\frac{1}{L}\bigtriangledown f(x^{(k)})\right)\bigg\rvert\bigg\rvert^2+\phi(x^{(k)})f^?(x;x(k))=2L???x?(x(k)?L1?f(x(k)))??2+?(x(k))

由图可知

f^(x;x(k))≥f(x)\hat{f}(x;x^{(k)})\geq f(x)f^?(x;x(k))f(x)

当且仅当x=x(k)x=x^{(k)}x=x(k)时,取等号. f^(x;x(k))\hat{f}(x;x^{(k)})f^?(x;x(k))实际上为原目标函数的二次上界.

x(k+1)=arg?min?xf^(x;x(k))x^{(k+1)}=\arg\min_x \hat{f}(x;x^{(k)})x(k+1)=argminx?f^?(x;x(k)),则可以得到

x(k+1)=x(k)?1L▽f(x(k))x^{(k+1)}=x^{(k)}-\frac{1}{L}\bigtriangledown f(x^{(k)})x(k+1)=x(k)?L1?f(x(k))

因此,在二阶近似的条件下,梯度下降可以理解为:

  • 每一次迭代都在最小化目标函数在上一次迭代点处的二次上界.

收敛速度为O(1k)O(\frac{1}{k})O(k1?).


3. 引入非光滑约束后的近端梯度下降算法

考虑min?xfs(x)+fn(x)\min_x f^s(x)+f^n(x)minx?fs(x)+fn(x),其中fs(x)f^s(x)fs(x)为可微凸函数,且其梯度▽fs(x)\bigtriangledown f^s(x)fs(x)满足L-Lipschitz条件,fn(x)f^n(x)fn(x)为非光滑函数.
对光滑部分做如上二阶近似,得到

f^(x;x(k))=L2∣∣x?(x(k)?1L▽fs(x(k)))∣∣2+?(x(k))+fn(x)\hat{f}(x;x^{(k)})=\frac{L}{2}\bigg\lvert\bigg\lvert x-\left(x^{(k)}-\frac{1}{L}\bigtriangledown f^s(x^{(k)})\right)\bigg\rvert\bigg\rvert^2+\phi(x^{(k)})+f^n(x)f^?(x;x(k))=2L???x?(x(k)?L1?fs(x(k)))??2+?(x(k))+fn(x)

x(k+1)=arg?min?xf^(x;x(k))x^{(k+1)}=\arg\min_x \hat{f}(x;x^{(k)})x(k+1)=argminx?f^?(x;x(k)),则可以得到近端梯度下降的更新公式

x(k+1)=arg?min?xL2∣∣x?(x(k)?1L▽fs(x(k)))∣∣2+fn(x)x^{(k+1)}=\arg\min_x \frac{L}{2}\bigg\lvert\bigg\lvert x-\left(x^{(k)}-\frac{1}{L}\bigtriangledown f^s(x^{(k)})\right)\bigg\rvert\bigg\rvert^2+f^n(x)x(k+1)=argxmin?2L???x?(x(k)?L1?fs(x(k)))??2+fn(x)

而该更新公式可以通过如下近端问题高效求解:

proxμfn(x)(z)=arg?min?x12∣∣x?z∣∣2+μfn(x)prox_{\mu f^n(x)}(z)=\arg\min_x \frac{1}{2} ||x-z||^2+\mu f^n(x)proxμfn(x)?(z)=argxmin?21?x?z2+μfn(x)

最小化μfn(x)\mu f^n(x)μfn(x)加上一个独立的二次问题. 此时的收敛速率仍为O(1k)O(\frac{1}{k})O(k1?).


4. 三个近端梯度下降计算非光滑约束优化的例子

例1:

凸稀疏罚函数fn(x)=∣∣x∣∣1f^n(x)=||x||_1fn(x)=x1?,此时得到的近端优化问题为
arg?min?x12∣∣x?z∣∣2+μ∣∣x∣∣1\arg\min_x\frac{1}{2} ||x-z||^2+\mu ||x||_1argxmin?21?x?z2+μx1?
求解得到zzz的软阈值函数
proxμfn(x)(z)=Sμ(z)=sign(z)max?{∣z∣?μ,0}prox_{\mu f^n(x)}(z)=S_\mu(z)=sign(z)\max\left\{|z|-\mu,0\right\}proxμfn(x)?(z)=Sμ?(z)=sign(z)max{ z?μ,0}
此时的该操作符能够将zzz的所有元素向000压缩,而且计算仅需线性时间.

例2:

fn(x)=∣∣x∣∣0f^n(x)=||x||_0fn(x)=x0?,则得到zzz的硬阈值函数
proxμfn(x)(z)=Hμ(z)={z∣z∣≥μ0otherwiseprox_{\mu f^n(x)}(z)=H_\mu(z)=\begin{cases} z \quad|z|\geq\mu\\ 0 \quad otherwise \end{cases}proxμfn(x)?(z)=Hμ?(z)={ zzμ0otherwise?

例3:

fn(x)=∑iI∞[xi≤0]f^n(x)=\sum_i I_{\infty}[x_i \leq 0]fn(x)=i?I?[xi?0],则得到ReLU网的非线性变换
proxμfn(x)(z)=Rec(z)=max?{z,0}prox_{\mu f^n(x)}(z)=Rec(z)=\max\left\{z,0\right\}proxμfn(x)?(z)=Rec(z)=max{ z,0}

  相关解决方案