XGBoost 3 - XGBoost原理及调用_综合

XGBoost原理

主要内容包括：

Boosting （提升算法）
- $\text{AdaBoost}$ （经典的提升算法）
$\text{Gradient Boosting}$ （梯度提升）
$\text{XGBoost}$ （梯度提升的优化实现）

1 - Boosting

$\text{Boosting}$ : 将弱学习器组合成强分类

构造一个性能很高的强学习器是一件很困难的事情
但构造一个性能一般的弱学习器并不难
弱学习器：性能比随机猜测好（层数不深的 $\text{CART}$ 是一个好选择）

G (x) = \sum t = 1 T α t ? t (x)

$G(x) = \sum_{t = 1}^{T}\alpha_t\phi_t(x)$

2 - AdaBoost

通过改变样本的权重得到不同的弱学习器 $f^{(t)}$ ，初始时每个样本的权重为 $\frac1N$ ，在得到一个弱分类器之后，计算该弱分类器的分类误差率 $\epsilon_t$ ，由此为依据更新每个样本的权重 $w^{(t)}_n$ ，使得被误分类的样本点权重增大，正确分类的样本的权重减少，这样使得在下一轮的学习当中，误分样本能得到更多的关注，并由分类误差率 $\epsilon_t$ 计算得到当前弱分类器在最终分类模型中的权重 $\alpha_t$ 。

2.1 - AdaBoost M1算法

给定训练集合 $\{x_n, y_n\}$ , $y_n \in \{-1, 1\}$

初始训练集样本权重： $w_{1}^{(t)} = \frac1N$
t=1,2,?,T
- 训练弱分类器： $\phi_t(x)$
- 计算分类误差：
  $? t = \sum N n = 1 w t n 1 ( y n \neq ? t ( x n ) ) \sum N n = 1 w t n$ $\epsilon_t = \frac{\sum_{n=1}^{N}w_{n}^{t}1(y_n \ne \phi_t(x_n))}{\sum_{n=1}^{N}w_{n}^{t}}$
- 计算弱分类器的权重：
  $α t = l o g (1 ? ? t ? t ? ? ? ? ? \sqrt)$ $\alpha_t = log(\sqrt{\frac{1-\epsilon_t}{\epsilon_t}})$
- 更新样本的权重：
  $w (t + 1) n = w (t) n e x p (? α t y n ? t (x n))$ $w_{n}^{(t+1)} = w_{n}^{(t)}exp(-\alpha_t y_n \phi_t(x_n))$

最终得到强分类器:

G (x) = s i g n (\sum t = 1 T α t ? t (x))

$G(x) = sign(\sum_{t=1}^{T}\alpha_t\phi_t(x))$

3 - Adaboost算法分析

$\text{Adaboost}$ 可以看成当前向分步算法的损失函数为指数损失函数时的加法模型

模型最终的目的是要找到一堆弱学习器 $g_t(x)$ 的组合去逼近最优的 $f$ ，采用的方法是每次找一个弱分类器添加到已经集成的模型当中。与梯度下降算法的思路相似，无法直接得到最优解，而是一步一步的去逼近。

3.1 - 前向分步加法算法

损失函数：
- 目标函数：
  $m i n f 1 N \sum i = 1 N L (f (x i), y i)$ $min_{f}\frac1N\sum_{i=1}^{N}L(f(x_i), y_i)$
- 前向逐步递增：
  - 初始化：
    $f 0 (x) = a r g m i n f 1 N \sum i = 1 N L (f (x i), y i)$ $f_0(x) = argmin_{f}\frac1N\sum_{i=1}^{N}L(f(x_i), y_i)$
  - 递增：
    $(β t, ? t) = a r g m i n β, ? 1 N \sum i = 1 N L (f t ? 1 (x i) + β t ? t (x i), y i)$ $(\beta_t, \phi_t) = argmin_{\beta, \phi}\frac1N\sum_{i=1}^{N}L\big(f_{t-1}(x_i)+\beta_t \phi_t(x_i), y_i\big)$
    $f t (x) = f t ? 1 (x) + β t ? t (x)$ $f_t(x) = f_{t-1}(x) + \beta_t \phi_t(x)$
- 3.2 - 前向分步算法与Adaboost
  
  $\text{Adaboost}$ 可以看成当前向分步算法的损失函数为指数损失函数时的加法模型。
  
  证明如下：
  
  L(f(x),y)=exp(?yf(x))
  在第 t 步：
  $\begin{aligned} (1) & (β_{t}, ϕ_{t} (x)) & = a r g m i n_{β, ϕ} \sum_{i = 1}^{N} e x p (- y_{i} (f_{t - 1} (x_{i}) + β ϕ (x_{i}))) \\ (2) & = a r g m i n_{β, ϕ} \sum_{i = 1}^{N} e x p (- y_{i} (f_{t - 1} (x_{i})) e x p (- y_{i} β ϕ (x_{i})) \\ (3) & = a r g m i n_{β, ϕ} \sum_{i = 1}^{N} {\bar{w}}_{i}^{(t)} e x p (- y_{i} β ϕ (x_{i})) \\ (4) & = a r g m i n_{β, ϕ} \sum_{ϕ (x_{i}) = y_{i}} {\bar{w}}_{i}^{(t)} e^{- β} + \sum_{ϕ (x_{i}) \neq y_{i}} {\bar{w}}_{i}^{(t)} e^{β} \\ (5) & = a r g m i n_{β, ϕ} \sum_{ϕ (x_{i}) = y_{i}} {\bar{w}}_{i}^{(t)} e^{- β} + \sum_{ϕ (x_{i}) \neq y_{i}} {\bar{w}}_{i}^{(t)} e^{β} + \sum_{ϕ (x_{i}) \neq y_{i}} {\bar{w}}_{i}^{(t)} e^{- β} - \sum_{ϕ (x_{i}) \neq y_{i}} {\bar{w}}_{i}^{(t)} e^{- β} \\ (6) & = a r g m i n_{β, ϕ} e^{- β} \sum_{i = 1}^{N} {\bar{w}}_{i}^{(t)} + (e^{β} - e^{- β}) \sum_{ϕ (x_{i}) \neq y_{i}} {\bar{w}}_{i}^{(t)} \\ (1) & = a r g m i n_{β, ϕ} e^{- β} \sum_{i = 1}^{N} {\bar{w}}_{i}^{(t)} + (e^{β} - e^{- β}) \sum_{i = 1}^{N} {\bar{w}}_{i}^{(t)} I (y_{i} \neq ϕ (x_{i})) \end{aligned}$
  其中令：
  w?(t)i=exp(?yi(ft?1(xi))
  因为 w?(t)i 既不依赖于 β 也不依赖于 ? ，所以与最优化无关。所以能够使得 (1) 最小化的 ? 如下：
  ?t(x)=argmin?∑i=1Nw?(t)iI(yi≠?(xi))(*)
  接下来求解最佳的 β 为了求解能够使得 (1) 最小的 β ，我们首先对 β 求偏导：
  ?L?β=?e?β∑i=1Nw?(t)i+eβ∑i=1Nw?(t)iI(yi≠?(xi))+e?β∑i=1Nw?(t)iI(yi≠?(xi))(7)
  令 ?L?β=0 得到：
  eβ∑i=1Nw?(t)iI(yi≠?(xi))=e?β(∑i=1Nw?(t)i?∑i=1Nw?(t)iI(yi≠?(xi)))
  
  log(eβ)+log(∑i=1Nw?(t)iI(yi≠?(xi)))=log(e?β)+log((∑i=1Nw?(t)i?∑i=1Nw?(t)iI(yi≠?(xi))))
  两边取 log
  logeβ=12log∑Ni=1w?(t)i?∑Ni=1w?(t)iI(yi≠?(xi))log∑Ni=1w?(t)iI(yi≠?(xi))
  
  βt=12log∑Ni=1w?(t)i?∑Ni=1w?(t)iI(yi≠?(xi))log∑Ni=1w?(t)iI(yi≠?(xi))
  因为：
  ?t=∑Ni=1w?(t)iI(yi≠?(xi))∑Ni=1w?(t)i
  得到
  βt=12log1??t?t(*)
  现在已经求出了最佳的 βt 和 ?t(x) ，所以：
  ft(x)=ft?1(x)+βt?t(x)
  以及
  w?(t)i=exp(?yi(ft?1(xi))
  
  w?(t+1)i=exp(?yi(ft?1(x)+βt?t(x))=w?(t)iexp(?yiβt?t(x))
  
  得到了 $\text{Adaboost}$ 算法中的样本权重的更新规则。所以现在可以证得， $\text{Adaboost}$ 算法为前向分步加法算法损失函数为指数损失时候的特例。
  
  因为指数损失对outliers比较敏感 ，因此另一种选择的损失函数为 负log似然损失（交叉熵损失/logistic损失） ，这时会得到logitBoost。另外对于回归问题，损失函数可以取为L2损失（平方误差损失） ，此时得到的是L2Boosting。
  
  3.3 - 提升树用于回归
  
  当损失函数为 $L2$ 损失的时候：
  
  L(f(x),y)=(f(x)?y)2
  - 初始化为： $f_0(x) = \bar{y}$
  - 在第 $t$ 步的时候：
  $\begin{aligned} (18) & a r g m i n_{β, ϕ} L (f_{t} (x), y) & = a r g m i n_{β, ϕ} L (f_{t - 1} (x) + β ϕ (x), y) \\ (19) & = a r g m i n_{β, ϕ} (f_{t - 1} (x) + β ϕ (x) - y)^{2} \\ (20) & = a r g m i n_{β, ϕ} ((y - f_{t - 1} (x)) - β ϕ (x))^{2} \\ (21) & = a r g m i n_{β, ϕ} (r - β ϕ (x))^{2} \end{aligned}$
  
  因此每一轮是使用弱分类器来拟合残差 $y - f_{t-1}(x)$ 。
  
  提升树利用加法模型与前向分步计算法实现了学习的优化过程，当损失函数为指数损失和平方误差损失的时候，每一步的优化是简单的。但是对于一般的损失函数来说，往往比较复杂，所以出现了梯度提升算法。其关键是利用损失函数的负梯度在当前模型的值作为残差的近似值。
  
  4 - GBDT
  
  $\text{Boosting Tree}$ (提升树)适合于损失函数为平方损失（拟合残差）或者指数损失（ $\text{Adaboost}$ 的弱学习器限制为受限的决策树桩）。而 $\text{Gradient Boosting}$ 适合各类损失函数。
  
  对于 $\text{GBDT}$ ，其学习流程与 $\text{boosting tree}$ 相似，只是不再使用残差作为新的训练数据而是使用损失函数的梯度作为新的训练数据的 $y$ 值，具体的来说就是使用损失函数 $L(f(x), y)$ 对 $f(x)$ 求梯度然后带入 $f_{m-1}(x)$ 作为新的 $y$ 值，即利用损失函数的负梯度在当前模型的值作为回归问题提升树中残差的近似值来拟合一棵回归树：
  
  $- [\frac{\partial L (y, f (x_{i}))}{\partial f (x_{i})}]_{f (x) = f_{t - 1} (x)}$
  
  提升树模型每一次的提升都是靠上次的预测结果与训练数据的 $y$ 值差值作为新的训练数据进行重新训练， $\text{GDBT}$ 则是将残差计算替换成了损失函数的梯度方向，将上一次的预测结果带入梯度中求出本轮的训练数据，这两种模型就是在生成新的训练数据时采用了不同的方法。
  
  5 - XGBoost
  
  是 $\text{gradient boost}$ 的优化实现
  
  可自定义损失函数
  
  正则项：叶子节点的数目，叶子节点的分数( $\text{GBDT}$ 中没有正则项)
  
  建树和剪枝
  
  支持分裂点近似搜索
  
  系数特征处理
  
  缺失值处理
  
  特征重要性和特征选择
  
  并行，缓存
  
  $\text{gradient boost}$ 算法虽然对常见损失函数适用，但除了 $L2$ 损失函数，其他损失函数推导还是比较复杂。
  
  5.1 - xgboost中的损失函数
  
  $\text{XGBoost}$ ：对损失函数用二阶Taylor展开来近似。
  $f (x + Δ x) = f (x) + f' (x) Δ x + 1 2 f'' (x) Δ x 2$ $f(x+\Delta x) = f(x) + f'(x)\Delta x + \frac12f''(x)\Delta x^2$ 在第 $t$ 步，令 $g_{t, i} = [\frac{? L (f (x_{i}), y_{i})}{? f (x_{i})}]_{f = f_{t ? 1}}$ $g_{t,i} = \big[\frac{\partial L(f(x_i), y_i)}{\partial f(x_i)}\big]_{f=f_{t-1}}$ $h t, i = [? 2 L ( f ( x i ) , y i ) ? 2 f ( x i )] f = f t ? 1$ $h_{t,i} = \big[\frac{\partial^2 L(f(x_i), y_i)}{\partial^2 f(x_i)}\big]_{f=f_{t-1}}$ 这样可以得到： $L (f t ? 1 (x i) + ? (x i), y i) = L (f t ? 1 (x i), y i) + g t, i ? (x i) + 1 2 h t, i ? (x i) 2$ $L\big(f_{t-1}(x_i)+\phi(x_i), y_i\big) = L(f_{t-1}(x_i), y_i) + g_{t,i}\phi(x_i) + \frac12h_{t,i}\phi(x_i)^2$ 去掉与 $\phi(x_i)$ 无关的量
  $L (f t ? 1 (x i) + ? (x i), y i) = g t, i ? (x i) + 1 2 h t, i ? (x i) 2$ $L\big(f_{t-1}(x_i)+\phi(x_i), y_i\big) = g_{t,i}\phi(x_i) + \frac12h_{t,i}\phi(x_i)^2$
  
  实例：对于L2损失：
  $L (f (x), y) = 1 2 (f (x) ? y) 2$ $L(f(x), y) = \frac12(f(x) - y)^2$ $? L ( f ( x ) , y ) ? f ( x ) = f (x) ? y$ $\frac{\partial L(f(x), y)}{\partial f(x)} = f(x) - y$ $? 2 L ( f ( x ) , y ) ? 2 f ( x ) = 1$ $\frac{\partial^2 L(f(x), y)}{\partial^2 f(x)} = 1$ 所以可以得到： $g t, i = f t ? 1 (x i) ? y i$ $g_{t, i} = f_{t-1}(x_i) - y_i$ $h t, i = 1$ $h_{t,i} = 1$
  5.2 - xgboost中的树的定义
  
  重新对树进行了定义，将树拆成了结构部分 $q$ 和叶子分数部分 $w$ 。
  
  $? (x) = w q (x)$ $\phi(x) = w_{q(x)}$
  
  结构函数 $q$ ：把输入映射到叶子的索引号
  
  分数函数 $w$ ：把叶子索引号映射到叶子分数
  
  5.3 - 树的复杂度的定义
  
  模型的复杂度作为正则项:(不是唯一的定义方式，可视具体的问题而定)
  
  $Ω (? (x)) = γ T + 1 2 λ \sum t = 1 T w 2 t$ $\Omega(\phi(x)) = \gamma T + \frac12\lambda\sum_{t=1}^{T}w_t^2$
  
  其中第一项叶子结点的数目 $T$ 相当于 $L1$ 正则，第二项叶子节点的分数相当于 $L2$ 正则。
  
  5.4 - 目标函数
  
  这里对以下要用到的符号进行一下说明：
  
  $g_{m,i}, h_{m,i}$ 为损失函数第 $m$ 轮在第 $i$ 个样本上的一阶导数和二阶导数， $m = 1,\cdots,M$ , $M$ 为弱分类器的个数， $i = 1,\cdots, N$ ， $N$ 为样本的总数
  
  $T$ 为叶子节点的总数
  
  令每一个叶子 $t$ 上的样本集合为 $I_t = \{i|q(x_i) = t\}$
  
  $J (θ) = \sum i = 1 N L (f m (x i), y i) + Ω (θ) = \sum i = 1 N g m, i ? (x i) + 1 2 h m, i ? (x i) 2 + γ T + 1 2 λ \sum t = 1 T w 2 t = \sum i = 1 N g m, i w q (x i) + 1 2 h m, i w 2 q (x i) + γ T + 1 2 λ \sum t = 1 T w 2 t = \sum t = 1 T [\sum i \in I t g m, i w t + 1 2 \sum i \in I t h m, i w 2 t + 1 2 λ w 2 t] + γ T = \sum t = 1 T [\sum i \in I t g m, i w t + 1 2 (\sum i \in I t h m, i + λ) w 2 t] + γ T = \sum t = 1 T [G t w t + 1 2 (H t + λ) w 2 t] + γ T (12) (13) (14) (15) (16) (17)$ $\begin{align} J(\theta) &= \sum_{i=1}^{N}L(f_m(x_i), y_i) + \Omega(\theta) \\ &=\sum_{i=1}^{N}g_{m,i}\phi(x_i) + \frac12h_{m,i}\phi(x_i)^2 + \gamma T + \frac12\lambda\sum_{t=1}^{T}w_t^2\\ &=\sum_{i=1}^{N}g_{m,i}w_{q(x_i)} + \frac12h_{m,i}w_{q(x_i)}^2 + \gamma T + \frac12\lambda\sum_{t=1}^{T}w_t^2\\ &=\sum_{t=1}^{T} \big[ \sum_{i\in I_t}g_{m,i}w_t + \frac12\sum_{i\in I_t}h_{m,i}w_t^2 + \frac12\lambda w_t^2 \big] + \gamma T\\ &=\sum_{t=1}^{T} \big[ \sum_{i\in I_t}g_{m,i}w_t + \frac12\big(\sum_{i\in I_t}h_{m,i} + \lambda \big)w_t^2 \big] + \gamma T\\ &=\sum_{t=1}^{T} \big[ G_t w_t + \frac12\big(H_t + \lambda \big)w_t^2 \big] + \gamma T\\ \end{align}$ $J (θ) = \sum t = 1 T [G t w t + 1 2 (H t + λ) w 2 t] + γ T (1)$ $J(\theta)=\sum_{t=1}^{T} \big[ G_t w_t + \frac12\big(H_t + \lambda \big)w_t^2 \big] + \gamma T\tag1$ 如果树的结构 $q$ 已经知道的话，可以通过对 $(1)$ 求偏导得出最佳的 $w$ ，也就是叶子节点的分数。
  $\frac{? J (θ)}{? w_{t}} = G_{t} + (H_{t} + λ) w_{t} = 0$ $\frac{\partial J(\theta)}{\partial w_t} = G_t + (H_t + \lambda)w_t = 0$
  
  得到 每一个叶子最佳的分数 $w$ 为：
  $w : w_{t} = ? \frac{G_{t}}{H_{t} + λ}$ $w: w_t = -\frac{G_t}{H_t + \lambda}$
  
  将求得的 $w$ 带入目标函数 $J(\theta)$ ，可视为树的分数:
  $J (θ) = \sum t = 1 T [? G t G t H t + λ + 1 2 (H t + λ) (? G t H t + λ) 2] + γ T$ $J(\theta)=\sum_{t=1}^{T} \big[ -G_t\frac{G_t}{H_t + \lambda} + \frac12\big(H_t + \lambda \big)(-\frac{G_t}{H_t + \lambda})^2 \big] + \gamma T$
  $J (θ) = ? 1 2 \sum t = 1 T [G 2 t H t + λ] + γ T (2)$ $J(\theta)=-\frac12 \sum_{t=1}^{T} \big[ \frac{G_t^2}{H_t + \lambda} \big] + \gamma T\tag2$
  
  现在我们同样想要最小化 $J(\theta)$ 。
  
  现在的问题是，什么样的树的结构可以使得 $J(\theta)$ 最小呢？
  
  5.6 - 建树
  
  理论上，需要枚举所有可能的树的结构，根据 $(2)$ 计算结构的分数，选择分数最小的结构。
  
  实践中，我们贪婪的增加树的叶子结点的数目：
  
  从深度为0的树开始(所有的样本都在一个节点上)
  
  对于树的每一个叶子节点，尝试增加一个分裂点：
  
  令 $I_L, I_R$ 分别表示加入分裂点之后左右叶子结点的样本集合
  
  $G L = \sum i \in I L g m, i$ $G_L = \sum_{i\in I_L}g_{m,i}$
  
  $G R = \sum i \in I R g m, i$ $G_R = \sum_{i\in I_R}g_{m,i}$
  
  $H L = \sum i \in I L h m, i$ $H_L = \sum_{i\in I_L}h_{m,i}$
  
  $H R = \sum i \in I R h m, i$ $H_R = \sum_{i\in I_R}h_{m,i}$
  
  则增加分裂点之后目标函数的变化为
  
  $Gain = (? G 2 L + G 2 R H L + H R + λ + T m ? 1 γ) ? (? (G 2 L H L + λ + G 2 R H R + λ) + (T m ? 1 + 1) γ)$ $\text{Gain} = \bigg(-\frac{G_L^2+G_R^2}{H_L+H_R+\lambda} + T_{m-1}\gamma\bigg) - \bigg(-(\frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda}) + (T_{m-1} + 1)\gamma\bigg)$
  
  $Gain = G 2 L H L + λ + G 2 R H R + λ ? G 2 L + G 2 R H L + H R + λ ? γ$ $\text{Gain} = \frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda}-\frac{G_L^2+G_R^2}{H_L+H_R+\lambda}-\gamma$
  
  如何找最佳的分裂点？
  
  5.6.1 - 建树，精确搜索算法
  
  对每一个节点，穷举所有的特征(没有特征消耗)，对每一个特征，穷举所有的分裂点
  
  对每个特征，通过特征的取值将实例进行排序
  
  寻求该特征的最优分裂点
  
  对所有的特征采用如下的操作
  
  深度为k的树的时间复杂度：对特征所有取值的排序为 $NlogN$ ，N为样本点数目，若有D维特征，则 $kDNlogN$
  
  5.6.2 - 建树，近似搜索算法
  
  当数据太多不能装载到内存时，不能进行精确搜索分裂，只能近似搜索
  
  根据特征分布的百分位数，提出特征的一些候选分裂点
  
  将连续特征值映射到桶里（候选点对应的分裂），然后根据桶里样本的统计量，从这些候选中选择最佳分裂点
  
  根据候选提出的时间，分为
  
  全局近似：在构造树的初始阶段提出所有的候选分裂点，然后对各个层次采用相同的候选
  
  提出候选的次数少，但每次的候选数目多（因为候选不更新）
  
  局部近似：在每次分裂都重新提出候选，
  
  对层次较深的树更适合，候选分裂点的数目不需要太多
  
  5.6.3 - 建树，稀疏特征和缺失值
  
  在实际任务中，极有可能遇到稀疏特征
  
  缺失数据
  
  人工设计的特征：如one-hot编码
  
  $\text{XGBoost}$ ：在树的每个结点设置一个缺省方向
  
  5.7 - 剪枝
  
  分裂的增益
  
  $Gain = G 2 L H L + λ + G 2 R H R + λ ? G 2 L + G 2 R H L + H R + λ ? γ$ $\text{Gain} = \frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda}-\frac{G_L^2+G_R^2}{H_L+H_R+\lambda}-\gamma$
  
  提前终止
  
  如果出现负值，提前停止（ $\text{scikit-learn}$ 中采用的策略）
  
  但被提前终止掉的分裂可能其后续的分裂会带来好处
  
  但是在 $\text{XGBoost}$ 中采用的是过后剪枝策略：将树分裂到最大深度，然后再基于上述增益计算剪枝。在实现时还有学习率/收缩因子，给后续轮留机会，进一步防止。
  
  6 - sklearn参数设置
  
  xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100,silent=True, objective=‘binary:logistic’, nthread=-1, gamma=0,min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1,colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,base_score=0.5, random_state=0, seed=None, missing=None，**kwargs)
  
  $\text{max_depth}$ ：树的最大深度。越大通常模型复杂，更容易过拟合，这里作为弱学习器一般设置为1-10
  
  $\text{learning_rate}$ ：学习率或收缩因子。学习率和迭代次数／弱分类器数目n_estimators相关。缺省：0.1 （与直接调用xgboost的eta参数含义相同）
  
  $\text{n_estimators}$ ：弱分类器数目. 缺省:100
  
  $\text{objective}$ ：待优化的目标函数，常用值有：
  
  $\text{binary:logistic}$ 二分类的逻辑回归，返回预测的概率
  
  $\text{multi:softmax}$ 使用softmax的多分类器，返回预测的类别(不是概率)。
  
  $\text{multi:softprob}$ 和multi:softmax参数一样，但是返回的是每个数据属于各个类别的概率。
  
  支持用户自定义目标函数
  
  $\text{booster}$ ：选择每次迭代的模型，有两种选择：
  
  $\text{gbtree}$ ：基于树的模型，为缺省值。
  
  $\text{gbliner}$ ：线性模型
  
  $\text{gamma}$ ：节点分裂所需的最小损失函数下降值，缺省0
  
  $\text{min_child_weight}$ ：叶子结点需要的最小样本权重（ $\text{hessian}$ ）和
  
  $\text{subsample}$ ：构造每棵树的所用样本比例（样本采样比例）
  
  $\text{colsample_bytree}$ ：构造每棵树的所用特征比例
  
  $\text{colsample_bylevel}$ ：树在每层每个分裂的所用特征比例
  
  $\text{reg_alpha}$ ： $L1$ 正则的惩罚系数
  
  $\text{reg_lambda}$ ： $L2$ 正则的惩罚系数
  
  $\text{scale_pos_weight}$ ：正负样本的平衡，通常用于不均衡数据
  
  $\text{slient}$ ：参数值为1时，静默模式开启，不输出任何信息
  
  $\text{nthread}$ ：用来进行多线程控制。如果你希望使用CPU全部的核，那就用缺省值-1，算法会自动检测它。
  
  $\text{max_delta_step}$ ：允许的树的最大权重
  
  7 -xgboost的其他功能
  
  7.1 - 特征重要性
  
  采用树集成学习（如 $\text{Gradient Boosting}$ ）的好处之一是可以从训练好的预测模型得到特征重要性
  
  单棵树中的特征重要性：每个分裂点对性能的提升量，并用每个结点的样本数加权。
  
  性能测量可以是用于选择分裂点的指标（纯度）或其他更特别的误差函数
  
  整个模型中的特征重要性：模型中每棵树的平均
  
  在 $\text{XGBoost}$ 中已经自动算好，存放在 $\text{feature_importances}$ 中
  
  from matplotlib import pyplot pyplot.bar(range(len(model_XGB.feature_importances_)), model_XGB.feature_importances_) pyplot.show()
  
  还可以使用 $\text{XGBoost}$ 内嵌的函数，按特征重要性排序
  
  from xgboost import plot_importance plot_importance(model_XGB) pyplot.show()
  
  7.2 - 特征选择
  
  以根据特征重要性进行特征选择
  
  from sklearn.feature_selection import SelectFromModel
  
  输入一个（在全部数据集上）训练好的模型和给定阈值，重要性大于阈值的特征被选中
  
  selection = SelectFromModel(model, threshold=thresh, prefit=True) select_X_train = selection.transform(X_train)
  
  8 - 实战
  
  加载必要的包和数据
  
  from xgboost import XGBClassifier import pandas as pd from sklearn.preprocessing import LabelEncoder from sklearn.cross_validation import train_test_split import numpy as np from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt
  
  data = pd.read_csv("./data/mushrooms.csv")
  
  查看数据的形状和属性名。
  
  print(data.shape) print(data.columns)
  
  (8124, 23) Index(['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor','gill-attachment', 'gill-spacing', 'gill-size', 'gill-color','stalk-shape', 'stalk-root', 'stalk-surface-above-ring','stalk-surface-below-ring', 'stalk-color-above-ring','stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number','ring-type', 'spore-print-color', 'population', 'habitat'],dtype='object')
  
  查看是否有空值
  
  print(data.isnull().sum())
  
  class 0 cap-shape 0 cap-surface 0 cap-color 0 bruises 0 odor 0 gill-attachment 0 gill-spacing 0 gill-size 0 gill-color 0 stalk-shape 0 stalk-root 0 stalk-surface-above-ring 0 stalk-surface-below-ring 0 stalk-color-above-ring 0 stalk-color-below-ring 0 veil-type 0 veil-color 0 ring-number 0 ring-type 0 spore-print-color 0 population 0 habitat 0 dtype: int64
  
  查看 $\text{target value}$ 的取值
  
  print(data["class"].unique())
  
  ['p' 'e']
  
  查看特征的数据类型
  
  print(data.dtypes)
  
  class object cap-shape object cap-surface object cap-color object bruises object odor object gill-attachment object gill-spacing object gill-size object gill-color object stalk-shape object stalk-root object stalk-surface-above-ring object stalk-surface-below-ring object stalk-color-above-ring object stalk-color-below-ring object veil-type object veil-color object ring-number object ring-type object spore-print-color object population object habitat object dtype: object
  
  进行编码
  
  labelencoder = LabelEncoder() for col in data.columns:data[col] = labelencoder.fit_transform(data[col]) data.head(5)
  
  得到 $X$ 和 $y$
  
  X = data.iloc[:,1:23] y = data.iloc[:,0] print(X.shape) print(y.shape)
  
  (8124, 22) (8124,)
  
  划分训练集和测试集
  
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) print(X_train.shape) print(y_train.shape) print(X_test.shape) print(y_test.shape)
  
  (6499, 22) (6499,) (1625, 22) (1625,)
  
  使用默认配置的 $\text{XGBClassifier}$ 并训练
  
  xgb = XGBClassifier() xgb.fit(X_train, y_train)
  
  XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,max_depth=3, min_child_weight=1, missing=None, n_estimators=100,n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,silent=True, subsample=1)
  
  在测试集上的表现
  
  pred_test = xgb.predict(X_test) pred_test = np.where(pred_test > 0.5, 1, 0) print("Test Accuracy: " + str(accuracy_score(y_test, pred_test)))
  
  Test Accuracy: 1.0
  
  查看特征重要性
  
  print(xgb.feature_importances_)
  
  [0.00638298 0.00425532 0.00638298 0. 0.35531914 0.0.05106383 0.04255319 0.05319149 0.02340426 0.04680851 0.036170210.04680851 0.00851064 0.01702128 0. 0. 0.046808510. 0.15531915 0.07659575 0.02340426]
  
  plt.bar(range(len(xgb.feature_importances_)), xgb.feature_importances_) plt.show()
  
  from xgboost import plot_importance plot_importance(xgb) plt.show()
  
  进行特征选择
  
  thresholds = sorted(xgb.feature_importances_) for thresh in thresholds:selection = SelectFromModel(xgb, threshold=thresh, prefit=True)select_X_train = selection.transform(X_train)xgb_select = XGBClassifier()xgb_select.fit(select_X_train, y_train)select_X_test = selection.transform(X_test)pred = xgb_select.predict(select_X_test)pred = np.where(pred>0.5, 1, 0)print("thresh:%f, features size:%d, accuracy:%.3f" % (thresh, select_X_train.shape[1], accuracy_score(pred, y_test)))
  
  thresh:0.000000, features size:22, accuracy:1.000 thresh:0.000000, features size:22, accuracy:1.000 thresh:0.000000, features size:22, accuracy:1.000 thresh:0.000000, features size:22, accuracy:1.000 thresh:0.000000, features size:22, accuracy:1.000 thresh:0.004255, features size:17, accuracy:1.000 thresh:0.006383, features size:16, accuracy:1.000 thresh:0.006383, features size:16, accuracy:1.000 thresh:0.008511, features size:14, accuracy:1.000 thresh:0.017021, features size:13, accuracy:1.000 thresh:0.023404, features size:12, accuracy:1.000 thresh:0.023404, features size:12, accuracy:1.000 thresh:0.036170, features size:10, accuracy:1.000 thresh:0.042553, features size:9, accuracy:1.000 thresh:0.046809, features size:8, accuracy:1.000 thresh:0.046809, features size:8, accuracy:1.000 thresh:0.046809, features size:8, accuracy:1.000 thresh:0.051064, features size:5, accuracy:0.993 thresh:0.053191, features size:4, accuracy:0.993 thresh:0.076596, features size:3, accuracy:0.993 thresh:0.155319, features size:2, accuracy:0.993 thresh:0.355319, features size:1, accuracy:0.985
  
  参考资料
  
  ID3、C4.5、CART、随机森林、bagging、boosting、Adaboost、GBDT、xgboost算法总结257赞
  
  RF、GBDT、XGBoost面试级整理