一. 梯度提升决策树GBDT
GBDT(Gradient Boosting Decision Tree)的核心在于: 每一棵树学的是之前所有树结论和的残差(负梯度),这个残差就是一个加预测值后能得真实值的累加量。
在分类问题中,GBDT的损失函数跟逻辑回归一样,采用的对数似然函数。
在回归问题中,GBDT采用最小化误差平方(ls)。在每一个叶子节点都会得到一个预测值,该预测值等于属于这个节点的所有label的均值。分枝时穷举每一个feature的每个阈值找最好的分割点,衡量的标准就是用最小化误差平方。
import gzip
import pickle as pkl
from sklearn.model_selection import train_test_splitdef load_data(path):f = gzip.open(path, 'rb')try:train_set, valid_set, test_set = pkl.load(f, encoding='latin1') #python3except:train_set, valid_set, test_set = pkl.load(f) #python2f.close()return train_set,valid_set,test_set
path = 'mnist.pkl.gz'
train_set,valid_set,test_set = load_data(path)Xtrain,_,ytrain,_ = train_test_split(train_set[0], train_set[1], test_size=0.9)
Xtest,_,ytest,_ = train_test_split(test_set[0], test_set[1], test_size=0.9)
print(Xtrain.shape, ytrain.shape, Xtest.shape, ytest.shape)
(5000, 784) (5000,) (1000, 784) (1000,)
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
import time
from sklearn.metrics import mean_squared_errorgbc = GradientBoosetingClassifier(n_estimators=10, learning_rate=0.1, max_depth=3)
start_time = time.time()
gbc.fit(Xtrain, ytrain)
end_time = time.time()
print('The training time = {}'.format(end_time - start_time))gbc_pred = gbc.predict(Xtest)
gbc_accuracy = np.