实战: 对GBDT(lightGBM)分类任务进行贝叶斯优化, 并与随机方法对比_综合

- 一. 数据预处理
- - 1.1 读取&清理&切割数据
  - 1.2 标签的分布
- 二. 基础模型建立
- - 2.1 LightGBM建模
  - 2.2 默认参数的效果
- 三. 设置参数空间
- - 3.* 参数空间采样
- 四. 随机优化
- - 4.1 交叉验证LightGBM
  - 4.2 Objective Function
  - 4.3 执行随机调参
  - 4.4 Random Search 结果
- 五. 贝叶斯优化
- - 5.1 Objective Function
  - 5.2 Domain Space
  - - 5.2.1 学习率分布
    - 5.2.2 叶子数分布
    - 5.2.3 boosting_type
    - 5.2.4 参数分布汇总
    - - 5.2.4.* 参数采样结果看一下
  - 5.3 准备贝叶斯优化
  - 5.4 贝叶斯优化结果
  - - 5.4.1 保存结果
    - 5.4.2 测试集上的效果
- 六. 随机VS贝叶斯方法对比
- - 6.1 调参过程可视化展示
  - 6.2 学习率对比
  - 6.3 Boosting Type 对比
  - 6.4 数值型参数对比
- 七. 贝叶斯优化参数变化情况
- - 7.1 Boosting Type 变化
  - 7.2 学习率&叶子数&... 变化
  - 7.3 reg_alpha, reg_lambda 变化
  - 7.4 随机与贝叶斯优化损失变化的对比
  - 7.5 保存结果

保险数据集，来进行GBDT分类任务预测，基于贝叶斯和随机优化方法进行对比分析.

一. 数据预处理

1.1 读取&清理&切割数据

import pandas as pd
import numpy as npdata = pd.read_csv('caravan-insurance-challenge.csv')
data.head()

在这里插入图片描述

train = data[data['ORIGIN'] == 'train']
test = data[data['ORIGIN'] == 'test']train_labels = np.array(train['CARAVAN'].astype(np.int32)).reshape((-1,))
test_labels = np.array(test['CARAVAN'].astype(np.int32)).reshape((-1,))train = train.drop(['ORIGIN', 'CARAVAN'], axis = 1)
test = test.drop(['ORIGIN', 'CARAVAN'], axis = 1)features = np.array(train)
test_features = np.array(test)
lebels = train_labels[:]print('Train shape:', train.shape)
print('Test shape:', test.shape)
train.head()

在这里插入图片描述

1.2 标签的分布

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlineplt.hist(labels, edgecolor = 'k')
plt.xlabel('Label'); plt.ylabel('Count'); plt.title('Count of Labels')

在这里插入图片描述
样本是不平衡数据，所以在这里选择使用ROC曲线来进行评估，接下来我们的目标就是使得其AUC的值越大越好。

二. 基础模型建立

2.1 LightGBM建模

import lightgbm as lgb
model = lgb.LGBMClassifier()
model

LGBMClassifier(boosting_type=‘gbdt’, class_weight=None, colsample_bytree=1.0, importance_type=‘split’, learning_rate=0.1, max_depth=-1, min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0, n_estimators=100, n_jobs=-1, num_leaves=31, objective=None, random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

2.2 默认参数的效果

这个基础模型，我们要做的就是尽可能高的来提升AUC指标。

from sklearn.metrics import roc_auc_score
from timeit import default_timer as timerstart = timer()
model.fit(features, labels)
train_time = timer() - startpredictions = model.predict_proba(test_featurs)[:, 1]
auc = roc_auc_score(test_labels, predictions)print('The baseline score on the test set is {:.4f}.'.format(auc))
print('The baseline training time is {:.4f} seconds.'.format(train_time))

The baseline score on the test set is 0.7092.
The baseline training time is 0.3402 seconds.

三. 设置参数空间

RandomizedSearchCV没有Early Stopping功能 , 所以我们来自己写一下 .

有些参数设置成对数分布，比如学习率，因为这类参数都是要累乘才能发挥效果的，一般经验都是写成log分布形式。还有一些参数得在其他参数控制下来进行选择

import randomparam_grid = {
    'class_weight': [None, 'balanced'],'boosting_type': ['gbdt', 'goss', 'dart'],'num_leaves': list(range(30, 150)),'learning_rate': list(np.logspace(np.log(0.005), np.log(0.2), base=np.exp(1), num=800))),'subsample_for_bin': list(range(20000, 300000, 20000)),'min_child_samples': list(range(20, 500, 5)),'reg_alpha': list(np.linspace(0, 1)),'reg_lambda': list(np.linspace(0, 1)),'colsample_bytree': list(np.linspace(0.6, 1, 10))}
subsample_dist = list(np.linepace(0.5, 1, 100))# 学习率的分布
plt.hist(param_grid['learning_rate'], color = 'r', edgecolor = 'k')
plt.xlabel('Learning Rate'); plt.ylabel('Count'); plt.title('Learning Rate Distribution', size =18)

在这里插入图片描述

# 叶子数目的分布
plt.hist(param_grid['num_leaves'], color = 'm', edgecolor = 'k')
plt.xlabel('Learning Number of Leaves'); plt.ylabel('Count'); plt.title('Number of Leaves Distribution')

在这里插入图片描述

3.* 参数空间采样

{
    key: random.sample(value, 2) for key, value in param_grid.items()}

在这里插入图片描述

params = {
    key: random.sample(value, 1)[0] for key, value in param_grid.items()}
params['subsample'] = random.sample(subsample_dist, 1)[0] if params['boosting_type'] != 'goss' else 1.0
params

{‘class_weight’: ‘balanced’, ‘boosting_type’: ‘gbdt’,
‘num_leaves’: 149, ‘learning_rate’: 0.024474734290096542,
‘subsample_for_bin’: 200000, ‘min_child_samples’: 110,
‘r

实战: 对GBDT(lightGBM)分类任务进行贝叶斯优化, 并与随机方法对比

目录:

一. 数据预处理

1.1 读取&清理&切割数据

1.2 标签的分布

二. 基础模型建立

2.1 LightGBM建模

2.2 默认参数的效果

三. 设置参数空间

3.* 参数空间采样