金融风控-贷款违约预测学习笔记(Part5:模型融合)
- 1. 内容介绍
- 2. 代码示例
-
- 2.1 简单平均
1. 内容介绍
将之前建模调参的结果进行模型融合。 尝试多种融合方案,提交融合结果。(模型融合一般用于A榜比赛的尾声和B榜比赛的全程)
模型融合是比赛后期上分的重要手段,特别是多人组队学习的比赛中,将不同队友的模型进行融合,可能会收获意想不到的效果哦,往往模型相差越大且模型表现都不错的前提下,模型融合后结果会有大幅提升,以下是模型融合的方式。
- 平均:
简单平均法:结果直接融合 求多个预测结果的平均值。pre1-pren分别是n组模型预测出来的结果,将其进行加权融
加权平均法:一般根据之前预测模型的准确率,进行加权融合,将准确性高的模型赋予更高的权重。
- 投票:
简单投票法:分硬投票(Hard Voting,根据少数服从多数来定最终结果)和软投票(Soft Voting,将所有模型预测样本为某一类别的概率的平均值作为标准,概率最高的对应的类型为最终的预测结果)
加权投票法:给各基模型赋予不同权值。
- 综合:
排序融合(Rank averaging)(未查到相关资料)。
log融合(未查到相关资料) 。
- stacking:
构建多层模型,并利用预测结果再拟合预测。
- blending:
选取部分数据预测训练得到预测结果作为新特征,带入剩下的数据中预测。
- boosting/bagging
Baggin和Boosting的区别总结如下:
样本选择上: Bagging方法的训练集是从原始集中有放回的选取,所以从原始集中选出的各轮训练集之间是独立的;而Boosting方法需要每一轮的训练集不变,只是训练集中每个样本在分类器中的权重发生变化。而权值是根据上一轮的分类结果进行调整。
样例权重上: Bagging方法使用均匀取样,所以每个样本的权重相等;而Boosting方法根据错误率不断调整样本的权值,错误率越大则权重越大
预测函数上: Bagging方法中所有预测函数的权重相等;而Boosting方法中每个弱分类器都有相应的权重,对于分类误差小的分类器会有更大的权重。
并行计算上: Bagging方法中各个预测函数可以并行生成;而Boosting方法各个预测函数只能顺序生成,因为后一个模型参数需要前一轮模型的结果。
参考资料:https://blog.csdn.net/wuzhongqiang/article/details/105012739
2. 代码示例
import pandas as pd
import os
import gc
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler
import math
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')
def reduce_mem_usage(df):# 调整之前的内存占用空间 (返回的单位是byte)start_mem = df.memory_usage().sum()print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))for col in df.columns:col_dtype = df[col].dtypeif col_dtype != object:c_min = df[col].min()c_max = df[col].max()if str(col_dtype)[:3] == 'int':for dtype in [np.int8, np.int16, np.int32, np.int64]:if c_min > np.iinfo(dtype).min and c_max < np.iinfo(dtype).max:df[col] = df[col].astype(dtype)breakelse:for dtype in [np.float16, np.float32, np.float64]:if c_min > np.finfo(dtype).min and c_max < np.finfo(dtype).max:df[col] = df[col].astype(dtype)breakelse:df[col] = df[col].astype('category')end_mem = df.memory_usage().sum()print('Memory usage after optimization is: {:.2f}MB'.format(end_mem))print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem)/start_mem))return df
x_train = pd.read_csv('Dataset/data_for_model.csv')
x_train = reduce_mem_usage(x_train)
y_train = pd.read_csv('Dataset/label_for_model.csv', names=['isDefault'])['isDefault'].astype(np.int8)
x_test = pd.reand_csv('Dataset/testA_With_FeatureEngineering.csv')
x_test = reduce_mean(x_test)
2.1 简单平均
def cv_model(clf, train_x, train_y, test_x, clf_name):folds = 5seed = 2020kf = KFold(n_splits=folds, shuffle=True, random_state=seed)train = np.zeros(train_x.shape[0])test = np.zeros(test_x.shape[0])cv_scores = []for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):print('*'*30,str(i+1),'*'*30)trn_x, trn_y, val_x, val_y = (train_x.iloc[train_index], train_y[train_index],train_x.iloc[valid_index], train_y[valid_index])if clf_name == 'lgb':train_matrix = clf.Dataset(trn_x, label=trn_y)valid_matrix = clf.Dataset(val_x, label=val_y)parmas = {
'boosting_type': 'gbdt','objective': 'binary','metric': 'auc','min_child_weight': 5,'num_leaves': 2**5,'lambda_12': 10,'feature_fraction': 0.8,'bagging_fraction': 0.8,'bagging_freq': 4,'learning_rate': 0.1,'seed': seed,'nthread': 28,'n_jobs': 24,'silent': True,'verbose': -1}model = clf.train(parmas, train_matrix, 500000, valid_sets=[train_matrix, valid_matrix],verbose_eval=200, early_stopping_rounds=200)val_pred = model.predict(val_x, num_iteration=model.best_iteration)test_pred = model.predict(test_x, num_iteration=model.best_iteration)if clf_name == 'xgb':train_matrix = clf.DMatrix(trn_x, label=trn_y)valid_matrix = clf.DMatrix(val_x, label=val_y)test_matrix = clf.DMatrix(test_x)parmas = {
'booster': 'gbtree','objective': 'binary:logistic','eval_metric': 'auc','gamma': 1,'min_child_weight': 1.5,'max_depth': 5,'lambda': 10,'subsample': 0.7,'colsample_bytree': 0.7,'colsample_bylevel': 0.7,'eta': 0.04,'tree_method': 'exact','seed': seed,'nthread': 36,'silent': True,# 'tree_method': 'gpu_hist'}watchlist = [(train_matrix, 'train'), (valid_matrix, 'eval')]model = clf.train(parmas, train_matrix, num_boost_round=50000, evals=watchlist,verbose_eval=200, early_stopping_rounds=200)val_pred = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)test_pred = model.predict(test_matrix, ntree_limit=model.best_ntree_limit)if clf_name == 'cat':parmas = {
'learning_rate': 0.05,'depth': 5,'l2_leaf_reg': 10,'bootstrap_type': 'Bernoulli','od_type': 'Iter','od_wait': 50,'random_seed': 11,'allow_writing_files': False}model = clf(iterations=20000, **parmas)model.fit(trn_x, trn_y, eval_set=(val_x, val_y),cat_features=[], use_best_model=True, verbose=500)val_pred = model.predict(val_x)test_pred = model.predict(test_x)train[valid_index] = val_predtest = test_pred / kf.n_splitscv_scores.append(roc_auc_score(val_y, val_pred))print(cv_scores)print('%s_scotrainre_lsit: ' % clf_name, cv_scores)print('%s_score_mean: ' % clf_name, np.mean(cv_scores))print('%s_score_std: ' % clf_name, np.std(cv_scores))return train, test
def lgb_model(x_train, y_train, x_test):lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, 'lgb')return lgb_train, lgb_testdef xgb_model(x_train, y_train, x_test):xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, 'xgb')return xgb_train, xgb_testdef cat_model(x_train, y_train, x_test):cat_train, cat_test = cv_model(CatBoostRegressor, x_train, y_train, x_test, 'cat')return cat_train, cat_test
lgb_train, lgb_test = lgb_model(x_train, y_train, x_test)
x_train, xgb_test = xgb_model(x_train, y_train, x_test)
cat_train, cat_test = cat_model(x_train, y_train, x_test)
pred = (lgb_test + xgb_test + cat_test) / 3
其他部分还没完善。待整理。