当前位置: 代码迷 >> 综合 >> 天池项目笔记-金融风控-贷款违约预测 Task3
  详细解决方案

天池项目笔记-金融风控-贷款违约预测 Task3

热度:45   发布时间:2024-02-20 05:23:53.0

Task03_特征工程 features engineering

目前只是数据预处理和初步洞察(初步 basic preprocessing)后续更新思路。可以考虑使用评分卡模型对这一问题进行分析。

1.时间格式处理

1.1 将earliesCreditLine 特征转为日期类型

? 通过观察原始数据,‘earliesCreditLine’数据为字符串数据,将日期以非结构化形式保存。这一步将这一数据进行结构化处理转为datetime类型,方便模型使用和后续的特征工程构建,记录为’earliesCreditLine_date’。

? 如’Aug-2001’数据,表示2001年8月,通过下面的代码将数据转为’%Y-%m-%d’的日期类型,并统一将日期设置为各个月的1号。

# earliesCreditLine 转为日期类型
dic_month = {
    'Jan':'01', 'Feb':'02', 'Mar':'03', 'Apr':'04', 'May':'05', 'Jun':'06', 'Jul':'07', 'Aug':'08', 'Sep':'09', 'Oct':'10', 'Nov':'11', 'Dec':'12'}def get_month_year(str):month = list(str)[0] + list(str)[1] + list(str)[2]month = dic_month[month]year = list(str)[-4] + list(str)[-3] + list(str)[-2] + list(str)[-1]date = year + '-' + month + '-' + '01'return datetrain_data['earliesCreditLine_date'] = train_data['earliesCreditLine'].apply(lambda x : get_month_year(x))
test_data['earliesCreditLine_date'] = test_data['earliesCreditLine'].apply(lambda x : get_month_year(x))
train_data = train_data.drop(columns = 'earliesCreditLine')
test_data = test_data.drop(columns = 'earliesCreditLine')
1.2时间特征构建

? 数据中的时间特征有两个,分别是’issueDate’贷款发放的日期和’earliesCreditLine‘借款人最早报告的信用额度开立的月份。某一数据的时间点没有太大意义,所以我们使用时间区间,将所有的时间数据转为datetime类型后,统一减去某比较久远的过去的值,构建新的特征’issueDateDT’和’earliesCreditLine_dateDT’。后续这一时间还可以用于其他探索,为了baseline我们先做到这里。

import datetime
# issueDate 转换为时间差数值
train_data['issueDate'] = pd.to_datetime(train_data['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
train_data['issueDateDT'] = train_data['issueDate'].apply(lambda x: x-startdate).dt.daystest_data['issueDate'] = pd.to_datetime(test_data['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
test_data['issueDateDT'] = test_data['issueDate'].apply(lambda x: x-startdate).dt.daysplt.hist(train_data['issueDateDT'], label='train');
plt.hist(test_data['issueDateDT'], label='test');
plt.legend();
plt.title('Distribution of issueDateDT dates');
# earliesCreditLine_date 转换为数值(时间差)
train_data['earliesCreditLine_date'] = pd.to_datetime(train_data['earliesCreditLine_date'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('1950-01-01', '%Y-%m-%d')
train_data['earliesCreditLine_dateDT'] = train_data['earliesCreditLine_date'].apply(lambda x: x-startdate).dt.daystest_data['earliesCreditLine_date'] = pd.to_datetime(test_data['earliesCreditLine_date'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('1950-01-01', '%Y-%m-%d')
test_data['earliesCreditLine_dateDT'] = test_data['earliesCreditLine_date'].apply(lambda x: x-startdate).dt.daysplt.hist(train_data['earliesCreditLine_dateDT'], label='train');
plt.hist(test_data['earliesCreditLine_dateDT'], label='test');
plt.legend();
plt.title('Distribution of issueDateDT dates');

在这里插入图片描述
在这里插入图片描述
上图绘制了新增的两个时间特征在训练集和测试集中的分布,基本比较一致。

2.特征分类 features classification

? 在这里我们结合前面数据分布的工作,更细致地将不同特征分为类别特征和数值型特征。

# employmentTitle 可以用数据类型特征/分箱后作为分类类型特征,暂时用前者方法处理
# issueDate 日期类型信息
# earliesCreditLine 可转为日期类型信息
# policycode,n11 几乎只有一种取值,drop
feature_columns = ['loanAmnt', 'term', 'interestRate', 'installment', 'grade','subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership','annualIncome', 'verificationStatus', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years','ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec','pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc','initialListStatus', 'applicationType',  'title','n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8','n9', 'n10', 'n12', 'n13', 'n14','issueDateDT','earliesCreditLine_dateDT'] 
numerical_fea = ['loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle',  'annualIncome',  'postCode', 'regionCode', 'dti', 'delinquency_2years','ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec','pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc','initialListStatus', 'applicationType',  'title','n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8','n9', 'n10', 'n12', 'n13', 'n14', 'issueDateDT', 'earliesCreditLine_dateDT']
categorical_fea = ['grade','subGrade','employmentLength','homeOwnership','verificationStatus','purpose']

3.类别特征编码 label encoding

? 考虑到首先使用树模型,于是决定先不将数据变得过于稀疏,因此先采用label encoding对类别特征进行预处理,至于哪一种编码效果好,或者将哪些数据看作数值类型,都有待尝试。更换不同的模型时,此处可以进行调整。

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
train_data['grade'] = le.fit_transform(train_data['grade'])
train_data['subGrade'] = le.fit_transform(train_data['subGrade'])
train_data['employmentLength'] = train_data['employmentLength'].apply(lambda x : str(x))
train_data['employmentLength'] = le.fit_transform(train_data['employmentLength'])
test_data['grade'] = le.fit_transform(test_data['grade'])
test_data['subGrade'] = le.fit_transform(test_data['subGrade'])
test_data['employmentLength'] = test_data['employmentLength'].apply(lambda x : str(x))
test_data['employmentLength'] = le.fit_transform(test_data['employmentLength'])

最后再来看一下处理后喂给baseline模型的数据:

train_data.head()
id loanAmnt term interestRate installment grade subGrade employmentTitle employmentLength homeOwnership annualIncome verificationStatus issueDate isDefault purpose postCode regionCode dti delinquency_2years ficoRangeLow ficoRangeHigh openAcc pubRec pubRecBankruptcies revolBal revolUtil totalAcc initialListStatus applicationType title policyCode n0 n1 n2 n2.1 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14 issueDateDT earliesCreditLine_date earliesCreditLine_dateDT
0 0 35000.0 5 19.52 917.97 4 21 320.0 2 2 110000.0 2 2014-07-01 1 1 137.0 32 17.05 0.0 730.0 734.0 7.0 0.0 0.0 24178.0 48.9 27.0 0 0 1.0 1.0 0.0 2.0 2.0 2.0 4.0 9.0 8.0 4.0 12.0 2.0 7.0 0.0 0.0 0.0 2.0 2587 2001-08-01 18840
1 1 18000.0 5 18.49 461.90 3 16 219843.0 5 0 46000.0 2 2012-08-01 0 0 156.0 18 27.83 0.0 700.0 704.0 13.0 0.0 0.0 15096.0 38.9 18.0 1 0 1723.0 1.0 NaN NaN NaN NaN 10.0 NaN NaN NaN NaN NaN 13.0 NaN NaN NaN NaN 1888 2002-05-01 19113
2 2 12000.0 5 16.99 298.17 3 17 31698.0 8 0 74000.0 2 2015-10-01 0 0 337.0 14 22.77 0.0 675.0 679.0 11.0 0.0 0.0 4606.0 51.8 27.0 0 0 0.0 1.0 0.0 0.0 3.0 3.0 0.0 0.0 21.0 4.0 5.0 3.0 11.0 0.0 0.0 0.0 4.0 3044 2006-05-01 20574
3 3 11000.0 3 7.26 340.96 0 3 46854.0 1 1 118000.0 1 2015-08-01 0 4 148.0 11 17.21 0.0 685.0 689.0 9.0 0.0 0.0 9948.0 52.6 28.0 1 0 4.0 1.0 6.0 4.0 6.0 6.0 4.0 16.0 4.0 7.0 21.0 6.0 9.0 0.0 0.0 0.0 1.0 2983 1999-05-01 18017
4 4 3000.0 3 12.99 101.07 2 11 54.0 11 1 29000.0 2 2016-03-01 0 10 301.0 21 32.16 0.0 690.0 694.0 12.0 0.0 0.0 2942.0 32.0 27.0 0 0 11.0 1.0 1.0 2.0 7.0 7.0 2.0 4.0 9.0 10.0 15.0 7.0 12.0 0.0 0.0 0.0 4.0 3196 1977-08-01 10074