Task03_特征工程 features engineering
目前只是数据预处理和初步洞察(初步 basic preprocessing)后续更新思路。可以考虑使用评分卡模型对这一问题进行分析。
1.时间格式处理
1.1 将earliesCreditLine 特征转为日期类型
? 通过观察原始数据,‘earliesCreditLine’数据为字符串数据,将日期以非结构化形式保存。这一步将这一数据进行结构化处理转为datetime类型,方便模型使用和后续的特征工程构建,记录为’earliesCreditLine_date’。
? 如’Aug-2001’数据,表示2001年8月,通过下面的代码将数据转为’%Y-%m-%d’的日期类型,并统一将日期设置为各个月的1号。
# earliesCreditLine 转为日期类型
dic_month = {
'Jan':'01', 'Feb':'02', 'Mar':'03', 'Apr':'04', 'May':'05', 'Jun':'06', 'Jul':'07', 'Aug':'08', 'Sep':'09', 'Oct':'10', 'Nov':'11', 'Dec':'12'}def get_month_year(str):month = list(str)[0] + list(str)[1] + list(str)[2]month = dic_month[month]year = list(str)[-4] + list(str)[-3] + list(str)[-2] + list(str)[-1]date = year + '-' + month + '-' + '01'return datetrain_data['earliesCreditLine_date'] = train_data['earliesCreditLine'].apply(lambda x : get_month_year(x))
test_data['earliesCreditLine_date'] = test_data['earliesCreditLine'].apply(lambda x : get_month_year(x))
train_data = train_data.drop(columns = 'earliesCreditLine')
test_data = test_data.drop(columns = 'earliesCreditLine')
1.2时间特征构建
? 数据中的时间特征有两个,分别是’issueDate’贷款发放的日期和’earliesCreditLine‘借款人最早报告的信用额度开立的月份。某一数据的时间点没有太大意义,所以我们使用时间区间,将所有的时间数据转为datetime类型后,统一减去某比较久远的过去的值,构建新的特征’issueDateDT’和’earliesCreditLine_dateDT’。后续这一时间还可以用于其他探索,为了baseline我们先做到这里。
import datetime
# issueDate 转换为时间差数值
train_data['issueDate'] = pd.to_datetime(train_data['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
train_data['issueDateDT'] = train_data['issueDate'].apply(lambda x: x-startdate).dt.daystest_data['issueDate'] = pd.to_datetime(test_data['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
test_data['issueDateDT'] = test_data['issueDate'].apply(lambda x: x-startdate).dt.daysplt.hist(train_data['issueDateDT'], label='train');
plt.hist(test_data['issueDateDT'], label='test');
plt.legend();
plt.title('Distribution of issueDateDT dates');
# earliesCreditLine_date 转换为数值(时间差)
train_data['earliesCreditLine_date'] = pd.to_datetime(train_data['earliesCreditLine_date'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('1950-01-01', '%Y-%m-%d')
train_data['earliesCreditLine_dateDT'] = train_data['earliesCreditLine_date'].apply(lambda x: x-startdate).dt.daystest_data['earliesCreditLine_date'] = pd.to_datetime(test_data['earliesCreditLine_date'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('1950-01-01', '%Y-%m-%d')
test_data['earliesCreditLine_dateDT'] = test_data['earliesCreditLine_date'].apply(lambda x: x-startdate).dt.daysplt.hist(train_data['earliesCreditLine_dateDT'], label='train');
plt.hist(test_data['earliesCreditLine_dateDT'], label='test');
plt.legend();
plt.title('Distribution of issueDateDT dates');
上图绘制了新增的两个时间特征在训练集和测试集中的分布,基本比较一致。
2.特征分类 features classification
? 在这里我们结合前面数据分布的工作,更细致地将不同特征分为类别特征和数值型特征。
# employmentTitle 可以用数据类型特征/分箱后作为分类类型特征,暂时用前者方法处理
# issueDate 日期类型信息
# earliesCreditLine 可转为日期类型信息
# policycode,n11 几乎只有一种取值,drop
feature_columns = ['loanAmnt', 'term', 'interestRate', 'installment', 'grade','subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership','annualIncome', 'verificationStatus', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years','ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec','pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc','initialListStatus', 'applicationType', 'title','n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8','n9', 'n10', 'n12', 'n13', 'n14','issueDateDT','earliesCreditLine_dateDT']
numerical_fea = ['loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'annualIncome', 'postCode', 'regionCode', 'dti', 'delinquency_2years','ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec','pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc','initialListStatus', 'applicationType', 'title','n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8','n9', 'n10', 'n12', 'n13', 'n14', 'issueDateDT', 'earliesCreditLine_dateDT']
categorical_fea = ['grade','subGrade','employmentLength','homeOwnership','verificationStatus','purpose']
3.类别特征编码 label encoding
? 考虑到首先使用树模型,于是决定先不将数据变得过于稀疏,因此先采用label encoding对类别特征进行预处理,至于哪一种编码效果好,或者将哪些数据看作数值类型,都有待尝试。更换不同的模型时,此处可以进行调整。
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
train_data['grade'] = le.fit_transform(train_data['grade'])
train_data['subGrade'] = le.fit_transform(train_data['subGrade'])
train_data['employmentLength'] = train_data['employmentLength'].apply(lambda x : str(x))
train_data['employmentLength'] = le.fit_transform(train_data['employmentLength'])
test_data['grade'] = le.fit_transform(test_data['grade'])
test_data['subGrade'] = le.fit_transform(test_data['subGrade'])
test_data['employmentLength'] = test_data['employmentLength'].apply(lambda x : str(x))
test_data['employmentLength'] = le.fit_transform(test_data['employmentLength'])
最后再来看一下处理后喂给baseline模型的数据:
train_data.head()
id | loanAmnt | term | interestRate | installment | grade | subGrade | employmentTitle | employmentLength | homeOwnership | annualIncome | verificationStatus | issueDate | isDefault | purpose | postCode | regionCode | dti | delinquency_2years | ficoRangeLow | ficoRangeHigh | openAcc | pubRec | pubRecBankruptcies | revolBal | revolUtil | totalAcc | initialListStatus | applicationType | title | policyCode | n0 | n1 | n2 | n2.1 | n4 | n5 | n6 | n7 | n8 | n9 | n10 | n11 | n12 | n13 | n14 | issueDateDT | earliesCreditLine_date | earliesCreditLine_dateDT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 35000.0 | 5 | 19.52 | 917.97 | 4 | 21 | 320.0 | 2 | 2 | 110000.0 | 2 | 2014-07-01 | 1 | 1 | 137.0 | 32 | 17.05 | 0.0 | 730.0 | 734.0 | 7.0 | 0.0 | 0.0 | 24178.0 | 48.9 | 27.0 | 0 | 0 | 1.0 | 1.0 | 0.0 | 2.0 | 2.0 | 2.0 | 4.0 | 9.0 | 8.0 | 4.0 | 12.0 | 2.0 | 7.0 | 0.0 | 0.0 | 0.0 | 2.0 | 2587 | 2001-08-01 | 18840 |
1 | 1 | 18000.0 | 5 | 18.49 | 461.90 | 3 | 16 | 219843.0 | 5 | 0 | 46000.0 | 2 | 2012-08-01 | 0 | 0 | 156.0 | 18 | 27.83 | 0.0 | 700.0 | 704.0 | 13.0 | 0.0 | 0.0 | 15096.0 | 38.9 | 18.0 | 1 | 0 | 1723.0 | 1.0 | NaN | NaN | NaN | NaN | 10.0 | NaN | NaN | NaN | NaN | NaN | 13.0 | NaN | NaN | NaN | NaN | 1888 | 2002-05-01 | 19113 |
2 | 2 | 12000.0 | 5 | 16.99 | 298.17 | 3 | 17 | 31698.0 | 8 | 0 | 74000.0 | 2 | 2015-10-01 | 0 | 0 | 337.0 | 14 | 22.77 | 0.0 | 675.0 | 679.0 | 11.0 | 0.0 | 0.0 | 4606.0 | 51.8 | 27.0 | 0 | 0 | 0.0 | 1.0 | 0.0 | 0.0 | 3.0 | 3.0 | 0.0 | 0.0 | 21.0 | 4.0 | 5.0 | 3.0 | 11.0 | 0.0 | 0.0 | 0.0 | 4.0 | 3044 | 2006-05-01 | 20574 |
3 | 3 | 11000.0 | 3 | 7.26 | 340.96 | 0 | 3 | 46854.0 | 1 | 1 | 118000.0 | 1 | 2015-08-01 | 0 | 4 | 148.0 | 11 | 17.21 | 0.0 | 685.0 | 689.0 | 9.0 | 0.0 | 0.0 | 9948.0 | 52.6 | 28.0 | 1 | 0 | 4.0 | 1.0 | 6.0 | 4.0 | 6.0 | 6.0 | 4.0 | 16.0 | 4.0 | 7.0 | 21.0 | 6.0 | 9.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2983 | 1999-05-01 | 18017 |
4 | 4 | 3000.0 | 3 | 12.99 | 101.07 | 2 | 11 | 54.0 | 11 | 1 | 29000.0 | 2 | 2016-03-01 | 0 | 10 | 301.0 | 21 | 32.16 | 0.0 | 690.0 | 694.0 | 12.0 | 0.0 | 0.0 | 2942.0 | 32.0 | 27.0 | 0 | 0 | 11.0 | 1.0 | 1.0 | 2.0 | 7.0 | 7.0 | 2.0 | 4.0 | 9.0 | 10.0 | 15.0 | 7.0 | 12.0 | 0.0 | 0.0 | 0.0 | 4.0 | 3196 | 1977-08-01 | 10074 |