当前位置: 代码迷 >> 综合 >> 天池项目笔记-金融风控-贷款违约预测 Task2
  详细解决方案

天池项目笔记-金融风控-贷款违约预测 Task2

热度:7   发布时间:2024-02-20 05:14:03.0

Task02_EDA

1.导入数据 load the data

train_data = pd.read_csv('./train.csv', sep = ',')
test_data = pd.read_csv('./testA.csv', sep = ',')

2.数据浏览 data overview

? 在开始使用数据前先对其进行大概的浏览,包括数据的行列、缺失值、重复值与统计信息等。

train_data.head()
id loanAmnt term interestRate installment grade subGrade employmentTitle employmentLength homeOwnership annualIncome verificationStatus issueDate isDefault purpose postCode regionCode dti delinquency_2years ficoRangeLow ficoRangeHigh openAcc pubRec pubRecBankruptcies revolBal revolUtil totalAcc initialListStatus applicationType earliesCreditLine title policyCode n0 n1 n2 n2.1 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14
0 0 35000.0 5 19.52 917.97 E E2 320.0 2 years 2 110000.0 2 2014-07-01 1 1 137.0 32 17.05 0.0 730.0 734.0 7.0 0.0 0.0 24178.0 48.9 27.0 0 0 Aug-2001 1.0 1.0 0.0 2.0 2.0 2.0 4.0 9.0 8.0 4.0 12.0 2.0 7.0 0.0 0.0 0.0 2.0
1 1 18000.0 5 18.49 461.90 D D2 219843.0 5 years 0 46000.0 2 2012-08-01 0 0 156.0 18 27.83 0.0 700.0 704.0 13.0 0.0 0.0 15096.0 38.9 18.0 1 0 May-2002 1723.0 1.0 NaN NaN NaN NaN 10.0 NaN NaN NaN NaN NaN 13.0 NaN NaN NaN NaN
2 2 12000.0 5 16.99 298.17 D D3 31698.0 8 years 0 74000.0 2 2015-10-01 0 0 337.0 14 22.77 0.0 675.0 679.0 11.0 0.0 0.0 4606.0 51.8 27.0 0 0 May-2006 0.0 1.0 0.0 0.0 3.0 3.0 0.0 0.0 21.0 4.0 5.0 3.0 11.0 0.0 0.0 0.0 4.0
3 3 11000.0 3 7.26 340.96 A A4 46854.0 10+ years 1 118000.0 1 2015-08-01 0 4 148.0 11 17.21 0.0 685.0 689.0 9.0 0.0 0.0 9948.0 52.6 28.0 1 0 May-1999 4.0 1.0 6.0 4.0 6.0 6.0 4.0 16.0 4.0 7.0 21.0 6.0 9.0 0.0 0.0 0.0 1.0
4 4 3000.0 3 12.99 101.07 C C2 54.0 NaN 1 29000.0 2 2016-03-01 0 10 301.0 21 32.16 0.0 690.0 694.0 12.0 0.0 0.0 2942.0 32.0 27.0 0 0 Aug-1977 11.0 1.0 1.0 2.0 7.0 7.0 2.0 4.0 9.0 10.0 15.0 7.0 12.0 0.0 0.0 0.0 4.0
train_data.shape
test_data.shape

(200000, 48)

train_data.columns

Index([‘id’, ‘loanAmnt’, ‘term’, ‘interestRate’, ‘installment’, ‘grade’,
‘subGrade’, ‘employmentTitle’, ‘employmentLength’, ‘homeOwnership’,
‘annualIncome’, ‘verificationStatus’, ‘issueDate’, ‘isDefault’,
‘purpose’, ‘postCode’, ‘regionCode’, ‘dti’, ‘delinquency_2years’,
‘ficoRangeLow’, ‘ficoRangeHigh’, ‘openAcc’, ‘pubRec’,
‘pubRecBankruptcies’, ‘revolBal’, ‘revolUtil’, ‘totalAcc’,
‘initialListStatus’, ‘applicationType’, ‘earliesCreditLine’, ‘title’,
‘policyCode’, ‘n0’, ‘n1’, ‘n2’, ‘n2.1’, ‘n4’, ‘n5’, ‘n6’, ‘n7’, ‘n8’,
‘n9’, ‘n10’, ‘n11’, ‘n12’, ‘n13’, ‘n14’],
dtype=‘object’)

train_data.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):

‘#’ Column Non-Null Count Dtype


0 id 800000 non-null int64
1 loanAmnt 800000 non-null float64
2 term 800000 non-null int64
3 interestRate 800000 non-null float64
4 installment 800000 non-null float64
5 grade 800000 non-null object
6 subGrade 800000 non-null object
7 employmentTitle 799999 non-null float64
8 employmentLength 753201 non-null object
9 homeOwnership 800000 non-null int64
10 annualIncome 800000 non-null float64
11 verificationStatus 800000 non-null int64
12 issueDate 800000 non-null object
13 isDefault 800000 non-null int64
14 purpose 800000 non-null int64
15 postCode 799999 non-null float64
16 regionCode 800000 non-null int64
17 dti 799761 non-null float64
18 delinquency_2years 800000 non-null float64
19 ficoRangeLow 800000 non-null float64
20 ficoRangeHigh 800000 non-null float64
21 openAcc 800000 non-null float64
22 pubRec 800000 non-null float64
23 pubRecBankruptcies 799595 non-null float64
24 revolBal 800000 non-null float64
25 revolUtil 799469 non-null float64
26 totalAcc 800000 non-null float64
27 initialListStatus 800000 non-null int64
28 applicationType 800000 non-null int64
29 earliesCreditLine 800000 non-null object
30 title 799999 non-null float64
31 policyCode 800000 non-null float64
32 n0 759730 non-null float64
33 n1 759730 non-null float64
34 n2 759730 non-null float64
35 n2.1 759730 non-null float64
36 n4 766761 non-null float64
37 n5 759730 non-null float64
38 n6 759730 non-null float64
39 n7 759730 non-null float64
40 n8 759729 non-null float64
41 n9 759730 non-null float64
42 n10 766761 non-null float64
43 n11 730248 non-null float64
44 n12 759730 non-null float64
45 n13 759730 non-null float64
46 n14 759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB

train_data['employmentTitle'].value_counts()

54.0 51149
38.0 12644
32.0 11543
184.0 6112
151.0 5193

311225.0 1
13438.0 1
311224.0 1
311223.0 1
269398.0 1
Name: employmentTitle, Length: 248683, dtype: int64

? 对该特征来说,职业头衔被脱敏使用某种表征方法通过计数来保存了,也许使用了NLP的技术,如果是这样的话,可以考虑使用数据分箱的方法将其转换为类别型数据。

4.数据分布 data distribution

#首先初步划分数值型和类别变量
feature_columns = ['loanAmnt', 'term', 'interestRate', 'installment', 'grade','subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership','annualIncome', 'verificationStatus', 'issueDate', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years','ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec','pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc','initialListStatus', 'applicationType', 'earliesCreditLine', 'title','policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8','n9', 'n10', 'n11', 'n12', 'n13', 'n14']      
numerical_fea = ['loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle',  'annualIncome',  'issueDate', 'postCode', 'regionCode', 'dti', 'delinquency_2years','ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec','pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc','initialListStatus', 'applicationType', 'earliesCreditLine', 'title','policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8','n9', 'n10', 'n11', 'n12', 'n13', 'n14']
categorical_fea = ['grade','subGrade','employmentLength','homeOwnership','verificationStatus','purpose']
sample = train_data.sample(1000)
# 查看Numerical features的分布情况,离散还是连续
for feas in num_fea:plt.figure(figsize = (6,4))plt.title(feas)plt.scatter(sample['id'], sample[feas], s = 8)

? 这里取1000个数据样本,观察初步划分的数值型变量中的数据分布,考虑是否进一步对特征种类进行划分,在此处展示部分特征数据的分布图。
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

#过滤数值型类别特征
def get_numerical_serial_fea(data,feas):numerical_serial_fea = []numerical_noserial_fea = []for fea in feas:temp = data[fea].nunique()if temp <= 10:numerical_noserial_fea.append(fea)continuenumerical_serial_fea.append(fea)return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(train_data,num_fea)#每个数字特征得分布可视化
f = pd.melt(train_data, value_vars=numerical_serial_fea)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

? 此处根据每个数值类型特征的分布,将类别数大于10类的特征保留为数值类型,类别数小于10的保留为类别特征。
在这里插入图片描述

# 变量分布可视化
plt.figure(figsize=(8, 8))
sns.barplot(train_data["employmentLength"].value_counts(dropna=False)[:20],train_data["employmentLength"].value_counts(dropna=False).keys()[:20])
plt.show()

在这里插入图片描述

# 正负样本比例
total = len(train_data)
total_amt = train_data.groupby(['isDefault'])['loanAmnt'].sum().sum()
plt.figure(figsize=(6,5))
plot_tr = sns.countplot(x='isDefault',data=train_data)#train data‘isDefault’这个特征每种类别的数量**
plot_tr.set_title("Fraud Loan Distribution \n 0: good user | 1: bad user", fontsize=14)
plot_tr.set_xlabel("Is fraud by count", fontsize=16)
plot_tr.set_ylabel('Count', fontsize=16)
for p in plot_tr.patches:height = p.get_height()plot_tr.text(p.get_x()+p.get_width()/2.,height + 3,'{:1.2f}%'.format(height/total*100),ha="center", fontsize=15) 

在这里插入图片描述
? 由于金融欺诈数据中,实际欺诈行为即正样本的数量应该会很少,存在样本不平衡的情况,这是应该在后续工作中克服的。

5.缺失值 missing values

? 统计缺失值:

train_data.isnull().sum()

id 0
loanAmnt 0
term 0
interestRate 0
installment 0
grade 0
subGrade 0
employmentTitle 1
employmentLength 46799
homeOwnership 0
annualIncome 0
verificationStatus 0
issueDate 0
isDefault 0
purpose 0
postCode 1
regionCode 0
dti 239
delinquency_2years 0
ficoRangeLow 0
ficoRangeHigh 0
openAcc 0
pubRec 0
pubRecBankruptcies 405
revolBal 0
revolUtil 531
totalAcc 0
initialListStatus 0
applicationType 0
earliesCreditLine 0
title 1
policyCode 0
n0 40270
n1 40270
n2 40270
n2.1 40270
n4 33239
n5 40270
n6 40270
n7 40270
n8 40271
n9 40270
n10 33239
n11 69752
n12 40270
n13 40270
n14 40270
dtype: int64

? 对缺失值进行可视化:

sample = train_data.sample(1000)
msno.matrix(sample)
plt.show()
msno.bar(sample)
plt.show()
msno.heatmap(sample)
plt.show()

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

6.特征热力图

? 通过热力图可以观察特征之间的相关关系,确定哪些是冗余数据。

plt.figure(figsize = (30,10))
sns.heatmap(train_data.corr(), annot = True, cmap = 'coolwarm')

在这里插入图片描述

? 在完成对数据的探索性分析之后,下一步应该考虑如何预处理非数值类型的数据并构造新的特征工程。下一步会注意在日期类型数据、类别数据的encoding、数据分箱和特征聚合上的工作。根据新的特征确定新的特征列。