天池项目笔记-金融风控-贷款违约预测 Task2_综合

Task02_EDA

1.导入数据 load the data

train_data = pd.read_csv('./train.csv', sep = ',')
test_data = pd.read_csv('./testA.csv', sep = ',')

2.数据浏览 data overview

? 在开始使用数据前先对其进行大概的浏览，包括数据的行列、缺失值、重复值与统计信息等。

train_data.head()

	id	loanAmnt	term	interestRate	installment	grade	subGrade	employmentTitle	employmentLength	homeOwnership	annualIncome	verificationStatus	issueDate	isDefault	purpose	postCode	regionCode	dti	ficoRangeLow	ficoRangeHigh	openAcc	revolBal	revolUtil	totalAcc	initialListStatus	earliesCreditLine	title	policyCode	n0	n1	n2	n2.1	n4	n5	n6	n7	n8	n9	n10	n11	n12	n13	n14
0	0	35000.0	5	19.52	917.97	E	E2	320.0	2 years	2	110000.0	2	2014-07-01	1	1	137.0	32	17.05	730.0	734.0	7.0	24178.0	48.9	27.0	0	Aug-2001	1.0	1.0	0.0	2.0	2.0	2.0	4.0	9.0	8.0	4.0	12.0	2.0	7.0	0.0	0.0	0.0	2.0
1	1	18000.0	5	18.49	461.90	D	D2	219843.0	5 years	0	46000.0	2	2012-08-01	0	0	156.0	18	27.83	700.0	704.0	13.0	15096.0	38.9	18.0	1	May-2002	1723.0	1.0	NaN	NaN	NaN	NaN	10.0	NaN	NaN	NaN	NaN	NaN	13.0	NaN	NaN	NaN	NaN
2	2	12000.0	5	16.99	298.17	D	D3	31698.0	8 years	0	74000.0	2	2015-10-01	0	0	337.0	14	22.77	675.0	679.0	11.0	4606.0	51.8	27.0	0	May-2006	0.0	1.0	0.0	0.0	3.0	3.0	0.0	0.0	21.0	4.0	5.0	3.0	11.0	0.0	0.0	0.0	4.0
3	3	11000.0	3	7.26	340.96	A	A4	46854.0	10+ years	1	118000.0	1	2015-08-01	0	4	148.0	11	17.21	685.0	689.0	9.0	9948.0	52.6	28.0	1	May-1999	4.0	1.0	6.0	4.0	6.0	6.0	4.0	16.0	4.0	7.0	21.0	6.0	9.0	0.0	0.0	0.0	1.0
4	4	3000.0	3	12.99	101.07	C	C2	54.0	NaN	1	29000.0	2	2016-03-01	0	10	301.0	21	32.16	690.0	694.0	12.0	2942.0	32.0	27.0	0	Aug-1977	11.0	1.0	1.0	2.0	7.0	7.0	2.0	4.0	9.0	10.0	15.0	7.0	12.0	0.0	0.0	0.0	4.0

train_data.shape
test_data.shape

(200000, 48)

train_data.columns

Index([‘id’, ‘loanAmnt’, ‘term’, ‘interestRate’, ‘installment’, ‘grade’,
‘subGrade’, ‘employmentTitle’, ‘employmentLength’, ‘homeOwnership’,
‘annualIncome’, ‘verificationStatus’, ‘issueDate’, ‘isDefault’,
‘purpose’, ‘postCode’, ‘regionCode’, ‘dti’, ‘delinquency_2years’,
‘ficoRangeLow’, ‘ficoRangeHigh’, ‘openAcc’, ‘pubRec’,
‘pubRecBankruptcies’, ‘revolBal’, ‘revolUtil’, ‘totalAcc’,
‘initialListStatus’, ‘applicationType’, ‘earliesCreditLine’, ‘title’,
‘policyCode’, ‘n0’, ‘n1’, ‘n2’, ‘n2.1’, ‘n4’, ‘n5’, ‘n6’, ‘n7’, ‘n8’,
‘n9’, ‘n10’, ‘n11’, ‘n12’, ‘n13’, ‘n14’],
dtype=‘object’)

train_data.info()

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):

‘#’ Column Non-Null Count Dtype

0 id 800000 non-null int64
1 loanAmnt 800000 non-null float64
2 term 800000 non-null int64
3 interestRate 800000 non-null float64
4 installment 800000 non-null float64
5 grade 800000 non-null object
6 subGrade 800000 non-null object
7 employmentTitle 799999 non-null float64
8 employmentLength 753201 non-null object
9 homeOwnership 800000 non-null int64
10 annualIncome 800000 non-null float64
11 verificationStatus 800000 non-null int64
12 issueDate 800000 non-null object
13 isDefault 800000 non-null int64
14 purpose 800000 non-null int64
15 postCode 799999 non-null float64
16 regionCode 800000 non-null int64
17 dti 799761 non-null float64
18 delinquency_2years 800000 non-null float64
19 ficoRangeLow 800000 non-null float64
20 ficoRangeHigh 800000 non-null float64
21 openAcc 800000 non-null float64
22 pubRec 800000 non-null float64
23 pubRecBankruptcies 799595 non-null float64
24 revolBal 800000 non-null float64
25 revolUtil 799469 non-null float64
26 totalAcc 800000 non-null float64
27 initialListStatus 800000 non-null int64
28 applicationType 800000 non-null int64
29 earliesCreditLine 800000 non-null object
30 title 799999 non-null float64
31 policyCode 800000 non-null float64
32 n0 759730 non-null float64
33 n1 759730 non-null float64
34 n2 759730 non-null float64
35 n2.1 759730 non-null float64
36 n4 766761 non-null float64
37 n5 759730 non-null float64
38 n6 759730 non-null float64
39 n7 759730 non-null float64
40 n8 759729 non-null float64
41 n9 759730 non-null float64
42 n10 766761 non-null float64
43 n11 730248 non-null float64
44 n12 759730 non-null float64
45 n13 759730 non-null float64
46 n14 759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB

train_data['employmentTitle'].value_counts()

54.0 51149
38.0 12644
32.0 11543
184.0 6112
151.0 5193
…
311225.0 1
13438.0 1
311224.0 1
311223.0 1
269398.0 1
Name: employmentTitle, Length: 248683, dtype: int64

? 对该特征来说，职业头衔被脱敏使用某种表征方法通过计数来保存了，也许使用了NLP的技术，如果是这样的话，可以考虑使用数据分箱的方法将其转换为类别型数据。

4.数据分布 data distribution

#首先初步划分数值型和类别变量
feature_columns = ['loanAmnt', 'term', 'interestRate', 'installment', 'grade','subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership','annualIncome', 'verificationStatus', 'issueDate', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years','ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec','pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc','initialListStatus', 'applicationType', 'earliesCreditLine', 'title','policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8','n9', 'n10', 'n11', 'n12', 'n13', 'n14']      
numerical_fea = ['loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle',  'annualIncome',  'issueDate', 'postCode', 'regionCode', 'dti', 'delinquency_2years','ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec','pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc','initialListStatus', 'applicationType', 'earliesCreditLine', 'title','policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8','n9', 'n10', 'n11', 'n12', 'n13', 'n14']
categorical_fea = ['grade','subGrade','employmentLength','homeOwnership','verificationStatus','purpose']

sample = train_data.sample(1000)
# 查看Numerical features的分布情况，离散还是连续
for feas in num_fea:plt.figure(figsize = (6,4))plt.title(feas)plt.scatter(sample['id'], sample[feas], s = 8)

? 这里取1000个数据样本，观察初步划分的数值型变量中的数据分布，考虑是否进一步对特征种类进行划分，在此处展示部分特征数据的分布图。
在这里插入图片描述

#过滤数值型类别特征
def get_numerical_serial_fea(data,feas):numerical_serial_fea = []numerical_noserial_fea = []for fea in feas:temp = data[fea].nunique()if temp <= 10:numerical_noserial_fea.append(fea)continuenumerical_serial_fea.append(fea)return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(train_data,num_fea)#每个数字特征得分布可视化
f = pd.melt(train_data, value_vars=numerical_serial_fea)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

? 此处根据每个数值类型特征的分布，将类别数大于10类的特征保留为数值类型，类别数小于10的保留为类别特征。
在这里插入图片描述

# 变量分布可视化
plt.figure(figsize=(8, 8))
sns.barplot(train_data["employmentLength"].value_counts(dropna=False)[:20],train_data["employmentLength"].value_counts(dropna=False).keys()[:20])
plt.show()

在这里插入图片描述

# 正负样本比例
total = len(train_data)
total_amt = train_data.groupby(['isDefault'])['loanAmnt'].sum().sum()
plt.figure(figsize=(6,5))
plot_tr = sns.countplot(x='isDefault',data=train_data)#train data‘isDefault’这个特征每种类别的数量**
plot_tr.set_title("Fraud Loan Distribution \n 0: good user | 1: bad user", fontsize=14)
plot_tr.set_xlabel("Is fraud by count", fontsize=16)
plot_tr.set_ylabel('Count', fontsize=16)
for p in plot_tr.patches:height = p.get_height()plot_tr.text(p.get_x()+p.get_width()/2.,height + 3,'{:1.2f}%'.format(height/total*100),ha="center", fontsize=15)

在这里插入图片描述
? 由于金融欺诈数据中，实际欺诈行为即正样本的数量应该会很少，存在样本不平衡的情况，这是应该在后续工作中克服的。

5.缺失值 missing values

? 统计缺失值：

train_data.isnull().sum()

id 0
loanAmnt 0
term 0
interestRate 0
installment 0
grade 0
subGrade 0
employmentTitle 1
employmentLength 46799
homeOwnership 0
annualIncome 0
verificationStatus 0
issueDate 0
isDefault 0
purpose 0
postCode 1
regionCode 0
dti 239
delinquency_2years 0
ficoRangeLow 0
ficoRangeHigh 0
openAcc 0
pubRec 0
pubRecBankruptcies 405
revolBal 0
revolUtil 531
totalAcc 0
initialListStatus 0
applicationType 0
earliesCreditLine 0
title 1
policyCode 0
n0 40270
n1 40270
n2 40270
n2.1 40270
n4 33239
n5 40270
n6 40270
n7 40270
n8 40271
n9 40270
n10 33239
n11 69752
n12 40270
n13 40270
n14 40270
dtype: int64

? 对缺失值进行可视化：

sample = train_data.sample(1000)
msno.matrix(sample)
plt.show()
msno.bar(sample)
plt.show()
msno.heatmap(sample)
plt.show()

在这里插入图片描述

6.特征热力图

? 通过热力图可以观察特征之间的相关关系，确定哪些是冗余数据。

plt.figure(figsize = (30,10))
sns.heatmap(train_data.corr(), annot = True, cmap = 'coolwarm')

在这里插入图片描述

? 在完成对数据的探索性分析之后，下一步应该考虑如何预处理非数值类型的数据并构造新的特征工程。下一步会注意在日期类型数据、类别数据的encoding、数据分箱和特征聚合上的工作。根据新的特征确定新的特征列。