Task02_EDA
1.导入数据 load the data
train_data = pd.read_csv('./train.csv', sep = ',')
test_data = pd.read_csv('./testA.csv', sep = ',')
2.数据浏览 data overview
? 在开始使用数据前先对其进行大概的浏览,包括数据的行列、缺失值、重复值与统计信息等。
train_data.head()
id | loanAmnt | term | interestRate | installment | grade | subGrade | employmentTitle | employmentLength | homeOwnership | annualIncome | verificationStatus | issueDate | isDefault | purpose | postCode | regionCode | dti | delinquency_2years | ficoRangeLow | ficoRangeHigh | openAcc | pubRec | pubRecBankruptcies | revolBal | revolUtil | totalAcc | initialListStatus | applicationType | earliesCreditLine | title | policyCode | n0 | n1 | n2 | n2.1 | n4 | n5 | n6 | n7 | n8 | n9 | n10 | n11 | n12 | n13 | n14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 35000.0 | 5 | 19.52 | 917.97 | E | E2 | 320.0 | 2 years | 2 | 110000.0 | 2 | 2014-07-01 | 1 | 1 | 137.0 | 32 | 17.05 | 0.0 | 730.0 | 734.0 | 7.0 | 0.0 | 0.0 | 24178.0 | 48.9 | 27.0 | 0 | 0 | Aug-2001 | 1.0 | 1.0 | 0.0 | 2.0 | 2.0 | 2.0 | 4.0 | 9.0 | 8.0 | 4.0 | 12.0 | 2.0 | 7.0 | 0.0 | 0.0 | 0.0 | 2.0 |
1 | 1 | 18000.0 | 5 | 18.49 | 461.90 | D | D2 | 219843.0 | 5 years | 0 | 46000.0 | 2 | 2012-08-01 | 0 | 0 | 156.0 | 18 | 27.83 | 0.0 | 700.0 | 704.0 | 13.0 | 0.0 | 0.0 | 15096.0 | 38.9 | 18.0 | 1 | 0 | May-2002 | 1723.0 | 1.0 | NaN | NaN | NaN | NaN | 10.0 | NaN | NaN | NaN | NaN | NaN | 13.0 | NaN | NaN | NaN | NaN |
2 | 2 | 12000.0 | 5 | 16.99 | 298.17 | D | D3 | 31698.0 | 8 years | 0 | 74000.0 | 2 | 2015-10-01 | 0 | 0 | 337.0 | 14 | 22.77 | 0.0 | 675.0 | 679.0 | 11.0 | 0.0 | 0.0 | 4606.0 | 51.8 | 27.0 | 0 | 0 | May-2006 | 0.0 | 1.0 | 0.0 | 0.0 | 3.0 | 3.0 | 0.0 | 0.0 | 21.0 | 4.0 | 5.0 | 3.0 | 11.0 | 0.0 | 0.0 | 0.0 | 4.0 |
3 | 3 | 11000.0 | 3 | 7.26 | 340.96 | A | A4 | 46854.0 | 10+ years | 1 | 118000.0 | 1 | 2015-08-01 | 0 | 4 | 148.0 | 11 | 17.21 | 0.0 | 685.0 | 689.0 | 9.0 | 0.0 | 0.0 | 9948.0 | 52.6 | 28.0 | 1 | 0 | May-1999 | 4.0 | 1.0 | 6.0 | 4.0 | 6.0 | 6.0 | 4.0 | 16.0 | 4.0 | 7.0 | 21.0 | 6.0 | 9.0 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | 4 | 3000.0 | 3 | 12.99 | 101.07 | C | C2 | 54.0 | NaN | 1 | 29000.0 | 2 | 2016-03-01 | 0 | 10 | 301.0 | 21 | 32.16 | 0.0 | 690.0 | 694.0 | 12.0 | 0.0 | 0.0 | 2942.0 | 32.0 | 27.0 | 0 | 0 | Aug-1977 | 11.0 | 1.0 | 1.0 | 2.0 | 7.0 | 7.0 | 2.0 | 4.0 | 9.0 | 10.0 | 15.0 | 7.0 | 12.0 | 0.0 | 0.0 | 0.0 | 4.0 |
train_data.shape
test_data.shape
(200000, 48)
train_data.columns
Index([‘id’, ‘loanAmnt’, ‘term’, ‘interestRate’, ‘installment’, ‘grade’,
‘subGrade’, ‘employmentTitle’, ‘employmentLength’, ‘homeOwnership’,
‘annualIncome’, ‘verificationStatus’, ‘issueDate’, ‘isDefault’,
‘purpose’, ‘postCode’, ‘regionCode’, ‘dti’, ‘delinquency_2years’,
‘ficoRangeLow’, ‘ficoRangeHigh’, ‘openAcc’, ‘pubRec’,
‘pubRecBankruptcies’, ‘revolBal’, ‘revolUtil’, ‘totalAcc’,
‘initialListStatus’, ‘applicationType’, ‘earliesCreditLine’, ‘title’,
‘policyCode’, ‘n0’, ‘n1’, ‘n2’, ‘n2.1’, ‘n4’, ‘n5’, ‘n6’, ‘n7’, ‘n8’,
‘n9’, ‘n10’, ‘n11’, ‘n12’, ‘n13’, ‘n14’],
dtype=‘object’)
train_data.info()
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):‘#’ Column Non-Null Count Dtype
0 id 800000 non-null int64
1 loanAmnt 800000 non-null float64
2 term 800000 non-null int64
3 interestRate 800000 non-null float64
4 installment 800000 non-null float64
5 grade 800000 non-null object
6 subGrade 800000 non-null object
7 employmentTitle 799999 non-null float64
8 employmentLength 753201 non-null object
9 homeOwnership 800000 non-null int64
10 annualIncome 800000 non-null float64
11 verificationStatus 800000 non-null int64
12 issueDate 800000 non-null object
13 isDefault 800000 non-null int64
14 purpose 800000 non-null int64
15 postCode 799999 non-null float64
16 regionCode 800000 non-null int64
17 dti 799761 non-null float64
18 delinquency_2years 800000 non-null float64
19 ficoRangeLow 800000 non-null float64
20 ficoRangeHigh 800000 non-null float64
21 openAcc 800000 non-null float64
22 pubRec 800000 non-null float64
23 pubRecBankruptcies 799595 non-null float64
24 revolBal 800000 non-null float64
25 revolUtil 799469 non-null float64
26 totalAcc 800000 non-null float64
27 initialListStatus 800000 non-null int64
28 applicationType 800000 non-null int64
29 earliesCreditLine 800000 non-null object
30 title 799999 non-null float64
31 policyCode 800000 non-null float64
32 n0 759730 non-null float64
33 n1 759730 non-null float64
34 n2 759730 non-null float64
35 n2.1 759730 non-null float64
36 n4 766761 non-null float64
37 n5 759730 non-null float64
38 n6 759730 non-null float64
39 n7 759730 non-null float64
40 n8 759729 non-null float64
41 n9 759730 non-null float64
42 n10 766761 non-null float64
43 n11 730248 non-null float64
44 n12 759730 non-null float64
45 n13 759730 non-null float64
46 n14 759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
train_data['employmentTitle'].value_counts()
54.0 51149
38.0 12644
32.0 11543
184.0 6112
151.0 5193
…
311225.0 1
13438.0 1
311224.0 1
311223.0 1
269398.0 1
Name: employmentTitle, Length: 248683, dtype: int64
? 对该特征来说,职业头衔被脱敏使用某种表征方法通过计数来保存了,也许使用了NLP的技术,如果是这样的话,可以考虑使用数据分箱的方法将其转换为类别型数据。
4.数据分布 data distribution
#首先初步划分数值型和类别变量
feature_columns = ['loanAmnt', 'term', 'interestRate', 'installment', 'grade','subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership','annualIncome', 'verificationStatus', 'issueDate', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years','ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec','pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc','initialListStatus', 'applicationType', 'earliesCreditLine', 'title','policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8','n9', 'n10', 'n11', 'n12', 'n13', 'n14']
numerical_fea = ['loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'annualIncome', 'issueDate', 'postCode', 'regionCode', 'dti', 'delinquency_2years','ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec','pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc','initialListStatus', 'applicationType', 'earliesCreditLine', 'title','policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8','n9', 'n10', 'n11', 'n12', 'n13', 'n14']
categorical_fea = ['grade','subGrade','employmentLength','homeOwnership','verificationStatus','purpose']
sample = train_data.sample(1000)
# 查看Numerical features的分布情况,离散还是连续
for feas in num_fea:plt.figure(figsize = (6,4))plt.title(feas)plt.scatter(sample['id'], sample[feas], s = 8)
? 这里取1000个数据样本,观察初步划分的数值型变量中的数据分布,考虑是否进一步对特征种类进行划分,在此处展示部分特征数据的分布图。
#过滤数值型类别特征
def get_numerical_serial_fea(data,feas):numerical_serial_fea = []numerical_noserial_fea = []for fea in feas:temp = data[fea].nunique()if temp <= 10:numerical_noserial_fea.append(fea)continuenumerical_serial_fea.append(fea)return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(train_data,num_fea)#每个数字特征得分布可视化
f = pd.melt(train_data, value_vars=numerical_serial_fea)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
? 此处根据每个数值类型特征的分布,将类别数大于10类的特征保留为数值类型,类别数小于10的保留为类别特征。
# 变量分布可视化
plt.figure(figsize=(8, 8))
sns.barplot(train_data["employmentLength"].value_counts(dropna=False)[:20],train_data["employmentLength"].value_counts(dropna=False).keys()[:20])
plt.show()
# 正负样本比例
total = len(train_data)
total_amt = train_data.groupby(['isDefault'])['loanAmnt'].sum().sum()
plt.figure(figsize=(6,5))
plot_tr = sns.countplot(x='isDefault',data=train_data)#train data‘isDefault’这个特征每种类别的数量**
plot_tr.set_title("Fraud Loan Distribution \n 0: good user | 1: bad user", fontsize=14)
plot_tr.set_xlabel("Is fraud by count", fontsize=16)
plot_tr.set_ylabel('Count', fontsize=16)
for p in plot_tr.patches:height = p.get_height()plot_tr.text(p.get_x()+p.get_width()/2.,height + 3,'{:1.2f}%'.format(height/total*100),ha="center", fontsize=15)
? 由于金融欺诈数据中,实际欺诈行为即正样本的数量应该会很少,存在样本不平衡的情况,这是应该在后续工作中克服的。
5.缺失值 missing values
? 统计缺失值:
train_data.isnull().sum()
id 0
loanAmnt 0
term 0
interestRate 0
installment 0
grade 0
subGrade 0
employmentTitle 1
employmentLength 46799
homeOwnership 0
annualIncome 0
verificationStatus 0
issueDate 0
isDefault 0
purpose 0
postCode 1
regionCode 0
dti 239
delinquency_2years 0
ficoRangeLow 0
ficoRangeHigh 0
openAcc 0
pubRec 0
pubRecBankruptcies 405
revolBal 0
revolUtil 531
totalAcc 0
initialListStatus 0
applicationType 0
earliesCreditLine 0
title 1
policyCode 0
n0 40270
n1 40270
n2 40270
n2.1 40270
n4 33239
n5 40270
n6 40270
n7 40270
n8 40271
n9 40270
n10 33239
n11 69752
n12 40270
n13 40270
n14 40270
dtype: int64
? 对缺失值进行可视化:
sample = train_data.sample(1000)
msno.matrix(sample)
plt.show()
msno.bar(sample)
plt.show()
msno.heatmap(sample)
plt.show()
6.特征热力图
? 通过热力图可以观察特征之间的相关关系,确定哪些是冗余数据。
plt.figure(figsize = (30,10))
sns.heatmap(train_data.corr(), annot = True, cmap = 'coolwarm')
? 在完成对数据的探索性分析之后,下一步应该考虑如何预处理非数值类型的数据并构造新的特征工程。下一步会注意在日期类型数据、类别数据的encoding、数据分箱和特征聚合上的工作。根据新的特征确定新的特征列。