金融风控学习赛
https://tianchi.aliyun.com/competition/entrance/531830/information
一、赛题数据
赛题以预测用户贷款是否违约为任务,数据集报名后可见并可下载,该数据来自某信贷平台的贷款记录,总数据量超过120w,包含47列变量信息,其中15列为匿名变量。为了保证比赛的公平性,将会从中抽取80万条作为训练集,20万条作为测试集A,20万条作为测试集B,同时会对employmentTitle、purpose、postCode和title等信息进行脱敏。
导入数据分析相关库
# 导入标准库
import io, os, sys, types, time, datetime, math, random, requests, subprocess,io, tempfile, math# 导入第三方库
# 数据处理
import numpy as np
import pandas as pd# 数据可视化
import matplotlib.pyplot as plt
import missingno
import seaborn as sns
# from pandas.tools.plotting import scatter_matrix # No module named 'pandas.tools'
from mpl_toolkits.mplot3d import Axes3D
# plt.style.use('seaborn') # 改变图像风格
plt.rcParams['font.family'] = ['Arial Unicode MS', 'Microsoft Yahei', 'SimHei', 'sans-serif'] # 解决中文乱码
plt.rcParams['axes.unicode_minus'] = False # simhei黑体字 负号乱码 解决# 特征选择和编码
from sklearn.feature_selection import RFE, RFECV
from sklearn.svm import SVR
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize # Imputer
# from fancyimpute import BiScaler, KNN, NuclearNormMinimization, SoftImpute# 机器学习
import sklearn.ensemble as ske
from sklearn import datasets, model_selection, tree, preprocessing, metrics, linear_model
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, SGDClassifier
from sklearn.tree import DecisionTreeClassifier# 网格搜索、随机搜索
import scipy.stats as st
from scipy.stats import randint as sp_randint
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split# 模型度量(分类)
from sklearn.metrics import precision_recall_fscore_support, roc_curve, auc# 警告处理
import warnings
warnings.filterwarnings('ignore')# 在Jupyter上画图
%matplotlib inline# 数据预处理
import numpy as np
import scipy as sc
import scipy.fftpack
import sklearn as sk
import matplotlib.pyplot as plt
import ewtpy
import scipy.fftpack# 绘图工具包
import seaborn as sns
import pyecharts.options as opts
from pyecharts.charts import Line, Grid
数据集导入
- train
- test
# 数据集路径train_path = 'train.csv'
test_path = 'testA.csv'
dataset_path = './'
data_train_path = dataset_path + train_path
data_test_path = dataset_path + test_path# 2.数据集csv读入
train = pd.read_csv(data_train_path)
test = pd.read_csv(data_test_path)
二、数据分析
- 数据总体了解:
- 读取数据集并了解数据集大小,原始特征维度;
- 通过info熟悉数据类型;
- 粗略查看数据集中各特征基本统计量;
- 缺失值和唯一值:
- 查看数据缺失值情况
- 查看唯一值特征情况
- 深入数据-查看数据类型
- 类别型数据
- 数值型数据
- 离散数值型数据
- 连续数值型数据
- 数据间相关关系
- 特征和特征之间关系
- 特征和目标变量之间关系
- 用pandas_profiling生成数据报告
数据集大小和类型
- train: (800000, 47)
- test : (200000, 46)
我们可以看到数据量还是比较大的,对内层空间有一定要求
print(train.shape,test.shape)
print('--'*30)
train.info()
print('--'*30)
test.info()
(800000, 47) (200000, 46)
------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
id 800000 non-null int64
loanAmnt 800000 non-null float64
term 800000 non-null int64
interestRate 800000 non-null float64
installment 800000 non-null float64
grade 800000 non-null object
subGrade 800000 non-null object
employmentTitle 799999 non-null float64
employmentLength 753201 non-null object
homeOwnership 800000 non-null int64
annualIncome 800000 non-null float64
verificationStatus 800000 non-null int64
issueDate 800000 non-null object
isDefault 800000 non-null int64
purpose 800000 non-null int64
postCode 799999 non-null float64
regionCode 800000 non-null int64
dti 799761 non-null float64
delinquency_2years 800000 non-null float64
ficoRangeLow 800000 non-null float64
ficoRangeHigh 800000 non-null float64
openAcc 800000 non-null float64
pubRec 800000 non-null float64
pubRecBankruptcies 799595 non-null float64
revolBal 800000 non-null float64
revolUtil 799469 non-null float64
totalAcc 800000 non-null float64
initialListStatus 800000 non-null int64
applicationType 800000 non-null int64
earliesCreditLine 800000 non-null object
title 799999 non-null float64
policyCode 800000 non-null float64
n0 759730 non-null float64
n1 759730 non-null float64
n2 759730 non-null float64
n3 759730 non-null float64
n4 766761 non-null float64
n5 759730 non-null float64
n6 759730 non-null float64
n7 759730 non-null float64
n8 759729 non-null float64
n9 759730 non-null float64
n10 766761 non-null float64
n11 730248 non-null float64
n12 759730 non-null float64
n13 759730 non-null float64
n14 759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 46 columns):
id 200000 non-null int64
loanAmnt 200000 non-null float64
term 200000 non-null int64
interestRate 200000 non-null float64
installment 200000 non-null float64
grade 200000 non-null object
subGrade 200000 non-null object
employmentTitle 200000 non-null float64
employmentLength 188258 non-null object
homeOwnership 200000 non-null int64
annualIncome 200000 non-null float64
verificationStatus 200000 non-null int64
issueDate 200000 non-null object
purpose 200000 non-null int64
postCode 200000 non-null float64
regionCode 200000 non-null int64
dti 199939 non-null float64
delinquency_2years 200000 non-null float64
ficoRangeLow 200000 non-null float64
ficoRangeHigh 200000 non-null float64
openAcc 200000 non-null float64
pubRec 200000 non-null float64
pubRecBankruptcies 199884 non-null float64
revolBal 200000 non-null float64
revolUtil 199873 non-null float64
totalAcc 200000 non-null float64
initialListStatus 200000 non-null int64
applicationType 200000 non-null int64
earliesCreditLine 200000 non-null object
title 200000 non-null float64
policyCode 200000 non-null float64
n0 189889 non-null float64
n1 189889 non-null float64
n2 189889 non-null float64
n3 189889 non-null float64
n4 191606 non-null float64
n5 189889 non-null float64
n6 189889 non-null float64
n7 189889 non-null float64
n8 189889 non-null float64
n9 189889 non-null float64
n10 191606 non-null float64
n11 182425 non-null float64
n12 189889 non-null float64
n13 189889 non-null float64
n14 189889 non-null float64
dtypes: float64(33), int64(8), object(5)
memory usage: 70.2+ MB
# 字段特征
train.describe()
id | loanAmnt | term | interestRate | installment | employmentTitle | homeOwnership | annualIncome | verificationStatus | isDefault | purpose | postCode | regionCode | dti | delinquency_2years | ficoRangeLow | ficoRangeHigh | openAcc | pubRec | pubRecBankruptcies | revolBal | revolUtil | totalAcc | initialListStatus | applicationType | title | policyCode | n0 | n1 | n2 | n3 | n4 | n5 | n6 | n7 | n8 | n9 | n10 | n11 | n12 | n13 | n14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 800000.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 799999.000000 | 800000.000000 | 8.000000e+05 | 800000.000000 | 800000.000000 | 800000.000000 | 799999.000000 | 800000.000000 | 799761.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 799595.000000 | 8.000000e+05 | 799469.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 799999.000000 | 800000.0 | 759730.000000 | 759730.000000 | 759730.000000 | 759730.000000 | 766761.000000 | 759730.000000 | 759730.000000 | 759730.000000 | 759729.000000 | 759730.000000 | 766761.000000 | 730248.000000 | 759730.000000 | 759730.000000 | 759730.000000 |
mean | 399999.500000 | 14416.818875 | 3.482745 | 13.238391 | 437.947723 | 72005.351714 | 0.614213 | 7.613391e+04 | 1.009683 | 0.199513 | 1.745982 | 258.535648 | 16.385758 | 18.284557 | 0.318239 | 696.204081 | 700.204226 | 11.598020 | 0.214915 | 0.134163 | 1.622871e+04 | 51.790734 | 24.998861 | 0.416953 | 0.019267 | 1754.113589 | 1.0 | 0.511932 | 3.642330 | 5.642648 | 5.642648 | 4.735641 | 8.107937 | 8.575994 | 8.282953 | 14.622488 | 5.592345 | 11.643896 | 0.000815 | 0.003384 | 0.089366 | 2.178606 |
std | 230940.252013 | 8716.086178 | 0.855832 | 4.765757 | 261.460393 | 106585.640204 | 0.675749 | 6.894751e+04 | 0.782716 | 0.399634 | 2.367453 | 200.037446 | 11.036679 | 11.150155 | 0.880325 | 31.865995 | 31.866674 | 5.475286 | 0.606467 | 0.377471 | 2.245802e+04 | 24.516126 | 11.999201 | 0.493055 | 0.137464 | 7941.474040 | 0.0 | 1.333266 | 2.246825 | 3.302810 | 3.302810 | 2.949969 | 4.799210 | 7.400536 | 4.561689 | 8.124610 | 3.216184 | 5.484104 | 0.030075 | 0.062041 | 0.509069 | 1.844377 |
min | 0.000000 | 500.000000 | 3.000000 | 5.310000 | 15.690000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -1.000000 | 0.000000 | 630.000000 | 634.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 199999.750000 | 8000.000000 | 3.000000 | 9.750000 | 248.450000 | 427.000000 | 0.000000 | 4.560000e+04 | 0.000000 | 0.000000 | 0.000000 | 103.000000 | 8.000000 | 11.790000 | 0.000000 | 670.000000 | 674.000000 | 8.000000 | 0.000000 | 0.000000 | 5.944000e+03 | 33.400000 | 16.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 2.000000 | 3.000000 | 3.000000 | 3.000000 | 5.000000 | 4.000000 | 5.000000 | 9.000000 | 3.000000 | 8.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
50% | 399999.500000 | 12000.000000 | 3.000000 | 12.740000 | 375.135000 | 7755.000000 | 1.000000 | 6.500000e+04 | 1.000000 | 0.000000 | 0.000000 | 203.000000 | 14.000000 | 17.610000 | 0.000000 | 690.000000 | 694.000000 | 11.000000 | 0.000000 | 0.000000 | 1.113200e+04 | 52.100000 | 23.000000 | 0.000000 | 0.000000 | 1.000000 | 1.0 | 0.000000 | 3.000000 | 5.000000 | 5.000000 | 4.000000 | 7.000000 | 7.000000 | 7.000000 | 13.000000 | 5.000000 | 11.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
75% | 599999.250000 | 20000.000000 | 3.000000 | 15.990000 | 580.710000 | 117663.500000 | 1.000000 | 9.000000e+04 | 2.000000 | 0.000000 | 4.000000 | 395.000000 | 22.000000 | 24.060000 | 0.000000 | 710.000000 | 714.000000 | 14.000000 | 0.000000 | 0.000000 | 1.973400e+04 | 70.700000 | 32.000000 | 1.000000 | 0.000000 | 5.000000 | 1.0 | 0.000000 | 5.000000 | 7.000000 | 7.000000 | 6.000000 | 11.000000 | 11.000000 | 10.000000 | 19.000000 | 7.000000 | 14.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
max | 799999.000000 | 40000.000000 | 5.000000 | 30.990000 | 1715.420000 | 378351.000000 | 5.000000 | 1.099920e+07 | 2.000000 | 1.000000 | 13.000000 | 940.000000 | 50.000000 | 999.000000 | 39.000000 | 845.000000 | 850.000000 | 86.000000 | 86.000000 | 12.000000 | 2.904836e+06 | 892.300000 | 162.000000 | 1.000000 | 1.000000 | 61680.000000 | 1.0 | 51.000000 | 33.000000 | 63.000000 | 63.000000 | 49.000000 | 70.000000 | 132.000000 | 79.000000 | 128.000000 | 45.000000 | 82.000000 | 4.000000 | 4.000000 | 39.000000 | 30.000000 |
test.describe()
id | loanAmnt | term | interestRate | installment | employmentTitle | homeOwnership | annualIncome | verificationStatus | purpose | postCode | regionCode | dti | delinquency_2years | ficoRangeLow | ficoRangeHigh | openAcc | pubRec | pubRecBankruptcies | revolBal | revolUtil | totalAcc | initialListStatus | applicationType | title | policyCode | n0 | n1 | n2 | n3 | n4 | n5 | n6 | n7 | n8 | n9 | n10 | n11 | n12 | n13 | n14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 2.000000e+05 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 199939.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.00000 | 199884.000000 | 2.000000e+05 | 199873.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.000000 | 200000.0 | 189889.000000 | 189889.000000 | 189889.000000 | 189889.000000 | 191606.000000 | 189889.000000 | 189889.000000 | 189889.000000 | 189889.000000 | 189889.000000 | 191606.000000 | 182425.000000 | 189889.000000 | 189889.000000 | 189889.000000 |
mean | 899999.500000 | 14436.954125 | 3.481690 | 13.244800 | 438.737804 | 72435.750740 | 0.614100 | 7.645184e+04 | 1.010430 | 1.744410 | 258.259535 | 16.430830 | 18.291282 | 0.315895 | 696.166400 | 700.166530 | 11.580340 | 0.21553 | 0.135293 | 1.625583e+04 | 51.875121 | 24.921810 | 0.416650 | 0.019465 | 1778.180570 | 1.0 | 0.507865 | 3.648110 | 5.647141 | 5.647141 | 4.732545 | 8.093976 | 8.527334 | 8.274840 | 14.592551 | 5.596296 | 11.626891 | 0.000833 | 0.003618 | 0.088341 | 2.180316 |
std | 57735.171256 | 8737.430326 | 0.855195 | 4.766528 | 262.246698 | 106892.374933 | 0.675465 | 7.766237e+04 | 0.781732 | 2.367497 | 199.752214 | 11.061279 | 11.493806 | 0.876316 | 31.852619 | 31.853228 | 5.455525 | 0.60653 | 0.380665 | 2.243082e+04 | 24.555849 | 11.943628 | 0.493005 | 0.138153 | 7983.247915 | 0.0 | 1.315019 | 2.257779 | 3.308588 | 3.308588 | 2.959386 | 4.803759 | 7.303106 | 4.550902 | 8.109357 | 3.220978 | 5.464619 | 0.030516 | 0.064276 | 0.505161 | 1.841987 |
min | 800000.000000 | 500.000000 | 3.000000 | 5.310000 | 14.010000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 625.000000 | 629.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000e+00 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 849999.750000 | 8000.000000 | 3.000000 | 9.750000 | 248.890000 | 420.000000 | 0.000000 | 4.600000e+04 | 0.000000 | 0.000000 | 103.000000 | 8.000000 | 11.830000 | 0.000000 | 670.000000 | 674.000000 | 8.000000 | 0.00000 | 0.000000 | 5.940000e+03 | 33.500000 | 16.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 | 0.000000 | 2.000000 | 3.000000 | 3.000000 | 3.000000 | 5.000000 | 4.000000 | 5.000000 | 9.000000 | 3.000000 | 8.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
50% | 899999.500000 | 12000.000000 | 3.000000 | 12.740000 | 375.430000 | 7836.000000 | 1.000000 | 6.500000e+04 | 1.000000 | 0.000000 | 203.000000 | 14.000000 | 17.620000 | 0.000000 | 690.000000 | 694.000000 | 11.000000 | 0.00000 | 0.000000 | 1.114000e+04 | 52.300000 | 23.000000 | 0.000000 | 0.000000 | 2.000000 | 1.0 | 0.000000 | 3.000000 | 5.000000 | 5.000000 | 4.000000 | 7.000000 | 7.000000 | 7.000000 | 13.000000 | 5.000000 | 11.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
75% | 949999.250000 | 20000.000000 | 3.000000 | 15.990000 | 580.942500 | 119739.250000 | 1.000000 | 9.000000e+04 | 2.000000 | 4.000000 | 392.000000 | 22.000000 | 24.080000 | 0.000000 | 710.000000 | 714.000000 | 14.000000 | 0.00000 | 0.000000 | 1.977925e+04 | 70.800000 | 32.000000 | 1.000000 | 0.000000 | 5.000000 | 1.0 | 0.000000 | 5.000000 | 7.000000 | 7.000000 | 6.000000 | 11.000000 | 11.000000 | 10.000000 | 19.000000 | 7.000000 | 14.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
max | 999999.000000 | 40000.000000 | 5.000000 | 30.990000 | 1715.420000 | 378338.000000 | 5.000000 | 9.500000e+06 | 2.000000 | 13.000000 | 931.000000 | 50.000000 | 999.000000 | 28.000000 | 845.000000 | 850.000000 | 90.000000 | 61.00000 | 11.000000 | 1.743266e+06 | 366.600000 | 127.000000 | 1.000000 | 1.000000 | 61676.000000 | 1.0 | 32.000000 | 32.000000 | 51.000000 | 51.000000 | 63.000000 | 70.000000 | 99.000000 | 83.000000 | 112.000000 | 41.000000 | 90.000000 | 3.000000 | 3.000000 | 25.000000 | 28.000000 |
字段表
- id 为贷款清单分配的唯一信用证标识
- loanAmnt 贷款金额
- term 贷款期限(year)
- interestRate 贷款利率
- installment 分期付款金额
- grade 贷款等级
- subGrade 贷款等级之子级
- employmentTitle 就业职称
- employmentLength 就业年限(年)
- homeOwnership 借款人在登记时提供的房屋所有权状况
- annualIncome 年收入
- verificationStatus 验证状态
- issueDate 贷款发放的月份
- purpose 借款人在贷款申请时的贷款用途类别
- postCode 借款人在贷款申请中提供的邮政编码的前3位数字
- regionCode 地区编码
- dti 债务收入比
- delinquency_2years 借款人过去2年信用档案中逾期30天以上的违约事件数
- ficoRangeLow 借款人在贷款发放时的fico所属的下限范围
- ficoRangeHigh 借款人在贷款发放时的fico所属的上限范围
- openAcc 借款人信用档案中未结信用额度的数量
- pubRec 贬损公共记录的数量
- pubRecBankruptcies 公开记录清除的数量
- revolBal 信贷周转余额合计
- revolUtil 循环额度利用率,或借款人使用的相对于所有可用循环信贷的信贷金额
- totalAcc 借款人信用档案中当前的信用额度总数
- initialListStatus 贷款的初始列表状态
- applicationType 表明贷款是个人申请还是与两个共同借款人的联合申请
- earliesCreditLine 借款人最早报告的信用额度开立的月份
- title 借款人提供的贷款名称
- policyCode 公开可用的策略代码=1新产品不公开可用的策略代码=2
- n系列匿名特征 匿名特征n0-n14,为一些贷款人行为计数特征的处理
观察各个字段含义和实际数值
# train.head(20).iloc[:,:13]
train.head(20).iloc[:,13:26]
isDefault | purpose | postCode | regionCode | dti | delinquency_2years | ficoRangeLow | ficoRangeHigh | openAcc | pubRec | pubRecBankruptcies | revolBal | revolUtil | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 137.0 | 32 | 17.05 | 0.0 | 730.0 | 734.0 | 7.0 | 0.0 | 0.0 | 24178.0 | 48.9 |
1 | 0 | 0 | 156.0 | 18 | 27.83 | 0.0 | 700.0 | 704.0 | 13.0 | 0.0 | 0.0 | 15096.0 | 38.9 |
2 | 0 | 0 | 337.0 | 14 | 22.77 | 0.0 | 675.0 | 679.0 | 11.0 | 0.0 | 0.0 | 4606.0 | 51.8 |
3 | 0 | 4 | 148.0 | 11 | 17.21 | 0.0 | 685.0 | 689.0 | 9.0 | 0.0 | 0.0 | 9948.0 | 52.6 |
4 | 0 | 10 | 301.0 | 21 | 32.16 | 0.0 | 690.0 | 694.0 | 12.0 | 0.0 | 0.0 | 2942.0 | 32.0 |
5 | 0 | 9 | 512.0 | 21 | 17.14 | 0.0 | 730.0 | 734.0 | 19.0 | 0.0 | 0.0 | 4047.0 | 31.1 |
6 | 0 | 0 | 517.0 | 14 | 17.49 | 0.0 | 755.0 | 759.0 | 12.0 | 0.0 | 0.0 | 3111.0 | 8.5 |
7 | 0 | 0 | 100.0 | 4 | 32.60 | 0.0 | 665.0 | 669.0 | 8.0 | 1.0 | 1.0 | 14021.0 | 59.7 |
8 | 1 | 0 | 792.0 | 13 | 19.22 | 0.0 | 690.0 | 694.0 | 15.0 | 0.0 | 0.0 | 27176.0 | 46.0 |
9 | 0 | 0 | 59.0 | 11 | 24.39 | 0.0 | 725.0 | 729.0 | 7.0 | 0.0 | 0.0 | 2936.0 | 30.6 |
10 | 0 | 4 | 134.0 | 8 | 14.21 | 0.0 | 665.0 | 669.0 | 13.0 | 0.0 | 0.0 | 8653.0 | 47.5 |
11 | 0 | 0 | 893.0 | 49 | 34.63 | 0.0 | 710.0 | 714.0 | 10.0 | 0.0 | 0.0 | 16343.0 | 80.9 |
12 | 0 | 0 | 195.0 | 38 | 7.58 | 0.0 | 680.0 | 684.0 | 12.0 | 0.0 | 0.0 | 18866.0 | 35.7 |
13 | 0 | 2 | 134.0 | 8 | 5.68 | 0.0 | 690.0 | 694.0 | 7.0 | 0.0 | 0.0 | 4334.0 | 68.8 |
14 | 0 | 4 | 167.0 | 8 | 38.95 | 0.0 | 710.0 | 714.0 | 9.0 | 0.0 | 0.0 | 19023.0 | 60.8 |
15 | 0 | 2 | 194.0 | 38 | 17.27 | 0.0 | 660.0 | 664.0 | 16.0 | 1.0 | 1.0 | 220.0 | 3.6 |
16 | 0 | 2 | 492.0 | 36 | 21.02 | 0.0 | 705.0 | 709.0 | 16.0 | 0.0 | 0.0 | 36609.0 | 61.1 |
17 | 1 | 4 | 56.0 | 8 | 17.14 | 0.0 | 695.0 | 699.0 | 5.0 | 0.0 | 0.0 | 5463.0 | 76.9 |
18 | 1 | 3 | 140.0 | 8 | 28.95 | 3.0 | 660.0 | 664.0 | 6.0 | 0.0 | 0.0 | 6804.0 | 84.0 |
19 | 0 | 0 | 305.0 | 15 | 15.55 | 0.0 | 700.0 | 704.0 | 10.0 | 0.0 | 0.0 | 22859.0 | 57.0 |
查看缺失值
- 可以看到employmentLength这一字段缺失情况严重
- 其他非衍生字段缺失值,做一个缺值处理
统计多少字段有缺失值
# 统计多少字段有缺失值
print(f'There are {train.isnull().any().sum()} columns in train dataset with missing values.')
There are 22 columns in train dataset with missing values.
统计超过50%的缺失字段
# 统计是否有超过50%的缺失字段
have_null_fea_dict = (train.isnull().sum()/len(train)).to_dict()
fea_null_moreThanHalf = {
}
for key,value in have_null_fea_dict.items():if value > 0.5:fea_null_moreThanHalf[key] = value
have_null_fea_dict
fea_null_moreThanHalf
{}
缺失值可视化
# 绘图查看缺失值
missingno.bar(train)
<matplotlib.axes._subplots.AxesSubplot at 0x1bd01f2d780>
缺失率可视化
# nan可视化
missing = train.isnull().sum()/len(train)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1bd26416e48>
显然employmentLength字段是非衍生特征里面缺失值非常明显的字段
- 纵向了解哪些列存在 “nan”, 并可以把nan的个数打印,主要的目的在于查看某一列nan存在的个数是否真的很大,如果nan存在的过多,说明这一列对label的影响几乎不起作用了,可以考虑删掉。如果缺失值很小一般可以选择填充。
- 另外可以横向比较,如果在数据集中,某些样本数据的大部分列都是缺失的且样本足够的情况下可以考虑删除。
train['employmentLength'].value_counts()
10+ years 262753
2 years 72358
< 1 year 64237
3 years 64152
1 year 52489
5 years 50102
4 years 47985
6 years 37254
8 years 36192
7 years 35407
9 years 30272
Name: employmentLength, dtype: int64
特征类型查看
- 特征一般都是由类别型特征和数值型特征组成,而数值型特征又分为连续型和离散型。
- 类别型特征有时具有非数值关系,有时也具有数值关系。比如‘grade’中的等级A,B,C等,是否只是单纯的分类,还是A优于其他要结合业务判断。
- 数值型特征本是可以直接入模的,但往往风控人员要对其做分箱,转化为WOE编码进而做标准评分卡等操作。从模型效果上来看,特征分箱主要是为了降低变量的复杂性,减少变量噪音对模型的影响,提高自变量和因变量的相关度。从而使模型更加稳定。
numerical_fea = list(train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(train.columns)))
类别型特征:category_fea
category_fea
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']
连续型特征:numerical_fea
numerical_fea
['id','loanAmnt','term','interestRate','installment','employmentTitle','homeOwnership','annualIncome','verificationStatus','isDefault','purpose','postCode','regionCode','dti','delinquency_2years','ficoRangeLow','ficoRangeHigh','openAcc','pubRec','pubRecBankruptcies','revolBal','revolUtil','totalAcc','initialListStatus','applicationType','title','policyCode','n0','n1','n2','n3','n4','n5','n6','n7','n8','n9','n10','n11','n12','n13','n14']
Data Type | |
---|---|
id | ID Column |
loanAmnt | Numeric |
term | Categorical |
interestRate | Numeric |
installment | Numeric |
grade | Categorical |
subGrade | Categorical |
employmentTitle | Numeric |
employmentLength | Categorical |
homeOwnership | Categorical |
annualIncome | Numeric |
verificationStatus | Categorical |
issueDate | Date |
isDefault | Label |
purpose | Categorical |
postCode | Numeric |
regionCode | Numeric |
dti | Numeric |
delinquency_2years | Numeric |
ficoRangeLow | Numeric |
ficoRangeHigh | Numeric |
openAcc | Numeric |
pubRec | Numeric |
pubRecBankruptcies | Categorical |
revolBal | Numeric |
revolUtil | Numeric |
totalAcc | Numeric |
initialListStatus | Categorical |
applicationType | Categorical |
earliesCreditLine | Date |
title | Numeric |
policyCode | Numeric |
n0 | Numeric |
n1 | Numeric |
n2 | Numeric |
n3 | Numeric |
n4 | Numeric |
n5 | Numeric |
n6 | Numeric |
n7 | Numeric |
n8 | Numeric |
n9 | Numeric |
n10 | Numeric |
n11 | Categorical |
n12 | Categorical |
n13 | Numeric |
n14 | Numeric |
分离数值型特征
数值型特征包含连续性和离散型
# 过滤数值型类别特征
def get_numerical_serial_fea(data,feas):numerical_serial_fea = []numerical_noserial_fea = []for fea in feas:temp = data[fea].nunique()if temp <= 10:numerical_noserial_fea.append(fea)continuenumerical_serial_fea.append(fea)return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(train,numerical_fea)
# 连续型
numerical_serial_fea
['id','loanAmnt','interestRate','installment','employmentTitle','annualIncome','purpose','postCode','regionCode','dti','delinquency_2years','ficoRangeLow','ficoRangeHigh','openAcc','pubRec','pubRecBankruptcies','revolBal','revolUtil','totalAcc','title','n0','n1','n2','n3','n4','n5','n6','n7','n8','n9','n10','n13','n14']
# 离散型
numerical_noserial_fea
['term','homeOwnership','verificationStatus','isDefault','initialListStatus','applicationType','policyCode','n11','n12']
数值连续型变量分析
#每个数字特征得分布可视化
f = pd.melt(train, value_vars=numerical_serial_fea)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
- 查看某一个数值型变量的分布,查看变量是否符合正态分布,如果不符合正太分布的变量可以log化后再观察下是否符合正态分布。
- 如果想统一处理一批数据变标准化 必须把这些之前已经正态化的数据提出
- 正态化的原因:一些情况下正态化非正态特征可以让模型更快的收敛,一些模型要求数据正态(eg. GMM、KNN),保证数据不要过偏态即可,过于偏态可能会影响模型预测结果。
#Ploting Transaction Amount Values Distribution
plt.figure(figsize=(16,12))
plt.suptitle('Transaction Values Distribution', fontsize=22)
plt.subplot(221)
sub_plot_1 = sns.distplot(train['loanAmnt'])
sub_plot_1.set_title("loanAmnt Distribuition", fontsize=18)
sub_plot_1.set_xlabel("")
sub_plot_1.set_ylabel("Probability", fontsize=15)plt.subplot(222)
sub_plot_2 = sns.distplot(np.log(train['loanAmnt']))
sub_plot_2.set_title("loanAmnt (Log) Distribuition", fontsize=18)
sub_plot_2.set_xlabel("")
sub_plot_2.set_ylabel("Probability", fontsize=15)
Text(0, 0.5, 'Probability')
总结
- 数据理解和数据处理是数据挖掘极其重要的一环,我们需要了解数据集各个字段的特点并加以处理
三、评测标准
提交结果为每个测试样本是1的概率,也就是y为1的概率。评价方法为AUC评估模型效果(越大越好)。
分类常用使用的评估指标是:
- Accuracy(精确度),AUC,Recall(召回率),Precision(准确度),F1,Kappa
本次是学习赛使用的评估指标是AUC
- AUC也就是ROC曲线下与坐标轴围成的面积
- ROC空间将假正例率(FPR)定义为 X 轴,真正例率(TPR)定义为 Y 轴。
- TPR:在所有实际为正例的样本中,被正确地判断为正例之比率。
- FPR:在所有实际为负例的样本中,被错误地判断为正例之比率。
- AUC的取值范围子是0.5和1之间,面积越大,精准度越高,因此AUC越接近1.0,模型精准率预告,AUC为1时精准率为100%,
三、结果提交
提交前请确保预测结果的格式与sample_submit.csv中的格式一致,以及提交文件后缀名为csv。