gitee
pd.notnull(df)pd.isnull(df)df.dropna(axis=0,how='any',inplace=False) 默认df.dropna(axis=1,how='all',inplace=True) 对df本身产生影响
df.sort_values(by='Z',ascending=False)
In [2]: import pandas as pdIn [3]: import numpy as npIn [4]: df = pd.DataFrame(np.arange(12).reshape((3,4)),index=list('abc'),columns=list('WXYZ') )In [5]: df
Out[5]:W X Y Z
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11In [6]: df[df==0] = np.nanIn [7]: df
Out[7]:W X Y Z
a NaN 1 2 3
b 4.0 5 6 7
c 8.0 9 10 11In [8]: df.dropna()
Out[8]:W X Y Z
b 4.0 5 6 7
c 8.0 9 10 11In [9]: df.dropna(axis=1)
Out[9]:X Y Z
a 1 2 3
b 5 6 7
c 9 10 11In [10]: df.dropna(axis=0,how='all')
Out[10]:W X Y Z
a NaN 1 2 3
b 4.0 5 6 7
c 8.0 9 10 11In [11]: df.dropna(axis=1,how='all')
Out[11]:W X Y Z
a NaN 1 2 3
b 4.0 5 6 7
c 8.0 9 10 11
In [12]: pd.isnull(df)
Out[12]:W X Y Z
a True False False False
b False False False False
c False False False FalseIn [13]: pd.isnull(df['W'])
Out[13]:
a True
b False
c False
Name: W, dtype: boolIn [14]: df[ pd.isnull(df['W']) ]
Out[14]:W X Y Z
a NaN 1 2 3
In [22]: import numpy as npIn [23]: import pandas as pdIn [24]: df = pd.DataFrame(np.arange(12).reshape((3,4)),index=list('abc'),columns=list('WXYZ') )In [25]: df
Out[25]:W X Y Z
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11In [26]: df[df==0] = np.nanIn [27]: df
Out[27]:W X Y Z
a NaN 1 2 3
b 4.0 5 6 7
c 8.0 9 10 11In [28]: df.fillna(0)
Out[28]:W X Y Z
a 0.0 1 2 3
b 4.0 5 6 7
c 8.0 9 10 11In [29]: df.fillna(100)
Out[29]:W X Y Z
a 100.0 1 2 3
b 4.0 5 6 7
c 8.0 9 10 11
对于缺失值的填充一般不填充一个具体的数据,一般用 均值 或者 中位数填充
1.对于有些列 填充可能没有什么实际意义
2.对于有些列则 填充有意义
In [18]: df
Out[18]:W X Y Z
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11In [19]: df.mean()
Out[19]:
W 4.0
X 5.0
Y 6.0
Z 7.0
dtype: float64In [25]: df.median()
Out[25]:
W 4.0
X 5.0
Y 6.0
Z 7.0
dtype: float64In [26]: df.loc[0,:]=np.nanIn [27]: df
Out[27]:W X Y Z
a 0.0 1.0 2.0 3.0
b 4.0 5.0 6.0 7.0
c 8.0 9.0 10.0 11.0
0 NaN NaN NaN NaNIn [28]: df.dropna(inplace=True)In [29]: df
Out[29]:W X Y Z
a 0.0 1.0 2.0 3.0
b 4.0 5.0 6.0 7.0
c 8.0 9.0 10.0 11.0In [30]: df.loc['a',:] = np.nanIn [31]: df
Out[31]:W X Y Z
a NaN NaN NaN NaN
b 4.0 5.0 6.0 7.0
c 8.0 9.0 10.0 11.0In [32]: df.fillna( df.mean() )
Out[32]:W X Y Z
a 6.0 7.0 8.0 9.0
b 4.0 5.0 6.0 7.0
c 8.0 9.0 10.0 11.0In [33]: df
Out[33]:W X Y Z
a NaN NaN NaN NaN
b 4.0 5.0 6.0 7.0
c 8.0 9.0 10.0 11.0In [34]: df.fillna(df.median())
Out[34]:W X Y Z
a 6.0 7.0 8.0 9.0
b 4.0 5.0 6.0 7.0
c 8.0 9.0 10.0 11.0In [35]: df
Out[35]:W X Y Z
a NaN NaN NaN NaN
b 4.0 5.0 6.0 7.0
c 8.0 9.0 10.0 11.0In [36]: df['W'].fillna(df['W'].mean())
Out[36]:
a 6.0
b 4.0
c 8.0
Name: W, dtype: float64In [37]: df
Out[37]:W X Y Z
a NaN NaN NaN NaN
b 4.0 5.0 6.0 7.0
c 8.0 9.0 10.0 11.0In [38]: df['W'].fillna( df['W'].median() )
Out[38]:
a 6.0
b 4.0
c 8.0
Name: W, dtype: float64In [39]: df
Out[39]:W X Y Z
a NaN NaN NaN NaN
b 4.0 5.0 6.0 7.0
c 8.0 9.0 10.0 11.0In [40]: df['W'] = 1In [41]: df
Out[41]:W X Y Z
a 1 NaN NaN NaN
b 1 5.0 6.0 7.0
c 1 9.0 10.0 11.0In [42]: df['W'] = df['W'].fillna( df['W'].mean() )In [43]: df
Out[43]:W X Y Z
a 1 NaN NaN NaN
b 1 5.0 6.0 7.0
c 1 9.0 10.0 11.0
DataFrame 的 loc 与 iloc 的区别:
In [15]: df.iloc[[0,2],:]
Out[15]:W X Y Z
a 0 1 2 3
c 8 9 10 11In [16]: df.loc[:,'W']
Out[16]:
a 0
b 4
c 8
Name: W, dtype: int32In [17]: df.loc[:,['W','Z']]
Out[17]:W Z
a 0 3
b 4 7
c 8 11