问题描述
我的问题类似于“ 的 ”和“ ”,但由于我要合并的单元格包含列表,因此所有给出的答案均无效。
简化后,我的df如下所示:
players players1 players2 players3
1 ['1','2']
2 ['2','4']
3 ['1','4']
4 ['1','5']
5 ['3','5']
6
7 ['3','4']
(所以我知道在两列中永远不会有值。)
现在,我想添加一个新的列来合并列表:
players players1 players2 players3 players_combine
1 ['1','2'] ['1','2']
2 ['2','4'] ['2','4']
3 ['1','4'] ['1','4']
4 ['1','5'] ['1','5']
5 ['3','5'] ['3','5']
6
7 ['3','4'] ['3','4']
我尝试了很多事情-主要是链接答案的变体,我的最后一个想法是级联np.where。 但这没有用。 合并列中仅显示“玩家”中的值。
df['players_combine'] = np.where(df.players.notnull(),df.players.values,np.where(df.players1.notnull(),df.players1.values,np.where(df.players2.notnull(),df.players2.values,np.where(df.players3.notnull(),df.players3.values,np.nan))))
编辑:如在评论中要求:df.head(5).to_dict()
{'players': {'5b41800eaffb061b88c4beac': ['57005', '124021', '132037', '78523', '111742', '133892', '76431', '78066', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '108106', '127464', '296770'], '5b41800eaffb061b88c4bead': ['18929', '110183', '28401', '302853', '296768', '94912', '93671', '52060', '43282', '132364', '140646', '77861', '19787', '133790', '312666', '76336', '317219', '137849'], '5b41800daffb061b88c4bc7f': 'nan', '5b41800eaffb061b88c4bd62': 'nan', '5b41800eaffb061b88c4bd65': 'nan'}, 'players1': {'5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '78066', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '108106', '127464', '76431'], '5b41800eaffb061b88c4bd62': '', '5b41800eaffb061b88c4bd65': ''}, 'players2': {'5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': nan, '5b41800eaffb061b88c4bd62': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '108106', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '78066', '127464', '76431'], '5b41800eaffb061b88c4bd65': ''}, 'players3': {'5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': nan, '5b41800eaffb061b88c4bd62': nan, '5b41800eaffb061b88c4bd65': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '108106', '138749', '132358', '247798', '69756', '133745', '278877', '77857', '78066', '127464', '76431']}}
1楼
如果任何一个空单元格是一个空字符串( ''
),请首先将其设为NaN:
df[df==''] = np.nan
然后,选择每一行中所有非NaN的最大值:
df.apply(lambda x: x[x.notnull()].max(), axis=1)
#1 [1, 2]
#2 [2, 4]
#3 [1, 4]
#4 [1, 5]
#5 [3, 5]
#6 NaN
#7 [3, 4]
另一个有趣(且更快 )的解决方案是消除所有NaN行,然后在每一行中找到第一个有效值:
df.loc[df.notnull().any(axis=1)]\
.apply(lambda x: x[x.first_valid_index()], axis=1)
#1 [1, 2]
#2 [2, 4]
#3 [1, 4]
#4 [1, 5]
#5 [3, 5]
#7 [3, 4]
2楼
由于您知道每行最多只能在一个列中输入一个值,因此可以将不需要的值替换为NaN
,然后使用.stack
。
在这种情况下,您似乎同时拥有'nan'
和''
字符串,应np.NaN
其替换为np.NaN
。
import numpy as np
df['players_combine'] = df.replace({'': np.NaN, 'nan': np.NaN}, regex=True).stack().reset_index(level=1, drop=True)
样本数据
import pandas as pd
df = pd.DataFrame({'players': [['1','2'], '', '', np.NaN, ''],
'players1': ['', ['2','4'], '', np.NaN, ''],
'players2': ['', '', ['1','5'], np.NaN, ''],
'players3': ['', '', np.NaN, ['3', '5'], '']})
输出:
players players1 players2 players3 players_combine
0 [1, 2] [1, 2]
1 [2, 4] [2, 4]
2 [1, 5] NaN [1, 5]
3 NaN NaN NaN [3, 5] [3, 5]
4 NaN
3楼
分析您的df.to_dict()
似乎,对于某些单元格,您有字符串 nan
;对于其他单元格,您具有实际的np.nan
;对于其他一些单元格,您甚至有空字符串''
。
因此,首先通过使空值统一来清理数据集:
df = df.replace({'nan':np.nan, '':np.nan})
然后您可以通过axis=1
df['players_combine'] = df.agg(lambda s: s[~s.isnull()][0], axis=1)
4楼
由于某些格式错误的数据是我的代码中的问题,而不是所提出的问题,因此我想添加一些有关所有不同选项性能的信息。 我使用np.where是因为它使用向量化,而np.apply遍历行,因此存在巨大的性能差异。
使用4 * 2000 = 8000行设置test-df:
import pandas as pd
import numpy as np
l = [[['1','2'],np.NaN,np.NaN,np.NaN],
[np.NaN,['2','3'],np.NaN,np.NaN],
[np.NaN,np.NaN,['3','4'],np.NaN],
[np.NaN,np.NaN,np.NaN,['4','5']]]
l=l*2000
df = pd.DataFrame(l)
df.columns = ['players','players1','players2','players3']
最佳选择:
%timeit df['players_combine'] = np.where(df.players.notnull(),df.players.values,np.where(df.players1.notnull(),df.players1.values,np.where(df.players2.notnull(),df.players2.values,np.where(df.players3.notnull(),df.players3.values,np.nan))))
100 loops, best of 3: 2.18 ms per loop
很好的选择:
%timeit df.loc[df.notnull().any(axis=1)]\
.apply(lambda x: x[x.first_valid_index()], axis=1)
100 loops, best of 3: 413 ms per loop
其他选择1:
%timeit df['players_combine'] = df.agg(lambda s: s[~s.isnull()][0], axis=1)
1 loop, best of 3: 4.71 s per loop
和2:
%timeit df['players_combine'] = df.apply(lambda x: x[x.notnull()].max(), axis=1)
1 loop, best of 3: 4.86 s per loop