当前位置: 代码迷 >> python >> 合并包含列表对象的熊猫列
  详细解决方案

合并包含列表对象的熊猫列

热度:89   发布时间:2023-06-19 09:09:50.0

我的问题类似于“ 的 ”和“ ”,但由于我要合并的单元格包含列表,因此所有给出的答案均无效。

简化后,我的df如下所示:

        players    players1    players2    players3
1       ['1','2']  
2       ['2','4']
3                  ['1','4']
4                              ['1','5']
5                                          ['3','5']
6
7                  ['3','4']

(所以我知道在两列中永远不会有值。)

现在,我想添加一个新的列来合并列表:

        players    players1    players2    players3     players_combine
1       ['1','2']                                       ['1','2']
2       ['2','4']                                       ['2','4']
3                  ['1','4']                            ['1','4']
4                              ['1','5']                ['1','5']
5                                          ['3','5']    ['3','5']
6
7                  ['3','4']                            ['3','4']

我尝试了很多事情-主要是链接答案的变体,我的最后一个想法是级联np.where。 但这没有用。 合并列中仅显示“玩家”中的值。

df['players_combine'] = np.where(df.players.notnull(),df.players.values,np.where(df.players1.notnull(),df.players1.values,np.where(df.players2.notnull(),df.players2.values,np.where(df.players3.notnull(),df.players3.values,np.nan))))

编辑:如在评论中要求:df.head(5).to_dict()

{'players': {'5b41800eaffb061b88c4beac': ['57005', '124021', '132037', '78523', '111742', '133892', '76431', '78066', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '108106', '127464', '296770'], '5b41800eaffb061b88c4bead': ['18929', '110183', '28401', '302853', '296768', '94912', '93671', '52060', '43282', '132364', '140646', '77861', '19787', '133790', '312666', '76336', '317219', '137849'], '5b41800daffb061b88c4bc7f': 'nan', '5b41800eaffb061b88c4bd62': 'nan', '5b41800eaffb061b88c4bd65': 'nan'}, 'players1': {'5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '78066', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '108106', '127464', '76431'], '5b41800eaffb061b88c4bd62': '', '5b41800eaffb061b88c4bd65': ''}, 'players2': {'5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': nan, '5b41800eaffb061b88c4bd62': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '108106', '138749', '132358', '77857', '69756', '133745', '278877', '247798', '78066', '127464', '76431'], '5b41800eaffb061b88c4bd65': ''}, 'players3': {'5b41800eaffb061b88c4beac': nan, '5b41800eaffb061b88c4bead': nan, '5b41800daffb061b88c4bc7f': nan, '5b41800eaffb061b88c4bd62': nan, '5b41800eaffb061b88c4bd65': ['57005', '124021', '132037', '78523', '111742', '133892', '296770', '108106', '138749', '132358', '247798', '69756', '133745', '278877', '77857', '78066', '127464', '76431']}}

如果任何一个空单元格是一个空字符串( '' ),请首先将其设为NaN:

df[df==''] = np.nan

然后,选择每一行中所有非NaN的最大值:

df.apply(lambda x: x[x.notnull()].max(), axis=1)
#1    [1, 2]
#2    [2, 4]
#3    [1, 4]
#4    [1, 5]
#5    [3, 5]
#6       NaN
#7    [3, 4]

另一个有趣(且更快 )的解决方案是消除所有NaN行,然后在每一行中找到第一个有效值:

df.loc[df.notnull().any(axis=1)]\
  .apply(lambda x: x[x.first_valid_index()], axis=1)
#1    [1, 2]
#2    [2, 4]
#3    [1, 4]
#4    [1, 5]
#5    [3, 5]
#7    [3, 4]

由于您知道每行最多只能在一个列中输入一个值,因此可以将不需要的值替换为NaN ,然后使用.stack

在这种情况下,您似乎同时拥有'nan'''字符串,应np.NaN其替换为np.NaN

import numpy as np

df['players_combine'] = df.replace({'': np.NaN, 'nan': np.NaN}, regex=True).stack().reset_index(level=1, drop=True)

样本数据

import pandas as pd
df = pd.DataFrame({'players': [['1','2'], '', '', np.NaN, ''],
                   'players1': ['', ['2','4'], '', np.NaN, ''],
                   'players2': ['', '', ['1','5'], np.NaN, ''],
                   'players3': ['', '', np.NaN, ['3', '5'], '']})

输出:

  players players1 players2 players3 players_combine
0  [1, 2]                                     [1, 2]
1           [2, 4]                            [2, 4]
2                    [1, 5]      NaN          [1, 5]
3     NaN      NaN      NaN   [3, 5]          [3, 5]
4                                                NaN

分析您的df.to_dict()似乎,对于某些单元格,您有字符串 nan ;对于其他单元格,您具有实际的np.nan ;对于其他一些单元格,您甚至有空字符串''

因此,首先通过使空值统一来清理数据集:

df = df.replace({'nan':np.nan, '':np.nan})

然后您可以通过axis=1

df['players_combine'] = df.agg(lambda s: s[~s.isnull()][0], axis=1)

由于某些格式错误的数据是我的代码中的问题,而不是所提出的问题,因此我想添加一些有关所有不同选项性能的信息。 我使用np.where是因为它使用向量化,而np.apply遍历行,因此存在巨大的性能差异。


使用4 * 2000 = 8000行设置test-df:

import pandas as pd
import numpy as np

l = [[['1','2'],np.NaN,np.NaN,np.NaN],
     [np.NaN,['2','3'],np.NaN,np.NaN],
    [np.NaN,np.NaN,['3','4'],np.NaN],
    [np.NaN,np.NaN,np.NaN,['4','5']]]

l=l*2000

df = pd.DataFrame(l)

df.columns = ['players','players1','players2','players3']

最佳选择:

%timeit df['players_combine'] = np.where(df.players.notnull(),df.players.values,np.where(df.players1.notnull(),df.players1.values,np.where(df.players2.notnull(),df.players2.values,np.where(df.players3.notnull(),df.players3.values,np.nan))))

100 loops, best of 3: 2.18 ms per loop

很好的选择:

%timeit df.loc[df.notnull().any(axis=1)]\
  .apply(lambda x: x[x.first_valid_index()], axis=1)

100 loops, best of 3: 413 ms per loop

其他选择1:

%timeit df['players_combine'] = df.agg(lambda s: s[~s.isnull()][0], axis=1)

1 loop, best of 3: 4.71 s per loop

和2:

%timeit df['players_combine'] = df.apply(lambda x: x[x.notnull()].max(), axis=1)

1 loop, best of 3: 4.86 s per loop
  相关解决方案