问题描述
我有以下熊猫数据框
Code Sum Quantity
0 -12 0
1 23 0
2 -10 0
3 -12 0
4 100 0
5 102 201
6 34 0
7 -34 0
8 -23 0
9 100 0
10 100 0
11 102 300
12 -23 0
13 -25 0
14 100 123
15 167 167
我想要的数据框是
Code Sum Quantity new_sum
0 -12 0 -12
1 23 0 23
2 -10 0 -10
3 -12 0 -12
4 100 0 0
5 102 201 202
6 34 0 34
7 -34 0 -34
8 -23 0 -23
9 100 0 0
10 100 0 0
11 102 300 302
12 -23 0 -23
13 -25 0 -25
14 100 123 100
15 167 167 167
逻辑是:
首先,我将检查数量列中的非零值。 在上面的示例数据中,我们在索引 4 处获得了数量的第一个非零出现,即 201。然后我想添加列总和,直到我在行中获得负值。
我写了一个代码,它使用嵌套的if
语句。但是,由于多个 if 和 row wise 比较,执行代码需要很多时间。
current_stock = 0
for i in range(len(test)):
if(test['Quantity'][i] != 0):
current_stock = test['Sum'][i]
if(test['Sum'][i-1] > 0):
current_stock = current_stock + test['Sum'][i-1]
test['new_sum'][i-1] = 0
if(test['Sum'][i-2] > 0):
current_stock = current_stock + test['Sum'][i-2]
test['new_sum'][i-2] = 0
if(test['Sum'][i-3] > 0):
current_stock = current_stock + test['Sum'][i-3]
test['new_sum'][i-3] = 0
else:
test['new_sum'][i] = current_stock
else:
test['new_sum'][i] = current_stock
else:
test['new_sum'][i] = current_stock
else:
test['new_sum'][i] = test['Sum'][i]
有没有更好的方法来做到这一点?
1楼
让我们看一下三种解决方案,并在最后提供性能比较。
一种试图接近熊猫的方法如下:
def f1(df):
# Group together the elements of df.Sum that might have to be added
pos_groups = (df.Sum <= 0).cumsum()
pos_groups[df.Sum <= 0] = -1
# Create the new column and populate it with what is in df.Sum
df['new_sum'] = df.Sum
# Find the indices of the new column that need to be calculated as a sum
indices = df[df.Quantity > 0].index
for i in indices:
# Find the relevant group of positive integers to be summed, ensuring
# that we only consider those that come /before/ the one to be calculated
group = pos_groups[:i+1] == pos_groups[i]
# Zero out all the elements that will be part of the sum
df.new_sum[:i+1][group] = 0
# Calculate the actual sum and store that
df.new_sum[i] = df.Sum[:i+1][group].sum()
f1(df)
一个可能有改进空间的地方是pos_groups[:i+1] == pos_groups[i]
它检查所有i+1
元素,根据您的数据的样子,它可能会检查分数那些。
然而,这在实践中可能仍然更有效。
如果没有,您可能需要手动迭代以查找组:
def f2(sums, quantities):
new_sums = np.copy(sums)
indices = np.where(quantities > 0)[0]
for i in indices:
a = i
while sums[a] > 0:
s = new_sums[a]
new_sums[a] = 0
new_sums[i] += s
a -= 1
return new_sums
df['new_sum'] = f2(df.Sum.values, df.Quantity.values)
最后,再次取决于您的数据是什么样的,使用可以改进后一种方法的可能性:
from numba import jit
f3 = jit(f2)
df['new_sum'] = f3(df.Sum.values, df.Quantity.values)
对于问题中提供的数据(可能太小而无法提供正确的图片),性能测试如下所示:
In [13]: %timeit f1(df)
5.32 ms ± 77.7 ?s per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [14]: %timeit df['new_sum'] = f2(df.Sum.values, df.Quantity.values)
190 ?s ± 5.23 ?s per loop (mean ± std. dev. of 7 runs, 10000 loops each
In [18]: %timeit df['new_sum'] = f3(df.Sum.values, df.Quantity.values)
178 ?s ± 10.1 ?s per loop (mean ± std. dev. of 7 runs, 10000 loops each)
在这里,大部分时间都花在更新数据框上。 如果数据大 1000 倍,Numba 解决方案最终将成为明显的赢家:
In [28]: df_large = pd.concat([df]*1000).reset_index()
In [29]: %timeit f1(df_large)
5.82 s ± 63.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [30]: %timeit df_large['new_sum'] = f2(df_large.Sum.values, df_large.Quantity.values)
6.27 ms ± 146 ?s per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [31]: %timeit df_large['new_sum'] = f3(df_large.Sum.values, df_large.Quantity.values)
215 ?s ± 5.76 ?s per loop (mean ± std. dev. of 7 runs, 1000 loops each)