当前位置: 代码迷 >> 综合 >> pySpark DataFrame上/下采样的方法
  详细解决方案

pySpark DataFrame上/下采样的方法

热度:11   发布时间:2023-12-19 02:56:42.0

方法一:

df_class_0 = df_train[df_train['label'] == 0]
df_class_1 = df_train[df_train['label'] == 1]
df_class_1_over = df_class_1.sample(count_class_0, replace=True)
df_test_over = pd.concat([df_class_0, df_class_1_over], axis=0)

方法二:

train_1= train_initial.where(col('label')==1).sample(True, 10.0, seed = 2018)
#step 2. Merge this data with label = 0 datatrain_0=train_initial.where(col('label')==0)
train_final = train_0.union(train_1)

参考:

  1. stackOverflow
  相关解决方案