比较Python中两行之间相同列元素的数量_python

我正在尝试编写一个基本脚本，该脚本将帮助我查找行之间有多少个相似的列。 信息非常简单，类似于：

array = np.array([0 1 0 0 1 0 0], [0 0 1 0 1 1 0])

我将不得不在列表的所有排列之间执行此脚本，因此将第1行与第2行进行比较，将第1行与第3行进行比较，等等。

任何帮助将不胜感激。

您可以使用基本的numpy技术解决您的标题问题。 假设您有一个二维numpy数组a并且想要比较m和n行：

row_m = a[m, :] # this selects row index m and all column indices, thus: row m
row_n = a[n, :]
shared = row_m == row_n # this compares row_m and row_n element-by-element storing each individual result (True or False) in a separate cell, the result thus has the same shape as row_m and row_n
overlap = shared.sum() # this sums over all elements in shared, since False is encoded as 0 and True as 1 this returns the number of shared elements.

将此食谱应用于所有成对的行的最简单方法是广播：

 first = a[:, None, :] # None creates a new dimension to make space for a second row axis
 second = a[None, :, :] # Same but new dim in first axis
 # observe that axes 0 and 1 in these two array are arranged as for a distance map
 # a binary operation between arrays so layed out will trigger broadcasting, i.e. numpy will compute all possible pairs in the appropriate positions
 full_overlap_map = first == second # has shape nrow x nrow x ncol
 similarity_table = full_overlap_map.sum(axis=-1) # shape nrow x nrow

如果您可以依靠所有行都是二进制值，那么“相似列”的计数就是

def count_sim_cols(row0, row1):
    return np.sum(row0*row1)

如果可能会有更大范围的值，您只需将产品替换为

def count_sim_cols(row0, row1):
     return np.sum(row0 == row1)

如果您希望对“相似性”有一定的容忍度，例如tol ，则取一些小值，这仅仅是

def count_sim_cols(row0, row1):
    return np.sum(np.abs(row0 - row1) < tol)

然后，您可以进行双嵌套循环以获取计数。 假设X是一个n行的numpy数组

sim_counts = {}
for i in xrange(n):
    for j in xrange(i + 1, n):
        sim_counts[(i, j)] = count_sim_cols(X[i], X[j])

比较Python中两行之间相同列元素的数量

问题描述

1楼

2楼