问题描述
我正在尝试编写一个基本脚本,该脚本将帮助我查找行之间有多少个相似的列。 信息非常简单,类似于:
array = np.array([0 1 0 0 1 0 0], [0 0 1 0 1 1 0])
我将不得不在列表的所有排列之间执行此脚本,因此将第1行与第2行进行比较,将第1行与第3行进行比较,等等。
任何帮助将不胜感激。
1楼
您可以使用基本的numpy技术解决您的标题问题。
假设您有一个二维numpy数组a
并且想要比较m
和n
行:
row_m = a[m, :] # this selects row index m and all column indices, thus: row m
row_n = a[n, :]
shared = row_m == row_n # this compares row_m and row_n element-by-element storing each individual result (True or False) in a separate cell, the result thus has the same shape as row_m and row_n
overlap = shared.sum() # this sums over all elements in shared, since False is encoded as 0 and True as 1 this returns the number of shared elements.
将此食谱应用于所有成对的行的最简单方法是广播:
first = a[:, None, :] # None creates a new dimension to make space for a second row axis
second = a[None, :, :] # Same but new dim in first axis
# observe that axes 0 and 1 in these two array are arranged as for a distance map
# a binary operation between arrays so layed out will trigger broadcasting, i.e. numpy will compute all possible pairs in the appropriate positions
full_overlap_map = first == second # has shape nrow x nrow x ncol
similarity_table = full_overlap_map.sum(axis=-1) # shape nrow x nrow
2楼
如果您可以依靠所有行都是二进制值,那么“相似列”的计数就是
def count_sim_cols(row0, row1):
return np.sum(row0*row1)
如果可能会有更大范围的值,您只需将产品替换为
def count_sim_cols(row0, row1):
return np.sum(row0 == row1)
如果您希望对“相似性”有一定的容忍度,例如tol
,则取一些小值,这仅仅是
def count_sim_cols(row0, row1):
return np.sum(np.abs(row0 - row1) < tol)
然后,您可以进行双嵌套循环以获取计数。
假设X
是一个n
行的numpy数组
sim_counts = {}
for i in xrange(n):
for j in xrange(i + 1, n):
sim_counts[(i, j)] = count_sim_cols(X[i], X[j])