weak-and算法原理演示(wand)
分类: 算法
2013-01-14 11:38
2147人阅读
评论(2)
收藏
举报
wand 信息检索 weak-and
推荐一个在信息检索中用到的weak-and算法,这个算法在广告系统中有成熟的应用。
简单来说,一般我们在计算文本相关性的时候,会通过倒排索引的方式进行查询,通过倒排索引已经要比全量遍历节约大量时间,但是有时候仍然很慢。
原因是很多时候我们其实只是想要top n个结果,一些结果明显较差的也进行了复杂的相关性计算,而weak-and算法通过计算每个词的贡献上限来估计文档的相关性上限,从而建立一个阈值对倒排中的结果进行减枝,从而得到提速的效果。
从我实际测试的结果看,对于短文本的效果不如长文本的明显,但是在视频的电影数据上面看,仍然减少了50%的耗时(top 100),并且该算法可以通过牺牲精度来进一步提升速度,非常不错。
以下代码是一个算法原理演示,实现了主要的算法逻辑以验证算法的有效性,供大家参考,该实现优化了原始算法的一些逻辑尽量减少了无谓的循环
- #!/usr/bin/python
- #wangben updated 20130108
- class WAND:
- '''''implement wand algorithm'''
- def __init__(self, InvertIndex, last_docid):
- self.invert_index = InvertIndex #InvertIndex: term -> docid1, docid2, docid3 ...
- self.current_doc = 0
- self.current_invert_index = {}
- self.query_terms = []
- self.threshold = 2
- self.sort_terms = []
- self.LastID = 2000000000 #big num
- self.debug_count = 0
- self.last_docid = last_docid
- def __InitQuery(self, query_terms):
- '''''check terms len > 0'''
- self.current_doc = -1
- self.current_invert_index.clear()
- self.query_terms = query_terms
- self.sort_terms[:] = []
- self.debug_count = 0
- for term in query_terms:
- #initial start pos from the first position of term's invert_index
- self.current_invert_index[term] = [ self.invert_index[term][0], 0 ] #[ docid, index ]
- def __SortTerms(self):
- if len(self.sort_terms) == 0:
- for term in self.query_terms:
- if term in self.current_invert_index:
- doc_id = self.current_invert_index[term][0]
- self.sort_terms.append([ int(doc_id), term ])
- self.sort_terms.sort()
- def __PickTerm(self, pivot_index):
- return 0
- def __FindPivotTerm(self):
- score = 0
- for i in range(0, len(self.sort_terms)):
- score += 1
- if score >= self.threshold:
- return [ self.sort_terms[i][1], i]
- return [ None, len(self.sort_terms) ]
- def __IteratorInvertIndex(self, change_term, docid, pos):
- '''''move to doc id > docid'''
- doc_list = self.invert_index[change_term]
- i = 0
- for i in range(pos, len(doc_list)):
- if doc_list[i] >= docid:
- pos = i
- docid = doc_list[i]
- break
- return [ docid, pos ]
- def __AdvanceTerm(self, change_index, docid ):
- change_term = self.sort_terms[change_index][1]
- pos = self.current_invert_index[change_term][1]
- (new_doc, new_pos) = \
- self.__IteratorInvertIndex(change_term, docid, pos)
- self.current_invert_index[change_term] = \
- [ new_doc , new_pos ]
- self.sort_terms[change_index][0] = new_doc
- def __Next(self):
- if self.last_docid == self.current_doc:
- return None
- while True:
- self.debug_count += 1
- #sort terms by doc id
- self.__SortTerms()
- #find pivot term > threshold
- (pivot_term, pivot_index) = self.__FindPivotTerm()
- if pivot_term == None:
- #no more candidate
- return None
- #debug_info:
- for i in range(0, pivot_index + 1):
- print self.sort_terms[i][0],self.sort_terms[i][1],"|",
- print ""
- pivot_doc_id = self.current_invert_index[pivot_term][0]
- if pivot_doc_id == self.LastID: #!!
- return None
- if pivot_doc_id <= self.current_doc:
- change_index = self.__PickTerm(pivot_index)
- self.__AdvanceTerm( change_index, self.current_doc + 1 )
- else:
- first_docid = self.sort_terms[0][0]
- if pivot_doc_id == first_docid:
- self.current_doc = pivot_doc_id
- return self.current_doc
- else:
- #pick all preceding term
- for i in range(0, pivot_index):
- change_index = i
- self.__AdvanceTerm( change_index, pivot_doc_id )
- def DoQuery(self, query_terms):
- self.__InitQuery(query_terms)
- while True:
- candidate_docid = self.__Next()
- if candidate_docid == None:
- break
- print "candidate_docid:",candidate_docid
- #insert candidate_docid to heap
- #update threshold
- print "debug_count:",self.debug_count
- if __name__ == "__main__":
- testIndex = {}
- testIndex["t1"] = [ 0, 1, 2, 3, 6 , 2000000000]
- testIndex["t2"] = [ 3, 4, 5, 6, 2000000000 ]
- testIndex["t3"] = [ 2, 5, 2000000000 ]
- testIndex["t4"] = [ 4, 6, 2000000000 ]
- w = WAND(testIndex, 6)
- w.DoQuery(["t1", "t2", "t3", "t4"])
输出结果中会展示next中循环的次数,以及最后被选为candidate的docid
这里省略了建立堆的过程,使用了一个默认阈值2作为doc的删选条件,候选doc和query doc采用重复词的个数计算UB,这里只是一个算法演示,实际使用的时候需要根据自己的相关性公式进行调整(关于Upper Bound需要注意的是,需要在预处理阶段,把每个词可能会共现的最大相关性的分值计算出来作为该词的UB)
输出结果如下,展示了docid的相关变化过程,以及最后的循环次数14次
- 0 t1 | 2 t3 |
- 2 t1 | 2 t3 |
- candidate_docid: 2
- 2 t1 | 2 t3 |
- 2 t3 | 3 t1 |
- 3 t1 | 3 t2 |
- candidate_docid: 3
- 3 t1 | 3 t2 |
- 3 t2 | 4 t4 |
- 4 t2 | 4 t4 |
- candidate_docid: 4
- 4 t2 | 4 t4 |
- 4 t4 | 5 t2 |
- 5 t2 | 5 t3 |
- candidate_docid: 5
- 5 t2 | 5 t3 |
- 5 t3 | 6 t1 |
- 6 t1 | 6 t2 |
- candidate_docid: 6
- debug_count: 14