基于 TextRank 算法的关键词抽取_综合

2021SC@SDUSC

源码：

    def textrank(self, sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'), withFlag=False):"""Extract keywords from sentence using TextRank algorithm.Parameter:- topK: return how many top keywords. `None` for all possible words.- withWeight: if True, return a list of (word, weight);if False, return a list of words.- allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v'].if the POS of w is not in this list, it will be filtered.- withFlag: if True, return a list of pair(word, weight) like posseg.cutif False, return a list of words"""self.pos_filt = frozenset(allowPOS)g = UndirectWeightedGraph()cm = defaultdict(int)words = tuple(self.tokenizer.cut(sentence))#===========================================================================for i, wp in enumerate(words):if self.pairfilter(wp):for j in xrange(i + 1, i + self.span):if j >= len(words):breakif not self.pairfilter(words[j]):continueif allowPOS and withFlag:cm[(wp, words[j])] += 1else:cm[(wp.word, words[j].word)] += 1for terms, w in cm.items():g.addEdge(terms[0], terms[1], w)nodes_rank = g.rank()#==========================================================================if withWeight:tags = sorted(nodes_rank.items(), key=itemgetter(1), reverse=True)else:tags = sorted(nodes_rank, key=nodes_rank.__getitem__, reverse=True)if topK:return tags[:topK]else:return tags

textrank()接收的五个参数用法和基于TD-IDF算法的extract_tags（）使用基本类似，唯一不同的是allowPOS参数默认值变了，textrank()的默认值变成了('ns', 'n', 'vn', 'v')，默认限制在四个词性

使用 #=============将源码分割为三个部分

第一部分：
使用pos_filt存放冻结的allowPOS，g为创建的无向加权图，cm为创建的value初始值为（int）0的字典，防止在调用时出现任何错误。详情参见
words为使用posseg.cut(sentence)切分后的结果集。
self.tokenizer:

self.tokenizer = self.postokenizer = jieba.posseg.dt

第二部分：
enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列，同时列出数据和数据下标
pairfilter(wp)源码：

    def pairfilter(self, wp):return (wp.flag in self.pos_filt and len(wp.word.strip()) >= 2and wp.word.lower() not in self.stop_words)

如果词性符合要求，词条长度大于二并且不属于stop_words，返回True；否则返回False
用以判断切分的词是否符合要求（加入无向图）

for循环对下表为i往后self.span(默认值为5)-1个词进行判断，如果它长度小于第i个词，并且符合 pairfilter(wp)的条件，那么会在两词之间形成无向路径，且权值+1（之前处理过权值默认为0）。

将上述生成的边、权加入无向图，进行rank()排序，获得最终结果。

第三部分：

针对withWeight和topK参数的值对结果进行最后加工。

withWeight为True，排序时将freq.items()降序排列

否则排序时只将freq key部分降序排列。

返回前topK个关键字，如果topK为空，则返回所有结果。

测试范例：

测试代码：

import jieba.analyse as analyse
import jieba
jieba.initialize()
boundary = "="*40
content = open('lyric.txt','rb').read()
print(boundary)
print(1,"，topK=5")
tags = analyse.textrank(content,topK=5)
print(tags)
print(boundary)
print(2,"，topK=15")
tags = analyse.textrank(content,topK=15)
print(tags)
print(boundary)
print(3,"，topK=5,withWeight = True")
tags = analyse.textrank(content,topK=5,withWeight=True)
print(tags)
print(boundary)
print(4,"，topK=5,withWeight = True,allowPOS = ('n','v')")
tags = analyse.textrank(content,topK=5,withWeight=True,allowPOS=('n','v'))
print(tags)
print(boundary)
print(5,"，topK=5,withWeight = True,allowPOS = ('n','v'),withFlag=True")
tags = analyse.textrank(content,topK=5,withWeight=True,allowPOS=('n','v'),withFlag=True)
print(tags)
print(boundary)
print(6,"，topK=5,withWeight = True,allowPOS = ('n','v'),withFlag=True,set_idf_path()")
analyse.set_idf_path("../extra_dict/idf.txt.big")
tags = analyse.textrank(content,topK=5,withWeight=True,allowPOS=('n','v'),withFlag=True)
print(tags)
print(boundary)
print(7,"，topK=5,withWeight = True,allowPOS = ('n','v'),withFlag=True,set_stop_words()")
analyse.set_stop_words("../extra_dict/stop_words.txt")
tags = analyse.textrank(content,topK=5,withWeight=True,allowPOS=('n','v'),withFlag=True)
print(tags)

注：因为allowPOS默认值不为空值，所以withFlag参数不必像extract_tags()必须依赖于allowPOS的参数。