当前位置: 代码迷 >> 综合 >> 基于 TextRank 算法的关键词抽取
  详细解决方案

基于 TextRank 算法的关键词抽取

热度:6   发布时间:2023-11-21 17:33:48.0

2021SC@SDUSC


源码:

    def textrank(self, sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'), withFlag=False):"""Extract keywords from sentence using TextRank algorithm.Parameter:- topK: return how many top keywords. `None` for all possible words.- withWeight: if True, return a list of (word, weight);if False, return a list of words.- allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v'].if the POS of w is not in this list, it will be filtered.- withFlag: if True, return a list of pair(word, weight) like posseg.cutif False, return a list of words"""self.pos_filt = frozenset(allowPOS)g = UndirectWeightedGraph()cm = defaultdict(int)words = tuple(self.tokenizer.cut(sentence))#===========================================================================for i, wp in enumerate(words):if self.pairfilter(wp):for j in xrange(i + 1, i + self.span):if j >= len(words):breakif not self.pairfilter(words[j]):continueif allowPOS and withFlag:cm[(wp, words[j])] += 1else:cm[(wp.word, words[j].word)] += 1for terms, w in cm.items():g.addEdge(terms[0], terms[1], w)nodes_rank = g.rank()#==========================================================================if withWeight:tags = sorted(nodes_rank.items(), key=itemgetter(1), reverse=True)else:tags = sorted(nodes_rank, key=nodes_rank.__getitem__, reverse=True)if topK:return tags[:topK]else:return tags

textrank()接收的五个参数用法和基于TD-IDF算法的extract_tags()使用基本类似,唯一不同的是allowPOS参数默认值变了,textrank()的默认值变成了('ns', 'n', 'vn', 'v'),默认限制在四个词性

使用 #=============将源码分割为三个部分

第一部分:
使用pos_filt存放冻结的allowPOS,g为创建的无向加权图,cm为创建的value初始值为(int)0的字典,防止在调用时出现任何错误。详情参见
words为使用posseg.cut(sentence)切分后的结果集。
        self.tokenizer:

self.tokenizer = self.postokenizer = jieba.posseg.dt

第二部分:
enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标
pairfilter(wp)源码:

    def pairfilter(self, wp):return (wp.flag in self.pos_filt and len(wp.word.strip()) >= 2and wp.word.lower() not in self.stop_words)

如果 词性符合要求,词条长度大于二并且不属于stop_words,返回True;否则返回False
用以判断切分的词是否符合要求(加入无向图)

for循环对下表为i往后self.span(默认值为5)-1个词进行判断,如果它长度小于第i个词,并且符合 pairfilter(wp)的条件,那么会在两词之间形成无向路径,且权值+1(之前处理过权值默认为0)。

将上述生成的边、权加入无向图,进行rank()排序,获得最终结果。

第三部分:

针对withWeight和topK参数的值对结果进行最后加工。

withWeight为True,排序时将freq.items()降序排列

否则排序时只将freq key部分降序排列。

返回前topK个关键字,如果topK为空,则返回所有结果。

测试范例:

 测试代码:

import jieba.analyse as analyse
import jieba
jieba.initialize()
boundary = "="*40
content = open('lyric.txt','rb').read()
print(boundary)
print(1,",topK=5")
tags = analyse.textrank(content,topK=5)
print(tags)
print(boundary)
print(2,",topK=15")
tags = analyse.textrank(content,topK=15)
print(tags)
print(boundary)
print(3,",topK=5,withWeight = True")
tags = analyse.textrank(content,topK=5,withWeight=True)
print(tags)
print(boundary)
print(4,",topK=5,withWeight = True,allowPOS = ('n','v')")
tags = analyse.textrank(content,topK=5,withWeight=True,allowPOS=('n','v'))
print(tags)
print(boundary)
print(5,",topK=5,withWeight = True,allowPOS = ('n','v'),withFlag=True")
tags = analyse.textrank(content,topK=5,withWeight=True,allowPOS=('n','v'),withFlag=True)
print(tags)
print(boundary)
print(6,",topK=5,withWeight = True,allowPOS = ('n','v'),withFlag=True,set_idf_path()")
analyse.set_idf_path("../extra_dict/idf.txt.big")
tags = analyse.textrank(content,topK=5,withWeight=True,allowPOS=('n','v'),withFlag=True)
print(tags)
print(boundary)
print(7,",topK=5,withWeight = True,allowPOS = ('n','v'),withFlag=True,set_stop_words()")
analyse.set_stop_words("../extra_dict/stop_words.txt")
tags = analyse.textrank(content,topK=5,withWeight=True,allowPOS=('n','v'),withFlag=True)
print(tags)

注:因为allowPOS默认值不为空值,所以withFlag参数不必像extract_tags()必须依赖于allowPOS的参数。