当前位置: 代码迷 >> 综合 >> jieba库:Tokenizer()类详解:(三)词典增删词
  详细解决方案

jieba库:Tokenizer()类详解:(三)词典增删词

热度:30   发布时间:2023-11-21 17:32:54.0

2021SC@SDUSC


源码:

    def add_word(self, word, freq=None, tag=None):"""Add a word to dictionary.freq and tag can be omitted, freq defaults to be a calculated valuethat ensures the word can be cut out."""#检查是否初始化self.check_initialized()#改变编码word = strdecode(word)#根据实参确定freq,如果freq为None,freq就为suggest_freq()的返回值;否则freq为它本身freq = int(freq) if freq is not None else self.suggest_freq(word, False)#添加到词频字典中self.FREQ[word] = freqself.total += freq#添加词性if tag:self.user_word_tag_tab[word] = tag#把字典中没有的word的子word添加到字典中,词频为0for ch in xrange(len(word)):wfrag = word[:ch + 1]if wfrag not in self.FREQ:self.FREQ[wfrag] = 0#用来删除词if freq == 0:finalseg.add_force_split(word)

第一步同样是检查jieba库是否初始化,因为初始化后才会加载词典。

strdecode(sentence)源码:

def strdecode(sentence):if not isinstance(sentence, text_type):try:sentence = sentence.decode('utf-8')except UnicodeDecodeError:sentence = sentence.decode('gbk', 'ignore')return sentence

对sentence使用‘utf-8’进行改编码,如果失败就使用'gbk'。

如果freq为None,那么它将调用 suggest_freq(word,False)函数,获得该词可以被识别的词频。然后用该词频作为word的词频,添加word到词频FREQ字典。

如果tag为None,则不会添加word的词性到self.user_word_tag_tab字典。

也就是说,如果希望添加词并且使得它可以被识别,自定义词典中完全可以省略该词的词频


删除词:

源码:

    def del_word(self, word):"""Convenient function for deleting a word."""#使词频为0,调用finalseg.add_force_split(word)self.add_word(word, 0)

  相关解决方案