一、原理
Glove原理部分有大神已经写好的,此处给出链接方便查看:
https://blog.csdn.net/coderTC/article/details/73864097
Glove和skip-gram、CBOW模型对比
Cbow/Skip-Gram 是一个local context window的方法,比如使用NS来训练,缺乏了整体的词和词的关系,负样本采用sample的方式会缺失词的关系信息。
另外,直接训练Skip-Gram类型的算法,很容易使得高曝光词汇得到过多的权重
Global Vector融合了矩阵分解Latent Semantic Analysis (LSA)的全局统计信息和local context window优势。融入全局的先验统计信息,可以加快模型的训练速度,又可以控制词的相对权重。
二、Glove实践
1、 开发环境准备
本人采用的是Ubuntu16.04的系统,安装了python 2.7
首先打开终端,安装gensim
sudo easy_install --upgrade gensim 就好了~~~~~~~~~~~~~
2、下载官方的代码
官方的代码的GitHub在此 : https://github.com/stanfordnlp/GloVe
该代码为c的版本,并且跑在linux下
3、生成词向量
在glove文件夹里打开终端进行编译,输入:
make
会生成一个build文件夹
然后再执行sh demo.sh就行了:
sh demo.sh
其中,可以再demo.sh里面,设置训练语料路径(默认是从网上下载一个语料,把这段删了,改成自己的语料路径就行了),还可以设置迭代次数,向量的维度等等,自己随便折腾就行了
4、由词向量获得模型并加载测试
首先,默认已经装好python+gensim了,并且已经会用word2vec了。#!usr/bin/python
# -*- coding: utf-8 -*-import shutil
import gensim
def getFileLineNums(filename): f = open(filename,'r') count = 0 for line in f: count += 1 return countdef prepend_line(infile, outfile, line): """ Function use to prepend lines using bash utilities in Linux. (source: http://stackoverflow.com/a/10850588/610569) """ with open(infile, 'r') as old: with open(outfile, 'w') as new: new.write(str(line) + "\n") shutil.copyfileobj(old, new) def prepend_slow(infile, outfile, line): """ Slower way to prepend the line by re-creating the inputfile. """ with open(infile, 'r') as fin: with open(outfile, 'w') as fout: fout.write(line + "\n") for line in fin: fout.write(line) def load(filename): # Input: GloVe Model File # More models can be downloaded from http://nlp.stanford.edu/projects/glove/ # glove_file="glove.840B.300d.txt" glove_file = filename dimensions = 50 num_lines = getFileLineNums(filename) # num_lines = check_num_lines_in_glove(glove_file) # dims = int(dimensions[:-1]) dims = 50 print num_lines # # # Output: Gensim Model text format. gensim_file='glove_model.txt' gensim_first_line = "{} {}".format(num_lines, dims) # # # Prepends the line. #if platform == "linux" or platform == "linux2": prepend_line(glove_file, gensim_file, gensim_first_line) #else: # prepend_slow(glove_file, gensim_file, gensim_first_line) # Demo: Loads the newly created glove_model.txt into gensim API. model=gensim.models.KeyedVectors.load_word2vec_format(gensim_file,binary=False) #GloVe Model model_name = gensim_file[6:-4] #model.save('/home/qf/GloVe-master/' + model_name) return model if __name__ == '__main__': myfile='/home/qf/GloVe-master/vectors.txt'load(myfile)####################################model_name='model'model = gensim.models.KeyedVectors.load('/home/qf/GloVe-master/'+model_name) print len(model.vocab) word_list = [u'to',u'one'] for word in word_list: print word,'--' for i in model.most_similar(word, topn=10): print i[0],i[1] print ''
在终端下运行该程序,可以获得一个在第一行增加了两个数的新的向量txt,第一个数指明一共有多少个向量,第二个数指明每个向量有多少维,同时获得一个模型文件model
测试时就可以直接用word2vec的load函数加载了
至此完成模型的训练预测试工作!!!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
后记:
注意:1、该程序下载的语料库是已经处理好的,如果使用自己的需要另外处理
2、在过程中遇到的一些问题:
(1)Python脚本运行出现语法错误:IndentationError: unindent does not match any outer indentation level
对于此错误,最常见的原因是,一是没有对齐,二是在对齐时混合使用tab和空格。
(2)安装gensim时,使用pip install gensim报错,使用sudo easy_install --upgrade gensim就可以了
(3)python下出现SyntaxError: Non-ASCII character '\xe5' in file 的错误
解决办法:是因为编码有问题,在python脚本的开头加上
#!usr/bin/python
# -*- coding: utf-8 -*-
(4)在上述程序中最开始load模型时使用gensim.models.Word2Vec.load('/home/qf/GloVe-master/'+model_name)
报错,后改为:
gensim.models.KeyedVectors.load('/home/qf/GloVe-master/'+model_name)
原因可能时没有安装word2vec,可以在终端下安装word2vec:
pip install word2vec
即可按第一种方式调用~~~未实践验证
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
参考: https://blog.csdn.net/sscssz/article/details/53333225