Automated Phrase Mining from Massive Text Corpora
海量文本语料库中的自动短语挖掘
ABSTRACT
摘要
As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus.
Phrase mining is important in various tasks such as information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. Recently, a few data-driven methods have been developed successfully for extraction of phrases from massive domain-specific text. However, none of the state-of-the-art models is fully automated because they require human experts for designing rules or labeling phrases.
作为文本分析的基本任务之一,短语挖掘旨在从文本语料库中提取高质量的短语。
短语挖掘在各种任务中都很重要,例如信息提取/检索,分类法构建和主题建模。 现有的大多数方法都依赖于训练有素的复杂语言分析器,因此,在没有额外但昂贵的改编的情况下,新域和体裁的文本语料库的性能可能无法令人满意。 最近,已经成功开发了一些数据驱动的方法,用于从大量领域特定的文本中提取短语。 但是,最新的模型都不是完全自动化的,因为它们需要人工来设计规则或标记短语。
Since one can easily obtain many quality phrases from public knowledge bases to a scale that is much larger than that produced by human experts, in this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which leverages this large amount of high-quality phrases in an effective way and achieves better performance compared to limited human labeled phrases. In addition, we develop a POS-guided phrasal segmentation model, which incorporates the shallow syntactic information in part-of-speech (POS) tags to further enhance the performance, when a POS tagger is available. Note that, AutoPhrase can support any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, the new method has shown significant improvements in effectiveness on five real-world datasets across different domains and languages.
由于人们可以轻松地从公共知识库中获取许多优质短语,其规模远大于人类专家所产生的短语,因此在本文中,我们提出了一种新颖的自动短语挖掘框架AutoPhrase,该框架利用了大量的 与有限的人类标记短语相比,可以有效地提高质量的短语并获得更好的性能。 此外,我们开发了POS引导的短语分割模型,当POS标签器可用时,该模型将浅句法信息包含在词性(POS)标签中,以进一步提高性能。 请注意,只要具备该语言的通用知识库(例如Wikipedia)可用,AutoPhrase就可以支持任何语言,同时可以受益于(但不需要)POS标记器。 与最先进的方法相比,新方法在跨不同域和语言的五个真实数据集上显示出显着的有效性。
1.INTRODUCTION
1.介绍
Phrase mining refers to the process of automatic extraction of high-quality phrases (e.g., scientific terms and general entity names) in a given corpus (e.g., research papers and news). Representing the text with quality phrases instead of n-grams can improve computational models for applications such as information extraction/retrieval, taxonomy construction, and topic modeling.
词组挖掘是指自动提取给定语料库(例如研究论文和新闻)中的高质量短语(例如科学术语和一般实体名称)的过程。 用高质量的短语而不是n-gram表示文本可以改善应用程序的计算模型,例如信息提取/检索,分类法构建和主题建模。
Almost all the state-of-the-art methods, however, require human experts at certain levels. Most existing methods [9, 20, 25] rely on complex, trained linguistic analyzers (e.g., dependency parsers) to locate phrase mentions, and thus may have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption.
Our latest domain-independent method SegPhrase [13] out-performs many other approaches [9, 20, 25, 5, 19, 6], but still needs domain experts to first carefully select hundreds of varying-quality phrases from millions of candidates, and then annotate them with binary labels.
但是,几乎所有最先进的方法都需要一定水平的人类专家。 现有的大多数方法[9、20、25]都依赖于训练有素的复杂语言分析器(例如,依赖解析器)来定位词组提及,因此在新域和体裁的文本语料库上可能没有令人满意但又不昂贵的适应性。
我们最新的领域无关方法SegPhrase [13]胜过许多其他方法[9、20、25、5、19、6],但仍然需要领域专家首先从数百万个候选词中精心选择数百个质量不同的短语, 然后用二进制标签对其进行注释。
Such reliance on manual efforts by domain and linguistic experts becomes an impediment for timely analysis of massive, emerging text corpora in specific domains. An ideal automated phrase mining method is supposed to be domain-independent, with minimal human effort or reliance on linguistic analyzers . Bearing this in mind, we propose a novel automated phrase mining framework AutoPhrase in this paper, going beyond SegPhrase, to further get rid of additional manual labeling effort and enhance the performance, mainly using the following two new techniques.
这种对领域和语言专家手动工作的依赖成为及时分析特定领域中大量新兴文本语料库的障碍。 理想的自动短语挖掘方法应该是独立于域的,并且只需最少的人力或对语言分析器的依赖即可。 牢记这一点,我们在本文中提出了一种新颖的自动短语挖掘框架AutoPhrase,它超越了SegPhrase,以进一步摆脱额外的手动标记工作并增强性能,主要使用以下两种新技术。
- Robust Positive-Only Distant Training. In fact, many high-quality phrases are freely available in general knowledge bases, and they can be easily obtained to a scale that is much larger than that produced by human experts. Domain-specific corpora usually contain some quality phrases also encoded in general knowledge bases, even when there may be no other domain-specific knowledge bases. Therefore, for distant training, we leverage the existing high-quality phrases, as available from general knowledge bases, such as Wikipedia and Freebase, to get rid of additional manual labeling effort. We independently build samples of positive labels from general knowledge bases and negative labels from the given domain corpora, and train a number of base classifiers. We then aggregate the predictions from these classifiers, whose independence helps reduce the noise from negative labels.
健壮的仅积极远程培训。 实际上,在普通知识库中可以免费获得许多高质量的短语,并且可以容易地以比人类专家产生的规模更大的规模获得它们。 特定领域的语料库通常包含一些质量短语,这些短语也会在通用知识库中进行编码,即使可能没有其他特定领域的知识库。 因此,对于远程培训,我们利用可从常规知识库(例如Wikipedia和Freebase)获得的现有高质量短语来摆脱其他手动标记工作。 我们从通用知识库中独立构建正标签样本,并从给定的领域语料库中构建负标签样本,并训练许多基础分类器。 然后,我们汇总来自这些分类器的预测,这些分类器的独立性有助于减少负面标签的干扰。
- POS-Guided Phrasal Segmentation. There is a trade-off between the performance and domain-independence when incorporating linguistic processors in the phrase mining method. On the domain independence side, the accuracy might be limited without linguistic knowledge. It is difficult to support multiple languages, if the method is completely language-blind. On the accuracy side, relying on complex, trained linguistic analyzers may hurt the domain-independence of the phrase mining method. For example, it is expensive to adapt dependency parsers to special domains like clinical reports. As a compromise, we propose to incorporate a pre-trained part-of-speech (POS) tagger to further enhance the performance, when it is available for the language of the document collection. The POS-guided phrasal segmentation leverages the shallow syntactic information in POS tags to guide the phrasal segmentation model locating the boundaries of phrases more accurately.
POS引导的短语分割。 在短语挖掘方法中加入语言处理器时,性能和域独立性之间需要权衡。 在域独立方面,如果没有语言知识,准确性可能会受到限制。 如果该方法完全是语言盲的,则很难支持多种语言。 在准确性方面,依赖复杂,训练有素的语言分析器可能会损害短语挖掘方法的领域独立性。 例如,使依赖性解析器适应诸如临床报告之类的特殊域是昂贵的。 作为一种折衷方案,当文档收集的语言可用时,我们建议合并预训练的词性(POS)标记器以进一步提高性能。 POS引导的短语分割利用POS标签中的浅层语法信息来指导短语分割模型,以更准确地定位短语的边界。
In principle, AutoPhrase can support any language as long as a general knowledge base in that language is available. In fact, at least 58 languages have more than 100,000 articles in Wikipedia as of Feb, 20172 . Moreover, since pre-trained part-of-speech (POS) taggers are widely available in many languages (e.g., more than 20 languages in TreeTagger [22]3 ), the POS-guided phrasal segmentation can be applied in many scenarios. It is worth mentioning that for domain-specific knowledge bases (e.g., MeSH terms in the biomedical domain) and trained POS taggers, the same paradigm applies. In this study, without loss of generality, we only assume the availability of a general knowledge base together with a pre-trained POS tagger.
原则上,只要具备该语言的一般知识库,AutoPhrase就可以支持任何语言。 截至2017年2月,至少有58种语言在Wikipedia中拥有超过100,000篇文章。 此外,由于预训练的词性(POS)标记器可以以多种语言(例如TreeTagger [22] 3中的20多种语言)广泛使用,因此POS引导的短语分割可以在许多情况下应用。 值得一提的是,对于特定领域的知识库(例如,生物医学领域中的MeSH术语)和受过训练的POS标签器,适用相同的范例。 在这项研究中,在不失一般性的前提下,我们仅假设具有一般知识以及预训练的POS标记器的可用性。
As demonstrated in our experiments, AutoPhrase not only works effectively in multiple domains like scientific papers, business reviews, and Wikipedia articles, but also supports multiple languages, such as English, Spanish, and Chinese.
如我们的实验所示,AutoPhrase不仅可以在科学论文,商业评论和Wikipedia文章等多个领域中有效运行,而且还支持多种语言,例如英语,西班牙语和中文。
Our main contributions are highlighted as follows:
- We study an important problem, automated phrase mining, and analyze its major challenges as above.
- We propose a robust positive-only distant training method for phrase quality estimation to minimize the human effort.
- We develop a novel phrasal segmentation model to leverage POS tags to achieve further improvement, when a POS tagger is available.
- We demonstrate the robustness and accuracy of our method and show improvements over prior methods, with results of experiments conducted on five real-world datasets in different domains (scientific papers, business reviews, and Wikipedia articles) and different languages (English, Spanish, and Chinese).
我们的主要贡献突出如下:
- 我们研究了一个重要的问题,即自动短语挖掘,并如上所述分析了其主要挑战。
- 我们提出了一种健壮的仅积极的远距离训练方法来评估词组质量,以最大程度地减少人工。
- 我们开发了一种新颖的短语分割模型,以在POS标签器可用时利用POS标签实现进一步的改进。
- 我们展示了我们方法的鲁棒性和准确性,并展示了对先前方法的改进,并在五个不同领域(科学论文,商业评论和Wikipedia文章)和不同语言(英语,西班牙语, 和中文)。
The rest of the paper is organized as follows. Section 2 positions our work relative to existing works. Section 3 defines basic concepts including four requirements of phrases.
The details of our method are covered in Section 4. Extensive experiments and case studies are presented in Section 5. We conclude the study in Section 7.
本文的其余部分安排如下。 第2节将我们的工作相对于现有作品进行定位。 第三部分定义了基本概念,包括短语的四个要求。
第4节介绍了我们方法的详细信息。第5节介绍了广泛的实验和案例研究。第7节总结了该研究。
2.RELATED WORK
2.相关工作
Identifying quality phrases efficiently has become ever more central and critical for effective handling of massively increasing-size text datasets. In contrast to keyphrase extraction [17, 23, 14], this task goes beyond the scope of single documents and provides useful cross-document signals. The natural language processing (NLP) community has conducted extensive studies typically referred to as automatic term recognition [9, 20, 25], for the computational task of extracting terms (such as technical phrases). This topic also attracts attention in the information retrieval (IR) community [7, 19] since selecting appropriate indexing terms is critical to the improvement of search engines where the ideal indexing units represent the main concepts in a corpus, not just literal bag-of-words.
有效地识别质量短语对于有效处理大量增加的文本数据集变得越来越重要和关键。 与关键字短语提取[17、23、14]相反,此任务超出了单个文档的范围,并提供了有用的跨文档信号。 自然语言处理(NLP)社区已经进行了广泛的研究,通常被称为自动术语识别[9,20,25],用于提取术语(例如技术短语)的计算任务。 该主题在信息检索(IR)社区中也引起关注[7,19],因为选择合适的索引词对改进搜索引擎至关重要,在搜索引擎中,理想的索引单元代表了语料库中的主要概念,而不仅仅是词袋
Text indexing algorithms typically filter out stop words and restrict candidate terms to noun phrases. With pre-defined part-of-speech (POS) rules, one can identify noun phrases as term candidates in POS-tagged documents. Supervised noun phrase chunking techniques exploit such tagged documents to automatically learn rules for identifying noun phrase boundaries. Other methods may utilize more sophisticated NLP technologies such as dependency parsing to further enhance the precision. With candidate terms collected, the next step is to leverage certain statistical measures derived from the corpus to estimate phrase quality.
Some methods rely on other reference corpora for the calibration of “termhood”. The dependency on these various kinds of linguistic analyzers, domain-dependent language rules, and expensive human labeling, makes it challenging to extend these approaches to emerging, big, and unrestricted corpora, which may include many different domains, topics, and languages.
文本索引算法通常会过滤掉停用词并将候选词限制为名词短语。 使用预定义的词性(POS)规则,可以将名词短语标识为POS标记文档中的候选词。 有监督的名词短语分块技术利用此类带标签的文档来自动学习用于识别名词短语边界的规则。 其他方法可以利用更复杂的NLP技术(例如依赖项解析)来进一步提高精度。 收集到候选词后,下一步就是利用从语料库得出的某些统计量来估计短语质量。
一些方法依赖于其他参考语料库来校准“术语”。 对这些各种语言分析器的依赖,与域相关的语言规则以及昂贵的人工标签,使得将这些方法扩展到新兴的,庞大的且不受限制的语料库(包括许多不同的领域,主题和语言)变得颇具挑战性。
To overcome this limitation, data-driven approaches opt instead to make use of frequency statistics in the corpus to address both candidate generation and quality estimation. They do not rely on complex linguistic feature generation, domain-specific rules or extensive labeling efforts.
Instead, they rely on large corpora containing hundreds of thousands of documents to help deliver superior performance . In , several indicators, including frequency and comparison to super/sub-sequences, were proposed to extract n-grams that reliably indicate frequent, concise concepts. Deane proposed a heuristic metric over frequency distribution based on Zipfian ranks, to measure lexical association for phrase candidates. As a preprocessing step towards topical phrase extraction, El-Kishky et al. proposed to mine significant phrases based on frequency as well as document context following a bottom-up fashion, which only considers a part of quality phrase criteria, popularity and concordance. Our previous work succeeded at integrating phrase quality estimation with phrasal segmentation to further rectify the initial set of statistical features, based on local occurrence context. Unlike previous methods which are purely unsupervised, required a small set of phrase labels to train its phrase quality estimator. It is worth mentioning that all these approaches still depend on the human effort (e.g., setting domain-sensitive thresholds). Therefore, extending them to work automatically is challenging.
为了克服此限制,数据驱动的方法选择使用语料库中的频率统计数据来解决候选生成和质量估计问题。 他们不依赖复杂的语言特征生成,特定领域的规则或大量的标注工作。
相反,他们依靠包含数十万个文档的大型语料库来帮助提供卓越的性能。 在中,提出了一些指标,包括频率和与超/子序列的比较,以提取可可靠表示频繁,简洁概念的n-gram。 Deane提出了一种基于Zipfian等级的频率分布启发式度量,以测量短语候选词的词汇关联。 作为朝向局部短语提取的预处理步骤,El-Kishky等人。 提议以自下而上的方式根据频率和文档上下文来挖掘重要的短语,这种方式仅考虑了质量短语标准,受欢迎程度和一致性的一部分。 我们之前的工作成功地将词组质量估计与短语分割相结合,以根据局部出现的上下文进一步纠正统计特征的初始集合。 与以前完全不受监督的方法不同,它需要一小组短语标签来训练其短语质量估算器。 值得一提的是,所有这些方法仍然取决于人工(例如,设置域敏感的阈值)。 因此,将它们扩展为自动工作具有挑战性。
3.PRELIMINARIES
3.前提条件
The goal of this paper is to develop an automated phrase mining method to extract quality phrases from a large collection of documents without human labeling effort, and with only limited, shallow linguistic analysis. The main input to the automated phrase mining task is a corpus and a knowledge base. The input corpus is a textual word sequence in a particular language and a specific domain, of arbitrary length. The output is a ranked list of phrases with decreasing quality.
本文的目的是开发一种自动的短语挖掘方法,以从大量文档中提取高质量的短语,而无需人工标记,并且仅进行有限的浅层语言分析。 自动短语挖掘任务的主要输入是语料库和知识库。 输入语料库是特定语言和特定领域的任意长度的文本单词序列。 输出是质量下降的短语的排名列表。
The AutoPhrase framework is shown in Figure 1. The work flow is completely different form our previous domain independent phrase mining method requiring human effort, although the phrase candidates and the features used during phrase quality (re-)estimation are the same. In this paper, we propose a robust positive-only distant training to minimize the human effort and develop a POS-guided phrasal segmentation model to improve the model performance. In this section, we briefly introduce basic concepts and components as preliminaries.
AutoPhrase框架如图1所示。工作流程与以前的领域无关的短语挖掘方法完全不同,后者需要人工完成,尽管短语候选和短语质量(重新)估计中使用的功能是相同的。 在本文中,我们提出了一种鲁棒的仅正向远程训练,以最大程度地减少人工工作,并开发一种POS引导的短语分割模型,以提高模型性能。 在本节中,我们简要介绍了基本概念和组件。
A phrase is defined as a sequence of words that appear consecutively in the text, forming a complete semantic unit in certain contexts of the given documents. The phrase quality is defined to be the probability of a word sequence being a complete semantic unit, meeting the following criteria :
短语定义为在文本中连续出现的单词序列,在给定文档的某些上下文中形成完整的语义单元。 短语质量被定义为一个单词序列是一个完整的语义单元的概率,满足以下标准:
-
Popularity: Quality phrases should occur with sufficient frequency in the given document collection.
-
Concordance: The collocation of tokens in quality phrases occurs with significantly higher probability than expected due to chance.
-
Informativeness: A phrase is informative if it is indicative of a specific topic or concept.
-
Completeness: Long frequent phrases and their subsequences within those phrases may both satisfy the 3 criteria above. A phrase is deemed complete when it can be interpreted as a complete semantic unit in some given document context. Note that a phrase and a subphrase contained within it, may both be deemed complete, depending on the context in which they appear. For example, “relational database system”, “relational database” and “database system” can all be valid in certain context.
-
流行:在给定的文档集中,应以足够的频率出现高质量的短语。
-
一致性:由于偶然性,质量短语中标记的配置发生的可能性大大高于预期。
-
信息性:如果一个短语表示特定主题或概念,则它是提供信息的。
-
完整性:长期使用的短语及其在这些短语中的子序列都可能同时满足上述3个条件。 当某个短语在某些给定的文档上下文中可以解释为完整的语义单元时,则视为完整。 注意,短语和其中包含的短语可能会被视作完整,这取决于它们出现的上下文。 例如,“关系数据库系统”,“关系数据库”和“数据库系统”在某些情况下都可以有效。
AutoPhrase will estimate the phrase quality based on the positive and negative pools twice, once before and once after the POS-guided phrasal segmentation. That is, the POS-guided phrasal segmentation requires an initial set of phrase quality scores; we estimate the scores based on raw frequencies beforehand; and then once the feature values have been rectified, we re-estimate the scores.
Only the phrases satisfying all above requirements are recognized as quality phrases.
AutoPhrase会根据正负库对词组质量进行两次评估,一次是在POS引导的短语分割之前,一次是在词组分割之后。 也就是说,POS引导的短语分割需要一组初始短语质量得分; 我们预先根据原始频率估算得分; 然后对特征值进行校正后,我们将重新估算分数。
只有满足所有上述要求的短语才被视为质量短语。
Example 1. “strong tea” is a quality phrase while “heavy tea” fails to be due to concordance. “this paper” is a popular and concordant phrase, but is not informative in research publication corpus. “NP-complete in the strong sense” is a quality phrase while “NP-complete in the strong” fails to be due to completeness.
示例1.“浓茶”是一个优质词组,而“重茶”则归因于一致性。 “本文”是一个流行且一致的短语,但在研究出版物语料库中却没有提供很多信息。 “强意义上的NP-完全”是一个质量短语,而“强意义上的NP-完全”不是由于完整性。
To automatically mine these quality phrases, the first phase of AutoPhrase (see leftmost box in Figure 1) establishes the set of phrase candidates that contains all n-grams over the minimum support threshold τ (e.g., 30) in the corpus.
Here, this threshold refers to raw frequency of the n-grams calculated by string matching. In practice, one can also set a phrase length threshold (e.g., n ≤ 6) to restrict the number of words in any phrase. Given a phrase candidate w1w2 . . . wn, its phrase quality is:
为了自动挖掘这些高质量的短语,AutoPhrase的第一阶段(请参见图1的最左边的方框)建立了一组短语候选,其中包含了语料库中最小支持阈值τ(例如30)以上的所有n-gram。
这里,该阈值是指通过字符串匹配计算出的n元语法的原始频率。 实际上,还可以设置短语长度阈值(例如,n≤6)以限制任何短语中的单词数量。 给定短语候选w1w2。 。 。 wn,其词组质量为:
where [w1w2 . . . wn] refers to the event that these words constitute a phrase. Q(·), also known as the phrase quality estimator, is initially learned from data based on statistical features , such as point-wise mutual information, point-wise KL divergence, and inverse document frequency, designed to model concordance and informativeness mentioned above.
Note the phrase quality estimator is computed independent of POS tags. For unigrams, we simply set their phrase quality as 1.
其中[w1w2。 。 。 [wn]指这些单词构成短语的事件。 Q(·),也称为短语质量估计器,最初是根据统计特征从数据中学习的,这些统计特征例如是点对点互信息,点对点KL散度和逆文档频率,旨在对上述一致性和信息性进行建模。 。
请注意,词组质量估算器的计算独立于POS标签。 对于字母组合,我们只需将其词组质量设置为1。
Example 2. A good quality estimator will return Q(this paper) ≈ 0 and Q(relational database system) ≈ 1.
Then, to address the completeness criterion, the phrasal segmentation finds the best segmentation for each sentence.
例子2.一个高质量的估计器将返回Q(本文)≈0和Q(关系数据库系统)≈1。
然后,为了解决完整性标准,短语分割会为每个句子找到最佳分割。
During the phrase quality re-estimation, related statistical features will be re-computed based on the rectified frequency of phrases, which means the number of times that a phrase becomes a complete semantic unit in the identified segmentation. After incorporating the rectified frequency, the phrase quality estimator Q(·) also models the completeness in addition to concordance and informativeness.
在短语质量重新估计期间,将基于短语的校正频率重新计算相关的统计特征,这意味着短语在标识的细分中成为完整语义单元的次数。 合并了整流频率后,短语质量估计器Q(·)还会对一致性和信息性进行建模。
Example 4. Continuing the previous example, the raw frequency of the phrase “great firewall” is 2, but its rectified frequency is 1. Both the raw frequency and the rectified frequency of the phrase “firewall software” are 1. The raw frequency of the phrase “classifier SVM” is 1, but its rectified frequency is 0.
示例4.继续前面的示例,短语“伟大的防火墙”的原始频率为2,但其整流频率为1。短语“防火墙软件”的原始频率和整流频率均为1。 短语“分类器SVM”为1,但其整流频率为0。
4.METHODOLOGY
4.方法
In this section, we focus on introducing our two new techniques.
Figure 2: The illustration of each base classifier. In each base classifier, we first randomly sample K positive and negative labels from the pools respectively.
There might be δ quality phrases among the K negative labels. An unpruned decision tree is trained based on this perturbed training set.
图2:每个基本分类器的图示。 在每个基本分类器中,我们首先分别从库中随机抽取K个正负标签。
K个否定标签之间可能有δ个质量短语。 基于该扰动的训练集训练未修剪的决策树。
4.1Robust Positive-Only Distant Training
4.1健壮的仅积极远程培训
To estimate the phrase quality score for each phrase candidate, our previous work required domain experts to first carefully select hundreds of varying-quality phrases from millions of candidates, and then annotate them with binary labels. For example, for computer science papers, our domain experts provided hundreds of positive labels (e.g., “spanning tree” and “computer science”) and negative labels (e.g., “paper focuses” and “important form of ”). However, creating such a label set is expensive, especially in specialized domains like clinical reports and business reviews, because this approach provides no clues for how to identify the phrase candidates to be labeled. In this paper, we introduce a method that only utilizes existing general knowledge bases without any other human effort.
为了估计每个短语的短语质量得分,我们先前的工作要求领域专家首先从数百万个候选单词中精心选择数百个质量不同的短语,然后使用二进制标签对其进行注释。 例如,对于计算机科学论文,我们的领域专家提供了数百个正面标签(例如,“生成树”和“计算机科学”)和负面标签(例如,“按焦点排列”和“的重要形式”)。 但是,创建这样的标签集非常昂贵,尤其是在诸如临床报告和业务审查之类的专业领域中,因为这种方法没有提供任何线索来识别如何识别待标记的短语候选者。 在本文中,我们介绍一种仅利用现有的通用知识库而无需任何其他人工的方法。
4.1.1 Label Pools
4.1.1标签池
Public knowledge bases (e.g., Wikipedia) usually encode a considerable number of high-quality phrases in the titles, keywords, and internal links of pages. For example, by analyzing the internal links and synonyms in English Wikipedia, more than a hundred thousand high-quality phrases were discovered. As a result, we place these phrases in a positive pool.
公共知识库(例如Wikipedia)通常在页面的标题,关键字和内部链接中编码大量高质量的短语。 例如,通过分析英语维基百科中的内部链接和同义词,发现了十万多个高质量的短语。 结果,我们将这些短语放在一个正数池中。
Knowledge bases, however, rarely, if ever, identify phrases that fail to meet our criteria, what we call inferior phrases.
An important observation is that the number of phrase candidates, based on n-grams (recall leftmost box of Figure 1), is huge and the majority of them are actually of of inferior quality (e.g., “Francisco opera and”). In practice, based on our experiments, among millions of phrase candidates, usually, only about 10% are in good quality. Therefore, phrase candidates that are derived from the given corpus but that fail to match any high-quality phrase derived from the given knowledge base, are used to populate a large but noisy negative pool.
但是,知识库很少(如果有的话)会识别出不符合我们标准的短语,我们称之为劣等短语。
一个重要的观察结果是,基于n元语法(参见图1的最左边的方框)的候选短语数量巨大,而且实际上大多数质量较低(例如,“法兰西斯科歌剧”和“法兰西歌剧”)。 实际上,根据我们的实验,在数以百万计的短语候选者中,通常只有大约10%的质量良好。 因此,从给定语料库派生但未匹配从给定知识库派生的任何高质量短语的短语候选者将用于填充较大但嘈杂的负数池。
4.1.2 Noise Reduction
4.1.2降噪
Directly training a classifier based on the noisy label pools is not a wise choice: some phrases of high quality from the given corpus may have been missed (i.e., inaccurately binned into the negative pool) simply because they were not present in the knowledge base. Instead, we propose to utilize an ensemble classifier that averages the results of T independently trained base classifiers. As shown in Figure 2, for each base classifier, we randomly draw K phrase candidates with replacement from the positive pool and the negative pool respectively (considering a canonical balanced classification scenario). This size-2K subset of the full set of all phrase candidates is called a perturbed training set [2], because the labels of some (δ in the figure) quality phrases are switched from positive to negative. In order for the ensemble classifier to alleviate the effect of such noise, we need to use base classifiers with the lowest possible training errors. We grow an unpruned decision tree to the point of separating all phrases to meet this requirement. In fact, such decision tree will always reach 100% training accuracy when no two positive and negative phrases share identical feature values in the perturbed training set. In this case, its ideal error is δ/2K , which approximately equals to the proportion of switched labels among all phrase candidates (i.e., ≈ 10%). Therefore, the value of K is not sensitive to the accuracy of the unpruned decision tree and is fixed as 100 in our implementation. Assuming the extracted features are distinguishable between quality and inferior phrases, the empirical error evaluated on all phrase candidates, p, should be relatively small as well.
基于噪声标签池直接训练分类器不是一个明智的选择:给定语料库中的某些高质量短语可能会因为它们不在知识库中而被遗漏(即不正确地归入否定池)。 取而代之的是,我们建议使用集成分类器,对T个独立训练的基本分类器的结果取平均。 如图2所示,对于每个基本分类器,我们分别从正集合和负集合中随机抽取K个候选短语进行替换(考虑到典范的平衡分类方案)。 整个所有短语候选集的大小为2K的子集称为扰动训练集[2],因为某些(在图中的δ)质量短语的标签从正转换为负。 为了使整体分类器减轻这种噪声的影响,我们需要使用训练误差尽可能小的基本分类器。 我们将未修剪的决策树生长到分离所有短语以满足这一要求的地步。 实际上,当在扰动的训练集中没有两个正负短语共享相同的特征值时,这种决策树将始终达到100%的训练精度。 在这种情况下,其理想误差为δ/ 2K,大约等于所有候选短语中切换标签的比例(即 ≈10%)。 因此,K的值对未修剪的决策树的准确性不敏感,在我们的实现中固定为100。 假设提取的特征在质量和劣等短语之间是可区分的,则对所有候选短语p评估的经验误差也应相对较小。
An interesting property of this sampling procedure is that the random selection of phrase candidates for building perturbed training sets creates classifiers that have statistically independent errors and similar erring probability.
Therefore, we naturally adopt random forest, which is verified, in the statistics literature, to be robust and efficient.
The phrase quality score of a particular phrase is computed as the proportion of all decision trees that predict that phrase is a quality phrase. Suppose there are T trees in the random forest, the ensemble error can be estimated as the probability of having more than half of the classifiers misclassifying a given phrase candidate as follows.
该采样过程的一个有趣特性是,随机选择用于构建扰动训练集的短语候选者会创建分类器,这些分类器具有统计上独立的错误和相似的错误概率。
因此,我们自然而然地采用了随机森林,该随机森林在统计资料中已被证明是可靠且有效的。
将特定短语的短语质量得分计算为预测该短语为质量短语的所有决策树的比例。 假设在随机森林中有T树,则可以将集合误差估计为如下情况,即超过一半的分类器将给定短语候选者误分类的概率。
4.2 POS-Guided Phrasal Segmentation
4.2 POS引导的词组分割
Phrasal segmentation addresses the challenge of measuring completeness (our fourth criterion) by locating all phrase mentions in the corpus and rectifying their frequencies obtained originally via string matching.
短语分割通过在语料库中定位所有短语提及并纠正最初通过字符串匹配获得的频率来解决测量完整性(我们的第四个标准)的挑战。
The corpus is processed as a length-n POS-tagged word sequence , where refers to a pair consisting of a word and its POS tag . A POS-guided phrasal segmentation is a partition of this sequence into m segments induced by a boundary index sequence satisfying . The i-th segment refers to
语料库被处理为长度为n的带有POS标签的单词序列 ,其中 是指由单词及其POS标签$ <w_i,t_i> B = {b_1,b_2,…。 。 。 ,b_ {m + 1}} $ 满足 $ 1 = b_1 <b_2 <。 。 。 <b_ {m + 1} = n + 1 Ω_{bi}Ω_{bi + 1}。 。 。 Ω_{b_ {i + 1}-1} $
Compared to the phrasal segmentation in our previous work, the POS-guided phrasal segmentation addresses the completeness requirement in a context-aware way, instead of equivalently penalizing phrase candidates of the same length. In addition, POS tags provide shallow, language specific knowledge, which may help boost phrase detection accuracy, especially at syntactic constituent boundaries for that language.
与我们以前的工作中的短语分割相比,POS引导的短语分割以上下文感知的方式解决了完整性要求,而不是等同地惩罚相同长度的短语候选。 此外,POS标签还提供了针对特定语言的浅浅知识,这可能有助于提高短语检测的准确性,尤其是在该语言的语法组成边界上。
help boost phrase detection accuracy, especially at syntactic constituent boundaries for that language.
与我们以前的工作中的短语分割相比,POS引导的短语分割以上下文感知的方式解决了完整性要求,而不是等同地惩罚相同长度的短语候选。 此外,POS标签还提供了针对特定语言的浅浅知识,这可能有助于提高短语检测的准确性,尤其是在该语言的语法组成边界上。