当前位置: 代码迷 >> 综合 >> 【P2】On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study
  详细解决方案

【P2】On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study

热度:54   发布时间:2024-02-25 00:55:57.0

接上篇:Part I


论文原文:HTML
论文被引:313(2020/10/02)


文章目录

  • 4 Datasets
    • 4.1 Dataset preparation
    • 4.2 Datasets used in the literature
    • 4.3 Semantically meaningful outlier datasets
  • 5 Experimental results
    • 5.1 Evaluation measures
    • 5.2 Characterization of the outlier methods
    • 5.3 Characterization of the datasets
  • 6 Conclusions


4 Datasets

In this section, we give a systematic presentation of the datasets considered in this study on issues in outlier evaluation. The datasets have been organized into two groups: the first, presented in Sect. 4.2, consists of sets that have previously appeared in the research literature in the evaluation of outlier detection algorithms; the second group, presented in Sect. 4.3, consists of sets originally intended for the evaluation of classification methods, where one or more classes have a natural semantic interpretation as outliers. In total, we define, collect, and make publicly available a repository of approximately 1000 datasets (including variants), together with full details of their preprocessing.We begin in Sect. 4.1 with a discussion of the issues surrounding the compilation, preparation, and description of datasets for the evaluation of outlier detection algorithms.

在这一节中,我们系统地介绍了本研究中考虑的异常值评估问题的数据集。数据集分为两组:第一组,在第4.2节中介绍。由先前出现在研究文献中的离群点检测算法的评估中的集合组成;第二组,在第4.3节中介绍。由最初用于评估分类方法的集合组成,其中一个或多个类具有作为异常值的自然语义解释。总的来说,我们定义、收集并公开大约1000个数据集(包括变体)的存储库,以及它们预处理的全部细节。第4.1节讨论了用于离群点检测算法评估的数据集的编译、准备和描述的相关问题。

4.1 Dataset preparation

The UCI repository (Bache and Lichman 2013) is a valuable source of datasets for the evaluation of data mining algorithms. While most of them have been proposed for the evaluation of classification methods, they have also been widely used for unsupervised algorithms such as clustering. However, since the semantics of data clusters are often quite different from those of ground-truth classes,the appropriateness of such datasets for the evaluation of unsupervised learning methods is debatable (F?rber et al. 2010). For the evaluation of outlier detection, the semantic mismatch is even more problematic, since outliers are assumed to be both rare and diverse.

UCI知识库(Bache和Lichman 2013)是用于评估数据挖掘算法的数据集的有价值的来源。虽然它们中的大多数已经被提出用于评估分类方法,但是它们也被广泛用于无监督算法,例如聚类。然而,由于数据聚类的语义往往与真实标注的语义大相径庭,这种数据集对于评估无监督学习方法的适当性是有争议的(F?rber等人,2010年)。对于离群点检测的评估,语义不匹配甚至更成问题,因为离群点被假定为既罕见又多样。

In the following, we outline the main issues in converting classification datasets to outlier evaluation datasets, and discuss how these issues were handled in our study.

在下文中,我们概述了将分类数据集转换为异常值评估数据集的主要问题,并讨论了在我们的研究中如何处理这些问题。

Downsampling A common approach in outlier detection research is to randomly downsample a particular class to produce outliers, while retaining all instances of the remaining classes to form the inlier set. Random downsampling often leads to great variation in the nature of the outliers produced. Therefore, to mitigate the impact of randomization when downsampling, we repeat the procedure for each dataset 10 times, resulting in 10 different variants for these datasets.

下采样 离群点检测研究中的一种常见方法是随机下采样一个特定的类以产生离群点,同时保留剩余类的所有实例以形成内联集(inlier set)。随机下采样通常会导致所产生的异常值的性质发生很大变化。因此,为了减轻下采样时随机化的影响,我们对每个数据集重复该过程10次,导致这些数据集有10个不同的变量。

Duplicates The handling of duplicate instances in the dataset has received scant attention in the literature. However, the presence of duplicates is problematic for several methods. For example, for LOF and many of its variants, duplicate instances can lead to distance values of zero, which introduces numerical instability into the computation of local density estimates. For datasets containing duplicates, we generate two variants, one with the original duplicates, and one without duplicates. It should be noted, though, that removing duplicate records can drastically lower local density estimates in certain cases.

重复值 数据集中重复实例的处理在文献中很少受到关注。然而,对于几种方法来说,存在重复是有问题的。例如,对于LOF及其许多变体,重复的实例可能导致距离值为零,这将数值不稳定性引入到局部密度估计的计算中。对于包含副本的数据集,我们生成两个变体,一个包含原始副本,另一个不包含副本。然而,应该注意的是,在某些情况下,删除重复记录会大大降低局部密度估计值。

Categorical attributes Transformation of categorical attributes into numerical ones is another source of dataset variation. We employ two techniques:

分类属性 将分类属性转换为数字属性是数据集变化的另一个来源。我们采用两种技术:

  • 1-of-n encoding, where a categorical attribute with n possible values is mapped into n binary attributes for which a value of 1 (or 0) represents the presence (or absence) of the corresponding categorical value;
  • IDF, where a categorical attribute is encoded as the inverse document frequency IDF(t) = ln(N/ft), where N is the total number of instances, and ftis the frequency (number of occurrences) of the attribute value t.
  • 1/n编码,其中具有n个可能值的分类属性被映射成n个二进制属性,其中值1(或0)表示对应分类值的存在(或不存在);
  • IDF,其中分类属性被编码为 inverse document frequency IDF(t)=ln(N/ft)IDF(t) = ln(N/f_t)IDF(t)=ln(N/ft?),其中 NNN 是实例总数,ftf_tft? 是属性值 ttt 的频率(出现次数)。

For datasets with categorical attributes we thus have three variants: one where categorical attributes are removed, and two resulting from the transformations described above.

因此,对于具有分类属性的数据集,我们有三种变体:一种是删除分类属性,另两种是由上述转换产生的。

Normalization The normalization of datasets is expected to have considerable impact on the results, but is rarely discussed in the literature. A full exploration of this issue is beyond the scope of this study. However, for each dataset that does not already have normalized attributes, we include two variants: unnormalized, and attribute-wise linear normalization to the range [0,1].

标准化 数据集的标准化预计会对结果产生相当大的影响,但在文献中很少讨论。对这个问题的全面探讨超出了本研究的范围。然而,对于每个还没有规范化属性的数据集,我们包括两个变量:非规范化的和属性线性规范化到范围 [0,1][0,1][0,1]

Missing values Standard outlier detection techniques as considered here cannot handle data with missing values. We determine for each dataset and each attribute the number of missing values. If an attribute has fewer than 10 % of instances with missing values, those instances are removed. Otherwise, the attribute itself is removed.

缺失值 这里考虑的标准异常值检测技术不能处理缺失值的数据。我们为每个数据集和每个属性确定缺失值的数量。如果属性缺少值的实例少于10%,则这些实例将被删除。否则,属性本身将被移除。

4.2 Datasets used in the literature

Table 1 lists those datasets of our study that are known to have appeared in the outlier detection literature. For each dataset, variants have been produced according to the guidelines set out in Sect. 4.1. The details shown in the table include the number of instances, outliers, and attributes after preprocessing missing attributes and downsampling but before the removal of duplicates. Full documentation of the datasets is available on our repository website.

表1列出了我们研究中已知出现在异常值检测文献中的数据集。对于每个数据集,变量都是根据第4.1节中规定的准则产生的。表中显示的详细信息包括实例、异常值以及预处理缺失属性和下采样后但在删除重复项之前的属性的数量。数据集的完整文档可在我们的存储库网站上获得。

Some of the datasets in our collection have been the basis of benchmarking in several publications. Unless the processed datasets were made publicly available [as it is the case for datasets used by Keller et al.(2012)], some ambiguity may remain as to their construction, due to a lack of information as regards the issues listed in Sect. 4.1, or to the use of downsampling. As an additional complication, some publications do not give clear references to the datasets they use, or refer to datasets by ambiguous names. Often, it is not specified as to whether only the training set, only the test set, or both partitions of a classification dataset were used.5In contrast, processing datasets according to the guidelines set out earlier, as well as making them publicly available together with their full documentation, promotes the reproducibility and ease of comparison of future experimentation with outlier methods.

我们收集的一些数据集已经成为一些出版物中基准测试的基础。除非处理后的数据集公开可用[如凯勒等人(2012)使用的数据集的情况],否则由于缺乏第4.1节所列问题的信息,其构建可能仍存在一些模糊性,或使用下采样。另一个复杂的问题是,一些出版物没有明确引用它们所使用的数据集,或者用不明确的名称引用数据集。通常情况下,并没有规定是只使用训练集、测试集还是分类数据集的两个部分。相比之下,根据前面提出的指导原则处理数据集,以及将它们与完整的文档一起公开,可以提高未来实验与异常值方法的可重复性和易比较性。

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

4.3 Semantically meaningful outlier datasets

Semantically meaningful datasets for outlier evaluation are those in which certain classes can be reasonably assumed to be associated with real-world instances that are both rare and deviating—for example, ‘sick’ patients within a population dominated by ‘healthy’ individuals. However, it is sometimes the case that outliers within the real-world population may be overrepresented within a given classification dataset. For such datasets, we create several variants by downsampling the outlier class at several different rates: 20, 10, 5, and 2 % of outliers.

语义上有意义的离群值评估数据集是这样的数据集,其中某些类别可以被合理地假设为与现实世界中罕见且偏离的实例相关联,例如,“健康”个体占主导地位的人群中的“患病”患者。然而,有时在给定的分类数据集中,真实世界人口中的异常值可能过多。对于这样的数据集,我们通过以几个不同的速率对异常值类进行下采样来创建几个变量:异常值的20%、10%、5%和2%。

As semantically meaningful datasets, we selected and processed the following UCI repository datasets (Bache and Lichman 2013), together with an additional classification dataset, Stamps (Micenková et al. 2012), combining training and test sets whenever both exist (summarized in Table 2).

作为语义上有意义的数据集,我们选择并处理了以下UCI知识库数据集(Bache和Lichman,2013),以及一个额外的分类数据集,Stamps (Micenková等人,2012),只要训练集和测试集都存在,就将它们结合起来(总结在表2中)。

  • Annthyroid Medical data on hypothyroidism. Three classes relate to the conditions ‘normal’, ‘hyperfunction’, and ‘subnormal functioning’. Classes other than ‘normal’ were defined as outliers here.

  • 甲状腺功能减退症的医学数据。有三类与“正常”、“功能亢进”和“功能低下”有关。“正常”以外的类在这里被定义为异常值。

  • Arrhythmia Patients classified as normal or as exhibiting some type of cardiac arrhythmia.Intotal,there are14 types of arrhythmia and 1 type that brings together all the other different types. However, 3 types of arrhythmia have no data. Again, we treat healthy people as inliers and patients suffering from arrhythmia as outliers.

  • 被归类为正常或表现出某种心律失常的患者。总的来说,心律失常有14种,其中1种综合了所有其他不同的类型。然而,3种心律失常没有据。同样,我们把健康的人当作正常值,把患有心律失常的病人当作异常值。

  • Pima Medical data on diabetes. Patients suffering from diabetes were considered outliers.

  • 糖尿病医学数据。患有糖尿病的病人被认为是离群值。

  • Cardiotocography Related to heart diseases, describing 3 classes: normal, suspect, or pathological. Normal patients are treated as inliers and the remaining as outliers.

  • 与心脏疾病相关的数据集,共分为3类:正常、可疑或病理。正常患者被视为正常值(inliers),其余被视为外联者。

  • Heart Disease Medical data on heart problems, with affected patients considered outliers and healthy people considered inliers.

  • 关于心脏问题的医学数据,受影响的病人被认为是离群值,健康的人被认为是正常值。

  • Hepatitis Prediction of whether a patient suffering from hepatitis will die (outliers) or survive (inliers).

  • 预测肝炎患者是否会死亡(离群值)或存活(正常值)。

  • Internet Ads Images from web pages, classified as ads or not, for the purpose of learning to remove ads automatically from web pages while retaining regular images. Ads are considered outliers.

  • 网页中的图像,分类为广告或不是广告,目的是学习自动从网页中移除广告,同时保留常规图像。广告被认为是离群值。

  • Page Blocks Types of blocks in document pages, relating to an essential step in document analysis, namely to separate text from pictures or graphics. If the block content is text, it was labeled here as inlier, otherwise it was labeled as outlier.

  • 文档页面中的块类型,与文档分析中的一个重要步骤有关,即把文本从图片或图形中分离出来。如果块内容是文本,它在这里被标记为内联,否则它被标记为外联。

  • Parkinson Medical data distinguishing healthy people from those suffering from Parkinson’s disease. The latter were labeled as outliers.

  • 医学数据区分健康人和帕金森病患者。后者被标记为异常值。

  • Spam Base Emails classified as spam (outliers) or non-spam.

  • 电子邮件分为垃圾邮件(离群值)或非垃圾邮件。

  • Stamps Learning to identify forged (photocopied and printed) stamps from genuine (ink) stamps based on color and printer properties. The former are outliers.

  • 学习根据颜色和打印机属性从真(墨水)邮票中识别伪造(影印和打印)邮票。前者是离群值。

  • Wilt Differentiating diseased trees from other land covers. The former are considered outliers here.

  • 区分患病树木和其他土地覆盖物。前者在这里被认为是异常值。

5 Experimental results

Our study centers around the following three questions: Are the evaluation measures capable of revealing the performance characteristics of outlier methods? How do outlier methods perform over a broad range of datasets and parameter settings? How can one better understand and properly characterize datasets in light of the outlier detection evaluation task?

我们的研究围绕以下三个问题进行:评估措施是否能够揭示离群方法的性能特征?离群值方法如何在广泛的数据集和参数设置中发挥作用?如何根据离群点检测评估任务更好地理解和恰当地表征数据集?

In our experimentation, each of the 12 methods was executed over multiple variants of the 23 datasets, determined according to whether the data were normalized, whether duplicates were removed, on the treatment of categorical attributes, and on which of 4 downsampling rates was employed. Each experiment was performed for each meaningful choice of parameter value k between 1 and 100 (or the number of data instances, if less than 100).6In total, 1,300,758 experimental runs were performed. The complete results, including all plots for all datasets, are available on our repository website.

在我们的实验中,12种方法中的每一种都在23个数据集的多个变量上执行,根据数据是否标准化、重复是否被移除、分类属性的处理以及使用了4个下采样速率中的哪一个来确定。针对参数值k在1和100之间的每个有意义的选择(或数据实例的数量,如果小于100)执行每个实验。总共执行了1300758次实验运行。完整的结果,包括所有数据集的所有图,可在我们的存储库网站上获得。

5.1 Evaluation measures

In order to assess the quality of outlier methods, we first investigate the behavior of the three base evaluation measures identified in Sect. 3: precision at n [P@n, where we set n = |O| (Craswell 2009a,b)], average precision (AP), and ROC AUC. For each evaluation measure, roughly 1000 plots were produced across the range of values of k, of which examples are shown in Fig.1for two datasets, Annthyroid (7.4 % outliers) and HeartDisease (44.4 % outliers). On these two examples, we can already observe significant variations in performance trends across differing combinations of outlier algorithms, datasets, parameter choices and evaluation methods (please see the web repository for the complete results).

为了评估异常值方法的质量,我们首先调查第3节中确定的三个基本评估指标的行为。n处的精度[P@n,其中我们设置了n = |O| (Craswell 2009a,b)]、平均精度(AP)和ROC AUC。对于每个评估指标,在k值的范围内产生了大约1000个图,图1中显示了两个数据集的例子,Ann甲状腺(7.4 %异常值)和心脏疾病(44.4 %异常值)。在这两个例子中,我们已经可以观察到异常值算法、数据集、参数选择和评估方法的不同组合在性能趋势上的显著变化(完整结果请参见网络存储库)。

As noted in Sect. 3, ROC AUC is expected to be less sensitive to variation in the number of true outliers than the other evaluation measures. This tendency is confirmed by our experimental results, as demonstrated by the examples shown in Fig. 1. T h e ROC AUC scores achieved by the methods were consistently high across the datasets, while still being able to discriminate among the different outlier methods. In contrast, the P@n scores were considerably lower for those datasets with smaller proportions of outliers (as seen here for Annthyroid). Although the behavior of AP more closely resembles that of ROC AUC, in that it assesses the ranks of all outliers, AP scores also tend to be low when, as it is the typical case in outlier detection scenarios, the numbers of inliers and outliers are not balanced.

如第3节所述,ROC AUC预计对真实异常值数量的变化不太敏感。我们的实验结果证实了这一趋势,如图1所示的例子所示。通过这些方法获得的ROC AUC分数在整个数据集上始终很高,同时仍然能够区分不同的异常值方法。相比之下,对于那些离群值比例较小的数据集,P@n得分要低得多(如安甲状腺所示)。虽然AP的行为与ROC AUC的行为更相似,因为它评估了所有离群值的排名,但当内联值和离群值的数量不平衡时,AP的得分也往往较低,这是离群值检测场景中的典型情况。
在这里插入图片描述
图1两个数据集的结果(示例,无重复、标准化、无下采样),比较n (P@n)、平均精度(AP)和ROC AUC下的精度(所有数据集的完整结果可在我们的网络存储库中获得)。a 甲状腺,b 心脏病。

P@n takes into account only the number of true outliers among the n top-ranked items, and thus its behavior is quite different from that of ROC AUC. As observed by Davis and Goadrich (2006), the P@n measure can therefore be helpful in discriminating between methods that perform more or less equally well in terms of ROC AUC. Following their lead, we will rely mainly on ROC AUC scores in judging the effectiveness of the outlier methods, while turning occasionally to P@n and AP for further insights.

P@n只考虑了n个排名靠前的项目中真实离群值的数量,因此它的行为与ROC AUC有很大的不同。正如Davis和Goadrich(2006)所观察到的,P@n度量因此可以有助于区分在ROC AUC方面表现大致相同的方法。在他们的领导下,我们将主要依靠ROC AUC分数来判断离群方法的有效性,同时偶尔转向P@n和AP以获得进一步的见解。

When P@n and AP are adjusted for chance, we obtain curves analogous in shape to those of their unadjusted counterparts, but with an upward shift in the quality scores. As detailed in Sect. 3, although adjustment for chance is not necessary when comparing the performance of different methods on a single dataset, it is beneficial (if not indispensable) when comparing across datasets with different proportions of outliers.

当P@n和AP因偶然性而被调整时,我们得到的曲线在形状上类似于它们未被调整的对应曲线,但是在质量分数上有一个向上的偏移。详见第3节,虽然在单个数据集上比较不同方法的性能时,没有必要对偶然性进行调整,但是在比较具有不同比例异常值的数据集时,这是有益的(如果不是必不可少的)。

5.2 Characterization of the outlier methods

In this section, we characterize the performance of the selected outlier detection methods using datasets introduced in Sect. 4. These datasets cover a wide range of proportions of points considered outliers in the ground truth of the datasets—as high as 75 % for the Parkinson dataset (without downsampling). One can argue that ground-truth datasets with a large proportion of outliers are not appropriate for the evaluation of outlier detection methods, since outliers are by definition assumed to be exceptions in the data, and thus much less common than inliers. Accordingly, most outlier detection models either implicitly or explicitly assume that outliers are relatively rare — a characteristic which is confirmed by our experimental results, in the sense that all methods perform worse as more and more objects are included as outliers in a dataset. The conclusions about relative performance of the methods are however largely unaffected by the proportion of outliers in the dataset.

在本节中,我们使用第4节中介绍的数据集来描述所选离群点检测方法的性能。这些数据集覆盖了被认为是数据集真实标注中异常值的点的大范围比例——帕金森数据集高达75%(没有下采样)。有人可能会说,具有很大比例异常值的真实标注数据集不适合用于评估异常值检测方法,因为异常值根据定义被假设为数据中的异常,远不如正常数据常见。因此,大多数离群点检测模型或者隐含地或者明确地假设离群点是相对罕见的——这一特征被我们的实验结果所证实,也就是说,随着越来越多的对象被作为离群点包含在数据集中,所有方法的性能都会变差。然而,关于方法相对性能的结论在很大程度上不受数据集中异常值比例的影响。

To compare the methods according to their quality scores we will consider initially (1) the average performance over the range of given values of k (representing an expected performance if the users have no prior knowledge about k), and (2) the best-case performance, selecting the k for which the performance of a method on a dataset is maximal (representing the optimistic case where the optimal value of k for a method is known in advance). To show how the number of outliers in a dataset affects the performance of the methods, we show here results that are aggregated over all datasets, as well as results that are aggregated over only the datasets that contain up t o 5% outliers.

为了根据质量分数来比较这些方法,我们将首先考虑(1)在给定的k值范围内的平均性能(如果用户没有关于k的先验知识,则代表预期性能),以及(2)最佳情况性能,选择在数据集上方法性能最好的k(代表方法的最佳k值预先已知的乐观情况)。为了显示数据集中异常值的数量如何影响方法的性能,我们在这里显示了在所有数据集中聚集的结果,以及仅在包含最多5%异常值的数据集中聚集的结果。

In Fig. 2a, b, we show for each method the mean and standard error of the best-case ROC AUC values (aggregated over all datasets), for normalized and unnormalized datasets, respectively; Fig. 2c, d shows the same statistics aggregated over datasets that contain only up to 5 % outliers.

在图2a、2b中,我们分别显示了标准化和非标准化数据集的每种方法的最佳ROC AUC值的平均值和标准误差(在所有数据集上合计);图2c,d显示了在仅包含高达5%异常值的数据集上聚集的相同统计。

When comparing these figures, we can make two general observations: (1) normalization, on average, leads to better performance for all methods; and (2) having only a small percentage of objects labeled as outliers in the ground truth leads to better performance for all methods, and while some methods are more affected than others, their relative performance does not change dramatically. Because of observation (1), we will focus in the following on results for normalized datasets. Because of observation (2), we will focus on datasets with a small percentage of outliers in most cases while occasionally comparing results over different percentages of outliers. However, results for unnormalized data, as well as aggregated values for different maximum percentages of outliers (up to 10 % and up to 20 %) are available online; the main conclusions are the same across all these results.

当比较这些数字时,我们可以得出两个一般性的观察结果:(1)平均而言,归一化(normalization)会使所有方法的性能更好;(2)在真实标注中,只有一小部分对象被标记为异常值,这使得所有方法的性能都更好,虽然有些方法比其他方法受影响更大,但它们的相对性能并没有显著变化。由于观察(1),我们将在下面重点关注归一化数据集的结果。由于观察(2),我们将在大多数情况下关注具有小百分比异常值的数据集,同时偶尔比较不同百分比异常值的结果。但是,非归一化数据的结果,以及不同最大异常值百分比(最高10%和最高20%)的汇总值可在线获得;所有这些结果的主要结论是相同的。

Regarding the relative performance of the methods under their best parameter setting, we can see that KDEOS (along with FastABOD on unnormalized data) scores below the others in terms of ROC AUC performance. Although no one method consistently and significantly outperformed the others in all experiments, Fig. 2b does show that a group consisting of the methods kNN, kNNW, and LOF as well as the LOF-variant LDF does stand out to some extent when averaging best performance over all datasets. As shown in Fig. 2d, when averaging over the datasets containing only up to 5 % of outliers, in addition to LOF, LDF, kNN, and kNNW, two more close variants of LOF also stand out: SimplifiedLOF, and LoOP.

关于这些方法在其最佳参数设置下的相对性能,我们可以看到KDEOS(以及非正规数据上的FastABOD)在ROC AUC性能方面得分低于其他方法。虽然没有一种方法在所有实验中始终显著优于其他方法,但图2b确实显示,当在所有数据集上平均表现最佳时,由方法kNN、kNNW和LOF以及LOF变体LDF组成的组确实在某种程度上脱颖而出。如图2d所示,除了LOF、LDF、kNN和kNNW之外,当对仅包含高达5%异常值的数据集求平均时,LOF的两个更接近的变体也很突出:SimplifiedLOF和LoOP
在这里插入图片描述
图2在没有重复的不同数据集集合上,每种方法获得的ROC AUC值的比较。每种方法的最大ROC AUC的平均值(为每种方法选择最佳k)-所有数据集汇总的非标准化数据。b平均每种方法的最大ROC AUC(即为每种方法选择最佳值)—标准化数据。平均每种方法的最大ROC AUC(为每种方法选择最佳值)—非标准化数据,在具有高达5 %异常值的数据集上汇总。平均每种方法的最大ROC AUC(为每种方法选择最佳k)-标准化数据,在具有高达5 %异常值的数据集上汇总。e在大小为10的窗口内,平均ROC AUC在所有数据集上聚集的k-标准化数据的最佳值附近。f所有k-标准化数据的平均ROC AUC,在所有数据集上汇总。在大小为10的窗口内,平均ROC AUC围绕k的最佳值——标准化数据,在具有高达5 %异常值的数据集上聚集。所有k-标准化数据的平均变异系数,在具有高达5 %异常值的数据集上汇总

To study the stability of performance with respect to the choice of k, we consider the average performance within a range of±5 around the optimal value: the less stable the method, the greater the expected degradation in performance when aggregating over this larger window. If necessary, the window may be shifted so that it fits entirely within the allowable range for k for the dataset in question. The results aggregated over all data sets are shown in Fig. 2e, and the results aggregated over datasets with up to 5 % outliers are shown in Fig. 2g. We observe that for all methods, the overall performance is degraded to some extent; however, this degradation is greater for some methods than for others (indicating that the method is less stable). The method LDF is shown to be the least stable, FastABOD shows a more stable behavior than the other methods, whereas the stabilities of the other methods are more or less comparable. When averaging best performance over all datasets, the methods kNN, kNNW, and LOF stand out to some extent, and when averaging over the datasets containing only up to 5 % of outliers, the best performing group also contains SimplifiedLOF and LoOP.

为了研究与k的选择有关的性能稳定性,我们考虑在最优值附近5的范围内的平均性能:方法越不稳定,当在这个更大的窗口上聚集时,预期的性能下降越大。如有必要,可以移动窗口,使其完全符合所讨论数据集的k的允许范围。图2e示出了在所有数据集上聚集的结果,图2g示出了在具有高达5 %异常值的数据集上聚集的结果。我们观察到,对于所有方法,整体性能在某种程度上降低;然而,这种退化对于某些方法比对于其他方法更大(表明该方法不太稳定)。LDF方法被证明是最不稳定的,FastABOD显示出比其他方法更稳定的行为,而其他方法的稳定性或多或少是可比的。当对所有数据集的最佳性能进行平均时,方法kNN、kNNW和LOF在某种程度上是突出的,当对仅包含5 %异常值的数据集进行平均时,最佳性能组还包含SimplifiedLOF和LoOP。

If we widen the window to include all values of k in the range 1 to 100, the performances of all methods degrade even further, as one would expect (Fig. 2f, h). Here, the top group of methods least affected by the variation ofk are the distance-based methods kNN and kNNW, as well as FastABOD, followed by LOF, when averaging over all datasets, and also followed by SimplifiedLOFandLoOP,whenaveragingover the datasets containing only up to 5 % of outliers.

如果我们将窗口扩大到包括1到100范围内的所有k值,所有方法的性能都会进一步下降,正如我们所预料的那样(图2f,h)。这里,最不受k变化影响的方法是基于距离的方法kNN和kNNW,以及FastABOD,当对所有数据集求平均值时,其次是LOF,当对仅包含5 %异常值的数据集求平均值时,再次是SimplifiedLOFandLoOP。

WeappliedtheFriedmantest(Friedman 1937) to examine whether there is a significant difference between the results of the algorithms on collected datasets. The null hypothesis for this test assumes that there is no significantdifferencebetweenthealgorithms. If the calculated probability is low (p-value less than the selected significance level) the null hypothesis is rejected, which indicates that at least two algorithms are significantly different from each other. The Nemenyi post-hoc test (Nemenyi 1963) can be applied in this scenario so as to reveal which pairs of algorithms exhibit such differences (if any). In the usage of both tests, we follow Dem?ar (2006).

我们应用riedfriedmantest(Friedman 1937)来检查算法在收集的数据集上的结果之间是否存在显著差异。本测试的零假设假设算法之间没有显著差异。如果计算出的概率较低(p值小于选定的显著性水平),则否定零假设,这表明至少两种算法彼此显著不同。内梅尼事后测试(内梅尼1963)可以应用于这种情况,以揭示哪些算法对表现出这种差异(如果有的话)。在这两种测试的使用中,我们遵循德姆萨(2006)。

The Friedman test was applied to the collection of datasets normalized and without duplicates, with the exception of ALOI and KDDCup99, since not all algorithms have provided results for these two datasets. We base the test on the best achieved quality in terms of ROC AUC (i.e., we chose for each method the best-performing parameter setting (k) for each dataset independently). Due to the assumption of independence between variables of the test, the results for those datasets with multiple subsampled variants were grouped by averaging the best results over all variants for each method.

除了ALOI和KDDCup99之外,Friedman测试被应用于归一化和无重复的数据集的集合,因为不是所有的算法都为这两个数据集提供结果。我们根据ROC AUC的最佳质量进行测试(即,我们为每种方法独立选择每个数据集的最佳参数设置(k))。由于测试变量之间的独立性假设,具有多个二次抽样变量的数据集的结果通过对每种方法的所有变量的最佳结果进行平均来分组。

The Friedman test returned a p-value of approximately 2.891E-10, which suggests that the null hypothesis is extremely unlikely. The results for the Nemenyi post-hoc test are shown in Table3. The symbols ‘+’o r‘++’ indicate that the column method is better than the row method with 90 % (‘+’) and 95 % (‘++’) confidence, the symbols ‘?‘ o r ‘??‘ indicate significantly worse performance. Two main observations can be made: (i) KDEOS is statistically worse than kNN, kNNW, LOF, SimplifiedLOF, LoOP,COF,LDF, and INFLO at the 95% confidence level; and (ii)LOF is statistically better than three competitors(ODIN,KDEOS,andFastABOD) at the 95% confidence level, better than LDOF at the 90 % confidence level, and is the “winner” from this particular perspective.

Friedman检验得出的p值约为2.891-10,这表明零假设极不可能成立。内梅尼事后测试的结果如表3所示。符号’+’ o ’ r ‘+‘表示列方法优于行方法,置信度为90 % (’+’)和95 % (’+’),符号’ o ’ r '表示性能明显较差。可以得出两个主要的观察结果:(1)在95%的置信水平上,KDEOS在统计上比kNN、kNNW、LOF、SimplifiedLOF、LoOP、COF、LDF和YUM差;(二)LOF在95%的置信水平上统计上优于三个竞争对手(ODIN、KDEOS和FastABOD),在90%的置信水平上优于LDOF,从这个特殊的角度来看是“赢家”。

These findings also shed some light as to how the evaluation of new methods could beperformed.Itisalwayspossibletofindcases(specificparametersettingsforspecific datasets) where one particular method outperforms its competitors. As was demonstrated here, best practice dictates that the behavior of outlier detection methods be studied across a range of parameter settings, as the results for different parameter values can vary widely. Even if the methods to be compared share a seemingly analogous parameter (such as a neighborhood size k), setting it to the same values for all methods may still not allow for a direct comparison. As indicated in Fig. 1, the methods may depend on the parameter in different ways, and reach their peak performances for different choices of a seemingly identical parameter such as neighborhood size. Surveying the research literature would suggest that best practice is not always followed. There are many publications in which methods are compared and conclusions are made based on only a single, arbitrary choice of an important parameter (see the work of Müller et al. (2011, 2012), Liu et al. (2012), Keller et al. (2012), Ting et al. (2013) for some recent examples published at high quality venues).

这些发现也为新方法的评估提供了一些启示。总是有可能发现一种特定方法优于其竞争对手的情况(特定数据集的特定参数设置)。正如这里所展示的,最佳实践要求在一系列参数设置中研究异常值检测方法的行为,因为不同参数值的结果可能有很大差异。即使要比较的方法共享一个看似相似的参数(如邻域大小k),为所有方法设置相同的值可能仍然不允许直接比较。如图1所示,这些方法可以以不同的方式依赖于参数,并且对于看似相同的参数(例如邻域大小)的不同选择,达到它们的峰值性能。调查研究文献表明,最佳实践并不总是被遵循。在许多出版物中,方法的比较和结论仅基于单个、任意选择的重要参数(参见穆勒等人(2011,2012)、刘等人(2012)、凯勒等人(2012)、丁等人(2013)的工作,其中一些最近的例子在高质量的场所发表)。

To summarize these findings, we may conclude that after about 15 years of research, in general, the seminal methods kNN, kNNW, and LOF remain the state of the art, especially in datasets with possibly larger amounts of outliers. The methods SimplifiedLOF and LoOP that are closely related to LOF perform similarly well in datasets with few outliers, but not better than LOF, on average. LDF shows a relatively good peak performance but is rather unstable w.r.t. the choice of k. The peak performance of FastABOD is a bit below average but FastABOD is very stable w.r.t. k. It seems that none of the more recent developments offer comprehensive improvement over the classic methods, on average. It would indeed seem that there is no free lunch for unsupervised outlier detection.

总结这些发现,我们可以得出这样的结论:经过大约15年的研究,一般来说,创新方法kNN、kNNW和LOF仍然是最先进的,尤其是在可能有大量异常值的数据集上。与LOF密切相关的SimplifiedLOF和LoOP方法在具有少量异常值的数据集上表现相似,但平均而言并不比LOF好。LDF表现出了相对较好的峰值性能,但相对于k的选择来说相当不稳定。FastAbd的峰值性能略低于平均水平,但FastAbd非常稳定。平均而言,最近的发展似乎都没有比经典方法提供全面的改进。似乎确实没有免费的午餐用于无监督的异常值检测。

5.3 Characterization of the datasets

We now characterize the properties of the studied datasets, and discuss their suitability for the evaluation of outlier detection methods. For this analysis, we examine the collective performance of our representative set of outlier detection methods on these datasets, based on notions of ‘difficulty’ of outlier detection,and anotion of ‘diversity’ of the results.

我们现在描述了所研究数据集的特性,并讨论了它们对离群点检测方法评估的适用性。为了进行这一分析,我们在这些数据集上检验了我们有代表性的离群点检测方法的总体性能,基于结果的“检测难度”和“多样性”的概念。

To obtain a better understanding of the relative difficulty of the datasets, we first consider performance scores for each dataset, aggregated over all methods. For this purpose, for a given dataset, we determine for each method the best performance score obtained over a range of parameter settings, and then average these scores over allmethods. For each dataset to which random downsampling is applied over multiple runs, we also show the standard error as an interval about the mean score. Figure 3 shows the results for the datasets from Tables 1 and 2, for each of the performance measures considered.

为了更好地理解数据集的相对难度,我们首先考虑每个数据集的性能分数,这些分数汇总在所有方法中。为此,对于给定的数据集,我们为每个方法确定在一系列参数设置中获得的最佳性能分数,然后对所有方法的这些分数进行平均。对于在多次运行中应用随机下采样的每个数据集,我们还将标准误差显示为平均分数的区间。图3显示了表1和表2中的数据集对所考虑的每个性能度量的结果。
在这里插入图片描述

The first (top) row in Fig. 3 shows the results using Precision at n (P@n). One can clearly observe a wide variation in P@n across the different datasets. For example, on the Parkinson dataset variants, the methods achieve an average P@n of approximately 0.75, whereas on the Annthyroid variants the scores fall below 0.2. Overall, we note a trend towards higher P@n scores as the proportion of outliers in a dataset increases (for these proportions, see Tables 1 and 2; for the datasets in Table 2, w e a l s o s h o w results for variants with smaller downsampling rates). As discussed earlier, this trend is due at least in part to the fact that a random ranking leads to an expected P@n score of |O|/N, independently of the value of n. Increasing the proportion of outliers therefore increases the expected P@n score for random rankings.

图3中的第一行(顶部)显示了使用n (P@n)精度的结果。人们可以清楚地观察到不同数据集之间的P@n差异很大。例如,在帕金森数据集变量上,这些方法的平均P@n约为0.75,而在安甲状腺变量上,得分低于0.2。总的来说,我们注意到随着数据集中异常值比例的增加,P@n分数有上升的趋势(关于这些比例,见表1和表2;对于表2中的数据集,具有较小下采样速率的变量的结果。如前所述,这种趋势至少部分是由于这样一个事实,即随机排名导致|O|/N的预期P@n分数,而与N的值无关。因此,增加离群值的比例会增加随机排名的预期P@n分数。

In order to control for differences in the proportions of outliers, we introduced the adjusted precision at n, Adjusted P@n, which is shown in the second row of Fig. 3. Here, the trend regarding increasing numbers of outliers is attenuated (or even reversed in some cases), in particular when comparing variants of the same dataset across different downsampling rates. Indeed, the results for Adjusted P@n indicate that an increasing number of outliers can lead to a lower quality score (as in the case of the Cardiotocography set), which suggests that the outlier class may begin to form a cluster as the sample size increases. For datasets where the scores are already very low, such as Annthyroid, Pima, and SpamBase, increasing the number of outliers does not have a significant effect. As discussed earlier, both P@n and Adjusted P@n consider only the first n positions of a ranking. If n is low, the resulting scores may be very low, or highly variable, and thus inconclusive.

为了控制异常值比例的差异,我们引入了调整后的精度n,调整后的P@n,如图3第二行所示。这里,关于离群值数量增加的趋势被减弱(或者在某些情况下甚至被逆转),特别是当在不同的下采样速率上比较相同数据集的变量时。事实上,调整后的P@n的结果表明,离群值数量的增加会导致较低的质量分数(如心脏分娩记录集的情况),这表明离群值类别可能随着样本大小的增加而开始形成聚类。对于分数已经很低的数据集,如Ann甲状腺、Pima和SpamBase,增加离群值的数量没有显著影响。如前所述,P@n和调整后的P@n都只考虑排名的前n个位置。如果n较低,所得分数可能很低,或者变化很大,因此不确定。

The third row in Fig. 3 shows the Average Precision, AP, for each dataset. As discussed in Sect. 3, AP attempts to overcome the deficiencies of P@n by computing scores over multiple choices of n. However, the figure shows that as with P@n, A P scores tend to be higher for datasets with larger proportions of outliers. To a large extent, this effect can be explained by the increases in the expected P@n values of which AP is the average.

图3中的第三行显示了每个数据集的平均精度。如第节所述。3、AP试图通过在n的多个选择上计算分数来克服P@n的不足。然而,该图显示,与P@n一样,对于具有更大比例的离群值的数据集,A/P分数往往更高。在很大程度上,这种影响可以用预期的P@n值的增加来解释,其中AP是平均值。

The fourth row in Fig. 3 shows Adjusted AP. Again, the adjusted AP scores tend to be more stable than the unadjusted AP scores when the proportion of outliers is increased.

图3中的第四行显示了调整后的接入点。同样,当离群值的比例增加时,调整后的AP分数往往比未调整的AP分数更稳定。

The fifth row in Fig. 3 shows the results obtained with the most commonly-used performance measure for outlier detection methods, ROC AUC. Like the Average Precision (both adjusted and unadjusted), ROC AUC takes the entire outlier ranking into account. The ROC AUC scores show a clear decreasing trend as the proportion of outliers is increased. This trend is clear even for those datasets (such as Spambase) where Adjusted P@n was less discriminative. However, it is also clear that a relatively high ROC AUC score indicates only that, in the overall ranking, outliers are more likely to be ranked ahead of inliers; it does not necessarily mean that the top rankings are dominated by outliers. We would therefore argue that one cannot rely solely on ROC AUC scores in judging the quality of an outlier method—rather, ROC AUC and AdjustedP@n complement each other, as they reveal different aspects of an outlier ranking, both of which are relevant in practice.

图3中的第五行显示了离群值检测方法最常用的性能度量ROC AUC所获得的结果。像平均精度(调整和未调整)一样,ROC AUC考虑了整个离群值排名。随着异常值比例的增加,ROC AUC得分显示出明显的下降趋势。这种趋势甚至对那些调整后的P@n差别较小的数据集(如Spambase)来说也很明显。然而,同样明显的是,相对较高的ROC AUC分数仅表明,在总体排名中,离群值更有可能排在正常值之前;这并不一定意味着排名靠前的都是离群值。因此,我们认为,在判断离群值方法的质量时,不能仅仅依靠ROC AUC分数——相反,ROC AUC和AdjustedP @ n是相辅相成的,因为它们揭示了离群值排序的不同方面,这两者在实践中是相关的。

To eliminate the percentage of outliers as a factor that can influence the relative performance of methods, we will restrict the following analysis to datasets with a comparable proportion of outliers,selecting a variant from those datasets with between 3 and 5 % of outliers. Since every dataset from Table 2, and the datasets ALOI, Glass, Lymphography (for which we select Lymphography_idf), Waveform, WBC, and WDBC from Table 1, all have a variant where the proportion of outliers falls in this range, this will still include a variant of the majority of the different types of data.

为了消除作为影响方法相对性能的一个因素的异常值的百分比,我们将把下面的分析限制在具有可比较的异常值比例的数据集上,从那些具有3%到5 %异常值的数据集中选择一个变量。由于表2中的每个数据集以及表1中的ALOI、Glass、淋巴造影(我们选择淋巴造影_idf)、波形、WBC和WDBC数据集都有一个变量,其中异常值的比例在此范围内,这仍将包括大多数不同类型数据的变量。

Since it is often presumed that the dimensionality of a dataset can be a contributing factor in how well a method performs on the dataset (Houle et al. 2010; Zimek et al. 2012), we first address the question whether this is the case for the given datasets and methods. In Fig. 4, we show the ROC AUC scores for each method (using the best k parameter for each method) on each of the datasets with between 3 and 5 % of outliers, averaged over the different variants where available. The datasets are arranged on the x-axis of the plot from left to right in order of increasing dimensionality (see Tables 1 and2for the dimensionalities of the datasets). V alues that are not available (FastABOD on ALOI) are omitted, i.e. the function of FastABOD is discontinuous. One can see that there is no clear pattern that would indicate an effect of the dimensionality for these datasets. In fact the lowest dimensional dataset Wilt (5 attributes) is among the most ‘difficult’, and Internet Ads, eventhough very high-dimensional(1555attributes) is of intermediate difficulty for most of the methods. Furthermore, all methods show a similar behavior for most datasets, suggesting that the difficulty of the datasets is independent of their dimensionality.

因为通常假设数据集的维度是一种方法在数据集上表现如何的促成因素(Houle等人,2010;Zimek等人,2012),我们首先解决对于给定的数据集和方法是否是这种情况的问题。在图4中,我们显示了每个数据集上每个方法的ROC AUC分数(使用每个方法的最佳k参数),离群值在3%到5 %之间,在可用的不同变量上取平均值。数据集在图的x轴上从左到右按维数增加的顺序排列(数据集的维数见表1和表2)。省略了不可用的v值(ALOI上的FastAbd),即FastAbd的函数是不连续的。人们可以看到,没有明确的模式可以表明这些数据集的维度效应。事实上,最低维度的数据集(5个属性)是最“难”的,也是最难的,尽管对于大多数方法来说,非常高维度的数据集(1555个属性)是中等难度的。此外,对于大多数数据集,所有方法都表现出相似的行为,这表明数据集的难度与其维度无关。

在这里插入图片描述

To further analyze the suitability of the datasets for outlier detection, we consider now first one variant for each of the datasets: for ALOI, Glass, Lymphography, and Wilt these are just the given ‘base’ datasets; for the other datasets, one of the 10 downsampled versions is chosen at random. As always, the full details for all datasets can be found in the web repository.

为了进一步分析数据集对离群点检测的适用性,我们现在首先考虑每个数据集的一个变体:对于ALOI、Glass、Lymphography和Wilt,这些只是给定的“基础”数据集;对于其他数据集,随机选择10个下采样版本中的一个。像往常一样,所有数据集的全部细节都可以在web存储库中找到。

In Fig. 5, for each of the selected datasets, a heat map is shown that represents for each outlier detection method (x-axis) the (binned) rank it gives to the outliers in the ground truth (y-axis)—using the best overall solution (the best value for k) according to ROC AUC. In other words, the (x, y) position in a plot for a dataset represents the (binned) rank that the method x has given to ground truth outlier y. The outliers for each dataset are ordered along the y-axis of the corresponding plots according to the average of ranks achieved over all methods, with the top-ranked outliers appearing at the bottom of the heat map. For the heat maps, the ranks are binned and color coded in the following way. Given n = |O| outliers in the ground truth of a dataset, and a ranking of all N points in that dataset by an outlier method, the firstn rank positions are assigned to Bin 1 (thus outliers whose rank falls into this bin would have contributed to the P@n score for that method), the second n positions are assigned Bin 2 (outliers falling in this bin would have contributed to the P@2n for that method but not to the P@n score), and so on, up to Bin number 9,and all ranks higher than 9 · n are assigned to bin 10. Bin 1 is assigned the color dark blue, Bin 10 is assigned the color dark red, and the other bin colors are assigned in consecutive order within the color spectrum, as indicated in the legend of Fig. 5.

在图5中,对于每个选定的数据集,示出了热图,该热图表示对于每个离群点检测方法(x轴),它在地面真实度(y轴)中给予离群点的(入库的)等级——使用根据ROC AUC的最佳整体解决方案(k的最佳值)。换句话说,数据集在图中的(x,y)位置表示方法x对地面真值异常值y给出的(入库)等级。每个数据集的异常值根据所有方法获得的等级平均值沿相应图的y轴排序,最高等级的异常值出现在热图的底部。对于热图,等级按以下方式进行分类和颜色编码。给定数据集的基本事实中的N = | 0 |个异常值,以及通过异常值方法对该数据集中的所有N个点进行的排序,第一个N个排序位置被分配给箱1(因此,其排序落入该箱的异常值会对该方法的P@n分数有贡献),第二个N个位置被分配给箱2(落入该箱的异常值会对该方法的P@2n分数有贡献,但不会对P@n分数有贡献),以此类推,直到箱号9,并且所有高于9*n的排序被分配给箱10。如图5的图例所示,容器1被指定为深蓝色,容器10被指定为深红色,并且其他容器颜色在色谱中以连续的顺序被指定。

在这里插入图片描述
图5 17个数据集中异常等级的多样性(无重复,标准化,异常值为3-5%)。根据ROC AUC,(入库)等级是指通过每种方法获得的关于tok的最佳解。请注意,对于ALOI,FastABOD是缺失的(由于这个大集合的空间需求过大)

Since for each method we take account only of the best result achieved, these plots tell us less about the overall performance by a method on a given dataset, but more about how difficulttheindividualoutliersaretodetect(withblueoutliersbeingeasiest to detect), and the differences in the level of (best-case) difficulty across the various methods. For several of these datasets, most of the outliers are relatively easy for the majority of methods to detect. The Parkinson dataset variant shown is an extreme example in which all of the methods tested place all n outliers within the top 2n ranks. In the WDBC variant, most outliers are placed in the top ranks by most methods, with only two outliers that all methods essentially failed to identify. In the Waveform variant, roughly one third of the outliers are easily identified by most methods, while roughly half are not identified in almost all cases. Overall, we see a wide spectrum of difficulty of detection, where some outliers are easily identified by all methods, some are ranked highly by some methods but not others, and some outliers are not detected at all among the top-9n by any of the methods.

因为对于每种方法,我们只考虑获得的最佳结果,所以这些图告诉我们的不是某个方法在给定数据集上的总体性能,而是个体大纲视图的难度如何(最容易检测到blue outliers),以及各种方法之间(最佳情况下)难度的差异。对于这些数据集中的几个,大多数异常值对于大多数方法来说相对容易检测。所示的帕金森数据集变体是一个极端的例子,其中所有被测试的方法都将所有n个异常值置于前2n个等级中。在WDBC变量中,大多数方法都将大多数异常值放在最靠前的位置,只有两个异常值是所有方法都无法识别的。在波形变量中,大约三分之一的异常值很容易通过大多数方法识别,而大约一半的异常值几乎在所有情况下都无法识别。总的来说,我们看到检测的困难范围很广,其中一些异常值很容易被所有方法识别,一些异常值被一些方法排名很高,但其他方法排名不高,一些异常值根本没有被任何方法在前9n名中检测到。

To characterize the properties of the datasets across different downsampling rates, and to facilitate the discussion of their suitability for evaluating outlier detection methods, we formalize notions of ‘difficulty’ and ‘diversity’ of a specific dataset variant over a given set of representative outlier detection methods.

为了描述数据集在不同下采样速率下的特性,并便于讨论它们对评估离群点检测方法的适用性,我们在一组给定的代表性离群点检测方法上形式化了特定数据集变量的“难度”和“多样性”的概念。

Difficulty of a dataset is simply defined as the average of the (binned) ranks of all outliers in the dataset reported by the given set of outlier methods (for each variant shown in Fig. 5, this is the average bin number depicted in the corresponding plot). Datasets with low difficulty score contain outliers that are relatively easy to detect by the majority of methods. A high difficulty score indicates that most or all methods have difficulty in finding the outliers.

数据集的难度简单地定义为由给定的一组异常值方法报告的数据集中所有异常值的(入库)等级的平均值(对于图5所示的每个变量,这是相应图中描绘的平均箱数)。具有低难度分数的数据集包含相对容易被大多数方法检测到的异常值。高难度分数表明大多数或所有方法都难以找到异常值。

Diversity characterizes the agreement of a given collection of outlier methods with respect to the scores they give to the outliers. For each individual outliero, the diversity score of o is defined as the standard deviation of the (binned) ranks reported for this outlier by the different methods. The diversity score for a dataset is then computed as the Root Mean Square (RMS) of the diversity scores of all outliers in the dataset. If a dataset has a low diversity score, the methods largely agree on the difficulty of identifying the outliers of the dataset. A high diversity score indicates a large disagreement on the ranking of the outliers.

多样性表征了给定的离群值方法集合在给离群值打分方面的一致性。对于每个单独的异常值,o的多样性得分定义为通过不同方法报告的该异常值的(入库)等级的标准偏差。然后,数据集的多样性得分被计算为数据集中所有异常值的多样性得分的均方根。如果数据集具有较低的多样性得分,则这些方法在很大程度上同意识别数据集异常值的难度。高多样性分数表明对离群值的排序存在很大分歧。

Figure 6 shows the position of each dataset with between 3 and 5 % outliers in the space of diversity vs. difficulty. For each dataset with downsampled variants, the 95 % confidence ellipse is also shown to indicate the extent to which the difficulty and diversitycanvarywithdownsampling.Thedatasetswithonlyonevariantareshownas filled dots. The plot also includes artificial points that indicate certain boundary cases: (1) in the lower left corner, a point representing a ‘perfect’ result, with a diversity score of 0 (all methods agree on the binned ranks of all outliers) and a difficulty score of 1 (all methods identify all n outliers within the top n ranks); (2) in the upper right corner, points representing the difficulty and diversity scores that would be obtained if each of the 12 methods returned a uniform random ranking of all dataset objects; (3) in the lower right corner, points representing the results for a set of 12 identical random rankers, resulting always in a diversity score of 0, but varying in difficulty by chance.

图6显示了在多样性与难度的空间中具有3%至5 %异常值的每个数据集的位置。对于具有下采样变量的每个数据集,95 %置信椭圆也显示为指示下采样的难度和多样性的程度。只有一个变量的数据集用点填充。该图还包括指示某些边界情况的人工点:(1)在左下角,表示“完美”结果的点,多样性得分为0(所有方法都同意所有异常值的入库等级),难度得分为1(所有方法都识别前n个等级中的所有n个异常值);(2)在右上角,表示难度和多样性分数的点,如果12种方法中的每一种返回所有数据集对象的统一随机排序,将获得这些分数;(3)在右下角,代表一组12个相同的随机排序者的结果的点,结果总是多样性得分为0,但是难度会偶然变化。

在这里插入图片描述
图6具有3–5%异常值的多样性与困难数据集的对比(无重复,标准化)。对于具有不同下采样变量的每个数据集,显示95 %置信椭圆以及每个变量的单个点。不对应于实际数据实例的省略号的平均值由标有相应基础数据集的“×”表示。随机排序器的分数是100次模拟的结果,数据集有200个点,其中10个是异常值

Note that diversity and difficulty are not completely independent of each other. On datasets of very low difficulty, it follows from the definition that outlier methods must then also tend to agree on the rankings of the outliers. It is impossible to have at the same time high diversity in outlier rankings and a low difficulty score, and thus one would not expect to find datasets in the upper left area of the plot. Also, for the model of identical random rankers, if by chance a larger number of outliers is found in top positions of the ranking, which the majority of the rankers agree on,we can also expect to see a lower difficulty score. While individual random rankers may occasionally, by chance, obtain a high ranking, it is much less likely for many independent rankers to achieve it simultaneously. As one can observe in the figure, this leads to a much lower variance in difficulty scores for the 12 independent random rankers than for the 12 identical rankers.

注意多样性和难度并不是完全相互独立的。在难度非常低的数据集上,根据定义,离群值方法也必须倾向于同意离群值的排名。在离群值排名中不可能同时具有高多样性和低难度分数,因此人们不会期望在图的左上角区域找到数据集。此外,对于相同的随机排序模型,如果偶然在排序的顶部位置发现大量异常值,这是排序的主要部分,我们可以期望看到较低的难度分数。虽然个别随机排名者偶尔可能偶然获得高排名,但许多独立排名者同时获得高排名的可能性要小得多。从图中可以看出,这导致了12个独立随机排序者的难度分数比12个相同排序者的难度分数差异小得多。

From Fig. 6 we can also observe that most of the datasets considered in this study are of moderate difficulty and diversity. Such datasets are of greatest interest in outlier evaluation, as they have significant variety in their performance characteristics, and allow for the possibility of a new method to demonstrate its general superiority over existing methods by detecting the outliers in these datasets more consistently.

从图6中,我们还可以观察到,本研究中考虑的大多数数据集具有中等难度和多样性。此类数据集在离群值评估中最受关注,因为它们的性能特征有很大差异,并且允许新方法通过更一致地检测这些数据集中的离群值来证明其优于现有方法的可能性。

The observed scores also indicate that the outliers in the given datasets exhibit at least some of the properties that the methods at tempt to model.The12 studied methods lead,ingeneral,to difficulty scores that are much lower than that of the random rankers, indicating that a significant proportion of the objects labeled as outliers can indeed be considered as outliers by the outlier models corresponding to the different methods.

观察到的分数还表明,给定数据集中的异常值至少表现出这些方法试图建模的一些特性。这12种被研究的方法,总的来说,导致了比随机排序法低得多的难度分数,表明被标记为异常值的很大一部分对象确实可以被对应于不同方法的异常值模型认为是异常值。

在这里插入图片描述
图7左边是观察到的每个数据集的难度分数分布。在每一个数据集的右边,是在随机置换基础真值标签后获得的难度分数的分布。对于只有一个变量有3%到5 %异常值的数据集,只绘制一个值。

Datasets that are centrally-located in Fig. 6 can potentially offer insights into the strengths and weaknesses of different outlier methods. However, in most cases we observe a large variance in both diversity and difficulty scores for different downsamplings of the same base dataset (indicated by the error ellipses). This variance can be extreme(as in the case of Parkinson10), or more moderate (as in Annthyroid). The large observed variances indicate that, in general, the use of a single downsampled dataset is not appropriate when evaluating the performance of outlier detection methods.

位于图6中央的数据集可以潜在地提供对不同异常值方法的优势和弱点的洞察。然而,在大多数情况下,我们观察到同一基础数据集的不同下采样在多样性和难度分数上都有很大差异(由误差省略号指示)。这种差异可能是极端的(如帕金森),也可能是中等的(如甲状腺)。观察到的较大差异表明,一般来说,在评估离群点检测方法的性能时,使用单个下采样数据集是不合适的。

To investigate whether the observed results are significantly different from those of a set of random rankers with the same dependency between them, we also performed experiments in which we computed the difficulty score for each dataset based on the ranking by the 12 methods, but after a random permutation of the ground truth labels. The results are shown in Fig. 7. The figure shows for each dataset the distribution of observed difficulty scores as a boxplot (in light blue), together with a boxplot of the distribution of difficulty scores obtained after randomizing the ground truth labels(inred),combining 1000 randomizations per dataset variant.The results clearly demonstrate that(at least some of)the objects labeled as outliers agree with the outlier model. For each base dataset with 10 different downsamplings, not a single random result in 10,000 is even close to the observed difficulty values; for the datasets with only a single variant none of the 1,000 random results is close to the observed value.

为了研究观察到的结果是否与一组具有相同相关性的随机排序器的结果显著不同,我们还进行了实验,其中我们基于12种方法的排序计算了每个数据集的难度分数,但是在随机排列基础事实标签之后。结果如图7所示。该图以箱线图(浅蓝色)的形式显示了每个数据集观察到的难度分数的分布,以及随机处理基础事实标签(inred)后获得的难度分数分布的箱线图,每个数据集变量结合了1000次随机处理。结果清楚地表明(至少部分)被标记为异常值的对象与异常值模型一致。对于具有10个不同下采样的每个基础数据集,10,000中没有一个随机结果甚至接近观察到的难度值;对于只有单一变量的数据集,1000个随机结果中没有一个接近观察值。


6 Conclusions

In this experimental study, we addressed a constant challenge in unsupervised outlier detection: the evaluation of algorithms in terms of effectiveness. W e h a v e d i s c u s s e d the notorious lack of commonly accepted, well-understood benchmark datasets with annotated ground truth. We also elaborated on commonly used evaluation measures, their strengths and weaknesses, and how several measures can be used in combination to provide insights into the relative performance of outlier methods. For precision at n and for average precision, we proposed an adjustment for chance, which allows meaningful comparisons of the performances of methods on different datasets.

在这项实验研究中,我们解决了无监督离群点检测中的一个持续挑战:算法有效性的评估。世界卫生组织,美国疾病控制与预防中心,美国疾病控制与预防中心,臭名昭著的缺乏普遍接受的,理解良好的带有注释的基准数据集。我们还详细阐述了常用的评估方法、它们的优缺点,以及如何将几种方法结合起来使用,以提供对异常值方法的相对性能的见解。对于n精度和平均精度,我们提出了一个机会调整,它允许在不同的数据集上对方法的性能进行有意义的比较。

Using the study of evaluation measures as a foundation, we performed an extensive experimental analysis of a representative set of both classical and recent unsupervised outlier detection methods, on a large collection of datasets.

以评估方法的研究为基础,我们在大量数据集上对经典和最新的无监督离群点检测方法的代表性集合进行了广泛的实验分析。

Papers proposing a novel method often justify its performance based on a specific evaluation measure,on few datasets,and for few parameter settings.By using a diverse collection of datasets, several evaluation measures, and a broad range of parameter settings, we argue here that it is typically pointless and unjustified to state the superior behavior of any method for the general case. For optimization problems in machine learning, this fact is captured in the ‘no free lunch’ theorem (Wolpert 1996). For unsupervised learning approaches, it has been conjectured that the quest for a truly general and superior method is futile [at least for clustering, this has been discussed by Estivill-Castro(2002),Kriegel et al.(2009b), andvon Luxburg et al.(2012)], but there is no common understanding of the implications of this conjecture. We therefore show here that in the evaluation of new algorithms for outlier detection, the goal should be to analyze where their strengths and (perhaps more importantly) weaknesses lie, when confronted with datasets of different characteristics.

提出一种新方法的论文通常基于特定的评估方法、较少的数据集和较少的参数设置来证明其性能。通过使用不同的数据集集合、几种评估方法和广泛的参数设置,我们认为在一般情况下陈述任何方法的优越行为通常都是毫无意义和不合理的。对于机器学习中的优化问题,这一事实在“没有免费的午餐”定理中得到了体现。对于无监督的学习方法,人们已经推测,寻求一种真正普遍和优越的方法是徒劳的[至少对于聚类来说,这已经由Estivill-Castro(2002)、Kriegel等人(2009b)和von Luxburg等人(2012)讨论过],但是对于这种推测的含义没有共同的理解。因此,我们在这里表明,在评估离群点检测的新算法时,目标应该是分析当面对不同特征的数据集时,它们的优势和(也许更重要的)劣势在哪里。

The gist of our findings is that,when considering the totality of results produced in a systematic way across different parameter settings and a diverse collection of datasets (rather than specific parameter settings for specific datasets), the seminal methods kNN, kNNW, and LOF still remain the state of the art—none of the more recent methods tested offer any comprehensive improvement over those classics, while two methods in particular (LDF and KDEOS) have been found to be noticeably less robust to parameter choices. However, by picking appropriate parameter values, one may cast any of the methods tested in a favorable light, which emphasizes the importance of systematic testing across a range of parameter values.

我们发现的要点是,当考虑到在不同的参数设置和不同的数据集集合(而不是特定数据集的特定参数设置)上以系统的方式产生的全部结果时,具有开创性的方法kNN、kNNW和LOF仍然是最先进的——没有一种最近测试的方法比这些经典方法提供任何全面的改进,而特别是两种方法(LDF和KDEOS)被发现对参数选择的鲁棒性明显较差。然而,通过选择适当的参数值,人们可以将任何一种测试方法置于有利的角度,这强调了在一系列参数值上进行系统测试的重要性。

These findings should be taken with a grain of salt, as our selection of methods included—among general, rather abstract methods such as kNN and LOF—also rather specialized methods such as FastABOD (high dimensional data) or KDEOS (kernel-based, where the choice of a kernel allows for adaptation to very specific application problems). Recall (Sect. 2) that both FastABOD and KDEOS (and also LDF) require other parameters in addition to the neighborhood size—parameters that have not been studied here but that have reasonable default settings. Furthermore,KDEOS was designed as ensemble method and was tested here as individual method, putting it at a disadvantage. The evaluation of ensemble methods for outlier detection (Zimek et al. 2013a) poses different questions and challenges for future work.

这些发现应该半信半疑,因为我们选择的方法包括——在一般的、相当抽象的方法中,如kNN和LOF——也包括相当专门的方法,如FastABOD(高维数据)或KDEOS(基于内核,内核的选择允许适应非常具体的应用问题)。召回(教派。FastABOD和KDEOS(以及LDF)都需要除邻域大小之外的其他参数,这里没有研究这些参数,但是它们有合理的默认设置。此外,KDEOS被设计为集成方法,在这里被测试为单独方法,这使它处于不利地位。离群点检测集成方法的评估(Zimek等人,2013年a)为未来的工作提出了不同的问题和挑战。

Thus our findings do not allow us to conclude that these two methods perform worse than the others in general. If a method does not compete well on arbitrary datasets with classic, generic solutions like kNN or LOF, this does not at all mean that the same method cannot excel on particular domains. Our findings do suggest, however, that novel methods should not be proposed without also indicating those domains or application scenarios where this method is particularly well suited.Merely demonstrating that the method excels on a few datasets for a few parameter settings does not suffice.Most importantly,broad ranges of parameter choices should be tested for the competitors.

因此,我们的发现不允许我们得出结论,这两种方法比其他方法表现更差。如果一种方法在任意数据集上不能与经典的、通用的解决方案(如kNN或LOF)很好地竞争,这并不意味着同一种方法不能在特定领域表现出色。然而,我们的发现确实表明,在没有指出该方法特别适合的领域或应用场景的情况下,不应该提出新的方法。仅仅证明该方法在几个数据集和几个参数设置上表现出色是不够的。最重要的是,应该为竞争对手测试广泛的参数选择。

Another source of arbitrariness in the outlier evaluation research literature is the very common practice of producing datasets with outlier ground truth by means of class downsampling (along with other preprocessing steps). Our study shows that observations based on downsampling can vary considerably from sample to sample, and thus experimentation on only a small number of downsampled sets may not produce meaningful outcomes.Since for some sets there maybe significant variance even when many downsampled variants have been considered, for the sake of reproducibility, the dataset samples that are adopted in an evaluation should be made publicly available, as we have done in our web repository.

离群值评估研究文献中任意性的另一个来源是通过类下采样(以及其他预处理步骤)产生具有离群值基础事实的数据集的非常普遍的实践。我们的研究表明,基于下采样的观测值可能因样本而异,因此仅在少量下采样集合上进行实验可能不会产生有意义的结果。因为对于某些集合,即使考虑了许多下采样变量,也可能存在显著的差异,为了再现性,评估中采用的数据集样本应该公开,就像我们在网络存储库中所做的那样。

On the positive side, our experimental study has provided a better understanding of the characteristics of datasets in current use, according to their suitability for evaluating outlier detection methods. Our characterizations can in principle be extended to other datasets, and thus our methodology could eventually serve to establish a commonly accepted collection of benchmark datasets for the evaluation of outlier detection methods. The extensive collection of results in our web repository can serve as a basis of comparison between established outlier methods and any new methods that may be proposed in the future, over a variety of datasets and a broad range of parameter settings, while avoiding the need to run new experiments on the established methods.

从积极的一面来看,我们的实验研究根据数据集对评估离群点检测方法的适用性,对当前使用的数据集的特征提供了更好的理解。原则上,我们的特征可以扩展到其他数据集,因此,我们的方法最终可以用来建立一个普遍接受的基准数据集集合,用于评估离群点检测方法。在我们的网络存储库中收集的大量结果可以作为比较已建立的离群值方法和未来可能提出的任何新方法的基础,这些方法涵盖各种数据集和广泛的参数设置,同时避免了对已建立的方法进行新实验的需要。

For a typical scenario where this study can be useful for future research in this domain,let us consider the situation in which researchers have developed a new outlier detection method, and have available to them for the evaluation of the method some dataset with annotated ground truth. Such researchers can make twofold use of our results.

对于一个典型的场景,该研究对该领域的未来研究可能是有用的,让我们考虑研究人员已经开发了一种新的离群点检测方法的情况,并且已经为他们提供了用于评估该方法的带有注释的一些数据集。这样的研究人员可以双重利用我们的结果。

They can test their method on the datasets provided in our repository and directly compare its performance with the results of the 12 standard methods we used. In addition to summaries and statistics, we provide also all raw results on our webpage. If the new method is competitive on our datasets, and if the authors can identify a scenario where their new method is particularly well suited (be it more generally applicable across many types of data—such as categorical, graph, or sparse vector data—or be it adapted to more specific purposes—such as highdimensional data, or a particular domain), much more evidence can be generated for the advantages and disadvantages of the novel method than can be found in many publications today.

他们可以在我们存储库中提供的数据集上测试他们的方法,并直接将其性能与我们使用的12种标准方法的结果进行比较。除了摘要和统计数据,我们还在网页上提供所有原始结果。如果新方法在我们的数据集上有竞争力,并且如果作者能够确定他们的新方法特别适合的场景(它是否更普遍地适用于许多类型的数据,例如分类、图形或稀疏向量数据,或者它是否适用于更具体的目的,例如高维数据或特定的领域),就可以为新方法的优缺点产生比今天许多出版物中发现的更多的证据。

They can run the 12 standard methods from our collection on their new dataset, and perform the analysis as presented in Sect. 5.3. In this way, the new dataset can be situated within the space of diversity vs. difficulty(cf.Fig.6), and the ground truth can be compared with random labelings as in our analysis in Fig. 7. Ideally, a new dataset adds value to the portfolio by providing different but equally reasonable challenges.

他们可以在新的数据集上运行我们收集的12种标准方法,并执行第5.3节中介绍的分析。通过这种方式,新的数据集可以位于多样性与难度的空间内(参见图6),并且可以将基本事实与随机标记进行比较,如图7中的分析所示。理想情况下,新数据集通过提供不同但同样合理的挑战来增加投资组合的价值。

In this paper, we do not claim to have delivered the ultimate benchmark dataset collection for outlier detection. Furthermore, the selection of outlier detection methods used in our study is not exhaustive. The online repository could be extended to accommodate both novel methods and additional datasets. We provide online all scripts and implementations required to repeat our experiments. The same scripts and implementations, using ELKI 0.7 (Schubert et al. 2015a), can be easily used to extend the experiments, including more methods and more datasets. We plan to extend the repository in both directions and offer to include also methods and datasets as suggested or provided by users.

在本文中,我们并没有声称已经交付了用于离群点检测的最终基准数据集集合。此外,我们研究中使用的异常值检测方法的选择并不详尽。在线知识库可以扩展,以适应新方法和额外的数据集。我们在线提供重复实验所需的所有脚本和实现。同样的脚本和实现,使用ELKI 0.7(舒伯特等人2015a),可以很容易地用来扩展实验,包括更多的方法和更多的数据集。我们计划在两个方向上扩展存储库,并提供用户建议或提供的方法和数据集。

This study focused on representative unsupervised outlier detection models based on neighborhoods in Euclidean space. Future extensions of our study could include approximation methods [assessing speed-up as well as approximation quality, extending the work ofOrair et al.(2010)], special methods for high-dimensional data (Zimek et al. 2012), or recently-developed ensemble techniques (Zimek et al. 2013a).

本研究主要关注基于欧氏空间邻域的代表性无监督离群点检测模型。我们研究的未来扩展可能包括近似方法[评估速度和近似质量,扩展Orair等人(2010)]的工作,高维数据的特殊方法(Zimek等人,2012),或最近开发的集成技术(Zimek等人,2013a)。

Acknowledgments This project was partially funded by FAPESP (Brazil—Grant #2013/18698-4), CNPq (Brazil—Grants #304137/2013-8 and #400772/2014-0), NSERC (Canada), and the Danish Council for Independent Research—Technology and Production Sciences (FTP) (Denmark—Grant 10-081972).

  相关解决方案