So let’s talk about how to evaluate a classifier. Now just following the example,
we have our features, each observation being represented by a set of numbers,
and each observation’s labelled and then the machine learning algorithm comes along and it gives a number to each observation, which is sort of what it thinks is going on,
and then the number says how far away from the decision boundary the observation is,
and also the sign of this function f is the predicted label.
Now let’s put those in another column, so this is y hat,
this is the sign of f; just tells you which side of the decision boundary that point is on.
And if the classifier is right, then y hat agrees with y.
Okay so let’s just look at these two columns for a few minutes.
If the classifier is really good, these predicted labels often agree with the true labels.
And let’s just put a few more examples in there just for fun.
Now this is called a true positive, where the true label is plus one, and the predicted label is plus one.
In a true negative, they’re both minus one.
And then a false positive or a type one error is when you think that it’s positive,
but it’s actually not. And then this is a false negative or a type 2 error, where you think it’s negative,
but it’s actually not.
And then below it this is just another true negative, and below that is another false negative, and so on.
Okay so the errors come in these two flavours. Now, how do we judge the quality of a classifier?
And we construct a confusion matrix,
and the confusion matrix has the true positives over on the upper left,
and then the true negatives down on the lower right,
and then the false positives up here and the false negatives down here so you can see that this classifier’s pretty good because most of the points are either true positives or true negatives,
which is good and there’s not too many errors.
Now if we don’t care about whether we have false positives or false negatives,
if they’re both equally bad, then we can look at them as classification error.
So this is just the fraction of points that are misclassified;
it’s also called the classification error, and it’s also called the misclassification rate.
And then I can write it this way,
so this is the fraction of points for which the predicted label is not equal to the true label.
So it just counts up the false positives and false negatives and divides by n.
The true positive rate is simply the number of true positives divided by the total number of positives.
So it’s this guy divided by the sum of these two here,
so it’s the number of points whose true label is one,
and whose predicted label is also one, and then divided by the number of points whose true label is one. It’s also called the sensitivity or the recall,
and the true negative rate, or the specificity is defined this way:
it’s actually just the number of true negatives divided by the total number of negatives,
okay, so it’s the number of points who are negative – they’re truly negative,
and they’re predicted to be negative, and then divided by the total number of negatives.
So again, it’s this guy true negatives, divided by the sum of these two guys.
And then the false positive rate looks like this:
so it’s the number of false positives divided by the total number of negatives,
and then there’s a few more
metrics.
There’s the precision, which is the true positives divided by the total number of predicted positives,
so in other words it’s this one divided by the sum of these two.
So the reason why I’m going through all these is because these metrics are all provided by pretty much any piece of software that you want to work with,
and they’re quantities of interest that you hear about fairly often and here’s the F1 score – the F1 score is kind of neat. It’s a balance between precision and recall.
So it’s two times precision times recall divided by precision plus recall.
So if either the precision or the recall are bad, then the F1 score is bad.
And so precision again uses those two quantities, and recall uses these two.
But if you get a good F1 score, that generally means that your model – your model is good.
F1 scores and precision and recall, these are all terms that are used very often in information retrieval, so things like evaluating search engines. And so here’s just some more detail about that.
The precision at n for a search query is defined like this:
so of the top n pages received by the search engine, how many were actually relevant to the query?
And the way we can write that is the number of true positives out of those top n, divided by n –
the number of pages received. And then the recall at n for a search query is the following:
it says if there are n relevant webpages where n is the number of total positives –
there are n relevant webpages what fraction of them did we get from our query?
So that’s the number of true positives divided by the total number of positives.
Which measure should you use?
Now the machine learners often use accuracy –
just plain accuracy or misclassification error because it’s just one number that you can directly compare across algorithms.
You need a single measure of quality to compare algorithms.
Once you have two measures of quality, you can’t directly make a comparison because what if one algorithm’s better according to one quality measure but not another one?
Then you can’t compare them.
But this only works –
this one only works when errors for the positive class count equally to errors from the negative class, and this doesn’t work when the data are imbalanced, but anyway that’s what people do. So doctors,
they often want to know like how many of the positives they got – they got right,
and they want to know how many of the negatives they got right,
so that makes sense that they want to look at both the true positive rate and the true negative rate.
And if you’re in information retrieval, then you probably want to use precision and recall and F1 score, which is a combination of the two.
So let’s say you’re – you know, you’re judging the quality of a search engine like Bing for instance.
You might care about precision; again, precision is, you know, of the webpages that the engine returned, how many were relevant?
That’s precision, and then recall is the fraction of the relevant webpages that the search engine returned,
and then you use the F1 score so it’s a single measure,
and then you can compare the quality of the different search engines in an easy way.
我们来谈谈如何评估分类器。下面举个例子,
我们有我们的特征,每个观察都用一组数字表示,
每一个观察的标记然后机器学习算法就会出现它给每一个观察一个数字,这就是它所认为的,
然后这个数字表示离决策边界有多远观察是,
这个函数f的符号是预测值。
现在我们把这些放到另一列,这是y帽,
这是f的符号;只是告诉你点所在的决定边界的哪一边。
如果分类器是正确的,那么y帽和y是一致的。
好,我们来看看这两列。
如果分类器真的很好,这些预测的标签通常与真正的标签一致。
我们再举几个例子。
这被称为真正,真正的标签是+ 1,预测的标签是+ 1。
在一个真正的负数中,它们都是- 1。
当你认为它是正的时,一个错误的正的或一个类型的错误是,
但实际上不是。然后这是一个假阴性或2型错误,你认为它是阴性的,
但实际上不是。
下面是另一个负的,下面是另一个假阴性,等等。
好的,误差来自这两种味道。那么,我们如何判断一个分类器的质量呢?
我们构造一个混乱矩阵,
而混淆矩阵在左上角有真正的优点,
右下角的真正的底片,
然后这里的假阳性和假阴性结果你可以看到这个分类器很好因为大部分的点要么是真阳性,要么是真阴性,
这很好,也没有太多的错误。
如果我们不关心是否有假阳性或假阴性,
如果它们都同样糟糕,那么我们可以把它们看成是分类错误。
这只是错误分类的分数的分数;
它也被称为分类错误,它也被称为错误分类率。
然后我可以这样写,
这是预测标签不等于真实标签的分数的分数。
所以它只计算假阳性和假阴性,然后除以n。
真正的阳性率就是真正的正数的数目除以正数的总数。
就是这个除以这两个的和,
所以这是真正的标签为1的点的个数,
它的预测值也是1,然后除以它的真值为1的点的个数。它也被称为灵敏度或回忆,
而真正的负速率,或者说专一性是这样定义的:
它实际上就是真正的负数的数量除以负的总数,
好的,这是负数的个数,它们是负的,
它们被预测为负,然后除以负的总数。
再一次,这是一个真正的负数,除以这两个数的和。
然后假阳性率是这样的
这是假阳性的数量除以负的总数,
然后还有一些。
指标。
这是精度,这是真正的正数除以预测的正数的总数,
换句话说就是这个除以这两个的和。
我之所以要讲这些是因为这些指标都是由你想要处理的任何软件提供的,
他们的兴趣是你经常听到的,这是F1的分数,F1的分数很简洁。它是精确和回忆之间的平衡。
所以它是2倍精度乘以回忆除以精确加上回忆。
因此,如果精度或召回都不好,那么F1的分数就不好。
所以精确度再次使用这两个量,回忆使用这两个。
但是如果你得到一个好的F1分数,这通常意味着你的模型-你的模型是好的。
F1的分数,精度和回忆,这些都是在信息检索中经常用到的术语,所以像评估搜索引擎。这里还有一些细节。
搜索查询的n的精度是这样定义的:
那么在搜索引擎接收的前n页中,有多少是与查询相关的?
我们可以这样写,这是上面n的正数的个数,除以n -。
收到的页数。然后在n个搜索查询的回忆是:
它说,如果有n个相关的网页,其中n是总阳性数。
有n个相关的网页我们从查询中得到了多少?
所以这就是真正的正数的数目除以正数的总数。
你应该使用哪种测量方法?
现在机器学习者经常使用准确性。
只是简单的准确或错误分类错误因为它只是一个数字你可以直接比较算法。
你需要一个衡量质量的方法来比较算法。
一旦你有了两种质量衡量标准,你就不能直接进行比较了,因为如果一种算法根据一种质量标准而不是另一种方法更好,那该怎么办?
然后你就不能比较它们了。
但这只能起作用。
只有当正值类的错误与负类的错误相等时,这个才有效,当数据不平衡时,这就不起作用了,但不管怎样,这就是人们做的。所以医生,
他们经常想知道他们得到了多少好处——他们说得对,
他们想知道他们得到了多少个负数,
因此,他们想要看看真正的正确率和真实的负速率是有道理的。
如果你在信息检索中,那么你可能想要使用精确和回忆,以及F1评分,这是两者的结合。
假设你在判断搜索引擎的质量比如Bing。
你可能关心精确度;同样,精度,你知道,引擎返回的网页,有多少是相关的?
这是精确的,然后回忆是搜索引擎返回的相关网页的一小部分,
然后你使用F1积分,所以这是一个单一的度量,
然后你可以简单地比较不同搜索引擎的质量。