So now that you understand the basics of classification, let’s talk about loss functions,
because that determines the major difference between a lot of machine learning methods.
Okay so how do we measure classification error?
Well, one very very simple way to do it is to use just a fraction of times our predictions are wrong.
So just the fraction of times the sign of f(x) is not equal to the truth y, and I can write it like that, okay? The issue with this particular way of measuring classification error is that if you want to try to minimize this, you could run into a lot of trouble because it’s computationally hard to minimize this thing.
So let’s give you the geometric picture here: the decision boundary is this line right here,
and f being positive is here and f being negative is here, and the red points are all misclassified.
And now what I’m going to do is something you might not be expecting which is that I’m going to move all the correctly classified points across the decision boundary to one side,
and all the misclassified points to the other side.
There they go, we’ll move all the misclassified points across the decision boundary and then all the correctly classified ones.
Okay, so I’m glad I did that, now I have the ones we’ve got correct over here and the ones we got wrong over there. And now the labels on this plot are wrong because we changed it, so it’s actually something like that.
Okay so over here, on the right either f is positive and y is also positive, so y-f is positive,
or they’re both negative so the product is positive again.
And then over here on the left, we have cases where the sign of f is different from y,
so the product is negative. And then the ones I suffer a penalty for are all these guys over there.
Okay so let’s keep that image in mind over here, and then I’ll put the labels up.
Now this function tells us what kind of penalty we’re going to issue for being wrong.
Okay so right now, if y times f is positive, it means we got it right and we lose 0 points.
And if we get it wrong, the point is on the wrong side and so we lose one point.
Now I’m just going to write this function another way which is like this,
okay, so we lose one
point if y-f is less than 0, and otherwise we lose no points. And this is the classic 0-1 loss.
It just tells you whether your classifier is right or wrong.
And then this thing is called a loss function, and there are other loss functions too,
and this one’s nice because – but it’s problematic because it’s not smooth,
and we have issues with things that are not smooth in machine learning.
So let’s try some more loss functions.
So while we’re doing this,
just keep in mind that these points over here are the ones that are very wrong,
because they’re on the wrong side of the decision boundary,
but they’re really far away from it too. And these points are wrong, but they’re not as bad;
they’re on the wrong side of the decision boundary but they’re pretty close to it.
And then we’ll say these points are sort of correct, and we’ll say these points are very correct.
And what we’ll really like to have are loss functions that don’t penalize the very correct ones,
but the penalty gets worse and worse as you go to the left.
But maybe we can use some other loss function,
something that – you know, maybe we get a small penalty for being sort of correct and then a bigger penalty for being sort of wrong and then a huge penalty for being very wrong.
Something that looks like this would be a deal.
So again,
this is – the horizontal axis is y times f, and this red one is 1 if y disagrees with the sign of f, and then the other curves are different loss functions and they’re actually for different machine learning algorithms.
And again, just keep in mind that on the right
– these are points that are on the correct side of the decision boundary,
they don’t suffer much penalty and on the left,
these are points that are incorrectly classified and they suffer more penalty.
This is the loss function that AdaBoost uses.
AdaBoost is one of the machine learning methods that we’ll cover in the course.
And this is the loss function that support vector machines use;
it’s a line, and then it’s another flat line.And this is the loss function for logistic regression, and we’re going to cover all three of these.
Now I’m going to write this idea about the loss functions in notation on the next slide.
Okay so start here:
the misclassification error is the fraction of times that the sign of f is not equal to the truth y that’s this.
I can rewrite it this way, okay, the number of times y times f is less than 0.
And then we’ll upper-bound this by these loss functions.
Okay, so then what is a good way to try to reduce the misclassification error which is that guy?
Well you could just try to minimize the average loss.
So if you had a choice of functions f, you could try to choose f to minimize this thing,
which hopefully would also minimize this but in a computationally easier way.
So here’s your first try for a machine learning algorithm.
Just choose the function f to minimize the average loss. And this seems like a good idea, right?
Well it is, and that’s what most machine learning methods are based on,
and how to do this minimization over models to get the best one,
that involves some optimization techniques, which go on behind the scenes.
But there’s one more thing I didn’t quite tell you,
which is that we want to do more than have a low training error.
We want to predict well on data that we haven’t seen before.
We want to, you know, generalize the new points, and that’s why we need statistical learning theory, because this algorithm that I showed you – that’s not right, and you’ll see why.
It’s pretty good,
but it’s missing this key element that tells – that encourages the model to stay simple and not over-fit.
So I’ll talk about statistical learning theory shortly.
现在你们已经了解了分类的基本知识,我们来讨论一下损失函数,
因为这决定了很多机器学习方法的主要区别。
那么我们如何测量分类误差呢?
一种很简单的方法就是用很小一部分的时间我们的预测是错误的。
所以f(x)的符号的分数不等于y,我可以这样写,对吧?这种测量分类误差的方法的问题是,如果你想要最小化这个,你可能会遇到很多麻烦因为它在计算上很难最小化这个东西。
我们给你一个几何图形,决定边界是这条线,
f是正的这里f是负的,红色的点都是错误分类的。
现在我要做的是你们可能不会想到的是我要把所有正确分类的点都移到一边,
所有的错误分类都指向另一边。
在这里,我们会把所有的错误分类都移到决策边界然后所有的分类都是正确的。
好了,我很高兴我这么做了,现在我有了我们在这里已经改正的那些和我们在那里出错的那些。这个图上的标签是错的,因为我们改变了它,所以它实际上是这样的。
好的,在这里,右边的f是正的y也是正的,所以y-f是正的,
或者它们都是负的,所以产物是正的。
然后在左边,我们有例子f的符号和y不同,
所以乘积是负的。然后我要惩罚的是这些人。
好的,让我们记住这个图像,然后我把标签放上去。
现在这个函数告诉我们,我们要对错误进行什么样的惩罚。
好的,现在,如果y乘以f是正的,这意味着我们得到了它,我们失去了0点。
如果我们做错了,重点就在错误的一边,所以我们失去了一点。
现在我要用另一种方法来写这个函数,
好的,我们失去了一个。
点如果y-f小于0,否则我们就失去了点。这是典型的0-1损失。
它只是告诉你分类器是对还是错。
然后这个叫做损失函数,还有其他的损失函数,
这个很好因为-但它有问题因为它不光滑,
我们在机器学习中遇到了一些不顺利的问题。
让我们尝试更多的损失函数。
当我们这样做的时候,
请记住,这里的这些点是非常错误的,
因为他们站在了决策界限的错误一边,
但它们离它也很远。这些观点是错误的,但也没有那么糟糕;
他们站在了决策界限的错误一边,但他们非常接近。
然后我们会说这些点是正确的,我们会说这些点是非常正确的。
我们真正想要的是损失函数不惩罚非常正确的函数,
但是当你走到左边的时候,惩罚变得越来越糟糕。
但是也许我们可以用其他的损失函数,
你知道,也许我们得到了一个小的惩罚,因为它是正确的,然后是一个更大的惩罚,因为它是错误的,然后是一个巨大的惩罚,因为它是非常错误的。
看起来这是一笔交易。
再一次,
横轴是y乘以f,这个红色的是1如果y不同意f的符号,那么其他的曲线是不同的损失函数它们实际上是针对不同的机器学习算法的。
记住,在右边。
-这些点在决策边界的正确一侧,
他们不会受到太多惩罚,
这些都是不正确分类的点,并且会受到更多的惩罚。
这是AdaBoost使用的损失函数。
AdaBoost是我们将在课程中介绍的机器学习方法之一。
这是支持向量机使用的损失函数;
这是一条直线,然后是另一条直线。
这是logistic回归的损失函数,我们将涵盖这三个部分。
现在我要把这个关于损失函数的概念写在下一张幻灯片上。
所以从这里开始:
误分类错误是f的符号不等于y的分数。
我可以这样写,y乘以f的次数小于0。
然后我们用这些损失函数来上界。
好的,那么什么是减少错误分类错误的好方法呢?
你可以试着最小化平均损失。
如果你有一个函数f,你可以选择f来最小化这个,
希望这也能使它最小化但在计算上更简单。
这是你第一次尝试机器学习算法。
只要选择函数f,就可以最小化平均损失。这看起来是个好主意,对吧?
这就是大多数机器学习方法的基础,
如何将模型最小化以得到最好的模型,
这涉及到一些在幕后进行的优化技术。
但还有一件事我没告诉你,
也就是说,我们想要做的不仅仅是低训练误差。
我们想要预测我们之前没见过的数据。
我们想要推广新的观点,这就是为什么我们需要统计学习理论,因为我给你们展示的这个算法是不对的,你们会明白为什么。
这是很好,
但它忽略了这个关键元素——它鼓励模型保持简单,而不是过度匹配。
我很快会讲到统计学习理论。