当前位置: 代码迷 >> 综合 >> 第一章.Classification -- 05.Maximum Likelihood Perspective翻译
  详细解决方案

第一章.Classification -- 05.Maximum Likelihood Perspective翻译

热度:93   发布时间:2023-09-18 15:06:20.0

Let’s talk more in depth about logistic regression. Putting that in the corner for now,

I wanted to give you another perspective on logistic regression,

which is the maximum likelihood perspective.

And you can skip this section if you’re not interested and nothing bad’s going to happen,

but it might be useful to some of you. Look at this function here;

it looks like – what does this function look like?

It looks like something is growing and then saturating.

Now this function is called the logistic function,

and it was one of the very early population models invented by Adolphe Quetelet and his pupil,

Pierre Francois Verhulst, somewhere in the mid-19th century,

and they were modelling growth of populations, and they were thinking that when a country gets full,

the population won’t grow as much and then the population will saturate, which is why it looks like that.

And it sounds kind of funny, but that’s what they were doing.

So see this is when the country is just growing and then here’s where it’s full and the population won’t grow anymore. Anyways, so how does this relate to logistic regression? Well, it does.

So what do you know about probabilities, right? They don’t go below 0, and they don’t

go above 1. You can take any number you like and send it through this function, and it’ll

give you a number between 0 and 1, it’ll give you a probability.

So this is the basic formula for that function, so when t is really big, then e to the t is much bigger than 1, 

and so this one basically gets ignored down here and you get 1.

And if t is really small, then the top goes to 0 and the bottom goes to 1, and you get 0.

Okay, so again, where does logistic regression come in?

You know, here is where it enters logistic regression.

Let’s model the probability that the outcome is –

the outcome y is 1 for a specific accent beta just like this, okay?

So why would we do this? It looks like a complicated function;

where did I get this? So here’s the trick: the thing on the left is a probability,

so the thing on the right had better be a probability. And guess what, we know it is.

It’s just a logistic function, and a logistic function only produces probabilities.

Okay so now this model makes sense, that’s why I want to model a probability like this.

And now I’m just putting it in matrix notation,

just to make my life a little bit easier instead of having to write all these sums all over the place,

I can just write this matrix x times the vector beta.

Now I also can compute the probability that the label is minus 1,

given xi and beta using the model. So it’s just one minus the probability that it’s 1,

so it’s just 1 minus that guy, and you can simplify that and make it look like this.

Now I’m going to need to calculate the likelihood of each of the observations,

which is the probability to observe the label y that I actually observed, given it’s x in the model beta.

And I am actually almost there already, because I’ve already done all of that already.

So this is it, right, if y is minus 1, then you use this one. If y is plus 1,

then you use this one, and that’s – that’s this probability right here.

And then I can simplify this a little bit more, because remember y is minus 1,

so I can always put a minus y here because minus y is just 1,

and I can do this same thing with the other term here.

So first thing I want to just divide top and bottom by e to the x beta,

and I end up with something that looks like that. And then I can always multiply by 1 in disguise,

because remember y is positive 1,

so I can just write this as minus yx because I just multiply by 1 here which is just the y.

Now the interesting thing is these two expressions should look rather similar to you; in fact,

they should look exactly the same because they are.

That’s very nice, because it means that the probability for y to equal whatever it does is written either this way or that way and they are the same. Okay, so I can just put it right there.

Alright, just adding a little space there, and then I compute the likelihood for all the data,

I have to multiply all these probabilities together. So what I end up with is this,

so this is the full likelihood for the dataset, and it looks just like that.

Okay, and so this guy equals this, which equals this, which equals that.

So I can summarize there and start with a fresh page; there it is.

And now, I can take negative log of both sides – that’s completely legal.

And then, when you take the log of a product, it becomes the sum of the logs, so there we are.

And then this fraction becomes – this is the log of this to the negative 1 power,

so the negative 1 comes out front and cancels with this minus sign, and I get this expression.

Now hopefully you have a good memory,

because this expression is exactly the same as the one I have up in the corner, okay?

So that’s cool, that minimizing negative log likelihood is the same as minimizing logistic logs.

Now minimizing negative log likelihood is like finding the coefficients that are the most likely to generate your data,

if you use the logistic model. I can derive the logistic regression a different way, but why do I care?

Why do I need this other derivation when I have the first one? And the answer is really neat:

it’s because now you have this. Remember this? This is the logistic function,

but now it provides a probabilistic interpretation of the model.

Whatever score that the model gives the observation, now you get the probability that y equals 1,

given x. You don’t just get a classification, so maybe I can show it geometrically another way.

Okay, so back to this picture over here. Now, this is the logistic function,

and over here you get a higher probability estimate and over here it’s very low.

And that interpretation is not something that you have with the loss function interpretation.

Okay, so just a summary here: for a logistic regression,

we split data randomly into training and test sets,

we estimate the coefficients and train the model by minimizing the subjective,

and then we score the model, and evaluate the model.

And if we want to, now we have the probabilistic interpretation;

we can send – we can get that through the function f.

We can plug f into the logistic function to get an estimate of the probability that y equals 1 given x.

and again, this is just the basic version in Azure ML;

this is all the programming, I just – you know, literally moved the modules and moved the modules there and put the connectors on them and hit run and that’s it.

Now this is just a preview of what happens when we put regularization on there;

we can actually improve performance by asking the logistic model to be simple,

and we can do that by adjusting this lovely lovely constant c,

and that determines how much regularization we’ll put into the model to keep it simple.

And again, we can work with linear models.

So for regularization, we’ll choose the sum of the squares of these coefficients,

and this is actually written in this nice neat way; this is actually called an L2 norm,

and that’s what we’re going to use to measure simplicity of models,

and that constant c is going to determine how much we care about the simplicity of the model and the accuracy of it.

And here’s another kind of regularization where we take the sum of the absolute values of those coefficients, and this is actually written this way called the L1 norm.

Now in practice, these two kinds of regularization –

they have very different meaning and they change the coefficients in different ways,

but they are both very helpful for purposes of generalization.

让我们更深入地讨论逻辑回归。把它放在角落里,

我想给你们另一个关于逻辑回归的观点,

这是最大可能性的观点。

你可以跳过这部分如果你不感兴趣,也不会有什么坏事发生,

但它对你们中的一些人可能有用。看看这个函数;

它看起来是什么样的?

它看起来像是在生长,然后饱和。

这个函数被称为logistic函数,

这是阿道夫·奎特雷和他的学生发明的早期人口模型之一,

Pierre Francois Verhulst, 19世纪中期,

他们模拟了人口的增长,他们认为当一个国家变得富裕时,

人口不会增长那么人口就会饱和,这就是为什么它看起来像这样。

这听起来很有趣,但他们就是这么做的。

所以,当这个国家刚刚开始增长的时候,这就是它的全盛时期,人口不会再增长了。不管怎样,这与逻辑回归有什么关系?嗯,确实如此。

你对概率了解多少?它们不低于0,也不低于0。

超过1。你可以取任何你喜欢的数字并通过这个函数发送,它会。

给你一个0到1之间的数字,它会给你一个概率。

这是这个函数的基本公式,当t很大时,e的t次方比1大很多,

所以这个基本上被忽略了然后得到1。

如果t是很小的,那么顶部是0,下面是1,得到0。

那么,逻辑回归在哪里呢?

这里是它进入逻辑回归的地方。

我们来模拟一下结果的概率。

结果y是一个特定的重音,就像这样?

我们为什么要这么做?它看起来像一个复杂的函数;

我从哪儿弄来的?这里有个技巧,左边的是概率,

所以右边的东西最好是概率。你猜怎么着,我们知道。

它只是一个逻辑函数,而logistic函数只能产生概率。

现在这个模型有意义了,这就是为什么我要建立这样的概率模型。

现在我把它代入矩阵表示法,

为了让我的生活更简单,而不是把所有的和都写在这里,

我可以把这个矩阵写成x乘以向量。

现在我也可以计算出标签为- 1的概率,

给定xi和使用模型的。所以它是1减去它是1的概率,

它就是1减去这个数,你可以化简它,让它看起来像这样。

现在我需要计算每一个观测的可能性,

这是观察我观察到的y的概率,给定的是模型中的x。

事实上,我已经在那里了,因为我已经做过了。

这是,如果y = - 1,你用这个。如果y是+ 1,

然后用这个,也就是这个概率。

然后我可以再化简一下,因为y = - 1,

所以我可以在这里加上- y因为- y = 1,

我可以用另一项来做同样的事情。

首先,我要把上面和下面除以e的x次方,

结果是这样的。然后我可以把它乘以1,

因为y是正1,

所以我可以把它写成- yx因为我把它乘以1这就是y。

有趣的是这两个表达式应该和你很相似;事实上,

它们应该看起来完全一样,因为它们是。

这很好,因为这意味着y的概率等于它所做的任何事情的概率都是这样或者那样,它们是一样的。我可以把它放在这里。

好的,在这里加一个小空间,然后我计算所有数据的可能性,

我要把这些概率相乘。所以我得到的结果是,

这是数据集的全部可能性,它看起来是这样的。

所以这个等于这个,等于这个,等于这个。

我可以总结一下,从一个新的页面开始;在这里。

现在,我可以对两边取负对数,这是完全合法的。

然后,当你取一个乘积的对数时,它就变成了对数的和,所以我们在这里。

然后这个分数变成-这是这个的- 1次方的对数,

所以- 1出现在前面,和这个负号相消,我得到这个表达式。

希望你们有好的记忆力,

因为这个表达式和我在角落里的那个表达式是完全一样的?

所以这很酷,最小化负对数的可能性和最小化逻辑日志是一样的。

减少负对数的可能性就像找到最可能产生数据的系数,

如果你使用logistic模型。我可以用不同的方式推导逻辑回归,但我为什么要关心呢?

当我有第一个的时候为什么还需要这个导数呢?答案很简单:

因为现在你有了这个。还记得这个吗?这是logistic函数,

但是现在它提供了模型的概率解释。

无论模型给出的结果是什么,现在你得到的概率是y = 1,

给定x,你不只是得到一个分类,也许我可以用另一种方式展示它。

回到这张图。这是logistic函数,

在这里你得到一个更高的概率估计值这里是很低的。

这种解释并不是你对损失函数的解释。

好的,总结一下,对于逻辑回归,

我们将数据随机分为训练和测试集,

我们通过最小化主观性来估计系数和训练模型,

然后我们对模型进行评分,并对模型进行评估。

如果我们想,现在我们有概率解释;

我们可以通过函数f得到它。

我们可以把f代入logistic函数得到y = 1给定x的概率的估计值。

这只是Azure ML的基本版本;

这是所有的编程,我只是-你知道,真的移动了模块,移动了模块,并把连接器放在它们上,然后点击运行,就这样。

这只是一个预览,当我们把正规化放在那里的时候;

我们可以通过让逻辑模型变得简单来提高性能,

我们可以通过调整这个可爱的常数c来做,

这就决定了我们要在模型中引入多少正规化来保持它的简单性。

同样,我们可以用线性模型。

对于正则化,我们会选择这些系数的平方和,

这是用很巧妙的方法写的;这实际上被称为L2规范,

这就是我们用来衡量模型简单性的方法,

这个常数c将决定我们对模型的简单性和准确性的关心程度。

这是另一种正则化方法我们将这些系数的绝对值和,写成这种形式叫做L1准则。

在实践中,这两种正规化。

它们有非常不同的含义它们用不同的方式改变系数,

但它们都很有助于推广。

  相关解决方案