introduction to ML strategy
1 - why ML strategy?
How to structure machine learning project, that is on the machine learning strategy.
What is machine learning strategy, let’s say we are working on cat classifier. If we have gotten a system to have 90% accuracy, this is good enough for applications. We have a lot of ideas to improve the system,
- collect more training data
- collect more diverse training set
- train algorithm longer with optimization algorithms
- try Ada instead of gradient descent
- try bigger/smaller network
- try dropout
- add L2 regularization
- try fancy architecture
- hidden units
- activation functions
We often have a lot of ideas we could try, and the problem is that, if we choose poorly, it’s entirely possible that we end up speeding time in some direction that didn’t do any good.
We will learn about the strategy of analysis a mechine learning problem, that will help we in the direction of the most promising things to try.
2 - Orthogonalization
The orthogonalization refer to the TV designer had designed the knobs so that each knob does only one thing.
For supervised learning system to do well, we usually need to tune the knobs of the system to make sure that four things hold true:
- Fit training set well on cost function
- training a bigger neural network, switch to a better optimization algorithm.
- Fit dev set well on the cost function
- regularization, getting a bigger training set would be useful knob we could try.
- Fit test set well on the cost function
- getting a bigger dev set, because if does well on dev set but not on test set, it probably means overfitting to the dev set, we need go back and find a bigger dev set.
- Perform well on the real world
- change either the dev set or the cost function. because if doing well on the test set according to some cost function doesn’t correspond to what we need it to do in the real world, it’s mean either dev test set distribution is not set correctly, or the cost function is not measuring the right thing.
How to diagnose what exactly is the bottleneck to your system performance and as well as identify the specific set of knobs we would use.
3 - single number evaluation metric
When training a machine learning system, the process will be much faster if we have a single real number evaluation metric.
Apply machine learning is a very emprical process, we often have a idea, code it up, run the experiment to see how it did, and use this outcome of the expriments to refine the ideas.
- precision is the of the examples that classifier recognizes as cat, what percentage actually are cats? so if the classifier have 95% precision, this means that when the classifier says something is a cat, these is 95% chance it really is a cat.
- recall is of all the images that really are cats, what percentage were correctly by classifier? so if classifier is 90% recall, this means that all of the really cat images the classifier accurately picked out 90% of them.
The problem with using precision and recall as your evaluation is that if classifier A does better on recall, classifier B does better on precision, then you are not sure which classifier is better. So it’s diffcultly to know how to pick one of a dozen. So we have to find a new evaluation metric that combine precision and recall, that is the harmonic mean of two:
Having a well-defined dev set, where is measuring the precision and recall, plus a single number evaluation metric, allows us to quickly choose the better classifier. So this can speed up iterating.
So have a real number evaluation metric can really improve the effciency in making those decisions.
4 - satisficing and optimizing metric
It’s not always easy to combine all the things we care about into a single real number evaluation metric, in this case, it sometimes useful to set up satisficing and optimizing metrics. In this case, the accuracy and run time are the evaluation metrics, so we want to choose a classifier that maximizes accuracy but subject to that the running time has to be less than or equal to 100 millisenonds. So in this case, we would say that accuracy is an optimizing metric, but the running time is what we call satisficing metric.In this case, the classifier B better that A, because all of the ones with a running time better that 100 millisecond, the B has the better accuracy.
So more generally, if we have m
metric that we care about, is reasonable to pick one of them to be optimizing metric, so we want to do as well as possible on this metric, and then m-1
to be satisficing metric means that so long as they reach some threshold we do not care how much better it is in this threshold.
To summarize, if there are multiple things you care about by set one as the optimizing metric, that we want do as well as possible on, and one or more as satisficing metrics were we will satisfice. Now we have a almost way of quickly looking at multiple cost size and pick one best.
5 - train/dev/test set distribution
The way we set up training, dev and test sets can have a huge impact on how rapidly we can make process on building machine learning application.
In this part, we are focus on how the set up dev set and test set.
Workflow in machine learning is that we try a lot of ideas and training up different models on the training set, and then use the dev set to evaluate the different ideas and pick one, and keep iterating to improve dev set performance until finally we have one that are happy with that we then evaluate on the test set.
make dev and test set come from the same distribution.
What we need to keep on mind is to establish the dev set and a single real number evaluation metric, then we can try different idesa, run experiments, and very quickly use the dev set and the metric to evaluate the classifier and try to pick the best one. We almost always can do well on the metric on the dev set. So have the test set and dev set from different distribution is like setting a target and then spend months trying to closer and closer to the bull’s eys, only to realize after months of work that, the bull’s eye in a different location somewhere else. So to avoid this, we need to make sure the dev and test set come from same distribution.
Guideline:
choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on, in particular, the dev set and test set should come from the same distribution.
we have not talk yet about how to set up a training set, but the important take away from this section is that setting up the dev set, as well as the evaluation metric is really defining what target we want to aim at, and hopefully by setting the dev set and test set to the same distribution we are really aiming at whatever target we hope the machine learning algorithm will hit.
The way we choose the training set will affect how well we can actually hit the target.
6 - size of dev set and test set
In the earlier era of machine learning, this was pretty reasonable, especially back when data set sizes were just smaller, such as 100, 1000 or 10000. But in the modern machine learning era, it might be quite reasonable to set up data as following:
Remember the purpose of the test set is that after finish developing a system, the test set help we evaluate how good the system is. So the guildline is to set the test set big enought to give you a high confidence in the overall performance of the system. But for some applications, if we do not need a high confidence in the performance of the final system, maybe be all we need is a train and dev set. And in fact, what sometimes happens was, people were talking about using train test split but what they were actually doing was iterating on the test set, so rather than the test set, what they had was a train dev split, and no test set.
Not recommending not having a test set when building a system, but if you have a very large dev set, so that you think you won’t overfit the dev set too badly, maybe it’s not totally unreasonable to just have a train and test set.
To summarize, the trend has been to use more data for training and less for dev and test, especially when we have a large data set. And the rule of thumb is really to try to set the dev set too big enough for its purpose, which help evaluate different ideas and pick up, A or B better. And the purpose of test set is to help evaluate the final cost bias, so just set the test set big enough for this purpose is okay.
7 - when to change dev/test sets and metrics
We have seen how sets of dev set and evaluation metric is like placing a target somewhere for the system to aim at. But sometimes, we might realize we put the target in the wrong place. In that case, we should move target.
7.1 - change metric
Let’s say we are building an app of cat classifier tries to find a great amount of cat images to show to cat loving users,
it seems like the A is doing better then B since there is only a 3% error, however for some reason, A is letting through a lot of the pornographic images. Algorithm B has 5% error thus it classifier and send to user fewer images, but it does not have pornographic images. From the company’s point of view, as well as from the user acceptance point of view, B is actually a better one. The evaluation metric, classification error, failed to correctly rank preference between algorithm, The evaluation metric or dev set or test set should be change.
the misclassification error metric can be written as a function as follow:
The problem with this metric is that it treats pornographic images vs non-pornographic images equally. One way to change this metric is to add the weight term:
the function becomes:
So that the error term goes up much more if the algorithm make a mistake on classifying a pornographic image as a cat image.
The point is that, if we are not satisfied with the old error metric, try to define a new one that better captures the performance in term of what’s actually a better algorithm.
- placing the target as one step
- shooting the target as a distinct step we can do separately
7.2 - change dev/test set
Another example, there are two algorithms A and B with error 3% and 5%, respectively, on the dev set or test set which are images downloads off the internet, so high quality well framed images. But maybe when we deploy the product, we find the B perform better, because user are uploading all sorts of images, less framed, much blurrier. So when test out the algorithm we find B is actually doing better.
So the guildline is if doing well on metric + dev/test set does not correspond to doing well on your application, change the metric and/or dev/test set, by changing the dev/test set so that the data better reflect the type of data we actually need to do well on.
8 - why human-level performance?
There are a lot of talk about comparing the machine learning systems to human level performance. There are two main reason for this,
- first is that because the advance in deep learning, machine learning algorithm are suddenly working much better and become much more feasible in a lot of application area, machine learning algorithm to actually become competitive with human-level performance.
- Second, it turn out that the workflow of designing and building the machine learning system much more efficient when we do some compare to what human can also do.
in this setting, it’s become natural to talk about comparing to human-level performance.
It turns out the progress is often quite fast until we surpass the human-level performance, and slows down after surpass human level performance. There are two reason for why progress often slows down when we surpass the human-level performance.
- human level performance is for many tasks not very far from the Bsyes optimal error.
- so long as the performance is wrose than human level performance, there are actually tools we can use to improve the performance that are harder to use once we have surpassed human level performance. Here is the tools we can use:
- get labeled data from human, so that we have more data to feed into the learning algorithm
- so long as human are still performing better than any other algorithm. we can ask people to look at the examples that the algorithms getting wrong and try to get inside in terms of why people got it right but algorithm got it wrong, that is error analysis.
- get a better analysis of bias and variance.
Once the algorithm doing better than humans, the three above tactics are harder to apply. This is maybe another reason why comparing to human level performance is helpful, especially on tasks that humans do well, and why machine learning algorithms tend to be really good at try to replicate tasks that people can do and catch up or slightly surpass human level performance.
9 - Avoidable bias
We talked about how we want the algorithm do well on the training set, but sometimes we don’t actually want to do too well. Knowing what human level performance is can tell us how well we want the algorithm to do on the training set.
In this case, the human level errors as a proxy for Bayes error since human are good to identify images.By knowing this Bayes error, it is easier to focus whether bias or variance avoidance tactics will improve the performance of the model.
- Scenario A:
There is a 7% gap between the performance of the training set and the human level error. It means that the algorithm is not fitting well with the training set. To solve this issue, we use bias reduction technique such as training a bigger neural network or running the training set longer.
Now let's look at the same training error and dev error, but in different data set, let's say the human level error is actually 7.5%, maybe the images in the data set are so blurry that even humans can not tell whether there is a cat in this images.
- Scenario B:
The training set error is doing good since there is only 0.5% difference with the human level error. The focus here is to reduce the variance since the difference between the training set error and dev set error is 2%. To resolve the issue, we use variance reduction technique such as add regularization or have a bigger training set.
The difference betweent the training set error and the human level error is called avoidable bais. So what we want to do is to improve the training set error until get down to Bayes error or a proxy of Bayes error like human error in the case of computer vision.
After defined the notation of avoidable basis, rather than saying the training error is 8%, we are say the avoidable bias is maybe 0.5%, whereas the 2% is the variance, so there is much more room in reducing 2% than in reducing 0.5%.
So understanding the human level, or estimate of Bayes error really cause we in different scenarios to focus on different tactics. In the next we will get deeper into understanding of what human level performance really means?
10 - understanding human-level performance
Human-level performance is useful for helping we drive progress in machine learning project.
human-level error give us a way of estimating Bayes error, what is the best possible error any function could achieve.
This is an example of a medical images classification in which the input is a radiology image and the output is a diagnosis classification decision.
In this setting, 0.5% as our estimate for Bayes error, so we will define the human-level performance as 0.5%.
To see why this matters, let’s look at an error analysis examples:
Scenario A:
In this case, the choice of human-level performance doesn’t have an impact. The avoidable bias is between 4% and 4.5%, and variance is 1%. Therefore, the focus should be on bias reduction technique.
Scenario B:
In this case, the choice of human-level performacen doesn’t have an impect. The avoidable bias is between 0% to 0.5%, and the variance is 4%, Thus, the focus should be on variance reduction technique.
where the define of human-level performance really matter?
Scenario C:
In this case, it really matters that use 0.5% as the estimate of Bayes. In this case, the avoidable bias 0.2%, which is twice as big as variance, which just 0.1%. And so this suggests that maybe both the bias and variance are both problem, maybe avoidable bias is a bit bigger problem. Notice that 0.5% is the best measure of Bayes error, if we use 0.7% as the proxy for Bayes error, the avoidable would be as pretty much 0%, and we might have missed that we actually should try to do better on the training set. So this give us a reason why making process in machine learning problem get harder as we achieve human-level performance. **Once we approached the 0.7% training error, unless we are very caerful about estimating Bayes error, we might not know how far away we are from Bayes error, So how much we should to try to reduce the training error is unknow. And if all we know was that a single typical doctor achieved 1% error, and that will be very diffcult to know if we should be trying to fit training set even better。 **So this maybe an illustration of why approach human-level the process harder to push, it actually harder to teast out the bias and variance effects.
So to recap, having an estimate of human-level preformance gives us an estimate of Bayes error, and this allow us to more quickly make decision to whether we should focus on reduce a bias or try to reduce a variance. These technique will tend to work well until surpass human-level performance, where no longer have a good estimate of Bayes error that can help us make this decision clearly.
11 - surpassing human-level performance
Scenario A
in this case, the bayes error is 0.5%, the avoidable bias is 0.1% and the variance is 0.2%, so there is maybe more to do reduce variance than avoidable bias.
Scenario B
now what is the avoidable bias? it’s actually much harder to answer that. In fact, the training error is 0.3%, does mean we overfitting by 0.2%, or maybe the bayes error, actually 0.1%, 0.2% or 0.3%, we don’t really know. Basis on the information given in this example, we actually don’t have enough information to tell if we should focus on reducing bias or focus on reducing variance. So that slow down the efficiently where we make progess. It’s doesn’t mean we can not make progess, but some the tools we have for pointing us in a clear direction just don’t work as well.
There are many problem where machine learning significantly surpass human-level performance:
- Online advertising
- Product recommendations
- Logistic
- Loan approval
12 - improving your model performance
We have learn about orthogonalization, how to set up dev and test set, human-level performance as a porxy of Bayes’ error, how to estimate the avoidable bias and variance.
Let’s put all together into a set of guildlines for how to improve the performance of algorithm.
Getting a supervised learning algorithm to work well means fundamentally assume that we can do two things:
- fit the training set pretty well, roughly saying that we can achieve low avoidable bias.
- the training set performance generalizes pretty well to dev/test set, this is sort of saying that variance is not too bad
To summarize:
if we want to improve the performance of the machine learning system.
- Looking at the difference between the training error and the proxy of Bayes error can give us a sense of the avoidable bias, in other words, just how much better do you think you should be try to do on your training set.
- Lookint at the difference between dev error and training error as an estimate how much of a variance problem we have, in other words, how much harder we should be working to making the performance generalize from the training set to the dev set.
To reduec the avoidable bias:
- train a bigger model
- train train longer
- better optimization algorithm, such as Momentum, RMSProp or Adam
- better neural network architecture
- # layers
- # units
- CNN
- RNN
- hyperparameter search
To reduce the variance:
- more data
- regularization
- L2
- dropout
- data augmentation
- better neural network architecture
- hyperparameter search