Remember me

Stanford ML 5.1: Learning Theory and the Bias/Variance Trade-off

Thu, 31 May 2012 05:38:21 GMT

Data analysis is part science, part art. It is part algorithm and part heuristic. Of the various approaches to data analysis, machine learning falls more on the side of purely algorithmic, but even here we have many decisions to make which don't have well-defined answers (e.g. which learning algorithm to use, how to divide the data into training/test/validation). Learning theory provides some guidance for how to build a model that is generalizable and can be used for prediction, which is the primary goal of machine learning.

The next set of lectures in Stanford CS229a (ml-class.org) covers regularization, a technique that is employed to avoid overfitting. This is tied to the concepts of parsimony, model selection, degrees of freedom, and the bias/variance trade-off. I consider this one of the most fundamental concepts in machine learning, so I want to spend a little time covering it before specifically looking at regularization techniques.

This material is covered through-out the machine learning textbooks, but is especially covered in Chapter 7 of ESL and in 3.1.4 and 3.2 of PRML.

The generalization performance of a learning method relates to its prediction capability on independent test data. Assessment of this performance is extremely important in practice, since it guides the choice of learning method or model, and gives us a measure of the quality of the ultimately chosen model. (ESL 7.1)

Underfitting/Overfitting

In some of the earlier lectures, we saw how a simple linear model could be used to fit potentially complex data.

For this section, I will be reproducing the analysis in PRML 1.1, which is very similar to the material covered by Professor Ng. Suppose that we have a process which generates data in the form of a sine wave + some noise $\gamma$:

$f(x) = sin(2 \pi x) + \gamma$

We want to fit a linear model to the data, but don't know what the underlying function is (in other words, we have 10 data points, but don't know that they were generated by a sine function). We might start with a simple linear model:

$f(x) = \theta_0 + \theta_1 x$

And progressively add more polynomial terms:

$f(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 + \cdots + \theta_n x^n$

These additional terms will improve the fit to the training data, but in the process they reduce the generalization of the model.

The real function is in red and the model is in red. We can see that adding more polynomial variables improves the fit. The 9th polynomial passes directly through every data point. But it is nothing like the underlying function. So we can tell immediately that this function has been overfit to the data and won't generalize to other datasets from the same distribution.

How can we tell which parameters $\theta$ to leave in the model (known as "model selection")? How can we avoid overfitting?

There are several ways to solve this problem:

1. Get more data (typically impossible)
2. Choose the model which best fits the data without overfitting (very difficult)
3. Reduce the opportunity for overfitting through regularization/shrinkage

Let's first look at how getting more data would solve the problem. In the case of the 9th polynomial, having more data ensures that it fits closer to the actual distribution.

We can see that adding more data reduces the extreme values in the prediction, and the high-order polynomial starts to look more and more like the underlying sine function. This is an important lesson: the size of the dataset is a critical ingredient, especially for a model with many parameters.

The bias/variance trade-off is one of the most important concepts to understand in statistical learning theory. This is covered explicitly in CS229 notes 4. Bias is a measure of how well the model fits the data. Variance characterizes how much the prediction varies around its average. In our sine wave example above, the linear model has high bias (fits very poorly) and low variance (the predictions are consistent, regardless of the specific dataset). On the other hand, the 9th polynomial has low bias on the training data (fits the training data extremely well) and high variance (the predictions vary widely and this won't fit well to other data).

However with too much fitting, the model adapts itself too closely to the training data, and will not generalize well (i.e., have large test error). In that case the predictions $\hat f(x_0)$ will have large variance...In contrast, if the model is not complex enough, it will underfit and may have large bias, again resulting in poor generalization. (ESL 2.9)

We can decompose the mean-squared error (MSE) into bias and variance terms:

$MSE = Var(\theta) + Bias(\theta)^2$

There are many different ways to characterize the performance of the model on in-sample (training) and out-of-sample (test and validation) datasets.

PMRL 1.1 makes use of the root-mean-square (RMS) error function (updated for our loss function convention):

$E_{RMS} = \sqrt{\frac{2 J(\theta)}{N}}$

To see how this trade-off operates, I divide the data into two sections: test and training. Using our original polynomial model, I progressively increase the model complexity by adding more parameters and see how the error function works.

What we see is that the lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently. On the other hand, the higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance. In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.

We will conclude this topic as part of the Stanford Machine Learning series in the next post by looking at dimension reduction techniques and the effective degrees of freedom.