point

Remember me

# ## Stanford ML 5.2: Regularization

Thu, 31 May 2012 05:38:24 GMT

We considered the problem of overfitting as model complexity increase in the prior post. Now we look at one way to control for this problem: regularization. The basic idea is to penalize each the model, essentially saying that we don't entirely believe the fit that falls out of our optimization. Since we are fitting to a sample of the data, overfitting will mean that the resulting model doesn't generalize well: it won't fit well to new datasets since they are unlikely to match the training data exactly.

[This is just a short post on regularization to show how it can help improve the generalization of a model.]

### Regularization and Ridge Regression

Continuing with the polynomial regression example from PRML 1.1, we now look at adding a penalty term to the error function. This will discourage the parameters from reaching large values during the optimization. Our old loss function for linear regression and logistic regression was: $J(\theta) = \frac{1}{2m}\sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2$

Now adding the penalty term, it becomes: $J(\theta) = \frac{1}{2m}\sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2$

Notice again that the loss function is identical for linear regression and logistic regression; what differs is the hypothesis function $h_{\theta}$. [Note: If you are following along with PRML, then will notice that Bishop refers to this as the error function and parameters are labeled $w$ instead of $\theta$.]

This particular form of regularization, using a quadratic penalty term, is known as ridge regression.

We can minimize the loss function as before using gradient descent, or using an explicitly solution from linear algebra. I have implemented these solutions but not posted them for the time being because the performance of the gradient descent solution is appalling. The closed form solution is already implemented in R in the MASS package, in the lm.ridge function. This function does not have a prediction function, so I have implemented this here.

In the last post, we saw before how increasing the model complexity resulted in a poor fit on the out-of-sample data. The more complex model is overfit to the training dataset. Here we can see the same diagram using ridge regression. At high model complexity, the fit still remains roughly constant because these additional terms are penalized. I won't expand on regularization at this stage, although I will commit the gradient descent solution to the github project. I will expand further on these topics (looking at other regularization models such as Lasso) in later posts when I continue with ESL. For now, we will start moving onto neural networks in the next post.