Remember me

## Stanford ML 3: Multivariate Regression, Gradient Descent, and the Normal Equation

Thu, 31 May 2012 05:38:14 GMT

The next set of lectures in CS229 covers "Linear Regression with Multiple Variables", also known as Multivariate Regression. This builds on the univariate linear regression material and results in a more general procedure.

As part of this, Professor Ng also provides more guidance on how to use Gradient Descent, and introduces the most widely used analytic solution to linear regression: the normal equation.

[Note: I have now committed all this code to github as an R package, which I'm currently calling stanford.ml. Currently the code is mostly contained in demo files, so you can load the package and then call the particular demo (for instance, this post could be run with demo("multivariate.regression")). My plan for the package is to build generic functions into the package, have demo files to walk through everything step-by-step, and then have a vignette to give a full description of everything. I may post this to CRAN once it's sufficiently well developed (at this stage it fails R CMD check because of lack of documentation, etc.). More details to follow. As always, feel free to fork the project and contribute!]

### Multivariate Regression

It is a simple extension from univariate linear regression to multivariate regression. We can simply add more variables:

$h_{\theta}(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ...$

Or more concisely, if we set $x_0 = 1$ then we can write this in matrix notation as:

$h_{\theta}(x) = \theta^T x$

For these examples, I will continue to use the housing dataset from the UCI Machine Learning Repository. I will just use four of the available variables -- CRIM: per capita crime rate by town, RM: average number of rooms per dwelling, PTRATIO: pupil-teacher ratio by town, and LSTAT: % lower status of the population -- to predict MEDV: Median value of owner-occupied homes in \$1000's:

$y_{medv} = \theta_0 + \theta_1 x_{crim} + \theta_2 x_{rm} + \theta_3 x_{ptratio} + \theta_4 x_{lstat}$

Before looking at the data, we would expect all of the variables to have an influence. CRIM, PTRATIO, and LSTAT should have a negative coefficient (higher values would result in a lower property value) while RM should have a positive coefficient (more rooms would result in a higher property value). This is our null hypothesis.

If we plot these variables in R as a scatterplot matrix, we can see some clear relationships, in line with our expectations.

We can fit a linear model in R and look at the resulting statistics. All variables are significant (in terms of t-stats) and have values in line with what we might expect.

### Optimizing with Gradient Descent

In the last post, we introduced Gradient Descent as an optimization method to find the minimum of the loss function. The loss function and gradient descent now have multiple variables, but all the other details remain the same.

Here I show the optimization path for the raw dataset (unscaled) given different values for the learning rate ($\alpha$). We can see that the algorithm converges on the right answer with very small values of alpha. When it doesn't converge, the values blow out to infinity. This happens because the steps taken along the loss function gradient are too large, and the optimization keeps missing the minimum value by larger and larger amounts.

Scaling the features before running gradient descent makes it easier to find the appropriate learning rate because the features are on the same scale.

### The Normal Equation

Linear regression can actually be solved analytically using a little linear algebra. This is not true for most other machine learning models. This follows from the Gauss-Newton Theorem, which is itself a modification of Newton's method. One important result is the Gauss-Markov Theorem (covered in ESL 3.3.2), which finds that the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares estimator.

There are many ways to derive the normal equations (wikipedia has a nice article on the subject), so I won't go through the derivation here. The normal equation is usually written as:

$\hat{\theta} = (X^T X)^{-1} X^T y$

The $\hat{\theta}$ hat notation means that this is an estimate. Using the normal equation is typically much faster than gradient descent, although it can be slower on very large data sets where taking the inverse matrix can be difficult.