Remember me

# ## Stanford ML 4: Logistic Regression and Classification

Thu, 31 May 2012 05:38:17 GMT

The initial lectures in Stanford CS229a were concerned with regression problems where the predicted value was a continuous number. Another class of problems is concerned with discrete problems, where values are divided into groups (e.g. on or off; red, green, or blue). This builds on all the material from the previous linear regression lectures.

The first classification model introduced in known as logistic regression (even though it is not technically a regression model since it is used for classification), which is a generialized linear model (GLM) used for binomial regression (two possible values, such as TRUE/FALSE, YES/NO). Logistic regression is covered in ESL 4.4 and PRML 4.3.2. It's also covered in Chapter 5 of my favorite regression book. "Data Analysis Using Regression and Multilevel/Hierarchical Models".

### Logistic Regression

Logistic regression is covered in CS229 notes 1, although that goes into far more detail (especially on GLM's) than in CS229a. For classification, we need our function to be constrained to several discrete values. In the case when we have two groups (e.g. true/false, on/off) then we want to constrain our hypothesis to two values: $0 \le h(\theta) \le 1$

This is expressed through the sigmoid (or logistic) function. $h_{\theta}(x) = \frac{1}{1 + e^{-\theta^Tx}}$

This looks like an "S" shape, moving between 0 and 1. Here we are expression our belief in the hypothesis as a probability, where we might choose a threshold (e.g. the hypothesis = 1 if it is greater than 0.5).

I'm going to use the South Africa Heart Data from ESL. The SA Heart data is used in several places in ESL:

A retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. There are roughly two controls per case of CHD. Many of the CHD positive men have undergone blood pressure reduction treatment and other programs to reduce their risk factors after their CHD event. In some cases the measurements were made after these treatments. These data are taken from a larger dataset, described in Rousseauw et al, 1983, South African Medical Journal.

As discussed in the past, assuming your dataset isn't too large, a scatterplot matrix is a really useful way to quickly look at data: This reproduces Figure 4.12 from ESL.

### Cost function and Gradient Descent

Gradient Descent works in much the same way with logistic regression as with linear regression. First, we can define the cost function in the same way as before, except that now our hypothesis is different (is a function of the sigmoid function): $J(\theta) = \frac{1}{m} \sum_{i=1}^m Cost(h_{\theta}(x^{(i)}), y^{(i)})$ $= -\frac{1}{m} [\sum_{i=1}^m y^{(i)} \log h_{\theta}(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\theta}(x^{(i)})]$

As before, it is considerably easier to scale the features before applying gradient descent.

### Multiple classes

Classification can also be applied in the case of multiple classes (or groups). One extension of logistic regression is known as multinomial logistic regression. The most famous dataset for this kind of analysis is Fisher's iris dataset (which is already in the R's datasets base package), from his "The use of multiple measurements in taxonomic problems." (1936). From R's help file on the data (help(iris)):

This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. Here I show how to apply Linear Discriminant Analysis and Multinomial Logistic Regression to this three-class problem.

Typically we would assess the performance of these models by dividing the data into training and test samples, and possibly choosing the parameters through cross-validation. I expect to touch on these issues in later posts as I continue this series on Stanford's open machine learning class.