datahacker.rs@gmail.com

#003A Logistic Regression – Cost Function Optimization

Logistic Regression – Cost Function Optimization

First, to train parameters $$w$$ and $$b$$  of a logistic regression model we need to define a cost function.

Given a training set of $$m$$ training examples, we want to find parameters $$w$$ and $$b$$, so that $$\hat{y}$$ is as close to $$y$$ (ground truth).

Here, we will use $$(i)$$ superscript to index different training examples.

Henceforth, we will use loss (error) function $$\mathcal{L}$$ to measure how well our algorithm is doing. The loss function is applied only to a single training sample, and commonly used loss function is a squared error :

$$\mathcal{L}(\hat{y},y) = \frac{1}{2}(\hat{y} – y)^{2}$$

In logistic regression squared error loss function is not an optimal choice. It results in an optimization problem which is not convex, and the gradient descent algorithm may not work well ( it may not converge optimally ) . It is a good moment to see what is a difference between convex and non-convex problem.

Assume you are standing at some point inside a closed set (like a field surrounded by a fence). If no matter where you stand inside that closed set, you can see the entire boundary just by taking a 360 degrees turn around yourself, that set is convex. If there is some part of the boundary that you can’t possibly see from where you are and you have to move to another point to be able to see it, then the set is non-convex.

In terms of a surface, the surface is convex if, loosely speaking, it looks like a ‘cup’ (like a parabola). If you have a ball and let it roll along the surface, that surface is convex if that ball is guaranteed to always end up at the same point in the end. However, if the surface has ‘bumps’, then, depending on where you drop the ball from, it might get stuck somewhere else. That surface is then non-convex.

To be sure that we will get to the global optimum, we will use following loss function:

$$\mathcal{L}(\hat{y},y)=-(ylog\hat{y}+(1-y)log(1-\hat{y}))$$

It will give us a convex optimization problem and it is therefore much easier to be optimized.

To understand why this is a good choice, let’s see these two cases:

• If $$y$$ = 1:
• $$\mathcal{L}( \hat{y}, y) = – log \hat{y}$$  $$\Rightarrow$$  $$log \hat{y}$$ should be large, so we want $$\hat{y}$$ large (as close as possible to 1 )
• If $$y$$ = 0:
• $$\mathcal{L}( \hat{y}, y) = – log (1 – \hat{y})$$ $$\Rightarrow$$  $$log (1 – \hat{y})$$ should be large, so we want $$\hat{y}$$ small (as close as possible to 0 )

$$\hat{y}$$ is a sigmoid function so it cannot be bigger than 1 or less then 0.

A cost function measures how well our parameters $$w$$ and $$b$$ are doing on the entire training set :

$$J(w, b) = \frac{1}{m}\sum_{i=1}^{m}\mathcal{L}(\hat{y}^{(i)},y^{(i)})=-\frac{1}{m}\sum_{i=1}^{m}\hat{y}^{(i)}log\hat{y^{(i)}}+(1-y^{(i)})log(1-\hat{y}^{(i)})$$

• Cost function $$J$$ is defined as an average of a sum of loss functions ( $$\mathcal{L}$$ ) of all parameters.
• Cost function is a function of parameters $$w$$ and $$b$$.

In a cost function diagram, the horizontal axes represent our spatial parameters, $$w$$ and $$b$$. In practice, $$w$$ can be of a much higher dimension, but for the purposes of plotting, we will illustrate $$w$$ and $$b$$ as scalars.

The cost function $$J(w,b)$$ is then some surface above these horizontal axes $$w$$ and $$b$$. So, the height of the surface represents the value of $$J(w,b)$$ at a certain point. Our goal will be to minimize function $$J$$, and to find parameters $$w$$ and $$b$$ . In the next post we will learn how gradient descent works.

More resources on the topic:

For more resources about deep learning, check these other sites.