#003A Logistic Regression – Cost Function Optimization
Logistic Regression – Cost Function Optimization
First, to train parameters \(w \) and \(b \) of a logistic regression model we need to define a cost function.
Given a training set of \(m\) training examples, we want to find parameters \(w\) and \(b \), so that \(\hat{y}\) is as close to \(y \) (ground truth).
Here, we will use \((i) \) superscript to index different training examples.
Henceforth, we will use
$$ \mathcal{L}(\hat{y},y) = \frac{1}{2}(\hat{y} – y)^{2} $$
In logistic regression squared
Assume you are standing at some point inside a closed set (like a field surrounded by a fence).
In terms of a surface, the surface is convex if, loosely speaking, it looks like a ‘cup’ (like a parabola). If you have a ball and let it roll along the surface, that surface is convex if that ball is guaranteed to always end up at the same point in the end. However, if the surface has ‘bumps’, then, depending on where you drop the ball from, it might get stuck somewhere else. That surface is then non-convex.
To be sure that we will get to the global optimum, we will use following loss function:
$$ \mathcal{L}(\hat{y},y)=-(ylog\hat{y}+(1-y)log(1-\hat{y})) $$
It will give us a convex optimization problem and it is therefore much easier to be optimized.
To understand why this is a good choice, let’s see these two cases:
- If \(y \) = 1:
- \(\mathcal{L}( \hat{y}, y) = – log \hat{y} \) \(\Rightarrow \) \(log \hat{y}\) should be large, so we want \(\hat{y} \) large (as close as possible to 1 )
- If \(y \) = 0:
- \(\mathcal{L}( \hat{y}, y) = – log (1 – \hat{y}) \) \(\Rightarrow \) \(log (1 – \hat{y})\) should be large, so we want \(\hat{y} \) small (as close as possible to 0 )
\(\hat{y}\) is a sigmoid function so it cannot be bigger than 1 or less then 0.
A cost function measures how well our parameters \(w\) and \(b\) are doing on the entire training set :
$$ J(w, b) = \frac{1}{m}\sum_{i=1}^{m}\mathcal{L}(\hat{y}^{(i)},y^{(i)})=-\frac{1}{m}\sum_{i=1}^{m}\hat{y}^{(i)}log\hat{y^{(i)}}+(1-y^{(i)})log(1-\hat{y}^{(i)}) $$
- Cost function \(J\) is defined as an average of a sum of loss functions ( \( \mathcal{L} \) ) of all parameters.
- Cost function is a function of parameters \(w \) and \(b\).
In a cost function diagram, the horizontal axes represent our spatial parameters, \(w\) and \(b\). In practice, \(w\) can be of a much higher dimension, but for the purposes of plotting, we will illustrate \(w\) and \(b\) as scalars.
More resources on the topic:
For more resources about deep learning, check these other sites.
- Introduction to Logistic Regression, toward data science.
- The Cost Function in Logistics Regression.
- The Computation Graph, Logistic Regression.
- How Gradient Descent Works.