#004A Logistic Regression – The Computation Graph
Logistic Regression – the Computation Graph
Why do we need a computation graph? To answer this question, we have to check how the computation for our neural network is organized. There are two important principles in neural network computation:
- Forward pass or forward propagation step
- Backward pass or backpropagation step
During NN’s forward propagation step we compute the output of our neural network. In a binary classification case, our neural network output is defined by a variable and it can have any value from \([0,1] \) interval.
In order to actually train our neural network (find parameters \( w \) and \( b \) as local optima of our cost function) we have to conduct a backpropagation step. In this way, we can compute gradients or compute derivatives. With this information, we are able to implement
A computation graph is a systematic and easy way to represent our neural network and it is used to better understand (or compute) derivatives or neural network output. The idea behind the Computation Graph is also the fundamental concept for Tensorflow library.
Before we start with logistic regression derivatives, here is a quick brief recap:
\(z = w^{T}x + b \)
$$ \hat{y} = a = \sigma(z) $$
\( \mathcal{L}(a,y)=-(yloga+(1-y)log(1-a)) \)
Where:
\(z\) – is Logistic Regression formula
\(\hat{y}\) – prediction
\(\mathcal{L}(a,y) \) – Loss function, where \(a \) is the output of logistic regression and \(y \) is the ground truth label
Logistic regression derivatives
The computation graph of a logistic regression looks like the
In this example, we only have two features \(x_{1}\) and \(x_{2}\). In order to compute \(z\),we will need to input \(w_{1}\), \(w_{2}\) and \(b \) in addition to the feature values \(x_{1}\) and \(x_{2}\):
\(z = w_{1}x_{1} + w_{2} x_{2} + b \)
After that, we can compute our \( \hat{y} \) (equals sigma of \(z\)):
\(\hat{y} = \sigma(z) \)
\(a = \sigma(z) \)
Finally, we are able to compute our loss function \(\mathcal{L}(a,y) \) .
To reduce our loss function (remember right now we are talking only about one data sample) we have to update our \(w\) and \(b\) parameters. So, first we have to compute the loss using forward propagation step. After this, we go in the opposite direction (backward propagation step) to compute the derivatives.
$$ da = \frac{\mathrm{d} \mathcal{L(a,y)} }{\mathrm{d} a} $$
$$ da = – \frac{y}{a} + \frac{1-y}{1-a} $$
Next, we compute the value of derivative (we will denote this as \(da \) in our code):
Having computed \(da \) , we can go backwards and compute \(dz \):
$$ dz = \frac{\mathrm{d} \mathcal{L(a,y)} }{\mathrm{d}z} $$
$$ dz = \frac{\mathrm{d} \mathcal{L(a,y)} }{\mathrm{d} a} \frac{\mathrm{d} a }{\mathrm{d} z} $$
$$ dz = a – y $$
$$ \frac{\mathrm{d} a}{\mathrm{d} z} = a(1 – a) $$
$$ \frac{\mathrm{d} \mathcal{L(a,y)} }{\mathrm{d} a} = – \frac{y}{a} + \frac{1-y}{1-a} $$
The final step in back propagation is to go back to compute amount of change of our parameters \(w\) and \(b \):
$$ {dw_{1}} = \frac{\mathrm{d} \mathcal{L(a,y)} }{\mathrm{d} w_{1}} = x_{1} {dz} $$
$$ {dw_{2}} = \frac{\mathrm{d} \mathcal{L(a,y)} }{\mathrm{d} w_{2}} = x_{2} {dz} $$
$$ {db} = \frac{\mathrm{d} \mathcal{L(a,y)} }{\mathrm{d} b} = {dz} $$
To conclude, if we want to do gradient descent with respect to just this one data sample, we would do the following updates (for some arbitrary number of iterations):
$$ w_{1} = w_{1} – \alpha{dw_{1}} $$
$$ w_{2} = w_{2} – \alpha{dw_{2}} $$
$$ b = b – \alpha{ db} $$
In the next post we will work on an example on the computation graph of a function.