#004A Logistic Regression – The Computation Graph

#004A Logistic Regression – The Computation Graph

Logistic Regression – the Computation Graph

Why do we need a computation graph? To answer this question, we have to check how the computation for our neural network is organized. There are two important principles in neural network computation:

  • Forward pass or forward propagation step
  • Backward pass or backpropagation step

During NN’s forward propagation step we compute the output of our neural network. In a binary classification case, our neural network output is defined by a variable and it can have any value from \([0,1] \) interval.

In order to actually train our neural network (find parameters \( w \) and \( b \)  as local optima of our cost function) we have to conduct a backpropagation step. In this way, we can compute gradients or compute derivatives. With this information, we are able to implement gradient descent algorithm for finding optimal values of \( w \) and \( b \). That way we can train our neural network and expect that it will do well on a classification task.

A computation graph is a systematic and easy way to represent our neural network and it is used to better understand (or compute) derivatives or neural network output. The idea behind the Computation Graph is also the fundamental concept for Tensorflow library. 

Before we start with logistic regression derivatives, here is a quick brief recap:

\(z = w^{T}x + b \)

$$ \hat{y} = a =  \sigma(z) $$

\( \mathcal{L}(a,y)=-(yloga+(1-y)log(1-a)) \)

Where:

\(z\) – is Logistic Regression formula

\(\hat{y}\) – prediction

\(\mathcal{L}(a,y) \) – Loss function, where \(a \) is the output of logistic regression and \(y \) is the ground truth label

 

Logistic regression derivatives

The computation graph of a logistic regression looks like the following:

forward propagation logistic regression computation graph

In this example, we only have two features \(x_{1}\) and \(x_{2}\). In order to compute \(z\),we will need to input \(w_{1}\), \(w_{2}\) and \(b \) in addition to the feature values \(x_{1}\) and \(x_{2}\):

\(z = w_{1}x_{1} + w_{2} x_{2} + b \)

After that, we can compute our \(  \hat{y} \) (equals sigma of \(z\)):

\(\hat{y}  = \sigma(z) \)

\(a  = \sigma(z) \)

Finally, we are able to compute our loss function \(\mathcal{L}(a,y) \) .

To reduce our loss function (remember right now we are talking only about one data sample) we have to update our \(w\) and \(b\) parameters. So, first we have to compute the loss using forward propagation step. After this, we go in the opposite direction (backward propagation step) to compute the derivatives.

$$ da = \frac{\mathrm{d} \mathcal{L(a,y)} }{\mathrm{d} a} $$

$$ da = – \frac{y}{a} + \frac{1-y}{1-a} $$

Next, we compute the value of derivative (we will denote this as \(da \) in our code):

Having computed \(da  \) , we can go backwards and compute \(dz  \):

$$ dz = \frac{\mathrm{d} \mathcal{L(a,y)} }{\mathrm{d}z} $$

$$ dz = \frac{\mathrm{d} \mathcal{L(a,y)} }{\mathrm{d} a} \frac{\mathrm{d} a }{\mathrm{d} z} $$

$$ dz = a – y $$

$$ \frac{\mathrm{d} a}{\mathrm{d} z} = a(1 – a) $$

$$ \frac{\mathrm{d} \mathcal{L(a,y)} }{\mathrm{d} a} = – \frac{y}{a} + \frac{1-y}{1-a} $$

The final step in back propagation is to go back to compute amount of change of our parameters \(w\) and \(b \):

$$ {dw_{1}} = \frac{\mathrm{d} \mathcal{L(a,y)} }{\mathrm{d} w_{1}} = x_{1} {dz} $$

$$ {dw_{2}} = \frac{\mathrm{d} \mathcal{L(a,y)} }{\mathrm{d} w_{2}} = x_{2} {dz} $$

$$ {db} = \frac{\mathrm{d} \mathcal{L(a,y)} }{\mathrm{d} b} = {dz} $$

 

To conclude, if we want to do gradient descent with respect to just this one data sample, we would do the following updates (for some arbitrary number of iterations):

$$ w_{1} = w_{1} – \alpha{dw_{1}} $$

$$ w_{2} = w_{2} – \alpha{dw_{2}} $$

$$ b = b – \alpha{ db} $$

In the next post we will work on an example on the computation graph of a function.

More resources on the topic:

Leave a Reply

Your email address will not be published. Required fields are marked *