#006A Fast Logistic Regression

#006A Fast Logistic Regression

Fast Logistic Regression

When we are programming Logistic Regression or Neural Networks we should avoid explicit \(for \) loops. It’s not always possible, but when we can, we should use built-in functions or find some other ways to compute it. Vectorizing the implementation of Logistic Regression  makes the code highly efficient. In this post we will see how we can use this technique to compute gradient descent without using even a single \(for \) loop.

Now, we will examine the forward propagation step of logistic regression. If we have \(m\) training examples, to make a prediction on the first example we need to compute  \(z \) and the activation function \(a\)  as follows:
\(z^{(1)}= \omega^T x^{(1)} + b \)
\(a^{(1)} = \sigma(z^{(1)}) \)
To make prediction on the second training example we need to compute this :
\(z^{(2)}= \omega^T x^{(2)} + b \)
                      \(a^{(2)} = \sigma(z^{(2)}) \)                       
The same is with prediction of third training example:
\(z^{(3)}= \omega^T x^{(3)} + b \)
\(a^{(3)} = \sigma(z^{(3)}) \)
So if we have \(m\) training examples we need to do these calculations \(m\) times. In order to carry out the forward propagation step, which means to compute these predictions for all \(m\) training examples, there is a way to do this without needing an explicit for loop.
We will stack all training examples horizontally in a matrix \(\textbf{X}\), so that every column in matrix \(\textbf{X} \) represents one training example:
$$ \textbf{X} = \begin{bmatrix} \vert & \vert & \dots  & \vert \\ x^{(1)} & x^{(2)} & \dots & x^{(m)} \\ \vert & \vert & \dots  & \vert \end{bmatrix} $$
$$    $$
Notice that matrix \(\omega \) is a \(n_{x} \times 1\) matrix (or a column vector), so when we transpose it we get  \(\omega^T  \)  which is a \(1 \times n_{x}\) matrix (or a row vector) so multiplying  \( \omega^T \) with \(\textbf{X} \) we get a \(1 \times m\) matrix. Then we add a \(1 \times m\) matrix \(b \) to obtain \(\textbf{Z}\).
We will define matrix \(\textbf{Z} \) by placing all  \(z^{(i)} \) values in a row vector :
$$ $$
\(\textbf{Z}=  \begin{bmatrix} z^{(1)} & z^{(2)} & \dots & z^{(m)} \end{bmatrix}  =  \omega^T \textbf{X} +  b = \begin{bmatrix} \omega^T x^{(1)} +b &  \omega^T x^{(2)} + b & \dots & \omega^T x^{(m)} + b \end{bmatrix}  \)
$$      $$
In Python, we can easily implement the calculation of a matrix \(\textbf{Z} \):
$$ \textbf{Z} = np.dot(\omega^T, \textbf{X}) + b $$
As we can see \(b \) is defined as a scalar. When you add this vector to this real number, Python automatically takes this real number \(b \) and expands it out to the \(1 \times m\) row vector. This operation is called broadcasting, and more about it we will see at the end of this we will see at the end of this post.  
Matrix \(\textbf{A} \) is defined as a \(1 \times m\), wich we also got by stacking horizontaly values  \(a^{(i)}\) as we did with matrix $latex \textbf{Z} $:
\(\textbf{A} =  \begin{bmatrix} a^{(1)} & a^{(2)} & \dots & a^{(m)} \end{bmatrix}  = \sigma (Z) \)
$$ $$
In Python, we can also calculate matrix \(\textbf{A} \) with one line of code as follows (if we have defined sigmoid function as above) : 
\(\textbf{A} = sigmoid(\textbf{Z}) \)
$$ $$
Vectorization of Logistic Regression
$$   $$
In the previous post we saw that for the gradient computation we had to compute detivative \(dz \) for every training example:
\(dz^{(1)} = a^{(1)} – y^{(1)} \)
\(dz^{(2)} = a^{(2)} – y^{(2)} \)
\(\vdots \)
\(dz^{(m)} = a^{(m)} – y^{(m)} \)
In the same way, we have defined previous variables, now we will define matrix \(\textbf{dZ} \), where we will stack all \(dz^{(i)} \) variables horizontally, dimension of this matrix \(\textbf{dZ} \) is \(1\times m\) or alternatively a \(m \) dimensional row vector . 
\(\textbf{dZ} =  \begin{bmatrix} dz^{(1)} & dz^{(2)} & \dots & dz^{(m)} \end{bmatrix}  \)
As we know that matrices \(\textbf{A} \) and \(\textbf{Y} \) are defined as follows:
\(\textbf{A} =  \begin{bmatrix} a^{(1)} & a^{(2)} & \dots & a^{(m)} \end{bmatrix}  \)
\(\textbf{Y} =  \begin{bmatrix} y^{(1)} & y^{(2)} & \dots & y^{(m)} \end{bmatrix}  \)
We can see that \(\textbf{dZ} \) is:
$$ \textbf{dZ} = \textbf{A} – \textbf{Y} =  \begin{bmatrix} a^{(1)} – y^{(1)} & a^{(2)} – y^{(2)} & \dots & a^{(m)} – y^{(m)} \end{bmatrix} $$
and all values in \(\textbf{dZ} \) can be computed at the same time.
To implement Logistic Regression on code we did this:

\(for \enspace i \enspace in \enspace range(m): \)

                                                                                           \(z^{(i)}=w^{T}x^{(i)}+b\)

                                                                                           \(a^{(i)}=\sigma (z^{(i)})\)

                                                                                           \(J+=- [\enspace y^{(i)} loga^{(i)}+(1-y^{(i)}) \enspace log(1-a^{(i)})) \enspace]\)

                                                                                           \(dz^{(i)}=a^{(i)}-y^{(i)}\)

                                                                                           \(dw_{1} \enspace +=x_{1}^{(i)}dz^{(i)}  \enspace (*)\)

                                                                                           \(dw_{2} \enspace+=x_2^{(i)}dz^{(i)}  \enspace (**) \)

                                                                                           \(db+=dz^{(i)}\)

 

 

 

 

After leaving the inner \(for \) loop, we have divided \(J\), \(\mathrm{d} w_{1}\), \(\mathrm{d} w_{1}\) and \(b\) by \(m\), because we computed their averages

 \( J/=m; \enspace dw_{1}/=m; \enspace dw_{2}/=m; \enspace db/=m;  \)

This code was non-vectorized and highly inefficent so we need to transform it. First, using vectorization, we can transform equations \((*) \) and \((**) \) into one equation:

\(dw += x^{(i)}dz^{(i)} \)

Remember that in this case we have two features, \( x_1 \) and \(x_2 \). If we had had more features, for example n features, we would have needed another for loop to calculate \( dw_{1} \)  …  \(dw_{n} \) .

The cost function is : $$ J = -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(a^{(i)})+(1-y^{(i)})\log(1-a^{(i)}) $$

The derivatives are:

$$ \frac{\partial J}{\partial w} = dw = \frac{1}{m}\textbf{X}(\textbf{A}-\textbf{Y})^T $$

$$ \frac{\partial J}{\partial b} = db = \frac{1}{m} \sum_{i=1}^m (a^{(i)}-y^{(i)}) $$

To calculate \(w \) and \(b \) we will still need following \(for \) loop.

\( for \enspace i \enspace in \enspace range(num \enspace of \enspace iterations): \)

                                                                                \( \textbf{Z}=w^{T}\textbf{X} + b       \)

                                                                                \( \textbf{A}=\sigma (\textbf{Z})  \)

                                                                                \( \textbf{dZ}=\textbf{A} – \textbf{Y}   \)

                                                                                \( dw\enspace =  \frac{1}{m}np.dot(\textbf{X}, \textbf{dZ}^T)   \)

                                                                                \( db\enspace =  \frac{1}{m}np.sum(\textbf{dZ})   \)

                                                                                \( w += -\alpha dw   \)

                                                                                \( b += -\alpha db   \)

We don’t need to loop through entire training set, but still we need to loop through number of iterations and that’s a \(for \) loop that we can’t get rid off.

This post completes the Logistic regression. It can be seen as a one neuron neural network. Let’s see why, what and how about neural networks!

In the next post we will learn what vectorization is.

More resources on the topic:

Leave a Reply

Your email address will not be published. Required fields are marked *