datahacker.rs@gmail.com

#005 PyTorch – Logistic Regression in PyTorch

#005 PyTorch – Logistic Regression in PyTorch

Highlights: In this post, we are going to talk about logistic regression. We will first cover the basic theory behind logistic regression and then we will see how we can apply this knowledge in PyTorch.

Tutorial Overview:

1. What is a binary prediction?

In the previous post, we learned how to build a linear model using the labeled training data set in order to make predictions. We have analyzed the dependence between the study time and grades. We learned that these types of data have a linear relationship which means that we can predict the grade based on the study time (if study time increases, grades should go up, and if study time decreases grades should go down). This type of prediction is called Linear regression. The goal of Linear regression is to develop a model that could predict any real value. However, in practice, we often come across a situation when we need to predict a specific output that has only two distinct values as is shown in the following examples.

• Are you going to pass or fail your exam?
• If you drive 100 miles per hour will you get on time from point A to point B or not?

As you may notice, in all of these examples there are only two possible answers: yes or no. Basically, we try to distinguish between two classes of outcomes. This type of prediction is called Binary Classification. The goal of Binary Classification is to classify elements of a given set of data into two groups.

An example of a binary classification problem:

In the following image we can see an example of a binary classification problem.

• $$x$$ – an input image
• $$y$$ – the output. It is a label by which we can recognize the image.
• $$y$$ can only have two values – 1 or 0. If $$y=1$$ – there is a cat in an image. On the other hand, if $$y=0$$ – there is no cat in the image

So, the task of Binary Classification is to learn a classifier that can take an image represented by its feature vector $$x$$ and predict whether the corresponding label is 1 – a cat is in an image, or 0 – no cat in the image. In other words, we need an algorithm to output the prediction $$\hat{y}$$ which is an estimate of $$y$$.

2. Logistic regression – introduction

One of the most common algorithms that are used to solve binary classification problems is called Logistic regression. It is a supervised learning algorithm that we can use when labels are either 0 or 1.

$$\hat{y}= P\left ( y=1|x \right) \\x\in \mathbb{R}^{n_x}$$

Here, $$\hat{y}$$ is the chance of $$y =1$$, given the input features $$x$$, and the $$x$$ is an $$n_{x}$$ – dimensional vector. The parameters of logistic regression are $$w$$, which is also an $$n_{x}$$ – dimensional vector together with $$b$$ which is a real number.

$$\mathrm{Parameters}: w \in \mathbb{R}^{n_x}, b \in \mathbb{R}$$

Now, given an input $$x$$ and the parameters $$w$$ and $$b$$, how do we generate the output ? One thing we could try is to apply the linear function

$$\hat{y}=w^{T}x+b$$

However, this is not a very good algorithm for binary classification, because we want $$\hat{y}$$ to be the chance that $$y =1$$. Therefore, $$\hat{y}$$ should be in the range between values 0 and 1. It is difficult to enforce this because $$\hat{y}$$ can be much bigger than 1 and also, it can be a negative number. So, we can conclude that we need a function that will transform our linear function to be in a range between 0 and 1. To do that we are going to apply the sigmoid function.

The Sigmoid function

In the following image, we can see a set of images. Some of them have a cat and some are non-cat images.

So, instead of fitting a line to the data (linear regression), logistic regression fits an “S” shaped logistic function called the sigmoid function.

A sigmoid function is a type of activation function that restricts the output to a range between 0 and 1. If you need a more detailed explanation of the sigmoid function you can click on this link.

To visualize the logistic regression model let’s take a look at the following image.

As you can see we will first calculate the output of a linear function $$z$$. This output $$z$$ will be the input to the sigmoid function. Next, for calculated $$z$$ we will produce prediction $$\hat{y}$$ which will be determined by the $$z$$. Then, if $$z$$ is large positive value, the $$\hat{y}$$ will be close to 1. On the other hand, if $$z$$ is a large negative value, the $$\hat{y}$$ will be close to 0. Therefore, the $$\hat{y}$$ will always be in the range between 0 to 1.

One simple way to classify the prediction $$\hat{y}$$ is to use a threshold value of 0.5. So, if our prediction is greater than 0.5 we assume that $$y$$ is 1. Otherwise, we will assume that $$y$$ is 0. As the $$\hat{y}$$ is getting closer to 1, the probability that there is a cat in the image will be higher. On the other hand, as the $$\hat{y}$$ is getting closer to 0, the probability that there is a cat in the image will be lower.

To calculate the $$\hat{y}$$ we will use the following equations. It is a very simple calculation where we will just plug in the sigmoid formula into the linear model.

Linear model

$$\hat{y}=w^{T}x+b$$

Sigmoid function:

$$\sigma(z)=\frac{1}{1+e^{-z}}$$

Logistic regression model:

$$\hat{y}=\sigma(w^{T}x+b)$$

If  $$z$$ is a large positive number then:

$$\sigma (z)=\frac{1}{1+0}\approx 1$$

If  $$z$$ is a large negative number then:

$$\sigma (z)=\frac{1}{1+\infty }\approx 0$$

So, when we implement logistic regression, our job is to try to compute parameters $$w$$ and $$b$$, so that $$\hat{y}$$ becomes a good estimate of the chance of $$y=1$$.

3. A cost function optimization

So, to train parameters $$w$$ and $$b$$ of a logistic regression model, we need to define a cost function. First, let’s remind ourselves how we compute a cost. For a more detailed explanation check out

For each data point $$x$$ we start computing a series of operations to produce a predicted output. Then, we compare that predicted output to the actual output to calculate a prediction error. That error is what we minimize during the learning process using an optimization strategy. The way we’re computing that error value is by using a loss function. Our ultimate goal is to minimize the loss function in order to identify the values of $$w$$ and $$b$$. For that, we use the algorithm called Gradient descent.

In the logistic regression, we will also use the loss (error) function $$\mathcal{L}$$ to measure how well our algorithm is doing. Remember that the loss function is applied only to a single training sample, and the commonly used loss function is a squared error :

$$\mathcal{L}(\hat{y},y) = \frac{1}{2}(\hat{y} – y)^{2}$$

However, in logistic regression, the squared error loss function is not an optimal choice. It results in an optimization problem that is not convex, and the gradient descent algorithm may not work well (it may not converge optimally). Now, it is a good moment to see what is the difference between convex and non-convex problems.

Let’s assume you are standing at some point inside a closed set (like a field surrounded by a fence). Suppose that you can see the entire boundary just by taking a $$360^{\circ}$$ turn around yourself. In that case, the set is convex. On the other hand, if there is some part of the boundary that you can’t possibly see from where you stand and you have to move to another point to be able to see it, then the set is non-convex.

In terms of a surface, the surface is convex if, loosely speaking, it looks like a ‘cup’ (like a parabola). If you have a ball and let it roll along the surface, that surface is convex if that ball is guaranteed to always end up at the same point in the end. However, if the surface has ‘bumps’, then, depending on where you drop the ball from, it might get stuck somewhere else. In that case, the surface is non-convex.

Cross–entropy loss function

To be sure that we will get to the global optimum, will use the Cross–entropy loss. It measures the performance of a classification model whose output is a probability value between 0 and 1.

In the graph above we can see that the Cross–entropy loss increases as the predicted probability diverge from the actual label. As the predicted probability approaches value 1, log loss slowly decreases. Here, a perfect model would have a log loss of 0. In binary classification, when we have only two classes, cross-entropy can be calculated using the following equation:

$$\mathcal{L}(\hat{y},y)=-(ylog\hat{y}+(1-y)log(1-\hat{y}))$$

It will give us a convex optimization and therefore, it will be much easier to optimize our parameters.

To understand why this is a good choice, let’s see these two cases:

• If $$y$$ = 1:
• $$\mathcal{L}( \hat{y}, y) = – log \hat{y}$$  $$\Rightarrow$$  $$log \hat{y}$$ should be large, so we want $$\hat{y}$$ large (as close as possible to 1 )
• If $$y$$ = 0:
• $$\mathcal{L}( \hat{y}, y) = – log (1 – \hat{y})$$ $$\Rightarrow$$  $$log (1 – \hat{y})$$ should be large, so we want $$\hat{y}$$ small (as close as possible to 0 )

Remember, $$\hat{y}$$ is a sigmoid function so it cannot be less than 0 and bigger than 1.

Now, we can define our cost function which measures how well our parameters $$w$$ and $$b$$ are doing on the entire training set. Here, we will use $$(i)$$ superscript to index different training examples.

$$J(w, b) = \frac{1}{m}\sum_{i=1}^{m}\mathcal{L}(\hat{y}^{(i)},y^{(i)})=-\frac{1}{m}\sum_{i=1}^{m}\hat{y}^{(i)}log\hat{y^{(i)}}+(1-y^{(i)})log(1-\hat{y}^{(i)})$$

• Cost function $$J$$ is defined as an average of a sum of loss functions ( $$\mathcal{L}$$ ) of all parameters.
• Cost function is a function of parameters $$w$$ and $$b$$.

In the following cost function diagram, the horizontal axes represent our spatial parameters, $$w$$ and $$b$$.

Note that in practice, $$w$$ can be of a much higher dimension, but for the purposes of plotting, we have illustrated $$w$$ and $$b$$ as scalars. The cost function $$J(w,b)$$ is then some surface above these horizontal axes $$w$$ and $$b$$. So, the height of the surface represents the value of $$J(w,b)$$ at a certain point. Our goal will be to minimize function $$J$$, and to find parameters $$w$$ and $$b$$ .

4. Calculating Logistic regression derivatives

Before we calculate the parameters $$w$$ and $$b$$, let’s take a look at the following computation graph of a logistic regression.

In this example, we only have two features $$x_{1}$$ and $$x_{2}$$. In order to compute $$z$$, we will need to input $$w_{1}$$, $$w_{2}$$ and $$b$$ in addition to the feature values $$x_{1}$$ and $$x_{2}$$:

$$z = w_{1}x_{1} + w_{2} x_{2} + b$$

Now, we can apply the forward propagation step and compute the prediction $$\hat{y}$$ and loss function:

$$\hat{y} = \sigma(z)$$

$$\mathcal{L}(\hat{y},y)$$

To reduce our loss function (remember right now we are talking only about one data sample) we have to update our $$w$$ and $$b$$ parameters. To do that we will apply the backward propagation step to compute the derivatives.

First, we need to calculate the derivative of loss with respect to $$\hat{y}$$:

$$\mathcal{L}(\hat{y},y)=-(ylog\hat{y}+(1-y)log(1-\hat{y}))$$

$$d\hat{y} = \frac{\mathrm{d} \mathcal{L(\hat{y},y)} }{\mathrm{d} \hat{y}}$$

$$d\hat{y} = – \frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}$$

The next step is to compute the derivative of loss with respect to $$z$$.

$$dz = \frac{\mathrm{d} \mathcal{L(\hat{y},y)} }{\mathrm{d}z}$$

$$dz = \frac{\mathrm{d} \mathcal{L(\hat{y},y)} }{\mathrm{d} \hat{y}} \frac{\mathrm{d} \hat{y} }{\mathrm{d} z}$$

$$\frac{\mathrm{d} \mathcal{L(\hat{y},y)} }{\mathrm{d} \hat{y}}= – \frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}$$

$$\frac{\mathrm{d} \hat{y}}{\mathrm{d} z} = a(1 – \hat{y})$$

$$dz = \hat{y} – y$$

The final step in back propagation is to compute amount of change of our parameters $$w$$ and $$b$$:

$${dw_{1}} = \frac{\mathrm{d} \mathcal{L(\hat{y},y)} }{\mathrm{d} w_{1}} = x_{1} {dz}$$

$${dw_{2}} = \frac{\mathrm{d} \mathcal{L(\hat{y},y)} }{\mathrm{d} w_{2}} = x_{2} {dz}$$

$${db} = \frac{\mathrm{d} \mathcal{L(\hat{y},y)} }{\mathrm{d} b} = {dz}$$

To conclude, if we want to do gradient descent with respect to just this one data sample, we would do the following updates (for some arbitrary number of iterations):

$$w_{1} = w_{1} – \alpha{dw_{1}}$$

$$w_{2} = w_{2} – \alpha{dw_{2}}$$

$$b = b – \alpha{ db}$$

Now, let’s see how we can create a logistic regression model in Python using PyTorch.

5. Logistic regression in Python with PyTorch

The code for logistic regression is similar to the code for linear regression. Just instead of predicting some continuous value, we are predicting whether something is true or false.

Simple example

First, we will import necessary libraries.

.wp-block-code {
border: 0;
}

.wp-block-code > div {
overflow: auto;
}

.shcb-language {
border: 0;
clip: rect(1px, 1px, 1px, 1px);
-webkit-clip-path: inset(50%);
clip-path: inset(50%);
height: 1px;
margin: -1px;
overflow: hidden;
position: absolute;
width: 1px;
word-wrap: normal;
word-break: normal;
}

.hljs {
box-sizing: border-box;
}

.hljs.shcb-code-table {
display: table;
width: 100%;
}

.hljs.shcb-code-table > .shcb-loc {
color: inherit;
display: table-row;
width: 100%;
}

.hljs.shcb-code-table .shcb-loc > span {
display: table-cell;
}

.wp-block-code code.hljs:not(.shcb-wrap-lines) {
white-space: pre;
}

.wp-block-code code.hljs.shcb-wrap-lines {
white-space: pre-wrap;
}

.hljs.shcb-line-numbers {
border-spacing: 0;
counter-reset: line;
}

.hljs.shcb-line-numbers > .shcb-loc {
counter-increment: line;
}

.hljs.shcb-line-numbers .shcb-loc > span {
}

.hljs.shcb-line-numbers .shcb-loc::before {
border-right: 1px solid #ddd;
content: counter(line);
display: table-cell;
text-align: right;
-webkit-user-select: none;
-moz-user-select: none;
-ms-user-select: none;
user-select: none;
white-space: nowrap;
width: 1%;
}
import numpy as np
import matplotlib.pyplot as plt
import torchCode language: JavaScript (javascript)

For this example, we will define an array $$X$$ that consists of six vectors. Then, if we scatter these vectors, we will get the flowing plot.

X = np.array([[-1,-1],[-2,-1],[-3,-2],[1, 1],[2,1],[3,2]])
plt.scatter(X[:,0],X[:,1])Code language: PHP (php)

By looking at the plot above we can visually separate these points into two groups or “classes”. We can say the data points above 0 will belong to class one, whereas the points below 0 will belong to the other class.

So how can we draw this line? Well, we can use the function np.linespace() which returns an evenly spaced array of numbers over a specified interval. So, we will create an array x1 of 50 numbers between -3 and 3. Next, we will create another array that will be equal to -1 * x1 +1. These two arrays will be the coordinates of points. Then, we can plot these 50 points.

x1 = np.linspace(-3, 3, 50)
x2 = -1.0 * x1 + 1
plt.scatter(X[:,0], X[:,1])
plt.plot(x1, x2,'r')Code language: JavaScript (javascript)

Output:

By plotting this line we can see that we have separated our points into these two classes. We can all agree that all points that are below this line should belong to the zero class and all the points that are above this line should belong to the first class. However, this line was arbitrarily chosen, and it is not an optimal one.

Logistic Regression experiment

Now let’s see how we can apply logistic regression in PyTorch to separate a set of points into two classes.

We will start by importing the function make_blobs() from the sklearn library. This function will help us to randomly generate two blobs that we’ll use for the classification.

from sklearn.datasets import make_blobsCode language: JavaScript (javascript)

The next step is to generate two datasets using this function. First, we will define variables x and y and pass the parameter n_samples, which will tell the function to create two blobs that have 200 data points in each of them. Then, we will set the cluster_std parameter to 1.4, so that data is not completely separated. Finally, we will set the parameter random_state to be equal to 2.

After we have created our x and y data sets, we need to convert them to tensors. To do that we will create variables x_torch and  y_torch, and we will apply the torch.FloatTensor() function. Also, we will reshape y_torch variable to $$(-1,1)$$.

x, y = make_blobs(n_samples=[200, 200], random_state=2, n_features=2, cluster_std=1.4)

x_torch = torch.from_numpy(x).type(torch.FloatTensor)
y_torch = torch.from_numpy(y).type(torch.FloatTensor).reshape(-1, 1)

Now that we have transformed our data into tensors, let’s print the variable x_torch to get a better understanding of the data.

x_torch

Output:

tensor([[  0.9811,  -8.8856],
[ -0.8678,  -2.3374],
[  1.2544,  -1.5154],
[ -1.9269,  -8.9902],
[ -1.8671,  -9.6667],
[ -0.8098,  -6.8153],
[ -2.0800,  -9.0721],
[ -1.5141, -10.2887],
[ -0.2425,  -0.0549],
[  0.2318,  -1.0355],
[ -0.8735,  -1.9513],
[ -0.2646,  -9.9847],
[ -1.8603,  -9.3872],....Code language: CSS (css)

As you can see, it is a tensor consisting of vectors, where in each vector, we have two numbers.

On the other hand, if we print the y_torch variable, we can see that it consists of numbers that are either 0 or 1.

y_torch

Output:

tensor([[0.],
[1.],
[1.],
[0.],
[0.],
[0.],
[0.],
[0.],
[1.],
[1.],
[1.],
[0.],
[0.],
[0.],
[0.],...Code language: CSS (css)

Now that we have a better understanding of the data, we can create a scatter plot.

plt.scatter(x[:,0], x[:,1], c=y, edgecolors='w');Code language: JavaScript (javascript)

Output:

So we can see that there are two classes. The first class pained in yellow and the second class painted in violet.

Now, comes the part where we define our logistic model. As mentioned earlier we will create the model with the same structure as the linear regression model. The only change is that instead of creating a linear layer that accepts only one number, we will set our model to accept a vector of two numbers, and returns one number as the output.

The first step is to create a class called LogisticRegression(). We will pass torch. nn.Module as a parameter and we will define the init function or the constructor by passing the parameter self. Then we will define a linear layer that will be the same as in the linear regression. So we will call the torch.nn.Linear() function. This function takes two input parameters. The first one is the size of each input sample which in this case will be equal to 2. The second parameter is the shape of the output which will be equal to 1. Next, we will create the forward() function which will take self and x as inputs. After that, we will create a variable y_hat where we will sore our predictions and we’ll call the self.linear on that x point. Now, remember that our data needs to be in the range from 0 to 1. For that purpose, we will use the sigmoid activation function. So, to call the function we will use the function torch.sigmoid() and as parameters, we will just pass y_hat.

class LogisticRegression(torch.nn.Module):
def __init__(self):
super(LogisticRegression, self).__init__()
self.linear = torch.nn.Linear(2, 1)

def forward(self, x):
y_hat = self.linear(x)
return torch.sigmoid(y_hat)

The next step is to call our logistic regression model. After that, we will create a variable named optimizer, and call the function torch.optim.SGD() which will calculate the gradients. As parameters to this function, we will pass the model.parameters(), and we will set the learning rate to be equal to 0.01. Then, we will calculate the loss by using the Binary Cross Entropy Loss function torch.nn.BCELoss().

model = LogisticRegression()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = torch.nn.BCELoss()

Now we can start training our model. We will create a for loop that will iterate in the range from 0 to 5000.

for epoch in range(5000):
y_hat = model(x_torch)

loss = criterion(y_hat, y_torch)
loss.backward()

optimizer.step()
optimizer.zero_grad()

The first step in the training loop is to define the prediction. So, we will create the variable y_hat and call the model on x_torch variable. Then, we need to calculate the loss which is equal to the criterion of the predicted value y_hat and original value y_torch.

Now we can apply the backward propagation. To calculate the gradients we will use the optimizer.step() function . Remember that we need to make sure that calculated gradients are equal to 0 after each epoch. To do that, we’ll just call optimizer.zero_grad() function.

Once our model is trained we can create the for loop which will return the final results of parameters $$w$$ and $$b$$.

for name, parameter in model.named_parameters():
print(name, parameter)Code language: CSS (css)
linear.weight Parameter containing:
linear.bias Parameter containing:
tensor([3.3394], requires_grad=True)Code language: PHP (php)

As you can see, the weight is actually a vector of two values, whereas the bias is a vector of only one number.

Now comes the big question. How can we use these values of the weight and the bias to make some predictions? Well, we will create a variable y_pred, and we will simply use the model.forward() function to make some predictions on x_torch. Then, we will simply print the first 10 numbers in the variable y_pred.

y_pred = model.forward(x_torch)
print(y_pred[:10])Code language: PHP (php)

Output:

tensor([[0.0853],
[0.7504],
[0.9559],
[0.0128],
[0.0082],
[0.1127],
[0.0109],
[0.0066],
[0.9586],
[0.9399]], grad_fn=<SliceBackward>)Code language: HTML, XML (xml)

If we take a look at the values of y_pred, we can see that they are ranging from 0 to 1. However, our goal is to separate these numbers into two classes – class 0 or class 1. The easiest way to achieve this is by using the function np.where(). As a parameter of this function, we will pass y_pred and we will set the following condition: if an element in y_pred is bigger than 0.5, it will become 1 and if it is lower than 0.5, it will become 0.

test = np.where(y_pred.detach().numpy() < 0.5, 0, 1)
print(test[:10])Code language: PHP (php)
[[0]
[1]
[1]
[0]
[0]
[0]
[0]
[0]
[1]
[1]]Code language: JSON / JSON with Comments (json)

Here you can see the first 10 numbers of the variable y_pred. The numbers that were below 0.5 are now set to 0, and numbers that were bigger than 0.5 are now set to 1.

One way to visualize the model’s performance is by scattering these points.

plt.scatter(x_torch[:, 0], x_torch[:, 1])
plt.scatter(x_torch[:, 0], x_torch[:, 1], c=test)

Output:

In the first image, we can see the original points. To label them, we just pass the color to be equal either to 0 or 1 from our test variable.

This is a nice way to visualize how our model classifies the data. Another way to do this is to visualize the model performance over time. For that, we can plot the loss function.

To plot the loss function we need to create the list all_loss before the training loop. Then, inside the training loop, we will append loss.item to that list. After we run this piece of code we will just plot our loss function.

all_loss = []
for epoch in range(5000):
y_hat = model(x_torch)

loss = criterion(y_hat, y_torch)
all_loss.append (loss.item())
loss.backward()

optimizer.step()
optimizer.zero_grad()
plt.plot(all_loss)Code language: CSS (css)

Output:

Summary

In this post, we have learned how to apply logistic regression to train our data model. We explained the theory behind this algorithm in order to show you why this is one of the most popular methods that we can use to solve the binary classification problem. In the next post, we will see how we can tackle the so called “XOR” problem.

References:

[1] PyTorch Tutorial 08 – Logistic Regression by Python Engineer -YouTube

[2] PyTorch Lecture 06: Logistic Regression by Sung Kim -YouTube

[3] Loss Functions — ML Glossary documentation – ML cheatsheet