#008 Machine Learning – Multiclass classification and softmax function

#008 Machine Learning – Multiclass classification and softmax function

Highlights: Hello and welcome. So far we learned what binary classification is and we looked at one of the most common binary classification algorithms called logistic regression. In today’s post, we are going to cover another type of classification problem called multiclass classification. It refers to classification problems with more than just two possible output labels. not just 0 or 1. Let’s begin with our post and dive deeper into this multiclass classification. 

Tutorial overview:

  1. Multiclass classification introduction
  2. The softmax function
  3. The cost function for softmax regression
  4. Softmax function in Python

1. Multiclass classification introduction

First, let’s see what multiclass classification means. For example, we want to classify handwritten digits. In the binary classification, we will distinguish between the digits 0 and 1. However, if you’re trying to read protocols or zip codes in an envelope, you will find out that there are actually 10 possible digits you might want to recognize. 

So a multiclass classification is a classification problem where \(y \) can take only a small number of discrete categories. Do note that \(y \) is not just any number, but now it can take on more than just two possible values.

To better visualize a multiclass classification problem, let’s take a look at the following image.

As you can see, on the left side we have the binary classification. In that case, we will apply the logistic regression algorithm in order to estimate what is the probability of \(y \) being class 1 or class 2, given the features \(x \). 

On the right side, we can see a multiclass classification problem. Here, instead of having a dataset with just 2 classes, we will have four classes. Here \(x \) represents one class, the circle represents another class, the triangle represents the third class and the square represents the fourth class. Instead of just estimating the chance of \(y \) being equal to 1, we want to estimate what’s the chance that \(y \) is equal to 1, 2, 3, or 4. Now the classification algorithm can learn a decision boundary that divides the space into four categories rather than just two categories.

So that’s the definition of the multiclass classification problem. Now, we are going to learn about the softmax regression algorithm which is often used for multiclass classification problems. 

2. The softmax function

Recall that logistic regression produces a decimal between 0 and 1. For example, a logistic regression output of 0.8 from a dog/cat classifier suggests an 80% chance of an image of a dog and a 20% chance of an image of a cat. 

The softmax function extends this idea by assigning decimal probabilities to each class in a multi-class problem. It is a generalization of logistic regression, which is a binary classification algorithm to the multiclass classification contexts.

Let’s take a look at how it works. Recall that logistic regression applies when \(y \) can take on two possible output values, either zero or one. We would first calculate the following equation for \(z \)

$$ z=\overrightarrow{\mathrm{w}} \cdot \overrightarrow{\mathrm{x}}+b $$

Then we would compute the function \(g(z) \) which is a sigmoid function applied to \(z \). 

$$ a=g(z)=\frac{1}{1+e^{-z}}=P(y=1 \mid \overrightarrow{\mathrm{x}}) $$

We interpreted this as logistic regression estimates of the probability of \(y \) being equal to 1 given those input features \(x \). Note that if the probability of \(y \) equals 1 is 0.71, then the probability that \(y \) is equal to zero must be 0.29. That is because the chance of  \(y \) equals to 1, and the chances of  \(y \) equal to 0, must add up to one. So, if there’s a 71% chance of  \(y \) being 1, there has to be a 29% chance of  \(y \) being equal to 0. 

 Let’s now generalize this to softmax regression. We are going to do this with a concrete example when \(y \) takes four possible outputs 1, 2, 3, or 4. Here’s what softmax regression will do,

It will compute the parameters of softmax regression \(z_{1} \), \(z_{2} \), \(z_{3} \), and $latex z_{4} as follows:

$$ z_{1}=\overrightarrow{\mathrm{w}}_{1} \cdot \overrightarrow{\mathrm{x}}+b_{1} $$

$$ z_{2}=\overrightarrow{\mathrm{w}}_{2} \cdot \overrightarrow{\mathrm{x}}+b_{2} $$

$$ z_{3}=\overrightarrow{\mathrm{w}}_{3} \cdot \overrightarrow{\mathrm{x}}+b_{3} $$

$$ z_{4}=\overrightarrow{\mathrm{w}}_{4} \cdot \overrightarrow{\mathrm{x}}+b_{4} $$

Next, let’s take a look at the formula for softmax regression. We’ll compute $latex  a_{1} $,  $latex  a_{2} $,  $latex  a_{3} $, and  $latex  a_{4} $, which will be interpreted as the algorithm’s estimate of the chance of \(y \) being equal to 1,2,3 o r4 given the input features \(x \)

$$ a_{1}=\frac{e^{z_{1}}}{e^{z_{1}}+e^{z_{2}}+e^{z_{3}}+e^{z_{4}}} =P(y=1 \mid \overrightarrow{\mathrm{x}}) $$

$$ a_{2}=\frac{e^{z_{2}}}{e^{z_{1}}+e^{z_{2}}+e^{z_{3}}+e^{z_{4}}} =P(y=2 \mid \overrightarrow{\mathrm{x}}) $$

$$ a_{3}=\frac{e^{z_{3}}}{e^{z_{1}}+e^{z_{2}}+e^{z_{3}}+e^{z_{4}}} =P(y=3 \mid \overrightarrow{\mathrm{x}}) $$

$$ a_{4}=\frac{e^{z_{4}}}{e^{z_{1}}+e^{z_{2}}+e^{z_{3}}+e^{z_{4}}} =P(y=4 \mid \overrightarrow{\mathrm{x}}) $$

These equations are our specifications for the softmax regression model. They have parameters \(\overrightarrow{\mathrm{w}}_{1} \) through \(\overrightarrow{\mathrm{w}}_{4} \), and \(\overrightarrow{\mathrm{b}}_{1} \) through\(\overrightarrow{\mathrm{b}}_{4} \). Then, if we can learn appropriate choices for all these parameters, this will give us a way of predicting what’s the chance of \(y \) being 1, 2, 3, or 4, given a set of input features \(x \). 

You might have realized that because the chance of \(y \) takes on the values of 1, 2, 3, or 4, they have to add up to one like in the logistic regression. 

Let’s now write down the formula for the general case for softmax regression. In the general case, \(y \)  can take \(N \) possible values, so \(y \) can be 1, 2, 3, and so on up to \(N \). In that case, the softmax regression formula will look like this:

$$ z_{j}=\overrightarrow{\mathbf{w}}_{j} \cdot \overrightarrow{\mathrm{x}}+b_{j} $$

$$ j=1, \ldots, N $$

$$ a_{j}=\frac{e^{z_{j}}}{\sum_{k=1}^{N} e^{z_{k}}}=\mathrm{P}(\mathrm{y}=j \mid \overrightarrow{\mathrm{x}}) $$

Here we are using another variable \(k \) to index the summation because  \(j \) refers to a specific fixed number like \(j=1 \). The value of \(a_{j} \)j is interpreted as the model’s estimate that \(y % is equal to \)latex j $ given the input features \(x \). Notice that by the construction of this formula, if you add up \(a_{1} \), \(a_{2} \) all the way through \(a_{N} \), these numbers will always end up adding up to 1. 

$$ a_{1}+a_{2}+\ldots+a_{N}=1 $$

Another interesting thing to remember is that if we apply softmax regression with only two possible output classes, \(N=2 \), the softmax regression ends up computing basically the same thing as logistic regression. The parameters end up being a little bit different, but it ends up reducing to a logistic regression model. That’s why the softmax regression model is the generalization of logistic regression. Having defined how softmax regression computes its outputs, let’s now take a look at how to specify the cost function for softmax regression. 

3. The cost function for softmax regression

Recall that for logistic regression, we had the following formulas.

First, let’s write again formula for \(z \):

$$ z=\overrightarrow{\mathrm{w}} \cdot \overrightarrow{\mathrm{x}}+b $$

Then we said that \(a_{1} \) is equal to  \(g(z) \), and was interpreted as a probability of \(y \) is equal to 1. Also, \(a_{2} \) is the probability that \(y \) is equal to 0

$$ a_{1} =g(z)=\frac{1}{1+e^{-z}}=P(y=1 \mid \overrightarrow{\mathrm{x}}) $$

$$ a_{2}=1-a_{1} \quad=P(y=0 \mid \overrightarrow{\mathrm{x}}) $$

Now let’s write a formula for the loss of logistic regression:

$$ loss=-y \log a_{1}-(1-y) \log \left(1-a_{1}\right) $$

Note that here this term \(\left(1-a_{1}\right) \) is also equal to \(a_{2} \). Therefore we can rewrite or simplify the loss for logistic regression in the following way. We can say that if \(y \) is equal to 1, the loss is \(-y \log a_{1} \). On the other hand if \(y \) is equal to 0, then the loss is  \(-y \log a_{2} \).  Then, the cost function for all the parameters in the model is the average loss, average over the entire training set. 

$$ J(\overrightarrow{\mathbf{w}}, b)=\text { average loss } $$

So, that was a cost function for this regression. Now, let’s write down the cost function that is conventionally used for the softmax regression. Recall that this is the equation we use for the softmax regression. 

$$ a_{N}=\frac{e^{z_{N}}}{e^{z_{1}}+e^{z_{2}}+e^{z_{3}}+e^{z_{4}}} =P(y=N \mid \overrightarrow{\mathrm{x}}) $$

The formula for the loss we’re going to use for softmax regression looks like this:

$$ loss\left(a_{1}, \ldots, a_{N}, y\right)= \begin{cases}-\log a_{1} & \text { if } y=1 \\ -\log a_{2} & \text { if } y=2 \\ & \vdots \\ -\log a_{N} & \text { if } y=N\end{cases} $$

The loss when the algorithm outputs \(a_{1} \) through \(a_{N} \), and the ground truth label is \(y \) equals \(-\log a_{1} \) if \(y=1 \). It is a negative log of the probability that \(y=1 \). Next, if \(a=2 \) then it \(-\log a_{1} \) if \(y=2 \). And so on all the way down to the last example if \(y=N \), then the loss is \(-\log a_{N} \).

So, that’s what this function looks like. The negative log of \(a_{j} \) is a curve that looks like this.

Here we can see that If \(a_{j} \) was very close to 1, then it belongs to the right part of the curve and the loss will be very small. But if it thought, say, \(a_{j} \) had only a 50% chance then the loss gets a little bit bigger. The smaller \(a_{j} \) is, the bigger the loss. This incentivizes the algorithm to make \(a_{j} \) as large as possible, as close to 1 as possible. Because whatever the actual value of \(y \) was, you want the algorithm to say that the chance of \(y \) being that value was pretty large.

So to summarize, in the case of multiclass classification, what we want to see is a prediction of which class the network “thinks” the input represents. This distribution returned by the softmax activation function represents confidence scores for each class and will always add up to 1.

Now, let’s see how we can calculate these confidence scores using the softmax function in Python.

4. Softmax function in Python

First, let’s import the necessary libraries.

import numpy as np
import math
import matplotlib.pyplot as plt

The next step is to “exponentiate” the outputs. We do this with Euler’s number, \(e \), which is roughly equal to 2.71828182846 and referred to as the “exponential growth” number. We need then to calculate these exponentiates to continue.

First, let’s define or output values that are stored in the vector layer_outputs. Then we will calculate the exponentials of these values which going to be stored in the vector norm_values

layer_outputs = [4.8, 1.21, 2.385, 3.25, 0.88]
# For each value in a vector, calculate the exponential value
exp_values = np.exp(layer_outputs)
print('exponentiated values:')
print(exp_values)
# Now normalize values
norm_values = exp_values / np.sum(exp_values)
norm_values.sort()
print('normalized exponentiated values:')
print(norm_values)
print('sum of normalized values:', np.sum(norm_values))
exponentiated values:
[121.51041752   3.35348465  10.85906266  25.79033992   2.41089971]
normalized exponentiated values:
[0.01470741 0.02045753 0.06624441 0.15733088 0.74125977]
sum of normalized values: 1.0

As you can see the last value in the vector has the largest percentage of 74%.

Note that to calculate the probabilities, we need non-negative values. An exponential value of any number is always non-negative. Now, let’s plot or results.

plt.plot(norm_values)

Next, let’s calculate the loss.

loss=[]
a = - math.log(norm_values[0])
b = - math.log(norm_values[1])
c = - math.log(norm_values[2])
d = - math.log(norm_values[3])
e = - math.log(norm_values[4])
loss.append(a)
loss.append(b)
loss.append(c)
loss.append(d)
loss.append(e)
print(loss)
[4.219404153063724, 3.8894041530637247, 2.714404153063725, 1.8494041530637244, 0.2994041530637247]
plt.plot(loss)

As you can see, the smallest loss is for the last value in the vector again.

Summary

So, in this post, we learned about multiclass classification models. Moreover, we covered one of the most popular activation function, particularly for the multiclass classification called the softmax function. We learned about the form of the model as well as the cost function for softmax regression.

So, that is it for this post.  See you, soon!