dH #014: Understanding SVM Loss Functions: From Theory to Practice
*A deep dive into multiclass SVM loss with practical examples and mathematical insights*
—
## The Foundation: Multiclass SVM Loss
The multiclass SVM loss function ensures the score of the correct class should be higher than all other scores. This fundamental principle drives how we evaluate and optimize classification models, creating a robust framework for distinguishing between multiple categories.
The loss function is visualized with a graph showing the relationship between the highest score among other classes and the score for the correct class. This visualization helps us understand the mathematical behavior that makes SVM classifiers so effective.
Graph showing loss relationship between correct class score and highest other score
## Understanding the Hinge Loss Behavior
The loss decreases linearly until the score of the correct class exceeds the highest incorrect class score by a margin, at which point it becomes zero. Moving to the left on this graph, you can see that as the score for the correct class becomes close to or even higher than the score of the highest incorrect class, the loss assigned to that example will increase linearly.
This type of loss function that has a general shape of a linear region and then a zero region comes up frequently in different contexts in machine learning. This characteristic shape is often called a **hinge loss** because it looks kind of like a door hinge that can open and close.
## Mathematical Formulation
We can write down the same intuition mathematically using the following approach. Given a single data example, xi image and yi label, the SVM loss has the form where we sum over each of the category labels, not including the correct label yi.
The sum goes over all category labels, but excludes the correct class. For each incorrect category j, we take the max of zero, and the score of class j minus the correct class score plus one (the margin):
**Li = Σj≠yi max(0, sj – syi + 1)**
This formula corresponds to two important cases:
– If the correct class score is more than one greater than the incorrect class score, we achieve a loss of zero for that class
– Otherwise, we accumulate loss proportional to how much the incorrect class score exceeds the correct class score
## Practical Example: Cat, Car, and Frog Classification
To make this concrete, let’s examine a dataset of three images: a gray cat, a red Audi car, and a colorful tree frog. The classification matrix shows scores for these three distinct categories.
Matrix showing classification scores with three example images
Given our Weight Matrix W, the classifier produces specific scores:
– **Cat row**: 3.2, 1.3, and 2.2
– **Car row**: 5.1, 4.9, and 2.5
– **Frog row**: -1.7, 2.0, and -3.1
Using these scores, we can compute the SVM loss according to the formula Li = Σj≠yi max(0, sj – syi + 1).
### Computing Loss for the Cat Example
To compute the loss for the cat example, we need to loop over all incorrect classes. We skip the cat category since it’s the correct class.
For the car category: max(0, 5.1 – 3.2 + 1) = max(0, 2.9) = **2.9**
For the frog category: max(0, 2.2 – 3.2 + 1) = max(0, 0) = **0**
The overall loss for this cat image is **2.9**.
### Computing Loss for the Car Example
For the car image, because the correct category score is 4.9, and this is more than one greater than all of the scores assigned to the incorrect categories, we achieve a loss of **0** for this example.
### Computing Loss for the Frog Example
Here, we get a lot of loss because we’ve assigned a very low score (-3.1) to the frog category, while the incorrect categories have much higher scores.
To compute the loss over the full dataset, we just take an average over the loss of all examples.
## Key Properties of SVM Loss
### Loss Range and Behavior
**Minimum loss**: The minimum possible loss is **0**, achieved when the correct category has a score much higher than all the incorrect categories.
**Maximum loss**: The maximum loss is **infinite**, which happens when the correct category has a very low score that’s much smaller than all the other predicted scores.
### Robustness to Small Changes
One interesting property of the multiclass SVM loss is that once an example is correctly classified, changing the predicted scores of that example just a little bit doesn’t really affect the loss anymore. This makes the loss function robust to small perturbations in well-classified examples.
## Understanding Initialization Behavior
When we have a linear classifier that’s randomly initialized with small random weight values, all of the predicted scores for each category would also be small random values. In this scenario, what loss would we expect to see from the SVM classifier?
If all predicted scores are small random values, the expected loss would be approximately **C-1** (where C is the number of classes), not zero. This is because with random scores, the margin violations will accumulate across all incorrect classes, giving us a baseline loss that decreases as the model learns to separate classes properly.
## The Uniqueness Problem with Zero Loss
The question arises: if we find a weight matrix W that results in a multiclass SVM loss (L) equal to zero, would this solution be unique? If we find such a solution where L = 0, the solution would not be unique.
Multiclass SVM loss equation showing f(x,W) = Wx and loss function L
For example, if we multiply our weight matrix W by 2, the overall loss would still remain zero, demonstrating that multiple solutions exist. Because if we would take our matrix and multiply it all by two, then we would still get overall loss of zero.
### Demonstrating Non-Uniqueness with Our Example
When examining the loss values in the multiclass SVM example, a loss of zero for the car category (4.9 score) indicates that its score exceeds those of incorrect categories (cat: 1.3, frog: 2.0) by more than the margin of 1.
Grid showing classification scores for cat, car, frog categories
When multiplying the weight matrix by 2 (2W), all predicted scores scale proportionally due to linearity, as shown in the calculation where 2.6 becomes -6.2 and 4.0 becomes -4.8, yet still maintains zero loss. The margin requirement continues to be satisfied with doubled weights, resulting in zero loss as demonstrated by max(0,-6.2) + max(0,-4.8) = 0.
## Regularization: Beyond Training Error
The slide introduces regularization as a concept that goes beyond simple training error optimization. The fundamental loss function L(W) is defined as the average loss across all N training examples, where each example’s loss Li depends on the model’s prediction f(xi,W) and the true label yi.
This component is specifically called the **data loss**, which measures how well the model predictions match the training data. An additional regularization term is commonly added to the overall loss function to serve purposes beyond data fitting.
### The Complete Loss Function with Regularization
The equation L(W) shown combines two key components: data loss and regularization. The second term λR(W) in the equation is called a regularization term, which is explicitly labeled in the slide.
L(W) = 1/N sum(Li(f(xi,W),yi)) + λR(W)
As shown in the equation, the regularization term R(W) does not involve the training data terms xi or yi. As explicitly labeled in the slide, regularization prevents the model from doing too well on training data. Basically, to give the model something else to do other than just try to fit the training data.
## Types of Regularization
These different types of regularization will often come with some kind of hyper parameter, usually called lambda. That will be some hyper parameter controlling the trade-off between how well the model is supposed to fit the data, versus how well is the model supposed to achieve this regularization loss.
### Common Linear Model Regularizers
A couple of very common examples of regularization that are typically used for linear models are:
– **L2 regularization**: The overall norm of the weight matrix W
– **L1 regularization**: The sum of the absolute values of all the elements in the weight matrix W
– **Elastic net**: A combination of the L1 and L2 regularizer, sometimes seen in statistics literature
All of these types of regularizers will also be used in neural networks. But as we move to neural network models, we’ll also see other types of regularizers, such as dropout, batch normalization, and more recent techniques like cutout, mixup, and stochastic depth.
## Why Use Regularization?
The basic idea of why we might want to use regularizers is threefold:
### 1. Expressing Model Preferences
One is that adding some additional term to the loss beyond the data loss allows us to express our preferences over different types of models when those different types of models are not distinguished by their training accuracy. This can be a way that we can inject some of our own human prior knowledge into the types of classifiers that we would like to learn.
Regularization allows us to express preferences over different types of classifiers, as shown by comparing two weight vectors w1=[1,0,0,0] and w2=[0.25,0.25,0.25,0.25] that both satisfy w^T x = 1 for x=[1,1,1,1], with L2 regularization defined as R(W) = Σk Σl W²k,l.
L2 Regularization formula and example vectors
### 2. Avoiding Overfitting
A second is to avoid what we call **overfitting**. Overfitting is a bad problem in machine learning. This happens when you build a model that works really, really well on your training data, but it actually performs very poorly on unseen data.
This is a point where machine learning is quite distinct from something like optimization. In optimization, we typically have an objective function, and our whole goal is just to find the bottom of the objective function. But in machine learning, we often don’t really want to do that at all, because at the end of the day, we want to build a system that performs well on unseen data.
So finding a model that gets the best possible performance on the training data might be working against us in some ways and might result in models that do not work well on unseen data.
### 3. Improving Optimization
There’s another technical aspect: if we’re using gradient-based optimizers, then adding an extra regularization term can add extra curvature to the overall objective landscape, and that can maybe sometimes help the optimization process.
## Regularization: Prefer Simpler Models
The second key aspect of regularization is preferring simpler models to avoid overfitting. The graph shows a model that takes a scalar input x and predicts a scalar output y, illustrated on a standard coordinate plane.
X-Y coordinate plane with scattered data points
The training data consists of five blue circular points scattered across the coordinate plane. We could imagine fitting two different models to this training data. The models F1 and F2 could be fitted to these points, demonstrating the trade-off between model complexity and generalization.
### Complex vs Simple Models
The model f₁ (shown in blue) fits the training data points perfectly but with high complexity, while f₂ (shown in green) is a simpler linear model that has some training error.
Comparison of complex (f₁) and simple (f₂) models fitting data points
The linear model f₂, despite having higher training error, represents a more intuitive solution that might better capture the underlying pattern. While f₁ perfectly fits the given training points (shown as blue circles), its complex nature suggests it would perform poorly on new data points, whereas the simpler linear model f₂ would likely generalize better.
While this is a conceptual illustration where f₁ is clearly non-linear, it effectively demonstrates the principle of model complexity versus simplicity. This visualization serves as an illustrative example to demonstrate the concept of preferring simpler models through regularization.
### Key Takeaways on Regularization
The takeaway here is that **regularization is really important when you’re building machine learning systems**. You should basically always incorporate some form of regularization into whatever machine learning system you’re trying to build.
## Moving Beyond SVM: Cross-Entropy Loss
So far we’ve seen this idea of a linear classifier, the notion of a loss function, and a concrete example of the loss function being the multi-class SVM loss. We’ve also talked about regularization as a way to prefer one type of classifier over another.
Another way that you can give the model your preferences about the types of functions you’d like it to learn is by using different types of loss functions to train the model. We’ve seen the multi-class SVM loss, but another very commonly used loss, perhaps the most commonly used loss when training neural networks, is the so-called **cross entropy loss** or **multinomial logistic regression**.
### Cross-Entropy Loss: Multiple Names, Same Concept
Cross-Entropy Loss, also known as Multinomial Logistic Regression, comes in many names but represents the same concept. The key motivation is to interpret raw classifier scores as probabilities.
Cross-Entropy Loss (Multinomial Logistic Regression)
Looking at the example scores shown, we have raw numerical outputs like 3.2 for cat, 5.1 for car, and -1.7 for frog, which currently lack probabilistic interpretation. We just said that we had an input x, we had a weight matrix w, it was somehow spitting out some collection of scores.
But the multi-class SVM loss did not really give any interpretation to those scores, other than telling that the score of the correct class should be higher than the score of all the other classes.
### Converting Scores to Probabilities
Now as we move to the cross entropy loss, we’re motivated by wanting to give some interpretation to the scores that the model is predicting. With the cross entropy loss, we want to transform these raw numerical scores, like our example showing 3.2, 5.1, and -1.7, into proper probability distributions over all categories.
We’d like to find a way to take this arbitrary vector of scores and interpret it as a probability distribution over all of the categories that we’re trying to recognize.
### The Softmax Function
So the way that we do that is with this particular function called **softmax** that has this functional form. The raw classifier scores, also called **logits**, are shown as unnormalized log-probabilities with values like 3.2 for cat, 5.1 for car, and -1.7 for frog.
Table showing logit scores for cat, car, frog
These terms – logits and unnormalized log-probabilities – are commonly used in classifier outputs. We take these raw scores and run them through an exponential function, as shown in the Softmax function formula where e^k represents the exponential of each score.
The exponential is applied element-wise to each score in the vector, as demonstrated in the Softmax formula Σje^j. The output of exponentials is always non-negative. This exponential transformation ensures our outputs will be non-negative, as required for probability values. These intermediate results are called **unnormalized probabilities**.
### From Logits to Probabilities: A Complete Example
The process involves converting unnormalized log-probabilities (3.2, 5.1, -1.7) to unnormalized probabilities (24.5, 164.0, 0.18) through exponentiation, and then normalizing to get final probabilities.
Three-step probability conversion process
The example shows a classification task with three categories (cat, car, frog), where unnormalized logits are transformed into proper probabilities that sum to 1 (0.13, 0.87, 0.00).
There exists some ground truth or correct probability distribution that we would have liked it to predict. The correct probability distribution would have assigned all the probability mass onto the correct class. So the target probability distribution in this case would have had a 1 in the first slot, 0 in all the others.
### Information Theory Foundation
We want to have some function that compares probability distributions. If you take information theory, then there’s a lot of nice mathematical reasons why the Kullback-Leibler divergence is often used as a way to measure differences between probability distributions.
The cross entropy loss maximizes the probability of the correct class using the formula Li = -log P(Y=yi|X=xi), which is transformed through the softmax function to interpret raw classifier scores as probabilities.
Mathematical formulas showing loss function and softmax
### Cross-Entropy vs SVM Loss: Key Differences
Just like we did for the multi-class SVM loss, we can ask questions about this loss as well. What’s the minimum and maximum possible loss for an example when we’re using the cross entropy loss?
The **minimum loss would be 0**, and the **maximum loss would be infinity**.
But what’s interesting here is that with the SVM loss, it was actually possible to achieve the minimum. With the SVM loss, we could achieve a loss of 0 by just having the correct class be a lot higher than all the other classes.
But with the cross entropy loss, the only possible way to achieve a loss of 0 would be if our predicted probability distribution was actually one hot, as shown by the softmax function P(Y=k|X=xi) = e^sk/∑je^sj.
The predicted and target probability distributions would only achieve a cross-entropy loss of 0 when they are identical, as shown by comparing the normalized probabilities (0.13, 0.87, 0.00) with the correct probabilities (1.00, 0.00, 0.00) in the example.
Probability comparison table showing normalized vs correct probabilities
—
## Conclusion
Understanding SVM loss functions and cross-entropy loss provides the foundation for building effective classification systems. Key insights include:
**SVM Loss Characteristics:**
– Hinge-like behavior with linear regions and zero regions
– Non-unique solutions when achieving zero loss
– Margin-based approach focusing on decision boundaries
**Regularization Benefits:**
– Prevents overfitting by preferring simpler models
– Allows injection of human prior knowledge
– Improves optimization landscape through added curvature
**Cross-Entropy Advantages:**
– Provides probabilistic interpretation of classifier outputs
– Based on solid information theory foundations
– More commonly used in modern neural networks
– Achieves true zero loss only with perfect probability matching
The mathematical elegance of both approaches, combined with their practical effectiveness, makes them essential tools for any machine learning practitioner. Whether choosing SVM loss for its interpretability or cross-entropy for its probabilistic nature, understanding these fundamentals enables better model design and debugging capabilities.