dH 015# Optimization in Neural Networks: A Journey Through Loss Landscapes

dH 015# Optimization in Neural Networks: A Journey Through Loss Landscapes

*Understanding loss functions and the mathematical framework that drives neural network training*

## The Foundation of Optimization

The slide introduces the topic of Optimization, setting up our framework for understanding loss functions and weight matrices. But for the purpose of today, we’re mostly going to extract all those away and just think about the loss function as an abstract function. It inputs the weight matrix and outputs this scalar value of the loss.

During optimization, we want to find the optimal weight matrix w* that minimizes the loss function L(w), which is formally written as w* = arg min_w L(w). And this is why this topic of general optimization is indeed very general.

So instead, we’re going to focus on the bits that are most salient for training that will become more salient for training large neural network models as we move forward in this semester.

This is the formal mathematical notation we use to express an optimization problem: w* = arg min_w L(w). It’s a very broad topic across all of mathematics and applied computational mathematics and things like that. So there’s no way we can cover all possible hard pieces of optimization in one lecture.

## Visualizing the Optimization Landscape

But we often think about it intuitively as traversing a very large high-dimensional landscape. So we often think about this process optimization as trying to walk towards the bottom of some beautiful mountain valley landscape, like this stunning Alpine vista with its rolling green meadows descending between forested slopes.

Dramatic mountain landscape with green valleys, snow-capped peaks, and forest

Where here the intuition is that each point, each x-y point in this mountainous terrain represents a different state in our optimization space.

The image shows a stunning mountain landscape with snow-capped peaks in the distance and lush green valleys below, under a vibrant blue sky with white clouds – a metaphorical representation of our optimization landscape. During the process of optimization, we’re like someone trying to find the lowest point in this landscape while blindfolded.

Mountain valley showing clear elevation changes

Looking at this landscape, we can see multiple valleys and paths, but determining the absolute lowest point is challenging without a global view. However, in general, finding the global minimum requires iterative methods that progressively explore the landscape, similar to carefully descending through this mountain terrain step by step.

So then the question is, how do we actually get to the bottom?

## Idea #1: Random Search (Bad Idea!)

So for certain types of optimization problems, like might arise in linear classification, linear regression, we might be able to just write down an explicit equation for the bottom of this objective landscape. But in general, that’s not going to work.

So we need to find more, so we’re going to use iterative methods instead that try to iteratively improve our solutions and go towards the bottom of this objective landscape.

So the first iterative algorithm that you might imagine to optimize an objective landscape, which by the way is a terrible idea, but it’s instructive to think about, is random search.

So here, what we might try to do is implement random search by generating 1000 random weight matrices (W) of size 10×3073 scaled by 0.0001, evaluating their loss values on the training set, and keeping track of the lowest loss value found.

Python code showing random search implementation with loss tracking

With our loss function abstracted into a single function L, this random search procedure can be implemented in just a few lines of Python code, using numpy’s random.randn for parameter generation and a simple loss comparison loop. The code demonstrates random search training on a dataset with 58,000 examples and 3073 features, showing how the loss value improves across multiple attempts, starting from 9.401632 and gradually decreasing.

Using random search, we achieved 15.5% accuracy on the test set with dimensions X test [3073 x 10000] and Y test [10000 x 1], which demonstrates that even this basic optimization algorithm can make non-trivial progress. But of course, on this dataset, state of the art is something like 95%, so we’ve gotten a little bit of gap to close on the rest of this lecture.

Random search, as indicated in the title, is a bad idea and represents a naive algorithm that you probably don’t want to use in practice, though it performs better than one might expect. So that kind of motivates the use of slightly smarter algorithms than random search.

## Idea #2: Follow the Slope

Idea number two is to follow the slope, as shown in the slide title. We’re going to follow the slope of this mountainous landscape downhill, illustrated by the cartoon figure on the green hillside.

Alpine mountain landscape with green valleys, snow-capped peaks, and cartoon figure with directional arrow

Even though this figure traversing the objective landscape doesn’t have complete information about their surroundings, they can still follow the local slope.

The ground is sloping downward in a mountainous landscape, as illustrated by the cartoon figure positioned on a green alpine slope. Given only the local information about how the objective landscape is changing in our current position, as shown by the red arrow indicating the downward direction, our strategy is to step in the direction of greatest decrease.

We can repeat this process iteratively, following the natural contours of the mountain slope, leading us towards lower points in the objective landscape. So this is a fairly simple straightforward algorithm but it actually works quite well.

## The Mathematics of Gradient Descent

So here to make this a little bit more formal, we need to talk about derivatives, right? Because if you recall for a single value function that inputs a single scalar and outputs a single scalar, we can define the derivative which tells us the slope at any point in this on the domain of this function where the slope tells us for any point in the domain, for any point, any point, how if we change x by a little bit, then how much is y going to change a little bit in correspondents?

So that’s the same, that’s the familiar gradient, that’s the familiar derivative that we have from single variable calculus. And of course this extends very naturally to multiple dimensions as well.

So if you’ll recall for a vector valued function that inputs a vector and outputs a scalar, which is what we’re talking about in this optimization setup, then we can define the gradient where the gradient at a point is now a vector telling us the direction of greatest increase in the function.

So now the gradient tells us the direction of greatest increase and if you’ll recall your vector calculus, you’ll know that the magnitude of the gradient tells us the slope in the direction of greatest increase.

So and it’s also kind of a nice fact that the direction, we actually care about the direction of greatest decrease because we want to walk downhill this objective landscape and it turns out the direction of greatest decrease is opposite the direction of greatest increase, which is maybe not obvious, but it happens to be true.

And then so we can simply step in the direction of the negative gradient and this will lead us downhill down our objective landscape.

## Computing Gradients in Practice

So then the question is how might we actually go about computing these gradients for arbitrary vector input functions that input a vector and output a scalar? Well, we can actually implement this limit definition of the gradient directly in software.

So you could imagine that we have some value of the some current value of the weight matrix W here on the left shown as a big vector. And now on the right we want to compute this gradient and the gradient of the loss with respect to the weight matrix. And recall that the loss is a single scalar value.

The gradient of the loss with respect to the weight matrix W will be a vector with the same shape as the current weight matrix [0.34, -1.11, 0.78, …], where each element in the gradient dL/dW will indicate the slope for that corresponding weight parameter, with the current loss being 1.25347.

## Numerical Gradient Computation

So we could just implement this limit definition of the gradient numerically, as shown by the current weight matrix W and its corresponding gradient dL/dW. Given our function that computes the loss, which is currently 1.25347, what we could do is compute the loss at our current w, then we could perturb the value of w by increasing the first slot of the weight matrix by a small step each.

Current weight matrix W with numerical values and gradient dL/dW placeholder

And now run that perturb value of w through our loss function to compute a new loss. And then apply the limit definition of the derivative to compute some approximation to the gradient in the first slot of the gradient, which will replace one of these question marks in our dL/dW matrix.

And here remember that it’s maybe rise over run. So we look at the difference in loss value divided by the difference in amount in which we change that slot of the input weight matrix w, which currently starts with values like 0.34, -1.11, 0.78.

So that gives us some approximation to the slope of that function along the coordinate axis of the first slot of the weight matrix. And of course we can repeat this procedure for the second dimension, perturb the second dimension, recompute a new value of the loss, and then apply limit definition of the derivative to get this numeric approximation to the second slot.

Looking at the current weight matrix W, we can see multiple slots with values like 0.34, -1.11, 0.78, and continuing through 0.33. The numeric gradient is demonstrated by adding a small increment of 0.0001 to one slot and observing the change in loss from 1.25347 to 1.25353.

Original weight values and modified weight with +0.0001 showing loss change

The gradient calculation shows specific values like -2.5 and 0.6 in the gradient dL/dW. In this particular example, our weight matrix W has 9 visible slots with specific values. The ellipsis after 0.33 indicates there are additional slots beyond what’s shown.

## Limitations of Numerical Gradients

So this would actually not be super expensive to compute. But as we move to larger neural network based systems, our weight matrices become very, very large. They might have thousands or millions or tens of millions. Or in the biggest cases, perhaps even billions of learnable parameters.

The numeric gradient calculation requires computing (f(x+h) – f(x))/h as h approaches 0, as shown in the limit definition formula. So this would become very impractical as we move to very high-dimensional optimization problems.

So this numeric gradient not only is it slow, but it’s also not even computing the correct thing at all. But it’s still useful to think about in terms of what is actually going on when we try to compute gradients in these large scale systems.

So OK, so now at this point, you might have thought that this was kind of a stupid thing to have done, right?

## Analytic Gradient: A Better Approach

The loss function is defined as a function of the weight matrix W, which is fundamental to our optimization problem. The loss function can be written as L = (1/N)∑Li + ∑Wk², where Li represents individual losses and includes a max margin term max(0, sj – sy_i + 1).

Our objective is to compute ∇W L, the gradient of the loss function with respect to the weight matrix W, where s = f(x;W) = Wx. Understanding vector and matrix calculus rules will help us derive this gradient efficiently.

The loss function L is defined as the sum of individual losses Li plus a regularization term of squared weights, where Li uses a max-margin formulation comparing scores sj with the true class score, and s is computed as the product of weights W and input x.

The slide introduces how to compute the analytic gradient where loss is a function of W. The slide shows portraits of two historical mathematicians who contributed to calculus. The goal is to derive ∇WL using vector calculus to find an exact gradient expression as a function of the weight matrix W.

Historical portraits of mathematicians with formal attire and long dark hair

But now what we want to do is then hopefully we can then just use our knowledge of vector calculus to write down an expression that tells us directly what is an algebraic or analytic expression for the exact gradient as a function of the weight matrix W.

Looking at our current weight matrix W [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33] with a loss of 1.25347, we derive the expression for dL/dW (the derivative of the loss with respect to the weight matrix). This allows us to compute the gradient dL/dW, which yields values [-2.5, 0.6, 0, 0.2, 0.7, -0.5, 1.1, 1.3, -2.1] in a single operation.

That hopefully now it will no longer require a number of four passes linear in the number of dimensions of the weight matrix. And the details of exactly how we will derive these gradient expressions.

And in practice, we’ll often rely on back combination, which we’ll talk about in much more detail, I think, one week from today’s lecture six. So for the time being, you can just hope to pull back on your knowledge of calculus.

But moving forward, we’ll use back combination as a more structured algorithm to derive gradients for arbitrarily large arbitrarily complex expressions. But for the linear classification, for the linear classifiers we’ve considered so far, hopefully you should be able to work it out on paper. Or if not, then you can use back combination, even for those, as we’ll talk about on the topic.

## Computing Gradients: Two Approaches

Computing gradients can be done through two main approaches, as shown in the title slide. We’re going to compute gradients at every point and use that gradient to iteratively improve our estimate of where we want to fall in the loss landscape.

There are two distinct methods for computing gradients: numeric and analytic gradients. The numeric gradient is approximate, slow, but very easy to write and implement. In contrast, the analytic gradient is exact and fast, but can be error-prone.

In practice, we always use the analytic gradient but check its implementation with the numerical gradient – a process called gradient check. Because deriving analytic gradients requires calculus work on paper, it can sometimes be error-prone. Mistakes can occur in computation or derivation. This is particularly common when computing gradients for large complicated expressions.

Therefore, in practice, we utilize both methods. The numeric gradient serves as a debugging tool to verify proper implementation of the analytic gradient. So here, we would typically then write our code in a way that’s agnostic to the number of dimensions and the weight matrix. And then we can test the analytic version of our gradient computation on a very low dimensional version.

## Gradient Check: Best Practices

When comparing numerical and analytic gradients, numerical gradients are approximate, slow but easy to write, while analytic gradients are exact, fast but error-prone. In practice, you should always use analytic gradients but verify the implementation with numerical gradients through a gradient check.

This gradient check approach helps prevent implementation errors in your gradient computations. So we will use this strategy on the homework assignments. And you’ll see starting on assignment two, that again will be released later today.

You will be implementing your own gradient computations for linear classifiers. To debug these implementations, we’ve provided a gradient check function that compares your analytic gradient against an internally computed numerical gradient.

This is a very useful strategy and it’s applicable beyond homework assignments as well. In the PyTorch documentation, you’ll find that PyTorch provides a similar gradient check function.

When implementing gradient computations in PyTorch, it’s strongly recommended to use torch.autograd.gradcheck() function with parameters like eps=1e-06 and atol=1e-05 to verify proper gradient implementation.

PyTorch gradcheck function documentation and usage

The documentation also includes allclose() functionality to check between numerical and analytical gradients, which is essential when implementing derivatives in PyTorch.

We’ll explore examples of both numeric and analytic gradients later in the course, where numeric gradients are approximate but easier to implement, while analytic gradients are exact but more error-prone. PyTorch provides torch.autograd.gradgradcheck for verifying gradient implementations.

The gradgradcheck function verifies that backpropagating through computed gradients to the specified grad_outputs is correct, ensuring proper gradient computation. When implementing gradient code in practice, these gradient checking tools are essential for verifying correct implementation.

## Gradient Descent Algorithm

The algorithm of gradient descent fundamentally involves iteratively stepping in the direction of the negative gradient, which represents the direction of local steepest descent. So recall that we had talked about this way, this strategy for optimizing the loss function, where we’re going to start somewhere, and then feel with our feet to see what is the local gradient direction, and then step in that direction, and then repeat.

The gradient descent algorithm is remarkably concise, implemented in just four lines of Python code that initialize weights and iteratively update them based on computed gradients.

Python code implementation of gradient descent showing weight updates

The implementation initializes weights, loops for a specified number of steps, computes the gradient of the loss function, and updates weights by stepping in the negative gradient direction with a learning rate.

Gradient descent has three key hyperparameters: the weight initialization method, number of steps, and learning rate.

## Understanding Gradient Descent Hyperparameters

The concept of Gradient Descent involves iteratively stepping in the direction of the negative gradient, which represents the direction of local steepest descent. The implementation begins with w = initialize_weights(), followed by iterating through steps where we compute the gradient and update weights using the learning rate.

There are three key hyperparameters in gradient descent: the weight initialization method, number of steps, and learning rate. The number of steps determines how long we run the algorithm, as we need finite execution time for practical applications.

The learning rate (learning_rate * dw in the code) controls how much we trust the gradient and determines the size of each step in the direction of steepest descent. There’s many different stopping criteria. You might imagine using ingredient descent, but the most common one, indeed learning, is simply to run for some fixed number of iterations.

And this hyperparameter is usually running more iterations, tends to work better for most be flirting applications. So this hyperparameter is usually constrained by your computational budget and how long you’re willing to wait for your models to train.

And the third hyperparameter here is the learning rate. So when we take the step in the negative gradient, we actually need to say how much do we actually trust this gradient? How big of a step do we want to take?

The local gradient tells us the direction of greatest decrease for our function, but we need to actually decide how much we want to step in that direction of greatest decrease. The hyperparameter controlling this step size is usually called the learning rate because it controls how fast your network learns, where higher learning rates mean we’re taking larger steps.

And maybe you might hope that your algorithm would converge faster. Lower learning rates are maybe less prone to numeric explosion, but we’ll take longer to converge. We’ll take longer to learn.

So these three hyperparameters will be some of your friends throughout the rest of the semester.

## Visualizing Gradient Descent

The concept of gradient descent can be visualized through a contour plot or heat map. The algorithm of gradient descent can be understood through this heat map visualization, where different colors represent the loss function’s values.

Colored contour plot showing gradient descent path

In this visualization, the axes represent two dimensions of the weight matrix (W_1), while the colors indicate the height of the objective function at each point. Starting from an initial point (marked in white), we compute the negative gradient, which indicates the direction of steepest descent in the loss function.

The algorithm then takes iterative steps along this gradient direction using a specified learning rate, as shown by the dashed line path in the visualization.

The gradient descent algorithm starts at a point in the blue region of a rainbow-colored contour visualization, where the colors transition from blue at the edges through green, yellow, and red, creating a bowl-shaped objective landscape.

Rainbow gradient contour visualization showing optimization landscape

Following the direction of steepest descent, as defined in the algorithm’s core equation w -= learning_rate * dw, the process iteratively moves toward the red region at the center, where the loss is minimized.

## Key Properties of Gradient Descent

The gradient descent algorithm, which iteratively steps in the direction of negative gradient for steepest descent, exhibits several interesting features in optimization landscapes. In the colorful optimization landscape shown, the descent path does not proceed directly to the minimum.

Due to the rainbow-colored bowl-shaped landscape, the gradient direction follows an angular path rather than descending straight to the minimum. The algorithm’s path deviates from a direct descent to the minimum. The descent path curves along the contours of the optimization landscape before converging to the minimum, as illustrated by the black curved line in the visualization.

The algorithm begins with rapid descent, as shown by the steeper initial portion of the curved path. The slide introduces the concept of Gradient Descent, which iteratively steps in the direction of the negative gradient to find local steepest descent. The implementation requires three key hyperparameters: weight initialization method, number of steps, and learning rate.

The gradient descent algorithm is implemented with a multiplicative learning rate applied to the gradient, as shown in the code: w -= learning_rate * dw

As we approach the bottom of the bowl, the objective landscape flattens out, as visualized in the rainbow gradient image. The magnitudes of our gradients will become small. The step sizes naturally become smaller due to the multiplicative relationship between learning rate and gradient.

This parameterization of gradient descent allows the algorithm to naturally slow down as it approaches flat regions or minima of the objective function.

## Batch Gradient Descent vs. Stochastic Gradient Descent

This version of gradient descent is called Batch Gradient Descent, as shown in the slide title. The loss function L(W) is defined as a sum over N individual loss functions Li(xi, yi, W) for all training examples, plus a regularization term λR(W).

Li represents the loss function on how well our classifier is doing on that one individual example. The full loss function L(W) averages these individual losses over N training examples and includes a regularization term.

The gradient ∇W L(W) follows the same structure, being the average of individual gradients plus the gradient of the regularization term. This full sum becomes very expensive when N is large.

If n is something like a million, or 10 million, or a billion, if you’re lucky enough to have a very, very large data set, then simply computing the value of this loss function will involve a loop over tens of thousands or millions or billions of training samples.

Computing a single step of gradient descent requires waiting a very long time to loop through the entire dataset. In practice, this approach becomes infeasible for training on large datasets where high performance is desired.

In practice, we often use a variant of gradient descent called Stochastic Gradient Descent (SGD), which is computationally more efficient for large datasets. Instead of computing the sum over the full training dataset, SGD approximates the loss function L(W) and its gradient by using minibatches of typically 32, 64, or 128 examples, as shown in the mathematical formulation where each Li represents the loss for a single training example.

## Stochastic Gradient Descent Implementation

The slide introduces Stochastic Gradient Descent (SGD), showing both the mathematical formulation and implementation approach. Common minibatch sizes are 32, 64, or 128 examples, as shown in the implementation notes. The implementation shows how to modify the algorithm to use minibatches instead of the full dataset, with a Python-style pseudocode demonstration.

SGD implementation pseudocode

The key hyperparameters for SGD include weight initialization, number of steps, learning rate, batch size, and data sampling strategy.

Once we move from full batch to stochastic gradient descent, we introduce a couple more hyper parameters into the system. We still have our old friends of weight initialization, number of steps, and learning rates. We’ve also introduced batch size as a new hyper parameter.

That is, when we’re computing these mini batches, how many elements should be in each mini batch? Thankfully, it turns out, empirically, that this one tends not to be too sensitive by the hyper parameter.

And the general here is just make the mini batch size as large as you can until you run out of GPU memory. Or if you happen to have access to multiple GPUs or multiple machines, it turns out that you can distribute this and have very, very large batches that are distributed over multiple GPUs or multiple machines.

And it turns out, this is all empirically speaking, that for many neural network systems, the exact batch size you use doesn’t matter too much as long as you properly account for other things in the system. So we’ll talk more about those details later in the class.

But the general rule of thumb here is, don’t worry too much about the batch size as a hyper parameter. Instead, just try to make it as big as you can fit.

And then we’ve introduced another hyper parameter here, which is the method by which we’re going to sample our training data at every iteration.

For classification problems like we’ve been talking about so far, this tends not to matter too much. On the homework assignments, what we’ll do is just draw the data at random for each iteration.

Another common strategy would be to shuffle the data set at the beginning. We’re going to shuffle the data set and then march through that shuffle data set in order and then shuffle the data set again and then march through it again in order.

For classification problems like we’ve been considering, the latter is more common. But I think it tends not to matter too much for your final results.

But if you think about other types of problems, then things like structured prediction problems or maybe triplet matching problems or ranking problems, then the way in which you iterate over your data can sometimes be an important hyper parameter. But thankfully, that’s not the case for many image classification problems.

## Understanding the “Stochastic” in SGD

So you might be wondering why this is called Stochastic Gradient Descent (SGD). It’s kind of a funny name. The reason for that is that we now think of our loss function probabilistically, as an expectation over the full data distribution pdata.

When we think about our data as having been sampled from some underlying probability distribution pdata, we can write the loss function L(W) as an expectation over all possible samples, plus a regularization term λR(W). This expectation can be approximated using a Monte Carlo estimation, where we average over N independent loss functions L(xi, yi, W) from our dataset, plus the regularization term.

When thinking about loss functions this way, we can choose how many samples to take to approximate the full expectation over the data distribution pdata. This probabilistic thinking introduces the ‘stochastic’ element in Stochastic Gradient Descent.

Similarly, we can approximate the gradient using a small Monte Carlo sample from the full distribution. Where this term stochastic comes from in the names stochastic gradient descent. And of course, the gradient is much the same so that we can approximate the gradient as well by taking a small Monte Carlo estimate of samples over the full distribution.

## Interactive Learning Tools

We have an interactive web demo at vision.stanford.edu/teaching/cs231n-demos/linear-classify/ that lets you experiment with linear classifiers and gradient descent training in your web browser.

Interactive classifier visualization with parameter controls

The demo allows you to adjust loss functions, weights, and step size while visualizing how these hyperparameters affect decision boundaries across different colored regions, with real-time updates to total loss (2.57), data loss (0.64), and regularization loss (1.92).

So it turns out that this is actually, this actually works. Now at this point, you know enough to go and implement the linear classifier portion on assigned to.

## Problems with Basic SGD

But it turns out that this simple algorithm of stochastic gradient descent, even though it’s actually quite effective, there are some potential problems with the approach. Let’s examine some situations where this basic version of stochastic gradient descent might cause problems.

One problem with SGD occurs when the loss landscape has dramatically different rates of change in different directions. The contour plot shows concentric elliptical curves representing the loss landscape.

Contour plot with concentric elliptical curves showing optimization challenges

This loss landscape exhibits a condition number problem, where the ratio of largest to smallest singular values of the Hessian matrix is large, creating an elongated elliptical shape.

When applying gradient descent on such a landscape, the optimization path may oscillate between the elongated sides of the contours rather than taking a direct path to the optimum at the center. The optimization path tends to oscillate around the contours when moving across the narrow dimension.

The step size presents a crucial trade-off in this type of loss landscape. If the step size is too large, the gradient updates may overshoot the optimal point, leading to continued oscillations.

## Conclusion

We’ve covered the essential foundations of optimization in neural networks, from the intuitive mountain landscape metaphors to the mathematical rigor of gradient computation and the practical realities of training large-scale models.

**Key takeaways from our optimization journey:**

– **Gradient descent transforms optimization** from an abstract mathematical problem into an intuitive process of following slopes downhill through high-dimensional parameter spaces

– **The choice between numerical and analytic gradients** reveals the trade-off between simplicity and efficiency, with gradient checking serving as our safety net for implementation correctness

– **Hyperparameter tuning** (learning rate, batch size, initialization) becomes an art that balances convergence speed, stability, and computational resources

– **Stochastic Gradient Descent** emerges as the practical solution for large datasets, trading exact gradients for computational feasibility through mini-batch sampling

– **Real-world challenges** like ill-conditioned loss landscapes remind us that even simple algorithms can face complex optimization problems

The elegant simplicity of gradient descent—just four lines of code—belies its power to train models with billions of parameters. Yet as we’ve seen with oscillation problems in elongated loss landscapes, this basic approach has limitations that motivate more sophisticated optimization methods.

Understanding these fundamentals provides the foundation for exploring advanced optimizers like momentum, Adam, and adaptive learning rate methods that address the challenges we’ve identified. The mathematical intuition and practical considerations covered here will serve as essential building blocks as neural networks continue to grow in complexity and capability.

Whether you’re implementing your first linear classifier or debugging training dynamics in large-scale models, these optimization principles remain at the heart of making neural networks learn effectively.