#003 Machine Learning – Improving The Performance Of A Learning Algorithm
Highlights: Welcome back to our new Machine Learning series. In the previous post, we studied all about Linear Regression, Cost Functions and Gradient Descent. We also built a simple Linear Regression model using Python.
In this tutorial post, we will learn how to make our Linear Regression model faster and more powerful. We will start by building a Linear Regression model using multiple features and then, enhance its performance using various techniques. And finally, we’ll implement what we learn about Multiple Linear Regression models using simple code in Python. So, let’s begin!
Tutorial Overview:
- Linear Regression Using Multiple Features
- Understanding Vectorization
- Gradient Descent For Multiple Linear Regression
- Feature Scaling
- Choosing The Right Learning Rate
- Feature Engineering
- Implementing Multiple Linear Regression In Python
1. Linear Regression Using Multiple Features
In our previous post, we studied an example for predicting the price of a house given the size of the house. In that particular example, we worked with the original version of Linear Regression which utilized only a single feature \(x \), the size of the house, in order to predict \(y \), the price of the house.
The equation for our original Linear Regression model was:
$$ f_{w,b}(x) = wx+b $$
Using this function, we learned how to predict the price of the house with just one feature which was the size of the house.
However, what do we do if we have other features apart from the size of the house given to us, such as the number of bedrooms, the number of floors, and the age of the home?
In this case, we have a lot more information with us to predict the price of the house.
Let us introduce some new notations here. We will use the variables \(x_{1} \), \(x_{2} \), \(x_{3} \), and \(x_{4} \) to denote the four new features. For simplicity, let’s introduce a little bit more notation.
- \(x_{j} \) – \(j^{th} \) feature
- \(n \) – number of features
- \(\vec{x}^{(i)} \) – features of \(j^{th} \) training example
- \(x_{j}^{(i)} \) – value of feature \(j \) in \(j^{th} \) training example
Here, \(x_{j} \) represents the list of features and \(j \) ranges from one to four since we have four features. We will use \(n \) to denote the total number of features. In our example above, \(n=4 \).
Furthermore, we’ll use \(\vec{x}^{(i)} \) to denote the ith training example. Here, \(\vec{x}^{(i)} \) is actually going to be a list or a vector containing four numbers, or all the features of the ith training example. For instance, \(\vec{x}^{(2)} \) will be a vector of the features for the second training example, so, it will be equal to the second row of the table.
$$ \vec{x}^{(2)} = [560,4,1,12] $$
Sometimes, we call this a row vector.
To refer to a specific feature in the \(i^{th} \) training example, we will write \(x_{j}^{(i)} \). For instance, \(x_{3}^{(2)} \) will be the value of the third feature, that is the number of floors in the second training example. Therefore, \(x_{3}^{(2)} = 1\).
Now that we have multiple features, let’s take a look at what our model would look like. In our previous post, we defined the model using the following formula:
$$ f_{w,b}(x) = wx+b $$
Here, \(x \) was a single feature.
However, with multiple features, we’re going to define it in the following way:
$$ f_{w,b}(x) = w_{1}x_{1}+w_{2}x_{2}+w_{3}x_{3}+w_{4}x_{4}+ b $$
Using the formula above, we can also define one possible model to concretely estimate the house price as follows:
Let us now interpret the parameters used above.
If the model is trying to predict the price of the house in thousands of dollars, you can think of the term \(b = 80 \) as the base price of the house. The base price starts off at 80,000$, assuming it has no size, no bedrooms, no floor, and no age. You can think of this as 0.1 such that for every additional square foot, the price will increase by 100$.
Similarly, for each additional bathroom, the price increases by 4,000$, and for each additional floor, the price may increase by 10,000$. Finally, for each additional year of the house’s age, the price may decrease by 2,000$, because the parameter is negative 2.
We can generalize the above formula to write it for \(n \) given features, as follows:
$$ f_{w,b}(x) = w_{1}x_{1}+w_{2}x_{2}+…+w_{n}x_{n}+ b $$
Let us simplify our expression further by introducing some more notations.
We will define \(w \) as a vector. Same as before, \(b \) is a single number. We can also rewrite Slatex x $ as a vector.
With this new notation, the model can now be rewritten in the following way:
$$ f_{\vec{w},b}(\vec{x})= \vec{w}\cdot \vec{x}+b $$
In linear algebra, this ‘dot’ refers to the dot product of two vectors. This particular notation lets us write the model in a more compact form with fewer characters.
The name for this type of Linear Regression model with multiple input features is Multiple Linear Regression. This is in contrast to Univariate Linear Regression, which has just one feature.
Before we move ahead to see the implementation of this Multiple Linear Regression model in Python, we need to understand a neat trick that can make our implementation much simpler. We call it Vectorization and it is useful not just in the case of Linear Regression but can also be applied while implementing other learning algorithms.
2. Understanding Vectorization
Vectorization is a very useful concept. It has the power to make your code shorter and more efficient. Learning how to write vectorized code will allow you to take advantage of modern numerical linear algebra libraries as well as GPU hardware.
Vectorization actually has two distinct benefits. First, it makes code shorter, and second, it also results in your code running much faster. We will understand the concept better by studying an example in Python.
Let’s create a model for predicting output using synthetic data. Here, we will create an example with parameters \(w \) and \(b \), where \(w \) is a vector of 1,000,000 random numbers, \(b \) is a single number, and we have a vector of features \(x \) which also contains 1,000,000 numbers.
First, let’s import the NumPy and Time libraries. We can use the Time library to calculate the time needed for code to execute and create vectors \(x \) and \(w \) using the function np.random.rand()
. We will also create a variable \(b \) which will be equal to 4.
import numpy as np
import time
x = np.random.rand(1000000)
w = np.random.rand(1000000)
b = 4
Have a look at an implementation without vectorization for computing the model’s prediction. First, we will use a for
loop to compute the prediction.
y_heat = 0
tic = time.time()
for i in range(1000000):
y_heat+= x[i]*w[i] + b
toc = time.time()
print("for loop:"+ str(1000*(toc-tic))+"ms")
674.6037006378174ms
Initially, we set the value of the prediction y_heat
to be equal to 0. Then, we create a variable tic
where we store the time at the beginning of the for
loop and the variable toc
stores the time at the end of the loop. Inside the for
loop, we will iterate through all 1,000,000 elements and calculate the prediction.
As you can see, this experiment took us 0.67 seconds.
Now, let’s calculate the prediction again but this time using Vectorization. The function \(f \) is actually the dot product of \(w \) and \(x \) where \(b \) is added at the end. We can implement this with a single line of code by using the NumPy function np.dot()
.
tic = time.time()
c = np.dot(a,b)
toc = time.time()
print("Vectorized:"+ str(1000*(toc-tic))+"ms")
Vectorized:9.137153625488281ms
Wow! Take a look at this result. Using Vectorization, our code runs much faster compared to the previous example without Vectorization.
Now that we have a fair bit of understanding about Vectorization, let us move ahead and compute Gradient Descent for Multiple Linear Regression, using the technique of Vectorization.
3. Gradient Descent For Multiple Linear Regression
First, let’s quickly review what multiple linear regression looks like.
$$ \vec{w}= [w_{1}…w_{n}] $$
$$ b = number $$
Using vector notation, we can write the model as:
$$ f_{\vec{w},b}(\vec{x})= \vec{w}\cdot \vec{x}+b $$
We can also write a Cost Function using this vector notation.
$$ J(\vec{w},b) $$
Here’s what Gradient Descent, for parameters \(w \) and \(b \), looks like when it is written in vectorized form.
$$ w_{j} = w_{j}-\alpha \frac{dJ(\vec{w},b)}{dw_{j}} $$
$$ b = b-\alpha \frac{dJ(\vec{w},b)}{db} $$
Next, let’s see what this looks like when we implement Gradient Descent. Notice the derivative term in particular.
$$ w_{n}= w_{n}-\alpha \frac{1}{m}\sum_{i=1}^{m}(f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)})x_{n}^{(i)} $$
$$ b= b-\alpha \frac{1}{m}\sum_{i=1}^{m}(f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)}) $$
We can see that Gradient Descent becomes just a little bit different with multiple features compared to just one feature. One difference is that \(w \) and \(x \) are now vectors. For Multiple Linear Regression, we have \(J \) ranging from 1 through \(n \). If we implement this equation, we will get Gradient Descent for Multiple Linear Regression.
Now, let us learn some handy tricks to improve the performance of Multiple Linear Regression. One such trick is called Feature Scaling.
4. Feature Scaling
Feature Scaling is a method used to normalize the range of features of data. In data processing, this method is also known as Data Normalization and is, generally, performed during the data preprocessing step. So, if we are given multiple features like the size of the house, the number of bedrooms, and the number of floors, Feature Scaling is usually used to scale the features to be in the same range. Let’s understand this using an example.
For instance, we need to predict the price of a house using two given features – the size of the house and the number of bedrooms. Let’s say that the size typically ranges from 300 to 2000 square meters. And, the number of bedrooms in the dataset ranges from 0 to 5 bedrooms.
Therefore, in this example, the size feature takes a relatively large range of values and the number of bedrooms feature takes on a relatively small range of values.
Now, consider a sample house that has a size of 2000 square meters, has five bedrooms and a price of 500. For this training example, what do you think are reasonable values for the size of parameters \(w_{1} \) and \(w_{2} \)? Well, let’s look at one possible set of parameters.
Here, the estimated price in thousands of dollars is $100,000K plus $0.5k plus $50K, or we can say, slightly over $100 Million. This is clearly very far from the actual price of $500,000. Therefore, this does not seem to very good parameter choices for \(w_{1} \) and \(w_{2} \). Now, let’s take a look at another possibility.
Let’s say that \(w_{1} \) and \(w_{2} \) are the other way around. In this case, \(w_{1} \) is relatively small and \(w_{2} \) is relatively large. In such a case, the predicted price can be written as:
As you can see, this version of the model predicts a price of $500,000 which is a much more reasonable estimate and happens to be the same price as the true price of the house.
So, how does this relate to Grading Descent?
Take a look at the scatter plot of the features where the size in square meters is along the horizontal axis and the number of bedrooms is along the vertical axis. If we plot the training data, the horizontal axis is on a much larger scale or a much larger range of values compared to the vertical axis.
Notice how the Cost Function is represented in a contour plot, on the right-hand side. You might see a contour plot where the horizontal axis has a much narrower range, between zero and one, whereas the vertical axis takes on much larger values, between 10 and 100. We can see that the contours form ovals or ellipses, and are shorter on one side and longer on the other. The reason for this is because a very small change to \(w_{1} \) can have a very large impact on the estimated price and that’s a very large impact on the cost \(J \). This is because \(w_{1} \) is usually multiplied by a very large number.
In contrast, in order to change the predictions, it takes a much larger change in \(w_{2} \). Therefore, the small changes to \(w_{1} \), don’t change the Cost Function nearly as much.
Such tall and skinny contours and bouncy Gradient Descent, which take a lot of time to reach the global minimum, are not ideal. In order to simplify things, a useful thing to do is to scale the features. This involves transforming a part of the training data such that \(x_{1} \) and \(x_{2} \) range from 0 to 1. In this case, the data points will now look more like this.
Now, here, the Gradient Descent can find a much more direct path to the global minimum.
There are multiple ways to perform Feature Scaling. Two of the most common methods used for Feature Scaling are:
- Normalization
- Standardization
Normalization
Normalization is also known as Min-max Scaling or Min-max Normalization. It is the simplest method for rescaling the range of features to fit in the range [0, 1]. The general formula for Normalization is given as:
$$ x^{,}=\frac{x-min(x)}{max(x)-min(x)} $$
Here, \(max(x) \) and \(min(x) \) are the maximum and the minimum values of the feature respectively.
Standardization
Feature Standardization makes the values of each feature in the data have zero mean and unit variance. The general method of calculation is to determine the distribution mean and standard deviation for each feature and then, calculate the new data point using the following formula:
$$ x^{,}=\frac{x-\bar{x}}{\sigma } $$
Here, \(\sigma \) is the standard deviation of the feature vector and \(\bar{x} \) is the average of the feature vector.
Now, in order to proceed towards training and optimizing our learning algorithm, we must decide on a suitable learning rate. Let’s move to the next section and understand how to set an appropriate learning rate.
5. Choosing The Right Learning Rate
One of the most important hyperparameters that we need to set when training a neural network is the learning rate for the optimization algorithm. This parameter is a very small number usually ranging between 0.1 and 0.0001. It scales the magnitude of our weight updates in order to minimize the network’s loss function.
Often during the training process, we use the same learning rate. However, it is highly recommended to adjust the learning rate in order to get better results. Here is why:
The goal of Gradient Descent is to minimize the loss between the actual and the predicted output. Remember that we start the training process with arbitrarily set weights and biases. Then, we update these weights and biases as we move closer to the minimum of the loss function.
The size of these steps when we move towards the minimized loss depends on the learning rate. Now, if we choose a step that is too large we can pass the minimum and miss it. On the other hand, if we choose a small step it will take a very long time for us to reach the minimum. To better understand this, have a look at the following image.
The solution to this problem is to decrease the learning rate as we move closer to the minimum of the loss function, instead of keeping the learning rate fixed. In this way, we will take smaller and smaller steps until we reach closer to the minimum.
6. Feature Engineering
The choice of features can have a huge impact on your learning algorithm’s performance. In fact, for many practical applications, choosing or entering the right features is a critical step to making the algorithm work well.
Let’s take a look at how we can choose or engineer the most appropriate features for our learning algorithm. Observe the following image.
Here, our goal is to predict the price of a house.
Say you have two features for each house. The first feature is the frontage of the lot and the second feature is the depth of the lot. Given these two features, we can build our model.
However, there’s another option for how you might choose a different way to use these features in the model that could be even more effective. You might notice that the area of the land can be calculated as the frontage or width times the depth. Intuitively, we can notice that the area of the land is more predictive of the price than the frontage and depth as separate features. You might define a new feature \(x_{3} \) that is equal to the area of the plot of land. With this feature, you can, then, have another model.
Now, the model can choose parameters \(w_{1} \), \(w_{2} and \)latex w_{3} $ depending on whether the data conveys the most important feature for predicting the price of the house – the frontage, the depth, or the area.
This is an example of Feature Engineering. You will use your knowledge and intuition to design new features by transforming or combining the original features of the problem. This will make it easier for the learning algorithm to make accurate predictions.
In short, no matter if you are given certain features to start off your model, you can define new features and design a much better model depending on your application.
We have learnt all about Multiple Linear Regression in the sections above. Let us move ahead and implement some of our knowledge using Python and build a simple Multiple Linear Regression model from scratch.
7. Implementing Multiple Linear Regression In Python
First, we will import the necessary libraries and read the data frame where our data is stored. We will use the same example we studied above and try to predict the house price. We will also read a data frame “House_features” where our data is stored.
import pandas as pd
import numpy as np
import copy
import math
df = pd.read_excel("/content/House_features.xlsx")
df
number_of_flors | size | n_of_bedrooms | number_of_flors | age_of_home | price |
---|---|---|---|---|---|
0 | 2104 | 5 | 1 | 45 | 460 |
1 | 1416 | 3 | 2 | 40 | 232 |
2 | 1534 | 3 | 2 | 30 | 315 |
3 | 852 | 2 | 1 | 36 | 178 |
For simplicity, we will just take 4 houses in this example. Each house has 4 features: size, number of bedrooms, number of floors, and age. Using this training data, we want to create a program that can estimate the price of any other house.
Next, we will create the variable X_train
, where we will store the values of all 4 features, and y_train
, where we will store the values of the prices.
X = df.drop("price", axis='columns')
X_train = X.values
y = df["price"]
y_train = y.values
Great! Now, we are ready to start building our Multiple Linear Regression model. The next step is to choose optimal values for our parameters.
w_init = np.array([0.1,4,10,-2])
b_init = 80
Next, let’s create a function predict()
that will calculate the prediction of our model.
def predict(x,w,b):
p = np.dot(x,w) + b
return (p)
Now, let’s call our function predict()
and calculate the estimated values for each of these 4 houses.
f_wb = predict(X_train, w_init,b_init)
f_wb
array([230.4, 173.6, 205.4, 111.2])
The subsequent step is to calculate the cost for our model. For this, we will define the function compute_cost()
. Here, we will implement the following equation for the cost.
$$ J(w)=\frac{1}{2m}\sum_{i=1}^{m}(f_{w}(x^{(i)})-y^{(i)})^{2} $$.
def compute_cost(X,y,w,b):
m = X.shape[0]
cost = 0.0
for i in range(m):
f_wb_i = np.dot(X[i],w)+b
cost = cost + (f_wb_i - y[i])**2
cost = cost/(2*m)
return(np.squeeze(cost))
Now, let’s compute the cost using pre-chosen optimal parameters.
cost = compute_cost(X_train,y_train,w_init,b_init)
cost
6589.5199999999995
After calculating the cost, we are ready to compute the gradients. To do this, we will define the function compute_gradient().
Here, we will implement the following equations.
$$ w_{j} = w_{j}-\alpha \frac{dJ(\vec{w},b)}{dw_{j}} $$
$$ b = b-\alpha \frac{dJ(\vec{w},b)}{db} $$
In the above expressions, the $latex n represents the number of features. We will continue to update our parameters simultaneously. We will calculate the derivatives and update our parameters using the following equations.
$$ \frac{dJ({w},b)}{dw_{j}} = \frac{1}{m}\sum_{m-1}^{i=0}(f_{{w},b}({x}^{(i)})-y^{(i)})x_{j}^{(i)} $$
$$ \frac{dJ({w},b)}{db} = \frac{1}{m}\sum_{m-1}^{i=0}(f_{{w},b}({x}^{(i)})-y^{(i)}) $$
def compute_gradient(X,y,w,b):
m,n = X.shape
dj_dw = np.zeros((n,))
dj_db = 0
for i in range(m):
err = (np.dot(X[i],w)+b) - y[i]
for j in range(n):
dj_dw[j] = dj_dw[j] + err * X[i,j]
dj_db = dj_db + err
dj_dw = dj_dw/m
dj_db = dj_db /m
return dj_db, dj_dw
tmp_dj_db, tmp_dj_dw = compute_gradient(X_train, y_train,w_init, b_init)
print(tmp_dj_dw)
print(tmp_dj_db)
[-1.977032e+05 -4.464000e+02 -1.581000e+02 -4.590200e+03]
-116.10000000000001
The function gradient_descent()
will compute the derivatives, update the parameters and store the loss history for each iteration. In this way, we can track the changes in the value of loss through the iterations.
def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iter):
m = len(X)
J_history = []
w = copy.deepcopy(w_in)
b = b_in
for i in range(num_iter):
dj_db, dj_dw = gradient_function(X,y,w,b)
#update parameters
w = w -alpha *dj_dw
b = b -alpha *dj_db
if i < 100000:
J_history.append(cost_function(X,y,w,b))
if i % math.ceil(num_iter/10) == 0:
print(f"Iteration {i:4d}:Cost {J_history[-1]: 8.2f}")
return w,b,J_history
Finally, let’s plot our cost through each iteration.
fig,(ax1,ax2) = plt.subplots(1,2, constrained_layout = True, figsize = (12,4))
ax1.plot(J_hist)
ax2.plot(100 + np.arange(len(J_hist[100:])),J_hist[100:])
We can see that in the beginning, the loss had a big drop. Then at some point, it became quite stable.
Well, this brings us to the end of our topic for today, i.e., Multiple Linear Regression, which is used to improve the speed, efficiency and overall performance of learning algorithms. Let’s revise what we learned to reinforce the concepts.
Improving The Performance Of A Learning Algorithm Using Multiple Linear Regression
- Linear Regression with one feature is called Univariate Linear Regression
- Linear Regression with multiple features is called Multiple Linear Regression
- Multiple Linear Regression can make the implementation of a learning algorithm much faster and more efficient
- Multiple Linear Regression models make use of Vectorization to improve the performance of a learning algorithm
- The Gradient Descent algorithm in Multiple Linear Regression models involves the use of vectors
- Feature Scaling or Data Normalization is used to scale the features to a certain range
- Feature Scaling can be performed by using either Normalization or Standardization methods
- Multiple Linear Regression involves the use of a variable learning rate instead of a fixed value
- Feature Engineering is useful in defining new features by a combination or transformation of given features
Summary
This marks the end of yet another post in our exciting new series on Machine Learning. We hope you are now clear about the difference between Univariate and Multiple Linear Regression models. Try building your own model with another example and send in your results to us. We are always up for a good discussion! We’ll see you with another interesting topic on Machine Learning soon. Till then, take care! 🙂