#001 Machine Learning – Introduction To Machine Learning
Highlights: Machine Learning is one of the hottest career choices of the 21st century. Empowering oneself with the right skills to teach and train a machine is becoming imperative by the day. Considering popular demand by the industry and our dear subscribers, we are starting a fresh tutorial series on Machine Learning.
In this blog post, you will learn about the basics of Machine Learning, key challenges and problems that Machine Learning can tackle and some important Machine Learning algorithms. So, let’s begin!
Tutorial Overview:
1. Machine Learning: An Overview
Machine learning as a concept has been around for quite some time. The term “Machine Learning” was coined by Arthur Samuel, a computer scientist at IBM and a pioneer in AI and computer gaming. Samuel designed a computer program for playing checkers. The more the program played the games of checkers, the more it gained experience and learned how to make predictions using algorithms.
Machine Learning is nothing but the analysis and construction of algorithms that are capable of learning from data and making predictions using this data. Like a human brain gains knowledge by doing a task over and over again, a machine brain understands entities, domains and mutual connections with the help of inputs such as training data, knowledge graphs and more.
A simple definition of Machine learning can be written as follows:
“Machine learning is an application of AI that enables systems to learn and improve from experience without being explicitly programmed. Machine learning focuses on developing computer programs that can access data and use it to learn for themselves”.
There are essentially two ways in which machines can be taught and trained. These two Machine Learning approaches have their own set of features and advantages.
- Supervised Learning
- Unsupervised Learning
Let us learn about each of these methods with their respective sub-categories, examples and some practical code implementation using Python.
2. Supervised Learning
Supervised learning is the category of Machine Learning that involves the relationship between measured features of data \(x \) and some desired output label Slatex y $ that the learning algorithm eventually learns from.
Once this model is determined, it can be used to apply labels to new or unknown data. In other words, the model learns to take just the input alone without the output label and gives a reasonably accurate prediction of the output.
Now, this process of Supervised Learning can be divided into two tasks:
- Regression: Here, labels are considered as continuous quantities
- Classification: Here, labels are considered as discrete categories
Let us understand each of these tasks using their respective examples.
Regression Algorithms
Linear Regression
Consider an example wherein we want to predict housing prices based on the size of the house. First, we are going to collect some data and plot them.
In the graph above, on the horizontal axis is the size of the house measured in square meters. On the vertical axis is the price of the house in thousands of dollars.
Now, let’s say that we want to find out what’s the price for a 750 square meters house.
How can the learning algorithm help you here? One thing a learning algorithm might be able to do is to fit the straight line to the data. This is called Linear regression.
In such a case, we can see that the house could be sold for about $130,000. However, fitting a straight line isn’t the only learning algorithm we can use. Other algorithms may work better for this application. For example, instead of fitting a straight line, you might decide that it’s better to fit a curve. Then, if we make a prediction, in that case, the house could be sold for $200,000.
This is an example of Supervised Learning. We gave the algorithm a dataset that contains a correct answer. That answer is the label or the correct price \(y \) which is given for every house on the plot. The task of the learning algorithm is to predict the right price for any given house size. That’s why we can categorize this as Supervised Learning.
In fact, this housing price prediction example is a particular type of Supervised Learning task called Regression. Here, we’re trying to predict a number from infinitely many possible numbers such as the house prices in our example.
Let us better understand Regression in Supervised Learning and specifically, Linear Regression by writing a simple code.
Linear Regression In Python
In order to explain how to implement regression tasks in Python, a good starting point would be to use the simplest form of Linear regression which is fitting a straight line to data.
Let’s begin with the standard imports.
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
Code language: JavaScript (javascript)
Next, let’s write the formula for a straight line.
$$ y = wx +b $$
Here, \(w \) is commonly known as the slope of a line, and \(b \) is commonly known as the intercept of a line with \(y \) axis.
Now let’s create 100 random synthetic data points that are scattered about a straight line with a slope of 3 and an intercept of 10.
rng = np.random.RandomState(1)
x = 10 * rng.rand(100)
y = 3 * x - 10 + rng.randn(100)
plt.scatter(x, y)
We can use Scikit-Learn’s LinearRegression estimator to fit this data and construct the best-fit line.
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
model.fit(x[:, np.newaxis], y)
xfit = np.linspace(0, 10, 100)
yfit = model.predict(xfit[:, np.newaxis])
plt.scatter(x, y)
plt.plot(xfit, yfit)
Code language: JavaScript (javascript)
The slope and intercept of the data are contained in the model’s fit parameters. Here the relevant parameters are coef_
and intercept_
as shown below.
print("Model slope: ", model.coef_[0])
print("Model intercept:", model.intercept_)
Code language: CSS (css)
Model slope: 2.968492508765533
Model intercept: 10.236957254148901
As you can see, the results are very close to the inputs.
So, this is how we implement Linear Regression in Python as part of our Supervised Learning process.
Next, let us try and see how Classification is performed in Supervised Learning along with a simple example and of course, its Python implementation.
Classification Algorithms
The second major type of Supervised Learning algorithm is called a Classification algorithm. Let’s take a look at what this means.
We will take breast cancer detection as an example of a Classification problem.
Let’s say you’re building a Machine Learning system as a diagnostic tool for doctors to detect breast cancer. This tool can be extremely useful because early detection could potentially save a patient’s life. Using a patient’s medical records, your Machine Learning system tries to figure out whether a tumour is malignant or benign.
To start with, we first assume that our dataset has tumours of various sizes. All these tumours are labelled as either 0 for benign or 1 for malignant. Then, we can plot our data on a graph where the horizontal axis represents the size of the tumour and the vertical axis represents the diagnosis. It takes only two values 0 or 1 depending on whether the tumour is benign or malignant.
One reason that this is different from Regression is that we’re trying to predict only a small number of possible outputs or categories. In this case, we have only two possible outputs 0 or 1. This is different from Regression which tries to predict any number.
We can even plot our dataset on a single axis, as shown below.
In the graph above, we use two different symbols to denote the category of cancer. ‘O’ denotes the benign examples and ‘X’ represents the malignant examples.
If new patients walk in for a diagnosis and they have a lump of a particular size, then the question is, will your system classify this tumour as benign or malignant? It turns out that in Classification problems, you can also have more than two possible output categories.
The learning algorithm can even output multiple types of cancer diagnosis if it turns out to be malignant.
Let’s call two different types of cancers Type 1 and Type 2. In this case, the average would have three possible output categories it could predict – No Cancer, Type 1 Cancer or Type 2 Cancer. Do note that in Classification, the terms ‘output classes’ and ‘output categories’ are often used interchangeably.
Therefore, in short, Classification algorithms predict categories. Categories don’t have to be numbered. They could be non-numeric as well.
For example, a Classification algorithm can predict whether the given picture is of a cat or a dog. It can predict if a tumor is benign or malignant, as discussed in our example above. It can even categorize output in terms of numbers such as 0, 1 or 2.
The essential difference between Classification and Regression is that in Classification, only a small finite set of possible output categories can be predicted such as 0, 1 or 2 but not the numbers in between such as 0.5 or 1.7.
In our example, there was only a single input, i.e., the size of the tumour. However, we can also use more than one input value to predict an output. Have a look at the image below.
Here, instead of just knowing the tumour size, we also have each patient’s age in years. Now, our new dataset has two inputs – age and tumour size. In this new dataset, we’re going to use ‘circles’ to show patients whose tumours are benign and ‘crosses’ to show the patients with a tumour that was malignant. So, when a new patient comes in, the doctor can measure the patient’s tumour size and also record the patient’s age.
The question that arises here is, given two inputs about the patient’s tumour, how can we determine the type of tumour it is?
Well, what the learning algorithm might do is find some boundary that separates the malignant tumours from the benign ones. Here, the learning algorithm has to decide how to fit a boundary line through this data. The boundary line found by the learning algorithm would help the doctor with the diagnosis.
In the graph above, a blue dot represents the new tumour that we want to classify. We can notice that the tumour is more likely to be benign.
Naive Bayes Classification
To understand Classification better, let us take the example of a simple algorithm known as Naive Bayes Classification.
Naive Bayes models are a group of extremely fast and simple classification algorithms that are often suitable for very high-dimensional datasets. A Naive Bayes Classifier assumes that the effect of a particular feature in a class is independent of other features. This assumption is called Class Conditional Independence and it simplifies computation. That is why it is considered ‘naive’.
Let’s take a look at the formula for Naive Bayes Classification.
$$ P(H~|~{\rm D})= \frac{P(D~|~{\rm H})P(H) }{P(D)} $$
Here:
- P(H) represents the probability of hypothesis H being true (regardless of the data). This is known as the prior probability of H.
- P(D) represents the probability of the data (regardless of the hypothesis). This is known as the prior probability of D.
- P(H|D) represents the probability of hypothesis H given the data D. This is known as posterior probability.
- P(D|H) represents the probability of the data D given that hypothesis H was true.
Let us take an example.
An insurance company insured 2,000 scooter drivers, 4,000 car drivers, and 6,000 truck drivers. The probabilities of an accident involving a scooter driver, car driver, and a truck are 0.01, 0.03, and 0.015 respectively. One of the insured persons meets with an accident. What is the probability that he is a scooter driver?
Let E1, E2, E3, and A be the events defined as follows:
- E1 = person chosen is a scooter driver
- E2 = person chosen is a car driver
- E3 = person chosen is a truck driver
- A = person who met with an accident
Since there are 12,000 people, therefore:
- P(E1) = 2000/12000 = ⅙
- P(E2) = 4000/12000 = ⅓
- P(E3) = 6000/12000 = ½
It is given that P(A / E1) = Probability that a person meets with an accident given that he is a scooter driver = 0.01
Similarly, you have
- P(A / E2) = 0.03
- P(A / E3) = 0.15
You are required to find P(E1 / A), i.e., the probability that the person is a scooter driver given that the person has met with an accident.
$$ P(E1/A) = P(E1) P(A/E1)P(E1) P(A/E1) + P(E2) P(A/E2) + P(E3) P(A/E3) $$
$$ = 1/6 * 0.01(1/6 * 0.01) + (1/3 * 0.03) + (1/2 * 0.15) = 1/52 $$
Now let’s examine another example in Python using the Naive Bayes Classification algorithm.
Naive Bayes Classification In Python
In this example, we will use a dummy dataset with three columns: weather, temperature, and play. The first two are features (weather and temperature) and the other is the label.
weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
'Rainy','Sunny','Overcast','Overcast','Rainy']
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild']
play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']
Code language: JavaScript (javascript)
Next, we need to convert these string labels into numbers. This is known as Label Encoding. Scikit-learn provides LabelEncoder()
function for encoding labels with a value between 0 and one less than the number of discrete classes.
# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
weather_encoded=le.fit_transform(weather)
temp_encoded=le.fit_transform(temp)
label=le.fit_transform(play)
weather_encoded = list(weather_encoded)
temp_encoded = list(temp_encoded)
label = list(label)
print("weather:", weather_encoded)
print("Temp:",temp_encoded)
print("Play:",label)
Code language: PHP (php)
weather: [2, 2, 0, 1, 1, 1, 0, 2, 2, 1, 2, 0, 0, 1]
Temp: [1, 1, 1, 2, 0, 0, 0, 2, 0, 2, 2, 2, 1, 2]
Play: [0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0]
Now let’s combine both the features (weather and temperature) in a single list of tuples.
#Combining weather and temp into single listof tuples
features=list(zip(weather_encoded,temp_encoded))
print(features)
Code language: PHP (php)
[(2, 1), (2, 1), (0, 1), (1, 2), (1, 0), (1, 0), (0, 0), (2, 2), (2, 0), (1, 2), (2, 2), (0, 2), (0, 1), (1, 2)]
Next, we will generate a model using a Naive Bayes Classifier in the following steps:
- Create a Naive Bayes Classifier
- Fit the dataset on the Classifier
- Perform prediction
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
#Create a Gaussian Classifier
model = GaussianNB()
# Train the model using the training sets
model.fit(features,label)
#Predict Output
predicted= model.predict([[2,1]]) # 0:Sunny, 2:Hot
print ("Predicted Value:", predicted)
Code language: PHP (php)
Predicted Value: [0]
As you can see, the predicted value for sunny weather and hot temperature is 0 which indicates that players should not play golf.
So, this was a brief overview of Supervised Learning and the two types of Supervised Learning – Regression and Classification. Supervised learning maps input \(x \) to output \(y \), where the learning algorithm learns from the right answers. While in the Regression application, the learning algorithm predicts numbers from an infinite number of output values, in the Classification application, the algorithm makes a prediction based on a small set of output values.
Now that we have a fair bit of understanding about Supervised Learning, let us study about the other major type of Machine Learning, i.e., Unsupervised Learning.
3. Unsupervised Learning
After Supervised Learning, the most widely used form of Machine Learning is Unsupervised Learning. There are two types of algorithms that come under Unsupervised Learning.
- Clustering Algorithms: Involve discrete labelling of inputs
- Dimensionality Reduction Algorithms: Involve reducing the number of input values
Clustering Algorithms
Let’s take a look at an example that we already used for Supervised Learning about breast cancer detection.
Recall that in Supervised Learning, each example was associated with an output label \(y \) such as benign or malignant.
Have a look at the right-hand side plot below. Instead of representing the categories of cancer with ‘crosses’ for malignant and ‘circles’ for benign, we won’t associate any particular symbol with our output labels. So, here, even though we have data for a patient’s tumour size and age, we don’t know whether it is benign or malignant.
Our key job with Unsupervised Learning is not to diagnose whether the tumour is benign or malignant. Instead, our job is to find some structure or pattern or points of interest in the data. Here, we are not trying to supervise the algorithm. Therefore, an unsupervised learning algorithm is free to decide whether or not and how the data needs to be divided into clusters.
This particular type of Unsupervised Learning is called Clustering.
Clustering algorithms seek to learn from the properties of the data, an optimal division or discrete labelling of groups of points.
k-Means Clustering in Python
Many clustering algorithms are available in Scikit-Learn and elsewhere. One of the most common algorithms is k-means clustering.
First, let’s import the necessary libraries.
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
Code language: JavaScript (javascript)
The k-means algorithm searches for a predetermined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like:
- The “cluster center” is the arithmetic mean of all the points belonging to the cluster.
- Each point is closer to its own cluster center than to other cluster centers.
Those two assumptions are the basis of the k-means model. We will soon dive into exactly how the algorithm reaches this solution, but for now, let’s take a look at a simple dataset and see the k-means result.
First, let’s generate a two-dimensional dataset containing four distinct blobs.
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=400, centers=4,
cluster_std=0.60, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=30)
Code language: JavaScript (javascript)
Now let’s classify these four clusters using the k-means algorithm.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
Code language: JavaScript (javascript)
Let’s visualize the results by plotting the data coloured by these labels. We will also plot the cluster centers as determined by the k-means estimator:
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
Code language: JavaScript (javascript)
As you can see the algorithm correctly classified all four clusters.
Apart from Clustering algorithms, let us study about one of the most broadly used Unsupervised Learning algorithms, known as Dimensionality Reduction, especially, Principal Component Analysis (PCA).
Dimensionality Reduction Algorithms
Dimensionality Reduction refers to the techniques that reduce the number of input variables in a dataset. Such algorithms are very useful because more input features often make a predictive modelling task more challenging. High-dimensionality statistics and Dimensionality Reduction techniques are often used for data visualization. Nevertheless, these techniques can be used in applied Machine Learning to simplify a Classification or Regression dataset in order to fit a predictive model better.
Now, let’s explore some of these algorithms in Python.
Principal Component Analysis (PCA) In Python
PCA is fundamentally a dimensionality reduction algorithm, but it can also be useful as a tool for visualization, noise filtering, feature extraction, and engineering.
Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components. This results in a lower-dimensional projection of the data that preserves the maximal data variance.
Starting with the Python implementation, let us, first, create synthetic data points.
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');
Code language: JavaScript (javascript)
Here is an example of using PCA as a dimensionality reduction transform:
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape: ", X.shape)
print("transformed shape:", X_pca.shape)
Code language: JavaScript (javascript)
The transformed data has been reduced to a single dimension. To understand the effect of this dimensionality reduction, we can perform the inverse transform of this reduced data and plot it along with the original data:
X_new = pca.inverse_transform(X_pca)
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.8)
plt.axis('equal');
Code language: JavaScript (javascript)
In the plot above, the ‘light blue’ points are the original data, while the ‘dark orange’ points are the projected version. This makes clear what a PCA dimensionality reduction means.
The information along the least important principal axis or axes is removed, leaving only the component(s) of the data with the highest variance. The fraction of variance that is cut out (proportional to the spread of points about the line formed in this figure) is roughly a measure of how much “information” is discarded in this reduction of dimensionality.
This reduced-dimension dataset is in some senses “good enough” to encode the most important relationships between the points. Despite reducing the dimension of the data by 50%, the overall relationship between the data points is mostly preserved.
Introduction To Machine Learning
- Machine Learning is all about making machines intelligent using the power of algorithms, data and learning experience
- There are two types of Machine Learning – Supervised Learning and Unsupervised Learning
- Supervised Learning involves labelling inputs and using popular algorithms such as Regression and Classification
- Linear Regression is a popular Regression algorithm in Supervised Learning techniques
- Naive Bayes Classification is a popular Classification algorithm in Supervised Learning techniques
- Unsupervised Learning involves discrete division of inputs (Clustering Algorithm) or reducing the number of inputs (Dimensionality Reduction Algorithms)
- k-Means Clustering is a popular Clustering algorithm in Unsupervised Learning techniques
- Principal Component Analysis (PCA) is a popular Dimensionality Reduction algorithm in Unsupervised Learning techniques
Summary
So, folks, this brings us to the end of the first tutorial post of this amazingly knowledgeable series on Machine Learning. We covered both theory and code in this post, which makes it a very hands-on journey, as practical examples make every concept more fun to understand. Stay tuned for more detailed posts on Machine Learning where we will take the concepts learnt today ahead. Also, please feel free to ask any doubts or questions you may have by dropping in your message in the comments section. See you, soon! 🙂