#003 GANs – Autoencoder implemented with PyTorch

#003 GANs – Autoencoder implemented with PyTorch

Highlights: In this post, we will talk about autoencoders. In particular, you will gain a deeper insight into the working mechanisms of autoencoders. They are important machine learning models for data compression, analysis, and data modeling. Moreover, we will present several autoencoder architectures and show how they can be implemented in PyTorch. So, let’s get started!

Tutorial Overview:

  1. Introduction to Autoencoders
  2. Image Reconstruction in Autoencoders
  3. Autoencoder based on a Fully Connected Neural Network implemented in PyTorch
  4. Autoencoder with Convolutional layers implemented in PyTorch

1. Introduction to Autoencoders

Our goal in generative modeling is to find ways to learn the hidden factors that are embedded in data. However, we cannot measure them directly and the only data that we have at our disposal are observed data. For instance, imagine that you have a large number of face images. This represents an observed dataset. In some images, persons can smile, and this factor we can term a hidden or a latent variable of a face dataset. With this in mind, an autoencoder is a very simple generative model which tries to learn the underlying latent variables in the data by coding its input. If they are so simple, how do they work?

Our goal in generative modeling is to find ways to learn the hidden variables when we are only given the observed data. An autoencoder is a very simple generative model which tries to learn the underlying latent variables in the data by coding its input. If they are so simple, how do they work?

2. Image Reconstruction in Autoencoders

The simplest version of an autoencoder can be a simple and shallow neural network with a single hidden layer. This hidden layer connects the input with the output. Output is not labeled, and therefore, we already can concur that we are operating in an unsupervised learning domain. The goal of this network is to “pass” input data, and then, encode it within hidden layer activations. Next, these activations should allow, as accurately as possible, the reconstruction of the input vectorized image. Hence, to clarify once more, we use a vector that represents a vectorized image. The following image explains this concept nicely.

We first begin by feeding the raw input data into the model which is passed through one (or more) neural network layers. The first part of the network we call encoder, and the output of the encoder is a low dimensional latent space. It is a feature vector representation that we are trying to reveal.

Specifically, this network maps the input data \(x \) into a vector of latent variables \(z \). Note that we can have fully connected networks or a dense layer as the input. In this case, we will have to vectorize or flatten our input image. On the other hand, if we opt to use a convolutional layer as the first one, we will use an original image as the input.

Why do we care about this low-dimensional latent space?

Well, it can be very useful for the compression of our data. Furthermore, this step can also be useful for data visualization (e.g. similar to PCA). Moreover, the holy grail that we are searching for is compact and distinctive features. And last but not least, autoencoders are used for image denoising and reconstruction (image inpainting). Hence, those are the main applications of autoencoders.

To illustrate this topic further, when we work with images, a pixel-based space is highly dimensional. So, our goal is to take that high-dimensional data and encode it into a compressed latent vector representation.

Our next question is: How do we train the weights of a neural network to get this latent variable vector \(z \)? Well, the problem is that we never actually have access to this data (hence the name hidden 🙂 ), since we cannot directly observe it. Recall the cocktail party problem and the sound of a piano that we cannot measure directly. Even if we put hundreds of microphones, a piano tune is still going to be mixed with other sound sources in the room.

Hence, we do not have labeled data and we cannot cast this encoding process as a supervised learning problem. However, we can find a solution for this by adding a decoder structure. In simple words, a decoder will now be the fully connected neural network or a convolutional neural network. Its architecture will be symmetrical/mirrored version of the encoder. The goal of the decoder would be to reconstruct a replica of the original image from this learned latent space.

In other words, with an image example, we can simply take the mean squared error from the input to the reconstructed image at the output. Here, the really important thing is that the loss function does not have any labels. Therefore, this is an unsupervised learning problem. The only components of the loss are the input \(x \) and the reconstructions \(hat{x}\).

$$ mse = (frac{1}{n})sum_{i=1}^{n}( x {i} – hat{x} {i})^{2} $$

The output of the decoder network we will call the reconstructed output \(hat{x} \)

Does this remind you of something?

So, this is going to be a lossy reconstruction of the original input \(x \). You have maybe noticed the similarity to the concept of PCA. We can reduce the number of dimensions, and as a result, the reconstructed data cannot be perfectly reconstructed if we chose just a few principal components (lossy compression). Nevertheless, we still enjoy many great properties of PCA and use it very often in practice.

Hence, this network will be trained using the reconstruction error as our objective function. That is, we want the input \(x \) and our reconstructed output \(hat{x}\) to be as similar as possible. In addition, do pay special attention to this statement: the key concept here is that from a reduced set of variables, vector \(z\), we need to reconstruct the output \(hat{x}\) which will be of a much higher dimension. This is not an easy task!

Well, we know this was indeed a lot of theory.

Finally, it’s time for some coding.

3. Autoencoder based on a Fully Connected Neural Network implemented in PyTorch

Naturally, we start this part with the necessary libraries that we need to import.

import torch 
import torch.nn as nn
from torchvision import datasets
from torch.utils.data import  DataLoader
import numpy as np
import matplotlib.pyplot as plt
from torchvision import transforms 
from torch import optim
import matplotlib.pyplot as plt

The next step is to load the MNIST dataset. This is repeated numerous times and explained on our blog, but for completeness here is how we can do that. We load both training and test dataset and use DataLoader for efficient dataset processing.

transform = transforms.Compose( [transforms.ToTensor()] )
mnist = datasets.MNIST(root = "./", train = True, download = True , transform=transform)
mnist_test = datasets.MNIST(root = "../", train = False, download = True, transform = transform)

train_loader = DataLoader ( mnist , batch_size = 64, shuffle = True)
test_loader = DataLoader ( mnist_test , batch_size = 64, shuffle= True)

# here we can also plot the MNIST image
plt.imshow(mnist[0][0].numpy().reshape(28,28), cmap = 'gray') 

The following class is the most important code block of this section. Here, we define an autoencoder that uses fully-connected or dense layers. It is a very simple network, but still, for the MNIST dataset, we can get insightful results. Here, observe the symmetry between the encoder-decoder part of the networks. Note that all layers are linear and that we have applied a ReLU activation function.

class AE(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.encoder  = nn.Sequential( 
        nn.Linear(28*28, 128), 
        nn.ReLU(inplace=True),
        nn.Linear(128, 64),
        nn.ReLU(),
        nn.Linear(64, 32),
        nn.ReLU(),
        )
        
        self.decoder  = nn.Sequential( 
        nn.Linear(32, 64),
        nn.ReLU(),
        nn.Linear(64, 128),
        nn.ReLU(),
        nn.Linear(128, 28*28),
        nn.ReLU()
        )

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

As with any neural net, we need to start with the training process. Here, we define the necessary parameters and among the most important is to set the loss function to be a Mean square loss.

model = AE()
lr = 0.001
weight_decay = 1e-5 
loss = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr = lr, weight_decay = weight_decay)
epochs = 20

Next, we will proceed with the training step. The framework for this is familiar to us. We will feed data from the TrainLoader object in mini-batches. We will output the reconstructed images using the model() call. Here, note that we need to be careful to set the input dimensions accurately. Namely, as this is a fully-connected neural network it accepts the vectorized/flattened image. As such, we need to pass a tensor mini batch in a suitable shape. Hence, we will use view() function. The first argument is set to -1, which will be cast into a batch size (64 in our case). The remaining shape that was \(1\times28\times28 \) (\(channel\times height\times width \)) will be flattened into a single vector of size 784.

all_loss = []
for i in range(epochs):
    for idx, batch in enumerate(train_loader):
        
        output = model(batch[0].view(-1,28*28))
        loss_value = loss(output, batch[0].view(-1, 28*28))
        
        model.zero_grad()
        loss_value.backward()

        all_loss.append(loss_value)

        optimizer.step()

    print(i)
    print(f"All loss  {all_loss[-1]}")

Another important thing that we need to do is to set gradients to zero as always. Then, we will call .backward() function, and subsequently, we will call optimizer.step(). After every epoch, we will print the current loss.

When our model has finished with the training we can evaluate the results. Here, we will use one (or more) input images from the test set. We will forward pass it through the autoencoder network and we will plot the output.

for idx, batch in enumerate(test_loader):
  break 
output_test = model(batch[0].view(-1, 28*28))
output_test = output_test.view(-1, 28, 28)

plt.imshow(batch[0][0].view(28,28).detach().numpy())

The first image is the original image from the test dataset. The second one is the reconstructed image and represents the output from the autoencoder. Obviously, this is a very simple autoencoder, but the results are satisfying.

In the beginning, we have mentioned that there is a similarity between the PCA and the autoencoder approach. Both methods calculate the reduced number of features that they used for subsequent reconstruction. Here, we will train PCA on the training set. For this, we will use sklearn library and convert our tensors back to NumPy data type. Finally, we select random 10 images and show results of reconstruction both for PCA and Autoencoder-based reconstruction.

X_train = []
for i in range(len(mnist)):
  X_train.append(mnist[i][0].view(784).numpy())
X_train = np.array(X_train)
X_test = []
for i in range(len(mnist_test)):
  X_test.append(mnist_test[i][0].view(784).numpy())
X_test = np.array(X_test)

Here, we have created NumPy arrays X_train to optimize PCA and X_test to compare the two methods. Some commented lines can be helpful if you want to debug the code on your own and gain further insight.

X_test_torch = torch.tensor(X_test[:100,:])
output_AE = model(X_test_torch)
#plt.imshow(output_AE[0].detach().numpy().reshape((28,28)), cmap = 'gray')

X_rec_autoencoder = output_AE.detach().numpy()

# Here, we convert back to a square shape, so that the we can plot the outputs
X_rec_autoencoder = X_rec_autoencoder.reshape(-1, 28,28)
#print(X_rec_autoencoder.shape)

For PCA we set a number of components to 32. Then, we transform 100 images from the X_test, and do a back reconstruction (X_rec_pca)

n_components = 32
# Train PCA model
pca = PCA(n_components=n_components)
pca.fit(X_train)
# reconstruct X_test using Principal Components
num_elements = X_test.shape[0]
X_test_PCA = pca.fit_transform(X_test)

X_rec_pca = pca.inverse_transform(X_test_PCA)

X_rec_pca = np.reshape(X_rec_pca[:10], (10,28,28))
X_test = np.reshape(X_test[:10], (10,28,28))

Finally, we use the following code to complete the plot.

# Function for plotting the data and results
def plotter(data, title):
    fig = plt.figure(figsize=(12, 6))
    fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
    for i in range(10):
        ax = fig.add_subplot(1, 10, i + 1, xticks=[], yticks=[])
        ax.imshow(data[i], 
                cmap=plt.cm.binary, interpolation='nearest')
    ax.yaxis.set_label_position("right")
    ax.set_ylabel(title, fontsize='medium')

# Now let's see how the predictions look
# Along with the difference from the original
plotter(X_test,'X_test')
plotter(X_rec_autoencoder,'Autoencoder')
plotter(X_rec_pca,'PCA')

Here, is the final output. We can see that the autoencoder indeed performs a better job than a linear PCA model. This is something that can be expected. In essence, PCA can model only linearly dependant data. On the other hand, neural networks used within the autoencoder can capture higher nonlinearities.

Some additional comments. It is actually very interesting to observe how we can reconstruct an image of 784 pixels, that was originally compressed to 32 elements. In both models, this can be seen as a “multiplication with a matrix”. In PCA these 32 elements will be scalers that would scale 32 the most informative PCA components and would be summed. In a decoder part, 32 elements would be multiplied with a matrix as well. Moreover, we will have two (or more) such matrices. In addition, the ReLU functions are there to incorporate additional nonlinearities and thus enable a more efficient coding/decoding process. So, if this process was initially fuzzy just connect it with a PCA and things should be more intuitive.

4. Autoencoder with Convolutional layers implemented in PyTorch

Now, we will again develop the very same experiment. The only difference is that we will be using a neural network that consists of convolutional layers as well.

Now comes the important question. hm, but how are we going to create the output image? That is, to use a decoder structure and from the very tiny compact image representation (encoder output-bottleneck layer) we go back to the original image resolution. For this, we of course need to start increasing the size of the image (feature maps). We can use two options.

  1. One is to use regular conv filters and use upsampling with a factor of 2. This upsampling is an analogous to the max pooling when we are compressing the data in the encoder part.
  2. On the other hand, we can use the so called dilated convolution. It is also known as a transpose convolution or atrous convolution. It is for instance used in the semantic segmentation neural networks (such as U-net and DeepLab).

Transposed Convolution

The main goal of transposed convolutions is to increase the size of a feature map, from a smaller size to a bigger one.

What are deconvolutional layers? - Data Science Stack Exchange

Looking at the example above, we want to change the shape from a \(3\times3 \) up to a \(5\times5 \) output feature map. The image that we want to upsample is the blue squares at the bottom. The process of a transposed convolution consists of the following steps:

  • We will zero pad the original \(3\times3 \) image, and place the paddings on all sides of the pixels. These are represented as the white squares. This way we went from the original image to a \(7\times7 \) image that is zero padded image.
  • Then, we will have a \(3\times3 \) convolution filter (kernel) that we will use to produce our output. Note that this filter is shown as a grey square (\(3\times3 \)) that is moving across the image.
  • By moving the kernel, we can see that at that particular position, it will generate one output element (pixel) of the output feature map on the top (\(5\times5 \)).

So, this is how such an autoencoder may look like. In the encoder part, we have convolutional and pooling layers. Note that the input and output channel features are carefully selected. In contrast, a decoder part uses these values but in the opposite direction. We start from the encoder output and upsample the feature map (image) using the Transposed Convolution. In addition, the ReLU function is used throughout the network as the main activation function. In the final layer, we use a tanh activation function, so that our output would be limited and that we can compare it with the input (for the loss calculation).

class AE_conv(nn.Module):
  def __init__(self):
    super().__init__()
    self.encoder = nn.Sequential (
        nn.Conv2d(1, 16, 3, stride=3, padding=1),  # b, 16, 10, 10 
             nn.ReLU(True), 
             nn.MaxPool2d(2, stride=2),  # b, 16, 5, 5 
             nn.Conv2d(16, 8, 3, stride=2, padding=1),  # b, 8, 3, 3 
             nn.ReLU(True), 
             nn.MaxPool2d(2, stride=1)  # b, 8, 2, 2 
    )

    self.decoder = nn.Sequential(
        # note that here, we have the same number of output channels
      nn.ConvTranspose2d(8, 16, 3, stride=2),  # b, 16, 5, 5 
             nn.ReLU(True), 
             nn.ConvTranspose2d(16, 8, 5, stride=3, padding=1),  # b, 8, 15, 15 
             nn.ReLU(True), 
             nn.ConvTranspose2d(8, 1, 2, stride=2, padding=1),  # b, 1, 28, 28 
             nn.Tanh() 
    )

  def forward(self, x):
    x = self.encoder(x)
    x = self.decoder(x)
    return x

Slight adjustments of the previous code are made so that instead of a simple fully-connected layer we use a model that consists of the convolutional layer. The first thing to note there is the input. Now, we are using an input image of the original size \(28\times28 \). In the previous experiment, we have treated the input as a vectorized image of 784 elements. However, since these details are relatively minor, try to do it on your own as an exercise or just have a look at our GitHub repo. Finally, we show the results obtained with the convolutional autoencoder.

Not impressed. Yes, we agree. However in this case we have used only two convolutional layers. In practice, when we work with images, the use of a convolutional layer should provide superior results as compared to the fully-connected network. Try on your own to add a few more layers and find a more elegant solution.

In addition, the autoencoders cannot just be taken “off-the-shelf” if you start to work on a new custom dataset. There is some work that we need to engineer. The most important detail is how we select the dimensionality of our latent space. This is the so-called, hidden bottleneck layer and its size represents a trade-off between the feature compactness, compression, and accuracy of the reconstruction. The lower the size, the higher the reconstruction error will be, and vice versa. One approach for this would be to use a well-known Akaike Information Criterion (AIC) and plot this value versus the number of features. For more information see Data Science Handbook, GMM example.

Summary

To sum things up, autoencoders are using the bottleneck hidden layer that forces the network to learn a compressed latent representation from the data. By using the reconstruction loss, we can train the network in a completely unsupervised manner, which is where the name autoencoder comes from the fact that we are automatically encoding information within the data into a smaller latent space. Here, we showed how you can implement a simple autoencoder structure with or without convolutional layers. There is one drawback. We cannot generate new images. But, therefore, we will continue our generative model journey with variational autoencoders in the following post.