dh 008#: Generative Adversarial Networks: The Foundation of Modern Image Generation

dh 008#: Generative Adversarial Networks: The Foundation of Modern Image Generation

*Understanding the game-theoretic approach that revolutionized artificial intelligence and computer vision*

## Introduction

Starting with a very kind of broad view, in general there are two types of models: Generative Adversarial Networks and Diffusion Models. In recent years, diffusion models have largely replaced generative adversarial networks as the preferred technology for generating synthetic images. In large part, because GANs are just notoriously difficult to train, and diffusion models are much more straightforward to train and can be GANs.

Two types of generative models

Diffusion models are one of the hottest topics in the popular press conversation about AI, with lots of discussion about whether they will replace artists, their ability to generate realistic fake news and spam, etc. But in this lecture, I’d like to focus on applications that are most obvious relevance to economic research, which are very niche applications in this broader space.

## Supervised vs Unsupervised Learning Foundation

There’s supervised models where we provide data (x) and labels (y); estimate a function to map the data to the labels, y=f(x). And so we’ve talked about things like classification, localization, segmentation, object detection.

Supervised learning overview

On the other hand, we’ve also seen unsupervised methods in the course where we only provide data with no labels. There is no ground truth, but rather what we’re trying to do is learn some underlying structure that’s present in the data. And so we’ve seen dimensionality reduction (i.e. tSNE, UMAP), clustering, density estimation, autoencoders.

## The Challenge of Generative Modeling

Traditionally, generative models are a type of unsupervised model that generate new samples from the same distribution as the training data. In other words, given the distribution of training data, we need to learn the model distribution.

Generative model objective

And so, you know, I think as economists, if we were thinking about how to go about this, we might think about explicit density estimation. How can we estimate that density? That’s not something that we’re going to discuss here.

Instead, I’ll start by talking about implicit density estimation, which aims to learn a way to sample from the data generating process without explicitly defining it, because it’s going to be a very, very, very complicated object. We want to learn a way to sample from the data generating process without explicitly having to define what it is, because it’s going to be a very, very, very complicated object.

Implicit vs explicit density estimation

## The GAN Revolution

Recall that our motivation is to be able to sample from the training data distribution, which is a complex and high-dimensional object. And one approach would be to try to estimate or approximate this density. And I think that’s where this literature started. And it’s kind of intuitive way it would start there, but this is slow and it’s really, really, really hard.

And so GANs use a different approach that turns out to be much more tractable. They sample from a simple distribution that we know how to sample from, unlike the complex distribution that real world images are drawn from. So they’ll sample random noise, or sample a Gaussian or uniform distribution.

And then learn a model that can transform a sample from a simple distribution into a sample from the training distribution. And that’s gonna be a really complex transformation. And so what do we use to approximate complex functions? Again, if you get one thing out of the course, we use a neural network to do that.

GAN architecture overview

## The Game-Theoretic Breakthrough

And so, you know, the big kind of breakthrough in this literature came from a paper by Goodfellow et al. “Generative Adversarial Nets,” which is a 2014 paper, and their insight is that we can use a game theoretic approach between two adversarial networks to learn this function.

Goodfellow et al. paper reference

And so we have a generator network that generates images from kind of the initial random noise that look like the real images, and then we have a discriminator network that tries to distinguish between the real images and the fake images. And the discriminator is just a classification network that says is this image real or is it fake?

## The Training Process

When the discriminator successfully identifies real and fake images, no changes needed to its model parameters, but the generator is going to be penalized with large updates to its model parameters because it’s not doing a good job of fooling the discriminator. Alternatively, when the generator does fool the discriminator, no changes needed to its parameters, but the discriminator’s parameters need to be updated.

Training dynamics

And at the limit, the generator would produce perfect replicas and the discriminator would always predict the class score of 0.5 for real and 0.5 for fake.

## Mathematical Framework

And so we can think of Gans as a two-player game where we have parameters theta g from the generator and theta d from the discriminator. D is the likelihood of that the image is real. D_θd(x) is the discriminator output for real data x and D_θd(G_θg(z)) is the discriminator output for fake data G(z). To maximize the objective, the discriminator wants D(x) close to 1 and D(G(z)) close to 0.

GAN mathematical formulation

To minimize the objective, the generator wants dGZ to be close to one, fooling the discriminator. And so we alternate between gradient descent on the discriminator and gradient descent on the generator.

## Training Challenges and Solutions

And so in practice, the proposed approach of alternating between gradient ascent on the discriminator: max_theta [E_z~p_data log(D_theta(x)) + E_z~p(z) log(1 – D_theta(G_theta(z)))] and gradient descent on the generator: min_theta [E_z~p(z) log(1 – D_theta(G_theta(z)))] does not work well, because the gradient of D_theta(G_theta(z)) is flat when we have bad samples, and only becomes steep when the generator is doing a pretty good job.

Training difficulties

So instead, we max the likelihood of the discriminator being wrong. And so this is the procedure for training Gans that comes from the seminal Goodfellow et al., 2014 paper.

Training procedure

## Architecture Improvements

The generator is an upsampling network with fractionally strided convolutions and the discriminator is a conv net (real/fake classifier).

Network architectures

And so there was kind of a further breakthrough in 2016 by Alec Radford and co-authors. They gave some practical tips that make a pretty big difference onto the performance of Gans.

DCGAN improvements

So they replaced pooling leaders with strided convolutions for the discriminator and fractionally strided convolutions for the generator. They used bach norm for both the generator and discriminator. They removed like connected hidden layers for deep architectures. They use ray-loo activations for all layers of the generator except for the output and they use leaky ray-loo in the discriminator for all layers.

## Applications in Economic Research

While really interesting issues, I will focus in the lecture on applications of most obvious relevance to economic research, which are very niche in this broader space. And in fact, like a lot of the relevant literature is still in the GAN space with the diffusion literature, not really having as much to say yet about these particular applications.

Economic research applications

And so even though some people would say that GANs are Stone Age technology, meaning we’ve had something better for like a year and a half, I’m going to start with the GAN literature, because actually, for the things we want to do, the best approach might still come, you know, at this point in time.

So some applications to have in the back of your mind, cleaning up document backgrounds, data augmentation for layout recognition and OCR. And so you can see kind of why this wouldn’t be kind of a major priority of the diffusion literature, you know, who besides people like us who would like to custom trade layout detection or OCR for historical research, have a legitimate reason to want to generate fake documents.

Document processing applications

But, you know, if we can generate, you know, very realistic synthetic historical texts, synthetic document layouts that just kind of changes the game as far as OCR goes, because then we can have a bunch of labels for free. And so I think this is a potentially relevant application of generative models.

## Evolution of GAN Quality

So this is an example of the generator architecture that comes from this red front at all paper. And so this, this matters in practice. And so if you look at the original good fellow at all paper and you see what it generates, I mean, it’s not bad at generating endless digits. But the faces really don’t look horribly realistic. You know, you’re not going to take these generated images and, you know, put them on a fake news website and fool people, right?

Original GAN results

And so it’s kind of a proof of concept that you can do this, but it’s not particularly realistic. Like, you know, if you go through and make these changes that I talked about, which they’re all kind of like small things tinker in with the architecture with they make a pretty big difference in terms of how realistic that you’re generated images look like.

And so this is just showing kind of the succession of gans as people take the same original idea. But then they make these modifications, the architecture to try to make them more stable to train. You get more and more realistic generated images out of them.

GAN evolution timeline

## Practical Application: Text Bleed Removal

And so I want to talk about one application that we did that was pretty straightforward, which is to remove text lead. And so basically we have images that look like the top one. And you can see that there’s quite a lot of bleed through from the opposite side of the page.

Text bleed problem

And, you know, we were concerned that this is a fact in like OCR and layout detection, especially if you want to mostly do those things kind of off the shelf. And so we trained a gann model to turn the top image into the bottom image. And we did this with entirely unparadied data because we don’t have clean and dirty versions of the same image.

But what we did have is we had a publication that looked like the top one. And we had a publication from a different year with a cleaner scan that looked like the bottom one.

## CycleGAN: Unpaired Image Translation

And so in order to train the model that turns that top image into the bottom image, we used an architecture called cycle, and cycle, and it’s a model that can take an image from one domain and generate a synthetic version of the image with a specific modification.

And that with a specific modification to it. And so the most like famous example in this literature, everybody uses is turning horses into zebras or zebras into horses, which you see there.

CycleGAN examples

But you see also other examples like a turn of money painting into a photograph or to a photograph into a money painting. It can take a photograph and turn it into paintings that look like they came, you know, that are in the style of different artists. It can turn summer into winter, winter into summer, et cetera.

All right. And so the real innovation of cycle, again, is that you can do this all on paired data. And so traditional image to image translation, required paired data. And these are very costly to collect. And some cases, they would be impossible to collect.

Unpaired data advantage

I want to, I have these, you know, I have these dirty scans that I want to clean up. And if I had a clean version of them in the first place, I wouldn’t need the model. And, you know, I, the, like the whole reason for needing this model is I can use traditional computer vision thresholding methods to try to get rid of that background noise, but it works pretty poorly.

And so I just I don’t have any paired data and cycle again does not require images in the source and target to be paired. And so you hear you’d have a bunch of kind of photographs on one hand and portrait or sorry, like paintings and different styles on the other hand.

CycleGAN architecture

## CycleGAN Architecture and Training

And so cycle, again, simultaneously trains to generate our models and to discriminate our models. So one generator takes images from the first domain and outputs images from the second domain. And the other generator takes images from the second domain as inputs and generates images for the first domain.

Dual generators

And so if you put an image output for the first generator into the second generator, it should return the original image, which is called cycle consistency. If you convert a horse into a zebra and then convert it back into a horse, they convert an image still looks similar to the original.

And so again, as a last term measures this discrepancy going in both directions. So it’s analogous to a method used for text translation between languages. You can think of it as kind of a trans a translator for images.

Cycle consistency

You know, it’s going from x to y and back to x and imposing a cycle consistency loss there in the same way when it goes from y to x and back to y. There’s also a cycle consistency loss imposed there.

## Technical Implementation Details

And so the generators are deep convolutional gans, implemented using multiple residual blocks and the discriminators use patch gans which tries to classify whether patches of the image are real or fake and it is run convolutionally across the image and then all responses are average.

Technical architecture

So the paper uses atom and a low learning rate for a hundred epochs and then an additional hundred epochs with learning rate decay with a batch size of one.

And so this is a, this are examples from the paper. And as I said in this literature, they really love turning horses into zebras. But you see the turning oranges into apples and whatnot.

CycleGAN results

## Style Transfer Applications

And cycle gans can be used for style transfer where you take something in one style and turn it into another style. And here it is turning paintings into photos and doing some photo editing.

Style transfer examples

## Limitations and Failure Cases

But they do also give examples of failures. So it can fail due to distributional features of the training data. And so if the model is only trained on horses and zebras in the wild, this is what it generates with the image here, which is clearly a ridiculous.

CycleGAN failures

And also cycle gans often succeeds well with color and texture changes, but it tends to fail when geometric transformations are required. And so in the dogument context, you know, it has a hard time converting fonts into another font, but it works great with getting rid of the background noise because that’s ultimately kind of about changing the color and the texture.

Geometric vs texture limitations

## Font Style Transfer with ReStyle

Okay, so this is another example where we used fully unsupervised generation to take a modern font on the left and to generate the output on the right, which is meant to kind of mimic the target, which is shown in the middle, but we don’t use paired data.

Font style transfer

It’s totally unsupervised and we do that with a model called re-style, which is kind of a much more recent contribution to the style transfer literature. So in the interest of time, I don’t want to talk about this in depth. This is just kind of something we tried and we haven’t really pursued it beyond kind of this example, but it seems to kind of work reasonably well.

And we think it would be like interesting in the context of say FOCR right now, we’re just using kind of digital fonts for the pre-training, but you could potentially use these unsupervised methods and just get crops from a whole bunch of different documents to generate just more realistic data for the self-supervised pre-training that would make it work better purely off the shelf without needing any target data at all.

## Handwriting Generation: State of the Art

But I want to talk about actual kind of published work on handwriting generation. And so in particular, I’m going to talk about a 2021 model called handwriting transformers that I think is still kind of essentially this data they art.

You know, there was one diffusion model that uses diffusion for handwriting generation, but it was not very impressive results and it’s kind of using an older version of diffusion. And so to the best of our knowledge, this is kind of still say they art, but maybe somebody will email me tomorrow and tell me something just been posted on archive and it’s better.

And so that’s the nature of this literature, which is a great thing. Okay, so we have some desired style of handwriting that we would like to mimic, which is shown at the top and then we have query text. And we want to write this query text and the desired style.

And like, why do we want to do this? Again, this could be incredibly useful for generating data to train a no-ser. And I should say this model is for English, like most of those literature is about generating English handwriting, which is a big problem if your interest is in, you know, ancient Chinese or something. But, you know, that’s the way it is.

## Handwriting Transformer Architecture

And so this is showing handwriting transformer on the top, again writing model in the middle and another approach on the bottom. And they’re arguing that there is the best, and so we’re going to see what this architecture does in a minute, but in short it is using a transformer within it.

Handwriting comparison

And that is able to capture more of the long run dependencies. And so if you look at the word also underlined in green, you can see in the again writing, it’s not connecting all the components of it, which you know to connect to have the like the ligatures that connect handwriting, that’s kind of a matter of long range dependencies.

Long-range dependencies

You have dependencies between the characters because they’re connected. And it kind of points to other more problematic things with these other approaches and argues that kind of their approach provides kind of the most realistic mimicking of this particular style.

And we see that because we can, you know, we see here words that appear in both of them, like throughout and also, and in the IES at the end, and it seems to just it’s most realistically kind of mimicking that style with the handwriting transformer.

Style comparison results

## Complex Multi-Component Architecture

And so this is a pretty complicated model, which is again kind of in the spirit of the game literature. I don’t know how much of a pain it was to train. And so they have a large generator network that contains a transformer encoder and a transformer decoder. And this is what’s going to actually kind of generate the handwriting.

Complex architecture overview

And these learn the long and short run dependencies and thus can encode both global and local style patterns. And the decoder generate the query text in a specific style. And so again, you see here, like with the this word also how it seems to be able to really capture the word.

Really capture these long run dependent long range dependencies in terms of having the ligatures between different characters and the handwriting.

Long-range dependency capture

This is not just a transformer model, though, to make it more trainable and again, cans are difficult to train. They use a hybrid architecture. It includes a CNN backbone to obtain sequences of convolutional features from the style images. And so the train algorithm closely follows the Dan literature and there’s going to be kind of four parts of this architecture that are important to its performance.

Hybrid architecture details

## Four-Component Training System

So you have a discriminator network. It’s convolutional. And it’s trained on an adversarial loss to promote realistic looking images. And so kind of along the lines of what we just saw with the GANS, you have a recognize our network and there’s a recognition loss.

Four-component system

And it’s trained to make sure text images are real text rather than just hallucinations that somehow look like writing. And it’s optimized with real labeled handwritten samples. You know, and so you can imagine if you didn’t incentivize it like that the model might kind of hallucinate things.

As we’ve seen with many other generative models like in the case of GPT, it is hallucinating something that looks realistic, but there really are no such characters. And that’s not what we want.

And there’s a style classifier to be able to create a given style of handwriting and that’s trained on a style loss, which is just a classification loss over different styles of handwriting.

And then there’s the generator, which is this encoder and decoder transformer. And it’s trained on a cycle loss that ensures that encoded style features have cycle consistency such that the original style features sequence can be reconstructed from the generated image.

And so the total loss adds up all these different components of the loss. And this is the model architectures. You see, you know, you’re giving a style example.

Okay, so like we want to generate handwriting in this style and you’re encoding that with the CNN and then passing that into a transformer. And then you also have the words that you want to write the query words and you’re passing that into the decoder and you’re passing the encoding of the style into the decoder.

Complete model architecture

And then further passing that into a CNN decoder. And then you’re going to train that on these different losses so you have your site.

*This comprehensive exploration of Generative Adversarial Networks demonstrates their evolution from basic proof-of-concept models to sophisticated systems capable of complex image-to-image translation and handwriting generation. The applications in document processing and historical research showcase the practical value of these techniques for economic research, despite the challenges in training and the emergence of newer diffusion-based approaches.*