Generative Adversarial Networks (GAN)

In machine learning, there are two primary categories of models, generative models and discriminatory models. Discriminatory models strive to discriminate the given input into one or the other output classes depending on the type of input data. Whereas a generative model does not have a set of output classes that it has to categorize the data into. A generative model, as it's name suggests, tries to generate data that fits into the distribution exhibited by the input data. Mathematically, we can make the following conclusion. A discriminatory model calculates \( P(X|A) \) whereas a generative model calculates \( P(X,A) \).

It is generally known that neural networks behave poorly in optimizing \( P(X,A) \) and are most widely used as a discriminatory model. The model of choice for any generative models have been Hidden Markov Models (HMMs) or Guassian Mixture Models (GMMs). This can be seen from the dominance of these in models in generative applications such as speech synthesis and audio generators. But with the recent developments in neural networks as generative models (especially Google's WaveNet), GANs have taken prominence in the ML domain.

What are GANs?

GANs are a combination of generative and discriminatory models working together to out-smart each other. Let us look into each of these components in turn.

Problem Statement: Building a neural network that can generate an image as close as possible to a real world image.


The goal of the generator is to predict 784 numbers (28X28 images of MNIST) that looks almost like the real image of numbers.

The real images might follow some distribution (say Y distribution) in the real numbers space. The generator takes a random point from the Guassian distribution and tries to transform it in such a way as to fit it into the distribution followed by the real images. Now, let us break it down into bits. Guassian distribution is nothing but the distribution of random numbers. Hence, picking a point from a guassian distribution is analogous to picking a vector with random values. Hence, let us assume that the input vector space is of the dimension 100. Hence, the input to the generative model is a vector of size 100 having random values. The goal of the generator model is to take this vector and turn it into a real world image.

Here, the problem becomes apparent that there is no set image to compare this generated image to. We need a model that can generate an image that is as close to a real world image as possible. This is where the discriminator model comes into picture.


The goal of the discriminator is to be able to differentiate between the fake images generated by the generator and the real images. The input to the discriminator model is the 28X28 image and the output is a neuron indicating whether the image is a fake or a real image.

The discriminator is trained by using a binary cross entropy as the cost function (as is the norm in a neural net for classification) and the most important part is that this error is back propagated to the generative model as well.

This process of backpropagating the error to the generative model, forces the generator to produce images with more authenticity. Since the discriminator is trained to differentiate between the real and fake images and this error is propagted to the generator, the generator is forced to generate images matching the distributions of the real images.

Now that the concept behind the GANs are clear, let us delve deeper into the technical aspects.

Let the generative model be represented by \( G \) and the discriminatory model be represented by \( D \). Since both of them are neural models, both \( G \) and \( D \) are differentiable functions that can be applied on the input. Let \( x \) be the input vector of dimension 100.

$$z = G(x)$$ $$y = D(z)$$ where, \( y \) is the 1 or 0 indicating if the input image is real or not.

Let \( \theta_{G} \) and \( \theta_{D} \) be the model parameters for \( G \) and \( D \) respectively. The cost function \( J_{G} \) of the generator and \( J_{D} \) of the discriminator depend both on \( \theta_{G} \) and \( \theta_{D} \). But \( G \) does not have access to \( \theta_{D} \) and \( D \) does not have access to \( \theta_{G} \). Therefore, we need to minimize \( J_{G}\) with respect to \( \theta_{G} \) and \( J_{D} \) with respect to \( \theta_{D} \) and both \( J_{G} \) and \( J_{D} \) must attain an equilibrium.

As we know, $$J_{D}=-\frac{1}{2}\left ( y_{i}log(p_{i}) + (1-y_{i})log(1-p_{i}) \right )$$ $$J_{G}=-J_{D}$$ NOTE: The generator and the discriminator are trained in alternate cycles.

Generator Model:

Input Layer: Vector of dimension 100
Layer 1:  Dense Neurons (1024)
Layer 2: Dense Neurons (128 * 7 * 7)
Layer 3: Reshape Neurons (128 * 7 * 7) -> (128, 7, 7)
Layer 4: Upsampling2D Neurons (2, 2) [ (128, 7, 7) -> (128, 14, 14) ]
Layer 5: Convolution2D Neurons (64, 5, 5) with border=same
Layer 6: Upsampling2D Neurons (2, 2) [ (128, 14, 14) -> (128, 28, 28) ]
Layer 7: Convolution2D Neurons (1, 5, 5) with border=same

At the end, as we can see, the output will be a (1, 28, 28) vector which is the same dimension as an MNIST image. Also, since all activations are relU, these pixel values would be either 0 or 1 i.e black or white.

Discriminator model: 

Input Layer: Vector of dimension (1, 28, 28)
Layer 1: Convolution2D Neurons (64, 5, 5) with border=same
Layer 2: MaxPooling2D Neurons with pool_size=(2, 2)
Layer 3: Convolution2D Neurons (128, 5, 5) with border=same
Layer 4: MaxPooling2D Neurons with pool_size=(2, 2)
Layer 5: Flatten
Layer 6:  Dense Neurons (1024)
Layer 7: Dense Neurons (1)
Activation: Sigmoid/tanh