Mathematical Foundations of GANs

Foundations of GAN

(Update: 29/07/2025 22:43)

This document provides a comprehensive overview of Generative Adversarial Networks (GANs), their mathematical foundations, and various implementations. It covers the motivation behind GANs, the process of generative modeling, and different types of GAN architectures including DC-GAN, Conditional GAN, and Wasserstein GAN.

This is a live document, and I will keep updating it as I implement more GANs.

Table of Contents

Motivation

I’ve been given a sample data which is sampled i.i.d from a population. I intend to learn the parameters of the probability distribution of the population and then, sample at will.

Goal

So, I can decompose my goal into:

  1. Estimate the probability distribution of the population Pdata\mathbb{P}_{\text{data}}
  2. Learn to sample from that estimate

Initial Plan

Given, x1,x2,x3,...,xnPdata{x_1, x_2, x_3, ..., x_n} \sim \mathbb{P}_{data} (unknown distribution) I’ll construct a Pgen\mathbb{P}_{gen} and push Pgen\mathbb{P}_{gen} towards Pdata\mathbb{P}_{data}

From hereon, I will refer Pdata\mathbb{P}_{\text{data}} as PX\mathbb{P}_{X} and Pgen\mathbb{P}_{\text{gen}} as Pθ\mathbb{P}_{\theta}

Also, P\mathbb{P} refers to Distribution Function and PP refers to Probability Density Function

Assumption: Hereon, we shall solely use PP with the assumption that the Distribution Functions we’re dealing with have well-behaved Density Functions. [To make the calculations and math a little more easier]

Density Function is not a probability. It is a tool to evaluate probability

The 3-step Process of Generative Modelling

  1. Assume a parametric family on PXP_{X} denoted by PθP_{\theta}
    1. This PθP_{\theta} is represented by Neural Networks
  2. Define & Compute a divergence metric between the PθP_{\theta} and PXP_{X}
    1. We don’t know PθP_{\theta} and PXP_{X} though?
  3. Solve an optimisation problem over the parameters of PθP_{\theta} to minimise the above divergence metric

Chalo, Assume you’ll somehow be able to compute the divergence metric between probability distributions that you don’t even know, and also optimise them, how will you begin to think about how you’ll sample from the estimated probability distribution?

The Push-Forward Methods

Consider some arbitrary but known distribution ZN(0,I)Z \sim N(0, I) and gθ(z):ZXg_{\theta}(z): Z \rightarrow X, then X~=gθ(Z)\tilde{X} = g_{\theta}(Z) has a different distribution than that of ZZ. So, if you end up constructing a gθg_{\theta} that will produce a X~\tilde{X} that has the same distribution or close to that of PXP_{X}, we are in the game.

_Add the generator diagram here

Impending Questions

  1. How to compute the divergence metric without knowing PθP_{\theta} and PXP_X?
  2. What should be the choice of the divergence metric D?
  3. How to choose the gθ(z)g_{\theta}(z), in turn, PθP_{\theta}?
  4. How to solve the optimisation problem of minimising the divergence metric?

Variational Divergence Minimisation

Define DIVERGENCE between distributions

f-divergence

Given 2 probability distribution functions with the corresponding density functions denoted by PXP_X & PθP_{\theta}, the f-divergence between them is defined as follows:

Df(PXPθ)=χPθ(x)f(PX(x)Pθ(x)) D_f(P_X || P_{\theta}) = \int_{\chi} P_{\theta}(x)f{(\frac{P_{X}(x)}{P_{\theta}(x)})}

f(u):R+Rf(u): \mathbb{R}^{+} \rightarrow \mathbb{R}, convex, left-semi continuous, and f(1)=0f(1) = 0 χ:\chi: space on which the\text{space on which the} PXP_X &\& PθP_{\theta} are supported\text{are supported}

Properties of f-divergence:

  1. Df()0D_f() \geq 0, for any choice of f()f()
  2. Df(PXPθ)=0D_f{(P_X || P_{\theta})} = 0, if PX=PθP_X = P_{\theta}

Examples of f-divergence:

  • KL Divergence: f(u)=uloguf(u) = u\text{log}u

Df(PXPθ)=χPθ(x)PX(x)Pθ(x)log(PX(x)Pθ(x))dx=χPX(x)log(PX(x)Pθ(x))dx=DKLD_f(P_X || P_{\theta}) = \int_{\chi} P_{\theta}(x) \frac{P_X(x)}{P_{\theta}(x)} \log{(\frac{P_X(x)}{P_{\theta}(x)})} dx = \int_{\chi} P_X(x) \log{(\frac{P_X(x)}{P_{\theta}(x)})} dx = D_{KL}

Df(PXPθ)Forward KLDf(PθPX)Reverse KL\overbrace{D_f(P_X || P_{\theta})}^{\text{Forward KL}} \neq \underbrace{D_f(P_{\theta} || P_X)}_{\text{Reverse KL}} | KL-Divergence is not symmetric

Difference choices of f functions result in different divergence metric with their own properties; which necessitates the need to look at swathes of divergence metrics.

  • JS Divergence: f(u)=12(ulogu(u+1)log(u+12))f(u) = \frac{1}{2}(u\text{log}u - (u+1)\text{log}(\frac{u+1}{2})) | Used Famously in GAN
  • Total Variational Distance: f(u)=12u1f(u) = \frac{1}{2}|u-1|

Algorithm for f-divergence minimization

Objective: Algorithm to minimise DfD_f between PXP_X and PθP_{\theta}, without knowing neither, but having samples from both.

Samples from PXP_X - Dataset D Samples from PθP_{\theta} - Outputs of gθ(z)g_{\theta}(z) for different zz

But, without knowing Pθ(x)P_{\theta}(x) and PX(x)P_X(x), solving a high-dimensional integral is infeasible.

Key Idea: Integrals involving density functions can be approximated using samples from the distribution.

The Law of Unconcisious Statistican

Suppose we wanted to compute,

I=χh(x)PX(x)dxI = \int_{\chi} h(x) P_{X}(x) dx

where h(x)h(x) is a function and PX(x)P_{X}(x) is the density function.

Also, suppose that we have samples drawn i.i.d from PXP_X

x1,x2,x3,....,xni.i.dPXx_1, x_2, x_3, ...., x_n \sim^{\text{i.i.d}} P_X

I=EPX[hx]I = \mathbb{E}_{P_X}[h_x]

Law of Large Numbers limn1ni=1nh(xi)xii.i.dPX=EPX[h(x)]\underbrace{\lim_{n \rightarrow \infty} \frac{1}{n} \sum_{i=1}^{n} h(x_i)}_{x_i \sim^{\text{i.i.d}} P_X} = E_{P_X}[h(x)]

_If the terms of the f-divergence can be expressed in terms of expectation of functions with respect to PXP_X and PθP_{\theta}, then one can compute and optimise them_

**Expressing DfD_f in terms of Expectations over PXP_X and PθP_{\theta}

Df(PXPθ)=χPθ(x)f(PX(x)Pθ(x))D_f(P_X || P_{\theta}) = \int_{\chi} P_{\theta}(x)f{(\frac{P_{X}(x)}{P_{\theta}(x)})} <--- Even though this expression looks eerily similar to the integral form of the expectation, you cannot directly rewrite that expression in terms of expectation precisely because of the arguments of f. They aren’t simply “x”, but instead are the ratio of probability density functions.

Therefore, we somehow have to decouple the ratio of PDFs from ‘f’

SLIGHT DETOUR

Convex Conjugate

If f(u)f(u) is a convex function, then there exists a conjugate function

f(t):=supudom(f)utf(u)f^*(t) := \sup_{u \in \text{dom}(f)} {ut - f(u)}

Essentially, we are lower-bounding f(u)f(u) at every point t.

Properties of Conjugate

  1. f(t)f^*(t) is also convex
  2. [f(t)]=f(u)[f^*(t)]^* = f(u) ==> f(u)=suptdom(f)tuf(t)f(u) = \sup_{t \in dom (f^*)} {tu - f^*(t)} DETOUR DONE

Df(PXPθ)=χPθ(x)f(PX(x)Pθ(x))udx=χPθ(x)f(u)dxD_f(P_X || P_{\theta}) = \int_{\chi} P_{\theta}(x)f{\underbrace{(\frac{P_{X}(x)}{P_{\theta}(x)})}_{u}} \,dx = \int_{\chi} P_{\theta}(x)f(u) \,dx f(u)=supt{tuf(t)}f(u) = \sup_{t} \{{tu - f^*(t)}\}

Df(PXPθ)=χPθ(x)supt{tuf(t)}dxD_f(P_X || P_{\theta}) = \int_{\chi} P_{\theta}(x) \sup_{t} \{{tu - f^*(t)}\} \,dx

Df(PXPθ)=χPθ(x)supt{t[PX(x)Pθ(x)]f(t)}dxD_f(P_X || P_{\theta}) = \int_{\chi} P_{\theta}(x) \sup_{t} \{{t[\frac{P_{X}(x)}{P_{\theta}(x)}] - f^*(t)}\} \,dx

=supT(X)TχPθ(x){T(X)[PX(x)Pθ(x)]f(T(X))}dx= \sup_{T(X) \in \mathcal{T}} \int_{\chi} P_{\theta}(x) \{{T(X)[\frac{P_{X}(x)}{P_{\theta}(x)}] - f^*(T(X))}\} \,dx

The inner optimisation problem involves XX & the solution for it is dependent (a function of XX)

T:Xdom(f)\mathcal{T}: X \rightarrow dom(f^*) T(X)TT(X) \in \mathcal{T}

But,

DfsupT(X)TχPθ(x){T(X)[PX(x)Pθ(x)]f(T(X))}dxD_f \geq \sup_{T(X) \in \mathcal{T}} \int_{\chi} P_{\theta}(x) \{{T(X)[\frac{P_{X}(x)}{P_{\theta}(x)}] - f^*(T(X))}\} \,dx

because the space of functions T\mathcal{T} that we are optimising over may not contain the optimal T(x)T^*(x), that is the solution for the inner optimisation problem.

supT(X)χPX(x)T(X)dxχPθ(x)f(T(X))dx\geq \sup_{T(X)} \int_{\chi} {P_{X}(x)}T(X)\,dx - \int_{\chi} {P_{\theta}(x)}f^*(T(X))\,dx supT(X){EPXT(X)EPθf(T(X))}\geq \sup_{T(X)} \{\mathbb{E}_{P_{X}}T(X) - \mathbb{E}_{P_{\theta}}f^*(T(X))\}

Realisation of VDM

We set out to solve ---> θ=argminθDf(PXPθ)\theta^* = \mathop{\arg\min}_{\theta} D_f(P_X || P_{\theta})

But we are relegated to solve θ=argminθ[lower bound on Df]\theta^* = \mathop{\arg\min}_{\theta} [\text{lower bound on }D_f] and that’s the best we can do.

=argminθ[supT(X){EPXT(X)EPθf(T(X))}]= \mathop{\arg\min}_{\theta} [\sup_{T(X)} \{\mathbb{E}_{P_{X}}T(X) - \mathbb{E}_{P_{\theta}}f^*(T(X))\}]

Calculating the supremum with respect to a class of functions T(X)TT(X) \in \mathcal{T} cannot be done analytically (?), since it is an optimisation procedure over a space of non-trivial functions.

In practice, we represent T\mathcal{T} via neural networks: Tw(X)T_w(X), where w are the parameters of the neural network.

With this, the objective will become:

θ,w=argminθmaxw[EPXTw(X)EPθf(Tw(X))]\theta^*, w^* = \mathop{\arg \min}_{\theta} \max_{w}[\mathbb{E}_{P_{X}}T_w(X) - \mathbb{E}_{P_{\theta}}f^*(T_w(X))]

Implementing VDM for Generative Sampling

J(θ,w)=[EPXTw(X)EPθf(Tw(X))]J(\theta, w) = [\mathbb{E}_{P_{X}}T_w(X) - \mathbb{E}_{P_{\theta}}f^*(T_w(X))] (θ,w)=argminθmaxwJ(θ,w)Saddle-Point Optimisation Problem\underbrace{(\theta^*, w^*) = \mathop{\arg \min}_{\theta} \max_{w} J(\theta, w)}_{\text{Saddle-Point Optimisation Problem}}

This, essentially, is the Blueprint for Adversial Training

Any saddle point optimisation problem is also a adversial optimisation problem.

We know, by construction, T():Xdom(f)T(): X \rightarrow dom(f^*) ---> Depending on the choice of ff^*, we need to tweak the T network so that the range of T network corresponds to domain of ff^* ---> In practice, we do this by representing Tw(X)T_w(X) as σf(Vw(X))\sigma_f(V_w(X)), where σf\sigma_f is a f-divergence specific activation

Vw(X):XRV_w(X): X \rightarrow \mathbb{R}, σf(v):Rdomf\sigma_f(v): \mathbb{R} \rightarrow dom{f^*}

This means the T network that we’ll approximate using neural networks is represented as a composition of 2 functions.

The Discriminator as the composition of 2 functions will go here

J(θ,w)=[EPXσfVw(x)EPθf(σfVw(x))]J(\theta, w) = [\mathbb{E}_{P_{X}}\sigma_f{V_w(x)} - \mathbb{E}_{P_{\theta}}f^*(\sigma_f{V_w(x)})]

Generative Adversial Networks

A Special Case of VDM Algorithms

The choice of f-divergence: ulogu(u+1)log(u+1)Awfully similar to JS Div.\overbrace{u\log u - (u+1) \log(u+1)}^{\text{Awfully similar to JS Div.}}

f(t)=log(1et)f^*(t) = -\log(1 - e^t) and dom(f)=Rdom(f^*) = \mathbb{R^-} σf(v)=log(1+ev)\sigma_f(v) = -\log(1 + e^{-v})

JGAN(θ,w)=EPX[logDw(X)]+EPθ[log(1Dw(X))]J_{GAN}(\theta, w) = \mathbb{E}_{P_X}[\log D_w(X)] + \mathbb{E}_{P_{\theta}}[\log(1 - D_w(X))]

where Dw(X)=11+eVw(X)D_w(X) = \frac{1}{1 + e^{-V_w(X)}}

Exercise Derive the above JGANJ_{GAN} using the given f

GAN Architecture

The Generator and Discriminator go here

JGAN(θ,w)=ExPX[logDw(x)]+Ex^Pθ[log(1Dw(x^))]J_{GAN}(\theta, w) = \mathbb{E}_{x \sim P_X}[\log D_w(x)] + \mathbb{E}_{\hat{x} \sim P_{\theta}}[\log(1 - D_w(\hat{x}))]

Assume, {x1,x2,...,xB1}PX\{x_1, x_2, ..., x_{B_1}\} \sim P_X and {x1^,x2^,...,xB2^}Pθ\{\hat{x_1}, \hat{x_2}, ..., \hat{x_{B_2}}\} \sim P_{\theta}

To Train Discriminator

Keep θ\theta constant

B1={x1,x2,...,xB1}DB_1 = \{x_1, x_2, ..., x_{B_1}\} \subset D | Sample z1,z2,...,zB2z_1, z_2, ..., z_{B_2} & Pass z1,z2,...,zB2z_1, z_2, ..., z_{B_2} through gθ()g_{\theta}() with fixed θ\theta

JGAN=1B1i=1logDw(xi)+1B2j=1log1Dw(xj^)J_{GAN} = \frac{1}{B_1} \sum_{i=1} \log D_w(x_i) + \frac{1}{B_2} \sum_{j=1} \log 1 - D_w(\hat{x_j}) w=argmaxwJGAN(θ,w)w^* = \mathop{\arg \max}_{w} J_{GAN}(\theta, w)

wt+1wt+α1θJGAN(θ,w)w^{t+1} \leftarrow w^{t} + \alpha_1 \nabla_{\theta}J_{GAN}(\theta, w)

Backpropogate the gradients all the way to the inputs of the discriminator

Exercise: What is the behaviour of D(x)D(x) if Pθ(x)=PX(x)P_{\theta}(x) = P_X(x)?

To Train Generator

JGAN=1B2j=1log1Dw(gθ(zj))J_{GAN} = \frac{1}{B_2} \sum_{j=1} \log 1 - D_w(g_{\theta}(z_j))

To update the parameters of the generator, we do not need the samples from the data.

θ=argminθJGAN(θ,w)\theta^* = \mathop{\arg \min}_{\theta} J_{GAN}(\theta, w)

θt+1θtα2θJGAN(θ,w)\theta^{t+1} \leftarrow \theta^{t} - \alpha_2 \nabla_{\theta}J_{GAN}(\theta, w)

Typically, alternate between 1-step of Generator and 1-step of Discriminator whilst training

Implementation of GAN

class Generator(nn.Module):
  def __init__(self, noise_dim, img_dim):
    super(Generator, self).__init__()
    self.model = nn.Sequential(
        nn.Linear(noise_dim, 256),
        nn.ReLU(True),
        nn.Linear(256, 512),
        nn.ReLU(True),
        nn.Linear(512, 1024),
        nn.ReLU(True),
        nn.Linear(1024, img_dim),
        nn.Tanh() # Because we normalized images to [-1, 1]
    )

  def forward(self, z):
    return self.model(z)
class Discriminator(nn.Module):
  def __init__(self, img_dim):
    super(Discriminator, self).__init__()
    self.model = nn.Sequential(
        nn.Linear(img_dim, 512),
        nn.LeakyReLU(0.2, inplace=True),
        nn.Linear(512, 256),
        nn.LeakyReLU(0.2, inplace=True),
        nn.Linear(256, 1),
        nn.Sigmoid() # Outputs probability between 0 and 1
    )

  def forward(self, img):
    return self.model(img)
def train_gan(train_loader, num_epochs:
  fixed_noise = torch.randn(64, noise_dim).to(device)

  for epoch in range(num_epochs):
    for batch_idx, (real, _) in enumerate(train_loader):
      batch_size = real.size(0)
      real = real.view(batch_size, -1).to(device)

      ### Create Real and Fake Labels
      real_labels = torch.ones(batch_size, 1).to(device)
      fake_labels = torch.zeros(batch_size, 1).to(device)

      ### ==================================
      ### Train Discriminator: max log(D(x)) + log(1 - D(G(z)))
      ### ==================================

      # Real Images
      outputs = discriminator(real)
      d_loss_real = criterion(outputs, real_labels)
      real_score = outputs

      # Fake Images
      z = torch.randn(batch_size, noise_dim).to(device)
      fake = generator(z)
      outputs = discriminator(fake.detach())
      d_loss_fake = criterion(outputs, fake_labels)
      fake_score = outputs

      # Total Loss
      d_loss = d_loss_real + d_loss_fake

      # Backprop
      d_optimizer.zero_grad()
      d_loss.backward()
      d_optimizer.step()

      ### ================================
      ### Train Generator: min log(1 - D(G(z))) <-> max log(D(G(z))
      ### ================================
      # Generate fake images again
      z = torch.randn(batch_size, noise_dim).to(device)
      fake = generator(z)
      outputs = discriminator(fake)

      g_loss = criterion(outputs, real_labels) # trick discriminator

      # Backprop
      generator.zero_grad()
      g_loss.backward()
      g_optimizer.step()

      if (epoch + 1) % 10 == 0:
		print(f"Epoch [{epoch+1}/{num_epochs}], D_loss: {d_loss.item():.4f}, G_loss {g_loss.item():.4f}")
		show_generated_images(epoch+1, generator, fixed_noise)

Vanilla GAN - Recap

Input: D = {x1,x2,...,xn}i.i.dPX\{x_1, x_2, ..., x_n\} \sim^{\text{i.i.d}} P_X Data: MNIST28\texttimes28\text{MNIST}_{28\texttimes28}

Loss Function

J(θ,w)=EPX[logDw(x)]+EPθ[1logDw(x)]J(\theta, w) = \mathbb{E}_{P_X}[\log D_w(x)] + \mathbb{E}_{P_{\theta}}[1 - \log D_w(x)]

Approximate expectation using Sample Mean and use Batches

J(θ,w)=1B1i=1B1logDw(xi)+1B2j=1B2log1Dw(xj^)J(\theta, w) = \frac{1}{B_1} \sum_{i=1}^{B_1} \log D_w(x_i) + \frac{1}{B_2} \sum_{j=1}^{B_2} \log 1 - D_w(\hat{x_j})

x1,x2,...,xB1i.i.dPxx_1, x_2, ..., x_{B_1} \sim^{\text{i.i.d}} P_x x1^,x2^,...,xB2^i.i.dPθ\hat{x_1}, \hat{x_2}, ..., \hat{x_{B_2}} \sim^{\text{i.i.d}} P_{\theta}

J(θ,w)=1B1i=1B1logDw(xi)+1B2j=1B2log1Dw(gθ(zj))J(\theta, w) = \frac{1}{B_1} \sum_{i=1}^{B_1} \log D_w(x_i) + \frac{1}{B_2} \sum_{j=1}^{B_2} \log 1 - D_w(g_{\theta}(z_j))

z1,z2,...,zB2i.i.dN(0,I)z_1, z_2, ..., z_{B_2} \sim^{\text{i.i.d}} N(0, I)

Discriminator N/W:

w=argmaxw1B1i=1B1logDw(xi)+1B2j=1B2log1Dw(gθ(zj))w^* = \mathop{\arg \max}_{w} \frac{1}{B_1} \sum_{i=1}^{B_1} \log D_w(x_i) + \frac{1}{B_2} \sum_{j=1}^{B_2} \log 1 - D_w(g_{\theta}(z_j)) wt+1wt+α1wJ(θ,w)w^{t+1} \leftarrow w^t + \alpha_1 \nabla_{w}J(\theta, w)

While training the discriminator the parameters of the generator network are kept constant.

  1. Sample z1,z2,...,zB2i.i.dN(0,I)z_1, z_2, ..., z_{B_2} \sim^{\text{i.i.d}} N(0, I)
  2. Pass them through gθ(z)g_{\theta}(z) to obtain gθ(z1),gθ(z2),...,gθ(zB2)g_{\theta}(z_1), g_{\theta}(z_2), ..., g_{\theta}(z_{B_2})
  3. Pass these through Dw(.)D_w(.) and obtain the 2nd2nd term in loss
  4. Get x1,x2,...,xB1Pxx_1, x_2, ..., x_{B_1} \sim P_x
  5. Pass these through Dw(.)D_w(.) and get the 1st term in loss
  6. TOTAL LOSS = TERM 1 + TERM 2

This is how the gradient ascent is implemented for the discriminator network.

Generator N/W:

θ=argminθ1B1i=1B1logDw(xi)Independent of θ+1B2j=1B2log1Dw(gθ(zj))\theta^* = \mathop{\arg \min}_{\theta} \overbrace{\frac{1}{B_1} \sum_{i=1}^{B_1} \log D_w(x_i)}^{\text{Independent of } \theta} + \frac{1}{B_2} \sum_{j=1}^{B_2} \log 1 - D_w(g_{\theta}(z_j)) θt+1θt+α2θJ(θ,w)\theta^{t+1} \leftarrow \theta^t + \alpha_2 \nabla_{\theta}J(\theta, w)

While training the generator the parameters of the discriminator network are kept constant.

GAN as Classifier-Guided Generative Sampler

D={x1,x2,...,xn}i.i.dPXD = \{x_1, x_2, ..., x_n\} \sim^{\text{i.i.d}} P_X

Goal: Pθ(x^)P_{\theta}(\hat{x}) to be close to PXP_X

Suppose there’s a binary classifier,

Dw(X)={1 if xPX0 if xPθD_w(X) = \begin{cases} 1 \text{ if } x \sim P_X \\ 0 \text{ if } x \sim P_{\theta} \end{cases}

Dw(X)D_w(X) is a binary classifier between the samples of PXP_X and PθP_{\theta}

Question: Can Dw(X)D_w(X) be used to make PXP_X and PθP_{\theta} closer? Answer: Yes. Tweak θ\theta till the classifier fails to distinguish between the samples of PXP_X and PθP_{\theta}

However, Failure of the classifier need not imply PX=PθP_X = P_{\theta} (A counter example supporting this statement can be easily formed)

Solution: Keep altering the classifier along with the data (pretty intuitive)

Natural Question: What if the data clusters just cycles around 2 positions? We may get stuck!!!

This is Possible. It’s called MODE COLLAPSE, a failure mode in GANs

Formulation

Suppose, Dw:X[0,1]D_{w}: X \rightarrow [0, 1] Let Dw(X)D_w(X) represent the likelihood of sample XX coming from PXP_X

The objective is to maximise the log-likelihood of XX coming from PXP_X

w=argmaxw[EPXlogDw(x)]w^* = \mathop{\arg \max}_{w}[\mathbb{E}_{P_X} \log D_w(x)]

Maximising the likelihood of XPXX \sim P_X

Classifier should also maximise the likelihood of X^\hat{X} not coming from PXP_X, when X^Pθ\hat{X} \sim P_{\theta}

w=argmaxw[EX^Pθlog[1Dw(x)]]w^* = \mathop{\arg \max}_{w}[\mathbb{E}_{\hat{X} \sim P_{\theta}} \log[1 - D_w(x)]]

Maximising the likelihood of X^≁PX\hat{X} \not\sim P_X

The Combined Objective For Classifier Training

w=argmaxw[EXPXlogDw(x)+EX^Pθlog[1Dw(x)]]w^* = \mathop{\arg \max}_{w}[\mathbb{E}_{X \sim P_X} \log D_w(x) + \mathbb{E}_{\hat{X} \sim P_{\theta}} \log[1 - D_w(x)]]

Same as the loss-function we’ve obtained in the VDM.

The Objective for The Generator

It’s job is to make the classifier “fail”

Invert the optimisation for the classifier

θ=argminθJ(θ,w)\theta^* = \mathop{\arg \min}_{\theta} J(\theta, w) θ,w=argminθmaxwJ(θ,w)\theta^*, w^* = \mathop{\arg \min}_{\theta} \max_{w} J(\theta, w)

This classifier-guided interpretation of GAN is not generalizable across other f-divergences. Because, there’s no guarantee that the T function we use to bound the f-div has the interpretation that T-function is a classifier. Only with this particular choice of f-divergence do we have that interpretation.

Deep-Convolutional GAN

Typically, in a GAN, ZRk,XRd,k<<d\mathcal{Z} \in \mathbb{R}^{k}, X \in \mathbb{R}^d, k << d

In a DC-GAN, Upconvolution (or transpose convolutional) layers are used in the generator. In DC-GAN, data is images

Implementation of DC-GAN

Conditional GAN

Data D: {(x1,y1),(x2,y2),...,(xn,yn)}i.i.dPXY\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\} \sim^{\text{i.i.d}} P_{XY} Example: XX : Images, YY : class-label / textual embedding

Objective: Sample from conditional distribution PXYP_{X | Y} instead of PXP_X (Naive GAN)

Solution: Estimate PXYP_{X | Y} and make PθP_{\theta} approach PXYP_{X | Y} y\forall y

Modify the generator & discriminator to operate on the samples of conditional distribution.

J(θ,w)=Ex,yPXY[logDw(x,y)]+Ex^,yPX^Yθ[1logDw(x^,y)]J(\theta, w) = \mathbb{E}_{x, y \sim P_{X | Y}}[\log D_w(x, y)] + \mathbb{E}_{\hat{x}, y \sim P^{\theta}_{\hat{X} | Y}}[1 - \log D_w(\hat{x}, y)]

Implementation of Conditional GAN

Wasserstien GAN

Motivation for the Need

  1. The saddle point optimisation problem is difficult
  2. Any f-divergence minimisation leads to unstable training

Question: What makes GAN training difficult and unstable?

Manifold Hypothesis

The data we observe in the wild reside in a very low-dimensional manifold in the ambient space (Rd)(\mathbb{R}^d)

Example pixel011001{\text{pixel}} \begin{array} {|r|r|} \hline 0 & 1 & 1 & 0 & 0 \\ \hline 1 & & & & \\ \hline & & & & \\ \hline & & & & \\ \hline & & & & \\ \hline \end{array} Consider this 5 x 5 image with Pij=1P_{ij} = 1 if toss is heads & vice-versa. The likelihood that an image generated by the above process resembles a real-world image is very low.

Therefore, it can be posited that real data exists on a lower dimensional manifolds.

Consequently, the supports of PXP_X and PθP_{\theta} will not be aligned with a very high probability.

It can be shown that a Perfect Discriminator can be learned when the supports of PXP_X and PθP_{\theta} are not aligned.

==> The GAN Training Saturates: Perfect Discriminator Theorem For a perfect discriminator, the gradients flowing to generator nullify. Alternatively, it can be said that the lower bound on the loss is not tight, therefore minimising it would not make sense.

When we have a perfect discriminator ---> the Df(PxPθ)D_f(P_x || P_{\theta}) becomes independent of θ{\theta}
---> this causes saturation ---> Since, the gradients won’t flow to the generator, you wouldn’t be able to tweak the θ\theta to make the generated data align with real data, effectively.

1 Empirical Solution is to not train them equally. 1 Principled Solution is to use a softer divergence metric that does not saturate when the manifolds of the supports of PXP_X and PθP_{\theta} do not align.

Wasserstein Metric [Optimal Transport]

Idea: We will show that the Wasserstein metric will not saturate and won’t become independent of generator’s metrics, despite the manifold of supports not aligning.

Given 2 distributions, PXP_X & PθP_{\theta}

W(PXPθ)=minλΠ(X,X^)Ex,x^λxx^2W(P_X || P_{\theta}) = \min_{\lambda \in \Pi(X, \hat{X})} \mathbb{E}_{x, \hat{x} \sim \lambda} || x - \hat{x}||^2

λ:joint distribution between PX,PX^\lambda: \text{joint distribution between } P_X, P_{\hat{X}} Π(X,X^)\Pi(X, \hat{X}): All possible joint distributions such that xΠ(x,x^)dx=Px^\int_x \Pi(x, \hat{x}) \,dx = P_{\hat{x}} & x^Π(x,x^)dx^=Px\int_{\hat{x}} \Pi(x, \hat{x}) \,d\hat{x} = P_{x}

Example: Suppose PXP_X and PX^P_{\hat{X}} are two 1-d discrete PMF.

The mass in PXP_X can be redistributed such that PXP_X transforms into PX^P_{\hat{X}} A redistribution scheme can be represented as a joint distribution between PXP_X and PX^P_{\hat{X}}

_Joint Distribution between PXP_X and PX^P_{\hat{X}}_

x1^x2^...xL^x10.30.2...0.1x2.............xk...\begin{array}{c|lcr} & \hat{x_1} & \hat{x_2} & ... & \hat{x_L} \\ \hline x_1 & 0.3 & 0.2 & ... & 0.1 \\ x_2 & & & ... & \\ . & . & . & . &.\\ . & . & . & . &.\\ x_k & & & ... & \end{array}

Every redistribution scheme is a joint distribution and it is called a transport plan.

Next Goal:

Quantify the effort involved in a transport plan.

Suppose some mass is moved from XX and X^\hat{X}, then XX^2:Distance of the Movement of Mass||X - \hat{X}||_2: \text{Distance of the Movement of Mass}

Π(X,X^):Amount of mass that was moved\Pi(X, \hat{X}): \text{Amount of mass that was moved} XX^2Π(X,X^):Work done in moving the Mass from XX^||X - \hat{X}||_2 \Pi(X, \hat{X}): \text{Work done in moving the Mass from } X \rightarrow \hat{X}

Any work done in a transport plan:

X,X^xx^2Π(x,x^)dxdx^\int_{X, \hat{X}} ||x - \hat{x}||_2 \Pi(x, \hat{x}) \,dx \,d\hat{x} EΠ(x,x^)xx^2\mathbb{E}_{\Pi(x, \hat{x})} ||x - \hat{x}||_2

_Why are we doing this? We are in search of metric that tells us the work needed to move one distribution to another.

Given that multiple transport plans (Joint Distributions) exist between XX and X^\hat{X}, which of them would correspond to least amount of work done?

minλΠ(X,X^)Eλ(x,x^)xx^2\min_{\lambda \in \Pi(X, \hat{X})} \mathbb{E}_{\lambda(x, \hat{x})} ||x - \hat{x}||_2

This corresponds to the transport plan leading to the least amount of work done.

The closer the PXP_X and PX^P_{\hat{X}} are the lesser the amount of work needed by the transport plan.

Wassertstein’s Metric between 2 distributions PXP_X and PX^P_{\hat{X}} is given by:

Wp(PxPX^)=minλΠ(X,X^)Eλ(x,x^)xx^p,where X^Π(X,X^)dX^=PX and vice-versaW_{p}(P_x || P_{\hat{X}}) = \min_{\lambda \in \Pi(X, \hat{X})} \mathbb{E}_{\lambda(x, \hat{x})} ||x - \hat{x}||_p, \text{where } \int_{\hat{X}} \Pi(X, \hat{X}) \,d\hat{X} = P_{X} \text{ and vice-versa}

Fact | Take-Home Message: Wasserstein’s Metric does not saturate unlike f-divergences, when supports of PXP_X and PθP_{\theta} does not overlap

How to do Generative Modelling with Wasserstein Metric?

θ=argminθW(PXPθ)\theta^* = \mathop{\arg \min}_{\theta} W(P_X || P_{\theta}) ---> How to minimise? ---> Cue the Kantrovich-Rubenstein Duality

W(PXPθ)=maxTw(X)L<11-Lipschitz[ExPxTw(X)Ex^PθTw(x^)]W(P_X || P_{\theta}) = \max_{\underbrace{||T_w(X)||_{L} < 1}_{\text{1-Lipschitz}}}[\mathbb{E}_{x \sim P_x}T_w(X) - \mathbb{E}_{\hat{x} \sim P_{\theta}}T_w(\hat{x})] Tw(X)L<1:1-Lipschitz||T_w(X)||_{L} < 1: \text{1-Lipschitz}

==> A function is 1-Lipschitz\text{1-Lipschitz} if it’s rate of change is bounded by 1.

Tw(x1)Tw(x2)x1x21\frac{||T_w(x_1) - T_w(x_2)||}{||x_1 - x_2||} \leq 1

AKA: Bounded Derivatives

θ,w=argminθmaxTwL<1[ExPxTw(X)Ex^PθTw(x^)]\theta^*, w^* = \mathop{\arg \min}_{\theta} \max_{||T_w||_L < 1} [\mathbb{E}_{x \sim P_x}T_w(X) - \mathbb{E}_{\hat{x} \sim P_{\theta}}T_w(\hat{x})]

The above objective is very similar to GANs. Thus, the above method of minimising the wasserstein metric is called W-GAN.

Tw(X)T_w(X) has to be made 1-Lipschitz\text{1-Lipschitz} ---> How to achieve this is a research topic in itself ---> 1 Practical way to achieve this ---> Normalise the weights of TwT_w such that w2=1||w||_2 = 1 after each gradient step.

CONCLUSION: It is observed that training W-GAN is more stable compared to that of Naive GAN.

Implementation of WGAN

Inversion with GAN

A trained GAN is a function ZXZ \rightarrow X. Suppose one is interested in inversion of the above function,i.e, given xiPxx_i \sim P_x, find the corresponding ZZ ---> zi:gθ(zi)=xiz_i: g_{\theta}(z_i) = x_i

Question: How to modify a GAN such that the inverse is possible?

First off, why is it worthy of our attention?

  1. Feature Extraction: Given a dataset obtain GAN-inverted vectors and use them as features for data. They, then, can be used for downstream tasks.
  2. Data Manipulation & Editing: Suppose xiPxx_i \sim P_x needs to be edited. First, obtain xi=gθ(zi)x_i = g_{\theta}^*(z_i) via inversion. Then, zedit=fedit(zi)z_{\text{edit}} = f_{\text{edit}}(z_i) & xedit=gθ(zedit)x_{\text{edit}} = g_{\theta}^*(z_{\text{edit}})

Approach 1: Bi-Directional GAN

In a Naive-GAN, gθ(z):ZXg_{\theta}(z):Z \rightarrow X Dw(x):X[0,1]D_w(x): X \rightarrow [0, 1]

In Bi-GAN, in addition to gθ(z)g_{\theta}(z) & Dw(x)D_w(x), there is another function called the Encoder or the Invertor network ---> Eϕ:χZE_{\phi}: \chi \rightarrow Z

In Bi-GAN, the Dw()D_w() is designed to classify between the data tuples of the form [X,Eϕ(X)] and [gθ(Z),Z][X,E_{\phi}(X)] \text{ and } [g_{\theta}(Z), Z], where XPXX \sim P_X, Eϕ(X)Pϕ(Z^),ZN(0,I),gθ(Z)PθE_{\phi}(X) \sim P_{\phi}(\hat{Z}), Z \sim N(0, I), g_{\theta}(Z) \sim P_{\theta}

In a Naive-GAN, Dw()D_w() classifies between X and g_θ(Z)X \text{ and } g\_{\theta}(Z)

LBi-GAN(θ,w,ϕ)=ExPx[Ez^PϕlogDw(x,Eϕ(x))]+EzPz[Ex^Pθlog{1Dw(gθ(z),z)}]L_{\text{Bi-GAN}}(\theta, w, \phi) = \mathop{\mathbb{E}}_{x \sim P_x}[\mathop{\mathbb{E}}_{\hat{z} \sim P_{\phi}} \log D_w(x, E_{\phi}(x))] + \mathop{\mathbb{E}}_{z \sim P_z}[\mathop{\mathbb{E}}_{\hat{x} \sim P_{\theta}} \log \{1 - D_w(g_{\theta}(z), z)\}] Z^=Eϕ(x)Pϕ(Z),X^=gθ(Z)Pθ(X^),ZPz=N(0,I)\hat{Z} = E_{\phi}(x) \sim P_{\phi}(Z), \hat{X} = g_{\theta}(Z) \sim P_{\theta}(\hat{X}), Z \sim P_z = N(0, I)

Objective Function for BiGAN

minθ,ϕmaxwLBi-GAN(θ,w,ϕ)\mathop{\min_{\theta, \phi}} \mathop{\max_{w}} L_{\text{Bi-GAN}}(\theta, w, \phi)

Once a Bi-GAN is trained, the generator network gθ(z)g_{\theta}^*(z) is used for generation & Eϕ(x)E_{\phi}^*(x) is used for inversion.

It can be shown that the optima of LBi-GANL_{\text{Bi-GAN}} is achieved when PZ^X=PZX^P_{\hat{Z}X} = P_{Z\hat{X}}, where PZ^X=xPxz^Pϕ(z^x).dz^.dx P*{\hat{Z}X} = \int_xP_x\int*{\hat{z}}P*{\phi}(\hat{z} | x).d{\hat{z}}.dx PZX^=zPzx^P_θ(x^z).dx^.dz P*{Z\hat{X}} = \int*zP_z\int*{\hat{x}}P\_{\theta}(\hat{x} | z).d{\hat{x}}.dz

Implementation of BiGAN

Latent Regression

LLat-Reg=[ExPxlogDw(x)+Ex^Pθlog{1Dw(x^)}]+λzEϕ(x^)22L_{\text{Lat-Reg}} = [\mathop{\mathbb{E}}_{x \sim P_x} \log D_w(x) + \mathop{\mathbb{E}}_{\hat{x} \sim P_{\theta}} \log \{1 - D_w(\hat{x})\}] + \lambda ||z - E_{\phi}(\hat{x})||_2^2

Handling Domain-Shift

You made a classifier in Delhi. Want to use in Boston. How to make it robust to that change? By extracting domain-invariant features

Suppose we have data from a distribution,

DS={(x1,y1),(x2,y2),...,(xn,yn)}i.i.dPSD_{S} = \{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\} \mathop{\sim}^{\text{i.i.d}} P_S

DSD_S: Source Distribution

Task: Classification, Estimate PS(yx)P_S(y | x)

During test-time, data comes from a different distribution. DTD_T that is different from DSD_S

DTDSD_T \neq D_S

Classifiers / Regressors trained on DSD_S would fail on DTD_T [PACS Dataset]

Unsupervised Domain Adaptation

Given D_S = \{(x_i, y_i)\}_{i=1}^n \mathop{\sim}^{\text{i.i.d}} P_S$$$$D_T = \{(\hat{x_j})\}_{j=1}^m \mathop{\sim}^{\text{i.i.d}} P_T$$$$P_S \neq P_T (Domain-Shift)

Objective: Given DSD_S & DTD_T, learn features/classifier such that it performs well on both source and target data

Solution: Domain Adversial Networks

ϕ:χF\phi: \chi \rightarrow F | ϕ(x)=fS\phi(x) = f_S & ϕ(x^)=fT\phi(\hat{x}) = f_T fSPfSf_S \sim P_{f_S} fTPfTf_T \sim P_{f_T}

θ,w=argminϕmaxwEPfSlogDw(fS)+EPfTlogDw(fT)\theta^*, w^* = \arg \mathop{\min}_{\phi} \mathop{\max}_{w} {\mathop{\mathbb{E}}_{P_{f_S}} \log D_w(f_S) + \mathop{\mathbb{E}}_{P_{f_T}} \log D_w(f_T)} θ,ψ=argminψBCE(y,hψ(fS))\theta^*, \psi^* = \arg \mathop{\min}_{\psi} BCE(y, h_{\psi}(f_S))

The ϕ\phi-network has 2 objectives

  1. Adversial Objective
  2. Classification Objective

If we solely train the generator for the adversial objective, we may solve for the task of producing same representations for data points with same meaning well, but the produced representation may not be conducive for the classification task.

During inference, the same classifier can be used on both the source and target distributions, since the classifier is trained on domain-agnostic features.

Given,

x^testPTfttest=ϕ(x^test)y^test=hψ(fttest)\begin{aligned} \hat{x}_{\text{test}} \sim P_T \\ f_t^{\text{test}} = \phi^*(\hat{x}_{\text{test}}) \\ \hat{y}_{\text{test}} = h_{\psi^*}(f_t^{\text{test}}) \end{aligned}

Since ϕ\phi would ensure that PfS=PfTP_{f_S} = P_{f_T}, hψh_{\psi^*} will function the same way irrespective of domain.

Implementation of UDA

Evaluation of Generative Models

Given Dtrue={x1,x2,x3,...,xn}i.i.dPxD_{\text{true}} = \{x_1, x_2, x_3, ..., x_n\} \mathop{\sim}^{\text{i.i.d}} P_x (Real Data), Dgen={x1^,x2^,...,xn^}i.i.dPθD_{\text{gen}} = \{\hat{x_1}, \hat{x_2}, ..., \hat{x_n}\} \mathop{\sim}^{\text{i.i.d}} P_{\theta^*} (Generated Data)

**Goal: Compare DtrueD_{\text{true}} & DgenD_{\text{gen}}

A popular metric to do that comparison is Frechet Inception Distance

Frechet Inception Distance

FID: Wassertstein’s Metric between DtrueD_{\text{true}} & DgenD_{\text{gen}} with a twist.

  1. Take a pretrained inception net, trained on Image-Net.
  2. Pass DtrueD_{\text{true}} & DgenD_{\text{gen}} through the above inception network

Extract the features for DtrueD_{\text{true}} & DgenD_{\text{gen}} from an lthl^{\text{th}} layer of Iψ(x)I_{\psi^*}(x)

D^true={Ztrue1,Ztrue2,...,Ztruen}D^gen={Zgen1,Zgen2,...,Zgenn}Ztruei=Output of lth layer of Iψ(x)Zgeni=Output of lth layer of Iψ(x)\begin{gathered} \hat{D}_{\text{true}} = \{ Z^1_{\text{true}}, Z^2_{\text{true}}, ..., Z^n_{\text{true}} \} \hat{D}_{\text{gen}} = \{ Z^1_{\text{gen}}, Z^2_{\text{gen}}, ..., Z^n_{\text{gen}} \} \\ Z_{\text{true}}^i = \text{Output of } l^{th} \text{ layer of }I_{\psi^*}(x) \\ Z_{\text{gen}}^i = \text{Output of } l^{th} \text{ layer of }I_{\psi^*}(x) \end{gathered}
  1. Compute Mean and Co-Variances for D^gen\hat{D}_{\text{gen}}, D^true\hat{D}_{\text{true}}

    μtrue=1ni=1nZtrueiΣtrue=1ni=1nZtruei(Ztruei)T\begin{gathered} \mu_{\text{true}} = \frac{1}{n} \sum_{i=1}^n Z_{\text{true}}^i \\ \Sigma_{\text{true}} = \frac{1}{n} \sum_{i=1}^n Z_{\text{true}}^i(Z_{\text{true}}^i)^T \end{gathered}

    Similarily, compute μgen\mu_{\text{gen}} & Σgen\Sigma_{\text{gen}}

  2. Assume, ZtrueN(μtrue,Σtrue)ZgenN(μgen,Σgen)\begin{aligned} Z_{\text{true}} \sim N(\mu_{\text{true}}, \Sigma_{\text{true}}) \\ Z_{\text{gen}} \sim N(\mu_{\text{gen}}, \Sigma_{\text{gen}}) \end{aligned}

  3. FID = W(N(μtrue,σtrue)N(μgen,σgen))W(N(\mu_{\text{true}}, \sigma_{\text{true}}) || N(\mu_{\text{gen}}, \sigma_{\text{gen}})) = μtrueμgen2+tr(σtrue+σgen2(σtrueσgen)12)||\mu_{\text{true}} - \mu_{\text{gen}}||^2 + tr(\sigma_{\text{true}} + \sigma_{\text{gen}} - 2(\sigma_{\text{true}} \sigma_{\text{gen}})^{\frac{1}{2}})

The Generative Model is better if FID is lower.

Lower FID ==> lower Wasserstein distance between PXP_X and PθP_{\theta^*}

Reference Papers

  1. GAN Paper: https://arxiv.org/pdf/1406.2661
  2. f-GAN Paper: https://arxiv.org/pdf/1606.00709
  3. Conditional GAN Paper: https://arxiv.org/pdf/1411.1784
  4. DC-GAN Paper: https://arxiv.org/pdf/1511.06434
  5. W-GAN Paper: https://arxiv.org/pdf/1701.07875
  6. BiGAN Paper: https://arxiv.org/pdf/1605.09782