Deep Generative Models

TL;DR: This blog provides an overview of (deep) generative models and a few foundational mathematical concepts.

1. Introduction

A statistical generative model is a probability distribution $p(x)$.

Image

It is generative because sampling from $p(x)$ generates new data points.

Image

We are interested in learning the data distribution $p(x)$ with a model $p_\theta (x)$, parametrized by $\theta$, from an empirical dataset. The random variable $x$ represents a data sample drawn from the underlying data distribution $p(x)$. In some cases, we may not be able to explicitly model $p(x)$ directly but can instead just generate samples from it.

Example: Consider a data distribution of greyscale images, where \(x \in \mathcal{X}^D\), with \(\mathcal{X}=\{0,1, \ldots, 255\}\) representing the possible pixel intensity values, and $D$ the total number of pixels in each image. So, $p(x)$ represents the probability distribution over the space of all possible greyscale. In other words, $p(x)$ assigns a probability to each possible configuration of pixel values across the $D$ pixels.

With probability distribution $p(x)$, we can do the following:

Image
Figure adopted from [Stanford CS236 - Fall 2023](https://deepgenerativemodels.github.io/syllabus.html).

2. Useful Concepts and Mathematical Preliminaries

2.1. Control Signals

We often need some form of control signal (such as a latent variable $z$) for generation.

The data distribution $p(x)$ can then be factorized through the control signal $z$. It’s often useful to condition on rich information $z$.

\[p(x)=\int p(x \mid z) p(z) d z\]

The model splits the task of generating data into two parts:

  1. Generating the latent variable $z$.
  2. Generating a sample $x$ conditioned on $z$.

This allows the model to disentangle factors of variation in the data (e.g., shape, color, orientation) and represent them explicitly in the latent space.

2.2. Discriminative vs. Generative

Given a classification problem (discriminative), our goal is to learn the conditional probability of a sample belonging to a certain class, expressed as:

\[P(Y=c \mid X=x).\]

Given a generative problem, the input $X$ is not given. Requires a model of the joint distribution over both $X$ and $Y$. We are interested in learning the marginal probability \(P(X)\) or the joint probability ($Y$ as the control signal):

\[P(Y=c, X=x).\]

In summary:

The conditional probability, marginal probability, and the joint probability are related by the Bayes’ Rule.

\[P(Y \mid X)=\frac{P(X \mid Y) P(Y)}{P(X)}=\frac{P(X, Y)}{P(X)}.\]

2.3. Concrete Example: Data Distribution

The MNIST consists of grayscale images with pixel values between $0$ and $255$. We can normalize them to $[0,1]$.

\[p(x)=p \left( x _ 1, x _ 2, \ldots, x _ {784} \right) =\prod _ {i=1}^{784} p\left(x _ i\right)\]

Note: Obviously, pixels are not independent, but we make this assumption. Here, we assume Bernoulli distributions, however, you can use other distributions as well; Gaussian is another common choice. Later, we will discuss the log-likelihood and KL divergence of Bernoulli and Gaussian distributions, which will help clarify the rationale behind modeling images as Bernoulli or Gaussian variables.

2.4. Entropy, Cross-entropy, and KL Divergence

2.5. Log-likelihoods and KL Divergence of Bernouli and Gaussian

Oftentimes, we assume a simple probability distribution $p(x)$ over the input. Common choices include (independent) Gaussian and Bernoulli. We are interested in learning a distribution parametrized by \(p _ \theta(x)\) through maximum likelihood learning or minimizing the KL divergence; here $\theta$ are the parameters of the distribution, which can be given by a neural network.

2.6. The Reparametrization Trick

When we want to optimize a function involving stochastic variables, we face an issue because gradients cannot directly flow through sampling operations.

\[\mathcal{L}(\theta)=\mathbb{E} _ {z \sim q _ \theta(z)}[f(z)].\]\[\nabla _ \theta \mathcal{L}(\theta) = \nabla _ \theta \mathbb{E} _ {z \sim q _ \theta(z)}[f(z)].\]

Other Useful Resources for Starters

Books

Lecture Recordings

  1. Stanford CS236 Deep Generative Models (2023)
  2. A Course on Generative AI - Diffusion Models