TL;DR: This blog provides an overview of (deep) generative models and a few foundational mathematical concepts.
A statistical generative model is a probability distribution $p(x)$.

It is generative because sampling from $p(x)$ generates new data points.

We are interested in learning the data distribution $p(x)$ with a model $p_\theta (x)$, parametrized by $\theta$, from an empirical dataset. The random variable $x$ represents a data sample drawn from the underlying data distribution $p(x)$. In some cases, we may not be able to explicitly model $p(x)$ directly but can instead just generate samples from it.
Example: Consider a data distribution of greyscale images, where \(x \in \mathcal{X}^D\), with \(\mathcal{X}=\{0,1, \ldots, 255\}\) representing the possible pixel intensity values, and $D$ the total number of pixels in each image. So, $p(x)$ represents the probability distribution over the space of all possible greyscale. In other words, $p(x)$ assigns a probability to each possible configuration of pixel values across the $D$ pixels.
With probability distribution $p(x)$, we can do the following:

We often need some form of control signal (such as a latent variable $z$) for generation.
The data distribution $p(x)$ can then be factorized through the control signal $z$. It’s often useful to condition on rich information $z$.
\[p(x)=\int p(x \mid z) p(z) d z\]The model splits the task of generating data into two parts:
This allows the model to disentangle factors of variation in the data (e.g., shape, color, orientation) and represent them explicitly in the latent space.
Given a classification problem (discriminative), our goal is to learn the conditional probability of a sample belonging to a certain class, expressed as:
\[P(Y=c \mid X=x).\]Given a generative problem, the input $X$ is not given. Requires a model of the joint distribution over both $X$ and $Y$. We are interested in learning the marginal probability \(P(X)\) or the joint probability ($Y$ as the control signal):
\[P(Y=c, X=x).\]In summary:
The conditional probability, marginal probability, and the joint probability are related by the Bayes’ Rule.
\[P(Y \mid X)=\frac{P(X \mid Y) P(Y)}{P(X)}=\frac{P(X, Y)}{P(X)}.\]The MNIST consists of grayscale images with pixel values between $0$ and $255$. We can normalize them to $[0,1]$.
Note: Obviously, pixels are not independent, but we make this assumption. Here, we assume Bernoulli distributions, however, you can use other distributions as well; Gaussian is another common choice. Later, we will discuss the log-likelihood and KL divergence of Bernoulli and Gaussian distributions, which will help clarify the rationale behind modeling images as Bernoulli or Gaussian variables.
Entropy $H(p)$ is a measure of the uncertainty in the distribution:
\[H(p)=\mathbb{E} _ {X \sim p}[-\log p(X)].\]Cross-entropy $H(p, q)$ measures the expected number of bits needed to encode data from $p$ using the distribution $q$:
\[H(p, q)=\mathbb{E} _ {X \sim p}[-\log q(X)].\]KL Divergence $D _ {\mathrm{KL}}(p | q)$ : is a measure of how one probability distribution diverges from another:
\[D _ {\mathrm{KL}}(p \| q)=\mathbb{E} _ {X \sim p}\left[\log \frac{p(X)}{q(X)}\right].\]Oftentimes, we assume a simple probability distribution $p(x)$ over the input. Common choices include (independent) Gaussian and Bernoulli. We are interested in learning a distribution parametrized by \(p _ \theta(x)\) through maximum likelihood learning or minimizing the KL divergence; here $\theta$ are the parameters of the distribution, which can be given by a neural network.
where $\theta$ are the parameters of the bournoli distribution.
This is essentially the form of the cross-entropy loss.
Note: If the variances in the Gaussian distributions are fixed (i.e., they are constants and not learnable parameters), then maximizing the log likelihood or minimizing the KL divergence between the true distribution and the predicted distribution reduces to optimizing the mean squared error (MSE) between the means of the distributions.
When we want to optimize a function involving stochastic variables, we face an issue because gradients cannot directly flow through sampling operations.
However, the sampling operation $z \sim$ $q _ \theta(z)$ introduces randomness that interrupts the gradient flow; another way to see this is that $z$ is not a deterministic function of $\theta$.
The issue arises because the sampled value $z$ is treated as a “constant” once sampled, and the gradient of a constant with respect to the distribution’s parameters $\theta$ is zero.
When you sample $z$ from a probability distribution $q _ \theta(z)$, like $z \sim \mathcal{N}\left(\mu, \sigma^2\right)$, the random variable $z$ is drawn based on the parameters $\theta=(\mu, \sigma)$ (mean and standard deviation). The sampling process itself is a discrete event, meaning once the sample $z$ is obtained, it becomes a fixed value.
The reparameterization trick allows us to express the random variable $z$ as a deterministic function of the parameters $\theta$ and a separate random variable $\epsilon$.