shinnku's Blog

Diffusion Probabilistic Models

By shinnku on Nov 28, 2023
Generated samples on CelebA-HQ 256 × 256

Diffusion Probabilistic Models

The original paper covers everything in detail—this post merely provides a casual summary. ps: Paper link

Prior Methods

VAE:

LVAE(θ,ϕ)=logpθ(x)+DKL(qϕ(zx)pθ(zx))=Ezqϕ(zx)logpθ(xz)+DKL(qϕ(zx)pθ(z))θ,ϕ=argminθ,ϕLVAELVAE=logpθ(x)DKL(qϕ(zx)pθ(zx))logpθ(x)\begin{aligned} L_{\mathrm{VAE}}(\theta, \phi) & =-\log p_\theta(\mathbf{x})+D_{\mathrm{KL}}\left(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\theta(\mathbf{z} \mid \mathbf{x})\right) \\ & =-\mathbb{E}_{\mathbf{z} \sim q_\phi(\mathbf{z} \mid \mathbf{x})} \log p_\theta(\mathbf{x} \mid \mathbf{z})+D_{\mathrm{KL}}\left(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\theta(\mathbf{z})\right) \\ \theta^*, \phi^* & =\arg \min _{\theta, \phi} L_{\mathrm{VAE}} \\ -L_{\mathrm{VAE}} & =\log p_\theta(\mathbf{x})-D_{\mathrm{KL}}\left(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p_\theta(\mathbf{z} \mid \mathbf{x})\right) \leq \log p_\theta(\mathbf{x}) \end{aligned}

GAN:

minGmaxDL(D,G)=Expr(x)[logD(x)]+Ezpz(z)[log(1D(G(z)))]=Expr(x)[logD(x)]+Expg(x)[log(1D(x)]L(G,D)=2DJS(prpg)2log2\begin{gathered} \min _G \max _D L(D, G)=\mathbb{E}_{x \sim p_r(x)}[\log D(x)]+\mathbb{E}_{z \sim p_z(z)}[\log (1-D(G(z)))] \\ =\mathbb{E}_{x \sim p_r(x)}[\log D(x)]+\mathbb{E}_{x \sim p_g(x)}[\log (1-D(x)] \\ L\left(G, D^*\right)=2 D_{J S}\left(p_r \| p_g\right)-2 \log 2 \end{gathered}

Abstract

Below is the abstract (copied verbatim):

We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics.

Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding.

On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN.

Our implementation is available at https://github.com/hojonathanho/diffusion.

Honestly, the abstract is rather difficult to follow, but that’s fine. The paper is filled with probability calculations—starting with parameterized Markov chains and transition distributions, then variational inference, and finally maximum likelihood—which can be overwhelming.

Background

Diffusion models are latent variable models of the form pθ(x0):=pθ(x0:T)dx1:Tp_\theta\left(\mathbf{x}_0\right):=\int p_\theta\left(\mathbf{x}_{0: T}\right) d \mathbf{x}_{1: T}, where x1,,xT\mathbf{x}_1, \ldots, \mathbf{x}_T are latents of the same dimensionality as the data x0q(x0)\mathbf{x}_0 \sim q\left(\mathbf{x}_0\right).

The initial state x0\mathbf{x}_0 is clean data, while the final state xT\mathbf{x}_T is nearly pure noise.

Reverse Process

pθ(x0:T)p_\theta\left(\mathbf{x}_{0: T}\right) represents the joint distribution from x0\mathbf{x}_0 to xT\mathbf{x}_T. The parameter θ\theta is learned during training and defines the transition probabilities between states. This is known as the reverse process, running from TT back to 00.

The joint distribution pθ(x0:T)p_\theta\left(\mathbf{x}_{0: T}\right) is called the reverse process, and it is defined as a Markov chain with learned Gaussian transitions starting at p(xT)=N(xT;0,I)p\left(\mathbf{x}_T\right)=\mathcal{N}\left(\mathbf{x}_T ; \mathbf{0}, \mathbf{I}\right) :

pθ(x0:T):=p(xT)t=1Tpθ(xt1xt)pθ(xt1xt):=N(xt1;μθ(xt,t),Σθ(xt,t))\begin{aligned} &p_\theta\left(\mathbf{x}_{0: T}\right):=p\left(\mathbf{x}_T\right) \prod_{t=1}^T p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right) \\ &p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right):=\mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right), \mathbf{\Sigma}_\theta\left(\mathbf{x}_t, t\right)\right) \end{aligned}

Forward Process

In essence, q(x1:Tx0)q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right) is a Markov chain where the variables β1,,βT\beta_1, \ldots, \beta_T control how much noise is added at each step.

What distinguishes diffusion models from other types of latent variable models is that the approximate posterior q(x1:Tx0)q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right), called the forward process or diffusion process, is fixed to a Markov chain that gradually adds Gaussian noise to the data according to a variance schedule β1,,βT\beta_1, \ldots, \beta_T :

q(x1:Tx0):=t=1Tq(xtxt1)q(xtxt1):=N(xt;1βtxt1,βtI)\begin{aligned} &q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right):=\prod_{t=1}^T q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right) \\ &q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right):=\mathcal{N}\left(\mathbf{x}_t ; \sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}\right) \end{aligned}

Training

Training is performed by optimizing the usual variational bound on negative log likelihood:

E[logpθ(x0)]Eq[logpθ(x0:T)q(x1:Tx0)]=Eq[logp(xT)t1logpθ(xt1xt)q(xtxt1)]=:L\begin{aligned} \mathbb{E}\left[-\log p_\theta\left(\mathbf{x}_0\right)\right] & \leq \mathbb{E}_q\left[-\log \frac{p_\theta\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)}\right] \\ & =\mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t \geq 1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)}\right] \\&=: L \end{aligned}

generated by Chatgpt4 (????), verified by shinnku

  1. Decompose the log-likelihood: Start with the log-likelihood of the data logpθ(x0)\log p_\theta\left(\mathbf{x}_0\right) and apply the chain rule of probability:

    logpθ(x0)=logpθ(x0:T)dx1:T=logpθ(x0:T)q(x1:Tx0)q(x1:Tx0)dx1:T\begin{aligned} \log p_\theta\left(\mathbf{x}_0\right) &=\log \int p_\theta\left(\mathbf{x}_{0: T}\right) \d \mathbf{x}_{1: T} \\ &=\log \int \frac{p_\theta\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)} q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right) \d \mathbf{x}_{1: T} \end{aligned}

    Here, pθ(x0:T)p_\theta\left(\mathbf{x}_{0: T}\right) is the reverse process.

  2. Apply Jensen’s inequality: Because the log function is concave, Jensen’s inequality allows moving the log outside the integral:

    logpθ(x0)q(x1:Tx0)logpθ(x0:T)q(x1:Tx0)dx1:T\log p_\theta\left(\mathbf{x}_0\right) \geq \int q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right) \log \frac{p_\theta\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)} \d \mathbf{x}_{1: T}
  3. Expectation form: Rewrite the expression as an expectation:

    Eq[logpθ(x0)]Eq[logpθ(x0:T)q(x1:Tx0)]=L\mathbb{E}_q\left[\,\log p_\theta\left(\mathbf{x}_0\right)\right] \geq \mathbb{E}_q\left[\log \frac{p_\theta\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)}\right] = -L

The forward process variances βt\beta_t can be learned by reparameterization or held constant as hyperparameters, and expressiveness of the reverse process is ensured in part by the choice of Gaussian conditionals in pθ(xt1xt)p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right), because both processes have the same functional form when βt\beta_t are small.

  1. Negative Log Likelihood:
    • The objective is to minimize the negative log likelihood by optimizing the variational bound.
  2. Variational bound:
    • The bound LL is an upper bound on the negative log likelihood.
    • It consists of two parts: an expectation over p(xt)p(\mathbf{x}_t) and another expectation over the log ratio of pθ(xt1xt1)p_{\theta}(\mathbf{x}_{t-1} \mid \mathbf{x}_{t-1}) to q(xt1xt1)q(\mathbf{x}_{t-1} \mid \mathbf{x}_{t-1}).

A notable property of the forward process is that it admits sampling xt\mathbf{x}_t at an arbitrary timestep tt in closed form: using the notation αt:=1βt\alpha_t:=1-\beta_t and αˉt:=s=1tαs\bar{\alpha}_t:=\prod_{s=1}^t \alpha_s, we have

q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)=\mathcal{N}\left(\mathbf{x}_t ; \sqrt{\bar{\alpha}_t} \mathbf{x}_0,\left(1-\bar{\alpha}_t\right) \mathbf{I}\right)

Efficient training is therefore possible by optimizing random terms of LL with stochastic gradient descent. Further improvements come from variance reduction by rewriting L(3)L(3) as:

Eq[DKL(q(xTx0)p(xT))LT+t>1DKL(q(xt1xt,x0)pθ(xt1xt))Lt1logpθ(x0x1)L0]\mathbb{E}_q[\underbrace{D_{\mathrm{KL}}\left(q\left(\mathbf{x}_T \mid \mathbf{x}_0\right) \| p\left(\mathbf{x}_T\right)\right)}_{L_T}+\sum_{t>1} \underbrace{D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)\right)}_{L_{t-1}} \underbrace{-\log p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)}_{L_0}]

Below is a derivation of above, the reduced variance variational bound for diffusion models. This material is from Sohl-Dickstein et al;

L=Eq[logpθ(x0:T)q(x1:Tx0)]=Eq[logp(xT)t1logpθ(xt1xt)q(xtxt1)]=Eq[logp(xT)t>1logpθ(xt1xt)q(xtxt1)logpθ(x0x1)q(x1x0)]=Eq[logp(xT)t>1logpθ(xt1xt)q(xt1xt,x0)q(xt1x0)q(xtx0)logpθ(x0x1)q(x1x0)]=Eq[logp(xT)q(xTx0)t>1logpθ(xt1xt)q(xt1xt,x0)logpθ(x0x1)]=Eq[DKL(q(xTx0)p(xT))+t>1DKL(q(xt1xt,x0)pθ(xt1xt))logpθ(x0x1)]\begin{aligned} L & =\mathbb{E}_q\left[-\log \frac{p_\theta\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_0\right)}\right] \\ & =\mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t \geq 1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)}\right] \\ & =\mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t>1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)}-\log \frac{p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)}{q\left(\mathbf{x}_1 \mid \mathbf{x}_0\right)}\right] \\ & =\mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t>1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right)} \cdot \frac{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_0\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)}-\log \frac{p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)}{q\left(\mathbf{x}_1 \mid \mathbf{x}_0\right)}\right] \\ & =\mathbb{E}_q\left[-\log \frac{p\left(\mathbf{x}_T\right)}{q\left(\mathbf{x}_T \mid \mathbf{x}_0\right)}-\sum_{t>1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right)}-\log p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)\right] \\ & =\mathbb{E}_q\left[D_{\mathrm{KL}}\left(q\left(\mathbf{x}_T \mid \mathbf{x}_0\right) \| p\left(\mathbf{x}_T\right)\right)+\sum_{t>1} D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right) \| p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)\right)-\log p_\theta\left(\mathbf{x}_0 \mid \mathbf{x}_1\right)\right] \end{aligned}

The following is an alternate version of LL. It is not tractable to estimate, but it is useful for our discussion in Section 4.3.

L=Eq[logp(xT)t1logpθ(xt1xt)q(xtxt1)]=Eq[logp(xT)t1logpθ(xt1xt)q(xt1xt)q(xt1)q(xt)]=Eq[logp(xT)q(xT)t1logpθ(xt1xt)q(xt1xt)logq(x0)]=DKL(q(xT)p(xT))+Eq[t1DKL(q(xt1xt)pθ(xt1xt))]+H(x0)\begin{aligned} L & =\mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t \geq 1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)}\right] \\ & =\mathbb{E}_q\left[-\log p\left(\mathbf{x}_T\right)-\sum_{t \geq 1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)} \cdot \frac{q\left(\mathbf{x}_{t-1}\right)}{q\left(\mathbf{x}_t\right)}\right] \\ & =\mathbb{E}_q\left[-\log \frac{p\left(\mathbf{x}_T\right)}{q\left(\mathbf{x}_T\right)}-\sum_{t \geq 1} \log \frac{p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)}-\log q\left(\mathbf{x}_0\right)\right] \\ & =D_{\mathrm{KL}}\left(q\left(\mathbf{x}_T\right) \| p\left(\mathbf{x}_T\right)\right)+\mathbb{E}_q\left[\sum_{t \geq 1} D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right) \| p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)\right)\right]+H\left(\mathbf{x}_0\right) \end{aligned}

Equation (5) uses KL divergence to directly compare pθ(xt1xt)p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right) against forward process posteriors, which are tractable when conditioned on x0\mathbf{x}_0 :

q(xt1xt,x0)=N(xt1;μ~t(xt,x0),β~tI) where μ~t(xt,x0):=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt and β~t:=1αˉt11αˉtβt\begin{aligned} q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right) & =\mathcal{N}\left(\mathbf{x}_{t-1} ; \tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right), \tilde{\beta}_t \mathbf{I}\right) \\ \text { where } \quad \tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right) & :=\frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1-\bar{\alpha}_t} \mathbf{x}_0+\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t} \mathbf{x}_t \quad \text { and } \quad \tilde{\beta}_t:=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t \end{aligned}

Consequently, all KL divergences in Eq. (5) are comparisons between Gaussians, so they can be calculated in a Rao-Blackwellized fashion with closed form expressions instead of high variance Monte Carlo estimates.

3. Diffusion models and denoising autoencoders

Diffusion models might appear to be a restricted class of latent variable models, but they allow a large number of degrees of freedom in implementation. One must choose the variances βt\beta_t of the forward process and the model architecture and Gaussian distribution parameterization of the reverse process. To guide our choices, we establish a new explicit connection between diffusion models and denoising score matching (Section 3.2) that leads to a simplified, weighted variational bound objective for diffusion models (Section 3.4). Ultimately, our model design is justified by simplicity and empirical results (Section 4). Our discussion is categorized by the terms of Eq. (5).

3.1 Forward process and LTL_T

We ignore the fact that the forward process variances βt\beta_t are learnable by reparameterization and instead fix them to constants (see Section 4 for details). Thus, in our implementation, the approximate posterior qq has no learnable parameters, so LTL_T is a constant during training and can be ignored.

3.2 Reverse process and L1:T1L_{1: T-1}

Now we discuss our choices in pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)=\mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right), \mathbf{\Sigma}_\theta\left(\mathbf{x}_t, t\right)\right) for 1<tT1<t \leq T. First, we set Σθ(xt,t)=σt2I\boldsymbol{\Sigma}_\theta\left(\mathbf{x}_t, t\right)=\sigma_t^2 \mathbf{I} to untrained time dependent constants. Experimentally, both σt2=βt\sigma_t^2=\beta_t and σt2=β~t=1αˉt11αˉtβt\sigma_t^2=\tilde{\beta}_t=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t had similar results. The first choice is optimal for x0N(0,I)\mathbf{x}_0 \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), and the second is optimal for x0\mathrm{x}_0 deterministically set to one point. These are the two extreme choices corresponding to upper and lower bounds on reverse process entropy for data with coordinatewise unit variance.

Second, to represent the mean μθ(xt,t)\boldsymbol{\mu}_\theta\left(\mathrm{x}_t, t\right), we propose a specific parameterization motivated by the following analysis of LtL_t. With pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right)=\mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right), \sigma_t^2 \mathbf{I}\right), we can write:

Lt1=Eq[12σt2μ~t(xt,x0)μθ(xt,t)2]+CL_{t-1}=\mathbb{E}_q\left[\frac{1}{2 \sigma_t^2}\left\|\tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right)-\boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right)\right\|^2\right]+C

where CC is a constant that does not depend on θ\theta. So, we see that the most straightforward parameterization of μθ\boldsymbol{\mu}_\theta is a model that predicts μ~t\tilde{\boldsymbol{\mu}}_t, the forward process posterior mean. However, we can expand Eq. (8) further by reparameterizing Eq. (4) as xt(x0,ϵ)=αˉtx0+1αˉtϵ\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right)=\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon} for ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) and applying the forward process posterior formula (7)(7) :

Lt1C=Ex0,ϵ[12σt2μ~t(xt(x0,ϵ),1αˉt(xt(x0,ϵ)1αˉtϵ))μθ(xt(x0,ϵ),t)2]=Ex0,ϵ[12σt21αt(xt(x0,ϵ)βt1αˉtϵ)μθ(xt(x0,ϵ),t)2]\begin{aligned} L_{t-1}-C & =\mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}}\left[\frac{1}{2 \sigma_t^2}\left\|\tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right), \frac{1}{\sqrt{\bar{\alpha}_t}}\left(\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right)-\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}\right)\right)-\boldsymbol{\mu}_\theta\left(\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right), t\right)\right\|^2\right] \\ & =\mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}}\left[\frac{1}{2 \sigma_t^2}\left\|\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right)-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}\right)-\boldsymbol{\mu}_\theta\left(\mathbf{x}_t\left(\mathbf{x}_0, \boldsymbol{\epsilon}\right), t\right)\right\|^2\right] \end{aligned}

Equation (10) reveals that μθ\mu_\theta must predict 1αt(xtβt1αˉtϵ)\frac{1}{\sqrt{\alpha_t}}\left(\mathrm{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon\right) given xt\mathrm{x}_t. Since xt\mathrm{x}_t is available as input to the model, we may choose the parameterization

μθ(xt,t)=μ~t(xt,1αˉt(xt1αˉtϵθ(xt)))=1αt(xtβt1αˉtϵθ(xt,t))\boldsymbol{\mu}_\theta\left(\mathbf{x}_t, t\right)=\tilde{\boldsymbol{\mu}}_t\left(\mathrm{x}_t, \frac{1}{\sqrt{\bar{\alpha}_t}}\left(\mathrm{x}_t-\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}_\theta\left(\mathrm{x}_t\right)\right)\right)=\frac{1}{\sqrt{\alpha_t}}\left(\mathrm{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left(\mathrm{x}_t, t\right)\right)

where ϵθ\epsilon_\theta is a function approximator intended to predict ϵ\boldsymbol{\epsilon} from xt\mathbf{x}_t. To sample xt1pθ(xt1xt)\mathbf{x}_{t-1} \sim p_\theta\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t\right) is to compute xt1=1αt(xtβt1αˉtϵθ(xt,t))+σtz\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta\left(\mathbf{x}_t, t\right)\right)+\sigma_t \mathbf{z}, where zN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}). The complete sampling procedure, Algorithm 2, resembles Langevin dynamics with ϵθ\epsilon_\theta as a learned gradient of the data density. Furthermore, with the parameterization (11), Eq. (10) simplifies to:

Ex0,ϵ[βt22σt2αt(1αˉt)ϵϵθ(αˉtx0+1αˉtϵ,t)2]\mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}}\left[\frac{\beta_t^2}{2 \sigma_t^2 \alpha_t\left(1-\bar{\alpha}_t\right)}\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_\theta\left(\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}, t\right)\right\|^2\right]

which resembles denoising score matching over multiple noise scales indexed by tt [55]. As Eq. (12) is equal to (one term of) the variational bound for the Langevin-like reverse process (11)(11), we see that optimizing an objective resembling denoising score matching is equivalent to using variational inference to fit the finite-time marginal of a sampling chain resembling Langevin dynamics.

To summarize, we can train the reverse process mean function approximator μθ\boldsymbol{\mu}_\theta to predict μ~t\tilde{\boldsymbol{\mu}}_t, or by modifying its parameterization, we can train it to predict ϵ\epsilon. (There is also the possibility of predicting x0\mathbf{x}_0, but we found this to lead to worse sample quality early in our experiments.) We have shown that the ϵ\epsilon-prediction parameterization both resembles Langevin dynamics and simplifies the diffusion model’s variational bound to an objective that resembles denoising score matching. Nonetheless, it is just another parameterization of pθ(xt1xt)p_\theta\left(\mathrm{x}_{t-1} \mid \mathrm{x}_t\right), so we verify its effectiveness in Section 4 in an ablation where we compare predicting ϵ\epsilon against predicting μ~t\tilde{\mu}_t.

Thanks for reading this far by Shinnku
© Copyright 2025 by Shinnku's blog. Built with ♥Astro.