DDPM ( ELBO vs Stochastic DE )

Hello,

I decided to ask here as I didn’t know which forum can help. I just took extensive notes after watching this course on Diffucion Models]( Flow Matching and Diffusion Models ). Prior to that i viewed other material that dealt with VAEs and presented the math using Evidence Lower Bound. There is also the Tutorial on Diffusion Models for Imaging and Vision(arXiv:2403.18103).

The following is from the MIT course notes and I need some mathematical support or code examples to understand why ELBO is not favoured by the instructors. Am I understanding this wrongly ?

Discrete time vs. continuous time. The first denoising diffusion model papers [41, 42, 17] did not use SDEs but constructed Markov chains in discrete time, i.e. with time steps t = 0, 1, 2, 3, . . . . To this date, you will find a lot of works in the literature working with this discrete-time formulation. While this construction is appealing due to its simplicity, the disadvantage of the time-discrete approach is that it forces you to choose a time discretization before training. Further, the loss function needs to be approximated via an evidence lower bound (ELBO) - which is, as the name suggests, only a lower bound to the loss we actually want to minimize. Later, Song et al. [45] showed that these constructions were essentially an approximation of a time-continuous SDEs. Further, the ELBO loss becomes tight (i.e. it is not a lower bound anymore) in the continuous time case (e.g. note that Theorem 12 and Theorem 22 are equalities and not lower bounds - this would be different in the discrete time case). This made the SDE construction popular because it was considered mathematically “cleaner” and that one could control the simulation error via ODE/SDE samplers post training. It is important to note however that both models employ the same loss and are not fundamentally different.

Thanks,

Mohan

Hmm… This relationship is apparently quite easy to get tangled up in:


A useful way to read the “DDPM ELBO vs stochastic differential equation” issue is that these are not really two competing theories of diffusion models. They are more like different coordinate systems for describing almost the same family of generative models.

My current understanding is:

  • DDPM / ELBO view: natural if you start from a discrete-time latent-variable model.
  • Denoising score matching view: natural if you want to explain why the practical noise-prediction loss works.
  • Continuous-time SDE view: natural if you want a unified forward/reverse process and sampler story.
  • Flow matching view: natural if you want an even more direct vector-field regression story.

So I would be careful with the phrasing “ELBO is not favored”. A more precise statement is probably:

The ELBO is the natural objective when DDPM is presented as a discrete-time latent-variable model, but the score/SDE/flow views often give a cleaner explanation of the practical training objective and the sampling dynamics. That does not mean the ELBO view is wrong, obsolete, or unrelated.

1. The short version

Question My answer
Is the DDPM ELBO wrong? No. It is the natural variational objective for a discrete-time latent-variable Markov chain.
Is the simplified DDPM loss “just” an ELBO? Not exactly. It can be derived from a simplified / reweighted variational bound, but it is also very naturally interpreted as denoising score matching or noise prediction.
Is the SDE view cleaner? Often yes, especially conceptually. It directly says: learn the score of noisy marginals, then run a reverse-time SDE or probability-flow ODE.
Does the SDE view eliminate ELBOs? No. Continuous-time diffusion also has variational / likelihood lower-bound interpretations.
What does “equality” usually mean in score matching or flow matching notes? Usually equality up to a parameter-independent constant between an intractable marginal objective and a tractable conditional objective. It does not mean the whole generative modeling problem becomes exact.
Are DDPM, score-based diffusion, SDE diffusion, and flow matching different things? They can be different parameterizations / discretizations / objectives, but many common cases are deeply equivalent or transformable.

2. DDPM was not “ELBO only” even in the original paper

The original Denoising Diffusion Probabilistic Models paper presents DDPMs as latent-variable models trained through a variational bound, but the paper already emphasizes a connection to denoising score matching. The project page also summarizes the method as using “a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics”:
DDPM project page.

So I would not frame DDPM as:

DDPM = ELBO, while score/SDE = something completely different.

A better framing is:

DDPM starts from a discrete-time variational latent-variable model, but its most useful simplified objective is already closely related to denoising score matching.

This is also why DDPM implementations often look like simple noise prediction rather than like a textbook VAE objective.

3. Why the ELBO derivation feels indirect

In the discrete DDPM story, the forward process adds noise step by step:

  • start with data;
  • define a fixed noising Markov chain;
  • learn a reverse Markov chain;
  • introduce latent variables for all intermediate noisy states;
  • optimize a variational lower bound on the data likelihood;
  • simplify / reweight the resulting terms;
  • end up with a denoising-style objective, usually noise prediction.

This is legitimate, but it can feel roundabout because the implemented objective often looks much simpler:

sample a timestep, corrupt the data with Gaussian noise, and train a network to predict the noise / clean data / score / velocity.

That is why tutorials like Calvin Luo’s Understanding Diffusion Models: A Unified Perspective are useful: they explicitly connect the variational perspective and the score-based perspective. The tutorial derives variational diffusion models as a special case of a Markovian hierarchical VAE, then shows that optimization can be viewed as predicting one of several equivalent targets, such as the clean input, the injected noise, or the score.

The blog version is also readable:
Understanding Diffusion Models: A Unified Perspective.

4. Why the SDE view feels cleaner

The SDE view, especially from Score-Based Generative Modeling through Stochastic Differential Equations, says something like this:

  1. Define a continuous-time forward process that gradually turns data into noise.
  2. The reverse-time process exists and depends on the time-dependent score of the perturbed data distribution.
  3. Learn that score with a neural network.
  4. Generate samples by solving the reverse-time SDE, or a related probability-flow ODE.

This is conceptually clean because it separates several ideas that are somewhat entangled in the discrete DDPM presentation:

Component DDPM / discrete view SDE view
Forward process finite noising chain continuous-time noising SDE
Reverse process learned denoising Markov chain reverse-time SDE using the score
Training target variational bound, often simplified to noise prediction score / denoising score matching
Sampling fixed or chosen denoising schedule numerical SDE / ODE solver
Discretization built into the model description often treated as a solver choice

This is why the SDE formulation often feels more “modern” or more “principled”. It gives a unified language for DDPM-like variance-preserving processes, score-based / variance-exploding processes, reverse SDEs, probability-flow ODEs, predictor-corrector samplers, etc.

Yang Song’s blog post Generative Modeling by Estimating Gradients of the Data Distribution is a very good intuitive bridge here. The official code repository is also useful for orientation:
score_sde.

5. But “SDE is exact, ELBO is only a lower bound” is too compressed

There is a subtle but important distinction here.

In score matching / denoising score matching, one often proves that an intractable score-matching objective and a tractable denoising score-matching objective differ only by a parameter-independent constant. In flow matching, a similar pattern appears: an intractable marginal flow matching objective can be replaced by a tractable conditional flow matching objective, again with the same optimizer / gradient up to terms independent of the model.

This is probably the kind of “equality” that many modern lecture notes are referring to.

But that does not mean:

continuous-time SDE diffusion gives exact maximum likelihood with no variational or approximation issue.

The approximations just move to different places:

  • the score network is approximate;
  • the SDE / ODE solver is numerical;
  • likelihood computation, if needed, has its own assumptions and costs;
  • the training objective may still be a surrogate for the downstream metric one cares about;
  • weighting across noise levels matters a lot.

So the safe version is:

The score/SDE/flow formulations often make the training target cleaner, but they do not magically remove all approximation from generative modeling.

6. Continuous-time SDEs also have a variational interpretation

A key paper here is A Variational Perspective on Diffusion-Based Generative Models and Score Matching. It develops a variational framework for continuous-time generative diffusion and connects score matching to likelihood lower bounds for the plug-in reverse SDE.

This is important because it prevents the mistaken dichotomy:

discrete DDPM = ELBO
continuous SDE = no ELBO

The relationship is more like:

discrete DDPM has a natural ELBO derivation; continuous-time diffusion also admits variational / likelihood lower-bound interpretations; score matching and ELBO are connected rather than opposed.

Another relevant paper is Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation, which argues that commonly used diffusion objectives are closely related to ELBOs over different noise levels. The OpenReview page is here:
OpenReview: Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation.

This makes the statement “ELBO is not favored” look too strong. A better statement is:

In practice, people often do not optimize the original discrete DDPM ELBO literally. They optimize reweighted denoising objectives that are easier to train and often better for perceptual quality. But these objectives remain closely connected to ELBO-like interpretations.

7. The practical DDPM loss and score matching are very close

In many Gaussian diffusion setups, predicting the injected noise is equivalent to predicting the score up to a known scaling. This is why people can talk about:

  • epsilon prediction;
  • x0 prediction;
  • v prediction;
  • score prediction;

as different parameterizations of closely related training targets.

The exact equivalence depends on the noise schedule, parameterization, and loss weighting. So I would avoid saying “they are exactly the same loss” without qualifications. But conceptually:

DDPM noise prediction is one parameterization of denoising score estimation.

That is also why the Hugging Face implementation-oriented materials often present DDPM mostly as noise prediction and scheduling rather than as an explicit ELBO optimizer. See:
The Annotated Diffusion Model and Diffusers DDPM docs.

8. Where flow matching fits

Flow matching adds one more viewpoint.

The blog/paper Diffusion Meets Flow Matching: Two Sides of the Same Coin argues that diffusion models and Gaussian flow matching are deeply connected; different model specifications can lead to different network outputs, schedules, and loss weightings, while describing essentially the same generative model in many common cases.

The MIT notes An Introduction to Flow Matching and Diffusion Models are also useful because they put ODEs, SDEs, flow matching, score matching, and modern diffusion models in one framework. The course page is here:
Flow Matching and Diffusion Models.

For a more implementation-oriented flow matching reference, see Meta’s Flow Matching Guide and Code and the associated arXiv paper:
Flow Matching Guide and Code.

This reinforces the same point:

These frameworks often differ less in the underlying generative family and more in the chosen probability path, parameterization, objective weighting, and numerical sampler.

9. A useful mental model

I would summarize the relationship like this:

View Natural starting point Main object learned What it explains well What can be confusing
DDPM / ELBO discrete latent-variable model reverse denoising kernels, often parameterized by noise prediction why diffusion can be trained as a variational model the practical simple loss can look disconnected from the ELBO
Denoising score matching noisy data distributions score or equivalent noise target why denoising regression works likelihood interpretation needs extra work
Continuous-time SDE forward and reverse stochastic processes time-dependent score unification of DDPM, score models, reverse SDE, probability-flow ODE SDE notation can hide discretization and approximation issues
Flow matching probability paths and vector fields velocity / vector field direct regression objective and ODE sampling relationship to diffusion depends on path, parameterization, and weighting

10. My answer to the original confusion

If the question is:

Is the DDPM ELBO less favored because the SDE formulation gives an equality instead of a lower bound?

I would answer:

Not exactly.

The DDPM ELBO is the natural derivation when DDPM is treated as a discrete-time latent-variable model. It is not wrong, and it is not merely historical baggage. However, the simplified objective used in practice is more directly understood as denoising score matching / noise prediction. The continuous-time SDE formulation makes this score-based interpretation cleaner and unifies the sampling dynamics.

The “equality” in denoising score matching or flow matching usually refers to equality up to a parameter-independent constant between a hard marginal objective and an easier conditional objective. It should not be interpreted as “the whole likelihood problem is now exact and ELBOs are obsolete.”

In fact, continuous-time diffusion also has variational interpretations, and modern diffusion objectives can often be understood as ELBO-related objectives with particular noise-level weightings. So the best mental model is not:

ELBO versus SDE

but rather:

ELBO, score matching, SDEs, and flow matching are different but tightly connected views of the same underlying denoising / transport problem.

11. Suggested reading order

If someone wants to untangle this efficiently, I would read in this order:

Order Resource Why read it
1 Understanding Diffusion Models: A Unified Perspective Best bridge between variational and score-based views.
2 DDPM Original DDPM paper; shows the ELBO and denoising score matching connection.
3 Yang Song’s score-based generative modeling blog Intuitive score matching and SDE explanation.
4 Score-Based Generative Modeling through SDEs Main SDE unification paper.
5 A Variational Perspective on Diffusion-Based Generative Models and Score Matching Shows continuous-time diffusion still has a variational / likelihood-bound interpretation.
6 Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation Explains why practical diffusion objectives remain closely related to ELBOs.
7 Diffusion Meets Flow Matching Explains the diffusion / flow matching connection in a readable way.
8 An Introduction to Flow Matching and Diffusion Models Comprehensive modern notes covering ODEs, SDEs, flow matching, score matching, and guidance.

12. One-sentence version

DDPM’s ELBO, denoising score matching, continuous-time SDE diffusion, and flow matching are not mutually exclusive explanations; the ELBO view gives a latent-variable / likelihood-bound story, while the score/SDE/flow views often give a cleaner training-and-sampling story, and modern theory connects these views rather than replacing one with another.