Hmm… This relationship is apparently quite easy to get tangled up in:
A useful way to read the “DDPM ELBO vs stochastic differential equation” issue is that these are not really two competing theories of diffusion models. They are more like different coordinate systems for describing almost the same family of generative models.
My current understanding is:
- DDPM / ELBO view: natural if you start from a discrete-time latent-variable model.
- Denoising score matching view: natural if you want to explain why the practical noise-prediction loss works.
- Continuous-time SDE view: natural if you want a unified forward/reverse process and sampler story.
- Flow matching view: natural if you want an even more direct vector-field regression story.
So I would be careful with the phrasing “ELBO is not favored”. A more precise statement is probably:
The ELBO is the natural objective when DDPM is presented as a discrete-time latent-variable model, but the score/SDE/flow views often give a cleaner explanation of the practical training objective and the sampling dynamics. That does not mean the ELBO view is wrong, obsolete, or unrelated.
1. The short version
| Question |
My answer |
| Is the DDPM ELBO wrong? |
No. It is the natural variational objective for a discrete-time latent-variable Markov chain. |
| Is the simplified DDPM loss “just” an ELBO? |
Not exactly. It can be derived from a simplified / reweighted variational bound, but it is also very naturally interpreted as denoising score matching or noise prediction. |
| Is the SDE view cleaner? |
Often yes, especially conceptually. It directly says: learn the score of noisy marginals, then run a reverse-time SDE or probability-flow ODE. |
| Does the SDE view eliminate ELBOs? |
No. Continuous-time diffusion also has variational / likelihood lower-bound interpretations. |
| What does “equality” usually mean in score matching or flow matching notes? |
Usually equality up to a parameter-independent constant between an intractable marginal objective and a tractable conditional objective. It does not mean the whole generative modeling problem becomes exact. |
| Are DDPM, score-based diffusion, SDE diffusion, and flow matching different things? |
They can be different parameterizations / discretizations / objectives, but many common cases are deeply equivalent or transformable. |
2. DDPM was not “ELBO only” even in the original paper
The original Denoising Diffusion Probabilistic Models paper presents DDPMs as latent-variable models trained through a variational bound, but the paper already emphasizes a connection to denoising score matching. The project page also summarizes the method as using “a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics”:
DDPM project page.
So I would not frame DDPM as:
DDPM = ELBO, while score/SDE = something completely different.
A better framing is:
DDPM starts from a discrete-time variational latent-variable model, but its most useful simplified objective is already closely related to denoising score matching.
This is also why DDPM implementations often look like simple noise prediction rather than like a textbook VAE objective.
3. Why the ELBO derivation feels indirect
In the discrete DDPM story, the forward process adds noise step by step:
- start with data;
- define a fixed noising Markov chain;
- learn a reverse Markov chain;
- introduce latent variables for all intermediate noisy states;
- optimize a variational lower bound on the data likelihood;
- simplify / reweight the resulting terms;
- end up with a denoising-style objective, usually noise prediction.
This is legitimate, but it can feel roundabout because the implemented objective often looks much simpler:
sample a timestep, corrupt the data with Gaussian noise, and train a network to predict the noise / clean data / score / velocity.
That is why tutorials like Calvin Luo’s Understanding Diffusion Models: A Unified Perspective are useful: they explicitly connect the variational perspective and the score-based perspective. The tutorial derives variational diffusion models as a special case of a Markovian hierarchical VAE, then shows that optimization can be viewed as predicting one of several equivalent targets, such as the clean input, the injected noise, or the score.
The blog version is also readable:
Understanding Diffusion Models: A Unified Perspective.
4. Why the SDE view feels cleaner
The SDE view, especially from Score-Based Generative Modeling through Stochastic Differential Equations, says something like this:
- Define a continuous-time forward process that gradually turns data into noise.
- The reverse-time process exists and depends on the time-dependent score of the perturbed data distribution.
- Learn that score with a neural network.
- Generate samples by solving the reverse-time SDE, or a related probability-flow ODE.
This is conceptually clean because it separates several ideas that are somewhat entangled in the discrete DDPM presentation:
| Component |
DDPM / discrete view |
SDE view |
| Forward process |
finite noising chain |
continuous-time noising SDE |
| Reverse process |
learned denoising Markov chain |
reverse-time SDE using the score |
| Training target |
variational bound, often simplified to noise prediction |
score / denoising score matching |
| Sampling |
fixed or chosen denoising schedule |
numerical SDE / ODE solver |
| Discretization |
built into the model description |
often treated as a solver choice |
This is why the SDE formulation often feels more “modern” or more “principled”. It gives a unified language for DDPM-like variance-preserving processes, score-based / variance-exploding processes, reverse SDEs, probability-flow ODEs, predictor-corrector samplers, etc.
Yang Song’s blog post Generative Modeling by Estimating Gradients of the Data Distribution is a very good intuitive bridge here. The official code repository is also useful for orientation:
score_sde.
5. But “SDE is exact, ELBO is only a lower bound” is too compressed
There is a subtle but important distinction here.
In score matching / denoising score matching, one often proves that an intractable score-matching objective and a tractable denoising score-matching objective differ only by a parameter-independent constant. In flow matching, a similar pattern appears: an intractable marginal flow matching objective can be replaced by a tractable conditional flow matching objective, again with the same optimizer / gradient up to terms independent of the model.
This is probably the kind of “equality” that many modern lecture notes are referring to.
But that does not mean:
continuous-time SDE diffusion gives exact maximum likelihood with no variational or approximation issue.
The approximations just move to different places:
- the score network is approximate;
- the SDE / ODE solver is numerical;
- likelihood computation, if needed, has its own assumptions and costs;
- the training objective may still be a surrogate for the downstream metric one cares about;
- weighting across noise levels matters a lot.
So the safe version is:
The score/SDE/flow formulations often make the training target cleaner, but they do not magically remove all approximation from generative modeling.
6. Continuous-time SDEs also have a variational interpretation
A key paper here is A Variational Perspective on Diffusion-Based Generative Models and Score Matching. It develops a variational framework for continuous-time generative diffusion and connects score matching to likelihood lower bounds for the plug-in reverse SDE.
This is important because it prevents the mistaken dichotomy:
discrete DDPM = ELBO
continuous SDE = no ELBO
The relationship is more like:
discrete DDPM has a natural ELBO derivation; continuous-time diffusion also admits variational / likelihood lower-bound interpretations; score matching and ELBO are connected rather than opposed.
Another relevant paper is Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation, which argues that commonly used diffusion objectives are closely related to ELBOs over different noise levels. The OpenReview page is here:
OpenReview: Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation.
This makes the statement “ELBO is not favored” look too strong. A better statement is:
In practice, people often do not optimize the original discrete DDPM ELBO literally. They optimize reweighted denoising objectives that are easier to train and often better for perceptual quality. But these objectives remain closely connected to ELBO-like interpretations.
7. The practical DDPM loss and score matching are very close
In many Gaussian diffusion setups, predicting the injected noise is equivalent to predicting the score up to a known scaling. This is why people can talk about:
epsilon prediction;
x0 prediction;
v prediction;
- score prediction;
as different parameterizations of closely related training targets.
The exact equivalence depends on the noise schedule, parameterization, and loss weighting. So I would avoid saying “they are exactly the same loss” without qualifications. But conceptually:
DDPM noise prediction is one parameterization of denoising score estimation.
That is also why the Hugging Face implementation-oriented materials often present DDPM mostly as noise prediction and scheduling rather than as an explicit ELBO optimizer. See:
The Annotated Diffusion Model and Diffusers DDPM docs.
8. Where flow matching fits
Flow matching adds one more viewpoint.
The blog/paper Diffusion Meets Flow Matching: Two Sides of the Same Coin argues that diffusion models and Gaussian flow matching are deeply connected; different model specifications can lead to different network outputs, schedules, and loss weightings, while describing essentially the same generative model in many common cases.
The MIT notes An Introduction to Flow Matching and Diffusion Models are also useful because they put ODEs, SDEs, flow matching, score matching, and modern diffusion models in one framework. The course page is here:
Flow Matching and Diffusion Models.
For a more implementation-oriented flow matching reference, see Meta’s Flow Matching Guide and Code and the associated arXiv paper:
Flow Matching Guide and Code.
This reinforces the same point:
These frameworks often differ less in the underlying generative family and more in the chosen probability path, parameterization, objective weighting, and numerical sampler.
9. A useful mental model
I would summarize the relationship like this:
| View |
Natural starting point |
Main object learned |
What it explains well |
What can be confusing |
| DDPM / ELBO |
discrete latent-variable model |
reverse denoising kernels, often parameterized by noise prediction |
why diffusion can be trained as a variational model |
the practical simple loss can look disconnected from the ELBO |
| Denoising score matching |
noisy data distributions |
score or equivalent noise target |
why denoising regression works |
likelihood interpretation needs extra work |
| Continuous-time SDE |
forward and reverse stochastic processes |
time-dependent score |
unification of DDPM, score models, reverse SDE, probability-flow ODE |
SDE notation can hide discretization and approximation issues |
| Flow matching |
probability paths and vector fields |
velocity / vector field |
direct regression objective and ODE sampling |
relationship to diffusion depends on path, parameterization, and weighting |
10. My answer to the original confusion
If the question is:
Is the DDPM ELBO less favored because the SDE formulation gives an equality instead of a lower bound?
I would answer:
Not exactly.
The DDPM ELBO is the natural derivation when DDPM is treated as a discrete-time latent-variable model. It is not wrong, and it is not merely historical baggage. However, the simplified objective used in practice is more directly understood as denoising score matching / noise prediction. The continuous-time SDE formulation makes this score-based interpretation cleaner and unifies the sampling dynamics.
The “equality” in denoising score matching or flow matching usually refers to equality up to a parameter-independent constant between a hard marginal objective and an easier conditional objective. It should not be interpreted as “the whole likelihood problem is now exact and ELBOs are obsolete.”
In fact, continuous-time diffusion also has variational interpretations, and modern diffusion objectives can often be understood as ELBO-related objectives with particular noise-level weightings. So the best mental model is not:
ELBO versus SDE
but rather:
ELBO, score matching, SDEs, and flow matching are different but tightly connected views of the same underlying denoising / transport problem.
11. Suggested reading order
If someone wants to untangle this efficiently, I would read in this order:
12. One-sentence version
DDPM’s ELBO, denoising score matching, continuous-time SDE diffusion, and flow matching are not mutually exclusive explanations; the ELBO view gives a latent-variable / likelihood-bound story, while the score/SDE/flow views often give a cleaner training-and-sampling story, and modern theory connects these views rather than replacing one with another.