Composing Diffusion Models

Hello! Just reading some more diffusion models papers, and wanted to write brief summaries for each of them as an aid for learning.

Compositional Visual Generation with Composable Diffusion Models - Liu et al. https://arxiv.org/pdf/2206.01714

The goal is to increase performance on complex images by using different diffusion models to capture different subsets of a specification
They are then composed together to generate an image
This method allows for significantly more complex combinations than in training
We know the basic rundown of diffusion models so I will not repeat it here.
EBMs are a class of generative models where the data distribution is modeled with a probability density.

Where the exponent $E_{\theta}(x)$ is a learnable neural network.

The sampling procedure is functionally the same as that for diffusion models, as you can see in the following equation:

The key innovation is then treating EBMs and diffusion models as the same thing, where the noise prediction network essentially is equivalent to predicting the change in energy.
A trained diffusion model can be viewed as an “implicitly parameterized EBM”.
EBMs have been shown previously to have good capabilities in compositional generation.
The composed distribution is as follows:

intuitively, we can try to do the same thing for diffusion models as follows:
Inspired by EBMs, the authors define two compositional operators, conjunction AND and negation NOT to compose diffusion models
They train a set of diffusion models representing conditional probability distribution x given concept c and an unconditional probability distribution

Equation 9 states that the probability that image x is generated given concepts 1-n is proportional to the probability that x and all of the concepts are generated, which is the product of the probability of x * the probability of the concepts given x (given the concepts are conditionally independent of each other given x)

With bayes rule, we can substitute $p(c_i

x)$ with $\frac{p(x

c_i)}{p(x)}$ (because the equation states that the proabbility is proportional to rather than equal to) and get equation 10

To generate the noise prediction, we adding the noise given x and timestep t, and for each concept, adding the noise predicted given the concept and subtracting the original noise, and taking a weighted sum over all concepts (the weight is called temperature scaling)
To compose an image with this updated noise, we do the same thing as if we had noise from only one diffusion model (equation 12)

the negation of a concept can be ill-defined, such as the negation of “dark” being “bright” or random noises
It’s necessary to have conditioning on other concepts as a result.
As such, the authors refactorize the joint probability distribution as

Used CLEVR, Relational GLEVR, FFHQ datasets which are objects, relational description images, and real-world human faces respectively
Evaluated using Binary classification, on three different settings: generating with 1 component, or conditioning on single concept, generating with 2 components, or generating with 3 components
Also evaluated using FID
Compared with StyleGAN2-ADA, LACE, GLIDE, EBM
Composed results generally higher quality images with correct object relations
Three failure cases: (1) pre-trained diffusion models do not understand certain concepts, (2) diffusion models confuse attributes of objects, (3) the composition does not work, which often happens when objects are in the center of the images

Limitation of approach: composing multiple models only works if they are instances of the same model
Limited success when composing diffusion models trained on different daasets
EBMS can succesfully compose multiple separately trained model

the goal is to repurpose diffusion models without finetuning to a variety of downstream tasks.
Like in the previous paper, an intuition is to borrow ideas from composing EBMs.
However, the formula given previously for how the EBM parametrizes a distribution is slightly incorrect, and the correct formulation is:

where $Z(\theta)$ is

or the area under the curve of the probability distribution (so that the probability adds to 1)

By not modelling the normalization constant, we can no longer efficiently compute likelihoods or draw samples
this complicates training because most generative models are trained by maximizing likelihood
To try to get likelihoods or sample from the EBM, we must rely on approximate methods such as MCMC.
MCMC with Unadjusted Langevin Dynamics is basically the same thing functionally as diffusion models, as previously noted.
The rewritten training objective of diffusion models is as follows

Learning a conditional diffusion model is learning $p_{theta}(x y;t)$.
Exploiting Bayes rule leads us to find for $\lambda = 1$:
The intuition of this method is to essentially learn a classifier for different y which can guide the generations of x given y
in practice, it is beneficial to make $\lambda >1$.
There is another way to do this which is to not learn this explicit model but an implicit model of x y
The equation ends up being:

The first method allows you to train a bunch of different classifiers and attach them to the generative model
The second method has better performance but it’s harder and more expensive to do.