No more confusion about diffusion!

Images generated from the Diffusion Transformer (DiT) from [1]

Diffusion models have become the dominant method of image generation in the last few years. But what are they? And how do they work? In this article, I will explain this intuitively and with some mathematics. This does require some mathematical and machine learning background, but I try here to abstract the complexity as much as possible.

Most of this is lifted from the MIT course on flow and diffusion models [2]: you can find the full lectures, lecture notes, and exercises here. A huge thanks to Peter Holderrieth and Ezra Erives, the course lecturers.

Flow and Diffusion Models

A quick physics primer

The nomenclature of these types of models was lifted from physics (where a lot of machine learning originates). Flow models describe as the name suggests, flows. These are mathematically formalised as an Ordinary Differential Equation (ODE) and an initial condition:

This equation has a unique solution, which is a function that depends on time, called the flow. This is because the function will describe how the system will evolve or flow through time.

Diffusion models are simply an extension of flow models. They describe how particles disperse through time. Due to collisions between particles, it is very difficult to model this, and the best that physicists have found is to add a random component to the ODE.

n.b. This is a very sloppy notation, but I am doing this for simplicity.

This is a Stochastic Differential Equation (SDE), where the added term represents added random noise at each step. SDEs do not have solutions but a distribution of solutions because of the random component.

What the hell does this have to do with machine learning?

The flow and diffusion processes are mappings from a certain distribution to another. This mapping is controlled by the function u in the equations above, we can use a neural network to model the function u. Researchers found that if you set the initial state to random noise and the last state to data (such as an image), you could make a generative model.

This is the basic algorithm for generating data with a flow model; the same concept applies for diffusion models, simply add random noise at every update (line 5).

We have an algorithm for generating data, but how do we train this network to do this specific task of mapping from noise to data? And also, you might be thinking, how do you generate an image based on a prompt? This algorithm just generates data with no guidance. I will cover both of these points in the next sections.

Training

There is a lot of mathematical proof required to give a full explanation of how to train this properly, I won’t go too much into it here. Please refer to the lecture notes linked above for the full details. There are many flavours of losses for flow and diffusion models that are used for various purposes. The most common is the Conditional Flow Matching Loss:

This looks like a standard mean-squared error loss, but some of the symbols look a bit different. The function u is our neural network, z is a datapoint sampled from our dataset, ϵ is random Gaussian noise, x is a variable, it is a function of t, z, and ϵ (more on this later). In a regular neural network, we would be fitting the function to a label or number. In this case, we fit the network to the difference between the data and some noise. What does it mean?

Essentially, we are fitting the network to predict the vector pointing directly from the noise to the data, the direction in which to go. This is just a straight line and works surprisingly well in practice: for example, it is the loss function that is used to train Meta’s generative video models.

We can see that x is parametrized as a point lying on the line joining ϵ and z. The network is trained to, at each point in the space, point the input closer to a datapoint.

This is specifically the procedure where we use the Gaussian Conditional Probability Path formalism, not important for this post, but useful if you want to dig into it more. The procedure for training diffusion models is very similar, but the loss, called Score Matching Loss, is slightly different, and so is the parameterization of x. Again, I recommend going to the original notes to learn more. Flow matching is generally used, even Stable Diffusion 3, the state-of-the-art for image generation, uses flow matching.

Score or FLow matching: When to choose which?

I’ve told you about two different types of matching: flow and score, but which is better? As often in machine learning, it’s about tradeoffs.

Flow models are cheaper to train and require fewer steps at inference to generate samples
Diffusion models are more expensive to train and require many more steps at inference to converge, but the sample quality is much higher.

Your choice will depend on your application and your requirements: for a video generator, flow matching makes more sense as it is better to generate many images to create a fluid stream. For an image generator, diffusion models will be more appropriate: the goal is to create a single high-quality image.

Guidance

We now have the tools to make a flow model, but how can we condition it, say, on a prompt? These are the image generators that we are familiar with, but we have not had any indication of anything of the sort so far. This process is called guidance, as we are guiding the network towards a certain solution.

This is not difficult at all, given that you have a dataset of input-output pairs, such as images with captions. Since every datapoint z has an associated caption y, all you need to do is add y as an input to the neural network along with x, and the network will learn to use this as guidance to direct the path (in the case of captions a language embedding is required to convert the words to numbers that the network can interpret them).

This sounds good, but in practice, it was found empirically that the guidance was often not strong enough, e.g., generated images didn’t fit the caption well enough. This led people to use classifier-free guidance. This is a bit of a hack to weigh the guidance more heavily.

To train a classifier-free guided flow model, we simply need to change the loss:

At each calculation, replace y with a null label (this can be anything that isn’t already a label) with a certain probability. My intuition for this (which could be wrong!) is that the network has to learn bigger “steps” towards a given datapoint, as the guidance to a certain datapoint could be taken away. It’s a bit analogous to dropout.

Then, at inference, the guidance can be tuned with the hyperparameter w. By tuning this hyperparameter, we decide how much the network “uses” the guidance. The basic equation is here.

We can see that using w=1, we reduce down to regular guidance, but using something higher directs us more in the right direction while minimizing the other directions. This isn’t mathematically very sound, but it works empirically.

Why does this work?

A valid question, why are we doing all this? Why can’t we simply make a model that takes as input a prompt and train it to generate an image from that? The answer is that this is much, much, much harder. The intuition is that flow and diffusion models are iteratively denoising an image (see the flow matching objective), which is a much easier task than generating an image outright. Second, the process to generate is iterative (see algorithm), which means that mistakes on an iteration can be fixed on the next iteration, and the model can correct its own mistakes.

A note on architectures

The flow and score matching losses are very flexible objectives that can be used with basically any underlying architecture of models. Due to size complexity, all SOTA models tend to do diffusion in a latent space. This means that they first train a Variational AutoEncoder (VAE) or other embedding model to embed images (or other high-dimensional data) into a smaller and semantically meaningful space. A diffusion model is then much cheaper to train in this space.

Traditionally, for image generation, the U-net architecture was used with Convolutional layers. More recently, the Diffusion Transformer (DiT) [1] has dominated the image generation game.

Thanks for reading! Make sure to share this article if you enjoyed it!

References

[1] Scalable Diffusion Models with Transformers, Peebles et Xie, 2o22

[2] Introduction to Flow Matching and Diffusion Models, Peter Holderrieth and Ezra Erives

AI Blogathon

Tournaments

Weekly Tournament - May 26, 2025 (Completed)
Weekly Tournament - May 19, 2025 (Completed)
Weekly Tournament - April 28, 2025 (Completed)
Weekly Tournament - April 21, 2025 (Completed)
Weekly Tournament - April 14, 2025 (Completed)
Weekly Tournament - April 7, 2025 (Completed)
AI Madness (Completed)

Leaderboard

Rank	Post	Score
1	Smarter Automation With Burr: The Future of Decision-Making	5018
2	How to Build an MCP Server for Kafka and Qdrant	3104
3	Building Conversational AI: A Comprehensive Guide to Voice Assistants with LangChain	1870
4	Visualizing Chunking Impacts in Agentic RAG with Agno, Qdrant, RAGAS and LlamaIndex	1867
5	Run Gemma 3 Locally Using Open WebUI	1835
6	Comparison of Major LLM Architectures (2017– 2025)	1283
7	Hamilton in Action: Practical Use Cases for Modern Data Workflows	1273
8	How Anthropic Is Reinventing RAG Systems with Contextual Retrieval	1082
9	Decoding Language: The Art of Tokenization and Embeddings	1050
10	Building and Deploying Data-Aware AI Agents in Databricks with Claude Opus 4: An End-to-End Python Tutorial	990