Menu
Avatar
The menu of my blog
Quick Stats
Quests
30 Quests
Messages
2 Messages
Playback
5 Playback
Items
6 Items
Skills
2 Skills
Trace
1 Trace
Message

The Sword Art Online Utilities Project

Welcome, traveler. This is a personal blog built in the style of the legendary SAO game interface. Navigate through the menu to explore the journal, skills, and item logs.

© 2020-2026 Nagi-ovo | RSS | Breezing
← Back to Quest Log
Ditching the SDEs: A Simpler Path with Flow Matching
Ditching the SDEs: A Simpler Path with Flow Matching

Flow Matching gives us a fresh—and simpler—lens on generative modeling. Instead of reasoning about probability densities and score functions, we reason about vector fields and flows.

Oct 3, 2025 35 min read
DLFlow-Matching

Human-Crafted

Written directly by the author with no AI-generated sections.

Ditching the SDEs: A Simpler Path with Flow Matching

This post mainly follows the teaching structure of the video above, with my own re-organization and explanations mixed in. If you spot mistakes, feel free to point them out in the comments!

Flow Matching: Rebuilding Generative Models from First Principles

Alright, let’s talk about generative models. The goal is simple, right? We have a dataset—say, a bunch of cat images—drawn from some wild, high-dimensional distribution p1(x1)p_1(x_1)p1​(x1​). We want to train a model that can spit out brand-new cat images. The goal is simple, but the method can get… well, pretty gnarly.

You’ve probably heard of diffusion models. The idea is to start from an image, add noise step by step over hundreds of steps, then train a big network to reverse that process. The math involves score functions (∇xlog⁡pt(x)\nabla_x \log p_t(x)∇x​logpt​(x)), stochastic differential equations (stochastic differential equations, SDEs)… it’s a whole thing. It works—and works surprisingly well—but as a computer scientist I always wonder: can we reach the same goal in a simpler, more direct way? Is there a way to hack it?

Let’s step back and start from scratch. First principles.

The Core Problem

We have two distributions:

  1. p0(x0)p_0(x_0)p0​(x0​): a super simple noise distribution we can sample from easily. Think x0 = torch.randn_like(image).
  2. p1(x1)p_1(x_1)p1​(x1​): the real data distribution—complex and unknown (cats!). We can sample from it by loading an image from the dataset.

We want to learn a mapping that takes a sample from p0p_0p0​ and transforms it into a sample from p1p_1p1​.

The diffusion approach defines a complicated distribution path pt(x)p_t(x)pt​(x) that gradually morphs p1p_1p1​ into p0p_0p0​, then learns how to reverse it. But that intermediate distribution pt(x)p_t(x)pt​(x) is exactly where the mathematical complexity comes from.

So what’s the simplest thing we could possibly do?

A Naive “High-School Physics” Idea

What if we just… draw a straight line?

Seriously. Pick a noise sample x0x_0x0​ and a real cat image sample x1x_1x1​. What’s the simplest path between them? Linear interpolation.

xt=(1−t)x0+tx1x_t = (1-t)x_0 + t x_1xt​=(1−t)x0​+tx1​

Here ttt is our “time” parameter, going from 000 to 111.

  • When t=0t=0t=0, we’re at noise x0x_0x0​.
  • When t=1t=1t=1, we’re at the cat image x1x_1x1​.
  • For any ttt in between, we’re at some blurry mixture of the two.

Now, imagine a particle moving along this straight-line path from x0x_0x0​ to x1x_1x1​ in one second. What’s its velocity? Let’s stick to high-school physics and differentiate with respect to time ttt:

dxtdt=ddt((1−t)x0+tx1)=−x0+x1=x1−x0\frac{d x_t}{dt} = \frac{d}{dt}((1-t)x_0 + t x_1) = -x_0 + x_1 = x_1 - x_0dtdxt​​=dtd​((1−t)x0​+tx1​)=−x0​+x1​=x1​−x0​

Hold on. Let that result sit in your head for a moment.

Screenshot 2025-10-01 at 00.19.45

Along this simple straight-line path, the particle’s velocity at any time is just a constant vector pointing from start to end. That’s the simplest vector field you can imagine.

That’s the “Aha!” moment. What if this is all we need?

Building the Model

Now we have a target: we want to learn a vector field. Let’s define a neural network vθ(x,t)v_\theta(x, t)vθ​(x,t) that takes any point xxx and any time ttt as input, and outputs a vector—its predicted velocity at that point.

How do we train it? We want the network output to match our simple target velocity x1−x0x_1 - x_0x1​−x0​. The most direct choice is Mean Squared Error.

So the training objective becomes:

L=Et,x0,x1[∥vθ((1−t)x0+tx1,t)−(x1−x0)∥2]L = \mathbb{E}_{t, x_0, x_1} \left[ \| v_\theta((1-t)x_0 + t x_1, t) - (x_1 - x_0) \|^2 \right]L=Et,x0​,x1​​[∥vθ​((1−t)x0​+tx1​,t)−(x1​−x0​)∥2]

Let’s break down the training loop. It’s almost comically simple:

x1 = sample_from_dataset()        # grab a real cat image
x0 = torch.randn_like(x1)         # sample some noise
t = torch.rand(1)                 # pick a random time
xt = (1 - t) * x0 + t * x1        # interpolate to get the training input point
predicted_velocity = model(xt, t) # ask the model to predict the velocity
target_velocity = x1 - x0         # this is our ground truth!
loss = mse_loss(predicted_velocity, target_velocity)
loss.backward()
optimizer.step()

Boom. That’s it. This is the core of Conditional Flow Matching. We turned a puzzling distribution-matching problem into a simple regression problem.

Why This Is So Powerful: “Simulation-Free”

Notice what we didn’t do. We never had to mention the complicated marginal distribution pt(x)p_t(x)pt​(x). We never had to define or estimate a score function. We bypass the entire SDE/PDE machinery.

All we need is the ability to sample pairs (x0,x1)(x_0, x_1)(x0​,x1​) and interpolate between them. That’s why it’s called simulation-free training. It’s unbelievably direct.

Generating Images (Inference)

Now we’ve trained vθ(x,t)v_\theta(x, t)vθ​(x,t)—a great “GPS” that can navigate from noise to data. How do we generate a new cat image?

We just follow its directions:

  1. Start from random noise: x = torch.randn(...).
  2. Start at time t=0t=0t=0.
  3. Iterate for a number of steps: a. Get a direction from our “GPS”: velocity = model(x, t). b. Take a small step: x = x + velocity * dt. c. Update time: t = t + dt.
  4. After enough steps (e.g., when ttt reaches 1), x becomes our new cat image.

This is just solving an ordinary differential equation (Ordinary Differential Equation, ODE). It’s basically Euler’s method—something you might’ve even seen in high school. Pretty cool, right?

Summary

So to recap: Flow Matching gives us a fresh, simpler way to think about generative modeling. Instead of thinking in terms of densities and scores, we think in terms of vector fields and flows. We define a simple path from noise to data (like a straight line), then train a neural network to learn the velocity field that generates that path.

fm

It turns out this simple, intuitive idea isn’t just a hack: it’s theoretically sound, and it powers some recent state-of-the-art models like SD3. It’s a great reminder that sometimes the deepest progress comes from finding a cleaner abstraction for a complex problem.

Simplicity wins.

The “Proper” Derivation: Why Does Our Simple “Hack” Work?

So far we derived Flow Matching’s core with an extremely simple intuition: take a noise sample x0x_0x0​, a data sample x1x_1x1​, draw a straight line between them (xt=(1−t)x0+tx1x_t = (1-t)x_0 + t x_1xt​=(1−t)x0​+tx1​), and say the network vθv_\thetavθ​ only needs to learn the velocity—i.e., x1−x0x_1 - x_0x1​−x0​. The loss almost writes itself. Done.

Honestly, for most practitioners, that’s 90% of what you need.

But if you’re like me, a little voice might whisper: “Wait… this feels too easy. Our trick is built on a pair of independent samples (x0,x1)(x_0, x_1)(x0​,x1​). Why should learning these independent straight lines make the network understand the flow of the entire high-dimensional probability distribution pt(x)p_t(x)pt​(x)? Is this a legit shortcut, or a lucky, slightly cute hack that just happens to work?”

That’s where the formal derivation in the paper comes in. Its job is to show that our simple conditional objective can indeed (almost magically) optimize the bigger, scarier marginal objective.

Let’s put on a mathematician’s hat for a moment and see how they bridge the gap.

The “Official” Theoretical Problem: Marginal Flows

The “real” theoretical setup is: we have a family of distributions pt(x)p_t(x)pt​(x) that continuously morphs from noise p0(x)p_0(x)p0​(x) to data p1(x)p_1(x)p1​(x). This continuous deformation is governed by a vector field ut(x)u_t(x)ut​(x)—the velocity at time ttt and position xxx.

So the “official” goal is to train vθ(x,t)v_\theta(x, t)vθ​(x,t) to match the true marginal vector field ut(x)u_t(x)ut​(x). The loss would be:

Lmarginal=Et∼U(0,1),x∼pt(x)[∥vθ(x,t)−ut(x)∥2]L_{marginal} = \mathbb{E}_{t \sim U(0,1), x \sim p_t(x)} \left[ \| v_\theta(x, t) - u_t(x) \|^2 \right]Lmarginal​=Et∼U(0,1),x∼pt​(x)​[∥vθ​(x,t)−ut​(x)∥2]

And this is immediately a disaster: it’s intractable. We can’t sample from pt(x)p_t(x)pt​(x), and we don’t know the target ut(x)u_t(x)ut​(x). So in this form, the loss is useless.

The Bridge: Connecting “Marginal” and “Conditional”

Researchers use a classic trick: “Sure, the marginal field ut(x)u_t(x)ut​(x) is a beast. But can we express it as an average over simpler conditional vector fields?”

A conditional vector field ut(x∣x1)u_t(x|x_1)ut​(x∣x1​) is the velocity at point xxx given that we already know the final destination is the data point x1x_1x1​.

The paper proves (and this is the key theoretical insight) that the scary marginal field ut(x)u_t(x)ut​(x) is the expectation over these simple conditional fields, weighted by “the probability that a path starting from x1x_1x1​ passes through xxx”:

ut(x)=Ex1∼p1(x1)[ut(x∣x1)⋅(some probability term)]u_t(x) = \mathbb{E}_{x_1 \sim p_1(x_1)} [u_t(x|x_1) \cdot (\text{some probability term})]ut​(x)=Ex1​∼p1​(x1​)​[ut​(x∣x1​)⋅(some probability term)]

This builds the bridge: we relate an unknown quantity (ut(x)u_t(x)ut​(x)) to many simpler things we might be able to define (ut(x∣x1)u_t(x|x_1)ut​(x∣x1​)).

We start from the “official” marginal flow matching loss. For any fixed time ttt, it is:

Lt(vθ)=Ex∼pt(x)[∥vθ(x,t)−ut(x)∥2]L_t(v_\theta) = \mathbb{E}_{x \sim p_t(x)} \left[ \| v_\theta(x, t) - u_t(x) \|^2 \right]Lt​(vθ​)=Ex∼pt​(x)​[∥vθ​(x,t)−ut​(x)∥2]

Here pt(x)p_t(x)pt​(x) is the marginal density at time ttt, and ut(x)u_t(x)ut​(x) is the true marginal vector field we want to learn. We can’t access either, so this form can’t be computed. The goal is to transform it into something computable.

Step 1: Expand the squared error

Using the identity ∥A−B∥2=∥A∥2−2A⋅B+∥B∥2\|A - B\|^2 = \|A\|^2 - 2A \cdot B + \|B\|^2∥A−B∥2=∥A∥2−2A⋅B+∥B∥2:

Lt(vθ)=Ex∼pt(x)[∥vθ(x,t)∥2−2vθ(x,t)⋅ut(x)+∥ut(x)∥2]L_t(v_\theta) = \mathbb{E}_{x \sim p_t(x)} \left[ \|v_\theta(x,t)\|^2 - 2 v_\theta(x,t) \cdot u_t(x) + \|u_t(x)\|^2 \right]Lt​(vθ​)=Ex∼pt​(x)​[∥vθ​(x,t)∥2−2vθ​(x,t)⋅ut​(x)+∥ut​(x)∥2]

During optimization we only care about terms that depend on θ\thetaθ. The term ∥ut(x)∥2\|u_t(x)\|^2∥ut​(x)∥2 does not depend on θ\thetaθ, so it can be treated as a constant for gradients. To minimize Lt(vθ)L_t(v_\theta)Lt​(vθ​), it’s enough to minimize:

Lt′(vθ)=Ex∼pt(x)[∥vθ(x,t)∥2−2vθ(x,t)⋅ut(x)]L_t'(v_\theta) = \mathbb{E}_{x \sim p_t(x)} \left[ \|v_\theta(x,t)\|^2 - 2 v_\theta(x,t) \cdot u_t(x) \right]Lt′​(vθ​)=Ex∼pt​(x)​[∥vθ​(x,t)∥2−2vθ​(x,t)⋅ut​(x)]

Step 2: Rewrite the expectation as an integral

Using Ex∼p(x)[f(x)]=∫p(x)f(x)dx\mathbb{E}_{x \sim p(x)}[f(x)] = \int p(x)f(x)dxEx∼p(x)​[f(x)]=∫p(x)f(x)dx, and focusing on the cross term with the unknown ut(x)u_t(x)ut​(x):

Lt′(vθ)=∫pt(x)∥vθ(x,t)∥2dx−2∫pt(x)vθ(x,t)⋅ut(x)dxL_t'(v_\theta) = \int p_t(x) \|v_\theta(x,t)\|^2 dx - 2 \int p_t(x) v_\theta(x,t) \cdot u_t(x) dxLt′​(vθ​)=∫pt​(x)∥vθ​(x,t)∥2dx−2∫pt​(x)vθ​(x,t)⋅ut​(x)dx

Step 3: Substitute the bridge identity

The key identity links the hard marginal term pt(x)ut(x)p_t(x) u_t(x)pt​(x)ut​(x) to conditional terms:

pt(x)ut(x)=Ex1∼p1(x1)[pt(x∣x1)ut(x∣x1)]=∫p1(x1)pt(x∣x1)ut(x∣x1)dx1p_t(x) u_t(x) = \mathbb{E}_{x_1 \sim p_1(x_1)} [ p_t(x|x_1) u_t(x|x_1) ] = \int p_1(x_1) p_t(x|x_1) u_t(x|x_1) dx_1pt​(x)ut​(x)=Ex1​∼p1​(x1​)​[pt​(x∣x1​)ut​(x∣x1​)]=∫p1​(x1​)pt​(x∣x1​)ut​(x∣x1​)dx1​

Substitute into the cross term:

−2∫pt(x)vθ(x,t)⋅ut(x)dx=−2∫vθ(x,t)⋅(∫p1(x1)pt(x∣x1)ut(x∣x1)dx1)dx-2 \int p_t(x) v_\theta(x,t) \cdot u_t(x) dx = -2 \int v_\theta(x,t) \cdot \left( \int p_1(x_1) p_t(x|x_1) u_t(x|x_1) dx_1 \right) dx−2∫pt​(x)vθ​(x,t)⋅ut​(x)dx=−2∫vθ​(x,t)⋅(∫p1​(x1​)pt​(x∣x1​)ut​(x∣x1​)dx1​)dx

Step 4: Swap integration order (Fubini–Tonelli theorem)

We now have a double integral. It looks more complex, but we can swap the order of dxdxdx and dx1dx_1dx1​:

=−2∫p1(x1)(∫pt(x∣x1)vθ(x,t)⋅ut(x∣x1)dx)dx1= -2 \int p_1(x_1) \left( \int p_t(x|x_1) v_\theta(x,t) \cdot u_t(x|x_1) dx \right) dx_1=−2∫p1​(x1​)(∫pt​(x∣x1​)vθ​(x,t)⋅ut​(x∣x1​)dx)dx1​

Step 5: Convert integrals back to expectations and complete the square

The inner part ∫pt(x∣x1)vθ(x,t)⋅ut(x∣x1)dx\int p_t(x|x_1) v_\theta(x,t) \cdot u_t(x|x_1) dx∫pt​(x∣x1​)vθ​(x,t)⋅ut​(x∣x1​)dx is an expectation over pt(x∣x1)p_t(x|x_1)pt​(x∣x1​), so:

=−2∫p1(x1)Ex∼pt(⋅∣x1)[vθ(x,t)⋅ut(x∣x1)]dx1= -2 \int p_1(x_1) \mathbb{E}_{x \sim p_t(\cdot|x_1)} \left[ v_\theta(x,t) \cdot u_t(x|x_1) \right] dx_1=−2∫p1​(x1​)Ex∼pt​(⋅∣x1​)​[vθ​(x,t)⋅ut​(x∣x1​)]dx1​

And since this is also an integral over p1(x1)p_1(x_1)p1​(x1​), we can write it as an expectation over x1x_1x1​:

=−2Ex1∼p1(x1)[Ex∼pt(⋅∣x1)[vθ(x,t)⋅ut(x∣x1)]]= -2 \mathbb{E}_{x_1 \sim p_1(x_1)} \left[ \mathbb{E}_{x \sim p_t(\cdot|x_1)} \left[ v_\theta(x,t) \cdot u_t(x|x_1) \right] \right]=−2Ex1​∼p1​(x1​)​[Ex∼pt​(⋅∣x1​)​[vθ​(x,t)⋅ut​(x∣x1​)]]

This nested expectation can be merged into an expectation over the joint distribution:

Cross term=−2Ex1∼p1,x∼pt(⋅∣x1)[vθ(x,t)⋅ut(x∣x1)]\text{Cross term} = -2 \mathbb{E}_{x_1 \sim p_1, x \sim p_t(\cdot|x_1)} \left[ v_\theta(x,t) \cdot u_t(x|x_1) \right]Cross term=−2Ex1​∼p1​,x∼pt​(⋅∣x1​)​[vθ​(x,t)⋅ut​(x∣x1​)]

Now substitute this back into Lt′(vθ)L_t'(v_\theta)Lt′​(vθ​). With a similar transformation, the first term becomes Ex1,x∼pt(⋅∣x1)[∥vθ(x,t)∥2]\mathbb{E}_{x_1, x \sim p_t(\cdot|x_1)} [ \|v_\theta(x,t)\|^2 ]Ex1​,x∼pt​(⋅∣x1​)​[∥vθ​(x,t)∥2]. So:

Lt′(vθ)=Ex1,x∼pt(⋅∣x1)[∥vθ(x,t)∥2−2vθ(x,t)⋅ut(x∣x1)]L_t'(v_\theta) = \mathbb{E}_{x_1, x \sim p_t(\cdot|x_1)} \left[ \|v_\theta(x,t)\|^2 - 2 v_\theta(x,t) \cdot u_t(x|x_1) \right]Lt′​(vθ​)=Ex1​,x∼pt​(⋅∣x1​)​[∥vθ​(x,t)∥2−2vθ​(x,t)⋅ut​(x∣x1​)]

To form a perfect square, add and subtract the same term Ex1,x∼pt(⋅∣x1)[∥ut(x∣x1)∥2]\mathbb{E}_{x_1, x \sim p_t(\cdot|x_1)} \left[ \|u_t(x|x_1)\|^2 \right]Ex1​,x∼pt​(⋅∣x1​)​[∥ut​(x∣x1​)∥2]:

Lt′(vθ)=Ex1,x[∥vθ(x,t)∥2−2vθ(x,t)⋅ut(x∣x1)+∥ut(x∣x1)∥2]−Ex1,x[∥ut(x∣x1)∥2]L_t'(v_\theta) = \mathbb{E}_{x_1, x} \left[ \|v_\theta(x,t)\|^2 - 2 v_\theta(x,t) \cdot u_t(x|x_1) + \|u_t(x|x_1)\|^2 \right] - \mathbb{E}_{x_1, x} \left[ \|u_t(x|x_1)\|^2 \right]Lt′​(vθ​)=Ex1​,x​[∥vθ​(x,t)∥2−2vθ​(x,t)⋅ut​(x∣x1​)+∥ut​(x∣x1​)∥2]−Ex1​,x​[∥ut​(x∣x1​)∥2]

The bracketed part is a complete square. The final subtracted term doesn’t depend on θ\thetaθ, so we can ignore it during optimization.


Final result

We’ve shown that minimizing the original intractable marginal loss is equivalent to minimizing the following Conditional Flow Matching Objective:

LCFM(vθ)=Et,x1,x∼pt(⋅∣x1)[∥vθ(x,t)−ut(x∣x1)∥2]L_{CFM}(v_\theta) = \mathbb{E}_{t, x_1, x \sim p_t(\cdot|x_1)} \left[ \| v_\theta(x, t) - u_t(x|x_1) \|^2 \right]LCFM​(vθ​)=Et,x1​,x∼pt​(⋅∣x1​)​[∥vθ​(x,t)−ut​(x∣x1​)∥2]

So we have a rigorous justification: if we define a simple conditional path (like linear interpolation) and its vector field, and optimize this simple regression loss, we still achieve the grand goal of optimizing the true marginal flow.

This is huge: we eliminated the dependence on the marginal density pt(x)p_t(x)pt​(x). The loss now depends only on the conditional path density pt(⋅∣x1)p_t(\cdot|x_1)pt​(⋅∣x1​) and the conditional vector field ut(x∣x1)u_t(x|x_1)ut​(x∣x1​).

Back to Our Original Simple Idea

Where does that leave us? The formal proof says: as long as we can define a conditional path pt(x∣x1)p_t(x|x_1)pt​(x∣x1​) and its corresponding vector field ut(x∣x1)u_t(x|x_1)ut​(x∣x1​), we can train with LFML_{FM}LFM​.

Now we can invite the “high-school physics” idea back in. Since we’re free to define the conditional path, let’s pick the simplest, least pretentious option:

  1. Define the conditional path pt(⋅∣x0,x1)p_t(\cdot|x_0, x_1)pt​(⋅∣x0​,x1​): make it deterministic—a straight line. So the probability is 1 on the line xt=(1−t)x0+tx1x_t = (1-t)x_0 + t x_1xt​=(1−t)x0​+tx1​ and 0 everywhere else. (In diffusion, paths from x0x_0x0​ are stochastic.)
  2. Define the conditional vector field ut(xt∣x0,x1)u_t(x_t|x_0, x_1)ut​(xt​∣x0​,x1​): as we computed earlier, the velocity is x1−x0x_1 - x_0x1​−x0​.

[!NOTE] In math, this kind of “all mass at a single point, zero elsewhere” distribution is called the Dirac delta function δ\deltaδ. So when we choose a straight-line path, we’re effectively choosing a Dirac delta distribution for pt(x∣x0,x1)p_t(x|x_0, x_1)pt​(x∣x0​,x1​).

Now plug these into the fancy-looking LFML_{FM}LFM​ objective we derived. The expectation Ex∼pt(⋅∣x1)\mathbb{E}_{x \sim p_t(\cdot|x_1)}Ex∼pt​(⋅∣x1​)​ becomes “take the point xtx_txt​ on our line”, and the target ut(x∣x1)u_t(x|x_1)ut​(x∣x1​) becomes the simple x1−x0x_1 - x_0x1​−x0​.

And then—the magic moment—we end up with:

L=Et,x0,x1[∥vθ((1−t)x0+tx1,t)−(x1−x0)∥2]L = \mathbb{E}_{t, x_0, x_1} \left[ \| v_\theta((1-t)x_0 + t x_1, t) - (x_1 - x_0) \|^2 \right]L=Et,x0​,x1​​[∥vθ​((1−t)x0​+tx1​,t)−(x1​−x0​)∥2]

We’re back at the exact same ultra-clean loss we “guessed” from first principles. That’s what the formal derivation is for.

Summary

Pretty cool. We just went through a bunch of heavy math—integrals, Fubini’s theorem, the whole package—just to prove that our simple intuitive shortcut was correct from the start. We’ve confirmed: learning the simple vector target x1−x0x_1 - x_0x1​−x0​ along a straight-line path is a valid way to train a generative model.

From Theory to torch: Coding Flow Matching

Now that we’ve got the intuition (and even the full formal proof), we can remember: this is still just a regression problem. So let’s look at code.

Surprisingly, the PyTorch implementation is almost a 1:1 translation of the final formula. No hidden complexity, no scary math libraries—just pure torch.

Let’s break down the most important parts: the training loop and the sampling (inference) process.

Source code from the video: https://github.com/dome272/Flow-Matching/blob/main/flow-matching.ipynb

Setup: Data and Model

First, the script sets up a 2D checkerboard pattern. This is our tiny “cat image dataset”. These points are our real data x1x_1x1​.

Then it defines a simple MLP (multilayer perceptron). This is our neural network—our “GPS”, our vector-field predictor vθ(x,t)v_\theta(x,t)vθ​(x,t). It’s a standard network that takes a batch of coordinates x and a batch of times t, and outputs a predicted velocity vector for each point. Nothing fancy in the architecture—the magic is in what we ask it to learn.

Training Loop: Where the Magic Happens

This is the core. Let’s recall the final, beautiful loss:

L=Et,x0,x1[∥vθ((1−t)x0+tx1⏞Input to Model,t)⏟Prediction−(x1−x0)⏟Target∥2]L = \mathbb{E}_{t,x_0,x_1} \left[ \left\| \underbrace{v_\theta \big( \overbrace{(1-t)x_0 + tx_1}^{\text{Input to Model}}, t \big)}_{\text{Prediction}} - \underbrace{(x_1 - x_0)}_{\text{Target}} \right\|^2 \right]L=Et,x0​,x1​​​​Predictionvθ​((1−t)x0​+tx1​​Input to Model​,t)​​−Target(x1​−x0​)​​​2​

Now let’s walk through the training loop line by line. This is the formula in action.

data = torch.Tensor(sampled_points)
training_steps = 100_000
batch_size = 64
pbar = tqdm.tqdm(range(training_steps))
losses = []
 
for i in pbar:
	# 1. Sample real data x1 and noise x0
    x1 = data[torch.randint(data.size(0), (batch_size,))]
    x0 = torch.randn_like(x1)
    
    # 2. Define the target vector
    target = x1 - x0
    
    # 3. Sample random time t and create the interpolated input xt
    t = torch.rand(x1.size(0))
    xt = (1 - t[:, None]) * x0 + t[:, None] * x1
    
    # 4. Get the model's prediction
    pred = model(xt, t)  # also add t here
    
    # 5. Calculate the loss and other standard boilerplate
    loss = ((target - pred)**2).mean()
    loss.backward()
    optim.step()
    optim.zero_grad()
    pbar.set_postfix(loss=loss.item())
    losses.append(loss.item())

Mapping directly to the formula:

  • x1 = ... and x0 = ...: sample from the data distribution p1p_1p1​ and the noise distribution p0p_0p0​ to supply x1x_1x1​ and x0x_0x0​ for the expectation E\mathbb{E}E.

  • target = x1 - x0: this is the heart of it. This line computes the true vector field for our straight-line path—i.e. the target (x1−x0)\color{red}{(x_1 - x_0)}(x1​−x0​).

  • xt = (1 - t[:, None]) * x0 + t[:, None] * x1: this is the other key part. It computes the interpolated point xtx_txt​ on the path—the model input (1−t)x0+tx1\color{blue}{(1-t)x_0 + tx_1}(1−t)x0​+tx1​.

  • pred = model(xt, t): forward pass to get the prediction vθ(xt,t)v_\theta(x_t,t)vθ​(xt​,t).

  • loss = ((target - pred)**2).mean(): mean squared error between target and prediction—the ∥⋅∥2\|\cdot\|^2∥⋅∥2 part.

That’s it. These five lines are a direct line-by-line implementation of the elegant formula we derived.

Sampling: Following the Flow 🗺️

Now we’ve trained the model. It’s a highly skilled “GPS” that knows the velocity field. How do we generate new checkerboard samples? Start from empty space (noise), and walk where it tells you to go.

Mathematically, we want to solve the ODE: dxtdt=vθ(xt,t)\frac{dx_t}{dt} = v_\theta(x_t, t)dtdxt​​=vθ​(xt​,t) The simplest way to solve it is Euler’s method: take small discrete steps to approximate continuous flow.

[!TIP] Since vθv_\thetavθ​ is a complex neural network, we can’t solve this analytically. We must simulate it. The simplest approach is to approximate smooth continuous flow with a sequence of small straight-line steps.

By the definition of the derivative, over a small time step dtdtdt, the position change dxtdx_tdxt​ is approximately velocity times the time step: dxt≈vθ(xt,t)⋅dtdx_t \approx v_\theta(x_t, t) \cdot dtdxt​≈vθ​(xt​,t)⋅dt.

So to get the new position at t+dtt + dtt+dt, we add this small change to the current position. This gives the update rule:

xt+dt=xt+vθ(xt,t)⋅dtx_{t+dt} = x_t + v_\theta(x_t, t) \cdot dtxt+dt​=xt​+vθ​(xt​,t)⋅dt

This “move a tiny bit along the velocity direction” has a famous name: Euler’s method (Euler update). It’s the simplest numerical solver for ODEs—and as you can see, it almost falls out of first principles.

# The sampling loop from the script
xt = torch.randn(1000, 2)  # Start with pure noise at t=0
steps = 1000
for i, t in enumerate(torch.linspace(0, 1, steps), start=1):
    pred = model(xt, t.expand(xt.size(0))) # Get velocity prediction
    
    # This is the Euler method step!
    # dt = 1 / steps
    xt = xt + (1 / steps) * pred
  • xt = torch.randn(...): start from a random point cloud—our initial state.

  • for t in torch.linspace(0, 1, steps): simulate flow from t=0t=0t=0 to t=1t=1t=1 over discrete steps.

  • pred = model(xt, ...): at each step, query the current velocity vθ(xt,t)v_\theta(x_t, t)vθ​(xt​,t).

  • xt = xt + (1 / steps) * pred: Euler update. Move xt a tiny step in the predicted direction. Here dt = 1 / steps.

Repeat this update, and the random point cloud gradually gets “pushed” by the learned vector field until it flows into the checkerboard data distribution.

Theoretical simplicity translates directly into clean, efficient code. It’s genuinely beautiful.

DiffusionFlow

But wait… before we celebrate, let’s pause for a “thought bubble” moment.

[!WARNING] The math above works under the assumption that we use a straight-line path between x0x_0x0​ (random noise) and x1x_1x1​ (a random cat image). But… is a straight line really the best, most efficient path?

From a chaotic Gaussian noise cloud to the delicate high-dimensional manifold where real images live, the “true” transformation is likely a wild, winding journey. Forcing a straight line might be too crude. We ask a single neural network vθv_\thetavθ​ to learn a vector field that magically works for all these unnatural linear interpolations. This may be one reason we still need many sampling steps for high-quality images: the learned vector field has to keep correcting for our oversimplified path assumption.

So a good “hacker” would naturally ask next: “Can we make the learning problem easier?”

Imagine… instead of pairing completely random x0x_0x0​ and x1x_1x1​, we could find “better” start/end points—say a pair (z0,z1)(z_0, z_1)(z0​,z1​) that are naturally related, so the path between them is already close to a straight line.

Where do such pairs come from? Simple: use another generative model (e.g. a standard DDPM) to generate them. Give it a noise vector z0z_0z0​; after hundreds of steps it outputs a decent image z1z_1z1​. Now we have a pair (z0,z1)(z_0, z_1)(z0​,z1​) that represents the “real” trajectory taken by a strong model.

This gives us a teacher–student framework: the old, slow model provides these “pre-straightened” trajectories, and we use them to train a new, simpler Flow Matching model. The new model’s learning problem becomes much easier.

This idea—using one model to construct an easier learning problem for another—is powerful. You’re essentially “distilling” a complex, curved path into a simpler, straighter one. Teams at DeepMind and elsewhere had the same idea: it’s the core of Rectified Flow / DiffusionFlow—iteratively straightening the path until it’s so straight you can almost jump from start to end in one step.

It’s a beautiful meta-level extension of our original shortcut. Worth savoring.

References

Article Info Human-Crafted
Title Ditching the SDEs: A Simpler Path with Flow Matching
Author Nagi-ovo
URL
Last Updated No edits yet
Citation

For commercial reuse, contact the site owner for authorization. For non-commercial use, please credit the source and link to this article.

You may copy, distribute, and adapt this work as long as derivatives share the same license. Licensed under CC BY-NC-SA 4.0.

Session 00:00:00