Flow Matching

Introduction

Flow matching is a fundamental algorithm used in state-of-the-art image generation models as well as VLA (vision-language-action) models in robotics. In particular, π0.5\pi_{0.5} (Physical Intelligence, 2025) and X-VLA (Zheng et. al., 2025) are notable high-performing examples of successful flow matching used in robotic manipulation.

I thought it would be interesting to dive into the mathematical derivations behind algorithm, standing on the shoulders of many excellent existing resources explaining it :)

What is Flow Matching?

The aim of flow matching is to generate a sample from an unknown probability distribution.

For example, let’s consider an image generation model. Say we have a dataset of 50 cat pictures and we want to generate another picture. The unknown probability distribution here is the mathematical function describing the likelihood of all cat pictures. We cannot know this function directly; we only have samples (our dataset). We want to sample from this unknown probability distribution to generate a picture that is distinct to the ones in the dataset, but one that we would still classify as a cat.

Flow matching visualization: Blue distribution represents cat pictures, black dots are limited dataset. An orange arrow shows generating a new sample.

Blue represents the entire cat picture distribution, black dots represent our limited dataset.

How might we accomplish this? Flow matching proposes we build a path translating from a known distribution to this unknown distribution, so that if we follow this path, we will always end up at some point in the unknown distribution.

It’s sort of like a treasure map. It tells you where to go from where you started, and if you dig around the target, you’re bound to find treasure.

Let’s call the unknown target distribution X1X_1, and the known distribution X0X_0, where X0=N(0,I)X_0 = \mathcal{N}(0, I). Let’s set it such that at time t=0t=0, xX0x \sim X_0, at t=1t=1, x1=xX1x_1 = x \sim X_1, so as we move forward in time we move closer and closer to the target distribution.

Probability path visualization: Shows a path from a known distribution to an unknown target distribution.

pp is known, qq is target, ptp_t is our path (Lipman et. al., 2024)

Now let’s construct a probability path pt(x)p_t(x) that fulfills these requirements.

Constructing the Probability Path

At a given time tt, pt(x)p_t(x) should give us the intermediate probability distribution (the grey distributions in the image above) that our points could possibly be in. By the marginal Law of Total Probability, given two continuous random variables XX and YY,

fX(x)=fXY(xy)fY(y)dy\begin{equation} f_X(x) = \int f_{X|Y}(x|y)f_Y(y) dy \end{equation}

Thus, we can derive the probability path below! Given an arbitrary time t[0,1]t \in [0, 1] and q(x1)=P(x1)q(x_1) = P(x_1),

pt(x)=pt1(xx1)q(x1)dx1, where pt1(xx1)=N(tx1,(1t)2I)\begin{equation} p_t(x) = \int p_{t|1}(x|x_1)q(x_1)dx_1 \text{, where }p_{t|1}(x|x_1) = \mathcal{N}(tx_1, (1-t)^2 I) \end{equation}

As we can see, by defining pt1(xx1)p_{t|1}(x|x_1) as a Gaussian with μ=tx1\mu = tx_1 and σ=(1t)I\sigma = (1-t)I, we can intuitively linearly interpolate over the variable tt from X0X_0 to X1X_1. This specific probability path is the conditional optimal-transport or linear path.

Let’s define XtX_t as the random variable given time tt using the Reparameterization Trick.

Xtpt1(xx1)=N(tx1,(1t)2I)Xt=tx1+(1t)ϵXt=tX1+(1t)X0 as X0N(0,I)\begin{equation} \begin{aligned} X_t &\sim p_{t|1}(x|x_1) = \mathcal{N}(tx_1, (1-t)^2 I) \\ X_t &= tx_1 + (1-t)\epsilon \\ X_t &= tX_1 + (1-t)X_0 \text{ as }X_0 \sim \mathcal{N}(0, I) \end{aligned} \end{equation}

Now we have XtX_t in terms of X0X_0 and X1X_1.

Let’s take a step back and remind ourselves of the goal. Given a sample from a known probability distribution, we want to be able to map it to a sample from an unknown probability distribution. We now know how to describe a relationship between the probability distributions X0X_0 and X1X_1 by interpolating over intermediary probability distributions XtX_t, but XtX_t is still incalculable without knowing what X1X_1 is.

Additionally, this is not the function input/output we want. We want to be able to map from an input, to some output when t=1t=1.

Defining Flow and Vector Fields

This is where flow and vector fields come in.

Let’s define a vector field ut(x)u_t(x) such that given coordinate xx and time value tt, utu_t outputs the direction to move from that point.

Let’s further define a flow ϕt(x0)\phi_t(x_0) such that given a starting point x0X0x_0 \sim X_0 and t[0,1]t \in [0, 1], it returns the coordinates of that point as we follow the vector field over time.

The relationship between the two can be defined as

xt=ϕt(x0)ddtϕt(x)=ut(ϕt(x0))ddtxt=ut(xt)\begin{equation} \begin{aligned} x_t &= \phi_t(x_0) \\ \frac{d}{dt} \phi_t(x) &= u_t (\phi_t(x_0)) \\ \frac{d}{dt} x_t &= u_t(x_t) \\ \end{aligned} \end{equation}
Probability path visualization: Shows a path from a known distribution to an unknown target distribution.

Arrows represent the vector field, the blue lines represent the flow given a starting input sampled from x0x_0

In the image above, we can see that as we move forward in time, we follow the vector field defined by the grey arrows, and our path that we’re creating is our flow!

For this vector field specifically, let’s define utu_t such that it creates the probability field ptp_t. This means that if we (hypothetically) sampled 1000 points from X0X_0, found their position at t=0.5t=0.5 after following utu_t, it would create the density represented by X0.5X_{0.5}.

This is an important enough concept to reiterate. A vector field dictates which way the point moves at a given time step. The flow represents the path a particle takes, given a starting location and what time step we’re at. The probability path represents the density of those particles.

(I’d recommend watching 3Blue1Brown’s video on this - his visualizations are extremely helpful)

Now we finally have the function we wanted! A vector field tells us how to move from a point in X0X_0 to a point in X1X_1. If we’re able to learn the correct vector field, we can generate any photo from our target distribution.

We can simply define the flow matching loss as

LFM(θ)=Et,Xtutθ(Xt)ut(Xt)\begin{equation} \mathcal{L}_{FM}(\theta) = \mathbb{E}_{t, X_t} ||u_t^\theta (X_t) - u_t (X_t)|| \end{equation}

where utθu_t^\theta is our neural network parameterized vector field. Minimizing this means our neural network perfectly matches the target vector field utu_t.

Unfortunately, calculating this loss function is intractable.

Simplifying the Intractable Loss Function

Let’s first show why it’s intractable by deriving an equation for utu_t.

You’ll notice that I began talking about particles rather than points: some students of physics may recognize this as fluid dynamics. Fluid dynamics theory further tells us that through the Continuity Equation,

pt=(ptut)\begin{equation} \frac{\partial p}{\partial t} = - \nabla \cdot (p_t u_t) \end{equation}

where (ptut)\nabla \cdot (p_t u_t) is divergence, defined by

(ptut)=i=0dxipt(x)ut(x)\begin{equation} \nabla (p_t u_t) = \sum_{i=0}^d \frac{\partial}{\partial x_i} p_t(x) u_t(x) \end{equation}

where x is a single point.

Multiplying pt(x)p_t(x) and ut(x)u_t(x) gives us the probability current (or flux), as pt(x)p_t(x) represents the density at a point and ut(x)u_t(x) represents the velocity of that point. Thinking about this in terms of liquids and a hose,

Flux represents how much water is flowing through a point at a given time.

Divergence defines how much a vector field is expanding / compressing at a specific point. Interpreting the continuity equation, if divergence is positive and the flux is expanding, then the change in probability (density) will be negative, meaning that more water is escaping. If divergence is negative and the flux is compressing, then the change in probability is positive, as the water is moving slower and thus more dense.

We can attempt to solve for ut(x)u_t(x) now. Starting with the Continuity Equation:

pt(x)t+(pt(x)ut(x))=0\begin{equation} \frac{\partial p_t(x)}{\partial t} + \nabla \cdot (p_t(x) u_t(x)) = 0 \end{equation}

Substitute the marginal definition pt(x)=pt1(xx1)q(x1)dx1p_t(x) = \int p_{t|1}(x|x_1)q(x_1)dx_1 into the first term:

tpt1(xx1)q(x1)dx1+(pt(x)ut(x))=0\begin{equation} \frac{\partial}{\partial t} \int p_{t|1}(x|x_1)q(x_1)dx_1 + \nabla \cdot (p_t(x) u_t(x)) = 0 \end{equation}

Move the time derivative inside the integral and substitute the conditional continuity equation pt1t=(pt1ut1)\frac{\partial p_{t|1}}{\partial t} = -\nabla \cdot (p_{t|1} u_{t|1}):

[(pt1(xx1)ut(xx1))]pt1tq(x1)dx1+(pt(x)ut(x))=0\begin{equation} \int \underbrace{\left[ -\nabla \cdot (p_{t|1}(x|x_1) u_t(x|x_1)) \right]}_{\frac{\partial p_{t|1}}{\partial t}} q(x_1)dx_1 + \nabla \cdot (p_t(x) u_t(x)) = 0 \end{equation}

Rearrange to equate the divergence terms:

(pt(x)ut(x))=ut(xx1)pt1(xx1)q(x1)dx1\begin{equation} \nabla \cdot (p_t(x) u_t(x)) = \nabla \cdot \int u_t(x|x_1) p_{t|1}(x|x_1) q(x_1) dx_1 \end{equation}

Removing the divergence operator implies equality of the fields. Solving for ut(x)u_t(x) yields:

ut(x)=1pt(x)ut(xx1)pt1(xx1)q(x1)dx1\begin{equation} u_t(x) = \frac{1}{p_t(x)} \int u_t(x|x_1) p_{t|1}(x|x_1) q(x_1) dx_1 \end{equation} ut(x)=ut(xx1)pt1(xx1)q(x1)pt(x)p(x1x)dx1\begin{equation} u_t(x) = \int u_t(x|x_1) \underbrace{\frac{p_{t|1}(x|x_1) q(x_1)}{p_t(x)}}_{p(x_1|x)} dx_1 \end{equation}

Unfortunately, we then find that p(x1x)p(x_1|x) is intractable, as finding the integral over x1x_1 of this requires us to check every possible image to calculate it (and it’s impossible to check every single possible cat photo in existence).

Instead, we can simplify this by choosing a single objective, by fixing x1x_1 to be a single target example (i.e. an image in our dataset). Then we have an equation for Xt1X_{t|1} which only depends on our Gaussian sample X0X_0 and tt:

Xt1=tx1+(1t)X0pt1(x1)=N(tx1,(1t)2I)\begin{equation} X_{t|1} = tx_1 + (1-t)X_0 \sim p_{t|1}(x_1) = \mathcal{N}(tx_1, (1-t)^2 I) \end{equation}

In this case, rearranging the identity ddtXt1=ut(Xt1x1)\frac{d}{dt}X_{t|1} = u_t(X_{t|1} | x_1) by differentiating Equation 14 yields

ut(Xt1x1)=x1X0\begin{equation} u_t(X_{t|1} | x_1) = x_1 - X_0 \end{equation}

(Technically velocity fields should be a function of the current location not the target location, so you may rewrite this as x0Xt1t\frac{x_0-X_t}{1-t}, but for our purposes they are equivalent.)

Finally! We have a much simpler velocity field function that we can now easily compute the MSE of, yielding the Conditional Flow Matching loss function.

LCFM(θ)=Et,Xt,X1utθ(Xt)ut(XtX1)2=Et,Xt,X1utθ(Xt)(x1X0)2\begin{equation} \begin{align*} \mathcal{L}_{CFM}(\theta) &= \mathbb{E}_{t, X_t, X_1} || u_t^\theta (X_t) - u_t (X_t | X_1) || ^2 \\ &= \mathbb{E}_{t, X_t, X_1} || u_t^\theta (X_t) - (x_1 - X_0) || ^2 \end{align*} \end{equation}

At a high level, we are just stating to match a vector field that results in a linear path pointing from X0X_0 to x1x_1. Does this work? Finding the derivative of our earlier loss function LFMθ\mathcal{L}_{FM} \theta and this one will yield the same result, showing they are equivalent!

We can see this in example flow matching code (Lipman et. al., 2024):

import torch 
from torch import nn, Tensor

class Flow(nn.Module):
    """ 
    Parameterized vector field.
    Input: [x_t, t]
    Output: u(x_t)
    """
    ...

...

# Initialize flow matching model, MSE loss function
flow = Flow()
loss_fn = nn.MSELoss()

# x_1 sampled from target distribution, x_0 sampled from Gaussian 
# Calculate delta between the two
dx_t = x_1 - x_0  

# Find MSE between u^theta_t(x_t) and x_1 - x_0!
# Note that x_t represents the coordinates of the point while t is a float between [0, 1] representative of the time
loss_fn(flow(x_t, t), dx_t).backward()

I’ll leave the post there - maybe I’ll write more later on. Further reading would include various ODE solvers, score-based generative modeling, and diffusion models!

Resources that greatly helped me: