(diffusion_mfu)=
# Diffusion FLOPs / MFU

Last updated: 06/02/2026

VeRL-Omni reports **Model FLOPs Utilization (MFU)** for diffusion RL
training using the same actor keys upstream
[verl](https://github.com/verl-project/verl) reports for LLM RL — so users
have a single, hardware-agnostic metric to compare across runs,
checkpoints, and clusters. This page describes what is reported, how the
numbers are computed, and how to add an estimator for a new diffusion
architecture.

If you are looking for FlowGRPO-specific training metrics
(`zero_std_ratio`, `ratio_mean`, `pg_clipfrac_*`, ...), see
{ref}`metrics`.

## Reported metrics

The diffusion trainer emits two MFU keys, on the same step cadence as
the rest of the actor metrics:

| Metric                   | What is timed                                                                                                                                              |
|--------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `perf/mfu/actor`         | Actor `train_batch` — full mini-batch update (includes all forward/backward micro-batches for gradient accumulation).                                      |
| `perf/mfu/actor_infer`   | Actor `infer_batch` — full mini-batch forward-only pass (log-prob recompute on the rollout trajectories).                                                  |

Reference-policy forward passes (e.g. ref log-prob or ref noise-pred) are
not surfaced at the trainer level, matching upstream verl.

`MFU = 1.0` means every GPU in the data-parallel group is sustaining the
device's advertised peak FLOPS for the duration of the timed call. The
peak comes from `get_device_peak_tflops()` in `verl_omni.utils.mfu`,
which wraps upstream `verl.utils.flops_counter.get_device_flops()` and
honors the `VERL_OMNI_DEVICE_FLOPS_TFLOPS` env override — no
diffusion-specific device table is introduced.

Absolute MFU values are model-, hardware-, batch-shape-, and
parallelism-dependent; treat them as **relative** numbers for comparing
configurations (LoRA vs full FT, before/after a kernel change, baseline
vs optimisation) on the same setup, not as cross-cluster benchmarks.

> **LoRA caveat.** The formula counts the full DiT's forward+backward
> FLOPs uniformly for both LoRA and full FT. LoRA's reported MFU is
> therefore an over-estimate of the *achieved* compute (its backward
> skips `∂L/∂w` for frozen weights), but it lets you compare relative
> throughput across runs on one metric.

## How FLOPs are computed

### Streams: latent vs prompt

Every diffusion transformer the counter supports has two token streams
with distinct per-block linear groups:

- **Latent stream** — the VAE-encoded tokens the denoiser processes.
  Image latents for T2I, spatiotemporal latents for T2V, audio latents
  for T2A. These flow through the "image-side" linears of each block
  (`to_q/to_k/to_v`, `to_out`, the FFN; named `img_mod` / `img_mlp` in
  the Qwen-Image / SD3 family, `attn1` / `ffn` in Wan). The naming
  matches diffusers' own (`latents`, `image_latents`, `all_latents`) —
  a "latent" is whatever VAE-space tensor goes into the denoiser,
  noisy or not.
- **Prompt stream** — tokens that condition the generation. Typically
  text-encoder tokens after attention masking. These flow through the
  "text-side" or cross-attention linears (`add_q/k/v_proj`,
  `to_add_out`, `txt_mlp` in MM-DiT; `attn2`'s KV path in
  Wan-style cross-attention).

For variants that introduce extra latents, the rule is precise and
local: **if two tensors are concatenated along the sequence dim before
hitting a linear, they belong to the same stream for counting** — there
is no third bucket.

| Pipeline pattern | `latent_seqlens` per sample is | `prompt_seqlens` per sample is |
|---|---|---|
| T2I / T2V / T2A (Qwen-Image, SD3, Flux, Wan2.2, Hunyuan, LTX, AudioLDM-style) | image / video / audio latent tokens only | text-encoder tokens (after mask) |
| Img2Img / Edit / Inpaint (Qwen-Image-Edit, SD3-Img2Img) | denoise-target latent **plus** reference latent — concatenated on the image side before the transformer block, so they share `to_q/k/v` and the FFN | text-encoder tokens |
| ControlNet | denoise-target latent **plus** ControlNet conditioning latent (same image-side concat) | text-encoder tokens |
| Img2Vid (Wan2.2-I2V) | video latent tokens only | text tokens **plus** vision-encoder tokens — the reference image is encoded by a separate encoder and concatenated to the text-encoder output, so both go through the cross-attention KV |
| Class-conditioned / unconditional (DiT class-cond) | image latent tokens | 0 (no prompt stream) |

The joint attention term inside `estimate_flops` uses
`(latent_seqlens[i] + prompt_seqlens[i]) ** 2` per sample — the
"concatenated" length you asked about comes out of this product, it is
not stored separately. For Wan-style **self-attn + cross-attn** the
self-attention term uses `latent_seqlens[i] ** 2` and the cross-attention
term uses `latent_seqlens[i] * prompt_seqlens[i]`. Either way, the two
per-stream totals carry enough information; no third "joint" field is
needed.

### Per-call FLOPs (Qwen-Image reference implementation)

Qwen-Image's transformer block has **two parallel residual streams**
(image tokens, text tokens) and the only place they interact is a single
**joint full attention** that runs on the concatenated sequence. The
diffusers source for `QwenImageTransformerBlock.__init__` makes the
asymmetry explicit (file:
`diffusers/models/transformers/transformer_qwenimage.py`):

```python
class QwenImageTransformerBlock(nn.Module):
    def __init__(self, dim, num_attention_heads, attention_head_dim, ...):
        # ---- Image stream (image-side linears) --------------------
        self.img_mod = nn.Sequential(nn.SiLU(), nn.Linear(dim, 6 * dim))   # per-sample
        self.img_norm1 = nn.LayerNorm(dim, ...)
        self.attn = Attention(                                             # JOINT attention
            query_dim=dim,
            added_kv_proj_dim=dim,           # has its own KV proj for text stream
            dim_head=attention_head_dim, heads=num_attention_heads,
            processor=QwenDoubleStreamAttnProcessor2_0(),
        )
        self.img_norm2 = nn.LayerNorm(dim, ...)
        self.img_mlp = FeedForward(dim=dim, dim_out=dim, ...)              # image-only
        # ---- Text stream (text-side linears) ----------------------
        self.txt_mod = nn.Sequential(nn.SiLU(), nn.Linear(dim, 6 * dim))   # per-sample
        self.txt_norm1 = nn.LayerNorm(dim, ...)
        # NOTE: text stream has NO separate attention module —
        # the joint attention above handles both streams.
        self.txt_norm2 = nn.LayerNorm(dim, ...)
        self.txt_mlp = FeedForward(dim=dim, dim_out=dim, ...)              # text-only
```

In the corresponding `forward`:

```python
# Image stream consumes img_mlp; runs on img_tot tokens per call.
img_modulated, img_gate2 = self._modulate(self.img_norm2(hidden_states), img_mod2, ...)
hidden_states = hidden_states + img_gate2 * self.img_mlp(img_modulated)

# Text stream consumes txt_mlp; runs on txt_tot tokens per call.
txt_modulated, txt_gate2 = self._modulate(self.txt_norm2(encoder_hidden_states), txt_mod2)
encoder_hidden_states = encoder_hidden_states + txt_gate2 * self.txt_mlp(txt_modulated)
```

In other words, `img_mlp` is **never applied to text tokens** and
`txt_mlp` is **never applied to image tokens** — they have disjoint
inputs, so the FLOPs accounting for them must use disjoint token totals.

#### Per-block FLOPs contribution

The block has three groups of linears, each scaling with a different
"token total":

| Module             | Token scope                  | Per-block params         | FLOPs term it generates |
|--------------------|------------------------------|--------------------------|-------------------------|
| `img_attn_qkv`, `img_attn_out`, `img_mlp` | image tokens only (`img_tot`) | $4 \cdot \mathrm{dim}^2 + 8 \cdot \mathrm{dim}^2 = 12 \cdot \mathrm{dim}^2$ | $6 \cdot L \cdot 12\,\mathrm{dim}^2 \cdot \mathrm{img\_tot}$ |
| `txt_added_kv`, `txt_added_q`, `txt_added_out`, `txt_mlp` | text tokens only (`txt_tot`) | $4 \cdot \mathrm{dim}^2 + 8 \cdot \mathrm{dim}^2 = 12 \cdot \mathrm{dim}^2$ | $6 \cdot L \cdot 12\,\mathrm{dim}^2 \cdot \mathrm{txt\_tot}$ |
| `img_mod`, `txt_mod`             | per-sample (`B`)             | $6 \cdot \mathrm{dim} \cdot \mathrm{dim}$ each | $6 \cdot L \cdot 12\,\mathrm{dim}^2 \cdot B$ |
| `attn` (joint QK·V matmul)       | joint seq (`img_s + txt_s`)  | (no extra weights)       | $12 \cdot L \cdot H \cdot d \cdot \sum_i (\mathrm{img}\_s_i + \mathrm{txt}\_s_i)^2$ |

The joint attention adds **no extra weights** in this row because its QKV
projections are already counted in the two stream rows above
(`img_attn_qkv` on the image side, `added_kv_proj` + `added_q` on the
text side); only the data-dependent $\mathrm{softmax}(QK^\top) \cdot V$
matmuls show up here.

#### Closed-form formula

Define:

- $\mathrm{dim} = \mathrm{num\_attention\_heads} \cdot \mathrm{attention\_head\_dim}$
- $L = \mathrm{num\_layers}$, $H = \mathrm{num\_attention\_heads}$, $d = \mathrm{attention\_head\_dim}$
- $\mathrm{img\_tot} = \sum_i \mathrm{img}\_s_i$, $\mathrm{txt\_tot} = \sum_i \mathrm{txt}\_s_i$, $B = \mathrm{batch\_size}$

```python
img_dense   = 6 * (L * 12*dim**2 + in_channels*dim + patch**2*out_channels*dim) * img_tot
txt_dense   = 6 * (L * 12*dim**2 + joint_attention_dim*dim) * txt_tot
mod_flops   = 6 * L * 12*dim**2 * B           # img_mod + txt_mod (per-sample, not per-token)
attn_flops  = 12 * L * H * d * sum_i (img_s_i + txt_s_i)**2

flops_per_call = (img_dense + txt_dense + mod_flops + attn_flops) \
                 * num_timesteps * num_forward_passes
```

The leading `6 *` factor on dense terms is `2 FLOPs/MAC × 3 (fwd+bwd)`;
the `12 *` factor on `attn_flops` adds another `× 2` for the two
non-causal attention matmuls ($Q \cdot K^\top$ and $\mathrm{softmax} \cdot V$).
Forward-only callers divide the resulting MFU by 3 in
`_postprocess_output` to remove the backward contribution. This matches
upstream verl's `_estimate_qwen3_vit_flop` (non-causal); the dense
convention is also identical to `_estimate_qwen2_flops`.

The extra terms in `img_dense` (the patch-embed input projection and the
patch-unembed output projection) and in `txt_dense` (the text-encoder
projection into the joint dim) are the **non-block** weights applied once
per token at the input and output of the DiT — `img_dense` and `txt_dense`
roll them in for completeness. They are small relative to the $L \cdot
12 \mathrm{dim}^2$ term but the counter still tracks them so absolute
FLOPs match a hand-rolled `model.numel()` reference (the
`TestQwenImageFlopsParamCount` regression test asserts this).

Per-call multipliers:

- `num_timesteps` — denoising-loop depth.
  `data["all_timesteps"].shape[1]` for FlowGRPO-family algorithms; `1`
  for diffusion DPO.
- `num_forward_passes` — `1` (no-CFG / guidance-distilled) or `2` (True-CFG /
  standard CFG), resolved per pipeline by `get_forward_passes_per_step`.

### MFU formula

Given `flops_per_call` and the elapsed wall time `delta_time` returned by
the worker's timer:

```python
peak_FLOPS  = get_device_peak_tflops()                                # device peak in TFLOPS
achieved    = flops_per_call / (delta_time * dp_size)
MFU         = achieved / peak_FLOPS
if forward_only:
    MFU /= 3.0                                                        # remove backward contribution
```

Here ``dp_size = torch.distributed.get_world_size(dp_group)`` (or
``engine.get_data_parallel_size()`` when Ulysses/SP is enabled). It matches
the scope of the DP all-gather, not global ``WORLD`` size.

Here `flops_per_call / delta_time` is already in TFLOPS (the architecture
`estimate_flops` implementations divide by `1e12`), and
`DiffusionFlopsCounter.estimate_flops` returns `(achieved_tflops,
promised_tflops)`.

The ``/ dp_size`` divisor matches the doc definition above:
``_postprocess_output`` consumes DP-allgathered ``flops_per_call`` and
divides by ``get_world_size(dp_group)`` (same scope as the seqlen gather).
On the diffusion side this is reached via ``allgather_diffusion_flops_meta``
gathering ``latent_seqlens`` and ``prompt_seqlens`` across the DP group
*before* ``estimate_flops`` runs.

## Adding a new architecture

Adding a new architecture is **one class with one required method**.
Subclass `DiffusionModelFlops`, implement `estimate_flops`, and
register it. The base-class `get_latent_seqlens` and
`get_prompt_seqlens` extractors cover the standard `(B, C, *spatial)`
layouts (including FlowGRPO rollout-stacked variants), so most new
T2I / T2V / T2A models do not write any data-plumbing code.

### Step 1 — Identify the pipeline class name

Open the model directory's `model_index.json` and read the top-level
`_class_name`. That string is the registry key.

```json
{
  "_class_name": "WanPipeline",
  "_diffusers_version": "0.32.0",
  "transformer": ["diffusers", "WanTransformer3DModel"]
}
```

`DiffusionModelConfig.architecture` is set to this value automatically
when the model is loaded, so the counter dispatches on it without any
config plumbing.

### Step 2 — Read the diffusers transformer-block source

Open the corresponding transformer block in the diffusers source (for
Wan: `diffusers/models/transformers/transformer_wan.py`). The attention
topology and the per-block linear weights are what you need.

For Wan2.2, the relevant section is:

```python
class WanTransformerBlock(nn.Module):
    def __init__(self, dim, ffn_dim, num_heads, ...):
        # 1. Self-attention on image tokens only.
        self.attn1 = WanAttention(dim=dim, heads=num_heads,
                                  dim_head=dim // num_heads, ...)
        # 2. Cross-attention from image tokens to text encoder.
        self.attn2 = WanAttention(dim=dim, heads=num_heads,
                                  dim_head=dim // num_heads,
                                  added_kv_proj_dim=added_kv_proj_dim, ...)
        # 3. Feed-forward on image tokens.
        self.ffn = FeedForward(dim, inner_dim=ffn_dim, ...)
```

Two facts matter for the estimator:

- **Attention topology**: `attn1` is **self-attention on the image
  stream only** (cost $\propto \mathrm{img}\_s^2$). `attn2` is
  **cross-attention from image to text** (cost $\propto \mathrm{img}\_s
  \cdot \mathrm{txt}\_s$). This is different from Qwen-Image's joint
  full attention $\propto (\mathrm{img}\_s + \mathrm{txt}\_s)^2$.
- **Per-block linear weights**:
  - Self-attn: $\mathrm{QKV}$ on image ($3 \cdot \mathrm{dim}^2$) plus
    output projection ($\mathrm{dim}^2$) → $4 \cdot \mathrm{dim}^2$.
  - Cross-attn: $\mathrm{Q}$ on image ($\mathrm{dim}^2$), $\mathrm{KV}$
    on the text stream ($2 \cdot \mathrm{dim} \cdot
    \mathrm{added\_kv\_proj\_dim}$), and the output projection
    ($\mathrm{dim}^2$). If `added_kv_proj_dim == dim` this simplifies
    to $4 \cdot \mathrm{dim}^2$; otherwise compute it from the config.
  - FFN: $\mathrm{dim} \to \mathrm{ffn\_dim} \to \mathrm{dim}$ → $2
    \cdot \mathrm{dim} \cdot \mathrm{ffn\_dim}$ (i.e. **not** the $8
    \cdot \mathrm{dim}^2$ assumption Qwen-Image makes; Wan uses an
    explicit `ffn_dim`).

### Step 3 — Write the architecture class

```python
# verl_omni/utils/mfu/qwen_image.py

@register_diffusion_architecture(
    "WanPipeline",
    "WanPipelineWithLogProb",   # alias, if you ship a custom rollout class
)
class WanFlops(DiffusionModelFlops):
    """Wan2.2 DiT FLOPs estimator (self-attn + cross-attn topology)."""

    # latent_seqlens and prompt_seqlens are inherited from the base class.
    # Wan's (B, C, T, H, W) video latents and FlowGRPO's (B, T_steps, C,
    # T, H, W) rollout-stacked variant are both handled by the default
    # extractor — the latent stream here is the video latent tokens.

    def estimate_flops(
        self,
        latent_seqlens: Sequence[int],
        prompt_seqlens: Sequence[int],
        delta_time: float,
        *,
        num_timesteps: int,
        num_forward_passes: int,
    ) -> float:
        num_heads  = int(self.config["num_attention_heads"])
        head_dim   = int(self.config["attention_head_dim"])
        num_layers = int(self.config["num_layers"])
        ffn_dim    = int(self.config["ffn_dim"])
        added_kv   = int(self.config.get("added_kv_proj_dim") or self.dim)
        dim        = self.dim

        # latent_s = tokens flowing through attn1 + ffn (the image-side
        # linears in Wan; the latent stream here is the video latents).
        latent_tot = sum(int(s) for s in latent_seqlens)
        prompt_tot = sum(int(s) for s in prompt_seqlens)
        batch      = max(len(latent_seqlens), len(prompt_seqlens))

        # Per-block linear param counts.
        self_attn_n = 4 * dim * dim                           # QKV + out, latent-side
        cross_q_n   = 1 * dim * dim                           # Q from latent stream
        cross_kv_n  = 2 * dim * added_kv                      # KV from prompt stream
        cross_o_n   = 1 * dim * dim
        ffn_n       = 2 * dim * ffn_dim                       # explicit ffn_dim, no 4x assumption

        # Dense FLOPs.
        latent_dense = self.compute_dense_flops(num_layers * (self_attn_n + ffn_n), latent_tot)
        cross_dense  = self.compute_dense_flops(
            num_layers * (cross_q_n + cross_o_n), latent_tot
        ) + self.compute_dense_flops(
            num_layers * cross_kv_n, prompt_tot
        )
        mod_flops    = self.compute_dense_flops(num_layers * (6 * dim * dim), batch)  # per-sample timestep embed

        # Attention FLOPs. Factor 12 = 2 FLOPs/MAC * 2 matmuls * 3 (fwd+bwd).
        self_attn_flops = 12 * num_layers * num_heads * head_dim * sum(int(s) ** 2 for s in latent_seqlens)
        cross_attn_flops = 12 * num_layers * num_heads * head_dim * sum(
            int(l) * int(p)
            for l, p in zip(latent_seqlens, prompt_seqlens, strict=False)
        )

        flops_per_call = (
            latent_dense + cross_dense + mod_flops + self_attn_flops + cross_attn_flops
        ) * num_timesteps * num_forward_passes
        return flops_per_call / delta_time / 1e12              # → TFLOPS achieved
```

What you did **not** have to write:

- **Latent → seqlen extraction.** The base class default reads
  `data["image_latents"]` (training) or `data["all_latents"]`
  (FlowGRPO rollout-stacked) and returns the product of the spatial
  dims. Wan's `(B, C, T, H, W)` produces `T*H*W` tokens per sample
  out of the box; the rollout-stacked `(B, T_steps, C, T, H, W)`
  collapses to the same per-sample count. MM-DiT-family pipelines
  (Qwen-Image, SD3, Flux, ...) call `diffusers._pack_latents` *before*
  the transformer, reshaping `(B, C, H, W)` into a packed
  `(B, L, C')` (or `(B, T_steps, L, C')` after FlowGRPO stacking) with
  `L = (H/p) * (W/p)` and `C' = C * p**2 == in_channels`;
  `QwenImageFlops.get_latent_seqlens` overrides the default to detect
  this layout via `shape[-1] == in_channels` and return `L` per
  sample, so subclasses inheriting from `QwenImageFlops` get the
  packed handling for free.
- **Prompt → seqlen extraction.** The default reads
  `prompt_embeds_mask` (nested or dense) and falls back to dense
  `prompt_embeds.shape[1]` or zeros for unconditional models.
- **CFG-pass detection.** `get_forward_passes_per_step` already covers Wan's
  `guidance_scale > 1` → 2 passes, including the
  `guidance_embeds=True` short-circuit for guidance-distilled variants.
- **Distributed all-gather.** `TrainingWorker._allgather_diffusion_flops_meta`
  is topology-agnostic.
- **Forward-only divisor.** `_postprocess_output` applies the `/3` after
  `estimate_flops` returns.
- **Device peak lookup.** The counter reuses
  `verl.utils.flops_counter.get_device_flops()`.

#### Sidebar — overriding `get_latent_seqlens` for Img2Img / Edit / ControlNet

Image-edit and ControlNet variants concatenate reference latents to the
denoise-target latents along the sequence dim before the transformer
block, so the reference tokens flow through the **same** image-side
linears (`to_q/k/v`, `to_out`, the FFN) as the denoise targets. They
therefore belong on the latent stream — the effective
`latent_seqlens[i]` becomes
`denoise_target_token_count + reference_token_count` per sample, not a
separate field. Subclass the parent T2I class and override
`get_latent_seqlens` only; `estimate_flops` is inherited:

```python
@register_diffusion_architecture("QwenImageEditPipeline")
class QwenImageEditFlops(QwenImageFlops):
    def get_latent_seqlens(self, data: Any = None, config: Optional[Mapping[str, Any]] = None) -> list[int]:
        # `super()` already handles the diffusers-packed (B, L, C')
        # layout for the denoise-target stream; the reference stream
        # arrives in the same packed shape, so its L is just shape[-2].
        base = super().get_latent_seqlens(data, config)
        ref = data.get("reference_image_latents")
        if ref is None:
            return base
        ref_per_sample = int(ref.shape[-2])
        return [b + ref_per_sample for b in base]
```

The same pattern applies to Img2Img, Inpaint, and ControlNet — just
swap the `reference_*` key for whichever your pipeline stores the
extra latents under. For Img2Vid models that concatenate
vision-encoder tokens to the text-encoder output instead, override
`get_prompt_seqlens` in the same way (add the encoded reference-image
token count to each per-sample entry).

### Step 4 — Add a unit test

```python
# tests/utils/test_diffusion_flops_counter_on_cpu.py

WAN_CONFIG: dict = {
    "_class_name": "WanTransformer3DModel",
    "num_attention_heads": 16,
    "attention_head_dim": 128,
    "num_layers": 30,
    "ffn_dim": 8192,
    "added_kv_proj_dim": 2048,
}

class TestWanFlopsScaling:
    def test_linear_in_num_timesteps(self):
        counter = DiffusionFlopsCounter("WanPipeline", WAN_CONFIG)
        kw = dict(latent_seqlens=[512] * 2, prompt_seqlens=[64] * 2,
                  delta_time=1.0, num_forward_passes=1)
        est_a, _ = counter.estimate_flops(num_timesteps=10, **kw)
        est_b, _ = counter.estimate_flops(num_timesteps=30, **kw)
        assert math.isclose(est_b / est_a, 3.0, rel_tol=1e-9)

    def test_quadratic_in_latent_seqlen(self):
        counter = DiffusionFlopsCounter("WanPipeline", WAN_CONFIG)
        kw = dict(prompt_seqlens=[64], delta_time=1.0,
                  num_timesteps=1, num_forward_passes=1)
        small, _ = counter.estimate_flops(latent_seqlens=[256], **kw)
        large, _ = counter.estimate_flops(latent_seqlens=[512], **kw)
        # Self-attn is quadratic, dense is linear → ratio is between 2 and 4.
        assert 2.0 < large / small < 4.0
```

Mirror the pattern in `TestQwenImageFlopsScaling` for fuller coverage
(hand-rolled reference comparison, `num_forward_passes=2`, batch-shape sweep).
For architectures whose weights are tractable to enumerate, also add a
`TestArchFlopsParamCount` test that asserts the per-block parameter
count baked into the estimator matches `block.numel()` on a
freshly-instantiated tiny model.

### Step 5 — Verify on a smoke run

Use any existing diffusion-RL launch script (e.g.
`examples/flowgrpo_trainer/run_qwen_image_ocr.sh`, or the H200-tuned
`examples/flowgrpo_trainer/run_qwen_image_ocr_h200_mfu_optimized.sh`
with `export VERL_OMNI_DEVICE_FLOPS_TFLOPS=989`). Look for the two keys
in your logger output:

```text
{"perf/mfu/actor": 0.XX, "perf/mfu/actor_infer": 0.XX, ...}
```

(Any in-range value confirms the counter is wired up. Absolute numbers
depend on model, hardware, batch shape, and parallelism.)

If `perf/mfu/actor` is `0` or missing, check:

1. `DiffusionModelConfig.architecture` matches the string you passed
   to `@register_diffusion_architecture`. The pipeline class name in
   `model_index.json` is the source of truth.
2. The transformer config file exists at
   `<local_path>/transformer/config.json`. The counter warns and
   degrades to `0` when this file is missing.
3. The config fields your `estimate_flops` reads (`num_layers`,
   `ffn_dim`, ...) actually appear in the diffusers config. Print
   `counter.config` to confirm.
4. The default `get_latent_seqlens` finds your latents. The default
   looks for `data["image_latents"]` then `data["all_latents"]`. If
   your pipeline stores its latent-stream tensor under a different key
   (e.g. `data["audio_latents"]`), override `get_latent_seqlens` in
   the architecture class — same pattern as the Edit sidebar above.

If `perf/mfu/actor > 1.0`, the two common causes are:

1. **Mis-identified device peak.** `verl.utils.flops_counter.get_device_flops`
   matches `torch.cuda.get_device_name()` by substring against a built-in
   table. On clusters with relabeled SKUs (e.g. H200 cards reporting as
   `"NVIDIA L20X"` via VBIOS), the substring match falls through to the
   first hit (`"L20"`, 119.5 TFLOPS bf16 dense) rather than the real
   silicon peak (`H200`, 989 TFLOPS), inflating reported MFU by roughly
   the peak ratio. Pin the correct peak via the env var:

   ```bash
   export VERL_OMNI_DEVICE_FLOPS_TFLOPS=989   # H200 bf16 dense
   ```

   Honored by `get_device_peak_tflops()` in
   `verl_omni.utils.mfu` and consumed by
   `DiffusionFlopsCounter.estimate_flops`. See
   `tests/utils/test_diffusion_flops_counter_on_cpu.py::TestDevicePeakOverride`.
2. **Missing DP gather of seqlens.** `_allgather_diffusion_flops_meta`
   handles this generically for the shipped path, so it should only
   trigger if you added a new metadata field that bypasses the gather.
   The regression test `TestDPGlobalConsistency` guards against this.

## Tuning and Improving MFU

For diffusion RL workloads (like FlowGRPO), achieving high MFU requires balancing memory constraints with compute and communication overheads. Based on optimizations for 20B+ models on H200 clusters, here are the primary levers to improve MFU:

1. **Disable Offloading (If Memory Permits):**
   - **`param_offload`**: Setting this to `False` provides the largest MFU gain. Offloading parameters requires a massive PCIe round-trip every forward/backward pass.
   - **`optimizer_offload`**: Moving Adam states to CPU and running the update there severely bottlenecks the `update_actor` phase. Set to `False` if possible.
   - *Tuning Strategy*: Start with both off. If you hit an Out of Memory (OOM) error during `update_weights` or `update_actor`, re-enable `optimizer_offload=True` first (as it doesn't impact the forward pass), and only enable `param_offload=True` as a last resort.

2. **Reduce Sequence Parallelism (SP):**
   - For moderate sequence lengths (e.g., ~1024 tokens for 512x512 latents), the all-to-all communication overhead of Ulysses SP outweighs its memory benefits.
   - Setting `ulysses_sequence_parallel_size=1` removes this overhead and increases your Data Parallel (DP) size, which reduces FSDP shard sizes.

3. **Increase Micro-Batch Size:**
   - Increasing `ppo_micro_batch_size_per_gpu` (e.g., from 16 to 32) helps amortize FSDP all-gather and reduce-scatter collective overheads.
   - *Note*: Once the effective matrix dimensions (M, N, K) exceed ~512, tensor cores are generally saturated, so returns diminish quickly.

4. **Layered Summon:**
   - If **both** `param_offload=False` and `optimizer_offload=False`, set
     `layered_summon=False` so weight sync can load the full model at once.
   - When `param_offload=True` (common on colocated hybrid actor/rollout +
     reward setups), keep `layered_summon=True` — disabling it tends to
     OOM during `update_weights`.

5. **Account for Gradient Checkpointing:**
   - If `enable_gradient_checkpointing: true` is set in your config, the *physical* MFU is actually ~33% higher than the reported MFU. The counter formula assumes a standard 1 forward + 2 backward passes (factor of 6), but checkpointing requires an additional recompute pass (1 fwd + 1 recompute + 2 bwd = 8).

## Caveats and limitations

- **LoRA over-estimates achieved compute.** As noted above, the formula
  treats LoRA and full FT identically. Use the absolute number for
  *relative* comparisons across runs, not as a hardware benchmark.
- **SP padding undercounts attention.** The counter feeds
  `prompt_embeds_mask.sum(-1)` (the unpadded length) into the formula,
  but the model runs on prompt embeds padded to a multiple of `sp_size`
  by `_pad_embeds_for_sp`. The undercounted text-side seqlen is at most
  `sp_size - 1` tokens per sample; the corresponding share of
  `flops_per_call` depends on the model's text-vs-image work ratio
  but is dominated by the image stream and joint attention.
- **CFG with gradient detachment.** If a future loss path detaches the
  negative-CFG branch's backward, `num_forward_passes` becomes a slight
  over-estimate ($\le 2\times$). The pipeline can override the
  detection with an explicit `pipeline.num_forward_passes: 1` field.
- **Image-edit / Img2Img / Inpaint / ControlNet variants are not yet
  estimated.** These pipelines concatenate reference latents to the
  denoise-target latents along the sequence dim, so the effective
  `latent_seqlens` is larger than the spatial-dim product of the
  denoise-target tensor alone; the current registry warns + reports
  MFU=0 for them rather than under-counting silently. See [Adding a
  new architecture](#adding-a-new-architecture) for the override
  pattern.
- **Rollout FLOPs are out of scope.** vLLM-Omni runs the rollout
  decoder outside the `TrainingWorker.Timer` block and on possibly
  different hardware; attributing FLOPs there is a follow-up.

## Further reading

- [Upstream `verl.utils.flops_counter`](https://github.com/verl-project/verl/blob/main/verl/utils/flops_counter.py) — the LLM-side counter and the `get_device_flops` table.
- {ref}`metrics` — FlowGRPO-specific metrics (`zero_std_ratio`, `ratio_mean`, `pg_clipfrac_*`).
- [`docs/perf/profiler.md`](profiler.md) — `nsys` / `torch.profiler` recipes when MFU alone is not enough to localise a regression.