Mix-GRPO
Last updated: 05/12/2026.
Mix-GRPO (paper, code) extends Flow-GRPO with a Mixed ODE-SDE rollout and a sliding-window training schedule that together greatly cut the cost of online RL fine-tuning of flow-matching diffusion models.
The rollout uses deterministic ODE sampling outside a contiguous window of denoising steps and stochastic SDE sampling inside the window – only the in-window steps yield meaningful log-probabilities and contribute to the policy gradient.
A trainer-side scheduler slides the window across training iterations, so over the course of training every part of the trajectory is exercised while each individual rollout still pays SDE cost on a small fraction of the trajectory.
In practice this lets you keep a short inference horizon (e.g. 10 steps) for fast iteration while MixGRPO’s ODE/SDE split reduces gradient variance, or scale to a longer horizon (e.g. 50 steps) and still train at the cost of a small SDE window per rollout.
How verl-omni implements MixGRPO
MixGRPO shares the SDE step formula, advantage estimator and loss with
FlowGRPO, so only the (architecture, algorithm) adapter pair changes.
Layer |
What it does |
Code |
|---|---|---|
Algo config |
Two extra knobs ( |
|
Adapter pair |
Subclasses of the FlowGRPO adapters re-registered under |
|
Rollout |
Already supports a contiguous SDE window (ODE outside / SDE inside) – no changes needed. |
|
Configuration
Algorithm dispatch lives on actor_rollout_ref.model.algorithm; everything
else is rollout configuration under actor_rollout_ref.rollout.algo.
Because MixGRPO reuses FlowGRPO’s advantage estimator and PPO loss verbatim,
you must also pin the two cascaded fields back to flow_grpo so they don’t
propagate the unknown mix_grpo token into the validator at
DiffusionLossConfig.valid_modes
and the DiffusionAdvantageEstimator
enum (see Caveat: cascade vs. validators below):
algorithm:
adv_estimator: flow_grpo # MixGRPO reuses FlowGRPO's estimator
actor_rollout_ref:
model:
algorithm: mix_grpo # selects the (arch, algo) adapter pair
actor:
diffusion_loss:
loss_mode: flow_grpo # MixGRPO reuses FlowGRPO's loss
rollout:
algo:
# ----- Common SDE configs ---------------------------------------------
noise_level: 1.0 # SDE noise magnitude
sde_type: sde # sde | cps
sde_window_size: null # window length / "group size"
sde_window_range: null # [start, end] envelope; null = full trajectory
# ----- MixGRPO sliding-window scheduler (mix_grpo only) -------------
sample_strategy: random # random | progressive
iters_per_group: 1 # progressive only
sde_window_seed: 0 # random only
Caveat: cascade vs. validators
actor_rollout_ref.model.algorithm is wired to four dispatch points via
OmegaConf templates of the form
${oc.select:actor_rollout_ref.model.algorithm,flow_grpo}:
The
(architecture, algorithm)adapter pair lookups (DiffusionModelBaseandVllmOmniPipelineBase).algorithm.adv_estimator.actor_rollout_ref.actor.diffusion_loss.loss_mode.
Setting actor_rollout_ref.model.algorithm=mix_grpo therefore propagates
mix_grpo to all four sites. Adapter dispatch (1) is happy — both adapters
are registered under algorithm="mix_grpo". The other two points fail at
runtime because:
DiffusionAdvantageEstimatoronly enumeratesflow_grpo, socompute_advantageraises when it tries to look upmix_grpo.DiffusionLossConfig.__post_init__checksloss_mode in ["flow_grpo"]and raisesValueError: Invalid diffusion loss_mode: mix_grpo.
Pinning algorithm.adv_estimator=flow_grpo and
actor_rollout_ref.actor.diffusion_loss.loss_mode=flow_grpo keeps the
adapter dispatch on mix_grpo while the estimator/loss stay on the
existing FlowGRPO implementations.
Field semantics
actor_rollout_ref.model.algorithm– selects the registered(architecture, algorithm)adapter pair.mix_grporoutes toverl_omni/pipelines/qwen_image_mix_grpo/.noise_level– magnitude of injected SDE noise inside the window. Outside the windownoise_levelis forced to0so the step degenerates to a deterministic Euler ODE step.sde_type–sde(FlowGRPO formulation) orcps(Coefficients-Preserving Sampling).sde_window_size– length of the active SDE window, called “group size” in MixGRPO.nullmeans “use the entire trajectory” (the legacy FlowGRPO setting).sde_window_range– a[start, end]envelope of valid window-start positions:Under
random, the rollout backend draws the start uniformly from[start, end - sde_window_size + 1).Under
progressive, the same envelope clamps the deterministic sliding window.nulldefaults to the full trajectory[0, num_inference_steps](minus the last ODE step wheresigma_prev = 0).
sample_strategy– MixGRPO only.randomdraws a fresh window per step (seeded so all ranks agree);progressiveadvances the window bysde_window_sizeeveryiters_per_groupiterations.iters_per_group– MixGRPO progressive only. Number of training iterations spent at each window position.sde_window_seed– MixGRPO random only. Base seed for the per-step random window draws. Distinct from the rollout generator seed (val_kwargs.seed) so the two random streams stay decoupled.
Validation
Validation always uses the deterministic ODE path with noise_level=0, so
the sliding-window settings are irrelevant there.
Reference recipe
A ready-to-run script is provided at
examples/mixgrpo_trainer/run_qwen_image_ocr_lora_mixgrpo.sh. The default
config uses a 10-step trajectory with a 2-step window (random strategy),
matching the FlowGRPO baseline’s inference budget:
algorithm.adv_estimator=flow_grpo
actor_rollout_ref.model.algorithm=mix_grpo
actor_rollout_ref.actor.diffusion_loss.loss_mode=flow_grpo
actor_rollout_ref.rollout.algo.sample_strategy=random
actor_rollout_ref.rollout.algo.sde_window_seed=42
actor_rollout_ref.rollout.algo.sde_window_size=2
actor_rollout_ref.rollout.algo.sde_window_range=[0,5]
actor_rollout_ref.rollout.algo.noise_level=1.2
actor_rollout_ref.rollout.algo.sde_type=sde
The first and third lines pin the cascaded estimator/loss back to
flow_grpo; see Caveat: cascade vs. validators.
Tuning guide
The two most impactful parameters are num_inference_steps (rollout
trajectory length) and sde_window_size (how many steps use SDE).
Setting |
|
|
|
Speed |
Quality |
|---|---|---|---|---|---|
Fast (default) |
10 |
2 |
|
~7 min/step |
Good — matches FlowGRPO budget |
Long trajectory |
50 |
4 |
|
~23 min/step |
Higher reward baseline, but gradients are diluted (only 8% of trajectory is SDE) |
Guidelines:
Start with the default (10 steps, window 2). This gives the fastest iteration and strongest learning signal per step because a larger fraction of the trajectory contributes to gradients.
Increase
num_inference_steps(e.g. 50) when image quality at rollout time is important and you can afford the wall-clock cost. Pair with a proportionally largersde_window_size(e.g. 4) to keep the gradient signal strong.sde_window_size / num_inference_stepsratio controls the trade-off: a higher ratio means more gradient signal per step but higher SDE cost; a lower ratio is cheaper but gradients are noisier.sample_strategy: userandomfor short trajectories (window positions are already well-covered); useprogressivewithiters_per_groupfor long trajectories to ensure systematic coverage.Validation always uses the deterministic ODE path (
noise_level=0) regardless of training settings.
References
MixGRPO: J. Li et al., MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE, arXiv:2507.21802.
MixGRPO repo: https://github.com/Tencent-Hunyuan/MixGRPO.
FlowGRPO: Y. Liu et al., Flow-GRPO: Training Flow Matching Models via Online RL, arXiv:2505.05470.
Coefficients-Preserving Sampling: arXiv:2509.05952.