Common Pitfalls
Last updated: 06/29/2026.
Float32 Precision Loss in Stored Rollout Latents
Symptom
Training metrics show a systematic negative bias at step 1 (before any weight update):
actor/ratio_meanconsistently below1.0(e.g.0.99996)actor/ppo_klandactor/pg_clipfracinflated at step 1actor/pg_clipfrac_higheris zero — all clipping on the lower sideMost visible with rollout correction (
bypass_mode=True), but also degrades stored trajectory precision in standard training.
Root cause
FlowMatchSDEDiscreteScheduler.step() computes log_prob in float32
using the fp32 prev_sample, then casts prev_sample back to
model_output.dtype (bfloat16) before returning. The stored latents
lose precision, creating a mismatch with the log-prob computation.
Fix
Two changes in the scheduler, one in the rollout adapter. The training adapter is unchanged — it already uses fp32 correctly.
1. Scheduler — step() no longer truncates prev_sample to bfloat16,
and sample_previous_step() asserts model_output is float32 so callers
cannot accidentally pass lower precision.
2. Rollout adapter — latents are cast to the transformer’s native dtype before the forward pass (performance), noise_pred is cast to float32 before the scheduler (precision), and all stored latents are in float32.
Verification
The fix eliminates the systematic precision-loss bias from the scheduler.
In non-bypass mode (no rollout correction) ratio_mean ≈ 1.0 at step 1.
In bypass mode a ~3×10⁻⁵ KL divergence remains due to the vLLM vs PyTorch
attention kernel difference, which is unavoidable when using different
inference backends.
Metric |
Before fix (bypass) |
After fix (bypass) |
No bypass |
|---|---|---|---|
|
~3.6×10⁻⁵ |
~3.3×10⁻⁵ |
~1×10⁻⁶ |
|
~12% |
~9% |
~1% |
RoPE Sequence Length Mismatch
Symptom (RoPE)
When step_execution=True, actor/ppo_kl is elevated even at step 1
compared to the full-forward (step_execution=False) path. The effect
persists across training steps and cannot be eliminated by the fp32
latent-storage fix alone.
This also affects the stock vllm-omni (non-stepwise) path in some configurations — the root cause is upstream, not specific to stepwise mode.
Root cause (RoPE)
vllm-omni sets Rotary Position Embedding (RoPE) sequence lengths from
mask.sum() (valid token count), while diffusers sets them from the
padded encoder-tensor width (text_seq_len). Under continuous batching,
vllm-omni pads all requests to a shared target_seq_len, so valid
tokens at positions beyond ~50 receive incorrect RoPE — they get the
positional encoding of a much shorter sequence.
Concretely, if a request has 200 valid tokens and is padded to width
1058, mask.sum() = 200 but the embedding width is 1058. The RoPE
position for token 100 is computed as position 100 of a 200-length
sequence rather than position 100 of a 1058-length sequence.
Fix (RoPE)
In prepare_encode, set txt_seq_lens from the padded embed width
instead of from mask.sum():
# Wrong (vllm-omni default):
txt_seq_lens = [int(mask.sum()) for mask in prompt_embeds_mask]
# Correct (matches diffusers):
txt_seq_lens = [int(prompt_embeds.shape[1])] * int(prompt_embeds.shape[0])
The stepwise adapters in verl_omni/experimental/ already do this.
The stock vllm-omni path is still affected and tracked as an upstream
issue.
Verification (RoPE)
Compare actor/ppo_kl at step 1 between step_execution=True and
step_execution=False runs with all other knobs identical. After the
fix the difference should be within numerical tolerance (~3×10⁻⁵ KL
divergence due to unavoidable vLLM vs PyTorch attention kernel
difference).
Float32 Precision Loss in Stepwise Scheduler
Symptom (Stepwise Scheduler)
Training metrics show a systematic negative bias at step 1 when
step_execution=True:
actor/ratio_meanconsistently below1.0actor/ppo_klandactor/pg_clipfracinflated at step 1actor/pg_clipfrac_higheris zero — all clipping on the lower side
The same model/config produces correct ratio_mean ≈ 1.0 when
step_execution=False.
Root cause (Stepwise Scheduler)
step_scheduler stores new_latents in the model’s compute dtype (bf16)
instead of fp32. The trainer later recomputes log-probs on these stored
latents via FlowMatchSDEDiscreteScheduler.sample_previous_step() in
fp32, creating a precision mismatch. Additionally, under continuous
batching the engine gathers latents across in-flight requests:
a freshly-added request has fp32 latents while stepped requests have
bf16 latents, producing a “Mixed dtypes in latents batch” error.
Fix (Stepwise Scheduler)
Two changes in step_scheduler:
Store
new_latents.float()in the trajectory lists.Keep
state.latentsin fp32 throughout — do NOT cast to model dtype after the scheduler step.denoise_stepalready casts to the transformer dtype before the forward pass.
# Wrong:
state.latents = new_latents # bf16
# Correct:
state.latents = new_latents.to(torch.float32)
The non-CB diffuse() path already does this correctly — the stepwise
override must match.
Verification (Stepwise Scheduler)
ratio_mean ≈ 1.0 at step 1 with step_execution=True, matching the
step_execution=False baseline within tolerance.
SDE Window: Per-Request vs Per-GPU Seeding
Symptom
critic/rewards/std_meangrows over training (e.g. 0.07 → 0.10) instead of shrinking.critic/score/meandeclines or oscillates instead of improving.Training collapses early — the run dies after ~30 steps while a correctly-configured run trains healthily for 180+ steps.
actor/ppo_klincreases andactor/lossbecomes unstable.
Most visible with reward functions that produce smooth, fine-grained scores (e.g. PickScore) where small differences between rollouts are the signal the policy should learn from.
Root cause
The SDE window is the contiguous range of timesteps where stochastic noise (exploration) is injected during the diffusion rollout. When the window is chosen per-request — e.g. by hashing the request ID or using an unseeded RNG — different rollouts for the same prompt inject noise at different timestep ranges. The reward scatter combines two sources of variance:
Exploration noise — intended, the policy should learn from this.
Timestep bias — unintended, two rollouts with identical behaviour can score differently purely because noise was applied at different timesteps.
The second source inflates critic/rewards/std_mean by ~4×, making GRPO
advantage estimates unreliable. The trainer cannot distinguish noise from
signal, drift dominates, and training collapses.
Fix
Seed the SDE window RNG with a per-GPU identifier so all rollouts on the same GPU share the same noise-injection window. This removes the timestep bias from the variance budget — all rollouts for a prompt are evaluated under consistent noise structure, and the remaining variance is genuine exploration that GRPO can learn from.
# Wrong — each rollout gets a different SDE window:
rng = random.Random(hash(request_id))
# Correct — all rollouts on the same GPU share one window:
rng = random.Random(int(os.environ["LOCAL_RANK"]))
This matches the reference flow_grpo behaviour of
random.seed(process_index).
Note
Reading LOCAL_RANK from the environment couples the window to process
placement, which is opaque and not reproducible across GPU counts. A better
approach would pass an explicit seed from the launcher.
Verification
Two BAGEL PickScore LoRA runs on the same 4-GPU setup, differing only in SDE window seeding:
Metric |
Per-request seed |
Per-GPU seed |
|---|---|---|
|
0.07 → 0.10 (diverging) |
0.022 → 0.014 (converging) |
|
0.78 → 0.62 (collapsing) |
0.82 → 0.85+ (improving) |
Steps survived |
28 |
181+ |
|
Growing to 8×10⁻⁴ |
Near zero, stable |
|
N/A |
0.82 → 0.87 (improving) |
The fix brings critic/rewards/std_mean into the same ~0.01–0.02 range
observed in reference flow_grpo PickScore runs.