Common Pitfalls

Last updated: 06/29/2026.

Float32 Precision Loss in Stored Rollout Latents

Symptom

Training metrics show a systematic negative bias at step 1 (before any weight update):

actor/ratio_mean consistently below 1.0 (e.g. 0.99996)
actor/ppo_kl and actor/pg_clipfrac inflated at step 1
actor/pg_clipfrac_higher is zero — all clipping on the lower side
Most visible with rollout correction (bypass_mode=True), but also degrades stored trajectory precision in standard training.

Root cause

FlowMatchSDEDiscreteScheduler.step() computes log_prob in float32 using the fp32 prev_sample, then casts prev_sample back to model_output.dtype (bfloat16) before returning. The stored latents lose precision, creating a mismatch with the log-prob computation.

Fix

Two changes in the scheduler, one in the rollout adapter. The training adapter is unchanged — it already uses fp32 correctly.

1. Scheduler — step() no longer truncates prev_sample to bfloat16, and sample_previous_step() asserts model_output is float32 so callers cannot accidentally pass lower precision.

2. Rollout adapter — latents are cast to the transformer’s native dtype before the forward pass (performance), noise_pred is cast to float32 before the scheduler (precision), and all stored latents are in float32.

Verification

The fix eliminates the systematic precision-loss bias from the scheduler. In non-bypass mode (no rollout correction) ratio_mean ≈ 1.0 at step 1. In bypass mode a ~3×10⁻⁵ KL divergence remains due to the vLLM vs PyTorch attention kernel difference, which is unavoidable when using different inference backends.

Metric	Before fix (bypass)	After fix (bypass)	No bypass
`actor/ppo_kl`	~3.6×10⁻⁵	~3.3×10⁻⁵	~1×10⁻⁶
`actor/pg_clipfrac`	~12%	~9%	~1%

RoPE Sequence Length Mismatch

Symptom (RoPE)

When step_execution=True, actor/ppo_kl is elevated even at step 1 compared to the full-forward (step_execution=False) path. The effect persists across training steps and cannot be eliminated by the fp32 latent-storage fix alone.

This also affects the stock vllm-omni (non-stepwise) path in some configurations — the root cause is upstream, not specific to stepwise mode.

Root cause (RoPE)

vllm-omni sets Rotary Position Embedding (RoPE) sequence lengths from mask.sum() (valid token count), while diffusers sets them from the padded encoder-tensor width (text_seq_len). Under continuous batching, vllm-omni pads all requests to a shared target_seq_len, so valid tokens at positions beyond ~50 receive incorrect RoPE — they get the positional encoding of a much shorter sequence.

Concretely, if a request has 200 valid tokens and is padded to width 1058, mask.sum() = 200 but the embedding width is 1058. The RoPE position for token 100 is computed as position 100 of a 200-length sequence rather than position 100 of a 1058-length sequence.

Fix (RoPE)

In prepare_encode, set txt_seq_lens from the padded embed width instead of from mask.sum():

# Wrong (vllm-omni default):
txt_seq_lens = [int(mask.sum()) for mask in prompt_embeds_mask]

# Correct (matches diffusers):
txt_seq_lens = [int(prompt_embeds.shape[1])] * int(prompt_embeds.shape[0])

The stepwise adapters in verl_omni/experimental/ already do this. The stock vllm-omni path is still affected and tracked as an upstream issue.

Verification (RoPE)

Compare actor/ppo_kl at step 1 between step_execution=True and step_execution=False runs with all other knobs identical. After the fix the difference should be within numerical tolerance (~3×10⁻⁵ KL divergence due to unavoidable vLLM vs PyTorch attention kernel difference).

Float32 Precision Loss in Stepwise Scheduler

Symptom (Stepwise Scheduler)

Training metrics show a systematic negative bias at step 1 when step_execution=True:

actor/ratio_mean consistently below 1.0
actor/ppo_kl and actor/pg_clipfrac inflated at step 1
actor/pg_clipfrac_higher is zero — all clipping on the lower side

The same model/config produces correct ratio_mean ≈ 1.0 when step_execution=False.

Root cause (Stepwise Scheduler)

step_scheduler stores new_latents in the model’s compute dtype (bf16) instead of fp32. The trainer later recomputes log-probs on these stored latents via FlowMatchSDEDiscreteScheduler.sample_previous_step() in fp32, creating a precision mismatch. Additionally, under continuous batching the engine gathers latents across in-flight requests: a freshly-added request has fp32 latents while stepped requests have bf16 latents, producing a “Mixed dtypes in latents batch” error.

Fix (Stepwise Scheduler)

Two changes in step_scheduler:

Store new_latents.float() in the trajectory lists.
Keep state.latents in fp32 throughout — do NOT cast to model dtype after the scheduler step. denoise_step already casts to the transformer dtype before the forward pass.

# Wrong:
state.latents = new_latents  # bf16

# Correct:
state.latents = new_latents.to(torch.float32)

The non-CB diffuse() path already does this correctly — the stepwise override must match.

Verification (Stepwise Scheduler)

ratio_mean ≈ 1.0 at step 1 with step_execution=True, matching the step_execution=False baseline within tolerance.

SDE Window: Per-Request vs Per-GPU Seeding

Symptom

critic/rewards/std_mean grows over training (e.g. 0.07 → 0.10) instead of shrinking.
critic/score/mean declines or oscillates instead of improving.
Training collapses early — the run dies after ~30 steps while a correctly-configured run trains healthily for 180+ steps.
actor/ppo_kl increases and actor/loss becomes unstable.

Most visible with reward functions that produce smooth, fine-grained scores (e.g. PickScore) where small differences between rollouts are the signal the policy should learn from.

Root cause

The SDE window is the contiguous range of timesteps where stochastic noise (exploration) is injected during the diffusion rollout. When the window is chosen per-request — e.g. by hashing the request ID or using an unseeded RNG — different rollouts for the same prompt inject noise at different timestep ranges. The reward scatter combines two sources of variance:

Exploration noise — intended, the policy should learn from this.
Timestep bias — unintended, two rollouts with identical behaviour can score differently purely because noise was applied at different timesteps.

The second source inflates critic/rewards/std_mean by ~4×, making GRPO advantage estimates unreliable. The trainer cannot distinguish noise from signal, drift dominates, and training collapses.

Fix

Seed the SDE window RNG with a per-GPU identifier so all rollouts on the same GPU share the same noise-injection window. This removes the timestep bias from the variance budget — all rollouts for a prompt are evaluated under consistent noise structure, and the remaining variance is genuine exploration that GRPO can learn from.

# Wrong — each rollout gets a different SDE window:
rng = random.Random(hash(request_id))

# Correct — all rollouts on the same GPU share one window:
rng = random.Random(int(os.environ["LOCAL_RANK"]))

This matches the reference flow_grpo behaviour of random.seed(process_index).

Note

Reading LOCAL_RANK from the environment couples the window to process placement, which is opaque and not reproducible across GPU counts. A better approach would pass an explicit seed from the launcher.

Verification

Two BAGEL PickScore LoRA runs on the same 4-GPU setup, differing only in SDE window seeding:

Metric	Per-request seed	Per-GPU seed
`critic/rewards/std_mean`	0.07 → 0.10 (diverging)	0.022 → 0.014 (converging)
`critic/score/mean` trend	0.78 → 0.62 (collapsing)	0.82 → 0.85+ (improving)
Steps survived	28	181+
`actor/ppo_kl`	Growing to 8×10⁻⁴	Near zero, stable
`val-core/.../reward/mean@1`	N/A	0.82 → 0.87 (improving)

The fix brings critic/rewards/std_mean into the same ~0.01–0.02 range observed in reference flow_grpo PickScore runs.