Common Pitfalls

Last updated: 06/29/2026.


Float32 Precision Loss in Stored Rollout Latents

Symptom

Training metrics show a systematic negative bias at step 1 (before any weight update):

  • actor/ratio_mean consistently below 1.0 (e.g. 0.99996)

  • actor/ppo_kl and actor/pg_clipfrac inflated at step 1

  • actor/pg_clipfrac_higher is zero — all clipping on the lower side

  • Most visible with rollout correction (bypass_mode=True), but also degrades stored trajectory precision in standard training.

Root cause

FlowMatchSDEDiscreteScheduler.step() computes log_prob in float32 using the fp32 prev_sample, then casts prev_sample back to model_output.dtype (bfloat16) before returning. The stored latents lose precision, creating a mismatch with the log-prob computation.

Fix

Two changes in the scheduler, one in the rollout adapter. The training adapter is unchanged — it already uses fp32 correctly.

1. Schedulerstep() no longer truncates prev_sample to bfloat16, and sample_previous_step() asserts model_output is float32 so callers cannot accidentally pass lower precision.

2. Rollout adapter — latents are cast to the transformer’s native dtype before the forward pass (performance), noise_pred is cast to float32 before the scheduler (precision), and all stored latents are in float32.

Verification

The fix eliminates the systematic precision-loss bias from the scheduler. In non-bypass mode (no rollout correction) ratio_mean 1.0 at step 1. In bypass mode a ~3×10⁻⁵ KL divergence remains due to the vLLM vs PyTorch attention kernel difference, which is unavoidable when using different inference backends.

Metric

Before fix (bypass)

After fix (bypass)

No bypass

actor/ppo_kl

~3.6×10⁻⁵

~3.3×10⁻⁵

~1×10⁻⁶

actor/pg_clipfrac

~12%

~9%

~1%


RoPE Sequence Length Mismatch

Symptom (RoPE)

When step_execution=True, actor/ppo_kl is elevated even at step 1 compared to the full-forward (step_execution=False) path. The effect persists across training steps and cannot be eliminated by the fp32 latent-storage fix alone.

This also affects the stock vllm-omni (non-stepwise) path in some configurations — the root cause is upstream, not specific to stepwise mode.

Root cause (RoPE)

vllm-omni sets Rotary Position Embedding (RoPE) sequence lengths from mask.sum() (valid token count), while diffusers sets them from the padded encoder-tensor width (text_seq_len). Under continuous batching, vllm-omni pads all requests to a shared target_seq_len, so valid tokens at positions beyond ~50 receive incorrect RoPE — they get the positional encoding of a much shorter sequence.

Concretely, if a request has 200 valid tokens and is padded to width 1058, mask.sum() = 200 but the embedding width is 1058. The RoPE position for token 100 is computed as position 100 of a 200-length sequence rather than position 100 of a 1058-length sequence.

Fix (RoPE)

In prepare_encode, set txt_seq_lens from the padded embed width instead of from mask.sum():

# Wrong (vllm-omni default):
txt_seq_lens = [int(mask.sum()) for mask in prompt_embeds_mask]

# Correct (matches diffusers):
txt_seq_lens = [int(prompt_embeds.shape[1])] * int(prompt_embeds.shape[0])

The stepwise adapters in verl_omni/experimental/ already do this. The stock vllm-omni path is still affected and tracked as an upstream issue.

Verification (RoPE)

Compare actor/ppo_kl at step 1 between step_execution=True and step_execution=False runs with all other knobs identical. After the fix the difference should be within numerical tolerance (~3×10⁻⁵ KL divergence due to unavoidable vLLM vs PyTorch attention kernel difference).


Float32 Precision Loss in Stepwise Scheduler

Symptom (Stepwise Scheduler)

Training metrics show a systematic negative bias at step 1 when step_execution=True:

  • actor/ratio_mean consistently below 1.0

  • actor/ppo_kl and actor/pg_clipfrac inflated at step 1

  • actor/pg_clipfrac_higher is zero — all clipping on the lower side

The same model/config produces correct ratio_mean 1.0 when step_execution=False.

Root cause (Stepwise Scheduler)

step_scheduler stores new_latents in the model’s compute dtype (bf16) instead of fp32. The trainer later recomputes log-probs on these stored latents via FlowMatchSDEDiscreteScheduler.sample_previous_step() in fp32, creating a precision mismatch. Additionally, under continuous batching the engine gathers latents across in-flight requests: a freshly-added request has fp32 latents while stepped requests have bf16 latents, producing a “Mixed dtypes in latents batch” error.

Fix (Stepwise Scheduler)

Two changes in step_scheduler:

  1. Store new_latents.float() in the trajectory lists.

  2. Keep state.latents in fp32 throughout — do NOT cast to model dtype after the scheduler step. denoise_step already casts to the transformer dtype before the forward pass.

# Wrong:
state.latents = new_latents  # bf16

# Correct:
state.latents = new_latents.to(torch.float32)

The non-CB diffuse() path already does this correctly — the stepwise override must match.

Verification (Stepwise Scheduler)

ratio_mean 1.0 at step 1 with step_execution=True, matching the step_execution=False baseline within tolerance.


SDE Window: Per-Request vs Per-GPU Seeding

Symptom

  • critic/rewards/std_mean grows over training (e.g. 0.07 → 0.10) instead of shrinking.

  • critic/score/mean declines or oscillates instead of improving.

  • Training collapses early — the run dies after ~30 steps while a correctly-configured run trains healthily for 180+ steps.

  • actor/ppo_kl increases and actor/loss becomes unstable.

Most visible with reward functions that produce smooth, fine-grained scores (e.g. PickScore) where small differences between rollouts are the signal the policy should learn from.

Root cause

The SDE window is the contiguous range of timesteps where stochastic noise (exploration) is injected during the diffusion rollout. When the window is chosen per-request — e.g. by hashing the request ID or using an unseeded RNG — different rollouts for the same prompt inject noise at different timestep ranges. The reward scatter combines two sources of variance:

  1. Exploration noise — intended, the policy should learn from this.

  2. Timestep bias — unintended, two rollouts with identical behaviour can score differently purely because noise was applied at different timesteps.

The second source inflates critic/rewards/std_mean by ~4×, making GRPO advantage estimates unreliable. The trainer cannot distinguish noise from signal, drift dominates, and training collapses.

Fix

Seed the SDE window RNG with a per-GPU identifier so all rollouts on the same GPU share the same noise-injection window. This removes the timestep bias from the variance budget — all rollouts for a prompt are evaluated under consistent noise structure, and the remaining variance is genuine exploration that GRPO can learn from.

# Wrong — each rollout gets a different SDE window:
rng = random.Random(hash(request_id))

# Correct — all rollouts on the same GPU share one window:
rng = random.Random(int(os.environ["LOCAL_RANK"]))

This matches the reference flow_grpo behaviour of random.seed(process_index).

Note

Reading LOCAL_RANK from the environment couples the window to process placement, which is opaque and not reproducible across GPU counts. A better approach would pass an explicit seed from the launcher.

Verification

Two BAGEL PickScore LoRA runs on the same 4-GPU setup, differing only in SDE window seeding:

Metric

Per-request seed

Per-GPU seed

critic/rewards/std_mean

0.07 → 0.10 (diverging)

0.022 → 0.014 (converging)

critic/score/mean trend

0.78 → 0.62 (collapsing)

0.82 → 0.85+ (improving)

Steps survived

28

181+

actor/ppo_kl

Growing to 8×10⁻⁴

Near zero, stable

val-core/.../reward/mean@1

N/A

0.82 → 0.87 (improving)

The fix brings critic/rewards/std_mean into the same ~0.01–0.02 range observed in reference flow_grpo PickScore runs.