# Common Pitfalls Last updated: 06/29/2026. --- ## Float32 Precision Loss in Stored Rollout Latents (symptom-float32)= ### Symptom Training metrics show a systematic negative bias **at step 1** (before any weight update): - `actor/ratio_mean` consistently below `1.0` (e.g. `0.99996`) - `actor/ppo_kl` and `actor/pg_clipfrac` inflated at step 1 - `actor/pg_clipfrac_higher` is **zero** — all clipping on the lower side - Most visible with rollout correction (`bypass_mode=True`), but also degrades stored trajectory precision in standard training. (root-cause-float32)= ### Root cause `FlowMatchSDEDiscreteScheduler.step()` computes `log_prob` in **float32** using the fp32 `prev_sample`, then **casts `prev_sample` back to `model_output.dtype` (bfloat16)** before returning. The stored latents lose precision, creating a mismatch with the log-prob computation. (fix-float32)= ### Fix Two changes in the scheduler, one in the rollout adapter. The training adapter is **unchanged** — it already uses fp32 correctly. **1. Scheduler** — `step()` no longer truncates `prev_sample` to bfloat16, and `sample_previous_step()` asserts `model_output` is float32 so callers cannot accidentally pass lower precision. **2. Rollout adapter** — latents are cast to the transformer's native dtype before the forward pass (performance), noise_pred is cast to float32 before the scheduler (precision), and all stored latents are in float32. (verification-float32)= ### Verification The fix eliminates the systematic precision-loss bias from the scheduler. In non-bypass mode (no rollout correction) `ratio_mean ≈ 1.0` at step 1. In bypass mode a ~3×10⁻⁵ KL divergence remains due to the vLLM vs PyTorch attention kernel difference, which is unavoidable when using different inference backends. | Metric | Before fix (bypass) | After fix (bypass) | No bypass | |---|---|---|---| | `actor/ppo_kl` | ~3.6×10⁻⁵ | ~3.3×10⁻⁵ | ~1×10⁻⁶ | | `actor/pg_clipfrac` | ~12% | ~9% | ~1% | --- ## RoPE Sequence Length Mismatch (symptom-rope)= ### Symptom (RoPE) When `step_execution=True`, `actor/ppo_kl` is elevated even at step 1 compared to the full-forward (`step_execution=False`) path. The effect persists across training steps and cannot be eliminated by the fp32 latent-storage fix alone. This also affects the **stock vllm-omni (non-stepwise) path** in some configurations — the root cause is upstream, not specific to stepwise mode. (root-cause-rope)= ### Root cause (RoPE) vllm-omni sets Rotary Position Embedding (RoPE) sequence lengths from `mask.sum()` (valid token count), while diffusers sets them from the padded encoder-tensor width (`text_seq_len`). Under continuous batching, vllm-omni pads all requests to a shared `target_seq_len`, so valid tokens at positions beyond ~50 receive incorrect RoPE — they get the positional encoding of a much shorter sequence. Concretely, if a request has 200 valid tokens and is padded to width 1058, `mask.sum()` = 200 but the embedding width is 1058. The RoPE position for token 100 is computed as position 100 of a 200-length sequence rather than position 100 of a 1058-length sequence. (fix-rope)= ### Fix (RoPE) In `prepare_encode`, set `txt_seq_lens` from the padded embed width instead of from `mask.sum()`: ```python # Wrong (vllm-omni default): txt_seq_lens = [int(mask.sum()) for mask in prompt_embeds_mask] # Correct (matches diffusers): txt_seq_lens = [int(prompt_embeds.shape[1])] * int(prompt_embeds.shape[0]) ``` The stepwise adapters in `verl_omni/experimental/` already do this. The stock vllm-omni path is still affected and tracked as an upstream issue. (verification-rope)= ### Verification (RoPE) Compare `actor/ppo_kl` at step 1 between `step_execution=True` and `step_execution=False` runs with all other knobs identical. After the fix the difference should be within numerical tolerance (~3×10⁻⁵ KL divergence due to unavoidable vLLM vs PyTorch attention kernel difference). --- ## Float32 Precision Loss in Stepwise Scheduler (symptom-fp32-stepwise)= ### Symptom (Stepwise Scheduler) Training metrics show a systematic negative bias **at step 1** when `step_execution=True`: - `actor/ratio_mean` consistently below `1.0` - `actor/ppo_kl` and `actor/pg_clipfrac` inflated at step 1 - `actor/pg_clipfrac_higher` is **zero** — all clipping on the lower side The same model/config produces correct `ratio_mean ≈ 1.0` when `step_execution=False`. (root-cause-fp32-stepwise)= ### Root cause (Stepwise Scheduler) `step_scheduler` stores `new_latents` in the model's compute dtype (bf16) instead of fp32. The trainer later recomputes log-probs on these stored latents via `FlowMatchSDEDiscreteScheduler.sample_previous_step()` in fp32, creating a precision mismatch. Additionally, under continuous batching the engine gathers latents across in-flight requests: a freshly-added request has fp32 latents while stepped requests have bf16 latents, producing a "Mixed dtypes in latents batch" error. (fix-fp32-stepwise)= ### Fix (Stepwise Scheduler) Two changes in `step_scheduler`: 1. Store `new_latents.float()` in the trajectory lists. 2. Keep `state.latents` in fp32 throughout — do NOT cast to model dtype after the scheduler step. `denoise_step` already casts to the transformer dtype before the forward pass. ```python # Wrong: state.latents = new_latents # bf16 # Correct: state.latents = new_latents.to(torch.float32) ``` The non-CB `diffuse()` path already does this correctly — the stepwise override must match. (verification-fp32-stepwise)= ### Verification (Stepwise Scheduler) `ratio_mean ≈ 1.0` at step 1 with `step_execution=True`, matching the `step_execution=False` baseline within tolerance. --- ## SDE Window: Per-Request vs Per-GPU Seeding (symptom-sde-window)= ### Symptom - `critic/rewards/std_mean` grows over training (e.g. 0.07 → 0.10) instead of shrinking. - `critic/score/mean` declines or oscillates instead of improving. - Training collapses early — the run dies after ~30 steps while a correctly-configured run trains healthily for 180+ steps. - `actor/ppo_kl` increases and `actor/loss` becomes unstable. Most visible with reward functions that produce smooth, fine-grained scores (e.g. PickScore) where small differences between rollouts are the signal the policy should learn from. (root-cause-sde-window)= ### Root cause The SDE window is the contiguous range of timesteps where stochastic noise (exploration) is injected during the diffusion rollout. When the window is chosen **per-request** — e.g. by hashing the request ID or using an unseeded RNG — different rollouts for the same prompt inject noise at *different timestep ranges*. The reward scatter combines two sources of variance: 1. **Exploration noise** — intended, the policy should learn from this. 2. **Timestep bias** — unintended, two rollouts with identical behaviour can score differently purely because noise was applied at different timesteps. The second source inflates `critic/rewards/std_mean` by ~4×, making GRPO advantage estimates unreliable. The trainer cannot distinguish noise from signal, drift dominates, and training collapses. (fix-sde-window)= ### Fix Seed the SDE window RNG with a **per-GPU identifier** so all rollouts on the same GPU share the same noise-injection window. This removes the timestep bias from the variance budget — all rollouts for a prompt are evaluated under consistent noise structure, and the remaining variance is genuine exploration that GRPO can learn from. ```python # Wrong — each rollout gets a different SDE window: rng = random.Random(hash(request_id)) # Correct — all rollouts on the same GPU share one window: rng = random.Random(int(os.environ["LOCAL_RANK"])) ``` This matches the reference flow_grpo behaviour of ``random.seed(process_index)``. ```{note} Reading ``LOCAL_RANK`` from the environment couples the window to process placement, which is opaque and not reproducible across GPU counts. A better approach would pass an explicit seed from the launcher. ``` (verification-sde-window)= ### Verification Two BAGEL PickScore LoRA runs on the same 4-GPU setup, differing only in SDE window seeding: | Metric | Per-request seed | Per-GPU seed | |---|---|---| | `critic/rewards/std_mean` | 0.07 → 0.10 (diverging) | 0.022 → 0.014 (converging) | | `critic/score/mean` trend | 0.78 → 0.62 (collapsing) | 0.82 → 0.85+ (improving) | | Steps survived | 28 | 181+ | | `actor/ppo_kl` | Growing to 8×10⁻⁴ | Near zero, stable | | `val-core/.../reward/mean@1` | N/A | 0.82 → 0.87 (improving) | The fix brings `critic/rewards/std_mean` into the same ~0.01–0.02 range observed in reference flow_grpo PickScore runs.