# Common Pitfalls Last updated: 05/22/2026. --- ## Float32 precision loss in stored rollout latents ### Symptom Training metrics show a systematic negative bias **at step 1** (before any weight update): - `actor/ratio_mean` consistently below `1.0` (e.g. `0.99996`) - `actor/ppo_kl` and `actor/pg_clipfrac` inflated at step 1 - `actor/pg_clipfrac_higher` is **zero** — all clipping on the lower side - Most visible with rollout correction (`bypass_mode=True`), but also degrades stored trajectory precision in standard training. ### Root cause `FlowMatchSDEDiscreteScheduler.step()` computes `log_prob` in **float32** using the fp32 `prev_sample`, then **casts `prev_sample` back to `model_output.dtype` (bfloat16)** before returning. The stored latents lose precision, creating a mismatch with the log-prob computation. ### Fix Two changes in the scheduler, one in the rollout adapter. The training adapter is **unchanged** — it already uses fp32 correctly. **1. Scheduler** — `step()` no longer truncates `prev_sample` to bfloat16, and `sample_previous_step()` asserts `model_output` is float32 so callers cannot accidentally pass lower precision. **2. Rollout adapter** — latents are cast to the transformer's native dtype before the forward pass (performance), noise_pred is cast to float32 before the scheduler (precision), and all stored latents are in float32. ### Verification The fix eliminates the systematic precision-loss bias from the scheduler. In non-bypass mode (no rollout correction) `ratio_mean ≈ 1.0` at step 1. In bypass mode a ~3×10⁻⁵ KL divergence remains due to the vLLM vs PyTorch attention kernel difference, which is unavoidable when using different inference backends. | Metric | Before fix (bypass) | After fix (bypass) | No bypass | |---|---|---|---| | `actor/ppo_kl` | ~3.6×10⁻⁵ | ~3.3×10⁻⁵ | ~1×10⁻⁶ | | `actor/pg_clipfrac` | ~12% | ~9% | ~1% |