# DiffusionNFT Last updated: 06/03/2026. DiffusionNFT ([paper](https://arxiv.org/abs/2509.16117), [code](https://github.com/NVlabs/DiffusionNFT), [project page](https://research.nvidia.com/labs/cosmos-lab/diffusionnft/)) is an online RL method for diffusion models that optimizes the **forward diffusion process** instead of applying policy gradients to the reverse sampling chain. It contrasts positive and negative generations under a reward signal, then folds that reinforcement signal into a supervised flow-matching objective. This makes DiffusionNFT useful since reverse-process likelihoods are expensive or awkward to estimate. It is solver-agnostic during rollout, only needs clean final latents/images and rewards for actor training, and naturally supports an off-policy split between the rollout policy and the training policy. ## Algorithm For a prompt $c$, the old policy samples $K$ clean images $x_0^{1:K} \sim \pi^{\text{old}}(\cdot \mid c)$. A reward model assigns raw scores $r^{\text{raw}}(x_0, c)$, which are mapped into an optimality probability: $$ \begin{aligned} r(x_0,c) &= \frac{1}{2} + \frac{1}{2} \mathrm{clip}\left( \frac{ r^{\mathrm{raw}}(x_0,c) - \mathbb{E}_{\pi^{\mathrm{old}}(\cdot \mid c)} r^{\mathrm{raw}}(x_0,c) }{Z_c}, -1, 1 \right). \end{aligned} $$ This optimality-probability transform follows the GRPO-style practice of normalizing rewards within a prompt group before clipping them into a bounded training signal. Here $Z_c > 0$ is the reward normalizer for prompt $c$; in practice, it is a standard-deviation term, estimated either from the prompt's sample group or from the global rollout batch. VeRL-Omni defaults to global reward standard deviation normalization for DiffusionNFT (`algorithm.global_std=True`). DiffusionNFT then noising-samples $x_t$ from the forward process and optimizes two implicit branches: $$ \begin{aligned} \mathcal{L}(\theta) &= \mathbb{E}_{c,\,x_0 \sim \pi^{\mathrm{old}}(\cdot \mid c),\,t} \Big[ r\left\|v_\theta^+(x_t,c,t)-v\right\|_2^2 \\ &\quad + (1-r)\left\|v_\theta^-(x_t,c,t)-v\right\|_2^2 \Big], \end{aligned} $$ where the implicit positive and negative velocities are $$ v_\theta^+(x_t,c,t) = (1-\beta)v^{\text{old}}(x_t,c,t) + \beta v_\theta(x_t,c,t), $$ $$ v_\theta^-(x_t,c,t) = (1+\beta)v^{\text{old}}(x_t,c,t) - \beta v_\theta(x_t,c,t). $$ Here $\beta$ controls the reinforcement guidance strength. In VeRL-Omni this is `actor_rollout_ref.actor.diffusion_loss.mix_beta`. ## How VeRL-Omni Implements DiffusionNFT VeRL-Omni's DiffusionNFT path uses a direct-preference trainer loop rather than the policy-gradient loop used by Flow-GRPO. | Layer | What it does | Code | |---|---|---| | Rollout adapter | Generates images with the `old` LoRA adapter and returns clean latents plus trainable forward timesteps. | `verl_omni/pipelines/qwen_image_diffusion_nft/` | | Actor loss | Implements the implicit positive/negative forward-process objective and optional reference prediction MSE. | `verl_omni/trainer/diffusion/diffusion_algos.py` | | FSDP engine | Trains from clean latents by re-noising at selected forward timesteps. | `verl_omni/workers/engine/fsdp/diffusers_impl.py` | | Trainer | Runs online rollout, reward evaluation, actor update, and old-policy adapter refresh. | `verl_omni/trainer/diffusion/ray_diffusion_trainer.py` | The actor keeps two policy adapters: - `default`: the trainable policy updated by actor optimization. - `old`: the rollout policy used for data collection and the implicit branch definitions above. After actor updates, the trainer refreshes the `old` adapter from `default` using `algorithm.old_policy_decay_schedule` and `algorithm.old_policy_update_interval`. ## Configuration The reference Qwen-Image OCR recipe selects DiffusionNFT with: ```bash actor_rollout_ref.model.algorithm=diffusion_nft actor_rollout_ref.model.model_type=diffusion_nft_model algorithm.trainer_type=direct_preference algorithm.sample_source=online algorithm.paired_preference=false actor_rollout_ref.actor.diffusion_loss.loss_mode=diffusion_nft actor_rollout_ref.model.policy_state_adapters='["default","old"]' actor_rollout_ref.rollout.rollout_adapter=old actor_rollout_ref.rollout.calculate_log_probs=False ``` ### Core Parameters - `actor_rollout_ref.rollout.n`: number of images sampled per prompt. The example uses `16`. - `algorithm.timestep_fraction`: fraction of rollout timesteps used for forward-process actor training. Use `1.0` to train on all selected rollout timesteps. - `algorithm.adv_mode`: maps normalized reward advantages into `reward_prob`. The default recipe uses `continuous`. - `algorithm.old_policy_decay_schedule`: old-policy update schedule. Supported values include `copy`, `linear_to_0_5`, and `delayed_linear_to_0_999`. - `algorithm.old_policy_update_interval`: optimizer steps between `old` adapter refreshes. - `actor_rollout_ref.actor.diffusion_loss.mix_beta`: $\beta$ in the implicit positive/negative velocity equations. - `actor_rollout_ref.actor.diffusion_loss.ref_kl_coef`: coefficient for the prediction-space reference MSE regularizer. - `actor_rollout_ref.actor.diffusion_loss.adv_clip_max`: clamp used when mapping normalized rewards into `reward_prob`. ## Reference Example The ready-to-run OCR example post-trains `Qwen/Qwen-Image` with a visual reward model (`Qwen/Qwen3-VL-8B-Instruct`) using `vllm_omni` rollout. It is configured for one node with 4 GPUs, LoRA rank `64`, rollout group size `16`, 10 training rollout steps, and 40 validation steps. First install the OCR reward dependency after setting up the base VeRL-Omni environment: ```bash pip install Levenshtein ``` Obtain the raw OCR dataset from the original Flow-GRPO repository ([dataset/ocr](https://github.com/yifan123/flow_grpo/tree/main/dataset/ocr)) and place it under `$WORKSPACE/data/ocr`, where `WORKSPACE` defaults to `$HOME` if unset. Then preprocess it into the parquet files consumed by the DiffusionNFT script: ```bash export WORKSPACE=${WORKSPACE:-$HOME} python3 examples/flowgrpo_trainer/data_process/qwenimage_ocr.py \ --input_dir $WORKSPACE/data/ocr \ --output_dir $WORKSPACE/data/ocr ``` The command writes: - `$WORKSPACE/data/ocr/train.parquet` - `$WORKSPACE/data/ocr/test.parquet` Launch training from the repository root: ```bash bash examples/diffusionnft_trainer/run_qwen_image_ocr_lora.sh ``` ## References - K. Zheng *et al.*, *DiffusionNFT: Online Diffusion Reinforcement with Forward Process*, arXiv:2509.16117. - DiffusionNFT official repository: . - DiffusionNFT project page: . ## Citation ```bibtex @article{zheng2025diffusionnft, title={DiffusionNFT: Online Diffusion Reinforcement with Forward Process}, author={Zheng, Kaiwen and Chen, Huayu and Ye, Haotian and Wang, Haoxiang and Zhang, Qinsheng and Jiang, Kai and Su, Hang and Ermon, Stefano and Zhu, Jun and Liu, Ming-Yu}, journal={arXiv preprint arXiv:2509.16117}, year={2025} } ```