Diffusion-DPO

Last updated: 06/05/2026.

Diffusion-DPO (paper, code) adapts Direct Preference Optimization (DPO) to text-to-image diffusion models. It aligns a diffusion policy to pairwise preferences by comparing how well the current model and a frozen reference model explain chosen and rejected images under a forward noising process.

VeRL-Omni supports Diffusion-DPO as a direct-preference algorithm. The default recipe is online DPO: the trainer samples multiple images for each prompt, scores them with a reward model or reward function, converts the best and worst samples into a chosen/rejected pair, and updates the actor with the DPO loss. Offline preference pairs are also supported through the same loss and engine contract.

Algorithm

For a prompt $c$, online Diffusion-DPO first samples $K$ images from the current rollout policy:

\[ x_0^{1:K} \sim \pi_\theta(\cdot \mid c). \]

A reward function assigns scalar scores $r(x_0^k, c)$. VeRL-Omni builds one preference pair per prompt by selecting the highest-scoring sample as the chosen image $x_0^w$ and the lowest-scoring sample as the rejected image $x_0^l$:

\[ x_0^w = \arg\max_{x_0^k} r(x_0^k, c), \qquad x_0^l = \arg\min_{x_0^k} r(x_0^k, c). \]

The pair is then noised with the same noise $\epsilon$ and timestep $t$:

\[ x_t = (1-\sigma_t)x_0 + \sigma_t \epsilon. \]

For flow-matching models, the target velocity is:

\[ u(x_0, \epsilon) = \epsilon - x_0. \]

Diffusion-DPO compares the current model’s prediction error against the reference model’s prediction error. Let

\[ \Delta_\theta(x_0) = \left\|v_\theta(x_t,c,t)-u(x_0,\epsilon)\right\|_2^2 - \left\|v_{\mathrm{ref}}(x_t,c,t)-u(x_0,\epsilon)\right\|_2^2. \]

The pairwise objective is:

\[ \mathcal{L}_{\mathrm{DPO}}(\theta) = -\mathbb{E}_{(c,x_0^w,x_0^l)} \log \sigma\left( -\frac{\beta}{2} \left[ \Delta_\theta(x_0^w)-\Delta_\theta(x_0^l) \right] \right). \]

Here $\beta$ is the DPO inverse temperature. Larger values make the update more sensitive to the current-vs-reference error margin between the chosen and rejected samples.

How VeRL-Omni Implements Diffusion-DPO

VeRL-Omni runs Diffusion-DPO through the direct-preference trainer.

Layer	What it does	Code
Trainer	Runs online rollout, reward scoring, best/worst pair selection, reference prediction, and actor update.	`verl_omni/trainer/diffusion/ray_diffusion_trainer.py`
Actor loss	Selects online pairs and computes the pairwise DPO objective from model and reference prediction errors.	`verl_omni/trainer/diffusion/diffusion_algos.py`
FSDP engine	Re-noises clean latents with shared pairwise noise/timesteps and performs a one-shot flow-matching forward pass.	`verl_omni/workers/engine/fsdp/diffusers_impl.py`
Pairwise utilities	Samples and validates shared noise/timesteps for adjacent chosen/rejected pairs.	`verl_omni/pipelines/utils.py`
Qwen-Image adapter	Builds Qwen-Image transformer inputs for DPO training and optional True-CFG inference.	`verl_omni/pipelines/qwen_image_dpo/`

The online batch layout is important. After rollout and reward scoring, DPOLoss.prepare_actor_batch(...) groups samples by prompt uid, sorts each group by reward, and keeps [chosen, rejected] adjacent in the actor batch. DPODiffusersFSDPEngine then samples one shared noise tensor and one shared timestep per pair, repeats both across the chosen and rejected samples, and returns:

noise_pred: current actor prediction.
noise: shared pairwise flow noise.
latent: clean latent for the generated image.
timesteps: shared pairwise training timestep.

The trainer computes ref_noise_pred with the reference policy before actor update. The DPO loss consumes noise_pred, ref_noise_pred, noise, latent, and sample_level_rewards; it also checks that adjacent pairs share the same prompt uid and that the chosen reward is not lower than the rejected reward.

Configuration

The reference online Qwen-Image OCR recipe selects Diffusion-DPO with:

algorithm.trainer_type=direct_preference
algorithm.sample_source=online
algorithm.paired_preference=true
actor_rollout_ref.model.algorithm=dpo
actor_rollout_ref.model.model_type=diffusion_dpo_model
actor_rollout_ref.model.external_lib=verl_omni.pipelines.qwen_image_dpo
actor_rollout_ref.actor.diffusion_loss.loss_mode=dpo
actor_rollout_ref.actor.diffusion_loss.dpo_beta=100.0
actor_rollout_ref.rollout.calculate_log_probs=false

Core Parameters

algorithm.trainer_type: must be direct_preference for Diffusion-DPO.
algorithm.sample_source: use online for live rollout and reward scoring. Use offline only when the dataset already contains preference pairs and scores.
algorithm.paired_preference: must be true; DPO trains on adjacent chosen/rejected pairs.
actor_rollout_ref.rollout.n: number of images sampled per prompt before online pair selection. It must be at least 2; the example uses 16.
actor_rollout_ref.actor.diffusion_loss.dpo_beta: $\beta$ in the pairwise DPO objective. The default config value is 2000.0, while the online Qwen-Image OCR recipe uses 100.0.
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu: must be an even number greater than or equal to 2 when paired_preference=true, so a chosen/rejected pair is not split across micro batches.
actor_rollout_ref.actor.shuffle: pair-preserving DPO updates require unshuffled actor batches; the trainer disables shuffling if needed.
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu: controls the reference forward micro-batch size used to compute ref_noise_pred.
actor_rollout_ref.rollout.calculate_log_probs: should be false; DPO does not train from reverse-process log probabilities.

Reference Example

The ready-to-run online DPO example post-trains Qwen/Qwen-Image with an OCR reward model (Qwen/Qwen3-VL-8B-Instruct) using vllm_omni rollout. It is configured for one node with 4 GPUs, LoRA rank 64, rollout group size 16, 35 inference steps during training rollout, and 300 actor update steps.

First install the OCR reward dependency after setting up the base VeRL-Omni environment:

pip install Levenshtein

Obtain the raw OCR dataset from the original Flow-GRPO repository (dataset/ocr) and place it under $WORKSPACE/data/ocr, where WORKSPACE defaults to $HOME if unset. Then preprocess it into the Qwen-Image parquet files consumed by the DPO script:

export WORKSPACE=${WORKSPACE:-$HOME}

python3 examples/flowgrpo_trainer/data_process/qwenimage_ocr.py \
  --input_dir $WORKSPACE/data/ocr \
  --output_dir $WORKSPACE/data/ocr/qwen_image

The command writes:

$WORKSPACE/data/ocr/qwen_image/train.parquet
$WORKSPACE/data/ocr/qwen_image/test.parquet

Launch online DPO training from the repository root:

bash examples/dpo_trainer/qwen_image/run_qwen_image_online_dpo_lora.sh

You can override any Hydra option at launch time. For example, to reduce the rollout group size for a quick smoke run:

bash examples/dpo_trainer/qwen_image/run_qwen_image_online_dpo_lora.sh \
  data.train_batch_size=4 \
  actor_rollout_ref.rollout.n=2 \
  actor_rollout_ref.actor.ppo_mini_batch_size=2 \
  actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
  trainer.total_training_steps=2

References

B. Wallace et al., Diffusion Model Alignment Using Direct Preference Optimization, CVPR 2024.
Diffusion-DPO official repository: https://github.com/SalesforceAIResearch/DiffusionDPO.
FlowGRPO official repository DPO loss implementation: https://github.com/yifan123/flow_grpo/blob/main/scripts/train_sd3_dpo.py.

Citation

@inproceedings{Wallace_2024_CVPR,
  author = {Wallace, Bram and Dang, Meihua and Rafailov, Rafael and Zhou, Linqi and Lou, Aaron and Purushwalkam, Senthil and Ermon, Stefano and Xiong, Caiming and Joty, Shafiq and Naik, Nikhil},
  title = {Diffusion Model Alignment Using Direct Preference Optimization},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2024},
  pages = {8228--8238}
}