Diffusion-DPO
Last updated: 06/05/2026.
Diffusion-DPO (paper, code) adapts Direct Preference Optimization (DPO) to text-to-image diffusion models. It aligns a diffusion policy to pairwise preferences by comparing how well the current model and a frozen reference model explain chosen and rejected images under a forward noising process.
VeRL-Omni supports Diffusion-DPO as a direct-preference algorithm. The default recipe is online DPO: the trainer samples multiple images for each prompt, scores them with a reward model or reward function, converts the best and worst samples into a chosen/rejected pair, and updates the actor with the DPO loss. Offline preference pairs are also supported through the same loss and engine contract.
Algorithm
For a prompt \(c\), online Diffusion-DPO first samples \(K\) images from the current rollout policy:
A reward function assigns scalar scores \(r(x_0^k, c)\). VeRL-Omni builds one preference pair per prompt by selecting the highest-scoring sample as the chosen image \(x_0^w\) and the lowest-scoring sample as the rejected image \(x_0^l\):
The pair is then noised with the same noise \(\epsilon\) and timestep \(t\):
For flow-matching models, the target velocity is:
Diffusion-DPO compares the current model’s prediction error against the reference model’s prediction error. Let
The pairwise objective is:
Here \(\beta\) is the DPO inverse temperature. Larger values make the update more sensitive to the current-vs-reference error margin between the chosen and rejected samples.
How VeRL-Omni Implements Diffusion-DPO
VeRL-Omni runs Diffusion-DPO through the direct-preference trainer.
Layer |
What it does |
Code |
|---|---|---|
Trainer |
Runs online rollout, reward scoring, best/worst pair selection, reference prediction, and actor update. |
|
Actor loss |
Selects online pairs and computes the pairwise DPO objective from model and reference prediction errors. |
|
FSDP engine |
Re-noises clean latents with shared pairwise noise/timesteps and performs a one-shot flow-matching forward pass. |
|
Pairwise utilities |
Samples and validates shared noise/timesteps for adjacent chosen/rejected pairs. |
|
Qwen-Image adapter |
Builds Qwen-Image transformer inputs for DPO training and optional True-CFG inference. |
|
The online batch layout is important. After rollout and reward scoring,
DPOLoss.prepare_actor_batch(...) groups samples by prompt uid, sorts each
group by reward, and keeps [chosen, rejected] adjacent in the actor batch.
DPODiffusersFSDPEngine then samples one shared noise tensor and one shared
timestep per pair, repeats both across the chosen and rejected samples, and
returns:
noise_pred: current actor prediction.noise: shared pairwise flow noise.latent: clean latent for the generated image.timesteps: shared pairwise training timestep.
The trainer computes ref_noise_pred with the reference policy before actor
update. The DPO loss consumes noise_pred, ref_noise_pred, noise, latent,
and sample_level_rewards; it also checks that adjacent pairs share the same
prompt uid and that the chosen reward is not lower than the rejected reward.
Configuration
The reference online Qwen-Image OCR recipe selects Diffusion-DPO with:
algorithm.trainer_type=direct_preference
algorithm.sample_source=online
algorithm.paired_preference=true
actor_rollout_ref.model.algorithm=dpo
actor_rollout_ref.model.model_type=diffusion_dpo_model
actor_rollout_ref.model.external_lib=verl_omni.pipelines.qwen_image_dpo
actor_rollout_ref.actor.diffusion_loss.loss_mode=dpo
actor_rollout_ref.actor.diffusion_loss.dpo_beta=100.0
actor_rollout_ref.rollout.calculate_log_probs=false
Core Parameters
algorithm.trainer_type: must bedirect_preferencefor Diffusion-DPO.algorithm.sample_source: useonlinefor live rollout and reward scoring. Useofflineonly when the dataset already contains preference pairs and scores.algorithm.paired_preference: must betrue; DPO trains on adjacent chosen/rejected pairs.actor_rollout_ref.rollout.n: number of images sampled per prompt before online pair selection. It must be at least2; the example uses16.actor_rollout_ref.actor.diffusion_loss.dpo_beta: \(\beta\) in the pairwise DPO objective. The default config value is2000.0, while the online Qwen-Image OCR recipe uses100.0.actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu: must be an even number greater than or equal to2whenpaired_preference=true, so a chosen/rejected pair is not split across micro batches.actor_rollout_ref.actor.shuffle: pair-preserving DPO updates require unshuffled actor batches; the trainer disables shuffling if needed.actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu: controls the reference forward micro-batch size used to computeref_noise_pred.actor_rollout_ref.rollout.calculate_log_probs: should befalse; DPO does not train from reverse-process log probabilities.
Reference Example
The ready-to-run online DPO example post-trains Qwen/Qwen-Image with an OCR
reward model (Qwen/Qwen3-VL-8B-Instruct) using vllm_omni rollout. It is
configured for one node with 4 GPUs, LoRA rank 64, rollout group size 16,
35 inference steps during training rollout, and 300 actor update steps.
First install the OCR reward dependency after setting up the base VeRL-Omni environment:
pip install Levenshtein
Obtain the raw OCR dataset from the original Flow-GRPO repository
(dataset/ocr)
and place it under $WORKSPACE/data/ocr, where WORKSPACE defaults to
$HOME if unset. Then preprocess it into the Qwen-Image parquet files consumed
by the DPO script:
export WORKSPACE=${WORKSPACE:-$HOME}
python3 examples/flowgrpo_trainer/data_process/qwenimage_ocr.py \
--input_dir $WORKSPACE/data/ocr \
--output_dir $WORKSPACE/data/ocr/qwen_image
The command writes:
$WORKSPACE/data/ocr/qwen_image/train.parquet$WORKSPACE/data/ocr/qwen_image/test.parquet
Launch online DPO training from the repository root:
bash examples/dpo_trainer/run_qwen_image_online_dpo_lora.sh
You can override any Hydra option at launch time. For example, to reduce the rollout group size for a quick smoke run:
bash examples/dpo_trainer/run_qwen_image_online_dpo_lora.sh \
data.train_batch_size=4 \
actor_rollout_ref.rollout.n=2 \
actor_rollout_ref.actor.ppo_mini_batch_size=2 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
trainer.total_training_steps=2
References
B. Wallace et al., Diffusion Model Alignment Using Direct Preference Optimization, CVPR 2024.
Diffusion-DPO official repository: https://github.com/SalesforceAIResearch/DiffusionDPO.
FlowGRPO official repository DPO loss implementation: https://github.com/yifan123/flow_grpo/blob/main/scripts/train_sd3_dpo.py.
Citation
@inproceedings{Wallace_2024_CVPR,
author = {Wallace, Bram and Dang, Meihua and Rafailov, Rafael and Zhou, Linqi and Lou, Aaron and Purushwalkam, Senthil and Ermon, Stefano and Xiong, Caiming and Joty, Shafiq and Naik, Nikhil},
title = {Diffusion Model Alignment Using Direct Preference Optimization},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {8228--8238}
}