Trainer Interface
Last updated: Jun 05, 2026 (API docstrings are auto-generated).
VeRL-Omni provides Ray-based trainers for diffusion / multimodal RL.
TaskRunner builds worker mappings and
dispatches to a trainer subclass selected by algorithm.trainer_type:
policy_gradient→PolicyGradientRayTrainer(FlowGRPO, MixGRPO, DanceGRPO, GRPO-Guard; multi-timestep reverse-process PG)direct_preference→DirectPreferenceRayTrainer(DPO, DiffusionNFT, AWM; single forward-timestep preference updates)
Both subclasses inherit shared worker init from
BaseRayDiffusionTrainer.
Rollout and reward engines are initialized only when algorithm.sample_source=online.
Base Ray Diffusion Trainer
BaseRayDiffusionTrainer
owns colocated actor/ref worker setup, dataloaders, validation helpers, and
checkpointing. init_workers always builds actor/ref workers; rollout and
reward engines are added only when algorithm.sample_source=online.
Policy Gradient Ray Trainer
PolicyGradientRayTrainer
implements the online training loop for FlowGRPO-style algorithms: rollout
generation, reward scoring, advantage estimation over denoising timesteps, and
actor updates.
Direct Preference Ray Trainer
DirectPreferenceRayTrainer
is the extension point for direct-preference algorithms (DPO, DiffusionNFT, AWM)
that train with single forward-timestep updates rather than a full multi-step
SDE trajectory. The fit implementation is not yet available in-tree.
Entry Point
Diffusion Algorithms
The verl_omni.trainer.diffusion.diffusion_algos module provides the
loss-function and advantage-estimator registries used by the trainer. Custom
losses and advantage estimators can be registered via the decorators below.