Trainer Interface

Last updated: Jun 05, 2026 (API docstrings are auto-generated).

VeRL-Omni provides Ray-based trainers for diffusion / multimodal RL. TaskRunner builds worker mappings and dispatches to a trainer subclass selected by algorithm.trainer_type:

  • policy_gradientPolicyGradientRayTrainer (FlowGRPO, MixGRPO, DanceGRPO, GRPO-Guard; multi-timestep reverse-process PG)

  • direct_preferenceDirectPreferenceRayTrainer (DPO, DiffusionNFT, AWM; single forward-timestep preference updates)

Both subclasses inherit shared worker init from BaseRayDiffusionTrainer. Rollout and reward engines are initialized only when algorithm.sample_source=online.

Base Ray Diffusion Trainer

BaseRayDiffusionTrainer owns colocated actor/ref worker setup, dataloaders, validation helpers, and checkpointing. init_workers always builds actor/ref workers; rollout and reward engines are added only when algorithm.sample_source=online.

Policy Gradient Ray Trainer

PolicyGradientRayTrainer implements the online training loop for FlowGRPO-style algorithms: rollout generation, reward scoring, advantage estimation over denoising timesteps, and actor updates.

Direct Preference Ray Trainer

DirectPreferenceRayTrainer is the extension point for direct-preference algorithms (DPO, DiffusionNFT, AWM) that train with single forward-timestep updates rather than a full multi-step SDE trajectory. The fit implementation is not yet available in-tree.

Entry Point

Diffusion Algorithms

The verl_omni.trainer.diffusion.diffusion_algos module provides the loss-function and advantage-estimator registries used by the trainer. Custom losses and advantage estimators can be registered via the decorators below.

Trainer Config

Metrics