Rollout & Agent Loop

Last updated: Jul 03, 2026 (API docstrings are auto-generated).

VeRL-Omni rollout is built on top of vLLM-Omni for concurrent diffusion and multimodal generation. The agent loop streams per-sample generation requests to one or more rollout replicas, optionally fanning reward computation into asynchronous reward-loop workers.

verl_omni.agent_loop.DiffusionAgentLoopWorker

Diffusion Agent loop worker takes a batch of messages and run each message in an agent loop.

verl_omni.agent_loop.DiffusionSingleTurnAgentLoop

Agent loop for diffusion model serving.

verl_omni.agent_loop.DiffusionAgentLoopOutput

Agent loop output.

verl_omni.workers.rollout.replica.DiffusionOutput

Diffusion Agent Loop

class verl_omni.agent_loop.DiffusionAgentLoopWorker(config: DictConfig, llm_client: LLMServerClient, teacher_client: dict[str, LLMServerClient] | None = None, reward_loop_worker_handles: list[ActorHandle] = None)[source]

Diffusion Agent loop worker takes a batch of messages and run each message in an agent loop.

Parameters:
  • config (DictConfig) – whole config for main entrypoint.

  • llm_client (LLMServerClient) – Client for the LLM server replicas, produced by LLMServerManager.get_client() in the trainer.

  • teacher_client (dict[str, LLMServerClient]) – Not used by diffusion training; accepted to keep the constructor signature compatible with verl’s AgentLoopManager.create(), which positionally forwards a teacher client argument to each worker.

  • reward_loop_worker_handles (List[ray.actor.ActorHandle]) – Actor handles for streaming reward computation.

__init__(config: DictConfig, llm_client: LLMServerClient, teacher_client: dict[str, LLMServerClient] | None = None, reward_loop_worker_handles: list[ActorHandle] = None)[source]
async generate_sequences(batch: DataProto) DataProto[source]

Generate sequences from agent loop.

Parameters:

batch (DataProto) – Input batch.

Returns:

Output batch with the following fields.

  • prompts: [bsz, prompt_length] prompt token ids from dataset.

  • responses: diffusion output, typically [bsz, C, H, W] (image) or [bsz, T, C, H, W] (video).

  • rm_scores (optional): [bsz, 1] reward model scores.

  • meta_info:

    • metrics: List[dict], per-sample agent loop metrics.

    • reward_extra_keys (optional): List[str], keys for reward extra info for logging/validation.

Return type:

DataProto

class verl_omni.agent_loop.DiffusionSingleTurnAgentLoop(trainer_config: DictConfigWrap, server_manager: LLMServerClient, tokenizer: AutoTokenizer, processor: AutoProcessor, dataset_cls: type[RLHFDataset], data_config: DictConfigWrap, **kwargs)[source]

Agent loop for diffusion model serving.

async run(sampling_params: dict[str, Any], **kwargs) DiffusionAgentLoopOutput[source]

Run one diffusion generation turn and package agent-loop output.

Parameters:
  • sampling_params – Generation parameters forwarded to the server manager.

  • **kwargs – Per-sample fields from the dataset, including raw_prompt and optional raw_negative_prompt.

Returns:

Prompt ids, generated diffusion output, optional logprobs, runtime metrics, and extra fields.

Return type:

DiffusionAgentLoopOutput

class verl_omni.agent_loop.DiffusionAgentLoopOutput(*, prompt_ids: list[int], response_diffusion_output: Any, response_logprobs: Any | None = None, reward_score: float | None = None, num_turns: int = 0, metrics: AgentLoopMetrics, extra_fields: dict[str, Any] = {})[source]

Agent loop output.

extra_fields: dict[str, Any]

Extra fields for dynamic addition.

metrics: AgentLoopMetrics

Auxiliary performance metrics

model_config = {'arbitrary_types_allowed': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_turns: int

Number of chat turns, including user, assistant, tool.

prompt_ids: list[int]

Prompt token ids.

response_diffusion_output: Any

image tensor (CHW) / video tensor (TCHW).

Type:

Response diffusion output (torch.Tensor)

response_logprobs: Any | None

Log probabilities for the response tokens. (torch.Tensor)

reward_score: float | None

Reward score for the trajectory.

Rollout Replica

class verl_omni.workers.rollout.replica.DiffusionOutput(*, diffusion_output: Any, log_probs: Any | None = None, stop_reason: str | None = None, num_preempted: int | None = None, extra_fields: dict[str, Any] = {})[source]
diffusion_output: Any

generated image tensor (CHW format) / video tensor (TCHW format)

extra_fields: dict[str, Any]

Extra fields for dynamic addition.

log_probs: Any | None

logprobs of generated image/video

model_config = {'arbitrary_types_allowed': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

num_preempted: int | None

number of preempted times for metric calculation

stop_reason: str | None

‘completed’, ‘aborted’, or None for unknown

Type:

stop reason

vLLM-Omni Async Server

The async server adapters wire the verl.workers.rollout.vllm_rollout.vllm_async_server.vLLMHttpServer and verl.workers.rollout.vllm_rollout.vllm_async_server.vLLMReplica classes to vLLM-Omni’s diffusion-aware backend:

  • verl_omni.workers.rollout.vllm_rollout.vllm_omni_async_server.vLLMOmniHttpServer: subclass of vLLM’s HTTP server that swaps the model config for DiffusionModelConfig, skips LLM-only validation, and exposes a PIL → tensor converter for image responses.

  • verl_omni.workers.rollout.vllm_rollout.vllm_omni_async_server.vLLMOmniReplica: Ray actor wrapper that boots a vLLM-Omni engine per replica and forwards generation / weight-update / sleep & resume RPCs from the trainer.

These classes are heavy-weight wrappers that depend on running vLLM-Omni at import time, so they are not introspected by autodoc. See verl_omni/workers/rollout/vllm_rollout/vllm_omni_async_server.py for full source.

vLLM-Omni Utilities

class verl_omni.utils.vllm_omni.utils.OmniTensorLoRARequest(lora_name: str, lora_int_id: int, lora_path: str = '', base_model_name: str | None = None, tensorizer_config_dict: dict | None = None, load_inplace: bool = False, is_3d_lora_weight: bool = False, peft_config: dict = None, lora_tensors: dict = None)[source]
class verl_omni.utils.vllm_omni.utils.VLLMOmniHijack[source]

Monkey-patches vLLM + vllm-omni internals to support in-memory LoRA tensors.

Applies verl’s base vLLM LoRA hijack (VLLMHijack.hijack()) first, then layers the vllm-omni diffusion-side patches on top, so callers only need a single VLLMOmniHijack.hijack() call.

class verl_omni.workers.rollout.vllm_rollout.utils.vLLMOmniColocateWorkerExtension(**kwargs)[source]

The class for vLLM-Omni’s worker to inherit from, in the colocate setting. By defining an extension class, the code can work no matter what is the underlying worker class. This way, the code can be compatible with both vLLM V0 and V1. NOTE: we define this class in a separate module, and the main module should pass the full qualified name as worker_extension_cls argument.

Feature support: 1. LoRA 2. NPU (Ascend) memory-pool, sleep, and wake_up — via NPUColocateWorkerMixin

update_weights_from_ipc(peft_config: dict = None, base_sync_done=False, use_shm: bool = False)[source]

Update the weights of the rollout model.

For LoRA updates, all LoRA tensors are accumulated across buckets and loaded atomically via a single add_lora call, avoiding per-bucket partial loading. For full-weight updates, weights are streamed bucket-by-bucket via load_weights to keep GPU memory usage bounded.