Rollout & Agent Loop
Last updated: Jul 03, 2026 (API docstrings are auto-generated).
VeRL-Omni rollout is built on top of vLLM-Omni for concurrent diffusion and multimodal generation. The agent loop streams per-sample generation requests to one or more rollout replicas, optionally fanning reward computation into asynchronous reward-loop workers.
Diffusion Agent loop worker takes a batch of messages and run each message in an agent loop. |
|
Agent loop for diffusion model serving. |
|
Agent loop output. |
|
Diffusion Agent Loop
- class verl_omni.agent_loop.DiffusionAgentLoopWorker(config: DictConfig, llm_client: LLMServerClient, teacher_client: dict[str, LLMServerClient] | None = None, reward_loop_worker_handles: list[ActorHandle] = None)[source]
Diffusion Agent loop worker takes a batch of messages and run each message in an agent loop.
- Parameters:
config (DictConfig) – whole config for main entrypoint.
llm_client (LLMServerClient) – Client for the LLM server replicas, produced by
LLMServerManager.get_client()in the trainer.teacher_client (dict[str, LLMServerClient]) – Not used by diffusion training; accepted to keep the constructor signature compatible with verl’s
AgentLoopManager.create(), which positionally forwards a teacher client argument to each worker.reward_loop_worker_handles (List[ray.actor.ActorHandle]) – Actor handles for streaming reward computation.
- __init__(config: DictConfig, llm_client: LLMServerClient, teacher_client: dict[str, LLMServerClient] | None = None, reward_loop_worker_handles: list[ActorHandle] = None)[source]
- async generate_sequences(batch: DataProto) DataProto[source]
Generate sequences from agent loop.
- Parameters:
batch (DataProto) – Input batch.
- Returns:
Output batch with the following fields.
prompts:[bsz, prompt_length]prompt token ids from dataset.responses: diffusion output, typically[bsz, C, H, W](image) or[bsz, T, C, H, W](video).rm_scores(optional):[bsz, 1]reward model scores.meta_info:metrics:List[dict], per-sample agent loop metrics.reward_extra_keys(optional):List[str], keys for reward extra info for logging/validation.
- Return type:
DataProto
- class verl_omni.agent_loop.DiffusionSingleTurnAgentLoop(trainer_config: DictConfigWrap, server_manager: LLMServerClient, tokenizer: AutoTokenizer, processor: AutoProcessor, dataset_cls: type[RLHFDataset], data_config: DictConfigWrap, **kwargs)[source]
Agent loop for diffusion model serving.
- async run(sampling_params: dict[str, Any], **kwargs) DiffusionAgentLoopOutput[source]
Run one diffusion generation turn and package agent-loop output.
- Parameters:
sampling_params – Generation parameters forwarded to the server manager.
**kwargs – Per-sample fields from the dataset, including
raw_promptand optionalraw_negative_prompt.
- Returns:
Prompt ids, generated diffusion output, optional logprobs, runtime metrics, and extra fields.
- Return type:
- class verl_omni.agent_loop.DiffusionAgentLoopOutput(*, prompt_ids: list[int], response_diffusion_output: Any, response_logprobs: Any | None = None, reward_score: float | None = None, num_turns: int = 0, metrics: AgentLoopMetrics, extra_fields: dict[str, Any] = {})[source]
Agent loop output.
- extra_fields: dict[str, Any]
Extra fields for dynamic addition.
- metrics: AgentLoopMetrics
Auxiliary performance metrics
- model_config = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- num_turns: int
Number of chat turns, including user, assistant, tool.
- prompt_ids: list[int]
Prompt token ids.
- response_diffusion_output: Any
image tensor (CHW) / video tensor (TCHW).
- Type:
Response diffusion output (torch.Tensor)
- response_logprobs: Any | None
Log probabilities for the response tokens. (torch.Tensor)
- reward_score: float | None
Reward score for the trajectory.
Rollout Replica
- class verl_omni.workers.rollout.replica.DiffusionOutput(*, diffusion_output: Any, log_probs: Any | None = None, stop_reason: str | None = None, num_preempted: int | None = None, extra_fields: dict[str, Any] = {})[source]
- diffusion_output: Any
generated image tensor (CHW format) / video tensor (TCHW format)
- extra_fields: dict[str, Any]
Extra fields for dynamic addition.
- log_probs: Any | None
logprobs of generated image/video
- model_config = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- num_preempted: int | None
number of preempted times for metric calculation
- stop_reason: str | None
‘completed’, ‘aborted’, or None for unknown
- Type:
stop reason
vLLM-Omni Async Server
The async server adapters wire the
verl.workers.rollout.vllm_rollout.vllm_async_server.vLLMHttpServer and
verl.workers.rollout.vllm_rollout.vllm_async_server.vLLMReplica
classes to vLLM-Omni’s diffusion-aware backend:
verl_omni.workers.rollout.vllm_rollout.vllm_omni_async_server.vLLMOmniHttpServer: subclass of vLLM’s HTTP server that swaps the model config forDiffusionModelConfig, skips LLM-only validation, and exposes a PIL → tensor converter for image responses.verl_omni.workers.rollout.vllm_rollout.vllm_omni_async_server.vLLMOmniReplica: Ray actor wrapper that boots a vLLM-Omni engine per replica and forwards generation / weight-update / sleep & resume RPCs from the trainer.
These classes are heavy-weight wrappers that depend on running vLLM-Omni at
import time, so they are not introspected by autodoc. See
verl_omni/workers/rollout/vllm_rollout/vllm_omni_async_server.py for full
source.
vLLM-Omni Utilities
- class verl_omni.utils.vllm_omni.utils.OmniTensorLoRARequest(lora_name: str, lora_int_id: int, lora_path: str = '', base_model_name: str | None = None, tensorizer_config_dict: dict | None = None, load_inplace: bool = False, is_3d_lora_weight: bool = False, peft_config: dict = None, lora_tensors: dict = None)[source]
- class verl_omni.utils.vllm_omni.utils.VLLMOmniHijack[source]
Monkey-patches vLLM + vllm-omni internals to support in-memory LoRA tensors.
Applies verl’s base vLLM LoRA hijack (
VLLMHijack.hijack()) first, then layers the vllm-omni diffusion-side patches on top, so callers only need a singleVLLMOmniHijack.hijack()call.
- class verl_omni.workers.rollout.vllm_rollout.utils.vLLMOmniColocateWorkerExtension(**kwargs)[source]
The class for vLLM-Omni’s worker to inherit from, in the colocate setting. By defining an extension class, the code can work no matter what is the underlying worker class. This way, the code can be compatible with both vLLM V0 and V1. NOTE: we define this class in a separate module, and the main module should pass the full qualified name as worker_extension_cls argument.
Feature support: 1. LoRA 2. NPU (Ascend) memory-pool, sleep, and wake_up — via NPUColocateWorkerMixin
- update_weights_from_ipc(peft_config: dict = None, base_sync_done=False, use_shm: bool = False)[source]
Update the weights of the rollout model.
For LoRA updates, all LoRA tensors are accumulated across buckets and loaded atomically via a single
add_loracall, avoiding per-bucket partial loading. For full-weight updates, weights are streamed bucket-by-bucket viaload_weightsto keep GPU memory usage bounded.