(async_reward)= # Async Reward for Diffusion Training Last updated: 06/09/2026 Async reward lets VeRL-Omni score completed rollout samples through reward-loop workers while other samples are still being generated. It is useful when reward computation is expensive, for example, when a VLM judge, OCR model, preference model, or external HTTP scorer takes a significant fraction of the training step. ## Motivation In a standard online FlowGRPO step, training data flows through three major stages: 1. The rollout engine generates images or videos for each prompt. 2. The reward function scores each generated sample. 3. The trainer computes advantages and updates the actor. If the reward model is colocated with the actor or rollout workers, reward scoring often sits on the critical path. This is especially visible for multimodal reward models: rollout GPUs may finish some samples early, but the trainer cannot use those completed samples until reward computation finishes for the whole batch. Async reward moves reward scoring into reward-loop workers. When a rollout sample finishes, the agent loop immediately sends that sample to a reward worker. Other rollout samples continue running at the same time. With `reward.reward_model.enable_resource_pool=True`, those reward workers can also use a dedicated GPU pool, so expensive reward inference does not time-share the same GPUs as actor training and rollout generation. This reduces the end-to-end step time when reward latency is large enough to hide behind the remaining rollout work. ## What async reward means Async reward in VeRL-Omni is **sample-level streaming reward computation** within an otherwise on-policy training step. image The upper panel shows the synchronous reward case: rollout workers can continue generating later samples, but reward scoring starts only after the full rollout batch is ready. The lower panel shows async reward: each completed sample is streamed to a reward worker immediately, while rollout workers continue on later samples. Training still starts only after the full scored batch is ready, but the reward stage is partly hidden behind the remaining rollout work. The important boundary is the policy update. Async reward does **not** make the actor update proceed on partial or stale batches. The trainer still assembles the full rollout batch, extracts rewards, computes advantages, and then performs the actor update. This keeps the usual on-policy FlowGRPO semantics while reducing idle time inside the rollout/reward phase. ## Quickstart Run the async reward example: ```bash bash examples/flowgrpo_trainer/run_qwen_image_ocr_lora_async_reward.sh ``` The example uses four GPUs for actor/rollout and one GPU for reward inference: ```bash NUM_GPUS_ACTOR_ROLLOUT=4 NUM_GPUS_REWARD=1 ROLLOUT_TP=1 REWARD_TP=1 ``` The key overrides are: ```bash reward.num_workers=$((NUM_GPUS_REWARD / REWARD_TP)) reward.reward_model.enable=True reward.reward_model.model_path=$reward_model_name reward.reward_model.rollout.name=$REWARD_ENGINE reward.reward_model.enable_resource_pool=True reward.reward_model.nnodes=1 reward.reward_model.n_gpus_per_node=$NUM_GPUS_REWARD reward.reward_model.rollout.tensor_model_parallel_size=$REWARD_TP reward.custom_reward_function.path=$reward_function_path reward.custom_reward_function.name=compute_score_ocr ``` ## Config reference The most important settings live under `reward`: | Config | Meaning | | --- | --- | | `reward.reward_model.enable=True` | Enables model-backed reward computation. | | `reward.reward_model.enable_resource_pool=True` | Allocates a separate Ray resource pool for reward-model workers. This is the setting that enables reward computation to run on dedicated GPUs. | | `reward.reward_model.n_gpus_per_node` / `reward.reward_model.nnodes` | Size of the reward-model resource pool. | | `reward.num_workers` | Number of reward-loop workers. Usually set to `NUM_GPUS_REWARD / REWARD_TP`. | | `reward.reward_model.rollout.tensor_model_parallel_size` | Tensor-parallel size for reward-model inference. Increase this when the reward model does not fit on one GPU. | | `reward.custom_reward_function.path` / `name` | Reward function used by the reward manager. It may be a normal function or an `async def` coroutine. | | `reward.reward_manager.name` / `module.path` | Optional reward manager override, for example `MultiVisualRewardManager` when combining multiple rewards. | The base reward config documents these fields in `verl_omni/trainer/config/reward/reward.yaml`. ## How it plugs in Async reward is enabled by passing reward-loop worker handles into the rollout agent loop. This happens when either there is no reward model, or when the reward model has its own resource pool: ```python enable_agent_reward_loop = ( not self.use_rm or self.config.reward.reward_model.enable_resource_pool ) reward_loop_worker_handles = ( self.reward_loop_manager.reward_loop_workers if enable_agent_reward_loop else None ) ``` The diffusion agent loop runs one async task per rollout sample. After a sample finishes generation, `_compute_score` builds a one-sample `DataProto` containing the prompt, visual response, and reward metadata, then sends it to a reward-loop worker: ```python selected_reward_loop_worker_handle = random.choice( self.reward_loop_worker_handles ) result = await selected_reward_loop_worker_handle.compute_score.remote(data) output.reward_score = result["reward_score"] output.extra_fields["reward_extra_info"] = result["reward_extra_info"] ``` When the rollout manager returns to the trainer, samples that were scored through the reward loop already contain `rm_scores`. The trainer therefore skips the colocated reward path: ```python if self.use_rm and "rm_scores" not in batch.batch.keys(): batch_reward = self._compute_reward_colocate(batch) batch = batch.union(batch_reward) ``` This is why async reward can reduce the measured `reward` section in the trainer timer: the reward work has already been streamed during generation. ## External HTTP scorers Async reward also pairs well with external HTTP scorers. The HTTP reward client (`verl_omni.utils.reward_score.http_scorer_client`) is an `async` reward function that sends generated images to a separate scorer service. Because reward workers batch samples with `asyncio.gather`, requests in the batch can hit the HTTP service concurrently rather than serially. See [HTTP Scorer](../start/http_scorer.md) for the service protocol and an end-to-end OCR reward-server example. ## References - [Flow-GRPO: Training Flow Matching Models via Online RL](https://arxiv.org/abs/2505.05470) describes the online RL algorithm used by the FlowGRPO examples. - [HybridFlow: A Flexible and Efficient RLHF Framework](https://arxiv.org/abs/2409.19256) describes the verl systems model behind flexible role placement and resource pools.