Async Reward for Diffusion Training
Last updated: 06/09/2026
Async reward lets VeRL-Omni score completed rollout samples through reward-loop workers while other samples are still being generated. It is useful when reward computation is expensive, for example, when a VLM judge, OCR model, preference model, or external HTTP scorer takes a significant fraction of the training step.
Motivation
In a standard online FlowGRPO step, training data flows through three major stages:
The rollout engine generates images or videos for each prompt.
The reward function scores each generated sample.
The trainer computes advantages and updates the actor.
If the reward model is colocated with the actor or rollout workers, reward scoring often sits on the critical path. This is especially visible for multimodal reward models: rollout GPUs may finish some samples early, but the trainer cannot use those completed samples until reward computation finishes for the whole batch.
Async reward moves reward scoring into reward-loop workers. When a rollout
sample finishes, the agent loop immediately sends that sample to a reward worker.
Other rollout samples continue running at the same time. With
reward.reward_model.enable_resource_pool=True, those reward workers can also
use a dedicated GPU pool, so expensive reward inference does not time-share the
same GPUs as actor training and rollout generation.
This reduces the end-to-end step time when reward latency is large enough to hide behind the remaining rollout work.
What async reward means
Async reward in VeRL-Omni is sample-level streaming reward computation within an otherwise on-policy training step.
The upper panel shows the synchronous reward case: rollout workers can continue generating later samples, but reward scoring starts only after the full rollout batch is ready. The lower panel shows async reward: each completed sample is streamed to a reward worker immediately, while rollout workers continue on later samples. Training still starts only after the full scored batch is ready, but the reward stage is partly hidden behind the remaining rollout work.
The important boundary is the policy update. Async reward does not make the actor update proceed on partial or stale batches. The trainer still assembles the full rollout batch, extracts rewards, computes advantages, and then performs the actor update. This keeps the usual on-policy FlowGRPO semantics while reducing idle time inside the rollout/reward phase.
Quickstart
Run the async reward example:
bash examples/flowgrpo_trainer/run_qwen_image_ocr_lora_async_reward.sh
The example uses four GPUs for actor/rollout and one GPU for reward inference:
NUM_GPUS_ACTOR_ROLLOUT=4
NUM_GPUS_REWARD=1
ROLLOUT_TP=1
REWARD_TP=1
The key overrides are:
reward.num_workers=$((NUM_GPUS_REWARD / REWARD_TP))
reward.reward_model.enable=True
reward.reward_model.model_path=$reward_model_name
reward.reward_model.rollout.name=$REWARD_ENGINE
reward.reward_model.enable_resource_pool=True
reward.reward_model.nnodes=1
reward.reward_model.n_gpus_per_node=$NUM_GPUS_REWARD
reward.reward_model.rollout.tensor_model_parallel_size=$REWARD_TP
reward.custom_reward_function.path=$reward_function_path
reward.custom_reward_function.name=compute_score_ocr
Config reference
The most important settings live under reward:
Config |
Meaning |
|---|---|
|
Enables model-backed reward computation. |
|
Allocates a separate Ray resource pool for reward-model workers. This is the setting that enables reward computation to run on dedicated GPUs. |
|
Size of the reward-model resource pool. |
|
Number of reward-loop workers. Usually set to |
|
Tensor-parallel size for reward-model inference. Increase this when the reward model does not fit on one GPU. |
|
Reward function used by the reward manager. It may be a normal function or an |
|
Optional reward manager override, for example |
The base reward config documents these fields in
verl_omni/trainer/config/reward/reward.yaml.
How it plugs in
Async reward is enabled by passing reward-loop worker handles into the rollout agent loop. This happens when either there is no reward model, or when the reward model has its own resource pool:
enable_agent_reward_loop = (
not self.use_rm or self.config.reward.reward_model.enable_resource_pool
)
reward_loop_worker_handles = (
self.reward_loop_manager.reward_loop_workers
if enable_agent_reward_loop
else None
)
The diffusion agent loop runs one async task per rollout sample. After a sample
finishes generation, _compute_score builds a one-sample DataProto containing
the prompt, visual response, and reward metadata, then sends it to a reward-loop
worker:
selected_reward_loop_worker_handle = random.choice(
self.reward_loop_worker_handles
)
result = await selected_reward_loop_worker_handle.compute_score.remote(data)
output.reward_score = result["reward_score"]
output.extra_fields["reward_extra_info"] = result["reward_extra_info"]
When the rollout manager returns to the trainer, samples that were scored through
the reward loop already contain rm_scores. The trainer therefore skips the
colocated reward path:
if self.use_rm and "rm_scores" not in batch.batch.keys():
batch_reward = self._compute_reward_colocate(batch)
batch = batch.union(batch_reward)
This is why async reward can reduce the measured reward section in the trainer
timer: the reward work has already been streamed during generation.
External HTTP scorers
Async reward also pairs well with external HTTP scorers. The HTTP reward client
(verl_omni.utils.reward_score.http_scorer_client) is an async reward
function that sends generated images to a separate scorer service. Because reward
workers batch samples with asyncio.gather, requests in the batch can hit the
HTTP service concurrently rather than serially.
See HTTP Scorer for the service protocol and an end-to-end OCR reward-server example.
References
Flow-GRPO: Training Flow Matching Models via Online RL describes the online RL algorithm used by the FlowGRPO examples.
HybridFlow: A Flexible and Efficient RLHF Framework describes the verl systems model behind flexible role placement and resource pools.