Reward Interface

Last updated: Jul 18, 2026 (API docstrings are auto-generated).

VeRL-Omni reward pipelines support both rule-based scoring (e.g. JPEG compressibility) and model-based generative reward models (e.g. OCR via a vision-language model served behind an OpenAI-compatible router). Reward computation is dispatched per sample by the VisualRewardManager, which plugs into OmniRewardLoopManager — verl’s RewardLoopManager extended with profiler control over the reward-model rollout servers.

`verl_omni.reward_loop.reward_loop.OmniRewardLoopManager`	RewardLoopManager that can start/stop the profiler on the reward-model rollout servers.
`verl_omni.reward_loop.reward_manager.VisualRewardManager`	The reward manager for visual response.
`verl_omni.utils.reward_score.default_compute_score_image`	Compute the reward score for a visual (image) response.
`verl_omni.utils.reward_score.http_scorer_client.compute_score`	Compute reward by calling an external HTTP scorer service.
`verl_omni.utils.reward_score.unified_reward.compute_score_unified_reward`	Compute a human-preference score via UnifiedReward 2.0.

Reward Loop Manager

class verl_omni.reward_loop.reward_loop.OmniRewardLoopManager(config: DictConfig, rm_resource_pool: RayResourcePool = None)[source]

RewardLoopManager that can start/stop the profiler on the reward-model rollout servers.

The reward-model servers are the same RolloutReplica stack as the actor rollout servers, whose per-server profiler fan-out already exists (RolloutReplica.start_profile); upstream RewardLoopManager just exposes no caller for it. The trainer invokes these around the phase where the servers actually score: the generation phase when reward computation streams with rollout, or compute_rm_score in colocate mode. Configured via reward.reward_model.rollout.profiler.

start_profile(**kwargs) → None[source]: Start profiling on all reward-model rollout servers. No-op without a reward model.

stop_profile() → None[source]: Stop profiling on all reward-model rollout servers. No-op without a reward model.

Reward Manager

class verl_omni.reward_loop.reward_manager.VisualRewardManager(config, tokenizer, compute_score, reward_router_address=None, reward_model_tokenizer=None)[source]

The reward manager for visual response.

__init__(config, tokenizer, compute_score, reward_router_address=None, reward_model_tokenizer=None)[source]

Initialize reward manager.

Parameters:

config (DictConfig) – YAML config.
tokenizer (AutoTokenizer) – Tokenizer for tokenize messages.

Default Score Dispatcher

Visual (image) reward scoring functions for VeRL-Omni.

verl_omni.utils.reward_score.default_compute_score_image(data_source, solution_image, ground_truth, extra_info=None, **kwargs)[source]

Compute the reward score for a visual (image) response.

Parameters:

data_source (str) – Dataset identifier that determines the scoring method.
solution_image – The generated image, as a torch.Tensor in shape (C, H, W) or (N, C, H, W).
ground_truth (str) – Ground-truth answer (may be unused for rule-based rewards such as jpeg_compressibility).
extra_info (dict, optional) – Additional metadata passed by the reward manager.

Returns:

The computed score (or a dict with a "score" key).

Return type:

float or dict

Raises:

NotImplementedError – If no scorer is registered for data_source.

Built-in Reward Scorers

JPEG Compressibility

The reward function for JPEG compressibility. It is adapted from https://github.com/kvablack/ddpo-pytorch.

verl_omni.utils.reward_score.jpeg_compressibility.compute_score(solution_image)[source]

The scoring function for JPEG compressibility.

Parameters:: solution_image – the solution image or video, in shape (C, H, W) or (N, C, H, W).

GRM-based OCR Reward

OCR scoring backed by a generative reward model (GRM).

The compute_score_ocr() function sends a generated image to a vision language model served behind an OpenAI-compatible router and uses the model’s transcription, compared to a ground truth, to produce a score in [0, 1].

async verl_omni.utils.reward_score.genrm_ocr.compute_score_ocr(data_source: str, solution_image: ndarray | Tensor, ground_truth: str, extra_info: dict, reward_router_address: str, reward_model_tokenizer: PythonBackend = None, model_name: str | None = None)[source]

Compute an image OCR score via a generative reward model (GRM).

The image is sent to a GRM via an OpenAI-compatible router; the returned transcription is compared to ground_truth using Levenshtein distance to yield a score in [0, 1] (1 = perfect match).

Parameters:

data_source – Source dataset identifier. Unused, kept for interface consistency.
solution_image – The solution image or video to be evaluated.
ground_truth – The ground truth text for comparison.
extra_info – Additional information; frame_interval controls video frame subsampling.
reward_router_address – host:port of the GRM router.
reward_model_tokenizer – Tokenizer for the reward model. Unused, kept for interface consistency.
model_name – Name or path of the GRM. Defaults to DEFAULT_GRM_MODEL_PATH.

Returns:

{"score": float, "genrm_response": str}.

Return type:

dict

HTTP Scorer Client

Generic HTTP reward client for external scorer services.

Sends generated images to an external HTTP scorer service using pickle protocol and returns the score. Compatible with all scorer services under rewards_services/api_services/ that accept the standard payload format:

POST with pickle-serialized {"images": List[bytes], "prompts": List[str], "metadata": dict}
Response: pickle-serialized {"scores": List[float]}

async verl_omni.utils.reward_score.http_scorer_client.compute_score(solution_image: Tensor, ground_truth: str, server_url: str, **kwargs) → dict[source]

Compute reward by calling an external HTTP scorer service.

Parameters:

solution_image – Generated image tensor (C, H, W) or (N, C, H, W).
ground_truth – Prompt string passed directly to the scorer service.
server_url – Full URL of the scorer service (e.g., “http://localhost:19082”).

Returns:

dict with “score” key.

UnifiedReward Scorer

Human-preference scoring backed by UnifiedReward 2.0.

async verl_omni.utils.reward_score.unified_reward.compute_score_unified_reward(data_source: str, solution_image: ndarray | Tensor, ground_truth: str, extra_info: dict, reward_router_address: str, reward_model_tokenizer: PythonBackend = None, model_name: str | None = None)[source]

Compute a human-preference score via UnifiedReward 2.0.

The reward model scores the generated image against its text caption on Alignment, Coherence, and Style axes. The returned score is the mean of those axes normalized from the model’s 1-5 scale to [0, 1].

Reward Utilities

verl_omni.utils.reward_score.reward_utils.pil_image_to_base64(image: Image) → str[source]

Convert a PIL Image to a base64-encoded data URI string.

Parameters:: image – The PIL Image to convert.
Returns:: A base64-encoded PNG data URI string (e.g. data:image/png;base64,...).