Reward Interface
Last updated: Jun 05, 2026 (API docstrings are auto-generated).
VeRL-Omni reward pipelines support both rule-based scoring (e.g. JPEG
compressibility) and model-based generative reward models (e.g. OCR via a
vision-language model served behind an OpenAI-compatible router). Reward
computation is dispatched per sample by the
VisualRewardManager, which
plugs into the standard verl.experimental.reward_loop.RewardLoopManager.
Compute the reward score for a visual (image) response. |
|
|
Compute reward by calling an external HTTP scorer service. |
|
Compute a human-preference score via UnifiedReward 2.0. |
Reward Manager
Default Score Dispatcher
Visual (image) reward scoring functions for VeRL-Omni.
- verl_omni.utils.reward_score.default_compute_score_image(data_source, solution_image, ground_truth, extra_info=None, **kwargs)[source]
Compute the reward score for a visual (image) response.
- Parameters:
data_source (str) – Dataset identifier that determines the scoring method.
solution_image – The generated image, as a
torch.Tensorin shape(C, H, W)or(N, C, H, W).ground_truth (str) – Ground-truth answer (may be unused for rule-based rewards such as
jpeg_compressibility).extra_info (dict, optional) – Additional metadata passed by the reward manager.
- Returns:
The computed score (or a dict with a
"score"key).- Return type:
float or dict
- Raises:
NotImplementedError – If no scorer is registered for data_source.
Built-in Reward Scorers
JPEG Compressibility
The reward function for JPEG compressibility. It is adapted from https://github.com/kvablack/ddpo-pytorch.
GRM-based OCR Reward
OCR scoring backed by a generative reward model (GRM).
The compute_score_ocr() function sends a generated image to a vision
language model served behind an OpenAI-compatible router and uses the model’s
transcription, compared to a ground truth, to produce a score in [0, 1].
- async verl_omni.utils.reward_score.genrm_ocr.compute_score_ocr(data_source: str, solution_image: ndarray | Tensor, ground_truth: str, extra_info: dict, reward_router_address: str, reward_model_tokenizer: PreTrainedTokenizer = None, model_name: str | None = None)[source]
Compute an image OCR score via a generative reward model (GRM).
The image is sent to a GRM via an OpenAI-compatible router; the returned transcription is compared to
ground_truthusing Levenshtein distance to yield a score in[0, 1](1 = perfect match).- Parameters:
data_source – Source dataset identifier. Unused, kept for interface consistency.
solution_image – The solution image or video to be evaluated.
ground_truth – The ground truth text for comparison.
extra_info – Additional information;
frame_intervalcontrols video frame subsampling.reward_router_address –
host:portof the GRM router.reward_model_tokenizer – Tokenizer for the reward model. Unused, kept for interface consistency.
model_name – Name or path of the GRM. Defaults to
DEFAULT_GRM_MODEL_PATH.
- Returns:
{"score": float, "genrm_response": str}.- Return type:
dict
HTTP Scorer Client
Generic HTTP reward client for external scorer services.
Sends generated images to an external HTTP scorer service using pickle protocol and returns the score. Compatible with all scorer services under rewards_services/api_services/ that accept the standard payload format:
POST with pickle-serialized {"images": List[bytes], "prompts": List[str], "metadata": dict}
Response: pickle-serialized {"scores": List[float]}
- async verl_omni.utils.reward_score.http_scorer_client.compute_score(solution_image: Tensor, ground_truth: str, server_url: str, **kwargs) dict[source]
Compute reward by calling an external HTTP scorer service.
- Parameters:
solution_image – Generated image tensor (C, H, W) or (N, C, H, W).
ground_truth – Prompt string passed directly to the scorer service.
server_url – Full URL of the scorer service (e.g., “http://localhost:19082”).
- Returns:
dict with “score” key.
UnifiedReward Scorer
Human-preference scoring backed by UnifiedReward 2.0.
- async verl_omni.utils.reward_score.unified_reward.compute_score_unified_reward(data_source: str, solution_image: ndarray | Tensor, ground_truth: str, extra_info: dict, reward_router_address: str, reward_model_tokenizer: PreTrainedTokenizer = None, model_name: str | None = None)[source]
Compute a human-preference score via UnifiedReward 2.0.
The reward model scores the generated image against its text caption on Alignment, Coherence, and Style axes. The returned
scoreis the mean of those axes normalized from the model’s 1-5 scale to[0, 1].