Profiling FlowGRPO / diffusion training in VeRL-Omni

Last updated: 05/11/2026.

VeRL-Omni reuses the profiler subsystem from upstream verl (verl.utils.profiler) and exposes the same configuration surface for the diffusion trainer. Three profiling tools are supported:

Tool

Backend

Use case

nsys

NVIDIA Nsight Systems

End-to-end CUDA / kernel timeline tracing

torch

torch.profiler

PyTorch-level CPU / CUDA / op profiling

torch_memory

torch.cuda.memory._dump_snapshot

CUDA memory allocation snapshots

supported by the diffusion trainer at this time.

Configuration overview

Profiling is controlled by two layers of configuration that mirror upstream verl conventions:

  1. Global profiler config under global_profiler in diffusion_trainer.yaml. Selects the tool, the steps to profile, the output directory, and global tool-specific options (e.g. nsys controller / worker options).

  2. Per-role profiler config under actor_rollout_ref.{actor,ref,rollout}.profiler. Inherits defaults from profiler/profiler.yaml and selects which ranks to profile and the role-local tool config.

A typical training step automatically calls start_profile before the step begins and stop_profile after validation, so as long as the global steps list contains the current step the profiler is engaged.

Global profiler fields

global_profiler:
  _target_: verl.utils.profiler.ProfilerConfig
  tool: null                     # one of: nsys, torch, torch_memory (null disables)
  steps: null                    # e.g. [1, 2, 5]
  profile_continuous_steps: False
  save_path: outputs/profile
  global_tool_config:
    nsys: { ... }                # see below
    torch_memory: { ... }

Per-role profiler fields

actor_rollout_ref:
  actor:
    profiler:
      tool: torch                # nsys, torch, torch_memory
      enable: False
      all_ranks: False
      ranks: []
      tool_config:
        nsys: { discrete: ... }
        torch:
          contents: []           # cuda, cpu, memory, shapes, stack
          discrete: False
        torch_memory:
          trace_alloc_max_entries: 100000
          stack_depth: 32

The same block exists under actor_rollout_ref.ref.profiler and actor_rollout_ref.rollout.profiler. The rollout profiler is only active when the rollout role is colocated with the actor (the hybrid engine setup used by FlowGRPO today).

Quick recipes

The following recipes add CLI overrides on top of examples/flowgrpo_trainer/run_qwen_image_ocr_lora.sh.

1. PyTorch profiler — end-to-end

Capture a single trace per profiled step (combined CPU + CUDA activities).

+global_profiler.tool=torch \
+global_profiler.steps=[1,2,5] \
+global_profiler.save_path=./outputs/profile \
+actor_rollout_ref.actor.profiler.enable=True \
+actor_rollout_ref.actor.profiler.all_ranks=True \
+actor_rollout_ref.actor.profiler.tool=torch \
+actor_rollout_ref.actor.profiler.tool_config.torch.contents=[cpu,cuda] \
+actor_rollout_ref.actor.profiler.tool_config.torch.discrete=False

The traces land under outputs/profile. View them in Perfetto UI or chrome://tracing.

2. PyTorch profiler — discrete (per-stage)

Discrete mode produces one database per @DistProfiler.annotate-decorated function within a step, which is useful when zooming into a specific phase.

+global_profiler.tool=torch \
+global_profiler.steps=[3] \
+actor_rollout_ref.actor.profiler.enable=True \
+actor_rollout_ref.actor.profiler.ranks=[0] \
+actor_rollout_ref.actor.profiler.tool=torch \
+actor_rollout_ref.actor.profiler.tool_config.torch.discrete=True \
+actor_rollout_ref.actor.profiler.tool_config.torch.contents=[cpu,cuda]

3. CUDA memory snapshots (torch_memory)

The torch_memory tool records allocation history and dumps a snapshot at the end of each profiled step. Visualize the resulting JSON files at pytorch.org/memory_viz.

+global_profiler.tool=torch_memory \
+global_profiler.steps=[1,2] \
+actor_rollout_ref.actor.profiler.enable=True \
+actor_rollout_ref.actor.profiler.all_ranks=True \
+actor_rollout_ref.actor.profiler.tool=torch_memory

4. NVIDIA Nsight Systems (nsys)

Nsight requires nsys to be installed on every node and the nvtx Python package available in the training environment (pip install nvtx).

+global_profiler.tool=nsys \
+global_profiler.steps=[1,2] \
+global_profiler.profile_continuous_steps=True \
+actor_rollout_ref.actor.profiler.enable=True \
+actor_rollout_ref.actor.profiler.all_ranks=True \
+actor_rollout_ref.actor.profiler.tool=nsys

When global_profiler.tool=nsys and steps is non-empty, the FlowGRPO entrypoint launches the Ray TaskRunner under nsys using the controller_nsight_options from global_profiler.global_tool_config.nsys. Workers are launched with worker_nsight_options, including the required capture-range: cudaProfilerApi flag.

*.nsys-rep files are written by Ray under /tmp/ray/session_latest/logs/nsight/ on each node (this path is fixed by Ray). Open them with nsys-ui.

Implementation notes

  • Workers are wrapped with verl.utils.profiler.DistProfilerExtension, which exposes start_profile/stop_profile Ray methods. The diffusion trainer invokes them around each profiled step, mirroring verl/trainer/ppo/ray_trainer.py.

  • global_profiler.profile_continuous_steps=True keeps a single profiling database open across consecutive steps in global_profiler.steps, which is helpful for analysing inter-step behaviour.

  • In hybrid-engine FlowGRPO (the default), the rollout shares the actor worker, so configuring actor_rollout_ref.actor.profiler is usually enough to capture the full step.

Further reading