# Profiling FlowGRPO / diffusion training in VeRL-Omni Last updated: 05/11/2026. VeRL-Omni reuses the profiler subsystem from upstream [verl](https://github.com/verl-project/verl) (`verl.utils.profiler`) and exposes the same configuration surface for the diffusion trainer. Three profiling tools are supported: | Tool | Backend | Use case | |---------------|--------------------------------------------|-------------------------------------------| | `nsys` | NVIDIA Nsight Systems | End-to-end CUDA / kernel timeline tracing | | `torch` | `torch.profiler` | PyTorch-level CPU / CUDA / op profiling | | `torch_memory`| `torch.cuda.memory._dump_snapshot` | CUDA memory allocation snapshots | > supported by the diffusion trainer at this time. ## Configuration overview Profiling is controlled by two layers of configuration that mirror upstream verl conventions: 1. **Global** profiler config under `global_profiler` in [`diffusion_trainer.yaml`](https://github.com/verl-project/verl-omni/blob/main/verl_omni/trainer/config/diffusion_trainer.yaml). Selects the tool, the steps to profile, the output directory, and global tool-specific options (e.g. nsys controller / worker options). 2. **Per-role** profiler config under `actor_rollout_ref.{actor,ref,rollout}.profiler`. Inherits defaults from [`profiler/profiler.yaml`](https://github.com/verl-project/verl-omni/blob/main/verl_omni/trainer/config/profiler/profiler.yaml) and selects which ranks to profile and the role-local tool config. A typical training step automatically calls `start_profile` before the step begins and `stop_profile` after validation, so as long as the global `steps` list contains the current step the profiler is engaged. ### Global profiler fields ```yaml global_profiler: _target_: verl.utils.profiler.ProfilerConfig tool: null # one of: nsys, torch, torch_memory (null disables) steps: null # e.g. [1, 2, 5] profile_continuous_steps: False save_path: outputs/profile global_tool_config: nsys: { ... } # see below torch_memory: { ... } ``` ### Per-role profiler fields ```yaml actor_rollout_ref: actor: profiler: tool: torch # nsys, torch, torch_memory enable: False all_ranks: False ranks: [] tool_config: nsys: { discrete: ... } torch: contents: [] # cuda, cpu, memory, shapes, stack discrete: False torch_memory: trace_alloc_max_entries: 100000 stack_depth: 32 ``` The same block exists under `actor_rollout_ref.ref.profiler` and `actor_rollout_ref.rollout.profiler`. The rollout profiler is only active when the rollout role is colocated with the actor (the hybrid engine setup used by FlowGRPO today). ## Quick recipes The following recipes add CLI overrides on top of `examples/flowgrpo_trainer/run_qwen_image_ocr_lora.sh`. ### 1. PyTorch profiler — end-to-end Capture a single trace per profiled step (combined CPU + CUDA activities). ```bash +global_profiler.tool=torch \ +global_profiler.steps=[1,2,5] \ +global_profiler.save_path=./outputs/profile \ +actor_rollout_ref.actor.profiler.enable=True \ +actor_rollout_ref.actor.profiler.all_ranks=True \ +actor_rollout_ref.actor.profiler.tool=torch \ +actor_rollout_ref.actor.profiler.tool_config.torch.contents=[cpu,cuda] \ +actor_rollout_ref.actor.profiler.tool_config.torch.discrete=False ``` The traces land under `outputs/profile`. View them in [Perfetto UI](https://ui.perfetto.dev/) or `chrome://tracing`. ### 2. PyTorch profiler — discrete (per-stage) Discrete mode produces one database per `@DistProfiler.annotate`-decorated function within a step, which is useful when zooming into a specific phase. ```bash +global_profiler.tool=torch \ +global_profiler.steps=[3] \ +actor_rollout_ref.actor.profiler.enable=True \ +actor_rollout_ref.actor.profiler.ranks=[0] \ +actor_rollout_ref.actor.profiler.tool=torch \ +actor_rollout_ref.actor.profiler.tool_config.torch.discrete=True \ +actor_rollout_ref.actor.profiler.tool_config.torch.contents=[cpu,cuda] ``` ### 3. CUDA memory snapshots (`torch_memory`) The `torch_memory` tool records allocation history and dumps a snapshot at the end of each profiled step. Visualize the resulting JSON files at [pytorch.org/memory_viz](https://pytorch.org/memory_viz). ```bash +global_profiler.tool=torch_memory \ +global_profiler.steps=[1,2] \ +actor_rollout_ref.actor.profiler.enable=True \ +actor_rollout_ref.actor.profiler.all_ranks=True \ +actor_rollout_ref.actor.profiler.tool=torch_memory ``` ### 4. NVIDIA Nsight Systems (`nsys`) Nsight requires `nsys` to be installed on every node and the `nvtx` Python package available in the training environment (`pip install nvtx`). ```bash +global_profiler.tool=nsys \ +global_profiler.steps=[1,2] \ +global_profiler.profile_continuous_steps=True \ +actor_rollout_ref.actor.profiler.enable=True \ +actor_rollout_ref.actor.profiler.all_ranks=True \ +actor_rollout_ref.actor.profiler.tool=nsys ``` When `global_profiler.tool=nsys` and `steps` is non-empty, the FlowGRPO entrypoint launches the Ray TaskRunner under `nsys` using the `controller_nsight_options` from `global_profiler.global_tool_config.nsys`. Workers are launched with `worker_nsight_options`, including the required `capture-range: cudaProfilerApi` flag. `*.nsys-rep` files are written by Ray under `/tmp/ray/session_latest/logs/nsight/` on each node (this path is fixed by Ray). Open them with `nsys-ui`. ## Implementation notes * Workers are wrapped with `verl.utils.profiler.DistProfilerExtension`, which exposes `start_profile`/`stop_profile` Ray methods. The diffusion trainer invokes them around each profiled step, mirroring [`verl/trainer/ppo/ray_trainer.py`](https://github.com/verl-project/verl/blob/main/verl/trainer/ppo/ray_trainer.py). * `global_profiler.profile_continuous_steps=True` keeps a single profiling database open across consecutive steps in `global_profiler.steps`, which is helpful for analysing inter-step behaviour. * In hybrid-engine FlowGRPO (the default), the rollout shares the actor worker, so configuring `actor_rollout_ref.actor.profiler` is usually enough to capture the full step. ## Further reading * Upstream PyTorch profiler guide: [`docs/perf/torch_profiling.md` in verl](https://github.com/verl-project/verl/blob/main/docs/perf/torch_profiling.md) * Upstream Nsight guide: [`docs/perf/nsight_profiling.md` in verl](https://github.com/verl-project/verl/blob/main/docs/perf/nsight_profiling.md)