Installation

Last updated: 06/22/2026

Requirements

For NVIDIA GPU:

Python: Version >= 3.10
CUDA: Version >= 12.8

For Ascend NPU:

Python: Version >= 3.10
CANN: Version >= 8.5.0

Install

git clone https://github.com/verl-project/verl-omni.git
cd verl-omni

Create a Python virtual environment:

uv venv --python 3.12 --seed
source .venv/bin/activate

Install the platform backend.

For NVIDIA GPU:

uv pip install -e ".[gpu]" --torch-backend=auto

It will install vllm for the CUDA PyTorch stack and kernels for the actor FA3 backend.

For Ascend NPU:

uv pip install vllm==0.22.0
uv pip install "vllm-ascend @ git+https://github.com/vllm-project/vllm-ascend.git@bb4d0776eee8fc45c3484a45c971a7049f1a2bbf"

Install VeRL-Omni:

uv pip install -e ".[vllm-omni,train]"

It will install vllm-omni, verl, and verl-omni.

Extras

Extra	Adds	When
`gpu`	`vllm==0.22.0`, `kernels==0.14.1`, `liger-kernel`	CUDA rollout + actor FA3
`vllm-omni`	`vllm-omni==0.22.0`	vLLM-Omni rollout
`train`	`verl==0.8.0`	RL training
`dev`	`pytest`, `pre-commit`, `Levenshtein`, …	Local development / CI
`ocr`	`Levenshtein`	OCR reward (FlowGRPO)

Optional Dependencies

Extra	Install	When needed
OCR reward	`uv pip install -e ".[ocr]"`	FlowGRPO training with OCR-based reward
Dev tools	`uv pip install -e ".[dev]"`	Linting and unit tests
VeOmni engine backend	See Optional engine backends	VeOmni instead of default FSDP2

Flash Attention 3

The gpu extra pulls kernels==0.14.1 for the Diffusers actor FA3 backend. Rollout FA3 comes from vllm-omni (fa3-fwd), not from kernels.

If FA3 deps are missing at runtime, training falls back to native/SDPA automatically. NPU recipes override with actor_rollout_ref.model.attn_backend=_native_npu.

Optional engine backends

VeRL-Omni defaults to FSDP2 as the training engine for the policy and reference models. The diffusion trainer can alternatively be switched to VeOmni. The engine is selected at the Hydra command line — see examples/flowgrpo_trainer/run_qwen_image_ocr_veomni.sh for a complete recipe.

Installing VeOmni alongside vLLM 0.22.0

VeOmni 0.1.11’s gpu extra pins torch==2.9.1+cu129, which may conflict with the torch version pulled in by vllm==0.22.0. A plain uv pip install veomni[gpu,dit]==0.1.11 therefore fails dependency resolution.

VeOmni itself runs correctly on torch 2.11 — only the [gpu] extra’s pin is too strict. Install it without dependency resolution so the existing torch/vllm stack is preserved, and add the small set of runtime extras that the verl-omni VeOmni engine actually needs:

uv pip install veomni==0.1.11 --no-deps
uv pip install torchcodec librosa soundfile av

Verify the engine is importable:

python -c "import veomni; print('veomni', veomni.__version__)"
python -c "from veomni.distributed.offloading import load_model_to_gpu, load_optimizer, offload_model_to_cpu, offload_optimizer; print('VeOmni offloading helpers OK')"

If you want VeOmni’s full [gpu,dit] extras (flash-attn variants, liger-kernel, cuda-python, etc.), install them in a separate environment not pinned to vllm 0.22.0; verl-omni does not need them.

Post-Installation Verification

For NVIDIA GPU:

python -c "import torch; print('torch', torch.__version__, '| CUDA', torch.version.cuda)"
python -c "import vllm; print('vllm', vllm.__version__)"
python -c "import vllm_omni; print('vllm-omni OK')"
python -c "import verl; print('verl', verl.__version__)"
python -c "import verl_omni; print('VeRL-Omni ready')"

For Ascend NPU:

python -c "import torch; import torch_npu; print('torch', torch.__version__, '| NPU', torch.npu.is_available())"
python -c "import vllm; print('vllm', vllm.__version__)"
python -c "import verl; print('verl', verl.__version__)"
python -c "import verl_omni; print('VeRL-Omni ready')"

Build Your Own Docker Image

The repository has a CUDA Dockerfile at docker/Dockerfile.cuda. The default base image uses CUDA 13.0.2 on Ubuntu 22.04 (override with --build-arg CUDA_VERSION=… if needed). Build context is controlled by the repo-root .dockerignore; keep large local folders such as .venv, data/, and checkpoints/ out of the context.

Prerequisites

Docker with NVIDIA Container Toolkit

Build commands

From the repository root:

# Standard GPU training image (runtime target)
docker build -f docker/Dockerfile.cuda -t verl-omni:gpu .

# OCR reward (adds the `ocr` extra / Levenshtein)
docker build -f docker/Dockerfile.cuda --target ocr -t verl-omni:gpu-ocr .

# Local development tools (adds the `dev` extra)
docker build -f docker/Dockerfile.cuda --target dev -t verl-omni:gpu-dev .

The image bakes in verl_omni and its Python dependencies. Recipe scripts under examples/ are not copied into the image — mount the repository at runtime (see below).

Launch with interactive session for development

Start an interactive shell with GPU access, shared memory for Ray/vLLM, and common host directories mounted:

export REPO=/path/to/verl-omni          # this repository
export WORKSPACE=$HOME                  # data, checkpoints, HF cache root

docker run --gpus all --shm-size=16g -it --rm \
  --name verl-omni-ocr \
  -v "$REPO:/workspace/verl-omni" \
  -v "$WORKSPACE/data:$WORKSPACE/data" \
  -v "$WORKSPACE/checkpoints:$WORKSPACE/checkpoints" \
  -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
  -e WORKSPACE="$WORKSPACE" \
  -e HF_HOME=/root/.cache/huggingface \
  -e WANDB_API_KEY="${WANDB_API_KEY:-}" \
  -w /workspace/verl-omni \
  verl-omni:gpu-ocr \
  /bin/bash

Inside the container, confirm the installation (same checks as Post-Installation Verification).

Notes:

--shm-size=16g — Ray and vLLM use shared memory; larger shared memory is needed training.
Mount the repo — training recipes live in examples/; mounting $REPO lets you edit scripts locally and run them immediately in the container.
WORKSPACE — example scripts read datasets and write checkpoints under this path (default: $HOME inside the container, i.e. /root unless overridden).
Hugging Face cache — mounting ~/.cache/huggingface avoids re-downloading Qwen/Qwen-Image and reward models on every run.

Example: Qwen-Image FlowGRPO training in Docker

This walkthrough follows the FlowGRPO quickstart using the OCR dataset and examples/flowgrpo_trainer/run_qwen_image_ocr_lora.sh. Use the ocr image target (verl-omni:gpu-ocr) so the Levenshtein dependency is present.

1. Launch the interactive container (command above).

2. Prepare the OCR dataset inside the container:

export WORKSPACE=${WORKSPACE:-$HOME}
mkdir -p $WORKSPACE/data/ocr

# Obtain raw train.txt / test.txt from the Flow-GRPO repo:
# https://github.com/yifan123/flow_grpo/tree/main/dataset/ocr
# Place them under $WORKSPACE/data/ocr/, then preprocess:

python3 examples/flowgrpo_trainer/data_process/qwenimage_ocr.py \
  --input_dir $WORKSPACE/data/ocr \
  --output_dir $WORKSPACE/data/ocr/qwen_image

3. (Optional) Set W&B credentials:

export WANDB_API_KEY=<your_wandb_api_key>

4. Run FlowGRPO training (4 GPUs by default in the script):

bash examples/flowgrpo_trainer/run_qwen_image_ocr_lora.sh

The script launches python3 -m verl_omni.trainer.main_diffusion with FlowGRPO + vllm_omni rollout and OCR reward (compute_score_ocr). Checkpoints are written to:

checkpoints/flow_grpo/qwen_image_ocr_lora