Installation
Last updated: 06/22/2026
Requirements
For NVIDIA GPU:
Python: Version >= 3.10
CUDA: Version >= 12.8
For Ascend NPU:
Python: Version >= 3.10
CANN: Version >= 8.5.0
Install
git clone https://github.com/verl-project/verl-omni.git
cd verl-omni
Create a Python virtual environment:
uv venv --python 3.12 --seed
source .venv/bin/activate
Install the platform backend.
For NVIDIA GPU:
uv pip install -e ".[gpu]" --torch-backend=auto
It will install vllm for the CUDA PyTorch stack and kernels for the actor FA3 backend.
For Ascend NPU:
uv pip install vllm==0.22.0
uv pip install "vllm-ascend @ git+https://github.com/vllm-project/vllm-ascend.git@bb4d0776eee8fc45c3484a45c971a7049f1a2bbf"
Install VeRL-Omni:
uv pip install -e ".[vllm-omni,train]"
It will install vllm-omni, verl, and verl-omni.
Extras
Extra |
Adds |
When |
|---|---|---|
|
|
CUDA rollout + actor FA3 |
|
|
vLLM-Omni rollout |
|
|
RL training |
|
|
Local development / CI |
|
|
OCR reward (FlowGRPO) |
Optional Dependencies
Extra |
Install |
When needed |
|---|---|---|
OCR reward |
|
FlowGRPO training with OCR-based reward |
Dev tools |
|
Linting and unit tests |
VeOmni engine backend |
VeOmni instead of default FSDP2 |
Flash Attention 3
The gpu extra pulls kernels==0.14.1 for the Diffusers actor FA3 backend. Rollout FA3 comes from vllm-omni (fa3-fwd), not from kernels.
If FA3 deps are missing at runtime, training falls back to native/SDPA automatically. NPU recipes override with actor_rollout_ref.model.attn_backend=_native_npu.
Optional engine backends
VeRL-Omni defaults to FSDP2 as the training engine for the policy and reference models. The diffusion trainer can alternatively be switched to VeOmni. The engine is selected at the Hydra command line — see examples/flowgrpo_trainer/run_qwen_image_ocr_veomni.sh for a complete recipe.
Installing VeOmni alongside vLLM 0.22.0
VeOmni 0.1.11’s gpu extra pins torch==2.9.1+cu129, which may conflict with the torch version pulled in by vllm==0.22.0. A plain uv pip install veomni[gpu,dit]==0.1.11 therefore fails dependency resolution.
VeOmni itself runs correctly on torch 2.11 — only the [gpu] extra’s pin is too strict. Install it without dependency resolution so the existing torch/vllm stack is preserved, and add the small set of runtime extras that the verl-omni VeOmni engine actually needs:
uv pip install veomni==0.1.11 --no-deps
uv pip install torchcodec librosa soundfile av
Verify the engine is importable:
python -c "import veomni; print('veomni', veomni.__version__)"
python -c "from veomni.distributed.offloading import load_model_to_gpu, load_optimizer, offload_model_to_cpu, offload_optimizer; print('VeOmni offloading helpers OK')"
If you want VeOmni’s full [gpu,dit] extras (flash-attn variants, liger-kernel, cuda-python, etc.), install them in a separate environment not pinned to vllm 0.22.0; verl-omni does not need them.
Post-Installation Verification
For NVIDIA GPU:
python -c "import torch; print('torch', torch.__version__, '| CUDA', torch.version.cuda)"
python -c "import vllm; print('vllm', vllm.__version__)"
python -c "import vllm_omni; print('vllm-omni OK')"
python -c "import verl; print('verl', verl.__version__)"
python -c "import verl_omni; print('VeRL-Omni ready')"
For Ascend NPU:
python -c "import torch; import torch_npu; print('torch', torch.__version__, '| NPU', torch.npu.is_available())"
python -c "import vllm; print('vllm', vllm.__version__)"
python -c "import verl; print('verl', verl.__version__)"
python -c "import verl_omni; print('VeRL-Omni ready')"
Build Your Own Docker Image
The repository has a CUDA Dockerfile at docker/Dockerfile.cuda. The default base image uses CUDA 13.0.2 on Ubuntu 22.04 (override with --build-arg CUDA_VERSION=… if needed). Build context is controlled by the repo-root .dockerignore; keep large local folders such as .venv, data/, and checkpoints/ out of the context.
Prerequisites
Docker with NVIDIA Container Toolkit
Build commands
From the repository root:
# Standard GPU training image (runtime target)
docker build -f docker/Dockerfile.cuda -t verl-omni:gpu .
# OCR reward (adds the `ocr` extra / Levenshtein)
docker build -f docker/Dockerfile.cuda --target ocr -t verl-omni:gpu-ocr .
# Local development tools (adds the `dev` extra)
docker build -f docker/Dockerfile.cuda --target dev -t verl-omni:gpu-dev .
The image bakes in verl_omni and its Python dependencies. Recipe scripts under examples/ are not copied into the image — mount the repository at runtime (see below).
Launch with interactive session for development
Start an interactive shell with GPU access, shared memory for Ray/vLLM, and common host directories mounted:
export REPO=/path/to/verl-omni # this repository
export WORKSPACE=$HOME # data, checkpoints, HF cache root
docker run --gpus all --shm-size=16g -it --rm \
--name verl-omni-ocr \
-v "$REPO:/workspace/verl-omni" \
-v "$WORKSPACE/data:$WORKSPACE/data" \
-v "$WORKSPACE/checkpoints:$WORKSPACE/checkpoints" \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
-e WORKSPACE="$WORKSPACE" \
-e HF_HOME=/root/.cache/huggingface \
-e WANDB_API_KEY="${WANDB_API_KEY:-}" \
-w /workspace/verl-omni \
verl-omni:gpu-ocr \
/bin/bash
Inside the container, confirm the installation (same checks as Post-Installation Verification).
Notes:
--shm-size=16g— Ray and vLLM use shared memory; larger shared memory is needed training.Mount the repo — training recipes live in
examples/; mounting$REPOlets you edit scripts locally and run them immediately in the container.WORKSPACE— example scripts read datasets and write checkpoints under this path (default:$HOMEinside the container, i.e./rootunless overridden).Hugging Face cache — mounting
~/.cache/huggingfaceavoids re-downloadingQwen/Qwen-Imageand reward models on every run.
Example: Qwen-Image FlowGRPO training in Docker
This walkthrough follows the FlowGRPO quickstart using the OCR dataset and examples/flowgrpo_trainer/run_qwen_image_ocr_lora.sh. Use the ocr image target (verl-omni:gpu-ocr) so the Levenshtein dependency is present.
1. Launch the interactive container (command above).
2. Prepare the OCR dataset inside the container:
export WORKSPACE=${WORKSPACE:-$HOME}
mkdir -p $WORKSPACE/data/ocr
# Obtain raw train.txt / test.txt from the Flow-GRPO repo:
# https://github.com/yifan123/flow_grpo/tree/main/dataset/ocr
# Place them under $WORKSPACE/data/ocr/, then preprocess:
python3 examples/flowgrpo_trainer/data_process/qwenimage_ocr.py \
--input_dir $WORKSPACE/data/ocr \
--output_dir $WORKSPACE/data/ocr/qwen_image
3. (Optional) Set W&B credentials:
export WANDB_API_KEY=<your_wandb_api_key>
4. Run FlowGRPO training (4 GPUs by default in the script):
bash examples/flowgrpo_trainer/run_qwen_image_ocr_lora.sh
The script launches python3 -m verl_omni.trainer.main_diffusion with FlowGRPO + vllm_omni rollout and OCR reward (compute_score_ocr). Checkpoints are written to:
checkpoints/flow_grpo/qwen_image_ocr_lora