Multi-Node Training

Last updated: 06/15/2026

Scale FlowGRPO (or any diffusion RL) training across multiple nodes. This guide uses the Qwen-Image OCR LoRA example to explain every change needed when moving from one node to a multi-node cluster.

Introduction

The single-node quickstart runs all components — actor training, rollout generation, reward scoring, and reference log-prob computation — on 4 GPUs inside one machine. When you need more throughput (larger global batch, more rollout samples, or faster iteration), you can scale horizontally by adding more nodes.

Multi-node training in VeRL-Omni distributes the same components across multiple machines connected by a high-speed interconnect (InfiniBand or RoCE). The reference multi-node script (examples/flowgrpo_trainer/run_qwen_image_ocr_lora_multi_node.sh) runs on NNODES × GPUS_PER_NODE GPUs (default: 2 × 4 = 8) and achieves roughly linear throughput scaling by increasing the global batch size while keeping per-GPU work constant.

What changes and what stays the same

What changes

What stays the same

Global batch size (scaled by GPU count)

Per-GPU micro-batch size (constant)

Number of rollout replicas and reward workers

Rollout TP/DP (1 replica per GPU)

FSDP shard boundary (per-node)

Optimizer settings (lr, weight decay)

Trainer topology (nnodes, n_gpus_per_node)

Algorithm (FlowGRPO), model, LoRA config

Attention backend

Rollout sampling config (noise, SDE, CFG)

Architecture: how components are placed across nodes

Node 0 (rank 0..3)                          Node 1 (rank 4..7)
┌─────────────────────────────┐          ┌─────────────────────────────┐
│  GPU 0: replica + reward    │          │  GPU 4: replica + reward    │
│  GPU 1: replica + reward    │          │  GPU 5: replica + reward    │
│  GPU 2: replica + reward    │          │  GPU 6: replica + reward    │
│  GPU 3: replica + reward    │          │  GPU 7: replica + reward    │
│  (reward model TP=4)        │          │  (reward model TP=4)        │
│  FSDP shard group (GPUs 0-3)│          │  FSDP shard group (GPUs 4-7)│
└─────────────────────────────┘          └─────────────────────────────┘
         ▲                                        ▲
         │         NCCL / InfiniBand              │
         └────────────────┬───────────────────────┘
                          │
              ┌───────────┴───────────┐
              │  Shared filesystem    │
              │  (models, data, ckpt) │
              └───────────────────────┘

With the default ROLLOUT_TP=1 and ROLLOUT_DP=1, each GPU hosts one independent vLLM-Omni rollout replica. Each replica runs entirely within one GPU, requiring no cross-node coordination for generation. The reward model (Tensor Parallelism = REWARD_TP=4) is colocated with the rollout replicas on every node — one reward replica spans all 4 GPUs within a node via Ray’s fractional GPU scheduling, sharing each GPU with its rollout replica. With 2 nodes this gives 2 independent reward replicas (reward.num_workers = TOTAL_GPUS / REWARD_TP = 2), one per node. FSDP shards the actor parameters within each node via fsdp_size=$GPUS_PER_NODE, avoiding expensive cross-node all-gathers during training.

Prerequisites

  • Complete the single-node quickstart on one node first. Multi-node training uses the same dataset, models, and base configuration.

  • At least two nodes, each with the same number of GPUs, connected by InfiniBand or RoCE. All nodes must have identical software environments (same Python, CUDA, and pip packages) and shared access to model weights, data, and a writable checkpoint directory (e.g., via NFS or HDFS).

Conversion recipe: single-node → multi-node

The multi-node script at examples/flowgrpo_trainer/run_qwen_image_ocr_lora_multi_node.sh is a mechanical transformation of the single-node baseline at examples/flowgrpo_trainer/run_qwen_image_ocr_lora.sh. The seven changes below cover every line that differs.

1. Topology variables

Replace the hardcoded GPU count with cluster-wide variables:

# ── Single-node ─────────────────────
NUM_GPUS_ACTOR_ROLLOUT_REWARD=4

# ── Multi-node ──────────────────────
NNODES=${NNODES:-2}
GPUS_PER_NODE=${GPUS_PER_NODE:-4}
TOTAL_GPUS=$((NNODES * GPUS_PER_NODE))     

The environment variables accept overrides so the same script works on any cluster size:

NNODES=4 GPUS_PER_NODE=8 bash run_qwen_image_ocr_lora_multi_node.sh

2. Batch-size scaling

Scale train_batch_size and ppo_mini_batch_size by the GPU ratio so each GPU processes the same amount of work as in the single-node run:

# ── Single-node ─────────────────────
data.train_batch_size=32
actor_rollout_ref.actor.ppo_mini_batch_size=16

# ── Multi-node ──────────────────────
TRAIN_BATCH_SIZE=$((32 * TOTAL_GPUS / 4))       
PPO_MINI_BATCH_SIZE=$((16 * TOTAL_GPUS / 4))    
PPO_MICRO_BATCH_PER_GPU=16                     

data.train_batch_size=$TRAIN_BATCH_SIZE
actor_rollout_ref.actor.ppo_mini_batch_size=$PPO_MINI_BATCH_SIZE
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=$PPO_MICRO_BATCH_PER_GPU

The divisor 4 is the single-node GPU count. If your baseline uses 8 GPUs, divide by 8 instead. See Flow-GRPO for the detailed relationship between train_batch_size, ppo_mini_batch_size, and ppo_micro_batch_size_per_gpu.

3. Rollout and reward workers

Scale the number of agent-loop workers and reward workers to cover every GPU:

# ── Single-node ─────────────────────
actor_rollout_ref.rollout.agent.num_workers=$((NUM_GPUS_ACTOR_ROLLOUT_REWARD / ROLLOUT_TP))
reward.num_workers=$((NUM_GPUS_ACTOR_ROLLOUT_REWARD / REWARD_TP))

# ── Multi-node ──────────────────────
ROLLOUT_NUM_WORKERS=$((TOTAL_GPUS / ROLLOUT_TP)) 
actor_rollout_ref.rollout.agent.num_workers=$ROLLOUT_NUM_WORKERS

reward.num_workers=$((TOTAL_GPUS / REWARD_TP))

Each rollout replica needs one AgentLoopWorker client; each reward replica needs one reward worker. With ROLLOUT_TP=1, ROLLOUT_DP=1, and REWARD_TP=4 on 8 GPUs, you get 8 rollout replicas + 2 reward workers.

The rollout parallelism triplet (ROLLOUT_TP, ROLLOUT_DP, data_parallel_size) controls how many GPUs each vLLM-Omni replica spans:

actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP    # default: 1
actor_rollout_ref.rollout.data_parallel_size=1                       # default: 1

Replicas = TOTAL_GPUS / (ROLLOUT_TP × data_parallel_size). The script keeps ROLLOUT_TP=1 and data_parallel_size=1 so each GPU runs one self-contained replica — the simplest and most robust layout for multi-node. Increasing ROLLOUT_TP shards the rollout model across GPUs (useful when a single GPU runs out of memory) but requires that GPUS_PER_NODE is divisible by ROLLOUT_TP × data_parallel_size.

4. FSDP configuration

FSDP must shard within each node, not across the network:

# ── Single-node ─────────────────────
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \

# ── Multi-node ──────────────────────
actor_rollout_ref.actor.fsdp_config.fsdp_size=$GPUS_PER_NODE \
actor_rollout_ref.actor.fsdp_config.reshard_after_forward=False \
actor_rollout_ref.actor.fsdp_config.offload_policy=True \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \

Parameter

Value

Why

fsdp_size

$GPUS_PER_NODE

Limits FSDP’s communication group to GPUs on the same physical node, avoiding cross-node all-gather latency

reshard_after_forward

False

Keeps full parameters in memory after the forward pass instead of re-gathering them from shards. Trades memory for speed — safe with offload_policy=True

offload_policy

True

Moves policy parameters to CPU when the actor is idle, freeing GPU memory for rollout and reward

5. Attention backend

Multi-node runs benefit from FlashAttention 3’s varlen hub backend, which optimizes variable-length sequence batching across distributed GPUs:

# ── Multi-node only ─────────────────
actor_rollout_ref.model.attn_backend="_flash_3_varlen_hub"

This replaces the default attention implementation and is strongly recommended for any multi-node diffusion training run.

6. Trainer topology

Tell the trainer how many nodes and GPUs-per-node it has:

# ── Single-node ─────────────────────
trainer.n_gpus_per_node=$NUM_GPUS_ACTOR_ROLLOUT_REWARD \
trainer.nnodes=1 \

# ── Multi-node ──────────────────────
trainer.n_gpus_per_node=$GPUS_PER_NODE \
trainer.nnodes=$NNODES \

7. Experiment name

Tag the run with the cluster topology for bookkeeping:

# ── Single-node ─────────────────────
trainer.experiment_name=qwen_image_ocr_lora \

# ── Multi-node ──────────────────────
trainer.experiment_name=qwen_image_ocr_lora_multinode_${NNODES}x${GPUS_PER_NODE} \

Full reference script

examples/flowgrpo_trainer/run_qwen_image_ocr_lora_multi_node.sh
# Qwen-Image lora RL, vllm_omni rollout
set -x

# Set WORKSPACE to any writable directory; defaults to $HOME
WORKSPACE=${WORKSPACE:-$HOME}

ocr_train_path=$WORKSPACE/data/ocr/qwen_image/train.parquet
ocr_test_path=$WORKSPACE/data/ocr/qwen_image/test.parquet

model_name=Qwen/Qwen-Image
reward_model_name=Qwen/Qwen3-VL-8B-Instruct
reward_function_path=verl_omni/utils/reward_score/genrm_ocr.py

# ---- Cluster topology --------------------------------------------------------
NNODES=${NNODES:-2}
GPUS_PER_NODE=${GPUS_PER_NODE:-4}
TOTAL_GPUS=$((NNODES * GPUS_PER_NODE))  

# ---- Parallelism -------------------------------------------------------------
# Rollout: TP=1, DP=1.
# Each replica's nnodes = (TP*DP)/GPUS_PER_NODE = 1, so run_headless is bypassed.
ROLLOUT_TP=1
# Reward: keep TP=4 so one reward server fits on 4 GPUs on a single node.
# With TOTAL_GPUS=8 this yields 2 reward replicas (1 per node).
REWARD_TP=4

ENGINE=vllm_omni
REWARD_ENGINE=vllm

# ---- Batch sizing ------------------------------------------------------------
# Single-node baseline (run_qwen_image_ocr_lora.sh): 4 GPUs, train_batch=32,
# ppo_mini=16. Linearly scale by TOTAL_GPUS/4 = 4x to keep per-GPU work the
# same while increasing global batch size, matching the RFC's "scale-up"
# validation goal (equivalent per-GPU throughput, larger global batch).
TRAIN_BATCH_SIZE=$((32 * TOTAL_GPUS / 4))   
PPO_MINI_BATCH_SIZE=$((16 * TOTAL_GPUS / 4)) 
PPO_MICRO_BATCH_PER_GPU=16                   # unchanged

# Number of AgentLoopWorker actors that fan out prompts in parallel and call
# the HTTP servers. Following the existing convention (one client per
# rollout replica): TOTAL_GPUS / ROLLOUT_TP. NOTE: this knob controls the
# *clients*, not the number of replicas (replicas = total_gpus / TP / DP).
ROLLOUT_NUM_WORKERS=$((TOTAL_GPUS / ROLLOUT_TP)) 

# Optional reproducibility (yaml defaults are null / unseeded):
#   data.seed=42
#   actor_rollout_ref.rollout.seed=42

python3 -m verl_omni.trainer.main_diffusion \
    data.train_files=$ocr_train_path \
    data.val_files=$ocr_test_path \
    data.train_batch_size=$TRAIN_BATCH_SIZE  \
    data.max_prompt_length=256 \
    actor_rollout_ref.model.algorithm=flow_grpo \
    actor_rollout_ref.model.path=$model_name \
    actor_rollout_ref.model.lora_rank=64 \
    actor_rollout_ref.model.lora_alpha=128 \
    actor_rollout_ref.model.target_modules="['to_q','to_k','to_v','to_out.0','add_q_proj','add_k_proj','add_v_proj','to_add_out','img_mlp.net.0.proj','img_mlp.net.2','txt_mlp.net.0.proj','txt_mlp.net.2']" \
    actor_rollout_ref.model.attn_backend="_flash_3_varlen_hub" \
    actor_rollout_ref.actor.optim.lr=3e-4 \
    actor_rollout_ref.actor.optim.weight_decay=0.0001 \
    actor_rollout_ref.actor.ppo_mini_batch_size=$PPO_MINI_BATCH_SIZE \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=$PPO_MICRO_BATCH_PER_GPU \
    actor_rollout_ref.actor.fsdp_config.fsdp_size=$GPUS_PER_NODE \
    actor_rollout_ref.actor.fsdp_config.reshard_after_forward=False \
    actor_rollout_ref.actor.fsdp_config.offload_policy=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP \
    actor_rollout_ref.rollout.data_parallel_size=1 \
    actor_rollout_ref.rollout.name=$ENGINE \
    actor_rollout_ref.rollout.n=16 \
    actor_rollout_ref.rollout.agent.num_workers=$ROLLOUT_NUM_WORKERS \
    actor_rollout_ref.rollout.load_format=safetensors \
    actor_rollout_ref.rollout.layered_summon=True \
    actor_rollout_ref.rollout.pipeline.true_cfg_scale=4.0 \
    actor_rollout_ref.rollout.pipeline.max_sequence_length=256 \
    actor_rollout_ref.rollout.algo.noise_level=1.2 \
    actor_rollout_ref.rollout.algo.sde_type="sde" \
    actor_rollout_ref.rollout.algo.sde_window_size=2 \
    actor_rollout_ref.rollout.algo.sde_window_range="[0,5]" \
    actor_rollout_ref.rollout.val_kwargs.pipeline.num_inference_steps=50 \
    actor_rollout_ref.rollout.val_kwargs.algo.noise_level=0.0 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \
    reward.num_workers=$((TOTAL_GPUS / REWARD_TP)) \
    reward.reward_model.enable=True \
    reward.reward_model.model_path=$reward_model_name \
    reward.reward_model.rollout.name=$REWARD_ENGINE \
    reward.reward_model.rollout.tensor_model_parallel_size=$REWARD_TP \
    reward.custom_reward_function.path=$reward_function_path \
    reward.custom_reward_function.name=compute_score_ocr \
    trainer.logger='["console", "wandb"]' \
    trainer.project_name=flow_grpo \
    trainer.experiment_name=qwen_image_ocr_lora_multinode_${NNODES}x${GPUS_PER_NODE} \
    trainer.log_val_generations=8 \
    trainer.val_before_train=False \
    trainer.n_gpus_per_node=$GPUS_PER_NODE \
    trainer.nnodes=$NNODES \
    trainer.save_freq=30 \
    trainer.test_freq=30 \
    trainer.total_epochs=15 \
    trainer.total_training_steps=300 "$@"

How to run

Environment variables

The script reads these variables at runtime:

Variable

Default

Description

NNODES

2

Total number of nodes

GPUS_PER_NODE

4

GPUs on each node

WORKSPACE

$HOME

Shared directory for data and checkpoints

All other parameters (model paths, learning rate, LoRA config, etc.) are hardcoded in the script. Override them by appending key-value pairs to the command line, just like the single-node script.

Step 1: Start Ray on the master node

ray start --head --num-gpus=$GPUS_PER_NODE \
  --node-ip-address=$MASTER_IP \
  --dashboard-host=0.0.0.0

Replace $MASTER_IP with the master node’s IP address (e.g., 192.168.1.10) and $GPUS_PER_NODE with the number of GPUs on that node.

Step 2: Start Ray on each slave node

ray start --address=$MASTER_IP:6379 --num-gpus=$GPUS_PER_NODE

Run this on every worker node to join the Ray cluster. All nodes must share the same $GPUS_PER_NODE.

Step 3: Launch training on the master node

bash examples/flowgrpo_trainer/run_qwen_image_ocr_lora_multi_node.sh

The script reads NNODES and GPUS_PER_NODE from the environment (defaults: NNODES=2, GPUS_PER_NODE=4). Override as needed:

NNODES=2 GPUS_PER_NODE=4 bash examples/flowgrpo_trainer/run_qwen_image_ocr_lora_multi_node.sh