(multi_node_training)= # Multi-Node Training Last updated: 06/15/2026 Scale FlowGRPO (or any diffusion RL) training across multiple nodes. This guide uses the Qwen-Image OCR LoRA example to explain every change needed when moving from one node to a multi-node cluster. ## Introduction The single-node quickstart runs all components — actor training, rollout generation, reward scoring, and reference log-prob computation — on 4 GPUs inside one machine. When you need more throughput (larger global batch, more rollout samples, or faster iteration), you can scale horizontally by adding more nodes. Multi-node training in VeRL-Omni distributes the same components across multiple machines connected by a high-speed interconnect (InfiniBand or RoCE). The reference multi-node script (`examples/flowgrpo_trainer/run_qwen_image_ocr_lora_multi_node.sh`) runs on `NNODES × GPUS_PER_NODE` GPUs (default: 2 × 4 = 8) and achieves roughly linear throughput scaling by increasing the global batch size while keeping per-GPU work constant. ### What changes and what stays the same | What changes | What stays the same | |---|---| | Global batch size (scaled by GPU count) | Per-GPU micro-batch size (constant) | | Number of rollout replicas and reward workers | Rollout TP/DP (1 replica per GPU) | | FSDP shard boundary (per-node) | Optimizer settings (lr, weight decay) | | Trainer topology (`nnodes`, `n_gpus_per_node`) | Algorithm (FlowGRPO), model, LoRA config | | Attention backend | Rollout sampling config (noise, SDE, CFG) | ### Architecture: how components are placed across nodes ```text Node 0 (rank 0..3) Node 1 (rank 4..7) ┌─────────────────────────────┐ ┌─────────────────────────────┐ │ GPU 0: replica + reward │ │ GPU 4: replica + reward │ │ GPU 1: replica + reward │ │ GPU 5: replica + reward │ │ GPU 2: replica + reward │ │ GPU 6: replica + reward │ │ GPU 3: replica + reward │ │ GPU 7: replica + reward │ │ (reward model TP=4) │ │ (reward model TP=4) │ │ FSDP shard group (GPUs 0-3)│ │ FSDP shard group (GPUs 4-7)│ └─────────────────────────────┘ └─────────────────────────────┘ ▲ ▲ │ NCCL / InfiniBand │ └────────────────┬───────────────────────┘ │ ┌───────────┴───────────┐ │ Shared filesystem │ │ (models, data, ckpt) │ └───────────────────────┘ ``` With the default `ROLLOUT_TP=1` and `ROLLOUT_DP=1`, each GPU hosts one independent vLLM-Omni rollout replica. Each replica runs entirely within one GPU, requiring no cross-node coordination for generation. The reward model (Tensor Parallelism = `REWARD_TP=4`) is **colocated** with the rollout replicas on **every node** — one reward replica spans all 4 GPUs within a node via Ray's fractional GPU scheduling, sharing each GPU with its rollout replica. With 2 nodes this gives 2 independent reward replicas (`reward.num_workers = TOTAL_GPUS / REWARD_TP = 2`), one per node. FSDP shards the actor parameters within each node via `fsdp_size=$GPUS_PER_NODE`, avoiding expensive cross-node all-gathers during training. ## Prerequisites - Complete the {doc}`single-node quickstart ` on one node first. Multi-node training uses the same dataset, models, and base configuration. - At least two nodes, each with the same number of GPUs, connected by InfiniBand or RoCE. All nodes must have identical software environments (same Python, CUDA, and pip packages) and shared access to model weights, data, and a writable checkpoint directory (e.g., via NFS or HDFS). ## Conversion recipe: single-node → multi-node The multi-node script at `examples/flowgrpo_trainer/run_qwen_image_ocr_lora_multi_node.sh` is a mechanical transformation of the single-node baseline at `examples/flowgrpo_trainer/run_qwen_image_ocr_lora.sh`. The seven changes below cover every line that differs. ### 1. Topology variables Replace the hardcoded GPU count with cluster-wide variables: ```bash # ── Single-node ───────────────────── NUM_GPUS_ACTOR_ROLLOUT_REWARD=4 # ── Multi-node ────────────────────── NNODES=${NNODES:-2} GPUS_PER_NODE=${GPUS_PER_NODE:-4} TOTAL_GPUS=$((NNODES * GPUS_PER_NODE)) ``` The environment variables accept overrides so the same script works on any cluster size: ```bash NNODES=4 GPUS_PER_NODE=8 bash run_qwen_image_ocr_lora_multi_node.sh ``` ### 2. Batch-size scaling Scale `train_batch_size` and `ppo_mini_batch_size` by the GPU ratio so each GPU processes the same amount of work as in the single-node run: ```bash # ── Single-node ───────────────────── data.train_batch_size=32 actor_rollout_ref.actor.ppo_mini_batch_size=16 # ── Multi-node ────────────────────── TRAIN_BATCH_SIZE=$((32 * TOTAL_GPUS / 4)) PPO_MINI_BATCH_SIZE=$((16 * TOTAL_GPUS / 4)) PPO_MICRO_BATCH_PER_GPU=16 data.train_batch_size=$TRAIN_BATCH_SIZE actor_rollout_ref.actor.ppo_mini_batch_size=$PPO_MINI_BATCH_SIZE actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=$PPO_MICRO_BATCH_PER_GPU ``` The divisor `4` is the single-node GPU count. If your baseline uses 8 GPUs, divide by 8 instead. See {doc}`../algo/flowgrpo` for the detailed relationship between `train_batch_size`, `ppo_mini_batch_size`, and `ppo_micro_batch_size_per_gpu`. ### 3. Rollout and reward workers Scale the number of agent-loop workers and reward workers to cover every GPU: ```bash # ── Single-node ───────────────────── actor_rollout_ref.rollout.agent.num_workers=$((NUM_GPUS_ACTOR_ROLLOUT_REWARD / ROLLOUT_TP)) reward.num_workers=$((NUM_GPUS_ACTOR_ROLLOUT_REWARD / REWARD_TP)) # ── Multi-node ────────────────────── ROLLOUT_NUM_WORKERS=$((TOTAL_GPUS / ROLLOUT_TP)) actor_rollout_ref.rollout.agent.num_workers=$ROLLOUT_NUM_WORKERS reward.num_workers=$((TOTAL_GPUS / REWARD_TP)) ``` Each rollout replica needs one AgentLoopWorker client; each reward replica needs one reward worker. With `ROLLOUT_TP=1`, `ROLLOUT_DP=1`, and `REWARD_TP=4` on 8 GPUs, you get 8 rollout replicas + 2 reward workers. The rollout parallelism triplet (`ROLLOUT_TP`, `ROLLOUT_DP`, `data_parallel_size`) controls how many GPUs each vLLM-Omni replica spans: ```bash actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP # default: 1 actor_rollout_ref.rollout.data_parallel_size=1 # default: 1 ``` Replicas = `TOTAL_GPUS / (ROLLOUT_TP × data_parallel_size)`. The script keeps `ROLLOUT_TP=1` and `data_parallel_size=1` so each GPU runs one self-contained replica — the simplest and most robust layout for multi-node. Increasing `ROLLOUT_TP` shards the rollout model across GPUs (useful when a single GPU runs out of memory) but requires that `GPUS_PER_NODE` is divisible by `ROLLOUT_TP × data_parallel_size`. ### 4. FSDP configuration FSDP must shard within each node, not across the network: ```bash # ── Single-node ───────────────────── actor_rollout_ref.actor.fsdp_config.param_offload=True \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \ actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \ # ── Multi-node ────────────────────── actor_rollout_ref.actor.fsdp_config.fsdp_size=$GPUS_PER_NODE \ actor_rollout_ref.actor.fsdp_config.reshard_after_forward=False \ actor_rollout_ref.actor.fsdp_config.offload_policy=True \ actor_rollout_ref.actor.fsdp_config.param_offload=True \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \ actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \ ``` | Parameter | Value | Why | |---|---|---| | `fsdp_size` | `$GPUS_PER_NODE` | Limits FSDP's communication group to GPUs on the same physical node, avoiding cross-node all-gather latency | | `reshard_after_forward` | `False` | Keeps full parameters in memory after the forward pass instead of re-gathering them from shards. Trades memory for speed — safe with `offload_policy=True` | | `offload_policy` | `True` | Moves policy parameters to CPU when the actor is idle, freeing GPU memory for rollout and reward | ### 5. Attention backend Multi-node runs benefit from FlashAttention 3's varlen hub backend, which optimizes variable-length sequence batching across distributed GPUs: ```bash # ── Multi-node only ───────────────── actor_rollout_ref.model.attn_backend="_flash_3_varlen_hub" ``` This replaces the default attention implementation and is strongly recommended for any multi-node diffusion training run. ### 6. Trainer topology Tell the trainer how many nodes and GPUs-per-node it has: ```bash # ── Single-node ───────────────────── trainer.n_gpus_per_node=$NUM_GPUS_ACTOR_ROLLOUT_REWARD \ trainer.nnodes=1 \ # ── Multi-node ────────────────────── trainer.n_gpus_per_node=$GPUS_PER_NODE \ trainer.nnodes=$NNODES \ ``` ### 7. Experiment name Tag the run with the cluster topology for bookkeeping: ```bash # ── Single-node ───────────────────── trainer.experiment_name=qwen_image_ocr_lora \ # ── Multi-node ────────────────────── trainer.experiment_name=qwen_image_ocr_lora_multinode_${NNODES}x${GPUS_PER_NODE} \ ``` ## Full reference script ```{literalinclude} ../../examples/flowgrpo_trainer/run_qwen_image_ocr_lora_multi_node.sh :language: bash :caption: examples/flowgrpo_trainer/run_qwen_image_ocr_lora_multi_node.sh ``` ## How to run ### Environment variables The script reads these variables at runtime: | Variable | Default | Description | |---|---|---| | `NNODES` | `2` | Total number of nodes | | `GPUS_PER_NODE` | `4` | GPUs on each node | | `WORKSPACE` | `$HOME` | Shared directory for data and checkpoints | All other parameters (model paths, learning rate, LoRA config, etc.) are hardcoded in the script. Override them by appending key-value pairs to the command line, just like the single-node script. ### Step 1: Start Ray on the master node ```bash ray start --head --num-gpus=$GPUS_PER_NODE \ --node-ip-address=$MASTER_IP \ --dashboard-host=0.0.0.0 ``` Replace `$MASTER_IP` with the master node's IP address (e.g., `192.168.1.10`) and `$GPUS_PER_NODE` with the number of GPUs on that node. ### Step 2: Start Ray on each slave node ```bash ray start --address=$MASTER_IP:6379 --num-gpus=$GPUS_PER_NODE ``` Run this on **every** worker node to join the Ray cluster. All nodes must share the same `$GPUS_PER_NODE`. ### Step 3: Launch training on the master node ```bash bash examples/flowgrpo_trainer/run_qwen_image_ocr_lora_multi_node.sh ``` The script reads `NNODES` and `GPUS_PER_NODE` from the environment (defaults: `NNODES=2`, `GPUS_PER_NODE=4`). Override as needed: ```bash NNODES=2 GPUS_PER_NODE=4 bash examples/flowgrpo_trainer/run_qwen_image_ocr_lora_multi_node.sh ```