How to Integrate a Non-Diffusers Model for FlowGRPO Training
Last updated: 06/15/2026.
This guide walks you through integrating a non-diffusers model — a
standalone nn.Module that does not inherit from diffusers.ModelMixin
and is not loaded through diffusers.AutoModel.from_pretrained — into
VeRL-Omni so it can be trained end-to-end with the FlowGRPO algorithm.
Non-diffusers models manage their own architecture, configuration format, and weight-loading logic — none of which go through diffusers APIs. BAGEL-7B-MoT is the reference implementation.
If your model is a standard diffusers model, use
integrating_a_diffusion_model.md
instead. This guide extends the contracts defined there and focuses on
what is different for non-diffusers models.
We use the BAGEL-7B-MoT integration
(verl_omni/pipelines/bagel_flow_grpo/)
as the worked example throughout.
TL;DR
A new non-diffusers model needs four files in one new package plus three registry hooks:
verl_omni/pipelines/<model>_flow_grpo/
├── __init__.py # re-exports adapters + model class
├── <model>_model.py # nn.Module subclass of NonDiffusersModelBase
├── diffusers_training_adapter.py # subclass of DiffusionModelBase
└── vllm_omni_rollout_adapter.py # subclass of upstream vllm-omni Pipeline
The training adapter (diffusers_training_adapter.py) follows the
same DiffusionModelBase contract as standard diffusers models, but
overrides build_module() to use your custom from_pretrained() path
instead of diffusers.AutoModel. The rollout adapter is identical
in structure to the diffusers case.
The model module (<model>_model.py) is the new piece: a standalone
nn.Module that subclasses NonDiffusersModelBase and provides:
from_pretrained(model_path, torch_dtype)— classmethod for weight loadingforward(**kwargs)— the generation forward pass_no_split_modules— FSDP sharding hintsOptional gradient checkpointing support
When to Use NonDiffusersModelBase
Use NonDiffusersModelBase when diffusers cannot load the model.
Everything else (custom configs, weight loading, FSDP sharding)
can be handled through the standard DiffusionModelBase path by
overriding build_module().
Mental Model
The training side for non-diffusers models follows the same two-context architecture as diffusers models, but the module is built through a different path:
┌─────────────────────────────────┐ ┌──────────────────────────────┐
│ Rollout worker (vllm-omni) │ trajectory │ Trainer worker (FSDP) │
│ │ ─────────────▶ │ │
│ <Name>Pipeline (upstream) │ latents, │ <Name>ForTraining │
│ └─ forward() + SDE loop │ log_probs, │ (NonDiffusersModelBase) │
│ │ prompt ids │ └─ forward(...) │
│ <Name>PipelineWithLogProb │ │ │
│ └─ wraps with SDE scheduler │ │ │
│ └─ handles prompt format │ │ │
└─────────────────────────────────┘ └──────────────────────────────┘
The key difference from the diffusers path:
Aspect |
Diffusers model |
Non-diffusers model |
|---|---|---|
Module loading |
|
|
Module base class |
|
|
|
Return |
Return |
Config |
|
|
Architecture registration |
Auto-detected from |
Explicit: |
Prerequisites
Before you start, the new model must already be supported upstream by:
vllm-omni — provides the rollout-side
<Name>Pipeline. Your rollout adapter inherits from this class. The pipeline must be capable of running diffusion (text-to-image, or whatever modality you are training).A downloadable checkpoint — the model weights (
.safetensors) and config files must be available locally or on Hugging Face Hub.
Unlike the diffusers path, diffusers does not need to support the model — that is the whole point of the non-diffusers path. However, you must port or reimplement the transformer architecture locally (see Step 2).
If the model is not yet supported in vllm-omni, upstream it there first. Nothing below will work without vllm-omni rollout support.
Step 1 — Understand the Upstream Pipeline and the Rollout→Training Contract
Read the vllm-omni pipeline’s forward() method and answer the same
questions as in the diffusers guide, plus these non-diffusers-specific ones:
How does the upstream pipeline process text? Does it expect raw text strings or token IDs? If it expects strings, you may need a decode workaround in the rollout adapter (see the BAGEL integration’s
_ensure_bagel_prompt_textfor an example).What is the model’s forward signature? Non-diffusers models define their own forward convention — it will not match the diffusers
(sample, timestep, encoder_hidden_states)pattern. Document the signature;prepare_model_inputs()must produce matching kwargs.How does the model handle CFG? If it supports classifier-free guidance, identify the negative-branch parameters (e.g.
Nonefor text-conditioning) so training can replicate the CFG logic.What is the checkpoint format? Note the weight file name, key prefix conventions, and any architectural details that must be remapped.
What are the layer class names? List them — FSDP needs these for
_no_split_modulesso it can wrap the model correctly.What are the special token IDs? If the model uses boundary tokens (start-of-image, end-of-image, etc.), identify whether they come from config or the tokenizer. The data preprocessor must produce consistent token sequences.
Step 2 — Port the Model Architecture
Create verl_omni/pipelines/<model>_flow_grpo/<model>_model.py. This is
the most involved step. You are porting the transformer architecture from
the upstream model into a standalone nn.Module and subclassing
NonDiffusersModelBase.
2.1 Subclass NonDiffusersModelBase
from verl_omni.pipelines.non_diffusers_model_base import NonDiffusersModelBase
class MyModelForTraining(NonDiffusersModelBase):
_no_split_modules = ["MyTransformerLayer"]
_supports_gradient_checkpointing = True
def __init__(self, config: MyTrainingConfig):
super().__init__()
self.config = config
# ... build layers, embeddings, etc.
NonDiffusersModelBase provides for free:
Feature |
How to use |
|---|---|
LoRA/PEFT injection |
Inherits |
Gradient checkpointing |
Set |
FSDP sharding |
Set |
Checkpoint persistence |
Inherits |
2.2 Implement from_pretrained
@classmethod
def from_pretrained(cls, model_path: str, torch_dtype=torch.bfloat16) -> MyModelForTraining:
config = MyTrainingConfig.from_model_path(model_path)
ckpt_path = os.path.join(model_path, "ema.safetensors") # or whatever the checkpoint file is
from safetensors.torch import load_file
state_dict = load_file(ckpt_path)
model = cls(config)
mapped = _map_checkpoint_to_training(state_dict, config)
missing, unexpected = model.load_state_dict(mapped, strict=False)
if missing:
logger.warning(f"Missing keys: {len(missing)}")
model = model.to(torch_dtype)
return model
Key points:
You control the entire loading logic. No
diffusers.AutoModelinvolved.Remap checkpoint keys to match your local parameter names (see the BAGEL integration’s
bagel_model.pyfor an example).Handle dtype conversion yourself.
2.3 Implement forward
The forward signature is model-dependent. For example, an image
generation model might take (hidden_states, timestep, text_token_ids, latent_pos_ids, **kwargs). The only constraint is that
prepare_model_inputs() in the training adapter must build a dict
whose keys match this signature exactly.
For gradient checkpointing, wrap layer calls:
def forward(self, *args, **kwargs):
# ...
for layer in self.layers:
sequence = self._checkpointed_call(layer, sequence, ...)
# ...
return (output,)
Return a tuple (velocity,) — the FSDP engine expects a single-element
tuple.
2.4 Implement the Config
Create a @dataclass config class with a save_pretrained() method and a
from_model_path() classmethod:
@dataclass
class MyTrainingConfig:
hidden_size: int = 3584
num_hidden_layers: int = 28
# ...
def save_pretrained(self, save_directory: str):
output_path = os.path.join(save_directory, "config.json")
with open(output_path, "w") as f:
json.dump(asdict(self), f, indent=4, sort_keys=True)
@classmethod
def from_model_path(cls, model_path: str) -> MyTrainingConfig:
cfg_path = os.path.join(model_path, "config.json")
with open(cfg_path) as f:
raw = json.load(f)
return cls(
hidden_size=raw.get("hidden_size", 3584),
# ...
)
The config is saved alongside weights in save_pretrained().
Step 3 — Write the Training Adapter
Create verl_omni/pipelines/<model>_flow_grpo/diffusers_training_adapter.py.
This follows the same DiffusionModelBase contract as the diffusers guide
(How to Integrate a New Diffusion Model for FlowGRPO Training), with one key difference:
override build_module() to use your custom loading path.
3.1 Override build_module
@DiffusionModelBase.register("OmniMyModelForConditionalGeneration", algorithm="flow_grpo")
class MyModelDiffusion(DiffusionModelBase):
@classmethod
def build_module(cls, model_config: DiffusionModelConfig, torch_dtype: torch.dtype):
logger.info("Loading MyModelForTraining from %s", model_config.local_path)
return MyModelForTraining.from_pretrained(model_config.local_path, torch_dtype=torch_dtype)
When build_module() returns a non-None value, the FSDP engine uses it
directly. When it returns None (the default), the engine falls back to
diffusers.AutoModel.from_pretrained.
3.2 Implement prepare_model_inputs
This classmethod receives the training module, model config, latents,
timesteps, prompt embeddings (plus masks), and the micro-batch from the
trainer engine, returning a pair of model-kwargs dicts (positive and
negative). See How to Integrate a New Diffusion Model for FlowGRPO Training for the full
signature. Note: non-diffusers models often ignore the prompt_embeds*
parameters and read token IDs directly from micro_batch instead
(see the BAGEL integration’s _prompt_token_ids_to_batch for an example).
3.3 Implement forward_and_sample_previous_step
This follows the same pattern as the diffusers guide. If your model
supports classifier-free guidance, implement multi-branch forwarding
(e.g. conditional + unconditional) and combine the outputs before calling
scheduler.sample_previous_step(). The BAGEL integration demonstrates
a 3-branch CFG with sigma-interval gating as a reference.
The return signature is always (log_prob, prev_sample_mean, std_dev_t, sqrt_dt).
3.4 Implement the Scheduler
Non-diffusers models use FlowMatchSDEDiscreteScheduler just like
diffusers models. If your model needs custom sigma schedules (e.g.
time-shifted sigmas), place the setup logic in a shared common.py
so both the training and rollout adapters use the identical schedule.
See the BAGEL integration’s setup_bagel_sigmas for a worked example.
Step 4 — Write the Rollout Adapter
Create verl_omni/pipelines/<model>_flow_grpo/vllm_omni_rollout_adapter.py.
This is nearly identical to the diffusers guide — subclass the upstream
vllm-omni pipeline and wrap with the SDE scheduler.
Key considerations for non-diffusers models:
Prompt format. The verl-omni agent loop ships token IDs in
req.prompts[0]["prompt_token_ids"], but the upstream vllm-omni
pipeline may expect text strings. If needed, add a decode workaround
in your adapter’s forward() (see the BAGEL integration’s
_ensure_bagel_prompt_text for an example).
Scheduler adapter. Some upstream pipelines use non-standard
step() argument conventions. If the standard
FlowMatchSDEDiscreteScheduler is incompatible, wrap it in a
lightweight adapter that reshapes inputs and outputs to match the
pipeline’s expectation. Most models will not need this —
FlowMatchSDEDiscreteScheduler works directly with the standard
interface.
SDE window. forward() must set up an SDE window (selecting a
subset of denoising steps), compensate for any vllm-omni version-specific
step-count quirks, and slice the trajectory to return only the
windowed steps.
Step 5 — Wire Up Registries and Package
5.1 __init__.py
Export the adapter classes so their @register(...) decorators run on import.
The model module is typically imported by the training adapter directly and
does not need to be in the public API, but including it is fine:
from .diffusers_training_adapter import MyModelDiffusion
from .vllm_omni_rollout_adapter import MyModelPipelineWithLogProb
__all__ = ["MyModelDiffusion", "MyModelPipelineWithLogProb"]
5.2 Register in verl_omni/pipelines/__init__.py
from .my_model_flow_grpo import * # noqa: F401, F403
__all__ += my_model_flow_grpo.__all__
5.3 Architecture String
The architecture string passed to both @DiffusionModelBase.register() and
@VllmOmniPipelineBase.register() must be consistent. For non-diffusers
models, there is no model_index.json to auto-detect from, so users must
pass the architecture explicitly on the CLI:
+actor_rollout_ref.model.architecture=OmniMyModelForConditionalGeneration
Step 6 — Add a Data Preprocessor
The data preprocessor must match the tokenisation used by the upstream
pipeline. For most models the agent loop’s prompts tensor is enough.
If the upstream pipeline processes prompts differently from the default
chat template (e.g. it has its own tokenization path), the data
preprocessor must produce token sequences consistent with what the
pipeline expects. The BAGEL integration demonstrates this pattern: the
preprocessor uses a model-specific tokenize_<model>_prompt() wrapper
to match the pipeline’s prepare_prompts output, storing the result as
a pre-tokenized column that the training adapter reads via
_prompt_token_ids_to_batch().
(See examples/flowgrpo_trainer/data_process/ for reference implementations.)
Step 7 — Add a Smoke Test
Follow the same pattern as the diffusers guide (Step 6 of How to Integrate a New Diffusion Model for FlowGRPO Training), but with these additions:
The dummy data must include the
promptchat messages (standard batchprompts/attention_maskfrom the agent loop).The architecture override must be passed explicitly:
+actor_rollout_ref.model.architecture=OmniMyModelForConditionalGeneration.
Reference: BAGEL Implementation Checklist
The BAGEL integration is the canonical non-diffusers example. Use this checklist to verify your implementation against it:
Model module (bagel_model.py)
[ ]
BagelTrainingConfigdataclass withsave_pretrained()andfrom_model_path()[ ]
BagelForTraining(NonDiffusersModelBase)with:[ ]
_no_split_modules = ["BagelMoTLayer"][ ]
_supports_gradient_checkpointing = True[ ]
forward()with gradient checkpointing via_checkpointed_call()[ ]
from_pretrained()loading fromema.safetensorswith key remapping[ ] Token embedding, timestep embedding, VAE projection, position embedding
[ ] MoT dual-pathway attention (text
*_proj+ gen*_moe_gen)[ ] SOI/EOI boundary token handling
Training adapter (diffusers_training_adapter.py)
[ ]
@DiffusionModelBase.register("OmniBagelForConditionalGeneration", algorithm="flow_grpo")[ ]
build_module()returnsBagelForTraining.from_pretrained(...)[ ]
build_scheduler()andset_timesteps()with shifted sigmas[ ]
prepare_model_inputs()readspromptsandattention_maskfrom micro-batch (standard tensors, no extra field needed)[ ]
forward_and_sample_previous_step()with 3-branch CFG combining
Rollout adapter (vllm_omni_rollout_adapter.py)
[ ]
@VllmOmniPipelineBase.register("OmniBagelForConditionalGeneration", algorithm="flow_grpo")[ ] Subclasses
BagelPipelinefrom vllm-omni[ ] Wraps scheduler in
_BagelSchedulerAdapterfor 4-argstep()convention[ ] SDE
step()passes batched(1, tokens, C)tensors so log-probs match training[ ]
_ensure_bagel_prompt_text()workaround for text-prompt requirement[ ]
forward()sets up SDE window, vllm-omni 0.22 timestep compensation, returns sliced trajectory
Data preprocessor (see examples/flowgrpo_trainer/data_process/)
[ ] Stores prompts in standard chat-message format (
promptkey)[ ] (BAGEL only) Pre-tokenizes captions via
tokenize_bagel_prompt()and the training adapter reads them via_prompt_token_ids_to_batch()
Wiring
[ ]
verl_omni/pipelines/bagel_flow_grpo/__init__.pyre-exports the two adapter classes (BagelDiffusionandBagelPipelineWithLogProb)[ ]
verl_omni/pipelines/__init__.pyimportsbagel_flow_grpo[ ] Example launch script at
examples/flowgrpo_trainer/bagel/run_bagel_ocr_lora.sh[ ] Deploy config at
examples/flowgrpo_trainer/bagel/bagel_deploy_config.yaml
When to Use the Diffusers Path Instead
If diffusers can load the model, use
How to Integrate a New Diffusion Model for FlowGRPO Training. Override build_module() there
for any custom loading you need.