How to Integrate a Non-Diffusers Model for FlowGRPO Training

Last updated: 06/15/2026.

This guide walks you through integrating a non-diffusers model — a standalone nn.Module that does not inherit from diffusers.ModelMixin and is not loaded through diffusers.AutoModel.from_pretrained — into VeRL-Omni so it can be trained end-to-end with the FlowGRPO algorithm.

Non-diffusers models manage their own architecture, configuration format, and weight-loading logic — none of which go through diffusers APIs. BAGEL-7B-MoT is the reference implementation.

If your model is a standard diffusers model, use integrating_a_diffusion_model.md instead. This guide extends the contracts defined there and focuses on what is different for non-diffusers models.

We use the BAGEL-7B-MoT integration (verl_omni/pipelines/bagel_flow_grpo/) as the worked example throughout.

TL;DR

A new non-diffusers model needs four files in one new package plus three registry hooks:

verl_omni/pipelines/<model>_flow_grpo/
├── __init__.py                       # re-exports adapters + model class
├── <model>_model.py                  # nn.Module subclass of NonDiffusersModelBase
├── diffusers_training_adapter.py     # subclass of DiffusionModelBase
└── vllm_omni_rollout_adapter.py      # subclass of upstream vllm-omni Pipeline

The training adapter (diffusers_training_adapter.py) follows the same DiffusionModelBase contract as standard diffusers models, but overrides build_module() to use your custom from_pretrained() path instead of diffusers.AutoModel. The rollout adapter is identical in structure to the diffusers case.

The model module (<model>_model.py) is the new piece: a standalone nn.Module that subclasses NonDiffusersModelBase and provides:

from_pretrained(model_path, torch_dtype) — classmethod for weight loading
forward(**kwargs) — the generation forward pass
_no_split_modules — FSDP sharding hints
Optional gradient checkpointing support

When to Use NonDiffusersModelBase

Use NonDiffusersModelBase when diffusers cannot load the model. Everything else (custom configs, weight loading, FSDP sharding) can be handled through the standard DiffusionModelBase path by overriding build_module().

Mental Model

The training side for non-diffusers models follows the same two-context architecture as diffusers models, but the module is built through a different path:

  ┌─────────────────────────────────┐                ┌──────────────────────────────┐
  │ Rollout worker (vllm-omni)      │   trajectory   │ Trainer worker (FSDP)        │
  │                                 │ ─────────────▶ │                              │
  │ <Name>Pipeline (upstream)       │  latents,      │ <Name>ForTraining            │
  │  └─ forward() + SDE loop        │  log_probs,    │  (NonDiffusersModelBase)     │
  │                                 │  prompt ids    │  └─ forward(...)             │
  │ <Name>PipelineWithLogProb       │                │                              │
  │  └─ wraps with SDE scheduler    │                │                              │
  │  └─ handles prompt format       │                │                              │
  └─────────────────────────────────┘                └──────────────────────────────┘

The key difference from the diffusers path:

Aspect	Diffusers model	Non-diffusers model
Module loading	`diffusers.AutoModel.from_pretrained()`	`MyModel.from_pretrained(model_path)`
Module base class	`ModelMixin` (from diffusers)	`NonDiffusersModelBase` (from verl-omni)
`build_module()`	Return `None` → default AutoModel path	Return `MyModel.from_pretrained(...)`
Config	`model_index.json` → `_class_name`	`config.json` → custom struct (e.g. `BagelTrainingConfig`)
Architecture registration	Auto-detected from `model_index.json`	Explicit: `+actor_rollout_ref.model.architecture=...`

Prerequisites

Before you start, the new model must already be supported upstream by:

vllm-omni — provides the rollout-side <Name>Pipeline. Your rollout adapter inherits from this class. The pipeline must be capable of running diffusion (text-to-image, or whatever modality you are training).
A downloadable checkpoint — the model weights (.safetensors) and config files must be available locally or on Hugging Face Hub.

Unlike the diffusers path, diffusers does not need to support the model — that is the whole point of the non-diffusers path. However, you must port or reimplement the transformer architecture locally (see Step 2).

If the model is not yet supported in vllm-omni, upstream it there first. Nothing below will work without vllm-omni rollout support.

Step 1 — Understand the Upstream Pipeline and the Rollout→Training Contract

Read the vllm-omni pipeline’s forward() method and answer the same questions as in the diffusers guide, plus these non-diffusers-specific ones:

How does the upstream pipeline process text? Does it expect raw text strings or token IDs? If it expects strings, you may need a decode workaround in the rollout adapter (see the BAGEL integration’s _ensure_bagel_prompt_text for an example).
What is the model’s forward signature? Non-diffusers models define their own forward convention — it will not match the diffusers (sample, timestep, encoder_hidden_states) pattern. Document the signature; prepare_model_inputs() must produce matching kwargs.
How does the model handle CFG? If it supports classifier-free guidance, identify the negative-branch parameters (e.g. None for text-conditioning) so training can replicate the CFG logic.
What is the checkpoint format? Note the weight file name, key prefix conventions, and any architectural details that must be remapped.
What are the layer class names? List them — FSDP needs these for _no_split_modules so it can wrap the model correctly.
What are the special token IDs? If the model uses boundary tokens (start-of-image, end-of-image, etc.), identify whether they come from config or the tokenizer. The data preprocessor must produce consistent token sequences.

Step 2 — Port the Model Architecture

Create verl_omni/pipelines/<model>_flow_grpo/<model>_model.py. This is the most involved step. You are porting the transformer architecture from the upstream model into a standalone nn.Module and subclassing NonDiffusersModelBase.

2.1 Subclass `NonDiffusersModelBase`

from verl_omni.pipelines.non_diffusers_model_base import NonDiffusersModelBase

class MyModelForTraining(NonDiffusersModelBase):
    _no_split_modules = ["MyTransformerLayer"]
    _supports_gradient_checkpointing = True

    def __init__(self, config: MyTrainingConfig):
        super().__init__()
        self.config = config
        # ... build layers, embeddings, etc.

NonDiffusersModelBase provides for free:

Feature	How to use
LoRA/PEFT injection	Inherits `add_adapter()`, `load_lora_adapter()`, `set_adapter()`, `disable_adapters()`, `enable_adapters()`
Gradient checkpointing	Set `_supports_gradient_checkpointing = True` and wrap layer calls with `self._checkpointed_call(fn, *args)`
FSDP sharding	Set `_no_split_modules` to your layer class names
Checkpoint persistence	Inherits `save_pretrained()` (saves `model.safetensors` + `config.json`)

2.2 Implement `from_pretrained`

@classmethod
def from_pretrained(cls, model_path: str, torch_dtype=torch.bfloat16) -> MyModelForTraining:
    config = MyTrainingConfig.from_model_path(model_path)
    ckpt_path = os.path.join(model_path, "ema.safetensors")  # or whatever the checkpoint file is
    from safetensors.torch import load_file
    state_dict = load_file(ckpt_path)

    model = cls(config)
    mapped = _map_checkpoint_to_training(state_dict, config)
    missing, unexpected = model.load_state_dict(mapped, strict=False)
    if missing:
        logger.warning(f"Missing keys: {len(missing)}")
    model = model.to(torch_dtype)
    return model

Key points:

You control the entire loading logic. No diffusers.AutoModel involved.
Remap checkpoint keys to match your local parameter names (see the BAGEL integration’s bagel_model.py for an example).
Handle dtype conversion yourself.

2.3 Implement `forward`

The forward signature is model-dependent. For example, an image generation model might take (hidden_states, timestep, text_token_ids, latent_pos_ids, **kwargs). The only constraint is that prepare_model_inputs() in the training adapter must build a dict whose keys match this signature exactly.

For gradient checkpointing, wrap layer calls:

def forward(self, *args, **kwargs):
    # ...
    for layer in self.layers:
        sequence = self._checkpointed_call(layer, sequence, ...)
    # ...
    return (output,)

Return a tuple (velocity,) — the FSDP engine expects a single-element tuple.

2.4 Implement the Config

Create a @dataclass config class with a save_pretrained() method and a from_model_path() classmethod:

@dataclass
class MyTrainingConfig:
    hidden_size: int = 3584
    num_hidden_layers: int = 28
    # ...

    def save_pretrained(self, save_directory: str):
        output_path = os.path.join(save_directory, "config.json")
        with open(output_path, "w") as f:
            json.dump(asdict(self), f, indent=4, sort_keys=True)

    @classmethod
    def from_model_path(cls, model_path: str) -> MyTrainingConfig:
        cfg_path = os.path.join(model_path, "config.json")
        with open(cfg_path) as f:
            raw = json.load(f)
        return cls(
            hidden_size=raw.get("hidden_size", 3584),
            # ...
        )

The config is saved alongside weights in save_pretrained().

Step 3 — Write the Training Adapter

Create verl_omni/pipelines/<model>_flow_grpo/diffusers_training_adapter.py. This follows the same DiffusionModelBase contract as the diffusers guide (How to Integrate a New Diffusion Model for FlowGRPO Training), with one key difference: override build_module() to use your custom loading path.

3.1 Override `build_module`

@DiffusionModelBase.register("OmniMyModelForConditionalGeneration", algorithm="flow_grpo")
class MyModelDiffusion(DiffusionModelBase):
    @classmethod
    def build_module(cls, model_config: DiffusionModelConfig, torch_dtype: torch.dtype):
        logger.info("Loading MyModelForTraining from %s", model_config.local_path)
        return MyModelForTraining.from_pretrained(model_config.local_path, torch_dtype=torch_dtype)

When build_module() returns a non-None value, the FSDP engine uses it directly. When it returns None (the default), the engine falls back to diffusers.AutoModel.from_pretrained.

3.2 Implement `prepare_model_inputs`

This classmethod receives the training module, model config, latents, timesteps, prompt embeddings (plus masks), and the micro-batch from the trainer engine, returning a pair of model-kwargs dicts (positive and negative). See How to Integrate a New Diffusion Model for FlowGRPO Training for the full signature. Note: non-diffusers models often ignore the prompt_embeds* parameters and read token IDs directly from micro_batch instead (see the BAGEL integration’s _prompt_token_ids_to_batch for an example).

3.3 Implement `forward_and_sample_previous_step`

This follows the same pattern as the diffusers guide. If your model supports classifier-free guidance, implement multi-branch forwarding (e.g. conditional + unconditional) and combine the outputs before calling scheduler.sample_previous_step(). The BAGEL integration demonstrates a 3-branch CFG with sigma-interval gating as a reference.

The return signature is always (log_prob, prev_sample_mean, std_dev_t, sqrt_dt).

3.4 Implement the Scheduler

Non-diffusers models use FlowMatchSDEDiscreteScheduler just like diffusers models. If your model needs custom sigma schedules (e.g. time-shifted sigmas), place the setup logic in a shared common.py so both the training and rollout adapters use the identical schedule. See the BAGEL integration’s setup_bagel_sigmas for a worked example.

Step 4 — Write the Rollout Adapter

Create verl_omni/pipelines/<model>_flow_grpo/vllm_omni_rollout_adapter.py. This is nearly identical to the diffusers guide — subclass the upstream vllm-omni pipeline and wrap with the SDE scheduler.

Key considerations for non-diffusers models:

Prompt format. The verl-omni agent loop ships token IDs in req.prompts[0]["prompt_token_ids"], but the upstream vllm-omni pipeline may expect text strings. If needed, add a decode workaround in your adapter’s forward() (see the BAGEL integration’s _ensure_bagel_prompt_text for an example).

Scheduler adapter. Some upstream pipelines use non-standard step() argument conventions. If the standard FlowMatchSDEDiscreteScheduler is incompatible, wrap it in a lightweight adapter that reshapes inputs and outputs to match the pipeline’s expectation. Most models will not need this — FlowMatchSDEDiscreteScheduler works directly with the standard interface.

SDE window. forward() must set up an SDE window (selecting a subset of denoising steps), compensate for any vllm-omni version-specific step-count quirks, and slice the trajectory to return only the windowed steps.

Step 5 — Wire Up Registries and Package

5.1 `init.py`

Export the adapter classes so their @register(...) decorators run on import. The model module is typically imported by the training adapter directly and does not need to be in the public API, but including it is fine:

from .diffusers_training_adapter import MyModelDiffusion
from .vllm_omni_rollout_adapter import MyModelPipelineWithLogProb

__all__ = ["MyModelDiffusion", "MyModelPipelineWithLogProb"]

5.2 Register in `verl_omni/pipelines/init.py`

from .my_model_flow_grpo import *  # noqa: F401, F403
__all__ += my_model_flow_grpo.__all__

5.3 Architecture String

The architecture string passed to both @DiffusionModelBase.register() and @VllmOmniPipelineBase.register() must be consistent. For non-diffusers models, there is no model_index.json to auto-detect from, so users must pass the architecture explicitly on the CLI:

+actor_rollout_ref.model.architecture=OmniMyModelForConditionalGeneration

Step 6 — Add a Data Preprocessor

The data preprocessor must match the tokenisation used by the upstream pipeline. For most models the agent loop’s prompts tensor is enough.

If the upstream pipeline processes prompts differently from the default chat template (e.g. it has its own tokenization path), the data preprocessor must produce token sequences consistent with what the pipeline expects. The BAGEL integration demonstrates this pattern: the preprocessor uses a model-specific tokenize_<model>_prompt() wrapper to match the pipeline’s prepare_prompts output, storing the result as a pre-tokenized column that the training adapter reads via _prompt_token_ids_to_batch(). (See examples/flowgrpo_trainer/data_process/ for reference implementations.)

Step 7 — Add a Smoke Test

Follow the same pattern as the diffusers guide (Step 6 of How to Integrate a New Diffusion Model for FlowGRPO Training), but with these additions:

The dummy data must include the prompt chat messages (standard batch prompts / attention_mask from the agent loop).
The architecture override must be passed explicitly: +actor_rollout_ref.model.architecture=OmniMyModelForConditionalGeneration.

Reference: BAGEL Implementation Checklist

The BAGEL integration is the canonical non-diffusers example. Use this checklist to verify your implementation against it:

Model module (`bagel_model.py`)

[ ] BagelTrainingConfig dataclass with save_pretrained() and from_model_path()
[ ] BagelForTraining(NonDiffusersModelBase) with:
- [ ] _no_split_modules = ["BagelMoTLayer"]
- [ ] _supports_gradient_checkpointing = True
- [ ] forward() with gradient checkpointing via _checkpointed_call()
- [ ] from_pretrained() loading from ema.safetensors with key remapping
- [ ] Token embedding, timestep embedding, VAE projection, position embedding
- [ ] MoT dual-pathway attention (text *_proj + gen *_moe_gen)
- [ ] SOI/EOI boundary token handling

Training adapter (`diffusers_training_adapter.py`)

[ ] @DiffusionModelBase.register("OmniBagelForConditionalGeneration", algorithm="flow_grpo")
[ ] build_module() returns BagelForTraining.from_pretrained(...)
[ ] build_scheduler() and set_timesteps() with shifted sigmas
[ ] prepare_model_inputs() reads prompts and attention_mask from micro-batch (standard tensors, no extra field needed)
[ ] forward_and_sample_previous_step() with 3-branch CFG combining

Rollout adapter (`vllm_omni_rollout_adapter.py`)

[ ] @VllmOmniPipelineBase.register("OmniBagelForConditionalGeneration", algorithm="flow_grpo")
[ ] Subclasses BagelPipeline from vllm-omni
[ ] Wraps scheduler in _BagelSchedulerAdapter for 4-arg step() convention
[ ] SDE step() passes batched (1, tokens, C) tensors so log-probs match training
[ ] _ensure_bagel_prompt_text() workaround for text-prompt requirement
[ ] forward() sets up SDE window, vllm-omni 0.22 timestep compensation, returns sliced trajectory

Shared utilities (`common.py`)

[ ] setup_bagel_sigmas() — shared sigma schedule for rollout and training
[ ] bagel_time_shift() — SD3-style timestep shift of 3.0
[ ] CFG defaults (BAGEL_FLOWGRPO_CFG_DEFAULTS) — consistent between adapters

Data preprocessor (see `examples/flowgrpo_trainer/data_process/`)

[ ] Stores prompts in standard chat-message format (prompt key)
[ ] (BAGEL only) Pre-tokenizes captions via tokenize_bagel_prompt() and the training adapter reads them via _prompt_token_ids_to_batch()

Wiring

[ ] verl_omni/pipelines/bagel_flow_grpo/__init__.py re-exports the two adapter classes (BagelDiffusion and BagelPipelineWithLogProb)
[ ] verl_omni/pipelines/__init__.py imports bagel_flow_grpo
[ ] Example launch script at examples/flowgrpo_trainer/bagel/run_bagel_ocr_lora.sh
[ ] Deploy config at examples/flowgrpo_trainer/bagel/bagel_deploy_config.yaml

When to Use the Diffusers Path Instead

If diffusers can load the model, use How to Integrate a New Diffusion Model for FlowGRPO Training. Override build_module() there for any custom loading you need.