(integrating_a_non_diffusers_model)= # How to Integrate a Non-Diffusers Model for FlowGRPO Training Last updated: 06/15/2026. This guide walks you through integrating a **non-diffusers model** — a standalone `nn.Module` that does **not** inherit from `diffusers.ModelMixin` and is **not** loaded through `diffusers.AutoModel.from_pretrained` — into VeRL-Omni so it can be trained end-to-end with the **FlowGRPO** algorithm. Non-diffusers models manage their own architecture, configuration format, and weight-loading logic — none of which go through diffusers APIs. BAGEL-7B-MoT is the reference implementation. If your model is a standard diffusers model, use [`integrating_a_diffusion_model.md`](integrating_a_diffusion_model.md) instead. This guide extends the contracts defined there and focuses on what is **different** for non-diffusers models. We use the **BAGEL-7B-MoT** integration ([`verl_omni/pipelines/bagel_flow_grpo/`](../../verl_omni/pipelines/bagel_flow_grpo/__init__.py)) as the worked example throughout. --- ## TL;DR A new non-diffusers model needs **four files in one new package** plus **three registry hooks**: ``` verl_omni/pipelines/_flow_grpo/ ├── __init__.py # re-exports adapters + model class ├── _model.py # nn.Module subclass of NonDiffusersModelBase ├── diffusers_training_adapter.py # subclass of DiffusionModelBase └── vllm_omni_rollout_adapter.py # subclass of upstream vllm-omni Pipeline ``` The **training adapter** (`diffusers_training_adapter.py`) follows the same `DiffusionModelBase` contract as standard diffusers models, but overrides `build_module()` to use your custom `from_pretrained()` path instead of `diffusers.AutoModel`. The **rollout adapter** is identical in structure to the diffusers case. The **model module** (`_model.py`) is the new piece: a standalone `nn.Module` that subclasses `NonDiffusersModelBase` and provides: - `from_pretrained(model_path, torch_dtype)` — classmethod for weight loading - `forward(**kwargs)` — the generation forward pass - `_no_split_modules` — FSDP sharding hints - Optional gradient checkpointing support --- ## When to Use NonDiffusersModelBase Use `NonDiffusersModelBase` when **diffusers cannot load the model**. Everything else (custom configs, weight loading, FSDP sharding) can be handled through the standard ``DiffusionModelBase`` path by overriding ``build_module()``. --- ## Mental Model The training side for non-diffusers models follows the same two-context architecture as diffusers models, but the **module** is built through a different path: ```text ┌─────────────────────────────────┐ ┌──────────────────────────────┐ │ Rollout worker (vllm-omni) │ trajectory │ Trainer worker (FSDP) │ │ │ ─────────────▶ │ │ │ Pipeline (upstream) │ latents, │ ForTraining │ │ └─ forward() + SDE loop │ log_probs, │ (NonDiffusersModelBase) │ │ │ prompt ids │ └─ forward(...) │ │ PipelineWithLogProb │ │ │ │ └─ wraps with SDE scheduler │ │ │ │ └─ handles prompt format │ │ │ └─────────────────────────────────┘ └──────────────────────────────┘ ``` The key difference from the diffusers path: | Aspect | Diffusers model | Non-diffusers model | |---|---|---| | Module loading | `diffusers.AutoModel.from_pretrained()` | `MyModel.from_pretrained(model_path)` | | Module base class | `ModelMixin` (from diffusers) | `NonDiffusersModelBase` (from verl-omni) | | `build_module()` | Return `None` → default AutoModel path | Return `MyModel.from_pretrained(...)` | | Config | `model_index.json` → `_class_name` | `config.json` → custom struct (e.g. `BagelTrainingConfig`) | | Architecture registration | Auto-detected from `model_index.json` | Explicit: `+actor_rollout_ref.model.architecture=...` | --- ## Prerequisites Before you start, the new model must already be supported upstream by: - **vllm-omni** — provides the rollout-side `Pipeline`. Your rollout adapter inherits from this class. The pipeline must be capable of running diffusion (text-to-image, or whatever modality you are training). - **A downloadable checkpoint** — the model weights (`.safetensors`) and config files must be available locally or on Hugging Face Hub. Unlike the diffusers path, **diffusers does not need to support the model** — that is the whole point of the non-diffusers path. However, you must port or reimplement the transformer architecture locally (see Step 2). If the model is not yet supported in vllm-omni, upstream it there first. Nothing below will work without vllm-omni rollout support. --- ## Step 1 — Understand the Upstream Pipeline and the Rollout→Training Contract Read the vllm-omni pipeline's `forward()` method and answer the same questions as in the diffusers guide, plus these non-diffusers-specific ones: 1. **How does the upstream pipeline process text?** Does it expect raw text strings or token IDs? If it expects strings, you may need a decode workaround in the rollout adapter (see the BAGEL integration's `_ensure_bagel_prompt_text` for an example). 2. **What is the model's forward signature?** Non-diffusers models define their own forward convention — it will not match the diffusers `(sample, timestep, encoder_hidden_states)` pattern. Document the signature; ``prepare_model_inputs()`` must produce matching kwargs. 3. **How does the model handle CFG?** If it supports classifier-free guidance, identify the negative-branch parameters (e.g. `None` for text-conditioning) so training can replicate the CFG logic. 4. **What is the checkpoint format?** Note the weight file name, key prefix conventions, and any architectural details that must be remapped. 5. **What are the layer class names?** List them — FSDP needs these for `_no_split_modules` so it can wrap the model correctly. 6. **What are the special token IDs?** If the model uses boundary tokens (start-of-image, end-of-image, etc.), identify whether they come from config or the tokenizer. The data preprocessor must produce consistent token sequences. --- ## Step 2 — Port the Model Architecture Create `verl_omni/pipelines/_flow_grpo/_model.py`. This is the most involved step. You are porting the transformer architecture from the upstream model into a standalone `nn.Module` and subclassing `NonDiffusersModelBase`. ### 2.1 Subclass `NonDiffusersModelBase` ```python from verl_omni.pipelines.non_diffusers_model_base import NonDiffusersModelBase class MyModelForTraining(NonDiffusersModelBase): _no_split_modules = ["MyTransformerLayer"] _supports_gradient_checkpointing = True def __init__(self, config: MyTrainingConfig): super().__init__() self.config = config # ... build layers, embeddings, etc. ``` `NonDiffusersModelBase` provides for free: | Feature | How to use | |---|---| | **LoRA/PEFT injection** | Inherits `add_adapter()`, `load_lora_adapter()`, `set_adapter()`, `disable_adapters()`, `enable_adapters()` | | **Gradient checkpointing** | Set `_supports_gradient_checkpointing = True` and wrap layer calls with `self._checkpointed_call(fn, *args)` | | **FSDP sharding** | Set `_no_split_modules` to your layer class names | | **Checkpoint persistence** | Inherits `save_pretrained()` (saves `model.safetensors` + `config.json`) | ### 2.2 Implement `from_pretrained` ```python @classmethod def from_pretrained(cls, model_path: str, torch_dtype=torch.bfloat16) -> MyModelForTraining: config = MyTrainingConfig.from_model_path(model_path) ckpt_path = os.path.join(model_path, "ema.safetensors") # or whatever the checkpoint file is from safetensors.torch import load_file state_dict = load_file(ckpt_path) model = cls(config) mapped = _map_checkpoint_to_training(state_dict, config) missing, unexpected = model.load_state_dict(mapped, strict=False) if missing: logger.warning(f"Missing keys: {len(missing)}") model = model.to(torch_dtype) return model ``` Key points: - You control the entire loading logic. No `diffusers.AutoModel` involved. - Remap checkpoint keys to match your local parameter names (see the BAGEL integration's ``bagel_model.py`` for an example). - Handle dtype conversion yourself. ### 2.3 Implement `forward` The forward signature is **model-dependent**. For example, an image generation model might take ``(hidden_states, timestep, text_token_ids, latent_pos_ids, **kwargs)``. The only constraint is that ``prepare_model_inputs()`` in the training adapter must build a dict whose keys match this signature exactly. For gradient checkpointing, wrap layer calls: ```python def forward(self, *args, **kwargs): # ... for layer in self.layers: sequence = self._checkpointed_call(layer, sequence, ...) # ... return (output,) ``` Return a **tuple** `(velocity,)` — the FSDP engine expects a single-element tuple. ### 2.4 Implement the Config Create a `@dataclass` config class with a `save_pretrained()` method and a `from_model_path()` classmethod: ```python @dataclass class MyTrainingConfig: hidden_size: int = 3584 num_hidden_layers: int = 28 # ... def save_pretrained(self, save_directory: str): output_path = os.path.join(save_directory, "config.json") with open(output_path, "w") as f: json.dump(asdict(self), f, indent=4, sort_keys=True) @classmethod def from_model_path(cls, model_path: str) -> MyTrainingConfig: cfg_path = os.path.join(model_path, "config.json") with open(cfg_path) as f: raw = json.load(f) return cls( hidden_size=raw.get("hidden_size", 3584), # ... ) ``` The config is saved alongside weights in `save_pretrained()`. --- ## Step 3 — Write the Training Adapter Create `verl_omni/pipelines/_flow_grpo/diffusers_training_adapter.py`. This follows the same `DiffusionModelBase` contract as the diffusers guide ({doc}`integrating_a_diffusion_model`), with one key difference: **override `build_module()`** to use your custom loading path. ### 3.1 Override `build_module` ```python @DiffusionModelBase.register("OmniMyModelForConditionalGeneration", algorithm="flow_grpo") class MyModelDiffusion(DiffusionModelBase): @classmethod def build_module(cls, model_config: DiffusionModelConfig, torch_dtype: torch.dtype): logger.info("Loading MyModelForTraining from %s", model_config.local_path) return MyModelForTraining.from_pretrained(model_config.local_path, torch_dtype=torch_dtype) ``` When `build_module()` returns a non-`None` value, the FSDP engine uses it directly. When it returns `None` (the default), the engine falls back to `diffusers.AutoModel.from_pretrained`. ### 3.2 Implement `prepare_model_inputs` This classmethod receives the training module, model config, latents, timesteps, prompt embeddings (plus masks), and the micro-batch from the trainer engine, returning a pair of model-kwargs dicts (positive and negative). See {doc}`integrating_a_diffusion_model` for the full signature. Note: non-diffusers models often ignore the ``prompt_embeds*`` parameters and read token IDs directly from ``micro_batch`` instead (see the BAGEL integration's ``_prompt_token_ids_to_batch`` for an example). ### 3.3 Implement `forward_and_sample_previous_step` This follows the same pattern as the diffusers guide. If your model supports classifier-free guidance, implement multi-branch forwarding (e.g. conditional + unconditional) and combine the outputs before calling ``scheduler.sample_previous_step()``. The BAGEL integration demonstrates a 3-branch CFG with sigma-interval gating as a reference. The return signature is always ``(log_prob, prev_sample_mean, std_dev_t, sqrt_dt)``. ### 3.4 Implement the Scheduler Non-diffusers models use ``FlowMatchSDEDiscreteScheduler`` just like diffusers models. If your model needs custom sigma schedules (e.g. time-shifted sigmas), place the setup logic in a shared ``common.py`` so both the training and rollout adapters use the identical schedule. See the BAGEL integration's ``setup_bagel_sigmas`` for a worked example. --- ## Step 4 — Write the Rollout Adapter Create `verl_omni/pipelines/_flow_grpo/vllm_omni_rollout_adapter.py`. This is nearly identical to the diffusers guide — subclass the upstream vllm-omni pipeline and wrap with the SDE scheduler. Key considerations for non-diffusers models: **Prompt format.** The verl-omni agent loop ships token IDs in ``req.prompts[0]["prompt_token_ids"]``, but the upstream vllm-omni pipeline may expect text strings. If needed, add a decode workaround in your adapter's ``forward()`` (see the BAGEL integration's ``_ensure_bagel_prompt_text`` for an example). **Scheduler adapter.** Some upstream pipelines use non-standard ``step()`` argument conventions. If the standard ``FlowMatchSDEDiscreteScheduler`` is incompatible, wrap it in a lightweight adapter that reshapes inputs and outputs to match the pipeline's expectation. Most models will not need this — ``FlowMatchSDEDiscreteScheduler`` works directly with the standard interface. **SDE window.** ``forward()`` must set up an SDE window (selecting a subset of denoising steps), compensate for any vllm-omni version-specific step-count quirks, and slice the trajectory to return only the windowed steps. --- ## Step 5 — Wire Up Registries and Package ### 5.1 `__init__.py` Export the adapter classes so their `@register(...)` decorators run on import. The model module is typically imported by the training adapter directly and does not need to be in the public API, but including it is fine: ```python from .diffusers_training_adapter import MyModelDiffusion from .vllm_omni_rollout_adapter import MyModelPipelineWithLogProb __all__ = ["MyModelDiffusion", "MyModelPipelineWithLogProb"] ``` ### 5.2 Register in `verl_omni/pipelines/__init__.py` ```python from .my_model_flow_grpo import * # noqa: F401, F403 __all__ += my_model_flow_grpo.__all__ ``` ### 5.3 Architecture String The architecture string passed to both `@DiffusionModelBase.register()` and `@VllmOmniPipelineBase.register()` must be consistent. For non-diffusers models, there is no `model_index.json` to auto-detect from, so users must pass the architecture explicitly on the CLI: ```bash +actor_rollout_ref.model.architecture=OmniMyModelForConditionalGeneration ``` --- ## Step 6 — Add a Data Preprocessor The data preprocessor must match the tokenisation used by the upstream pipeline. For most models the agent loop's ``prompts`` tensor is enough. If the upstream pipeline processes prompts differently from the default chat template (e.g. it has its own tokenization path), the data preprocessor must produce token sequences consistent with what the pipeline expects. The BAGEL integration demonstrates this pattern: the preprocessor uses a model-specific ``tokenize__prompt()`` wrapper to match the pipeline's ``prepare_prompts`` output, storing the result as a pre-tokenized column that the training adapter reads via ``_prompt_token_ids_to_batch()``. (See ``examples/flowgrpo_trainer/data_process/`` for reference implementations.) --- ## Step 7 — Add a Smoke Test Follow the same pattern as the diffusers guide (Step 6 of {doc}`integrating_a_diffusion_model`), but with these additions: 1. The dummy data must include the ``prompt`` chat messages (standard batch ``prompts`` / ``attention_mask`` from the agent loop). 2. The architecture override must be passed explicitly: `+actor_rollout_ref.model.architecture=OmniMyModelForConditionalGeneration`. --- ## Reference: BAGEL Implementation Checklist The BAGEL integration is the canonical non-diffusers example. Use this checklist to verify your implementation against it: ### Model module (`bagel_model.py`) - [ ] `BagelTrainingConfig` dataclass with `save_pretrained()` and `from_model_path()` - [ ] `BagelForTraining(NonDiffusersModelBase)` with: - [ ] `_no_split_modules = ["BagelMoTLayer"]` - [ ] `_supports_gradient_checkpointing = True` - [ ] `forward()` with gradient checkpointing via `_checkpointed_call()` - [ ] `from_pretrained()` loading from `ema.safetensors` with key remapping - [ ] Token embedding, timestep embedding, VAE projection, position embedding - [ ] MoT dual-pathway attention (text `*_proj` + gen `*_moe_gen`) - [ ] SOI/EOI boundary token handling ### Training adapter (`diffusers_training_adapter.py`) - [ ] `@DiffusionModelBase.register("OmniBagelForConditionalGeneration", algorithm="flow_grpo")` - [ ] `build_module()` returns `BagelForTraining.from_pretrained(...)` - [ ] `build_scheduler()` and `set_timesteps()` with shifted sigmas - [ ] `prepare_model_inputs()` reads ``prompts`` and ``attention_mask`` from micro-batch (standard tensors, no extra field needed) - [ ] `forward_and_sample_previous_step()` with 3-branch CFG combining ### Rollout adapter (`vllm_omni_rollout_adapter.py`) - [ ] `@VllmOmniPipelineBase.register("OmniBagelForConditionalGeneration", algorithm="flow_grpo")` - [ ] Subclasses `BagelPipeline` from vllm-omni - [ ] Wraps scheduler in `_BagelSchedulerAdapter` for 4-arg `step()` convention - [ ] SDE `step()` passes batched `(1, tokens, C)` tensors so log-probs match training - [ ] `_ensure_bagel_prompt_text()` workaround for text-prompt requirement - [ ] `forward()` sets up SDE window, vllm-omni 0.22 timestep compensation, returns sliced trajectory ### Shared utilities (`common.py`) - [ ] `setup_bagel_sigmas()` — shared sigma schedule for rollout and training - [ ] `bagel_time_shift()` — SD3-style timestep shift of `3.0` - [ ] CFG defaults (`BAGEL_FLOWGRPO_CFG_DEFAULTS`) — consistent between adapters ### Data preprocessor (see ``examples/flowgrpo_trainer/data_process/``) - [ ] Stores prompts in standard chat-message format (``prompt`` key) - [ ] (BAGEL only) Pre-tokenizes captions via ``tokenize_bagel_prompt()`` and the training adapter reads them via ``_prompt_token_ids_to_batch()`` ### Wiring - [ ] `verl_omni/pipelines/bagel_flow_grpo/__init__.py` re-exports the two adapter classes (``BagelDiffusion`` and ``BagelPipelineWithLogProb``) - [ ] `verl_omni/pipelines/__init__.py` imports `bagel_flow_grpo` - [ ] Example launch script at `examples/flowgrpo_trainer/bagel/run_bagel_ocr_lora.sh` - [ ] Deploy config at `examples/flowgrpo_trainer/bagel/bagel_deploy_config.yaml` --- ## When to Use the Diffusers Path Instead If diffusers can load the model, use {doc}`integrating_a_diffusion_model`. Override ``build_module()`` there for any custom loading you need.