Towards speed-of-light text generation with Nemotron-Labs diffusion language models
At a glance:
- Nemotron-Labs Diffusion offers three generation modes—autoregressive, diffusion, and self‑speculation—within a single model family.
- The 8B diffusion model delivers 2.6× higher tokens‑per‑forward‑pass than standard AR models, while self‑speculation reaches up to 6.4× speedup.
- NVIDIA releases 3B, 8B and 14B text models (plus an 8B vision‑language variant) under the Nemotron Open Model License, with training code via Megatron Bridge.
What nemotron‑labs diffusion is
Nemotron‑Labs Diffusion is a new class of diffusion language models (DLM) that generate text in parallel blocks and then iteratively refine those blocks. Unlike traditional autoregressive (AR) models that emit one token at a time, DLMs can draft multiple tokens simultaneously, allowing modern GPUs to spend more cycles on computation rather than memory fetches. The approach also introduces a natural revision mechanism: generated tokens can be updated in later refinement steps, reducing the propagation of early mistakes.
The research builds on the Efficient‑DLM concept, which showed that a pretrained AR model can be converted into a diffusion model by continuing pre‑training with a block‑wise attention scheme. This preserves the strengths of the original AR model—stability, KV‑cache friendliness, and strong baseline accuracy—while unlocking parallel decoding. NVIDIA’s implementation adds a joint AR‑diffusion objective, letting the same checkpoint serve both paradigms.
Three generation modes in one model
Nemotron‑Labs Diffusion supports three distinct inference pathways:
- Autoregressive mode – behaves like any conventional left‑to‑right LLM, useful for compatibility checks or workloads that demand deterministic token‑by‑token output.
- Diffusion mode – fills a 32‑token block at a time, iteratively denoising until a confidence threshold marks tokens as “good enough.” This mode maximises raw throughput.
- Self‑speculation mode – drafts a block bidirectionally with diffusion, then verifies each token using a fast AR pass. Linear self‑speculation yields a 6× speed boost, while quadratic self‑speculation reaches 6.4×, all with accuracy comparable to the AR baseline.
Switching between these modes requires only a single configuration flag at deployment time, meaning developers can keep their existing application code and experiment with speed‑accuracy trade‑offs on the fly.
Performance highlights
In NVIDIA’s internal benchmarks, the Nemotron‑Labs Diffusion 8B model achieved an average accuracy improvement of 1.2 % over the competing Qwen3 8B model. Measured in tokens‑per‑forward‑pass (TPF), diffusion mode delivered a 2.6× increase over standard AR decoding. Self‑speculation pushed the envelope further: linear self‑speculation attained a 6× TPF gain, and quadratic self‑speculation hit 6.4×, while maintaining comparable task‑level accuracy across the evaluated suite.
On a B200 GPU running the speedbench dataset, the self‑speculation (LinearSpec) configuration reached roughly 865 tokens per second—about four times the AR baseline on identical hardware. These numbers illustrate how parallel block generation can tap the full computational bandwidth of modern GPUs, especially in latency‑sensitive, batch‑size‑one scenarios.
How the models were trained
All Nemotron‑Labs Diffusion models were first pretrained on 1.3 trillion tokens from NVIDIA’s Nemotron pre‑training corpora. After this AR‑focused stage, the models underwent a joint AR‑diffusion fine‑tuning phase using an additional 45 billion tokens drawn from the Nemotron post‑training datasets. This two‑stage regimen allowed the models to retain the strong language understanding of the original AR checkpoint while acquiring the parallel drafting capability of diffusion.
Training leveraged the NVIDIA Megatron Bridge framework, which provides a unified codebase for both AR and diffusion objectives. The resulting models—available at 3B, 8B, and 14B parameter scales for text, plus an 8B vision‑language variant—are released under the commercially‑friendly NVIDIA Nemotron Open Model License (text models) or the NVIDIA Source Code License (vision‑language model), encouraging broad research and commercial adoption.
Deployment and inference through SGLang
Support for Nemotron‑Labs Diffusion is being added to the main branch of SGLang, NVIDIA’s high‑performance serving library. Developers can select the desired mode with a single line in the SGLang configuration:
ar_mode=truefor plain autoregressive decoding.diffusion_mode=true(FastDiffuser) for block‑wise diffusion.self_speculation=true(LinearSpec) for the bidirectional draft‑then‑verify workflow.
The integration enables serving the same checkpoint in three ways without duplicating model files, simplifying operations and reducing storage overhead. At the time of writing, the feature is accessible via an open issue tracker request on GitHub, and NVIDIA plans to merge full support into the upcoming SGLang release.
Getting started
Developers interested in experimenting can pull the Nemotron‑Labs Diffusion models from Hugging Face, review the technical report for deeper architectural details, and follow the publicly available training recipe on GitHub. Because the models retain full AR compatibility, existing pipelines can be upgraded incrementally, testing diffusion or self‑speculation modes only where latency or throughput gains are most needed.
FAQ
What generation modes does Nemotron‑Labs Diffusion support?
How does the performance of the 8B diffusion model compare to traditional AR models?
How can developers deploy Nemotron‑Labs Diffusion models?
More in the feed
Prepared by the editorial stack from public data and external sources.
Original article