DiffusionGemma:文本生成速度提升4倍

内容总结:
谷歌发布DiffusionGemma:文本生成速度提升4倍的全新开源模型
今天,谷歌正式推出DiffusionGemma——一款实验性开源文本扩散模型,采用Apache 2.0许可协议发布。这款拥有260亿参数的混合专家(MoE)模型,突破了传统自回归大语言模型逐词生成的局限,能够同时生成整段文本,在GPU上实现高达4倍的文本生成速度提升。
核心技术突破
DiffusionGemma基于Gemma 4系列的领先智能密度和Gemini扩散研究的最新成果,集成了专门为提速设计的新型扩散头。该模型在推理时仅激活38亿参数,量化后可适配18GB显存的高端消费级GPU,在单块NVIDIA H100上每秒可生成超1000个token,在NVIDIA GeForce RTX 5090上也能达到700+ token/s。
开发者价值与应用场景
DiffusionGemma专为需要高速交互的本地工作流设计,尤其适用于行内编辑、快速迭代和非线性文本结构生成等场景。其双向注意力机制让每个前向传播并行生成256个token,使每个token都能关注到其他所有token,在代码补全、氨基酸序列和数学图结构等非线性任务中优势显著。此外,模型具备智能自校正能力,可实时评估并修正整段文本中的错误。
性能权衡与定位
作为实验性模型,DiffusionGemma在速度优先的同时,整体输出质量低于标准Gemma 4。谷歌明确建议,对质量要求最高的生产应用仍应部署标准版Gemma 4。该模型的加速优势最适用于本地和低并发推理环境,在高并发云端服务中,自回归模型反而更能发挥计算效率。
开源生态与工具支持
开发者现可在Hugging Face下载模型权重(Apache 2.0许可),并通过MLX、vLLM(Red Hat集成支持)、Hugging Face Transformers等工具部署。谷歌还发布了Hackable Diffusion微调教程,并与Unsloth、NVIDIA NeMo合作提供微调方案。硬件层面,模型已通过NVIDIA优化,支持从GeForce RTX 5090/4090消费级GPU到Hopper/Blackwell企业级系统,包括NVIDIA DGX Spark、DGX Station和RTX PRO工作站。
文本扩散原理简析
类似AI图像生成器从噪点逐步清晰化,DiffusionGemma从随机占位符开始,通过多次迭代锁定正确token并优化剩余部分,最终输出高质量文本。这种并行处理能力使其能实现复杂Markdown格式完美闭合、近实时代码生成与渲染等新模式。
中文翻译:
DiffusionGemma:文本生成速度提升4倍
今天,我们推出DiffusionGemma,这是一款探索文本扩散技术的实验性开放模型,代表着一种极为高效的文本生成方法。该模型采用Apache 2.0许可证发布,是一个总参数量为260亿的混合专家(MoE)模型,它突破了传统自回归大语言模型(LLM)逐词依次处理的方式,而是同时生成整段文本,在GPU上实现高达4倍的文本生成速度提升。
DiffusionGemma基于我们Gemma 4系列业界领先的智能密度参数以及尖端的Gemini Diffusion研究成果构建,并集成了一款专为最大化生成速度而设计的新型扩散模块。虽然自回归的Gemma 4模型仍是高质量生产输出的标准,但DiffusionGemma专为那些探索对速度有严格要求的交互式本地工作流程(如行内编辑、快速迭代以及生成非线性文本结构)的研究人员和开发者而打造。
为开发者解锁全新价值
构建实时交互式AI应用的开发者常常受困于本地推理的延迟瓶颈。DiffusionGemma直面这些挑战,并带来了一些关键性的权衡:
- 极速推理: 通过将解码瓶颈从内存带宽转移到计算能力,DiffusionGemma在专用GPU上实现高达4倍的令牌输出速度。(在单块NVIDIA H100上每秒超过1000个令牌,在NVIDIA GeForce RTX 5090上每秒超过700个令牌)。
- 硬件门槛友好: 作为一个总参数量260亿的混合专家(MoE)模型,推理时仅激活38亿参数。经过量化后,DiffusionGemma可以轻松适配高端专用消费级GPU的18GB显存限制。
- 双向注意力机制: 每次前向传播并行生成256个令牌,使得每个令牌都能关注到其他所有令牌。这对于非线性领域(如行内编辑、代码填充、氨基酸序列或数学图表)具有显著优势。
- 智能自我修正: 该模型会迭代优化自身的输出,能够一次性评估整个文本块,从而实时修正错误。
- 实验性质与生产建议: 由于优先考虑速度和并行布局生成,DiffusionGemma的整体输出质量低于标准Gemma 4。对于要求最高质量的应用,我们建议部署标准Gemma 4。
你可以通过微调来提升DiffusionGemma在特定任务上的性能。在下面的例子中,Unsloth对DiffusionGemma进行了微调,使其能玩数独——这是自回归模型难以处理的任务,因为每个令牌都依赖于后续令牌。DiffusionGemma的双向注意力机制使这项任务变得容易得多。
微调后的DiffusionGemma正在解数独。
为何文本生成采用扩散技术?
尽管AI研究界探索基于扩散的文本生成已有多年,但将其应用于大型模型仍然是一个挑战。DiffusionGemma通过改变模型使用硬件的方式,扭转了这一局面。
传统模型的权衡
大多数语言模型的工作方式像打字机,从左到右每次生成一个令牌。在云端,这种方式效率很高,因为服务器可以将数千个用户请求批量处理,共享硬件负载。但当在本地为单个用户运行时,这种逐词生成的过程会让你的专用GPU或TPU利用率不足——大部分时间只是在等待下一次“击键”。
DiffusionGemma扭转了这种低效局面。它不按顺序预测单词,而是同时起草一整段256个令牌的段落。通过一次性给计算机处理器分配更大块的任务,DiffusionGemma让你的硬件潜力得到充分发挥。它将你的模型推理从单个顺序工作的打字机,升级为一台能够同时印出整段文本的大型印刷机。
DiffusionGemma的文本转3D SVG演示(由Hugging Face提供)。逐步生成过程。
这意味着DiffusionGemma的速度提升是为本地和低并发推理场景设计的。在高QPS(每秒查询数)的云端服务中,自回归模型可以通过部署来有效饱和计算资源,因此DiffusionGemma的并行解码带来的收益递减,并可能导致更高的服务成本。其吞吐量优势在单加速器上的低到中批次规模时最为显著。
文本扩散的工作原理
与AI图像生成器类似,后者从视觉噪点开始,通过迭代优化逐渐形成清晰图像,DiffusionGemma则将这一过程应用于文本:
- 画布: 模型从一个由随机占位符令牌组成的“画布”开始。
- 迭代优化: 模型进行多次传递,锁定正确的令牌,并将它们作为上下文线索来优化其余部分。
- 最终润色: 文本收敛为高质量的输出。
由于模型在生成过程中能够处理整个段落,这解锁了新的模型行为模式,例如完美闭合复杂的标记语言格式,或近乎实时地生成并渲染代码。
立即开始使用
- 下载权重: 立即在Hugging Face上获取实验模型的权重(基于宽松的Apache 2.0许可证发布)。
- 集成与学习: 在我们的DiffusionGemma开发者指南中了解更多信息。或深入阅读《DiffusionGemma视觉指南》以理解其内部运行机制。
- 使用你喜爱的开发工具: 使用MLX、vLLM(由Red Hat支持集成)和Hugging Face Transformers高效部署模型。为便于快速实验,我们发布了一份微调教程,使用Hackable Diffusion——一个专为组合性设计的模块化JAX工具箱。你也可以探索使用Unsloth和NVIDIA NeMo进行微调。此外,对llama.cpp的官方支持即将到来。
- 体验优化性能: 我们与NVIDIA合作,在其整个硬件堆栈上进行优化,确保兼容消费级配置(为GeForce RTX 5090和4090 GPU进行量化),同时在企业级系统(使用高级NVFP4内核的Hopper和Blackwell架构)上提供高性能,包括用于本地桌面部署的NVIDIA DGX Spark和DGX Station,以及针对AI专业人士的RTX PRO。对NVFP4原生支持加速了计算吞吐量,使模型能够以更快速度运行,并保持近无损的精度。
- 尝试你的方式: 在你的桌面专用GPU上运行,或通过Gemini Enterprise Agent Platform Model Garden或NVIDIA NIM在云端运行。
英文来源:
DiffusionGemma: 4x faster text generation
Today, we’re introducing DiffusionGemma, an experimental open model that explores text diffusion, an exceptionally fast approach to text generation. Released under an Apache 2.0 license, this 26B Mixture of Experts (MoE) model moves beyond the sequential token-by-token processing of typical autoregressive Large Language Models (LLMs). Instead, it generates entire blocks of text simultaneously, delivering up to 4x faster text generation on GPUs.
Built upon the industry-leading intelligence-per-parameter of our Gemma 4 family and cutting-edge Gemini Diffusion research, DiffusionGemma integrates a novel diffusion head designed to maximize generation speed. While autoregressive Gemma 4 models remain the standard for high-quality production outputs, DiffusionGemma is designed for researchers and developers exploring speed-critical, interactive local workflows such as in-line editing, rapid iteration, and generating non-linear text structures.
Unlocking new value for developers
Developers building real-time interactive AI applications often struggle with the latency bottlenecks of local inference. DiffusionGemma addresses these challenges directly, with some key trade-offs:
- Blazing fast inference: By shifting the decode bottleneck from memory-bandwidth to compute, DiffusionGemma generates up to 4x faster token output on dedicated GPUs. (1000+ tokens per second on a single NVIDIA H100, 700+ tokens per second on NVIDIA GeForce RTX 5090). 1
- Accessible hardware footprint: Operating as a 26B total Mixture of Experts (MoE) model that activates only 3.8B parameters during inference, DiffusionGemma fits comfortably within 18GB VRAM limits of high-end dedicated consumer GPUs when quantized.
- Bi-directional attention: Generating 256 tokens in parallel with each forward pass allows every token to attend to all others. This provides significant advantages for non-linear domains such as in-line editing, code infilling, amino acid sequences or mathematical graphs.
- Intelligent self-correction: The model iteratively refines its own output, allowing it to evaluate the entire text block at once to fix mistakes in real-time.
- Experimental status & production recommendations: Because it prioritizes speed and parallel layout generation, DiffusionGemma’s overall output quality is lower than standard Gemma 4. For applications that demand maximum quality, we recommend deploying standard Gemma 4.
You can improve DiffusionGemma's performance on specific tasks through fine-tuning. In the example below, Unsloth fine-tuned DiffusionGemma to play Sudoku — a task autoregressive models struggle with because each token depends on future tokens. DiffusionGemma's bi-directional attention makes this much easier.
Fine-tuned DiffusionGemma solving Sudoku.
Why diffusion for text?
While the AI research community has explored diffusion-based text generation for years, applying it to large models has remained a challenge. DiffusionGemma changes this by shifting how models use hardware.
The trade-off with traditional models
Most language models act like a typewriter, generating one token at a time from left to right. In the cloud, this is efficient because servers can batch thousands of user requests together to share the hardware load. But when run locally for a single user, this word-by-word process leaves your dedicated GPU or TPU underutilized — it spends most of its time simply waiting for the next "keystroke."
DiffusionGemma reverses this inefficiency. Instead of predicting words sequentially, it drafts an entire 256-token paragraph simultaneously. By giving the computer's processor a larger chunk of work at once, DiffusionGemma utilizes your hardware to its full potential. It upgrades your model inference from a single, sequential typewriter to a massive printing press that stamps the entire block of text simultaneously.
DiffusionGemma text-to-3D SVG demo by Hugging Face. Step-by-step generation.
This means DiffusionGemma's speedup is designed for local and low-concurrency inference. In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs. The throughput advantage is strongest at low-to-medium batch sizes on a single accelerator.
How text diffusion works
Similar to AI image generators that start with visual static and iteratively refine it into a clear picture, DiffusionGemma applies this to text: - The canvas: The model starts with a canvas of random placeholder tokens.
- Iterative refinement: The model makes multiple passes, locking in correct tokens and using them as context clues to refine the rest.
- Final polish: The text converges into high-quality output.
Because the model can process the whole paragraph while generating, it unlocks new patterns of model behavior, like perfectly closing complex markdown formatting or generating and rendering code in near real-time.
Get started today - Download the weights: Access the experimental model weights (released under a permissive Apache 2.0 license) right now on Hugging Face.
- Integrate & learn: Learn more in our DiffusionGemma developer guide. Or deep dive into A Visual Guide to DiffusionGemma to understand the mechanics under the hood.
- Use your favorite development tools: Serve the model efficiently using MLX, vLLM (with integration supported by Red Hat), and Hugging Face Transformers. For rapid experimentation, we are releasing a fine-tuning tutorial using Hackable Diffusion, a modular JAX toolbox designed for composability. You can also explore fine-tuning with Unsloth and NVIDIA NeMo. Additionally, official support for llama.cpp is arriving soon.
- Experience optimized performance: We worked with NVIDIA to optimize across their hardware stack, ensuring compatibility with consumer setups (quantized for GeForce RTX 5090 and 4090 GPUs) alongside high performance on enterprise systems (Hopper and Blackwell using advanced NVFP4 kernels), including NVIDIA DGX Spark and DGX Station for local deskside deployment, and RTX PRO for AI professionals. Native support for NVFP4 (4-bit floating-point) accelerates compute throughput, allowing the model to run at faster speeds with near-lossless accuracy.
- Try your way: Run on your desktop dedicated GPU or in the cloud through Gemini Enterprise Agent Platform Model Garden or NVIDIA NIM.
文章标题:DiffusionGemma:文本生成速度提升4倍
文章链接:https://news.qimuai.cn/?post=4321
本站文章均为原创,未经授权请勿用于任何商业用途