在 Pixel 设备上通过冻结的多 token 预测加速 Gemini Nano 模型

内容总结:
谷歌推出新型端侧AI加速架构:多令牌预测技术落地Pixel 9/10系列
2026年6月26日,谷歌平台与设备部门研究产品经理伊登·科恩与研究经理米歇尔·拉马诺维奇联合发布一项技术突破:通过为已部署的冻结生产模型加装多令牌预测(MTP)模块,在不降低模型能力的前提下,显著提升端侧推理速度并降低能耗。
当前,以Gemini Nano和Gemma为代表的端侧大语言模型已实现手机本地化运行,可即时摘要通知、校对短信等,且无需上传用户隐私数据。然而,移动设备受限于严格的能耗预算和有限内存,标准的“自回归”生成模式(逐词输出)会造成算力闲置和内存带宽压力,影响用户体验并加速耗电。
为解决这一瓶颈,谷歌团队研发了针对移动端高度受限环境的MTP架构,该技术已在Pixel 9及Pixel 10系列中上线。不同于传统推测解码方案需额外部署独立的“起草模型”(如1.28亿参数级别),后者既占用宝贵内存,又因无法利用主模型已计算的语义上下文而导致效率低下。MTP方案直接在主干模型(Gemini Nano v3)的最终层后附加一个轻量级Transformer头(MTP头),通过复用模型已生成的隐藏状态进行多步未来词元的自回归预测。
技术关键点包括:
- 零拷贝架构:MTP头不维护独立的历史缓存,而是直接交叉注意力机制访问主模型的冻结键值缓存,彻底解决内存冗余问题。实测显示,相较独立起草器,单实例可节省130MB内存,并消除额外的文本预处理延迟。
- 无精度损失:由于主干模型权重完全冻结,MTP仅作为效率优化模块。即使草案预测错误,验证阶段会丢弃不匹配内容,保证最终输出与原始模型逐比特一致,实现完全向后兼容。
- 显著性能提升:在Pixel 9设备上,MTP使每次推理平均多正确预测约两个词元,端侧生成速度提升50%以上。应用于“AI通知摘要”和“校对”等实际场景时,因验证步骤减少,处理器唤醒频率下降,从而延长电池续航。
该技术为开发者消除了关键痛点——无需为每个新任务微调独立、占用内存的起草模型,即可获得高速端侧AI体验。未来,谷歌计划在下一代Pixel设备中持续集成MTP,并探索并行解码、分支路径探索等更高效架构,进一步在严苛的移动约束下提升生成速度与验证成功率。
中文翻译:
2026年6月26日
Eden Cohen,研究产品经理,以及 Michelle Ramanovich,研究经理,Google 平台与设备部
我们提出了一种方法,可将多令牌预测应用于已冻结的生产模型,从而加速设备端推理,同时避免单独草稿模型带来的低效问题。
让强大的大语言模型直接装进口袋,如今已借助 Gemini Nano 和 Gemma 等设备端模型成为现实。这项技术为手机上的日常功能——例如即时摘要大量通知或校对重要短信——提供了支持,且所有操作均无需将私人数据发送至设备之外。但要让这些功能对日常用户真正实用,就必须以极高的效率运行。
在移动设备上实现这种速度是一项重大挑战。与庞大的服务器环境不同,手机受限于严格的能耗预算和有限的内存容量。此外,标准语言模型以“自回归”方式生成文本——这意味着它们一次只能处理并输出一个词(或令牌)。这种逐步处理过程形成了瓶颈,未能充分利用手机的处理能力,同时给内存带宽带来压力,最终可能导致用户体验变慢并消耗电池电量。
为克服这一瓶颈,我们宣布推出一种新架构,可将多令牌预测应用于现有“冻结”状态的 Gemini Nano v3 模型。在 EAGLE 框架和自信自适应语言建模等先前方法的基础上,我们设计了新的架构组件,专门针对移动环境最大化效率提升。我们近期的公告已重点介绍了如何通过多令牌预测加速 Gemma 4,并将其提供给开发者使用。
今天的文章则聚焦于边缘计算中独特且极端的约束条件。该方法已近期部署到 Pixel 9 和 10 系列上,可作为一种开箱即用的加速方案。对用户而言,这意味着 AI 通知摘要和校对等功能在生成文本时速度显著提升,且能耗更低。对开发者而言,这消除了一大痛点:无需为每个新任务微调独立且占用大量内存的草稿模型,即可实现高速设备端人工智能。
多令牌预测建立在推测解码的发展基础之上。在传统设置中,生成 N 个令牌需要大模型进行 N 次前向传播。推测解码将这一过程拆分为两个部分:
然而,这会导致一些低效问题。运行一个独立的“草稿”模型会争夺有限的内存资源。此外,独立的草稿模型对主模型丰富的内部状态“视而不见”,仅基于文本历史预测下一个令牌,而缺少主模型已计算出的语义上下文。多令牌预测通过从独立架构转向集成架构来解决这些低效问题。我们不再训练一个单独的小语言模型来生成草稿,而是在主模型的最终层后附加一个轻量级的 Transformer 头部,即多令牌预测头部。
这种架构利用深度退出层进行草稿生成,充分发挥主模型骨干网络已完成的计算工作。多令牌预测头部接收主模型最终的高维激活值,并利用它们自回归地预测一系列未来令牌。
虽然多令牌预测头部通常与骨干网络一起预训练——例如在我们最近发布的 Gemma 4 模型中——但在利用已部署的设备端基础模型时,这种做法并不现实。因此,我们的工作重点是改造草稿模型头部,使其独立于预训练流程运行。
我们取一个完全训练好的 Gemini Nano v3 模型,冻结其权重,并在其最终层后附加一个密集的 Transformer 堆栈——即多令牌预测头部。我们仅训练这些新增参数,以最小化对未来令牌的预测误差。由于骨干网络被冻结,多令牌预测成为纯粹的效率优化手段,确保基础模型的能力或安全对齐不会受到任何损害。
由于不正确的草稿会在验证阶段被丢弃,最终输出与主模型的输出在比特级别上完全一致,这使我们能够以完全向后兼容的方式部署效率更新。
虽然标准的多令牌预测实现通过在主模型和草稿模型之间共享静态参数来优化训练效率,但设备端推理面临更严格的瓶颈:动态内存。即使权重共享,如果草稿模型独立处理上下文,它也会因生成并维护自己的键值缓存而产生“双重负担”。鉴于移动设备内存有限,避免这种冗余至关重要。
为解决这一问题,我们设计了一种零拷贝架构,使多令牌预测头部能够有效利用主模型的状态。多令牌预测头部不维护自身的历史记录,而是被设计为直接交叉关注主模型已冻结的键值缓存。这使得草稿模型能够查询骨干网络已计算出的“记忆”和上下文,而无需重复计算。
这一设计带来了两项效率提升。首先,它消除了草稿模型的预填充延迟:通过利用现有缓存,头部无需额外时间处理提示。其次,它减少了运行时的内存占用。与独立草稿模型相比,我们观察到每个实例节省了 130MB 内存,这得益于去除了草稿模型的嵌入查找表、预填充点注意力变体以及特定应用的调优参数。
在我们的实验中,我们发现多令牌预测草稿模型始终能生成更准确的令牌预测,从而在 Pixel 9 设备上实现 50% 或更高的速度提升。
这种性能差距源于多令牌预测能够获取更丰富的表示。与将主模型视为黑箱的独立草稿模型不同,多令牌预测头部直接利用已由更大骨干网络处理过的最终激活值:
为了在 Pixel 9 和 10 设备上部署多令牌预测,我们重新设计了设备端推理堆栈,以处理验证阶段和草稿阶段之间的复杂依赖关系。
最终结果验证了这些架构选择的有效性。在生产负载下,例如 AI 通知摘要和校对功能,多令牌预测每次推理平均能正确预测近两个额外的令牌。此外,更少的验证步骤意味着减少唤醒重型处理器的次数,从而降低能耗并延长电池续航。
我们期待将多令牌预测集成到未来的 Pixel 设备中,并探索替代架构——包括并行解码和无辅助头部的范式——以进一步降低草稿延迟,并在严格的移动约束下增加同时验证的令牌数量。
我们也在研究更高效处理语言生成固有歧义的方法。虽然标准推测解码假设存在唯一的最佳未来路径,但我们正在开发允许模型并行探索分支可能性的技术。这旨在即使在不确定的上下文中也能最大化接受长序列的概率。此外,我们正在研究验证宽容度:在特定用例中放宽草稿与验证之间的严格精确令牌匹配要求,以在边缘计算中实现更高效率。
此项工作是我们优化设备端大语言模型效率的一部分,参与人员包括 Filippo Galgani、Omri Homburger、Pooja Consul、Matthew Markwell 和 Vivek Kumar。部分元素基于 Google DeepMind 中 Gemini 团队的成果:Tal Schuster、Ziwei Ji、Ivan Korotkov 和 Ganesh Jawahar。我们还要特别感谢 Nadav Bar、Utku Evci、Nir Shabat、Joe Zou 以及 Google Research、Google DeepMind 和平台与设备部的各个团队,感谢他们的评审、宝贵反馈和支持。
英文来源:
June 26, 2026
Eden Cohen, Research Product Manager, and Michelle Ramanovich, Research Manager, Google Platforms and Devices
We introduce a method to retrofit Multi-Token Prediction onto frozen production models, accelerating on-device inference without the inefficiencies of separate drafters.
Having powerful Large Language Models (LLMs) right in your pocket is now a reality with on-device models like Gemini Nano and Gemma. This technology enables everyday features on your phone — such as instantly summarizing a flurry of notifications or proofreading an important text message — all without sending your private data off device. But to make these features useful for everyday users, they need to happen very efficiently.
Delivering this kind of speed on a mobile device is a significant challenge. Unlike vast server environments, mobile phones operate under a strict energy budget and hard memory (RAM) limits. Furthermore, standard language models generate text "autoregressively" — meaning they process and output just one word (or token) at a time. This step-by-step process creates a bottleneck, underutilizing the phone's processing power while straining its memory bandwidth, which can ultimately slow down the user experience and drain the battery.
To overcome this bottleneck, we are announcing a new architecture that retrofits Multi-Token Prediction (MTP) onto existing, "frozen" Gemini Nano v3 models. Building on prior approaches like the EAGLE framework and Confident Adaptive Language Modeling (CALM), we designed new architectural components to maximize these efficiency gains specifically for mobile environments. Our recent announcements highlighted accelerating Gemma 4 with MTP and making it available to developers.
Today's article tackles the unique, extreme constraints of edge computing. Recently rolled out to the Pixel 9 and 10 series, this approach acts as an out-of-the-box speedup. For users, this means that features like AI Notification Summaries and Proofread generate text significantly faster and with less energy consumption. For developers, it eliminates a major friction point: delivering high-speed on-device AI without the need to fine-tune separate, memory-heavy drafting models for every new task.
MTP builds upon the evolution of speculative decoding. In a traditional setup, generating N tokens requires N forward passes of the large model. Speculative decoding decouples this process into two parts:
However, this results in some inefficiencies. Running a separate "standalone" drafter model (e.g., 128M parameters) competes for limited RAM. Furthermore, a standalone drafter is "blind" to the main model's rich internal state, predicting next tokens based solely on text history without the semantic context the main model has already computed. MTP addresses these inefficiencies by moving from a standalone architecture to an integrated one. Instead of training a separate small language model to draft tokens, we append a lightweight Transformer head, the MTP head, to the final layers of the main model.
This architecture, which uses a deep exit layer for drafting, leverages the work already performed by the main model’s backbone. The MTP head takes the final high-dimensional activations (hidden states) of the main model and uses them to autoregressively predict a sequence of future tokens.
While MTP heads are commonly pre-trained in tandem with the backbone — such as in our recent releases of Gemma 4 models — this is prohibitive when leveraging already-deployed on-device foundation models. Instead, our work focuses on retrofitting the drafter head to operate independently of the pre-training pipeline.
We take a fully trained Gemini Nano v3 model, freeze its weights, and attach a dense transformer stack — the MTP head — to the final layers. We train only these parameters to minimize the prediction error on future tokens. With a frozen backbone, MTP becomes strictly an efficiency optimization, ensuring no degradation in the base model's capabilities or safety alignment.
Because incorrect drafts are discarded during verification, the final output remains bit-for-bit identical to the main model, allowing us to roll out efficiency updates with full backward compatibility.
While standard MTP implementations optimize for training efficiency by sharing static parameters (like embedding weights) between the main model and the drafter, on-device inference faces a stricter bottleneck: dynamic memory. Even with shared weights, if a drafter processes context independently, it incurs a "double tax" on memory by generating and maintaining its own key-value (KV) cache. Given the limited memory on mobile, avoiding this redundancy is critical.
To solve this, we engineered a zero-copy architecture where the MTP head effectively leverages the main model's state. Instead of maintaining its own history, the MTP head is designed to cross-attend directly to the main model’s frozen KV cache. This allows the drafter to query the "memories" and context already computed by the backbone without duplication.
This design yields two efficiency gains. First, it eliminates drafter prefill latency: by utilizing the existing cache, the head requires no additional time to process the prompt. Second, it reduces the runtime memory footprint. We observed savings of 130MB per instance compared to a standalone drafter by saving drafter embedding lookup tables, prefill dot attention variants, and application specific tuning parameters.
In our experiments, we found that MTP drafters consistently produce more accurate token predictions, which results in speedups on Pixel 9 devices of 50% or more
This performance gap stems from MTP’s access to richer representations. Unlike standalone drafters that treat the main model as a black box, the MTP head directly utilizes final activations already processed by the larger backbone:
For the deployment of MTP on Pixel 9 and 10 devices, we redesigned the on-device inference stack to handle the complex dependency between the verification and drafting phases.
The results validated the architectural choices. In production workloads, such as AI Notification Summaries and Proofread, MTP correctly predicts an average of nearly two additional tokens per inference pass. Furthermore, fewer verification steps mean less time waking heavy processors, reducing energy consumption and improving battery life.
We look forward to integrating MTP on future Pixel devices, as well as exploring alternative architectures — including parallel decoding and paradigms without auxiliary heads — to further drive down draft latency and increase simultaneous token verification under strict mobile constraints.
We are also investigating ways to handle the inherent ambiguity of language generation more efficiently. While standard speculative decoding assumes a single best future path, we are developing techniques that allow the model to explore branching possibilities in parallel. This aims to maximize the probability of accepting long sequences even in uncertain contexts. Furthermore, we are studying verification leniency: relaxing the strict exact token match between draft and verification for specific use cases to bring further efficiencies to the edge.
This work is part of our efforts for optimizing on-device LLM efficiency, with Filippo Galgani, Omri Homburger, Pooja Consul, Matthew Markwell, and Vivek Kumar. Certain elements were built on developments from the Gemini team in Google DeepMind: Tal Schuster, Ziwei ji, Ivan Korotkov, and Ganesh Jawahar. We’d also like to extend a big thank you for reviews and valuable feedback and support to Nadav Bar, Utku Evci, Nir Shabat, Joe Zou, and teams in Google Research, Google Deepmind, and Platforms & Devices.
文章标题:在 Pixel 设备上通过冻结的多 token 预测加速 Gemini Nano 模型
文章链接:https://news.qimuai.cn/?post=4449
本站文章均为原创,未经授权请勿用于任何商业用途