Gemma 4 QAT 模型:优化模型压缩以提升移动设备与笔记本电脑的运行效率

qimuai 发布于 阅读:29 一手编译

Gemma 4 QAT 模型:优化模型压缩以提升移动设备与笔记本电脑的运行效率

内容来源:https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/

内容总结:

Gemma 4 QAT模型发布:面向移动与笔记本端的模型压缩优化

自两个月前发布Gemma 4以来,谷歌持续拓展其能力边界。在推出多令牌预测(MTP)加速推理、以及发布12B模型填补E4B与26B MOE模型之间空白之后,今天正式推出经过量化感知训练(QAT)优化后的新检查点,使Gemma 4更加高效,让用户能在日常边缘设备和消费级GPU上本地运行模型。

在缩小模型的同时保持质量

量化技术通过减少内存占用并加速解码速度,是在消费级硬件上运行模型的关键技术。然而,标准的训练后量化(PTQ)常常导致性能下降。QAT将量化过程直接整合到训练中,相比标准PTQ基线,能够实现更高的整体质量。此次发布涵盖了流行的Q4_0量化格式以及专为移动场景设计的新型量化格式的QAT检查点。采用移动端格式后,Gemma 4 E2B的内存占用已降至1GB。

VRAM与存储空间节省

以下为加载模型所需的大致VRAM内存要求:

针对移动设备的底层优化

标准压缩格式往往难以在移动处理器上高效运行。为确保Gemma 4在移动端流畅运行,谷歌设计了专为边缘硬件定制的移动量化方案:

由于许多使用场景中不需要音频和视觉编码器,用户还可以仅部署所需的模态来进一步优化内存占用。例如,Gemma 4 E2B纯文本模型(不含逐层嵌入)的内存需求低于1GB。

即刻上手体验

为方便开发者使用,谷歌已与主流开发工具生态合作,即日起全面支持Gemma 4 QAT检查点:

中文翻译:

Gemma 4 QAT模型:优化模型压缩以提升移动设备与笔记本电脑效率
自两个月前发布Gemma 4以来,我们一直在持续扩展其能力。首先,我们引入了多令牌预测(MTP)来加速推理;就在几天前,我们又发布了12B模型,以填补E4B与26B MOE模型之间的空白。
今天,我们发布了通过量化感知训练(QAT)优化后的新检查点,使Gemma 4更加高效,从而能够在日常边缘设备及消费级GPU上本地运行模型。
通过在训练过程中模拟量化,QAT在模型压缩时最大程度地减少了质量损失。本次发布包含针对主流Q4_0量化格式的QAT检查点,以及专为移动场景定制的新型量化格式。借助这种移动端格式,我们将Gemma 4 E2B的内存占用降至1GB。这些改进大幅降低了内存需求,同时保留了Gemma 4应有的能力与质量。

在缩小模型规模的同时保持质量

量化是通过降低内存占用并加速解码速度,使模型能在消费级硬件上运行的关键技术。然而,标准的训练后量化(PTQ)常导致性能下降。与训练后简单量化不同,QAT将量化过程直接融入训练环节。尽管PTQ在保持质量方面已有效果,但我们的QAT方法相比标准PTQ基线实现了更高的整体质量。
我们将此QAT方案应用于主流的Q4_0格式,以最大化所有模型的性能。针对边缘模型(E2B和E4B),我们重新设计了量化思路,采用了一种专为移动设备优化的特殊量化方案。

节省显存与存储空间

以下是加载模型所需的大致显存要求:

底层优化:适配移动设备

标准的压缩格式往往难以在移动处理器上高效运行。为确保Gemma 4在移动设备上流畅运行,我们专门设计了一套面向边缘硬件的定制移动端量化方案:

由于音频与视觉编码器在许多场景下并非必需,你还可以仅部署所需模态,进一步优化内存占用。例如,Gemma 4 E2B纯文本模型(不含逐层嵌入层)的内存需求不到1GB。

立即上手

为了使这些模型能够轻松适配你偏好的工作流程,我们与生态中的主流开发者工具展开合作,从即日起无缝支持Gemma 4 QAT检查点:

我们迫不及待想看到你用本地运行的Gemma 4构建出什么作品!

英文来源:

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency
Since releasing Gemma 4 two months ago, we've been continuously working to expand its capabilities. First, we introduced Multi-Token Prediction (MTP) to accelerate inference, and just a couple of days ago, we released a 12B model to bridge the gap between our E4B and 26B MOE models.
Today, we are releasing new checkpoints optimized with Quantization-Aware Training (QAT) to make Gemma 4 even more efficient, so you can run models locally on everyday edge devices and consumer GPUs.
By simulating quantization during training, QAT minimizes quality loss when the model is compressed. This release includes QAT checkpoints for the popular Q4_0 quantization format as well as a novel quantization format specialized for mobile use cases. Using this mobile format, we’ve reduced the memory footprint of Gemma 4 E2B to 1GB. Together, these dramatically reduce memory requirements while preserving the capabilities and quality you expect from Gemma 4.
Keeping model quality while making them smaller
Quantization is a key technology to run models on consumer hardware by reducing their memory footprint while also accelerating decode speed. However, standard Post-Training Quantization (PTQ) often leads to performance degradation. Instead of simply quantizing the model after training, QAT integrates the quantization process directly into training. While PTQ is already effective at preserving quality, our QAT results yield even higher overall quality compared to standard PTQ baselines.
We applied this QAT recipe to the popular Q4_0 format to maximize performance for all the models. For the edge models (E2B and E4B), we rethought how we approach quantization with a special mobile-specialized quantization schema.
Saving on VRAM and Storage
Below are the approximate memory requirements indicating how much VRAM is required to load the models:
Optimizing for mobile devices under the hood
Standard compression formats are often hard for mobile processors to run efficiently. To ensure Gemma 4 performs smoothly on mobile, we engineered a custom mobile-quantization schema designed for edge hardware:

谷歌新消息

文章目录


    扫描二维码,在手机上阅读