Gemma 4 QAT 模型:优化模型压缩以提升移动设备与笔记本电脑的运行效率

内容总结:
Gemma 4 QAT模型发布:面向移动与笔记本端的模型压缩优化
自两个月前发布Gemma 4以来,谷歌持续拓展其能力边界。在推出多令牌预测(MTP)加速推理、以及发布12B模型填补E4B与26B MOE模型之间空白之后,今天正式推出经过量化感知训练(QAT)优化后的新检查点,使Gemma 4更加高效,让用户能在日常边缘设备和消费级GPU上本地运行模型。
在缩小模型的同时保持质量
量化技术通过减少内存占用并加速解码速度,是在消费级硬件上运行模型的关键技术。然而,标准的训练后量化(PTQ)常常导致性能下降。QAT将量化过程直接整合到训练中,相比标准PTQ基线,能够实现更高的整体质量。此次发布涵盖了流行的Q4_0量化格式以及专为移动场景设计的新型量化格式的QAT检查点。采用移动端格式后,Gemma 4 E2B的内存占用已降至1GB。
VRAM与存储空间节省
以下为加载模型所需的大致VRAM内存要求:
- Gemma 4 E2B:移动量化格式——1GB;Q4_0格式——约3GB
- Gemma 4 E4B:Q4_0格式——约6GB
- Gemma 4 12B:Q4_0格式——约8GB
- Gemma 4 26B:Q4_0格式——约18GB
针对移动设备的底层优化
标准压缩格式往往难以在移动处理器上高效运行。为确保Gemma 4在移动端流畅运行,谷歌设计了专为边缘硬件定制的移动量化方案:
- 静态激活:预计算激活缩放设置,减轻移动芯片工作负载,提升响应速度
- 通道级量化:压缩数据结构适应移动加速器设计,实现原生计算
- 定向2位量化:对生成令牌的特定部分进行重度压缩至2位,同时保持核心推理层高精度,节省存储但不削弱模型智能
- 嵌入层与KV缓存优化:聚焦压缩模型的词汇表和短期记忆,大幅减少活动内存占用,支持长对话而不耗尽空间
由于许多使用场景中不需要音频和视觉编码器,用户还可以仅部署所需的模态来进一步优化内存占用。例如,Gemma 4 E2B纯文本模型(不含逐层嵌入)的内存需求低于1GB。
即刻上手体验
为方便开发者使用,谷歌已与主流开发工具生态合作,即日起全面支持Gemma 4 QAT检查点:
- 下载权重:在Hugging Face上获取Q4_0和移动模型权重,提供GGUF格式(适用于llama.cpp)和压缩张量(适用于vLLM)
- 集成与学习:查阅文档了解最佳部署方式
- 桌面试用:通过llama.cpp、Ollama和LM Studio等用户友好界面在桌面端本地运行
- 设备端部署:使用谷歌轻量级LiteRT-LM运行时进行边缘部署,或通过Transformers.js在浏览器中直接运行
- 使用开发工具:借助SGLang和vLLM高效服务大型模型,通过MLX优化Apple Silicon性能,使用MTP QAT检查点在量化时保留MTP加速优势,直接使用Hugging Face Transformers和Unsloth微调权重
中文翻译:
Gemma 4 QAT模型:优化模型压缩以提升移动设备与笔记本电脑效率
自两个月前发布Gemma 4以来,我们一直在持续扩展其能力。首先,我们引入了多令牌预测(MTP)来加速推理;就在几天前,我们又发布了12B模型,以填补E4B与26B MOE模型之间的空白。
今天,我们发布了通过量化感知训练(QAT)优化后的新检查点,使Gemma 4更加高效,从而能够在日常边缘设备及消费级GPU上本地运行模型。
通过在训练过程中模拟量化,QAT在模型压缩时最大程度地减少了质量损失。本次发布包含针对主流Q4_0量化格式的QAT检查点,以及专为移动场景定制的新型量化格式。借助这种移动端格式,我们将Gemma 4 E2B的内存占用降至1GB。这些改进大幅降低了内存需求,同时保留了Gemma 4应有的能力与质量。
在缩小模型规模的同时保持质量
量化是通过降低内存占用并加速解码速度,使模型能在消费级硬件上运行的关键技术。然而,标准的训练后量化(PTQ)常导致性能下降。与训练后简单量化不同,QAT将量化过程直接融入训练环节。尽管PTQ在保持质量方面已有效果,但我们的QAT方法相比标准PTQ基线实现了更高的整体质量。
我们将此QAT方案应用于主流的Q4_0格式,以最大化所有模型的性能。针对边缘模型(E2B和E4B),我们重新设计了量化思路,采用了一种专为移动设备优化的特殊量化方案。
节省显存与存储空间
以下是加载模型所需的大致显存要求:
底层优化:适配移动设备
标准的压缩格式往往难以在移动处理器上高效运行。为确保Gemma 4在移动设备上流畅运行,我们专门设计了一套面向边缘硬件的定制移动端量化方案:
- 静态激活:通常情况下,模型会浪费算力动态计算数据缩放方式。我们在训练过程中预先计算这些设置,从而降低移动芯片的负载,加快响应速度。
- 通道级量化:我们对压缩数据的结构进行优化,使其适配移动加速器的设计。这使得手机能够以原生方式执行计算,无需采用缓慢的变通方案。
- 定向2比特量化:我们对模型生成令牌的特定部分进行了深度压缩(至2比特),同时将核心推理层保持更高精度。这既节省了存储空间,又不会降低模型的智能水平。
- 嵌入层与KV缓存优化:我们将压缩重点放在模型的词汇表与短期记忆上。这大幅降低了活跃内存占用,使你能够进行长时间对话而无需担心空间不足。
由于音频与视觉编码器在许多场景下并非必需,你还可以仅部署所需模态,进一步优化内存占用。例如,Gemma 4 E2B纯文本模型(不含逐层嵌入层)的内存需求不到1GB。
立即上手
为了使这些模型能够轻松适配你偏好的工作流程,我们与生态中的主流开发者工具展开合作,从即日起无缝支持Gemma 4 QAT检查点:
- 下载权重:立即在Hugging Face获取Q4_0与移动端模型权重。我们针对你的工作流定制了格式:GGUF格式可配合llama.cpp使用,压缩张量则为vLLM提供。对于其他场景,我们提供了未量化的检查点,可转换并量化为支持Q4_0的格式。
- 集成与学习:查阅我们的文档,了解如何最佳部署QAT检查点。
- 桌面端试用:使用llama.cpp、Ollama和LM Studio等友好界面,轻松在桌面端本地下载、管理并运行Gemma 4 QAT模型。
- 设备端部署:使用Google轻量级LiteRT-LM运行时进行优化边缘部署,或通过Transformers.js直接在网页上运行模型。
- 使用你喜爱的开发工具:借助SGLang和vLLM高效部署更大模型,通过MLX针对Apple Silicon进行优化。使用MTP QAT检查点,在量化模型的同时保留MTP的加速效果。直接使用Hugging Face Transformers和Unsloth微调权重。
我们迫不及待想看到你用本地运行的Gemma 4构建出什么作品!
英文来源:
Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency
Since releasing Gemma 4 two months ago, we've been continuously working to expand its capabilities. First, we introduced Multi-Token Prediction (MTP) to accelerate inference, and just a couple of days ago, we released a 12B model to bridge the gap between our E4B and 26B MOE models.
Today, we are releasing new checkpoints optimized with Quantization-Aware Training (QAT) to make Gemma 4 even more efficient, so you can run models locally on everyday edge devices and consumer GPUs.
By simulating quantization during training, QAT minimizes quality loss when the model is compressed. This release includes QAT checkpoints for the popular Q4_0 quantization format as well as a novel quantization format specialized for mobile use cases. Using this mobile format, we’ve reduced the memory footprint of Gemma 4 E2B to 1GB. Together, these dramatically reduce memory requirements while preserving the capabilities and quality you expect from Gemma 4.
Keeping model quality while making them smaller
Quantization is a key technology to run models on consumer hardware by reducing their memory footprint while also accelerating decode speed. However, standard Post-Training Quantization (PTQ) often leads to performance degradation. Instead of simply quantizing the model after training, QAT integrates the quantization process directly into training. While PTQ is already effective at preserving quality, our QAT results yield even higher overall quality compared to standard PTQ baselines.
We applied this QAT recipe to the popular Q4_0 format to maximize performance for all the models. For the edge models (E2B and E4B), we rethought how we approach quantization with a special mobile-specialized quantization schema.
Saving on VRAM and Storage
Below are the approximate memory requirements indicating how much VRAM is required to load the models:
Optimizing for mobile devices under the hood
Standard compression formats are often hard for mobile processors to run efficiently. To ensure Gemma 4 performs smoothly on mobile, we engineered a custom mobile-quantization schema designed for edge hardware:
- Static activations: Normally, models waste processing power calculating how to scale data on the fly. We pre-calculate these settings during training, which reduces workload on mobile chips and makes responses faster.
- Channel-wise quantization: We structured the compressed data to fit the design of mobile accelerators. This allows the phone to run calculations natively without needing slow workarounds.
- Targeted 2-bit quantization: We heavily compressed (to 2-bit) the specific parts of the model that generate tokens, while keeping the core reasoning layers at higher precision. This saves storage without making the model less smart.
- Embedding and KV cache optimization: We focused compression on the model’s vocabulary list and its short-term memory. This drastically reduces the active memory footprint, letting you have long chats without running out of space.
Because our audio and vision encoders are not needed in many use cases, you can optimize your memory footprint even further by deploying only the modalities you need. For example, the Gemma 4 E2B text-only model (without Per-Layer Embeddings) requires less than 1 GB of memory.
Get started today
To make those models easily usable with your preferred workflow, we’ve partnered with popular developer tools across the ecosystem to seamlessly support the Gemma 4 QAT checkpoints starting today: - Download the weights: Access the Q4_0 and mobile model weights right now on Hugging Face. We've tailored the formats to fit your workflow: GGUF formats are ready for use with llama.cpp, and compressed tensors are provided for vLLM. For everything else, we share unquantized checkpoints that can be converted and quantized into formats supporting Q4_0.
- Integrate & learn: Explore our documentation to learn how to best deploy the QAT checkpoints.
- Try on your desktop: Easily download, manage, and run Gemma 4 QAT models locally on your desktop using user-friendly interfaces like llama.cpp, Ollama and LM Studio.
- Deploy on-device: Use Google's lightweight LiteRT-LM runtime for optimized edge deployment or run the models directly on the web with Transformers.js
- Use your favorite development tools: Serve larger models efficiently with SGLang and vLLM, optimize for Apple Silicon with MLX. Use the MTP QAT checkpoints to preserve the speedup of MTP while quantizing the models. Fine-tune weights directly using Hugging Face Transformers and Unsloth.
We can't wait to see what you build with Gemma 4 running locally!
文章标题:Gemma 4 QAT 模型:优化模型压缩以提升移动设备与笔记本电脑的运行效率
文章链接:https://news.qimuai.cn/?post=4280
本站文章均为原创,未经授权请勿用于任何商业用途