介绍 Gemma 4 12B：一款统一且无需编码器的多模态模型

qimuai 发布于 2026-6-4 14:00 阅读：45 一手编译

内容来源：https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/

内容总结：

谷歌推出Gemma 4 12B模型：笔记本也能跑的多模态AI助手

当地时间近日，谷歌正式发布其最新中型人工智能模型——Gemma 4 12B。该模型定位为“统一、无编码器的多模态模型”，旨在让先进的智能体多模态能力直接运行在笔记本电脑上，填补了此前边缘友好型E4B模型与更高级的26B专家混合模型之间的空白。

据悉，Gemma 4 12B是首个原生支持音频输入的中型Gemma模型。得益于其创新的统一架构，模型摒弃了传统的独立视觉和音频编码器，将图像与音频信息直接输入大语言模型主干，从而大幅降低了延迟与内存占用。在标准基准测试中，其性能已接近更大型的26B模型，但内存占用却不足后者的一半，仅需16GB显存或统一内存即可在本地流畅运行，并支持强大的多步推理与智能体工作流。

截至目前，Gemma系列模型下载量已突破1.5亿次，被广泛用于从可穿戴机械臂到企业级AI安全等各类场景。新版模型采用Apache 2.0开源许可，全面兼容主流开发者工具，并配备多令牌预测起草器以降低响应延迟。

开发者现可通过LM Studio、Ollama、Hugging Face等平台直接下载权重或进行实验。谷歌还同步发布了官方“Skills Repository”技能库，支持开发者利用最新Gemma技术构建智能体应用，并可通过Google Cloud进行生产部署。

中文翻译：

今天，我们正式推出Gemma 4 12B：一款统一、无编码器的多模态模型。

Gemma 4 12B是我们最新推出的模型，旨在将智能体化的多模态能力直接带入笔记本电脑。它填补了边缘友好的E4B与更先进的26B混合专家模型（MoE）之间的空白，以更小的内存占用封装了强大的功能。这也是我们首款原生支持音频输入的中型模型。

感谢开发者社区的贡献，Gemma 4系列模型的下载量现已突破1.5亿次。从辅助物理操作的穿戴式机械臂，到企业级AI安全方案，你们用这些模型构建了各种应用。我们非常期待看到大家用这款新模型创造出什么。

以下是Gemma 4 12B独特之处的概览：

新颖的统一架构：无需多模态编码器。视觉和音频输入直接流入大语言模型主干。
高级推理能力：基准测试性能接近我们的26B模型，解锁了强大的多步推理和智能体工作流程。
适用于笔记本电脑：体积小巧，仅需16GB显存或统一内存即可本地运行。
开放且易用：采用Apache 2.0许可发布，并获得开发者生态系统的全面支持。
可搭配草稿模型：Gemma 4 12B配备了多令牌预测（MTP）草稿器以降低延迟。

这些特性共同将先进的多模态能力带入日常硬件，且无需牺牲速度或推理能力。现在让我们深入了解Gemma 4 12B是如何实现这一点的。

本地运行最先进的智能体

Gemma 4 12B在标准基准测试中的性能接近我们更大的26B MoE模型，但总内存占用不到后者的一半。它小巧到可以在配备16GB内存的消费级笔记本电脑上本地运行，直接在您的机器上解锁强大的多模态和智能体体验。

体验独特高效、统一的架构

Gemma 4 12B的突出之处在于其处理视觉和音频输入的简化方式。传统的多模态模型通常依赖独立的编码器来转换图像和音频，然后再将这些表征传递给语言模型。由于这种分离的编码器会增加延迟和内存使用，我们训练Gemma 4 12B时采用了无编码器架构，直接整合音频和视觉输入。

以下是Gemma 4 12B原生处理多模态输入的方式：

视觉：我们用轻量级嵌入模块取代了Gemma 4的视觉编码器，该模块仅由单个矩阵乘法、位置嵌入和归一化组成。这使得大语言模型主干能够接管视觉处理。
音频：我们进一步简化了音频处理。完全移除了音频编码器，并将原始音频信号直接投影到与文本令牌相同的维度空间。

希望深入了解的开发者，请查阅我们的配套文档《Gemma 4 12B开发者指南》。

立即开始使用

亲自尝试：在LM Studio、Ollama、Google AI Edge Gallery App、Google AI Edge Eloquent App以及LiteRT-LM CLI中，只需点击几下即可进行实验。
下载权重：直接从Hugging Face和Kaggle下载预训练和指令微调检查点。
集成与学习：查阅开发者文档和快速入门笔记本。
使用您最喜爱的开发工具：通过Hugging Face Transformers、llama.cpp、MLX、SGLang和vLLM实现本地推理管道，或使用Unsloth进行高效微调。
使用Gemma Skills解锁智能体开发：为支持开发者利用Gemma的最新进展构建智能体，我们发布了官方的Skills库。这是一个专为使智能体能够使用Gemma模型进行构建而设计的技能库。
按需部署：使用Google Cloud在生产环境中启动端点。通过Gemini Enterprise Agent Platform Model Garden、Cloud Run和GKE进行部署。

英文来源：

Introducing Gemma 4 12B: a unified, encoder-free multimodal model
Today, we are introducing Gemma 4 12B, our latest model designed to bring agentic multimodal intelligence directly to laptops. Bridging the gap between our edge-friendly E4B and our more advanced 26B Mixture of Experts (MoE), Gemma 4 12B packages powerful capabilities inside a reduced memory footprint. It is also our first mid-sized model to feature native audio inputs.
Thanks to the developer community, Gemma 4 models have now crossed 150 million downloads. You’ve built everything from wearable robotic arms for physical assistance to enterprise-grade AI security. We're excited to see what you build with this latest addition.
Here’s an overview of what makes Gemma 4 12B unique:

Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.
Advanced reasoning: Benchmark performance nearing our 26B model, unlocking powerful multi-step reasoning and agentic workflows.
Laptop ready: Small enough to run locally with just 16GB of VRAM or unified memory.
Open and accessible: Released under an Apache 2.0 license with support across the developer ecosystem.
Drafter-ready: Gemma 4 12B comes equipped with Multi-Token Prediction (MTP) drafters to reduce latency.
Together, these features bring advanced multimodal capabilities to everyday hardware without sacrificing speed or reasoning. Let's now take a closer look at how Gemma 4 12B achieves this.
Run state-of-the-art agents locally
Gemma 4 12B delivers performance nearing our larger 26B MoE model on standard benchmarks, but at less than half the total memory footprint. Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.
Experience a uniquely efficient, unified architecture
What makes Gemma 4 12B stand out is its streamlined approach to processing visual and audio inputs. Traditional multimodal models typically rely on separate encoders to translate images and audio before passing those representations to the language model. Because these split encoders add latency and increase memory usage, we trained Gemma 4 12B with an encoder-free architecture to integrate audio and vision input directly.
Here is how Gemma 4 12B processes multimodal inputs natively:
Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations. This allows the LLM backbone to take over visual processing.
Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.
For developers who want a breakdown, head over to our companion Gemma 4 12B Developer Guide.
Get started today
Try it yourself: Experiment with a couple of clicks in LM Studio, Ollama, Google AI Edge Gallery App, the Google AI Edge Eloquent app and the LiteRT-LM CLI
Download the weights: Download the pre-trained and instruction-tuned checkpoints directly from Hugging Face and Kaggle.
Integrate & learn: Review the developer documentation and the quick start notebook.
Use your favorite development tools: Implement local inference pipelines with Hugging Face Transformers, llama.cpp, MLX, SGLang, and vLLM, or fine-tune with efficiency using Unsloth.
Unlock Agentic Development with Gemma Skills: To support agents to build with the latest Gemma advancements, we are releasing our official Skills Repository. This is a library of skills designed specifically to enable agents to build with Gemma models.
Deploy your way: Spin up endpoints in production using Google Cloud. Deploy your way through Gemini Enterprise Agent Platform Model Garden, Cloud Run and GKE.

谷歌新消息

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读