借助Gemini企业级智能体平台的Agentic RAG技术，解锁可靠响应。

qimuai 发布于 2026-6-6 08:00 阅读：26 一手编译

内容来源：https://research.google/blog/unlocking-dependable-responses-with-gemini-enterprise-agent-platforms-agentic-rag/

内容总结：

谷歌推出新型“智能体RAG”框架：多智能体协同破解企业复杂查询难题

【科技前沿】谷歌研究院6月5日发布新一代智能体检索增强生成（Agentic RAG）框架。该框架由谷歌研究院与谷歌云联合开发，通过多智能体协同工作流，将传统RAG系统的单次检索升级为“规划-检索-验证”的迭代闭环，可有效应对企业场景中涉及多数据源、多跳查询的复杂需求。

针对当前企业业务中常见的“信息孤岛”问题——例如查询“X项目所用服务器的规格参数”，传统系统可能仅能获取文档中的服务器ID，却无法主动关联其他数据库进行二次检索——新框架通过引入“充分上下文检测智能体”，实现了关键突破。该智能体扮演“质检员”角色，不仅验证已检索内容是否完整，更能精准识别缺失信息并触发定向补搜，彻底避免AI“猜答案”或给出“信息不足”的肤浅回应。

以医疗场景为例，当医生询问患者“出院用药、饮食限制及住院期间过敏反应”时，系统会迅速启动根智能体、规划智能体、查询重写智能体等多角色分工协作。初轮检索后，若“充分上下文检测智能体”发现缺少过敏信息，将自动生成“皮疹、不良事件”等专向搜索指令，直至补全全部三项关键信息后，方由合成智能体输出准确摘要。

评测数据显示，在标准事实性数据集FramesQA上，该框架准确率较传统RAG提升最高达34%。在混合多语料库的跨库测试（从4个干扰数据库中选择正确来源）中，系统仍保持90.1%的准确率，且响应延迟与单库模式差距仅3%。该功能现已在Gemini企业智能体平台上线公开预览版。

中文翻译：

2026年6月5日
Cyrus Rashtchian（研究科学家）与 Da-Cheng Juan（工程经理），Google 研究院
我们推出全新的自主检索增强生成框架。基于 Google 研究院与 Google Cloud 的合作成果，我们的多智能体工作流程超越了标准 RAG，通过将复杂的企业级查询拆解为子任务，并在生成可靠回答前迭代搜索足够上下文信息。

当前的单步检索增强生成（RAG）系统并非为现代业务流程中多源、多跳查询而设计。例如，若查询为“项目 X 所使用的服务器规格是什么？”，系统可能找到关于项目 X 的文档，但这些文档可能仅提及一个服务器 ID。系统不会主动获取该 ID 并在另一数据库中执行二次搜索以查找规格。结果便是部分答案或“未找到”响应，因为信息分散在不同的“数据孤岛”中，需要更深入的探索才能发现事实。

这正是“自主 RAG”的用武之地——它能规划、推理并与数据源进行迭代交互，从而处理复杂查询，提升可靠性与准确性。

今天，我们很高兴地介绍 Google Gemini Enterprise Agent Platform 上托管的、基于自主 RAG 的跨语料库检索版本。与其他多智能体 RAG 框架类似，我们的框架运用多种智能体协同工作，可靠地回答复杂查询。但不同于其他多智能体框架，我们的框架会整合充分的上下文，以确认是否有足够信息生成准确答案。与标准 RAG 相比，我们的框架在事实性数据集上的准确率提升高达 34%。我们还使用专有内部数据集对系统进行了评估，发现在多个特定领域任务上实现了更优的基准性与推理准确性。

将多智能体 RAG 视为一个组织有序的研究部门，而非单一搜索引擎，会更有助于理解。在“单一”或“基础”RAG 系统中，检索组件仅查看用户问题，并在大语言模型生成回答前尝试查找匹配文档。

而在多智能体框架中，系统将任务拆解为专业化角色：

我们新自主 RAG 框架的关键区别在于“持久性”。与其他 RAG 方案相比，我们的框架之所以高效，是因为它能识别何时信息缺失，并持续搜索直至上下文完整。这避免了 AI 在首次搜索无果时“猜测”，或简单回复“信息不足”。虽然某些情况下这种回复是合理的，但有时信息确实存在，只是需要进一步查找。

例如，假设一位医生询问患者的用药、饮食和过敏情况：
“约翰·杜在膝关节手术后出院带药和饮食限制是什么？住院期间是否出现过任何过敏反应？不包括住院或急诊期间仅使用的药物，但肝素静脉输液或替奈普酶除外。”

针对此问题，我们的框架启动多个专门智能体。下图概述了我们的解决方案，随后将进行详细说明。

根智能体解析医生的请求，并将任务委派给子智能体。规划智能体识别出需要检查三个不同领域：药房、营养和临床记录。查询重写智能体将冗长请求拆解为简单、可检索的问题，以便检索器更准确地找到相关内容。

RAG 智能体一次性搜索患者记录中所有查询分支。它找到了用药和饮食信息，但在最明显的文件中未发现任何过敏记录。在标准或“基础”RAG 系统中，流程可能在此终止，给出不完整的答案。

将“充分上下文智能体”视为装配线末端的质量控制检查员。它在允许生成回答前检查三项具体发现：

充分上下文智能体评估 RAG 智能体从数据库中提取的实际文本片段。在医生的示例中，这些可能是“出院小结”和“营养记录”中的特定段落。它会阅读这些内容，判断回答查询所需的信息是否存在于这些句子中。

系统还会生成一份“草稿”响应。随后，充分上下文智能体审查提示、草稿和检索到的片段，以评估模型是否拥有提供全面且基于事实回答所需的一切。若提示要求三项内容（用药、饮食、过敏），但片段仅包含其中两项信息，充分上下文智能体便会将其标记为“上下文不足”。

这是最关键的环节。充分上下文智能体精确识别缺失内容。它不仅输出“上下文不足”，还会生成具体的“原因”和“反馈”日志。例如：
发现：“我们获得了用药清单和低钠饮食说明。”
缺口：“我们缺失源文档中关于住院期间过敏反应或不良事件的信息。”

充分上下文智能体将找到的信息与原始请求进行对比，并询问：“我们回答了过敏问题吗？”若未回答，则发出“上下文不足”信号并提供具体反馈：“你找到了用药和饮食，但遗漏了过敏信息。请返回并专门搜索‘皮疹’或‘不良事件’。”在多源场景中，它还可请求更多信息，或判断该源与查询无关。

基于充分上下文智能体的反馈，查询重写智能体创建新的搜索请求“皮疹”。随后，RAG 智能体深入分析首次忽略的文件，找到缺失信息。

充分上下文智能体最后一次检查数据。现在已获得用药、饮食和过敏信息，它判定可以停止搜索。最终，合成智能体为医生撰写一份清晰准确的摘要。

我们在基于 FRAMES 论文的 FramesQA 数据集上评估了自主 RAG。一个多跳问题示例如下：
“截至 2024 年 6 月，收视率最高的两部电视剧季终集中，哪一集时长最长？长了多少？”

RAG 系统需执行多步操作才能得出正确答案。首先，需识别出收视率最高的两部季终集来自《陆军野战医院》和《欢乐酒店》。然后，需查找它们的时长并计算差值。在许多 RAG 设置（基础 RAG 或缺乏充分上下文的自主 RAG）中，模型可能会给出类似回复：
“尽管多次扫描，我未找到《陆军野战医院》或《欢乐酒店》的明确时长。文档提供了收视数据，但未提供以分钟或小时计的时长。”

这并未回答问题。

幸运的是，我们的自主 RAG 可通过首先搜索电视剧，然后利用查询重写智能体和充分上下文智能体针对性地搜索《陆军野战医院》或《欢乐酒店》的时长来解决此问题。随后，Gemini 可轻松判断哪一集季终集时长最长及超出多少：
“《陆军野战医院》季终集时长为 150 分钟，是两部中最长的。它比《欢乐酒店》季终集（约 98 分钟）长 52 分钟。”

我们进行了一项实验以大规模测试此能力（FramesQA 包含 824 个查询及一个含 2676 份 PDF 文档的语料库）。在“基础”RAG 设置中，我们使用 Google 的 RAG 引擎（包含高级检索引擎、大语言模型解析器和重排序器）。我们将此与两种设置下的自主 RAG 进行比较。在单语料库设置中，我们从 FramesQA 文档中检索。在跨语料库设置中，我们额外包含了三个干扰数据集，规划智能体需确定从何处检索。此跨语料库设置模拟了企业数据库由不同团队管理的用例。我们通过将系统回答与数据集中真实答案进行比对，并使用大语言模型作为裁判来计算准确率。

在跨语料库设置中，我们的系统准确率几乎与单语料库设置持平。即便规划智能体需从 4 个可能语料库中选出正确的一个，我们仍能成功路由搜索查询，并正确回答 90.1% 的问题。此外，单语料库与跨语料库版本的延迟大致相同（平均差异在 3% 以内）。这表明我们的自主 RAG 系统能对多个不相关数据源进行推理，为更灵活的检索场景开辟了可能性。

通过结合高级查询规划、路由与充分上下文机制，我们的自主 RAG 系统确保 AI 生成的回答可审计、可追溯且基于事实。我们期待机器学习社区利用这些新的自主能力，构建下一代可靠 AI 系统。此新功能现已作为 Gemini Enterprise Agent Platform 的公开预览版提供。

本项目是与 Bo Li、Zhongjie Mao、Tiger Jin、Yuhong Kan、Mohd Abdullah (Obito)、Chun-Sung Ferng、Pooneh Mortazavi、Roger (Peng) Yu、Eran Lewis 及 Ivan Kuznetsov 的合作成果。感谢 Kimberly Schwede 设计图形，以及 Mark Simborg 提供写作协助。同时感谢我们的关键企业合作伙伴提供宝贵的用户反馈、数据与洞察。

英文来源：

June 5, 2026
Cyrus Rashtchian, Research Scientist, and Da-Cheng Juan, Engineering Manager, Google Research
We introduce our new agentic RAG framework. Based on a collaboration between Google Research and Google Cloud, our multi-agent workflow goes beyond standard RAG by breaking down complex enterprise queries and iteratively searching for sufficient context before generating dependable responses.
Current single-step retrieval-augmented generation (RAG) systems weren’t designed for the multi-source, multi-hop queries of modern business workflows. If, for example, the query is, "What are the specs of the server used in Project X?", the system might find documents about Project X, but those documents might only mention a server ID. It won't know to take that ID and perform a second search in another database to find the specs. The result is a partial answer or a "not found" response because the information is spread across different "islands" of data, requiring deeper exploration to find the facts.
Enter “agentic RAG”, which plans, reasons, and iteratively interacts with data sources, enabling the handling of complex queries to increase dependability and accuracy.
Today, we’re excited to introduce Google’s Gemini Enterprise Agent Platform-hosted version of Cross-Corpus Retrieval powered by Agentic RAG. Like other multi-agent RAG frameworks, ours employs various agents that work together to reliably answer complex queries. Unlike other multi-agent frameworks, ours incorporates sufficient context to confirm if there is enough information for an accurate answer. Compared to standard RAG, our framework increases accuracy on factuality datasets by up to 34%. We also evaluated our system with proprietary, internal datasets and found that we achieve better grounding and improved reasoning accuracy on multiple domain-specific tasks.
It helps to think of multi-agent RAG not as a single search engine but as an organized research department. In a "monolithic" or “Vanilla” RAG system, the retrieval component just looks at your question and tries to find matching documents before an LLM generates a response.
In a multi-agent framework, the system breaks the job down into specialized roles:
The key difference with our new agentic RAG framework is persistence. Compared to other RAG solutions, our framework is effective because it knows when it is missing information and continues searching until the context is complete. This prevents the AI from "guessing" when the first search comes up empty, or from simply saying, “I don’t have enough information.” While this is an appropriate response in some cases, sometimes the information is there and we just need to find it.
For example, imagine a doctor asking about a patient’s medications, diet, and allergies:
"What are the discharge medications and dietary restrictions for John Doe after his knee surgery, and did he have any allergic reactions during his stay? Do not include medications only administered during hospital inpatient or emergency department visits except for heparin IV drip or Tenecteplase."
In response, our framework kicks off many specialized agents. We give an overview of our solution in the figure below and then describe it in more detail afterwards.
The Root Agent parses the doctor's request and delegates the tasks to sub-agents. The Planner Agent identifies that it needs to check three distinct areas: Pharmacy, Nutrition, and Clinical Notes. The Query Rewriter breaks the long request into simple, searchable questions so the retriever can more accurately find relevant content.
The RAG Agent searches the patient's records for all the query fanouts at once. It finds the medications and the diet information, but it can’t find any mention of allergies in the most obvious files. In a standard or “Vanilla” RAG system, the process might end here with an incomplete answer.
Think of the Sufficient Context Agent as a quality-control inspector standing at the end of an assembly line. It examines three specific findings before allowing a response to be generated:
The Sufficient Context Agent evaluates the actual text chunks pulled from the database by the RAG Agent. In the doctor's example, these could be the specific paragraphs found in the "Discharge Summary" and "Nutrition Notes." It reads these to see if the information needed to answer the query is present in those sentences.
The system also creates a "rough draft" response. The Sufficient Context Agent then reviews the prompt, draft, and retrieved snippets to evaluate whether the model has everything it needs to provide a comprehensive and grounded answer. If the prompt asks for three things (meds, diet, allergies) but the snippets only contain information about two, the Sufficient Context Agent flags it as “insufficient context.”
This is the most critical part. The Sufficient Context Agent identifies exactly what is not there. It doesn't just output that "this is insufficient"; it generates a specific "Reason" and "Feedback" log. For example:
Finding: "We have the medication list and the low-sodium diet instructions."
Gap: "We are missing information from the source documents about allergic reactions or adverse events during the stay."
The Sufficient Context Agent compares what was found against the original request and asks: "Did we answer the allergy question?” If not, it then issues an "Insufficient Context" signal and provides specific feedback: "You found meds and diet, but you missed allergies. Go back and search specifically for 'rashes' or 'adverse events'." In a multi-source situation, it can also request more information or decide that the source isn’t relevant to the query.
Because of the Sufficient Context Agent feedback, the Query Rewriter creates a new search for "rashes." Then, the RAG Agent dives deeper into files it ignored the first time and finds the missing information.
The Sufficient Context Agent checks the data one last time. Now that it has the meds, diet, and allergy info, it determines we can stop searching. Finally, the Synthesis Agent writes a clean, accurate summary for the doctor.
We evaluated agentic RAG on FramesQA, which is based on the FRAMES paper. An example multi-hop question is:
“Of the top two most watched television season finales (as of June 2024), which finale ran the longest in length and by how much?”
The RAG system needs to perform multiple steps to arrive at the correct answer. First, it has to identify that the two most watched finales are from the shows MASH and Cheers. Then, it has to find their running times, and calculate the length difference. In many RAG settings (Vanilla RAG or agentic RAG without sufficient context), we could end up in a situation where the model says something like:
“Despite multiple scans, I found no explicit runtimes for MASH or Cheers. The documents provide viewership data, but not the duration in minutes or hours.”
This does not answer the question.
Fortunately, our agentic RAG can solve this by first searching for the TV shows, then using the Query Rewriter and Sufficient Context Agent to have a targeted search for the run time of MASH or Cheers. Then, Gemini can easily determine which finale ran the longest in length and by how much:
“The MASH finale ran for 150 minutes, making it the longest of the top two. It was 52 minutes longer than the Cheers finale, which ran for approximately 98 minutes.”
We ran an experiment to test this ability at scale (FramesQA has 824 queries along with a corpus containing 2,676 PDF documents). In the “Vanilla” RAG setting, we use Google’s RAG Engine (which has an advanced retrieval engine, LLM parser, and re-ranker). We compared this with our agentic RAG in two settings. In the single-corpus setting, we retrieve from the FramesQA documents. In the cross-corpus setting, we also include three other distracting datasets, where the Planner Agent must determine where to retrieve from. This cross-corpus setting mimics use cases where companies have databases managed by separate teams. We compute accuracy by using an LLM-as-a-judge to compare the system responses to the ground truth answers in the dataset.
In the cross-corpus setting, our system nearly matches its single-corpus accuracy. Even when the Planner Agent must select the correct corpus out of 4 possibilities, we successfully route the search queries and answer 90.1% of questions correctly. Also, the latency of both single- and cross-corpus versions is about the same (within 3% on average). This demonstrates that our Agentic RAG system can reason over multiple, unrelated data sources, which opens up possibilities for more flexible retrieval scenarios.
By combining advanced query planning, routing, and sufficient context, our agentic RAG system ensures that AI-generated responses are auditable, traceable, and grounded. We look forward to seeing how the machine learning community leverages these new agentic capabilities to build the next generation of dependable AI systems. This new feature is now available as a public preview offering in Gemini Enterprise Agent Platform.
This project is joint work with Bo Li, Zhongjie Mao, Tiger Jin, Yuhong Kan, Mohd Abdullah (Obito), Chun-Sung Ferng, Pooneh Mortazavi, Roger (Peng) Yu, Eran Lewis, and Ivan Kuznetsov. We thank Kimberly Schwede for designing the graphics and Mark Simborg for writing assistance. We also thank our key enterprise partners for critical user feedback, data, and insights.

谷歌研究进展

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读