推理银行：赋能智能体从经验中学习

qimuai 发布于 2026-4-22 08:01 阅读：20 一手编译

推理银行：赋能智能体从经验中学习

内容来源：https://research.google/blog/reasoningbank-enabling-agents-to-learn-from-experience/

内容总结：

谷歌推出“推理记忆库”框架，赋能AI智能体从经验中持续学习

近日，谷歌云的研究科学家Jun Yan与Chen-Yu Lee及其团队在学术会议ICLR上发表了题为《ReasoningBank：通过推理记忆实现智能体自我演进规模化》的论文，并开源了名为“ReasoningBank”的新型智能体记忆框架。该框架旨在突破当前AI智能体在部署后难以从成功与失败经验中学习的瓶颈，使其能够像人类一样，通过积累的经验提炼通用推理策略，实现持续进化。

当前，AI智能体在应对复杂现实任务（如网页浏览、软件工程）时，常因缺乏有效的经验记忆机制而重复犯错，或无法有效利用过往洞察。现有主流记忆方案多侧重于详尽记录每一步操作轨迹，或仅总结成功的工作流，其局限性在于未能提炼出更高层次、可迁移的推理模式，且忽视了从失败中学习的重要价值。

ReasoningBridge框架的核心创新在于，它构建了一个持续运行的“检索-提取-巩固”闭环系统。智能体在执行任务前，会从“记忆库”中检索相关的高层次策略记忆；在行动后，则利用大语言模型作为“裁判”进行自我评估，从成功轨迹中提取有效工作流，并关键性地从失败经历中反思教训，将其转化为预防性策略。例如，智能体不仅能学会“点击‘加载更多’按钮”，更能从以往错误中总结出“在执行加载前，务必先验证页面标识，以避免陷入无限滚动陷阱”这类更具战略深度的规则。

为进一步提升学习效率，团队提出了“记忆感知的测试时扩展”方法。该方法通过并行扩展（同时探索多种解决路径并进行对比）和序列扩展（在单一路径中迭代优化推理过程），将智能体在探索中产生的大量试错数据，高效提炼为高质量的记忆。这使得更优质的记忆能引导更有效的探索，而更丰富的探索数据又反过来滋养出更强大的记忆库，形成良性循环。

在网页导航和软件工程修复等动态环境基准测试中，基于Gemini-2.5-Flash模型并搭载ReasoningBridge的智能体表现卓越。与无记忆基线相比，其任务成功率在两项测试中分别提升了8.3%和4.6%，同时因减少了盲目探索，平均每个任务节省了近3个执行步骤。当结合“记忆感知的测试时扩展”后，性能得到进一步强化。研究团队还观察到智能体展现出“策略成熟度”的涌现现象：其记忆从初期的简单操作清单，逐渐演变为包含复合逻辑与预防性思维的复杂策略结构。

该研究标志着智能体向持续自主学习者演进的重要一步。通过构建一个能够从成败经验中不断提炼、固化并应用战略知识的记忆系统，ReasoningBridge为开发更高效、更可靠的长周期运行AI智能体提供了新的关键技术路径。

中文翻译：

推理记忆库：赋能智能体从经验中学习

2026年4月21日
谷歌云研究科学家颜俊、李振宇

推理记忆库是一种创新的智能体记忆框架，它利用成功与失败的经验来提炼可泛化的推理策略，使智能体在部署后能够持续从经验中学习。

快速链接

智能体在应对复杂的现实世界任务中正变得日益关键，其应用范围涵盖通用网络导航到协助处理大型软件工程代码库。然而，当这些智能体在现实世界中转变为持久、长期运行的角色时，它们面临着一个关键局限：难以在部署后分析和学习成功与失败的经验。

如果智能体在处理每个新任务时缺乏记忆机制，就会反复犯相同的策略性错误，并丢弃宝贵的洞见。为解决此问题，业界引入了多种形式的智能体记忆来存储过往交互信息以供复用。然而，现有方法通常侧重于保存所采取行动的详尽记录（例如Synapse中使用的轨迹记忆），或仅记录从成功尝试中总结出的工作流（如智能体工作流记忆）。这些方法存在两个根本性缺陷：首先，通过记录详细行动而非战术性预见，它们未能提炼出更高层次、可迁移的推理模式；其次，由于过度强调成功经验，它们错失了一个主要的学习来源——自身的失败。

为弥补这一差距，我们在ICLR论文《推理记忆库：利用推理记忆扩展智能体自我进化》中，提出了一种新颖的智能体记忆框架（GitHub），该框架从成功和失败的经验中提炼有用的洞见，用于测试时的自我进化。在网络浏览和软件工程基准测试中，与基线方法相比，推理记忆库在提升智能体效能（更高成功率）和效率（更少任务步骤）方面均表现出色。

利用推理记忆库提炼洞见

推理记忆库将全局推理模式提炼为高层次、结构化的记忆。每个结构化记忆项包含以下内容：

标题：总结核心策略的简明标识符。
描述：记忆项的简要摘要。
内容：从过往经验中提炼出的推理步骤、决策依据或操作洞见。

记忆工作流在一个检索、提取和整合的持续闭环中运行。在采取行动前，智能体调用推理记忆库，将相关记忆收集到其上下文中。随后，它与环境交互，并使用大语言模型作为评判者来自我评估产生的轨迹，提取成功洞见或失败反思。值得注意的是，这种自我评判无需完全准确，因为我们发现推理记忆库对评判噪音具有相当的鲁棒性。在提取过程中，智能体将轨迹中的工作流和可泛化的洞见提炼为新的记忆。为简化起见，我们直接将这些新记忆附加到推理记忆库中，更复杂的整合策略留待未来工作。

关键的是，与现有仅关注成功运行的工作流记忆策略不同，推理记忆库会主动分析失败经验，以获取反事实信号和潜在陷阱。通过将这些错误提炼为预防性经验教训，推理记忆库构建了强大的策略性护栏。例如，智能体不仅仅学习"点击'加载更多'按钮"这样的程序性规则，还可能从过去的失败中学会"在尝试加载更多结果前，务必先验证当前页面标识符，以避免陷入无限滚动陷阱"。

记忆感知的测试时扩展

测试时扩展——在推理时扩展计算资源——已在数学和竞争性编程等推理领域展现出巨大效力。然而，在智能体环境中，现有的测试时扩展方法通常会丢弃探索轨迹，仅将最终答案视为唯一有用的产出。这些被忽视的探索实际上是一个丰富的数据源，可以加速智能体随时间推移从经验中学习的能力。

我们通过记忆感知的测试时扩展，明确地将记忆与扩展联系起来，从而弥合这一差距。通过将推理记忆库作为一个强大的经验学习器，记忆感知的测试时扩展利用对比和精炼信号，将广泛的探索提炼为高质量的记忆。我们通过两种不同形式的扩展来展示记忆感知的测试时扩展功能的力量：

并行扩展：智能体在记忆的指导下，为同一查询生成多个不同的轨迹。通过自我对比，推理记忆库比较成功轨迹和错误推理轨迹，以提炼更稳健的策略并合成更高质量的记忆。
顺序扩展：智能体在单个轨迹内迭代精炼推理，以产生强有力的中间依据。推理记忆库将这些关于智能体试错和渐进式改进的中间洞见捕获为高质量的记忆项。

记忆感知的测试时扩展建立了强大的协同效应：来自推理记忆库的高质量记忆引导扩展探索朝向更有前景的策略；反过来，扩展的交互产生了显著更丰富的学习信号，这些信号反馈给一个更智能的推理记忆库，从而帮助智能体。

性能与涌现能力

我们在涵盖动态环境的多个具有挑战性的基准测试上评估了推理记忆库。以ReAct提示策略作为所有智能体的基础，我们将推理记忆库与三种记忆配置进行了比较：无记忆基线、Synapse（轨迹记忆）和AWM（工作流记忆）。根据我们使用Gemini-2.5-Flash在WebArena和SWE-Bench-Verified上的主要评估结果，我们得出以下关键观察：

卓越的成功率：未经扩展的推理记忆库在WebArena上比无记忆智能体高出8.3%，在SWE-Bench-Verified上高出4.6%。
效率提升：由于智能体主动访问过去的决策依据，其执行命令时大大减少了盲目的探索。在SWE-Bench-Verified上，推理记忆库相比无记忆基线，每个任务平均节省了近3个总执行步骤。
记忆感知的测试时扩展协同效应：当加入记忆感知的测试时扩展（并行扩展，扩展因子k=5）时，成功率得到进一步提升。带有记忆感知的测试时扩展的推理记忆库相比不带扩展的推理记忆库，在WebArena上成功率提高了3%，步骤减少了0.4步。

重要的是，在评估过程中，我们观察到了策略成熟度的涌现。在一个网络浏览示例中，智能体最初提炼的规则类似于简单的程序性检查清单（例如，"查找页面链接"）。随着智能体处理更多问题集，这些记忆在执行过程中被整合进来。基于现有知识，智能体将新轨迹提炼为更高级的记忆。随着时间的推移，简单的检查清单演变为具有组合性、预防性逻辑结构的记忆（例如，"持续将任务与活动页面过滤器交叉比对，以确保检索到的数据集不会过早分页"）。更多细节请参阅论文。

结论

推理记忆库提供了一个强大的框架，使大语言模型能够从经验中学习，并在测试时进化为持续学习者。我们相信，记忆驱动的经验扩展代表了智能体扩展的一个关键新前沿。

我们很高兴能与更广泛的研究社区分享这一成果。

致谢

本研究由欧阳思如、颜俊、徐翊鸿、陈彦霏、蒋珂、王子丰、韩汝钧、龙天乐、Samira Daruki、唐相如、Vishy Tirumalashetty、George Lee、Mahsan Rofouei、林航飞、韩家炜、李振宇和Tomas Pfister共同完成。

英文来源：

ReasoningBank: Enabling agents to learn from experience
April 21, 2026
Jun Yan and Chen-Yu Lee, Research Scientists, Google Cloud
ReasoningBank is a novel agent memory framework that uses successful and failed experiences to distill generalizable reasoning strategies, enabling an agent to continuously learn from experience after deployment.
Quick links
Agents are becoming increasingly crucial in tackling complex real-world tasks, ranging from general web navigation to assisting with extensive software engineering codebases. However, as these agents transition into persistent, long-running roles in the real world, they face a critical limitation: they struggle to analyze and learn from successful and failed experiences after deployment.
Agents approaching each new task without a memory mechanism will repeatedly make the same strategic errors and discard valuable insights. To address this, various forms of agent memory have been introduced to store information about past interactions for reuse. However, existing methods generally focus on saving exhaustive records of every action taken — such as the trajectory memory used in Synapse — or only documenting workflows summarized from successful attempts, as seen in Agent Workflow Memory). These approaches have two fundamental drawbacks: first, by recording detailed actions instead of tactical foresight, they fail to distill higher-level, transferable reasoning patterns; second, by over-emphasizing successful experiences, they miss out on a primary source of learning — their own failures.
To bridge this gap, in our ICLR paper, "ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory", we introduce a novel agent memory framework (github) that distills useful insights from both successful and failed experiences for test-time self-evolution. When evaluated on web browsing and software engineering benchmarks, ReasoningBank enhances both agent effectiveness (higher success rates) and efficiency (fewer task steps) compared to baseline approaches.
Distilling insights with ReasoningBank
ReasoningBank distills global reasoning patterns into high-level, structured memories. Each structured memory item contains the following:

Title: A concise identifier summarizing the core strategy.
Description: A brief summary of the memory item.
Content: The distilled reasoning steps, decision rationales, or operational insights extracted from past experiences.
The memory workflow operates in a continuous, closed loop of retrieval, extraction, and consolidation. Before taking action, the agent draws upon the ReasoningBank to gather relevant memories into its context. It then interacts with the environment and uses an LLM-as-a-judge to self-assess the resulting trajectory and extracts success insights or failure reflection. Notably, this self-judgement does not need to be perfectly accurate, as we find ReasoningBank to be quite robust against judgment noise. During extraction, the agent distills workflows and generalizable insights from the trajectory into new memories. For simplicity, we directly append these to the ReasoningBank, leaving more sophisticated consolidation strategies for future work.
Crucially, unlike existing workflow memory strategies that only focus on successful runs, ReasoningBank actively analyzes failed experiences to source counterfactual signals and pitfalls. By distilling these mistakes into preventative lessons, ReasoningBank builds powerful strategic guardrails. For example, instead of merely learning a procedural rule like "click the 'Load More' button”, the agent might learn from a past failure to "always verify the current page identifier first to avoid infinite scroll traps before attempting to load more results”.
Memory-aware test-time scaling (MaTTS)
Test-time scaling (TTS) — scaling compute at inference time — has shown immense effectiveness in reasoning domains like math and competitive programming. However, in agentic environments, existing TTS methods often discard the exploration trajectory and treat the final answer as the only useful outcome. This overlooked exploration is actually a rich data source that could accelerate an agent's ability to learn from experience over time.
We bridge this gap by explicitly linking memory with scaling through memory-aware test-time scaling (MaTTS). By using ReasoningBank as a powerful experience learner, MaTTS distills extensive exploration into high-quality memories via contrastive and refinement signals. We demonstrate the power of MaTTS functions through two distinct forms of scaling:
Parallel scaling: The agent generates multiple distinct trajectories for the same query under the guidance of memory. Through self-contrast, ReasoningBank compares successful and spuriously reasoned trajectories to distill more robust strategies and synthesize higher-quality memories.
Sequential scaling: The agent iteratively refines reasoning within a single trajectory to produce strong intermediate rationale. ReasoningBank captures these intermediate insights on the agent's trial-and-errors and progressive improvement as high-quality memory items.
MaTTS establishes a strong synergy: high-quality memory from ReasoningBank steers the scaled exploration towards more promising strategies, and in return, the scaled interactions generate significantly richer learning signals that feed back into an even smarter ReasoningBank to help the agent.
Performance & emergent capabilities
We evaluated ReasoningBank across challenging benchmarks covering dynamic environments. Using the ReAct prompting strategy as the foundation for all agents, we compared ReasoningBank against three memory configurations: a memory-free baseline (Vanilla ReAct), Synapse (Trajectory Memory) and AWM (Workflow Memory). From our main evaluation results with Gemini-2.5-Flash on WebArena and SWE-Bench-Verified, we have the following key observations:
Superior success rates: ReasoningBank without scaling outperformed memory-free agents by 8.3% on WebArena and 4.6% on SWE-Bench-Verified.
Efficiency gains: Because the agent actively accesses past decision rationales, it executes commands with vastly reduced aimless exploration. On SWE-Bench-Verified, ReasoningBank saved almost 3 total execution steps per task over memory-free baselines.
MaTTS synergy: When adding MaTTS (parallel scaling with a scaling factor k=5), success rates are further boosted. ReasoningBank w/ MaTTS improves over ReasoningBank by a 3% success rate increase and 0.4 fewer steps on WebArena.
Importantly, during evaluation, we observed the emergence of strategic maturity. In a web-browsing example, the agent's initial curated rules resembled simple procedural checklists (e.g., "Look for page links"). As the agent persisted through more problem sets, these memories were incorporated during execution. Building upon existing knowledge, the agent distilled new trajectories into more advanced memories. Over time, simple checklists evolved into memories with compositional, preventative logic structures (e.g., "Cross-reference tasks continuously with active page filters to ensure retrieved datasets aren't paginated prematurely"). See the paper for more details.
Conclusion
ReasoningBank provides a powerful framework for enabling LLMs to learn from experiences and evolve into continuous learners during test-time. We believe memory-driven experience scaling represents a crucial new frontier for agent scaling.
We are excited to share this with the broader research community.
Acknowledgements
This research was conducted by Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, and Tomas Pfister.

谷歌研究进展

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读