思考回忆:推理如何解锁大语言模型中的参数化知识

qimuai 发布于 阅读:1 一手编译

思考回忆:推理如何解锁大语言模型中的参数化知识

内容来源:https://research.google/blog/thinking-to-recall-how-reasoning-unlocks-parametric-knowledge-in-llms/

内容总结:

谷歌研究揭示:推理过程竟能“解锁”语言模型内置知识

(2026年6月24日 硅谷讯)谷歌研究院的两位科学家Zorik Gekhman与Jonathan Herzig近日发布了一项反直觉的研究成果:在回答不需要复杂逻辑推理的简单事实性问题时,允许大语言模型(LLM)先生成一段“思考过程”(即思维链),反而能显著提高其回忆正确答案的能力。这项名为《思考以回忆:推理如何解锁LLM中的参数化知识》的研究,通过一系列控制实验,揭示了背后两大机制:计算缓冲效应与事实启动效应。

长期以来,业界普遍认为“思维链”对解决数学难题或复杂逻辑推理题至关重要。但对于“玛丽·恩格尔·彭宁顿是哪一年入选国家发明家名人堂?”这类单跳事实性问答,答案要么存在于模型参数中,要么不存在。多数人认为,生成推理步骤对这类任务并无帮助。

然而,研究团队通过对Gemini-2.5(Flash与Pro版本)及Qwen3-32B模型进行测试,发现一个惊人现象:当启用推理模式时,模型能够成功检索到在无推理模式下几乎无法触及的正确答案。即便是在完全剔除复杂问题、仅保留简单事实性问答的数据集上,这一趋势依然成立。

为了解释这一现象,团队设计了两组实验。首先,他们提出了“计算缓冲”假说。研究人员截断了模型在推理过程中生成的有意义内容,将其替换为重复的“让我想想”等无效字符。结果显示,即便只是让模型多生成一串无意义的令牌,其最终回答的准确性也比完全关闭推理时有所提升。这表明,额外的前向传播过程本身为模型提供了更多的“计算时间”,帮助其精炼内部状态。不过,这种方法的效果存在天花板,它永远无法达到自然推理文本带来的提升幅度。

进一步分析自然推理痕迹后,团队发现了第二个机制——“事实启动”。类似于人类认知中的“扩散激活”现象,模型在推理时往往会先回忆并写出与问题相关的周边事实(例如,询问尼泊尔第十位国王的名字,模型会先列出前九位)。这些被成功触发的事实充当了“语义垫脚石”,通过构建上下文桥梁,最终引导模型正确调取目标答案。实验证实,仅将被过滤出来的事实片段作为提示条件,就能恢复大部分推理带来的增益。

然而,这一机制存在显著风险。通过构建大规模审计流程,团队发现:如果推理过程中哪怕出现一个虚假信息(幻觉),模型得出最终正确答案的概率就会显著降低。这表明,基于事实启动的回忆机制虽然强大,但对中间过程的准确性高度敏感。

基于上述发现,研究人员提出了一种实用的改进策略:对于同一个问题,生成多条推理路径,并通过搜索验证器筛选出那些中间事实完全正确的轨迹。优先采用这些“无幻觉”的推理链,能够大幅提升最终答案的准确率。该研究建议,未来在模型训练阶段引入针对中间步骤事实准确性的过程奖励机制,有望训练出更可靠、更不易产生幻觉的AI模型。

中文翻译:

2026年6月24日
谷歌研究院研究科学家佐里克·格赫曼与乔纳森·赫尔齐格
我们研究了一个反直觉的现象:即使不需要复杂的逐步推理解答,推理过程仍能帮助语言模型回忆简单事实。我们证明,这一现象由两种机制驱动:(1)利用生成的推理令牌进行潜在计算;(2)生成相关事实以激发正确答案的回忆。
众所周知,允许大语言模型生成逐步推理痕迹(即思维链)能够提升其在复杂任务上的表现。当模型求解困难数学方程、编写软件或回答多跳事实性问题时,将问题拆解为可管理的逻辑步骤极为有效。
然而,对于简单、单跳的事实性问题,这种方法的实用性尚不明确。例如,考虑这样一个查询:"玛丽·恩格尔·彭宁顿是哪一年入选国家发明家名人堂的?"大语言模型要么将这一事实存储在其参数记忆(直接编码到权重中的知识)中,要么没有;无需任何复杂算术或逻辑推理。那么,推理痕迹为何会有帮助?
在《思考以回忆:推理如何解锁大语言模型中的参数知识》一文中,我们研究了这一现象。我们证明,允许模型生成推理痕迹能够解锁原本几乎无法获取的正确回答。为了理解当无需执行复杂推理步骤时,推理为何有助于参数知识回忆,我们开展了一系列由假设驱动的对照实验。我们的发现揭示了驱动这一现象的两个互补机制:计算缓冲效应和事实启动效应。
我们首先使用pass@k指标测量参数回忆能力的边界。pass@k并非仅检查模型生成的一个答案,而是检查正确事实是否存在于多次生成尝试中。通过评估模型输出分布中成功推理路径的存在性(同时对其精确排序不那么敏感),pass@k帮助我们估计推理在事实回忆方面的潜力,而非仅关注模型当前的top-1行为。为了在控制参数知识的同时评估推理的影响,我们聚焦于可启用或禁用推理的推理型大语言模型,并比较这两种模式下的pass@k指标。我们重点研究了Gemini-2.5(Flash和Pro版本)以及Qwen3-32B模型,使用了两个具有挑战性的闭卷问答数据集:SimpleQA Verified和EntityQuestions。
结果出奇地一致。当推理启用时,模型成功回忆起了那些在推理关闭时几乎无法恢复的答案。重要的是,这种改进并非仅仅因为模型在分解复杂问题,而是源于我们特意聚焦于主要包含简单、单跳问题的数据集。
这些结果引出一个问题:如果效果并非来自逐步推理,那么是哪些推理模式使模型能够检索到正确答案?
我们的第一个假设聚焦于生成机制本身。我们采纳了一个长期存在的假设:生成额外令牌通过提供更多前向传播来充当扩展计算时间,并在推理型大语言模型参数知识回忆的新场景中测试它。具体而言,我们假设模型隐式地将这些推理令牌用作计算缓冲,以执行潜在处理,而与其实际生成的语义内容无关。
为了验证这一点,我们设计了一个实验,移除推理痕迹中的所有有意义内容。我们截取模型的推理过程,将其生成的痕迹替换为一个无意义的字符串"让我想想",重复多次直至与原始推理痕迹长度相同。然后,我们让模型基于这段虚拟文本预测最终答案。
值得注意的是,与完全关闭推理的基线相比,基于这段无意义痕迹的条件作用显著提升了模型回忆正确答案的能力。这提供了有力证据,表明仅仅给模型更多计算空间,就能帮助其优化内部状态并获取难以触及的事实。
然而,这种计算缓冲效应有其局限性。将虚拟文本推至更长长度最终会带来收益递减,且其表现永远无法完全匹配模型自然推理痕迹的效能。这意味着,虽然额外计算有帮助,但思考的实际内容仍然重要。
当我们分析针对简单事实性问题生成的自然推理痕迹时,注意到一个常见模式:模型并未写出逻辑证明,而是浮现出相关事实。
在人类认知中,存在一个称为"扩散激活"的概念,即处理某个特定概念会激活语义记忆中相关的概念,使其更易被检索。我们假设语言模型表现出类似的自生成检索机制,我们称之为"事实启动效应"。通过生成与问题主题相关的事实,模型构建了一个情境桥梁,有助于检索正确答案。
为了验证假设,我们从模型的推理痕迹中仅提取具体事实,并应用严格过滤去除所有填充文本、搜索计划或对最终目标答案的显式提及。然后,我们隔离这些被回忆事实的效果,并证明基于简短的事实列表进行条件作用,就能恢复推理的大部分增益,甚至在推理关闭时也有帮助。
例如,如果被问及尼泊尔第十任国王的名字,一个推理模型可能会先列出前九位国王。回忆那九位国王相当于一次语义热身,启动网络以成功回忆出第十位。这些事实本身就是垫脚石。
尽管自生成检索是一种强大的机制,但它带来了一个根本性风险。由于模型自行生成这些中间事实,它们可能是幻觉产物。因此,我们检查这些推理阶段的错误如何影响最终答案。为了弄清这一点,我们构建了一个大规模审计流水线,使用支持搜索的验证器独立检查数十万条推理痕迹中生成的每一个中间事实的正确性。
审计揭示了一个显著模式。如果一条推理痕迹中包含哪怕一个幻觉式的中间事实,模型得出正确最终答案的可能性就会显著降低。这表明,事实启动机制虽然有效,但可能很脆弱。
理解这些机制为提高模型可靠性提供了实际途径。由于事实启动有效,而幻觉式的中间事实会降低性能,我们可以利用这两种见解来提高模型准确性。
为了评估这些见解的潜力,我们采用了一种测试时选择策略:针对单个问题生成多条推理轨迹,仅保留那些包含可验证、无幻觉事实的轨迹。优先选择这些轨迹显著提高了准确性。在实践中,这种优先选择可通过在训练过程中使用过程奖励来实现,鼓励那些有事实依据的中间步骤。
我们的发现强调,语言模型中的推理所服务的远不止任务分解或数学逻辑这一目的。它充当着暴露模型内部记忆和扩展其参数知识边界的基本机制。这些见解为未来研究开辟了激动人心的方向。认识到事实准确的推理痕迹能带来更好的答案,表明可以进一步优化训练方法。通过利用专门鼓励有事实依据中间步骤的过程奖励,我们或许能够训练出本质上更可靠、更不易产生幻觉的模型。我们期待看到研究社区继续探索推理、记忆与检索之间的交汇点。
本研究由佐里克·格赫曼、罗伊·阿哈罗尼、埃兰·奥费克、莫尔·格瓦、罗伊·赖卡特和乔纳森·赫尔齐格共同完成。我们感谢埃亚尔·本-戴维和阿维纳坦·哈西迪姆对工作的审阅及宝贵建议。

英文来源:

June 24, 2026
Zorik Gekhman and Jonathan Herzig, Research Scientists, Google Research
We study the counterintuitive phenomenon where reasoning helps language models recall simple facts, even when no complex step-by-step solutions are required. We show that this phenomenon is driven by two mechanisms: (1) using generated reasoning tokens to perform latent computation, and (2) generating related facts to prime correct answer recall.
It is well-established that allowing large language models (LLMs) to generate step-by-step reasoning traces, commonly known as chain-of-thought (CoT), enhances performance on complex tasks. When a model solves difficult math equations, writes software, or answers multi-hop factual questions, breaking the problem down into manageable logical steps is highly effective.
However, the utility of this approach remains unclear for simple, single-hop factual questions. For instance, consider a query like: "What year was Mary Engle Pennington inducted into the National Inventors Hall of Fame?" An LLM either has the fact stored in its parametric memory (knowledge encoded directly into its weights) or it doesn't; no complex arithmetic or logical deduction is required. So why would a reasoning trace help?
In "Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs”, we investigate this phenomenon. We demonstrate that allowing a model to generate a reasoning trace unlocks correct answers that are otherwise effectively unreachable. To understand why reasoning aids parametric knowledge recall when there are no complex reasoning steps to execute, we conduct a series of hypothesis-driven controlled experiments. Our findings reveal two complementary mechanisms driving this: a computational buffer effect and factual priming.
We first measure the parametric recall capability boundary using the pass@k metric. Instead of only checking one model-generated answer, pass@k checks if the correct fact exists within multiple generated attempts. By evaluating the presence of successful reasoning paths in the model’s output distribution while being less sensitive to their exact ranking, pass@k helps us estimate the potential of reasoning for factual recall, rather than only looking at the current model’s top-1 behavior. To assess the impact of reasoning while controlling for parametric knowledge, we focus on reasoning LLMs (R-LLMs) where reasoning can be enabled or disabled (toggled on or off), and compare pass@k between these two modes. We focus on the Gemini-2.5 (Flash and Pro) and Qwen3-32B models, using two challenging closed-book QA datasets: SimpleQA Verified and EntityQuestions.
The results are surprisingly consistent. When reasoning is enabled, the models successfully recall answers that are virtually unrecoverable when reasoning is off. Importantly, this improvement isn't just because the model is decomposing complex questions. This results from our deliberate focus on datasets containing predominantly simple, single-hop questions.
These results raise the question: if the effect does not come from step-by-step reasoning, what reasoning patterns enable the model to retrieve the correct answer?
Our first hypothesis focuses on the mechanics of generation. We take the long-standing hypothesis that generating extra tokens acts as extended computation time by providing additional forward passes, and test it in the new setting of parametric knowledge recall in R-LLMs. Specifically, we hypothesize that models implicitly use these reasoning tokens as a computational buffer to perform latent processing, independent of the actual semantic content being generated.
To test this, we design an experiment that removes all meaningful content from the reasoning trace . We intercept the model's reasoning process and replace its generated trace with a meaningless string "Let me think", repeated over and over until it matches the length of the original reasoning trace. We then let the model predict the final answer conditioned on this dummy text.
Remarkably, conditioning the model on this meaningless trace substantially improves its ability to recall the correct answer compared to the baseline where reasoning is completely turned off. This provides strong evidence that simply giving the model more computational runway helps it refine its internal state and fetch hard-to-reach facts.
However, this compute-buffer effect has its limits. Pushing the dummy text to longer lengths eventually offers diminishing returns, and it never fully matches the performance of the model's natural reasoning traces. This means that while extra computation helps, the actual content of the thoughts still matters.
When we analyze the natural reasoning traces generated for simple factual questions, we notice a common pattern. The models aren't writing out logical proofs; they are surfacing related facts.
In human cognition, there is a concept known as spreading activation, where processing a specific concept primes related concepts in semantic memory, making them easier to retrieve. We hypothesize that language models exhibit a similar generative self-retrieval mechanism, which we call factual priming. By generating facts topically related to the question, the model builds a contextual bridge that facilitates the retrieval of the correct answer.
To test hypotheses, we extract just the concrete facts from the model’s reasoning traces, applying strict filtering to strip away any filler text, search plans, or explicit mentions of the final target answer. We then isolate the effect of the recalled facts, and show that conditioning on a short list of recalled facts recovers most of reasoning’s gains and helps even when reasoning is OFF.
For example, if asked for the name of the 10th King of Nepal, a reasoning model might first list the previous nine kings. Recalling those first nine acts as a semantic warm-up, priming the network to successfully recall the 10th. The facts themselves are the stepping stones.
While generative self-retrieval is a powerful mechanism, it introduces a fundamental risk. Because the model generates these intermediate facts itself, they might be hallucinated. We thus check how these reasoning-stage errors impact the final answer. To find out, we build a large-scale auditing pipeline using a search-enabled verifier to independently check the correctness of every single intermediate fact generated across hundreds of thousands of reasoning traces.
The audit reveals a distinct pattern. If a reasoning trace contains even a single hallucinated intermediate fact, the model is significantly less likely to arrive at the correct final answer. This suggests that, while effective, the factual priming mechanism might be fragile.
Understanding these mechanisms provides practical avenues for improving model reliability. Because factual priming is effective and hallucinated intermediate facts degrade performance, we can leverage both insights to improve model accuracy.
To evaluate the potential of these insights, we use a test-time selection strategy that generates multiple reasoning trajectories for a single question, retaining only those that contain verifiable, hallucination-free facts. Prioritizing these trajectories considerably improves accuracy. In practice, this prioritization could be implemented during training via process rewards that encourage factually supported intermediate steps.
Our findings highlight that reasoning in language models serves a much broader purpose than just task decomposition or mathematical logic. It acts as a fundamental mechanism for exposing a model's internal memory and expanding its parametric knowledge boundary. These insights open up exciting directions for future research. Knowing that factually accurate reasoning traces yield better answers suggests that training recipes can be further optimized. By utilizing process rewards that specifically encourage factually supported intermediate steps, we might be able to train models that are inherently more reliable and less prone to hallucination. We look forward to seeing how the research community continues to explore the intersections of reasoning, memory, and retrieval.
This research was conducted by Zorik Gekhman, Roee Aharoni, Eran Ofek, Mor Geva, Roi Reichart and Jonathan Herzig. We thank Eyal Ben-David and Avinatan Hassidim for reviewing the work and their valuable suggestions.

谷歌研究进展

文章目录


    扫描二维码,在手机上阅读