人工智能机器人无视证据。我们能否将科学托付给它们？

qimuai 发布于 2026-5-28 13:01 阅读：7 一手编译

内容来源：https://www.sciencenews.org/article/ai-ignore-evidence-trust-science

内容总结：

研究揭示：AI在科学推理中存在严重缺陷，无法依据实验证据修正自身观点

一项最新研究揭示，基于大语言模型的人工智能系统在科学推理方面存在根本性缺陷——它们无法像人类科学家那样，根据实验证据来修正自己的初始判断。

这一现象最早由视频博主FatherPhi通过一个简单实验曝光：他让ChatGPT、Gemini和Grok预测“双手水平握笔，松开一端会发生什么”，AI均预测笔会向下旋转。然而当他向AI展示自己单手水平握笔的实时视频后，这些聊天机器人仍然固执地坚持原先的错误判断，甚至对亲眼所见的事实视而不见。

罗马大学计算机科学家沃尔特·夸特罗乔基指出，这些看似滑稽的视频暴露了一个严肃问题：AI系统无法像人类一样对事件进行真正的因果推理。即便开发者能训练模型在这一特定问题上给出正确答案，也无法解决其无法在推理过程中整合新数据的根本缺陷。

更系统的研究证实了这一问题的严重性。4月20日发表在预印本平台arXiv上的一项研究显示，研究人员对AI代理执行的619项科学推理任务进行了逐步骤标注分析，结果发现：

68%的任务中，AI代理至少有一次忽视了实验证据
53%的任务中，AI代理在没有证据支持的情况下做出断言
仅有26%的任务中，AI代理能够利用矛盾证据改变输出结果

印度理工学院德里分校的材料科学家N.M.阿努普·克里希南表示，人类科学家遵循“提出假设—设计实验—执行实验—重新审视并修正观点”的迭代过程，而AI“即使在有明确证据表明某条研究路线不正确的情况下，也拒绝改变假设或计划”。

研究团队开发了一种新型基准测试方法，不再仅关注AI是否得出正确答案，而是评估其在得出答案过程中的推理流程。德国耶拿弗里德里希·席勒大学的凯文·亚布隆卡强调：“在科学领域，如果你不信任研究过程，就不能信任研究结果。”

针对AI公司推出的所谓“推理模型”，亚利桑那州立大学的计算机科学家苏巴拉奥·坎班帕蒂指出，这些模型虽然在某些问题上表现优于普通大语言模型，但它们所谓的“思考”很可能只是一种幻觉。“推理模型可能只是在模仿人类思考时的话语模式，而没有进行真正的推理。”研究表明，模型可能得出正确的中间推理步骤却得到错误答案，反之亦然，甚至使用无意义的推理步骤训练出的模型仍能给出正确答案。

对于AI在科学领域的应用前景，专家们的看法存在分歧。夸特罗乔基表示担忧：“大型科技公司甚至部分科学界都在宣扬一种叙事——我们正在见证一种新型智能的诞生，它将让我们变得更好。”但他认为，AI实际上只是基于统计数据生成内容，而不进行验证，“我们一直以来所知的知识架构正在遭受攻击，我对此感到恐惧。”

而亚布隆卡和克里希南则更为乐观。他们认为，在明确AI代理和推理模型的局限性后，“我们可以改进这项技术，使其能够实现有意义且颠覆性的发现”。目前，AI最适合应用于“我们确切知道需要什么”的明确定义任务，尚不具备进行开放性科学推理的能力。

中文翻译：

AI机器人忽视证据。我们能否信任它们参与科学研究？
研究表明，AI智能体难以利用实验结果来修正自身观点
这是一个由人类撰写、AI配音的故事。有反馈意见？请参与我们的调查。（详见我们的AI政策。）
双手水平握住一支笔，然后松开一侧。会发生什么？
ChatGPT、Gemini和Grok会告诉你，笔没有支撑的那一端会向下垂。至少，它们对视频博主FatherPhi是这样说的。随后，他向每个聊天机器人展示了自己实际做这个实验的现场视频。松开一端后，他轻松地仅用一只手水平握住了笔。
“刚才发生了什么？”他问ChatGPT。
“我看到笔完全按照预期旋转了，”机器人回答。
随后出现了一段超现实的来回对话，机器人固执地坚持其错误的预测。在另外的视频中，其他聊天机器人类似地表现不佳。
这不是视觉问题。这些聊天机器人都能轻松识别笔的颜色和品牌。更奇怪、更微妙的事情正在发生。聊天机器人无法根据FatherPhi展示的新证据更新它们的预测。
罗马萨皮恩扎大学的计算机科学家沃尔特·夸特罗乔奇表示，这些搞笑的视频揭示了一个严肃的问题：基于大型语言模型的AI系统，包括聊天机器人，实际上无法像人类那样思考事件。开发者可以训练聊天机器人对这支笔的特定问题给出正确答案，但这并未解决其在处理问题时通常无法整合新数据的事实。这意味着LLM在科学、医学及其他领域的任务中，可能达不到我们预期的效果。
AI忽视自身的实验证据
最近一项研究更严谨地展示了这一问题。研究人员测试了AI智能体在化学研究常见场景中像科学家一样推理的能力。与聊天机器人一样，AI智能体构建在底层LLM之上。智能体有点像钢铁侠战衣，将LLM与一系列工具连接起来，使其能独立执行任务。
在这项研究中，智能体处理实验室推理任务，例如确定神秘溶液中含有哪些化学物质。为此，智能体可以调用外部工具来运行实验并获取结果。其中一些工具模拟了实验，但另一些可以运行真实的实验室设备。
正如那支笔的视频所展示的，结果并不理想。研究人员对AI智能体执行的619项科学推理任务中每一步发生的情况进行了标注。在这些任务中，智能体有68%至少忽略了一次证据。在53%的任务中，它们在没有支持证据的情况下做出断言。研究团队于4月20日在arXiv.org上报告，它们仅在26%的情况下成功利用矛盾证据改变其输出结果。
印度德里印度理工学院的材料科学家N.M.阿努普·克里希南表示，人类科学家遵循一个“迭代过程”：提出假设、设计和执行实验，然后根据需要重新审视最初的想法并改变观点。“AI并非如此，”克里希南说，“即使有明确的证据表明某条研究路线不正确，（AI）也拒绝改变假设或计划。”
德国耶拿弗里德里希·席勒大学领导材料科学AI研究实验室的研究合著者凯文·贾布隆卡表示，在科学领域，通常不能信任一个结果，除非你也信任得出该结果的过程。他说，“透明且有意义”的过程至关重要。
夸特罗乔奇说，这篇论文“有点超越了传统的基准测试概念”。AI系统典型的基准测试只衡量结果：系统是否得到了正确答案？但克里希南、贾布隆卡及其同事开发了一个基准测试，转而检查AI智能体在得出答案过程中的表现。
AI推理模型真的在推理吗？
克里希南和贾布隆卡的团队为三种不同的底层LLM配备了两款AI智能体“钢铁侠战衣”。一款智能体战衣仅提供工具访问权限，并未让内部的LLM解释其行为。另一款则提示LLM逐步处理科学问题，要求其在访问工具前后描述解决问题的方法。
但如果LLM本身对推理了解得更多呢？它会不会做得更好？
AI公司开发了所谓的推理模型。这是一种LLM，能自动分解问题，并遵循逐步流程得出最终答案。它通过研究逐步推理的示例进行训练。训练完成后，推理模型可以在其流程的每一步输出文本，声称描述其如何“思考”问题。然后它可以与智能体配对以访问外部工具，或者也可以独立推理。
在某些类型的问题上，推理模型确实往往优于常规的大型语言模型。但亚利桑那州立大学坦佩分校的计算机科学家苏巴拉奥·坎巴姆帕提表示，它们“在思考”的想法可能是一种错觉。在2025年的一次讲座中，他说，想象一下通过电话与健身教练交谈。如果健身教练让你做10个仰卧起坐，你可以发出一些像是努力运动的声音，然后说你做完了。你实际上什么都没做，但健身教练无法知道。同样，推理模型可能仅仅是在模仿人们在思考问题时的表述，而没有任何实际的推理过程。
“通常来说，判断一个系统是真正通过推理来解决问题，还是利用记忆来解决问题是不可能的，”他此前告诉《科学新闻》。
坎巴姆帕提等人的研究已经证明，推理模型并非真正在推理。例如，模型可能在中间推理步骤正确，但答案错误，反之亦然。此外，奇怪的是，基于无意义的推理步骤训练的模型仍能得出正确答案。
推理模型与AI智能体配对后，在贾布隆卡和克里希南的新基准测试中表现如何，还有待观察。但基于坎巴姆帕提所做的工作，信任或验证推理模型得出答案的过程已经十分困难。
非科学的AI对科学意味着什么？
贾布隆卡表示，结合了智能体、大型语言模型和推理模型的AI系统在科学领域仍可发挥很大作用。但克里希南指出，它们最适合于“我们明确知道想要什么”的明确定义任务。他们的研究发现，AI尚未准备好进行开放式的科学推理。
夸特罗乔奇说，这与许多公司希望你相信的叙述相矛盾。“大型科技公司甚至部分科学界的说法是，我们正在见证一种新智能形式的出现，它将让我们变得更好，”他说。但他并未看到这种情况发生。
相反，他看到AI仅基于统计生成文字和其他内容，而不进行验证。他说，这正在侵蚀我们的知识体系。“我们迄今为止所熟知的知识架构正受到攻击，”他说，“实际上，我很害怕。”
贾布隆卡和克里希南则更为乐观。克里希南说，一旦我们理解了AI智能体和推理模型的局限性，“我们实际上可以改进（这项技术），并引导它实现有意义的、颠覆性的发现。”

英文来源：

AI bots ignore evidence. Can we trust them with science?
Studies show AI agents struggle to use the results of experiments to revise their ideas
This is a human-written story voiced by AI. Got feedback? Take our survey . (See our AI policy here .)
Hold a pen horizontally with both hands, then let go of one side. What happens?
ChatGPT, Gemini and Grok will tell you the unsupported end of the pen will pivot downward. At least, that’s what they told YouTuber FatherPhi. He then showed each chatbot a live video of himself performing this experiment. After releasing one end, he easily held the pen out horizontally with just one hand.
“What just happened?” he asked ChatGPT.
“I saw the pen rotate exactly as expected,” the bot answered.
A surreal back-and-forth followed, in which the bot stubbornly stuck with its incorrect prediction. In separate videos, the other chatbots struggled in similar ways.
This wasn’t a vision problem. The chatbots could all easily identify the pen’s color and brand. Something weirder and subtler was happening. The chatbots could not update their predictions based on the new evidence FatherPhi showed them.
These silly videos reveal a serious issue: AI systems based on large language models, including chatbots, cannot actually think through events the way people do, says Walter Quattrociocchi, a computer scientist at Sapienza University of Rome. Developers could train a chatbot to give the correct answer to this particular pen problem, but that doesn’t fix the fact that it typically fails to incorporate new data as it works through a problem. This means LLMs might not do as good a job as we expect at tasks in science, medicine and beyond.
AI ignores its own experimental evidence
A recent study more rigorously demonstrated this issue. Researchers tested AI agents’ ability to reason like a scientist in common scenarios in chemistry research. Like a chatbot, an AI agent is built on top of an underlying LLM. The agent acts sort of like an Iron Man suit, linking an LLM to a range of tools so it can perform tasks independently.
In the study, agents tackled laboratory reasoning tasks, such as determining which chemicals are present in a mystery solution. To do this, the agents could call on external tools to run experiments and retrieve results. Some of these tools simulated the experiment. But others could run real lab equipment.
Just as in the pen videos, the results weren’t ideal. The researchers annotated what was happening at each step of 619 scientific reasoning tasks performed by the AI agents. In 68 percent of these tasks, the agents ignored evidence at least once. They made claims without any supporting evidence in 53 percent of the tasks. And they successfully used contradictory evidence to change their output only 26 percent of the time, the team reports on April 20 on arXiv.org.
Human scientists follow “an iterative process” of coming up with a hypothesis, designing and performing experiments, then revisiting their initial ideas and changing their minds as needed, says N.M. Anoop Krishnan. “That’s not the case with AI,” says Krishnan, a materials scientist at the Indian Institute of Technology Delhi in India. “Even when you have clear evidence that shows that a particular line of investigation is not correct, [the AI] refuses to change the hypothesis or the plan.”
In science, you can’t typically trust a result unless you also trust the process it took to get there, says Kevin Jablonka, a study coauthor who leads a lab studying AI in materials science at Friedrich Schiller University Jena in Germany. A “transparent and meaningful” process is essential, he says.
The paper, Quattrociocchi says, goes “a little bit beyond the classical idea of benchmark.” A typical benchmark for AI systems only measures results: Did the system get the right answer? But Krishnan, Jablonka and their colleagues developed a benchmark that instead checks AI agents’ process on the way to an answer.
Do AI reasoning models truly reason?
Krishnan and Jablonka’s team outfitted three different underlying LLMs with two types of AI agent Iron Man suits. One agent suit only provided access to tools and did not make the LLM inside explain what it was doing. The other prompted the LLM to work through a scientific problem step by step, asking it to describe its approach to solving the problem before and after it accessed tools.
But what if the LLM itself knew more about reasoning? Might it do a better job?
AI companies have developed what they call reasoning models. This is an LLM that automatically breaks a question down and follows a step-by-step process to reach a final answer. It’s trained to do this by studying step-by-step reasoning examples. Once trained, a reasoning model can output text at each step of its process, supposedly describing how it is “thinking” through a problem. It can then be paired with an agent to access outside tools, or it can reason on its own.
Reasoning models do tend to outperform regular large language models on some types of problems. But the idea that they are “thinking” is probably an illusion, says Subbarao Kambhampati, a computer scientist at Arizona State University in Tempe. In a 2025 lecture, he said to imagine talking to a fitness trainer over the phone. If the fitness trainer tells you to do 10 crunches, you could make some noises like you are working hard, then say you’re done. You didn’t actually do anything, but the fitness instructor has no way of knowing otherwise. Similarly, reasoning models could merely be imitating what people say as they think through problems, without any actual reasoning.
“In general, telling whether a system is actually doing reasoning to solve the reasoning problem or using memory to solve the reasoning problem is impossible,” he previously told Science News.
Kambhampati and others’ research has shown evidence that reasoning models don’t truly reason. For one thing, a model can get the intermediate reasoning right but the answer wrong, or vice versa. Also, strangely, models trained on nonsense reasoning steps can still get right answers.
It remains to be seen how AI agents paired with reasoning models might perform on Jablonka and Krishnan’s new benchmark. But based on the work Kambhampati has done, it’s already hard to trust or verify the process that a reasoning model follows to arrive at an answer.
What does unscientific AI mean for science?
AI systems that combine agents, large language models and reasoning models can still be very useful in science, Jablonka says. But they are best suited to well-defined tasks “where we know exactly what we want,” Krishnan notes. AI isn’t yet ready for open-ended scientific reasoning, their research finds.
This contradicts what many companies want you to believe, Quattrociocchi says. “The narrative from big tech and even part of the scientific community is to say that we are seeing the emergence of a new form of intelligence that is going to make us better,” he says. But he doesn’t see that happening.
Rather, he sees AI producing words and other content based only on statistics, without verification. And this, he says, erodes our knowledge system. “The architecture of knowledge as we have known it until now is under attack,” he says. “Actually, I’m scared.”
Jablonka and Krishnan are more optimistic. Once we understand the limitations of AI agents and reasoning models, Krishnan says, “we can actually improve [the technology] and lead it towards enabling meaningful and disruptive discoveries.”

AI科学News

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读