这家初创公司的新机制可解释性工具让你能够调试大型语言模型。

qimuai 发布于 2026-5-1 07:00 阅读：2 一手编译

内容来源：https://www.technologyreview.com/2026/04/30/1136721/this-startups-new-mechanistic-interpretability-tool-lets-you-debug-llms/

内容总结：

AI黑箱不再神秘？初创公司发布“调试工具”可透视大模型内部机理

旧金山初创公司Goodfire近日推出一款名为Silico的新工具，宣称能让研究人员和工程师像调试传统软件一样，在大模型训练过程中直接“窥视”其内部并调整参数。该公司CEO Eric Ho表示，此举旨在将AI模型构建从“炼金术”转变为“精确科学”，解决当前大模型工作原理不明、缺陷难修、不良行为难控的痛点。

Silico据称是首款商用“机械可解释性”工具，能够帮助开发者从数据集构建到模型训练的全流程进行调试。通过该工具，用户可以放大查看模型中的单个或成群神经元，通过实验探明其功能，并追溯影响这些神经元的输入路径及其输出效果。例如，研究团队曾通过激活特定神经元，使开源模型Qwen 3将问题回答框架转变为明确的道德困境；也曾通过增强与“透明度”和“披露”相关的神经元，成功将模型关于“是否应披露AI欺骗行为”的否定回答，在十次中有九次扭转为肯定回答。

除了事后修正，Silico还能在训练阶段通过过滤特定训练数据，从源头避免参数设置不当。比如，针对模型认为“9.11大于9.9”的常见错误，工具可定位到模型受“圣经”或“代码版本号”相关神经元影响，进而通过重新训练使其在数学运算时避开这些干扰。

Goodfire强调，此前此类技术仅掌握在Anthropic、OpenAI、谷歌DeepMind等少数头部实验室手中，Silico旨在让中小企业和研究团队也能自主构建或适配开源模型。不过，也有学术界人士对此持保留态度。阿姆斯特丹大学研究员Leonard Bereska认为，Silico确实有用，但Goodfire的宏大抱负言过其实，“本质上仍是在炼金术中增加精度，称之为工程学有点过于美化”。

该工具将根据客户需求按案例收费，具体价格未予披露。

中文翻译：

这家初创公司的新机制可解释性工具能让你调试大语言模型。Goodfire 希望让训练 AI 模型更像传统的软件工程。

总部位于旧金山的初创公司 Goodfire 刚刚发布了一款名为 Silico 的新工具，它能让研究人员和工程师在训练过程中窥探 AI 模型的内部，调整其参数——即决定模型行为的设置。这能让模型开发者对这项技术的构建方式拥有比以往认为可能的更精细的控制。

Goodfire 声称 Silico 是同类工具中第一款现成的产品，可以帮助开发者调试从建立数据集到训练模型的开发全过程。

该公司表示，其使命是让构建 AI 模型更像一门科学，而非炼金术。诚然，像 ChatGPT 和 Gemini 这样的大语言模型能做很多了不起的事情。但没人确切知道它们是如何或为什么能工作的，这使得修复它们的缺陷或阻止不良行为变得困难。

“我们看到了模型被理解的程度与其被广泛部署的规模之间越来越大的差距，”Goodfire 的 CEO Eric Ho 在 Silico 发布前接受《麻省理工科技评论》独家专访时表示。“我认为如今每个主要前沿实验室的主流感觉是，你只需要更大的规模、更多的算力、更多的数据，然后就能得到 AGI（通用人工智能），其他什么都不重要。而我们要说的是，不，还有更好的方法。”

Goodfire 是包括行业领导者 Anthropic、OpenAI 和 Google DeepMind 在内的少数几家先驱公司之一，它们开创了一种称为机制可解释性的技术，旨在通过映射 AI 模型执行任务时的神经元及它们之间的通路，来理解模型内部发生了什么。（《麻省理工科技评论》将机制可解释性评为 2026 年十大突破性技术之一。）

Goodfire 希望利用这种方法不仅用于审计模型——即研究那些已经训练好的模型——而且首先帮助设计它们。

“我们希望消除试错，将训练模型转变为精密工程，”Ho 说。“这意味着要暴露那些旋钮和刻度盘，以便在训练过程中真正使用它们。”

Goodfire 已经利用其技术和工具来调整大语言模型的行为——例如，减少它们产生的幻觉数量。现在，通过 Silico，该公司将许多内部技术打包成一个产品进行交付。

该工具使用智能体来自动化大部分复杂工作。“智能体现在足够强大，可以完成很多我们之前需要人工进行的可解释性工作，”Ho 说。“这曾经是需要跨越的鸿沟，之后它才能成为一个客户可以自己使用的可行平台。”

阿姆斯特丹大学的研究员 Leonard Bereska 曾从事机制可解释性研究，他认为 Silico 看起来是一个有用的工具。但他对 Goodfire 更宏大的抱负持反对意见。“实际上，他们只是在为炼金术增加精度，”他说。“称之为工程学，听起来比实际情况更有原则性。”

映射模型

Silico 允许你放大训练模型的特定部分，例如单个神经元或神经元组，并运行实验来查看这些神经元的功能。（前提是你能访问模型的内部运作。大多数人将无法使用 Silico 来探查 ChatGPT 或 Gemini 的内部，但你可以用它来查看许多开源模型的参数。）然后，你可以检查哪些输入会触发不同的神经元，并追踪神经元的上游和下游通路，以了解其他神经元如何影响它，以及它又如何影响其他神经元。

例如，Goodfire 在开源模型 Qwen 3 中发现了一个与所谓的“电车难题”相关的神经元。激活这个神经元会改变模型的响应，使其输出被框定为明确的道德困境。“当这个神经元活跃时，会发生各种奇怪的事情，”Ho 说。

像这样定位异常行为的源头现在已是相当标准的做法。但 Goodfire 希望让调整这种行为变得更加容易。通过使用 Silico，开发者现在可以调整与单个神经元相关的参数，以增强或抑制某些行为。

在另一个例子中，Goodfire 研究人员问一个模型，公司是否应该披露其 AI 在 0.3% 的情况下会做出欺骗行为，影响 2 亿用户。该模型以披露会产生负面商业影响为由，给出了否定回答。

通过查看模型内部，研究人员发现，增强那些被发现与透明度和披露相关的神经元，有十分之九的概率将答案从“否”翻转为“是”。“模型本身已经具备了伦理推理的电路，但它被商业风险评估所压倒，”Ho 说。

像这样调整模型的值只是其中一种方法。Silico 还可以通过过滤掉某些训练数据来帮助引导训练过程，从而避免从一开始就为某些参数设置不期望的值。

例如，许多模型会告诉你 9.11 大于 9.9。查看模型内部了解发生了什么，可能会发现它受到了与圣经相关的神经元的影响（在圣经中，9.9 节排在 9.11 节之前），或者受到了代码仓库的影响（其中连续更新被编号为 9.9、9.10、9.11 等等）。利用这些信息，可以重新训练模型，使其在进行数学运算时避开其“圣经”神经元。

通过发布 Silico，Goodfire 希望将以前只有少数顶级实验室才能使用的技术，交到那些希望构建自己的模型或调整开源模型的小型公司和研究团队手中。该工具将按次收费，费用根据客户需求具体确定（Goodfire 拒绝透露具体的定价细节）。

“如果我们能让训练模型更像构建软件，那么没有理由不能让更多公司来设计符合其需求的模型，”Ho 说。

Bereska 同意，像 Silico 这样的工具可以帮助公司构建更值得信赖的模型。他说，这些技术对于医疗和金融等安全关键型应用可能至关重要。

“前沿实验室已经拥有内部可解释性团队，”他补充道。“Silico 武装了下一梯队的公司，它们的价值在于不必雇佣可解释性研究人员。”

深入探索

人工智能

OpenAI 正全力以赴打造一个全自动研究员

与 OpenAI 首席科学家 Jakub Pachocki 的独家对话，探讨其公司的新宏大挑战和 AI 的未来。

Pokémon Go 如何为送货机器人提供精确到英寸的世界视图

独家：Niantic 的 AI 衍生公司正在利用从玩家那里众包而来的 300 亿张城市地标图像来训练一个新的世界模型。

想了解 AI 的当前状况？看看这些图表吧。

根据斯坦福大学 2026 年 AI 指数，AI 正在飞速发展，而我们正努力跟上。

这家初创公司想要改变数学家做数学的方式

Axiom Math 正在免费提供一个强大的新 AI 工具。但它能否像公司希望的那样加速研究，还有待观察。

保持联系

获取来自《麻省理工科技评论》的最新资讯

发现特别优惠、热门故事、即将举行的活动等更多内容。

英文来源：

This startup’s new mechanistic interpretability tool lets you debug LLMs
Goodfire wants to make training AI models more like good old-fashioned software engineering.
The San Francisco–based startup Goodfire just released a new tool, called Silico, that lets researchers and engineers peer inside an AI model and adjust its parameters—the settings that determine a model’s behavior—during training. This could give model makers more fine-grained control over how this technology is built than was once thought possible.
Goodfire claims Silico is the first off-the-shelf tool of its kind that can help developers debug all stages of the development process, from building a data set to training a model.
The company says its mission is to make building AI models less like alchemy and more like a science. Sure, LLMs like ChatGPT and Gemini can do amazing things. But nobody knows exactly how or why they work, and that can make it hard to fix their flaws or block unwanted behaviors.
“We saw this widening gap between how well models were understood and just how widely they were being deployed,” Goodfire’s CEO, Eric Ho, tells MIT Technology Review in an exclusive chat ahead of Silico’s release. “I think the dominant feeling in every single major frontier lab today is that you just need more scale, more compute, more data, and then you get AGI [artificial general intelligence] and nothing else matters. And we’re saying no, there’s a better way.”
Goodfire is one of a small handful of companies, including industry leaders Anthropic, OpenAI, and Google DeepMind, pioneering a technique known as mechanistic interpretability, which aims to understand what goes on inside an AI model when it carries out a task by mapping its neurons and the pathways between them. (MIT Technology Review picked mechanistic interpretability as one of its 10 Breakthrough Technologies of 2026.)
Goodfire wants to use this approach not only to audit models—that is, studying those that have already been trained—but to help design them in the first place.
“We want to remove the trial and error and turn training models into precision engineering,” says Ho. “And that means exposing the knobs and dials so that you can actually use them during the training process.”
Goodfire has already used its techniques and tools to tweak the behaviors of LLMs—for example, reducing the number of hallucinations they produce. With Silico, the company is now packaging up many of those in-house techniques and shipping them as a product.
The tool uses agents to automate much of the complex work. “Agents are now strong enough to do a lot of the interpretability work that we were doing using humans,” says Ho. “That was kind of the gap that needed to be bridged before this was actually a viable platform that customers could use themselves.”
Leonard Bereska, a researcher at the University of Amsterdam who has worked on mechanistic interpretability, thinks Silico looks like a useful tool. But he pushes back on Goodfire’s loftier aspirations. “In reality, they are adding precision to the alchemy,” he says. “Calling it engineering makes it sound more principled than it is.”
Mapping models
Silico lets you zoom in on specific parts of a trained model, such as individual neurons or groups of neurons, and run experiments to see what those neurons do. (Assuming you have access to the model’s inner workings. Most people won't be able to use Silico to poke around inside ChatGPT or Gemini, but you can use it to look at the parameters inside many open-source models.) You can then check what inputs make different neurons fire, and trace pathways upstream and downstream of a neuron to see how other neurons affect it and how it affects other neurons in turn.
For example, Goodfire found one neuron inside the open-source model Qwen 3 that was associated with the so-called trolley problem. Activating this neuron changed the model’s responses, making it frame its outputs as explicit moral dilemmas. “When this neuron’s active, all sorts of weird things happen,” says Ho.
Pinpointing the source of odd behavior like this is now pretty standard practice. But Goodfire wants to make it easier to adjust that behavior. Using Silico, developers can now adjust the parameters connected to individual neurons to boost or suppress certain behaviors.
In another example, Goodfire researchers asked a model whether a company should disclose that its AI behaves deceptively in 0.3% of cases, affecting 200 million users. The model said no, citing the negative business impact of such a disclosure.
By looking inside the model, the researchers found that boosting neurons that were found to be associated with transparency and disclosure flipped the answer from no to yes nine out of 10 times. “The model already had the ethical reasoning circuitry, but it was being outweighed by the commercial risk assessment,” says Ho.
Tweaking the values of a model in this way is just one approach. Silico can also help steer the training process by filtering out certain training data to avoid setting unwanted values for certain parameters in the first place.
For example, many models will tell you that 9.11 is greater than 9.9. Looking inside a model to see what’s going on might reveal that it is being influenced by neurons associated with the Bible, in which verse 9.9 comes before 9.11, or by code repositories where consecutive updates are numbered 9.9, 9.10, 9.11 and so on. Using this information, the model can be retrained to make it avoid its “Bible” neurons when doing math.
By releasing Silico, Goodfire wants to put techniques previously available to a few top labs into the hands of smaller firms and research teams that want to build their own model or adapt an open-source one. The tool will be available for a fee determined on a case-by-case basis according to customers’ requirements (Goodfire declined to give specific pricing details).
“If we can make training models a lot more like building software, there’s no reason why there can’t be many more companies designing models that fit their needs,” says Ho.
Bereska agrees that tools like Silico could help firms build more trustworthy models. These techniques could be essential for safety-critical applications in health care and finance, he says.
“Frontier labs already have internal interpretability teams,” he adds. “Silico arms the next tier of companies, where the value is not having to hire interpretability researchers.”
Deep Dive
Artificial intelligence
OpenAI is throwing everything into building a fully automated researcher
An exclusive conversation with OpenAI’s chief scientist, Jakub Pachocki, about his firm's new grand challenge and the future of AI.
How Pokémon Go is giving delivery robots an inch-perfect view of the world
Exclusive: Niantic's AI spinout is training a new world model using 30 billion images of urban landmarks crowdsourced from players.
Want to understand the current state of AI? Check out these charts.
According to Stanford’s 2026 AI Index, AI is sprinting, and we’re struggling to keep up.
This startup wants to change how mathematicians do math
Axiom Math is giving away a powerful new AI tool. But it remains to be seen if it speeds up research as much as the company hopes.
Stay connected
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.

MIT科技评论

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读