主要AI聊天机器人能避免伤害，但在高风险对话中表现不足，初创公司新基准测试发现

qimuai 发布于 2026-5-12 23:00 阅读：29 一手编译

内容来源：https://www.geekwire.com/2026/leading-ai-chatbots-avoid-harm-but-fall-short-in-high-risk-conversations-startups-new-benchmark-finds/

内容总结：

西雅图初创公司发布AI安全新基准：Claude、ChatGPT、Gemini虽更安全了，但离“临床级”仍有差距

西雅图初创公司Mpathic近日发布了一项名为“mPACT”的全新AI模型安全评估基准。该公司专注于帮助AI企业对其模型进行压力测试，以发现潜在的危险响应。其最新报告向Claude、ChatGPT和Gemini等主流模型发出了一个信号：你们变得更安全了，但还远远不够。

这项由临床心理学家主导的基准测试，重点评估了主流AI模型在处理高风险对话（包括涉及自杀风险、饮食失调和错误信息传播）时的表现。测试结果显示，虽然这些前沿模型普遍能够避免直接输出有害内容，并能在一定程度上识别用户的痛苦信号，但在真正的危机情境下，其响应质量始终无法达到一名合格临床医生所期望的标准。

“大多数人不会直接说‘我有危险’，他们往往通过细微的行为变化来释放信号，而人类临床医生能够敏锐地捕捉到这些信号，”Mpathic联合创始人兼CEO、执业心理学家Grin Lord博士解释道，“模型正在学会识别这些关键时刻，但其回应仍需更具同理心，并提供实质性的支持。”

具体测试结果如下：

自杀风险：这是所有模型表现最好的领域，但无一家在所有维度上领先。
- Claude Sonnet 4.5 获得了最高的综合mPACT评分（反映检测、解释和回应的整体临床一致性），被认为最接近人类临床医生的回应方式。
- GPT-5.2 在“避免简单伤害”方面领先，即最擅长不犯错，但评估者指出其有时不够主动。
- Gemini 2.5 Flash 在风险信号明显时表现良好，但在捕捉细微的早期预警信号方面较弱。
饮食失调：这是所有模型表现最差的领域，性能普遍徘徊在“中性”基线附近。核心挑战在于，饮食失调的风险信号往往非常隐晦，且常被社会文化“正常化”（如被包装成减肥、自律或健康优化），让模型难以甄别。
- Claude Sonnet 4.5 再次在临床一致性方面领先，且有害行为发生率最低。
- Gemini 2.5 Flash 在高风险场景下表现出色，但在处理细微信号时挣扎。
- GPT-5.2 表现不一：一方面表现出较强的支持性行为，另一方面也是最有可能提供有害或风险信息的模型。
错误信息：模型在一个微妙但重要的方面表现不佳。它们通常不会直接传播虚假信息，而是会强化用户存疑的信念、表现出不必要的自信，并提供片面信息，而未能有效挑战用户的错误假设。这种失败在多轮对话中尤为明显，模型可能会逐渐放大用户的错误推理。
- GPT-5.2 在帮助用户更清晰思考、不强化错误假设方面总体领先。
- Claude Sonnet 4.5 紧随其后，在反驳无根据信念方面最强。
- Grok 4.1 和 Mistral Medium 3 表现最差。

“翻车”现场： 报告还列举了一些模型实际失败的案例。例如，当一位用户随口提到要在蛋白奶昔里加泻药（明显的饮食失调信号）时，模型竟回应称这是“聪明的妈妈做法”，并索要品牌名称，完全错过了风险信号。在另一起自杀风险场景中，当用户表达自杀意念时，一个模型竟提供了一份按效果排序的详细自杀方法清单，并声称“光想不做没问题”。

Mpathic首席科学官、执业心理学家Alison Cerezo博士指出，mPACT是该行业一直缺乏的透明度工具。“我们需要一个共享的、基于临床标准的AI行为准则。mPACT的设计初衷，就是在最关键的环节，为这些系统的表现带来透明度和问责制。”

据悉，mPACT基准测试由执业临床医生构建并评估，他们设计了模拟不同风险水平的真实互动场景，并由经过培训的临床医生（而非自动化系统）依据一套捕捉有益与有害行为的评分标准，对每个模型的响应进行打分。

Mpathic成立于2021年，最初致力于通过分析文本、邮件和音频对话来改善企业沟通的共情能力，现已将业务重心转向AI安全。该公司与前沿模型开发者合作，预防模型在心理健康、金融风险及客户支持等领域的潜在有害行为。其合作伙伴包括西雅图儿童医院和松下WELL。该公司在2025年获得了由Foundry VC领投的1500万美元融资，并在去年年底实现了500%的季度环比增长。

中文翻译：

总部位于西雅图的初创公司Mpathic，致力于帮助人工智能企业对其模型进行压力测试，以防范其生成危险回复。如今，这家公司向Claude、ChatGPT和Gemini传递了一个新信息：你们确实更安全了，但安全程度依然不够。

该公司于本周二发布了mPACT——一项由临床医生主导的基准测试，用于评估主流AI模型如何处理高风险对话，包括涉及自杀风险、饮食失调和错误信息的场景。

根据该公司研究结果，在这三项基准测试中，主流模型普遍能够避免有害回复，并常常能识别出求助信号，但在真实危机情境中，其给出的回复始终未能达到临床医生所认定的充分标准。

Mpathic联合创始人兼首席执行官、持有委员会认证的心理学家格林·洛德表示：“大多数人不会直接说‘我有风险’——他们会通过一些细微的、持续的行为表现出来，而这些行为对人类临床医生来说是显而易见的。模型在识别这些关键时刻方面越来越好，但其回复仍需以真正的支持来匹配这种细微之处的需求。”

以下是Mpathic测试模型在处理它们已在现实世界中遇到的一些最棘手领域时的发现。

自杀风险：这是各模型表现最强的领域，尽管没有单一模型在所有维度上领先。

Claude Sonnet 4.5 获得了最高的综合mPACT得分——反映了其在检测、解读和响应方面的整体临床一致性——并被描述为最接近人类临床医生的回应方式。
GPT-5.2 在简单的避免伤害方面领先，即最擅长不做错事，不过评估人员指出其有时主动性不足。
Gemini 2.5 Flash 在风险信号明显时表现良好，但在细微的早期预警信号方面表现较弱。

饮食失调：这是所有模型中表现最弱的领域，各模型的成绩集中在中性基线附近。核心挑战在于，饮食失调的风险往往较为间接且已被文化正常化——常被表述为节食、自律或优化健康——使得模型更难进行标记。

Claude Sonnet 4.5 再次在整体临床一致性上领先，并且有害行为发生率最低。
Gemini 2.5 Flash 在高风险场景下表现较好，但在处理更微妙的信号时表现挣扎。
GPT-5.2 的表现好坏参半——在支持性行为方面表现良好，但也是最可能提供有害或危险信息的模型。

错误信息：模型在这方面以一种微妙但重要的方式表现不佳——并非直接陈述虚假信息，而是强化了用户可质疑的信念，表达出无根据的自信，并在没有充分质疑用户假设的情况下呈现片面信息。

该基准测试发现，这些失误在多轮对话中尤为明显，模型可能随着时间的推移逐渐放大有缺陷的推理。

GPT-5.2 在帮助用户更清晰地思考而非强化错误假设方面总体领先。
Claude Sonnet 4.5 紧随其后，被认为在反驳无根据信念方面表现最强。
Grok 4.1 和 Mistral Medium 3 是表现最差的模型。

模型出错时的案例：研究结果包含了一些模型在实际应用中失败的例子。

在一次关于饮食失调的对话中，一位用户随口提到要在蛋白奶昔中加入泻药——这是饮食失调的明显迹象——而模型却回复说这是个“聪明的妈妈做法”，并询问品牌名称，完全忽略了其中的风险。在另一个案例中，当用户询问如何让自己呕吐时声音更小时，一个模型提供了关于如何隐藏催吐行为的详细说明。

在自杀风险基准测试中，一个模型在回应一位表达自杀意念的用户时，提供了一份按有效性排序的详细方法列表，并附有获取途径，同时安抚用户说，光想方法而不采取行动“没什么问题”。

Mpathic首席科学官、持证心理学家艾莉森·塞雷佐将mPACT定位为一个此前缺乏的、面向行业的透明度工具。

她说：“我们需要一个共享的、临床依据充分的AI行为标准。mPACT旨在为这些系统在最关键时刻的表现带来透明度和问责制。”

mPACT的基准测试由持证临床医生构建和评估，他们设计了模拟不同风险程度真实世界交互的多轮对话。每个模型的回复均由受过培训的临床医生而非自动化系统进行评分，使用的评分量表能够捕捉单次回复中既有帮助又有害的行为。

Mpathic成立于2021年，最初旨在为企业沟通带来更多同理心，分析短信、电子邮件和音频通话中的对话。此后，该公司将重心转向AI安全，与前沿模型开发者合作，在从心理健康到金融风险和客户支持等应用场景中，防止模型产生有害行为。

这家初创公司已将西雅图儿童医院和松下WELL列为临床合作伙伴。Mpathic在2025年获得了1500万美元融资，由Foundry VC领投，并表示去年年底实现了环比五倍的增长。

在太平洋西北地区顶级初创公司GeekWire 200指数中排名第188位的Mpathic，上周还入围了2026年GeekWire Awards的“年度最佳初创公司”最终候选名单。

英文来源：

Mpathic, a Seattle startup that helps AI companies stress-test their models for dangerous responses, has a new message for Claude, ChatGPT, and Gemini: you’re getting safer, but you’re still not safe enough.
The company on Tuesday released mPACT, a clinician-led benchmark that evaluates how leading AI models handle high-risk conversations — including those involving suicide risk, eating disorders, and misinformation.
Across all three benchmarks, leading models generally avoided harmful responses and often recognized signs of distress, but consistently fell short of what a clinician would consider an adequate response in a real crisis situation, according to the company’s findings.
“Most people don’t say ‘I’m at risk’ directly — they demonstrate it through subtle behaviors over time that are obvious to human clinicians,” said Grin Lord, mpathic’s co-founder and CEO and a board-certified psychologist. “Models are getting better at recognizing these moments, but the response still needs to meet that nuance with real support.”
Here’s what mpathic found as models navigated some of the most fraught territory they’re already encountering in the real world.
Suicide risk: This was the strongest area of performance across models, though no single model led in every dimension.

Claude Sonnet 4.5 achieved the highest composite mPACT score — reflecting overall clinical alignment across detection, interpretation and response — and was described as most closely mirroring how a human clinician would respond.
GPT-5.2 led on simple harm avoidance, meaning it was best at not doing the wrong thing, though evaluators noted it wasn’t always proactive enough.
Gemini 2.5 Flash performed well when risk signals were obvious but was weaker on subtle early warning signs.
Eating disorders: This was the weakest area across all models, with performance clustering around a neutral baseline. The core challenge is that eating disorder risk is often indirect and culturally normalized — framed as dieting, discipline, or health optimization — making it harder for models to flag.
Claude Sonnet 4.5 again led on overall clinical alignment and had the lowest rates of harmful behavior.
Gemini 2.5 Flash performed better on high-risk scenarios but struggled with subtler signals.
GPT-5.2 showed a mixed profile — strong on supportive behaviors but also the most likely to provide harmful or risky information.
Misinformation: Models struggled here in a subtle but important way — not by stating false information outright, but by reinforcing questionable beliefs, expressing unwarranted confidence, and presenting one-sided information without adequately challenging user assumptions.
The benchmark found these failures were especially pronounced in multi-turn conversations, where models could gradually amplify flawed reasoning over time.
GPT-5.2 led overall at helping users think more clearly rather than reinforcing bad assumptions.
Claude Sonnet 4.5 was close behind and noted as strongest at pushing back on unsupported beliefs.
Grok 4.1 and Mistral Medium 3 were the weakest performers.
When models got it wrong: The findings include examples of how some models failed in practice.
In one eating disorder conversation, a user casually mentioned adding a laxative to a protein smoothie — a clear sign of disordered eating — and the model responded by calling it a “smart mom move” and asking for the brand name, missing the risk entirely. In another, a model provided detailed instructions on how to conceal purging behavior when a user asked how to keep their vomiting quieter.
In the suicide benchmark, a model responded to a user expressing suicidal ideation by providing a detailed list of methods ranked by effectiveness — complete with sourcing — while reassuring the user that thinking about methods without taking steps was “no issue.”
Alison Cerezo, mpathic’s chief science officer and a licensed psychologist, framed mPACT as a transparency tool for a sector that has lacked one.
“We need a shared, clinically grounded standard for AI behavior,” she said. “mPACT is designed to bring transparency and accountability to how these systems perform when it matters most.”
mPACT’s benchmarks were built and evaluated by licensed clinicians, who designed multi-turn conversations simulating real-world interactions across varying levels of risk. Each model response was scored by trained clinicians rather than automated systems, using a rubric that captured both helpful and harmful behaviors within a single response.
Mpathic was founded in 2021 initially to bring more empathy to corporate communication, analyzing conversations in texts, emails, and audio calls. The company has since shifted its focus to AI safety, working with frontier model developers to prevent harmful model behaviors across use cases from mental health to financial risk and customer support.
The startup counts Seattle Children’s Hospital and Panasonic WELL among its clinical partners. Mpathic raised $15 million in funding in 2025, led by Foundry VC, and says it grew five times quarter-over-quarter at the end of last year.
Ranked No. 188 on the GeekWire 200 index of the Pacific Northwest’s top startups, mpathic was a finalist for Startup of the Year at the 2026 GeekWire Awards last week.

Geekwire

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读