如今,人工智能健康工具层出不穷——但它们究竟效果如何?

内容总结:
AI健康助手井喷式上线,但“数字医生”真的靠谱吗?
近期,微软、亚马逊等科技巨头相继推出面向公众的健康类AI聊天机器人,用户可连接个人健康记录并咨询健康问题。这标志着消费级健康AI工具正成为新趋势。
市场对此类工具需求旺盛,根源在于传统医疗系统难以满足所有人的即时咨询需求。一些研究显示,现有大语言模型能提供相对安全有效的建议。然而,多位学术专家指出,这些工具在广泛发布前,普遍缺乏独立第三方的严格安全评估。
尽管产品界面通常带有“不用于诊断或治疗”的免责声明,但用户很可能忽略警告,将其用于病情分诊甚至自我诊断,这存在显著风险。例如,一项研究发现,ChatGPT健康版有时会对轻症建议过度医疗,或未能识别急症。
开发公司表示已通过内部基准测试确保安全性,例如OpenAI发布了HealthBench评估框架。但学者指出,内部测试存在盲区,且标准化的第三方评估体系尚未建立。即便模型在测试中表现良好,普通用户在实际交互中可能因缺乏医学知识而无法提供关键信息,导致获得无效甚至错误建议。
专家共识是,对于医疗资源匮乏的人群,这类工具可能带来帮助,但其实际效益与风险比重尚未明晰。当前紧迫的任务是建立独立、全面且持续的评估机制,在鼓励创新的同时守护用户健康安全。正如牛津大学研究者所言:“证据基础必须坚实可靠。”在“数字医生”走进千家万户之前,严谨的“执业考核”不可或缺。
中文翻译:
如今,人工智能健康工具比以往任何时候都多——但它们的效果究竟如何?
对于医疗资源有限的人群而言,专业聊天机器人或许能带来改变。但若缺乏更多测试,我们无法判断它们究竟有益还是有害。
本月初,微软推出了"Copilot健康"模块,这是其Copilot应用中的新功能区域,用户可在此关联个人医疗记录并咨询具体健康问题。几天前,亚马逊也宣布,原本仅限于其One Medical会员使用的基于大语言模型的健康AI工具将向公众开放。这些产品与OpenAI今年1月发布的ChatGPT健康版、以及经授权可访问用户健康记录的Anthropic公司Claude模型共同构成了健康AI大众化浪潮。
鉴于现有医疗体系难以满足民众需求,市场对提供健康建议的聊天机器人存在明显需求。部分研究表明,当前大语言模型已能提供安全有效的建议。但研究人员指出,这些工具在广泛发布前,理应由独立专家进行更严格的评估。
在健康这类高风险领域,仅靠企业自行评估产品可能并不明智——尤其当评估结果未向外部专家公开时。即便企业(包括OpenAI等部分公司)确实开展了严谨高质量的研究,仍可能存在盲区,需要更广泛的研究群体协助完善。
牛津互联网研究所博士候选人安德鲁·比恩指出:"考虑到健康需求永远存在,我们确实应该探索所有可行途径。这些模型发展到值得推广的阶段,在我看来完全合理。"但他强调:"证据基础必须切实存在。"
临界点
据开发者称,健康产品此时面世,是因为大语言模型确实已达到能有效提供医疗建议的水平。微软人工智能健康副总裁、前外科医生多米尼克·金将AI技术进步视为公司组建健康团队及推出Copilot健康模块的核心动因:"生成式AI在回答健康问题、提供优质回应方面取得了巨大进展。"
但金认为这只是部分原因,另一关键因素是需求。在Copilot健康模块上线前夕,微软发布的报告及配套博客文章详细阐述了用户如何通过Copilot获取健康建议。该公司表示每天接收5000万条健康咨询,健康话题已成为Copilot移动端最热门的讨论主题。
其他AI公司也察觉并回应了这一趋势。OpenAI健康AI团队负责人卡兰·辛格哈尔透露:"早在健康产品推出前,我们就发现用户通过ChatGPT咨询健康问题的使用率急剧增长。"(OpenAI与微软长期合作,Copilot底层采用OpenAI模型。)
人们可能只是更倾向于向全天候待命、不带评判的机器人倾诉健康问题。但许多专家结合医疗体系现状解读这一现象。西奈山医疗集团首席AI官吉里什·纳德卡尼表示:"这些工具的存在及其市场定位有其必然性——因为获取医疗服务本就困难,对特定群体尤为如此。"
面向消费者的大语言模型健康聊天机器人的理想愿景在于:既能改善用户健康,又可缓解医疗系统压力。这可能涉及帮助用户判断是否需要就医(即分诊)。若聊天机器人分诊有效,急需急诊的患者或能更早寻求救治,而症状较轻者则可根据机器人建议在家处理,避免不必要的急诊室和诊所拥堵。
但纳德卡尼与西奈山研究人员近期一项广受讨论的研究发现,ChatGPT健康版有时会对轻微病症建议过度治疗,且未能识别急症。尽管辛格哈尔等专家认为该研究方法可能未能全面反映ChatGPT健康版的能力,这项研究仍引发公众担忧:这些工具在发布前接受的外部评估实在太少。
本文采访的多数学术专家认同,鉴于部分人群医疗资源匮乏,大语言模型健康聊天机器人可能带来实际益处。但六位专家均担忧这些工具在未经独立研究者安全评估的情况下就匆忙上市。虽然部分宣传功能(如推荐运动计划或建议咨询医生的问题)相对无害,但其他功能存在明显风险——分诊是其一,要求聊天机器人提供诊断或治疗方案则是另一风险。
ChatGPT健康版界面设有醒目免责声明,强调其不适用于诊断或治疗;Copilot健康与亚马逊健康AI的公告也包含类似警示。但这些警告容易被忽视。贝斯以色列女执事医疗中心内科医师、研究员兼谷歌访问研究员亚当·罗德曼指出:"我们都知道人们会用它进行诊断和治疗。"
医疗测试
企业声称正在测试聊天机器人,以确保其在绝大多数情况下提供安全回应。OpenAI设计并发布了HealthBench基准测试,通过模拟真实健康对话评估大语言模型表现(尽管对话本身由大语言模型生成)。去年驱动ChatGPT健康版与Copilot健康的GPT-5发布时,OpenAI公布了该模型的HealthBench得分:虽远未完美,但较此前模型有显著提升。
然而HealthBench等评估存在局限。比恩及其同事上月发表的研究发现:即使大语言模型能独立准确识别虚构文字场景中的病症,非专业用户在模型协助下判断病情的正确率可能仅三分之一。缺乏医学知识的用户可能不知道哪些场景细节(或自身真实经历)应纳入提问,也可能误解模型提供的信息。
比恩认为这种表现差距对OpenAI模型影响重大。在原始HealthBench研究中,该公司承认其模型在需要向用户追问信息的对话中表现较差。若确实如此,医学知识不足、无法向健康聊天机器人提供必要信息的用户,可能得到无用或不准确的建议。
辛格哈尔指出,原始HealthBench研究完成后发布的GPT-5系列模型在获取补充信息方面远优于前代。但OpenAI报告显示,当前旗舰模型GPT-5.4在追问上下文方面反而不及早期版本GPT-5.2。
比恩认为理想状况下,健康聊天机器人应在公开发布前接受真人用户的对照测试(如其研究所示)。但这可能难度较大——AI领域发展迅猛,而人体研究耗时漫长。比恩自己的研究使用的GPT-4o模型已发布近一年,现已过时。
本月初谷歌发布的一项研究符合比恩的标准。研究中,患者在会见真人医生前,与该公司尚未公开的医疗大语言模型聊天机器人AMIE讨论病情。总体而言,AMIE的诊断准确率与医生相当,且所有对话均未引发重大安全隐患。
尽管结果令人鼓舞,谷歌并不计划近期发布AMIE。谷歌DeepMind研究科学家艾伦·卡西克萨林甘在邮件中写道:"虽然研究取得进展,但在实现诊断治疗系统的实际应用前仍需解决重大局限,包括公平性、公正性及安全测试的深入研究。"谷歌近期透露,其与CVS合作建设的Health100健康平台将搭载基于旗舰Gemini模型的AI助手,但该工具预计不用于诊断治疗。
与卡西克萨林甘共同领导AMIE研究的罗德曼认为,对ChatGPT健康版等聊天机器人而言,多年期大规模研究未必是最佳方案。"临床试验模式在生成式AI领域常遇阻的原因很多,这正是基准测试讨论的意义所在——是否存在可信第三方制定的、公认有意义的基准,可供实验室自我约束?"
关键在于"第三方"。无论企业如何全面评估自身产品,其结论都难以完全取信于人。第三方评估不仅带来公正性,多机构参与更有助于消除盲区。
辛格哈尔表示强烈支持外部评估:"我们尽力支持学界,发布HealthBench的部分原因正是为学界及其他模型开发者提供优质评估范例。"鉴于高质量评估成本高昂,他对单个学术实验室能否完成"终极评估"持怀疑态度,但高度赞赏学术团体整合现有及新型评估形成综合测试套件的努力——例如斯坦福大学的MedHELM框架,该框架在多类医疗任务中测试模型表现。目前OpenAI的GPT-5在MedHELM评分中位居榜首。
领导MedHELM项目的斯坦福大学医学教授尼甘·沙阿指出该框架存在局限:仅评估单轮对话,而用户咨询时往往涉及多轮交流。他表示正与合作伙伴筹备能评估复杂对话的测试体系,但这需要时间与资金。"我们无法阻止企业发布健康类产品,他们仍会按自身意愿行事。我们这类人能做的只有寻找资助基准测试的途径。"
本文受访者均不认为健康大语言模型需在第三方评估中表现完美才能发布。医生本身也会犯错——对仅能偶尔就医者而言,一个随时可用、偶有失误的大语言模型,只要错误不严重,可能已是现状的巨大改善。
但根据现有证据,我们仍无法确知当前工具是否真的带来改善,亦或其风险是否已超过收益。
英文来源:
There are more AI health tools than ever—but how well do they work?
Specialized chatbots might make a difference for people with limited health-care access. Without more testing, we don't know if they’ll help or harm.
Earlier this month, Microsoft launched Copilot Health, a new space within its Copilot app where users will be able to connect their medical records and ask specific questions about their health. A couple of days earlier, Amazon had announced that Health AI, an LLM-based tool previously restricted to members of its One Medical service, would now be widely available. These products join the ranks of ChatGPT Health, which OpenAI released back in January, and Anthropic’s Claude, which can access user health records if granted permission. Health AI for the masses is officially a trend.
There’s a clear demand for chatbots that provide health advice, given how hard it is for many people to access it through existing medical systems. And some research suggests that current LLMs are capable of making safe and useful recommendations. But researchers say that these tools should be more rigorously evaluated by independent experts, ideally before they are widely released.
In a high-stakes area like health, trusting companies to evaluate their own products could prove unwise, especially if those evaluations aren’t made available for external expert review. And even if the companies are doing quality, rigorous research—which some, including OpenAI, do seem to be—they might still have blind spots that the broader research community could help to fill.
“To the extent that you always are going to need more health care, I think we should definitely be chasing every route that works,” says Andrew Bean, a doctoral candidate at the Oxford Internet Institute. “It’s entirely plausible to me that these models have reached a point where they’re actually worth rolling out.”
“But,” he adds, “the evidence base really needs to be there.”
Tipping points
To hear developers tell it, these health products are now being released because large language models have indeed reached a point where they can effectively provide medical advice. Dominic King, the vice president of health at Microsoft AI and a former surgeon, cites AI advancement as a core reason why the company’s health team was formed, and why Copilot Health now exists. “We’ve seen this enormous progress in the capabilities of generative AI to be able to answer health questions and give good responses,” he says.
But that’s only half the story, according to King. The other key factor is demand. Shortly before Copilot Health was launched, Microsoft published a report, and an accompanying blog post, detailing how people used Copilot for health advice. The company says it receives 50 million health questions each day, and health is the most popular discussion topic on the Copilot mobile app.
Other AI companies have noticed, and responded to, this trend. “Even before our health products, we were seeing just a rapid, rapid increase in the rate of people using ChatGPT for health-related questions,” says Karan Singhal, who leads OpenAI’s Health AI team. (OpenAI and Microsoft have a long-standing partnership, and Copilot is powered by OpenAI’s models.)
It’s possible that people simply prefer posing their health problems to a nonjudgmental bot that’s available to them 24-7. But many experts interpret this pattern in light of the current state of the health-care system. “There is a reason that these tools exist and they have a position in the overall landscape,” says Girish Nadkarni, chief AI officer at the Mount Sinai Health System. “That’s because access to health care is hard, and it’s particularly hard for certain populations.”
The virtuous vision of consumer-facing LLM health chatbots hinges on the possibility that they could improve user health while reducing pressure on the health-care system. That might involve helping users decide whether or not they need medical attention, a task known as triage. If chatbot triage works, then patients who need emergency care might seek it out earlier than they would have otherwise, and patients with more mild concerns might feel comfortable managing their symptoms at home with the chatbot’s advice rather than unnecessarily busying emergency rooms and doctor’s offices.
But a recent, widely discussed study from Nadkarni and other researchers at Mount Sinai found that ChatGPT Health sometimes recommends too much care for mild conditions and fails to identify emergencies. Though Singhal and some other experts have suggested that its methodology might not provide a complete picture of ChatGPT Health’s capabilities, the study has surfaced concerns about how little external evaluation these tools see before being released to the public.
Most of the academic experts interviewed for this piece agreed that LLM health chatbots could have real upsides, given how little access to health care some people have. But all six of them expressed concerns that these tools are being launched without testing from independent researchers to assess whether they are safe. While some advertised uses of these tools, such as recommending exercise plans or suggesting questions that a user might ask a doctor, are relatively harmless, others carry clear risks. Triage is one; another is asking a chatbot to provide a diagnosis or a treatment plan.
The ChatGPT Health interface includes a prominent disclaimer stating that it is not intended for diagnosis or treatment, and the announcements for Copilot Health and Amazon’s Health AI include similar warnings. But those warnings are easy to ignore. “We all know that people are going to use it for diagnosis and management,” says Adam Rodman, an internal medicine physician and researcher at Beth Israel Deaconess Medical Center and a visiting researcher at Google.
Medical testing
Companies say they are testing the chatbots to ensure that they provide safe responses the vast majority of the time. OpenAI has designed and released HealthBench, a benchmark that scores LLMs on how they respond in realistic health-related conversations—though the conversations themselves are LLM-generated. When GPT-5, which powers both ChatGPT Health and Copilot Health, was released last year, OpenAI reported the model’s HealthBench scores: It did substantially better than previous OpenAI models, though its overall performance was far from perfect.
But evaluations like HealthBench have limitations. In a study published last month, Bean—the Oxford doctoral candidate—and his colleagues found that even if an LLM can accurately identify a medical condition from a fictional written scenario on its own, a non-expert user who is given the scenario and asked to determine the condition with LLM assistance might figure it out only a third of the time. If they lack medical expertise, users might not know which parts of a scenario—or their real-life experience—are important to include in their prompt, or they might misinterpret the information that an LLM gives them.
Bean says that this performance gap could be significant for OpenAI’s models. In the original HealthBench study, the company reported that its models performed relatively poorly in conversations that required them to seek more information from the user. If that’s the case, then users who don’t have enough medical knowledge to provide a health chatbot with the information that it needs from the get-go might get unhelpful or inaccurate advice.
Singhal, the OpenAI health lead, notes that the company’s current GPT-5 series of models, which had not yet been released when the original HealthBench study was conducted, do a much better job of soliciting additional information than their predecessors. However, OpenAI has reported that GPT-5.4, the current flagship, is actually worse at seeking context than GPT-5.2, an earlier version.
Ideally, Bean says, health chatbots would be subjected to controlled tests with human users, as they were in his study, before being released to the public. That might be a heavy lift, particularly given how fast the AI world moves and how long human studies can take. Bean’s own study used GPT-4o, which came out almost a year ago and is now outdated.
Earlier this month, Google released a study that meets Bean’s standards. In the study, patients discussed medical concerns with the company’s Articulate Medical Intelligence Explorer (AMIE), a medical LLM chatbot that is not yet available to the public, before meeting with a human physician. Overall, AMIE’s diagnoses were just as accurate as physicians’, and none of the conversations raised major safety concerns for researchers.
Despite the encouraging results, Google isn’t planning to release AMIE anytime soon. “While the research has advanced, there are significant limitations that must be addressed before real-world translation of systems for diagnosis and treatment, including further research into equity, fairness, and safety testing,” wrote Alan Karthikesalingam, a research scientist at Google DeepMind, in an email. Google did recently reveal that Health100, a health platform it is building in partnership with CVS, will include an AI assistant powered by its flagship Gemini models, though that tool will presumably not be intended for diagnosis or treatment.
Rodman, who led the AMIE study with Karthikesalingam, doesn’t think such extensive, multiyear studies are necessarily the right approach for chatbots like ChatGPT Health and Copilot Health. “There’s lots of reasons that the clinical trial paradigm doesn’t always work in generative AI,” he says. “And that’s where this benchmarking conversation comes in. Are there benchmarks [from] a trusted third party that we can agree are meaningful, that the labs can hold themselves to?”
They key there is “third party.” No matter how extensively companies evaluate their own products, it’s tough to trust their conclusions completely. Not only does a third-party evaluation bring impartiality, but if there are many third parties involved, it also helps protect against blind spots.
OpenAI’s Singhal says he’s strongly in favor of external evaluation. “We try our best to support the community,” he says. “Part of why we put out HealthBench was actually to give the community and other model developers an example of what a very good evaluation looks like.”
Given how expensive it is to produce a high-quality evaluation, he says, he’s skeptical that any individual academic laboratory would be able to produce what he calls “the one evaluation to rule them all.” But he does speak highly of efforts that academic groups have made to bring preexisting and novel evaluations together into comprehensive evaluations suites—such as Stanford’s MedHELM framework, which tests models on a wide variety of medical tasks. Currently, OpenAI’s GPT-5 holds the highest MedHELM score.
Nigam Shah, a professor of medicine at Stanford University who led the MedHELM project, says it has limitations. In particular, it only evaluates individual chatbot responses, but someone who’s seeking medical advice from a chatbot tool might engage it in a multi-turn, back-and-forth conversation. He says that he and some collaborators are gearing up to build an evaluation that can score those complex conversations, but that it will take time, and money. “You and I have zero ability to stop these companies from releasing [health-oriented products], so they’re going to do whatever they damn please,” he says. “The only thing people like us can do is find a way to fund the benchmark.”
No one interviewed for this article argued that health LLMs need to perform perfectly on third-party evaluations in order to be released. Doctors themselves make mistakes—and for someone who has only occasional access to a doctor, a consistently accessible LLM that sometimes messes up could still be a huge improvement over the status quo, as long as its errors aren’t too grave.
With the current state of the evidence, however, it’s impossible to know for sure whether the currently available tools do in fact constitute an improvement, or whether their risks outweigh their benefits.
Deep Dive
Artificial intelligence
A “QuitGPT” campaign is urging people to cancel their ChatGPT subscriptions
Backlash against ICE is fueling a broader movement against AI companies’ ties to President Trump.
Moltbook was peak AI theater
The viral social network for bots reveals more about our own current mania for AI as it does about the future of agents.
OpenAI is throwing everything into building a fully automated researcher
An exclusive conversation with OpenAI’s chief scientist, Jakub Pachocki, about his firm's new grand challenge and the future of AI.
How Pokémon Go is giving delivery robots an inch-perfect view of the world
Exclusive: Niantic's AI spinout is training a new world model using 30 billion images of urban landmarks crowdsourced from players.
Stay connected
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.
文章标题:如今,人工智能健康工具层出不穷——但它们究竟效果如何?
文章链接:https://news.qimuai.cn/?post=3685
本站文章均为原创,未经授权请勿用于任何商业用途