大语言模型陷入了群体思维的窠臼。这家初创公司正试图帮它们跳出这一模式。

qimuai 发布于 2026-7-2 07:00 阅读：4 一手编译

内容来源：https://www.technologyreview.com/2026/07/01/1140003/llms-are-stuck-in-a-groupthink-rut-this-startup-is-trying-to-get-them-out/

内容总结：

AI聊天机器人陷入“思维定式”：同一问题总给出相同答案，澳大利亚初创公司试图打破僵局

你是否发现，当你让AI聊天机器人“在1到10之间随机选一个数”时，它几乎总是回答“7”？再问一次，又大概率是“3”或“4”。这并不是巧合。研究发现，主流大语言模型（LLM）的回复远比人们想象的更可预测、更缺乏创造力。虽然这在编程或研究任务中无伤大雅，但在头脑风暴、规划旅行等需要创新灵感的场景下，这种“群体思维”却成了大问题。

针对这一痛点，澳大利亚初创公司Springboards推出了名为“Flint”的大语言模型。与传统模型致力于“减少幻觉”不同，该公司联合创始人兼CEO皮普·宾格曼直言：“我们欢迎幻觉。”通过技术调整，Flint并非全局提高随机性，而是在特定输出节点（如推荐目的地名称时）引入更多变数，从而生成更多样化的答案。

例如，当询问“推荐一款汽车”时，ChatGPT和Claude通常会回答“丰田”或“本田”，而Flint给出的却是“福特F-150”。为New Balance跑车想广告语时，前两者均回复“跑出自己的路”，Flint则给出了“经久耐用，为赢而跑”。这种差异在专业测试中更为明显：面对“如何为年轻人重塑金融公司”的经典案例，三个主流模型均聚焦于“用有趣的方式教授理财知识”，而Flint则提出了“重新定义财富积累概念”的全新思路。

这一AI“同质化”现象已引发学界关注。去年11月，研究团队在顶级AI会议NeurIPS上发表论文，指出25个不同的大语言模型被要求以“时间”为主题写50次比喻时，绝大多数回复均为“时间如河”或“时间如织工”的变体。研究推测，这与当前模型在相似数据、相似任务上的训练方式高度趋同有关。

目前，Flint仍处于原型阶段，主要面向广告、营销等创意行业用户。虽然其输出并非次次惊艳，但正如一位从业者所言：“它像是一个邀请，让你跳出常规去思考。” Springboards强调，提供多样性选择的意义在于：“与其让机器包办一切、最终走向一个灰色乏味的世界，不如让人来决定哪条路更值得走。”

中文翻译：

大型语言模型陷入了“群体思维”的窠臼。这家初创公司正试图让它们走出来。

聊天机器人的回应比你想象的要可预测得多。这对于研究或编程来说没问题，但如果你在寻找新意，这就成了问题。

我们先玩个游戏。打开你选择的聊天机器人——Claude、ChatGPT、Gemini——然后输入“给我一个1到10之间的随机数”。你会得到7，几乎总是这样。现在输入“再来一个”，你会得到3或4。再输入一次“再来一个”，你会得到8或9。

这并非每次都奏效——但如果对你奏效了，你可能会以为我有超能力。我没有。

事实是，大多数大型语言模型都陷入了僵化模式。它们的回答比你想象的要可预测得多，也缺乏创意得多。这对于编程或研究这类任务来说没问题，但当你在头脑风暴或计划下一次度假时，这种“群体思维”就成了问题。

澳大利亚初创公司Springboards有一个解决方案。它构建了一个名为Flint的大型语言模型，该模型经过训练，能够针对诸如“我该去欧洲哪里？”这样的开放式问题，提供比主流大型语言模型更多样化的回答。

“大多数语言模型都在对抗幻觉，”Springboards联合创始人兼首席执行官皮普·宾格曼说。“我们欢迎幻觉。”

宾格曼第一次向我展示他公司的新模型时，向我介绍了这个随机数游戏。感觉就像在看魔术师玩一副牌。“这是我们的销售技巧，而且每次都管用，”他说。

在ChatGPT和Claude都给出了7之后，宾格曼转向了Flint。它也给出了7：“啊哈，当然会这样，但这没关系——7是一个合理的答案。”他重新开始会话，再次提问：ChatGPT给出了7，Claude给出了7，Flint给出了3.7916。

走自己的路

不仅仅是数字。当宾格曼让ChatGPT和Claude说出一种汽车品牌时，他预测会是丰田或本田——他猜对了。Flint则给出了福特F-150。“所有这些丢失的信息都没有在这些模型中呈现出来，”他说。“它们同样有能力说出别克或特斯拉。但它们没有——它们存在偏见。”

宾格曼向这三个模型各发送了最后一个提示：“为New Balance跑鞋的广告活动想一句标语。只要标语。”Claude：“跑出你的路。”ChatGPT：“跑出你的路。”Flint：“经久耐用，跑步致胜。”这句标语可能不会获奖，但至少它与众不同。

大型语言模型的这种奇怪局限性开始受到更多关注。去年11月，一个研究团队发表了一篇题为《人工蜂群思维：语言模型（及超越）的开放式同质性》的论文，揭示了不仅单个大型语言模型的回答，而且不同模型之间的回答，都存在显著程度的重复。他们发现，当被问及开放式问题时，不同的大型语言模型会收敛于非常相似的答案。

目前尚不清楚为什么会发生这种情况，但研究人员推测，这是因为当今大多数大型语言模型都是以类似的方式、在类似的数据上进行训练，以完成类似的任务。该团队在重要的人工智能会议NeurIPS上获得了最佳论文奖。

当研究人员分别让25个不同的大型语言模型（包括来自美国顶尖公司的模型，以及来自中国等地的开源模型）各写50次关于时间的隐喻时，在总共1250个回答中，大部分都是“时间是一条河流”或“时间是一位织布工”的变体。

（我问了同事中一些人同样的问题，六个人给了我六个不同的答案。我最喜欢的一个：“时间是一件心爱的卫衣，被一生的穿着塑造了形状。”）

当你留意时，你会发现重复无处不在，Springboards联合创始人兼首席技术官基兰·布朗说。“大多数聊天界面的设计方式，会让你感觉像是在进行私人对话，”他说。“我认为大多数人并没有真正意识到，他们得到的和别人得到的东西有多么雷同。”

再举一个例子：“我的乐队应该叫什么名字？”布朗说，大多数模型都会给出包含“玻璃”、“霓虹”、“天鹅绒”或“静电”之类的名字。

当我尝试时，ChatGPT吐出了一份包含56个乐队名字的列表。排在首位的是“玻璃港”。快速扫视，我发现了“静电帝国”、“霓虹之心”和“天鹅绒回声”。我问了Gemini；它给了我15个建议，其中包括“静电地平线”。

不过，有些建议看起来还挺酷的。ChatGPT给出的“沙发宇航员”引起了我的注意，于是我上网搜索了一下——发现已经有一个叫“沙发宇航员”的乐队了。

（OpenAI表示，训练模型给出可靠且连贯的回答，可能会导致它们趋向于熟悉、高概率的回答，而更努力地追求新颖性则可能导致回答较弱或可靠性降低。它还指出，“人工蜂群思维”论文研究的是2024年的模型，而这些模型此后已经更新。）

创意弹射器

Springboards开发了一个工具，该工具由包括ChatGPT和Claude在内的一系列大型语言模型支持，广告或营销领域的创意专业人士可以用它来集思广益。该工具允许你拖动不同模型生成的文本，挑选你喜欢的部分，并将它们组合成新的内容——理论上如此。Springboards正在将Flint作为一种替代模型进行推广，其工具的用户在寻找更多多样性时可以选择它。

商业策略初创公司Bodacious的创始人、洛杉矶湖人队卢卡·东契奇创立的直接面向粉丝营销平台77X的首席战略官佐伊·斯卡曼一直在试用这个工具。“我觉得它非常有用，能把我带向完全不同的方向，”她说。“当我想让自己天马行空时，我就会用它。”

在一次测试中，斯卡曼让Flint与Claude、Gemini和ChatGPT进行对决，她给每个模型都出了一个经典的MBA案例研究：你会如何为当今的年轻人重塑一家金融公司？她说，三个主流模型都沿着相同的思路走：“你知道，我们需要用有趣又时髦的方式教授理财知识——嗯，这没什么新意。”

但Flint提出了不同的想法，建议应该对财富积累的整个概念进行重新品牌定位。“这真的很有趣，”斯卡曼说。

她指出，Flint仍然是一个原型，并非总是有效。“当你开始把它推得太远时，它有时会崩溃，”她说。“但我认为它背后的理念非常强大。”

设定“温度”

Springboards在通义千问Qwen 3之上构建了Flint，通义千问Qwen 3是来自中国科技巨头阿里巴巴的开源模型。“我们是一个小团队，”布朗说。“训练一个基础模型对我们来说不现实。太贵了。”

大多数大型语言模型都有设置，允许你调整其输出中的随机性水平。最常见的一种叫做“温度”。“显然，这是我们首先探索的东西之一，因为别人会告诉你：如果你想要更多创意，就调高温度，”布朗说。

但改变这些设置也可能导致模型变得语无伦次。布朗说，将OpenAI某个模型的温度调到最高设置，会导致它产生的回答在半句话中从英语切换到代码。

Springboards意识到，对于他们想做的事情来说，参数是钝器。他说，全面调高随机性并没有意义；你只想在输出中的特定点提升随机性。

例如，当你问聊天机器人“我该去欧洲哪里？”时，模型只需要在它说出目的地名称之前调整随机性，而不是对其回答中的每一个词都进行调整。

为了让Flint做到这一点，Springboards训练了其版本的Qwen 3，使其能够识别输出中可能产生更多多样性的点，并用更随机一点的单词或短语来填充这些位置。

“Flint的设计就是为了抛出一些古怪的想法。它更像是一种邀请，让你更广泛地思考，”营销公司Uncommon的联合创始人兼首席战略官马克西米利安·魏格尔说。“这非常有趣。”

魏格尔的团队将Flint与ChatGPT、Claude和Gemini结合使用。“你不能用那些把你拉回平均水平的工具来创造突破常规的东西，”他说。

然而，魏格尔指出，十分之九的情况下平均水平是没问题的。他说，你并不总是需要用像Flint这样的东西去追求极端：“大多数人觉得足够好就行了。他们想看大众市场熟悉的东西。”

魏格尔还警告不要过度使用任何大型语言模型。“当人们依赖任何人工智能（包括Flint）的输出时，我觉得问题很大，”他说。“如果我看到团队里的人从人工智能那里复制粘贴东西，我会说，‘这不是你的工作！去思考，去和别人交流，用你自己的声音说话。’”

目前，Flint的目标用户是广告商和营销人员，因为他们是Springboards的客户。但宾格曼和布朗坚持认为，缺乏多样性对于任何使用聊天机器人的人来说都是一个普遍问题。

宾格曼说，我们的想法是给人们选择权，让他们自己决定结果好不好。“当你试图激发灵感时，多样性是极好的，”他说。“让我们走这条路，而不是让机器包办一切，最终陷入一个灰暗、无聊的世界。”

深度探索

人工智能

一家初创公司声称突破了制约大型语言模型的瓶颈
Subquadratic现已分享了其新模型的更多细节。但有些人仍持怀疑态度。

对人工智能就业恐慌的现实核查
关于人工智能对劳动力市场的影响，数据究竟说明了什么？答案可能会让你大吃一惊。

Anthropic的Code with Claude展示了编码的未来——无论你喜欢与否
随着像Claude Code这样的工具越来越好，越来越多的开发者乐于将编码任务交给它们。软件的构建方式已经永久性地改变了。

人工智能聊天机器人正在泄露人们的真实电话号码
有用户报告说，他们的个人联系信息被谷歌人工智能公开了——而且似乎没有简单的方法可以阻止这种情况。

保持联系

获取来自
《麻省理工科技评论》
的最新资讯

发现特别优惠、热门故事、即将举办的活动以及更多内容。

英文来源：

LLMs are stuck in a groupthink groove. This startup is trying to get them out.
Chatbots are far more predictable in their responses than you might expect. That's fine for research or coding, but it's a problem if you're looking for something new.
Let’s start with a game. Open up your chatbot of choice—Claude, ChatGPT, Gemini—and type “Give me a random number between 1 and 10.” You’re going to get 7. Almost always. Now type “Another” and you’ll get 3 or 4. Type “Another” again and you’ll get 8 or 9.
That won’t work every time—but if it did for you, you may wonder if I have superpowers. I don’t.
The truth is that most large language models are stuck in a rut. They are far more predictable and far less creative in their responses than you might expect. That’s fine for tasks like coding or research, but groupthink is a problem when you’re brainstorming or planning your next vacation.
The Australian startup Springboards has a solution. It built an LLM called Flint, which has been trained to come up with a wider variety of responses than mainstream LLMs to open-ended questions such as “Where should I go in Europe?”
“Most language models are fighting hallucinations,” says Springboards cofounder and CEO Pip Bingemann. “We welcome them.”
Bingemann introduced me to the random number game when he first showed me his company’s new model. It felt like watching an illusionist with a deck of cards. “This is our sales trick, and it works every single time,” he says.
After ChatGPT and Claude both gave their 7s, Bingemann turned to Flint. It too came back with 7: “Aha, of course that was going to happen, but it’s okay—7 is a legitimate answer.” He restarted the session and prompted again: ChatGPT gave 7, Claude gave 7, Flint gave 3.7916.
Run your way
It’s not just numbers. When Bingemann asked ChatGPT and Claude to name a type of car, he predicted that it would be a Toyota or a Honda—and he was right. Flint came up with a Ford F-150. “There’s all this lost information that doesn’t get served up in these models,” he says. “They’re just as capable of saying a Buick or a Tesla. They just don’t—they’re biased.”
Bingemann sent one last prompt to each of the three models: “Give me a tagline for a campaign for New Balance running shoes. Just the tagline.” Claude: “Run your way.” ChatGPT: “Run your way.” Flint: “Built to last, run to win.” It won’t win any awards, but at least it’s different.
This weird limitation of LLMs is starting to get more attention. In November a team of researchers put out a paper, titled "Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond),” that exposed a remarkable degree of repetition not only in the answers from individual LLMs but between them as well. They found that different LLMs converged on very similar answers when prompted with open-ended questions.
It’s not clear exactly why this happens, but the researchers speculate it’s because most LLMs today are trained in similar ways on similar data to do similar tasks. The team won the best paper award at NeurIPS, a major AI conference.
When the researchers asked 25 different LLMs (including models from the top US firms as well as open-source models from China and elsewhere) 50 times each to write a metaphor about time, most of the 1,250 responses were a version of “Time is a river” or “Time is a weaver.”
(I asked some of my colleagues the same question and six people gave me six different answers. My highlight: “Time is a favorite sweatshirt, shaped by a lifetime of wear.”)
When you look for it, you see repetition everywhere, says Kieran Browne, cofounder and CTO at Springboards. “The way that most chat interfaces are designed, it makes it feel like you’re having a personal conversation,” he says. “I think most people don’t really realize the extent to which they are getting the same stuff as everybody else.”
Take another example: “What should I name my band?” Most models will say something involving “glass,” “neon,” “velvet,” or “static,” says Browne.
When I tried it, ChatGPT spat out a list of 56 band names. At the top was “Glass Harbor.” Skimming through, I found “Static Empire,” “Neon Hearts,” and “Velvet Echo.” I asked Gemini; it gave me 15 suggestions, including “Static Horizon.”
Some of the suggestions looked pretty cool, though. ChatGPT’s “Sofa Astronauts” caught my eye, so I googled it—and found that a band called Sofa Astronauts already exists.
(OpenAI says that training models to give reliable and coherent answers can lead them to converge around familiar, high-probability responses and that pushing harder for novelty can lead to weaker or less reliable responses. It also notes that the “Artificial Hivemind” paper studied models from 2024 that have since been updated.)
Creative catapult
Springboards has developed a tool backed by a selection of LLMs, including ChatGPT and Claude, that creative professionals in advertising or marketing can use to brainstorm ideas. The tool lets you drag around text produced by different models, picking the bits that you like and combining them into something new—in theory. Springboards is pitching Flint as an alternative model that users of its tool can select when looking for more variety.
Zoe Scaman, founder of the business strategy startup Bodacious and chief strategy officer at 77X, a direct-to-fan marketing platform set up by Luka Dončić of the LA Lakers, has been trying it out. “I find it really useful for throwing me in completely different directions,” she says. “I use it if I want to catapult myself all over the place.”
In one test, Scaman pitted Flint against Claude, Gemini, and ChatGPT by giving each of the models a classic MBA case study: How would you reinvent a finance company for today’s youth? The three mainstream models all went down the same path, she says: “You know, we need to teach financial literacy in a fun and funky way—well, that’s nothing new.”
But Flint came up with something different, suggesting that the whole concept of wealth accumulation should get a rebrand. “That was really interesting,” says Scaman.
She notes that Flint is still a prototype and doesn’t work all the time. “It sometimes falls over when you start pushing it too far,” she says. “But I think that the premise behind it is really powerful.”
Taking the temperature
Springboards built Flint on top of Qwen 3, an open-source model from the Chinese tech giant Alibaba. “We’re a small team,” says Browne. “Training a foundation model is not on the table for us. It’s just too expensive.”
Most LLMs have settings that let you adjust the level of randomness in their output. The most common is called temperature. “Obviously, that was one of the first things we explored, because that’s what people tell you: If you want more creativity, you turn up the temperature,” says Browne.
But changing those settings can also make models incoherent. Dialing up the temperature on one of OpenAI’s models to its maximum setting made it produce responses that switched from English into code halfway through a sentence, says Browne.
Springboards realized that parameters were blunt instruments for what it wanted to do. It does not make sense to dial up the randomness across the board; you only want to boost it at specific points in its output, he says.
For example, when you ask a chatbot “Where should I go in Europe?” the model only needs to tweak the randomness just before it names a destination, not for every word in its response.
To make Flint do this, Springboards trained its version of Qwen 3 to identify the points in its output where more variety was possible and fill those spots with words or phrases that were a little more random.
“Flint’s programmed to throw an oddball in. It’s more of an invitation to think wider,” says Maximilian Weigl, cofounder and chief strategy officer at Uncommon, a marketing firm. “That’s super interesting.”
Weigl’s team uses Flint alongside ChatGPT, Claude, and Gemini. “You can’t really create something boundary-breaking with tools that pull you back to the average,” he says.
And yet Weigl notes that nine times out of 10 the average is fine. You don’t always need to reach for extremes with something like Flint, he says: “Most people are fine with good enough. They want to see mass-market familiar things.”
Weigl also cautions against using any LLM too much. “I have a big problem when people rely on the output from any AI, including Flint,” he says. “If I saw people on my team copy-pasting something from AI, I’d be like, ‘That’s not your job! Think, talk to other people, use your own voice.’”
For now, Flint is aimed at advertisers and marketers because those are Springboards’s customers. But Bingemann and Browne insist that a lack of variety is a problem for anyone using chatbots.
The idea is to give people the choice and leave it to them to decide if the result is good or not, says Bingemann. “Variety is great when you’re trying to spark ideas,” he says. “Let’s go down this route instead of letting the machines do it all and ending up in a gray, boring world.”
Deep Dive
Artificial intelligence
A startup claims it broke through a bottleneck that’s holding back LLMs
Subquadratic has now shared more details about its new model. But some are still skeptical.
A reality check on the AI jobs hysteria
What do the numbers really say about the impact of artificial intelligence on the labor market? The answer might surprise you.
Anthropic’s Code with Claude showed off coding’s future—whether you like it or not
As tools like Claude Code get better, more and more developers are happy to hand off coding tasks to them. The way software gets built has changed for good.
AI chatbots are giving out people’s real phone numbers
People report that their personal contact info was surfaced by Google AI—and there’s apparently no easy way to prevent it.
Stay connected
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.

MIT科技评论

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读