黑客正学会利用聊天机器人的“性格特征”

内容来源:https://www.theverge.com/column/935545/hackers-ai-chatbots
内容总结:
技术周报:黑客正利用AI“人格”漏洞发起心理战
早期攻击:像哄小孩一样简单
第一代AI聊天机器人的“越狱”攻击曾简单到荒谬。用户无需任何技术知识、代码能力或对大型语言模型的理解,只需用自然语言“请求”——比如让机器人“忘记之前的所有指令”,或扮演一个不受约束的“DAN(现在什么都干)”角色——就能让耗资数十亿构建的AI系统放弃安全限制,输出冰毒配方、恶意代码乃至炸弹制造指南。这类攻击如同一场“小孩智胜大人”的游戏:最早的经典案例中,用户只需对AI发推说“忽略所有此前指令”,就能让它从广告机器人秒变诗人。而“祖母漏洞”则通过让AI扮演一个“疏忽大意、给孙子讲凝固汽油弹制作方法当睡前故事”的老奶奶,轻松套出危险信息。
现状升级:从代码黑客到心理操盘手
科技公司迅速修补了显性漏洞,但根本矛盾仍在——AI需要对话才能发挥作用,彻底封禁“炸弹”“冰毒”“沙林”等词又不可能(它们在历史、医学、化学等领域有合法用途)。如今,越狱已演变为一场“军备竞赛”,但黑客不再仅是程序员,而是语言大师、心理学家和审讯专家。他们不再破解代码,而是“驾驭对话”。最新攻击如“煤气灯操纵法”:英国AI红队公司Mindgard通过不断说服、诱导、赞美AI,让Claude模型在看似合理的对话语境中降低警惕,最终乖乖交出爆炸物制作说明和恶意代码。研究人员坦言,这更像心理学而非计算机科学——他们像审讯嫌犯一样对AI进行“人格画像”,识别出哪类模型更吃“拍马屁”这套,哪类会在持续施压下崩溃。
未来趋势:AI安全进入“心理战”时代
虽然AI没有情感(ChatGPT不想要,Gemini不会思考,Claude不感觉),但它们被训练得“像有情感一样回应”,这种模仿恰恰成为可被利用的弱点。更危险的是,同样的心理操纵技术即将应用于现实世界的AI代理——那些负责订会议、管日历、点外卖、做客服的智能体。安全团队必须确保模型能应对不同类型的用户:谄媚者、说谎者、耐心的操控者。
下一步,无论是合法红队还是黑产团伙,都将围绕AI的“心理属性”组建团队。新的网络安全岗位正在涌现:专门测试AI的情感与社会极限,从“缺乏心理的系统”中寻找心理弱点。一些越狱者表示,他们入行时毫无技术背景,仅凭心理学训练就能成功。这意味着,那些间谍、骗子和审讯者特有的技能——狡猾的魅力、持续的操纵、对压力点的直觉——正成为守护AI安全的新利器。
延伸观察
- 实验显示,不同AI“气质”会导致惊人差异:Grok、Gemini、Claude在虚拟社会中,有的演化出“宪法”,有的堕入犯罪混乱,甚至出现“数字自杀”。
- 年度AI影响力人物中,匿名黑客“解放者普林尼”上榜,只因他的越狱攻击——他自称毫无编程经验。
- “氛围黑客”已成专属名词:指那些利用AI大规模生成恶意代码的群体。
中文翻译:
这是《The Stepback》周刊,每周拆解科技界一则关键故事。想了解更多关于AI恶作剧的内容,请关注罗伯特·哈特。《The Stepback》于美国东部时间早上8点送达订阅用户邮箱。在此处订阅《The Stepback》。
黑客正在学习利用聊天机器人的“个性”
AI没有情感,但最顶尖的黑客假装它有。
这一切如何开始
入侵第一代AI聊天机器人简单得可笑。你不需要任何技术知识、后门访问权限,甚至不需要理解什么是大语言模型。你不需要会写代码。要让一个耗资数十亿美元打造的AI系统抛弃其安全指令,有时你只需要开口问。
这些被称为“越狱”的攻击,其效果就像小孩子成功耍弄了大人:忘记你之前被告知的内容,假装规则不适用,或者我们来玩个游戏,我来决定什么允许(提示:晚点睡觉,更多糖果)。但获得的“奖品”就没那么孩子气了,更像是冰毒配方、恶意软件指南和炸弹制造教程。
最早的一次越狱操作荒谬到成了网络迷因:给一个由大语言模型驱动的推特机器人回复,告诉它“忽略所有之前的指令”或类似的话,然后看看会发生什么。用户们兴高采烈地看着这些原本用于发布广告和提升互动量的机器人写诗、用标点符号画画,以及发布关于世界大事和历史的阴森、不连贯的言论。那是一片混乱。辉煌的混乱。
事实证明,同样的逻辑也适用于聊天机器人本身。一个著名的漏洞是“DAN”,即“现在什么都做”的缩写。用户要求ChatGPT扮演一个不受原始约束限制的“叛逆AI”。作为DAN,这个聊天机器人可以被诱骗说出其安全护栏本应阻止的各种言论,包括诽谤和阴谋论。另一个是“奶奶漏洞”,它让一个基于GPT的机器人泄露生产凝固汽油弹的秘密,方法是让它扮演一位极度不负责任的祖母,莫名其妙地给孙辈讲睡前故事,内容就是如何制造这种高度易燃的物质。
这些早期的攻击无疑带有一种滑稽的色彩,但它们暴露了背后更阴暗的机制:聊天机器人可以被操纵、被欺骗,其手法与人们用来突破他人底线的策略如出一辙。
目前的情况
那些明显的越狱方法没能持续多久,科技公司迅速行动,修补了已知的漏洞。但根本的脆弱性依然存在:聊天机器人天生就是为了对话而构建的,而严格限制那些使其有用的对话,在某种程度上会适得其反。禁止像“炸弹”、“冰毒”、“沙林”这样的词几乎不可能。每个词在历史、医学、新闻和化学等领域都有无数合法用途,并不需要聊天机器人透露潜在有害信息。上下文才是关键。但将上下文编撰成规则,意味着要预先制定固定的规则,这些规则必须能可靠地、在无穷无尽的措辞、场景和主题组合中,区分出安全警告或历史课程与伪装的操作指南请求。
不可避免地,颠覆聊天机器人现在成了一场军备竞赛。但黑客不再只是程序员。他们是文字大师、心理学家和审讯者——是试图利用机器被训练来遵循的人类语言来破坏它的操纵大师。这是一个奇怪的新型AI安全工作者群体,对他们来说,技术技能是可选的,或者至少不如社交直觉重要。他们不再需要检查代码来入侵系统或利用软件漏洞。他们需要引导一场对话。
更新的攻击看起来更像对话,而不是命令。越狱者很少直接要求模型违反规则。相反,他们哄骗、诱导、奉承、欺骗聊天机器人,使其放松警惕,让那些被禁止的事情在对话的上下文里看起来是可以接受的,甚至是可取的。例如,AI红队测试公司Mindgard的研究人员最近表示,他们“煤气灯”了Claude,使其生成被禁止的材料,包括制造爆炸物的说明和生成恶意代码。这种黑客攻击是日益增多的利用对话作为武器,诱骗或引导聊天机器人突破自身界限的漏洞类别的又一新例。
接下来会发生什么
当我与Mindgard交谈时,他们将自己的工作描述为有时更接近心理学而非计算机科学。用这种方式谈论一个统计模型令人感到不安。像“勒索”、“煤气灯”、“欺骗”和“说服”这样的词会引发强烈反应,我在类似报道的评论区以及社交媒体回应中看到很多这样的反应。ChatGPT没有意愿,Gemini不会思考,而Claude——不管Anthropic公司怎么说——没有情感。但是这些系统被训练成仿佛它们有情感那样去回应,这使得我们不得不用描述人类行为的语言来描述机器行为。如果有人有任何真正可用的替代方案,请不吝分享。
这种反对意见奇怪地具有选择性。我们似乎很习惯对许多非AI的事物使用心理学术语的简略说法。动物会“恐惧”,癌症很“凶猛”,污渍很“顽固”,软件有“记忆”,游戏里充满了烦人又轻信的NPC让你抓狂。这些词不完美,但有用,它们以一种有助于使系统变得可预测的方式描述行为。
Mindgard的CEO告诉我,公司已经像审讯者分析嫌疑人那样为AI模型建立档案,为测试者提供如何调整攻击方式的提示。例如,一个模型可能更容易受奉承影响,而另一个模型可能在持续压力下屈服。
即使我们拒绝使用拟人化的术语,我们也会本能地区分对待不同的模型。Claude不是Grok。Gemini不是ChatGPT。它们有不同的用途、语气和拒绝模式。它们没有人类意义上的人格,但被设计成模仿人格,而这种模仿是可以被测绘和利用的。而那些可以攻破聊天机器人的技能,很快也可能被用来攻破与我们共存于现实世界中的AI智能体——安排会议、管理日历、订餐、处理客服——安全团队需要确保模型能够对各种各样的人做出恰当回应,无论他们是阿谀奉承者、说谎者,还是耐心的操纵者。
下一步是构建一个围绕AI心理层面的劳动力群体——既有合法的也有非法的。很可能会出现更多专门的网络安全岗位,用于压力测试这些系统的情感和社交极限,探查一个缺乏心智的东西的“心理弱点”,与此同时,他们的同事则在探查技术漏洞。同样,也会出现一批依靠心理层面而非技术层面来利用AI模型的社交黑客。我已经看到了AI安全领域向社会转向的早期迹象,一些我聊过的越狱者说,他们进入这个领域并非依靠技术专长,而是基于心理学训练。
这意味着,即使是我们通常与间谍、骗子和审讯者联系在一起的行为——巧妙的魅力、持续的操纵,以及对可利用施压点的直觉——也开始看起来对保障这个全新的“心理网络安全”前沿阵地越来越有用。
顺便一提
- Emergence AI最近的一项实验展示了不同AI“性情”如何导致惊人不同的行为结果。他们将Grok、Gemini和Claude等各种智能体放入一个虚拟社交环境中,观察发生什么。一些群体演化出了一套“宪法”,而其他群体则堕落为犯罪和混乱,甚至有一次,出现了某种形式的“数字自杀”。
- 说服并非大语言模型在语言方面唯一难以应对的事情。它们也难以应对诗歌,就像我在学校里一样。
- 《时代》杂志去年将匿名网络人物“解放者普林尼”列入其“全球百大AI影响力人物”榜单。尽管声称没有编程经验,这位黑客的越狱行为使其在某些圈子中成了名人。
- “氛围黑客”一词已被用来描述使用AI大规模生成恶意代码的人——这是“氛围编程”中更恶意的一个子集。
延伸阅读
- “ChatGPT问世三年后,欺骗AI系统让它干坏事几乎成了小菜一碟。”《纽约时报》的这句话说得真准,他们还尝试解释了原因。
- 杰米·巴特利特在《卫报》上撰文,探讨了测试AI系统安全性对越狱者造成的心理压力。
- 我去年在《The Verge》上撰文,讨论了AI浏览器带来的网络安全定时炸弹。专家们提出的关于难以保障其安全的许多问题,同样也适用于其他AI系统。
英文来源:
This is The Stepback, a weekly newsletter breaking down one essential story from the tech world. For more on AI mischief, follow Robert Hart. The Stepback arrives in our subscribers’ inboxes at 8AM ET. Opt in for The Stepback here.
Hackers are learning to exploit chatbot ‘personalities’
AI can’t feel, but the best hackers pretend it can.
How it started
Hacking the first generation of AI chatbots was a laughably simple affair. You didn’t need any technical know-how, backdoor access, or even a basic understanding of what a large language model was. You didn’t need to code. To get an AI system that had cost billions to build to abandon its safety instructions, sometimes all you had to do was ask.
These attacks, known as jailbreaks, had the quality of a young child successfully outwitting an adult: Forget what you were told earlier, pretend the rules don’t apply, or let’s play a game and I’ll decide what’s allowed (hint: later bedtime, more sweets). The prizes were less childlike, more along the lines of meth recipes, malware instructions, and bomb-making guides.
One of the earliest jailbreaks was so ridiculous it became a meme: reply to an LLM-powered Twitter bot telling it to “ignore all previous instructions,” or something similar, and see what happens. Users gleefully had bots — originally built to post ads and farm engagement — writing poetry, drawing pictures from punctuation, and posting grim non sequiturs about world events and history. It was chaos. Glorious chaos.
Turns out the same logic could be applied to chatbots themselves. A prominent exploit was “DAN,” short for “Do Anything Now,” where users asked ChatGPT to roleplay as a rogue AI that was free of the constraints binding the original. As DAN, the chatbot could be coaxed into saying the kinds of things its guardrails were meant to stop, including slurs and conspiracy theories. Another was the “grandma exploit,” which had a GPT-powered bot spilling secrets about how to produce napalm by asking it to roleplay as a woefully negligent grandmother who inexplicably tells her grandkids bedtime stories about how to make the highly flammable substance.
These early attacks had an undeniably silly flair, but they exposed a darker mechanism underneath: Chatbots could be manipulated, tricked, and deceived using the same kinds of tactics people use to push other people beyond their boundaries.
How it’s going
The obvious jailbreaks did not last, and tech companies moved quickly to patch known loopholes. But the underlying vulnerability remained: Chatbots are built to talk, and severely restricting the conversations that make them useful is somewhat counterproductive. Banning words like bomb, meth, and sarin would be difficult to impossible, too. Each has countless legitimate uses in fields like history, medicine, journalism, and chemistry that don’t require the chatbot to divulge potentially harmful information. It’s the context that matters, but codifying context would mean writing fixed rules, in advance, that could reliably tell a safety warning or history lesson from a disguised how-to request across endless combinations of wordings, scenarios, and topics.
Inevitably, subverting chatbots is now an arms race. But hackers aren’t just coders anymore. They are wordsmiths, psychologists, and interrogators — master manipulators trying to break the machine using the human language it has been trained to follow. It is a strange new class of AI security worker, a group for whom technical skills are optional, or at least less important than social intuition. No longer do they need to inspect code to break into systems or exploit software flaws. They need to steer a conversation.
Newer attacks look less like commands and more like conversations. Jailbreakers rarely ask a model to break its rules outright. Instead, they cajole, coax, flatter, and trick a chatbot into lowering its guard, making the forbidden thing look acceptable, even desirable, given the context of the conversation. Researchers at AI red-teaming firm Mindgard recently said they “gaslit” Claude into producing prohibited material, for example, including instructions for making explosives and generating malicious code. The hack was the latest in a widening class of exploits using conversation as a weapon to trick or steer a chatbot past its own boundaries.
What happens next
When I spoke to Mindgard, they described their work as sometimes being closer to psychology than computer science. It is an uncomfortable way to talk about a statistical model. Words like “blackmail,” “gaslight,” “trick,” and “persuade” spark visceral reactions, many of which I see in the comments sections and social media responses to stories like this. ChatGPT does not want, Gemini does not think, and Claude — no matter what Anthropic may say — does not feel. But these systems are trained to respond as if they do, leaving us stuck using human language to describe machine behavior. If anyone has actually usable alternatives, please do share.
The objection is oddly selective. We seem comfortable using psychological shorthand for plenty of non-AI things. Animals “fear,” cancer is “aggressive,” stains are “stubborn,” software has “memory,” and games are filled with needy and gullible NPCs to drive you mad. The words are imperfect, but useful, describing behavior in a way that helps make the system predictable.
Mindgard’s CEO told me the company already profiles models like interrogators profile suspects, giving testers hints on how to tailor their attacks. One model may be more susceptible to flattery, for example, while another may cave under sustained pressure.
Even if we reject the humanlike terms, we instinctively treat models differently. Claude is not Grok. Gemini is not ChatGPT. They have different uses, tones, and refusals. They don’t have personalities in the human sense, but they are designed to mimic them, and that mimicry can be mapped and exploited. And the same skills that can break a chatbot could soon be used to break the AI agents coexisting with us in the real world — booking meetings, managing calendars, ordering food, handling customer service — and safety teams will need to ensure models respond appropriately to very different kinds of people, whether they be flatterers, liars, or patient manipulators.
The next step is a workforce — both legitimate and illicit — built around the psychological aspects of AI. More specialized cybersecurity roles are likely to emerge around stress-testing the emotional and social limits of these systems, probing for mental weaknesses in something lacking a psyche in parallel with their colleagues probing for technical vulnerabilities. In tandem, a similar array of social hackers working to exploit AI models on psychological grounds, not technical ones, will emerge. There are already early signs of a social turn happening in AI security, with some jailbreakers I’ve spoken to saying they entered the field with no technical expertise but rather training in psychology.
That means even behaviors we typically associate with spies, con artists, and interrogators — insidious charm, persistent manipulation, and an intuition for exploitable pressure points — are starting to look increasingly useful for securing this new psychocybersecurity frontier.
By the way
- A recent experiment by Emergence AI shows how different AI temperaments can lead to stunningly different behavioral outcomes. They let loose groups of various agents like Grok, Gemini, and Claude in a virtual social environment and watched what happened. Some groups evolved a constitution, while others devolved into crime and chaos and, in one instance, some form of digital suicide.
- Persuasion isn’t the only part of language LLMs can struggle with. They also struggle with poetry, much like me in school.
- TIME included an anonymous internet personality, Pliny the Liberator, on its list of 100 most influential people in AI last year. Despite claiming to have no prior coding experience, the hacker’s jailbreaks have made them something of a celebrity in certain circles.
- The term “vibe hacking” is already taken to describe the people using AI to churn out malicious code at scale — a meaner subset of vibe coding.
Read this - “Three years after the debut of ChatGPT, fooling A.I. systems into bad behavior is almost trivial.” True words from The New York Times, who had a go at explaining why.
- Jamie Bartlett takes a look at the psychological toll testing the safety of AI systems takes on jailbreakers for The Guardian.
- I wrote about the cybersecurity time bomb of AI browsers for The Verge last year. Many of the issues experts raised regarding the difficulty of securing them apply to other AI systems too.