谷歌的新型“万物互转”AI模型令人惊叹

qimuai 发布于 2026-5-24 09:00 阅读：19 一手编译

谷歌的新型“万物互转”AI模型令人惊叹

内容来源：https://www.theverge.com/tech/936507/gemini-omni-hands-on-deepfake-ai-video

内容总结：

实测谷歌Omni AI模型：深度伪造宠物玩偶游巴黎，效果令人震惊但仍存明显破绽

去年，我曾用AI深度伪造技术让孩子的毛绒鹿玩偶“Buddy”开启了一场虚拟度假。如今，谷歌全新推出的“任意输入转任意输出”AI模型Omni，让这一实验变得更加疯狂——不仅能让玩偶漂流、在埃菲尔铁塔前伪造我的形象，但其效果仍远未达到“奇点”水准。

实验背景：从广告创意到技术验证

这项实验源于我试图复现谷歌Gemini广告中的场景。我从未将生成的Buddy冒险视频给4岁的孩子看过，但这次测试让我深刻反思：生成式AI带来的无害乐趣与彻底的“垃圾内容”之间，界限究竟在哪？或许两者本就重叠，或许并非如此。但可以确定的是，制作逼真视频的工具已惊人地易用，几乎无需任何专业知识，而这一趋势正随着Omni时代的到来持续升温。

Omni模型解析：多模态转换与视频生成升级

Omni是谷歌推出的全新生成模型家族，号称未来能将任意输入（照片、视频、文字）转化为任意输出。首发的Omni Flash已集成至谷歌AI视频编辑平台Flow，相比前代Veo模型，它支持上传视频作为生成起点，并凭借更强的“现实世界知识”保持角色一致性。为验证这一宣称，我让AI Buddy再次上路。

实测结果：优劣并存，AI“惊悚时刻”频现

实测效果喜忧参半。部分视频质量显著提升，角色一致性和指令遵循度远超5个月前的Veo测试。但即便最好的片段仍存在AI“突然惊吓”——比如Buddy在跳伞时突然转向。更典型的案例是：我让Omni生成Buddy打包登船度假的蒙太奇，它让玩偶带上一罐蜂蜜，随后在片段中将其当作防晒霜涂抹。创意虽妙，但蜂蜜瓶始终在变换形态：从玻璃罐到透明水瓶，再到挤压瓶，最终帧更是混乱无序。

编辑功能：有改进但仍不完美

基于文本的编辑指令虽比Veo有效，但结果仍不稳定。当我要求强调Buddy的面部表情时，画面变得诡异；它还会随意给无角的幼鹿玩偶加上鹿角。我要求删除某场景中的鹿角，结果其他所有场景都被加上了鹿角。

成本考量：不菲的“实验代价”

视频生成并非免费。每个片段消耗15-40积分，单次编辑需40积分。我的20美元/月AI Pro套餐含1000积分，在生成约20个片段并做少量编辑后，仅剩145积分。若对视频有具体设想，反复调整的成本将相当可观。

深度伪造自拍：以假乱真到令人不安

Omni宣称的“将AI元素融入真实视频”功能最令人震撼。基于一段中性表情自拍，我生成了自己吃意大利面、坐在飞机座位、在埃菲尔铁塔前咬法棍的片段。尽管存在细微破绽（叉子碰碗声过于机械、飞机背景中一名女子出现两次），但整体效果足以乱真。丈夫在不知情的情况下，仅凭“碗不熟悉”这一线索判断视频有问题，而吃面动作本身完全真实——要知道，他是与我朝夕相处十年的至亲。

其他伪造视频的“社交媒体欺骗度”各不相同。个别埃菲尔铁塔片段略显卡通，但其中一个足以让人反复观看才能察觉AI痕迹。当AI的我转头露出马尾辫时，我确信那不是自己，但恐怕别人难以分辨——这让我感到不安。

结论：深陷“恐怖谷”，但工具已足够危险

坦白说，我对此有些疲惫。Veo 3的写实感曾令我震惊，过去几年制作假照片的简易性也屡次让我愕然。如今Omni带来的震撼不再新鲜，但危险的边缘已悄然逼近。谷歌希望你相信生成电影级杰作轻而易举，但现实远非如此。不过，若你拥有谷歌账号和信用卡，确实可以轻松让自己“瞬间”从家中客厅“飞往”毛伊岛。我们或许尚未站在“奇点”的山脚，但无疑已深陷“恐怖谷”——那个真实与虚假、惊艳与诡异的模糊地带。

（注：本文所有图片和视频均由谷歌Gemini生成）

中文翻译：

去年，我用深度伪造技术把我孩子的毛绒玩具制作成了他的小鹿玩偶去度假的样子。
谷歌全新的“万物皆可生成”AI模型太疯狂了
Omni带着我孩子的玩偶去漂流，还把我深度伪造到了埃菲尔铁塔前。但这并非奇点时刻。
这只是一场实验，想看看我能否重现谷歌正在投放的Gemini广告中描绘的场景，而我从未把巴迪小鹿冒险之旅的视频给四岁的孩子看过。但这却是一次发人深省的尝试，让我对生成式AI带来的无害乐趣与纯粹的垃圾信息之间的区别思考良多。也许那个维恩图是个完美的重合圆！也许不是。但我确信的是，制作逼真视频的工具已经出奇地好用，几乎不需要什么努力和专业知识。而且这一趋势正在火热地延续到Gemini的Omni时代。

Omni是一个全新的生成模型系列，据称终有一天能将任何类型的输入——照片、视频、文字——转化为任何其他形式。但就目前而言，它只能生成视频。Omni Flash是谷歌发布的首款此类模型，现已在该公司的AI视频生成与编辑平台Flow上可用。你仍可以使用之前的模型Veo，但Omni在几个方面对Veo进行了改进。

使用Omni，你可以上传一段视频，并结合文字提示作为AI生成创作的起点。谷歌还声称，Omni在生成视频时融入了更多真实世界知识，因此能更好地保持角色在整个视频中的一致性。唯一能真正验证这些说法的方法就是：我把AI巴迪带回来，让它收拾好AI生成的小行李，开启另一场冒险。

结果好坏参半，让人困惑。有些相当不错——比五个月前我测试Veo时生成的内容更一致、更贴合提示。但即便是Omni为我生成的最佳片段，仍然有一些AI式的“惊悚时刻”，比如巴迪在跳伞时突然转换了方向。

在另一段视频中，我给了Omni一些艺术自由。“制作一个蒙太奇：巴迪收拾行李去度假，登上邮轮开启热带假期。氛围是可爱又俏皮的。巴迪在行李箱里装了些有趣的东西，这些东西会在片段后面出现。”它让巴迪装了一罐蜂蜜；后来在片段中，它伸手去拿蜂蜜罐，仿佛那是一瓶防晒霜。“哎呀，”这个角色一边把蜂蜜挤到蹄子上，一边说道。

说实话，这梗还不错。只是那瓶蜂蜜在整个视频中不断变化，从罐子变成装满水的透明挤压瓶，然后又变回装满蜂蜜的挤压瓶。我甚至无法描述模型是如何生成视频最后一帧的——几乎就像是把刚才生成的序列里的各种元素胡乱拼凑在一起。

你可以通过文字提示建议对视频进行编辑，这一点我得给谷歌点赞：Omni在这方面比我测试Veo 3时表现更好。但Veo的效果很差——差到我每次想改点什么时，都觉得从头提示新视频要简单得多。Omni确实会采纳你的编辑建议，但结果并不总能如愿。

我让它在巴迪的度假视频中强调其面部反应，结果看起来反而很奇怪。它还时不时给巴迪加上鹿角，而它原本是没有的。巴迪是一只小鹿，谢谢。当我提示它去掉一个场景中出现的鹿角时，它照做了——然后又给所有其他场景都加上了鹿角。

问题是，这一切都不是免费的。生成视频需要消耗积分，根据场景长度和起始“素材”不同，费用从15到40积分不等。一次编辑需要40积分。我订阅了每月20美元的AI Pro计划，每月有1000积分。在生成大约20个片段并对其中一些进行编辑后，我的积分只剩下145了。如果你对Omni生成的视频有具体想法，你可能需要花费大量昂贵的反复尝试才能得到一个接近你设想的视频。

我真心说一句，我对自己看到的东西毫无准备
Omni宣称的优势之一是将AI生成的内容添加到真实视频中，所以我让巴迪休息一下，对自己进行了深度伪造。从一段没有表情的自拍视频开始，我提示Omni生成我吃一盘意大利面、坐在飞机座位上、以及站在埃菲尔铁塔前咬一口法棍面包的视频。我真心说一句，我对自己看到的东西毫无准备。

我的深度伪造视频中有一些AI痕迹。叉子碰触意大利面碗的声音有点太做作。飞机视频的背景中，有一个女人出现了两次。但除了这些小瑕疵和一些隐约的诡异感之外，这些视频逼真得令人难以置信。

我给我丈夫看了吃意大利面的片段；他知道我在测试一款AI视频工具，但我没告诉他场景中哪些部分是AI生成的。在不知道其中哪些是AI生成的情况下，他相信我真的坐在摄像机前吃意大利面，并说他唯一的线索是那个碗看起来不熟悉。吃意大利面本身看起来足够真实，骗过了我丈夫——一个在过去十年里几乎每天在现实生活中看着我的人。

我的其他深度伪造视频在不同程度上“足以在社交媒体上糊弄人”。有几段埃菲尔铁塔的视频看起来有点卡通化，但其中一段足够逼真，你可能需要多看几遍才能发现它是AI生成的。当AI版的我转过头，露出扎成马尾的头发时，我知道那不是我。但我不确定其他人是否能看出区别，这让我感觉很不自在。

我们确实已经深陷恐怖谷
说实话，这一切让我有点疲惫。测试Veo 3时，我对它能生成的逼真效果感到震惊。过去几年里，一次又一次地，我对制作假照片中假人如此容易感到震惊。我或许也该对Omni感到震惊，我想我是有点震惊，但那种新鲜感已经消退了。

要制作一部AI生成的电影杰作，仍然不像谷歌想让你相信的那么容易。但Omni确实在一些可识别的方面改进了Veo。如果你有一个谷歌账户和一张信用卡，你就能用微不足道的努力，把一段自己坐在家里的视频变成你正飞往毛伊岛的样子。我不认为我们正处于“奇点的山麓”，但肯定已经深陷恐怖谷。

本文中的所有图片和视频均由Google Gemini生成。

英文来源：

Last year I deepfaked my kid’s stuffed animal to make it look like his plush deer was on vacation.
Google’s new anything-to-anything AI model is wild
Omni sent my kid’s stuffie rafting and deepfaked me in front of the Eiffel Tower. But it’s not quite the singularity.
It was an experiment to see if I could re-create the events depicted in a Gemini ad Google was running, and I never showed the videos of Buddy the deer on his adventures to my four-year-old. But it was a revealing exercise that made me think a lot about the difference between some harmless fun with generative AI and full-on slop. Maybe that Venn diagram is a perfect circle! Maybe not. But what I know for sure is that the tools to make realistic videos are surprisingly good, requiring surprisingly little effort and know-how. And that trend is continuing hot into Gemini’s Omni era.
Omni is a new family of generative models that will allegedly one day be able to turn any kind of input — photo, video, text — into anything else. But for starters, it’s just creating video. Omni Flash is the first of these models Google has released, now available in the company’s AI video generation and editing platform, Flow. You can still use the previous model, Veo, if you want, but Omni improves on Veo in a few ways.
With Omni, you can upload a video and use that along with a text prompt as the starting point for your AI-generated creation. Google also claims Omni incorporates more real-world knowledge when producing videos and can do a better job of keeping characters consistent throughout a video as a result. There was only one way to really know if those claims are true: I brought back AI Buddy to pack his little AI-generated bags for another adventure.
The results are such a mixed bag they’re baffling. Some were very good — much more consistent and true to my prompt than when I was testing out Veo five months ago. But even the best clips Omni cooked up for me still have certain AI jump scares, like when Buddy suddenly switches orientation while he’s skydiving.
For another video, I gave Omni some artistic freedom. “Create a montage of Buddy packing for a vacation and embarking on a cruise ship for a tropical vacation. The mood is cute and playful. Buddy packs something funny in his suitcase that comes into play later in the clip.” It had Buddy pack a jar of honey; later in the clip he reaches for it as if it’s a bottle of sunscreen. “Uh oh,” the character says as he squirts honey onto his hoof.
Honestly, not a bad bit. Except that the bottle of honey constantly changes throughout the video, from a jar, to a clear squirt bottle filled with water, then back to a squeeze bottle filled with honey. And I can’t even begin to describe how the model came up with the final frame of the video — almost as if it just barfed up a bunch of elements of the sequence it just made.
You can use text-based prompts to suggest edits to your videos, and I’ll give Google credit: This works better with Omni than it did when I tested Veo 3. But the results were bad with Veo — so bad that I found it way easier to just prompt a new video from scratch every time I wanted something changed. Omni will actually take your edits on board, but the results don’t always hit.
I had it emphasize Buddy’s facial reactions in his vacation clips, and the results just wound up looking strange. It would also give Buddy antlers from time to time, which he does not have. Buddy is a baby, thank you very much. When I prompted it to remove the antlers that appeared in one scene, it obliged — and then added antlers in all the other ones.
The thing is, none of this is free. Generating videos costs credits, varying from 15 to 40 credits based on the length of the scene and the “ingredients” you start with. One round of edits costs 40 credits. I have the $20-per-month AI Pro plan that comes with 1,000 credits each month. After around 20 clips generated with a few edits on some, I’m down to 145. If you have specific ideas about the video you want Omni to generate, you might be looking at a lot of costly back-and-forth with the model to get a video that’s close to your vision.
I can genuinely say I wasn’t prepared for what I saw
One of Omni’s purported strengths is adding AI-generated stuff to real videos, so I gave Buddy a break and deepfaked myself. Starting with a selfie video with a neutral expression, I prompted Omni to generate videos of me eating a plate of spaghetti, sitting in an airplane seat, and standing in front of the Eiffel Tower taking a bite out of a baguette. And I can genuinely say I wasn’t prepared for what I saw.
There are AI tells in my deepfake videos. The clink of the fork hitting the bowl of pasta is a little too manufactured. There’s a woman in the background of the airplane video who shows up twice. But aside from those little glitches and a vaguely uncanny sense about them, they’re convincing as hell.
I showed my husband the pasta clip; he knew I was testing an AI video tool but I didn’t tell him what in the scene had been generated by AI. Without knowing what was AI-generated about it, he bought that I was sitting in front of a camera eating pasta, and said that his only clue something was up was that the bowl looked unfamiliar. The pasta-eating itself looked real enough to convince my husband. A man who has looked at me in real life basically every single day for the last decade.
My other deepfakes are varying levels of “good enough to fool people on social media.” A couple of the Eiffel Tower clips look slightly cartoonish, but one of them is convincing enough that you might need to rewatch it a few times to clock that it’s AI. I know it’s not me when the AI me turns her head and reveals her hair pulled back in a ponytail. But I’m not sure anyone else would know the difference, and that makes me feel weird.
We’re definitely deep in the uncanny valley
I’m a little exhausted by it all, to be honest. I was shocked when I tested Veo 3 at the realism it could produce. I’ve been shocked at how easy it is to make fake people in fake photos again and again over the past few years. I should probably be shocked by Omni too, and I guess I am, but the edge has worn off.
It’s still not quite as easy to make an AI-generated cinematic masterpiece as Google would like you to believe. But Omni does improve on Veo in some recognizable ways. If you have a Google account and a credit card, then you can take a video of yourself sitting at home and make it look like you’re on a flight to Maui with a trivial amount of effort. I don’t think we’re at the “foothills of the singularity” exactly, but we’re definitely deep in the uncanny valley.
All images and videos in this story were generated by Google Gemini.

ThevergeAI大爆炸

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读