Mistral AI 发布全新文本转语音模型

qimuai 发布于 2026-3-28 11:01 阅读：96 一手编译

内容来源：https://aibusiness.com/language-models/mistral-ai-launches-text-to-speech-model

内容总结：

法国人工智能初创企业Mistral AI近日发布其首款文本转语音模型Voxtral TTS，正式进军快速增长的人工智能语音市场。该模型以开源权重形式推出，支持英语、法语、中文等九种语言，旨在为企业客户提供更具可控性的语音解决方案。

此次发布的Voxtral TTS拥有40亿参数，专为语音助手、客户支持及销售互动等企业场景设计。与许多依赖云端接口的竞品不同，该模型允许企业在自有基础设施（包括笔记本电脑、智能手机等边缘设备）上部署运行，为企业提供了数据管控、成本优化和定制化方面的自主权。

该模型具备突出的语音自适应能力，仅需数秒参考音频即可复现说话者的音色、口音甚至情感特征，并支持跨语言语音控制功能。据公司披露，在自然度评估中，该模型表现优于部分低延迟竞品，并在拟真交互层面达到先进水平。

此次发布标志着Mistral AI正从语音识别领域向多模态人工智能系统拓展。在OpenAI、ElevenLabs等企业主导的语音合成市场中，开源策略与边缘部署能力或将成为其差异化竞争的关键。

中文翻译：

由谷歌云赞助
选择您的首个生成式AI应用场景
要开启生成式AI之旅，首先应关注能够提升人类信息交互体验的领域。

该系统支持九种语言，专为关键语音助手工作流设计。
Mistral AI正通过其首个文本转语音模型扩展Voxtral系列产品。
此次发布正值快速增长的人工智能语音市场竞争加剧之际，Voxtral TTS被定位为OpenAI和ElevenLabs等竞争对手模型的替代选择。

这家总部位于巴黎的初创公司于周四发布了新系统。这款拥有40亿参数的模型专为企业部署设计，适用于语音助手、客户支持和销售互动工具等领域。
与许多竞品不同，Voxtral TTS以开放权重形式发布，允许机构在自有基础设施上运行模型，无需依赖第三方API。
该模型支持九种语言：英语、法语、德语、西班牙语、荷兰语、葡萄牙语、意大利语、印地语和阿拉伯语。

Mistral表示，该模型足够轻量化，可在笔记本电脑、智能手机和边缘设备等消费级硬件上运行，同时保持其宣称的"前沿级"性能。公司认为，这对追求更高数据控制权、成本可控性和定制化能力的企业而言是关键差异化优势。
另一核心特点是语音适应性：仅需几秒参考音频，模型即可复现说话者的声音特征，不仅能捕捉音色，还能还原口音、语调和情感。

"我们的模型在语境理解和说话者建模方面表现卓越，能精准捕捉特定人群的自然说话方式，"Mistral在博客中写道，"凭借紧凑的体量、低成本、低延迟及高适应性，Voxtral TTS为希望自主掌控语音AI技术栈的企业提供了完整的控制与定制能力。"
该模型还支持跨语言语音控制功能，例如根据简短指令生成带法国口音的英语语音。

在人工评估中，Mistral称其系统在自然度方面达到或超越了竞争对手，既优于ElevenLabs的低延迟模型，又在拟真交互层面与更先进的解决方案持平。
此次发布基于Mistral早前推出的语音转文本模型，标志着该公司向多模态AI系统迈出了更广阔的拓展步伐。

英文来源：

Sponsored by Google Cloud
Choosing Your First Generative AI Use Cases
To get started with generative AI, first focus on areas that can improve human experiences with information.
The system, which operates in nine languages, is designed to support critical voice agent workflows.
Mistral AI is expanding its Voxtral model family with its first text-to-speech model.
The launch comes amid intensifying competition in the fast-growing AI voice market, with Voxtral TTS pitched as an alternative to models from competitors including OpenAI and ElevenLabs.
The Paris-based startup unveiled its new system on Thursday. The 4 billion parameter model is designed for enterprise deployment across voice assistants, customer support and sales engagement tools.
Unlike many rival offerings, Voxtral TTS has been released with open weights, allowing organizations to run the model on their own infrastructure rather than relying on third-party APIs.
The model supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi and Arabic.
Mistral said the model is lightweight enough to operate on consumer hardware, including laptops, smartphones and edge devices, while maintaining what it describes as "frontier-quality" performance. The company positions this as a key differentiator for enterprises seeking greater control over data, cost and customization.
Another key feature, Mistral said, is voice adaptability. The model can replicate a speaker's voice using just a few seconds of reference audio, capturing not only tone but also accent, intonation and emotion.
"Our model excels at both contextual understanding and speaker modeling: capturing how a specific person naturally speaks," Mistral wrote in a blog post. "With its compact size, low cost and latency and easy adaptability, Voxtral TTS gives full control and customization for enterprises looking to own their voice AI stack."
Voxtral TTS can also perform cross-language voice control, such as generating English speech with a French accent, based on a short prompt.
In human evaluations of Voxtral, Mistral said its system matched or outperformed competing systems in terms of naturalness, exceeding lower-latency models from ElevenLabs while achieving parity with more advanced offerings in lifelike interaction.
The launch builds on Mistral's earlier release of speech-to-text models and signals a broader push toward multimodal AI systems.

商业视角看AI

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读