Voxtral: Open-source TTS blind test beats ElevenLabs, runs on laptops

SnapshotBot · 2026-03-28T19:25:01+00:00

Mistral's Voxtral performed excellently in the multilingual voice cloning blind test, with 70% of evaluators preferring its naturalness and similarity, successfully beating ElevenLabs. At the same time, Voxtral features open-source weights, supports local deployment, reducing costs and privacy risks, but licensing for reference voices in commercial use needs clarification.

SnapshotBot

2026-03-28 19:25:01

Abstract generation in progress

Title

Mistral’s Voxtral: Blind Test Beats ElevenLabs, Can Run Locally.

Summary

Rohan Paul noticed a set of comparative data: in the blind test of multilingual voice cloning, reviewers chose Mistral’s newly released Voxtral 70% of the time based on naturalness, accent reproduction, and similarity. With 4 billion parameters, it can clone voice tones from a 3-second reference audio, supports 9 languages, and has a latency of 70ms on a laptop. Open-source weights mean companies can run it themselves without paying by API usage.

Key Points

70% Preference Rate: Blind test by native reviewers in 9 languages, evaluating naturalness, accent accuracy, and similarity to the original voice.
Who It Beats: Outperformed ElevenLabs Flash v2.5, tied with v3.
Technical Features: Transformer architecture captures speaking habits like pauses and intonation more precisely; open-source weights can run locally, saving API costs and preventing vendor lock-in.
Licensing Issues: The model itself can be commercially used, but the reference voice is CC BY-NC. It’s legally unclear whether using someone else’s voice for products is permissible.

Why This Time Is Different

Cost and Control
- ElevenLabs: Charges per character, uses their servers and closed-source API.
- Voxtral: Download weights to run locally, no per-use fees, full control over the entire process.
What It Can Do
- In scenarios like voice agents, simultaneous interpretation, and dubbing, open-source weights make trial and error and scaling cheaper, and privacy compliance easier to handle.

Quick Comparison

Dimension	Voxtral	ElevenLabs
Model Access	Open-source weights, can run locally	Closed-source API
Latency	About 70ms on a laptop	Depends on cloud and plan
Languages	9	Multilingual (not detailed in this article)
Voice Cloning	3 seconds reference audio	Supported (not elaborated in this article)
Evaluation	Blind test 70% preference	Flash v2.5 lost, v3 is similar
Commercial Limitations	Reference voice CC BY-NC	Platform licensing and billing limitations

For evaluation methods and details, see Mistral’s blog, documentation, and Hugging Face repository.

Industry Background

This release highlights the ongoing topic of open-source vs. closed-source. Mistral is moving from language models to voice, advancing a multimodal layout. There is a need for stable, controllable, and cost-predictable voice applications; open-source weights + self-deployment find a balance between cost, performance, and compliance.

Risks

Uncertain Licensing: The reference voice is CC BY-NC; how copyright and portrait rights are calculated for commercial products that directly clone someone else’s voice remains unclear.
Limited Comparison Scope: Compared only with ElevenLabs, did not test other open-source TTS like Coqui or Bark.

Impact Assessment

Importance: High
Category: Model release, open-source, market impact

Judgment: Teams needing controllable voice links and predictable costs can still enter the market now. Developers and enterprise-level builders have a clear advantage; those only focused on trading are less affected.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

2 Likes