Skip to main content
Text-to-Speech

Text-to-speech models

Choose a model for speech synthesis, voice cloning, and voice design.

You have 18 text-to-speech models across four families. Two questions narrow the field.
  1. Do you need a custom voice, or will a built-in voice work?
  2. Do you need real-time streaming, or is non-streaming acceptable?

Standard TTS or custom voice?

Standard TTS

Built-in voices, no setup required. Choose a model, pick a voice, and start synthesizing.
ModelFamilyKey strength
qwen3-tts-flashQwen3-TTSLow latency, good quality
qwen3-tts-flash-2025-11-27Qwen3-TTSLow latency, good quality (snapshot)
qwen3-tts-flash-2025-09-18Qwen3-TTSLow latency, good quality (snapshot)
qwen3-tts-flash-realtimeQwen3-TTSRealtime streaming, low latency
qwen3-tts-flash-realtime-2025-11-27Qwen3-TTSRealtime streaming, low latency (snapshot)
qwen3-tts-flash-realtime-2025-09-18Qwen3-TTSRealtime streaming, low latency (snapshot)
qwen3-tts-instruct-flashQwen3-TTSInstruction control (speed, emotion, style)
qwen3-tts-instruct-flash-2026-01-26Qwen3-TTSInstruction control (snapshot)
qwen3-tts-instruct-flash-realtimeQwen3-TTSRealtime streaming with instruction control
qwen3-tts-instruct-flash-realtime-2026-01-22Qwen3-TTSRealtime streaming with instruction control (snapshot)
cosyvoice-v3-plusCosyVoiceHigh quality, rich voice library
cosyvoice-v3-flashCosyVoiceFast synthesis

Custom voice

Create a unique voice by cloning from audio samples or designing one from a text description.
ModelFamilyKey strength
qwen3-tts-vc-2026-01-22Voice CloningClone a voice from audio samples
qwen3-tts-vc-realtime-2026-01-15Voice CloningReal-time voice cloning
qwen3-tts-vc-realtime-2025-11-27Voice CloningReal-time voice cloning
qwen3-tts-vd-2026-01-26Voice DesignDesign a voice from a text description
qwen3-tts-vd-realtime-2026-01-15Voice DesignReal-time voice design
qwen3-tts-vd-realtime-2025-12-16Voice DesignReal-time voice design
Cloning vs. design: Voice cloning reproduces a specific voice from audio samples. Voice design creates a new voice from a text description of the desired characteristics (such as "a warm, low-pitched female voice"). Use cloning when you have a target voice; use design when you want to create one from scratch.

Controlling how the voice sounds

Three approaches, ranked by flexibility:
  1. Instruction control (qwen3-tts-instruct-flash, qwen3-tts-instruct-flash-realtime) — Describe the desired delivery in natural language. Control speed, emotion, and style per request. Most flexible.
  2. Voice design (qwen3-tts-vd-*) — Generate a custom voice from a text description. Good for creating a brand voice without audio samples.
  3. Voice cloning (qwen3-tts-vc-*) — Reproduce an existing voice from audio samples. Best when you need to match a specific person's voice.

Full comparison

ModelFamilyStreamingCustom voiceInstruction control
qwen3-tts-flashQwen3-TTS
qwen3-tts-flash-2025-11-27Qwen3-TTS
qwen3-tts-flash-2025-09-18Qwen3-TTS
qwen3-tts-flash-realtimeQwen3-TTS
qwen3-tts-flash-realtime-2025-11-27Qwen3-TTS
qwen3-tts-flash-realtime-2025-09-18Qwen3-TTS
qwen3-tts-instruct-flashQwen3-TTS
qwen3-tts-instruct-flash-2026-01-26Qwen3-TTS
qwen3-tts-instruct-flash-realtimeQwen3-TTS
qwen3-tts-instruct-flash-realtime-2026-01-22Qwen3-TTS
qwen3-tts-vc-2026-01-22Voice Cloning
qwen3-tts-vc-realtime-2026-01-15Voice Cloning
qwen3-tts-vc-realtime-2025-11-27Voice Cloning
qwen3-tts-vd-2026-01-26Voice Design
qwen3-tts-vd-realtime-2026-01-15Voice Design
qwen3-tts-vd-realtime-2025-12-16Voice Design
cosyvoice-v3-plusCosyVoice
cosyvoice-v3-flashCosyVoice

Learn more