Choose a model for voice conversation, speech translation, and more.
S2S vs pipeline
Two ways to build voice-enabled apps:
| S2S | Pipeline (ASR + LLM + TTS) | |
|---|---|---|
| Latency | Low — single model, streaming | Higher — 3 sequential hops |
| Audio understanding | End-to-end — hears tone, emotion, responds in kind | Transcribes to text first — audio nuance lost |
| Voice customization | Preset voices via system prompt | Voice cloning, voice design (CosyVoice) |
- Use S2S when interactive conversation, low latency, and audio-aware responses matter. Continue reading this page.
- Use Pipeline when you need custom voices or want to mix-and-match the best ASR, LLM, and TTS for each stage.
Real-time or file-based?
-
Real-time (WebSocket) — Use for live voice interfaces: voice assistants, call centers, simultaneous interpretation. Audio streams in, speech streams out. Model names contain
-realtime. - File-based (HTTP) — Use when you can trade latency for better results: video dubbing, podcast translation, offline content processing. Unlocks thinking mode and function calling (Omni) and video context (Livetranslate).
Function calling
Let the model take actions based on what it hears and sees — check a knowledge base, query a schedule, trigger a workflow. Use qwen3-omni-flash (HTTP). Not available on realtime or Livetranslate models.
Thinking mode
Use qwen3-omni-flash (HTTP) when answer quality matters more than latency. The model reasons step-by-step before producing speech — useful for technical support, complex Q&A, or multi-step instructions.
Translation
Both model families can translate speech:
- Qwen3-Livetranslate — 18 languages + 6 Chinese dialects, ~3-second latency, out of the box. File-based variant accepts video for context-aware accuracy. 7 languages output text only (no audio).
- Qwen3-Omni — 10 languages + 8 Chinese dialects. Inject terminology and domain context via system prompt for specialized fields. Both realtime and file-based. More engineering effort, more control.
Livetranslate for quick setup; Omni for domain control.
Supported languages
Supported languages
| Language | Qwen3-Livetranslate | Qwen3-Omni |
|---|---|---|
| English | ✓ | ✓ |
| Chinese (Mandarin) | ✓ | ✓ |
| + Cantonese | ✓ | ✓ |
| + Sichuanese | ✓ | ✓ |
| + Shanghainese | ✓ | ✓ |
| + Beijing | ✓ | ✓ |
| + Tianjin | ✓ | ✓ |
| + Nanjing | — | ✓ |
| + Shaanxi | — | ✓ |
| + Hokkien | — | ✓ |
| French | ✓ | ✓ |
| German | ✓ | ✓ |
| Russian | ✓ | ✓ |
| Italian | ✓ | ✓ |
| Spanish | ✓ | ✓ |
| Portuguese | ✓ | ✓ |
| Japanese | ✓ | ✓ |
| Korean | ✓ | ✓ |
| Indonesian | Text only | — |
| Vietnamese | Text only | — |
| Thai | Text only | — |
| Arabic | Text only | — |
| Hindi | Text only | — |
| Greek | Text only | — |
| Turkish | Text only | — |
qwen-omni-turbo supports Chinese and English only.Recommended models
| Model | API | Input | Function calling | Built-in tools | Thinking | Batch |
|---|---|---|---|---|---|---|
qwen3-omni-flash-realtime | WebSocket | Text, audio, image, video | — | — | — | — |
qwen3-omni-flash | HTTP | Text, audio, image, video | ✓ | — | ✓ | — |
qwen3-livetranslate-flash-realtime | WebSocket | Audio | — | — | — | — |
qwen3-livetranslate-flash | HTTP | Audio, video | — | — | — | — |
All models
Qwen3-Omni
Qwen3-Omni
| Model | API | Input | Function calling | Built-in tools | Thinking | Batch |
|---|---|---|---|---|---|---|
qwen3-omni-flash-realtime | WebSocket | Text, audio, image, video | — | — | — | — |
qwen3-omni-flash-realtime-2025-12-01 | WebSocket | Text, audio, image, video | — | — | — | — |
qwen3-omni-flash-realtime-2025-09-15 | WebSocket | Text, audio, image, video | — | — | — | — |
qwen3-omni-flash | HTTP | Text, audio, image, video | ✓ | — | ✓ | — |
qwen3-omni-flash-2025-12-01 | HTTP | Text, audio, image, video | ✓ | — | ✓ | — |
qwen3-omni-flash-2025-09-15 | HTTP | Text, audio, image, video | ✓ | — | ✓ | — |
Qwen3-Livetranslate
Qwen3-Livetranslate
| Model | API | Input | Languages |
|---|---|---|---|
qwen3-livetranslate-flash-realtime | WebSocket | Audio | 18 |
qwen3-livetranslate-flash-realtime-2025-09-22 | WebSocket | Audio | 18 |
qwen3-livetranslate-flash | HTTP | Audio, video | 18 |
qwen3-livetranslate-flash-2025-12-01 | HTTP | Audio, video | 18 |
Legacy
Legacy
These models are no longer updated. Use Qwen3-Omni for new projects.
| Model | Input | API |
|---|---|---|
qwen2.5-omni-7b | Text, audio, image, video | HTTP |
qwen-omni-turbo | Text, audio, image, video | HTTP |
qwen-omni-turbo-latest | Text, audio, image, video | HTTP |
qwen-omni-turbo-2025-03-26 | Text, audio, image, video | HTTP |
qwen-omni-turbo-realtime | Text, audio | WebSocket |
qwen-omni-turbo-realtime-latest | Text, audio | WebSocket |
qwen-omni-turbo-realtime-2025-05-08 | Text, audio | WebSocket |