Choose a model for voice-in → voice-out scenarios: voice conversation, speech translation, simultaneous interpretation, and more.
This page covers voice-in → voice-out scenarios. For vision understanding, audio/video analysis, content moderation, and broader multimodal capabilities, see Omni-modal.
S2S vs pipeline
Two ways to build voice apps:
| S2S | Pipeline (ASR + LLM + TTS) | |
|---|---|---|
| Latency | Low — single model, streaming | Higher — 3 sequential hops |
| Audio understanding | End-to-end — hears tone, emotion, responds in kind | Transcribes to text first — audio nuance lost |
| Voice customization | Preset voices via system prompt | Voice cloning, voice design (CosyVoice) |
- Use S2S when interactive conversation, low latency, and audio-aware responses matter. Continue reading this page.
- Use Pipeline when you need custom voices or want to mix-and-match the best ASR, LLM, and TTS for each stage.
- ASR (speech recognition): Speech-to-text
- LLM (language model): Text generation
- TTS (speech synthesis): Text-to-speech
Real-time or file-based?
-
Real-time (WebSocket) — Use for live voice interfaces: voice assistants, call centers, simultaneous interpretation. Audio streams in, speech streams out. Model names contain
-realtime. - File-based (HTTP) — Use when you can trade latency for better results: video dubbing, podcast translation, offline content processing. File-based mode also unlocks function calling, web search, thinking mode, and video context (see "Companion capabilities" below).
Choose a model by scenario (S2S single-model route)
The following scenarios are for the S2S single-model route. For the Pipeline route, choose components from the ASR / LLM / TTS docs linked above.
| Scenario | Recommended model | API |
|---|---|---|
| Voice assistant / customer service | qwen3.5-omni-plus-realtime | WebSocket |
| Cost-sensitive conversations | qwen3.5-omni-flash-realtime | WebSocket |
| Simultaneous interpretation / live translation | qwen3.5-livetranslate-flash-realtime | WebSocket |
| Video dubbing / podcast translation | qwen3-livetranslate-flash | HTTP |
| Video analysis / batch tagging (thinking mode) | qwen3-omni-flash | HTTP |
Companion capabilities of S2S models
The following capabilities are provided directly by the Qwen3.5-Omni / Qwen3-Omni models in the S2S single-model route. In the Pipeline route, these capabilities need to be supported by the respective LLM or other components.
Function calling
Let the model take actions based on what it hears and sees — check a knowledge base, query a schedule, trigger a workflow. Supported by Qwen3.5 Omni (WebSocket and HTTP) and Qwen3 Omni (HTTP).
Livetranslate models and
qwen3-omni-flash-realtime do not support this feature.Web search
Let the model retrieve real-time information to answer questions about current events, stock prices, weather, and more. Supported by Qwen3.5 Omni (HTTP and WebSocket), including Plus and Flash variants. The model autonomously decides whether to search.
Qwen3-Omni-Flash and Livetranslate models do not support this feature. Web search and function calling cannot be enabled at the same time.
Thinking mode
Use qwen3-omni-flash (HTTP) when answer quality matters more than latency. The model reasons step-by-step before responding — useful for video analysis, batch tagging, and complex Q&A.
Thinking mode does not support speech output.
Translation
All model families can translate speech:
- Qwen3.5-Livetranslate — 60 languages (29 audio+text, 31 text-only), ~3-second latency, out of the box. WebSocket realtime only.
- Qwen3-Livetranslate — 18 languages + 5 Chinese dialects, ~3-second latency, out of the box. File-based variant accepts video for context-aware accuracy. 7 languages output text only (no audio).
- Qwen3.5-Omni — 29 output languages + 7 Chinese dialects. Superior audio-video understanding and web search. Inject terminology and domain context via system prompt. Both realtime and file-based.
- Qwen3-Omni-Flash — 11 output languages + 8 Chinese dialects. Inject terminology and domain context via system prompt for specialized fields. Both realtime and file-based. Lower cost.
Qwen3.5-Livetranslate for quick setup with broadest language coverage (60 languages, ~3s latency); Qwen3.5-Omni for best quality with web search and widest coverage; Qwen3-Omni-Flash for cost-sensitive scenarios.
Supported languages
Supported languages
| Language | Qwen3.5-Livetranslate | Qwen3-Livetranslate | Qwen3.5-Omni | Qwen3-Omni-Flash |
|---|---|---|---|---|
| English | ✓ | ✓ | ✓ | ✓ |
| Chinese (Mandarin) | ✓ | ✓ | ✓ | ✓ |
| + Cantonese | Text only | ✓ | ✓ | ✓ |
| + Sichuanese | ✓ | ✓ | ✓ | ✓ |
| + Shanghainese | ✓ | ✓ | ✓ | ✓ |
| + Beijing | ✓ | ✓ | ✓ | ✓ |
| + Tianjin | ✓ | ✓ | ✓ | ✓ |
| + Nanjing | — | — | ✓ | ✓ |
| + Shaanxi | — | — | ✓ | ✓ |
| + Hokkien | — | — | ✓ | ✓ |
| French | ✓ | ✓ | ✓ | ✓ |
| German | ✓ | ✓ | ✓ | ✓ |
| Russian | ✓ | ✓ | ✓ | ✓ |
| Italian | ✓ | ✓ | ✓ | ✓ |
| Spanish | ✓ | ✓ | ✓ | ✓ |
| Portuguese | ✓ | ✓ | ✓ | ✓ |
| Japanese | ✓ | ✓ | ✓ | ✓ |
| Korean | ✓ | ✓ | ✓ | ✓ |
| Arabic | ✓ | Text only | ✓ | — |
| Thai | ✓ | Text only | ✓ | ✓ |
| Vietnamese | ✓ | Text only | ✓ | — |
| Indonesian | ✓ | Text only | ✓ | — |
| Turkish | ✓ | Text only | ✓ | — |
| Hindi | ✓ | Text only | ✓ | — |
| Malay | ✓ | — | ✓ | — |
| Dutch | ✓ | — | ✓ | — |
| Urdu | ✓ | — | ✓ | — |
| Norwegian | ✓ | — | ✓ | — |
| Swedish | ✓ | — | ✓ | — |
| Danish | ✓ | — | ✓ | — |
| Hebrew | ✓ | — | ✓ | — |
| Finnish | ✓ | — | ✓ | — |
| Polish | ✓ | — | ✓ | — |
| Icelandic | ✓ | — | ✓ | — |
| Czech | ✓ | — | ✓ | — |
| Tagalog | ✓ | — | ✓ | — |
| Persian | ✓ | — | ✓ | — |
| Greek | Text only | Text only | — | — |
| Afrikaans | Text only | — | — | — |
| Asturian | Text only | — | — | — |
| Belarusian | Text only | — | — | — |
| Bulgarian | Text only | — | — | — |
| Bengali | Text only | — | — | — |
| Bosnian | Text only | — | — | — |
| Catalan | Text only | — | — | — |
| Cebuano | Text only | — | — | — |
| Estonian | Text only | — | — | — |
| Galician | Text only | — | — | — |
| Gujarati | Text only | — | — | — |
| Croatian | Text only | — | — | — |
| Hungarian | Text only | — | — | — |
| Javanese | Text only | — | — | — |
| Kazakh | Text only | — | — | — |
| Kannada | Text only | — | — | — |
| Kyrgyz | Text only | — | — | — |
| Latvian | Text only | — | — | — |
| Macedonian | Text only | — | — | — |
| Malayalam | Text only | — | — | — |
| Marathi | Text only | — | — | — |
| Punjabi | Text only | — | — | — |
| Romanian | Text only | — | — | — |
| Slovak | Text only | — | — | — |
| Slovenian | Text only | — | — | — |
| Swahili | Text only | — | — | — |
| Tajik | Text only | — | — | — |
| Azerbaijani | Text only | — | — | — |
| Ukrainian | Text only | — | — | — |
qwen-omni-turbo supports Chinese and English only.Recommended models
The table below lists the common entry point for each family. For pinned date versions (for version regression or stability), see "All models" below.
| Model | API | Input | Function calling | Web search | Thinking | Translation |
|---|---|---|---|---|---|---|
qwen3.5-omni-plus-realtime | WebSocket | Text, audio, image, video | ✓ | ✓ | — | 29 langs |
qwen3.5-omni-plus | HTTP | Text, audio, image, video | ✓ | ✓ | — | 29 langs |
qwen3.5-omni-flash-realtime | WebSocket | Text, audio, image, video | ✓ | ✓ | — | 29 langs |
qwen3.5-omni-flash | HTTP | Text, audio, image, video | ✓ | ✓ | — | 29 langs |
qwen3-omni-flash-realtime | WebSocket | Text, audio, image, video | — | — | — | 11 langs |
qwen3-omni-flash | HTTP | Text, audio, image, video | ✓ | — | ✓ | 11 langs |
qwen3.5-livetranslate-flash-realtime | WebSocket | Audio | — | — | — | 60 langs |
qwen3-livetranslate-flash-realtime | WebSocket | Audio | — | — | — | 18 langs |
qwen3-livetranslate-flash | HTTP | Audio, video | — | — | — | 18 langs |
All models
Qwen3.5-Omni
Qwen3.5-Omni
| Model | API | Input | Function calling | Web search | Thinking | Batch |
|---|---|---|---|---|---|---|
qwen3.5-omni-plus-realtime | WebSocket | Text, audio, image, video | ✓ | ✓ | — | — |
qwen3.5-omni-plus-realtime-2026-03-15 | WebSocket | Text, audio, image, video | ✓ | ✓ | — | — |
qwen3.5-omni-flash-realtime | WebSocket | Text, audio, image, video | ✓ | ✓ | — | — |
qwen3.5-omni-flash-realtime-2026-03-15 | WebSocket | Text, audio, image, video | ✓ | ✓ | — | — |
qwen3.5-omni-plus | HTTP | Text, audio, image, video | ✓ | ✓ | — | — |
qwen3.5-omni-plus-2026-03-15 | HTTP | Text, audio, image, video | ✓ | ✓ | — | — |
qwen3.5-omni-flash | HTTP | Text, audio, image, video | ✓ | ✓ | — | — |
qwen3.5-omni-flash-2026-03-15 | HTTP | Text, audio, image, video | ✓ | ✓ | — | — |
Qwen3-Omni-Flash
Qwen3-Omni-Flash
| Model | API | Input | Function calling | Web search | Thinking | Batch |
|---|---|---|---|---|---|---|
qwen3-omni-flash-realtime | WebSocket | Text, audio, image, video | — | — | — | — |
qwen3-omni-flash-realtime-2025-12-01 | WebSocket | Text, audio, image, video | — | — | — | — |
qwen3-omni-flash-realtime-2025-09-15 | WebSocket | Text, audio, image, video | — | — | — | — |
qwen3-omni-flash | HTTP | Text, audio, image, video | ✓ | — | ✓ | — |
qwen3-omni-flash-2025-12-01 | HTTP | Text, audio, image, video | ✓ | — | ✓ | — |
qwen3-omni-flash-2025-09-15 | HTTP | Text, audio, image, video | ✓ | — | ✓ | — |
Qwen3.5-Livetranslate
Qwen3.5-Livetranslate
| Model | API | Input | Languages |
|---|---|---|---|
qwen3.5-livetranslate-flash-realtime | WebSocket | Audio | 60 |
qwen3.5-livetranslate-flash-realtime-2026-05-19 | WebSocket | Audio | 60 |
Qwen3-Livetranslate
Qwen3-Livetranslate
| Model | API | Input | Languages |
|---|---|---|---|
qwen3-livetranslate-flash-realtime | WebSocket | Audio | 18 |
qwen3-livetranslate-flash-realtime-2025-09-22 | WebSocket | Audio | 18 |
qwen3-livetranslate-flash | HTTP | Audio, video | 18 |
qwen3-livetranslate-flash-2025-12-01 | HTTP | Audio, video | 18 |
Legacy
Legacy
These models are no longer updated. Use Qwen3.5-Omni or Qwen3-Omni-Flash for new projects.
| Model | Input | API |
|---|---|---|
qwen2.5-omni-7b | Text, audio, image, video | HTTP |
qwen-omni-turbo | Text, audio, image, video | HTTP |
qwen-omni-turbo-latest | Text, audio, image, video | HTTP |
qwen-omni-turbo-2025-03-26 | Text, audio, image, video | HTTP |
qwen-omni-turbo-realtime | Text, audio | WebSocket |
qwen-omni-turbo-realtime-latest | Text, audio | WebSocket |
qwen-omni-turbo-realtime-2025-05-08 | Text, audio | WebSocket |
Next steps
After choosing a model, refer to the corresponding usage guide:
- Qwen3.5-Omni / Qwen3-Omni (WebSocket, real-time) → Real-time multimodal speech
- Qwen3.5-Omni / Qwen3-Omni (HTTP, file-based) → Multimodal speech
- Qwen3.5-Livetranslate (WebSocket, real-time) → Real-time translation
- Qwen3-Livetranslate (HTTP, file-based) → File-based translation