Speech-to-speech models

This page covers voice-in → voice-out scenarios. For vision understanding, audio/video analysis, content moderation, and broader multimodal capabilities, see Omni-modal.

S2S vs pipeline

Two ways to build voice apps:

	S2S	Pipeline (ASR + LLM + TTS)
Latency	Low — single model, streaming	Higher — 3 sequential hops
Audio understanding	End-to-end — hears tone, emotion, responds in kind	Transcribes to text first — audio nuance lost
Voice customization	Preset voices via system prompt	Voice cloning, voice design (CosyVoice)

Use S2S when interactive conversation, low latency, and audio-aware responses matter. Continue reading this page.
Use Pipeline when you need custom voices or want to mix-and-match the best ASR, LLM, and TTS for each stage.

This page covers the S2S single-model route (Omni, Livetranslate). For the Pipeline route, choose your three components separately:

ASR (speech recognition): Speech-to-text
LLM (language model): Text generation
TTS (speech synthesis): Text-to-speech

Real-time or file-based?

Real-time (WebSocket) — Use for live voice interfaces: voice assistants, call centers, simultaneous interpretation. Audio streams in, speech streams out. Model names contain -realtime.
File-based (HTTP) — Use when you can trade latency for better results: video dubbing, podcast translation, offline content processing. File-based mode also unlocks function calling, web search, thinking mode, and video context (see "Companion capabilities" below).

Choose a model by scenario (S2S single-model route)

The following scenarios are for the S2S single-model route. For the Pipeline route, choose components from the ASR / LLM / TTS docs linked above.

Scenario	Recommended model	API
Voice assistant / customer service	`qwen3.5-omni-plus-realtime`	WebSocket
Cost-sensitive conversations	`qwen3.5-omni-flash-realtime`	WebSocket
Simultaneous interpretation / live translation	`qwen3.5-livetranslate-flash-realtime`	WebSocket
Video dubbing / podcast translation	`qwen3-livetranslate-flash`	HTTP
Video analysis / batch tagging (thinking mode)	`qwen3-omni-flash`	HTTP

Companion capabilities of S2S models

The following capabilities are provided directly by the Qwen3.5-Omni / Qwen3-Omni models in the S2S single-model route. In the Pipeline route, these capabilities need to be supported by the respective LLM or other components.

Function calling

Let the model take actions based on what it hears and sees — check a knowledge base, query a schedule, trigger a workflow. Supported by Qwen3.5 Omni (WebSocket and HTTP) and Qwen3 Omni (HTTP).

Livetranslate models and qwen3-omni-flash-realtime do not support this feature.

Web search

Let the model retrieve real-time information to answer questions about current events, stock prices, weather, and more. Supported by Qwen3.5 Omni (HTTP and WebSocket), including Plus and Flash variants. The model autonomously decides whether to search.

Qwen3-Omni-Flash and Livetranslate models do not support this feature. Web search and function calling cannot be enabled at the same time.

Thinking mode

Use qwen3-omni-flash (HTTP) when answer quality matters more than latency. The model reasons step-by-step before responding — useful for video analysis, batch tagging, and complex Q&A.

Thinking mode does not support speech output.

Translation

All model families can translate speech:

Qwen3.5-Livetranslate — 60 languages (29 audio+text, 31 text-only), ~3-second latency, out of the box. WebSocket realtime only.
Qwen3-Livetranslate — 18 languages + 5 Chinese dialects, ~3-second latency, out of the box. File-based variant accepts video for context-aware accuracy. 7 languages output text only (no audio).
Qwen3.5-Omni — 29 output languages + 7 Chinese dialects. Superior audio-video understanding and web search. Inject terminology and domain context via system prompt. Both realtime and file-based.
Qwen3-Omni-Flash — 11 output languages + 8 Chinese dialects. Inject terminology and domain context via system prompt for specialized fields. Both realtime and file-based. Lower cost.

Qwen3.5-Livetranslate for quick setup with broadest language coverage (60 languages, ~3s latency); Qwen3.5-Omni for best quality with web search and widest coverage; Qwen3-Omni-Flash for cost-sensitive scenarios.

Supported languages

Language	Qwen3.5-Livetranslate	Qwen3-Livetranslate	Qwen3.5-Omni	Qwen3-Omni-Flash
English	✓	✓	✓	✓
Chinese (Mandarin)	✓	✓	✓	✓
+ Cantonese	Text only	✓	✓	✓
+ Sichuanese	✓	✓	✓	✓
+ Shanghainese	✓	✓	✓	✓
+ Beijing	✓	✓	✓	✓
+ Tianjin	✓	✓	✓	✓
+ Nanjing	—	—	✓	✓
+ Shaanxi	—	—	✓	✓
+ Hokkien	—	—	✓	✓
French	✓	✓	✓	✓
German	✓	✓	✓	✓
Russian	✓	✓	✓	✓
Italian	✓	✓	✓	✓
Spanish	✓	✓	✓	✓
Portuguese	✓	✓	✓	✓
Japanese	✓	✓	✓	✓
Korean	✓	✓	✓	✓
Arabic	✓	Text only	✓	—
Thai	✓	Text only	✓	✓
Vietnamese	✓	Text only	✓	—
Indonesian	✓	Text only	✓	—
Turkish	✓	Text only	✓	—
Hindi	✓	Text only	✓	—
Malay	✓	—	✓	—
Dutch	✓	—	✓	—
Urdu	✓	—	✓	—
Norwegian	✓	—	✓	—
Swedish	✓	—	✓	—
Danish	✓	—	✓	—
Hebrew	✓	—	✓	—
Finnish	✓	—	✓	—
Polish	✓	—	✓	—
Icelandic	✓	—	✓	—
Czech	✓	—	✓	—
Tagalog	✓	—	✓	—
Persian	✓	—	✓	—
Greek	Text only	Text only	—	—
Afrikaans	Text only	—	—	—
Asturian	Text only	—	—	—
Belarusian	Text only	—	—	—
Bulgarian	Text only	—	—	—
Bengali	Text only	—	—	—
Bosnian	Text only	—	—	—
Catalan	Text only	—	—	—
Cebuano	Text only	—	—	—
Estonian	Text only	—	—	—
Galician	Text only	—	—	—
Gujarati	Text only	—	—	—
Croatian	Text only	—	—	—
Hungarian	Text only	—	—	—
Javanese	Text only	—	—	—
Kazakh	Text only	—	—	—
Kannada	Text only	—	—	—
Kyrgyz	Text only	—	—	—
Latvian	Text only	—	—	—
Macedonian	Text only	—	—	—
Malayalam	Text only	—	—	—
Marathi	Text only	—	—	—
Punjabi	Text only	—	—	—
Romanian	Text only	—	—	—
Slovak	Text only	—	—	—
Slovenian	Text only	—	—	—
Swahili	Text only	—	—	—
Tajik	Text only	—	—	—
Azerbaijani	Text only	—	—	—
Ukrainian	Text only	—	—	—

✓ = audio + text output. "Text only" = no audio output for that language. Qwen3.5-Livetranslate supports 60 languages total (29 audio+text, 31 text-only).Qwen3.5-Omni supports 113 input languages/dialects total. See full list for details.Legacy qwen-omni-turbo supports Chinese and English only.

Recommended models

The table below lists the common entry point for each family. For pinned date versions (for version regression or stability), see "All models" below.

Model	API	Input	Function calling	Web search	Thinking	Translation
`qwen3.5-omni-plus-realtime`	WebSocket	Text, audio, image, video	✓	✓	—	29 langs
`qwen3.5-omni-plus`	HTTP	Text, audio, image, video	✓	✓	—	29 langs
`qwen3.5-omni-flash-realtime`	WebSocket	Text, audio, image, video	✓	✓	—	29 langs
`qwen3.5-omni-flash`	HTTP	Text, audio, image, video	✓	✓	—	29 langs
`qwen3-omni-flash-realtime`	WebSocket	Text, audio, image, video	—	—	—	11 langs
`qwen3-omni-flash`	HTTP	Text, audio, image, video	✓	—	✓	11 langs
`qwen3.5-livetranslate-flash-realtime`	WebSocket	Audio	—	—	—	60 langs
`qwen3-livetranslate-flash-realtime`	WebSocket	Audio	—	—	—	18 langs
`qwen3-livetranslate-flash`	HTTP	Audio, video	—	—	—	18 langs

All models

Qwen3.5-Omni

Model	API	Input	Function calling	Web search	Thinking	Batch
`qwen3.5-omni-plus-realtime`	WebSocket	Text, audio, image, video	✓	✓	—	—
`qwen3.5-omni-plus-realtime-2026-03-15`	WebSocket	Text, audio, image, video	✓	✓	—	—
`qwen3.5-omni-flash-realtime`	WebSocket	Text, audio, image, video	✓	✓	—	—
`qwen3.5-omni-flash-realtime-2026-03-15`	WebSocket	Text, audio, image, video	✓	✓	—	—
`qwen3.5-omni-plus`	HTTP	Text, audio, image, video	✓	✓	—	—
`qwen3.5-omni-plus-2026-03-15`	HTTP	Text, audio, image, video	✓	✓	—	—
`qwen3.5-omni-flash`	HTTP	Text, audio, image, video	✓	✓	—	—
`qwen3.5-omni-flash-2026-03-15`	HTTP	Text, audio, image, video	✓	✓	—	—

Qwen3-Omni-Flash

Model	API	Input	Function calling	Web search	Thinking	Batch
`qwen3-omni-flash-realtime`	WebSocket	Text, audio, image, video	—	—	—	—
`qwen3-omni-flash-realtime-2025-12-01`	WebSocket	Text, audio, image, video	—	—	—	—
`qwen3-omni-flash-realtime-2025-09-15`	WebSocket	Text, audio, image, video	—	—	—	—
`qwen3-omni-flash`	HTTP	Text, audio, image, video	✓	—	✓	—
`qwen3-omni-flash-2025-12-01`	HTTP	Text, audio, image, video	✓	—	✓	—
`qwen3-omni-flash-2025-09-15`	HTTP	Text, audio, image, video	✓	—	✓	—

Qwen3.5-Livetranslate

Model	API	Input	Languages
`qwen3.5-livetranslate-flash-realtime`	WebSocket	Audio	60
`qwen3.5-livetranslate-flash-realtime-2026-05-19`	WebSocket	Audio	60

Qwen3-Livetranslate

Model	API	Input	Languages
`qwen3-livetranslate-flash-realtime`	WebSocket	Audio	18
`qwen3-livetranslate-flash-realtime-2025-09-22`	WebSocket	Audio	18
`qwen3-livetranslate-flash`	HTTP	Audio, video	18
`qwen3-livetranslate-flash-2025-12-01`	HTTP	Audio, video	18

Legacy

These models are no longer updated. Use Qwen3.5-Omni or Qwen3-Omni-Flash for new projects.

Model	Input	API
`qwen2.5-omni-7b`	Text, audio, image, video	HTTP
`qwen-omni-turbo`	Text, audio, image, video	HTTP
`qwen-omni-turbo-latest`	Text, audio, image, video	HTTP
`qwen-omni-turbo-2025-03-26`	Text, audio, image, video	HTTP
`qwen-omni-turbo-realtime`	Text, audio	WebSocket
`qwen-omni-turbo-realtime-latest`	Text, audio	WebSocket
`qwen-omni-turbo-realtime-2025-05-08`	Text, audio	WebSocket

Next steps

After choosing a model, refer to the corresponding usage guide:

Qwen3.5-Omni / Qwen3-Omni (WebSocket, real-time) → Real-time multimodal speech
Qwen3.5-Omni / Qwen3-Omni (HTTP, file-based) → Multimodal speech
Qwen3.5-Livetranslate (WebSocket, real-time) → Real-time translation
Qwen3-Livetranslate (HTTP, file-based) → File-based translation

Learn more

Real-time conversation

Build real-time multimodal voice assistants.

File-based conversation

Process audio and video with speech output.

Real-time translation

Translate speech across languages in real time.

File-based translation

Translate audio and video files.

​S2S vs pipeline

​Real-time or file-based?

​Choose a model by scenario (S2S single-model route)

​Companion capabilities of S2S models

​Function calling

​Web search

​Thinking mode

​Translation

​Recommended models

​All models

​Next steps

​Learn more

Real-time conversation

File-based conversation

Real-time translation

File-based translation

S2S vs pipeline

Real-time or file-based?

Choose a model by scenario (S2S single-model route)

Companion capabilities of S2S models

Function calling

Web search

Thinking mode

Translation

Recommended models

All models

Next steps

Learn more