Skip to main content
Speech-to-speech

Speech-to-speech models

Choose a model for voice-in → voice-out scenarios: voice conversation, speech translation, simultaneous interpretation, and more.

This page covers voice-in → voice-out scenarios. For vision understanding, audio/video analysis, content moderation, and broader multimodal capabilities, see Omni-modal.

S2S vs pipeline

Two ways to build voice apps:
S2SPipeline (ASR + LLM + TTS)
LatencyLow — single model, streamingHigher — 3 sequential hops
Audio understandingEnd-to-end — hears tone, emotion, responds in kindTranscribes to text first — audio nuance lost
Voice customizationPreset voices via system promptVoice cloning, voice design (CosyVoice)
  • Use S2S when interactive conversation, low latency, and audio-aware responses matter. Continue reading this page.
  • Use Pipeline when you need custom voices or want to mix-and-match the best ASR, LLM, and TTS for each stage.
This page covers the S2S single-model route (Omni, Livetranslate). For the Pipeline route, choose your three components separately:

Real-time or file-based?

  • Real-time (WebSocket) — Use for live voice interfaces: voice assistants, call centers, simultaneous interpretation. Audio streams in, speech streams out. Model names contain -realtime.
  • File-based (HTTP) — Use when you can trade latency for better results: video dubbing, podcast translation, offline content processing. File-based mode also unlocks function calling, web search, thinking mode, and video context (see "Companion capabilities" below).

Choose a model by scenario (S2S single-model route)

The following scenarios are for the S2S single-model route. For the Pipeline route, choose components from the ASR / LLM / TTS docs linked above.
ScenarioRecommended modelAPI
Voice assistant / customer serviceqwen3.5-omni-plus-realtimeWebSocket
Cost-sensitive conversationsqwen3.5-omni-flash-realtimeWebSocket
Simultaneous interpretation / live translationqwen3.5-livetranslate-flash-realtimeWebSocket
Video dubbing / podcast translationqwen3-livetranslate-flashHTTP
Video analysis / batch tagging (thinking mode)qwen3-omni-flashHTTP

Companion capabilities of S2S models

The following capabilities are provided directly by the Qwen3.5-Omni / Qwen3-Omni models in the S2S single-model route. In the Pipeline route, these capabilities need to be supported by the respective LLM or other components.

Function calling

Let the model take actions based on what it hears and sees — check a knowledge base, query a schedule, trigger a workflow. Supported by Qwen3.5 Omni (WebSocket and HTTP) and Qwen3 Omni (HTTP).
Livetranslate models and qwen3-omni-flash-realtime do not support this feature.
Let the model retrieve real-time information to answer questions about current events, stock prices, weather, and more. Supported by Qwen3.5 Omni (HTTP and WebSocket), including Plus and Flash variants. The model autonomously decides whether to search.
Qwen3-Omni-Flash and Livetranslate models do not support this feature. Web search and function calling cannot be enabled at the same time.

Thinking mode

Use qwen3-omni-flash (HTTP) when answer quality matters more than latency. The model reasons step-by-step before responding — useful for video analysis, batch tagging, and complex Q&A.
Thinking mode does not support speech output.

Translation

All model families can translate speech:
  • Qwen3.5-Livetranslate — 60 languages (29 audio+text, 31 text-only), ~3-second latency, out of the box. WebSocket realtime only.
  • Qwen3-Livetranslate — 18 languages + 5 Chinese dialects, ~3-second latency, out of the box. File-based variant accepts video for context-aware accuracy. 7 languages output text only (no audio).
  • Qwen3.5-Omni — 29 output languages + 7 Chinese dialects. Superior audio-video understanding and web search. Inject terminology and domain context via system prompt. Both realtime and file-based.
  • Qwen3-Omni-Flash — 11 output languages + 8 Chinese dialects. Inject terminology and domain context via system prompt for specialized fields. Both realtime and file-based. Lower cost.
Qwen3.5-Livetranslate for quick setup with broadest language coverage (60 languages, ~3s latency); Qwen3.5-Omni for best quality with web search and widest coverage; Qwen3-Omni-Flash for cost-sensitive scenarios.
LanguageQwen3.5-LivetranslateQwen3-LivetranslateQwen3.5-OmniQwen3-Omni-Flash
English
Chinese (Mandarin)
  + CantoneseText only
  + Sichuanese
  + Shanghainese
  + Beijing
  + Tianjin
  + Nanjing
  + Shaanxi
  + Hokkien
French
German
Russian
Italian
Spanish
Portuguese
Japanese
Korean
ArabicText only
ThaiText only
VietnameseText only
IndonesianText only
TurkishText only
HindiText only
Malay
Dutch
Urdu
Norwegian
Swedish
Danish
Hebrew
Finnish
Polish
Icelandic
Czech
Tagalog
Persian
GreekText onlyText only
AfrikaansText only
AsturianText only
BelarusianText only
BulgarianText only
BengaliText only
BosnianText only
CatalanText only
CebuanoText only
EstonianText only
GalicianText only
GujaratiText only
CroatianText only
HungarianText only
JavaneseText only
KazakhText only
KannadaText only
KyrgyzText only
LatvianText only
MacedonianText only
MalayalamText only
MarathiText only
PunjabiText only
RomanianText only
SlovakText only
SlovenianText only
SwahiliText only
TajikText only
AzerbaijaniText only
UkrainianText only
✓ = audio + text output. "Text only" = no audio output for that language. Qwen3.5-Livetranslate supports 60 languages total (29 audio+text, 31 text-only).Qwen3.5-Omni supports 113 input languages/dialects total. See full list for details.Legacy qwen-omni-turbo supports Chinese and English only.
The table below lists the common entry point for each family. For pinned date versions (for version regression or stability), see "All models" below.
ModelAPIInputFunction callingWeb searchThinkingTranslation
qwen3.5-omni-plus-realtimeWebSocketText, audio, image, video29 langs
qwen3.5-omni-plusHTTPText, audio, image, video29 langs
qwen3.5-omni-flash-realtimeWebSocketText, audio, image, video29 langs
qwen3.5-omni-flashHTTPText, audio, image, video29 langs
qwen3-omni-flash-realtimeWebSocketText, audio, image, video11 langs
qwen3-omni-flashHTTPText, audio, image, video11 langs
qwen3.5-livetranslate-flash-realtimeWebSocketAudio60 langs
qwen3-livetranslate-flash-realtimeWebSocketAudio18 langs
qwen3-livetranslate-flashHTTPAudio, video18 langs

All models

ModelAPIInputFunction callingWeb searchThinkingBatch
qwen3.5-omni-plus-realtimeWebSocketText, audio, image, video
qwen3.5-omni-plus-realtime-2026-03-15WebSocketText, audio, image, video
qwen3.5-omni-flash-realtimeWebSocketText, audio, image, video
qwen3.5-omni-flash-realtime-2026-03-15WebSocketText, audio, image, video
qwen3.5-omni-plusHTTPText, audio, image, video
qwen3.5-omni-plus-2026-03-15HTTPText, audio, image, video
qwen3.5-omni-flashHTTPText, audio, image, video
qwen3.5-omni-flash-2026-03-15HTTPText, audio, image, video
ModelAPIInputFunction callingWeb searchThinkingBatch
qwen3-omni-flash-realtimeWebSocketText, audio, image, video
qwen3-omni-flash-realtime-2025-12-01WebSocketText, audio, image, video
qwen3-omni-flash-realtime-2025-09-15WebSocketText, audio, image, video
qwen3-omni-flashHTTPText, audio, image, video
qwen3-omni-flash-2025-12-01HTTPText, audio, image, video
qwen3-omni-flash-2025-09-15HTTPText, audio, image, video
ModelAPIInputLanguages
qwen3.5-livetranslate-flash-realtimeWebSocketAudio60
qwen3.5-livetranslate-flash-realtime-2026-05-19WebSocketAudio60
ModelAPIInputLanguages
qwen3-livetranslate-flash-realtimeWebSocketAudio18
qwen3-livetranslate-flash-realtime-2025-09-22WebSocketAudio18
qwen3-livetranslate-flashHTTPAudio, video18
qwen3-livetranslate-flash-2025-12-01HTTPAudio, video18
These models are no longer updated. Use Qwen3.5-Omni or Qwen3-Omni-Flash for new projects.
ModelInputAPI
qwen2.5-omni-7bText, audio, image, videoHTTP
qwen-omni-turboText, audio, image, videoHTTP
qwen-omni-turbo-latestText, audio, image, videoHTTP
qwen-omni-turbo-2025-03-26Text, audio, image, videoHTTP
qwen-omni-turbo-realtimeText, audioWebSocket
qwen-omni-turbo-realtime-latestText, audioWebSocket
qwen-omni-turbo-realtime-2025-05-08Text, audioWebSocket

Next steps

After choosing a model, refer to the corresponding usage guide:

Learn more