Skip to main content
Speech-to-speech

Speech-to-speech models

Choose a model for voice conversation, speech translation, and more.

S2S vs pipeline

Two ways to build voice-enabled apps:
S2SPipeline (ASR + LLM + TTS)
LatencyLow — single model, streamingHigher — 3 sequential hops
Audio understandingEnd-to-end — hears tone, emotion, responds in kindTranscribes to text first — audio nuance lost
Voice customizationPreset voices via system promptVoice cloning, voice design (CosyVoice)
  • Use S2S when interactive conversation, low latency, and audio-aware responses matter. Continue reading this page.
  • Use Pipeline when you need custom voices or want to mix-and-match the best ASR, LLM, and TTS for each stage.

Real-time or file-based?

  • Real-time (WebSocket) — Use for live voice interfaces: voice assistants, call centers, simultaneous interpretation. Audio streams in, speech streams out. Model names contain -realtime.
  • File-based (HTTP) — Use when you can trade latency for better results: video dubbing, podcast translation, offline content processing. Unlocks thinking mode and function calling (Omni) and video context (Livetranslate).

Function calling

Let the model take actions based on what it hears and sees — check a knowledge base, query a schedule, trigger a workflow. Use qwen3-omni-flash (HTTP). Not available on realtime or Livetranslate models.

Thinking mode

Use qwen3-omni-flash (HTTP) when answer quality matters more than latency. The model reasons step-by-step before producing speech — useful for technical support, complex Q&A, or multi-step instructions.

Translation

Both model families can translate speech:
  • Qwen3-Livetranslate — 18 languages + 6 Chinese dialects, ~3-second latency, out of the box. File-based variant accepts video for context-aware accuracy. 7 languages output text only (no audio).
  • Qwen3-Omni — 10 languages + 8 Chinese dialects. Inject terminology and domain context via system prompt for specialized fields. Both realtime and file-based. More engineering effort, more control.
Livetranslate for quick setup; Omni for domain control.
LanguageQwen3-LivetranslateQwen3-Omni
English
Chinese (Mandarin)
  + Cantonese
  + Sichuanese
  + Shanghainese
  + Beijing
  + Tianjin
  + Nanjing
  + Shaanxi
  + Hokkien
French
German
Russian
Italian
Spanish
Portuguese
Japanese
Korean
IndonesianText only
VietnameseText only
ThaiText only
ArabicText only
HindiText only
GreekText only
TurkishText only
✓ = audio + text output. "Text only" = no audio output for that language.Legacy qwen-omni-turbo supports Chinese and English only.
ModelAPIInputFunction callingBuilt-in toolsThinkingBatch
qwen3-omni-flash-realtimeWebSocketText, audio, image, video
qwen3-omni-flashHTTPText, audio, image, video
qwen3-livetranslate-flash-realtimeWebSocketAudio
qwen3-livetranslate-flashHTTPAudio, video

All models

ModelAPIInputFunction callingBuilt-in toolsThinkingBatch
qwen3-omni-flash-realtimeWebSocketText, audio, image, video
qwen3-omni-flash-realtime-2025-12-01WebSocketText, audio, image, video
qwen3-omni-flash-realtime-2025-09-15WebSocketText, audio, image, video
qwen3-omni-flashHTTPText, audio, image, video
qwen3-omni-flash-2025-12-01HTTPText, audio, image, video
qwen3-omni-flash-2025-09-15HTTPText, audio, image, video
ModelAPIInputLanguages
qwen3-livetranslate-flash-realtimeWebSocketAudio18
qwen3-livetranslate-flash-realtime-2025-09-22WebSocketAudio18
qwen3-livetranslate-flashHTTPAudio, video18
qwen3-livetranslate-flash-2025-12-01HTTPAudio, video18
These models are no longer updated. Use Qwen3-Omni for new projects.
ModelInputAPI
qwen2.5-omni-7bText, audio, image, videoHTTP
qwen-omni-turboText, audio, image, videoHTTP
qwen-omni-turbo-latestText, audio, image, videoHTTP
qwen-omni-turbo-2025-03-26Text, audio, image, videoHTTP
qwen-omni-turbo-realtimeText, audioWebSocket
qwen-omni-turbo-realtime-latestText, audioWebSocket
qwen-omni-turbo-realtime-2025-05-08Text, audioWebSocket

Learn more