Skip to main content
Speech-to-text

Speech-to-text models

Choose a model for live captions, file transcription, and more.

You have ten speech-to-text models across three technology families. Two questions narrow the field.
  1. Do you need results as the user speaks, or after the recording ends?
  2. Does your audio contain domain-specific terms?

Real-time or batch?

Real-time

WebSocket. Audio streams in, text streams out. Use for live captions, voice assistants, and meeting transcription. Real-time model IDs end with -realtime.
  • Fun-ASR (fun-asr-realtime) — hotwords, VAD, and dialect support
  • Qwen3-ASR (qwen3-asr-flash-realtime) — emotion recognition alongside transcription
  • Qwen-Omni (qwen3.5-omni-plus-realtime) — multilingual coverage (113 languages/dialects), prompt-based context, and long sessions up to 120 minutes

Batch

Submit a file, poll for results. Handles recordings up to 12 hours and 2 GB. Batch model IDs have no -realtime suffix.
  • Fun-ASR (fun-asr) — speaker diarization, hotwords, and singing recognition
  • Qwen3-ASR (qwen3-asr-flash-filetrans) — emotion recognition
  • Qwen-Omni (qwen3.5-omni-flash) — multilingual, prompt context, and OpenAI-compatible HTTP

Near-real-time workaround

Batch APIs accept short clips. Submit 5-second chunks for near-real-time results without WebSocket. But real WebSocket avoids reconnection overhead. If latency matters, use a real-time model.

Handling domain terminology

Two approaches, ranked by flexibility:
  1. Prompt-based context (Qwen-Omni) — Describe your domain in the system prompt. No pre-configuration needed. The model adapts on every request. Trade-off: higher per-request latency than dedicated ASR.
  2. Hotwords (Fun-ASR) — Provide a vocabulary list with weights. Best for stable term lists that rarely change.
Qwen-Omni is not traditional ASR. It is an LLM that understands audio. You inject context through the prompt. It adapts without a hotword list.

Speaker diarization

Only fun-asr (batch) supports speaker diarization. If you need to know who said what, this is it.

Emotion and sentiment

qwen3-asr-flash-realtime, qwen3-asr-flash-filetrans, and Qwen-Omni models detect emotion alongside transcription. Traditional ASR models do not.
ModelModeAccuracy boostEmotionSpeaker diarizationLanguages
fun-asr-realtimeReal-timeHotwords----zh, en, ja, dialects
fun-asrBatchHotwords--30 languages
qwen3-asr-flash-realtimeReal-time----26 languages
qwen3.5-omni-plus-realtimeReal-timePrompt context--113 languages/dialects
qwen3.5-omni-flashBatchPrompt context--113 languages/dialects

All models

ModelModeAPIAccuracy boostEmotionSpeaker diarizationLanguagesMax duration
fun-asr-realtimeReal-timeWebSocketHotwordszh, en, ja, dialectsStreaming
fun-asrBatchAsync RESTHotwords30 languages12h / 2GB
ModelModeAPIAccuracy boostEmotionSpeaker diarizationLanguagesMax duration
qwen3-asr-flash-realtimeReal-timeWebSocket26 languagesStreaming
qwen3-asr-flash-filetransBatchAsync REST26 languages
ModelModeAPIAccuracy boostEmotionSpeaker diarizationLanguagesMax duration
qwen3.5-omni-plusBatchHTTP (OpenAI)Prompt context113 languages/dialectsPer-request limit
qwen3.5-omni-flashBatchHTTP (OpenAI)Prompt context113 languages/dialectsPer-request limit
qwen3.5-omni-plus-realtimeReal-timeWebSocketPrompt context113 languages/dialects120 min
qwen3.5-omni-flash-realtimeReal-timeWebSocketPrompt context113 languages/dialects120 min
qwen3-omni-flashBatchHTTP (OpenAI)Prompt context19 languages/dialectsPer-request limit
qwen3-omni-flash-realtimeReal-timeWebSocketPrompt context19 languages/dialects120 min
All models support WAV, MP3, AAC, and more.
Need speech translated to a different language? See Speech-to-Speech models for real-time and file-based translation with LiveTranslate and Qwen-Omni.

Learn more

Speech-to-text models - Qwen Cloud