Skip to main content
Speech-to-text

Speech-to-text models

Choose a model for live captions, file transcription, and more.

You have six speech-to-text models across three technology families. Two questions narrow the field.
  1. Do you need results as the user speaks, or after the recording ends?
  2. Does your audio contain domain-specific terms?

Real-time or batch?

Real-time

WebSocket. Audio streams in, text streams out. Use for live captions, voice assistants, and meeting transcription.
ModelFamilyKey strength
fun-asr-realtimeFun-ASRHotwords, VAD, dialect support
qwen3-asr-flash-realtimeQwen3-ASREmotion recognition
qwen3-omni-flash-realtimeQwen-OmniMultimodal streaming, prompt context, 120-min sessions

Batch

Submit a file. Poll for results. Handles recordings up to 12 hours and 2 GB. Use for call center archives, podcasts, and interviews.
ModelFamilyKey strength
fun-asrFun-ASRSpeaker diarization, hotwords, singing recognition
qwen3-asr-flash-filetransQwen3-ASREmotion recognition
qwen3-omni-flashQwen-OmniPrompt-based context, multimodal, OpenAI-compatible HTTP

Near-real-time workaround

Batch APIs accept short clips. Submit 5-second chunks for near-real-time results without WebSocket. But real WebSocket avoids reconnection overhead. If latency matters, use a real-time model.

Handling domain terminology

Two approaches, ranked by flexibility:
  1. Prompt-based context (Qwen-Omni) — Describe your domain in the system prompt. No pre-configuration needed. The model adapts on every request. Trade-off: higher per-request latency than dedicated ASR.
  2. Hotwords (Fun-ASR) — Provide a vocabulary list with weights. Best for stable term lists that rarely change.
Qwen-Omni is not traditional ASR. It is an LLM that understands audio. You inject context through the prompt. It adapts without a hotword list.

Speaker diarization

Only fun-asr (batch) supports speaker diarization. If you need to know who said what, this is it.

Emotion and sentiment

qwen3-asr-flash-realtime, qwen3-asr-flash-filetrans, and Qwen-Omni models detect emotion alongside transcription. Traditional ASR models do not.

Full comparison

ModelModeAPIAccuracy boostEmotionSpeaker diarizationLanguagesMax duration
fun-asr-realtimeReal-timeWebSocketHotwordszh, en, ja, dialectsStreaming
fun-asrBatchAsync RESTHotwordszh, en, ja, ko, de, fr, ru12h / 2GB
qwen3-asr-flash-realtimeReal-timeWebSocket26 languagesStreaming
qwen3-asr-flash-filetransBatchAsync REST26 languages
qwen3-omni-flashBatchHTTP (OpenAI)Prompt contextzh, en, ja, ko, de, fr, it, es, pt, ru; Chinese dialects: Sichuan, Shanghainese, Cantonese, Minnan, Shaanxi, Nanjing, Tianjin, BeijingPer-request limit
qwen3-omni-flash-realtimeReal-timeWebSocketPrompt contextzh, en, ja, ko, de, fr, it, es, pt, ru; Chinese dialects: Sichuan, Shanghainese, Cantonese, Minnan, Shaanxi, Nanjing, Tianjin, Beijing120 min
All models support WAV, MP3, AAC, and more.
Need speech translated to a different language? See Speech-to-Speech models for real-time and file-based translation with LiveTranslate and Qwen-Omni.

Learn more