Choose a model for live captions, file transcription, and more.
You have six speech-to-text models across three technology families. Two questions narrow the field.
WebSocket. Audio streams in, text streams out. Use for live captions, voice assistants, and meeting transcription.
Submit a file. Poll for results. Handles recordings up to 12 hours and 2 GB. Use for call center archives, podcasts, and interviews.
Batch APIs accept short clips. Submit 5-second chunks for near-real-time results without WebSocket. But real WebSocket avoids reconnection overhead. If latency matters, use a real-time model.
Two approaches, ranked by flexibility:
Only
- Do you need results as the user speaks, or after the recording ends?
- Does your audio contain domain-specific terms?
Real-time or batch?
Real-time
WebSocket. Audio streams in, text streams out. Use for live captions, voice assistants, and meeting transcription.
| Model | Family | Key strength |
|---|---|---|
fun-asr-realtime | Fun-ASR | Hotwords, VAD, dialect support |
qwen3-asr-flash-realtime | Qwen3-ASR | Emotion recognition |
qwen3-omni-flash-realtime | Qwen-Omni | Multimodal streaming, prompt context, 120-min sessions |
Batch
Submit a file. Poll for results. Handles recordings up to 12 hours and 2 GB. Use for call center archives, podcasts, and interviews.
| Model | Family | Key strength |
|---|---|---|
fun-asr | Fun-ASR | Speaker diarization, hotwords, singing recognition |
qwen3-asr-flash-filetrans | Qwen3-ASR | Emotion recognition |
qwen3-omni-flash | Qwen-Omni | Prompt-based context, multimodal, OpenAI-compatible HTTP |
Near-real-time workaround
Batch APIs accept short clips. Submit 5-second chunks for near-real-time results without WebSocket. But real WebSocket avoids reconnection overhead. If latency matters, use a real-time model.
Handling domain terminology
Two approaches, ranked by flexibility:
- Prompt-based context (Qwen-Omni) — Describe your domain in the system prompt. No pre-configuration needed. The model adapts on every request. Trade-off: higher per-request latency than dedicated ASR.
- Hotwords (Fun-ASR) — Provide a vocabulary list with weights. Best for stable term lists that rarely change.
Qwen-Omni is not traditional ASR. It is an LLM that understands audio. You inject context through the prompt. It adapts without a hotword list.
Speaker diarization
Only fun-asr (batch) supports speaker diarization. If you need to know who said what, this is it.
Emotion and sentiment
qwen3-asr-flash-realtime, qwen3-asr-flash-filetrans, and Qwen-Omni models detect emotion alongside transcription. Traditional ASR models do not.
Full comparison
| Model | Mode | API | Accuracy boost | Emotion | Speaker diarization | Languages | Max duration |
|---|---|---|---|---|---|---|---|
fun-asr-realtime | Real-time | WebSocket | Hotwords | ✗ | ✗ | zh, en, ja, dialects | Streaming |
fun-asr | Batch | Async REST | Hotwords | ✗ | ✓ | zh, en, ja, ko, de, fr, ru | 12h / 2GB |
qwen3-asr-flash-realtime | Real-time | WebSocket | — | ✓ | ✗ | 26 languages | Streaming |
qwen3-asr-flash-filetrans | Batch | Async REST | — | ✓ | ✗ | 26 languages | — |
qwen3-omni-flash | Batch | HTTP (OpenAI) | Prompt context | ✓ | ✗ | zh, en, ja, ko, de, fr, it, es, pt, ru; Chinese dialects: Sichuan, Shanghainese, Cantonese, Minnan, Shaanxi, Nanjing, Tianjin, Beijing | Per-request limit |
qwen3-omni-flash-realtime | Real-time | WebSocket | Prompt context | ✓ | ✗ | zh, en, ja, ko, de, fr, it, es, pt, ru; Chinese dialects: Sichuan, Shanghainese, Cantonese, Minnan, Shaanxi, Nanjing, Tianjin, Beijing | 120 min |
All models support WAV, MP3, AAC, and more.
Need speech translated to a different language? See Speech-to-Speech models for real-time and file-based translation with LiveTranslate and Qwen-Omni.