Choose a model for live captions, file transcription, and more.
You have ten speech-to-text models across three technology families. Two questions narrow the field.
WebSocket. Audio streams in, text streams out. Use for live captions, voice assistants, and meeting transcription. Real-time model IDs end with
Submit a file, poll for results. Handles recordings up to 12 hours and 2 GB. Batch model IDs have no
Batch APIs accept short clips. Submit 5-second chunks for near-real-time results without WebSocket. But real WebSocket avoids reconnection overhead. If latency matters, use a real-time model.
Two approaches, ranked by flexibility:
Only
- Do you need results as the user speaks, or after the recording ends?
- Does your audio contain domain-specific terms?
Real-time or batch?
Real-time
WebSocket. Audio streams in, text streams out. Use for live captions, voice assistants, and meeting transcription. Real-time model IDs end with -realtime.
- Fun-ASR (
fun-asr-realtime) — hotwords, VAD, and dialect support - Qwen3-ASR (
qwen3-asr-flash-realtime) — emotion recognition alongside transcription - Qwen-Omni (
qwen3.5-omni-plus-realtime) — multilingual coverage (113 languages/dialects), prompt-based context, and long sessions up to 120 minutes
Batch
Submit a file, poll for results. Handles recordings up to 12 hours and 2 GB. Batch model IDs have no -realtime suffix.
- Fun-ASR (
fun-asr) — speaker diarization, hotwords, and singing recognition - Qwen3-ASR (
qwen3-asr-flash-filetrans) — emotion recognition - Qwen-Omni (
qwen3.5-omni-flash) — multilingual, prompt context, and OpenAI-compatible HTTP
Near-real-time workaround
Batch APIs accept short clips. Submit 5-second chunks for near-real-time results without WebSocket. But real WebSocket avoids reconnection overhead. If latency matters, use a real-time model.
Handling domain terminology
Two approaches, ranked by flexibility:
- Prompt-based context (Qwen-Omni) — Describe your domain in the system prompt. No pre-configuration needed. The model adapts on every request. Trade-off: higher per-request latency than dedicated ASR.
- Hotwords (Fun-ASR) — Provide a vocabulary list with weights. Best for stable term lists that rarely change.
Qwen-Omni is not traditional ASR. It is an LLM that understands audio. You inject context through the prompt. It adapts without a hotword list.
Speaker diarization
Only fun-asr (batch) supports speaker diarization. If you need to know who said what, this is it.
Emotion and sentiment
qwen3-asr-flash-realtime, qwen3-asr-flash-filetrans, and Qwen-Omni models detect emotion alongside transcription. Traditional ASR models do not.
Recommended models
| Model | Mode | Accuracy boost | Emotion | Speaker diarization | Languages |
|---|---|---|---|---|---|
fun-asr-realtime | Real-time | Hotwords | -- | -- | zh, en, ja, dialects |
fun-asr | Batch | Hotwords | -- | ✓ | 30 languages |
qwen3-asr-flash-realtime | Real-time | -- | ✓ | -- | 26 languages |
qwen3.5-omni-plus-realtime | Real-time | Prompt context | ✓ | -- | 113 languages/dialects |
qwen3.5-omni-flash | Batch | Prompt context | ✓ | -- | 113 languages/dialects |
All models
Fun-ASR
Fun-ASR
| Model | Mode | API | Accuracy boost | Emotion | Speaker diarization | Languages | Max duration |
|---|---|---|---|---|---|---|---|
fun-asr-realtime | Real-time | WebSocket | Hotwords | ✗ | ✗ | zh, en, ja, dialects | Streaming |
fun-asr | Batch | Async REST | Hotwords | ✗ | ✓ | 30 languages | 12h / 2GB |
Qwen3-ASR
Qwen3-ASR
| Model | Mode | API | Accuracy boost | Emotion | Speaker diarization | Languages | Max duration |
|---|---|---|---|---|---|---|---|
qwen3-asr-flash-realtime | Real-time | WebSocket | — | ✓ | ✗ | 26 languages | Streaming |
qwen3-asr-flash-filetrans | Batch | Async REST | — | ✓ | ✗ | 26 languages | — |
Qwen-Omni
Qwen-Omni
| Model | Mode | API | Accuracy boost | Emotion | Speaker diarization | Languages | Max duration |
|---|---|---|---|---|---|---|---|
qwen3.5-omni-plus | Batch | HTTP (OpenAI) | Prompt context | ✓ | ✗ | 113 languages/dialects | Per-request limit |
qwen3.5-omni-flash | Batch | HTTP (OpenAI) | Prompt context | ✓ | ✗ | 113 languages/dialects | Per-request limit |
qwen3.5-omni-plus-realtime | Real-time | WebSocket | Prompt context | ✓ | ✗ | 113 languages/dialects | 120 min |
qwen3.5-omni-flash-realtime | Real-time | WebSocket | Prompt context | ✓ | ✗ | 113 languages/dialects | 120 min |
qwen3-omni-flash | Batch | HTTP (OpenAI) | Prompt context | ✓ | ✗ | 19 languages/dialects | Per-request limit |
qwen3-omni-flash-realtime | Real-time | WebSocket | Prompt context | ✓ | ✗ | 19 languages/dialects | 120 min |
All models support WAV, MP3, AAC, and more.
Need speech translated to a different language? See Speech-to-Speech models for real-time and file-based translation with LiveTranslate and Qwen-Omni.