Choose a model for live captions, file transcription, and more.
You have ten speech-to-text models across three technology families. Two questions narrow the field.
Replacing Whisper, Deepgram, or Google speech recognition? Use these Qwen Cloud equivalents.
WebSocket. Audio streams in, text streams out. Use for live captions, voice assistants, and meeting transcription. Real-time model IDs end with
Submit a file, poll for results. Handles recordings up to 12 hours and 2 GB. Batch model IDs have no
Batch APIs accept short clips. Submit 5-second chunks for near-real-time results without WebSocket. But real WebSocket avoids reconnection overhead. If latency matters, use a real-time model.
Two approaches, ranked by flexibility:
Only
- Do you need results as the user speaks, or after the recording ends?
- Does your audio contain domain-specific terms?
Migrate from closed-source models
Replacing Whisper, Deepgram, or Google speech recognition? Use these Qwen Cloud equivalents.
| Use case | Closed-source examples | Qwen Cloud recommendation |
|---|---|---|
| Real-time recognition | Deepgram Nova-3, Google Chirp 3 | fun-asr-realtime, qwen3.5-omni-plus-realtime |
| Offline / file transcription | OpenAI gpt-4o-transcribe, Whisper | fun-asr, qwen3.5-omni-plus |
Real-time or batch?
Real-time
WebSocket. Audio streams in, text streams out. Use for live captions, voice assistants, and meeting transcription. Real-time model IDs end with -realtime.
- Fun-ASR (
fun-asr-realtime) — hotwords, VAD, and dialect support - Qwen3-ASR (
qwen3-asr-flash-realtime) — emotion recognition alongside transcription - Qwen-Omni (
qwen3.5-omni-plus-realtime) — multilingual coverage (113 languages/dialects), prompt-based context, and long sessions up to 120 minutes
Batch
Submit a file, poll for results. Handles recordings up to 12 hours and 2 GB. Batch model IDs have no -realtime suffix.
- Fun-ASR (
fun-asr) — speaker diarization, hotwords, and singing recognition - Qwen3-ASR (
qwen3-asr-flash-filetrans) — emotion recognition - Qwen-Omni (
qwen3.5-omni-flash) — multilingual, prompt context, and OpenAI-compatible HTTP
Near-real-time workaround
Batch APIs accept short clips. Submit 5-second chunks for near-real-time results without WebSocket. But real WebSocket avoids reconnection overhead. If latency matters, use a real-time model.
Handling domain terminology
Two approaches, ranked by flexibility:
- Prompt-based context (Qwen-Omni) — Describe your domain in the system prompt. No pre-configuration needed. The model adapts on every request. Trade-off: higher per-request latency than dedicated ASR.
- Hotwords (Fun-ASR) — Provide a vocabulary list with weights. Best for stable term lists that rarely change.
Qwen-Omni is not traditional ASR. It is an LLM that understands audio. You inject context through the prompt. It adapts without a hotword list.
Speaker diarization
Only fun-asr (batch) supports speaker diarization. If you need to know who said what, this is it.
Emotion and sentiment
qwen3-asr-flash-realtime, qwen3-asr-flash-filetrans, and Qwen-Omni models detect emotion alongside transcription. Traditional ASR models do not.
Recommended models
| Model | Mode | Accuracy boost | Emotion | Speaker diarization | Languages |
|---|---|---|---|---|---|
fun-asr-realtime | Real-time | Hotwords | -- | -- | zh, en, ja, dialects |
fun-asr | Batch | Hotwords | -- | ✓ | 30 languages |
qwen3-asr-flash-realtime | Real-time | -- | ✓ | -- | 26 languages |
qwen3.5-omni-plus-realtime | Real-time | Prompt context | ✓ | -- | 113 languages/dialects |
qwen3.5-omni-flash | Batch | Prompt context | ✓ | -- | 113 languages/dialects |
All models
Fun-ASR
Fun-ASR
| Model | Mode | API | Accuracy boost | Emotion | Speaker diarization | Languages | Max duration |
|---|---|---|---|---|---|---|---|
fun-asr-realtime | Real-time | WebSocket | Hotwords | ✗ | ✗ | zh, en, ja, dialects | Streaming |
fun-asr | Batch | Async REST | Hotwords | ✗ | ✓ | 30 languages | 12h / 2GB |
fun-asr-flash-2026-06-15 | Batch | HTTP sync | Prompt context | ✗ | ✗ | zh, en, ja, ko and 37 more | 5 min / 2GB |
Qwen3-ASR
Qwen3-ASR
| Model | Mode | API | Accuracy boost | Emotion | Speaker diarization | Languages | Max duration |
|---|---|---|---|---|---|---|---|
qwen3-asr-flash-realtime | Real-time | WebSocket | — | ✓ | ✗ | 26 languages | Streaming |
qwen3-asr-flash-filetrans | Batch | Async REST | — | ✓ | ✗ | 26 languages | — |
Qwen-Omni
Qwen-Omni
| Model | Mode | API | Accuracy boost | Emotion | Speaker diarization | Languages | Max duration |
|---|---|---|---|---|---|---|---|
qwen3.5-omni-plus | Batch | HTTP (OpenAI) | Prompt context | ✓ | ✗ | 113 languages/dialects | Per-request limit |
qwen3.5-omni-flash | Batch | HTTP (OpenAI) | Prompt context | ✓ | ✗ | 113 languages/dialects | Per-request limit |
qwen3.5-omni-plus-realtime | Real-time | WebSocket | Prompt context | ✓ | ✗ | 113 languages/dialects | 120 min |
qwen3.5-omni-flash-realtime | Real-time | WebSocket | Prompt context | ✓ | ✗ | 113 languages/dialects | 120 min |
qwen3-omni-flash | Batch | HTTP (OpenAI) | Prompt context | ✓ | ✗ | 19 languages/dialects | Per-request limit |
qwen3-omni-flash-realtime | Real-time | WebSocket | Prompt context | ✓ | ✗ | 19 languages/dialects | 120 min |
All models support WAV, MP3, AAC, and more.
Need speech translated to a different language? See Speech-to-Speech models for real-time and file-based translation with LiveTranslate and Qwen-Omni.