Speech-to-text models

You have ten speech-to-text models across three technology families. Two questions narrow the field.

Do you need results as the user speaks, or after the recording ends?
Does your audio contain domain-specific terms?

Migrate from closed-source models

Replacing Whisper, Deepgram, or Google speech recognition? Use these Qwen Cloud equivalents.

Use case	Closed-source examples	Qwen Cloud recommendation
Real-time recognition	Deepgram Nova-3, Google Chirp 3	`fun-asr-realtime`, `qwen3.5-omni-plus-realtime`
Offline / file transcription	OpenAI gpt-4o-transcribe, Whisper	`fun-asr`, `qwen3.5-omni-plus`

Real-time or batch?

Real-time

WebSocket. Audio streams in, text streams out. Use for live captions, voice assistants, and meeting transcription. Real-time model IDs end with -realtime.

Fun-ASR (fun-asr-realtime) — hotwords, VAD, and dialect support
Qwen3-ASR (qwen3-asr-flash-realtime) — emotion recognition alongside transcription
Qwen-Omni (qwen3.5-omni-plus-realtime) — multilingual coverage (113 languages/dialects), prompt-based context, and long sessions up to 120 minutes

Batch

Submit a file, poll for results. Handles recordings up to 12 hours and 2 GB. Batch model IDs have no -realtime suffix.

Fun-ASR (fun-asr) — speaker diarization, hotwords, and singing recognition
Qwen3-ASR (qwen3-asr-flash-filetrans) — emotion recognition
Qwen-Omni (qwen3.5-omni-flash) — multilingual, prompt context, and OpenAI-compatible HTTP

Near-real-time workaround

Batch APIs accept short clips. Submit 5-second chunks for near-real-time results without WebSocket. But real WebSocket avoids reconnection overhead. If latency matters, use a real-time model.

Handling domain terminology

Two approaches, ranked by flexibility:

Prompt-based context (Qwen-Omni) — Describe your domain in the system prompt. No pre-configuration needed. The model adapts on every request. Trade-off: higher per-request latency than dedicated ASR.
Hotwords (Fun-ASR) — Provide a vocabulary list with weights. Best for stable term lists that rarely change.

Qwen-Omni is not traditional ASR. It is an LLM that understands audio. You inject context through the prompt. It adapts without a hotword list.

Speaker diarization

Only fun-asr (batch) supports speaker diarization. If you need to know who said what, this is it.

Emotion and sentiment

qwen3-asr-flash-realtime, qwen3-asr-flash-filetrans, and Qwen-Omni models detect emotion alongside transcription. Traditional ASR models do not.

Recommended models

Model	Mode	Accuracy boost	Emotion	Speaker diarization	Languages
`fun-asr-realtime`	Real-time	Hotwords	--	--	zh, en, ja, dialects
`fun-asr`	Batch	Hotwords	--	✓	30 languages
`qwen3-asr-flash-realtime`	Real-time	--	✓	--	26 languages
`qwen3.5-omni-plus-realtime`	Real-time	Prompt context	✓	--	113 languages/dialects
`qwen3.5-omni-flash`	Batch	Prompt context	✓	--	113 languages/dialects

All models

Fun-ASR

Model	Mode	API	Accuracy boost	Emotion	Speaker diarization	Languages	Max duration
`fun-asr-realtime`	Real-time	WebSocket	Hotwords	✗	✗	zh, en, ja, dialects	Streaming
`fun-asr`	Batch	Async REST	Hotwords	✗	✓	30 languages	12h / 2GB
`fun-asr-flash-2026-06-15`	Batch	HTTP sync	Prompt context	✗	✗	zh, en, ja, ko and 37 more	5 min / 2GB

Qwen3-ASR

Model	Mode	API	Accuracy boost	Emotion	Speaker diarization	Languages	Max duration
`qwen3-asr-flash-realtime`	Real-time	WebSocket	—	✓	✗	26 languages	Streaming
`qwen3-asr-flash-filetrans`	Batch	Async REST	—	✓	✗	26 languages	—

Qwen-Omni

Model	Mode	API	Accuracy boost	Emotion	Speaker diarization	Languages	Max duration
`qwen3.5-omni-plus`	Batch	HTTP (OpenAI)	Prompt context	✓	✗	113 languages/dialects	Per-request limit
`qwen3.5-omni-flash`	Batch	HTTP (OpenAI)	Prompt context	✓	✗	113 languages/dialects	Per-request limit
`qwen3.5-omni-plus-realtime`	Real-time	WebSocket	Prompt context	✓	✗	113 languages/dialects	120 min
`qwen3.5-omni-flash-realtime`	Real-time	WebSocket	Prompt context	✓	✗	113 languages/dialects	120 min
`qwen3-omni-flash`	Batch	HTTP (OpenAI)	Prompt context	✓	✗	19 languages/dialects	Per-request limit
`qwen3-omni-flash-realtime`	Real-time	WebSocket	Prompt context	✓	✗	19 languages/dialects	120 min

All models support WAV, MP3, AAC, and more.

Need speech translated to a different language? See Speech-to-Speech models for real-time and file-based translation with LiveTranslate and Qwen-Omni.

Learn more

Real-time ASR

Stream audio and get text back in real time.

Audio file transcription

Transcribe recorded audio files with async APIs.

Improve accuracy

Boost recognition of domain-specific terms.

​Migrate from closed-source models

​Real-time or batch?

​Real-time

​Batch

​Near-real-time workaround

​Handling domain terminology

​Speaker diarization

​Emotion and sentiment

​Recommended models

​All models

​Learn more

Real-time ASR

Audio file transcription

Improve accuracy

Migrate from closed-source models

Real-time or batch?

Real-time

Batch

Near-real-time workaround

Handling domain terminology

Speaker diarization

Emotion and sentiment

Recommended models

All models

Learn more