Live speech to text
Convert continuous audio streams into text in real time for scenarios such as live streaming captions, online meetings, voice chats, smart assistants, and intelligent customer service. Real-time speech recognition supports transcription from microphones, meeting recordings, or local audio files with punctuation, timestamps, and custom hotwords.
By providing context, you can optimize the recognition of domain-specific vocabulary, such as names, places, and product terms.
Length limit: The context content cannot exceed 10,000 tokens.
Usage:
To achieve the result above, add any of the following content to the context:
Qwen real-time speech recognition streams audio over WebSocket. Two modes are available: VAD mode (default) and Manual mode.
Replace
The server detects speech boundaries and segments sentences. The client streams audio, and the server returns results when each sentence ends. Best for conversations and meeting transcription.
Enable: Set
The client controls sentence segmentation by sending audio for a complete sentence, then sending
You can also use Qwen-Omni (
ASR prompt template:
For model availability, supported languages, and feature comparison, see Speech-to-text models.
Getting started
- Fun-ASR
- Qwen-ASR
For more code samples, see GitHub.Get an API key and set it as an environment variable. To use the SDK, install it.
Recognize speech from a microphone
Recognizes speech from a microphone and outputs results in real time.Before running the Python example, run
pip install pyaudio to install the third-party audio playback and capture suite.Recognize a local audio file
This feature recognizes and transcribes local audio files. This is ideal for near real-time scenarios with short audio, such as voice chats, commands, voice input, and voice search.The audio file used in the examples below is asr_example.wav.
Going live
Improve recognition accuracy
- Select a model with the correct sample rate: For 8 kHz telephone audio, use an 8 kHz model directly instead of upsampling it to 16 kHz for recognition. This avoids information distortion and yields better results.
- Use the custom vocabulary feature: For proprietary nouns, names, and brand names specific to your business, you can configure a custom vocabulary to significantly improve recognition accuracy. For more information, see Customize a vocabulary.
- Optimize input audio quality: Use high-quality microphones whenever possible and ensure a high signal-to-noise ratio (SNR) and an echo-free recording environment. At the application level, you can integrate algorithms such as noise reduction (for example, RNNoise) and acoustic echo cancellation (AEC) to preprocess the audio to obtain a cleaner signal.
- Specify the recognition language: For multilingual models, if you can predetermine the audio language when making a call, it helps the model converge and avoid confusion between similarly pronounced languages, which improves accuracy.
Set a fault tolerance policy
- Client-side reconnection: The client should implement an automatic reconnection mechanism to handle network jitter. For the Python SDK, consider the following suggestions:
- Catch exceptions: Implement the
on_errormethod in theCallbackclass. ThedashscopeSDK calls this method when it encounters a network error or other issues. - Notify status: When
on_erroris triggered, set a reconnection signal. In Python, you can usethreading.Event, which is a thread-safe flag. - Reconnection loop: Wrap the main logic in a
forloop (for example, to retry 3 times). When the reconnection signal is detected, the current recognition is interrupted, resources are cleaned up, and the loop restarts after a few seconds to create a new connection.
- Catch exceptions: Implement the
- Set a heartbeat to prevent connection loss: To maintain a persistent connection with the server, set the
heartbeatparameter totrue. This ensures that the connection to the server is not interrupted, even during long periods of silence in the audio. - Rate limits: When you call the model interface, take note of the model's rate limits rules.
Core usage: Context biasing (Qwen-ASR)
By providing context, you can optimize the recognition of domain-specific vocabulary, such as names, places, and product terms.
Length limit: The context content cannot exceed 10,000 tokens.
Usage:
- WebSocket API: Set the
session.input_audio_transcription.corpus.textparameter in the session.update event. - Python SDK: Set the
corpus_textparameter. - Java SDK: Set the
corpusTextparameter.
- Hotword lists in various separator formats, such as Hotword 1, Hotword 2, Hotword 3, Hotword 4
- Text paragraphs or chapters of any format and length
- Mixed content: Any combination of word lists and paragraphs
- Irrelevant or meaningless text, including garbled text. The feature is highly fault-tolerant and is almost never negatively affected by irrelevant text.
| Without context enhancement | With context enhancement |
|---|---|
| Without context enhancement, some investment bank names may be misrecognized. For example, "Bird Rock" should be "Bulge Bracket". Recognition result: "What internal jargon from the investment banking circle do you know? First, the nine major foreign investment banks, Bird Rock, BB..." | With context enhancement, investment bank names are recognized correctly. Recognition result: "What internal jargon from the investment banking circle do you know? First, the nine major foreign investment banks, the Bulge Bracket, BB..." |
- Word lists:
- Word list 1:
- Word list 2:
- Word list 3:
- Natural language:
- Natural language with interference: Some text is irrelevant to the recognition content, such as the names in the example below.
API reference
- Fun-ASR
- Qwen-ASR
Interaction flow (Qwen-ASR-Realtime)
Qwen real-time speech recognition streams audio over WebSocket. Two modes are available: VAD mode (default) and Manual mode.
URL
Replace <model_name> with your model name.
Headers
VAD mode (default)
The server detects speech boundaries and segments sentences. The client streams audio, and the server returns results when each sentence ends. Best for conversations and meeting transcription.
Enable: Set session.turn_detection in session.update.
-
The client sends
input_audio_buffer.appendto add audio to the buffer. -
The server returns
input_audio_buffer.speech_startedwhen it detects speech.If the client sendssession.finishbefore this event, the server returnssession.finishedand the client must disconnect. -
The client continues sending
input_audio_buffer.append. -
After all audio is sent, the client sends
session.finishto end the session. -
The server returns
input_audio_buffer.speech_stoppedwhen it detects the end of speech. -
The server returns
input_audio_buffer.committed. -
The server returns
conversation.item.created. -
The server returns
conversation.item.input_audio_transcription.textwith real-time transcription results. -
The server returns
conversation.item.input_audio_transcription.completedwith the final transcription result. -
The server returns
session.finishedwhen recognition completes. The client must then disconnect.
Manual mode
The client controls sentence segmentation by sending audio for a complete sentence, then sending input_audio_buffer.commit. Best when the client knows sentence boundaries, for example in chat app voice messages.
Enable: Set session.turn_detection to null in session.update.
-
The client sends
input_audio_buffer.appendto add audio to the buffer. -
The client sends
input_audio_buffer.committo create a new user message. -
The client sends
session.finishto end the session. -
The server returns
input_audio_buffer.committed. -
The server returns
conversation.item.input_audio_transcription.textwith real-time transcription results. -
The server returns
conversation.item.input_audio_transcription.completedwith the final transcription result. -
The server returns
session.finishedwhen recognition completes. The client must then disconnect.
Alternative: Use Qwen-Omni
You can also use Qwen-Omni (qwen3-omni-flash-realtime) for real-time speech recognition via WebSocket. Omni is an LLM that understands audio — you provide domain context through the system prompt instead of hotword lists.
When to use Omni for ASR: Clean speech inputs (microphone, voice calls) where you need domain-specific terminology handling via prompt.
When to use dedicated ASR models instead: Noisy or mixed audio (meetings with background music, videos with sound effects), or when you need hotwords, speaker diarization, or timestamps.
Qwen-Omni interprets all audio, not just speech. Music, typing, or ambient noise may produce descriptions instead of transcription. For mixed audio, preprocess with VAD to isolate speech, or use a dedicated ASR model.
Qwen-Omni-Realtime uses WebSocket for bidirectional streaming. For the full API and SDK reference, see Realtime conversation.