WebSocket client reference
Events you send to the server during a Qwen-ASR Realtime WebSocket session.
Connect to:
Authenticate with these headers:
Replace
A typical session:
Send this right after connecting to set the audio format, language, and Voice Activity Detection (VAD) options. Defaults apply if omitted.
On success, the server responds with
Start with these presets, then adjust as needed:
Stream audio chunks to the server buffer.
Behavior differs by mode:
Trigger recognition for all audio in the buffer as a single utterance. Use this in Manual mode when you control utterance boundaries (such as push-to-talk where a button press marks speech end).
Not available in VAD mode.
On success, the server responds with
End the session. The server response depends on whether speech was detected:
Connection
Connect to:
{model_name} with a supported model like qwen3-asr-flash-realtime and $DASHSCOPE_API_KEY with your API key.
For a feature overview and sample code, see Realtime speech recognition. For server events, see Server events for Qwen-ASR-Realtime.
Event lifecycle
A typical session:
- Open a WebSocket connection.
- Send
session.updateto set the audio format, language, and VAD options. - Send
input_audio_buffer.appendrepeatedly to stream audio. - In Manual mode, send
input_audio_buffer.committo trigger recognition. In VAD mode, the server triggers recognition automatically. - Send
session.finishto end the session. Disconnect after receivingsession.finished.
session.update
Send this right after connecting to set the audio format, language, and Voice Activity Detection (VAD) options. Defaults apply if omitted.
On success, the server responds with session.updated.
Example
string
body
required
Fixed value:
session.update.string
body
required
A unique event ID.
object
body
required
Session configuration.
Recommended VAD presets
Start with these presets, then adjust as needed:
| Preset | threshold | silence_duration_ms | Best for |
|---|---|---|---|
| Low latency | 0.0 | 400 | Fast interactions (voice commands, agent assist) where quick responses matter more than handling long pauses |
| Balanced (default) | 0.2 | 800 | General transcription balancing speed and accuracy |
Supported languages
| Code | Language |
|---|---|
| zh | Chinese (Mandarin, Sichuanese, Minnan, and Wu) |
| yue | Cantonese |
| en | English |
| ja | Japanese |
| de | German |
| ko | Korean |
| ru | Russian |
| fr | French |
| pt | Portuguese |
| ar | Arabic |
| it | Italian |
| es | Spanish |
| hi | Hindi |
| id | Indonesian |
| th | Thai |
| tr | Turkish |
| uk | Ukrainian |
| vi | Vietnamese |
| cs | Czech |
| da | Danish |
| fil | Filipino |
| fi | Finnish |
| is | Icelandic |
| ms | Malay |
| no | Norwegian |
| pl | Polish |
| sv | Swedish |
input_audio_buffer.append
Stream audio chunks to the server buffer.
Behavior differs by mode:
- VAD mode: The server monitors the buffer for voice activity and triggers recognition automatically.
- Manual mode: You control utterance boundaries. Send smaller chunks for lower latency.
The
audio field is Base64-encoded. In Manual mode, maximum size per event: 15 MiB. The server does not confirm this event.Example
string
body
required
Fixed value:
input_audio_buffer.append.string
body
required
A unique event ID.
string
body
required
Base64-encoded audio data.
input_audio_buffer.commit
Trigger recognition for all audio in the buffer as a single utterance. Use this in Manual mode when you control utterance boundaries (such as push-to-talk where a button press marks speech end).
Not available in VAD mode.
On success, the server responds with input_audio_buffer.committed.
Example
string
body
required
Fixed value:
input_audio_buffer.commit.string
body
required
A unique event ID.
session.finish
End the session. The server response depends on whether speech was detected:
- Speech detected: The server finishes recognition, sends
conversation.item.input_audio_transcription.completedwith results, thensession.finished. - No speech detected: The server sends
session.finisheddirectly.
session.finished, disconnect the WebSocket.
Example
string
body
required
Fixed value:
session.finish.string
body
required
A unique event ID.