Skip to main content
Realtime

Qwen-ASR client events

WebSocket client reference

Events you send to the server during a Qwen-ASR Realtime WebSocket session.

Connection

Connect to:
wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model={model_name}
Authenticate with these headers:
Authorization: Bearer $DASHSCOPE_API_KEY
OpenAI-Beta: realtime=v1
Replace {model_name} with a supported model like qwen3-asr-flash-realtime and $DASHSCOPE_API_KEY with your API key. For a feature overview and sample code, see Realtime speech recognition. For server events, see Server events for Qwen-ASR-Realtime.

Event lifecycle

A typical session:
  1. Open a WebSocket connection.
  2. Send session.update to set the audio format, language, and VAD options.
  3. Send input_audio_buffer.append repeatedly to stream audio.
  4. In Manual mode, send input_audio_buffer.commit to trigger recognition. In VAD mode, the server triggers recognition automatically.
  5. Send session.finish to end the session. Disconnect after receiving session.finished.

session.update

Send this right after connecting to set the audio format, language, and Voice Activity Detection (VAD) options. Defaults apply if omitted. On success, the server responds with session.updated.
Example
{
  "event_id": "event_123",
  "type": "session.update",
  "session": {
    "input_audio_format": "pcm",
    "sample_rate": 16000,
    "input_audio_transcription": {
      "language": "zh"
    },
    "turn_detection": {
      "type": "server_vad",
      "threshold": 0.0,
      "silence_duration_ms": 400
    }
  }
}
string
body
required
Fixed value: session.update.
string
body
required
A unique event ID.
object
body
required
Session configuration.
Start with these presets, then adjust as needed:
Presetthresholdsilence_duration_msBest for
Low latency0.0400Fast interactions (voice commands, agent assist) where quick responses matter more than handling long pauses
Balanced (default)0.2800General transcription balancing speed and accuracy

Supported languages

CodeLanguage
zhChinese (Mandarin, Sichuanese, Minnan, and Wu)
yueCantonese
enEnglish
jaJapanese
deGerman
koKorean
ruRussian
frFrench
ptPortuguese
arArabic
itItalian
esSpanish
hiHindi
idIndonesian
thThai
trTurkish
ukUkrainian
viVietnamese
csCzech
daDanish
filFilipino
fiFinnish
isIcelandic
msMalay
noNorwegian
plPolish
svSwedish

input_audio_buffer.append

Stream audio chunks to the server buffer. Behavior differs by mode:
  • VAD mode: The server monitors the buffer for voice activity and triggers recognition automatically.
  • Manual mode: You control utterance boundaries. Send smaller chunks for lower latency.
The audio field is Base64-encoded. In Manual mode, maximum size per event: 15 MiB. The server does not confirm this event.
Example
{
  "event_id": "event_2728",
  "type": "input_audio_buffer.append",
  "audio": "<Base64-encoded-audio-data>"
}
string
body
required
Fixed value: input_audio_buffer.append.
string
body
required
A unique event ID.
string
body
required
Base64-encoded audio data.

input_audio_buffer.commit

Trigger recognition for all audio in the buffer as a single utterance. Use this in Manual mode when you control utterance boundaries (such as push-to-talk where a button press marks speech end). Not available in VAD mode. On success, the server responds with input_audio_buffer.committed.
Example
{
  "event_id": "event_789",
  "type": "input_audio_buffer.commit"
}
string
body
required
Fixed value: input_audio_buffer.commit.
string
body
required
A unique event ID.

session.finish

End the session. The server response depends on whether speech was detected: After you receive session.finished, disconnect the WebSocket.
Example
{
  "event_id": "event_341",
  "type": "session.finish"
}
string
body
required
Fixed value: session.finish.
string
body
required
A unique event ID.
Qwen-ASR client events | Qwen Cloud