Qwen-ASR client events

Events you send to the server during a Qwen-ASR Realtime WebSocket session.

Connection

Connect to:

wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model={model_name}

Authenticate with these headers:

Authorization: Bearer $DASHSCOPE_API_KEY
OpenAI-Beta: realtime=v1

Replace {model_name} with a supported model like qwen3-asr-flash-realtime and $DASHSCOPE_API_KEY with your API key. For a feature overview and sample code, see Realtime speech recognition. For server events, see Server events for Qwen-ASR-Realtime.

Event lifecycle

A typical session:

Open a WebSocket connection.
Send session.update to set the audio format, language, and VAD options.
Send input_audio_buffer.append repeatedly to stream audio.
In Manual mode, send input_audio_buffer.commit to trigger recognition. In VAD mode, the server triggers recognition automatically.
Send session.finish to end the session. Disconnect after receiving session.finished.

session.update

Send this right after connecting to set the audio format, language, and Voice Activity Detection (VAD) options. Defaults apply if omitted. On success, the server responds with session.updated.

Example

{
  "event_id": "event_123",
  "type": "session.update",
  "session": {
    "input_audio_format": "pcm",
    "sample_rate": 16000,
    "input_audio_transcription": {
      "language": "zh"
    },
    "turn_detection": {
      "type": "server_vad",
      "threshold": 0.0,
      "silence_duration_ms": 400
    }
  }
}

string

body

required

Fixed value: session.update.

string

body

required

A unique event ID.

object

body

required

Session configuration.

Show properties

string

body

Audio encoding format. Valid values: pcm, opus. Default: pcm.

integer

body

Sample rate in Hz. Valid values: 16000, 8000. Default: 16000. When set to 8000, the server upsamples to 16,000 Hz, which adds minor latency. Use 8000 only for native 8 kHz sources such as telephony audio.

object

body

Recognition settings.

Show properties

string

body

Audio language. See supported languages.

string

body

Context text for contextual biasing -- background text, entity lists, or reference material that improves accuracy. Maximum 10,000 tokens.

object

body

VAD settings. Set to null to disable VAD mode and use Manual mode. If present, VAD mode is enabled.

Show properties

string

body

required

Fixed value: server_vad.

float

body

VAD sensitivity. Default: 0.2. Valid range: [-1, 1]. Lower values increase sensitivity but may fire on background noise. Higher values reduce sensitivity and avoid false triggers in noisy settings. See recommended VAD presets.

integer

body

Silence duration in milliseconds that marks the utterance end. Default: 800. Valid range: [200, 6000]. Shorter values (such as 300 ms) speed up responses but may split mid-sentence pauses. Longer values (such as 1,200 ms) handle pauses better but add latency. See recommended VAD presets.

Recommended VAD presets

Start with these presets, then adjust as needed:

Preset	threshold	silence_duration_ms	Best for
Low latency	`0.0`	`400`	Fast interactions (voice commands, agent assist) where quick responses matter more than handling long pauses
Balanced (default)	`0.2`	`800`	General transcription balancing speed and accuracy

Supported languages

Code	Language
zh	Chinese (Mandarin, Sichuanese, Minnan, and Wu)
yue	Cantonese
en	English
ja	Japanese
de	German
ko	Korean
ru	Russian
fr	French
pt	Portuguese
ar	Arabic
it	Italian
es	Spanish
hi	Hindi
id	Indonesian
th	Thai
tr	Turkish
uk	Ukrainian
vi	Vietnamese
cs	Czech
da	Danish
fil	Filipino
fi	Finnish
is	Icelandic
ms	Malay
no	Norwegian
pl	Polish
sv	Swedish

input_audio_buffer.append

Stream audio chunks to the server buffer. Behavior differs by mode:

VAD mode: The server monitors the buffer for voice activity and triggers recognition automatically.
Manual mode: You control utterance boundaries. Send smaller chunks for lower latency.

The audio field is Base64-encoded. In Manual mode, maximum size per event: 15 MiB. The server does not confirm this event.

Example

{
  "event_id": "event_2728",
  "type": "input_audio_buffer.append",
  "audio": "<Base64-encoded-audio-data>"
}

string

body

required

Fixed value: input_audio_buffer.append.

string

body

required

A unique event ID.

string

body

required

Base64-encoded audio data.

input_audio_buffer.commit

Trigger recognition for all audio in the buffer as a single utterance. Use this in Manual mode when you control utterance boundaries (such as push-to-talk where a button press marks speech end). Not available in VAD mode. On success, the server responds with input_audio_buffer.committed.

Example

{
  "event_id": "event_789",
  "type": "input_audio_buffer.commit"
}

string

body

required

Fixed value: input_audio_buffer.commit.

string

body

required

A unique event ID.

session.finish

End the session. The server response depends on whether speech was detected:

Speech detected: The server finishes recognition, sends conversation.item.input_audio_transcription.completed with results, then session.finished.
No speech detected: The server sends session.finished directly.

After you receive session.finished, disconnect the WebSocket.

Example

{
  "event_id": "event_341",
  "type": "session.finish"
}

string

body

required

Fixed value: session.finish.

string

body

required

A unique event ID.

​Connection

​Event lifecycle

​session.update

​Recommended VAD presets

​Supported languages

​input_audio_buffer.append

​input_audio_buffer.commit

​session.finish

Connection

Event lifecycle

session.update

Recommended VAD presets

Supported languages

input_audio_buffer.append

input_audio_buffer.commit

session.finish