Qwen-ASR-Realtime WebSocket API

Qwen-ASR-Realtime receives audio streams and transcribes speech in real time over WebSocket. The service supports two interaction modes: VAD mode and Manual mode. User guide: For model overviews and selection guidance, see Speech-to-text models. For sample code, see Realtime speech recognition.

Service endpoint

Use the following WebSocket URL. The model query parameter specifies the model. Replace <model_name> with the model name:

wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=<model_name>

Use the wss:// scheme. Set authorization in the request headers (see Request headers). Specify the model through the model query parameter.

Request headers

Include the following fields in the request headers:

Parameter	Type	Required	Description
Authorization	string	Yes	Authentication token in the format `Bearer $DASHSCOPE_API_KEY`. Replace with your API key.
user-agent	string	No	Client identifier that helps the server track request sources.
X-DashScope-WorkSpace	string	No	Qwen Cloud workspace ID.
X-DashScope-DataInspection	string	No	Whether to enable data inspection. Omit this header unless data inspection is required; if it is, set the value to `enable`.

Authorization is verified during the WebSocket handshake. If the API key is invalid or missing, the handshake fails with an HTTP 401 or 403 error.

Interaction flows

For details about client and server events, see Client events and Server events. Qwen-ASR-Realtime supports two interaction modes:

VAD mode (default): The server uses voice activity detection (VAD) to automatically detect the start and end of each utterance. Use this mode for real-time conversations, meeting transcription, and similar scenarios.
Manual mode: The client controls utterance boundaries. Use this mode when the client can determine those boundaries explicitly, such as sending a voice message in a messaging app.

VAD mode (default)

The server automatically detects the start and end of each utterance. Send the audio stream continuously; the server returns the final transcription for each utterance once it detects the end. To enable: Configure the session.turn_detection parameter in the client session.update event.

The client sends input_audio_buffer.append events to append audio to the buffer.
The server returns a conversation.item.input_audio_transcription.delta event when speech is detected.
The client continues to send input_audio_buffer.append events to submit audio.
After sending all audio, the client sends a session.finish event to end the session.
The server returns a conversation.item.input_audio_transcription.delta event when it detects the end of speech.
The server returns a conversation.item.input_audio_transcription.delta event that contains partial transcription results.
The server returns a conversation.item.input_audio_transcription.completed event that contains the final transcription result.
The server returns a session.finished event to signal that recognition is complete. The client must then close the connection.

If the client sends session.finish before step 5, the server immediately returns a session.finished event. The client must then close the connection.

Manual mode

The client controls utterance boundaries. After sending the audio for a complete utterance, the client sends an input_audio_buffer.commit event to notify the server. To enable: Set session.turn_detection to null in the client session.update event.

The client sends input_audio_buffer.append events to append audio to the buffer.
The client sends an input_audio_buffer.commit event to commit the input audio buffer. The commit creates a new user message item in the conversation.
The client sends a session.finish event to end the session.
The server returns a conversation.item.input_audio_transcription.delta event that contains partial transcription results.
The server returns a conversation.item.input_audio_transcription.completed event that contains the final transcription result.
The server returns a session.finished event to signal that recognition is complete. The client must then close the connection.

​Service endpoint

​Request headers

​Interaction flows

​VAD mode (default)

​Manual mode

Service endpoint

Request headers

Interaction flows

VAD mode (default)

Manual mode