WebSocket connection, headers, and interaction flows for Qwen-ASR-Realtime
Qwen-ASR-Realtime receives audio streams and transcribes speech in real time over WebSocket. The service supports two interaction modes: VAD mode and Manual mode.
User guide: For model overviews and selection guidance, see Speech-to-text models. For sample code, see Realtime speech recognition.
Use the following WebSocket URL. The
Include the following fields in the request headers:
For details about client and server events, see Client events and Server events.
Qwen-ASR-Realtime supports two interaction modes:
The server automatically detects the start and end of each utterance. Send the audio stream continuously; the server returns the final transcription for each utterance once it detects the end.
To enable: Configure the
The client controls utterance boundaries. After sending the audio for a complete utterance, the client sends an
Service endpoint
Use the following WebSocket URL. The model query parameter specifies the model. Replace <model_name> with the model name:
Use the
wss:// scheme. Set authorization in the request headers (see Request headers). Specify the model through the model query parameter.Request headers
Include the following fields in the request headers:
| Parameter | Type | Required | Description |
|---|---|---|---|
| Authorization | string | Yes | Authentication token in the format Bearer $DASHSCOPE_API_KEY. Replace with your API key. |
| user-agent | string | No | Client identifier that helps the server track request sources. |
| X-DashScope-WorkSpace | string | No | Qwen Cloud workspace ID. |
| X-DashScope-DataInspection | string | No | Whether to enable data inspection. Omit this header unless data inspection is required; if it is, set the value to enable. |
Authorization is verified during the WebSocket handshake. If the API key is invalid or missing, the handshake fails with an HTTP 401 or 403 error.
Interaction flows
For details about client and server events, see Client events and Server events.
Qwen-ASR-Realtime supports two interaction modes:
- VAD mode (default): The server uses voice activity detection (VAD) to automatically detect the start and end of each utterance. Use this mode for real-time conversations, meeting transcription, and similar scenarios.
- Manual mode: The client controls utterance boundaries. Use this mode when the client can determine those boundaries explicitly, such as sending a voice message in a messaging app.
VAD mode (default)
The server automatically detects the start and end of each utterance. Send the audio stream continuously; the server returns the final transcription for each utterance once it detects the end.
To enable: Configure the session.turn_detection parameter in the client session.update event.
- The client sends
input_audio_buffer.appendevents to append audio to the buffer. - The server returns a
conversation.item.input_audio_transcription.deltaevent when speech is detected. - The client continues to send
input_audio_buffer.appendevents to submit audio. - After sending all audio, the client sends a
session.finishevent to end the session. - The server returns a
conversation.item.input_audio_transcription.deltaevent when it detects the end of speech. - The server returns a
conversation.item.input_audio_transcription.deltaevent that contains partial transcription results. - The server returns a
conversation.item.input_audio_transcription.completedevent that contains the final transcription result. - The server returns a
session.finishedevent to signal that recognition is complete. The client must then close the connection.
If the client sends
session.finish before step 5, the server immediately returns a session.finished event. The client must then close the connection.Manual mode
The client controls utterance boundaries. After sending the audio for a complete utterance, the client sends an input_audio_buffer.commit event to notify the server.
To enable: Set session.turn_detection to null in the client session.update event.
- The client sends
input_audio_buffer.appendevents to append audio to the buffer. - The client sends an
input_audio_buffer.commitevent to commit the input audio buffer. The commit creates a new user message item in the conversation. - The client sends a
session.finishevent to end the session. - The server returns a
conversation.item.input_audio_transcription.deltaevent that contains partial transcription results. - The server returns a
conversation.item.input_audio_transcription.completedevent that contains the final transcription result. - The server returns a
session.finishedevent to signal that recognition is complete. The client must then close the connection.