Skip to main content
Realtime

Qwen-ASR server events

WebSocket server reference

Events the server sends during a WebSocket session.
User guide: For an overview of features and sample code, see Realtime speech recognition.

error

Sent when a client or server error occurs.
Example
{
  "event_id": "event_B2uoU7VOt1AAITsPRPH9n",
  "type": "error",
  "error": {
  "type": "invalid_request_error",
  "code": "invalid_value",
  "message": "Invalid value: 'whisper-1xx'. Supported values are: 'whisper-1'.",
  "param": "session.input_audio_transcription.model",
  "event_id": "event_123"
  }
}
string
body
Unique identifier for this event.
string
body
Event type. Always error.
object
body
Error details.

session.created

First event after connection. Contains default session settings.
Example
{
  "event_id": "event_1234",
  "type": "session.created",
  "session": {
  "id": "sess_001",
  "object": "realtime.session",
  "model": "qwen3-asr-flash-realtime",
  "modalities": ["text"],
  "input_audio_format": "pcm16",
  "input_audio_transcription": null,
  "turn_detection": {
      "type": "server_vad",
      "threshold": 0.5,
      "silence_duration_ms": 200
  }
  }
}
string
body
Unique identifier for this event.
string
body
Event type. Always session.created.
object
body
Session configuration.

session.updated

Sent after your session.update event is processed. If processing fails, an error event is sent instead. For other parameter descriptions, see session.created.
Example
{
  "event_id": "event_1234",
  "type": "session.updated",
  "session": {
  "id": "sess_001",
  "object": "realtime.session",
  "model": "qwen3-asr-flash-realtime",
  "modalities": ["text"],
  "input_audio_format": "pcm16",
  "input_audio_transcription": null,
  "turn_detection": {
      "type": "server_vad",
      "threshold": 0.5,
      "silence_duration_ms": 200
  }
  }
}
string
body
Unique identifier for this event.
string
body
Event type. Always session.updated.

input_audio_buffer.speech_started

Sent in VAD mode when speech starts in the audio buffer.
Triggered each time audio is added to the buffer, unless speech start was already detected.
Example
{
  "event_id": "event_B1lV7FPbgTv9qGxPI1tH4",
  "type": "input_audio_buffer.speech_started",
  "audio_start_ms": 64,
  "item_id": "item_B1lV7jWLscp4mMV8hSs8c"
}
string
body
Unique identifier for this event.
string
body
Event type. Always input_audio_buffer.speech_started.
integer
body
Milliseconds from the buffer start to speech detection.
string
body
ID of the user message item to be created.

input_audio_buffer.speech_stopped

Sent in VAD mode when speech ends in the audio buffer. Immediately followed by a conversation.item.created event with the user message item.
Example
{
  "event_id": "event_B3GGEYh2orwNIdhUagZPz",
  "type": "input_audio_buffer.speech_stopped",
  "audio_end_ms": 28128,
  "item_id": "item_B3GGE8ry4yqbqJGzrVhEM"
}
string
body
Unique identifier for this event.
string
body
Event type. Always input_audio_buffer.speech_stopped.
integer
body
Milliseconds from the session start to when speech stopped.
string
body
ID of the user message item created when speech stops.

input_audio_buffer.committed

Sent when the input audio buffer is committed.
Example
{
  "event_id": "event_1121",
  "type": "input_audio_buffer.committed",
  "previous_item_id": "msg_001",
  "item_id": "msg_002"
}
string
body
Unique identifier for this event.
string
body
Event type. Always input_audio_buffer.committed.
string
body
ID of the previous conversation item.
string
body
ID of the user conversation item to be created.

conversation.item.created

Sent when a conversation item is created.
Example
{
  "type": "conversation.item.created",
  "event_id": "event_B3GGKbCfBZTpqFHZ0P8vg",
  "previous_item_id": "item_B3GGE8ry4yqbqJGzrVhEM",
  "item": {
  "id": "item_B3GGEPlolCqdMiVbYIf5L",
  "object": "realtime.item",
  "type": "message",
  "status": "completed",
  "role": "user",
  "content": [
      {
    "type": "input_audio",
    "transcript": null
      }
  ]
  }
}
string
body
Unique identifier for this event.
string
body
Event type. Always conversation.item.created.
string
body
ID of the previous conversation item.
object
body
The conversation item.

conversation.item.input_audio_transcription.text

Sent frequently with real-time recognition results.
Example
{
  "event_id": "event_R7Pfu8QVBfP5HmpcbEFSd",
  "type": "conversation.item.input_audio_transcription.text",
  "item_id": "item_MpJQPNQzqVRc9aC9zMwSj",
  "content_index": 0,
  "language": "en",
  "emotion": "neutral",
  "text": "",
  "stash": "Beijing's"
}
string
body
Unique identifier for this event.
string
body
Event type. Always conversation.item.input_audio_transcription.text.
string
body
ID of the associated conversation item.
integer
body
Index of the content part that contains the audio.
string
body
Detected language. If you set the language request parameter, this value matches that setting. Possible values:
  • zh: Chinese (Mandarin, Sichuanese, Minnan, and Wu)
  • yue: Cantonese
  • en: English
  • ja: Japanese
  • de: German
  • ko: Korean
  • ru: Russian
  • fr: French
  • pt: Portuguese
  • ar: Arabic
  • it: Italian
  • es: Spanish
  • hi: Hindi
  • id: Indonesian
  • th: Thai
  • tr: Turkish
  • uk: Ukrainian
  • vi: Vietnamese
string
body
Detected emotion. Supported values: surprised, neutral, happy, sad, disgusted, angry, fearful.
string
body
Confirmed text prefix. The model has finalized this part and will not change it.
string
body
Pre-recognized text suffix. A temporary draft that follows the confirmed part. The model may still correct it.
To get the most complete preview, concatenate both fields: text + stash.

conversation.item.input_audio_transcription.completed

Sends the final recognition result and marks the end of a conversation item.
Example
{
  "event_id": "event_B3GGEjPT2sLzjBM74W6kB",
  "type": "conversation.item.input_audio_transcription.completed",
  "item_id": "item_B3GGC53jGOuIFcjZkmEQ9",
  "content_index": 0,
  "language": "en",
  "emotion": "neutral",
  "transcript": "What's the weather like today?"
}
string
body
Unique identifier for this event.
string
body
Event type. Always conversation.item.input_audio_transcription.completed.
string
body
ID of the associated conversation item.
integer
body
Index of the content part that contains the audio.
string
body
Detected language. If you set the language request parameter, this value matches that setting. Possible values:
  • zh: Chinese (Mandarin, Sichuanese, Minnan, and Wu)
  • yue: Cantonese
  • en: English
  • ja: Japanese
  • de: German
  • ko: Korean
  • ru: Russian
  • fr: French
  • pt: Portuguese
  • ar: Arabic
  • it: Italian
  • es: Spanish
  • hi: Hindi
  • id: Indonesian
  • th: Thai
  • tr: Turkish
  • uk: Ukrainian
  • vi: Vietnamese
string
body
Detected emotion. Supported values: surprised, neutral, happy, sad, disgusted, angry, fearful.
string
body
Transcription result.

conversation.item.input_audio_transcription.failed

Sent if recognition fails for the input audio, separate from other error events so you can identify which item failed.
Example
{
  "type": "conversation.item.input_audio_transcription.failed",
  "item_id": "<item_id>",
  "content_index": 0,
  "error": {
  "code": "<code>",
  "message": "<message>",
  "param": "<param>"
  }
}
string
body
Event type. Always conversation.item.input_audio_transcription.failed.
string
body
ID of the associated conversation item.
integer
body
Index of the content part that contains the audio.
object
body
Error details.

session.finished

Confirms that all recognition is complete. Sent after you send session.finish. You can disconnect after receiving this event.
Example
{
  "event_id": "event_2239",
  "type": "session.finished"
}
string
body
Unique identifier for this event.
string
body
Event type. Always session.finished.