Real-time TTS Java SDK
For models and voice options, see Realtime streaming TTS.
Both interaction modes follow the same pattern: build
Build a
Build a
Getting started
Both interaction modes follow the same pattern: build QwenTtsRealtimeParam, create a QwenTtsRealtime instance with a callback, connect, configure the session, send text, and close. For complete examples and tutorials, see Real-time text-to-speech.
- Server commit mode (
server_commit): The server decides when to synthesize buffered text. Best for most use cases. - Commit mode (
commit): The client callscommit()to trigger synthesis. Lowest latency, but you must manage sentence boundaries.
Request parameters
QwenTtsRealtimeParam
Build a QwenTtsRealtimeParam and pass it to the QwenTtsRealtime constructor.
| Parameter | Type | Required | Description |
|---|---|---|---|
| model | String | Yes | Model name. See Supported models. |
| url | String | Yes | WebSocket endpoint: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime. |
QwenTtsRealtimeConfig
Build a QwenTtsRealtimeConfig and pass it to updateSession().
| Parameter | Type | Required | Description |
|---|---|---|---|
| voice | String | Yes | Voice for synthesis. System voices: available for Qwen3-TTS-Instruct-Flash-Realtime, Qwen3-TTS-Flash-Realtime, and Qwen-TTS-Realtime. See Supported voices for samples. Custom voices: created through Voice cloning (Qwen) (Qwen3-TTS-VC-Realtime only) or Voice design (Qwen) (Qwen3-TTS-VD-Realtime only). |
| languageType | String | No | Language of the output audio. Default: Auto (the model detects language per segment; accuracy is not guaranteed). Set a specific language to improve quality for single-language text. Valid values: Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian. |
| mode | String | No | Interaction mode. server_commit (default): the server decides when to synthesize buffered text, balancing latency and quality. commit: the client triggers synthesis for the lowest latency; requires manual sentence boundary management. |
| format | String | No | Audio output format. Valid values: pcm (default), wav, mp3, opus. Qwen-TTS-Realtime supports only pcm. See Supported models. |
| sampleRate | int | No | Sample rate in Hz. Valid values: 8000, 16000, 24000 (default), 48000. Qwen-TTS-Realtime supports only 24000. See Supported models. |
| speechRate | float | No | Speech rate multiplier. Range: [0.5, 2.0]. Default: 1.0. Not supported by Qwen-TTS-Realtime. |
| volume | int | No | Audio volume. Range: [0, 100]. Default: 50. Not supported by Qwen-TTS-Realtime. |
| pitchRate | float | No | Pitch multiplier. Range: [0.5, 2.0]. Default: 1.0. Not supported by Qwen-TTS-Realtime. |
| bitRate | int | No | Audio bitrate in kbps. Range: [6, 510]. Default: 128. Only for opus format. Higher values produce better quality at larger file sizes. Not supported by Qwen-TTS-Realtime. |
| instructions | String | No | Instruction text for controlling speech style. Max 1,600 tokens. Chinese and English only. Only for Qwen3-TTS-Instruct-Flash-Realtime. See Instruction control. |
| optimizeInstructions | boolean | No | Optimizes instructions for more natural synthesis. Default: false. No effect if instructions is not set. Only for Qwen3-TTS-Instruct-Flash-Realtime. |
API reference
QwenTtsRealtime class
| Method | Signature | Server events | Description |
|---|---|---|---|
| connect | public void connect() throws NoApiKeyException, InterruptedException | session.created, session.updated | Opens a WebSocket connection. |
| updateSession | public void updateSession(QwenTtsRealtimeConfig config) | session.updated | Updates the session configuration. Call immediately after connect() to override defaults. The server validates parameters and returns an error if invalid. |
| appendText | public void appendText(String text) | None | Appends text to the server-side input buffer. In server_commit mode, the server decides when to synthesize. In commit mode, call commit() to trigger synthesis. |
| clearAppendedText | public void clearAppendedText() | input_text_buffer.cleared | Clears all text in the server-side input buffer. |
| commit | public void commit() | input_text_buffer.committed, response.output_item.added, response.content_part.added, response.audio.delta, response.audio.done, response.content_part.done, response.output_item.done, response.done | Commits buffered text and starts synthesis. Returns an error if the buffer is empty. In server_commit mode, the server commits automatically. In commit mode, call this to trigger synthesis. |
| finish | public void finish() | session.finished | Stops the current task. |
| close | public void close() | None | Closes the WebSocket connection. |
| getSessionId | public String getSessionId() | None | Returns the current session ID. |
| getResponseId | public String getResponseId() | None | Returns the most recent response ID. |
| getFirstAudioDelay | public long getFirstAudioDelay() | None | Returns the first audio packet latency in milliseconds. |
QwenTtsRealtimeCallback interface
| Method | Parameters | Description |
|---|---|---|
onOpen() | None | Called when the WebSocket connection opens. |
onEvent(JsonObject message) | message: The server event payload. | Called when a server event arrives. For event types, see Server-side events. |
onClose(int code, String reason) | code: WebSocket close status code. reason: Close reason. | Called when the server closes the connection. |
Next steps
- Realtime streaming TTS: Models, supported voices, and instruction control.
- Server-side events: WebSocket event type reference.
- Voice cloning (Qwen): Create custom voices with Qwen3-TTS-VC-Realtime.
- Voice design (Qwen): Design custom voices with Qwen3-TTS-VD-Realtime.
- GitHub samples: More code examples.