Skip to main content
Realtime

Voice design

Create voices from text

Use the returned voice name with Qwen TTS or Realtime streaming TTS. For an overview of how voice design works, supported models and languages, and tips for writing effective voice descriptions, see Voice design guide.
The target_model in voice design must match the model in synthesis. Mismatched models cause failures.

Prerequisites

  1. Get an API key and set it as an environment variable.
  2. Install the DashScope SDK (SDK examples only).

API reference

All operations use the same endpoint and authentication. Set the action parameter to choose the operation.

Common request details

Endpoint
POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization
Request headers
HeaderTypeRequiredDescription
AuthorizationstringYesBearer $DASHSCOPE_API_KEY
Content-TypestringYesapplication/json
Use the same account for voice design and synthesis.

Create a voice

Creates a custom voice from a text description and returns preview audio. Request syntax
{
  "model": "qwen-voice-design",
  "input": {
    "action": "create",
    "target_model": "<target-synthesis-model>",
    "voice_prompt": "<voice-description>",
    "preview_text": "<text-for-preview-audio>",
    "preferred_name": "<keyword-for-voice-name>",
    "language": "<language-code>"
  },
  "parameters": {
    "sample_rate": 24000,
    "response_format": "wav"
  }
}
model is the voice design model (always qwen-voice-design). target_model is the synthesis model that drives the created voice. Do not confuse them.
Request parameters
ParameterTypeDefaultRequiredDescription
modelstring--YesVoice design model. Fixed to qwen-voice-design.
actionstring--YesOperation type. Fixed to create.
target_modelstring--YesSynthesis model for the voice. Must match the model in subsequent synthesis calls. Values: qwen3-tts-vd-realtime-2026-01-15, qwen3-tts-vd-realtime-2025-12-16 (real-time), qwen3-tts-vd-2026-01-26 (non-real-time).
voice_promptstring--YesVoice description. Max 2,048 characters. Chinese and English only. See Write effective voice descriptions.
preview_textstring--YesText for the preview audio. Max 1,024 characters. Must be in a supported language.
preferred_namestring--NoKeyword for the voice name (alphanumeric and underscores, max 16 characters). Appears in the generated voice name. Example: announcer produces qwen-tts-vd-announcer-voice-20251201102800-a1b2.
languagestringzhNoLanguage code for the generated voice. Must match the preview_text language. Valid values: zh, en, de, it, pt, es, ja, ko, fr, ru.
sample_rateint24000NoSample rate in Hz for the preview audio. Valid values: 8000, 16000, 24000, 48000.
response_formatstringwavNoAudio format for the preview. Valid values: pcm, wav, mp3, opus.
Response example
{
  "output": {
    "preview_audio": {
      "data": "{base64_encoded_audio}",
      "sample_rate": 24000,
      "response_format": "wav"
    },
    "target_model": "qwen3-tts-vd-realtime-2026-01-15",
    "voice": "qwen-tts-vd-announcer-voice-20251201102800-a1b2"
  },
  "usage": {
    "count": 1
  },
  "request_id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
}
Response parameters
ParameterTypeDescription
voicestringGenerated voice name. Pass this as the voice parameter in the synthesis API.
preview_audio.datastringBase64-encoded preview audio.
preview_audio.sample_rateintSample rate of the preview audio (matches request or defaults to 24000).
preview_audio.response_formatstringFormat of the preview audio (matches request or defaults to wav).
target_modelstringSynthesis model bound to this voice.
usage.countintVoice creations billed. Always 1 for a successful creation ($0.2 per count).
request_idstringRequest ID for troubleshooting.

List voices

Returns a paginated list of voices under your account. Request syntax
{
  "model": "qwen-voice-design",
  "input": {
    "action": "list",
    "page_size": 10,
    "page_index": 0
  }
}
Request parameters
ParameterTypeDefaultRequiredDescription
modelstring--YesFixed to qwen-voice-design.
actionstring--YesFixed to list.
page_indexinteger0NoPage number. Range: 0--200.
page_sizeinteger10NoResults per page. Must be greater than 0.
Response example
{
  "output": {
    "page_index": 0,
    "page_size": 2,
    "total_count": 26,
    "voice_list": [
      {
        "gmt_create": "2025-12-10 17:04:54",
        "gmt_modified": "2025-12-10 17:04:54",
        "language": "zh",
        "preview_text": "Dear listeners, hello everyone. Welcome to today's program.",
        "target_model": "qwen3-tts-vd-realtime-2026-01-15",
        "voice": "qwen-tts-vd-announcer-voice-20251210170454-a1b2",
        "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, suitable for news broadcasting or documentary commentary."
      }
    ]
  },
  "usage": {},
  "request_id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
}
Response parameters
ParameterTypeDescription
page_indexintegerCurrent page number.
page_sizeintegerEntries per page.
total_countintegerTotal number of voices.
voice_list[].voicestringVoice name.
voice_list[].target_modelstringSynthesis model bound to this voice.
voice_list[].languagestringLanguage code.
voice_list[].voice_promptstringVoice description.
voice_list[].preview_textstringPreview text.
voice_list[].gmt_createstringCreation timestamp.
voice_list[].gmt_modifiedstringLast modified timestamp.
request_idstringRequest ID.

Query a voice

Returns details about a specific voice. Request syntax
{
  "model": "qwen-voice-design",
  "input": {
    "action": "query",
    "voice": "<voice-name>"
  }
}
Request parameters
ParameterTypeDefaultRequiredDescription
modelstring--YesFixed to qwen-voice-design.
actionstring--YesFixed to query.
voicestring--YesVoice name to query.
Response example (voice found)
{
  "output": {
    "gmt_create": "2025-12-10 14:54:09",
    "gmt_modified": "2025-12-10 17:47:48",
    "language": "zh",
    "preview_text": "Dear listeners, hello everyone.",
    "target_model": "qwen3-tts-vd-realtime-2026-01-15",
    "voice": "qwen-tts-vd-announcer-voice-20251210145409-a1b2",
    "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, suitable for news broadcasting or documentary commentary."
  },
  "usage": {},
  "request_id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
}
Response example (voice not found) If the voice does not exist, the API returns HTTP 400 with VoiceNotFound:
{
  "request_id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "code": "VoiceNotFound",
  "message": "Voice not found: qwen-tts-vd-announcer-voice-xxxx"
}
Response parameters
ParameterTypeDescription
voicestringVoice name.
target_modelstringSynthesis model bound to this voice.
languagestringLanguage code.
voice_promptstringVoice description.
preview_textstringPreview text.
gmt_createstringCreation time.
gmt_modifiedstringLast modification time.
request_idstringRequest ID.

Delete a voice

Deletes a voice and releases its quota. Request syntax
{
  "model": "qwen-voice-design",
  "input": {
    "action": "delete",
    "voice": "<voice-name>"
  }
}
Request parameters
ParameterTypeDefaultRequiredDescription
modelstring--YesFixed to qwen-voice-design.
actionstring--YesFixed to delete.
voicestring--YesVoice name to delete.
Response example
{
  "output": {
    "voice": "qwen-tts-vd-announcer-voice-20251210145409-a1b2"
  },
  "usage": {},
  "request_id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
}
Response parameters
ParameterTypeDescription
voicestringDeleted voice name.
request_idstringRequest ID.

Sample code

Create a voice and preview

  • cURL
  • Python
  • Java
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen-voice-design",
  "input": {
    "action": "create",
    "target_model": "qwen3-tts-vd-realtime-2026-01-15",
    "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, suitable for news broadcasting or documentary commentary.",
    "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.",
    "preferred_name": "announcer",
    "language": "en"
  },
  "parameters": {
    "sample_rate": 24000,
    "response_format": "wav"
  }
}'

Use a custom voice for synthesis

After creating a voice, pass the returned voice name to the synthesis API. The model must match the target_model from voice design.

Bidirectional streaming (real-time)

Use with qwen3-tts-vd-realtime-2026-01-15. See Realtime streaming TTS for details.
  • Python
  • Java
# pyaudio installation:
#   macOS:   brew install portaudio && pip install pyaudio
#   Ubuntu:  sudo apt-get install python3-pyaudio  (or pip install pyaudio)
#   CentOS:  sudo yum install -y portaudio portaudio-devel && pip install pyaudio
#   Windows: python -m pip install pyaudio

import pyaudio
import os
import base64
import threading
import time
import dashscope
from dashscope.audio.qwen_tts_realtime import QwenTtsRealtime, QwenTtsRealtimeCallback, AudioFormat

TEXT_TO_SYNTHESIZE = [
  "Right? I really like this kind of supermarket,",
  "especially during the New Year.",
  "Going to the supermarket",
  "just makes me feel",
  "super, super happy!",
  "I want to buy so many things!"
]

def init_dashscope_api_key():
  """Load the API key from environment variable."""
  dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

class MyCallback(QwenTtsRealtimeCallback):
  """Callback for streaming TTS playback."""
  def __init__(self):
    self.complete_event = threading.Event()
    self._player = pyaudio.PyAudio()
    self._stream = self._player.open(
      format=pyaudio.paInt16, channels=1, rate=24000, output=True
    )

  def on_open(self) -> None:
    print("[TTS] Connection established")

  def on_close(self, close_status_code, close_msg) -> None:
    self._stream.stop_stream()
    self._stream.close()
    self._player.terminate()
    print(f"[TTS] Connection closed, code={close_status_code}, msg={close_msg}")

  def on_event(self, response: dict) -> None:
    event_type = response.get("type", "")
    if event_type == "session.created":
      print(f'[TTS] Session started: {response["session"]["id"]}')
    elif event_type == "response.audio.delta":
      audio_data = base64.b64decode(response["delta"])
      self._stream.write(audio_data)
    elif event_type == "response.done":
      print(f"[TTS] Response complete, ID: {qwen_tts_realtime.get_last_response_id()}")
    elif event_type == "session.finished":
      print("[TTS] Session finished")
      self.complete_event.set()

  def wait_for_finished(self):
    self.complete_event.wait()

if __name__ == "__main__":
  init_dashscope_api_key()

  callback = MyCallback()
  qwen_tts_realtime = QwenTtsRealtime(
    model="qwen3-tts-vd-realtime-2026-01-15",
    callback=callback,
    url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime"
  )
  qwen_tts_realtime.connect()

  qwen_tts_realtime.update_session(
    voice="<your-voice-name>",  # Replace with your voice design voice name
    response_format=AudioFormat.PCM_24000HZ_MONO_16BIT,
    mode="server_commit"
  )

  for text_chunk in TEXT_TO_SYNTHESIZE:
    print(f"[Sending text]: {text_chunk}")
    qwen_tts_realtime.append_text(text_chunk)
    time.sleep(0.1)

  qwen_tts_realtime.finish()
  callback.wait_for_finished()

  print(f"[Metric] session_id={qwen_tts_realtime.get_session_id()}, "
          f"first_audio_delay={qwen_tts_realtime.get_first_audio_delay()}s")

Non-streaming and unidirectional streaming

Use with qwen3-tts-vd-2026-01-26. Pass the returned voice name to the synthesis API with the matching model. See Qwen TTS for details and code examples.

Query voices

  • cURL
  • Python
  • Java
# Query a specific voice
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen-voice-design",
  "input": {
    "action": "query",
    "voice": "<your-voice-name>"
  }
}'
# List all voices (paginated)
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen-voice-design",
  "input": {
    "action": "list",
    "page_size": 10,
    "page_index": 0
  }
}'

Delete a voice

  • cURL
  • Python
  • Java
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen-voice-design",
  "input": {
    "action": "delete",
    "voice": "<your-voice-name>"
  }
}'

Voice quota and cleanup

  • Account limit: 1,000 voices per account. Check the total_count field in the List voices response.
  • Automatic cleanup: Voices unused for synthesis in the past year are deleted automatically.