Create a voice | Qwen Cloud

POST

/services/audio/tts/customization

cURL

curl --request POST \
  --url 'https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization' \
  --header 'Authorization: Bearer <YOUR_API_KEY>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "qwen-voice-design",
  "input": {
    "action": "create",
    "target_model": "qwen3-tts-vd-realtime-2026-01-15",
    "voice_prompt": "<string>",
    "preview_text": "<string>",
    "preferred_name": "<string>",
    "language": "zh"
  },
  "parameters": {
    "sample_rate": 8000,
    "response_format": "pcm"
  }
}
'

{
  "output": {
    "voice": "qwen-tts-vd-announcer-voice-20251201102800-a1b2",
    "preview_audio": {
      "data": "{base64_encoded_audio}",
      "sample_rate": 24000,
      "response_format": "wav"
    },
    "target_model": "qwen3-tts-vd-realtime-2026-01-15"
  },
  "usage": {
    "count": 1
  },
  "request_id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
}

model is the design model (always qwen-voice-design). target_model is the synthesis model that drives the created voice. The target_model must match the model in subsequent synthesis calls — mismatched models cause failures.

Authorizations

string

header

required

DashScope API key. Get one at API key.

Body

application/json

enum<string>

required

Voice design model. Fixed to qwen-voice-design.

Available options:qwen-voice-design

Example:qwen-voice-design

object

required

Show child attributes

enum<string>

required

Operation type. Fixed to create.

Available options:create

Example:create

enum<string>

required

Synthesis model for the voice. Must match the model in subsequent synthesis calls. Values: qwen3-tts-vd-realtime-2026-01-15, qwen3-tts-vd-realtime-2025-12-16 (real-time), qwen3-tts-vd-2026-01-26 (non-real-time).

Available options:qwen3-tts-vd-realtime-2026-01-15,qwen3-tts-vd-realtime-2025-12-16,qwen3-tts-vd-2026-01-26

Example:qwen3-tts-vd-realtime-2026-01-15

string

required

Voice description. Max 2,048 characters. Chinese and English only. See Write effective voice descriptions.

Example:A composed middle-aged male announcer with a deep, rich and magnetic voice, suitable for news broadcasting.

Required range:length <= 2048

string

required

Text for the preview audio. Max 1,024 characters. Must be in a supported language.

Example:Dear listeners, hello everyone. Welcome to the evening news.

Required range:length <= 1024

string

Keyword for the voice name (alphanumeric and underscores, max 16 characters). Appears in the generated voice name. Example: announcer produces qwen-tts-vd-announcer-voice-20251201102800-a1b2.

Example:announcer

Required range:length <= 16pattern: ^[a-zA-Z0-9_]+$

enum<string>

default"zh"

Language code for the generated voice. Must match the preview_text language.

Available options:zh,en,de,it,pt,es,ja,ko,fr,ru

Example:en

object

Show child attributes

enum<integer>

default24000

Sample rate in Hz for the preview audio.

Available options:8000,16000,24000,48000

Example:24000

enum<string>

default"wav"

Audio format for the preview.

Available options:pcm,wav,mp3,opus

Example:wav

Response

200-application/json

object

Show child attributes

string

Generated voice name. Pass this as the voice parameter in the synthesis API.

Example:qwen-tts-vd-announcer-voice-20251201102800-a1b2

object

Show child attributes

string

Base64-encoded preview audio. Decode to get the audio file.

Example:{base64_encoded_audio}

integer

Sample rate of the preview audio.

Example:24000

string

Format of the preview audio.

Example:wav

string

Synthesis model bound to this voice.

Example:qwen3-tts-vd-realtime-2026-01-15

object

Show child attributes

integer

Voice creations billed. Always 1 for a successful creation ($0.2 per count).

Example:1

string

Request ID for troubleshooting.

Example:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx