Voice design - Qwen Cloud

Voice design generates custom voices from text descriptions. After creating a voice, use the returned voice name with CosyVoice TTS, Qwen TTS, or Realtime streaming TTS.

The target_model in voice design must match the model in synthesis. Mismatched models cause failures.

How it works

Write a voice description (voice_prompt) and preview text (preview_text).
Send a Create voice request with your target_model.
The API returns a voice name and Base64-encoded preview audio. Decode the Base64 string to get the audio file (WAV format).
Listen to the preview. If satisfied, use the voice name for synthesis. Otherwise, create a new voice.

Quick start

Prerequisites

Get an API key and set the DASHSCOPE_API_KEY environment variable.

Endpoint

All voice design operations use a single endpoint:

POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization

CosyVoice Voice Design

The following example shows how to create a CosyVoice voice from a text description and use it for speech synthesis.

CosyVoice Voice Design is available only in the Beijing region (v3.5 series and v3 series).

Step 1: Create a voice from a description Call the API with two parameters: voice_prompt for the voice description, and preview_text for the text read aloud in the preview audio.

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "voice-enrollment",
    "input": {
        "action": "create_voice",
        "target_model": "cosyvoice-v3.5-plus",
        "voice_prompt": "A composed middle-aged male announcer with a deep, rich and magnetic voice, a steady speaking speed and clear articulation, is suitable for news broadcasting or documentary commentary.",
        "preview_text": "Dear listeners, hello everyone. Welcome to the evening news.",
        "prefix": "announcer"
    },
    "parameters": {
        "sample_rate": 24000,
        "response_format": "wav"
    }
}'

Step 2: Synthesize speech with the designed voice In the following request, use the voice_id value returned in the previous step.

# coding=utf-8
import dashscope
from dashscope.audio.tts_v2 import *
import os
# Get an API key: https://www.alibabacloud.com/help/en/model-studio/get-api-key
# If you have not configured the environment variable, replace the following line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')
dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'
# Use the same model for voice design and speech synthesis
model = "cosyvoice-v3.5-plus"
# Replace the voice parameter with the custom voice generated by voice design
voice = "voice_id"
# Instantiate SpeechSynthesizer, passing the model, voice, and other request parameters in the constructor
synthesizer = SpeechSynthesizer(model=model, voice=voice)
# Send text for synthesis and get binary audio
audio = synthesizer.call("What is the weather like today?")
# Establishing the WebSocket connection is required when sending text for the first time, so the first-package latency includes the connection setup time
print('[Metric] requestId: {}, first-package latency: {} ms'.format(
  synthesizer.get_last_request_id(),
  synthesizer.get_first_package_delay()))
# Save the audio to a local file
with open('output.mp3', 'wb') as f:
  f.write(audio)

Qwen-TTS Voice Design

cURL
Python
Java

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen-voice-design",
  "input": {
    "action": "create",
    "target_model": "qwen3-tts-vd-2026-01-26",
    "voice_prompt": "A calm young female voice with clear articulation and gentle tone, suitable for audiobook narration.",
    "preview_text": "Hello, welcome to our program. Today we will explore the wonders of nature.",
    "preferred_name": "narrator",
    "language": "en"
  },
  "parameters": {
    "sample_rate": 24000,
    "response_format": "wav"
  }
}'

import requests
import base64
import os

response = requests.post(
  "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization",
  headers={
    "Authorization": f"Bearer {os.getenv('DASHSCOPE_API_KEY')}",
    "Content-Type": "application/json"
  },
  json={
    "model": "qwen-voice-design",
    "input": {
      "action": "create",
      "target_model": "qwen3-tts-vd-2026-01-26",
      "voice_prompt": "A calm young female voice with clear articulation "
                      "and gentle tone, suitable for audiobook narration.",
      "preview_text": "Hello, welcome to our program. "
                      "Today we will explore the wonders of nature.",
      "preferred_name": "narrator",
      "language": "en"
    },
    "parameters": {
      "sample_rate": 24000,
      "response_format": "wav"
    }
  },
  timeout=60
)

result = response.json()
voice_name = result["output"]["voice"]
print(f"Voice created: {voice_name}")

# Decode and save preview audio
audio_bytes = base64.b64decode(result["output"]["preview_audio"]["data"])
with open(f"{voice_name}_preview.wav", "wb") as f:
  f.write(audio_bytes)

import com.google.gson.Gson;
import com.google.gson.JsonObject;

import java.io.*;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Base64;

public class VoiceDesign {
  public static void main(String[] args) {
    String apiKey = System.getenv("DASHSCOPE_API_KEY");
    String apiUrl = "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization";

    try {
      String body = "{"
        + "\"model\": \"qwen-voice-design\","
        + "\"input\": {"
        +   "\"action\": \"create\","
        +   "\"target_model\": \"qwen3-tts-vd-realtime-2026-01-15\","
        +   "\"voice_prompt\": \"A calm young female voice with clear articulation "
        +     "and gentle tone, suitable for audiobook narration.\","
        +   "\"preview_text\": \"Hello, welcome to our program. "
        +     "Today we will explore the wonders of nature.\","
        +   "\"preferred_name\": \"narrator\","
        +   "\"language\": \"en\""
        + "},"
        + "\"parameters\": {"
        +   "\"sample_rate\": 24000,"
        +   "\"response_format\": \"wav\""
        + "}"
        + "}";

      HttpURLConnection conn = (HttpURLConnection) new URL(apiUrl).openConnection();
      conn.setRequestMethod("POST");
      conn.setRequestProperty("Authorization", "Bearer " + apiKey);
      conn.setRequestProperty("Content-Type", "application/json");
      conn.setDoOutput(true);

      try (OutputStream os = conn.getOutputStream()) {
        os.write(body.getBytes("UTF-8"));
      }

      int status = conn.getResponseCode();
      InputStream is = (status >= 200 && status < 300)
        ? conn.getInputStream()
        : conn.getErrorStream();

      StringBuilder sb = new StringBuilder();
      try (BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"))) {
        String line;
        while ((line = br.readLine()) != null) {
          sb.append(line);
        }
      }

      if (status == 200) {
        Gson gson = new Gson();
        JsonObject result = gson.fromJson(sb.toString(), JsonObject.class);
        JsonObject output = result.getAsJsonObject("output");
        String voiceName = output.get("voice").getAsString();
        System.out.println("Voice created: " + voiceName);

        // Decode and save preview audio
        String audioData = output.getAsJsonObject("preview_audio").get("data").getAsString();
        byte[] audioBytes = Base64.getDecoder().decode(audioData);
        try (FileOutputStream fos = new FileOutputStream(voiceName + "_preview.wav")) {
          fos.write(audioBytes);
        }
        System.out.println("Preview saved: " + voiceName + "_preview.wav");
      } else {
        System.err.println("Error " + status + ": " + sb.toString());
      }

    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

The response includes the voice name and Base64-encoded preview audio. Decode the Base64 string to get the WAV file and listen to the preview.

Use the voice for synthesis

Use the returned voice name with the matching synthesis model. The model in synthesis must match the target_model used during voice creation.

cURL
Python
Java

curl -X POST 'https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H 'Content-Type: application/json' \
-d '{
  "model": "qwen3-tts-vd-2026-01-26",
  "input": {
    "text": "Welcome to our audiobook. Let me take you on a journey through the wonders of nature.",
    "voice": "VOICE_NAME"
  }
}'

Replace VOICE_NAME with the voice name returned from the create step. The response contains an output.audio.url field with a download link (valid for 24 hours).

import requests
import os

voice_name = "VOICE_NAME"  # <-- from the create step

response = requests.post(
  "https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation",
  headers={
    "Authorization": f"Bearer {os.getenv('DASHSCOPE_API_KEY')}",
    "Content-Type": "application/json"
  },
  json={
    "model": "qwen3-tts-vd-2026-01-26",
    "input": {
      "text": "Welcome to our audiobook. "
              "Let me take you on a journey through the wonders of nature.",
      "voice": voice_name
    }
  },
  timeout=60
)

result = response.json()
audio_url = result["output"]["audio"]["url"]
print(f"Audio URL: {audio_url}")

import com.google.gson.Gson;
import com.google.gson.JsonObject;

import java.io.*;
import java.net.HttpURLConnection;
import java.net.URL;

public class VoiceDesignSynthesize {
  public static void main(String[] args) {
    String apiKey = System.getenv("DASHSCOPE_API_KEY");
    String voiceName = "VOICE_NAME"; // <-- from the create step

    try {
      String body = "{"
        + "\"model\": \"qwen3-tts-vd-2026-01-26\","
        + "\"input\": {"
        +   "\"text\": \"Welcome to our audiobook. "
        +     "Let me take you on a journey through the wonders of nature.\","
        +   "\"voice\": \"" + voiceName + "\""
        + "}"
        + "}";

      HttpURLConnection conn = (HttpURLConnection) new URL(
        "https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation"
      ).openConnection();
      conn.setRequestMethod("POST");
      conn.setRequestProperty("Authorization", "Bearer " + apiKey);
      conn.setRequestProperty("Content-Type", "application/json");
      conn.setDoOutput(true);

      try (OutputStream os = conn.getOutputStream()) {
        os.write(body.getBytes("UTF-8"));
      }

      int status = conn.getResponseCode();
      InputStream is = (status >= 200 && status < 300)
        ? conn.getInputStream()
        : conn.getErrorStream();

      StringBuilder sb = new StringBuilder();
      try (BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"))) {
        String line;
        while ((line = br.readLine()) != null) {
          sb.append(line);
        }
      }

      if (status == 200) {
        Gson gson = new Gson();
        JsonObject result = gson.fromJson(sb.toString(), JsonObject.class);
        String audioUrl = result.getAsJsonObject("output")
          .getAsJsonObject("audio").get("url").getAsString();
        System.out.println("Audio URL: " + audioUrl);

        // Download the audio file
        try (InputStream in = new URL(audioUrl).openStream();
             FileOutputStream out = new FileOutputStream("synthesis_output.wav")) {
          byte[] buffer = new byte[4096];
          int bytesRead;
          while ((bytesRead = in.read(buffer)) != -1) {
            out.write(buffer, 0, bytesRead);
          }
        }
        System.out.println("Audio saved: synthesis_output.wav");
      } else {
        System.err.println("Error " + status + ": " + sb.toString());
      }

    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

For real-time streaming synthesis with custom voices, see Realtime streaming TTS. For complete API parameters and more operations (list, query, delete), see the Voice design API reference.

Supported models

Voice design uses a design model and a target synthesis model. CosyVoice:

Voice design model: voice-enrollment
Target synthesis models: cosyvoice-v3.5-plus (Beijing only), cosyvoice-v3.5-flash (Beijing only), cosyvoice-v3-plus (Beijing only), cosyvoice-v3-flash (Beijing only). Use designed voices with CosyVoice TTS.

Qwen-TTS:

Voice design model: qwen-voice-design
Target synthesis models: Qwen3-TTS-VD-Realtime (for Realtime streaming TTS), Qwen3-TTS-VD (for Qwen TTS)

For model IDs and snapshot versions, see Text-to-speech models.

Voice design models (qwen3-tts-vd-*) only support custom-designed voices. They do not support system voices (Chelsie, Serena, Ethan, Cherry).

Supported languages

Code	Language
`zh`	Chinese
`en`	English
`de`	German
`it`	Italian
`pt`	Portuguese
`es`	Spanish
`ja`	Japanese
`ko`	Korean
`fr`	French
`ru`	Russian

voice_prompt supports Chinese and English only. The language parameter must match the preview_text language.

Write effective voice descriptions

A voice description (voice_prompt) tells the model what voice to generate. Combine gender, age, tone, and use case to define a distinctive voice.

Constraints

Max length: 2,048 characters.
Languages: Chinese and English only.

Description dimensions

Dimension	Examples
Gender	Male, female, neutral
Age	Child (5--12), teenager (13--18), young adult (19--35), middle-aged (36--55), elderly (55+)
Pitch	High, medium, low, high-pitched, low-pitched
Pace	Fast, medium, slow, fast-paced, slow-paced
Emotion	Cheerful, calm, gentle, serious, lively, composed, soothing
Characteristics	Magnetic, crisp, hoarse, mellow, sweet, rich, powerful
Use case	News broadcast, ad voice-over, audiobook, animation character, voice assistant, documentary narration

Tips

Be specific. Use concrete qualities like "deep," "crisp," or "fast-paced." Avoid vague terms like "nice" or "normal."
Use multiple dimensions. Combine gender, age, emotion, and use case. "Female voice" alone is too broad.
Be objective. Focus on physical and perceptual features. Write "high-pitched and energetic" instead of "my favorite voice."
Be original. Describe voice qualities directly. Celebrity imitation is not supported and involves copyright risks.
Be concise. Every word should serve a purpose. Avoid synonyms and meaningless intensifiers.

Examples

Good descriptions:

"A young, lively female voice with a fast pace and noticeable upward inflection, suitable for fashion product introductions."
"A calm, middle-aged male voice with a slow pace and deep, magnetic tone, suitable for news or documentary narration."
"A cute child's voice, around 8 years old, with a slightly childish tone, suitable for animation character voice-overs."

Ineffective descriptions:

Description	Issue	Improvement
"A nice voice"	Too vague	"A young female voice with a clear vocal line and gentle tone."
"A voice like a certain celebrity"	Celebrity imitation not supported	"A mature, magnetic male voice with a calm pace."
"A very, very, very nice female voice"	Redundant repetition	"A female voice, 20--24 years old, with a light tone and sweet quality."

Quota and billing

Voice quota and automatic cleanup

Total voice limit: Each Qwen Cloud account has a separate limit of 1,000 custom voices for CosyVoice and 1,000 for Qwen-TTS. The two quotas are counted independently. Automatic cleanup: If a voice isn't used in any speech synthesis request for one year, the system automatically deletes it.

Billing rules

The prices listed below are list prices. For current promotions and discounted pricing, visit the Model Marketplace.

CosyVoice: Voice design is free.
Qwen-TTS: Each voice design costs USD 0.2. Failed creations aren't charged. Voice cloning has separate pricing. Free quota (Singapore region only):
- You get 10 free voice design creations during the first 90 days after activating Qwen Cloud.
- Failed creations don't consume the free quota.
- Deleting a voice doesn't restore the free quota.
- After the free quota is used up or the 90-day window expires, voice design is billed at USD 0.2 per voice.

Error codes

If a call fails, see Error messages. Common voice design errors:

HTTP status	Error code	Cause	Resolution
400	BadRequest.VoiceNotFound	The specified voice does not exist (in voice design or synthesis operations)	Verify the voice name with List voices or Query a voice. If the voice does not exist, create a new voice with Create a voice.

Next steps

Voice design API reference (Qwen) -- Qwen-TTS voice design API parameters and response format
Voice design API reference (CosyVoice) -- CosyVoice voice design API parameters and response format
Realtime streaming TTS -- Use custom voices for real-time synthesis
Qwen TTS -- Use custom voices for non-streaming synthesis
Get an API key -- Set up authentication

​How it works

​Quick start

​Prerequisites

​Endpoint

​CosyVoice Voice Design

​Qwen-TTS Voice Design

​Use the voice for synthesis

​Supported models

​Supported languages

​Write effective voice descriptions

​Constraints

​Description dimensions

​Tips

​Examples

​Quota and billing

​Voice quota and automatic cleanup

​Billing rules

​Error codes

​Next steps

How it works

Quick start

Prerequisites

Endpoint

CosyVoice Voice Design

Qwen-TTS Voice Design

Use the voice for synthesis

Supported models

Supported languages

Write effective voice descriptions

Constraints

Description dimensions

Tips

Examples

Quota and billing

Voice quota and automatic cleanup

Billing rules

Error codes

Next steps