Create a designed voice

POST

/services/audio/tts/customization

cURL

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "voice-enrollment",
  "input": {
    "action": "create_voice",
    "target_model": "cosyvoice-v3.5-plus",
    "voice_prompt": "A composed middle-aged male announcer with a deep, rich voice",
    "preview_text": "Dear listeners, hello everyone",
    "prefix": "announcer",
    "language_hints": ["en"]
  },
  "parameters": {
    "sample_rate": 24000,
    "response_format": "wav"
  }
}'

{
  "output": {
    "voice_id": "cosyvoice-v3.5-plus-vd-announcer-xxxxxx",
    "preview_audio": {
      "data": "{base64_encoded_audio}",
      "sample_rate": 24000,
      "response_format": "wav"
    },
    "target_model": "cosyvoice-v3.5-plus"
  },
  "usage": {
    "count": 1
  },
  "request_id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
}

Authorizations

string

header

required

DashScope API key. Get one at API key.

Body

application/json

enum<string>

required

Fixed to voice-enrollment (shared with voice cloning).

Available options:voice-enrollment

Example:voice-enrollment

object

required

Show child attributes

enum<string>

required

Fixed to create_voice (same as voice cloning).

Available options:create_voice

Example:create_voice

enum<string>

required

Speech synthesis model for the designed voice. Must match the model in subsequent synthesis calls.

Available options:cosyvoice-v3.5-plus,cosyvoice-v3.5-flash,cosyvoice-v3-plus,cosyvoice-v3-flash

Example:cosyvoice-v3.5-plus

string

required

Voice description. Chinese and English only. Max 500 characters.

Example:A composed middle-aged male announcer with a deep, rich voice

Required range:length <= 500

string

required

Text for the preview audio. Chinese and English only. Max 200 characters.

Example:Dear listeners, hello everyone

Required range:length <= 200

string

required

Voice name prefix. Alphanumeric only, max 10 characters. Generated voice name format: {target_model}-vd-{prefix}-{unique_id}.

Example:announcer

Required range:length <= 10pattern: ^[a-zA-Z0-9]+$

string[]

default["zh"]

Specifies the language tendency of the generated voice. Must match the preview_text language. Only the first element is used. Supported: zh, en.

Example:

[
  "en"
]

object

Preview audio settings.

Show child attributes

enum<integer>

default24000

Sample rate in Hz for the preview audio.

Available options:16000,24000,48000

Example:24000

enum<string>

default"wav"

Audio format for the preview.

Available options:pcm,wav,mp3

Example:wav

Response

200-application/json

object

Show child attributes

string

Generated voice ID. Pass as the voice parameter in CosyVoice synthesis calls.

Example:cosyvoice-v3.5-plus-vd-announcer-xxxxxx

object

Preview audio data.

Show child attributes

string

Base64-encoded preview audio.

Example:{base64_encoded_audio}

integer

Sample rate of the preview audio in Hz.

Example:24000

string

Format of the preview audio.

Example:wav

string

Speech synthesis model bound to this voice.

Example:cosyvoice-v3.5-plus

object

Show child attributes

integer

Billed voice creation operations. Always 1.

Example:1

string

Request ID for troubleshooting.

Example:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx