Create a cloned voice

POST

/services/audio/tts/customization

cURL

curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
  "model": "voice-enrollment",
  "input": {
    "action": "create_voice",
    "target_model": "cosyvoice-v3-plus",
    "prefix": "myvoice",
    "url": "https://your-audio-url.wav",
    "language_hints": ["zh"]
  }
}'

{
  "output": {
    "voice_id": "cosyvoice-v3-plus-myvoice-xxxxxx"
  },
  "usage": {
    "count": 1
  },
  "request_id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
}

Authorizations

string

header

required

DashScope API key. Get one at API key.

Body

application/json

enum<string>

required

Fixed to voice-enrollment.

Available options:voice-enrollment

Example:voice-enrollment

object

required

Show child attributes

enum<string>

required

Fixed to create_voice.

Available options:create_voice

Example:create_voice

enum<string>

required

Speech synthesis model for the cloned voice. Must match the model in subsequent synthesis calls. Values: cosyvoice-v3-plus, cosyvoice-v3-flash, cosyvoice-v3-plus, cosyvoice-v3-flash.

Available options:cosyvoice-v3-plus,cosyvoice-v3-flash,cosyvoice-v3-plus,cosyvoice-v3-flash

Example:cosyvoice-v3-plus

string

required

Voice name prefix. Alphanumeric only, max 10 characters. Generated voice name format: {target_model}-{prefix}-{unique_id}.

Example:myvoice

Required range:length <= 10pattern: ^[a-zA-Z0-9]+$

string

required

Audio file URL for cloning. Must be publicly accessible.

Example:https://your-audio-url.wav

string[]

default["zh"]

Helps the model identify the audio language for more accurate cloning. Only the first element is used. If the specified language does not match the audio, the system auto-detects. Supported: zh, en, fr, de, ja, ko, ru (v3-plus); adds pt, es, it, th, id, vi for v3-flash.

Example:

[
  "zh"
]

number

default10

Max audio duration (seconds) used for cloning after preprocessing. Longer duration yields better results. Supported by cosyvoice-v3-plus and v3-flash only.

Example:15

Required range:3 <= x <= 30

boolean

defaultfalse

Enable audio preprocessing (noise reduction, enhancement, volume normalization). Recommended for noisy audio; disable for clean audio to preserve voice fidelity. Supported by cosyvoice-v3-plus and v3-flash only.

Example:false

Response

200-application/json

object

Show child attributes

string

Generated voice ID. Pass as the voice parameter in CosyVoice synthesis calls.

Example:cosyvoice-v3-plus-myvoice-xxxxxx

object

Show child attributes

integer

Billed voice creation operations. Always 1.

Example:1

string

Request ID for troubleshooting.

Example:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx