Skip to main content
Text-to-speech

Voice cloning

Clone a voice from audio samples for use with CosyVoice, Qwen-TTS, or Qwen-Omni models.

Clone a voice from 10-20 seconds of audio. The API returns a voice identifier instantly -- no training required.

How it works

  1. Clone a voice -- Call the voice cloning API with an audio sample and a target_model. The API returns a voice identifier instantly.
  2. Use the cloned voice -- Pass the voice identifier to the target model's API. The model must match the target_model from step 1.
The target_model set during voice creation must match the model used in subsequent API calls. Mismatched models cause synthesis to fail.

Supported models

Voice cloning model:
  • Qwen-TTS / Qwen-Omni: qwen-voice-enrollment
  • CosyVoice: voice-enrollment
Target models (target_model):
Qwen-TTS-VC models only support custom cloned voices, not system voices like Chelsie, Serena, Ethan, or Cherry.

Audio requirements

The quality of the input audio directly affects the cloning result. Each model family has different audio requirements.
  • CosyVoice
  • Qwen-TTS / Qwen-Omni
ItemRequirement
FormatWAV (16-bit), MP3, M4A
Duration10 -- 20 seconds recommended. 60 seconds maximum.
File size10 MB or less
Sample rate16 kHz or higher
ChannelsMono or stereo. For stereo audio, only the first channel is processed. Make sure the first channel contains valid speech.
ContentAt least 5 seconds of continuous, clear speech. Brief pauses must not exceed 2 seconds. No background music, ambient noise, or other voices. Use normal-speed spoken audio; do not use singing.
LanguageVaries by target_model. cosyvoice-v3-flash: Chinese (Mandarin, Cantonese, and regional dialects), English, French, German, Japanese, Korean, Russian, Portuguese, Thai, Indonesian, Vietnamese. cosyvoice-v3-plus: Chinese (Mandarin), English, French, German, Japanese, Korean, Russian.

Recording tips

Quick-start checklist

Use this checklist in a standard bedroom or similar small room:
  1. Close all windows and doors to block external noise.
  2. Turn off air conditioners, fans, and other electrical devices.
  3. Draw curtains to reduce glass reflections.
  4. Cover your desk with clothing or a blanket to reduce surface reflections.
  5. Read through your script. Define your character's tone and practice delivering naturally.
  6. Position the recording device approximately 10 cm from your mouth. Too close causes plosive distortion; too far produces a weak signal.
  7. Start recording.

Recording devices

Use a smartphone, digital voice recorder, or professional audio recorder.

Set up your recording environment

Choose the right room:
RequirementDetails
Room sizeRecord in a small enclosed space (max 10 m²).
Acoustic treatmentChoose a room with sound-absorbing materials: acoustic foam, carpets, or curtains.
Spaces to avoidAvoid auditoriums, conference rooms, and classrooms — these large spaces cause strong reverberation that degrades clone quality.
Control noise:
Noise sourceMitigation
Outdoor noiseClose all windows and doors. Avoid recording near traffic or construction.
Indoor noiseTurn off air conditioners, fans, and fluorescent lamp ballasts before recording.
Record a few seconds of ambient sound on your smartphone, then play it back at high volume to identify hidden noise sources.
Reduce reverberation: Reverberation blurs speech and reduces definition, directly impacting clone fidelity.
  • Draw curtains, open closet doors, or cover desks/cabinets with clothing or bed sheets to reduce reflections from smooth surfaces.
  • Place irregular objects (bookshelves, upholstered furniture) to scatter sound waves.

Prepare your script

GuidelineDetails
ContentNo strict restrictions apply. Align content with your target use case.
Sentence structureUse complete sentences. Avoid short phrases ("Hello", "Yes") that lack vocal information for cloning.
ContinuityMaintain semantic continuity — pause infrequently and aim for 3+ seconds of uninterrupted speech per segment.
Emotional expressionAdd appropriate emotional expression (warmth, friendliness, seriousness). Monotone delivery reduces clone naturalness.
Content restrictionsDo not include sensitive words (politics, pornography, violence). Recordings with this content will fail cloning.

End-to-end examples

Create a cloned voice from a local audio file, then use it with a matching model. Both steps must use the same target_model. Replace voice.mp3 with the path to your own audio file.

CosyVoice

Clone a voice and use it with CosyVoice TTS. Applies to cosyvoice-v3-plus and cosyvoice-v3-flash. For details, see CosyVoice TTS.
CosyVoice voice cloning uses a different API than Qwen-TTS/Qwen-Omni: the cloning model is voice-enrollment (not qwen-voice-enrollment), the action is create_voice, and the audio is passed as a public URL (not base64).
Step 1: Create a voice The url parameter must be a publicly accessible URL of your audio file. The prefix parameter sets a prefix for the voice name.
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "voice-enrollment",
    "input": {
        "action": "create_voice",
        "target_model": "cosyvoice-v3-plus",
        "prefix": "myvoice",
        "url": "https://your-audio-url.wav",
        "language_hints": ["en"]
    }
}'
Step 2: Synthesize speech with the cloned voice Replace YOUR_VOICE_ID with the voice value returned in the previous step.
curl -X POST https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/SpeechSynthesizer \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
    "model": "cosyvoice-v3-plus",
    "input": {
      "text": "How is the weather today?",
      "voice": "YOUR_VOICE_ID",
      "format": "wav",
      "sample_rate": 24000
    }
}'
For SDK examples (Python, Java), see CosyVoice voice cloning SDK.

Qwen-Omni: Realtime conversation

Clone a voice and use it in a realtime conversation. Applies to qwen3.5-omni-plus-realtime and qwen3.5-omni-flash-realtime. For details, see Real-time multimodal speech.
  • Python
  • Java
# Requirements: dashscope >= 1.23.9, pyaudio
import os
import requests
import base64
import pathlib
import time
import pyaudio
from dashscope.audio.qwen_omni import MultiModality, OmniRealtimeCallback, OmniRealtimeConversation
import dashscope

TARGET_MODEL = "qwen3.5-omni-plus-realtime"
PREFERRED_NAME = "guanyu"
VOICE_FILE_PATH = "voice.mp3"

def create_voice(file_path: str) -> str:
  api_key = os.getenv("DASHSCOPE_API_KEY")
  file_path_obj = pathlib.Path(file_path)
  base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
  data_uri = f"data:audio/mpeg;base64,{base64_str}"

  resp = requests.post(
    "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization",
    headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
    json={
      "model": "qwen-voice-enrollment",
      "input": {
        "action": "create",
        "target_model": TARGET_MODEL,
        "preferred_name": PREFERRED_NAME,
        "audio": {"data": data_uri}
      }
    }
  )
  if resp.status_code != 200:
    raise RuntimeError(f"Failed to create voice: {resp.status_code}, {resp.text}")
  return resp.json()["output"]["voice"]

class SimpleCallback(OmniRealtimeCallback):
  def __init__(self, pya):
    self.pya = pya
    self.out = None
  def on_open(self):
    self.out = self.pya.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)
  def on_event(self, response):
    if response['type'] == 'response.audio.delta':
      self.out.write(base64.b64decode(response['delta']))
    elif response['type'] == 'conversation.item.input_audio_transcription.completed':
      print(f"[User] {response['transcript']}")
    elif response['type'] == 'response.audio_transcript.done':
      print(f"[LLM] {response['transcript']}")

if __name__ == '__main__':
  dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

  # Step 1: Clone a voice
  voice = create_voice(VOICE_FILE_PATH)
  print(f"Voice cloning complete. Voice: {voice}")

  # Step 2: Start a conversation with the cloned voice
  pya = pyaudio.PyAudio()
  callback = SimpleCallback(pya)
  conv = OmniRealtimeConversation(
    model=TARGET_MODEL, callback=callback,
    url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime"
  )
  conv.connect()
  conv.update_session(
    output_modalities=[MultiModality.AUDIO, MultiModality.TEXT],
    voice=voice
  )
  mic = pya.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True)
  print("Conversation started. Speak into your microphone (Ctrl+C to exit)...")
  try:
    while True:
      audio_data = mic.read(3200, exception_on_overflow=False)
      conv.append_audio(base64.b64encode(audio_data).decode())
      time.sleep(0.01)
  except KeyboardInterrupt:
    conv.close()
    mic.close()
    callback.out.close()
    pya.terminate()
    print("\nConversation ended")

Qwen-Omni: Non-realtime conversation

Clone a voice and use it in a non-realtime conversation. Applies to qwen3.5-omni-plus and qwen3.5-omni-flash. For details, see Multimodal speech.
  • Python
  • Java
import os
import requests
import base64
import pathlib
import numpy as np
import soundfile as sf
import dashscope

TARGET_MODEL = "qwen3.5-omni-plus"
PREFERRED_NAME = "guanyu"
VOICE_FILE_PATH = "voice.mp3"

def create_voice(file_path: str) -> str:
  api_key = os.getenv("DASHSCOPE_API_KEY")
  file_path_obj = pathlib.Path(file_path)
  base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
  data_uri = f"data:audio/mpeg;base64,{base64_str}"

  resp = requests.post(
    "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization",
    headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
    json={
      "model": "qwen-voice-enrollment",
      "input": {
        "action": "create",
        "target_model": TARGET_MODEL,
        "preferred_name": PREFERRED_NAME,
        "audio": {"data": data_uri}
      }
    }
  )
  if resp.status_code != 200:
    raise RuntimeError(f"Failed to create voice: {resp.status_code}, {resp.text}")
  return resp.json()["output"]["voice"]

if __name__ == '__main__':
  dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

  voice = create_voice(VOICE_FILE_PATH)
  print(f"Voice cloning complete. Voice: {voice}")

  messages = [{"role": "user", "content": [{"text": "Hello, please introduce yourself"}]}]

  response = dashscope.MultiModalConversation.call(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    model=TARGET_MODEL,
    messages=messages,
    modalities=["text", "audio"],
    audio={"voice": voice, "format": "wav"},
    stream=True
  )

  print("Model response:")
  audio_base64_string = ""
  for r in response:
    try:
      content = r.output.choices[0].message.content
      for item in content:
        if isinstance(item, dict):
          if "audio" in item:
            audio_base64_string += item["audio"].get("data", "")
          elif "text" in item:
            print(item["text"], end="")
    except Exception:
      pass

  if audio_base64_string:
    wav_bytes = base64.b64decode(audio_base64_string)
    audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
    sf.write("audio_cloned.wav", audio_np, samplerate=24000)
    print("\nAudio saved to: audio_cloned.wav")

Qwen-TTS: Bidirectional streaming (real-time)

Applies to Qwen3-TTS-VC-Realtime models. For parameter details, see Realtime streaming TTS.
  • Python
  • Java
# pyaudio installation:
#   macOS:   brew install portaudio && pip install pyaudio
#   Ubuntu:  sudo apt-get install python3-pyaudio  (or pip install pyaudio)
#   CentOS:  sudo yum install -y portaudio portaudio-devel && pip install pyaudio
#   Windows: python -m pip install pyaudio

import pyaudio
import os
import requests
import base64
import pathlib
import threading
import time
import dashscope
from dashscope.audio.qwen_tts_realtime import QwenTtsRealtime, QwenTtsRealtimeCallback, AudioFormat

TARGET_MODEL = "qwen3-tts-vc-realtime-2026-01-15"
VOICE_FILE = "voice.mp3"  # Replace with your audio file

TEXT_TO_SYNTHESIZE = [
  'Today we explore the wonders of speech synthesis.',
  'Each voice carries a unique character.',
  'With voice cloning, you can bring any text to life.',
  "Let's create something amazing together."
]

def create_voice(file_path: str) -> str:
  """Create a cloned voice and return the voice identifier."""
  api_key = os.getenv("DASHSCOPE_API_KEY")
  file_path_obj = pathlib.Path(file_path)
  base64_str = base64.b64encode(file_path_obj.read_bytes()).decode()
  data_uri = f"data:audio/mpeg;base64,{base64_str}"

  response = requests.post(
    "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization",
    headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
    json={
      "model": "qwen-voice-enrollment",
      "input": {
        "action": "create",
        "target_model": TARGET_MODEL,
        "preferred_name": "myvoice",
        "audio": {"data": data_uri}
      }
    }
  )
  return response.json()["output"]["voice"]

class MyCallback(QwenTtsRealtimeCallback):
  def __init__(self):
    self.complete_event = threading.Event()
    self._player = pyaudio.PyAudio()
    self._stream = self._player.open(format=pyaudio.paInt16, channels=1, rate=24000, output=True)

  def on_event(self, response: dict) -> None:
    if response.get("type") == "response.audio.delta":
      audio_data = base64.b64decode(response["delta"])
      self._stream.write(audio_data)
    elif response.get("type") == "session.finished":
      self.complete_event.set()

if __name__ == "__main__":
  dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
  callback = MyCallback()
  tts = QwenTtsRealtime(model=TARGET_MODEL, callback=callback,
                        url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime")
  tts.connect()
  tts.update_session(voice=create_voice(VOICE_FILE),
                     response_format=AudioFormat.PCM_24000HZ_MONO_16BIT, mode="server_commit")

  for text in TEXT_TO_SYNTHESIZE:
    tts.append_text(text)
    time.sleep(0.1)

  tts.finish()
  callback.complete_event.wait()

Qwen-TTS: Non-streaming synthesis

Applies to Qwen3-TTS-VC models. For details, see Qwen TTS.
  • Python
  • Java
import os
import requests
import base64
import pathlib
import dashscope

TARGET_MODEL = "qwen3-tts-vc-2026-01-22"
VOICE_FILE = "voice.mp3"

def create_voice(file_path: str) -> str:
  api_key = os.getenv("DASHSCOPE_API_KEY")
  base64_str = base64.b64encode(pathlib.Path(file_path).read_bytes()).decode()
  data_uri = f"data:audio/mpeg;base64,{base64_str}"

  response = requests.post(
    "https://dashscope-intl.aliyuncs.com/api/v1/services/audio/tts/customization",
    headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
    json={
      "model": "qwen-voice-enrollment",
      "input": {"action": "create", "target_model": TARGET_MODEL,
                "preferred_name": "myvoice", "audio": {"data": data_uri}}
    }
  )
  return response.json()["output"]["voice"]

if __name__ == "__main__":
  dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'
  response = dashscope.MultiModalConversation.call(
    model=TARGET_MODEL,
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    text="Today we explore the wonders of speech synthesis.",
    voice=create_voice(VOICE_FILE),
    stream=False
  )
  print(response)

Troubleshooting

If you encounter errors, see Error messages.

Next steps