Skip to main content
Models & inference

Audio & speech FAQ

CosyVoice TTS, Qwen-Omni Realtime, and Fun-ASR — common questions about synthesis, real-time conversation, and speech recognition.

CosyVoice real-time synthesis

APIs: Python SDK, Java SDK, WebSocket API

How do I fix inaccurate pronunciation?

Use SSML, including the phoneme tag when needed.

How do I get the billed character count?

Why WebSocket instead of HTTP for TTS? (WebSocket)

The server pushes audio and progress; WebSocket fits low-latency streaming synthesis.

How do I get the request ID?

Why does SSML fail?

  1. SSML limits.
  2. Latest SDK.
  3. SDK: SSML on call / non-streaming only — not streaming_call-only flows.
  4. WebSocket: SSML support.

Why can't the audio play?

  1. File: Match output format to extension; use a compatible player.
  2. Stream: For MP3/Opus use a streaming player (FFmpeg, PyAudio, AudioFormat, MediaSource). Append chunks; first chunk may carry WAV/MP3 headers only once.

Why does playback stutter?

Shorter gaps between text sends; light callbacks (offload work off the WebSocket thread); stable network.

Why does synthesis take a long time?

Avoid long pauses between segments; expect ~500 ms first packet and RTF < 1 when healthy.

Why is trailing text missing or no speech returned?

Why is audio scrambled or garbled? (WebSocket)

One task_id for run-task, continue-task, and finish-task. Do not reorder writes.

WebSocket closes with code 1007 or auth errors (WebSocket)

SSL or WebSocketApp errors (Python)

Configure CA bundle (SSL_CERT_FILE) or fix macOS cert path; for AttributeError on WebSocketApp, reinstall websocket-client (pip uninstall websocket-client websocket then pip install websocket-client). Details in CosyVoice Python SDK.

How do I restrict my API key to CosyVoice only?

Manage workspaces.

More CosyVoice questions

CosyVoice Q&A on GitHub.

Qwen-Omni Realtime

APIs: Python SDK, Java SDK, Client events, Server events

How are input audio and images aligned?

Audio is the timeline. Images attach at send time; you can turn video on or off during the session. About 2 fps for images and ~100 ms audio packets for real-time use.

What is the difference between turn_detection on and off?

With turn_detection enabled (server VAD):
  • End of utterance is detected; inference runs automatically; text/audio responses return.
  • Input can continue during the model response; then the session returns to listening.
  • Barge-in: speaking during playback stops the response and returns to input.
With turn_detection disabled:
  • You end the turn and call commit and create_response / createResponse yourself.
  • Pause audio and video while the model is responding; resume after it finishes.
  • Use cancel_response / cancelResponse to interrupt.
With turn_detection on, you can still use commit, create_response, and cancel_response manually.

Why use another model for input_audio_transcription?

Omni responds to input; it is not a dedicated ASR transcript pipeline. Use a separate speech-to-text model for verbatim transcription.

Common audio issues

Recognition issues often come from container vs codec mismatch or wrong sample_rate / format. Verify the real encoding, not only the file extension.

Convert audio with FFmpeg

Use FFmpeg to transcode into a supported format.
ffmpeg -i input_audio.ext -c:a encoder_name -b:a bit_rate -ar sample_rate -ac number_of_channels output.ext

ffmpeg -i input.wav -c:a libmp3lame -q:a 0 output.mp3
ffmpeg -i input.mp3 -c:a pcm_s16le -ar 44100 -ac 2 output.wav
ffmpeg -i input.m4a -c:a copy output.aac
ffmpeg -i input.m4a -c:a aac -b:a 256k output.aac
ffmpeg -i input.flac -c:a libopus -b:a 128k -vbr on output.opus

Inspect container, codec, sample rate, and channels

Use ffprobe:
ffprobe -v error -show_entries format=format_name -show_entries stream=codec_name,sample_rate,channels -of default=noprint_wrappers=1 input.xxx
Use the reported values for format, sample_rate / sampleRate, and channels where applicable.

Fun-ASR realtime

APIs: Python SDK, Java SDK, WebSocket

How do I keep a connection alive during long silence?

Set heartbeat to true and send silent audio continuously. Generate silent audio with FFmpeg:
# Generate 1 second of silent audio at 16kHz, mono
ffmpeg -f lavfi -i anullsrc=r=16000:cl=mono -t 1 -acodec pcm_s16le silent.wav
Alternatively, use Audacity or Adobe Audition to create silent segments.

How do I recognize a local audio file?

Why use WebSocket instead of HTTP? (WebSocket)

WebSocket is full-duplex; HTTP request/response alone is not suited to continuous real-time audio.

Why is speech not recognized?

  1. Match format and sample_rate / sampleRate to the real audio. See Common audio issues.
  2. Match language_hints (Python) / languageHints (Java) to the spoken language.
  3. Use custom vocabulary / hotwords for business terms, product names, or proper nouns that need precise recognition. See Custom hotwords.

Fun-ASR file transcription

APIs: Python SDK, RESTful API, Java SDK

Is Base64-encoded audio supported?

No. Use public HTTP(S) URLs only — not Base64, raw binary uploads, or local paths as the file reference.

How do I host audio at a public URL?

  • Object storage (recommended): Public read or signed URLs; CDN-friendly.
  • Web server: HTTPS for small tests.
  • CDN: For high concurrency.
  • Object storage: Bucket, upload, public-read or temporary link.
  • Web server: Served directory (for example /var/www/html/audio/).
  • Storage: https://<bucket>.<region>.aliyuncs.com/<object-key> or custom domain.
  • Web: https://your-domain.com/audio/file.mp3
  • CDN: https://cdn.your-domain.com/audio/file.mp3
Browser or curl/Postman — HTTP 200 and playable audio.
See also Limitations.

How long does recognition take?

Tasks go PENDINGRUNNINGSUCCEEDED or FAILED. Queue time depends on load and file length.

Why can't I get a result after polling?

Possible rate limiting — inspect responses and back off.

What if an OSS temporary URL is inaccessible? (REST)

Set header X-DashScope-OssResourceResolve to enable.
Intended for raw REST. Official Java/Python SDKs may not support this header; prefer stable public URLs where possible.

Why is the audio not recognized?

Check format and sample rate; use ffprobe in Common audio issues.