CosyVoice TTS, Qwen-Omni Realtime, and Fun-ASR — common questions about synthesis, real-time conversation, and speech recognition.
CosyVoice real-time synthesis
APIs: Python SDK, Java SDK, WebSocket API
How do I fix inaccurate pronunciation?
Use SSML, including the phoneme tag when needed.
How do I get the billed character count?
- Python — non-streaming: Character counting rules.
- Python — streaming / callbacks: Parse
on_eventResultCallback JSON forusage.characters— use the last message. - Python — logging:
DASHSCOPE_LOGGING_LEVEL=debug; readcharactersfrom the last log line. - Java — non-streaming: Character counting rules.
- Java — other modes:
getUsage().getCharacters()on SpeechSynthesisResult — last value is final. - WebSocket:
payload.usage.charactersin result-generated.
Why WebSocket instead of HTTP for TTS? (WebSocket)
The server pushes audio and progress; WebSocket fits low-latency streaming synthesis.
How do I get the request ID?
- Python:
on_eventJSON orget_last_request_idon SpeechSynthesizer. - Java:
getRequestId()on SpeechSynthesisResult orgetLastRequestIdon SpeechSynthesizer. - WebSocket: result-generated or task-finished.
Why does SSML fail?
- SSML limits.
- Latest SDK.
- SDK: SSML on
call/ non-streaming only — notstreaming_call-only flows. - WebSocket: SSML support.
Why can't the audio play?
- File: Match output format to extension; use a compatible player.
- Stream: For MP3/Opus use a streaming player (FFmpeg, PyAudio,
AudioFormat,MediaSource). Append chunks; first chunk may carry WAV/MP3 headers only once.
Why does playback stutter?
Shorter gaps between text sends; light callbacks (offload work off the WebSocket thread); stable network.
Why does synthesis take a long time?
Avoid long pauses between segments; expect ~500 ms first packet and RTF < 1 when healthy.
Why is trailing text missing or no speech returned?
- Python:
streaming_completeon SpeechSynthesizer. - Java:
streamingCompleteon SpeechSynthesizer. - WebSocket: finish-task.
Why is audio scrambled or garbled? (WebSocket)
One task_id for run-task, continue-task, and finish-task. Do not reorder writes.
WebSocket closes with code 1007 or auth errors (WebSocket)
- 1007: Fix JSON and required fields;
payload.inputonly{}or text. - 401/403: API key / Troubleshoot authentication failures.
SSL or WebSocketApp errors (Python)
Configure CA bundle (SSL_CERT_FILE) or fix macOS cert path; for AttributeError on WebSocketApp, reinstall websocket-client (pip uninstall websocket-client websocket then pip install websocket-client). Details in CosyVoice Python SDK.
How do I restrict my API key to CosyVoice only?
Manage workspaces.
More CosyVoice questions
CosyVoice Q&A on GitHub.
Qwen-Omni Realtime
APIs: Python SDK, Java SDK, Client events, Server events
How are input audio and images aligned?
Audio is the timeline. Images attach at send time; you can turn video on or off during the session.
What input rates are recommended?
About 2 fps for images and ~100 ms audio packets for real-time use.
What is the difference between turn_detection on and off?
With turn_detection enabled (server VAD):
- End of utterance is detected; inference runs automatically; text/audio responses return.
- Input can continue during the model response; then the session returns to listening.
- Barge-in: speaking during playback stops the response and returns to input.
turn_detection disabled:
- You end the turn and call
commitandcreate_response/createResponseyourself. - Pause audio and video while the model is responding; resume after it finishes.
- Use
cancel_response/cancelResponseto interrupt.
With
turn_detection on, you can still use commit, create_response, and cancel_response manually.Why use another model for input_audio_transcription?
Omni responds to input; it is not a dedicated ASR transcript pipeline. Use a separate speech-to-text model for verbatim transcription.
Common audio issues
Recognition issues often come from container vs codec mismatch or wrong sample_rate / format. Verify the real encoding, not only the file extension.
Convert audio with FFmpeg
Use FFmpeg to transcode into a supported format.
Inspect container, codec, sample rate, and channels
Use ffprobe:
format, sample_rate / sampleRate, and channels where applicable.
Fun-ASR realtime
APIs: Python SDK, Java SDK, WebSocket
How do I keep a connection alive during long silence?
Set heartbeat to true and send silent audio continuously.
Generate silent audio with FFmpeg:
How do I recognize a local audio file?
- Python:
callon the Recognition class for a full file (non-streaming), orsend_audio_framefor streaming (bidirectional streaming). - Java:
call(non-streaming), orsendAudioFrame/streamCall(callback / Flowable streaming).
Why use WebSocket instead of HTTP? (WebSocket)
WebSocket is full-duplex; HTTP request/response alone is not suited to continuous real-time audio.
Why is speech not recognized?
- Match
formatandsample_rate/sampleRateto the real audio. See Common audio issues. - Match
language_hints(Python) /languageHints(Java) to the spoken language. - Use custom vocabulary / hotwords for business terms, product names, or proper nouns that need precise recognition. See Custom hotwords.
Fun-ASR file transcription
APIs: Python SDK, RESTful API, Java SDK
Is Base64-encoded audio supported?
No. Use public HTTP(S) URLs only — not Base64, raw binary uploads, or local paths as the file reference.
How do I host audio at a public URL?
1. Choose a storage method
1. Choose a storage method
- Object storage (recommended): Public read or signed URLs; CDN-friendly.
- Web server: HTTPS for small tests.
- CDN: For high concurrency.
2. Upload the file
2. Upload the file
- Object storage: Bucket, upload, public-read or temporary link.
- Web server: Served directory (for example
/var/www/html/audio/).
3. Get the public URL
3. Get the public URL
- Storage:
https://<bucket>.<region>.aliyuncs.com/<object-key>or custom domain. - Web:
https://your-domain.com/audio/file.mp3 - CDN:
https://cdn.your-domain.com/audio/file.mp3
4. Verify the URL
4. Verify the URL
Browser or
curl/Postman — HTTP 200 and playable audio.How long does recognition take?
Tasks go PENDING → RUNNING → SUCCEEDED or FAILED. Queue time depends on load and file length.
Why can't I get a result after polling?
Possible rate limiting — inspect responses and back off.
What if an OSS temporary URL is inaccessible? (REST)
Set header X-DashScope-OssResourceResolve to enable.
Intended for raw REST. Official Java/Python SDKs may not support this header; prefer stable public URLs where possible.