Real-time ASR WebSocket
Use the WebSocket API to connect to Fun-ASR real-time speech recognition from any language. For easier integration, use the Python SDK or Java SDK.
User guide: For model details and selection, see Realtime speech recognition.
The client and server interact in this sequence:
Instructions are JSON messages that control the task lifecycle.
Start a recognition task and set its parameters after connecting.
Example:
Tell the server that audio transmission is complete.
Example:
Events are JSON messages that report task status and recognition results.
Returned when the server processes the run-task instruction. You can now send audio.
Example:
Returned when the server produces a recognition result. Contains intermediate and final sentences.
Example:
Returned after the server receives the finish-task instruction and finishes processing remaining audio.
Example:
Returned when an error occurs during task processing. Close the connection and handle the error.
Example:
You can reuse a WebSocket connection across tasks. After the server returns a task-finished event, send another run-task instruction on the same connection.
Getting started
Prerequisites
- Get an API key and export it as an environment variable.
- Download the sample audio: asr_example.wav.
Sample code
- Node.js
- C#
- PHP
- Go
Install dependencies:Sample code:
Core concepts
Interaction flow
The client and server interact in this sequence:
1
Connect
Send a WebSocket connection request with authentication in the header.
2
Start the task
Send a run-task instruction with the model and audio parameters.
3
Confirm the task
The server returns a task-started event. You can now send audio.
4
Stream audio
- Send binary audio data continuously.
- The server returns result-generated events with intermediate and final results in real time.
5
End the task
Send a finish-task instruction after all audio is sent.
6
Confirm completion
The server returns a task-finished event after processing remaining audio.
7
Disconnect
Either side closes the WebSocket connection.
Audio requirements
- Channels: Mono only.
- Formats: pcm, wav, mp3, opus, speex, aac, amr. WAV files must use PCM encoding. Opus and Speex files must use an Ogg container. The amr format supports AMR-NB only.
- Sample rate: Must match
sample_ratein the run-task instruction.
Models
| Model | Version | Unit price | Free quota (Note) |
|---|---|---|---|
| fun-asr-realtime Currently, fun-asr-realtime-2025-11-07 | Stable | $0.00009/second | 36,000 seconds (10 hours) Valid for 90 days |
| fun-asr-realtime-2025-11-07 | Snapshot |
- Languages: Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong-Taiwan regions -- including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. English and Japanese are also supported.
- Sample rate: 16 kHz
- Audio formats: pcm, wav, mp3, opus, speex, aac, amr
API reference
Connection endpoint
Headers
| Parameter | Type | Required | Description |
|---|---|---|---|
| Authorization | string | Yes | Authentication token. Format: Bearer $DASHSCOPE_API_KEY. |
| user-agent | string | No | Client identifier. Helps the server track request sources. |
| X-DashScope-WorkSpace | string | No | Qwen Cloud workspace ID. |
| X-DashScope-DataInspection | string | No | Enable data compliance checks. Default: enable. Disable only when necessary. |
Instructions (client to server)
Instructions are JSON messages that control the task lifecycle.
1. run-task instruction: Start a task
Start a recognition task and set its parameters after connecting.
Example:
header parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| header.action | string | Yes | Instruction type. Set to run-task. |
| header.task_id | string | Yes | Unique task ID. Use the same value in the finish-task instruction. |
| header.streaming | string | Yes | Communication pattern. Set to duplex. |
payload parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| payload.task_group | string | Yes | Task group. Set to audio. |
| payload.task | string | Yes | Task type. Set to asr. |
| payload.function | string | Yes | Function type. Set to recognition. |
| payload.model | string | Yes | Model name. See the model list. |
| payload.input | object | Yes | Input configuration. Set to {}. |
| payload.parameters | |||
| format | string | Yes | Audio format: pcm, wav, mp3, opus, speex, aac, amr. See Audio requirements. |
| sample_rate | integer | Yes | Audio sample rate in Hz. fun-asr-realtime supports 16000 Hz. |
| vocabulary_id | string | No | Vocabulary ID for hotword recognition. See Customize hotwords. |
| semantic_punctuation_enabled | boolean | No | Enable semantic punctuation. Default: false. - true: High-accuracy punctuation suited for meetings. Disables VAD punctuation. - false: Low-latency VAD punctuation suited for interactive use. Semantic punctuation finds sentence boundaries more accurately. VAD responds faster. |
| max_sentence_silence | integer | No | VAD silence threshold in ms. A sentence ends when silence exceeds this value. Default: 1300. Range: [200, 6000]. Only applies when semantic_punctuation_enabled is false. |
| multi_threshold_mode_enabled | boolean | No | Prevent overly long sentences in VAD mode. Default: false. Only applies when semantic_punctuation_enabled is false. |
| heartbeat | boolean | No | Enable keep-alive. Default: false. - true: Connection stays open when you send silent audio continuously. - false: Connection times out after 60 seconds of silent audio. |
| language_hints | array[string] | No | Language codes for recognition. Leave unset for automatic detection. Supported codes: zh (Chinese), en (English), ja (Japanese). |
| speech_noise_threshold | float | No | Speech-noise detection threshold for VAD sensitivity. Range: [-1.0, 1.0]. Near -1: more noise may be transcribed as speech. Near +1: some speech may be filtered as noise. |
speech_noise_threshold is an advanced parameter. Small changes significantly affect recognition quality. Adjust in 0.1 steps and test thoroughly.2. finish-task instruction: End a task
Tell the server that audio transmission is complete.
Example:
header parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| header.action | string | Yes | Instruction type. Set to finish-task. |
| header.task_id | string | Yes | Task ID. Must match task_id from the run-task instruction. |
| header.streaming | string | Yes | Communication pattern. Set to duplex. |
payload parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
| payload.input | object | Yes | Input configuration. Set to {}. |
Events (server to client)
Events are JSON messages that report task status and recognition results.
1. task-started
Returned when the server processes the run-task instruction. You can now send audio.
Example:
header parameters:
| Parameter | Type | Description |
|---|---|---|
| header.event | string | Event type. Set to task-started. |
| header.task_id | string | Task ID. |
2. result-generated
Returned when the server produces a recognition result. Contains intermediate and final sentences.
Example:
header parameters:
| Parameter | Type | Description |
|---|---|---|
| header.event | string | Event type. Set to result-generated. |
| header.task_id | string | Task ID. |
payload parameters:
| Parameter | Type | Description |
|---|---|---|
| output | object | output.sentence contains the recognition result. See below. |
| usage | object | null when the sentence is incomplete (sentence_end = false). When complete (sentence_end = true), usage.duration is the billable duration in seconds. |
payload.usage parameters:
| Parameter | Type | Description |
|---|---|---|
| duration | integer | Billable duration in seconds. |
payload.output.sentence parameters:
| Parameter | Type | Description |
|---|---|---|
| begin_time | integer | Sentence start time in ms. |
| end_time | integer | null | Sentence end time in ms. null for intermediate results. |
| text | string | Recognized text. |
| words | array | Word-level timestamps. |
| heartbeat | boolean | null | If true, skip this result. Matches the heartbeat setting in the run-task instruction. |
| sentence_end | boolean | Whether the sentence has ended. |
payload.output.sentence.words parameters:
| Parameter | Type | Description |
|---|---|---|
| begin_time | integer | Word start time in ms. |
| end_time | integer | Word end time in ms. |
| text | string | Recognized word. |
| punctuation | string | Trailing punctuation. |
3. task-finished
Returned after the server receives the finish-task instruction and finishes processing remaining audio.
Example:
header parameters:
| Parameter | Type | Description |
|---|---|---|
| header.event | string | Event type. Set to task-finished. |
| header.task_id | string | Task ID. |
4. task-failed
Returned when an error occurs during task processing. Close the connection and handle the error.
Example:
header parameters:
| Parameter | Type | Description |
|---|---|---|
| header.event | string | Event type. Set to task-failed. |
| header.task_id | string | Task ID. |
| header.error_code | string | Error type. |
| header.error_message | string | Error details. |
Connection reuse
You can reuse a WebSocket connection across tasks. After the server returns a task-finished event, send another run-task instruction on the same connection.
- Each task on a reused connection must have a unique
task_id. - Failed tasks trigger a task-failed event and close the connection (no reuse).
- Connections time out after 60 seconds of inactivity.