Skip to main content
Realtime

Fun-ASR WebSocket API

Real-time ASR WebSocket

Use the WebSocket API to connect to Fun-ASR real-time speech recognition from any language. For easier integration, use the Python SDK or Java SDK. User guide: For model details and selection, see Realtime speech recognition.

Getting started

Prerequisites

  1. Get an API key and export it as an environment variable.
  2. Download the sample audio: asr_example.wav.

Sample code

  • Node.js
  • C#
  • PHP
  • Go
Install dependencies:
npm install ws
npm install uuid
Sample code:
const fs = require('fs');
const WebSocket = require('ws');
const { v4: uuidv4 } = require('uuid'); // Used to generate a UUID

// If you have not configured environment variables, replace the following line with your API key: const apiKey = "sk-xxx"
const apiKey = process.env.DASHSCOPE_API_KEY;
const url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference/'; // WebSocket server address
const audioFile = 'asr_example.wav'; // Replace with the path to your audio file

// Generate a 32-digit random ID
const TASK_ID = uuidv4().replace(/-/g, '').slice(0, 32);

// Create a WebSocket client
const ws = new WebSocket(url, {
  headers: {
    Authorization: `bearer ${apiKey}`
  }
});

let taskStarted = false; // A flag that indicates whether the task has started

// Send the run-task instruction when the connection is opened
ws.on('open', () => {
  console.log('Connected to the server');
  sendRunTask();
});

// Process received messages
ws.on('message', (data) => {
  const message = JSON.parse(data);
  switch (message.header.event) {
    case 'task-started':
      console.log('The task has started');
      taskStarted = true;
      sendAudioStream();
      break;
    case 'result-generated':
      console.log('Recognition result:', message.payload.output.sentence.text);
      if (message.payload.usage) {
        console.log('Billable duration of the task (in seconds):', message.payload.usage.duration);
      }
      break;
    case 'task-finished':
      console.log('The task is complete');
      ws.close();
      break;
    case 'task-failed':
      console.error('The task failed:', message.header.error_message);
      ws.close();
      break;
    default:
      console.log('Unknown event:', message.header.event);
  }
});

// If the task-started event is not received, close the connection
ws.on('close', () => {
  if (!taskStarted) {
    console.error('The task did not start. Closing the connection.');
  }
});

// Send the run-task instruction
function sendRunTask() {
  const runTaskMessage = {
    header: {
      action: 'run-task',
      task_id: TASK_ID,
      streaming: 'duplex'
    },
    payload: {
      task_group: 'audio',
      task: 'asr',
      function: 'recognition',
      model: 'fun-asr-realtime',
      parameters: {
        sample_rate: 16000,
        format: 'wav'
      },
      input: {}
    }
  };
  ws.send(JSON.stringify(runTaskMessage));
}

// Send the audio stream
function sendAudioStream() {
  const audioStream = fs.createReadStream(audioFile);
  let chunkCount = 0;

  function sendNextChunk() {
    const chunk = audioStream.read();
    if (chunk) {
      ws.send(chunk);
      chunkCount++;
      setTimeout(sendNextChunk, 100); // Send a chunk every 100 ms
    }
  }

  audioStream.on('readable', () => {
    sendNextChunk();
  });

  audioStream.on('end', () => {
    console.log('The audio stream has ended');
    sendFinishTask();
  });

  audioStream.on('error', (err) => {
    console.error('Error reading the audio file:', err);
    ws.close();
  });
}

// Send the finish-task instruction
function sendFinishTask() {
  const finishTaskMessage = {
    header: {
      action: 'finish-task',
      task_id: TASK_ID,
      streaming: 'duplex'
    },
    payload: {
      input: {}
    }
  };
  ws.send(JSON.stringify(finishTaskMessage));
}

// Handle errors
ws.on('error', (error) => {
  console.error('WebSocket error:', error);
});

Core concepts

Interaction flow

The client and server interact in this sequence:
Interaction sequence diagram
1

Connect

Send a WebSocket connection request with authentication in the header.
2

Start the task

Send a run-task instruction with the model and audio parameters.
3

Confirm the task

The server returns a task-started event. You can now send audio.
4

Stream audio

  • Send binary audio data continuously.
  • The server returns result-generated events with intermediate and final results in real time.
5

End the task

Send a finish-task instruction after all audio is sent.
6

Confirm completion

The server returns a task-finished event after processing remaining audio.
7

Disconnect

Either side closes the WebSocket connection.

Audio requirements

  • Channels: Mono only.
  • Formats: pcm, wav, mp3, opus, speex, aac, amr. WAV files must use PCM encoding. Opus and Speex files must use an Ogg container. The amr format supports AMR-NB only.
  • Sample rate: Must match sample_rate in the run-task instruction.

Models

ModelVersionUnit priceFree quota (Note)
fun-asr-realtime
Currently, fun-asr-realtime-2025-11-07
Stable$0.00009/second36,000 seconds (10 hours)
Valid for 90 days
fun-asr-realtime-2025-11-07Snapshot
  • Languages: Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong-Taiwan regions -- including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. English and Japanese are also supported.
  • Sample rate: 16 kHz
  • Audio formats: pcm, wav, mp3, opus, speex, aac, amr

API reference

Connection endpoint

wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference

Headers

ParameterTypeRequiredDescription
AuthorizationstringYesAuthentication token. Format: Bearer $DASHSCOPE_API_KEY.
user-agentstringNoClient identifier. Helps the server track request sources.
X-DashScope-WorkSpacestringNoQwen Cloud workspace ID.
X-DashScope-DataInspectionstringNoEnable data compliance checks. Default: enable. Disable only when necessary.

Instructions (client to server)

Instructions are JSON messages that control the task lifecycle.

1. run-task instruction: Start a task

Start a recognition task and set its parameters after connecting. Example:
{
  "header": {
    "action": "run-task",
    "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
    "streaming": "duplex"
  },
  "payload": {
    "task_group": "audio",
    "task": "asr",
    "function": "recognition",
    "model": "fun-asr-realtime",
    "parameters": {
      "format": "pcm",
      "sample_rate": 16000,
      "vocabulary_id": "vocab-xxx-24ee19fa8cfb4d52902170a0xxxxxxxx"
    },
    "input": {}
  }
}
header parameters:
ParameterTypeRequiredDescription
header.actionstringYesInstruction type. Set to run-task.
header.task_idstringYesUnique task ID. Use the same value in the finish-task instruction.
header.streamingstringYesCommunication pattern. Set to duplex.
payload parameters:
ParameterTypeRequiredDescription
payload.task_groupstringYesTask group. Set to audio.
payload.taskstringYesTask type. Set to asr.
payload.functionstringYesFunction type. Set to recognition.
payload.modelstringYesModel name. See the model list.
payload.inputobjectYesInput configuration. Set to {}.
payload.parameters
formatstringYesAudio format: pcm, wav, mp3, opus, speex, aac, amr. See Audio requirements.
sample_rateintegerYesAudio sample rate in Hz. fun-asr-realtime supports 16000 Hz.
vocabulary_idstringNoVocabulary ID for hotword recognition. See Customize hotwords.
semantic_punctuation_enabledbooleanNoEnable semantic punctuation. Default: false.
- true: High-accuracy punctuation suited for meetings. Disables VAD punctuation.
- false: Low-latency VAD punctuation suited for interactive use.
Semantic punctuation finds sentence boundaries more accurately. VAD responds faster.
max_sentence_silenceintegerNoVAD silence threshold in ms. A sentence ends when silence exceeds this value. Default: 1300. Range: [200, 6000]. Only applies when semantic_punctuation_enabled is false.
multi_threshold_mode_enabledbooleanNoPrevent overly long sentences in VAD mode. Default: false. Only applies when semantic_punctuation_enabled is false.
heartbeatbooleanNoEnable keep-alive. Default: false.
- true: Connection stays open when you send silent audio continuously.
- false: Connection times out after 60 seconds of silent audio.
language_hintsarray[string]NoLanguage codes for recognition. Leave unset for automatic detection. Supported codes: zh (Chinese), en (English), ja (Japanese).
speech_noise_thresholdfloatNoSpeech-noise detection threshold for VAD sensitivity. Range: [-1.0, 1.0]. Near -1: more noise may be transcribed as speech. Near +1: some speech may be filtered as noise.
speech_noise_threshold is an advanced parameter. Small changes significantly affect recognition quality. Adjust in 0.1 steps and test thoroughly.

2. finish-task instruction: End a task

Tell the server that audio transmission is complete. Example:
{
  "header": {
    "action": "finish-task",
    "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
    "streaming": "duplex"
  },
  "payload": {
    "input": {}
  }
}
header parameters:
ParameterTypeRequiredDescription
header.actionstringYesInstruction type. Set to finish-task.
header.task_idstringYesTask ID. Must match task_id from the run-task instruction.
header.streamingstringYesCommunication pattern. Set to duplex.
payload parameters:
ParameterTypeRequiredDescription
payload.inputobjectYesInput configuration. Set to {}.

Events (server to client)

Events are JSON messages that report task status and recognition results.

1. task-started

Returned when the server processes the run-task instruction. You can now send audio. Example:
{
  "header": {
    "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
    "event": "task-started",
    "attributes": {}
  },
  "payload": {}
}
header parameters:
ParameterTypeDescription
header.eventstringEvent type. Set to task-started.
header.task_idstringTask ID.

2. result-generated

Returned when the server produces a recognition result. Contains intermediate and final sentences. Example:
{
  "header": {
    "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
    "event": "result-generated",
    "attributes": {}
  },
  "payload": {
    "output": {
      "sentence": {
        "begin_time": 170,
        "end_time": 920,
        "text": "Okay, I got it",
        "heartbeat": false,
        "sentence_end": true,
        "words": [
          {
            "begin_time": 170,
            "end_time": 295,
            "text": "Okay",
            "punctuation": ","
          },
          {
            "begin_time": 295,
            "end_time": 503,
            "text": "I",
            "punctuation": ""
          },
          {
            "begin_time": 503,
            "end_time": 711,
            "text": "got",
            "punctuation": ""
          },
          {
            "begin_time": 711,
            "end_time": 920,
            "text": "it",
            "punctuation": ""
          }
        ]
      }
    },
    "usage": {
      "duration": 3
    }
  }
}
header parameters:
ParameterTypeDescription
header.eventstringEvent type. Set to result-generated.
header.task_idstringTask ID.
payload parameters:
ParameterTypeDescription
outputobjectoutput.sentence contains the recognition result. See below.
usageobjectnull when the sentence is incomplete (sentence_end = false). When complete (sentence_end = true), usage.duration is the billable duration in seconds.
payload.usage parameters:
ParameterTypeDescription
durationintegerBillable duration in seconds.
payload.output.sentence parameters:
ParameterTypeDescription
begin_timeintegerSentence start time in ms.
end_timeinteger | nullSentence end time in ms. null for intermediate results.
textstringRecognized text.
wordsarrayWord-level timestamps.
heartbeatboolean | nullIf true, skip this result. Matches the heartbeat setting in the run-task instruction.
sentence_endbooleanWhether the sentence has ended.
payload.output.sentence.words parameters:
ParameterTypeDescription
begin_timeintegerWord start time in ms.
end_timeintegerWord end time in ms.
textstringRecognized word.
punctuationstringTrailing punctuation.

3. task-finished

Returned after the server receives the finish-task instruction and finishes processing remaining audio. Example:
{
  "header": {
    "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
    "event": "task-finished",
    "attributes": {}
  },
  "payload": {
    "output": {}
  }
}
header parameters:
ParameterTypeDescription
header.eventstringEvent type. Set to task-finished.
header.task_idstringTask ID.

4. task-failed

Returned when an error occurs during task processing. Close the connection and handle the error. Example:
{
  "header": {
    "task_id": "2bf83b9a-baeb-4fda-8d9a-xxxxxxxxxxxx",
    "event": "task-failed",
    "error_code": "CLIENT_ERROR",
    "error_message": "request timeout after 23 seconds.",
    "attributes": {}
  },
  "payload": {}
}
header parameters:
ParameterTypeDescription
header.eventstringEvent type. Set to task-failed.
header.task_idstringTask ID.
header.error_codestringError type.
header.error_messagestringError details.

Connection reuse

You can reuse a WebSocket connection across tasks. After the server returns a task-finished event, send another run-task instruction on the same connection.
  1. Each task on a reused connection must have a unique task_id.
  2. Failed tasks trigger a task-failed event and close the connection (no reuse).
  3. Connections time out after 60 seconds of inactivity.
Fun-ASR WebSocket API | Qwen Cloud