Fun-ASR realtime Python SDK

User guide: For model selection, see Real-time speech recognition.

Getting started

The Recognition class supports both non-streaming and bidirectional streaming calls.

Non-streaming call: Recognizes a local file and returns the complete result at once.
Bidirectional streaming call: Recognizes an audio stream and returns results in real time. The stream can come from a microphone or a local file.

Non-streaming call

Submit a speech-to-text task for a single audio file. This call blocks until the result is returned. Instantiate the Recognition class, set the request parameters, and call call to get the recognition result (RecognitionResult).

Click to view the complete example

The example uses this audio file: asr_example.wav.

from http import HTTPStatus
import dashscope
from dashscope.audio.asr import Recognition
import os

# If you have not configured an environment variable, replace the following line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

recognition = Recognition(model='fun-asr-realtime',
                          format='wav',
                          sample_rate=16000,
                          callback=None)
result = recognition.call('asr_example.wav')
if result.status_code == HTTPStatus.OK:
  print('Recognition result:')
  print(result.get_sentence())
else:
  print('Error: ', result.message)

print(
  '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
  .format(
    recognition.get_last_request_id(),
    recognition.get_first_package_delay(),
    recognition.get_last_package_delay(),
  ))

Bidirectional streaming call

Submit a speech-to-text task and receive results via callback.

Start streaming recognition

Instantiate the Recognition class, configure the request parameters and callback (RecognitionCallback), and call start.

Send audio

Call send_audio_frame repeatedly to send binary audio data from a local file or device (such as a microphone).The server returns results in real time through the on_event callback.Each audio segment should be ~100 ms and 1-16 KB.

Stop recognition

Call stop to end recognition.This blocks until on_complete or on_error is triggered.

Click to view the complete example

Recognize speech from a microphone
Recognize a local audio file

import os
import signal  # for keyboard events handling (press "Ctrl+C" to terminate recording)
import sys

import dashscope
import pyaudio
from dashscope.audio.asr import *

mic = None
stream = None

# Set recording parameters
sample_rate = 16000  # sampling rate (Hz)
channels = 1  # mono channel
dtype = 'int16'  # data type
format_pcm = 'pcm'  # the format of the audio data
block_size = 3200  # number of frames per buffer


# Real-time speech recognition callback
class Callback(RecognitionCallback):
  def on_open(self) -> None:
    global mic
    global stream
    print('RecognitionCallback open.')
    mic = pyaudio.PyAudio()
    stream = mic.open(format=pyaudio.paInt16,
                          channels=1,
                          rate=16000,
                          input=True)

  def on_close(self) -> None:
    global mic
    global stream
    print('RecognitionCallback close.')
    stream.stop_stream()
    stream.close()
    mic.terminate()
    stream = None
    mic = None

  def on_complete(self) -> None:
    print('RecognitionCallback completed.')  # recognition completed

  def on_error(self, message) -> None:
    print('RecognitionCallback task_id: ', message.request_id)
    print('RecognitionCallback error: ', message.message)
    # Stop and close the audio stream if it is running
    if 'stream' in globals() and stream.active:
      stream.stop()
      stream.close()
    # Forcefully exit the program
    sys.exit(1)

  def on_event(self, result: RecognitionResult) -> None:
    sentence = result.get_sentence()
    if 'text' in sentence:
      print('RecognitionCallback text: ', sentence['text'])
      if RecognitionResult.is_sentence_end(sentence):
        print(
          'RecognitionCallback sentence end, request_id:%s, usage:%s'
          % (result.get_request_id(), result.get_usage(sentence)))


def signal_handler(sig, frame):
  print('Ctrl+C pressed, stop recognition ...')
  # Stop recognition
  recognition.stop()
  print('Recognition stopped.')
  print(
    '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
    .format(
      recognition.get_last_request_id(),
      recognition.get_first_package_delay(),
      recognition.get_last_package_delay(),
    ))
  # Forcefully exit the program
  sys.exit(0)


# main function
if __name__ == '__main__':
  # If you have not configured an environment variable, replace the following line with your API key: dashscope.api_key = "sk-xxx"
  dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

  dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

  # Create the recognition callback
  callback = Callback()

  # Call recognition service by async mode, you can customize the recognition parameters, like model, format,
  # sample_rate
  recognition = Recognition(
    model='fun-asr-realtime',
    format=format_pcm,
    # 'pcm', 'wav', 'opus', 'speex', 'aac', 'amr'. You can check the supported formats in the document.
    sample_rate=sample_rate,
    # Supports 8000, 16000.
    semantic_punctuation_enabled=False,
    callback=callback)

  # Start recognition
  recognition.start()

  signal.signal(signal.SIGINT, signal_handler)
  print("Press 'Ctrl+C' to stop recording and recognition...")
  # Create a keyboard listener until "Ctrl+C" is pressed

  while True:
    if stream:
      data = stream.read(3200, exception_on_overflow=False)
      recognition.send_audio_frame(data)
    else:
      break

  recognition.stop()

The example uses this audio file: asr_example.wav.

import os
import time
import dashscope
from dashscope.audio.asr import *

# If you have not set an environment variable, replace the next line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

from datetime import datetime


def get_timestamp():
  now = datetime.now()
  formatted_timestamp = now.strftime("[%Y-%m-%d %H:%M:%S.%f]")
  return formatted_timestamp


class Callback(RecognitionCallback):
  def on_complete(self) -> None:
    print(get_timestamp() + ' Recognition completed')  # recognition complete

  def on_error(self, result: RecognitionResult) -> None:
    print('Recognition task_id: ', result.request_id)
    print('Recognition error: ', result.message)
    exit(0)

  def on_event(self, result: RecognitionResult) -> None:
    sentence = result.get_sentence()
    if 'text' in sentence:
      print(get_timestamp() + ' RecognitionCallback text: ', sentence['text'])
    if RecognitionResult.is_sentence_end(sentence):
      print(get_timestamp() +
                  'RecognitionCallback sentence end, request_id:%s, usage:%s'
                  % (result.get_request_id(), result.get_usage(sentence)))


callback = Callback()

recognition = Recognition(model='fun-asr-realtime',
                          format='wav',
                          sample_rate=16000,
                          callback=callback)

try:
  audio_data: bytes = None
  f = open("asr_example.wav", 'rb')
  if os.path.getsize("asr_example.wav"):
    # Read the entire file into a buffer
    file_buffer = f.read()
    f.close()
    print("Start Recognition")
    recognition.start()

    # Send data in chunks of 3200 bytes
    buffer_size = len(file_buffer)
    offset = 0
    chunk_size = 3200

    while offset < buffer_size:
      # Calculate the size of the current chunk
      remaining_bytes = buffer_size - offset
      current_chunk_size = min(chunk_size, remaining_bytes)

      # Extract the current chunk from the buffer
      audio_data = file_buffer[offset:offset + current_chunk_size]

      # Send the audio frame
      recognition.send_audio_frame(audio_data)
      # Update the offset
      offset += current_chunk_size

      # Add a delay to simulate real-time transmission
      time.sleep(0.1)

    recognition.stop()
  else:
    raise Exception(
      'The supplied file was empty (zero bytes long)')
except Exception as e:
  raise e

print(
  '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
  .format(
    recognition.get_last_request_id(),
    recognition.get_first_package_delay(),
    recognition.get_last_package_delay(),
  ))

Request parameters

Set request parameters in the Recognition class constructor (__init__).

Parameter	Type	Default	Required	Description
model	str	-	Yes	Model for real-time speech recognition.
sample_rate	int	-	Yes	Audio sample rate in Hz. Supports 16000 Hz.
format	str	-	Yes	Audio format: pcm, wav, mp3, opus, speex, aac, amr. Important: opus/speex must be Ogg-encapsulated. wav must be PCM-encoded. amr supports only AMR-NB.
vocabulary_id	str	-	No	Vocabulary ID for hotword customization. See Customize hotwords.
semantic_punctuation_enabled	bool	False	No	Enable semantic punctuation. true: Uses semantic punctuation (disables VAD-based punctuation). Higher accuracy for conference transcription. false (default): Uses VAD punctuation (disables semantic). Lower latency for interactive scenarios.
max_sentence_silence	int	1300	No	Silence threshold for VAD sentence segmentation, in ms. Sentences end when silence exceeds this value. Range: 200-6000 ms. Only applies when `semantic_punctuation_enabled` is false.
multi_threshold_mode_enabled	bool	False	No	Prevents VAD from creating excessively long segments. Only applies when `semantic_punctuation_enabled` is false.
punctuation_prediction_enabled	bool	True	No	Add punctuation to results automatically. true (default): Cannot be modified.
heartbeat	bool	False	No	Maintain a persistent server connection: true: Connection stays active with continuous silent audio. false (default): Connection times out after 60 seconds, even with continuous silent audio. Silent audio = no sound signal. Generate it with audio editing software (Audacity, Adobe Audition) or FFmpeg. Requires SDK version 1.23.1+.
language_hints	list[str]	["zh", "en"]	No	Language codes for recognition. Leave unset for automatic detection. Supported codes: fun-asr-realtime, fun-asr-realtime-2025-11-07: zh (Chinese), en (English), ja (Japanese) fun-asr-realtime-2025-09-15: zh (Chinese), en (English)
speech_noise_threshold	float	-	No	Speech-noise detection threshold for VAD sensitivity. Range: [-1.0, 1.0]. Near -1: Lowers the noise threshold -- more noise may be transcribed as speech. Near +1: Raises the noise threshold -- some speech may be filtered out as noise. Important: This is an advanced parameter. Adjustments significantly affect recognition quality. Test thoroughly and make small adjustments (step size 0.1) based on your audio environment.
callback	RecognitionCallback	-	No	The RecognitionCallback interface.

Key interfaces

`Recognition` class

Import with from dashscope.audio.asr import *.

Member method	Method signature	Description
call	`def call(self, file: str, phrase_id: str = None, **kwargs) -> RecognitionResult`	Run non-streaming recognition on a local file. Blocks until processing completes, then returns `RecognitionResult`.
start	`def start(self, phrase_id: str = None, **kwargs)`	Start streaming recognition. Non-blocking. Use with `send_audio_frame` and `stop`.
send_audio_frame	`def send_audio_frame(self, buffer: bytes)`	Send an audio frame (~100 ms, 1-16 KB per packet). Get results via the `on_event` callback of RecognitionCallback.
stop	`def stop(self)`	Stop recognition. Blocks until all audio is processed.
get_last_request_id	`def get_last_request_id(self)`	Return the request ID. Available after the Recognition object is created.
get_first_package_delay	`def get_first_package_delay(self)`	Return the first-packet latency (time from first audio packet sent to first result received). Available after the task completes.
get_last_package_delay	`def get_last_package_delay(self)`	Return the last-packet latency (time from `stop` to final result). Available after the task completes.

Callback interface (`RecognitionCallback`)

In bidirectional streaming, the server returns data via callbacks. Implement a callback to handle responses.

Click to view example

class Callback(RecognitionCallback):
  def on_open(self) -> None:
    print('Connection successful')

  def on_event(self, result: RecognitionResult) -> None:
    # Implement the logic to receive recognition results

  def on_complete(self) -> None:
    print('Task complete')

  def on_error(self, result: RecognitionResult) -> None:
    print('An exception occurred: ', result)

  def on_close(self) -> None:
    print('Connection closed')


callback = Callback()

Method	Parameter	Return value	Description
`def on_open(self) -> None`	None	None	Called when a server connection is established.
`def on_event(self, result: RecognitionResult) -> None`	`result`: RecognitionResult	None	Called when a recognition result is returned.
`def on_complete(self) -> None`	None	None	Called after all results are returned.
`def on_error(self, result: RecognitionResult) -> None`	`result`: RecognitionResult	None	Called when an error occurs.
`def on_close(self) -> None`	None	None	Called when the connection closes.

Response

Recognition result (`RecognitionResult`)

RecognitionResult represents the result of a streaming call or a non-streaming call.

Member method	Method signature	Description
get_sentence	`def get_sentence(self) -> Union[Dict[str, Any], List[Any]]`	Return the current recognized sentence with timestamps. In a callback, returns a single sentence as `Dict[str, Any]`. See Sentence information.
get_request_id	`def get_request_id(self) -> str`	Return the request ID.
get_usage	`def get_usage(self, sentence: Dict[str, Any]) -> Dict`	Return usage information for the sentence.
is_sentence_end	`@staticmethod def is_sentence_end(sentence: Dict[str, Any]) -> bool`	Check whether the sentence has ended.

Sentence information (`Sentence`)

Parameter	Type	Description
begin_time	int	Start time of the sentence, in ms.
end_time	int	End time of the sentence, in ms.
text	str	Recognized text.
words	A list of Word objects	Word timestamp information.

Word timestamp information (`Word`)

Parameter	Type	Description
begin_time	int	Start time of the word, in ms.
end_time	int	End time of the word, in ms.
text	str	The word.
punctuation	str	The punctuation mark.

​Getting started

​Non-streaming call

​Bidirectional streaming call

​Request parameters

​Key interfaces

​Recognition class

​Callback interface (RecognitionCallback)

​Response

​Recognition result (RecognitionResult)

​Sentence information (Sentence)

​Word timestamp information (Word)