Skip to main content
Realtime

Fun-ASR realtime Python SDK

Real-time ASR Python SDK

User guide: For model selection, see Real-time speech recognition.

Prerequisites

Model availability

ModelVersionUnit priceFree quota (Note)
fun-asr-realtime
Currently, fun-asr-realtime-2025-11-07
Stable$0.00009/second36,000 seconds (10 hours)
Valid for 90 days
fun-asr-realtime-2025-11-07Snapshot$0.00009/second36,000 seconds (10 hours)
Valid for 90 days
  • Languages: Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong-Taiwan regions -- including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. English and Japanese are also supported.
  • Sample rate: 16 kHz
  • Audio formats: pcm, wav, mp3, opus, speex, aac, amr

Getting started

The Recognition class supports both non-streaming and bidirectional streaming calls.
  • Non-streaming call: Recognizes a local file and returns the complete result at once.
  • Bidirectional streaming call: Recognizes an audio stream and returns results in real time. The stream can come from a microphone or a local file.

Non-streaming call

Submit a speech-to-text task for a single audio file. This call blocks until the result is returned. Instantiate the Recognition class, set the request parameters, and call call to get the recognition result (RecognitionResult).
The example uses this audio file: asr_example.wav.
from http import HTTPStatus
import dashscope
from dashscope.audio.asr import Recognition
import os

# If you have not configured an environment variable, replace the following line with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

recognition = Recognition(model='fun-asr-realtime',
                          format='wav',
                          sample_rate=16000,
                          callback=None)
result = recognition.call('asr_example.wav')
if result.status_code == HTTPStatus.OK:
  print('Recognition result:')
  print(result.get_sentence())
else:
  print('Error: ', result.message)

print(
  '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
  .format(
    recognition.get_last_request_id(),
    recognition.get_first_package_delay(),
    recognition.get_last_package_delay(),
  ))

Bidirectional streaming call

Submit a speech-to-text task and receive results via callback.
1

Start streaming recognition

Instantiate the Recognition class, configure the request parameters and callback (RecognitionCallback), and call start.
2

Send audio

Call send_audio_frame repeatedly to send binary audio data from a local file or device (such as a microphone).The server returns results in real time through the on_event callback.Each audio segment should be ~100 ms and 1-16 KB.
3

Stop recognition

Call stop to end recognition.This blocks until on_complete or on_error is triggered.
  • Recognize speech from a microphone
  • Recognize a local audio file
import os
import signal  # for keyboard events handling (press "Ctrl+C" to terminate recording)
import sys

import dashscope
import pyaudio
from dashscope.audio.asr import *

mic = None
stream = None

# Set recording parameters
sample_rate = 16000  # sampling rate (Hz)
channels = 1  # mono channel
dtype = 'int16'  # data type
format_pcm = 'pcm'  # the format of the audio data
block_size = 3200  # number of frames per buffer


# Real-time speech recognition callback
class Callback(RecognitionCallback):
  def on_open(self) -> None:
    global mic
    global stream
    print('RecognitionCallback open.')
    mic = pyaudio.PyAudio()
    stream = mic.open(format=pyaudio.paInt16,
                          channels=1,
                          rate=16000,
                          input=True)

  def on_close(self) -> None:
    global mic
    global stream
    print('RecognitionCallback close.')
    stream.stop_stream()
    stream.close()
    mic.terminate()
    stream = None
    mic = None

  def on_complete(self) -> None:
    print('RecognitionCallback completed.')  # recognition completed

  def on_error(self, message) -> None:
    print('RecognitionCallback task_id: ', message.request_id)
    print('RecognitionCallback error: ', message.message)
    # Stop and close the audio stream if it is running
    if 'stream' in globals() and stream.active:
      stream.stop()
      stream.close()
    # Forcefully exit the program
    sys.exit(1)

  def on_event(self, result: RecognitionResult) -> None:
    sentence = result.get_sentence()
    if 'text' in sentence:
      print('RecognitionCallback text: ', sentence['text'])
      if RecognitionResult.is_sentence_end(sentence):
        print(
          'RecognitionCallback sentence end, request_id:%s, usage:%s'
          % (result.get_request_id(), result.get_usage(sentence)))


def signal_handler(sig, frame):
  print('Ctrl+C pressed, stop recognition ...')
  # Stop recognition
  recognition.stop()
  print('Recognition stopped.')
  print(
    '[Metric] requestId: {}, first package delay ms: {}, last package delay ms: {}'
    .format(
      recognition.get_last_request_id(),
      recognition.get_first_package_delay(),
      recognition.get_last_package_delay(),
    ))
  # Forcefully exit the program
  sys.exit(0)


# main function
if __name__ == '__main__':
  # If you have not configured an environment variable, replace the following line with your API key: dashscope.api_key = "sk-xxx"
  dashscope.api_key = os.environ.get('DASHSCOPE_API_KEY')

  dashscope.base_websocket_api_url='wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

  # Create the recognition callback
  callback = Callback()

  # Call recognition service by async mode, you can customize the recognition parameters, like model, format,
  # sample_rate
  recognition = Recognition(
    model='fun-asr-realtime',
    format=format_pcm,
    # 'pcm', 'wav', 'opus', 'speex', 'aac', 'amr'. You can check the supported formats in the document.
    sample_rate=sample_rate,
    # Supports 8000, 16000.
    semantic_punctuation_enabled=False,
    callback=callback)

  # Start recognition
  recognition.start()

  signal.signal(signal.SIGINT, signal_handler)
  print("Press 'Ctrl+C' to stop recording and recognition...")
  # Create a keyboard listener until "Ctrl+C" is pressed

  while True:
    if stream:
      data = stream.read(3200, exception_on_overflow=False)
      recognition.send_audio_frame(data)
    else:
      break

  recognition.stop()

Request parameters

Set request parameters in the Recognition class constructor (__init__).
ParameterTypeDefaultRequiredDescription
modelstr-YesModel for real-time speech recognition.
sample_rateint-YesAudio sample rate in Hz. Supports 16000 Hz.
formatstr-YesAudio format: pcm, wav, mp3, opus, speex, aac, amr.

Important: opus/speex must be Ogg-encapsulated. wav must be PCM-encoded. amr supports only AMR-NB.
vocabulary_idstr-NoVocabulary ID for hotword customization. See Customize hotwords.
semantic_punctuation_enabledboolFalseNoEnable semantic punctuation.
  • true: Uses semantic punctuation (disables VAD-based punctuation). Higher accuracy for conference transcription.
  • false (default): Uses VAD punctuation (disables semantic). Lower latency for interactive scenarios.
max_sentence_silenceint1300NoSilence threshold for VAD sentence segmentation, in ms. Sentences end when silence exceeds this value. Range: 200-6000 ms. Only applies when semantic_punctuation_enabled is false.
multi_threshold_mode_enabledboolFalseNoPrevents VAD from creating excessively long segments. Only applies when semantic_punctuation_enabled is false.
punctuation_prediction_enabledboolTrueNoAdd punctuation to results automatically.
  • true (default): Cannot be modified.
heartbeatboolFalseNoMaintain a persistent server connection:
  • true: Connection stays active with continuous silent audio.
  • false (default): Connection times out after 60 seconds, even with continuous silent audio. Silent audio = no sound signal. Generate it with audio editing software (Audacity, Adobe Audition) or FFmpeg.
Requires SDK version 1.23.1+.
language_hintslist[str]["zh", "en"]NoLanguage codes for recognition. Leave unset for automatic detection. Supported codes:
  • fun-asr-realtime, fun-asr-realtime-2025-11-07: zh (Chinese), en (English), ja (Japanese)
  • fun-asr-realtime-2025-09-15: zh (Chinese), en (English)
speech_noise_thresholdfloat-NoSpeech-noise detection threshold for VAD sensitivity. Range: [-1.0, 1.0].
  • Near -1: Lowers the noise threshold -- more noise may be transcribed as speech.
  • Near +1: Raises the noise threshold -- some speech may be filtered out as noise.
Important: This is an advanced parameter. Adjustments significantly affect recognition quality. Test thoroughly and make small adjustments (step size 0.1) based on your audio environment.
callbackRecognitionCallback-NoThe RecognitionCallback interface.

Key interfaces

Recognition class

Import with from dashscope.audio.asr import *.
Member methodMethod signatureDescription
calldef call(self, file: str, phrase_id: str = None, **kwargs) -> RecognitionResultRun non-streaming recognition on a local file. Blocks until processing completes, then returns RecognitionResult.
startdef start(self, phrase_id: str = None, **kwargs)Start streaming recognition. Non-blocking. Use with send_audio_frame and stop.
send_audio_framedef send_audio_frame(self, buffer: bytes)Send an audio frame (~100 ms, 1-16 KB per packet). Get results via the on_event callback of RecognitionCallback.
stopdef stop(self)Stop recognition. Blocks until all audio is processed.
get_last_request_iddef get_last_request_id(self)Return the request ID. Available after the Recognition object is created.
get_first_package_delaydef get_first_package_delay(self)Return the first-packet latency (time from first audio packet sent to first result received). Available after the task completes.
get_last_package_delaydef get_last_package_delay(self)Return the last-packet latency (time from stop to final result). Available after the task completes.

Callback interface (RecognitionCallback)

In bidirectional streaming, the server returns data via callbacks. Implement a callback to handle responses.
class Callback(RecognitionCallback):
  def on_open(self) -> None:
    print('Connection successful')

  def on_event(self, result: RecognitionResult) -> None:
    # Implement the logic to receive recognition results

  def on_complete(self) -> None:
    print('Task complete')

  def on_error(self, result: RecognitionResult) -> None:
    print('An exception occurred: ', result)

  def on_close(self) -> None:
    print('Connection closed')


callback = Callback()
MethodParameterReturn valueDescription
def on_open(self) -> NoneNoneNoneCalled when a server connection is established.
def on_event(self, result: RecognitionResult) -> Noneresult: RecognitionResultNoneCalled when a recognition result is returned.
def on_complete(self) -> NoneNoneNoneCalled after all results are returned.
def on_error(self, result: RecognitionResult) -> Noneresult: RecognitionResultNoneCalled when an error occurs.
def on_close(self) -> NoneNoneNoneCalled when the connection closes.

Response

Recognition result (RecognitionResult)

RecognitionResult represents the result of a streaming call or a non-streaming call.
Member methodMethod signatureDescription
get_sentencedef get_sentence(self) -> Union[Dict[str, Any], List[Any]]Return the current recognized sentence with timestamps. In a callback, returns a single sentence as Dict[str, Any]. See Sentence information.
get_request_iddef get_request_id(self) -> strReturn the request ID.
get_usagedef get_usage(self, sentence: Dict[str, Any]) -> DictReturn usage information for the sentence.
is_sentence_end@staticmethod def is_sentence_end(sentence: Dict[str, Any]) -> boolCheck whether the sentence has ended.

Sentence information (Sentence)

ParameterTypeDescription
begin_timeintStart time of the sentence, in ms.
end_timeintEnd time of the sentence, in ms.
textstrRecognized text.
wordsA list of Word objectsWord timestamp information.

Word timestamp information (Word)

ParameterTypeDescription
begin_timeintStart time of the word, in ms.
end_timeintEnd time of the word, in ms.
textstrThe word.
punctuationstrThe punctuation mark.