Real-time ASR Python SDK
User guide: For model selection, see Real-time speech recognition.
The Recognition class supports both non-streaming and bidirectional streaming calls.
Submit a speech-to-text task for a single audio file. This call blocks until the result is returned.
Instantiate the Recognition class, set the request parameters, and call
Submit a speech-to-text task and receive results via callback.
Set request parameters in the Recognition class constructor (
Import with
In bidirectional streaming, the server returns data via callbacks. Implement a callback to handle responses.
Prerequisites
- Create an API key and export it as an environment variable. Do not hard-code it.
- Install the latest DashScope SDK.
- For microphone examples, install pyaudio:
pip install pyaudio- Note: pyaudio requires the portaudio library. On Ubuntu/Debian:
sudo apt-get install libportaudio2 portaudio19-dev. On macOS:brew install portaudio.
- Note: pyaudio requires the portaudio library. On Ubuntu/Debian:
Model availability
| Model | Version | Unit price | Free quota (Note) |
|---|---|---|---|
| fun-asr-realtime Currently, fun-asr-realtime-2025-11-07 | Stable | $0.00009/second | 36,000 seconds (10 hours) Valid for 90 days |
| fun-asr-realtime-2025-11-07 | Snapshot | $0.00009/second | 36,000 seconds (10 hours) Valid for 90 days |
- Languages: Mandarin, Cantonese, Wu, Minnan, Hakka, Gan, Xiang, and Jin. Also supports Mandarin accents from Zhongyuan, Southwest, Jilu, Jianghuai, Lanyin, Jiaoliao, Northeast, Beijing, and Hong Kong-Taiwan regions -- including Henan, Shaanxi, Hubei, Sichuan, Chongqing, Yunnan, Guizhou, Guangdong, Guangxi, Hebei, Tianjin, Shandong, Anhui, Nanjing, Jiangsu, Hangzhou, Gansu, and Ningxia. English and Japanese are also supported.
- Sample rate: 16 kHz
- Audio formats: pcm, wav, mp3, opus, speex, aac, amr
Getting started
The Recognition class supports both non-streaming and bidirectional streaming calls.
- Non-streaming call: Recognizes a local file and returns the complete result at once.
- Bidirectional streaming call: Recognizes an audio stream and returns results in real time. The stream can come from a microphone or a local file.
Non-streaming call
Submit a speech-to-text task for a single audio file. This call blocks until the result is returned.
Instantiate the Recognition class, set the request parameters, and call call to get the recognition result (RecognitionResult).
Click to view the complete example
Click to view the complete example
The example uses this audio file: asr_example.wav.
Bidirectional streaming call
Submit a speech-to-text task and receive results via callback.
1
Start streaming recognition
Instantiate the Recognition class, configure the request parameters and callback (RecognitionCallback), and call
start.2
Send audio
Call
send_audio_frame repeatedly to send binary audio data from a local file or device (such as a microphone).The server returns results in real time through the on_event callback.Each audio segment should be ~100 ms and 1-16 KB.3
Stop recognition
Call
stop to end recognition.This blocks until on_complete or on_error is triggered.Click to view the complete example
Click to view the complete example
- Recognize speech from a microphone
- Recognize a local audio file
Request parameters
Set request parameters in the Recognition class constructor (__init__).
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
| model | str | - | Yes | Model for real-time speech recognition. |
| sample_rate | int | - | Yes | Audio sample rate in Hz. Supports 16000 Hz. |
| format | str | - | Yes | Audio format: pcm, wav, mp3, opus, speex, aac, amr. Important: opus/speex must be Ogg-encapsulated. wav must be PCM-encoded. amr supports only AMR-NB. |
| vocabulary_id | str | - | No | Vocabulary ID for hotword customization. See Customize hotwords. |
| semantic_punctuation_enabled | bool | False | No | Enable semantic punctuation.
|
| max_sentence_silence | int | 1300 | No | Silence threshold for VAD sentence segmentation, in ms. Sentences end when silence exceeds this value. Range: 200-6000 ms. Only applies when semantic_punctuation_enabled is false. |
| multi_threshold_mode_enabled | bool | False | No | Prevents VAD from creating excessively long segments. Only applies when semantic_punctuation_enabled is false. |
| punctuation_prediction_enabled | bool | True | No | Add punctuation to results automatically.
|
| heartbeat | bool | False | No | Maintain a persistent server connection:
|
| language_hints | list[str] | ["zh", "en"] | No | Language codes for recognition. Leave unset for automatic detection. Supported codes:
|
| speech_noise_threshold | float | - | No | Speech-noise detection threshold for VAD sensitivity. Range: [-1.0, 1.0].
|
| callback | RecognitionCallback | - | No | The RecognitionCallback interface. |
Key interfaces
Recognition class
Import with from dashscope.audio.asr import *.
| Member method | Method signature | Description |
|---|---|---|
| call | def call(self, file: str, phrase_id: str = None, **kwargs) -> RecognitionResult | Run non-streaming recognition on a local file. Blocks until processing completes, then returns RecognitionResult. |
| start | def start(self, phrase_id: str = None, **kwargs) | Start streaming recognition. Non-blocking. Use with send_audio_frame and stop. |
| send_audio_frame | def send_audio_frame(self, buffer: bytes) | Send an audio frame (~100 ms, 1-16 KB per packet). Get results via the on_event callback of RecognitionCallback. |
| stop | def stop(self) | Stop recognition. Blocks until all audio is processed. |
| get_last_request_id | def get_last_request_id(self) | Return the request ID. Available after the Recognition object is created. |
| get_first_package_delay | def get_first_package_delay(self) | Return the first-packet latency (time from first audio packet sent to first result received). Available after the task completes. |
| get_last_package_delay | def get_last_package_delay(self) | Return the last-packet latency (time from stop to final result). Available after the task completes. |
Callback interface (RecognitionCallback)
In bidirectional streaming, the server returns data via callbacks. Implement a callback to handle responses.
Click to view example
Click to view example
| Method | Parameter | Return value | Description |
|---|---|---|---|
def on_open(self) -> None | None | None | Called when a server connection is established. |
def on_event(self, result: RecognitionResult) -> None | result: RecognitionResult | None | Called when a recognition result is returned. |
def on_complete(self) -> None | None | None | Called after all results are returned. |
def on_error(self, result: RecognitionResult) -> None | result: RecognitionResult | None | Called when an error occurs. |
def on_close(self) -> None | None | None | Called when the connection closes. |
Response
Recognition result (RecognitionResult)
RecognitionResult represents the result of a streaming call or a non-streaming call.
| Member method | Method signature | Description |
|---|---|---|
| get_sentence | def get_sentence(self) -> Union[Dict[str, Any], List[Any]] | Return the current recognized sentence with timestamps. In a callback, returns a single sentence as Dict[str, Any]. See Sentence information. |
| get_request_id | def get_request_id(self) -> str | Return the request ID. |
| get_usage | def get_usage(self, sentence: Dict[str, Any]) -> Dict | Return usage information for the sentence. |
| is_sentence_end | @staticmethod def is_sentence_end(sentence: Dict[str, Any]) -> bool | Check whether the sentence has ended. |
Sentence information (Sentence)
| Parameter | Type | Description |
|---|---|---|
| begin_time | int | Start time of the sentence, in ms. |
| end_time | int | End time of the sentence, in ms. |
| text | str | Recognized text. |
| words | A list of Word objects | Word timestamp information. |
Word timestamp information (Word)
| Parameter | Type | Description |
|---|---|---|
| begin_time | int | Start time of the word, in ms. |
| end_time | int | End time of the word, in ms. |
| text | str | The word. |
| punctuation | str | The punctuation mark. |