Connection reuse and pooling

Reusing connections reduces resource consumption and improves throughput. The strategy depends on the protocol:

HTTP APIs (text generation, multimodal, embeddings): reuse TCP connections via connection pool configuration (Java) or Session objects (Python).
WebSocket APIs (TTS, real-time speech): pool synthesizer objects that hold long-lived WebSocket connections.

Prerequisites

Obtain and configure your API Key as the DASHSCOPE_API_KEY environment variable.
Install the latest DashScope SDK:
- Python SDK: >= 1.25.2
- Java SDK: >= 2.16.6

HTTP connection reuse

The DashScope endpoint differs by model type:

Text models (qwen-plus, qwen3-max, etc.): use the Generation class, which routes to /services/aigc/text-generation/generation.
Multimodal models (qwen3.6-plus, qwen3-vl-plus, etc.): use the MultiModalConversation class, which routes to /services/aigc/multimodal-generation/generation.

Java SDK

Connection pooling is enabled by default. Adjust the following parameters as needed.

Parameter	Description	Default	Unit	Notes
`connectTimeout`	Timeout for establishing a connection.	120	seconds	Shorter timeouts reduce wait time in low-latency scenarios.
`readTimeout`	Timeout for reading data.	300	seconds
`writeTimeout`	Timeout for writing data.	60	seconds
`connectionIdleTimeout`	Timeout for idle connections.	300	seconds	Longer idle timeouts avoid frequent reconnections under high concurrency.
`connectionPoolSize`	Maximum connections in the pool.	32	items	Too few connections cause blocking; too many increase server load.
`maximumAsyncRequests`	Maximum concurrent requests across all hosts. Must be ≤ `connectionPoolSize`.	32	requests
`maximumAsyncRequestsPerHost`	Maximum concurrent requests per host. Must be ≤ `maximumAsyncRequests`.	32	items

Configure connection pool parameters and call a model service:

// Recommended DashScope SDK version >= 2.12.0
import java.time.Duration;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.protocol.ConnectionConfigurations;
import com.alibaba.dashscope.protocol.Protocol;
import com.alibaba.dashscope.utils.Constants;

public class Main {
  public static MultiModalConversationResult callWithMessage() throws ApiException, NoApiKeyException, InputRequiredException {
    MultiModalConversation conv = new MultiModalConversation(Protocol.HTTP.getValue(), "https://dashscope-intl.aliyuncs.com/api/v1");
    Map<String, Object> textContent = new HashMap<>();
    textContent.put("text", "Who are you?");
    MultiModalMessage userMsg = MultiModalMessage.builder()
        .role(Role.USER.getValue())
        .content(Collections.singletonList(textContent))
        .build();
    MultiModalConversationParam param = MultiModalConversationParam.builder()
        // If you have not configured the environment variable, replace with your API key: .apiKey("sk-xxx")
        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
        .model("qwen3.6-plus")
        .messages(Collections.singletonList(userMsg))
        .build();

    return conv.call(param);
  }
  public static void main(String[] args) {
    // Connection pool configuration
    Constants.connectionConfigurations = ConnectionConfigurations.builder()
        .connectTimeout(Duration.ofSeconds(10))  // Timeout for establishing a connection, default 120s
        .readTimeout(Duration.ofSeconds(300)) // Timeout for reading data, default 300s
        .writeTimeout(Duration.ofSeconds(60)) // Timeout for writing data, default 60s
        .connectionIdleTimeout(Duration.ofSeconds(300)) // Timeout for idle connections, default 300s
        .connectionPoolSize(256) // Maximum connections in the connection pool, default 32
        .maximumAsyncRequests(256)  // Maximum concurrent requests, default 32
        .maximumAsyncRequestsPerHost(256) // Maximum concurrent requests per host, default 32
        .build();

    try {
      MultiModalConversationResult result = callWithMessage();
      System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    } catch (ApiException | NoApiKeyException | InputRequiredException e) {
      System.err.println("An error occurred while calling the service: " + e.getMessage());
    }
    System.exit(0);
  }
}

Python SDK

The Python SDK supports connection reuse via a custom Session. Two methods are available: async (aiohttp) and sync (requests.Session).

Async (aiohttp)

Use aiohttp.ClientSession with aiohttp.TCPConnector for async connection reuse.

Parameter	Description	Default	Notes
`limit`	Total connection limit	100	Higher values improve concurrency.
`limit_per_host`	Connection limit per host	0 (unlimited)	Prevents excessive load on a single host.
`ssl`	SSL context configuration	None	SSL certificate validation for HTTPS connections.

import asyncio
import aiohttp
import ssl
import certifi
from dashscope import AioMultiModalConversation
import dashscope
import os

async def main():
  dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

  # If you have not configured the environment variable, replace with your API key: dashscope.api_key = "sk-xxx"
  dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

  # Configure connection parameters
  connector = aiohttp.TCPConnector(
    limit=100,           # Total connection limit
    limit_per_host=30,   # Connection limit per host
    ssl=ssl.create_default_context(cafile=certifi.where()),
  )

  # Create a custom Session and pass it to the call method
  async with aiohttp.ClientSession(connector=connector) as session:
    response = await AioMultiModalConversation.call(
      model='qwen3.6-plus',
      messages=[{'role': 'user', 'content': [{'text': 'Hello, please introduce yourself'}]}],
      session=session,  # Pass the custom Session
    )
    print(response)

asyncio.run(main())

Sync (requests.Session)

Use requests.Session for sync connection reuse. Requests within the same Session reuse the TCP connection.

import requests
from dashscope import MultiModalConversation
import dashscope
import os

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# If you have not configured the environment variable, replace with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

# Use a with statement to ensure the Session closes correctly
with requests.Session() as session:
  response = MultiModalConversation.call(
    model='qwen3.6-plus',
    messages=[{'role': 'user', 'content': [{'text': 'Hello'}]}],
    session=session  # Pass the custom Session
  )
  print(response)

Reuse the same Session across multiple calls:

import requests
from dashscope import MultiModalConversation
import dashscope
import os

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# If you have not configured the environment variable, replace with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

# Create a Session object
session = requests.Session()

try:
  # Reuse the same Session for multiple calls
  response1 = MultiModalConversation.call(
    model='qwen3.6-plus',
    messages=[{'role': 'user', 'content': [{'text': 'Hello'}]}],
    session=session
  )
  print(response1)

  response2 = MultiModalConversation.call(
    model='qwen3.6-plus',
    messages=[{'role': 'user', 'content': [{'text': 'Introduce yourself'}]}],
    session=session
  )
  print(response2)
finally:
  # Ensure the Session closes correctly
  session.close()

WebSocket connection pooling

TTS services use WebSocket connections for real-time streaming. In production, creating a new connection per request wastes resources and adds latency. This section covers connection pooling, object pooling, and concurrent request management for high-throughput TTS workloads.

Python: Object pool

The Python SDK provides SpeechSynthesizerObjectPool to manage and reuse SpeechSynthesizer instances. The pool pre-creates objects and establishes WebSocket connections at initialization, eliminating per-request connection overhead. Pool sizing: Set max_size to 1.5x-2x your peak concurrency. Do not exceed your account's QPS limit.

import os
import threading
import dashscope
from dashscope.audio.tts_v2 import *

dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
dashscope.base_websocket_api_url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Create a global object pool (one-time cost at startup)
pool = SpeechSynthesizerObjectPool(max_size=20)

def synthesize(text, task_id):
  complete_event = threading.Event()

  class Callback(ResultCallback):
    def on_open(self):
      self.file = open(f'result_{task_id}.mp3', 'wb')

    def on_complete(self):
      complete_event.set()

    def on_error(self, message):
      print(f'[task_{task_id}] Error: {message}')

    def on_data(self, data):
      self.file.write(data)

    def on_close(self):
      if hasattr(self, 'file'):
        self.file.close()

  callback = Callback()

  # Borrow a pre-connected synthesizer from the pool
  synth = pool.borrow_synthesizer(
    model='cosyvoice-v3-flash',
    voice='longanyang',
    callback=callback
  )

  try:
    synth.call(text)
    complete_event.wait()
    print(f'[task_{task_id}] First packet delay: '
       f'{synth.get_first_package_delay()} ms')
    # Return the synthesizer to the pool for reuse
    pool.return_synthesizer(synth)
  except Exception as e:
    print(f'[task_{task_id}] Failed: {e}')
    synth.close()  # Do not return failed objects

# Run concurrent tasks
texts = ["First sentence.", "Second sentence.", "Third sentence."]
threads = [threading.Thread(target=synthesize, args=(t, i))
     for i, t in enumerate(texts)]
for t in threads:
  t.start()
for t in threads:
  t.join()

pool.shutdown()

Never return a synthesizer to the pool if the task failed or is still running. Close it manually instead.

Java: Connection pool + object pool

The Java SDK uses OkHttp3 connection pooling (enabled by default) plus an optional Apache Commons Pool2 object pool for SpeechSynthesizer instances. Step 1: Configure connection pool via environment variables

Variable	Default	Recommendation
`DASHSCOPE_CONNECTION_POOL_SIZE`	32	2x peak concurrency
`DASHSCOPE_MAXIMUM_ASYNC_REQUESTS`	32	Match connection pool size
`DASHSCOPE_MAXIMUM_ASYNC_REQUESTS_PER_HOST`	32	Match connection pool size

export DASHSCOPE_CONNECTION_POOL_SIZE=2000
export DASHSCOPE_MAXIMUM_ASYNC_REQUESTS=2000
export DASHSCOPE_MAXIMUM_ASYNC_REQUESTS_PER_HOST=2000

Step 2: Add commons-pool2 dependency

Maven
Gradle

<dependency>
  <groupId>org.apache.commons</groupId>
  <artifactId>commons-pool2</artifactId>
  <version>the-latest-version</version>
</dependency>

Step 3: Create and use the object pool

Variable	Default	Recommendation
`SAMBERT_OBJECTPOOL_SIZE` (Sambert)	500	1.5x-2x peak concurrency, must not exceed connection pool size

import com.alibaba.dashscope.audio.tts.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.tts.SpeechSynthesizer;
import org.apache.commons.pool2.BasePooledObjectFactory;
import org.apache.commons.pool2.PooledObject;
import org.apache.commons.pool2.impl.DefaultPooledObject;
import org.apache.commons.pool2.impl.GenericObjectPool;
import org.apache.commons.pool2.impl.GenericObjectPoolConfig;

// Factory
class SynthesizerFactory extends BasePooledObjectFactory<SpeechSynthesizer> {
  public SpeechSynthesizer create() { return new SpeechSynthesizer(); }
  public PooledObject<SpeechSynthesizer> wrap(SpeechSynthesizer obj) {
    return new DefaultPooledObject<>(obj);
  }
}

// Pool (global singleton)
GenericObjectPoolConfig<SpeechSynthesizer> config = new GenericObjectPoolConfig<>();
config.setMaxTotal(1200);
config.setMaxIdle(1200);
config.setMinIdle(1200);
GenericObjectPool<SpeechSynthesizer> pool =
  new GenericObjectPool<>(new SynthesizerFactory(), config);

// Usage in each task
SpeechSynthesizer synth = pool.borrowObject();
try {
  // ... configure params and call synth
  pool.returnObject(synth);
} catch (Exception e) {
  synth = null;  // Do not return on failure
}

Reference server sizing: a 4-core 8 GiB machine can handle ~600 concurrent Sambert TTS tasks with an object pool of 1200 and a connection pool of 2000.

Best practices

Java SDK: Set connectionPoolSize and maximumAsyncRequests based on your concurrent workload. Too few connections cause blocking; too many increase server load.
Python SDK: Use with statements to manage the Session lifecycle and ensure proper resource cleanup.
Choose the right method: Use async calls for async applications (like asyncio or FastAPI). Use sync calls for traditional applications.
WebSocket object pools: Never return a synthesizer to the pool if the task failed or is still running. Close it manually instead.

Performance monitoring

Track these metrics to maintain healthy production TTS services:

Metric	Description	Target
First packet delay	Time from request to first audio chunk	< 500 ms
End-to-end latency	Total time for complete synthesis	Depends on text length
Error rate	Percentage of failed requests	< 0.1%
Pool utilization	Borrowed objects / pool size	60%-80% at peak
Connection reuse ratio	Reused connections / total requests	> 95%

Access these metrics from the SDK:

# TTS
print(f"Request ID: {synthesizer.get_last_request_id()}")
print(f"First packet delay: {synthesizer.get_first_package_delay()} ms")

// TTS
System.out.println("Request ID: " + synthesizer.getLastRequestId());
System.out.println("First packet delay: " + synthesizer.getFirstPackageDelay() + " ms");

Production checklist

Before going live, verify the following:

API Key stored in environment variable, not hardcoded.
Connection pool and object pool sizes configured for expected peak load.
Pool sizes do not exceed your account's QPS limit.
Error handling returns failed objects to disposal (not back to pool).
Graceful shutdown calls pool.shutdown() (Python) or pool close (Java).
WebSocket connections use the correct endpoint (wss://dashscope-intl.aliyuncs.com/...).
Monitoring dashboards track first-packet delay, error rate, and pool utilization.
Load tested with 2x expected peak concurrency.
Retry logic with exponential backoff for transient failures.

Text to Speech -- TTS models, parameters, and streaming modes.
Realtime streaming -- realtime TTS streaming guide.
Improve recognition accuracy -- ASR optimization including high-concurrency ASR patterns.

​Prerequisites

​HTTP connection reuse

​Java SDK

​Python SDK

​Async (aiohttp)

​Sync (requests.Session)

​WebSocket connection pooling

​Python: Object pool

​Java: Connection pool + object pool

​Best practices

​Performance monitoring

​Production checklist

​Related

Prerequisites

HTTP connection reuse

Java SDK

Python SDK

Async (aiohttp)

Sync (requests.Session)

WebSocket connection pooling

Python: Object pool

Java: Connection pool + object pool

Best practices

Performance monitoring

Production checklist

Related