Skip to main content
Run and Scale

Connection reuse and pooling

HTTP connection reuse and WebSocket connection pooling for high-concurrency workloads.

Reusing connections reduces resource consumption and improves throughput. The strategy depends on the protocol:
  • HTTP APIs (text generation, multimodal, embeddings): reuse TCP connections via connection pool configuration (Java) or Session objects (Python).
  • WebSocket APIs (TTS, real-time speech): pool synthesizer objects that hold long-lived WebSocket connections.

Prerequisites

  • Obtain and configure your API Key as the DASHSCOPE_API_KEY environment variable.
  • Install the latest DashScope SDK:
    • Python SDK: >= 1.25.2
    • Java SDK: >= 2.16.6

HTTP connection reuse

The DashScope endpoint differs by model type:
  • Text models (qwen-plus, qwen3-max, etc.): use the Generation class, which routes to /services/aigc/text-generation/generation.
  • Multimodal models (qwen3.6-plus, qwen3-vl-plus, etc.): use the MultiModalConversation class, which routes to /services/aigc/multimodal-generation/generation.

Java SDK

Connection pooling is enabled by default. Adjust the following parameters as needed.
ParameterDescriptionDefaultUnitNotes
connectTimeoutTimeout for establishing a connection.120secondsShorter timeouts reduce wait time in low-latency scenarios.
readTimeoutTimeout for reading data.300seconds
writeTimeoutTimeout for writing data.60seconds
connectionIdleTimeoutTimeout for idle connections.300secondsLonger idle timeouts avoid frequent reconnections under high concurrency.
connectionPoolSizeMaximum connections in the pool.32itemsToo few connections cause blocking; too many increase server load.
maximumAsyncRequestsMaximum concurrent requests across all hosts. Must be ≤ connectionPoolSize.32requests
maximumAsyncRequestsPerHostMaximum concurrent requests per host. Must be ≤ maximumAsyncRequests.32items
Configure connection pool parameters and call a model service:
// Recommended DashScope SDK version >= 2.12.0
import java.time.Duration;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.InputRequiredException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.protocol.ConnectionConfigurations;
import com.alibaba.dashscope.protocol.Protocol;
import com.alibaba.dashscope.utils.Constants;

public class Main {
  public static MultiModalConversationResult callWithMessage() throws ApiException, NoApiKeyException, InputRequiredException {
    MultiModalConversation conv = new MultiModalConversation(Protocol.HTTP.getValue(), "https://dashscope-intl.aliyuncs.com/api/v1");
    Map<String, Object> textContent = new HashMap<>();
    textContent.put("text", "Who are you?");
    MultiModalMessage userMsg = MultiModalMessage.builder()
        .role(Role.USER.getValue())
        .content(Collections.singletonList(textContent))
        .build();
    MultiModalConversationParam param = MultiModalConversationParam.builder()
        // If you have not configured the environment variable, replace with your API key: .apiKey("sk-xxx")
        .apiKey(System.getenv("DASHSCOPE_API_KEY"))
        .model("qwen3.6-plus")
        .messages(Collections.singletonList(userMsg))
        .build();

    return conv.call(param);
  }
  public static void main(String[] args) {
    // Connection pool configuration
    Constants.connectionConfigurations = ConnectionConfigurations.builder()
        .connectTimeout(Duration.ofSeconds(10))  // Timeout for establishing a connection, default 120s
        .readTimeout(Duration.ofSeconds(300)) // Timeout for reading data, default 300s
        .writeTimeout(Duration.ofSeconds(60)) // Timeout for writing data, default 60s
        .connectionIdleTimeout(Duration.ofSeconds(300)) // Timeout for idle connections, default 300s
        .connectionPoolSize(256) // Maximum connections in the connection pool, default 32
        .maximumAsyncRequests(256)  // Maximum concurrent requests, default 32
        .maximumAsyncRequestsPerHost(256) // Maximum concurrent requests per host, default 32
        .build();

    try {
      MultiModalConversationResult result = callWithMessage();
      System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    } catch (ApiException | NoApiKeyException | InputRequiredException e) {
      System.err.println("An error occurred while calling the service: " + e.getMessage());
    }
    System.exit(0);
  }
}

Python SDK

The Python SDK supports connection reuse via a custom Session. Two methods are available: async (aiohttp) and sync (requests.Session).

Async (aiohttp)

Use aiohttp.ClientSession with aiohttp.TCPConnector for async connection reuse.
ParameterDescriptionDefaultNotes
limitTotal connection limit100Higher values improve concurrency.
limit_per_hostConnection limit per host0 (unlimited)Prevents excessive load on a single host.
sslSSL context configurationNoneSSL certificate validation for HTTPS connections.
import asyncio
import aiohttp
import ssl
import certifi
from dashscope import AioMultiModalConversation
import dashscope
import os

async def main():
  dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

  # If you have not configured the environment variable, replace with your API key: dashscope.api_key = "sk-xxx"
  dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

  # Configure connection parameters
  connector = aiohttp.TCPConnector(
    limit=100,           # Total connection limit
    limit_per_host=30,   # Connection limit per host
    ssl=ssl.create_default_context(cafile=certifi.where()),
  )

  # Create a custom Session and pass it to the call method
  async with aiohttp.ClientSession(connector=connector) as session:
    response = await AioMultiModalConversation.call(
      model='qwen3.6-plus',
      messages=[{'role': 'user', 'content': [{'text': 'Hello, please introduce yourself'}]}],
      session=session,  # Pass the custom Session
    )
    print(response)

asyncio.run(main())

Sync (requests.Session)

Use requests.Session for sync connection reuse. Requests within the same Session reuse the TCP connection.
import requests
from dashscope import MultiModalConversation
import dashscope
import os

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# If you have not configured the environment variable, replace with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

# Use a with statement to ensure the Session closes correctly
with requests.Session() as session:
  response = MultiModalConversation.call(
    model='qwen3.6-plus',
    messages=[{'role': 'user', 'content': [{'text': 'Hello'}]}],
    session=session  # Pass the custom Session
  )
  print(response)
Reuse the same Session across multiple calls:
import requests
from dashscope import MultiModalConversation
import dashscope
import os

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# If you have not configured the environment variable, replace with your API key: dashscope.api_key = "sk-xxx"
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")

# Create a Session object
session = requests.Session()

try:
  # Reuse the same Session for multiple calls
  response1 = MultiModalConversation.call(
    model='qwen3.6-plus',
    messages=[{'role': 'user', 'content': [{'text': 'Hello'}]}],
    session=session
  )
  print(response1)

  response2 = MultiModalConversation.call(
    model='qwen3.6-plus',
    messages=[{'role': 'user', 'content': [{'text': 'Introduce yourself'}]}],
    session=session
  )
  print(response2)
finally:
  # Ensure the Session closes correctly
  session.close()

WebSocket connection pooling

TTS services use WebSocket connections for real-time streaming. In production, creating a new connection per request wastes resources and adds latency. This section covers connection pooling, object pooling, and concurrent request management for high-throughput TTS workloads.

Python: Object pool

The Python SDK provides SpeechSynthesizerObjectPool to manage and reuse SpeechSynthesizer instances. The pool pre-creates objects and establishes WebSocket connections at initialization, eliminating per-request connection overhead. Pool sizing: Set max_size to 1.5x-2x your peak concurrency. Do not exceed your account's QPS limit.
import os
import threading
import dashscope
from dashscope.audio.tts_v2 import *

dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
dashscope.base_websocket_api_url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Create a global object pool (one-time cost at startup)
pool = SpeechSynthesizerObjectPool(max_size=20)

def synthesize(text, task_id):
  complete_event = threading.Event()

  class Callback(ResultCallback):
    def on_open(self):
      self.file = open(f'result_{task_id}.mp3', 'wb')

    def on_complete(self):
      complete_event.set()

    def on_error(self, message):
      print(f'[task_{task_id}] Error: {message}')

    def on_data(self, data):
      self.file.write(data)

    def on_close(self):
      if hasattr(self, 'file'):
        self.file.close()

  callback = Callback()

  # Borrow a pre-connected synthesizer from the pool
  synth = pool.borrow_synthesizer(
    model='cosyvoice-v3-flash',
    voice='longanyang',
    callback=callback
  )

  try:
    synth.call(text)
    complete_event.wait()
    print(f'[task_{task_id}] First packet delay: '
       f'{synth.get_first_package_delay()} ms')
    # Return the synthesizer to the pool for reuse
    pool.return_synthesizer(synth)
  except Exception as e:
    print(f'[task_{task_id}] Failed: {e}')
    synth.close()  # Do not return failed objects

# Run concurrent tasks
texts = ["First sentence.", "Second sentence.", "Third sentence."]
threads = [threading.Thread(target=synthesize, args=(t, i))
     for i, t in enumerate(texts)]
for t in threads:
  t.start()
for t in threads:
  t.join()

pool.shutdown()
Never return a synthesizer to the pool if the task failed or is still running. Close it manually instead.

Java: Connection pool + object pool

The Java SDK uses OkHttp3 connection pooling (enabled by default) plus an optional Apache Commons Pool2 object pool for SpeechSynthesizer instances. Step 1: Configure connection pool via environment variables
VariableDefaultRecommendation
DASHSCOPE_CONNECTION_POOL_SIZE322x peak concurrency
DASHSCOPE_MAXIMUM_ASYNC_REQUESTS32Match connection pool size
DASHSCOPE_MAXIMUM_ASYNC_REQUESTS_PER_HOST32Match connection pool size
export DASHSCOPE_CONNECTION_POOL_SIZE=2000
export DASHSCOPE_MAXIMUM_ASYNC_REQUESTS=2000
export DASHSCOPE_MAXIMUM_ASYNC_REQUESTS_PER_HOST=2000
Step 2: Add commons-pool2 dependency
  • Maven
  • Gradle
<dependency>
  <groupId>org.apache.commons</groupId>
  <artifactId>commons-pool2</artifactId>
  <version>the-latest-version</version>
</dependency>
Step 3: Create and use the object pool
VariableDefaultRecommendation
SAMBERT_OBJECTPOOL_SIZE (Sambert)5001.5x-2x peak concurrency, must not exceed connection pool size
import com.alibaba.dashscope.audio.tts.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.tts.SpeechSynthesizer;
import org.apache.commons.pool2.BasePooledObjectFactory;
import org.apache.commons.pool2.PooledObject;
import org.apache.commons.pool2.impl.DefaultPooledObject;
import org.apache.commons.pool2.impl.GenericObjectPool;
import org.apache.commons.pool2.impl.GenericObjectPoolConfig;

// Factory
class SynthesizerFactory extends BasePooledObjectFactory<SpeechSynthesizer> {
  public SpeechSynthesizer create() { return new SpeechSynthesizer(); }
  public PooledObject<SpeechSynthesizer> wrap(SpeechSynthesizer obj) {
    return new DefaultPooledObject<>(obj);
  }
}

// Pool (global singleton)
GenericObjectPoolConfig<SpeechSynthesizer> config = new GenericObjectPoolConfig<>();
config.setMaxTotal(1200);
config.setMaxIdle(1200);
config.setMinIdle(1200);
GenericObjectPool<SpeechSynthesizer> pool =
  new GenericObjectPool<>(new SynthesizerFactory(), config);

// Usage in each task
SpeechSynthesizer synth = pool.borrowObject();
try {
  // ... configure params and call synth
  pool.returnObject(synth);
} catch (Exception e) {
  synth = null;  // Do not return on failure
}
Reference server sizing: a 4-core 8 GiB machine can handle ~600 concurrent Sambert TTS tasks with an object pool of 1200 and a connection pool of 2000.

Best practices

  • Java SDK: Set connectionPoolSize and maximumAsyncRequests based on your concurrent workload. Too few connections cause blocking; too many increase server load.
  • Python SDK: Use with statements to manage the Session lifecycle and ensure proper resource cleanup.
  • Choose the right method: Use async calls for async applications (like asyncio or FastAPI). Use sync calls for traditional applications.
  • WebSocket object pools: Never return a synthesizer to the pool if the task failed or is still running. Close it manually instead.

Performance monitoring

Track these metrics to maintain healthy production TTS services:
MetricDescriptionTarget
First packet delayTime from request to first audio chunk< 500 ms
End-to-end latencyTotal time for complete synthesisDepends on text length
Error ratePercentage of failed requests< 0.1%
Pool utilizationBorrowed objects / pool size60%-80% at peak
Connection reuse ratioReused connections / total requests> 95%
Access these metrics from the SDK:
# TTS
print(f"Request ID: {synthesizer.get_last_request_id()}")
print(f"First packet delay: {synthesizer.get_first_package_delay()} ms")
// TTS
System.out.println("Request ID: " + synthesizer.getLastRequestId());
System.out.println("First packet delay: " + synthesizer.getFirstPackageDelay() + " ms");

Production checklist

Before going live, verify the following:
  • API Key stored in environment variable, not hardcoded.
  • Connection pool and object pool sizes configured for expected peak load.
  • Pool sizes do not exceed your account's QPS limit.
  • Error handling returns failed objects to disposal (not back to pool).
  • Graceful shutdown calls pool.shutdown() (Python) or pool close (Java).
  • WebSocket connections use the correct endpoint (wss://dashscope-intl.aliyuncs.com/...).
  • Monitoring dashboards track first-packet delay, error rate, and pool utilization.
  • Load tested with 2x expected peak concurrency.
  • Retry logic with exponential backoff for transient failures.