Skip to main content
Run and Scale

Connection pooling for high-concurrency speech

Connection pooling, object pooling, high-concurrency patterns, and performance monitoring for speech services.

TTS services use WebSocket connections for real-time streaming. In production, creating a new connection per request wastes resources and adds latency. This guide covers connection pooling, object pooling, and concurrent request management for high-throughput TTS workloads.

Prerequisites

  • Obtain and configure your API Key as the DASHSCOPE_API_KEY environment variable.
  • Install the latest DashScope SDK:
    • Python SDK: >= 1.25.2
    • Java SDK: >= 2.16.6

High-concurrency TTS patterns

Python: Object pool

The Python SDK provides SpeechSynthesizerObjectPool to manage and reuse SpeechSynthesizer instances. The pool pre-creates objects and establishes WebSocket connections at initialization, eliminating per-request connection overhead. Pool sizing: Set max_size to 1.5x-2x your peak concurrency. Do not exceed your account's QPS limit.
import os
import threading
import dashscope
from dashscope.audio.tts_v2 import *

dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
dashscope.base_websocket_api_url = 'wss://dashscope-intl.aliyuncs.com/api-ws/v1/inference'

# Create a global object pool (one-time cost at startup)
pool = SpeechSynthesizerObjectPool(max_size=20)

def synthesize(text, task_id):
  complete_event = threading.Event()

  class Callback(ResultCallback):
    def on_open(self):
      self.file = open(f'result_{task_id}.mp3', 'wb')

    def on_complete(self):
      complete_event.set()

    def on_error(self, message):
      print(f'[task_{task_id}] Error: {message}')

    def on_data(self, data):
      self.file.write(data)

    def on_close(self):
      if hasattr(self, 'file'):
        self.file.close()

  callback = Callback()

  # Borrow a pre-connected synthesizer from the pool
  synth = pool.borrow_synthesizer(
    model='cosyvoice-v3-flash',
    voice='longanyang',
    callback=callback
  )

  try:
    synth.call(text)
    complete_event.wait()
    print(f'[task_{task_id}] First packet delay: '
       f'{synth.get_first_package_delay()} ms')
    # Return the synthesizer to the pool for reuse
    pool.return_synthesizer(synth)
  except Exception as e:
    print(f'[task_{task_id}] Failed: {e}')
    synth.close()  # Do not return failed objects

# Run concurrent tasks
texts = ["First sentence.", "Second sentence.", "Third sentence."]
threads = [threading.Thread(target=synthesize, args=(t, i))
     for i, t in enumerate(texts)]
for t in threads:
  t.start()
for t in threads:
  t.join()

pool.shutdown()
Never return a synthesizer to the pool if the task failed or is still running. Close it manually instead.

Java: Connection pool + object pool

The Java SDK uses OkHttp3 connection pooling (enabled by default) plus an optional Apache Commons Pool2 object pool for SpeechSynthesizer instances. Step 1: Configure connection pool via environment variables
VariableDefaultRecommendation
DASHSCOPE_CONNECTION_POOL_SIZE322x peak concurrency
DASHSCOPE_MAXIMUM_ASYNC_REQUESTS32Match connection pool size
DASHSCOPE_MAXIMUM_ASYNC_REQUESTS_PER_HOST32Match connection pool size
export DASHSCOPE_CONNECTION_POOL_SIZE=2000
export DASHSCOPE_MAXIMUM_ASYNC_REQUESTS=2000
export DASHSCOPE_MAXIMUM_ASYNC_REQUESTS_PER_HOST=2000
Step 2: Add commons-pool2 dependency
  • Maven
  • Gradle
<dependency>
  <groupId>org.apache.commons</groupId>
  <artifactId>commons-pool2</artifactId>
  <version>the-latest-version</version>
</dependency>
Step 3: Create and use the object pool
VariableDefaultRecommendation
SAMBERT_OBJECTPOOL_SIZE (Sambert)5001.5x-2x peak concurrency, must not exceed connection pool size
import com.alibaba.dashscope.audio.tts.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.tts.SpeechSynthesizer;
import org.apache.commons.pool2.BasePooledObjectFactory;
import org.apache.commons.pool2.PooledObject;
import org.apache.commons.pool2.impl.DefaultPooledObject;
import org.apache.commons.pool2.impl.GenericObjectPool;
import org.apache.commons.pool2.impl.GenericObjectPoolConfig;

// Factory
class SynthesizerFactory extends BasePooledObjectFactory<SpeechSynthesizer> {
  public SpeechSynthesizer create() { return new SpeechSynthesizer(); }
  public PooledObject<SpeechSynthesizer> wrap(SpeechSynthesizer obj) {
    return new DefaultPooledObject<>(obj);
  }
}

// Pool (global singleton)
GenericObjectPoolConfig<SpeechSynthesizer> config = new GenericObjectPoolConfig<>();
config.setMaxTotal(1200);
config.setMaxIdle(1200);
config.setMinIdle(1200);
GenericObjectPool<SpeechSynthesizer> pool =
  new GenericObjectPool<>(new SynthesizerFactory(), config);

// Usage in each task
SpeechSynthesizer synth = pool.borrowObject();
try {
  // ... configure params and call synth
  pool.returnObject(synth);
} catch (Exception e) {
  synth = null;  // Do not return on failure
}
Reference server sizing: a 4-core 8 GiB machine can handle ~600 concurrent Sambert TTS tasks with an object pool of 1200 and a connection pool of 2000.

Performance monitoring

Track these metrics to maintain healthy production TTS services:
MetricDescriptionTarget
First packet delayTime from request to first audio chunk< 500 ms
End-to-end latencyTotal time for complete synthesisDepends on text length
Error ratePercentage of failed requests< 0.1%
Pool utilizationBorrowed objects / pool size60%-80% at peak
Connection reuse ratioReused connections / total requests> 95%
Access these metrics from the SDK:
# TTS
print(f"Request ID: {synthesizer.get_last_request_id()}")
print(f"First packet delay: {synthesizer.get_first_package_delay()} ms")
// TTS
System.out.println("Request ID: " + synthesizer.getLastRequestId());
System.out.println("First packet delay: " + synthesizer.getFirstPackageDelay() + " ms");

Production checklist

Before going live, verify the following:
  • API Key stored in environment variable, not hardcoded.
  • Connection pool and object pool sizes configured for expected peak load.
  • Pool sizes do not exceed your account's QPS limit.
  • Error handling returns failed objects to disposal (not back to pool).
  • Graceful shutdown calls pool.shutdown() (Python) or pool close (Java).
  • WebSocket connections use the correct region endpoint.
  • Monitoring dashboards track first-packet delay, error rate, and pool utilization.
  • Load tested with 2x expected peak concurrency.
  • Retry logic with exponential backoff for transient failures.