Skip to main content
Administration

Rate limits

Understand and manage API rate limits

How rate limits work

Rate limits control how many API requests and tokens your account can consume per minute for each model. There are two types of limits:
  • RPM (Requests Per Minute): Maximum number of API calls per minute.
  • TPM (Tokens Per Minute): Maximum number of tokens processed per minute.
Limits are applied at the account level — they are shared across all workspaces and API keys under one account.
Rate limits also apply per second: RPS = RPM / 60, TPS = TPM / 60. Burst requests within a single second can trigger throttling even if total usage stays below the per-minute limit.

View your rate limits

Go to Monitoring to see the rate limits and real-time usage for every model in your account. The Monitoring tab displays:
  • Summary metrics: Total Models, Total Calls, Failures, Avg Time to First Token, and Avg Latency for the selected time window.
  • Per-model breakdown: A table showing each model's Workspace, Avg TPM, Avg RPM, Total Calls, Failed Calls, Failure Rate, Avg Time to First Token, and Avg Latency.
Use the time range selector (such as 3 Hours, 24 Hours) to adjust the monitoring window.

Set rate limits per workspace

You can set custom RPM and TPM limits for individual models in a workspace.
1

Go to the Workspaces page

Go to Settings > Workspaces and click Edit on a sub-workspace.
2

Add models and set limits

Under Model Permission, click All Models to add models. For each model, set the Times / min (RPM) and Token / min (TPM) values, then click Apply.
3

Save changes

Click Save Changes to apply the new rate limits.
The RPM and TPM values you set cannot exceed the account-level limits for that model. The default workspace uses account-level limits and cannot be modified.

Temporarily increase rate limits

If you need higher throughput for a specific model, you can request a temporary increase through your account settings.
1

Go to the Rate Limit tab

Go to Monitoring and select the Rate Limit tab.
2

Request an increase

Click Increase Rate Limit Temporarily. Select the model, then enter the desired Token Rate Limit (tokens per 60 seconds). The dialog shows your current quota and the upper limit.
3

Save changes

Click Save Changes to apply the temporary increase.
Apply quotas based on actual needs. Unused capacity may be downsized to default limits after a period of inactivity.
The Rate Limit tab also shows a history of all temporary increase requests, including the application time, model code, and account TPM limit for each request.

Rate limit errors

When a rate limit is triggered, the API returns HTTP status code 429. The error message indicates which limit was hit:
Error messageCause
Requests rate limit exceeded or You exceeded your current requests listRPM limit reached
Allocated quota exceeded or You exceeded your current quotaTPM limit reached
Request rate increased too quicklySudden request surge triggered stability protection, even if RPM/TPM limits were not reached
Limits reset within one minute. For other errors, see Error messages.

Best practices

Smooth your request rate

Spread requests evenly over time rather than sending them in bursts. Use constant-rate scheduling, exponential backoff, or a request queue to avoid triggering per-second limits.

Use a backup model

If a request is rate-limited, fall back to an alternative model to maintain availability:
import os
import asyncio
from openai import AsyncOpenAI, APIStatusError

API_KEY = os.getenv("DASHSCOPE_API_KEY")
MODEL = "qwen-plus-2025-07-28"
BACKUP_MODEL = "qwen-plus-2025-07-14"
QUESTION = "Who are you?"
NUM_REQUESTS = 10

client = AsyncOpenAI(
  api_key=API_KEY,
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

async def send_request(model):
  try:
    await client.chat.completions.create(
      model=model,
      messages=[{"role": "user", "content": QUESTION}]
    )
    return True
  except APIStatusError as e:
    if e.status_code == 429:
      print(f"[Rate limit triggered] Model {model}")
      return False
    raise
  except Exception as e:
    print(f"[Request failed] Model {model}, Error: {e}")
    return False

async def task(i):
  if await send_request(MODEL):
    return True
  return await send_request(BACKUP_MODEL)

async def main():
  results = await asyncio.gather(*(task(i) for i in range(NUM_REQUESTS)))
  print(f"Successful: {sum(results)}, Failed: {len(results) - sum(results)}")

if __name__ == "__main__":
  asyncio.run(main())

Split large tasks

Long conversations or large documents consume many tokens quickly. Split large batch tasks into smaller batches and submit them at different times to stay within TPM limits.

Choose higher-limit models

Stable or latest model versions typically have higher rate limits than older snapshots. Where possible, use the latest version of a model.

Use batch inference

If you don't need real-time results, use the Batch API. Batch jobs are not subject to real-time rate limits but may have queuing and processing delays.