Understand and manage API rate limits
How rate limits work
Rate limits control how many API requests and tokens your account can consume per minute for each model. There are two types of limits:
- RPM (Requests Per Minute): Maximum number of API calls per minute.
- TPM (Tokens Per Minute): Maximum number of tokens processed per minute.
Rate limits also apply per second: RPS = RPM / 60, TPS = TPM / 60. Burst requests within a single second can trigger throttling even if total usage stays below the per-minute limit.
View your rate limits
Go to Monitoring to see the rate limits and real-time usage for every model in your account.
The Monitoring tab displays:
- Summary metrics: Total Models, Total Calls, Failures, Avg Time to First Token, and Avg Latency for the selected time window.
- Per-model breakdown: A table showing each model's Workspace, Avg TPM, Avg RPM, Total Calls, Failed Calls, Failure Rate, Avg Time to First Token, and Avg Latency.
Set rate limits per workspace
You can set custom RPM and TPM limits for individual models in a workspace.
1
Go to the Workspaces page
Go to Settings > Workspaces and click Edit on a sub-workspace.
2
Add models and set limits
Under Model Permission, click All Models to add models. For each model, set the Times / min (RPM) and Token / min (TPM) values, then click Apply.
3
Save changes
Click Save Changes to apply the new rate limits.
Temporarily increase rate limits
If you need higher throughput for a specific model, you can request a temporary increase through your account settings.
1
Go to the Rate Limit tab
Go to Monitoring and select the Rate Limit tab.
2
Request an increase
Click Increase Rate Limit Temporarily. Select the model, then enter the desired Token Rate Limit (tokens per 60 seconds). The dialog shows your current quota and the upper limit.
3
Save changes
Click Save Changes to apply the temporary increase.
Apply quotas based on actual needs. Unused capacity may be downsized to default limits after a period of inactivity.
Rate limit errors
When a rate limit is triggered, the API returns HTTP status code 429. The error message indicates which limit was hit:
| Error message | Cause |
|---|---|
Requests rate limit exceeded or You exceeded your current requests list | RPM limit reached |
Allocated quota exceeded or You exceeded your current quota | TPM limit reached |
Request rate increased too quickly | Sudden request surge triggered stability protection, even if RPM/TPM limits were not reached |