Skip to main content
Integrations

MLOps & observability

Production AI monitoring

Overview

Qwen Cloud provides two complementary observability features for your model deployments:
  • Analytics: View token consumption, request counts, latency, and success rates
  • Monitoring: Track per-model performance metrics, view call logs, and manage rate limits

Analytics

Go to the Analytics page to view usage and analytics for your workspace.

Filters

  • Time range: Select the time window (such as 24 Hours).
  • Models: Filter by specific model or view all models.
  • Granularity: Choose the aggregation interval (such as 1 Hour).

Metrics

The page shows four key metrics:
MetricDescription
TokensTotal token consumption
RequestsTotal number of API requests
Avg LatencyAverage response latency
Success ratePercentage of successful requests
The Tokens Analysis chart below provides a visual breakdown of token consumption over time.
Cost includes all consumption across the entire platform. Refer to billing data for details.

Usage units by model type

TypeSubcategoryUnitBilling basis
Large language modelText generation, Deep thinking, Vision understandingTokenBilled by input and output token count
Vision modelImage generationImage (count)Billed by successfully generated images
Vision modelVideo generationSecondsBilled by successfully generated video duration
Speech modelTTS, Realtime TTS, File ASR, Realtime ASR, Audio/video translationSeconds, characters, or tokensVaries by model -- may bill by audio duration, text characters, or token count
Omni-modal modelOmni-modal, Realtime multimodalTokenText billed by tokens; other modalities (audio, image, video) billed by corresponding token count

Monitoring

Go to the Monitoring page to monitor your API usage, configure alert rules, and manage rate limits.

Monitoring tab

The Monitoring tab shows a dashboard of your model performance for the selected workspace. At the top, summary cards display aggregate metrics including total models called, total calls, failures, average time to first token, and average latency. You can adjust the time range to focus on a specific period. Below the summary, a per-model table breaks down throughput (TPM/RPM), call volume, failure rate, and latency for each model. Use this to identify underperforming models or unexpected error spikes.

Rate limit tab

The Rate Limit tab lets you request temporary rate limit increases for specific models. Click Increase Rate Limit Temporarily to submit a request, and track the status of previous requests in the table below.