Skip to main content
Visual understanding

Analyze images and videos

Generate content from visual inputs

Vision models understand images and videos to answer questions, extract text, solve problems, and generate descriptions. These multimodal models combine visual understanding with language capabilities for tasks ranging from OCR to creative writing.

Visual input structure

Vision models accept images and videos alongside text prompts. Each message can contain multiple content types:
  • Text prompt: Your question or instruction about the visual content
  • Image URL: Direct link to an online image
  • Base64 image: Encoded image data for local files
  • Video URL: Direct link to video content (select models)

Make your first vision call

Prerequisites Which API to use?
  • OpenAI Compatible: Best for new integrations and migrating from OpenAI
  • DashScope: Use if you prefer the native SDK or need specific DashScope features
  • OpenAI compatible
  • DashScope
from openai import OpenAI
import os

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1"
)

completion = client.chat.completions.create(
  model="qwen3.6-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"
          },
        },
        {"type": "text", "text": "Describe what you see in this image"},
      ],
    },
  ],
)
print(completion.choices[0].message.content)
Response
This is a photo taken on a beach. In the photo, a person and a dog are sitting on the sand, with the sea and sky in the background. The person and dog appear to be interacting, with the dog's front paw resting on the person's hand. Sunlight is coming from the right side of the frame, adding a warm atmosphere to the scene.
{
  "choices": [
    {
      "message": {
        "content": "This image depicts a heartwarming scene on a sandy beach...",
        "reasoning_content": "The user wants a description of the image.\n\n1. **Identify the main subjects:** A woman and a dog...",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 2520,
    "completion_tokens": 777,
    "total_tokens": 3297,
    "completion_tokens_details": {
      "reasoning_tokens": 539,
      "text_tokens": 238
    },
    "prompt_tokens_details": {
      "image_tokens": 2503,
      "text_tokens": 17
    }
  },
  "created": 1774322504,
  "system_fingerprint": null,
  "model": "qwen3.6-plus",
  "id": "chatcmpl-be9bf2d1-2e70-91c4-b8bc-c7f5bbd30320"
}

Choose your vision model

  • Qwen3.6/3.5: Supports multimodal reasoning, 2D/3D image understanding, complex document parsing, visual programming, video understanding, and multimodal agents.
    • qwen3.6-plus: Best quality. Significantly improved object recognition, OCR, and object localization over qwen3.5-plus.
    • qwen3.5-plus: More affordable than qwen3.6-plus with strong overall performance.
    • qwen3.5-flash: Fastest and most affordable in the series.
    • qwen3.5-397b-a17b, qwen3.5-122b-a10b, qwen3.5-27b, qwen3.5-35b-a3b: Open-source Qwen3.5 models.
  • Qwen3-VL: Supports high-precision object recognition and localization (including 3D), agent tool calling, document and webpage parsing, complex problem solving, and long video understanding.
    • qwen3-vl-plus: Best quality in the Qwen3-VL series.
    • qwen3-vl-flash: Faster and more cost-effective.
For model names, context, pricing, and snapshot versions, see Model list. For concurrent request limits, see Rate limits.
ModelReasoningTool callingContext cacheStructured outputDetected languages
Qwen3.6/3.5 seriesSupportSupportSupported by the stable versions of qwen3.6-plus, qwen3.5-plus and qwen3.5-flash (Explicit cache only)Supported in non-thinking mode33 languages: Chinese, Japanese, Korean, Indonesian, Vietnamese, Thai, English, French, German, Russian, Portuguese, Spanish, Italian, Swedish, Danish, Czech, Norwegian, Dutch, Finnish, Turkish, Polish, Swahili, Romanian, Serbian, Greek, Kazakh, Uzbek, Cebuano, Arabic, Urdu, Persian, Hindi/Devanagari, and Hebrew.
Qwen3-VL seriesSupportSupportNow supports the stable versions of qwen3-vl-plus and qwen3-vl-flash.Supported in non-thinking mode33 languages: Chinese, Japanese, Korean, Indonesian, Vietnamese, Thai, English, French, German, Russian, Portuguese, Spanish, Italian, Swedish, Danish, Czech, Norwegian, Dutch, Finnish, Turkish, Polish, Swahili, Romanian, Serbian, Greek, Kazakh, Uzbek, Cebuano, Arabic, Urdu, Persian, Hindi/Devanagari, and Hebrew.
Qwen2.5-VL seriesNot supportedNot supportedSupport for stable versions of qwen-vl-max and qwen-vl-plus.The stable and latest versions of qwen-vl-max and qwen-vl-plus are supported.11 languages: Chinese, English, Japanese, Korean, Arabic, Vietnamese, French, German, Italian, Spanish, and Russian.

Compare model performance

Answer questions about images

Describe the content of an image or classify and label it, such as identifying people, places, animals, and plants.
InputOutput
If the sun is glaring, what item from this picture should I use?
When the sun is glaring, you should use the pink sunglasses from the picture. Sunglasses can effectively block strong light, reduce UV damage to your eyes, and help protect your vision and improve visual comfort in bright sunlight.

Generate creative content from images

Generate vivid text descriptions based on image or video content. This is suitable for creative scenarios such as story writing, copywriting, and short video scripts.
InputOutput
Please help me write an interesting social media post based on the content of the picture.
Merry Christmas from our little winter wonderland! We're getting ready for the holidays with warm lights, pinecones, and plenty of rustic charm. Hope your season is filled with this much warmth and joy!

Extract text and information

Recognize text and formulas in images or extract information from receipts, certificates, and forms, with support for formatted text output. The Qwen3-VL model has expanded its language support to 33 languages. For a list of supported languages, see Model feature comparison.
InputOutput
Extract the following from the image: ['Invoice Code', 'Invoice Number', 'Destination', 'Fuel Surcharge', 'Fare', 'Travel Date', 'Departure Time', 'Train Number', 'Seat Number']. Please output in JSON format.
{`{"Invoice Code": "221021325353", "Invoice Number": "10283819", "Destination": "Development Zone", "Fuel Surcharge": "2.0", "Fare": "8.00<Full>", "Travel Date": "2013-06-29", "Departure Time": "Serial", "Train Number": "040", "Seat Number": "371"}`}

Solve complex visual problems

Solve problems in images, such as math, physics, and chemistry problems. This feature is suitable for primary, secondary, university, and adult education.
InputOutput
Please solve the math problem in the image step by step.

Generate code from visual designs

Generate code from images or videos. This can be used to create HTML, CSS, and JS code from design drafts, website screenshots, and more.
InputOutput
Design a webpage using HTML and CSS based on my sketch, with black as the main color.
Webpage preview

Locate objects in images

The model supports 2D and 3D localization to determine object orientation, perspective changes, and occlusion relationships. Qwen3-VL adds 3D localization.
For Qwen2.5-VL, object detection is robust within 480x480 to 2560x2560 resolution. Outside this range, accuracy may decrease with occasional bounding box drift. To draw localization results on the original image, see FAQ.
InputOutput
2D localization
  • Return Box (bounding box) coordinates: Detect all food items in the image and output their bbox coordinates in JSON format.
    - Return Point (centroid) coordinates: Locate all food items in the image as points and output their point coordinates in XML format.
Visualization of 2D localization results
Detect the car in the image and predict its 3D position. Output JSON: [{"bbox_3d": [x_center, y_center, z_center, x_size, y_size, z_size, roll, pitch, yaw], "label": "category"}].

Parse documents and PDFs

Parse image-based documents, such as scans or image PDFs, into QwenVL HTML or QwenVL Markdown format. This format not only accurately recognizes text but also obtains the position information of elements such as images and tables. The Qwen3-VL model adds the ability to parse documents into Markdown format.
Recommended prompts are: qwenvl html (to parse into HTML format) or qwenvl markdown (to parse into Markdown format).
InputOutput
qwenvl markdown.
Visualization of results

Analyze video content

Analyze video content, such as locating specific events and obtaining timestamps, or generating summaries of key time periods.
InputOutput
Please describe the series of actions of the person in the video. Output in JSON format with start_time, end_time, and event. Use HH:mm:ss for timestamps.{`{"events": [{"start_time": "00:00:00", "end_time": "00:00:05", "event": "The person walks towards the table holding a cardboard box and places it on the table."}, {"start_time": "00:00:05", "end_time": "00:00:15", "event": "The person picks up a scanner and scans the label on the cardboard box."}, {"start_time": "00:00:15", "end_time": "00:00:21", "event": "The person puts the scanner back in its place and then picks up a pen to write information in a notebook."}]}`}

Work with visual content

Thinking mode

For enable/disable, streaming output, and thinking_budget, see Thinking.
Vision defaults: thinking is off for qwen3-vl-plus and qwen3-vl-flash, on for qwen3.5. Models with a -thinking suffix always think.

Work with multiple images

Pass multiple images in a single request for tasks like product comparison and multi-page document processing. Include multiple image objects in the user message's content array.
Per request: up to 256 images when passed as a public URL or local file path, and up to 250 images when passed as Base64-encoded images. Independently, total tokens for all images and text must stay below the model's maximum input (combined image-and-text token limit).
  • OpenAI compatible
  • DashScope
  • Python
  • Node.js
  • curl
import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3.6-plus",
  messages=[
    {"role": "user","content": [
      {"type": "image_url","image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241022/emyrja/dog_and_girl.jpeg"},},
      {"type": "image_url","image_url": {"url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/tiger.png"},},
      {"type": "text", "text": "What do these images depict?"},
      ],
    }
  ],
)

print(completion.choices[0].message.content)
Response
Image 1 shows a scene of a woman and a Labrador retriever interacting on a beach. The woman is wearing a plaid shirt and sitting on the sand, shaking hands with the dog. The background is ocean waves and the sky, and the whole picture is filled with a warm and pleasant atmosphere.

Image 2 shows a scene of a tiger walking in a forest. The tiger's coat is orange with black stripes, and it is stepping forward. The surroundings are dense trees and vegetation, and the ground is covered with fallen leaves. The whole picture gives a feeling of wild nature.

Analyze video content

Visual understanding models support understanding video content. You can provide files in the form of an image list (video frames) or a video file. The following is an example of code for understanding an online video or image list specified by a URL. For more information about video limits or the number of images that can be passed in an image list, see the Video limits section.
We recommend using the latest or a recent snapshot version of the model for better performance in understanding video files.
  • Video file
  • Image list
Visual understanding models analyze content by extracting a sequence of frames from a video. You can control the frame extraction policy using the following two parameters:
  • fps: Controls the frequency. One frame every 1/fps seconds. The value range is [0.1, 10] and the default value is 2.0.
    • High-speed motion scenes: Set a higher fps value to capture more detail.
    • Static or long videos: Set a lower fps value for efficiency.
  • max_frames: The upper limit of frames extracted. When the number calculated based on fps exceeds max_frames, the system automatically and evenly samples frames to stay within the limit. This parameter is active only for the DashScope SDK.
  • OpenAI compatible
  • DashScope
When you directly input a video file to a visual understanding model using the OpenAI SDK or HTTP method, you must set the "type" parameter in the user message to "video_url".
  • Python
  • Node.js
  • curl
import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
  model="qwen3.6-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "video_url",
          "video_url": {
            "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241115/cqqkru/1.mp4"
          },
          "fps": 2
        },
        {
          "type": "text",
          "text": "Summarize what happens in this video"
        }
      ]
    }
  ]
)

print(completion.choices[0].message.content)

Use local files

Visual understanding models provide two ways to upload local files: Base64 encoding and direct file path upload. Choose the upload method based on the file size and SDK type. For specific recommendations, see How to choose a file upload method. Both methods must meet the file requirements described in Image limits.
  • Base64 encoding upload
  • File path upload
Convert the file to a Base64 encoded string and then pass it to the model. This method is supported by the OpenAI and DashScope SDKs, and HTTP requests.

Steps to pass a Base64-encoded string (image example)

1

File encoding

Convert the local image to a Base64 encoding.
#  Encoding function: Converts a local file to a Base64 encoded string
import base64
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode("utf-8")

# Replace xxxx/eagle.png with the absolute path of your local image
base64_image = encode_image("xxx/eagle.png")
2

Construct a Data URL

The format is as follows: data:[MIME_type];base64,<base64_image>.
  1. Replace MIME_type with the actual media type. Ensure it matches the MIME type value in the Supported image formats table, such as image/jpeg or image/png.
  2. base64_image is the Base64 string generated in the previous step.
3

Call the model

Pass the Data URL through the image or image_url parameter and call the model.
The code examples below show how to pass local images, videos, and image lists using both Base64 encoding and file path methods. Due to the large number of examples, they are organized by file type.
  • Python
  • Java
import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Replace xxx/eagle.png with the absolute path of your local image
local_path = "xxx/eagle.png"
image_path = f"file://{local_path}"
messages = [
  {'role':'user',
  'content': [{'image': image_path},
  {'text': 'Describe what you see in this image'}]}]
response = dashscope.MultiModalConversation.call(
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  model='qwen3.6-plus',
  messages=messages)
print(response.output.choices[0].message.content[0]["text"])
  • OpenAI compatible
  • DashScope
  • Python
  • Node.js
  • curl
from openai import OpenAI
import os
import base64

def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode("utf-8")

base64_image = encode_image("xxx/eagle.png")
client = OpenAI(
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
  model="qwen3.6-plus",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {"url": f"data:image/png;base64,{base64_image}"},
        },
        {"type": "text", "text": "Describe what you see in this image"},
      ],
    }
  ],
)
print(completion.choices[0].message.content)
This example uses a locally saved test.mp4 file.
  • Python
  • Java
import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

local_path = "xxx/test.mp4"
video_path = f"file://{local_path}"
messages = [
  {'role':'user',
  'content': [{'video': video_path,"fps":2},
  {'text': 'What scene does this video depict?'}]}]
response = dashscope.MultiModalConversation.call(
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  model='qwen3.6-plus',  
  messages=messages)
print(response.output.choices[0].message.content[0]["text"])
  • OpenAI compatible
  • DashScope
  • Python
  • Node.js
  • curl
from openai import OpenAI
import os
import base64

def encode_video(video_path):
  with open(video_path, "rb") as video_file:
    return base64.b64encode(video_file.read()).decode("utf-8")

base64_video = encode_video("xxx/test.mp4")
client = OpenAI(
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
  model="qwen3.6-plus",  
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "video_url",
          "video_url": {"url": f"data:video/mp4;base64,{base64_video}"},
          "fps":2
        },
        {"type": "text", "text": "What scene does this video depict?"},
      ],
    }
  ],
)
print(completion.choices[0].message.content)
This example uses locally saved files: football1.jpg, football2.jpg, football3.jpg, and football4.jpg.
  • Python
  • Java
import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

local_path1 = "football1.jpg"
local_path2 = "football2.jpg"
local_path3 = "football3.jpg"
local_path4 = "football4.jpg"

image_path1 = f"file://{local_path1}"
image_path2 = f"file://{local_path2}"
image_path3 = f"file://{local_path3}"
image_path4 = f"file://{local_path4}"

messages = [{'role':'user',
  'content': [{'video': [image_path1,image_path2,image_path3,image_path4],"fps":2},
  {'text': 'What scene does this video depict?'}]}]
response = dashscope.MultiModalConversation.call(
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  model='qwen3.6-plus',
  messages=messages)

print(response.output.choices[0].message.content[0]["text"])
  • OpenAI compatible
  • DashScope
  • Python
  • Node.js
  • curl
import os
from openai import OpenAI
import base64

def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode("utf-8")

base64_image1 = encode_image("football1.jpg")
base64_image2 = encode_image("football2.jpg")
base64_image3 = encode_image("football3.jpg")
base64_image4 = encode_image("football4.jpg")
client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
  model="qwen3.6-plus",
  messages=[  
  {"role": "user","content": [
    {"type": "video","video": [
      f"data:image/jpeg;base64,{base64_image1}",
      f"data:image/jpeg;base64,{base64_image2}",
      f"data:image/jpeg;base64,{base64_image3}",
      f"data:image/jpeg;base64,{base64_image4}",]},
    {"type": "text","text": "Describe the specific process of this video"},
  ]}]
)
print(completion.choices[0].message.content)

Handle high-resolution images

The visual understanding model API has a limit on the number of visual tokens for a single image after encoding. With default configurations, high-resolution images are compressed, which may result in a loss of detail and affect understanding accuracy. Enable vl_high_resolution_images or adjust max_pixels to increase the number of visual tokens, which preserves more image details and improves understanding.
If an input image has more pixels than the model's pixel limit, the image is scaled down to fit within the limit.
ModelPixels per tokenvl_high_resolution_imagesmax_pixelsToken limitPixel limit
Qwen3.5 and Qwen3-VL series models32*32truemax_pixels is invalid16384 tokens16777216 (which is 16384*32*32)
Qwen3.5 and Qwen3-VL series models32*32false (default)Customizable. The default is 2621440, and the maximum is 16777216.Determined by max_pixels, which is max_pixels/32/32.max_pixels
qwen-vl-max, qwen-vl-max-latest, qwen-vl-max-2025-08-13, qwen-vl-plus, qwen-vl-plus-latest, qwen-vl-plus-2025-08-1532*32truemax_pixels is invalid16384 tokens16777216 (which is 16384*32*32)
Same Qwen2.5-VL models above32*32false (default)Customizable. The default is 2621440, and the maximum is 16777216.Determined by max_pixels, which is max_pixels/32/32.max_pixels
QVQ and other Qwen2.5-VL models28*28Not supportedCustomizable. The default is 1003520, and the maximum is 12845056.Determined by max_pixels, which is max_pixels/28/28.max_pixels
  • OpenAI compatible
  • DashScope
  • Python
  • Node.js
  • curl
import os
import time
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3.6-plus",
  messages=[
    {"role": "user","content": [
      {"type": "image_url","image_url": {"url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250212/earbrt/vcg_VCG211286867973_RF.jpg"},
      # max_pixels represents the maximum pixel threshold for the input image. It is invalid when vl_high_resolution_images=True, but customizable when vl_high_resolution_images=False. The maximum value varies by model.
      # "max_pixels": 16384 * 32 * 32
      },
      {"type": "text", "text": "What festival atmosphere does this picture show?"},
      ],
    }
  ],
  extra_body={"vl_high_resolution_images":True}

)
print(f"Model output: {completion.choices[0].message.content}")
print(f"Total input tokens: {completion.usage.prompt_tokens}")

Advanced features

Limits

Input file limits

  • Image limits
  • Video limits
  • Image resolution:
    • Minimum size: The width and height of the image must both be greater than 10 pixels.
    • Aspect ratio: The ratio of the long side to the short side of the image cannot exceed 200:1.
    • Pixel limit:
      • We recommend keeping the image resolution within 8K (7680x4320). Images that exceed this resolution may cause API call timeouts because of large file sizes and long network transmission times.
      • Automatic scaling: The model can adjust the image size using max_pixels and min_pixels. Therefore, providing ultra-high-resolution images does not improve recognition accuracy but increases the risk of call failures. We recommend scaling the image to a reasonable size on the client in advance.
  • Supported image formats
    • For resolutions below 4K (3840x2160), the supported image formats are as follows:
      Image formatCommon extensionsMIME type
      BMP.bmpimage/bmp
      JPEG.jpe, .jpeg, .jpgimage/jpeg
      PNG.pngimage/png
      TIFF.tif, .tiffimage/tiff
      WEBP.webpimage/webp
      HEIC.heicimage/heic
    • For resolutions between 4K (3840x2160) and 8K (7680x4320), only the JPEG, JPG, and PNG formats are supported.
  • Image size:
    • When passed as a public URL: A single image cannot exceed 20 MB for Qwen3.5, and 10 MB for other models.
    • When passed as a local path: A single image cannot exceed 10 MB.
    • When passed as a Base64-encoded string: The encoded string cannot exceed 10 MB.
    For more information about how to compress the file size, see How to compress an image or video to the required size.
  • Number of supported images:
    • When passed as a public URL or local file path: up to 256 images per request.
    • When passed as Base64-encoded strings: up to 250 images per request.
    These per-request caps are not the only constraint: the combined image and text token usage must stay within the model's maximum input. The total tokens from all images and all text must be less than the model's maximum input length.
    For example, if you use the qwen3-vl-plus model in thinking mode, the maximum input is 258048 tokens. If the input text consumes 100 tokens and each image consumes 2560 tokens, you can pass a maximum of (258048 - 100) / 2560 = 100 images.

File input methods

  • Public URL: Provide a publicly accessible file address that supports the HTTP or HTTPS protocol. For optimal stability and performance, upload the file to OSS to get a public URL.
To ensure that the model can successfully download the file, the request header of the public URL must include Content-Length (file size) and Content-Type (media type, such as image/jpeg). If either field is missing or incorrect, the file download fails.
  • Pass as a Base64-encoded string: Convert the file to a Base64-encoded string and then pass it.
  • Pass as a local file path (DashScope SDK only): Pass the path of the local file.
For recommendations on file input methods, see How to choose a file upload method?

Deploy to production

  • Image/video pre-processing: Visual understanding models have size limits for input files. For more information about how to compress files, see Image or video compression methods.
  • Process text files: Visual understanding models support processing files only in image format and cannot directly process text files. Convert the text file to an image format. We recommend using an image processing library, such as Python's pdf2image, to convert the file page by page into multiple high-quality images. Then pass them to the model using the multiple image input method.
  • Fault tolerance and stability
    • Timeout handling: In non-streaming calls, if the model does not finish outputting within 180 seconds, a timeout error is usually triggered. To improve the user experience, the response body returns any content already generated after a timeout. If the response header contains x-dashscope-partialresponse:true, the response triggered a timeout. You can use the partial mode feature (supported by some models) to add the generated content to the messages array and send the request again. This lets the large model continue generating content. For more information, see Continue writing based on incomplete output.
    • Retry mechanism: Design a reasonable API call retry logic, such as exponential backoff, to handle network fluctuations or temporary service unavailability.

Understand pricing and limits

  • Billing: The total cost is based on the total number of input and output tokens. For input and output prices, see Model list.
    • Token composition: Input tokens consist of text tokens and tokens converted from images or videos. Output tokens are the text that the model generates. In thinking mode, the model's thought process also counts toward the output tokens. If the thought process is not an output in thinking mode, the price for non-thinking mode applies.
    • Calculate image and video tokens: Use the following code to estimate the token consumption for an image or video. The estimate is for reference only. The actual usage is based on the API response.
  • Image
  • Video
Formula: Image Token = h_bar * w_bar / token_pixels + 2
  • h_bar, w_bar: The height and width of the scaled image. Before processing an image, the model pre-processes it by scaling it down to a specific pixel limit. The pixel limit depends on the values of the max_pixels and vl_high_resolution_images parameters. For more information, see Process high-resolution images.
  • token_pixels: The pixel value that corresponds to each visual token. This value varies by model:
    • Qwen3.5, Qwen3-VL, qwen-vl-max, qwen-vl-max-latest, qwen-vl-max-2025-08-13, qwen-vl-plus, qwen-vl-plus-latest, qwen-vl-plus-2025-08-15: Each token corresponds to 32x32 pixels.
    • QVQ and other Qwen2.5-VL models: Each token corresponds to 28x28 pixels.
The following code shows the approximate image scaling logic within the model. Use it to estimate the token count for an image. The actual billing is based on the API response.
import math
# Use the following command to install the Pillow library: pip install Pillow
from PIL import Image

def token_calculate(image_path, max_pixels, vl_high_resolution_images):
  # Open the specified image file.
  image = Image.open(image_path)

  # Get the original dimensions of the image.
  height = image.height
  width = image.width

  # Adjust the width and height to be multiples of 32 or 28, depending on the model.
  h_bar = round(height / 32) * 32
  w_bar = round(width / 32) * 32

  # Lower limit for image tokens: 4 tokens.
  min_pixels = 4 * 32 * 32
  # If vl_high_resolution_images is set to True, the upper limit for input image tokens is 16,386, and the corresponding maximum pixel value is 16384 * 32 * 32 or 16384 * 28 * 28. Otherwise, it is the value set for max_pixels.
  if vl_high_resolution_images:
    max_pixels = 16384 * 32 * 32
  else:
    max_pixels = max_pixels

  # Scale the image so that the total number of pixels is within the range of [min_pixels, max_pixels].
  if h_bar * w_bar > max_pixels:
    beta = math.sqrt((height * width) / max_pixels)
    h_bar = math.floor(height / beta / 32) * 32
    w_bar = math.floor(width / beta / 32) * 32
  elif h_bar * w_bar < min_pixels:
    beta = math.sqrt(min_pixels / (height * width))
    h_bar = math.ceil(height * beta / 32) * 32
    w_bar = math.ceil(width * beta / 32) * 32
  return h_bar, w_bar

if __name__ == "__main__":
  # Replace xxx/test.jpg with the path to your local image.
  h_bar, w_bar =  token_calculate("xxx/test.jpg", max_pixels=16384*32*32, vl_high_resolution_images=False)
  print(f"Scaled image dimensions: height {h_bar}, width {w_bar}")
  # The system automatically adds the <vision_bos> and <vision_eos> visual markers (1 token each).
  token = int((h_bar * w_bar) / (32 * 32))+2
  print(f"Number of tokens for the image: {token}")
  • View bills: View your bills or top up your account in the Billing section.
  • Rate limits: See Rate limits.
  • Free quota: Visual understanding models offer a free quota of 1 million tokens, valid for 90 days from the date you activate Qwen Cloud or your model request is approved.

Reference

For the input and output parameters of visual understanding models, see Chat API.

FAQ

Choose the most suitable upload method based on the SDK type, file size, and network stability.
TypeSpecificationsDashScope SDK (Python, Java)OpenAI compatible / DashScope HTTP
ImageGreater than 7 MB and less than 10 MBPass the local pathOnly public URLs are supported. We recommend using Object Storage Service.
ImageLess than 7 MBPass the local pathBase64 encoding
VideoGreater than 100 MBOnly public URLs are supported. We recommend using Object Storage Service.Only public URLs are supported. We recommend using Object Storage Service.
VideoGreater than 7 MB and less than 100 MBPass the local pathOnly public URLs are supported. We recommend using Object Storage Service.
VideoLess than 7 MBPass the local pathBase64 encoding
Base64 encoding increases data size, so the original file must be under 7 MB. Using Base64 or a local path avoids server-side download timeouts and improves stability.
Visual understanding models have size limits for input files. Compress files using the following methods.Image compression methods
  • Online tools: Use online tools such as CompressJPEG or TinyPng.
  • Local software: Use software such as Photoshop to adjust the quality during export.
  • Code implementation:
# pip install pillow

from PIL import Image
def compress_image(input_path, output_path, quality=85):
  with Image.open(input_path) as img:
    img.save(output_path, "JPEG", optimize=True, quality=quality)

# Pass a local image
compress_image("/xxx/before-large.jpeg","/xxx/after-min.jpeg")
Video compression methods
# Basic conversion command
# -i: input file path
# -vcodec: video encoder (libx264 recommended)
# -crf: controls video quality. Value range: [18-28]. Smaller values = higher quality.
# --preset: controls encoding speed vs compression. Common values: slow, fast, faster.
# -y: overwrite existing file.

ffmpeg -i input.mp4 -vcodec libx264 -crf 28 -preset slow output.mp4
After the visual understanding model outputs object localization results, use the following code to draw the bounding boxes and their labels on the original image.
  • Qwen2.5-VL: Returns coordinates as absolute values in pixels. These coordinates are relative to the top-left corner of the scaled image. To draw the bounding boxes, see the code in qwen2_5_vl_2d.py.
  • Qwen3-VL: Returns relative coordinates that are normalized to the range [0, 999]. To draw the bounding boxes, see the code in qwen3_vl_2d.py (for 2D localization) or qwen3_vl_3d.zip (for 3D localization).

Error codes

If a call fails, see Error messages.