Skip to main content
Visual understanding

Visual understanding models

Choose a model for image analysis, video understanding, OCR, and more.

Image and video understanding

Start with qwen3.6-plus — strongest accuracy, 1M context, 2-hour video support, and the full feature set including function calling and built-in tools. Once your use case works well, try qwen3.5-flash to reduce cost — near-flagship quality with the same context and features.

Image resolution

Most models support up to 16M pixels per image. Higher resolution costs more tokens: each image uses h × w / (32 × 32) + 2 tokens.

Video support

  • Up to 2 hours / 2GB → qwen3.6-plus, qwen3.5-plus, qwen3.5-flash
  • Up to 1 hour / 2GB → qwen3-vl-plus, qwen3-vl-flash
  • Up to 40 seconds / 150MB → qwen3-omni-flash (also accepts audio — see Speech models)

Function calling + built-in tools

Let the model take actions based on what it sees in images or video.
  • Function calling: Qwen3.6, Qwen3.5, and Qwen3-VL models
  • Built-in tools (web search, code execution — no setup): qwen3.6-plus, qwen3.5-plus, qwen3.5-flash only

Structured output

Get valid JSON from visual input — e.g., extract product info from a photo. Available on Qwen3.6, Qwen3.5, and Qwen3-VL in non-thinking mode.

OCR and document extraction

qwen-vl-ocr — specialized in documents, tables, exam questions, and handwriting. Or use qwen3.6-plus / qwen3.5-flash for general text extraction from images.
ModelContextMax pixels/imageMax video durationMax video sizeMax imagesMax videosFunction callingBuilt-in toolsStructured outputBatch
qwen3.6-plus1M16M2h2GB256 / 25064
qwen3.5-flash1M16M2h2GB256 / 25064
qwen3-vl-plus256k16M1h2GB256 / 25064
qwen3-vl-flash256k16M1h2GB256 / 25064
qwen3-omni-flash64k40s150MB2,0481
Max images: via URL / via Base64.

All models

Model IDInputOutputContextMax OutputMax imagesMax videosFunction callingBuilt-in toolsStructured outputBatchCoding Plan
qwen3.6-plusText, image, videoText1M64k256 / 25064
qwen3.6-plus-2026-04-02Text, image, videoText1M64k256 / 25064
Model IDInputOutputContextMax OutputMax imagesMax videosFunction callingBuilt-in toolsStructured outputBatchCoding Plan
qwen3.5-plusText, image, videoText1M64k256 / 25064
qwen3.5-plus-2026-02-15Text, image, videoText1M64k256 / 25064
qwen3.5-flashText, image, videoText1M64k256 / 25064
qwen3.5-flash-2026-02-23Text, image, videoText1M64k256 / 25064
qwen3.5-397b-a17bText, image, videoText32k8k256 / 25064
qwen3.5-122b-a10bText, image, videoText32k8k256 / 25064
qwen3.5-27bText, image, videoText32k8k256 / 25064
qwen3.5-35b-a3bText, image, videoText32k8k256 / 25064
Model IDInputOutputContextMax OutputMax imagesMax videosFunction callingBuilt-in toolsStructured outputBatch
qwen3-vl-plusText, image, videoText256k32k256 / 25064
qwen3-vl-plus-2025-12-19Text, image, videoText256k32k256 / 25064
qwen3-vl-plus-2025-09-23Text, image, videoText256k32k256 / 25064
qwen3-vl-flashText, image, videoText256k32k256 / 25064
qwen3-vl-flash-2026-01-22Text, image, videoText256k32k256 / 25064
qwen3-vl-flash-2025-10-15Text, image, videoText256k32k256 / 25064
Unlike other models on this page, Qwen-Omni accepts audio input and can output both text and speech.Standard
Model IDInputOutputContextMax OutputMax imagesMax videosFunction callingBuilt-in toolsStructured outputBatch
qwen3-omni-flashText, image, audio, videoText, audio64k16k2,0481
qwen3-omni-flash-2025-12-01Text, image, audio, videoText, audio64k16k2,0481
qwen3-omni-flash-2025-09-15Text, image, audio, videoText, audio64k16k2,0481
qwen-omni-turboText, image, audio, videoText, audio32k2k2,0481
qwen-omni-turbo-latestText, image, audio, videoText, audio32k2k2,0481
qwen-omni-turbo-2025-03-26Text, image, audio, videoText, audio32k2k2,0481
Realtime — streaming audio input with built-in Voice Activity Detection (VAD).
Model IDInputOutput
qwen3-omni-flash-realtimeText, image, audio (streaming)Text, audio
qwen3-omni-flash-realtime-2025-12-01Text, image, audio (streaming)Text, audio
qwen3-omni-flash-realtime-2025-09-15Text, image, audio (streaming)Text, audio
qwen-omni-turbo-realtimeText, audio (streaming)Text, audio
qwen-omni-turbo-realtime-latestText, audio (streaming)Text, audio
qwen-omni-turbo-realtime-2025-05-08Text, audio (streaming)Text, audio
Captioner (open source) — audio captioning model.
Model IDInputOutputContextMax OutputMax imagesMax videosFunction callingBuilt-in toolsStructured outputBatch
qwen3-omni-30b-a3b-captionerAudioText64k32k
Specializes in extracting text from documents, tables, exam questions, and handwriting.
Model IDInputOutputContextMax OutputMax imagesMax videosFunction callingBuilt-in toolsStructured outputBatch
qwen-vl-ocrText, imageText38k8k256 / 250
qwen-vl-ocr-2025-11-20Text, imageText38k8k256 / 250
Older model versions retained for backward compatibility. We recommend Qwen3.5 or Qwen3-VL for new projects.
Model IDInputOutputContextMax OutputMax imagesMax videosFunction callingBuilt-in toolsStructured outputBatch
qwen3-vl-235b-a22b-thinkingText, image, videoText128k8k256 / 25064
qwen3-vl-235b-a22b-instructText, image, videoText128k8k256 / 25064
qwen3-vl-32b-thinkingText, image, videoText128k8k256 / 25064
qwen3-vl-32b-instructText, image, videoText128k8k256 / 25064
qwen3-vl-30b-a3b-thinkingText, image, videoText128k8k256 / 25064
qwen3-vl-30b-a3b-instructText, image, videoText128k8k256 / 25064
qwen3-vl-8b-thinkingText, image, videoText128k8k256 / 25064
qwen3-vl-8b-instructText, image, videoText128k8k256 / 25064
qwen2.5-vl-72b-instructText, image, videoText128k8k256 / 25064
qwen2.5-vl-32b-instructText, image, videoText128k8k256 / 25064
qwen2.5-vl-7b-instructText, image, videoText128k8k256 / 25064
qwen2.5-vl-3b-instructText, image, videoText128k8k256 / 25064
qwen2.5-omni-7bText, image, audio, videoText, audio32k8k2,0481
qwen-vl-maxText, imageText32k8k256 / 250
qwen-vl-max-latestText, imageText128k8k256 / 250
qwen-vl-max-2025-08-13Text, imageText128k8k256 / 250
qwen-vl-max-2025-04-08Text, imageText128k8k256 / 250
qwen-vl-plusText, imageText128k8k256 / 250
qwen-vl-plus-latestText, imageText128k8k256 / 250
qwen-vl-plus-2025-08-15Text, imageText128k8k256 / 250
qwen-vl-plus-2025-05-07Text, imageText128k8k256 / 250
qwen-vl-plus-2025-01-25Text, imageText128k8k256 / 250
qvq-maxText, imageText128k8k256 / 250
qvq-max-latestText, imageText128k8k256 / 250
qvq-max-2025-03-25Text, imageText128k8k256 / 250

Learn more