Skip to main content
Embedding & reranking

Multimodal embeddings

Generate vectors from text, images, and video for cross-modal search and retrieval

Multimodal embedding models convert text, images, and video into numerical vectors. These vectors enable cross-modal search (text-to-image, image-to-image, text-to-video), image classification, video classification, and content retrieval.

Prerequisites

Get an API key and set it as an environment variable.

Independent vectors

Generate separate vectors for each input modality (text, image, or video). Use this when you need to process each content type independently.
Multimodal independent embedding requires the DashScope SDK or API. OpenAI-compatible endpoints are not supported.
  • Python
  • Java
import dashscope
import json
import os
from http import HTTPStatus

# Input can be a video
# video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
# input = [{'video': video}]

# Or an image
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
input = [{'image': image}]

resp = dashscope.MultiModalEmbedding.call(
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  model="tongyi-embedding-vision-plus",
  input=input
)

print(json.dumps(resp.output, indent=4))

Model selection

To generate a separate vector for each input (such as an image and its text caption), use tongyi-embedding-vision-plus or tongyi-embedding-vision-flash.

Available models

ModelDimensionsText limitImage limitVideo limit
tongyi-embedding-vision-plus64, 128, 256, 512, 1024, 1152 (default)1,024 tokensMax 3 MB per imageMax 10 MB per video
tongyi-embedding-vision-flash64, 128, 256, 512, 768 (default)1,024 tokensMax 3 MB per imageMax 10 MB per video

Input and language support

ModelTextImageVideoMulti-imagesMax items per request
tongyi-embedding-vision-plusChinese and EnglishJPEG, PNG, BMP (URL or Base64)MP4, MPEG, MOV, MPG, WEBM, AVI, FLV, MKV (URL only)Max 8 imagesNo element count limit. Total tokens must stay within the batch token limit.
tongyi-embedding-vision-flashChinese and EnglishJPEG, PNG, BMP (URL or Base64)MP4, MPEG, MOV, MPG, WEBM, AVI, FLV, MKV (URL only)Max 8 imagesNo element count limit. Total tokens must stay within the batch token limit.

API reference

For detailed parameter descriptions and response schemas, see the Multimodal Embedding API reference.

Error codes

If a call fails, see Error messages.