Multimodal embeddings

Multimodal embedding models convert text, images, and video into numerical vectors. These vectors enable cross-modal search (text-to-image, image-to-image, text-to-video), image classification, video classification, and content retrieval.

Prerequisites

Get an API key and set it as an environment variable.

Independent vectors

Generate separate vectors for each input modality (text, image, or video). Use this when you need to process each content type independently.

Multimodal independent embedding requires the DashScope SDK or API. OpenAI-compatible endpoints are not supported.

Python
Java

import dashscope
import json
import os
from http import HTTPStatus

# Input can be a video
# video = "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4"
# input = [{'video': video}]

# Or an image
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"
input = [{'image': image}]

resp = dashscope.MultiModalEmbedding.call(
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  model="tongyi-embedding-vision-plus",
  input=input
)

print(json.dumps(resp.output, indent=4))

import com.alibaba.dashscope.embeddings.MultiModalEmbedding;
import com.alibaba.dashscope.embeddings.MultiModalEmbeddingItemImage;
import com.alibaba.dashscope.embeddings.MultiModalEmbeddingItemVideo;
import com.alibaba.dashscope.embeddings.MultiModalEmbeddingParam;
import com.alibaba.dashscope.embeddings.MultiModalEmbeddingResult;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;

import java.util.Collections;

public class Main {
  public static void main(String[] args) {
    try {
      MultiModalEmbedding embedding = new MultiModalEmbedding();
      // Input can be a video
      // MultiModalEmbeddingItemVideo video = new MultiModalEmbeddingItemVideo(
      //     "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250107/lbcemt/new+video.mp4");

      // Or an image
      MultiModalEmbeddingItemImage image = new MultiModalEmbeddingItemImage(
        "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png");

      MultiModalEmbeddingParam param = MultiModalEmbeddingParam.builder()
        .model("tongyi-embedding-vision-plus")
        .contents(Collections.singletonList(image))
        .build();

      MultiModalEmbeddingResult result = embedding.call(param);
      System.out.println(result);

    } catch (ApiException | NoApiKeyException | UploadFileException e) {
      System.err.println("API call error: " + e.getMessage());
      e.printStackTrace();
    }
  }
}

Supported models

Model	Dimensions	Text limit	Image limit	Video limit
tongyi-embedding-vision-plus	64, 128, 256, 512, 1024, 1152 (default)	1,024 tokens	Max 3 MB per image	Max 10 MB per video
tongyi-embedding-vision-flash	64, 128, 256, 512, 768 (default)	1,024 tokens	Max 3 MB per image	Max 10 MB per video

Input and language support

Model	Text	Image	Video	Multi-images	Max items per request
tongyi-embedding-vision-plus	Chinese and English	JPEG, PNG, BMP (URL or Base64)	MP4, MPEG, MOV, MPG, WEBM, AVI, FLV, MKV (URL only)	Max 8 images	No element count limit. Total tokens must stay within the batch token limit.
tongyi-embedding-vision-flash	Chinese and English	JPEG, PNG, BMP (URL or Base64)	MP4, MPEG, MOV, MPG, WEBM, AVI, FLV, MKV (URL only)	Max 8 images	No element count limit. Total tokens must stay within the batch token limit.

API reference

For detailed parameter descriptions and response schemas, see the Multimodal Embedding API reference.

Error codes

If a call fails, see Error messages.

​Prerequisites

​Independent vectors

​Supported models

​Input and language support

​API reference

​Error codes

Prerequisites

Independent vectors

Supported models

Input and language support

API reference

Error codes