Skip to main content
Video gen & edits

Reference-to-video

Replicate motion and look

Wan-R2V accepts multimodal input (text, image, video, and audio) to generate performance videos. Use prompts to cast people or objects as the main characters. Quick links: API reference | Prompt guide

Getting started

Before you start, get an API key and set it as an environment variable. To use an SDK, install the DashScope SDK. Input prompt: "Video 1 walks in from the deep left side of the frame. Then the shot cuts to a close-up of Image 1. Video 1 is leaning against the rusty wall on the right side from Image 2. Hearing the footsteps, she slowly turns her head. After seeing Image 1, Video 1 says, 'Why did you still come?' Image 1 replies, 'Let's talk.'"
InputTypeRole
wan-r2v-girl-en.mp4 + reference voiceVideoVideo 1 (character)
wan-r2v-boy-en.jpg + reference voiceImageImage 1 (character)
wan-r2v-bg-en.jpgImageImage 2 (background)
Output: Multi-shot video with audio.
  • Python
  • Java
  • curl
Ensure that the DashScope Python SDK version is at least 1.25.16.
# -*- coding: utf-8 -*-
from http import HTTPStatus
from dashscope import VideoSynthesis
import dashscope
import os

# Set the base URL
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Set your API key
api_key = os.getenv("DASHSCOPE_API_KEY")

media = [
  {
    "type": "reference_video",
    "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260416/pfgcuv/wan-r2v-girl-en.mp4",
    "reference_voice": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260416/exiikq/wan-r2v-girl-demo-voice-en.mp3"
  },
  {
    "type": "reference_image",
    "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260416/skhalj/wan-r2v-boy-en.jpg",
    "reference_voice": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260416/pqxdoi/wan-r2v-boy-voice-en.mp3"
  },
  {
    "type": "reference_image",
    "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260416/vyqjxd/wan-r2v-bg-en.jpg"
  }
]

print('please wait...')
rsp = VideoSynthesis.call(
  api_key=api_key,
  model="wan2.7-r2v",
  media=media,
  resolution="720P",
  ratio="16:9",
  duration=10,
  prompt_extend=False,
  watermark=True,
  prompt="Video 1 walks in from the deep left side of the frame. Then the shot cuts to a close-up of Image 1. Video 1 is leaning against the rusty wall on the right side from Image 2. Hearing the footsteps, she slowly turns her head. After seeing Image 1, Video 1 says, \"Why did you still come?\" Image 1 replies, \"Let's talk.\"",
)
print(rsp)
if rsp.status_code == HTTPStatus.OK:
  print("video_url:", rsp.output.video_url)
else:
  print('Failed, status_code: %s, code: %s, message: %s' % (rsp.status_code, rsp.code, rsp.message))

Supported models

See Video generation models for a complete list of available models.

Core capabilities (wan2.7)

Single-image reference (multi-panel image)

Supported models: wan2.7 series. Description: You can input a multi-panel image (storyboard). The model automatically detects the multi-panel layout and generates a video with consistent characters, scenes, and shots. You can input only one multi-panel image at a time. Parameters:
  • media.type: Set to reference_image.
  • media.url: The URL or base64-encoded string of the multi-panel image.
  • prompt: If you provide only one reference image or video, use "reference image" or "reference video".
Input prompt: "Reference image, 3D cartoon adventure movie style, chibi characters with detailed textures, smooth actions, and vibrant colors. Keep the characters and forest scene consistent. Do not add text. Atmosphere: Adventurous, lighthearted, mysterious, whimsical. Characters: Boy explorer: round hat, backpack, short cloak. Sidekick: a flying small robot with a round body and blue glowing eyes. Scene: Fantasy forest with giant tree roots, mushrooms, vines, a treasure cave entrance, and sunbeams. Storyboard: 1. Wide shot: Tall trees and interlaced light beams in a mysterious and bright fantasy forest. 2. Medium shot: The boy pushes aside vines to explore. 3. Medium shot: The small robot flies beside him, scanning ahead with a blue light. 4. Close-up: An old treasure map unfolds in the boy's hands. 5. Close-up: He shows an excited expression, his eyes lighting up. 6. Action shot: The two jump over tree roots and a stream, continuing deeper into the forest. 7. Medium shot: A moss-covered treasure chest is revealed behind the vines. 8. Close-up: A golden glow shines from the edge of the treasure chest. 9. Final shot: The boy and the small robot stand before the treasure chest, looking at each other in surprise, full of adventure."
Input multi-panel imageOutput video
banana_storyboard
Generated video
  • Python
  • Java
  • curl
Ensure that the DashScope Python SDK version is at least 1.25.16.
# -*- coding: utf-8 -*-
from http import HTTPStatus
from dashscope import VideoSynthesis
import dashscope
import os

# Set the base URL
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Set your API key
api_key = os.getenv("DASHSCOPE_API_KEY")

media = [
  {
    "type": "reference_image",
    "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260403/wgjaxy/banana_storyboard_00000020.png"
  }
]


def sample_sync_call():
  print('----sync call, please wait a moment----')
  rsp = VideoSynthesis.call(
    api_key=api_key,
    model="wan2.7-r2v",
    media=media,
    resolution="720P",
    ratio="16:9",
    duration=10,
    prompt_extend=False,
    watermark=True,
    prompt="Reference image, 3D cartoon adventure movie style, chibi characters with detailed textures, smooth actions, and vibrant colors. Keep the characters and forest scene consistent. Do not add text. Atmosphere: Adventurous, lighthearted, mysterious, whimsical. Characters: Boy explorer: round hat, backpack, short cloak. Sidekick: a flying small robot with a round body and blue glowing eyes. Scene: Fantasy forest with giant tree roots, mushrooms, vines, a treasure cave entrance, and sunbeams. Storyboard: 1. Wide shot: Tall trees and interlaced light beams in a mysterious and bright fantasy forest. 2. Medium shot: The boy pushes aside vines to explore. 3. Medium shot: The small robot flies beside him, scanning ahead with a blue light. 4. Close-up: An old treasure map unfolds in the boy's hands. 5. Close-up: He shows an excited expression, his eyes lighting up. 6. Action shot: The two jump over tree roots and a stream, continuing deeper into the forest. 7. Medium shot: A moss-covered treasure chest is revealed behind the vines. 8. Close-up: A golden glow shines from the edge of the treasure chest. 9. Final shot: The boy and the small robot stand before the treasure chest, looking at each other in surprise, full of adventure.",
  )
  if rsp.status_code == HTTPStatus.OK:
    print(rsp.output.video_url)
  else:
    print('Failed, status_code: %s, code: %s, message: %s' %
      (rsp.status_code, rsp.code, rsp.message))


if __name__ == '__main__':
  sample_sync_call()

Multi-entity reference and voice customization

Supported models: wan2.7 series. Description: You can input multiple reference images and videos as entity materials. You can also specify a unique voice for each entity to enable multi-character interaction and voice differentiation. Parameters:
  • media: An array of reference materials.
    • media.type: Supports reference_image and reference_video. The total number of reference images and videos cannot exceed 5.
    • media.url: The URL of the material. Images also support base64-encoded strings.
    • media.reference_voice (optional): The audio URL to specify the voice for the entity. Use this with reference_image or reference_video. Audio logic: If a reference_video contains audio and reference_voice is not specified, the original video audio is used by default. If both are provided, reference_voice overwrites the original video audio.
  • prompt: Refer to the reference materials in the prompt according to the following rules:
    • Use identifiers such as Image 1, Image 2 for reference_image assets and Video 1, Video 2 for reference_video assets.
    • The reference order of the materials is defined by the media array. Images and videos are counted separately.
Input prompt: "Video 1 walks in from the deep left side of the frame. Then the shot cuts to a close-up of Image 1. Video 1 is leaning against the rusty wall on the right side from Image 2. Hearing the footsteps, she slowly turns her head. After seeing Image 1, Video 1 says, 'Why did you still come?' Image 1 replies, 'Let's talk.'"
InputTypeRole
wan-r2v-girl-en.mp4 + reference voiceVideoVideo 1 (character)
wan-r2v-boy-en.jpg + reference voiceImageImage 1 (character)
wan-r2v-bg-en.jpgImageImage 2 (background)
Output: Multi-shot video with audio.
  • Python
  • Java
  • curl
Ensure that the DashScope Python SDK version is at least 1.25.16.
# -*- coding: utf-8 -*-
from http import HTTPStatus
from dashscope import VideoSynthesis
import dashscope
import os

# Set the base URL
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Set your API key
api_key = os.getenv("DASHSCOPE_API_KEY")

media = [
  {
    "type": "reference_video",
    "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260416/pfgcuv/wan-r2v-girl-en.mp4",
    "reference_voice": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260416/exiikq/wan-r2v-girl-demo-voice-en.mp3"
  },
  {
    "type": "reference_image",
    "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260416/skhalj/wan-r2v-boy-en.jpg",
    "reference_voice": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260416/pqxdoi/wan-r2v-boy-voice-en.mp3"
  },
  {
    "type": "reference_image",
    "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260416/vyqjxd/wan-r2v-bg-en.jpg"
  }
]


def sample_sync_call():
  print('----sync call, please wait a moment----')
  rsp = VideoSynthesis.call(
    api_key=api_key,
    model="wan2.7-r2v",
    media=media,
    resolution="720P",
    ratio="16:9",
    duration=10,
    prompt_extend=False,
    watermark=True,
    prompt="Video 1 walks in from the deep left side of the frame. Then the shot cuts to a close-up of Image 1. Video 1 is leaning against the rusty wall on the right side from Image 2. Hearing the footsteps, she slowly turns her head. After seeing Image 1, Video 1 says, \"Why did you still come?\" Image 1 replies, \"Let's talk.\"",
  )
  if rsp.status_code == HTTPStatus.OK:
    print(rsp.output.video_url)
  else:
    print('Failed, status_code: %s, code: %s, message: %s' %
      (rsp.status_code, rsp.code, rsp.message))


if __name__ == '__main__':
  sample_sync_call()

Multi-entity reference and first-frame control

Supported models: wan2.7 series. Description: This feature adds first-frame control to the entity reference feature, which gives you more control over the composition and content flow of the video. Parameters:
  • media: An array of reference materials.
    • media.type: Supports first_frame, reference_image, and reference_video. You can provide a maximum of one first-frame image. You must provide at least one reference image or video. The total number of reference images and videos cannot exceed 5.
    • media.url: The URL of the material. Images also support base64-encoded strings.
  • prompt: Refer to the reference materials in the prompt according to the following rules:
    • Use "Image 1, Image 2" to refer to reference_image assets and "Video 1, Video 2" to refer to reference_video assets.
    • The reference order of the materials is defined by the media array. Images and videos are counted separately.
    • You do not need to reference the first frame in the prompt.
Input prompt: "An overhead shot captures a blue planet. The camera gradually zooms in toward the surface and cuts to a close-up of Image 1, who is holding Image 2 and eating it while saying: Why is not anyone coming to hang out with me?"
InputTypeRole
first-frame
First frameReference first frame
image-1
ImageImage 1 (entity)
image-2
ImageImage 2 (object)
Output: Video generated with the aspect ratio of the first frame.
  • Python
  • Java
  • curl
Ensure that the DashScope Python SDK version is at least 1.25.16.
# -*- coding: utf-8 -*-
from http import HTTPStatus
from dashscope import VideoSynthesis
import dashscope
import os

# Set the base URL
dashscope.base_http_api_url = 'https://dashscope-intl.aliyuncs.com/api/v1'

# Set your API key
api_key = os.getenv("DASHSCOPE_API_KEY")

media = [
  {
    "type": "first_frame",
    "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260414/ixwovg/wan2.7-r2v-first-frame.webp"
  },
  {
    "type": "reference_image",
    "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260414/fkltfw/wan2.7-r2v-image-qq.webp"
  },
  {
    "type": "reference_image",
    "url": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20260414/kxkbsv/wan2.7-r2v-image-ob.webp"
  }
]


def sample_sync_call():
  print('----sync call, please wait a moment----')
  rsp = VideoSynthesis.call(
    api_key=api_key,
    model="wan2.7-r2v",
    media=media,
    resolution="720P",
    duration=10,
    prompt_extend=False,
    watermark=True,
    prompt="An overhead shot captures a blue planet. The camera gradually zooms in toward the surface and cuts to a close-up of Image 1, who is holding Image 2 and eating it while saying: Why is not anyone coming to hang out with me?",
  )
  if rsp.status_code == HTTPStatus.OK:
    print(rsp.output.video_url)
  else:
    print('Failed, status_code: %s, code: %s, message: %s' %
      (rsp.status_code, rsp.code, rsp.message))


if __name__ == '__main__':
  sample_sync_call()

Provide references

Pass reference images, videos, and audio to the media array.

Input images

  • Number of first frames: A maximum of one first frame (media.type=first_frame) is allowed.
  • Number of reference images: A maximum of five reference images (media.type=reference_image) are allowed. The total number of reference images and reference videos cannot exceed 5.
  • Input methods:
    • Public URL: Supports HTTP or HTTPS protocols. Example: https://xxxx/xxx.png.
    • Base64-encoded string: Use the data:{MIME_type};base64,{base64_data} format, where:
      • {base64_data}: The Base64-encoded string of the image file.
      • {MIME_type}: The Multipurpose Internet Mail Extensions (MIME) type of the image. The type must match the file format.
        Image formatMIME type
        JPEGimage/jpeg
        JPGimage/jpeg
        PNGimage/png
        BMPimage/bmp
        WEBPimage/webp

Input videos

  • Number of reference videos: A maximum of five reference videos (media.type=reference_video) are allowed. The total number of reference images and reference videos cannot exceed 5.
  • Input methods:
    • Public URL: Supports HTTP or HTTPS protocols. Example: https://xxxx/xxx.mp4.

Input audio

  • Limits: The reference voice (media.reference_voice) can be used only with reference_image or reference_video to specify the voice for the corresponding entity role.
  • Input methods:
    • Public URL: Supports HTTP or HTTPS protocols. Example: https://xxxx/xxx.mp3.

Output video

  • Number of videos: 1.
  • Video specifications: The format is MP4.
  • Video URL validity period: 24 hours.
  • Video dimensions:
    • wan2.7 series: The resolution parameter controls the resolution level (720p or 1080p), and the ratio parameter controls the aspect ratio (16:9, 9:16, 1:1, 4:3, or 3:4).
      • If a first frame image is provided, the ratio parameter is ignored. The aspect ratio of the output video approximates that of the first frame image.
      • If a first frame image is not provided, the aspect ratio is specified by the ratio parameter. The default is 16:9.

Billing and rate limiting

  • For free quota information, see Free quota.
  • Billing details:
    • Input images are free of charge. Input and output videos are billed based on their duration in seconds.
    • Failed model calls or processing faults do not incur charges or consume the free quota.
  • Billing formula: Total billable duration (seconds) = Billable duration of input video (seconds) + Duration of output video (seconds).
  • Wan 2.7 series models
  • Wan 2.6 series models
Billable duration of input video: The maximum is 5 seconds. Truncation limit per video = 5 seconds / Number of input reference videos (reference images and the first frame image are excluded). Each video is billed based on min(actual duration, truncation limit). The billable durations for multiple videos are added together.
Number of reference videosTruncation limit per video
15s
22.5s
31.65s
41.25s
51s
Example: If the input is 2 reference videos + 1 image, the image is excluded from the count. The truncation limit is calculated based on 2 reference videos, resulting in 2.5 seconds per video. Billable input duration = min(video 1 duration, 2.5 seconds) + min(video 2 duration, 2.5 seconds).Billable duration of output video: The duration in seconds of the video successfully generated by the model.

API reference

FAQ

How do I reference materials in a prompt?

The reference method depends on the model and feature used:
  • Reference images are identified as Image 1, Image 2, and so on. Reference videos are identified as Video 1, Video 2, and so on.
  • Images and videos are counted separately. The order matches the order of the same type of material in the media array.
  • If you have only one reference image or video, you can simplify the identifier to "reference image" or "reference video".
  • Usually, you do not need to reference the first frame image in the prompt.
{
    "input": {
        "prompt": "Video 1 is playing the guitar, and Image 1 is holding a bouquet of flowers and walks past Video 1.",
        "media": [
            {
                "type": "first_frame",
                "url":  "https://example.com/scene.jpg"
            },
            {
                "type": "reference_video",
                "url":  "https://example.com/girl.mp4"
            },
            {
                "type": "reference_image",
                "url": "https://example.com/boy.png"
            }
        ]
    }
}

Can reference_voice be used with a first frame image?

This is not recommended. Use media.reference_voice with reference_image or reference_video to specify the timbre for the corresponding entity.