The Complete Guide to Seedance 2.0: Multimodal AI Video Creation from Scratch
Seedance 2.0 is ByteDance's multimodal AI video model that generates cinematic video from text, images, video clips, and audio. It offers two creation modes, an @ reference system for precise asset control, and native audio generation — all in one workflow. Here's how to use every feature.
Two Creation Modes
Seedance 2.0 provides two entry points, each suited to different workflows:
First/Last Frame Mode
- Upload one image as the opening or closing frame
- Add a text description of the desired motion and scene
- Best for: simple animations, image-to-video conversions, quick tests
All-in-One Reference Mode (Recommended)
- Combine images + video clips + audio + text in a single generation
- Supports up to 12 reference files simultaneously
- Best for: complex multi-asset productions, music videos, character-driven narratives
Input Specifications
| Input Type | Limit | What It Controls |
|---|---|---|
| Images | Up to 9 | Character appearance, scene style, product details |
| Video clips | Up to 3 (total ≤15s) | Camera movement, action rhythm, transition effects |
| Audio files | Up to 3 MP3 (total ≤15s) | Background music, sound effects, voiceover tone |
| Text | Natural language | Scene description, action instructions, mood |
Total file limit: 12 reference files per generation.
The @ Reference System
This is the most important feature to learn. The @ system lets you assign a specific role to each uploaded file — the model follows your assignments precisely instead of guessing.
How to Use @
- Upload your assets (images, videos, audio)
- In the prompt box, type @ to open the asset picker
- Select a file and describe its role in the generation
Example Prompt with @ References
@image1 as the opening frame character,
reference @video1 for camera movement (slow push-in to close-up),
use @audio1 for background music,
@image2 as the environment reference.
The character walks toward the camera under warm sunset lighting.
Key Rules
- Every uploaded file should be explicitly assigned with @
- Hover over assets to preview and verify you're referencing the correct file
- The model executes exactly what you assign — no guessing
Prompt Writing Techniques
1. Write by Timeline
Break your prompt into time segments for precise control:
- 0–3s: "Wide shot of a city skyline at dawn, slow pan right"
- 4–8s: "Cut to medium shot, character enters from the left, walking"
- 9–12s: "Push-in to close-up on character's face, soft focus background"
2. Use Specific Camera Language
The model understands professional cinematography terms:
- Push-in / Pull-out — zoom toward or away from the subject
- Pan — horizontal camera movement
- Tilt — vertical camera movement
- Tracking shot — camera follows the subject's movement
- Orbit — camera circles around the subject
- One-take — continuous unbroken shot
3. Describe Transitions
When creating multi-shot sequences, specify how scenes connect:
- "Fade from outdoor scene to indoor close-up"
- "Match cut from spinning coin to spinning globe"
- "Whip pan transition to the next scene"
4. Distinguish Reference vs. Instruction
- Reference: "@video1 for camera movement" — the model extracts and replicates the camera work
- Instruction: "slow push-in from wide to close-up" — the model generates the movement from your text description
Core Capabilities
Image Quality
- Physics-accurate motion (gravity, fabric draping, fluid dynamics)
- Smooth, natural human and animal movement
- Precise prompt adherence
- Consistent visual style throughout
Multimodal Combination
- Extract camera movement from a reference video
- Extract character appearance from reference images
- Extract musical rhythm from reference audio
- Combine all three in a single generation
Character Consistency
- Face, clothing, and expression preservation across shots
- Brand element consistency (logos, colors, typography)
- Scene style consistency (lighting, atmosphere)
Camera and Motion Replication
- Replicate specific cinematography techniques from reference videos
- Hitchcock zoom, orbit tracking, one-take sequences
- Precise motion speed and rhythm matching
Output Specifications
- Duration: 4–15 seconds (selectable)
- Resolution: Up to 2K / 1080p
- Aspect ratios: 16:9 (landscape), 9:16 (portrait), 1:1 (square)
- Audio: Native — includes dialogue sync, background music, sound effects
- Generation speed: ~30 points per 15-second video, 10x faster than previous generation
Important Notes
- No real human faces — uploads containing clear real human faces are blocked by content moderation
- Quality over quantity — upload only the assets that have the strongest impact on your desired output
- Verify @ assignments — hover over each asset reference to confirm correct file mapping
- Model randomness — results vary between generations; generate multiple times and pick the best
- Available on: Jimeng (即梦), Doubao (豆包), Volcano Engine (火山引擎)
Frequently Asked Questions
What are the two creation modes?
First/Last Frame mode (one image + text) for simple generations, and All-in-One Reference mode (up to 12 multimodal files) for complex productions.
How does the @ reference system work?
Type @ in the prompt box, select an uploaded file, and describe its role. Example: "@image1 as character reference, @video1 for camera movement." The model follows your assignments precisely.
What are the input limits?
Up to 9 images, 3 video clips (≤15s total), 3 audio files (≤15s total), and text. Maximum 12 files per generation.
What output does it produce?
4–15 seconds of video at up to 2K resolution with native audio, in 16:9, 9:16, or 1:1 aspect ratios.
Can I use real human photos?
No. Uploads with clear real human faces are blocked by content moderation. Use stylized or illustrated character references.
Ready to start creating? Try Seedance 2.0 now — free trial available.