What is I2V?
Image-to-Video (I2V) is a technique where you start from a still image and use a video model to animate it. Instead of the model generating both the appearance and the motion from text alone (T2V), I2V separates the two concerns:
- Stage 1: Generate a high-quality still image (using SDXL, Illustrious, or similar image models)
- Stage 2: Feed that image into WAN I2V to add motion and animation
This two-stage approach produces significantly better results than text-to-video alone, because the image model handles visual quality and character consistency, while the video model focuses only on motion.
Why Use I2V Instead of T2V?
- Better character consistency: The reference image locks in the character's appearance
- Higher visual quality: Image models are more mature and produce sharper results than video models generating from scratch
- More control: You can carefully craft the starting frame — fix poses, expressions, and composition before animating
- LoRA compatibility: Many character and style LoRAs are available for SDXL/Illustrious that aren't available for WAN
Most content on NijiTube is created using the I2V pipeline.
The Two-Stage Pipeline
Stage 1: Image Generation
Use an image generation model like SDXL or Illustrious XL to create a starting frame. This is done in ComfyUI with a standard image generation workflow:
- Load an anime checkpoint (e.g., Illustrious XL, AnimagineXL, or similar)
- Write your prompt describing the character, pose, and scene
- Generate at a resolution that matches your target video (e.g., 832x480 for landscape)
- Optionally add character LoRAs for specific characters
Stage 2: Video Generation
Feed the generated image into a WAN I2V model (distinct from the T2V model used in the beginner guide):
- The I2V model takes your still image as the first frame
- A text prompt describes the motion — what happens in the video
- The model generates subsequent frames that animate the starting image
The Key Difference
- T2V prompt: Describes the entire scene — appearance, setting, and motion
- I2V prompt: Describes only the motion — the appearance is already defined by the input image
For example, an I2V prompt might be: "Her hair sways gently in the breeze, she blinks and tilts her head slightly, soft ambient lighting, anime illustration style"
What You Need
- Everything from the beginner guide (ComfyUI, custom nodes)
- An SDXL or Illustrious XL checkpoint for image generation (placed in
ComfyUI/models/checkpoints/) - A WAN I2V model (different from the T2V model). Search for "Wan2.1-I2V-14B" on HuggingFace. GGUF versions are available for 12 GB VRAM
- Character or style LoRAs from CivitAI (optional, for SDXL stage)
Getting Started
This guide is an overview. For hands-on setup:
- Search CivitAI for "WAN I2V workflow" to find ready-made ComfyUI workflows
- Browse NijiTube videos and download their attached workflows — many creators use I2V and share their complete setup
- Community guides on CivitAI and Reddit (r/StableDiffusion, r/comfyui) provide detailed walkthrough posts
I2V takes more setup than T2V, but the quality improvement is substantial. Most creators who start with T2V move to I2V as their primary workflow.