NijiTube
← Back to Guide·5 min read

Image to Video (I2V)

What is I2V?

Image-to-Video (I2V) is a technique where you start from a still image and use a video model to animate it. Instead of the model generating both the appearance and the motion from text alone (T2V), I2V separates the two concerns:

  1. Stage 1: Generate a high-quality still image (using SDXL, Illustrious, or similar image models)
  2. Stage 2: Feed that image into WAN I2V to add motion and animation

This two-stage approach produces significantly better results than text-to-video alone, because the image model handles visual quality and character consistency, while the video model focuses only on motion.

Why Use I2V Instead of T2V?

  • Better character consistency: The reference image locks in the character's appearance
  • Higher visual quality: Image models are more mature and produce sharper results than video models generating from scratch
  • More control: You can carefully craft the starting frame — fix poses, expressions, and composition before animating
  • LoRA compatibility: Many character and style LoRAs are available for SDXL/Illustrious that aren't available for WAN

Most content on NijiTube is created using the I2V pipeline.

The Two-Stage Pipeline

Stage 1: Image Generation

Use an image generation model like SDXL or Illustrious XL to create a starting frame. This is done in ComfyUI with a standard image generation workflow:

  • Load an anime checkpoint (e.g., Illustrious XL, AnimagineXL, or similar)
  • Write your prompt describing the character, pose, and scene
  • Generate at a resolution that matches your target video (e.g., 832x480 for landscape)
  • Optionally add character LoRAs for specific characters

Stage 2: Video Generation

Feed the generated image into a WAN I2V model (distinct from the T2V model used in the beginner guide):

  • The I2V model takes your still image as the first frame
  • A text prompt describes the motion — what happens in the video
  • The model generates subsequent frames that animate the starting image

The Key Difference

  • T2V prompt: Describes the entire scene — appearance, setting, and motion
  • I2V prompt: Describes only the motion — the appearance is already defined by the input image

For example, an I2V prompt might be: "Her hair sways gently in the breeze, she blinks and tilts her head slightly, soft ambient lighting, anime illustration style"

What You Need

  • Everything from the beginner guide (ComfyUI, custom nodes)
  • An SDXL or Illustrious XL checkpoint for image generation (placed in ComfyUI/models/checkpoints/)
  • A WAN I2V model (different from the T2V model). Search for "Wan2.1-I2V-14B" on HuggingFace. GGUF versions are available for 12 GB VRAM
  • Character or style LoRAs from CivitAI (optional, for SDXL stage)

Getting Started

This guide is an overview. For hands-on setup:

  • Search CivitAI for "WAN I2V workflow" to find ready-made ComfyUI workflows
  • Browse NijiTube videos and download their attached workflows — many creators use I2V and share their complete setup
  • Community guides on CivitAI and Reddit (r/StableDiffusion, r/comfyui) provide detailed walkthrough posts

I2V takes more setup than T2V, but the quality improvement is substantial. Most creators who start with T2V move to I2V as their primary workflow.

Related Guides

Image to Video (I2V) | NijiTube