Your First AI Anime Video

Overview

This guide walks you through generating your first AI anime video from scratch. By the end you will have a short anime-style clip ready to upload.

The pipeline: ComfyUI (node-based UI) + WAN 2.2 14B GGUF (text-to-video model). The 14B model produces high-quality output, and the GGUF format lets it run on consumer GPUs with as little as 8 GB VRAM.

What You Need

Hardware

GPU: NVIDIA with at least 8 GB VRAM (RTX 3060 or better). 12 GB recommended
RAM: 32 GB recommended (16 GB minimum — may be tight)
Storage: ~15 GB free for models and outputs

Software

Windows 10/11 (this guide uses the Windows portable version)
An internet connection to download models (~12 GB total)

Step 1: Install ComfyUI

ComfyUI is a node-based interface for running AI generation models. It handles the entire pipeline visually.

Go to the ComfyUI GitHub releases page and download the latest Windows portable package (the .7z file)
Extract it to a folder with a short path, for example C:\ComfyUI
Run run_nvidia_gpu.bat
Your browser will open http://127.0.0.1:8188 — this is the ComfyUI interface

If nothing happens, make sure you have an NVIDIA GPU with up-to-date drivers. AMD GPUs require a different setup not covered here.

Step 2: Install the GGUF Custom Node

The 14B model uses the GGUF format for memory efficiency. ComfyUI needs one custom node to load GGUF files.

In ComfyUI, click the Manager button (top-right toolbar)
Click Install Custom Nodes
Search for ComfyUI-GGUF
Click Install and wait for it to finish
Restart ComfyUI (close the terminal window and run run_nvidia_gpu.bat again)

If you do not see the Manager button, install ComfyUI-Manager first: search "ComfyUI-Manager" on GitHub for instructions.

Step 3: Download Models

You need three files. Download them and place each in the correct folder inside your ComfyUI installation.

The WAN 2.2 14B Model (GGUF)

This is the core video generation model. The Q4 version (~9 GB) fits on 8 GB VRAM GPUs.

Search HuggingFace for "city96 Wan2.2-T2V-14B-GGUF" and download wan2.2_t2v_low_noise_14B_Q4_K_M.gguf (~9 GB)
Place in: ComfyUI/models/diffusion_models/

The "low noise" variant produces cleaner output and is recommended for beginners. If you have 12+ GB VRAM, you can use the Q6 version for even better quality.

The VAE

Search HuggingFace for "Comfy-Org Wan_2.1_ComfyUI_repackaged" and download wan_2.1_vae.safetensors (~300 MB)
Place in: ComfyUI/models/vae/

Note: The 14B model uses the 2.1 VAE, not the 2.2 VAE. This is correct.

The Text Encoder

Search HuggingFace for "Comfy-Org Wan_2.2_ComfyUI_Repackaged" and download umt5_xxl_fp8_e4m3fn_scaled.safetensors (~2.5 GB)
Place in: ComfyUI/models/text_encoders/

Step 4: Build the Workflow

Clear the default workflow first: click Menu (☰) → Clear Workflow (or press Ctrl+Shift+Delete). You will build the workflow from scratch.

To add a node, double-click on the empty canvas. A search bar appears — type the node name listed below and select it from the results.

4-1. Model Loading Chain

Add and connect these three nodes in order:

Double-click canvas → search UnetLoaderGGUF → add Unet Loader (GGUF)
Double-click canvas → search ModelSamplingSD3 → add ModelSamplingSD3
Connect: Unet Loader (GGUF) MODEL output → ModelSamplingSD3 model input

4-2. Text Encoding

Double-click canvas → search CLIPLoader → add Load CLIP
Double-click canvas → search CLIPTextEncode → add CLIP Text Encode (do this twice — one for positive, one for negative)
Connect: Load CLIP CLIP output → first CLIP Text Encode clip input
Connect: Load CLIP CLIP output → second CLIP Text Encode clip input

4-3. VAE and Video Latent

Double-click canvas → search VAELoader → add Load VAE
Double-click canvas → search Wan22ImageToVideoLatent → add Wan22ImageToVideoLatent
Connect: Load VAE VAE output → Wan22ImageToVideoLatent vae input

4-4. Sampler

Double-click canvas → search KSampler → add KSampler
Connect: ModelSamplingSD3 MODEL output → KSampler model input
Connect: first CLIP Text Encode CONDITIONING output → KSampler positive input
Connect: second CLIP Text Encode CONDITIONING output → KSampler negative input
Connect: Wan22ImageToVideoLatent LATENT output → KSampler latent_image input

4-5. Decode and Save

Double-click canvas → search VAEDecode → add VAE Decode
Double-click canvas → search SaveWEBM → add SaveWEBM
Connect: Load VAE VAE output → VAE Decode vae input
Connect: KSampler LATENT output → VAE Decode samples input
Connect: VAE Decode IMAGE output → SaveWEBM images input

Configure each node

Unet Loader (GGUF):

unet_name: wan2.2_t2v_low_noise_14B_Q4_K_M.gguf

Load CLIP:

clip_name: umt5_xxl_fp8_e4m3fn_scaled.safetensors
type: wan

Load VAE:

vae_name: wan_2.1_vae.safetensors

ModelSamplingSD3:

shift: 8.0

Wan22ImageToVideoLatent:

Width: 832
Height: 480
Length: 49 (about 3 seconds at 16fps)

KSampler:

seed: any number (or leave random)
Steps: 20
CFG: 5
Sampler: uni_pc
Scheduler: simple
denoise: 1.0

SaveWEBM:

codec: vp9
fps: 16

Step 5: Write Your Prompt

WAN uses natural language descriptions, not Danbooru-style tags. Describe the scene as if explaining a short video clip to someone.

Positive Prompt Example

In the first CLIP Text Encode node (positive), enter:

Close-up shot of a girl with long blue hair. The wind gently blows her hair across her face. She turns to look at the camera with a soft smile. Warm sunset lighting illuminates her face. Anime illustration style, cel-shaded, vibrant colors.

Negative Prompt Example

In the second CLIP Text Encode node (negative):

live action, realistic, photo, 3d render, ugly, blurry, low quality, distorted face, extra fingers, text, watermark, full body, wide shot, distant

Key Tips

Describe motion: "hair blowing in the wind", "she turns to look"
Specify the art style: "anime illustration style", "cel-shaded"
Keep it focused — one scene, one subject, one action
Tags like masterpiece or best quality have no effect on WAN — use natural sentences instead

Step 6: Generate

Click Queue Prompt (or press Ctrl+Enter) to start generation.

With the settings above (832x480, 49 frames, 20 steps), generation typically takes 2-5 minutes on a modern GPU (e.g., RTX 4070 with 12 GB VRAM). GPUs with 8 GB VRAM may take 5-10 minutes as the model swaps between RAM and VRAM. The progress bar shows which step is currently processing.

If You Run Out of Memory

Reduce frame count to 25 (about 1.5 seconds)
Close other applications using your GPU (browsers, games, etc.)
Make sure no other programs are using VRAM

Step 7: Check Your Video

The output is saved in the ComfyUI/output/ folder as a WEBM file.

Watch the clip. Common issues on first attempts:

Blurry or melted faces: Add "detailed anime face, clear eyes" to your prompt. Add "blurry face, distorted" to negative
No motion / frozen: Make sure your prompt describes movement. "Wind blowing, hair moving, slight head tilt"
Realistic instead of anime: Add "2d anime, cel-shaded" to your prompt. Add "realistic, 3d, photo" to negative

Step 8: Upload to NijiTube

Log in to NijiTube (Google, Discord, or X)
Go to Upload from the navigation
Select your video file (MP4 or WEBM, max 200 MB)
Add a title, tags, and select the appropriate rating (SFW or R-18)
In the AI Generation Settings section, enter the model name (WAN 2.2 14B) and paste your prompt
Optionally attach your ComfyUI workflow JSON — this lets other creators learn from your setup
Click Upload

Your first few uploads are briefly checked before publishing — this is only to verify compliance with our content policy (no illegal content, real persons, etc.), not to judge quality. All valid uploads will be approved. Once you build a track record, your uploads will go live instantly.

What Next?

LoRAs: Add style LoRAs to improve anime quality — see the Anime Style LoRA guide
Image to Video (I2V): Start from a high-quality still image for much more consistent results. See the I2V guide
Higher Resolution: Once comfortable, try 1280x704 for 720p output (takes longer but looks much sharper)
Browse NijiTube: Check what other creators are making and download their workflows to learn