NijiTube
← Back to Guide·15 min read

Your First AI Anime Video

Overview

This guide walks you through generating your first AI anime video from scratch. By the end you will have a short anime-style clip ready to upload.

The pipeline: ComfyUI (node-based UI) + WAN 2.2 14B GGUF (text-to-video model). The 14B model produces high-quality output, and the GGUF format lets it run on consumer GPUs with as little as 8 GB VRAM.

What You Need

Hardware

  • GPU: NVIDIA with at least 8 GB VRAM (RTX 3060 or better). 12 GB recommended
  • RAM: 32 GB recommended (16 GB minimum — may be tight)
  • Storage: ~15 GB free for models and outputs

Software

  • Windows 10/11 (this guide uses the Windows portable version)
  • An internet connection to download models (~12 GB total)

Step 1: Install ComfyUI

ComfyUI is a node-based interface for running AI generation models. It handles the entire pipeline visually.

  1. Go to the ComfyUI GitHub releases page and download the latest Windows portable package (the .7z file)
  2. Extract it to a folder with a short path, for example C:\ComfyUI
  3. Run run_nvidia_gpu.bat
  4. Your browser will open http://127.0.0.1:8188 — this is the ComfyUI interface

If nothing happens, make sure you have an NVIDIA GPU with up-to-date drivers. AMD GPUs require a different setup not covered here.

Step 2: Install the GGUF Custom Node

The 14B model uses the GGUF format for memory efficiency. ComfyUI needs one custom node to load GGUF files.

  1. In ComfyUI, click the Manager button (top-right toolbar)
  2. Click Install Custom Nodes
  3. Search for ComfyUI-GGUF
  4. Click Install and wait for it to finish
  5. Restart ComfyUI (close the terminal window and run run_nvidia_gpu.bat again)

If you do not see the Manager button, install ComfyUI-Manager first: search "ComfyUI-Manager" on GitHub for instructions.

Step 3: Download Models

You need three files. Download them and place each in the correct folder inside your ComfyUI installation.

The WAN 2.2 14B Model (GGUF)

This is the core video generation model. The Q4 version (~9 GB) fits on 8 GB VRAM GPUs.

  • Search HuggingFace for "city96 Wan2.2-T2V-14B-GGUF" and download wan2.2_t2v_low_noise_14B_Q4_K_M.gguf (~9 GB)
  • Place in: ComfyUI/models/diffusion_models/

The "low noise" variant produces cleaner output and is recommended for beginners. If you have 12+ GB VRAM, you can use the Q6 version for even better quality.

The VAE

  • Search HuggingFace for "Comfy-Org Wan_2.1_ComfyUI_repackaged" and download wan_2.1_vae.safetensors (~300 MB)
  • Place in: ComfyUI/models/vae/

Note: The 14B model uses the 2.1 VAE, not the 2.2 VAE. This is correct.

The Text Encoder

  • Search HuggingFace for "Comfy-Org Wan_2.2_ComfyUI_Repackaged" and download umt5_xxl_fp8_e4m3fn_scaled.safetensors (~2.5 GB)
  • Place in: ComfyUI/models/text_encoders/

Step 4: Build the Workflow

Clear the default workflow first: click Menu (☰)Clear Workflow (or press Ctrl+Shift+Delete). You will build the workflow from scratch.

To add a node, double-click on the empty canvas. A search bar appears — type the node name listed below and select it from the results.

4-1. Model Loading Chain

Add and connect these three nodes in order:

  1. Double-click canvas → search UnetLoaderGGUF → add Unet Loader (GGUF)
  2. Double-click canvas → search ModelSamplingSD3 → add ModelSamplingSD3
  3. Connect: Unet Loader (GGUF) MODEL output → ModelSamplingSD3 model input

4-2. Text Encoding

  1. Double-click canvas → search CLIPLoader → add Load CLIP
  2. Double-click canvas → search CLIPTextEncode → add CLIP Text Encode (do this twice — one for positive, one for negative)
  3. Connect: Load CLIP CLIP output → first CLIP Text Encode clip input
  4. Connect: Load CLIP CLIP output → second CLIP Text Encode clip input

4-3. VAE and Video Latent

  1. Double-click canvas → search VAELoader → add Load VAE
  2. Double-click canvas → search Wan22ImageToVideoLatent → add Wan22ImageToVideoLatent
  3. Connect: Load VAE VAE output → Wan22ImageToVideoLatent vae input

4-4. Sampler

  1. Double-click canvas → search KSampler → add KSampler
  2. Connect: ModelSamplingSD3 MODEL output → KSampler model input
  3. Connect: first CLIP Text Encode CONDITIONING output → KSampler positive input
  4. Connect: second CLIP Text Encode CONDITIONING output → KSampler negative input
  5. Connect: Wan22ImageToVideoLatent LATENT output → KSampler latent_image input

4-5. Decode and Save

  1. Double-click canvas → search VAEDecode → add VAE Decode
  2. Double-click canvas → search SaveWEBM → add SaveWEBM
  3. Connect: Load VAE VAE output → VAE Decode vae input
  4. Connect: KSampler LATENT output → VAE Decode samples input
  5. Connect: VAE Decode IMAGE output → SaveWEBM images input

Configure each node

Unet Loader (GGUF):

  • unet_name: wan2.2_t2v_low_noise_14B_Q4_K_M.gguf

Load CLIP:

  • clip_name: umt5_xxl_fp8_e4m3fn_scaled.safetensors
  • type: wan

Load VAE:

  • vae_name: wan_2.1_vae.safetensors

ModelSamplingSD3:

  • shift: 8.0

Wan22ImageToVideoLatent:

  • Width: 832
  • Height: 480
  • Length: 49 (about 3 seconds at 16fps)

KSampler:

  • seed: any number (or leave random)
  • Steps: 20
  • CFG: 5
  • Sampler: uni_pc
  • Scheduler: simple
  • denoise: 1.0

SaveWEBM:

  • codec: vp9
  • fps: 16

Step 5: Write Your Prompt

WAN uses natural language descriptions, not Danbooru-style tags. Describe the scene as if explaining a short video clip to someone.

Positive Prompt Example

In the first CLIP Text Encode node (positive), enter:

Close-up shot of a girl with long blue hair. The wind gently blows her hair across her face. She turns to look at the camera with a soft smile. Warm sunset lighting illuminates her face. Anime illustration style, cel-shaded, vibrant colors.

Negative Prompt Example

In the second CLIP Text Encode node (negative):

live action, realistic, photo, 3d render, ugly, blurry, low quality, distorted face, extra fingers, text, watermark, full body, wide shot, distant

Key Tips

  • Describe motion: "hair blowing in the wind", "she turns to look"
  • Specify the art style: "anime illustration style", "cel-shaded"
  • Keep it focused — one scene, one subject, one action
  • Tags like masterpiece or best quality have no effect on WAN — use natural sentences instead

Step 6: Generate

Click Queue Prompt (or press Ctrl+Enter) to start generation.

With the settings above (832x480, 49 frames, 20 steps), generation typically takes 2-5 minutes on a modern GPU (e.g., RTX 4070 with 12 GB VRAM). GPUs with 8 GB VRAM may take 5-10 minutes as the model swaps between RAM and VRAM. The progress bar shows which step is currently processing.

If You Run Out of Memory

  • Reduce frame count to 25 (about 1.5 seconds)
  • Close other applications using your GPU (browsers, games, etc.)
  • Make sure no other programs are using VRAM

Step 7: Check Your Video

The output is saved in the ComfyUI/output/ folder as a WEBM file.

Watch the clip. Common issues on first attempts:

  • Blurry or melted faces: Add "detailed anime face, clear eyes" to your prompt. Add "blurry face, distorted" to negative
  • No motion / frozen: Make sure your prompt describes movement. "Wind blowing, hair moving, slight head tilt"
  • Realistic instead of anime: Add "2d anime, cel-shaded" to your prompt. Add "realistic, 3d, photo" to negative

Step 8: Upload to NijiTube

  1. Log in to NijiTube (Google, Discord, or X)
  2. Go to Upload from the navigation
  3. Select your video file (MP4 or WEBM, max 200 MB)
  4. Add a title, tags, and select the appropriate rating (SFW or R-18)
  5. In the AI Generation Settings section, enter the model name (WAN 2.2 14B) and paste your prompt
  6. Optionally attach your ComfyUI workflow JSON — this lets other creators learn from your setup
  7. Click Upload

Your first few uploads are briefly checked before publishing — this is only to verify compliance with our content policy (no illegal content, real persons, etc.), not to judge quality. All valid uploads will be approved. Once you build a track record, your uploads will go live instantly.

What Next?

  • LoRAs: Add style LoRAs to improve anime quality — see the Anime Style LoRA guide
  • Image to Video (I2V): Start from a high-quality still image for much more consistent results. See the I2V guide
  • Higher Resolution: Once comfortable, try 1280x704 for 720p output (takes longer but looks much sharper)
  • Browse NijiTube: Check what other creators are making and download their workflows to learn

Related Guides

Your First AI Anime Video | NijiTube