Generating AI Video on a Local GPU with Wan2.1 and ComfyUI

Today I spent the day generating marketing videos for KOAP using AI video generation running entirely on my local RTX 4070 Ti. No cloud APIs, no per-minute billing, no waiting in queues. Just ComfyUI, a 1.3 billion parameter model, and a lot of prompt iteration.

Here's what I learned.

The Setup

Wan2.1 is an open-source text-to-video model from Alibaba. It comes in two sizes: 1.3B and 14B parameters. The 1.3B version runs comfortably on consumer GPUs with 12GB+ VRAM. The 14B version needs quantization tricks (more on that later).

I'm running it through ComfyUI, which gives you a node-based workflow editor and, more importantly, an API you can hit programmatically. Load the model once, fire prompts at it, get videos back. No GUI clicking required.

The pipeline is straightforward: text prompt goes in, 2-4 second video clip comes out as a series of frames. Each generation takes about 60-90 seconds on my 4070 Ti for a 480x832 clip at 16 frames.

Iteration 1: The Luxury Resort Problem

My first prompts were too aspirational. I wanted footage of community spaces, pools, shared amenities. What I got looked like a Four Seasons brochure.

Aerial drone shot of... a luxury resort pool. Not exactly "community association management software" vibes.

The problem was obvious in hindsight. When you prompt for "beautiful community pool, aerial view, cinematic," the model has been trained on tons of stock footage of luxury resorts. That's what "beautiful pool" means in the training data. I needed to fight the model's defaults.

Iteration 2: The Oversaturated Nature Phase

So I pivoted. More specific prompts about residential communities, apartment complexes, suburban neighborhoods. Added negative prompts to suppress the luxury look. Dropped the CFG scale from 7 to around 4-5.

The results were better conceptually but way too saturated. Everything looked like it had been run through an Instagram filter from 2014.

Getting closer to "residential community" but the colors are cranked to 11.

The golden hour lighting was beautiful but completely wrong for what I needed. Marketing videos for property management software should look real, not dreamy.

Iteration 3: Muted and Realistic

The breakthrough was changing the prompt strategy entirely. Instead of describing what I wanted to see, I described the style I wanted: "muted colors, documentary style, overcast natural lighting, realistic." I also added aggressive negative prompts: "oversaturated, vibrant, HDR, luxury, resort, cinematic color grading."

Lower CFG (around 3-4) helped too. Higher CFG makes the model follow the prompt more literally, but it also amplifies the model's biases. Lower CFG gives you more natural-looking output.

Now we're talking. Actual apartment complex, realistic lighting, muted tones.

These actually look like footage you'd see on a real property management website. Mission accomplished.

Quantization: The JPEG of AI Models

Now, about that 14B model. It's significantly better than 1.3B, but 14 billion parameters in FP16 (the standard precision) needs about 28GB of VRAM. My 4070 Ti has 12GB. So how do people run it?

Quantization. And the simplest way to understand it is the JPEG analogy.

A raw photo might be 25MB. A JPEG of the same image might be 2MB. You lost some information, but for most purposes the image looks identical. Quantization does the same thing to AI model weights.

FP16 (Full Precision): Each parameter is stored as a 16-bit floating point number. This is your raw photo. Maximum quality, maximum size. The 14B model needs ~28GB.

FP8 (8-bit Float): Cut each parameter to 8 bits. Like saving as a high-quality JPEG. You lose some precision in the decimal places, but the model barely notices. Now the 14B model fits in ~14GB. Almost fits on my card.

INT4 (4-bit Integer): Aggressively compress each parameter to just 4 bits. This is like saving a JPEG at quality 60. You can see artifacts if you look closely, but it's still remarkably usable. The 14B model drops to ~7GB. That fits on basically any modern GPU.

The quality loss from INT4 is noticeable but not dramatic. For video generation especially, where individual frames don't need to be pixel-perfect, it's a totally viable tradeoff. You get a model that's 10x more capable running on hardware that costs $800 instead of $2000+.

FP16: 2 bytes per parameter × 14B = 28GB VRAM
FP8:  1 byte per parameter  × 14B = 14GB VRAM  
INT4: 0.5 bytes per param   × 14B = 7GB VRAM

There are also mixed-precision approaches where critical layers stay at FP16 while less important layers get quantized harder. GGUF format (from the llama.cpp ecosystem) supports this, and ComfyUI can load GGUF-quantized Wan2.1 models directly.

Image-to-Video: Animate Your Stills

One of the coolest capabilities I haven't fully explored yet is image-to-video (I2V). You give the model a still image and a text prompt describing the motion you want, and it generates a video starting from that image.

This is huge for marketing. Got a nice photo of a property? Animate it. Add gentle camera movement, people walking through, clouds drifting. It takes a static hero image and turns it into something that catches the eye on a landing page.

The workflow in ComfyUI is almost identical to text-to-video. You just add an image input node that feeds into the model's conditioning. The text prompt shifts from describing the scene to describing the motion: "slow pan right, gentle breeze moving trees, people walking in background."

The Practical Workflow

Here's what the actual day-to-day looks like when you're using this for real content:

ComfyUI API: ComfyUI exposes a REST API. You POST a workflow JSON (which describes the node graph), and it queues the generation. You can poll for completion or use websockets. This means you can script batch generations, iterate on prompts programmatically, and build this into actual pipelines.

Prompt Engineering for Video: Video prompts are different from image prompts. You need to describe motion, camera work, and temporal changes. "Slow drone shot ascending over suburban neighborhood, morning light, cars parked in driveways" works better than "beautiful suburban neighborhood." The model needs to know what happens, not just what exists.

Negative Prompts Matter: For the 1.3B model especially, negative prompts are critical for steering away from the model's defaults. My go-to negative prompt became: "oversaturated, vibrant colors, HDR, luxury, resort, stock footage look, lens flare, cinematic color grading, unrealistic lighting." Basically, everything that makes AI video look like AI video.

Batch and Cherry-Pick: I generate 5 clips per prompt, pick the best 1-2. The model is stochastic, so you'll get variation. Some clips have weird artifacts, some nail it. At 60-90 seconds per generation, you can try a lot of variations quickly.

What's Next

The 14B model with GGUF quantization is the obvious next step. Better coherence, better motion, fewer artifacts. I also want to explore longer clips by chaining generations, using the last frame of one clip as the first frame of the next.

For KOAP specifically, I'm building a library of community-style B-roll clips that we can use across the website, social media, and pitch decks. All generated locally, no licensing issues, no stock footage watermarks.

The fact that this runs on a single consumer GPU in my office is still wild to me. A year ago this required cloud compute and serious money. Now it's a pip install and an afternoon of prompt tweaking.

The future of video content is local, iterative, and cheap. And it's already here.