Quant-VideoGen on RTX 5090

Based on Quant-VideoGen:Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization by Xi et al. (2026).

ArXiv Link: https://arxiv.org/abs/2602.02958v1arrow-up-right

GitHub: svg-project/Quant-VideoGenarrow-up-right

Why Quant VideoGen?

Autoregressive video generation models face a critical memory bottleneck that severely limits long video synthesis.

Picture this: generating just 5 seconds of 480p video demands approximately 34GB of memory for KV cache storage alone—already exceeding what a single RTX 5090 can handle.

Why is this such a problem? These models generate frames sequentially, and here's the catch: each new frame must maintain complete references to all previously generated frames. This means memory consumption grows linearly with every additional frame added, quickly overwhelming even the most powerful consumer hardware.

However,there is an improvement we can apply--quantization. While quantization for videos are very different from textual ones.

Quantization in Text Models (Why It Works):

Traditional quantization works straightforwardly. In text models like BERT, KV cache values are distributed uniformly. For example:

Sentence: "The cat is on the mat"

Token 1 (The):  [-0.5, 0.3, 1.2, -0.8, 0.6, ...]
Token 2 (cat):  [-0.4, 0.2, 1.1, -0.7, 0.5, ...]
Token 3 (is):   [-0.6, 0.4, 1.3, -0.9, 0.7, ...]
...

Observation: All values stay within [-1, 1.5]—nicely uniform!

Why does it work for text?

Because the value range is fixed and uniform across all tokens—no extreme outliers, all tokens treated equally.


Quantization in Video Models (Why It Fails):

Video KV cache exhibits wildly non-uniform value distributions. Within a single frame:

Applying traditional text quantization causes catastrophic failure:

The root cause: Because of the extreme outlier (1300), the entire quantization range stretches to accommodate it. This forces most normal values (0.1-50 range) into coarse bins, destroying precision.

To address this challenge, researchers from UC Berkeley, MIT, NVIDIA, Amazon, and University of Texas at Austin introduce Quant VideoGen, an innovative framework that exploits the inherent spatial-temporal redundancy in video content. The key insight is that adjacent video frames and spatially nearby regions exhibit high similarity, enabling aggressive compression.

The solution comprises two main components:

(1) Semantic-Aware Smoothing groups similar tokens using k-means clustering, quantizing residuals relative to cluster centroids rather than raw values. This reduces key cache quantization error by 6.9× and value cache error by 2.6×.

(2) Progressive Residual Quantization applies the smoothing process iteratively across multiple stages, capturing semantic information at different granularities while maintaining uniform value distributions suitable for low-precision quantization.

Experimental results demonstrate exceptional performance across multiple state-of-the-art models (LongCat-Video, HY-WorldPlay, Self-Forcing). QVG achieves 6.94-7.05× compression ratios while maintaining near-lossless visual quality (PSNR >28.7). Crucially, end-to-end latency overhead remains below 4%, making the approach practical for real applications.

This breakthrough dramatically lowers hardware barriers—enabling long video generation on consumer-grade GPUs that previously required enterprise-level hardware, while maintaining sub-4% performance overhead.

How to run Quant-VideoGen on a single RTX 5090

Prerequisites

  • A Yotta Labsarrow-up-right account with sufficient credits

  • A Hugging Facearrow-up-right account and access token (HF_TOKEN)

  • Familiarity with basic Python and terminal commands

  • 🐱 LongCat-Video — best starting point, smallest VRAM footprint

  • Self-Forcing — streaming video generation

  • 🌍 HY-WorldPlay — world model video generation

Model
BF16 Total KV
INT2 Total KV
Compression

LongCat-Video

22,272 MB

3,231 MB

6.9×

Self-Forcing

46,073 MB

6,614 MB

7.0×

HY-WorldPlay

29,700 MB

4,235 MB

7.0×

Why RTX 5090? Self-Forcing's BF16 KV cache alone requires ~46 GB — impossible on a single 32 GB GPU. QVG brings it down to ~6.6 GB, making long video generation fully feasible on RTX 5090.

Resources: GitHubarrow-up-right · Project Pagearrow-up-right · arXivarrow-up-right


Prerequisites

GPU Requirements

GPU
VRAM
Notes

RTX 5090

32 GB

✅ Recommended on Yotta Labs

RTX 4090

24 GB

Feasible for LongCat-Video only

H100 / A100

80 GB

Can run BF16 baseline without QVG


Step 1 — Deploy a Pod

  1. Navigate to Compute → Pods and click Deploy.

  2. Select RTX 5090 as the GPU.

  3. Under Pod Template, choose pytorch.

  4. Set System Volume to at least 150 GB (checkpoints vary: LongCat ~15 GB, Self-Forcing ~30 GB, HY-WorldPlay ~25 GB).

  5. Click Deploy and wait for the Pod to reach Running state.


Step 2 — Open JupyterLab

  1. Click Connect on the Pod card.

  2. Click the link for port 8888 to open JupyterLab.

  3. Create a new notebook: File → New → Notebook, select the default kernel.

  4. For steps marked [Terminal], open a terminal via File → New → Terminal.

Steps below are clearly marked as either [Terminal] or [Notebook Cell].


Step 3 — Set Up the Environment

3.1 — Clone the Repository

[Terminal]

3.2 — Create a Conda Environment

3.3 — Install QVG and Dependencies

[Terminal] — Run inside the qvg environment.

If you only need one backend, install just that extra — e.g. uv pip install -e ".[longcat]".

3.4 — Install Flash Attention

If this wheel doesn't match your environment, check your versions first:

Then find the matching wheel at the flash-attention releases pagearrow-up-right.


Step 4 — Download Model Checkpoints

When downloading check points, the official repository provides three choices: LongCat-video, Self- forcing and HY-WorldPlay.

Quant-VideoGen supports three backends with very different hardware requirements.

  1. Self-Forcing is the most lightweight option, built on Wan2.1-T2V-1.3B (~5 GB of weights), and runs comfortably on a single RTX 5090 (32 GB VRAM).

  2. HY-WorldPlay uses a mid-sized model (~25 GB of weights) and fits within a single RTX 5090 with QVG INT2 quantization applied to the KV cache.

  3. LongCat-Video is the most memory-intensive backend — it uses Mistral-Small-24B as its text encoder (22 GB) paired with a large DiT (51 GB), totaling ~83 GB of model weights. Since QVG only quantizes the inference-time KV cache and not the model weights themselves, LongCat-Video cannot be loaded onto a single 32 GB GPU regardless of quantization settings. It requires at least 2×H200 (160 GB total VRAM) to run.

We start from self-forcing video generation with Wan2.1-T2V-1.3B on a single RTX5090 GPU.

Backend
Model
Weights
Min GPU

Self-Forcing

Wan2.1-T2V-1.3B + DMD

~5 GB

1× RTX 5090

HY-WorldPlay

~25 GB

1× RTX 5090

LongCat-Video

Mistral-Small-24B + DiT

~83 GB

2× H200

Verify the downloaded files:

Expected output :


Step 5 — Run Video Generation

Output videos are saved to ~/Quant-VideoGen/results/.


Step 6 — View Output Videos

[ In Notebook cell ]


Step 7 — Want to Customize Quantization Settings?

Quantization options are defined inside each run_qvg.sh script. Key parameters:

Parameter
Default
Description

quant_method

triton-nstages-kmeans-int2

Quantization algorithm. INT2 gives ~7× compression.

block_size

64

Token block size. Smaller = finer-grained but slower.

num_centroids

256

K-means centroids. More = better quality, slightly more memory.


Larger model & Further explorations

GPU Requirements

  • LongCat-Video — requires 2× H200 (model weights total ~83 GB: Mistral-Small-24B text encoder + large DiT)

  • HY-WorldPlay — runs on a single RTX 5090 (32 GB); uses --offload_text_encoder to stay within VRAM budget


LongCat-Video (2× H200)

LongCat-Video generates long videos auto-regressively by extending from an initial base clip. The workflow has three sequential steps: generate a base clip → extend with BF16 → extend with QVG INT2 quantization.

Step 1 — Download Checkpoints

Verify:

Step 2 — Generate Base Clip

The base clip (results/longcat/base/1-0.mp4) is required by both run_bf16.sh and run_qvg.sh as the starting frame. Always run this first.

Confirm the output exists before proceeding:

Step 3 — Extend with BF16 (quality reference)

Output: results/longcat/bf16/

QVG reduces the KV cache by ~7× via 2-bit quantization, significantly cutting memory during the long auto-regressive extension.

Output: results/longcat/triton-nstages-kmeans-int2_64/.../

View Output in JupyterLab Notebook

Open a new notebook cell and run:


HY-WorldPlay (1× RTX 5090)

HY-WorldPlay is an image-to-video world model — it takes a single input image and a camera movement sequence (--pose) to generate a navigable video. It uses --offload_text_encoder to keep the text encoder on CPU and fit within 32 GB VRAM.

Step 1 — Download Checkpoints

Verify:

Output: results/hyworldplay/triton-nstages-kmeans-int2_64/.../

The default input image is assets/hyworld.png and the default prompt describes a stone arch bridge scene. To use your own image and prompt, edit the script directly:

Change these two lines:

You can also customize the camera trajectory via --pose. The default is:

Each entry is a direction followed by number of frames: w = forward, s = backward, a = left, d = right, up = tilt up, down = tilt down.

Step 3 — Run BF16 baseline (optional, quality comparison)

bash

Unlike LongCat, HY-WorldPlay does not require a base clip first — run_bf16.sh and run_qvg.sh are both standalone.

View Output in JupyterLab Notebook


Check one of our ouput files segment_2

file-download
3MB

Troubleshooting

  • LongCat: IndexError: list index out of range The base clip is missing. Run bash scripts/LongCat/base.sh first before running run_bf16.sh or run_qvg.sh.

  • LongCat: CUDA out of memory on H200 Make sure you have 2× H200. The model weights alone total ~83 GB and cannot be split across GPUs with the current single-process setup.

  • HY-WorldPlay: CUDA out of memory on RTX 5090 Confirm --offload_text_encoder is present in the script. Check with:

Last updated

Was this helpful?