Quant-VideoGen on RTX 5090
Based on Quant-VideoGen:Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization by Xi et al. (2026).
ArXiv Link: https://arxiv.org/abs/2602.02958v1
GitHub: svg-project/Quant-VideoGen
Why Quant VideoGen?
Autoregressive video generation models face a critical memory bottleneck that severely limits long video synthesis.
Picture this: generating just 5 seconds of 480p video demands approximately 34GB of memory for KV cache storage alone—already exceeding what a single RTX 5090 can handle.
Why is this such a problem? These models generate frames sequentially, and here's the catch: each new frame must maintain complete references to all previously generated frames. This means memory consumption grows linearly with every additional frame added, quickly overwhelming even the most powerful consumer hardware.
However,there is an improvement we can apply--quantization. While quantization for videos are very different from textual ones.
Quantization in Text Models (Why It Works):
Traditional quantization works straightforwardly. In text models like BERT, KV cache values are distributed uniformly. For example:
Sentence: "The cat is on the mat"
Token 1 (The): [-0.5, 0.3, 1.2, -0.8, 0.6, ...]
Token 2 (cat): [-0.4, 0.2, 1.1, -0.7, 0.5, ...]
Token 3 (is): [-0.6, 0.4, 1.3, -0.9, 0.7, ...]
...
Observation: All values stay within [-1, 1.5]—nicely uniform!Why does it work for text?
Because the value range is fixed and uniform across all tokens—no extreme outliers, all tokens treated equally.
Quantization in Video Models (Why It Fails):
Video KV cache exhibits wildly non-uniform value distributions. Within a single frame:
Applying traditional text quantization causes catastrophic failure:
The root cause: Because of the extreme outlier (1300), the entire quantization range stretches to accommodate it. This forces most normal values (0.1-50 range) into coarse bins, destroying precision.
To address this challenge, researchers from UC Berkeley, MIT, NVIDIA, Amazon, and University of Texas at Austin introduce Quant VideoGen, an innovative framework that exploits the inherent spatial-temporal redundancy in video content. The key insight is that adjacent video frames and spatially nearby regions exhibit high similarity, enabling aggressive compression.

The solution comprises two main components:
(1) Semantic-Aware Smoothing groups similar tokens using k-means clustering, quantizing residuals relative to cluster centroids rather than raw values. This reduces key cache quantization error by 6.9× and value cache error by 2.6×.
(2) Progressive Residual Quantization applies the smoothing process iteratively across multiple stages, capturing semantic information at different granularities while maintaining uniform value distributions suitable for low-precision quantization.
Experimental results demonstrate exceptional performance across multiple state-of-the-art models (LongCat-Video, HY-WorldPlay, Self-Forcing). QVG achieves 6.94-7.05× compression ratios while maintaining near-lossless visual quality (PSNR >28.7). Crucially, end-to-end latency overhead remains below 4%, making the approach practical for real applications.
This breakthrough dramatically lowers hardware barriers—enabling long video generation on consumer-grade GPUs that previously required enterprise-level hardware, while maintaining sub-4% performance overhead.
How to run Quant-VideoGen on a single RTX 5090
Prerequisites
A Yotta Labs account with sufficient credits
A Hugging Face account and access token (
HF_TOKEN)Familiarity with basic Python and terminal commands
🐱 LongCat-Video — best starting point, smallest VRAM footprint
⚡ Self-Forcing — streaming video generation
🌍 HY-WorldPlay — world model video generation
LongCat-Video
22,272 MB
3,231 MB
6.9×
Self-Forcing
46,073 MB
6,614 MB
7.0×
HY-WorldPlay
29,700 MB
4,235 MB
7.0×
Why RTX 5090? Self-Forcing's BF16 KV cache alone requires ~46 GB — impossible on a single 32 GB GPU. QVG brings it down to ~6.6 GB, making long video generation fully feasible on RTX 5090.
Resources: GitHub · Project Page · arXiv
Prerequisites
A Yotta Labs account with sufficient credits
A Hugging Face account and access token with read permissions
Basic familiarity with Python and the terminal
GPU Requirements
RTX 5090
32 GB
✅ Recommended on Yotta Labs
RTX 4090
24 GB
Feasible for LongCat-Video only
H100 / A100
80 GB
Can run BF16 baseline without QVG
Step 1 — Deploy a Pod
Log in to the Yotta Labs Console.
Navigate to Compute → Pods and click Deploy.
Select RTX 5090 as the GPU.
Under Pod Template, choose
pytorch.Set System Volume to at least 150 GB (checkpoints vary: LongCat ~15 GB, Self-Forcing ~30 GB, HY-WorldPlay ~25 GB).
Click Deploy and wait for the Pod to reach
Runningstate.
Step 2 — Open JupyterLab
Click Connect on the Pod card.
Click the link for port 8888 to open JupyterLab.
Create a new notebook: File → New → Notebook, select the default kernel.
For steps marked [Terminal], open a terminal via File → New → Terminal.
Steps below are clearly marked as either [Terminal] or [Notebook Cell].
Step 3 — Set Up the Environment
3.1 — Clone the Repository
[Terminal]
3.2 — Create a Conda Environment
3.3 — Install QVG and Dependencies
[Terminal] — Run inside the qvg environment.
If you only need one backend, install just that extra — e.g.
uv pip install -e ".[longcat]".
3.4 — Install Flash Attention
If this wheel doesn't match your environment, check your versions first:
Then find the matching wheel at the flash-attention releases page.
Step 4 — Download Model Checkpoints
When downloading check points, the official repository provides three choices: LongCat-video, Self- forcing and HY-WorldPlay.
Quant-VideoGen supports three backends with very different hardware requirements.
Self-Forcing is the most lightweight option, built on Wan2.1-T2V-1.3B (~5 GB of weights), and runs comfortably on a single RTX 5090 (32 GB VRAM).
HY-WorldPlay uses a mid-sized model (~25 GB of weights) and fits within a single RTX 5090 with QVG INT2 quantization applied to the KV cache.
LongCat-Video is the most memory-intensive backend — it uses Mistral-Small-24B as its text encoder (22 GB) paired with a large DiT (51 GB), totaling ~83 GB of model weights. Since QVG only quantizes the inference-time KV cache and not the model weights themselves, LongCat-Video cannot be loaded onto a single 32 GB GPU regardless of quantization settings. It requires at least 2×H200 (160 GB total VRAM) to run.
We start from self-forcing video generation with Wan2.1-T2V-1.3B on a single RTX5090 GPU.
Self-Forcing
Wan2.1-T2V-1.3B + DMD
~5 GB
1× RTX 5090
HY-WorldPlay
—
~25 GB
1× RTX 5090
LongCat-Video
Mistral-Small-24B + DiT
~83 GB
2× H200
Verify the downloaded files:
Expected output :
Step 5 — Run Video Generation
Output videos are saved to ~/Quant-VideoGen/results/.

Step 6 — View Output Videos
[ In Notebook cell ]
Step 7 — Want to Customize Quantization Settings?
Quantization options are defined inside each run_qvg.sh script. Key parameters:
quant_method
triton-nstages-kmeans-int2
Quantization algorithm. INT2 gives ~7× compression.
block_size
64
Token block size. Smaller = finer-grained but slower.
num_centroids
256
K-means centroids. More = better quality, slightly more memory.
Larger model & Further explorations
GPU Requirements
LongCat-Video — requires 2× H200 (model weights total ~83 GB: Mistral-Small-24B text encoder + large DiT)
HY-WorldPlay — runs on a single RTX 5090 (32 GB); uses
--offload_text_encoderto stay within VRAM budget
LongCat-Video (2× H200)
LongCat-Video generates long videos auto-regressively by extending from an initial base clip. The workflow has three sequential steps: generate a base clip → extend with BF16 → extend with QVG INT2 quantization.
Step 1 — Download Checkpoints
Verify:
Step 2 — Generate Base Clip
The base clip (results/longcat/base/1-0.mp4) is required by both run_bf16.sh and run_qvg.sh as the starting frame. Always run this first.
Confirm the output exists before proceeding:
Step 3 — Extend with BF16 (quality reference)
Output: results/longcat/bf16/
Step 4 — Extend with QVG INT2 (recommended)
QVG reduces the KV cache by ~7× via 2-bit quantization, significantly cutting memory during the long auto-regressive extension.
Output: results/longcat/triton-nstages-kmeans-int2_64/.../
View Output in JupyterLab Notebook
Open a new notebook cell and run:
HY-WorldPlay (1× RTX 5090)
HY-WorldPlay is an image-to-video world model — it takes a single input image and a camera movement sequence (--pose) to generate a navigable video. It uses --offload_text_encoder to keep the text encoder on CPU and fit within 32 GB VRAM.
Step 1 — Download Checkpoints
Verify:
Step 2 — Run with QVG INT2 (recommended)
Output: results/hyworldplay/triton-nstages-kmeans-int2_64/.../
The default input image is assets/hyworld.png and the default prompt describes a stone arch bridge scene. To use your own image and prompt, edit the script directly:
Change these two lines:
You can also customize the camera trajectory via --pose. The default is:
Each entry is a direction followed by number of frames: w = forward, s = backward, a = left, d = right, up = tilt up, down = tilt down.
Step 3 — Run BF16 baseline (optional, quality comparison)
bash
Unlike LongCat, HY-WorldPlay does not require a base clip first —
run_bf16.shandrun_qvg.share both standalone.
View Output in JupyterLab Notebook
Check one of our ouput files segment_2

Troubleshooting
LongCat:
IndexError: list index out of rangeThe base clip is missing. Runbash scripts/LongCat/base.shfirst before runningrun_bf16.shorrun_qvg.sh.LongCat:
CUDA out of memoryon H200 Make sure you have 2× H200. The model weights alone total ~83 GB and cannot be split across GPUs with the current single-process setup.HY-WorldPlay:
CUDA out of memoryon RTX 5090 Confirm--offload_text_encoderis present in the script. Check with:
Last updated
Was this helpful?