This PR introduces an **optional 4-bit (NF4) quantization path** for the **Qwen2.5 LLM component** inside VibeVoice, using Transformers + bitsandbytes. The diffusion head and processors remain BF16/FP32. This mirrors the project’s architecture and enables the **7B preview** to run on smaller GPUs while preserving output quality.
**Changes / Additions:**
* New toggle to run the LLM in **4-bit NF4** via `BitsAndBytesConfig`; default remains full precision.
* Q4 prefers **SDPA** attention (Flash-Attn auto-downshifts) for stability.
**Improvements on my 3080 12GB:**
* **7B**
* **Timing:** 29m 27s → **203.47s** (\~3m 23s) — **−88.5% time** (\~**8.68× faster**)
* **VRAM:** Q4 ≈ **7.6 GB**; FP16 ≳ **12 GB** — Q4 saves **≥4.4 GB** (≥**36.7%**)
* **1.5B**
* **Timing (Q4):** 105s → **154s** — **+49s** (\~**1.47× slower**)
* **VRAM:** Q4 ≈ **3.2 GB**; FP16 ≈ **8.7 GB** — Q4 saves **\~5.5 GB** (\~**63.2%**)
These changes have resulted in a nearly 90% reduction in inference time and over 40% reduction in VRAM usage with the 7B model in VRAM constrained environments with no perceptible change in quality in my limited testing.
While there is an increase in inference time with the 1.5B model, some may consider the smaller VRAM footprint worth it.
- Fixed IndexError in ComfyUI's model management system when unloading models
- Improved memory cleanup to prevent VRAM leaks when switching between models
- Updated cache key handling to properly track attention mode variants
- Enhanced patcher lifecycle management to work with ComfyUI's internal systems
- Added safer model cleanup that doesn't interfere with ComfyUI's model tracking