Update README.md

2026-03-13 20:39:49 +00:00 · 2025-09-24 17:51:40 +03:00
parent 696ef69152
commit 7ec3a8a378
1 changed files with 60 additions and 48 deletions
--- a/README.md
+++ b/README.md
@@ -26,26 +26,30 @@
 <!-- ABOUT THE PROJECT -->
 ## About The Project

-This project brings the power of **VibeVoice** into the modular workflow of ComfyUI. VibeVoice is a novel framework by Microsoft for generating expressive, long-form, multi-speaker conversational audio. It excels at creating natural-sounding dialogue, podcasts, and more, with consistent voices for up to 4 speakers.
+VibeVoice is a novel framework by Microsoft for generating expressive, long-form, multi-speaker conversational audio. It excels at creating natural-sounding dialogue, podcasts, and more, with consistent voices for up to 4 speakers.

 The custom node handles everything from model downloading and memory management to audio processing, allowing you to generate high-quality speech directly from a text script and reference audio files.

 **✨ Key Features:**
 *   **Multi-Speaker TTS:** Generate conversations with up to 4 distinct voices in a single audio output.
-*   **Zero-Shot Voice Cloning:** Use any audio file (`.wav`, `.mp3`) as a reference for a speaker's voice.
-*   **Advanced Attention Mechanisms:** Choose between `eager`, `sdpa`, `flash_attention_2`, and the new high-performance `sage` attention for fine-tuned control over speed, memory, and compatibility.
-*   **Robust 4-Bit Quantization:** Run the large language model component in 4-bit mode to significantly reduce VRAM usage, with smart, stable configurations for all attention modes.
+*   **High-Fidelity Voice Cloning:** Use any audio file (`.wav`, `.mp3`) as a reference for a speaker's voice.
+*   **Hybrid Generation Mode:** Mix and match cloned voices with high-quality, zero-shot generated voices in the same script.
+*   **Flexible Scripting:** Use simple `[1]` tags or the classic `Speaker 1:` format to write your dialogue.
+*   **Advanced Attention Mechanisms:** Choose between `eager`, `sdpa`, `flash_attention_2`, and the high-performance `sage` attention for fine-tuned control over speed and compatibility.
+*   **Robust 4-Bit Quantization:** Run the large language model component in 4-bit mode to significantly reduce VRAM usage.
 *   **Automatic Model Management:** Models are downloaded automatically and managed efficiently by ComfyUI to save VRAM.
-*   **Fine-Grained Control:** Adjust parameters like CFG scale, temperature, and sampling methods to tune the performance and style of the generated speech.

 <p align="right">(<a href="#readme-top">back to top</a>)</p>

 <!-- GETTING STARTED -->
 ## 🚀 Getting Started

-The node can be installed via **ComfyUI Manager:** Find `ComfyUI-VibeVoice` and click "Install".
+The easiest way to install is through the **ComfyUI Manager:**
+1.  Go to `Manager` -> `Install Custom Nodes`.
+2.  Search for `ComfyUI-VibeVoice` and click "Install".
+3.  Restart ComfyUI.

-Alternatively, install it manually:
+Alternatively, to install manually:

 1.  **Clone the Repository:**
    Navigate to your `ComfyUI/custom_nodes/` directory and clone this repository:
@@ -54,18 +58,17 @@ Alternatively, install it manually:
    ```

 2.  **Install Dependencies:**
-    Open a terminal or command prompt, navigate into the cloned directory, and install the required Python packages. **For quantization support, ensure you install `bitsandbytes`**.
+    Open a terminal or command prompt, navigate into the cloned directory, and install the required Python packages. **For quantization support, you must install `bitsandbytes`**.
    ```sh
    cd ComfyUI-VibeVoice
    pip install -r requirements.txt
    ```

 3.  **Optional: Install SageAttention**
-    To enable the new `sage` attention mode, you must install the `sageattention` library in your ComfyUI Python environment. For Windows users, please refer to this [AI-windows-whl](https://github.com/wildminder/AI-windows-whl) for the required package.
-
+    To enable the `sage` attention mode, you must install the `sageattention` library. For Windows users, a pre-compiled wheel is available at [AI-windows-whl](https://github.com/wildminder/AI-windows-whl).
    > **Note:** This is only required if you intend to use the `sage` attention mode.
-  
-3.  **Start/Restart ComfyUI:**
+
+4.  **Start/Restart ComfyUI:**
    Launch ComfyUI. The "VibeVoice TTS" node will appear under the `audio/tts` category. The first time you use the node, it will automatically download the selected model to your `ComfyUI/models/tts/VibeVoice/` folder.

 ## Models
@@ -79,31 +82,55 @@ Alternatively, install it manually:
 <!-- USAGE EXAMPLES -->
 ## 🛠️ Usage

-The node is designed to be intuitive within the ComfyUI workflow.
+The node is designed for maximum flexibility within your ComfyUI workflow.

 1.  **Add Nodes:** Add the `VibeVoice TTS` node to your graph. Use ComfyUI's built-in `Load Audio` node to load your reference voice files.
-2.  **Connect Voices:** Connect the `AUDIO` output from each `Load Audio` node to the corresponding `speaker_*_voice` input on the VibeVoice TTS node.
-3.  **Write Script:** In the `text` input, write your dialogue. Assign lines to speakers using the format `Speaker 1: ...`, `Speaker 2: ...`, etc., on separate lines.
+2.  **Connect Voices (Optional):** Connect the `AUDIO` output from each `Load Audio` node to the corresponding `speaker_*_voice` input.
+3.  **Write Your Script:** In the `text` input, write your dialogue using one of the supported formats.
 4.  **Generate:** Queue the prompt. The node will process the script and generate a single audio file containing the full conversation.

 > **Tip:** For a complete workflow, you can drag the example image from the `example_workflows` folder onto your ComfyUI canvas.

+### Scripting and Voice Modes
+
+#### Speaker Tagging
+You can assign lines to speakers in two ways. Both are treated identically.
+
+*   **Modern Format (Recommended):** `[1] This is the first speaker.`
+*   **Classic Format:** `Speaker 1: This is the first speaker.`
+
+You can also add an optional colon to the modern format (e.g., `[1]: ...`). The node handles all variations consistently.
+
+#### Hybrid Voice Generation
+This is a powerful feature that lets you mix cloned voices and generated (zero-shot) voices.
+
+*   **To Clone a Voice:** Connect a `Load Audio` node to the speaker's input (e.g., `speaker_1_voice`).
+*   **To Generate a Voice:** Leave the speaker's input empty. The model will create a unique, high-quality voice for that speaker.
+
+**Example Hybrid Script:**
+```
+[1] This line will use the audio from speaker_1_voice.
+[2] This line will have a new, unique voice generated for it.
+[1] I'm back with my cloned voice.
+```
+In this example, you would only connect an audio source to `speaker_1_voice`.
+
 ### Node Inputs

 *   **`model_name`**: Select the VibeVoice model to use (`1.5B` or `Large`).
-*   **`quantize_llm_4bit`**: **(Overhauled!)** Enable to run the LLM component in 4-bit (NF4) mode. This dramatically reduces VRAM usage.
-*   **`attention_mode`**: **(New!)** Select the attention implementation: `eager` (safest), `sdpa` (balanced), `flash_attention_2` (fastest), or `sage` (quantized high-performance).
-*   **`text`**: The conversational script. Lines must be prefixed with `Speaker <number>:` (e.g., `Speaker 1:`).
-*   **`cfg_scale`**: Controls how strongly the model adheres to the reference voice's timbre.
-*   **`inference_steps`**: Number of diffusion steps for the audio decoder.
-*   **`seed`**: A seed for reproducibility.
+*   **`text`**: The conversational script. See "Scripting and Voice Modes" above for formatting.
+*   **`quantize_llm_4bit`**: Enable to run the LLM component in 4-bit (NF4) mode, dramatically reducing VRAM usage.
+*   **`attention_mode`**: Select the attention implementation: `eager` (safest), `sdpa` (balanced), `flash_attention_2` (fastest), or `sage` (quantized high-performance).
+*   **`cfg_scale`**: Controls how strongly the model adheres to the reference voice's timbre. Higher values are stricter. Recommended: `1.3`.
+*   **`inference_steps`**: Number of diffusion steps for audio generation. Recommended: `10`.
+*   **`seed`**: A seed for reproducibility. Set to 0 for a random seed on each run.
 *   **`do_sample`, `temperature`, `top_p`, `top_k`**: Standard sampling parameters for controlling the creativity and determinism of the speech generation.
 *   **`force_offload`**: Forces the model to be completely offloaded from VRAM after generation.

 <!-- PERFORMANCE SECTION -->
 ## ⚙️ Performance & Advanced Features

-This update introduces a sophisticated system for managing performance, memory, and stability. The node will automatically select the best configuration based on your choices.
+This node features a sophisticated system for managing performance, memory, and stability.

 ### Feature Compatibility & VRAM Matrix

@@ -119,47 +146,32 @@ This update introduces a sophisticated system for managing performance, memory,
 | **ON**       | `sage`              | **Recommended for stability.** Uses `fp32` compute to ensure numerical stability with quantization, resulting in slightly higher VRAM usage.     | **Medium**    |


-A key feature of this node is the optional **4-bit quantization** for the language model component. This is highly recommended for users with memory-constrained GPUs (e.g., <= 16GB VRAM) who wish to run the larger `VibeVoice-Large-pt` model.
-
-**Benefits of `quantize_llm = Enabled`:**
-
-| Model | Performance Impact | VRAM Savings |
-|---|---|---|
-| **VibeVoice-Large (7B)** | **~8.5x faster** inference | Saves **>4.4 GB** (over 36%) |
-| **VibeVoice-1.5B** | ~1.5x slower inference | Saves **~5.5 GB** (over 63%) |
-
-As shown, quantization provides a massive speedup and VRAM reduction for the 7B model, making it accessible on a wider range of hardware. While it slightly slows down the 1.5B model, the significant VRAM savings may still be beneficial for complex workflows.
-
-*\*Note: `flash_attention_2` with Q4 automatically falls back to `sdpa`.*
-
 <!-- CHANGELOG -->
 ## Changelog

+<details open>
+<summary><strong>v1.5.0 - Stability and Prompting</strong></summary>
+
+### ✨ New Features & Improvements
+*   **Total Generation Stability:** Fixed the bug where a speaker's voice could unintentionally change or blend with another reference voice mid-sentence.
+*   **Improved Voice Cloning Fidelity** 
+*   **Consistent Speaker Tagging:** The node now intelligently handles multiple script formats (`[1]`, `[1]:`, and `Speaker 1:`) to produce identical, high-quality results, removing all previous inconsistencies.
+*   **Hybrid Voice Generation:** Mix and match cloned voices with high-quality, zero-shot generated voices in the same script. If a speaker's voice input is empty, a unique voice will be generated for them automatically.
+</details>
+
 <details>
 <summary><strong>v1.3.0 - SageAttention & Quantization Overhaul</strong></summary>

-### ✨ New Features
 *   **SageAttention Support:** Full integration with the `sageattention` library for a high-performance, mixed-precision attention option.
 *   **Robust 4-Bit LLM Quantization:** The "Quantize LLM (4-bit)" option is now highly stable and delivers significant VRAM savings.
 *   **Smart Configuration & Fallbacks:** The node now automatically handles incompatible settings (e.g., 4-bit with `flash_attention_2`) by gracefully falling back to a stable alternative (`sdpa`) and notifying the user.
-
-### 🐛 Bug Fixes & Stability Improvements
-*   **Fixed SageAttention Crashes** 
-*   **Fixed Numerical Instability (`NaN`/`Inf` Errors)**
-*   **Resolved All `dtype` Mismatches**
-*   **Corrected SageAttention Kernel Assertions**
-*   **Addressed Deprecation Warning**
 </details>

 <details>
 <summary><strong>v1.2.0 - Compatibility Update</strong></summary>

-### ✅ Compatibility
 *   **Transformers Library:** Includes automatic detection and compatibility for both older and newer versions of the Transformers library (pre- and post-4.56).
-
-### 🐛 Bug Fixes
-*   **Force Offload:** Resolved an `AttributeError` to ensure the force offload option works correctly with all versions of ComfyUI.
-*   **Multi-Speaker DynamicCache:** Fixed a `'DynamicCache' object has no attribute 'key_cache'` error when using multiple speakers with newer versions of the Transformers library.
+*   **Bug Fixes:** Resolved issues with `Force Offload` and multi-speaker generation on newer Transformers versions.
 </details>

 <p align="right">(<a href="#readme-top">back to top</a>)</p>