diff --git a/README.md b/README.md index f9f8822..26ca9d8 100644 --- a/README.md +++ b/README.md @@ -26,26 +26,30 @@ ## About The Project -This project brings the power of **VibeVoice** into the modular workflow of ComfyUI. VibeVoice is a novel framework by Microsoft for generating expressive, long-form, multi-speaker conversational audio. It excels at creating natural-sounding dialogue, podcasts, and more, with consistent voices for up to 4 speakers. +VibeVoice is a novel framework by Microsoft for generating expressive, long-form, multi-speaker conversational audio. It excels at creating natural-sounding dialogue, podcasts, and more, with consistent voices for up to 4 speakers. The custom node handles everything from model downloading and memory management to audio processing, allowing you to generate high-quality speech directly from a text script and reference audio files. **✨ Key Features:** * **Multi-Speaker TTS:** Generate conversations with up to 4 distinct voices in a single audio output. -* **Zero-Shot Voice Cloning:** Use any audio file (`.wav`, `.mp3`) as a reference for a speaker's voice. -* **Advanced Attention Mechanisms:** Choose between `eager`, `sdpa`, `flash_attention_2`, and the new high-performance `sage` attention for fine-tuned control over speed, memory, and compatibility. -* **Robust 4-Bit Quantization:** Run the large language model component in 4-bit mode to significantly reduce VRAM usage, with smart, stable configurations for all attention modes. +* **High-Fidelity Voice Cloning:** Use any audio file (`.wav`, `.mp3`) as a reference for a speaker's voice. +* **Hybrid Generation Mode:** Mix and match cloned voices with high-quality, zero-shot generated voices in the same script. +* **Flexible Scripting:** Use simple `[1]` tags or the classic `Speaker 1:` format to write your dialogue. +* **Advanced Attention Mechanisms:** Choose between `eager`, `sdpa`, `flash_attention_2`, and the high-performance `sage` attention for fine-tuned control over speed and compatibility. +* **Robust 4-Bit Quantization:** Run the large language model component in 4-bit mode to significantly reduce VRAM usage. * **Automatic Model Management:** Models are downloaded automatically and managed efficiently by ComfyUI to save VRAM. -* **Fine-Grained Control:** Adjust parameters like CFG scale, temperature, and sampling methods to tune the performance and style of the generated speech.
## 🚀 Getting Started -The node can be installed via **ComfyUI Manager:** Find `ComfyUI-VibeVoice` and click "Install". +The easiest way to install is through the **ComfyUI Manager:** +1. Go to `Manager` -> `Install Custom Nodes`. +2. Search for `ComfyUI-VibeVoice` and click "Install". +3. Restart ComfyUI. -Alternatively, install it manually: +Alternatively, to install manually: 1. **Clone the Repository:** Navigate to your `ComfyUI/custom_nodes/` directory and clone this repository: @@ -54,18 +58,17 @@ Alternatively, install it manually: ``` 2. **Install Dependencies:** - Open a terminal or command prompt, navigate into the cloned directory, and install the required Python packages. **For quantization support, ensure you install `bitsandbytes`**. + Open a terminal or command prompt, navigate into the cloned directory, and install the required Python packages. **For quantization support, you must install `bitsandbytes`**. ```sh cd ComfyUI-VibeVoice pip install -r requirements.txt ``` 3. **Optional: Install SageAttention** - To enable the new `sage` attention mode, you must install the `sageattention` library in your ComfyUI Python environment. For Windows users, please refer to this [AI-windows-whl](https://github.com/wildminder/AI-windows-whl) for the required package. - + To enable the `sage` attention mode, you must install the `sageattention` library. For Windows users, a pre-compiled wheel is available at [AI-windows-whl](https://github.com/wildminder/AI-windows-whl). > **Note:** This is only required if you intend to use the `sage` attention mode. - -3. **Start/Restart ComfyUI:** + +4. **Start/Restart ComfyUI:** Launch ComfyUI. The "VibeVoice TTS" node will appear under the `audio/tts` category. The first time you use the node, it will automatically download the selected model to your `ComfyUI/models/tts/VibeVoice/` folder. ## Models @@ -79,31 +82,55 @@ Alternatively, install it manually: ## 🛠️ Usage -The node is designed to be intuitive within the ComfyUI workflow. +The node is designed for maximum flexibility within your ComfyUI workflow. 1. **Add Nodes:** Add the `VibeVoice TTS` node to your graph. Use ComfyUI's built-in `Load Audio` node to load your reference voice files. -2. **Connect Voices:** Connect the `AUDIO` output from each `Load Audio` node to the corresponding `speaker_*_voice` input on the VibeVoice TTS node. -3. **Write Script:** In the `text` input, write your dialogue. Assign lines to speakers using the format `Speaker 1: ...`, `Speaker 2: ...`, etc., on separate lines. +2. **Connect Voices (Optional):** Connect the `AUDIO` output from each `Load Audio` node to the corresponding `speaker_*_voice` input. +3. **Write Your Script:** In the `text` input, write your dialogue using one of the supported formats. 4. **Generate:** Queue the prompt. The node will process the script and generate a single audio file containing the full conversation. > **Tip:** For a complete workflow, you can drag the example image from the `example_workflows` folder onto your ComfyUI canvas. +### Scripting and Voice Modes + +#### Speaker Tagging +You can assign lines to speakers in two ways. Both are treated identically. + +* **Modern Format (Recommended):** `[1] This is the first speaker.` +* **Classic Format:** `Speaker 1: This is the first speaker.` + +You can also add an optional colon to the modern format (e.g., `[1]: ...`). The node handles all variations consistently. + +#### Hybrid Voice Generation +This is a powerful feature that lets you mix cloned voices and generated (zero-shot) voices. + +* **To Clone a Voice:** Connect a `Load Audio` node to the speaker's input (e.g., `speaker_1_voice`). +* **To Generate a Voice:** Leave the speaker's input empty. The model will create a unique, high-quality voice for that speaker. + +**Example Hybrid Script:** +``` +[1] This line will use the audio from speaker_1_voice. +[2] This line will have a new, unique voice generated for it. +[1] I'm back with my cloned voice. +``` +In this example, you would only connect an audio source to `speaker_1_voice`. + ### Node Inputs * **`model_name`**: Select the VibeVoice model to use (`1.5B` or `Large`). -* **`quantize_llm_4bit`**: **(Overhauled!)** Enable to run the LLM component in 4-bit (NF4) mode. This dramatically reduces VRAM usage. -* **`attention_mode`**: **(New!)** Select the attention implementation: `eager` (safest), `sdpa` (balanced), `flash_attention_2` (fastest), or `sage` (quantized high-performance). -* **`text`**: The conversational script. Lines must be prefixed with `Speaker