Doing some experimentation

Support predict_ratio changing with timesteps
Implement Sortblock for single cond usage
2026-02-12 03:00:03 +00:00 · 2025-09-02 22:19:12 -07:00 · 2025-09-02 15:23:28 -07:00 · 2025-09-02 00:45:59 -07:00 · 2025-09-01 09:39:40 -07:00 · 2025-08-31 20:26:49 -07:00
224 changed files with 11132 additions and 23853 deletions
--- a/.ci/windows_amd_base_files/README_VERY_IMPORTANT.txt
+++ b/.ci/windows_amd_base_files/README_VERY_IMPORTANT.txt
@@ -1,27 +0,0 @@
-As of the time of writing this you need this preview driver for best results:
-https://www.amd.com/en/resources/support-articles/release-notes/RN-AMDGPU-WINDOWS-PYTORCH-PREVIEW.html
-
-HOW TO RUN:
-
-If you have a AMD gpu:
-
-run_amd_gpu.bat
-
-If you have memory issues you can try disabling the smart memory management by running comfyui with:
-
-run_amd_gpu_disable_smart_memory.bat
-
-IF YOU GET A RED ERROR IN THE UI MAKE SURE YOU HAVE A MODEL/CHECKPOINT IN: ComfyUI\models\checkpoints
-
-You can download the stable diffusion XL one from: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0_0.9vae.safetensors
-
-
-RECOMMENDED WAY TO UPDATE:
-To update the ComfyUI code: update\update_comfyui.bat
-
-
-TO SHARE MODELS BETWEEN COMFYUI AND ANOTHER UI:
-In the ComfyUI directory you will find a file: extra_model_paths.yaml.example
-Rename this file to: extra_model_paths.yaml and edit it with your favorite text editor.
-
-
--- a/.ci/windows_amd_base_files/run_amd_gpu.bat
+++ b/.ci/windows_amd_base_files/run_amd_gpu.bat
@@ -1,2 +0,0 @@
-.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build
-pause
--- a/.ci/windows_amd_base_files/run_amd_gpu_disable_smart_memory.bat
+++ b/.ci/windows_amd_base_files/run_amd_gpu_disable_smart_memory.bat
@@ -1,2 +0,0 @@
-.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --disable-smart-memory
-pause
--- a/.ci/windows_nvidia_base_files/README_VERY_IMPORTANT.txt
+++ b/.ci/windows_nvidia_base_files/README_VERY_IMPORTANT.txt
--- a/.ci/windows_nvidia_base_files/run_cpu.bat
+++ b/.ci/windows_nvidia_base_files/run_cpu.bat
--- a/.ci/windows_nvidia_base_files/run_nvidia_gpu.bat
+++ b/.ci/windows_nvidia_base_files/run_nvidia_gpu.bat
--- a/.ci/windows_nvidia_base_files/run_nvidia_gpu_fast_fp16_accumulation.bat
+++ b/.ci/windows_nvidia_base_files/run_nvidia_gpu_fast_fp16_accumulation.bat
--- a/.ci/windows_nvidia_base_files/advanced/run_nvidia_gpu_disable_api_nodes.bat
+++ b/.ci/windows_nvidia_base_files/advanced/run_nvidia_gpu_disable_api_nodes.bat
@@ -1,2 +0,0 @@
-..\python_embeded\python.exe -s ..\ComfyUI\main.py --windows-standalone-build --disable-api-nodes
-pause
--- a/.github/workflows/release-stable-all.yml
+++ b/.github/workflows/release-stable-all.yml
@@ -1,61 +0,0 @@
-name: "Release Stable All Portable Versions"
-
-on:
-  workflow_dispatch:
-    inputs:
-      git_tag:
-        description: 'Git tag'
-        required: true
-        type: string
-
-jobs:
-  release_nvidia_default:
-    permissions:
-      contents: "write"
-      packages: "write"
-      pull-requests: "read"
-    name: "Release NVIDIA Default (cu129)"
-    uses: ./.github/workflows/stable-release.yml
-    with:
-      git_tag: ${{ inputs.git_tag }}
-      cache_tag: "cu129"
-      python_minor: "13"
-      python_patch: "6"
-      rel_name: "nvidia"
-      rel_extra_name: ""
-      test_release: true
-    secrets: inherit
-
-  release_nvidia_cu128:
-    permissions:
-      contents: "write"
-      packages: "write"
-      pull-requests: "read"
-    name: "Release NVIDIA cu128"
-    uses: ./.github/workflows/stable-release.yml
-    with:
-      git_tag: ${{ inputs.git_tag }}
-      cache_tag: "cu128"
-      python_minor: "12"
-      python_patch: "10"
-      rel_name: "nvidia"
-      rel_extra_name: "_cu128"
-      test_release: true
-    secrets: inherit
-
-  release_amd_rocm:
-    permissions:
-      contents: "write"
-      packages: "write"
-      pull-requests: "read"
-    name: "Release AMD ROCm 6.4.4"
-    uses: ./.github/workflows/stable-release.yml
-    with:
-      git_tag: ${{ inputs.git_tag }}
-      cache_tag: "rocm644"
-      python_minor: "12"
-      python_patch: "10"
-      rel_name: "amd"
-      rel_extra_name: ""
-      test_release: false
-    secrets: inherit
--- a/.github/workflows/ruff.yml
+++ b/.github/workflows/ruff.yml
@@ -21,28 +21,3 @@ jobs:

    - name: Run Ruff
      run: ruff check .
-
-  pylint:
-    name: Run Pylint
-    runs-on: ubuntu-latest
-
-    steps:
-    - name: Checkout repository
-      uses: actions/checkout@v4
-
-    - name: Set up Python
-      uses: actions/setup-python@v4
-      with:
-        python-version: '3.12'
-
-    - name: Install requirements
-      run: |
-        python -m pip install --upgrade pip
-        pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
-        pip install -r requirements.txt
-
-    - name: Install Pylint
-      run: pip install pylint
-
-    - name: Run Pylint
-      run: pylint comfy_api_nodes
--- a/.github/workflows/stable-release.yml
+++ b/.github/workflows/stable-release.yml
@@ -2,53 +2,17 @@
 name: "Release Stable Version"

 on:
-  workflow_call:
-    inputs:
-      git_tag:
-        description: 'Git tag'
-        required: true
-        type: string
-      cache_tag:
-        description: 'Cached dependencies tag'
-        required: true
-        type: string
-        default: "cu129"
-      python_minor:
-        description: 'Python minor version'
-        required: true
-        type: string
-        default: "13"
-      python_patch:
-        description: 'Python patch version'
-        required: true
-        type: string
-        default: "6"
-      rel_name:
-        description: 'Release name'
-        required: true
-        type: string
-        default: "nvidia"
-      rel_extra_name:
-        description: 'Release extra name'
-        required: false
-        type: string
-        default: ""
-      test_release:
-        description: 'Test Release'
-        required: true
-        type: boolean
-        default: true
  workflow_dispatch:
    inputs:
      git_tag:
        description: 'Git tag'
        required: true
        type: string
-      cache_tag:
-        description: 'Cached dependencies tag'
+      cu:
+        description: 'CUDA version'
        required: true
        type: string
-        default: "cu129"
+        default: "129"
      python_minor:
        description: 'Python minor version'
        required: true
@@ -59,21 +23,7 @@ on:
        required: true
        type: string
        default: "6"
-      rel_name:
-        description: 'Release name'
-        required: true
-        type: string
-        default: "nvidia"
-      rel_extra_name:
-        description: 'Release extra name'
-        required: false
-        type: string
-        default: ""
-      test_release:
-        description: 'Test Release'
-        required: true
-        type: boolean
-        default: true
+

 jobs:
  package_comfy_windows:
@@ -92,15 +42,15 @@ jobs:
        id: cache
        with:
          path: |
-            ${{ inputs.cache_tag }}_python_deps.tar
+            cu${{ inputs.cu }}_python_deps.tar
            update_comfyui_and_python_dependencies.bat
-          key: ${{ runner.os }}-build-${{ inputs.cache_tag }}-${{ inputs.python_minor }}
+          key: ${{ runner.os }}-build-cu${{ inputs.cu }}-${{ inputs.python_minor }}
      - shell: bash
        run: |
-          mv ${{ inputs.cache_tag }}_python_deps.tar ../
+          mv cu${{ inputs.cu }}_python_deps.tar ../
          mv update_comfyui_and_python_dependencies.bat ../
          cd ..
-          tar xf ${{ inputs.cache_tag }}_python_deps.tar
+          tar xf cu${{ inputs.cu }}_python_deps.tar
          pwd
          ls

@@ -115,19 +65,12 @@ jobs:
          echo 'import site' >> ./python3${{ inputs.python_minor }}._pth
          curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
          ./python.exe get-pip.py
-          ./python.exe -s -m pip install ../${{ inputs.cache_tag }}_python_deps/*
-
-          grep comfyui ../ComfyUI/requirements.txt > ./requirements_comfyui.txt
-          ./python.exe -s -m pip install -r requirements_comfyui.txt
-          rm requirements_comfyui.txt
-
+          ./python.exe -s -m pip install ../cu${{ inputs.cu }}_python_deps/*
          sed -i '1i../ComfyUI' ./python3${{ inputs.python_minor }}._pth

-          if test -f ./Lib/site-packages/torch/lib/dnnl.lib; then
-            rm ./Lib/site-packages/torch/lib/dnnl.lib #I don't think this is actually used and I need the space
-            rm ./Lib/site-packages/torch/lib/libprotoc.lib
-            rm ./Lib/site-packages/torch/lib/libprotobuf.lib
-          fi
+          rm ./Lib/site-packages/torch/lib/dnnl.lib #I don't think this is actually used and I need the space
+          rm ./Lib/site-packages/torch/lib/libprotoc.lib
+          rm ./Lib/site-packages/torch/lib/libprotobuf.lib

          cd ..

@@ -142,18 +85,14 @@ jobs:

          mkdir update
          cp -r ComfyUI/.ci/update_windows/* ./update/
-          cp -r ComfyUI/.ci/windows_${{ inputs.rel_name }}_base_files/* ./
+          cp -r ComfyUI/.ci/windows_base_files/* ./
          cp ../update_comfyui_and_python_dependencies.bat ./update/

          cd ..

          "C:\Program Files\7-Zip\7z.exe" a -t7z -m0=lzma2 -mx=9 -mfb=128 -md=768m -ms=on -mf=BCJ2 ComfyUI_windows_portable.7z ComfyUI_windows_portable
-          mv ComfyUI_windows_portable.7z ComfyUI/ComfyUI_windows_portable_${{ inputs.rel_name }}${{ inputs.rel_extra_name }}.7z
+          mv ComfyUI_windows_portable.7z ComfyUI/ComfyUI_windows_portable_nvidia.7z

-      - shell: bash
-        if: ${{ inputs.test_release }}
-        run: |
-          cd ..
          cd ComfyUI_windows_portable
          python_embeded/python.exe -s ComfyUI/main.py --quick-test-for-ci --cpu

@@ -162,9 +101,10 @@ jobs:
          ls

      - name: Upload binaries to release
-        uses: softprops/action-gh-release@v2
+        uses: svenstaro/upload-release-action@v2
        with:
-          files: ComfyUI_windows_portable_${{ inputs.rel_name }}${{ inputs.rel_extra_name }}.7z
-          tag_name: ${{ inputs.git_tag }}
+          repo_token: ${{ secrets.GITHUB_TOKEN }}
+          file: ComfyUI_windows_portable_nvidia.7z
+          tag: ${{ inputs.git_tag }}
+          overwrite: true
          draft: true
-          overwrite_files: true
--- a/.github/workflows/test-execution.yml
+++ b/.github/workflows/test-execution.yml
@@ -1,30 +0,0 @@
-name: Execution Tests
-
-on:
-  push:
-    branches: [ main, master ]
-  pull_request:
-    branches: [ main, master ]
-
-jobs:
-  test:
-    strategy:
-      matrix:
-        os: [ubuntu-latest, windows-latest, macos-latest]
-    runs-on: ${{ matrix.os }}
-    continue-on-error: true
-    steps:
-    - uses: actions/checkout@v4
-    - name: Set up Python      
-      uses: actions/setup-python@v4
-      with:
-        python-version: '3.12'
-    - name: Install requirements
-      run: |
-        python -m pip install --upgrade pip
-        pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
-        pip install -r requirements.txt
-        pip install -r tests-unit/requirements.txt
-    - name: Run Execution Tests
-      run: |
-        python -m pytest tests/execution -v --skip-timing-checks
--- a/.github/workflows/test-unit.yml
+++ b/.github/workflows/test-unit.yml
@@ -10,7 +10,7 @@ jobs:
  test:
    strategy:
      matrix:
-        os: [ubuntu-latest, windows-2022, macos-latest]
+        os: [ubuntu-latest, windows-latest, macos-latest]
    runs-on: ${{ matrix.os }}
    continue-on-error: true
    steps:
--- a/.github/workflows/windows_release_dependencies.yml
+++ b/.github/workflows/windows_release_dependencies.yml
@@ -17,7 +17,7 @@ on:
        description: 'cuda version'
        required: true
        type: string
-        default: "130"
+        default: "129"

      python_minor:
        description: 'python minor version'
@@ -29,7 +29,7 @@ on:
        description: 'python patch version'
        required: true
        type: string
-        default: "9"
+        default: "6"
 #  push:
 #    branches:
 #      - master
@@ -56,8 +56,7 @@ jobs:
            ..\python_embeded\python.exe -s -m pip install --upgrade torch torchvision torchaudio ${{ inputs.xformers }} --extra-index-url https://download.pytorch.org/whl/cu${{ inputs.cu }} -r ../ComfyUI/requirements.txt pygit2
            pause" > update_comfyui_and_python_dependencies.bat

-            grep -v comfyui requirements.txt > requirements_nocomfyui.txt
-            python -m pip wheel --no-cache-dir torch torchvision torchaudio ${{ inputs.xformers }} ${{ inputs.extra_dependencies }} --extra-index-url https://download.pytorch.org/whl/cu${{ inputs.cu }} -r requirements_nocomfyui.txt pygit2 -w ./temp_wheel_dir
+            python -m pip wheel --no-cache-dir torch torchvision torchaudio ${{ inputs.xformers }} ${{ inputs.extra_dependencies }} --extra-index-url https://download.pytorch.org/whl/cu${{ inputs.cu }} -r requirements.txt pygit2 -w ./temp_wheel_dir
            python -m pip install --no-cache-dir ./temp_wheel_dir/*
            echo installed basic
            ls -lah temp_wheel_dir
--- a/.github/workflows/windows_release_dependencies_manual.yml
+++ b/.github/workflows/windows_release_dependencies_manual.yml
@@ -1,64 +0,0 @@
-name: "Windows Release dependencies Manual"
-
-on:
-  workflow_dispatch:
-    inputs:
-      torch_dependencies:
-        description: 'torch dependencies'
-        required: false
-        type: string
-        default: "torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu128"
-      cache_tag:
-        description: 'Cached dependencies tag'
-        required: true
-        type: string
-        default: "cu128"
-
-      python_minor:
-        description: 'python minor version'
-        required: true
-        type: string
-        default: "12"
-
-      python_patch:
-        description: 'python patch version'
-        required: true
-        type: string
-        default: "10"
-
-jobs:
-  build_dependencies:
-    runs-on: windows-latest
-    steps:
-        - uses: actions/checkout@v4
-        - uses: actions/setup-python@v5
-          with:
-            python-version: 3.${{ inputs.python_minor }}.${{ inputs.python_patch }}
-
-        - shell: bash
-          run: |
-            echo "@echo off
-            call update_comfyui.bat nopause
-            echo -
-            echo This will try to update pytorch and all python dependencies.
-            echo -
-            echo If you just want to update normally, close this and run update_comfyui.bat instead.
-            echo -
-            pause
-            ..\python_embeded\python.exe -s -m pip install --upgrade ${{ inputs.torch_dependencies }} -r ../ComfyUI/requirements.txt pygit2
-            pause" > update_comfyui_and_python_dependencies.bat
-
-            grep -v comfyui requirements.txt > requirements_nocomfyui.txt
-            python -m pip wheel --no-cache-dir ${{ inputs.torch_dependencies }} -r requirements_nocomfyui.txt pygit2 -w ./temp_wheel_dir
-            python -m pip install --no-cache-dir ./temp_wheel_dir/*
-            echo installed basic
-            ls -lah temp_wheel_dir
-            mv temp_wheel_dir ${{ inputs.cache_tag }}_python_deps
-            tar cf ${{ inputs.cache_tag }}_python_deps.tar ${{ inputs.cache_tag }}_python_deps
-
-        - uses: actions/cache/save@v4
-          with:
-            path: |
-              ${{ inputs.cache_tag }}_python_deps.tar
-              update_comfyui_and_python_dependencies.bat
-            key: ${{ runner.os }}-build-${{ inputs.cache_tag }}-${{ inputs.python_minor }}
--- a/.github/workflows/windows_release_nightly_pytorch.yml
+++ b/.github/workflows/windows_release_nightly_pytorch.yml
@@ -68,7 +68,7 @@ jobs:

            mkdir update
            cp -r ComfyUI/.ci/update_windows/* ./update/
-            cp -r ComfyUI/.ci/windows_nvidia_base_files/* ./
+            cp -r ComfyUI/.ci/windows_base_files/* ./
            cp -r ComfyUI/.ci/windows_nightly_base_files/* ./

            echo "call update_comfyui.bat nopause
--- a/.github/workflows/windows_release_package.yml
+++ b/.github/workflows/windows_release_package.yml
@@ -81,7 +81,7 @@ jobs:

            mkdir update
            cp -r ComfyUI/.ci/update_windows/* ./update/
-            cp -r ComfyUI/.ci/windows_nvidia_base_files/* ./
+            cp -r ComfyUI/.ci/windows_base_files/* ./
            cp ../update_comfyui_and_python_dependencies.bat ./update/

            cd ..
--- a/24
+++ b/24
@@ -1,3 +1,25 @@
 # Admins
 * @comfyanonymous
-* @kosinkadink
+
+# Note: Github teams syntax cannot be used here as the repo is not owned by Comfy-Org.
+# Inlined the team members for now.
+
+# Maintainers
+*.md @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
+/tests/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
+/tests-unit/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
+/notebooks/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
+/script_examples/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
+/.github/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
+/requirements.txt @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
+/pyproject.toml @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
+
+# Python web server
+/api_server/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @christian-byrne @guill
+/app/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @christian-byrne @guill
+/utils/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @christian-byrne @guill
+
+# Node developers
+/comfy_extras/ @yoland68 @robinjhuang @pythongosssss @ltdrdata @Kosinkadink @webfiltered @christian-byrne @guill
+/comfy/comfy_types/ @yoland68 @robinjhuang @pythongosssss @ltdrdata @Kosinkadink @webfiltered @christian-byrne @guill
+/comfy_api_nodes/ @yoland68 @robinjhuang @pythongosssss @ltdrdata @Kosinkadink @webfiltered @christian-byrne @guill
--- a/README.md
+++ b/README.md
@@ -66,7 +66,6 @@ See what ComfyUI can do with the [example workflows](https://comfyanonymous.gith
   - [Lumina Image 2.0](https://comfyanonymous.github.io/ComfyUI_examples/lumina2/)
   - [HiDream](https://comfyanonymous.github.io/ComfyUI_examples/hidream/)
   - [Qwen Image](https://comfyanonymous.github.io/ComfyUI_examples/qwen_image/)
-   - [Hunyuan Image 2.1](https://comfyanonymous.github.io/ComfyUI_examples/hunyuan_image/)
 - Image Editing Models
   - [Omnigen 2](https://comfyanonymous.github.io/ComfyUI_examples/omnigen/)
   - [Flux Kontext](https://comfyanonymous.github.io/ComfyUI_examples/flux/#flux-kontext-image-editing-model)
@@ -176,12 +175,6 @@ Simply download, extract with [7-Zip](https://7-zip.org) and run. Make sure you

 If you have trouble extracting it, right click the file -> properties -> unblock

-#### Alternative Downloads:
-
-[Experimental portable for AMD GPUs](https://github.com/comfyanonymous/ComfyUI/releases/latest/download/ComfyUI_windows_portable_amd.7z)
-
-[Portable with pytorch cuda 12.8 and python 3.12](https://github.com/comfyanonymous/ComfyUI/releases/latest/download/ComfyUI_windows_portable_nvidia_cu128.7z) (Supports Nvidia 10 series and older GPUs).
-
 #### How do I share models between another UI and ComfyUI?

 See the [Config file](extra_model_paths.yaml.example) to set the search paths for models. In the standalone windows build you can find this file in the ComfyUI directory. Rename this file to extra_model_paths.yaml and edit it with your favorite text editor.
@@ -197,11 +190,7 @@ comfy install

 ## Manual Install (Windows, Linux)

-Python 3.14 will work if you comment out the `kornia` dependency in the requirements.txt file (breaks the canny node) but it is not recommended.
-
-Python 3.13 is very well supported. If you have trouble with some custom node dependencies on 3.13 you can try 3.12
-
-### Instructions:
+Python 3.13 is very well supported. If you have trouble with some custom node dependencies you can try 3.12

 Git clone this repo.

@@ -210,32 +199,14 @@ Put your SD checkpoints (the huge ckpt/safetensors files) in: models/checkpoints
 Put your VAE in: models/vae


-### AMD GPUs (Linux)
-
+### AMD GPUs (Linux only)
 AMD users can install rocm and pytorch with pip if you don't have it already installed, this is the command to install the stable version:

 ```pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.4```

-This is the command to install the nightly with ROCm 7.0 which might have some performance improvements:
+This is the command to install the nightly with ROCm 6.4 which might have some performance improvements:

-```pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm7.0```
-
-
-### AMD GPUs (Experimental: Windows and Linux), RDNA 3, 3.5 and 4 only.
-
-These have less hardware support than the builds above but they work on windows. You also need to install the pytorch version specific to your hardware.
-
-RDNA 3 (RX 7000 series):
-
-```pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx110X-dgpu/```
-
-RDNA 3.5 (Strix halo/Ryzen AI Max+ 365):
-
-```pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx1151/```
-
-RDNA 4 (RX 9000 series):
-
-```pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/```
+```pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.4```

 ### Intel GPUs (Windows and Linux)

@@ -257,11 +228,11 @@ This is the command to install the Pytorch xpu nightly which might have some per

 Nvidia users should install stable pytorch using this command:

-```pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu130```
+```pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu129```

 This is the command to install pytorch nightly instead which might have performance improvements.

-```pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu130```
+```pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu129```

 #### Troubleshooting

@@ -292,6 +263,12 @@ You can install ComfyUI in Apple Mac silicon (M1 or M2) with any recent macOS ve

 > **Note**: Remember to add your models, VAE, LoRAs etc. to the corresponding Comfy folders, as discussed in [ComfyUI manual installation](#manual-install-windows-linux).

+#### DirectML (AMD Cards on Windows)
+
+This is very badly supported and is not recommended. There are some unofficial builds of pytorch ROCm on windows that exist that will give you a much better experience than this. This readme will be updated once official pytorch ROCm builds for windows come out.
+
+```pip install torch-directml``` Then you can launch ComfyUI with: ```python main.py --directml```
+
 #### Ascend NPUs

 For models compatible with Ascend Extension for PyTorch (torch_npu). To get started, ensure your environment meets the prerequisites outlined on the [installation](https://ascend.github.io/docs/sources/ascend/quick_install.html) page. Here's a step-by-step guide tailored to your platform and installation method:
--- a/app/frontend_management.py
+++ b/app/frontend_management.py
@@ -42,7 +42,6 @@ def get_installed_frontend_version():
    frontend_version_str = version("comfyui-frontend-package")
    return frontend_version_str

-
 def get_required_frontend_version():
    """Get the required frontend version from requirements.txt."""
    try:
@@ -64,7 +63,6 @@ def get_required_frontend_version():
        logging.error(f"Error reading requirements.txt: {e}")
        return None

-
 def check_frontend_version():
    """Check if the frontend version is up to date."""

@@ -205,37 +203,6 @@ class FrontendManager:
        """Get the required frontend package version."""
        return get_required_frontend_version()

-    @classmethod
-    def get_installed_templates_version(cls) -> str:
-        """Get the currently installed workflow templates package version."""
-        try:
-            templates_version_str = version("comfyui-workflow-templates")
-            return templates_version_str
-        except Exception:
-            return None
-
-    @classmethod
-    def get_required_templates_version(cls) -> str:
-        """Get the required workflow templates version from requirements.txt."""
-        try:
-            with open(requirements_path, "r", encoding="utf-8") as f:
-                for line in f:
-                    line = line.strip()
-                    if line.startswith("comfyui-workflow-templates=="):
-                        version_str = line.split("==")[-1]
-                        if not is_valid_version(version_str):
-                            logging.error(f"Invalid templates version format in requirements.txt: {version_str}")
-                            return None
-                        return version_str
-                logging.error("comfyui-workflow-templates not found in requirements.txt")
-                return None
-        except FileNotFoundError:
-            logging.error("requirements.txt not found. Cannot determine required templates version.")
-            return None
-        except Exception as e:
-            logging.error(f"Error reading requirements.txt: {e}")
-            return None
-
    @classmethod
    def default_frontend_path(cls) -> str:
        try:
--- a/app/subgraph_manager.py
+++ b/app/subgraph_manager.py
@@ -1,112 +0,0 @@
-from __future__ import annotations
-
-from typing import TypedDict
-import os
-import folder_paths
-import glob
-from aiohttp import web
-import hashlib
-
-
-class Source:
-    custom_node = "custom_node"
-
-class SubgraphEntry(TypedDict):
-    source: str
-    """
-    Source of subgraph - custom_nodes vs templates.
-    """
-    path: str
-    """
-    Relative path of the subgraph file.
-    For custom nodes, will be the relative directory like <custom_node_dir>/subgraphs/<name>.json
-    """
-    name: str
-    """
-    Name of subgraph file.
-    """
-    info: CustomNodeSubgraphEntryInfo
-    """
-    Additional info about subgraph; in the case of custom_nodes, will contain nodepack name
-    """
-    data: str
-
-class CustomNodeSubgraphEntryInfo(TypedDict):
-    node_pack: str
-    """Node pack name."""
-
-class SubgraphManager:
-    def __init__(self):
-        self.cached_custom_node_subgraphs: dict[SubgraphEntry] | None = None
-
-    async def load_entry_data(self, entry: SubgraphEntry):
-        with open(entry['path'], 'r') as f:
-            entry['data'] = f.read()
-        return entry
-
-    async def sanitize_entry(self, entry: SubgraphEntry | None, remove_data=False) -> SubgraphEntry | None:
-        if entry is None:
-            return None
-        entry = entry.copy()
-        entry.pop('path', None)
-        if remove_data:
-            entry.pop('data', None)
-        return entry
-
-    async def sanitize_entries(self, entries: dict[str, SubgraphEntry], remove_data=False) -> dict[str, SubgraphEntry]:
-        entries = entries.copy()
-        for key in list(entries.keys()):
-            entries[key] = await self.sanitize_entry(entries[key], remove_data)
-        return entries
-
-    async def get_custom_node_subgraphs(self, loadedModules, force_reload=False):
-        # if not forced to reload and cached, return cache
-        if not force_reload and self.cached_custom_node_subgraphs is not None:
-            return self.cached_custom_node_subgraphs
-        # Load subgraphs from custom nodes
-        subfolder = "subgraphs"
-        subgraphs_dict: dict[SubgraphEntry] = {}
-
-        for folder in folder_paths.get_folder_paths("custom_nodes"):
-            pattern = os.path.join(folder, f"*/{subfolder}/*.json")
-            matched_files = glob.glob(pattern)
-            for file in matched_files:
-                # replace backslashes with forward slashes
-                file = file.replace('\\', '/')
-                info: CustomNodeSubgraphEntryInfo = {
-                    "node_pack": "custom_nodes." + file.split('/')[-3]
-                }
-                source = Source.custom_node
-                # hash source + path to make sure id will be as unique as possible, but
-                # reproducible across backend reloads
-                id = hashlib.sha256(f"{source}{file}".encode()).hexdigest()
-                entry: SubgraphEntry = {
-                    "source": Source.custom_node,
-                    "name": os.path.splitext(os.path.basename(file))[0],
-                    "path": file,
-                    "info": info,
-                }
-                subgraphs_dict[id] = entry
-        self.cached_custom_node_subgraphs = subgraphs_dict
-        return subgraphs_dict
-
-    async def get_custom_node_subgraph(self, id: str, loadedModules):
-        subgraphs = await self.get_custom_node_subgraphs(loadedModules)
-        entry: SubgraphEntry = subgraphs.get(id, None)
-        if entry is not None and entry.get('data', None) is None:
-            await self.load_entry_data(entry)
-        return entry
-
-    def add_routes(self, routes, loadedModules):
-        @routes.get("/global_subgraphs")
-        async def get_global_subgraphs(request):
-            subgraphs_dict = await self.get_custom_node_subgraphs(loadedModules)
-            # NOTE: we may want to include other sources of global subgraphs such as templates in the future;
-            # that's the reasoning for the current implementation
-            return web.json_response(await self.sanitize_entries(subgraphs_dict, remove_data=True))
-
-        @routes.get("/global_subgraphs/{id}")
-        async def get_global_subgraph(request):
-            id = request.match_info.get("id", None)
-            subgraph = await self.get_custom_node_subgraph(id, loadedModules)
-            return web.json_response(await self.sanitize_entry(subgraph))
--- a/comfy/audio_encoders/audio_encoders.py
+++ b/comfy/audio_encoders/audio_encoders.py
@@ -1,5 +1,4 @@
 from .wav2vec2 import Wav2Vec2Model
-from .whisper import WhisperLargeV3
 import comfy.model_management
 import comfy.ops
 import comfy.utils
@@ -12,18 +11,7 @@ class AudioEncoderModel():
        self.load_device = comfy.model_management.text_encoder_device()
        offload_device = comfy.model_management.text_encoder_offload_device()
        self.dtype = comfy.model_management.text_encoder_dtype(self.load_device)
-        model_type = config.pop("model_type")
-        model_config = dict(config)
-        model_config.update({
-            "dtype": self.dtype,
-            "device": offload_device,
-            "operations": comfy.ops.manual_cast
-        })
-
-        if model_type == "wav2vec2":
-            self.model = Wav2Vec2Model(**model_config)
-        elif model_type == "whisper3":
-            self.model = WhisperLargeV3(**model_config)
+        self.model = Wav2Vec2Model(dtype=self.dtype, device=offload_device, operations=comfy.ops.manual_cast)
        self.model.eval()
        self.patcher = comfy.model_patcher.ModelPatcher(self.model, load_device=self.load_device, offload_device=offload_device)
        self.model_sample_rate = 16000
@@ -41,51 +29,14 @@ class AudioEncoderModel():
        outputs = {}
        outputs["encoded_audio"] = out
        outputs["encoded_audio_all_layers"] = all_layers
-        outputs["audio_samples"] = audio.shape[2]
        return outputs


 def load_audio_encoder_from_sd(sd, prefix=""):
+    audio_encoder = AudioEncoderModel(None)
    sd = comfy.utils.state_dict_prefix_replace(sd, {"wav2vec2.": ""})
-    if "encoder.layer_norm.bias" in sd: #wav2vec2
-        embed_dim = sd["encoder.layer_norm.bias"].shape[0]
-        if embed_dim == 1024:# large
-            config = {
-                "model_type": "wav2vec2",
-                "embed_dim": 1024,
-                "num_heads": 16,
-                "num_layers": 24,
-                "conv_norm": True,
-                "conv_bias": True,
-                "do_normalize": True,
-                "do_stable_layer_norm": True
-                }
-        elif embed_dim == 768: # base
-            config = {
-                "model_type": "wav2vec2",
-                "embed_dim": 768,
-                "num_heads": 12,
-                "num_layers": 12,
-                "conv_norm": False,
-                "conv_bias": False,
-                "do_normalize": False, # chinese-wav2vec2-base has this False
-                "do_stable_layer_norm": False
-            }
-        else:
-            raise RuntimeError("ERROR: audio encoder file is invalid or unsupported embed_dim: {}".format(embed_dim))
-    elif "model.encoder.embed_positions.weight" in sd:
-        sd = comfy.utils.state_dict_prefix_replace(sd, {"model.": ""})
-        config = {
-            "model_type": "whisper3",
-        }
-    else:
-        raise RuntimeError("ERROR: audio encoder not supported.")
-
-    audio_encoder = AudioEncoderModel(config)
    m, u = audio_encoder.load_sd(sd)
    if len(m) > 0:
        logging.warning("missing audio encoder: {}".format(m))
-    if len(u) > 0:
-        logging.warning("unexpected audio encoder: {}".format(u))

    return audio_encoder
--- a/comfy/audio_encoders/wav2vec2.py
+++ b/comfy/audio_encoders/wav2vec2.py
@@ -13,49 +13,19 @@ class LayerNormConv(nn.Module):
        x = self.conv(x)
        return torch.nn.functional.gelu(self.layer_norm(x.transpose(-2, -1)).transpose(-2, -1))

-class LayerGroupNormConv(nn.Module):
-    def __init__(self, in_channels, out_channels, kernel_size, stride, bias=False, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.conv = operations.Conv1d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, bias=bias, device=device, dtype=dtype)
-        self.layer_norm = operations.GroupNorm(num_groups=out_channels, num_channels=out_channels, affine=True, device=device, dtype=dtype)
-
-    def forward(self, x):
-        x = self.conv(x)
-        return torch.nn.functional.gelu(self.layer_norm(x))
-
-class ConvNoNorm(nn.Module):
-    def __init__(self, in_channels, out_channels, kernel_size, stride, bias=False, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.conv = operations.Conv1d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, bias=bias, device=device, dtype=dtype)
-
-    def forward(self, x):
-        x = self.conv(x)
-        return torch.nn.functional.gelu(x)
-

 class ConvFeatureEncoder(nn.Module):
-    def __init__(self, conv_dim, conv_bias=False, conv_norm=True, dtype=None, device=None, operations=None):
+    def __init__(self, conv_dim, dtype=None, device=None, operations=None):
        super().__init__()
-        if conv_norm:
-            self.conv_layers = nn.ModuleList([
-                LayerNormConv(1, conv_dim, kernel_size=10, stride=5, bias=True, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=2, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=2, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-            ])
-        else:
-            self.conv_layers = nn.ModuleList([
-                LayerGroupNormConv(1, conv_dim, kernel_size=10, stride=5, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=2, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=2, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-            ])
+        self.conv_layers = nn.ModuleList([
+            LayerNormConv(1, conv_dim, kernel_size=10, stride=5, bias=True, device=device, dtype=dtype, operations=operations),
+            LayerNormConv(conv_dim, conv_dim, kernel_size=3, stride=2, bias=True, device=device, dtype=dtype, operations=operations),
+            LayerNormConv(conv_dim, conv_dim, kernel_size=3, stride=2, bias=True, device=device, dtype=dtype, operations=operations),
+            LayerNormConv(conv_dim, conv_dim, kernel_size=3, stride=2, bias=True, device=device, dtype=dtype, operations=operations),
+            LayerNormConv(conv_dim, conv_dim, kernel_size=3, stride=2, bias=True, device=device, dtype=dtype, operations=operations),
+            LayerNormConv(conv_dim, conv_dim, kernel_size=2, stride=2, bias=True, device=device, dtype=dtype, operations=operations),
+            LayerNormConv(conv_dim, conv_dim, kernel_size=2, stride=2, bias=True, device=device, dtype=dtype, operations=operations),
+        ])

    def forward(self, x):
        x = x.unsqueeze(1)
@@ -106,7 +76,6 @@ class TransformerEncoder(nn.Module):
        num_heads=12,
        num_layers=12,
        mlp_ratio=4.0,
-        do_stable_layer_norm=True,
        dtype=None, device=None, operations=None
    ):
        super().__init__()
@@ -117,25 +86,20 @@ class TransformerEncoder(nn.Module):
                embed_dim=embed_dim,
                num_heads=num_heads,
                mlp_ratio=mlp_ratio,
-                do_stable_layer_norm=do_stable_layer_norm,
                device=device, dtype=dtype, operations=operations
            )
            for _ in range(num_layers)
        ])

        self.layer_norm = operations.LayerNorm(embed_dim, eps=1e-05, device=device, dtype=dtype)
-        self.do_stable_layer_norm = do_stable_layer_norm

    def forward(self, x, mask=None):
        x = x + self.pos_conv_embed(x)
        all_x = ()
-        if not self.do_stable_layer_norm:
-            x = self.layer_norm(x)
        for layer in self.layers:
            all_x += (x,)
            x = layer(x, mask)
-        if self.do_stable_layer_norm:
-            x = self.layer_norm(x)
+        x = self.layer_norm(x)
        all_x += (x,)
        return x, all_x

@@ -181,7 +145,6 @@ class TransformerEncoderLayer(nn.Module):
        embed_dim=768,
        num_heads=12,
        mlp_ratio=4.0,
-        do_stable_layer_norm=True,
        dtype=None, device=None, operations=None
    ):
        super().__init__()
@@ -191,19 +154,15 @@ class TransformerEncoderLayer(nn.Module):
        self.layer_norm = operations.LayerNorm(embed_dim, device=device, dtype=dtype)
        self.feed_forward = FeedForward(embed_dim, mlp_ratio, device=device, dtype=dtype, operations=operations)
        self.final_layer_norm = operations.LayerNorm(embed_dim, device=device, dtype=dtype)
-        self.do_stable_layer_norm = do_stable_layer_norm

    def forward(self, x, mask=None):
        residual = x
-        if self.do_stable_layer_norm:
-            x = self.layer_norm(x)
+        x = self.layer_norm(x)
        x = self.attention(x, mask=mask)
        x = residual + x
-        if not self.do_stable_layer_norm:
-            x = self.layer_norm(x)
-            return self.final_layer_norm(x + self.feed_forward(x))
-        else:
-            return x + self.feed_forward(self.final_layer_norm(x))
+
+        x = x + self.feed_forward(self.final_layer_norm(x))
+        return x


 class Wav2Vec2Model(nn.Module):
@@ -215,38 +174,34 @@ class Wav2Vec2Model(nn.Module):
        final_dim=256,
        num_heads=16,
        num_layers=24,
-        conv_norm=True,
-        conv_bias=True,
-        do_normalize=True,
-        do_stable_layer_norm=True,
        dtype=None, device=None, operations=None
    ):
        super().__init__()

        conv_dim = 512
-        self.feature_extractor = ConvFeatureEncoder(conv_dim, conv_norm=conv_norm, conv_bias=conv_bias, device=device, dtype=dtype, operations=operations)
+        self.feature_extractor = ConvFeatureEncoder(conv_dim, device=device, dtype=dtype, operations=operations)
        self.feature_projection = FeatureProjection(conv_dim, embed_dim, device=device, dtype=dtype, operations=operations)

        self.masked_spec_embed = nn.Parameter(torch.empty(embed_dim, device=device, dtype=dtype))
-        self.do_normalize = do_normalize

        self.encoder = TransformerEncoder(
            embed_dim=embed_dim,
            num_heads=num_heads,
            num_layers=num_layers,
-            do_stable_layer_norm=do_stable_layer_norm,
            device=device, dtype=dtype, operations=operations
        )

    def forward(self, x, mask_time_indices=None, return_dict=False):
+
        x = torch.mean(x, dim=1)

-        if self.do_normalize:
-            x = (x - x.mean()) / torch.sqrt(x.var() + 1e-7)
+        x = (x - x.mean()) / torch.sqrt(x.var() + 1e-7)

        features = self.feature_extractor(x)
        features = self.feature_projection(features)
+
        batch_size, seq_len, _ = features.shape

        x, all_x = self.encoder(features)
+
        return x, all_x
--- a/comfy/audio_encoders/whisper.py
+++ b/comfy/audio_encoders/whisper.py
@@ -1,186 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torchaudio
-from typing import Optional
-from comfy.ldm.modules.attention import optimized_attention_masked
-import comfy.ops
-
-class WhisperFeatureExtractor(nn.Module):
-    def __init__(self, n_mels=128, device=None):
-        super().__init__()
-        self.sample_rate = 16000
-        self.n_fft = 400
-        self.hop_length = 160
-        self.n_mels = n_mels
-        self.chunk_length = 30
-        self.n_samples = 480000
-
-        self.mel_spectrogram = torchaudio.transforms.MelSpectrogram(
-            sample_rate=self.sample_rate,
-            n_fft=self.n_fft,
-            hop_length=self.hop_length,
-            n_mels=self.n_mels,
-            f_min=0,
-            f_max=8000,
-            norm="slaney",
-            mel_scale="slaney",
-        ).to(device)
-
-    def __call__(self, audio):
-        audio = torch.mean(audio, dim=1)
-        batch_size = audio.shape[0]
-        processed_audio = []
-
-        for i in range(batch_size):
-            aud = audio[i]
-            if aud.shape[0] > self.n_samples:
-                aud = aud[:self.n_samples]
-            elif aud.shape[0] < self.n_samples:
-                aud = F.pad(aud, (0, self.n_samples - aud.shape[0]))
-            processed_audio.append(aud)
-
-        audio = torch.stack(processed_audio)
-
-        mel_spec = self.mel_spectrogram(audio.to(self.mel_spectrogram.spectrogram.window.device))[:, :, :-1].to(audio.device)
-
-        log_mel_spec = torch.clamp(mel_spec, min=1e-10).log10()
-        log_mel_spec = torch.maximum(log_mel_spec, log_mel_spec.max() - 8.0)
-        log_mel_spec = (log_mel_spec + 4.0) / 4.0
-
-        return log_mel_spec
-
-
-class MultiHeadAttention(nn.Module):
-    def __init__(self, d_model: int, n_heads: int, dtype=None, device=None, operations=None):
-        super().__init__()
-        assert d_model % n_heads == 0
-
-        self.d_model = d_model
-        self.n_heads = n_heads
-        self.d_k = d_model // n_heads
-
-        self.q_proj = operations.Linear(d_model, d_model, dtype=dtype, device=device)
-        self.k_proj = operations.Linear(d_model, d_model, bias=False, dtype=dtype, device=device)
-        self.v_proj = operations.Linear(d_model, d_model, dtype=dtype, device=device)
-        self.out_proj = operations.Linear(d_model, d_model, dtype=dtype, device=device)
-
-    def forward(
-        self,
-        query: torch.Tensor,
-        key: torch.Tensor,
-        value: torch.Tensor,
-        mask: Optional[torch.Tensor] = None,
-    ) -> torch.Tensor:
-        batch_size, seq_len, _ = query.shape
-
-        q = self.q_proj(query)
-        k = self.k_proj(key)
-        v = self.v_proj(value)
-
-        attn_output = optimized_attention_masked(q, k, v, self.n_heads, mask)
-        attn_output = self.out_proj(attn_output)
-
-        return attn_output
-
-
-class EncoderLayer(nn.Module):
-    def __init__(self, d_model: int, n_heads: int, d_ff: int, dtype=None, device=None, operations=None):
-        super().__init__()
-
-        self.self_attn = MultiHeadAttention(d_model, n_heads, dtype=dtype, device=device, operations=operations)
-        self.self_attn_layer_norm = operations.LayerNorm(d_model, dtype=dtype, device=device)
-
-        self.fc1 = operations.Linear(d_model, d_ff, dtype=dtype, device=device)
-        self.fc2 = operations.Linear(d_ff, d_model, dtype=dtype, device=device)
-        self.final_layer_norm = operations.LayerNorm(d_model, dtype=dtype, device=device)
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        attention_mask: Optional[torch.Tensor] = None
-    ) -> torch.Tensor:
-        residual = x
-        x = self.self_attn_layer_norm(x)
-        x = self.self_attn(x, x, x, attention_mask)
-        x = residual + x
-
-        residual = x
-        x = self.final_layer_norm(x)
-        x = self.fc1(x)
-        x = F.gelu(x)
-        x = self.fc2(x)
-        x = residual + x
-
-        return x
-
-
-class AudioEncoder(nn.Module):
-    def __init__(
-        self,
-        n_mels: int = 128,
-        n_ctx: int = 1500,
-        n_state: int = 1280,
-        n_head: int = 20,
-        n_layer: int = 32,
-        dtype=None,
-        device=None,
-        operations=None
-    ):
-        super().__init__()
-
-        self.conv1 = operations.Conv1d(n_mels, n_state, kernel_size=3, padding=1, dtype=dtype, device=device)
-        self.conv2 = operations.Conv1d(n_state, n_state, kernel_size=3, stride=2, padding=1, dtype=dtype, device=device)
-
-        self.embed_positions = operations.Embedding(n_ctx, n_state, dtype=dtype, device=device)
-
-        self.layers = nn.ModuleList([
-            EncoderLayer(n_state, n_head, n_state * 4, dtype=dtype, device=device, operations=operations)
-            for _ in range(n_layer)
-        ])
-
-        self.layer_norm = operations.LayerNorm(n_state, dtype=dtype, device=device)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        x = F.gelu(self.conv1(x))
-        x = F.gelu(self.conv2(x))
-
-        x = x.transpose(1, 2)
-
-        x = x + comfy.ops.cast_to_input(self.embed_positions.weight[:, :x.shape[1]], x)
-
-        all_x = ()
-        for layer in self.layers:
-            all_x += (x,)
-            x = layer(x)
-
-        x = self.layer_norm(x)
-        all_x += (x,)
-        return x, all_x
-
-
-class WhisperLargeV3(nn.Module):
-    def __init__(
-        self,
-        n_mels: int = 128,
-        n_audio_ctx: int = 1500,
-        n_audio_state: int = 1280,
-        n_audio_head: int = 20,
-        n_audio_layer: int = 32,
-        dtype=None,
-        device=None,
-        operations=None
-    ):
-        super().__init__()
-
-        self.feature_extractor = WhisperFeatureExtractor(n_mels=n_mels, device=device)
-
-        self.encoder = AudioEncoder(
-            n_mels, n_audio_ctx, n_audio_state, n_audio_head, n_audio_layer,
-            dtype=dtype, device=device, operations=operations
-        )
-
-    def forward(self, audio):
-        mel = self.feature_extractor(audio)
-        x, all_x = self.encoder(mel)
-        return x, all_x
--- a/comfy/cli_args.py
+++ b/comfy/cli_args.py
@@ -143,9 +143,8 @@ class PerformanceFeature(enum.Enum):
    Fp16Accumulation = "fp16_accumulation"
    Fp8MatrixMultiplication = "fp8_matrix_mult"
    CublasOps = "cublas_ops"
-    AutoTune = "autotune"

-parser.add_argument("--fast", nargs="*", type=PerformanceFeature, help="Enable some untested and potentially quality deteriorating optimizations. --fast with no arguments enables everything. You can pass a list specific optimizations if you only want to enable specific ones. Current valid optimizations: {}".format(" ".join(map(lambda c: c.value, PerformanceFeature))))
+parser.add_argument("--fast", nargs="*", type=PerformanceFeature, help="Enable some untested and potentially quality deteriorating optimizations. --fast with no arguments enables everything. You can pass a list specific optimizations if you only want to enable specific ones. Current valid optimizations: fp16_accumulation fp8_matrix_mult cublas_ops")

 parser.add_argument("--mmap-torch-files", action="store_true", help="Use mmap when loading ckpt/pt files.")
 parser.add_argument("--disable-mmap", action="store_true", help="Don't use mmap when loading safetensors.")
--- a/comfy/clip_model.py
+++ b/comfy/clip_model.py
@@ -61,12 +61,8 @@ class CLIPEncoder(torch.nn.Module):
    def forward(self, x, mask=None, intermediate_output=None):
        optimized_attention = optimized_attention_for_device(x.device, mask=mask is not None, small_input=True)

-        all_intermediate = None
        if intermediate_output is not None:
-            if intermediate_output == "all":
-                all_intermediate = []
-                intermediate_output = None
-            elif intermediate_output < 0:
+            if intermediate_output < 0:
                intermediate_output = len(self.layers) + intermediate_output

        intermediate = None
@@ -74,12 +70,6 @@ class CLIPEncoder(torch.nn.Module):
            x = l(x, mask, optimized_attention)
            if i == intermediate_output:
                intermediate = x.clone()
-            if all_intermediate is not None:
-                all_intermediate.append(x.unsqueeze(1).clone())
-
-        if all_intermediate is not None:
-            intermediate = torch.cat(all_intermediate, dim=1)
-
        return x, intermediate

 class CLIPEmbeddings(torch.nn.Module):
--- a/comfy/clip_vision.py
+++ b/comfy/clip_vision.py
@@ -50,13 +50,7 @@ class ClipVisionModel():
        self.image_size = config.get("image_size", 224)
        self.image_mean = config.get("image_mean", [0.48145466, 0.4578275, 0.40821073])
        self.image_std = config.get("image_std", [0.26862954, 0.26130258, 0.27577711])
-        model_type = config.get("model_type", "clip_vision_model")
-        model_class = IMAGE_ENCODERS.get(model_type)
-        if model_type == "siglip_vision_model":
-            self.return_all_hidden_states = True
-        else:
-            self.return_all_hidden_states = False
-
+        model_class = IMAGE_ENCODERS.get(config.get("model_type", "clip_vision_model"))
        self.load_device = comfy.model_management.text_encoder_device()
        offload_device = comfy.model_management.text_encoder_offload_device()
        self.dtype = comfy.model_management.text_encoder_dtype(self.load_device)
@@ -74,18 +68,12 @@ class ClipVisionModel():
    def encode_image(self, image, crop=True):
        comfy.model_management.load_model_gpu(self.patcher)
        pixel_values = clip_preprocess(image.to(self.load_device), size=self.image_size, mean=self.image_mean, std=self.image_std, crop=crop).float()
-        out = self.model(pixel_values=pixel_values, intermediate_output='all' if self.return_all_hidden_states else -2)
+        out = self.model(pixel_values=pixel_values, intermediate_output=-2)

        outputs = Output()
        outputs["last_hidden_state"] = out[0].to(comfy.model_management.intermediate_device())
        outputs["image_embeds"] = out[2].to(comfy.model_management.intermediate_device())
-        if self.return_all_hidden_states:
-            all_hs = out[1].to(comfy.model_management.intermediate_device())
-            outputs["penultimate_hidden_states"] = all_hs[:, -2]
-            outputs["all_hidden_states"] = all_hs
-        else:
-            outputs["penultimate_hidden_states"] = out[1].to(comfy.model_management.intermediate_device())
-
+        outputs["penultimate_hidden_states"] = out[1].to(comfy.model_management.intermediate_device())
        outputs["mm_projected"] = out[3]
        return outputs

@@ -136,12 +124,8 @@ def load_clipvision_from_sd(sd, prefix="", convert_keys=False):
                json_config = os.path.join(os.path.dirname(os.path.realpath(__file__)), "clip_vision_config_vitl_336.json")
        else:
            json_config = os.path.join(os.path.dirname(os.path.realpath(__file__)), "clip_vision_config_vitl.json")
-
-    # Dinov2
-    elif 'encoder.layer.39.layer_scale2.lambda1' in sd:
+    elif "embeddings.patch_embeddings.projection.weight" in sd:
        json_config = os.path.join(os.path.join(os.path.dirname(os.path.realpath(__file__)), "image_encoders"), "dino2_giant.json")
-    elif 'encoder.layer.23.layer_scale2.lambda1' in sd:
-        json_config = os.path.join(os.path.join(os.path.dirname(os.path.realpath(__file__)), "image_encoders"), "dino2_large.json")
    else:
        return None

--- a/comfy/controlnet.py
+++ b/comfy/controlnet.py
@@ -253,10 +253,7 @@ class ControlNet(ControlBase):
                to_concat = []
                for c in self.extra_concat_orig:
                    c = c.to(self.cond_hint.device)
-                    c = comfy.utils.common_upscale(c, self.cond_hint.shape[-1], self.cond_hint.shape[-2], self.upscale_algorithm, "center")
-                    if c.ndim < self.cond_hint.ndim:
-                        c = c.unsqueeze(2)
-                        c = comfy.utils.repeat_to_batch_size(c, self.cond_hint.shape[2], dim=2)
+                    c = comfy.utils.common_upscale(c, self.cond_hint.shape[3], self.cond_hint.shape[2], self.upscale_algorithm, "center")
                    to_concat.append(comfy.utils.repeat_to_batch_size(c, self.cond_hint.shape[0]))
                self.cond_hint = torch.cat([self.cond_hint] + to_concat, dim=1)

@@ -588,18 +585,11 @@ def load_controlnet_flux_instantx(sd, model_options={}):

 def load_controlnet_qwen_instantx(sd, model_options={}):
    model_config, operations, load_device, unet_dtype, manual_cast_dtype, offload_device = controlnet_config(sd, model_options=model_options)
-    control_latent_channels = sd.get("controlnet_x_embedder.weight").shape[1]
-
-    extra_condition_channels = 0
-    concat_mask = False
-    if control_latent_channels == 68: #inpaint controlnet
-        extra_condition_channels = control_latent_channels - 64
-        concat_mask = True
-    control_model = comfy.ldm.qwen_image.controlnet.QwenImageControlNetModel(extra_condition_channels=extra_condition_channels, operations=operations, device=offload_device, dtype=unet_dtype, **model_config.unet_config)
+    control_model = comfy.ldm.qwen_image.controlnet.QwenImageControlNetModel(operations=operations, device=offload_device, dtype=unet_dtype, **model_config.unet_config)
    control_model = controlnet_load_state_dict(control_model, sd)
    latent_format = comfy.latent_formats.Wan21()
    extra_conds = []
-    control = ControlNet(control_model, compression_ratio=1, latent_format=latent_format, concat_mask=concat_mask, load_device=load_device, manual_cast_dtype=manual_cast_dtype, extra_conds=extra_conds)
+    control = ControlNet(control_model, compression_ratio=1, latent_format=latent_format, load_device=load_device, manual_cast_dtype=manual_cast_dtype, extra_conds=extra_conds)
    return control

 def convert_mistoline(sd):
--- a/comfy/image_encoders/dino2.py
+++ b/comfy/image_encoders/dino2.py
@@ -31,20 +31,6 @@ class LayerScale(torch.nn.Module):
    def forward(self, x):
        return x * comfy.model_management.cast_to_device(self.lambda1, x.device, x.dtype)

-class Dinov2MLP(torch.nn.Module):
-    def __init__(self, hidden_size: int, dtype, device, operations):
-        super().__init__()
-
-        mlp_ratio = 4
-        hidden_features = int(hidden_size * mlp_ratio)
-        self.fc1 = operations.Linear(hidden_size, hidden_features, bias = True, device=device, dtype=dtype)
-        self.fc2 = operations.Linear(hidden_features, hidden_size, bias = True, device=device, dtype=dtype)
-
-    def forward(self, hidden_state: torch.Tensor) -> torch.Tensor:
-        hidden_state = self.fc1(hidden_state)
-        hidden_state = torch.nn.functional.gelu(hidden_state)
-        hidden_state = self.fc2(hidden_state)
-        return hidden_state

 class SwiGLUFFN(torch.nn.Module):
    def __init__(self, dim, dtype, device, operations):
@@ -64,15 +50,12 @@ class SwiGLUFFN(torch.nn.Module):


 class Dino2Block(torch.nn.Module):
-    def __init__(self, dim, num_heads, layer_norm_eps, dtype, device, operations, use_swiglu_ffn):
+    def __init__(self, dim, num_heads, layer_norm_eps, dtype, device, operations):
        super().__init__()
        self.attention = Dino2AttentionBlock(dim, num_heads, layer_norm_eps, dtype, device, operations)
        self.layer_scale1 = LayerScale(dim, dtype, device, operations)
        self.layer_scale2 = LayerScale(dim, dtype, device, operations)
-        if use_swiglu_ffn:
-            self.mlp = SwiGLUFFN(dim, dtype, device, operations)
-        else:
-            self.mlp = Dinov2MLP(dim, dtype, device, operations)
+        self.mlp = SwiGLUFFN(dim, dtype, device, operations)
        self.norm1 = operations.LayerNorm(dim, eps=layer_norm_eps, dtype=dtype, device=device)
        self.norm2 = operations.LayerNorm(dim, eps=layer_norm_eps, dtype=dtype, device=device)

@@ -83,10 +66,9 @@ class Dino2Block(torch.nn.Module):


 class Dino2Encoder(torch.nn.Module):
-    def __init__(self, dim, num_heads, layer_norm_eps, num_layers, dtype, device, operations, use_swiglu_ffn):
+    def __init__(self, dim, num_heads, layer_norm_eps, num_layers, dtype, device, operations):
        super().__init__()
-        self.layer = torch.nn.ModuleList([Dino2Block(dim, num_heads, layer_norm_eps, dtype, device, operations, use_swiglu_ffn = use_swiglu_ffn)
-                                          for _ in range(num_layers)])
+        self.layer = torch.nn.ModuleList([Dino2Block(dim, num_heads, layer_norm_eps, dtype, device, operations) for _ in range(num_layers)])

    def forward(self, x, intermediate_output=None):
        optimized_attention = optimized_attention_for_device(x.device, False, small_input=True)
@@ -96,8 +78,8 @@ class Dino2Encoder(torch.nn.Module):
                intermediate_output = len(self.layer) + intermediate_output

        intermediate = None
-        for i, layer in enumerate(self.layer):
-            x = layer(x, optimized_attention)
+        for i, l in enumerate(self.layer):
+            x = l(x, optimized_attention)
            if i == intermediate_output:
                intermediate = x.clone()
        return x, intermediate
@@ -146,10 +128,9 @@ class Dinov2Model(torch.nn.Module):
        dim = config_dict["hidden_size"]
        heads = config_dict["num_attention_heads"]
        layer_norm_eps = config_dict["layer_norm_eps"]
-        use_swiglu_ffn = config_dict["use_swiglu_ffn"]

        self.embeddings = Dino2Embeddings(dim, dtype, device, operations)
-        self.encoder = Dino2Encoder(dim, heads, layer_norm_eps, num_layers, dtype, device, operations, use_swiglu_ffn = use_swiglu_ffn)
+        self.encoder = Dino2Encoder(dim, heads, layer_norm_eps, num_layers, dtype, device, operations)
        self.layernorm = operations.LayerNorm(dim, eps=layer_norm_eps, dtype=dtype, device=device)

    def forward(self, pixel_values, attention_mask=None, intermediate_output=None):
--- a/comfy/image_encoders/dino2_large.json
+++ b/comfy/image_encoders/dino2_large.json
@@ -1,22 +0,0 @@
-{
-  "hidden_size": 1024,
-  "use_mask_token": true,
-  "patch_size": 14,
-  "image_size": 518,
-  "num_channels": 3,
-  "num_attention_heads": 16,
-  "initializer_range": 0.02,
-  "attention_probs_dropout_prob": 0.0,
-  "hidden_dropout_prob": 0.0,
-  "hidden_act": "gelu",
-  "mlp_ratio": 4,
-  "model_type": "dinov2",
-  "num_hidden_layers": 24,
-  "layer_norm_eps": 1e-6,
-  "qkv_bias": true,
-  "use_swiglu_ffn": false,
-  "layerscale_value": 1.0,
-  "drop_path_rate": 0.0,
-  "image_mean": [0.485, 0.456, 0.406],
-  "image_std": [0.229, 0.224, 0.225]
-}
--- a/comfy/k_diffusion/sampling.py
+++ b/comfy/k_diffusion/sampling.py
@@ -86,24 +86,24 @@ class BatchedBrownianTree:
    """A wrapper around torchsde.BrownianTree that enables batches of entropy."""

    def __init__(self, x, t0, t1, seed=None, **kwargs):
-        self.cpu_tree = kwargs.pop("cpu", True)
+        self.cpu_tree = True
+        if "cpu" in kwargs:
+            self.cpu_tree = kwargs.pop("cpu")
        t0, t1, self.sign = self.sort(t0, t1)
-        w0 = kwargs.pop('w0', None)
-        if w0 is None:
-            w0 = torch.zeros_like(x)
-        self.batched = False
+        w0 = kwargs.get('w0', torch.zeros_like(x))
        if seed is None:
-            seed = (torch.randint(0, 2 ** 63 - 1, ()).item(),)
-        elif isinstance(seed, (tuple, list)):
-            if len(seed) != x.shape[0]:
-                raise ValueError("Passing a list or tuple of seeds to BatchedBrownianTree requires a length matching the batch size.")
-            self.batched = True
+            seed = torch.randint(0, 2 ** 63 - 1, []).item()
+        self.batched = True
+        try:
+            assert len(seed) == x.shape[0]
            w0 = w0[0]
-        else:
-            seed = (seed,)
+        except TypeError:
+            seed = [seed]
+            self.batched = False
        if self.cpu_tree:
-            t0, w0, t1 = t0.detach().cpu(), w0.detach().cpu(), t1.detach().cpu()
-        self.trees = tuple(torchsde.BrownianTree(t0, w0, t1, entropy=s, **kwargs) for s in seed)
+            self.trees = [torchsde.BrownianTree(t0.cpu(), w0.cpu(), t1.cpu(), entropy=s, **kwargs) for s in seed]
+        else:
+            self.trees = [torchsde.BrownianTree(t0, w0, t1, entropy=s, **kwargs) for s in seed]

    @staticmethod
    def sort(a, b):
@@ -111,10 +111,11 @@ class BatchedBrownianTree:

    def __call__(self, t0, t1):
        t0, t1, sign = self.sort(t0, t1)
-        device, dtype = t0.device, t0.dtype
        if self.cpu_tree:
-            t0, t1 = t0.detach().cpu().float(), t1.detach().cpu().float()
-        w = torch.stack([tree(t0, t1) for tree in self.trees]).to(device=device, dtype=dtype) * (self.sign * sign)
+            w = torch.stack([tree(t0.cpu().float(), t1.cpu().float()).to(t0.dtype).to(t0.device) for tree in self.trees]) * (self.sign * sign)
+        else:
+            w = torch.stack([tree(t0, t1) for tree in self.trees]) * (self.sign * sign)
+
        return w if self.batched else w[0]


@@ -170,16 +171,6 @@ def offset_first_sigma_for_snr(sigmas, model_sampling, percent_offset=1e-4):
    return sigmas


-def ei_h_phi_1(h: torch.Tensor) -> torch.Tensor:
-    """Compute the result of h*phi_1(h) in exponential integrator methods."""
-    return torch.expm1(h)
-
-
-def ei_h_phi_2(h: torch.Tensor) -> torch.Tensor:
-    """Compute the result of h*phi_2(h) in exponential integrator methods."""
-    return (torch.expm1(h) - h) / h
-
-
@torch.no_grad()
 def sample_euler(model, x, sigmas, extra_args=None, callback=None, disable=None, s_churn=0., s_tmin=0., s_tmax=float('inf'), s_noise=1.):
    """Implements Algorithm 2 (Euler steps) from Karras et al. (2022)."""
@@ -1559,12 +1550,13 @@ def sample_er_sde(model, x, sigmas, extra_args=None, callback=None, disable=None
@torch.no_grad()
 def sample_seeds_2(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None, r=0.5):
    """SEEDS-2 - Stochastic Explicit Exponential Derivative-free Solvers (VP Data Prediction) stage 2.
-    arXiv: https://arxiv.org/abs/2305.14267 (NeurIPS 2023)
+    arXiv: https://arxiv.org/abs/2305.14267
    """
    extra_args = {} if extra_args is None else extra_args
    seed = extra_args.get("seed", None)
    noise_sampler = default_noise_sampler(x, seed=seed) if noise_sampler is None else noise_sampler
    s_in = x.new_ones([x.shape[0]])
+
    inject_noise = eta > 0 and s_noise > 0

    model_sampling = model.inner_model.model_patcher.get_model_object('model_sampling')
@@ -1572,53 +1564,55 @@ def sample_seeds_2(model, x, sigmas, extra_args=None, callback=None, disable=Non
    lambda_fn = partial(sigma_to_half_log_snr, model_sampling=model_sampling)
    sigmas = offset_first_sigma_for_snr(sigmas, model_sampling)

-    fac = 1 / (2 * r)
-
    for i in trange(len(sigmas) - 1, disable=disable):
        denoised = model(x, sigmas[i] * s_in, **extra_args)
        if callback is not None:
            callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigmas[i], 'denoised': denoised})
-
        if sigmas[i + 1] == 0:
            x = denoised
-            continue
+        else:
+            lambda_s, lambda_t = lambda_fn(sigmas[i]), lambda_fn(sigmas[i + 1])
+            h = lambda_t - lambda_s
+            h_eta = h * (eta + 1)
+            lambda_s_1 = lambda_s + r * h
+            fac = 1 / (2 * r)
+            sigma_s_1 = sigma_fn(lambda_s_1)

-        lambda_s, lambda_t = lambda_fn(sigmas[i]), lambda_fn(sigmas[i + 1])
-        h = lambda_t - lambda_s
-        h_eta = h * (eta + 1)
-        lambda_s_1 = torch.lerp(lambda_s, lambda_t, r)
-        sigma_s_1 = sigma_fn(lambda_s_1)
+            # alpha_t = sigma_t * exp(log(alpha_t / sigma_t)) = sigma_t * exp(lambda_t)
+            alpha_s_1 = sigma_s_1 * lambda_s_1.exp()
+            alpha_t = sigmas[i + 1] * lambda_t.exp()

-        alpha_s_1 = sigma_s_1 * lambda_s_1.exp()
-        alpha_t = sigmas[i + 1] * lambda_t.exp()
+            coeff_1, coeff_2 = (-r * h_eta).expm1(), (-h_eta).expm1()
+            if inject_noise:
+                # 0 < r < 1
+                noise_coeff_1 = (-2 * r * h * eta).expm1().neg().sqrt()
+                noise_coeff_2 = (-r * h * eta).exp() * (-2 * (1 - r) * h * eta).expm1().neg().sqrt()
+                noise_1, noise_2 = noise_sampler(sigmas[i], sigma_s_1), noise_sampler(sigma_s_1, sigmas[i + 1])

-        # Step 1
-        x_2 = sigma_s_1 / sigmas[i] * (-r * h * eta).exp() * x - alpha_s_1 * ei_h_phi_1(-r * h_eta) * denoised
-        if inject_noise:
-            sde_noise = (-2 * r * h * eta).expm1().neg().sqrt() * noise_sampler(sigmas[i], sigma_s_1)
-            x_2 = x_2 + sde_noise * sigma_s_1 * s_noise
-        denoised_2 = model(x_2, sigma_s_1 * s_in, **extra_args)
+            # Step 1
+            x_2 = sigma_s_1 / sigmas[i] * (-r * h * eta).exp() * x - alpha_s_1 * coeff_1 * denoised
+            if inject_noise:
+                x_2 = x_2 + sigma_s_1 * (noise_coeff_1 * noise_1) * s_noise
+            denoised_2 = model(x_2, sigma_s_1 * s_in, **extra_args)

-        # Step 2
-        denoised_d = torch.lerp(denoised, denoised_2, fac)
-        x = sigmas[i + 1] / sigmas[i] * (-h * eta).exp() * x - alpha_t * ei_h_phi_1(-h_eta) * denoised_d
-        if inject_noise:
-            segment_factor = (r - 1) * h * eta
-            sde_noise = sde_noise * segment_factor.exp()
-            sde_noise = sde_noise + segment_factor.mul(2).expm1().neg().sqrt() * noise_sampler(sigma_s_1, sigmas[i + 1])
-            x = x + sde_noise * sigmas[i + 1] * s_noise
+            # Step 2
+            denoised_d = (1 - fac) * denoised + fac * denoised_2
+            x = sigmas[i + 1] / sigmas[i] * (-h * eta).exp() * x - alpha_t * coeff_2 * denoised_d
+            if inject_noise:
+                x = x + sigmas[i + 1] * (noise_coeff_2 * noise_1 + noise_coeff_1 * noise_2) * s_noise
    return x


@torch.no_grad()
 def sample_seeds_3(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None, r_1=1./3, r_2=2./3):
    """SEEDS-3 - Stochastic Explicit Exponential Derivative-free Solvers (VP Data Prediction) stage 3.
-    arXiv: https://arxiv.org/abs/2305.14267 (NeurIPS 2023)
+    arXiv: https://arxiv.org/abs/2305.14267
    """
    extra_args = {} if extra_args is None else extra_args
    seed = extra_args.get("seed", None)
    noise_sampler = default_noise_sampler(x, seed=seed) if noise_sampler is None else noise_sampler
    s_in = x.new_ones([x.shape[0]])
+
    inject_noise = eta > 0 and s_noise > 0

    model_sampling = model.inner_model.model_patcher.get_model_object('model_sampling')
@@ -1630,49 +1624,45 @@ def sample_seeds_3(model, x, sigmas, extra_args=None, callback=None, disable=Non
        denoised = model(x, sigmas[i] * s_in, **extra_args)
        if callback is not None:
            callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigmas[i], 'denoised': denoised})
-
        if sigmas[i + 1] == 0:
            x = denoised
-            continue
+        else:
+            lambda_s, lambda_t = lambda_fn(sigmas[i]), lambda_fn(sigmas[i + 1])
+            h = lambda_t - lambda_s
+            h_eta = h * (eta + 1)
+            lambda_s_1 = lambda_s + r_1 * h
+            lambda_s_2 = lambda_s + r_2 * h
+            sigma_s_1, sigma_s_2 = sigma_fn(lambda_s_1), sigma_fn(lambda_s_2)

-        lambda_s, lambda_t = lambda_fn(sigmas[i]), lambda_fn(sigmas[i + 1])
-        h = lambda_t - lambda_s
-        h_eta = h * (eta + 1)
-        lambda_s_1 = torch.lerp(lambda_s, lambda_t, r_1)
-        lambda_s_2 = torch.lerp(lambda_s, lambda_t, r_2)
-        sigma_s_1, sigma_s_2 = sigma_fn(lambda_s_1), sigma_fn(lambda_s_2)
+            # alpha_t = sigma_t * exp(log(alpha_t / sigma_t)) = sigma_t * exp(lambda_t)
+            alpha_s_1 = sigma_s_1 * lambda_s_1.exp()
+            alpha_s_2 = sigma_s_2 * lambda_s_2.exp()
+            alpha_t = sigmas[i + 1] * lambda_t.exp()

-        alpha_s_1 = sigma_s_1 * lambda_s_1.exp()
-        alpha_s_2 = sigma_s_2 * lambda_s_2.exp()
-        alpha_t = sigmas[i + 1] * lambda_t.exp()
+            coeff_1, coeff_2, coeff_3 = (-r_1 * h_eta).expm1(), (-r_2 * h_eta).expm1(), (-h_eta).expm1()
+            if inject_noise:
+                # 0 < r_1 < r_2 < 1
+                noise_coeff_1 = (-2 * r_1 * h * eta).expm1().neg().sqrt()
+                noise_coeff_2 = (-r_1 * h * eta).exp() * (-2 * (r_2 - r_1) * h * eta).expm1().neg().sqrt()
+                noise_coeff_3 = (-r_2 * h * eta).exp() * (-2 * (1 - r_2) * h * eta).expm1().neg().sqrt()
+                noise_1, noise_2, noise_3 = noise_sampler(sigmas[i], sigma_s_1), noise_sampler(sigma_s_1, sigma_s_2), noise_sampler(sigma_s_2, sigmas[i + 1])

-        # Step 1
-        x_2 = sigma_s_1 / sigmas[i] * (-r_1 * h * eta).exp() * x - alpha_s_1 * ei_h_phi_1(-r_1 * h_eta) * denoised
-        if inject_noise:
-            sde_noise = (-2 * r_1 * h * eta).expm1().neg().sqrt() * noise_sampler(sigmas[i], sigma_s_1)
-            x_2 = x_2 + sde_noise * sigma_s_1 * s_noise
-        denoised_2 = model(x_2, sigma_s_1 * s_in, **extra_args)
+            # Step 1
+            x_2 = sigma_s_1 / sigmas[i] * (-r_1 * h * eta).exp() * x - alpha_s_1 * coeff_1 * denoised
+            if inject_noise:
+                x_2 = x_2 + sigma_s_1 * (noise_coeff_1 * noise_1) * s_noise
+            denoised_2 = model(x_2, sigma_s_1 * s_in, **extra_args)

-        # Step 2
-        a3_2 = r_2 / r_1 * ei_h_phi_2(-r_2 * h_eta)
-        a3_1 = ei_h_phi_1(-r_2 * h_eta) - a3_2
-        x_3 = sigma_s_2 / sigmas[i] * (-r_2 * h * eta).exp() * x - alpha_s_2 * (a3_1 * denoised + a3_2 * denoised_2)
-        if inject_noise:
-            segment_factor = (r_1 - r_2) * h * eta
-            sde_noise = sde_noise * segment_factor.exp()
-            sde_noise = sde_noise + segment_factor.mul(2).expm1().neg().sqrt() * noise_sampler(sigma_s_1, sigma_s_2)
-            x_3 = x_3 + sde_noise * sigma_s_2 * s_noise
-        denoised_3 = model(x_3, sigma_s_2 * s_in, **extra_args)
+            # Step 2
+            x_3 = sigma_s_2 / sigmas[i] * (-r_2 * h * eta).exp() * x - alpha_s_2 * coeff_2 * denoised + (r_2 / r_1) * alpha_s_2 * (coeff_2 / (r_2 * h_eta) + 1) * (denoised_2 - denoised)
+            if inject_noise:
+                x_3 = x_3 + sigma_s_2 * (noise_coeff_2 * noise_1 + noise_coeff_1 * noise_2) * s_noise
+            denoised_3 = model(x_3, sigma_s_2 * s_in, **extra_args)

-        # Step 3
-        b3 = ei_h_phi_2(-h_eta) / r_2
-        b1 = ei_h_phi_1(-h_eta) - b3
-        x = sigmas[i + 1] / sigmas[i] * (-h * eta).exp() * x - alpha_t * (b1 * denoised + b3 * denoised_3)
-        if inject_noise:
-            segment_factor = (r_2 - 1) * h * eta
-            sde_noise = sde_noise * segment_factor.exp()
-            sde_noise = sde_noise + segment_factor.mul(2).expm1().neg().sqrt() * noise_sampler(sigma_s_2, sigmas[i + 1])
-            x = x + sde_noise * sigmas[i + 1] * s_noise
+            # Step 3
+            x = sigmas[i + 1] / sigmas[i] * (-h * eta).exp() * x - alpha_t * coeff_3 * denoised + (1. / r_2) * alpha_t * (coeff_3 / h_eta + 1) * (denoised_3 - denoised)
+            if inject_noise:
+                x = x + sigmas[i + 1] * (noise_coeff_3 * noise_1 + noise_coeff_2 * noise_2 + noise_coeff_1 * noise_3) * s_noise
    return x


--- a/comfy/latent_formats.py
+++ b/comfy/latent_formats.py
@@ -533,94 +533,11 @@ class Wan22(Wan21):
                0.3971, 1.0600, 0.3943, 0.5537, 0.5444, 0.4089, 0.7468, 0.7744
            ]).view(1, self.latent_channels, 1, 1, 1)

-class HunyuanImage21(LatentFormat):
-    latent_channels = 64
-    latent_dimensions = 2
-    scale_factor = 0.75289
-
-    latent_rgb_factors = [
-        [-0.0154, -0.0397, -0.0521],
-        [ 0.0005,  0.0093,  0.0006],
-        [-0.0805, -0.0773, -0.0586],
-        [-0.0494, -0.0487, -0.0498],
-        [-0.0212, -0.0076, -0.0261],
-        [-0.0179, -0.0417, -0.0505],
-        [ 0.0158,  0.0310,  0.0239],
-        [ 0.0409,  0.0516,  0.0201],
-        [ 0.0350,  0.0553,  0.0036],
-        [-0.0447, -0.0327, -0.0479],
-        [-0.0038, -0.0221, -0.0365],
-        [-0.0423, -0.0718, -0.0654],
-        [ 0.0039,  0.0368,  0.0104],
-        [ 0.0655,  0.0217,  0.0122],
-        [ 0.0490,  0.1638,  0.2053],
-        [ 0.0932,  0.0829,  0.0650],
-        [-0.0186, -0.0209, -0.0135],
-        [-0.0080, -0.0076, -0.0148],
-        [-0.0284, -0.0201,  0.0011],
-        [-0.0642, -0.0294, -0.0777],
-        [-0.0035,  0.0076, -0.0140],
-        [ 0.0519,  0.0731,  0.0887],
-        [-0.0102,  0.0095,  0.0704],
-        [ 0.0068,  0.0218, -0.0023],
-        [-0.0726, -0.0486, -0.0519],
-        [ 0.0260,  0.0295,  0.0263],
-        [ 0.0250,  0.0333,  0.0341],
-        [ 0.0168, -0.0120, -0.0174],
-        [ 0.0226,  0.1037,  0.0114],
-        [ 0.2577,  0.1906,  0.1604],
-        [-0.0646, -0.0137, -0.0018],
-        [-0.0112,  0.0309,  0.0358],
-        [-0.0347,  0.0146, -0.0481],
-        [ 0.0234,  0.0179,  0.0201],
-        [ 0.0157,  0.0313,  0.0225],
-        [ 0.0423,  0.0675,  0.0524],
-        [-0.0031,  0.0027, -0.0255],
-        [ 0.0447,  0.0555,  0.0330],
-        [-0.0152,  0.0103,  0.0299],
-        [-0.0755, -0.0489, -0.0635],
-        [ 0.0853,  0.0788,  0.1017],
-        [-0.0272, -0.0294, -0.0471],
-        [ 0.0440,  0.0400, -0.0137],
-        [ 0.0335,  0.0317, -0.0036],
-        [-0.0344, -0.0621, -0.0984],
-        [-0.0127, -0.0630, -0.0620],
-        [-0.0648,  0.0360,  0.0924],
-        [-0.0781, -0.0801, -0.0409],
-        [ 0.0363,  0.0613,  0.0499],
-        [ 0.0238,  0.0034,  0.0041],
-        [-0.0135,  0.0258,  0.0310],
-        [ 0.0614,  0.1086,  0.0589],
-        [ 0.0428,  0.0350,  0.0205],
-        [ 0.0153,  0.0173, -0.0018],
-        [-0.0288, -0.0455, -0.0091],
-        [ 0.0344,  0.0109, -0.0157],
-        [-0.0205, -0.0247, -0.0187],
-        [ 0.0487,  0.0126,  0.0064],
-        [-0.0220, -0.0013,  0.0074],
-        [-0.0203, -0.0094, -0.0048],
-        [-0.0719,  0.0429, -0.0442],
-        [ 0.1042,  0.0497,  0.0356],
-        [-0.0659, -0.0578, -0.0280],
-        [-0.0060, -0.0322, -0.0234]]
-
-    latent_rgb_factors_bias = [0.0007, -0.0256, -0.0206]
-
-class HunyuanImage21Refiner(LatentFormat):
-    latent_channels = 64
-    latent_dimensions = 3
-    scale_factor = 1.03682
-
 class Hunyuan3Dv2(LatentFormat):
    latent_channels = 64
    latent_dimensions = 1
    scale_factor = 0.9990943042622529

-class Hunyuan3Dv2_1(LatentFormat):
-    scale_factor = 1.0039506158752403
-    latent_channels = 64
-    latent_dimensions = 1
-
 class Hunyuan3Dv2mini(LatentFormat):
    latent_channels = 64
    latent_dimensions = 1
@@ -629,20 +546,3 @@ class Hunyuan3Dv2mini(LatentFormat):
 class ACEAudio(LatentFormat):
    latent_channels = 8
    latent_dimensions = 2
-
-class ChromaRadiance(LatentFormat):
-    latent_channels = 3
-
-    def __init__(self):
-        self.latent_rgb_factors = [
-            # R    G    B
-            [ 1.0, 0.0, 0.0 ],
-            [ 0.0, 1.0, 0.0 ],
-            [ 0.0, 0.0, 1.0 ]
-        ]
-
-    def process_in(self, latent):
-        return latent
-
-    def process_out(self, latent):
-        return latent
--- a/comfy/ldm/ace/vae/music_dcae_pipeline.py
+++ b/comfy/ldm/ace/vae/music_dcae_pipeline.py
@@ -23,6 +23,8 @@ class MusicDCAE(torch.nn.Module):
        else:
            self.source_sample_rate = source_sample_rate

+        # self.resampler = torchaudio.transforms.Resample(source_sample_rate, 44100)
+
        self.transform = transforms.Compose([
            transforms.Normalize(0.5, 0.5),
        ])
@@ -35,6 +37,10 @@ class MusicDCAE(torch.nn.Module):
        self.scale_factor = 0.1786
        self.shift_factor = -1.9091

+    def load_audio(self, audio_path):
+        audio, sr = torchaudio.load(audio_path)
+        return audio, sr
+
    def forward_mel(self, audios):
        mels = []
        for i in range(len(audios)):
@@ -67,8 +73,10 @@ class MusicDCAE(torch.nn.Module):
            latent = self.dcae.encoder(mel.unsqueeze(0))
            latents.append(latent)
        latents = torch.cat(latents, dim=0)
+        # latent_lengths = (audio_lengths / sr * 44100 / 512 / self.time_dimention_multiple).long()
        latents = (latents - self.shift_factor) * self.scale_factor
        return latents
+        # return latents, latent_lengths

    @torch.no_grad()
    def decode(self, latents, audio_lengths=None, sr=None):
@@ -83,7 +91,9 @@ class MusicDCAE(torch.nn.Module):
            wav = self.vocoder.decode(mels[0]).squeeze(1)

            if sr is not None:
+                # resampler = torchaudio.transforms.Resample(44100, sr).to(latents.device).to(latents.dtype)
                wav = torchaudio.functional.resample(wav, 44100, sr)
+                # wav = resampler(wav)
            else:
                sr = 44100
            pred_wavs.append(wav)
@@ -91,6 +101,7 @@ class MusicDCAE(torch.nn.Module):
        if audio_lengths is not None:
            pred_wavs = [wav[:, :length].cpu() for wav, length in zip(pred_wavs, audio_lengths)]
        return torch.stack(pred_wavs)
+        # return sr, pred_wavs

    def forward(self, audios, audio_lengths=None, sr=None):
        latents, latent_lengths = self.encode(audios=audios, audio_lengths=audio_lengths, sr=sr)
--- a/comfy/ldm/audio/dit.py
+++ b/comfy/ldm/audio/dit.py
@@ -635,7 +635,7 @@ class ContinuousTransformer(nn.Module):
        # Attention layers

        if self.rotary_pos_emb is not None:
-            rotary_pos_emb = self.rotary_pos_emb.forward_from_seq_len(x.shape[1], dtype=torch.float, device=x.device)
+            rotary_pos_emb = self.rotary_pos_emb.forward_from_seq_len(x.shape[1], dtype=x.dtype, device=x.device)
        else:
            rotary_pos_emb = None

--- a/comfy/ldm/chroma/model.py
+++ b/comfy/ldm/chroma/model.py
@@ -151,6 +151,8 @@ class Chroma(nn.Module):
        attn_mask: Tensor = None,
    ) -> Tensor:
        patches_replace = transformer_options.get("patches_replace", {})
+        if img.ndim != 3 or txt.ndim != 3:
+            raise ValueError("Input img and txt tensors must have 3 dimensions.")

        # running on sequences img
        img = self.img_in(img)
@@ -252,9 +254,8 @@ class Chroma(nn.Module):
                            img[:, txt.shape[1] :, ...] += add

        img = img[:, txt.shape[1] :, ...]
-        if hasattr(self, "final_layer"):
-            final_mod = self.get_modulations(mod_vectors, "final")
-            img = self.final_layer(img, vec=final_mod)  # (N, T, patch_size ** 2 * out_channels)
+        final_mod = self.get_modulations(mod_vectors, "final")
+        img = self.final_layer(img, vec=final_mod)  # (N, T, patch_size ** 2 * out_channels)
        return img

    def forward(self, x, timestep, context, guidance, control=None, transformer_options={}, **kwargs):
@@ -270,9 +271,6 @@ class Chroma(nn.Module):

        img = rearrange(x, "b c (h ph) (w pw) -> b (h w) (c ph pw)", ph=self.patch_size, pw=self.patch_size)

-        if img.ndim != 3 or context.ndim != 3:
-            raise ValueError("Input img and txt tensors must have 3 dimensions.")
-
        h_len = ((h + (self.patch_size // 2)) // self.patch_size)
        w_len = ((w + (self.patch_size // 2)) // self.patch_size)
        img_ids = torch.zeros((h_len, w_len, 3), device=x.device, dtype=x.dtype)
--- a/comfy/ldm/chroma_radiance/layers.py
+++ b/comfy/ldm/chroma_radiance/layers.py
@@ -1,206 +0,0 @@
-# Adapted from https://github.com/lodestone-rock/flow
-from functools import lru_cache
-
-import torch
-from torch import nn
-
-from comfy.ldm.flux.layers import RMSNorm
-
-
-class NerfEmbedder(nn.Module):
-    """
-    An embedder module that combines input features with a 2D positional
-    encoding that mimics the Discrete Cosine Transform (DCT).
-
-    This module takes an input tensor of shape (B, P^2, C), where P is the
-    patch size, and enriches it with positional information before projecting
-    it to a new hidden size.
-    """
-    def __init__(
-        self,
-        in_channels: int,
-        hidden_size_input: int,
-        max_freqs: int,
-        dtype=None,
-        device=None,
-        operations=None,
-    ):
-        """
-        Initializes the NerfEmbedder.
-
-        Args:
-            in_channels (int): The number of channels in the input tensor.
-            hidden_size_input (int): The desired dimension of the output embedding.
-            max_freqs (int): The number of frequency components to use for both
-                             the x and y dimensions of the positional encoding.
-                             The total number of positional features will be max_freqs^2.
-        """
-        super().__init__()
-        self.dtype = dtype
-        self.max_freqs = max_freqs
-        self.hidden_size_input = hidden_size_input
-
-        # A linear layer to project the concatenated input features and
-        # positional encodings to the final output dimension.
-        self.embedder = nn.Sequential(
-            operations.Linear(in_channels + max_freqs**2, hidden_size_input, dtype=dtype, device=device)
-        )
-
-    @lru_cache(maxsize=4)
-    def fetch_pos(self, patch_size: int, device: torch.device, dtype: torch.dtype) -> torch.Tensor:
-        """
-        Generates and caches 2D DCT-like positional embeddings for a given patch size.
-
-        The LRU cache is a performance optimization that avoids recomputing the
-        same positional grid on every forward pass.
-
-        Args:
-            patch_size (int): The side length of the square input patch.
-            device: The torch device to create the tensors on.
-            dtype: The torch dtype for the tensors.
-
-        Returns:
-            A tensor of shape (1, patch_size^2, max_freqs^2) containing the
-            positional embeddings.
-        """
-        # Create normalized 1D coordinate grids from 0 to 1.
-        pos_x = torch.linspace(0, 1, patch_size, device=device, dtype=dtype)
-        pos_y = torch.linspace(0, 1, patch_size, device=device, dtype=dtype)
-
-        # Create a 2D meshgrid of coordinates.
-        pos_y, pos_x = torch.meshgrid(pos_y, pos_x, indexing="ij")
-
-        # Reshape positions to be broadcastable with frequencies.
-        # Shape becomes (patch_size^2, 1, 1).
-        pos_x = pos_x.reshape(-1, 1, 1)
-        pos_y = pos_y.reshape(-1, 1, 1)
-
-        # Create a 1D tensor of frequency values from 0 to max_freqs-1.
-        freqs = torch.linspace(0, self.max_freqs - 1, self.max_freqs, dtype=dtype, device=device)
-
-        # Reshape frequencies to be broadcastable for creating 2D basis functions.
-        # freqs_x shape: (1, max_freqs, 1)
-        # freqs_y shape: (1, 1, max_freqs)
-        freqs_x = freqs[None, :, None]
-        freqs_y = freqs[None, None, :]
-
-        # A custom weighting coefficient, not part of standard DCT.
-        # This seems to down-weight the contribution of higher-frequency interactions.
-        coeffs = (1 + freqs_x * freqs_y) ** -1
-
-        # Calculate the 1D cosine basis functions for x and y coordinates.
-        # This is the core of the DCT formulation.
-        dct_x = torch.cos(pos_x * freqs_x * torch.pi)
-        dct_y = torch.cos(pos_y * freqs_y * torch.pi)
-
-        # Combine the 1D basis functions to create 2D basis functions by element-wise
-        # multiplication, and apply the custom coefficients. Broadcasting handles the
-        # combination of all (pos_x, freqs_x) with all (pos_y, freqs_y).
-        # The result is flattened into a feature vector for each position.
-        dct = (dct_x * dct_y * coeffs).view(1, -1, self.max_freqs ** 2)
-
-        return dct
-
-    def forward(self, inputs: torch.Tensor) -> torch.Tensor:
-        """
-        Forward pass for the embedder.
-
-        Args:
-            inputs (Tensor): The input tensor of shape (B, P^2, C).
-
-        Returns:
-            Tensor: The output tensor of shape (B, P^2, hidden_size_input).
-        """
-        # Get the batch size, number of pixels, and number of channels.
-        B, P2, C = inputs.shape
-
-        # Infer the patch side length from the number of pixels (P^2).
-        patch_size = int(P2 ** 0.5)
-
-        input_dtype = inputs.dtype
-        inputs = inputs.to(dtype=self.dtype)
-
-        # Fetch the pre-computed or cached positional embeddings.
-        dct = self.fetch_pos(patch_size, inputs.device, self.dtype)
-
-        # Repeat the positional embeddings for each item in the batch.
-        dct = dct.repeat(B, 1, 1)
-
-        # Concatenate the original input features with the positional embeddings
-        # along the feature dimension.
-        inputs = torch.cat((inputs, dct), dim=-1)
-
-        # Project the combined tensor to the target hidden size.
-        return self.embedder(inputs).to(dtype=input_dtype)
-
-
-class NerfGLUBlock(nn.Module):
-    """
-    A NerfBlock using a Gated Linear Unit (GLU) like MLP.
-    """
-    def __init__(self, hidden_size_s: int, hidden_size_x: int, mlp_ratio, dtype=None, device=None, operations=None):
-        super().__init__()
-        # The total number of parameters for the MLP is increased to accommodate
-        # the gate, value, and output projection matrices.
-        # We now need to generate parameters for 3 matrices.
-        total_params = 3 * hidden_size_x**2 * mlp_ratio
-        self.param_generator = operations.Linear(hidden_size_s, total_params, dtype=dtype, device=device)
-        self.norm = RMSNorm(hidden_size_x, dtype=dtype, device=device, operations=operations)
-        self.mlp_ratio = mlp_ratio
-
-
-    def forward(self, x: torch.Tensor, s: torch.Tensor) -> torch.Tensor:
-        batch_size, num_x, hidden_size_x = x.shape
-        mlp_params = self.param_generator(s)
-
-        # Split the generated parameters into three parts for the gate, value, and output projection.
-        fc1_gate_params, fc1_value_params, fc2_params = mlp_params.chunk(3, dim=-1)
-
-        # Reshape the parameters into matrices for batch matrix multiplication.
-        fc1_gate = fc1_gate_params.view(batch_size, hidden_size_x, hidden_size_x * self.mlp_ratio)
-        fc1_value = fc1_value_params.view(batch_size, hidden_size_x, hidden_size_x * self.mlp_ratio)
-        fc2 = fc2_params.view(batch_size, hidden_size_x * self.mlp_ratio, hidden_size_x)
-
-        # Normalize the generated weight matrices as in the original implementation.
-        fc1_gate = torch.nn.functional.normalize(fc1_gate, dim=-2)
-        fc1_value = torch.nn.functional.normalize(fc1_value, dim=-2)
-        fc2 = torch.nn.functional.normalize(fc2, dim=-2)
-
-        res_x = x
-        x = self.norm(x)
-
-        # Apply the final output projection.
-        x = torch.bmm(torch.nn.functional.silu(torch.bmm(x, fc1_gate)) * torch.bmm(x, fc1_value), fc2)
-
-        return x + res_x
-
-
-class NerfFinalLayer(nn.Module):
-    def __init__(self, hidden_size, out_channels, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.norm = RMSNorm(hidden_size, dtype=dtype, device=device, operations=operations)
-        self.linear = operations.Linear(hidden_size, out_channels, dtype=dtype, device=device)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        # RMSNorm normalizes over the last dimension, but our channel dim (C) is at dim=1.
-        # So we temporarily move the channel dimension to the end for the norm operation.
-        return self.linear(self.norm(x.movedim(1, -1))).movedim(-1, 1)
-
-
-class NerfFinalLayerConv(nn.Module):
-    def __init__(self, hidden_size: int, out_channels: int, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.norm = RMSNorm(hidden_size, dtype=dtype, device=device, operations=operations)
-        self.conv = operations.Conv2d(
-            in_channels=hidden_size,
-            out_channels=out_channels,
-            kernel_size=3,
-            padding=1,
-            dtype=dtype,
-            device=device,
-        )
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        # RMSNorm normalizes over the last dimension, but our channel dim (C) is at dim=1.
-        # So we temporarily move the channel dimension to the end for the norm operation.
-        return self.conv(self.norm(x.movedim(1, -1)).movedim(-1, 1))
--- a/comfy/ldm/chroma_radiance/model.py
+++ b/comfy/ldm/chroma_radiance/model.py
@@ -1,320 +0,0 @@
-# Credits:
-# Original Flux code can be found on: https://github.com/black-forest-labs/flux
-# Chroma Radiance adaption referenced from https://github.com/lodestone-rock/flow
-
-from dataclasses import dataclass
-from typing import Optional
-
-import torch
-from torch import Tensor, nn
-from einops import repeat
-import comfy.ldm.common_dit
-
-from comfy.ldm.flux.layers import EmbedND
-
-from comfy.ldm.chroma.model import Chroma, ChromaParams
-from comfy.ldm.chroma.layers import (
-    DoubleStreamBlock,
-    SingleStreamBlock,
-    Approximator,
-)
-from .layers import (
-    NerfEmbedder,
-    NerfGLUBlock,
-    NerfFinalLayer,
-    NerfFinalLayerConv,
-)
-
-
-@dataclass
-class ChromaRadianceParams(ChromaParams):
-    patch_size: int
-    nerf_hidden_size: int
-    nerf_mlp_ratio: int
-    nerf_depth: int
-    nerf_max_freqs: int
-    # Setting nerf_tile_size to 0 disables tiling.
-    nerf_tile_size: int
-    # Currently one of linear (legacy) or conv.
-    nerf_final_head_type: str
-    # None means use the same dtype as the model.
-    nerf_embedder_dtype: Optional[torch.dtype]
-
-
-class ChromaRadiance(Chroma):
-    """
-    Transformer model for flow matching on sequences.
-    """
-
-    def __init__(self, image_model=None, final_layer=True, dtype=None, device=None, operations=None, **kwargs):
-        if operations is None:
-            raise RuntimeError("Attempt to create ChromaRadiance object without setting operations")
-        nn.Module.__init__(self)
-        self.dtype = dtype
-        params = ChromaRadianceParams(**kwargs)
-        self.params = params
-        self.patch_size = params.patch_size
-        self.in_channels = params.in_channels
-        self.out_channels = params.out_channels
-        if params.hidden_size % params.num_heads != 0:
-            raise ValueError(
-                f"Hidden size {params.hidden_size} must be divisible by num_heads {params.num_heads}"
-            )
-        pe_dim = params.hidden_size // params.num_heads
-        if sum(params.axes_dim) != pe_dim:
-            raise ValueError(f"Got {params.axes_dim} but expected positional dim {pe_dim}")
-        self.hidden_size = params.hidden_size
-        self.num_heads = params.num_heads
-        self.in_dim = params.in_dim
-        self.out_dim = params.out_dim
-        self.hidden_dim = params.hidden_dim
-        self.n_layers = params.n_layers
-        self.pe_embedder = EmbedND(dim=pe_dim, theta=params.theta, axes_dim=params.axes_dim)
-        self.img_in_patch = operations.Conv2d(
-            params.in_channels,
-            params.hidden_size,
-            kernel_size=params.patch_size,
-            stride=params.patch_size,
-            bias=True,
-            dtype=dtype,
-            device=device,
-        )
-        self.txt_in = operations.Linear(params.context_in_dim, self.hidden_size, dtype=dtype, device=device)
-        # set as nn identity for now, will overwrite it later.
-        self.distilled_guidance_layer = Approximator(
-                    in_dim=self.in_dim,
-                    hidden_dim=self.hidden_dim,
-                    out_dim=self.out_dim,
-                    n_layers=self.n_layers,
-                    dtype=dtype, device=device, operations=operations
-                )
-
-
-        self.double_blocks = nn.ModuleList(
-            [
-                DoubleStreamBlock(
-                    self.hidden_size,
-                    self.num_heads,
-                    mlp_ratio=params.mlp_ratio,
-                    qkv_bias=params.qkv_bias,
-                    dtype=dtype, device=device, operations=operations
-                )
-                for _ in range(params.depth)
-            ]
-        )
-
-        self.single_blocks = nn.ModuleList(
-            [
-                SingleStreamBlock(
-                    self.hidden_size,
-                    self.num_heads,
-                    mlp_ratio=params.mlp_ratio,
-                    dtype=dtype, device=device, operations=operations,
-                )
-                for _ in range(params.depth_single_blocks)
-            ]
-        )
-
-        # pixel channel concat with DCT
-        self.nerf_image_embedder = NerfEmbedder(
-            in_channels=params.in_channels,
-            hidden_size_input=params.nerf_hidden_size,
-            max_freqs=params.nerf_max_freqs,
-            dtype=params.nerf_embedder_dtype or dtype,
-            device=device,
-            operations=operations,
-        )
-
-        self.nerf_blocks = nn.ModuleList([
-            NerfGLUBlock(
-                hidden_size_s=params.hidden_size,
-                hidden_size_x=params.nerf_hidden_size,
-                mlp_ratio=params.nerf_mlp_ratio,
-                dtype=dtype,
-                device=device,
-                operations=operations,
-            ) for _ in range(params.nerf_depth)
-        ])
-
-        if params.nerf_final_head_type == "linear":
-            self.nerf_final_layer = NerfFinalLayer(
-                params.nerf_hidden_size,
-                out_channels=params.in_channels,
-                dtype=dtype,
-                device=device,
-                operations=operations,
-            )
-        elif params.nerf_final_head_type == "conv":
-            self.nerf_final_layer_conv = NerfFinalLayerConv(
-                params.nerf_hidden_size,
-                out_channels=params.in_channels,
-                dtype=dtype,
-                device=device,
-                operations=operations,
-            )
-        else:
-            errstr = f"Unsupported nerf_final_head_type {params.nerf_final_head_type}"
-            raise ValueError(errstr)
-
-        self.skip_mmdit = []
-        self.skip_dit = []
-        self.lite = False
-
-    @property
-    def _nerf_final_layer(self) -> nn.Module:
-        if self.params.nerf_final_head_type == "linear":
-            return self.nerf_final_layer
-        if self.params.nerf_final_head_type == "conv":
-            return self.nerf_final_layer_conv
-        # Impossible to get here as we raise an error on unexpected types on initialization.
-        raise NotImplementedError
-
-    def img_in(self, img: Tensor) -> Tensor:
-        img = self.img_in_patch(img) # -> [B, Hidden, H/P, W/P]
-        # flatten into a sequence for the transformer.
-        return img.flatten(2).transpose(1, 2) # -> [B, NumPatches, Hidden]
-
-    def forward_nerf(
-        self,
-        img_orig: Tensor,
-        img_out: Tensor,
-        params: ChromaRadianceParams,
-    ) -> Tensor:
-        B, C, H, W = img_orig.shape
-        num_patches = img_out.shape[1]
-        patch_size = params.patch_size
-
-        # Store the raw pixel values of each patch for the NeRF head later.
-        # unfold creates patches: [B, C * P * P, NumPatches]
-        nerf_pixels = nn.functional.unfold(img_orig, kernel_size=patch_size, stride=patch_size)
-        nerf_pixels = nerf_pixels.transpose(1, 2) # -> [B, NumPatches, C * P * P]
-
-        # Reshape for per-patch processing
-        nerf_hidden = img_out.reshape(B * num_patches, params.hidden_size)
-        nerf_pixels = nerf_pixels.reshape(B * num_patches, C, patch_size**2).transpose(1, 2)
-
-        if params.nerf_tile_size > 0 and num_patches > params.nerf_tile_size:
-            # Enable tiling if nerf_tile_size isn't 0 and we actually have more patches than
-            # the tile size.
-            img_dct = self.forward_tiled_nerf(nerf_hidden, nerf_pixels, B, C, num_patches, patch_size, params)
-        else:
-            # Get DCT-encoded pixel embeddings [pixel-dct]
-            img_dct = self.nerf_image_embedder(nerf_pixels)
-
-            # Pass through the dynamic MLP blocks (the NeRF)
-            for block in self.nerf_blocks:
-                img_dct = block(img_dct, nerf_hidden)
-
-        # Reassemble the patches into the final image.
-        img_dct = img_dct.transpose(1, 2) # -> [B*NumPatches, C, P*P]
-        # Reshape to combine with batch dimension for fold
-        img_dct = img_dct.reshape(B, num_patches, -1) # -> [B, NumPatches, C*P*P]
-        img_dct = img_dct.transpose(1, 2) # -> [B, C*P*P, NumPatches]
-        img_dct = nn.functional.fold(
-            img_dct,
-            output_size=(H, W),
-            kernel_size=patch_size,
-            stride=patch_size,
-        )
-        return self._nerf_final_layer(img_dct)
-
-    def forward_tiled_nerf(
-        self,
-        nerf_hidden: Tensor,
-        nerf_pixels: Tensor,
-        batch: int,
-        channels: int,
-        num_patches: int,
-        patch_size: int,
-        params: ChromaRadianceParams,
-    ) -> Tensor:
-        """
-        Processes the NeRF head in tiles to save memory.
-        nerf_hidden has shape [B, L, D]
-        nerf_pixels has shape [B, L, C * P * P]
-        """
-        tile_size = params.nerf_tile_size
-        output_tiles = []
-        # Iterate over the patches in tiles. The dimension L (num_patches) is at index 1.
-        for i in range(0, num_patches, tile_size):
-            end = min(i + tile_size, num_patches)
-
-            # Slice the current tile from the input tensors
-            nerf_hidden_tile = nerf_hidden[i * batch:end * batch]
-            nerf_pixels_tile = nerf_pixels[i * batch:end * batch]
-
-            # get DCT-encoded pixel embeddings [pixel-dct]
-            img_dct_tile = self.nerf_image_embedder(nerf_pixels_tile)
-
-            # pass through the dynamic MLP blocks (the NeRF)
-            for block in self.nerf_blocks:
-                img_dct_tile = block(img_dct_tile, nerf_hidden_tile)
-
-            output_tiles.append(img_dct_tile)
-
-        # Concatenate the processed tiles along the patch dimension
-        return torch.cat(output_tiles, dim=0)
-
-    def radiance_get_override_params(self, overrides: dict) -> ChromaRadianceParams:
-        params = self.params
-        if not overrides:
-            return params
-        params_dict = {k: getattr(params, k) for k in params.__dataclass_fields__}
-        nullable_keys = frozenset(("nerf_embedder_dtype",))
-        bad_keys = tuple(k for k in overrides if k not in params_dict)
-        if bad_keys:
-            e = f"Unknown key(s) in transformer_options chroma_radiance_options: {', '.join(bad_keys)}"
-            raise ValueError(e)
-        bad_keys = tuple(
-            k
-            for k, v in overrides.items()
-            if type(v) != type(getattr(params, k)) and (v is not None or k not in nullable_keys)
-        )
-        if bad_keys:
-            e = f"Invalid value(s) in transformer_options chroma_radiance_options: {', '.join(bad_keys)}"
-            raise ValueError(e)
-        # At this point it's all valid keys and values so we can merge with the existing params.
-        params_dict |= overrides
-        return params.__class__(**params_dict)
-
-    def _forward(
-        self,
-        x: Tensor,
-        timestep: Tensor,
-        context: Tensor,
-        guidance: Optional[Tensor],
-        control: Optional[dict]=None,
-        transformer_options: dict={},
-        **kwargs: dict,
-    ) -> Tensor:
-        bs, c, h, w = x.shape
-        img = comfy.ldm.common_dit.pad_to_patch_size(x, (self.patch_size, self.patch_size))
-
-        if img.ndim != 4:
-            raise ValueError("Input img tensor must be in [B, C, H, W] format.")
-        if context.ndim != 3:
-            raise ValueError("Input txt tensors must have 3 dimensions.")
-
-        params = self.radiance_get_override_params(transformer_options.get("chroma_radiance_options", {}))
-
-        h_len = (img.shape[-2] // self.patch_size)
-        w_len = (img.shape[-1] // self.patch_size)
-
-        img_ids = torch.zeros((h_len, w_len, 3), device=x.device, dtype=x.dtype)
-        img_ids[:, :, 1] = img_ids[:, :, 1] + torch.linspace(0, h_len - 1, steps=h_len, device=x.device, dtype=x.dtype).unsqueeze(1)
-        img_ids[:, :, 2] = img_ids[:, :, 2] + torch.linspace(0, w_len - 1, steps=w_len, device=x.device, dtype=x.dtype).unsqueeze(0)
-        img_ids = repeat(img_ids, "h w c -> b (h w) c", b=bs)
-        txt_ids = torch.zeros((bs, context.shape[1], 3), device=x.device, dtype=x.dtype)
-
-        img_out = self.forward_orig(
-            img,
-            img_ids,
-            context,
-            txt_ids,
-            timestep,
-            guidance,
-            control,
-            transformer_options,
-            attn_mask=kwargs.get("attention_mask", None),
-        )
-        return self.forward_nerf(img, img_out, params)[:, :, :h, :w]
--- a/comfy/ldm/flux/math.py
+++ b/comfy/ldm/flux/math.py
@@ -35,13 +35,11 @@ def rope(pos: Tensor, dim: int, theta: int) -> Tensor:
    out = rearrange(out, "b n d (i j) -> b n d i j", i=2, j=2)
    return out.to(dtype=torch.float32, device=pos.device)

-def apply_rope1(x: Tensor, freqs_cis: Tensor):
-    x_ = x.to(dtype=freqs_cis.dtype).reshape(*x.shape[:-1], -1, 1, 2)
-
-    x_out = freqs_cis[..., 0] * x_[..., 0]
-    x_out.addcmul_(freqs_cis[..., 1], x_[..., 1])
-
-    return x_out.reshape(*x.shape).type_as(x)

 def apply_rope(xq: Tensor, xk: Tensor, freqs_cis: Tensor):
-    return apply_rope1(xq, freqs_cis), apply_rope1(xk, freqs_cis)
+    xq_ = xq.to(dtype=freqs_cis.dtype).reshape(*xq.shape[:-1], -1, 1, 2)
+    xk_ = xk.to(dtype=freqs_cis.dtype).reshape(*xk.shape[:-1], -1, 1, 2)
+    xq_out = freqs_cis[..., 0] * xq_[..., 0] + freqs_cis[..., 1] * xq_[..., 1]
+    xk_out = freqs_cis[..., 0] * xk_[..., 0] + freqs_cis[..., 1] * xk_[..., 1]
+    return xq_out.reshape(*xq.shape).type_as(xq), xk_out.reshape(*xk.shape).type_as(xk)
+
--- a/comfy/ldm/flux/model.py
+++ b/comfy/ldm/flux/model.py
@@ -106,7 +106,6 @@ class Flux(nn.Module):
        if y is None:
            y = torch.zeros((img.shape[0], self.params.vec_in_dim), device=img.device, dtype=img.dtype)

-        patches = transformer_options.get("patches", {})
        patches_replace = transformer_options.get("patches_replace", {})
        if img.ndim != 3 or txt.ndim != 3:
            raise ValueError("Input img and txt tensors must have 3 dimensions.")
@@ -118,17 +117,9 @@ class Flux(nn.Module):
            if guidance is not None:
                vec = vec + self.guidance_in(timestep_embedding(guidance, 256).to(img.dtype))

-        vec = vec + self.vector_in(y[:, :self.params.vec_in_dim])
+        vec = vec + self.vector_in(y[:,:self.params.vec_in_dim])
        txt = self.txt_in(txt)

-        if "post_input" in patches:
-            for p in patches["post_input"]:
-                out = p({"img": img, "txt": txt, "img_ids": img_ids, "txt_ids": txt_ids})
-                img = out["img"]
-                txt = out["txt"]
-                img_ids = out["img_ids"]
-                txt_ids = out["txt_ids"]
-
        if img_ids is not None:
            ids = torch.cat((txt_ids, img_ids), dim=1)
            pe = self.pe_embedder(ids)
@@ -137,6 +128,7 @@ class Flux(nn.Module):

        blocks_replace = patches_replace.get("dit", {})
        for i, block in enumerate(self.double_blocks):
+            transformer_options["block"] = ("double_block", i, 2)
            if ("double_block", i) in blocks_replace:
                def block_wrap(args):
                    out = {}
@@ -178,6 +170,7 @@ class Flux(nn.Module):
        img = torch.cat((txt, img), 1)

        for i, block in enumerate(self.single_blocks):
+            transformer_options["block"] = ("single_block", i, 1)
            if ("single_block", i) in blocks_replace:
                def block_wrap(args):
                    out = {}
@@ -247,18 +240,12 @@ class Flux(nn.Module):
            h = 0
            w = 0
            index = 0
-            ref_latents_method = kwargs.get("ref_latents_method", "offset")
+            index_ref_method = kwargs.get("ref_latents_method", "offset") == "index"
            for ref in ref_latents:
-                if ref_latents_method == "index":
+                if index_ref_method:
                    index += 1
                    h_offset = 0
                    w_offset = 0
-                elif ref_latents_method == "uxo":
-                    index = 0
-                    h_offset = h_len * patch_size + h
-                    w_offset = w_len * patch_size + w
-                    h += ref.shape[-2]
-                    w += ref.shape[-1]
                else:
                    index = 1
                    h_offset = 0
--- a/comfy/ldm/hunyuan3d/vae.py
+++ b/comfy/ldm/hunyuan3d/vae.py
@@ -4,458 +4,81 @@
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
+
+
+from typing import Union, Tuple, List, Callable, Optional
+
 import numpy as np
-import math
+from einops import repeat, rearrange
 from tqdm import tqdm
-
-from typing import Optional
-
 import logging

 import comfy.ops
 ops = comfy.ops.disable_weight_init

-def fps(src: torch.Tensor, batch: torch.Tensor, sampling_ratio: float, start_random: bool = True):
-
-    # manually create the pointer vector
-    assert src.size(0) == batch.numel()
-
-    batch_size = int(batch.max()) + 1
-    deg = src.new_zeros(batch_size, dtype = torch.long)
-
-    deg.scatter_add_(0, batch, torch.ones_like(batch))
-
-    ptr_vec = deg.new_zeros(batch_size + 1)
-    torch.cumsum(deg, 0, out=ptr_vec[1:])
-
-    #return fps_sampling(src, ptr_vec, ratio)
-    sampled_indicies = []
-
-    for b in range(batch_size):
-        # start and the end of each batch
-        start, end = ptr_vec[b].item(), ptr_vec[b + 1].item()
-        # points from the point cloud
-        points = src[start:end]
-
-        num_points = points.size(0)
-        num_samples = max(1, math.ceil(num_points * sampling_ratio))
-
-        selected = torch.zeros(num_samples, device = src.device, dtype = torch.long)
-        distances = torch.full((num_points,), float("inf"), device = src.device)
-
-        # select a random start point
-        if start_random:
-            farthest = torch.randint(0, num_points, (1,), device = src.device)
-        else:
-            farthest = torch.tensor([0], device = src.device, dtype = torch.long)
-
-        for i in range(num_samples):
-            selected[i] = farthest
-            centroid = points[farthest].squeeze(0)
-            dist = torch.norm(points - centroid, dim = 1) # compute euclidean distance
-            distances = torch.minimum(distances, dist)
-            farthest = torch.argmax(distances)
-
-        sampled_indicies.append(torch.arange(start, end)[selected])
-
-    return torch.cat(sampled_indicies, dim = 0)
-class PointCrossAttention(nn.Module):
-    def __init__(self,
-        num_latents: int,
-        downsample_ratio: float,
-        pc_size: int,
-        pc_sharpedge_size: int,
-        point_feats: int,
-        width: int,
-        heads: int,
-        layers: int,
-        fourier_embedder,
-        normal_pe: bool = False,
-        qkv_bias: bool = False,
-        use_ln_post: bool = True,
-        qk_norm: bool = True):
-
-        super().__init__()
-
-        self.fourier_embedder = fourier_embedder
-
-        self.pc_size = pc_size
-        self.normal_pe = normal_pe
-        self.downsample_ratio = downsample_ratio
-        self.pc_sharpedge_size = pc_sharpedge_size
-        self.num_latents = num_latents
-        self.point_feats = point_feats
-
-        self.input_proj = nn.Linear(self.fourier_embedder.out_dim + point_feats, width)
-
-        self.cross_attn = ResidualCrossAttentionBlock(
-            width = width,
-            heads = heads,
-            qkv_bias = qkv_bias,
-            qk_norm = qk_norm
-        )
-
-        self.self_attn = None
-        if layers > 0:
-            self.self_attn = Transformer(
-                width = width,
-                heads = heads,
-                qkv_bias = qkv_bias,
-                qk_norm = qk_norm,
-                layers = layers
-            )
-
-        if use_ln_post:
-            self.ln_post = nn.LayerNorm(width)
-        else:
-            self.ln_post = None
-
-    def sample_points_and_latents(self, point_cloud: torch.Tensor, features: torch.Tensor):
-
-        """
-        Subsample points randomly from the point cloud (input_pc)
-        Further sample the subsampled points to get query_pc
-        take the fourier embeddings for both input and query pc
-
-        Mental Note: FPS-sampled points (query_pc) act as latent tokens that attend to and learn from the broader context in input_pc.
-        Goal: get a smaller represenation (query_pc) to represent the entire scence structure by learning from a broader subset (input_pc).
-        More computationally efficient.
-
-        Features are additional information for each point in the cloud
-        """
-
-        B, _, D = point_cloud.shape
-
-        num_latents = int(self.num_latents)
-
-        num_random_query = self.pc_size / (self.pc_size + self.pc_sharpedge_size) * num_latents
-        num_sharpedge_query = num_latents - num_random_query
-
-        # Split random and sharpedge surface points
-        random_pc, sharpedge_pc = torch.split(point_cloud, [self.pc_size, self.pc_sharpedge_size], dim=1)
-
-        # assert statements
-        assert random_pc.shape[1] <= self.pc_size, "Random surface points size must be less than or equal to pc_size"
-        assert sharpedge_pc.shape[1] <= self.pc_sharpedge_size, "Sharpedge surface points size must be less than or equal to pc_sharpedge_size"
-
-        input_random_pc_size = int(num_random_query * self.downsample_ratio)
-        random_query_pc, random_input_pc, random_idx_pc, random_idx_query = \
-            self.subsample(pc = random_pc, num_query = num_random_query, input_pc_size = input_random_pc_size)
-
-        input_sharpedge_pc_size = int(num_sharpedge_query * self.downsample_ratio)
-
-        if input_sharpedge_pc_size == 0:
-            sharpedge_input_pc = torch.zeros(B, 0, D, dtype = random_input_pc.dtype).to(point_cloud.device)
-            sharpedge_query_pc = torch.zeros(B, 0, D, dtype= random_query_pc.dtype).to(point_cloud.device)
-
-        else:
-            sharpedge_query_pc, sharpedge_input_pc, sharpedge_idx_pc, sharpedge_idx_query = \
-            self.subsample(pc = sharpedge_pc, num_query = num_sharpedge_query, input_pc_size = input_sharpedge_pc_size)
-
-        # concat the random and sharpedges
-        query_pc = torch.cat([random_query_pc, sharpedge_query_pc], dim = 1)
-        input_pc = torch.cat([random_input_pc, sharpedge_input_pc], dim = 1)
-
-        query = self.fourier_embedder(query_pc)
-        data = self.fourier_embedder(input_pc)
-
-        if self.point_feats > 0:
-            random_surface_features, sharpedge_surface_features = torch.split(features, [self.pc_size, self.pc_sharpedge_size], dim = 1)
-
-            input_random_surface_features, query_random_features = \
-                self.handle_features(features = random_surface_features, idx_pc = random_idx_pc, batch_size = B,
-                                     input_pc_size = input_random_pc_size, idx_query = random_idx_query)
-
-            if input_sharpedge_pc_size == 0:
-                input_sharpedge_surface_features = torch.zeros(B, 0, self.point_feats,
-                                                               dtype = input_random_surface_features.dtype, device = point_cloud.device)
-
-                query_sharpedge_features = torch.zeros(B, 0, self.point_feats,
-                                                       dtype = query_random_features.dtype, device = point_cloud.device)
-            else:
-
-                input_sharpedge_surface_features, query_sharpedge_features = \
-                    self.handle_features(idx_pc = sharpedge_idx_pc, features = sharpedge_surface_features,
-                                         batch_size = B, idx_query = sharpedge_idx_query, input_pc_size = input_sharpedge_pc_size)
-
-            query_features = torch.cat([query_random_features, query_sharpedge_features], dim = 1)
-            input_features = torch.cat([input_random_surface_features, input_sharpedge_surface_features], dim = 1)
-
-            if self.normal_pe:
-                # apply the fourier embeddings on the first 3 dims (xyz)
-                input_features_pe = self.fourier_embedder(input_features[..., :3])
-                query_features_pe = self.fourier_embedder(query_features[..., :3])
-                # replace the first 3 dims with the new PE ones
-                input_features = torch.cat([input_features_pe, input_features[..., :3]], dim = -1)
-                query_features = torch.cat([query_features_pe, query_features[..., :3]], dim = -1)
-
-            # concat at the channels dim
-            query = torch.cat([query, query_features], dim = -1)
-            data = torch.cat([data, input_features], dim = -1)
-
-        # don't return pc_info to avoid unnecessary memory usuage
-        return query.view(B, -1, query.shape[-1]), data.view(B, -1, data.shape[-1])
-
-    def forward(self, point_cloud: torch.Tensor, features: torch.Tensor):
-
-        query, data = self.sample_points_and_latents(point_cloud = point_cloud, features = features)
-
-        # apply projections
-        query = self.input_proj(query)
-        data = self.input_proj(data)
-
-        # apply cross attention between query and data
-        latents = self.cross_attn(query, data)
-
-        if self.self_attn is not None:
-            latents = self.self_attn(latents)
-
-        if self.ln_post is not None:
-            latents = self.ln_post(latents)
-
-        return latents
-
-
-    def subsample(self, pc, num_query, input_pc_size: int):
-
-        """
-        num_query: number of points to keep after FPS
-        input_pc_size: number of points to select before FPS
-        """
-
-        B, _, D = pc.shape
-        query_ratio = num_query / input_pc_size
-
-        # random subsampling of points inside the point cloud
-        idx_pc = torch.randperm(pc.shape[1], device = pc.device)[:input_pc_size]
-        input_pc = pc[:, idx_pc, :]
-
-        # flatten to allow applying fps across the whole batch
-        flattent_input_pc = input_pc.view(B * input_pc_size, D)
-
-        # construct a batch_down tensor to tell fps
-        # which points belong to which batch
-        N_down = int(flattent_input_pc.shape[0] / B)
-        batch_down = torch.arange(B).to(pc.device)
-        batch_down = torch.repeat_interleave(batch_down, N_down)
-
-        idx_query = fps(flattent_input_pc, batch_down, sampling_ratio = query_ratio)
-        query_pc = flattent_input_pc[idx_query].view(B, -1, D)
-
-        return query_pc, input_pc, idx_pc, idx_query
-
-    def handle_features(self, features, idx_pc, input_pc_size, batch_size: int, idx_query):
-
-        B = batch_size
-
-        input_surface_features = features[:, idx_pc, :]
-        flattent_input_features = input_surface_features.view(B * input_pc_size, -1)
-        query_features = flattent_input_features[idx_query].view(B, -1,
-                                                                 flattent_input_features.shape[-1])
-
-        return input_surface_features, query_features
-
-def normalize_mesh(mesh, scale = 0.9999):
-    """Normalize mesh to fit in [-scale, scale]. Translate mesh so its center is [0,0,0]"""
-
-    bbox = mesh.bounds
-    center = (bbox[1] + bbox[0]) / 2
-
-    max_extent = (bbox[1] - bbox[0]).max()
-    mesh.apply_translation(-center)
-    mesh.apply_scale((2 * scale) / max_extent)
-
-    return mesh
-
-def sample_pointcloud(mesh, num = 200000):
-    """ Uniformly sample points from the surface of the mesh """
-
-    points, face_idx = mesh.sample(num, return_index = True)
-    normals = mesh.face_normals[face_idx]
-    return torch.from_numpy(points.astype(np.float32)), torch.from_numpy(normals.astype(np.float32))
-
-def detect_sharp_edges(mesh, threshold=0.985):
-    """Return edge indices (a, b) that lie on sharp boundaries of the mesh."""
-
-    V, F = mesh.vertices, mesh.faces
-    VN, FN = mesh.vertex_normals, mesh.face_normals
-
-    sharp_mask = np.ones(V.shape[0])
-    for i in range(3):
-        indices = F[:, i]
-        alignment = np.einsum('ij,ij->i', VN[indices], FN)
-        dot_stack = np.stack((sharp_mask[indices], alignment), axis=-1)
-        sharp_mask[indices] = np.min(dot_stack, axis=-1)
-
-    edge_a = np.concatenate([F[:, 0], F[:, 1], F[:, 2]])
-    edge_b = np.concatenate([F[:, 1], F[:, 2], F[:, 0]])
-    sharp_edges = (sharp_mask[edge_a] < threshold) & (sharp_mask[edge_b] < threshold)
-
-    return edge_a[sharp_edges], edge_b[sharp_edges]
-
-
-def sharp_sample_pointcloud(mesh, num = 16384):
-    """ Sample points preferentially from sharp edges in the mesh. """
-
-    edge_a, edge_b = detect_sharp_edges(mesh)
-    V, VN = mesh.vertices, mesh.vertex_normals
-
-    va, vb = V[edge_a], V[edge_b]
-    na, nb = VN[edge_a], VN[edge_b]
-
-    edge_lengths = np.linalg.norm(vb - va, axis=-1)
-    weights = edge_lengths / edge_lengths.sum()
-
-    indices = np.searchsorted(np.cumsum(weights), np.random.rand(num))
-    t = np.random.rand(num, 1)
-
-    samples = t * va[indices] + (1 - t) * vb[indices]
-    normals = t * na[indices] + (1 - t) * nb[indices]
-
-    return samples.astype(np.float32), normals.astype(np.float32)
-
-def load_surface_sharpedge(mesh, num_points=4096, num_sharp_points=4096, sharpedge_flag = True, device = "cuda"):
-    """Load a surface with optional sharp-edge annotations from a trimesh mesh."""
-
-    import trimesh
-
-    try:
-        mesh_full = trimesh.util.concatenate(mesh.dump())
-    except Exception:
-        mesh_full = trimesh.util.concatenate(mesh)
-
-    mesh_full = normalize_mesh(mesh_full)
-
-    faces = mesh_full.faces
-    vertices = mesh_full.vertices
-    origin_face_count = faces.shape[0]
-
-    mesh_surface = trimesh.Trimesh(vertices=vertices, faces=faces[:origin_face_count])
-    mesh_fill = trimesh.Trimesh(vertices=vertices, faces=faces[origin_face_count:])
-
-    area_surface = mesh_surface.area
-    area_fill = mesh_fill.area
-    total_area = area_surface + area_fill
-
-    sample_num = 499712 // 2
-    fill_ratio = area_fill / total_area if total_area > 0 else 0
-
-    num_fill = int(sample_num * fill_ratio)
-    num_surface = sample_num - num_fill
-
-    surf_pts, surf_normals = sample_pointcloud(mesh_surface, num_surface)
-    fill_pts, fill_normals = (torch.zeros(0, 3), torch.zeros(0, 3)) if num_fill == 0 else sample_pointcloud(mesh_fill, num_fill)
-
-    sharp_pts, sharp_normals = sharp_sample_pointcloud(mesh_surface, sample_num)
-
-    def assemble_tensor(points, normals, label=None):
-
-        data = torch.cat([points, normals], dim=1).half().to(device)
-
-        if label is not None:
-            label_tensor = torch.full((data.shape[0], 1), float(label), dtype=torch.float16).to(device)
-            data = torch.cat([data, label_tensor], dim=1)
-
-        return data
-
-    surface = assemble_tensor(torch.cat([surf_pts.to(device), fill_pts.to(device)], dim=0),
-                              torch.cat([surf_normals.to(device), fill_normals.to(device)], dim=0),
-                              label = 0 if sharpedge_flag else None)
-
-    sharp_surface = assemble_tensor(torch.from_numpy(sharp_pts), torch.from_numpy(sharp_normals),
-                                    label = 1 if sharpedge_flag else None)
-
-    rng = np.random.default_rng()
-
-    surface = surface[rng.choice(surface.shape[0], num_points, replace = False)]
-    sharp_surface = sharp_surface[rng.choice(sharp_surface.shape[0], num_sharp_points, replace = False)]
-
-    full = torch.cat([surface, sharp_surface], dim = 0).unsqueeze(0)
-
-    return full
-
-class SharpEdgeSurfaceLoader:
-    """ Load mesh surface and sharp edge samples. """
-
-    def __init__(self, num_uniform_points = 8192, num_sharp_points = 8192):
-
-        self.num_uniform_points = num_uniform_points
-        self.num_sharp_points = num_sharp_points
-        self.total_points = num_uniform_points + num_sharp_points
-
-    def __call__(self, mesh_input, device = "cuda"):
-        mesh = self._load_mesh(mesh_input)
-        return load_surface_sharpedge(mesh, self.num_uniform_points, self.num_sharp_points, device = device)
-
-    @staticmethod
-    def _load_mesh(mesh_input):
-        import trimesh
-
-        if isinstance(mesh_input, str):
-            mesh = trimesh.load(mesh_input, force="mesh", merge_primitives = True)
-        else:
-            mesh = mesh_input
-
-        if isinstance(mesh, trimesh.Scene):
-            combined = None
-            for obj in mesh.geometry.values():
-                combined = obj if combined is None else combined + obj
-            return combined
-
-        return mesh
-
-class DiagonalGaussianDistribution:
-    def __init__(self, params: torch.Tensor, feature_dim: int = -1):
-
-        # divide quant channels (8) into mean and log variance
-        self.mean, self.logvar = torch.chunk(params, 2, dim = feature_dim)
-
-        self.logvar = torch.clamp(self.logvar, -30.0, 20.0)
-        self.std = torch.exp(0.5 * self.logvar)
-
-    def sample(self):
-
-        eps = torch.randn_like(self.std)
-        z = self.mean + eps * self.std
-
-        return z
-
-################################################
-# Volume Decoder
-################################################
-
-class VanillaVolumeDecoder():
+def generate_dense_grid_points(
+    bbox_min: np.ndarray,
+    bbox_max: np.ndarray,
+    octree_resolution: int,
+    indexing: str = "ij",
+):
+    length = bbox_max - bbox_min
+    num_cells = octree_resolution
+
+    x = np.linspace(bbox_min[0], bbox_max[0], int(num_cells) + 1, dtype=np.float32)
+    y = np.linspace(bbox_min[1], bbox_max[1], int(num_cells) + 1, dtype=np.float32)
+    z = np.linspace(bbox_min[2], bbox_max[2], int(num_cells) + 1, dtype=np.float32)
+    [xs, ys, zs] = np.meshgrid(x, y, z, indexing=indexing)
+    xyz = np.stack((xs, ys, zs), axis=-1)
+    grid_size = [int(num_cells) + 1, int(num_cells) + 1, int(num_cells) + 1]
+
+    return xyz, grid_size, length
+
+
+class VanillaVolumeDecoder:
    @torch.no_grad()
-    def __call__(self, latents: torch.Tensor, geo_decoder: callable, octree_resolution: int, bounds = 1.01,
-                 num_chunks: int = 10_000, enable_pbar: bool = True, **kwargs):
+    def __call__(
+        self,
+        latents: torch.FloatTensor,
+        geo_decoder: Callable,
+        bounds: Union[Tuple[float], List[float], float] = 1.01,
+        num_chunks: int = 10000,
+        octree_resolution: int = None,
+        enable_pbar: bool = True,
+        **kwargs,
+    ):
+        device = latents.device
+        dtype = latents.dtype
+        batch_size = latents.shape[0]

+        # 1. generate query points
        if isinstance(bounds, float):
            bounds = [-bounds, -bounds, -bounds, bounds, bounds, bounds]

-        bbox_min, bbox_max = torch.tensor(bounds[:3]), torch.tensor(bounds[3:])
-
-        x = torch.linspace(bbox_min[0], bbox_max[0], int(octree_resolution) + 1, dtype = torch.float32)
-        y = torch.linspace(bbox_min[1], bbox_max[1], int(octree_resolution) + 1, dtype = torch.float32)
-        z = torch.linspace(bbox_min[2], bbox_max[2], int(octree_resolution) + 1, dtype = torch.float32)
-
-        [xs, ys, zs] = torch.meshgrid(x, y, z, indexing = "ij")
-        xyz = torch.stack((xs, ys, zs), axis=-1).to(latents.device, dtype = latents.dtype).contiguous().reshape(-1, 3)
-        grid_size = [int(octree_resolution) + 1, int(octree_resolution) + 1, int(octree_resolution) + 1]
+        bbox_min, bbox_max = np.array(bounds[0:3]), np.array(bounds[3:6])
+        xyz_samples, grid_size, length = generate_dense_grid_points(
+            bbox_min=bbox_min,
+            bbox_max=bbox_max,
+            octree_resolution=octree_resolution,
+            indexing="ij"
+        )
+        xyz_samples = torch.from_numpy(xyz_samples).to(device, dtype=dtype).contiguous().reshape(-1, 3)

+        # 2. latents to 3d volume
        batch_logits = []
-        for start in tqdm(range(0, xyz.shape[0], num_chunks), desc="Volume Decoding",
+        for start in tqdm(range(0, xyz_samples.shape[0], num_chunks), desc="Volume Decoding",
                          disable=not enable_pbar):
-
-            chunk_queries = xyz[start: start + num_chunks, :]
-            chunk_queries = chunk_queries.unsqueeze(0).repeat(latents.shape[0], 1, 1)
-            logits = geo_decoder(queries = chunk_queries, latents = latents)
+            chunk_queries = xyz_samples[start: start + num_chunks, :]
+            chunk_queries = repeat(chunk_queries, "p c -> b p c", b=batch_size)
+            logits = geo_decoder(queries=chunk_queries, latents=latents)
            batch_logits.append(logits)

-        grid_logits = torch.cat(batch_logits, dim = 1)
-        grid_logits = grid_logits.view((latents.shape[0], *grid_size)).float()
+        grid_logits = torch.cat(batch_logits, dim=1)
+        grid_logits = grid_logits.view((batch_size, *grid_size)).float()

        return grid_logits

+
 class FourierEmbedder(nn.Module):
    """The sin/cosine positional embedding. Given an input tensor `x` of shape [n_batch, ..., c_dim], it converts
    each feature dimension of `x[..., i]` into:
@@ -552,11 +175,13 @@ class FourierEmbedder(nn.Module):
        else:
            return x

+
 class CrossAttentionProcessor:
    def __call__(self, attn, q, k, v):
        out = comfy.ops.scaled_dot_product_attention(q, k, v)
        return out

+
 class DropPath(nn.Module):
    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
    """
@@ -607,42 +232,39 @@ class MLP(nn.Module):
    def forward(self, x):
        return self.drop_path(self.c_proj(self.gelu(self.c_fc(x))))

+
 class QKVMultiheadCrossAttention(nn.Module):
    def __init__(
        self,
+        *,
        heads: int,
-        n_data = None,
        width=None,
        qk_norm=False,
        norm_layer=ops.LayerNorm
    ):
        super().__init__()
        self.heads = heads
-        self.n_data = n_data
        self.q_norm = norm_layer(width // heads, elementwise_affine=True, eps=1e-6) if qk_norm else nn.Identity()
        self.k_norm = norm_layer(width // heads, elementwise_affine=True, eps=1e-6) if qk_norm else nn.Identity()

-    def forward(self, q, kv):
+        self.attn_processor = CrossAttentionProcessor()

+    def forward(self, q, kv):
        _, n_ctx, _ = q.shape
        bs, n_data, width = kv.shape
-
        attn_ch = width // self.heads // 2
        q = q.view(bs, n_ctx, self.heads, -1)
-
        kv = kv.view(bs, n_data, self.heads, -1)
        k, v = torch.split(kv, attn_ch, dim=-1)

        q = self.q_norm(q)
        k = self.k_norm(k)
-
-        q, k, v = [t.permute(0, 2, 1, 3) for t in (q, k, v)]
-        out = F.scaled_dot_product_attention(q, k, v)
-
+        q, k, v = map(lambda t: rearrange(t, 'b n h d -> b h n d', h=self.heads), (q, k, v))
+        out = self.attn_processor(self, q, k, v)
        out = out.transpose(1, 2).reshape(bs, n_ctx, -1)
-
        return out

+
 class MultiheadCrossAttention(nn.Module):
    def __init__(
        self,
@@ -684,6 +306,7 @@ class MultiheadCrossAttention(nn.Module):
        x = self.c_proj(x)
        return x

+
 class ResidualCrossAttentionBlock(nn.Module):
    def __init__(
        self,
@@ -743,7 +366,7 @@ class QKVMultiheadAttention(nn.Module):
        q = self.q_norm(q)
        k = self.k_norm(k)

-        q, k, v = [t.permute(0, 2, 1, 3) for t in (q, k, v)]
+        q, k, v = map(lambda t: rearrange(t, 'b n h d -> b h n d', h=self.heads), (q, k, v))
        out = F.scaled_dot_product_attention(q, k, v).transpose(1, 2).reshape(bs, n_ctx, -1)
        return out

@@ -760,7 +383,8 @@ class MultiheadAttention(nn.Module):
        drop_path_rate: float = 0.0
    ):
        super().__init__()
-
+        self.width = width
+        self.heads = heads
        self.c_qkv = ops.Linear(width, width * 3, bias=qkv_bias)
        self.c_proj = ops.Linear(width, width)
        self.attention = QKVMultiheadAttention(
@@ -867,7 +491,7 @@ class CrossAttentionDecoder(nn.Module):
        self.query_proj = ops.Linear(self.fourier_embedder.out_dim, width)
        if self.downsample_ratio != 1:
            self.latents_proj = ops.Linear(width * downsample_ratio, width)
-        if not self.enable_ln_post:
+        if self.enable_ln_post == False:
            qk_norm = False
        self.cross_attn_decoder = ResidualCrossAttentionBlock(
            width=width,
@@ -898,44 +522,28 @@ class CrossAttentionDecoder(nn.Module):

 class ShapeVAE(nn.Module):
    def __init__(
-            self,
-            *,
-            num_latents: int = 4096,
-            embed_dim: int = 64,
-            width: int = 1024,
-            heads: int = 16,
-            num_decoder_layers: int = 16,
-            num_encoder_layers: int = 8,
-            pc_size: int = 81920,
-            pc_sharpedge_size: int = 0,
-            point_feats: int = 4,
-            downsample_ratio: int = 20,
-            geo_decoder_downsample_ratio: int = 1,
-            geo_decoder_mlp_expand_ratio: int = 4,
-            geo_decoder_ln_post: bool = True,
-            num_freqs: int = 8,
-            qkv_bias: bool = False,
-            qk_norm: bool = True,
-            drop_path_rate: float = 0.0,
-            include_pi: bool = False,
-            scale_factor: float = 1.0039506158752403,
-            label_type: str = "binary",
+        self,
+        *,
+        embed_dim: int,
+        width: int,
+        heads: int,
+        num_decoder_layers: int,
+        geo_decoder_downsample_ratio: int = 1,
+        geo_decoder_mlp_expand_ratio: int = 4,
+        geo_decoder_ln_post: bool = True,
+        num_freqs: int = 8,
+        include_pi: bool = True,
+        qkv_bias: bool = True,
+        qk_norm: bool = False,
+        label_type: str = "binary",
+        drop_path_rate: float = 0.0,
+        scale_factor: float = 1.0,
    ):
        super().__init__()
        self.geo_decoder_ln_post = geo_decoder_ln_post

        self.fourier_embedder = FourierEmbedder(num_freqs=num_freqs, include_pi=include_pi)

-        self.encoder = PointCrossAttention(layers = num_encoder_layers,
-                                    num_latents = num_latents,
-                                    downsample_ratio = downsample_ratio,
-                                    heads = heads,
-                                    pc_size = pc_size,
-                                    width = width,
-                                    point_feats = point_feats,
-                                    fourier_embedder = self.fourier_embedder,
-                                    pc_sharpedge_size = pc_sharpedge_size)
-
        self.post_kl = ops.Linear(embed_dim, width)

        self.transformer = Transformer(
@@ -975,14 +583,5 @@ class ShapeVAE(nn.Module):
        grid_logits = self.volume_decoder(latents, self.geo_decoder, bounds=bounds, num_chunks=num_chunks, octree_resolution=octree_resolution, enable_pbar=enable_pbar)
        return grid_logits.movedim(-2, -1)

-    def encode(self, surface):
-
-        pc, feats = surface[:, :, :3], surface[:, :, 3:]
-        latents = self.encoder(pc, feats)
-
-        moments = self.pre_kl(latents)
-        posterior = DiagonalGaussianDistribution(moments, feature_dim = -1)
-
-        latents = posterior.sample()
-
-        return latents
+    def encode(self, x):
+        return None
--- a/comfy/ldm/hunyuan3dv2_1/hunyuandit.py
+++ b/comfy/ldm/hunyuan3dv2_1/hunyuandit.py
@@ -1,659 +0,0 @@
-import math
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from comfy.ldm.modules.attention import optimized_attention
-import comfy.model_management
-
-class GELU(nn.Module):
-
-    def __init__(self, dim_in: int, dim_out: int, operations, device, dtype):
-        super().__init__()
-        self.proj = operations.Linear(dim_in, dim_out, device = device, dtype = dtype)
-
-    def gelu(self, gate: torch.Tensor) -> torch.Tensor:
-
-        if gate.device.type == "mps":
-            return F.gelu(gate.to(dtype = torch.float32)).to(dtype = gate.dtype)
-
-        return F.gelu(gate)
-
-    def forward(self, hidden_states):
-
-        hidden_states = self.proj(hidden_states)
-        hidden_states = self.gelu(hidden_states)
-
-        return hidden_states
-
-class FeedForward(nn.Module):
-
-    def __init__(self, dim: int, dim_out = None, mult: int = 4,
-                dropout: float = 0.0, inner_dim = None, operations = None, device = None, dtype = None):
-
-        super().__init__()
-        if inner_dim is None:
-            inner_dim = int(dim * mult)
-
-        dim_out = dim_out if dim_out is not None else dim
-
-        act_fn = GELU(dim, inner_dim, operations = operations, device = device, dtype = dtype)
-
-        self.net = nn.ModuleList([])
-        self.net.append(act_fn)
-
-        self.net.append(nn.Dropout(dropout))
-        self.net.append(operations.Linear(inner_dim, dim_out, device = device, dtype = dtype))
-
-    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
-        for module in self.net:
-            hidden_states = module(hidden_states)
-        return hidden_states
-
-class AddAuxLoss(torch.autograd.Function):
-
-    @staticmethod
-    def forward(ctx, x, loss):
-        # do nothing in forward (no computation)
-        ctx.requires_aux_loss = loss.requires_grad
-        ctx.dtype = loss.dtype
-
-        return x
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        # add the aux loss gradients
-        grad_loss = None
-        # put the aux grad the same as the main grad loss
-        # aux grad contributes equally
-        if ctx.requires_aux_loss:
-            grad_loss = torch.ones(1, dtype = ctx.dtype, device = grad_output.device)
-
-        return grad_output, grad_loss
-
-class MoEGate(nn.Module):
-
-    def __init__(self, embed_dim, num_experts=16, num_experts_per_tok=2, aux_loss_alpha=0.01, device = None, dtype = None):
-
-        super().__init__()
-        self.top_k = num_experts_per_tok
-        self.n_routed_experts = num_experts
-
-        self.alpha = aux_loss_alpha
-
-        self.gating_dim = embed_dim
-        self.weight = nn.Parameter(torch.empty((self.n_routed_experts, self.gating_dim), device = device, dtype = dtype))
-
-    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
-
-        # flatten hidden states
-        hidden_states = hidden_states.view(-1, hidden_states.size(-1))
-
-        # get logits and pass it to softmax
-        logits = F.linear(hidden_states, comfy.model_management.cast_to(self.weight, dtype=hidden_states.dtype, device=hidden_states.device), bias = None)
-        scores = logits.softmax(dim = -1)
-
-        topk_weight, topk_idx = torch.topk(scores, k = self.top_k, dim = -1, sorted = False)
-
-        if self.training and self.alpha > 0.0:
-            scores_for_aux = scores
-
-            # used bincount instead of one hot encoding
-            counts = torch.bincount(topk_idx.view(-1), minlength = self.n_routed_experts).float()
-            ce = counts / topk_idx.numel()  # normalized expert usage
-
-            # mean expert score
-            Pi = scores_for_aux.mean(0)
-
-            # expert balance loss
-            aux_loss = (Pi * ce * self.n_routed_experts).sum() * self.alpha
-        else:
-            aux_loss = None
-
-        return topk_idx, topk_weight, aux_loss
-
-class MoEBlock(nn.Module):
-    def __init__(self, dim, num_experts: int = 6, moe_top_k: int = 2, dropout: float = 0.0,
-                 ff_inner_dim: int = None, operations = None, device = None, dtype = None):
-        super().__init__()
-
-        self.moe_top_k = moe_top_k
-        self.num_experts = num_experts
-
-        self.experts = nn.ModuleList([
-            FeedForward(dim, dropout = dropout, inner_dim = ff_inner_dim, operations = operations, device = device, dtype = dtype)
-            for _ in range(num_experts)
-        ])
-
-        self.gate = MoEGate(dim, num_experts = num_experts, num_experts_per_tok = moe_top_k, device = device, dtype = dtype)
-        self.shared_experts = FeedForward(dim, dropout = dropout, inner_dim = ff_inner_dim, operations = operations, device = device, dtype = dtype)
-
-    def forward(self, hidden_states) -> torch.Tensor:
-
-        identity = hidden_states
-        orig_shape = hidden_states.shape
-        topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
-
-        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
-        flat_topk_idx = topk_idx.view(-1)
-
-        if self.training:
-
-            hidden_states = hidden_states.repeat_interleave(self.moe_top_k, dim = 0)
-            y = torch.empty_like(hidden_states, dtype = hidden_states.dtype)
-
-            for i, expert in enumerate(self.experts):
-                tmp = expert(hidden_states[flat_topk_idx == i])
-                y[flat_topk_idx == i] = tmp.to(hidden_states.dtype)
-
-            y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim = 1)
-            y =  y.view(*orig_shape)
-
-            y = AddAuxLoss.apply(y, aux_loss)
-        else:
-            y = self.moe_infer(hidden_states, flat_expert_indices = flat_topk_idx,flat_expert_weights = topk_weight.view(-1, 1)).view(*orig_shape)
-
-        y = y + self.shared_experts(identity)
-
-        return y
-
-    @torch.no_grad()
-    def moe_infer(self, x, flat_expert_indices, flat_expert_weights):
-
-        expert_cache = torch.zeros_like(x)
-        idxs = flat_expert_indices.argsort()
-
-        # no need for .numpy().cpu() here
-        tokens_per_expert = flat_expert_indices.bincount().cumsum(0)
-        token_idxs = idxs // self.moe_top_k
-
-        for i, end_idx in enumerate(tokens_per_expert):
-
-            start_idx = 0 if i == 0 else tokens_per_expert[i-1]
-
-            if start_idx == end_idx:
-                continue
-
-            expert = self.experts[i]
-            exp_token_idx = token_idxs[start_idx:end_idx]
-
-            expert_tokens = x[exp_token_idx]
-            expert_out = expert(expert_tokens)
-
-            expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]])
-
-            # use index_add_ with a 1-D index tensor directly avoids building a large [N, D] index map and extra memcopy required by scatter_reduce_
-            # + avoid dtype conversion
-            expert_cache.index_add_(0, exp_token_idx, expert_out)
-
-        return expert_cache
-
-class Timesteps(nn.Module):
-    def __init__(self, num_channels: int, downscale_freq_shift: float = 0.0,
-                 scale: float = 1.0, max_period: int = 10000):
-        super().__init__()
-
-        self.num_channels = num_channels
-        half_dim = num_channels // 2
-
-        # precompute the “inv_freq” vector once
-        exponent = -math.log(max_period) * torch.arange(
-            half_dim, dtype=torch.float32
-        ) / (half_dim - downscale_freq_shift)
-
-        inv_freq = torch.exp(exponent)
-
-        # pad
-        if num_channels % 2 == 1:
-            # we’ll pad a zero at the end of the cos-half
-            inv_freq = torch.cat([inv_freq, inv_freq.new_zeros(1)])
-
-        # register to buffer so it moves with the device
-        self.register_buffer("inv_freq", inv_freq, persistent = False)
-        self.scale = scale
-
-    def forward(self, timesteps: torch.Tensor):
-
-        x = timesteps.float().unsqueeze(1) * self.inv_freq.to(timesteps.device).unsqueeze(0)
-
-
-        # fused CUDA kernels for sin and cos
-        sin_emb = x.sin()
-        cos_emb = x.cos()
-
-        emb = torch.cat([sin_emb, cos_emb], dim = 1)
-
-        # scale factor
-        if self.scale != 1.0:
-            emb = emb * self.scale
-
-        # If we padded inv_freq for odd, emb is already wide enough; otherwise:
-        if emb.shape[1] > self.num_channels:
-            emb = emb[:, :self.num_channels]
-
-        return emb
-
-class TimestepEmbedder(nn.Module):
-    def __init__(self, hidden_size, frequency_embedding_size = 256, cond_proj_dim = None, operations = None, device = None, dtype = None):
-        super().__init__()
-
-        self.mlp = nn.Sequential(
-            operations.Linear(hidden_size, frequency_embedding_size, bias=True, device = device, dtype = dtype),
-            nn.GELU(),
-            operations.Linear(frequency_embedding_size, hidden_size, bias=True, device = device, dtype = dtype),
-        )
-        self.frequency_embedding_size = frequency_embedding_size
-
-        if cond_proj_dim is not None:
-            self.cond_proj = operations.Linear(cond_proj_dim, frequency_embedding_size, bias=False, device = device, dtype = dtype)
-
-        self.time_embed = Timesteps(hidden_size)
-
-    def forward(self, timesteps, condition):
-
-        timestep_embed = self.time_embed(timesteps).type(self.mlp[0].weight.dtype)
-
-        if condition is not None:
-            cond_embed = self.cond_proj(condition)
-            timestep_embed = timestep_embed + cond_embed
-
-        time_conditioned = self.mlp(timestep_embed)
-
-        # for broadcasting with image tokens
-        return time_conditioned.unsqueeze(1)
-
-class MLP(nn.Module):
-    def __init__(self, *, width: int, operations = None, device = None, dtype = None):
-        super().__init__()
-        self.width = width
-        self.fc1 = operations.Linear(width, width * 4, device = device, dtype = dtype)
-        self.fc2 = operations.Linear(width * 4, width, device = device, dtype = dtype)
-        self.gelu = nn.GELU()
-
-    def forward(self, x):
-        return self.fc2(self.gelu(self.fc1(x)))
-
-class CrossAttention(nn.Module):
-    def __init__(
-        self,
-        qdim,
-        kdim,
-        num_heads,
-        qkv_bias=True,
-        qk_norm=False,
-        norm_layer=nn.LayerNorm,
-        use_fp16: bool = False,
-        operations = None,
-        dtype = None,
-        device = None,
-        **kwargs,
-    ):
-        super().__init__()
-        self.qdim = qdim
-        self.kdim = kdim
-
-        self.num_heads = num_heads
-        self.head_dim = self.qdim // num_heads
-
-        self.scale = self.head_dim ** -0.5
-
-        self.to_q = operations.Linear(qdim, qdim, bias=qkv_bias, device = device, dtype = dtype)
-        self.to_k = operations.Linear(kdim, qdim, bias=qkv_bias, device = device, dtype = dtype)
-        self.to_v = operations.Linear(kdim, qdim, bias=qkv_bias, device = device, dtype = dtype)
-
-        if use_fp16:
-            eps = 1.0 / 65504
-        else:
-            eps = 1e-6
-
-        if norm_layer == nn.LayerNorm:
-            norm_layer = operations.LayerNorm
-        else:
-            norm_layer = operations.RMSNorm
-
-        self.q_norm = norm_layer(self.head_dim, elementwise_affine=True, eps = eps, device = device, dtype = dtype) if qk_norm else nn.Identity()
-        self.k_norm = norm_layer(self.head_dim, elementwise_affine=True, eps = eps, device = device, dtype = dtype) if qk_norm else nn.Identity()
-        self.out_proj = operations.Linear(qdim, qdim, bias=True, device = device, dtype = dtype)
-
-    def forward(self, x, y):
-
-        b, s1, _ = x.shape
-        _, s2, _ = y.shape
-
-        y = y.to(next(self.to_k.parameters()).dtype)
-
-        q = self.to_q(x)
-        k = self.to_k(y)
-        v = self.to_v(y)
-
-        kv = torch.cat((k, v), dim=-1)
-        split_size = kv.shape[-1] // self.num_heads // 2
-
-        kv = kv.view(1, -1, self.num_heads, split_size * 2)
-        k, v = torch.split(kv, split_size, dim=-1)
-
-        q = q.view(b, s1, self.num_heads, self.head_dim)
-        k = k.view(b, s2, self.num_heads, self.head_dim)
-        v = v.reshape(b, s2, self.num_heads * self.head_dim)
-
-        q = self.q_norm(q)
-        k = self.k_norm(k)
-
-        x = optimized_attention(
-            q.reshape(b, s1, self.num_heads * self.head_dim),
-            k.reshape(b, s2, self.num_heads * self.head_dim),
-            v,
-            heads=self.num_heads,
-        )
-
-        out = self.out_proj(x)
-
-        return out
-
-class Attention(nn.Module):
-
-    def __init__(
-        self,
-        dim,
-        num_heads,
-        qkv_bias = True,
-        qk_norm = False,
-        norm_layer = nn.LayerNorm,
-        use_fp16: bool = False,
-        operations = None,
-        device = None,
-        dtype = None
-    ):
-        super().__init__()
-        self.dim = dim
-        self.num_heads = num_heads
-        self.head_dim = self.dim // num_heads
-        self.scale = self.head_dim ** -0.5
-
-        self.to_q = operations.Linear(dim, dim, bias = qkv_bias, device = device, dtype = dtype)
-        self.to_k = operations.Linear(dim, dim, bias = qkv_bias, device = device, dtype = dtype)
-        self.to_v = operations.Linear(dim, dim, bias = qkv_bias, device = device, dtype = dtype)
-
-        if use_fp16:
-            eps = 1.0 / 65504
-        else:
-            eps = 1e-6
-
-        if norm_layer == nn.LayerNorm:
-            norm_layer = operations.LayerNorm
-        else:
-            norm_layer = operations.RMSNorm
-
-        self.q_norm = norm_layer(self.head_dim, elementwise_affine=True, eps = eps, device = device, dtype = dtype) if qk_norm else nn.Identity()
-        self.k_norm = norm_layer(self.head_dim, elementwise_affine=True, eps = eps, device = device, dtype = dtype) if qk_norm else nn.Identity()
-        self.out_proj = operations.Linear(dim, dim, device = device, dtype = dtype)
-
-    def forward(self, x):
-        B, N, _ = x.shape
-
-        query = self.to_q(x)
-        key = self.to_k(x)
-        value = self.to_v(x)
-
-        qkv_combined = torch.cat((query, key, value), dim=-1)
-        split_size = qkv_combined.shape[-1] // self.num_heads // 3
-
-        qkv = qkv_combined.view(1, -1, self.num_heads, split_size * 3)
-        query, key, value = torch.split(qkv, split_size, dim=-1)
-
-        query = query.reshape(B, N, self.num_heads, self.head_dim)
-        key = key.reshape(B, N, self.num_heads, self.head_dim)
-        value = value.reshape(B, N, self.num_heads * self.head_dim)
-
-        query = self.q_norm(query)
-        key = self.k_norm(key)
-
-        x = optimized_attention(
-            query.reshape(B, N, self.num_heads * self.head_dim),
-            key.reshape(B, N, self.num_heads * self.head_dim),
-            value,
-            heads=self.num_heads,
-        )
-
-        x = self.out_proj(x)
-        return x
-
-class HunYuanDiTBlock(nn.Module):
-    def __init__(
-        self,
-        hidden_size,
-        c_emb_size,
-        num_heads,
-        text_states_dim=1024,
-        qk_norm=False,
-        norm_layer=nn.LayerNorm,
-        qk_norm_layer=True,
-        qkv_bias=True,
-        skip_connection=True,
-        timested_modulate=False,
-        use_moe: bool = False,
-        num_experts: int = 8,
-        moe_top_k: int = 2,
-        use_fp16: bool = False,
-        operations = None,
-        device = None, dtype = None
-    ):
-        super().__init__()
-
-        # eps can't be 1e-6 in fp16 mode because of numerical stability issues
-        if use_fp16:
-            eps = 1.0 / 65504
-        else:
-            eps = 1e-6
-
-        self.norm1 = norm_layer(hidden_size, elementwise_affine = True, eps = eps, device = device, dtype = dtype)
-
-        self.attn1 = Attention(hidden_size, num_heads=num_heads, qkv_bias=qkv_bias, qk_norm=qk_norm,
-                               norm_layer=qk_norm_layer, use_fp16 = use_fp16, device = device, dtype = dtype, operations = operations)
-
-        self.norm2 = norm_layer(hidden_size, elementwise_affine = True, eps = eps, device = device, dtype = dtype)
-
-        self.timested_modulate = timested_modulate
-        if self.timested_modulate:
-            self.default_modulation = nn.Sequential(
-                nn.SiLU(),
-                operations.Linear(c_emb_size, hidden_size, bias=True, device = device, dtype = dtype)
-            )
-
-        self.attn2 = CrossAttention(hidden_size, text_states_dim, num_heads=num_heads, qkv_bias=qkv_bias,
-                                    qk_norm=qk_norm, norm_layer=qk_norm_layer, use_fp16 = use_fp16,
-                                    device = device, dtype = dtype, operations = operations)
-
-        self.norm3 = norm_layer(hidden_size, elementwise_affine = True, eps = eps, device = device, dtype = dtype)
-
-        if skip_connection:
-            self.skip_norm = norm_layer(hidden_size, elementwise_affine = True, eps = eps, device = device, dtype = dtype)
-            self.skip_linear = operations.Linear(2 * hidden_size, hidden_size, device = device, dtype = dtype)
-        else:
-            self.skip_linear = None
-
-        self.use_moe = use_moe
-
-        if self.use_moe:
-            self.moe = MoEBlock(
-                hidden_size,
-                num_experts = num_experts,
-                moe_top_k = moe_top_k,
-                dropout = 0.0,
-                ff_inner_dim = int(hidden_size * 4.0),
-                device = device, dtype = dtype,
-                operations = operations
-            )
-        else:
-            self.mlp = MLP(width=hidden_size, operations=operations, device = device, dtype = dtype)
-
-    def forward(self, hidden_states, conditioning=None, text_states=None, skip_tensor=None):
-
-        if self.skip_linear is not None:
-            combined = torch.cat([skip_tensor, hidden_states], dim=-1)
-            hidden_states = self.skip_linear(combined)
-            hidden_states = self.skip_norm(hidden_states)
-
-        # self attention
-        if self.timested_modulate:
-            modulation_shift = self.default_modulation(conditioning).unsqueeze(dim=1)
-            hidden_states = hidden_states + modulation_shift
-
-        self_attn_out = self.attn1(self.norm1(hidden_states))
-        hidden_states = hidden_states + self_attn_out
-
-        # cross attention
-        hidden_states = hidden_states + self.attn2(self.norm2(hidden_states), text_states)
-
-        # MLP Layer
-        mlp_input = self.norm3(hidden_states)
-
-        if self.use_moe:
-            hidden_states = hidden_states + self.moe(mlp_input)
-        else:
-            hidden_states = hidden_states + self.mlp(mlp_input)
-
-        return hidden_states
-
-class FinalLayer(nn.Module):
-
-    def __init__(self, final_hidden_size, out_channels, operations, use_fp16: bool = False, device = None, dtype = None):
-        super().__init__()
-
-        if use_fp16:
-            eps = 1.0 / 65504
-        else:
-            eps = 1e-6
-
-        self.norm_final = operations.LayerNorm(final_hidden_size, elementwise_affine = True, eps = eps, device = device, dtype = dtype)
-        self.linear = operations.Linear(final_hidden_size, out_channels, bias = True, device = device, dtype = dtype)
-
-    def forward(self, x):
-        x = self.norm_final(x)
-        x = x[:, 1:]
-        x = self.linear(x)
-        return x
-
-class HunYuanDiTPlain(nn.Module):
-
-    # init with the defaults values from https://huggingface.co/tencent/Hunyuan3D-2.1/blob/main/hunyuan3d-dit-v2-1/config.yaml
-    def __init__(
-        self,
-        in_channels: int = 64,
-        hidden_size: int = 2048,
-        context_dim: int = 1024,
-        depth: int = 21,
-        num_heads: int = 16,
-        qk_norm: bool = True,
-        qkv_bias: bool = False,
-        num_moe_layers: int = 6,
-        guidance_cond_proj_dim = 2048,
-        norm_type = 'layer',
-        num_experts: int = 8,
-        moe_top_k: int = 2,
-        use_fp16: bool = False,
-        dtype = None,
-        device = None,
-        operations = None,
-        **kwargs
-        ):
-
-        self.dtype = dtype
-
-        super().__init__()
-
-        self.depth = depth
-
-        self.in_channels = in_channels
-        self.out_channels = in_channels
-
-        self.num_heads = num_heads
-        self.hidden_size = hidden_size
-
-        norm = operations.LayerNorm if norm_type == 'layer' else operations.RMSNorm
-        qk_norm = operations.RMSNorm
-
-        self.context_dim = context_dim
-        self.guidance_cond_proj_dim = guidance_cond_proj_dim
-
-        self.x_embedder = operations.Linear(in_channels, hidden_size, bias = True, device = device, dtype = dtype)
-        self.t_embedder = TimestepEmbedder(hidden_size, hidden_size * 4, cond_proj_dim = guidance_cond_proj_dim, device = device, dtype = dtype, operations = operations)
-
-
-        # HUnYuanDiT Blocks
-        self.blocks = nn.ModuleList([
-            HunYuanDiTBlock(hidden_size=hidden_size,
-                            c_emb_size=hidden_size,
-                            num_heads=num_heads,
-                            text_states_dim=context_dim,
-                            qk_norm=qk_norm,
-                            norm_layer = norm,
-                            qk_norm_layer = qk_norm,
-                            skip_connection=layer > depth // 2,
-                            qkv_bias=qkv_bias,
-                            use_moe=True if depth - layer <= num_moe_layers else False,
-                            num_experts=num_experts,
-                            moe_top_k=moe_top_k,
-                            use_fp16 = use_fp16,
-                            device = device, dtype = dtype, operations = operations)
-            for layer in range(depth)
-        ])
-
-        self.depth = depth
-
-        self.final_layer = FinalLayer(hidden_size, self.out_channels, use_fp16 = use_fp16, operations = operations, device = device, dtype = dtype)
-
-    def forward(self, x, t, context, transformer_options = {}, **kwargs):
-
-        x = x.movedim(-1, -2)
-        uncond_emb, cond_emb = context.chunk(2, dim = 0)
-
-        context = torch.cat([cond_emb, uncond_emb], dim = 0)
-        main_condition = context
-
-        t = 1.0 - t
-
-        time_embedded = self.t_embedder(t, condition = kwargs.get('guidance_cond'))
-
-        x = x.to(dtype = next(self.x_embedder.parameters()).dtype)
-        x_embedded = self.x_embedder(x)
-
-        combined = torch.cat([time_embedded, x_embedded], dim=1)
-
-        def block_wrap(args):
-            return block(
-                args["x"],
-                args["t"],
-                args["cond"],
-                skip_tensor=args.get("skip"),)
-
-        skip_stack = []
-        patches_replace = transformer_options.get("patches_replace", {})
-        blocks_replace = patches_replace.get("dit", {})
-        for idx, block in enumerate(self.blocks):
-            if idx <= self.depth // 2:
-                skip_input = None
-            else:
-                skip_input = skip_stack.pop()
-
-            if ("block", idx) in blocks_replace:
-
-                combined = blocks_replace[("block", idx)](
-                    {
-                        "x": combined,
-                        "t": time_embedded,
-                        "cond": main_condition,
-                        "skip": skip_input,
-                    },
-                    {"original_block": block_wrap},
-                )
-            else:
-                combined = block(combined, time_embedded, main_condition, skip_tensor=skip_input)
-
-            if idx < self.depth // 2:
-                skip_stack.append(combined)
-
-        output = self.final_layer(combined)
-        output =  output.movedim(-2, -1) * (-1.0)
-
-        cond_emb, uncond_emb = output.chunk(2, dim = 0)
-        return torch.cat([uncond_emb, cond_emb])
--- a/comfy/ldm/hunyuan_video/model.py
+++ b/comfy/ldm/hunyuan_video/model.py
@@ -40,8 +40,6 @@ class HunyuanVideoParams:
    patch_size: list
    qkv_bias: bool
    guidance_embed: bool
-    byt5: bool
-    meanflow: bool


 class SelfAttentionRef(nn.Module):
@@ -164,30 +162,6 @@ class TokenRefiner(nn.Module):
        x = self.individual_token_refiner(x, c, mask, transformer_options=transformer_options)
        return x

-
-class ByT5Mapper(nn.Module):
-    def __init__(self, in_dim, out_dim, hidden_dim, out_dim1, use_res=False, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.layernorm = operations.LayerNorm(in_dim, dtype=dtype, device=device)
-        self.fc1 = operations.Linear(in_dim, hidden_dim, dtype=dtype, device=device)
-        self.fc2 = operations.Linear(hidden_dim, out_dim, dtype=dtype, device=device)
-        self.fc3 = operations.Linear(out_dim, out_dim1, dtype=dtype, device=device)
-        self.use_res = use_res
-        self.act_fn = nn.GELU()
-
-    def forward(self, x):
-        if self.use_res:
-            res = x
-        x = self.layernorm(x)
-        x = self.fc1(x)
-        x = self.act_fn(x)
-        x = self.fc2(x)
-        x2 = self.act_fn(x)
-        x2 = self.fc3(x2)
-        if self.use_res:
-            x2 = x2 + res
-        return x2
-
 class HunyuanVideo(nn.Module):
    """
    Transformer model for flow matching on sequences.
@@ -212,13 +186,9 @@ class HunyuanVideo(nn.Module):
        self.num_heads = params.num_heads
        self.pe_embedder = EmbedND(dim=pe_dim, theta=params.theta, axes_dim=params.axes_dim)

-        self.img_in = comfy.ldm.modules.diffusionmodules.mmdit.PatchEmbed(None, self.patch_size, self.in_channels, self.hidden_size, conv3d=len(self.patch_size) == 3, dtype=dtype, device=device, operations=operations)
+        self.img_in = comfy.ldm.modules.diffusionmodules.mmdit.PatchEmbed(None, self.patch_size, self.in_channels, self.hidden_size, conv3d=True, dtype=dtype, device=device, operations=operations)
        self.time_in = MLPEmbedder(in_dim=256, hidden_dim=self.hidden_size, dtype=dtype, device=device, operations=operations)
-        if params.vec_in_dim is not None:
-            self.vector_in = MLPEmbedder(params.vec_in_dim, self.hidden_size, dtype=dtype, device=device, operations=operations)
-        else:
-            self.vector_in = None
-
+        self.vector_in = MLPEmbedder(params.vec_in_dim, self.hidden_size, dtype=dtype, device=device, operations=operations)
        self.guidance_in = (
            MLPEmbedder(in_dim=256, hidden_dim=self.hidden_size, dtype=dtype, device=device, operations=operations) if params.guidance_embed else nn.Identity()
        )
@@ -246,23 +216,6 @@ class HunyuanVideo(nn.Module):
            ]
        )

-        if params.byt5:
-            self.byt5_in = ByT5Mapper(
-                in_dim=1472,
-                out_dim=2048,
-                hidden_dim=2048,
-                out_dim1=self.hidden_size,
-                use_res=False,
-                dtype=dtype, device=device, operations=operations
-            )
-        else:
-            self.byt5_in = None
-
-        if params.meanflow:
-            self.time_r_in = MLPEmbedder(in_dim=256, hidden_dim=self.hidden_size, dtype=dtype, device=device, operations=operations)
-        else:
-            self.time_r_in = None
-
        if final_layer:
            self.final_layer = LastLayer(self.hidden_size, self.patch_size[-1], self.out_channels, dtype=dtype, device=device, operations=operations)

@@ -274,12 +227,10 @@ class HunyuanVideo(nn.Module):
        txt_ids: Tensor,
        txt_mask: Tensor,
        timesteps: Tensor,
-        y: Tensor = None,
-        txt_byt5=None,
+        y: Tensor,
        guidance: Tensor = None,
        guiding_frame_index=None,
        ref_latent=None,
-        disable_time_r=False,
        control=None,
        transformer_options={},
    ) -> Tensor:
@@ -290,14 +241,6 @@ class HunyuanVideo(nn.Module):
        img = self.img_in(img)
        vec = self.time_in(timestep_embedding(timesteps, 256, time_factor=1.0).to(img.dtype))

-        if (self.time_r_in is not None) and (not disable_time_r):
-            w = torch.where(transformer_options['sigmas'][0] == transformer_options['sample_sigmas'])[0]  # This most likely could be improved
-            if len(w) > 0:
-                timesteps_r = transformer_options['sample_sigmas'][w[0] + 1]
-                timesteps_r = timesteps_r.unsqueeze(0).to(device=timesteps.device, dtype=timesteps.dtype)
-                vec_r = self.time_r_in(timestep_embedding(timesteps_r, 256, time_factor=1000.0).to(img.dtype))
-                vec = (vec + vec_r) / 2
-
        if ref_latent is not None:
            ref_latent_ids = self.img_ids(ref_latent)
            ref_latent = self.img_in(ref_latent)
@@ -308,17 +251,13 @@ class HunyuanVideo(nn.Module):

        if guiding_frame_index is not None:
            token_replace_vec = self.time_in(timestep_embedding(guiding_frame_index, 256, time_factor=1.0))
-            if self.vector_in is not None:
-                vec_ = self.vector_in(y[:, :self.params.vec_in_dim])
-                vec = torch.cat([(vec_ + token_replace_vec).unsqueeze(1), (vec_ + vec).unsqueeze(1)], dim=1)
-            else:
-                vec = torch.cat([(token_replace_vec).unsqueeze(1), (vec).unsqueeze(1)], dim=1)
+            vec_ = self.vector_in(y[:, :self.params.vec_in_dim])
+            vec = torch.cat([(vec_ + token_replace_vec).unsqueeze(1), (vec_ + vec).unsqueeze(1)], dim=1)
            frame_tokens = (initial_shape[-1] // self.patch_size[-1]) * (initial_shape[-2] // self.patch_size[-2])
            modulation_dims = [(0, frame_tokens, 0), (frame_tokens, None, 1)]
            modulation_dims_txt = [(0, None, 1)]
        else:
-            if self.vector_in is not None:
-                vec = vec + self.vector_in(y[:, :self.params.vec_in_dim])
+            vec = vec + self.vector_in(y[:, :self.params.vec_in_dim])
            modulation_dims = None
            modulation_dims_txt = None

@@ -331,12 +270,6 @@ class HunyuanVideo(nn.Module):

        txt = self.txt_in(txt, timesteps, txt_mask, transformer_options=transformer_options)

-        if self.byt5_in is not None and txt_byt5 is not None:
-            txt_byt5 = self.byt5_in(txt_byt5)
-            txt_byt5_ids = torch.zeros((txt_ids.shape[0], txt_byt5.shape[1], txt_ids.shape[-1]), device=txt_ids.device, dtype=txt_ids.dtype)
-            txt = torch.cat((txt, txt_byt5), dim=1)
-            txt_ids = torch.cat((txt_ids, txt_byt5_ids), dim=1)
-
        ids = torch.cat((img_ids, txt_ids), dim=1)
        pe = self.pe_embedder(ids)

@@ -396,16 +329,12 @@ class HunyuanVideo(nn.Module):

        img = self.final_layer(img, vec, modulation_dims=modulation_dims)  # (N, T, patch_size ** 2 * out_channels)

-        shape = initial_shape[-len(self.patch_size):]
+        shape = initial_shape[-3:]
        for i in range(len(shape)):
            shape[i] = shape[i] // self.patch_size[i]
        img = img.reshape([img.shape[0]] + shape + [self.out_channels] + self.patch_size)
-        if img.ndim == 8:
-            img = img.permute(0, 4, 1, 5, 2, 6, 3, 7)
-            img = img.reshape(initial_shape[0], self.out_channels, initial_shape[2], initial_shape[3], initial_shape[4])
-        else:
-            img = img.permute(0, 3, 1, 4, 2, 5)
-            img = img.reshape(initial_shape[0], self.out_channels, initial_shape[2], initial_shape[3])
+        img = img.permute(0, 4, 1, 5, 2, 6, 3, 7)
+        img = img.reshape(initial_shape[0], self.out_channels, initial_shape[2], initial_shape[3], initial_shape[4])
        return img

    def img_ids(self, x):
@@ -420,30 +349,16 @@ class HunyuanVideo(nn.Module):
        img_ids[:, :, :, 2] = img_ids[:, :, :, 2] + torch.linspace(0, w_len - 1, steps=w_len, device=x.device, dtype=x.dtype).reshape(1, 1, -1)
        return repeat(img_ids, "t h w c -> b (t h w) c", b=bs)

-    def img_ids_2d(self, x):
-        bs, c, h, w = x.shape
-        patch_size = self.patch_size
-        h_len = ((h + (patch_size[0] // 2)) // patch_size[0])
-        w_len = ((w + (patch_size[1] // 2)) // patch_size[1])
-        img_ids = torch.zeros((h_len, w_len, 2), device=x.device, dtype=x.dtype)
-        img_ids[:, :, 0] = img_ids[:, :, 0] + torch.linspace(0, h_len - 1, steps=h_len, device=x.device, dtype=x.dtype).unsqueeze(1)
-        img_ids[:, :, 1] = img_ids[:, :, 1] + torch.linspace(0, w_len - 1, steps=w_len, device=x.device, dtype=x.dtype).unsqueeze(0)
-        return repeat(img_ids, "h w c -> b (h w) c", b=bs)
-
-    def forward(self, x, timestep, context, y=None, txt_byt5=None, guidance=None, attention_mask=None, guiding_frame_index=None, ref_latent=None, disable_time_r=False, control=None, transformer_options={}, **kwargs):
+    def forward(self, x, timestep, context, y, guidance=None, attention_mask=None, guiding_frame_index=None, ref_latent=None, control=None, transformer_options={}, **kwargs):
        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
            self._forward,
            self,
            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, timestep, context, y, txt_byt5, guidance, attention_mask, guiding_frame_index, ref_latent, disable_time_r, control, transformer_options, **kwargs)
+        ).execute(x, timestep, context, y, guidance, attention_mask, guiding_frame_index, ref_latent, control, transformer_options, **kwargs)

-    def _forward(self, x, timestep, context, y=None, txt_byt5=None, guidance=None, attention_mask=None, guiding_frame_index=None, ref_latent=None, disable_time_r=False, control=None, transformer_options={}, **kwargs):
-        bs = x.shape[0]
-        if len(self.patch_size) == 3:
-            img_ids = self.img_ids(x)
-            txt_ids = torch.zeros((bs, context.shape[1], 3), device=x.device, dtype=x.dtype)
-        else:
-            img_ids = self.img_ids_2d(x)
-            txt_ids = torch.zeros((bs, context.shape[1], 2), device=x.device, dtype=x.dtype)
-        out = self.forward_orig(x, img_ids, context, txt_ids, attention_mask, timestep, y, txt_byt5, guidance, guiding_frame_index, ref_latent, disable_time_r=disable_time_r, control=control, transformer_options=transformer_options)
+    def _forward(self, x, timestep, context, y, guidance=None, attention_mask=None, guiding_frame_index=None, ref_latent=None, control=None, transformer_options={}, **kwargs):
+        bs, c, t, h, w = x.shape
+        img_ids = self.img_ids(x)
+        txt_ids = torch.zeros((bs, context.shape[1], 3), device=x.device, dtype=x.dtype)
+        out = self.forward_orig(x, img_ids, context, txt_ids, attention_mask, timestep, y, guidance, guiding_frame_index, ref_latent, control=control, transformer_options=transformer_options)
        return out
--- a/comfy/ldm/hunyuan_video/vae.py
+++ b/comfy/ldm/hunyuan_video/vae.py
@@ -1,136 +0,0 @@
-import torch.nn as nn
-import torch.nn.functional as F
-from comfy.ldm.modules.diffusionmodules.model import ResnetBlock, AttnBlock
-import comfy.ops
-ops = comfy.ops.disable_weight_init
-
-
-class PixelShuffle2D(nn.Module):
-    def __init__(self, in_dim, out_dim, op=ops.Conv2d):
-        super().__init__()
-        self.conv = op(in_dim, out_dim >> 2, 3, 1, 1)
-        self.ratio = (in_dim << 2) // out_dim
-
-    def forward(self, x):
-        b, c, h, w = x.shape
-        h2, w2 = h >> 1, w >> 1
-        y = self.conv(x).view(b, -1, h2, 2, w2, 2).permute(0, 3, 5, 1, 2, 4).reshape(b, -1, h2, w2)
-        r = x.view(b, c, h2, 2, w2, 2).permute(0, 3, 5, 1, 2, 4).reshape(b, c << 2, h2, w2)
-        return y + r.view(b, y.shape[1], self.ratio, h2, w2).mean(2)
-
-
-class PixelUnshuffle2D(nn.Module):
-    def __init__(self, in_dim, out_dim, op=ops.Conv2d):
-        super().__init__()
-        self.conv = op(in_dim, out_dim << 2, 3, 1, 1)
-        self.scale = (out_dim << 2) // in_dim
-
-    def forward(self, x):
-        b, c, h, w = x.shape
-        h2, w2 = h << 1, w << 1
-        y = self.conv(x).view(b, 2, 2, -1, h, w).permute(0, 3, 4, 1, 5, 2).reshape(b, -1, h2, w2)
-        r = x.repeat_interleave(self.scale, 1).view(b, 2, 2, -1, h, w).permute(0, 3, 4, 1, 5, 2).reshape(b, -1, h2, w2)
-        return y + r
-
-
-class Encoder(nn.Module):
-    def __init__(self, in_channels, z_channels, block_out_channels, num_res_blocks,
-                 ffactor_spatial, downsample_match_channel=True, **_):
-        super().__init__()
-        self.z_channels = z_channels
-        self.block_out_channels = block_out_channels
-        self.num_res_blocks = num_res_blocks
-        self.conv_in = ops.Conv2d(in_channels, block_out_channels[0], 3, 1, 1)
-
-        self.down = nn.ModuleList()
-        ch = block_out_channels[0]
-        depth = (ffactor_spatial >> 1).bit_length()
-
-        for i, tgt in enumerate(block_out_channels):
-            stage = nn.Module()
-            stage.block = nn.ModuleList([ResnetBlock(in_channels=ch if j == 0 else tgt,
-                                                     out_channels=tgt,
-                                                     temb_channels=0,
-                                                     conv_op=ops.Conv2d)
-                                        for j in range(num_res_blocks)])
-            ch = tgt
-            if i < depth:
-                nxt = block_out_channels[i + 1] if i + 1 < len(block_out_channels) and downsample_match_channel else ch
-                stage.downsample = PixelShuffle2D(ch, nxt, ops.Conv2d)
-                ch = nxt
-            self.down.append(stage)
-
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=ops.Conv2d)
-        self.mid.attn_1 = AttnBlock(ch, conv_op=ops.Conv2d)
-        self.mid.block_2 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=ops.Conv2d)
-
-        self.norm_out = ops.GroupNorm(32, ch, 1e-6, True)
-        self.conv_out = ops.Conv2d(ch, z_channels << 1, 3, 1, 1)
-
-    def forward(self, x):
-        x = self.conv_in(x)
-
-        for stage in self.down:
-            for blk in stage.block:
-                x = blk(x)
-            if hasattr(stage, 'downsample'):
-                x = stage.downsample(x)
-
-        x = self.mid.block_2(self.mid.attn_1(self.mid.block_1(x)))
-
-        b, c, h, w = x.shape
-        grp = c // (self.z_channels << 1)
-        skip = x.view(b, c // grp, grp, h, w).mean(2)
-
-        return self.conv_out(F.silu(self.norm_out(x))) + skip
-
-
-class Decoder(nn.Module):
-    def __init__(self, z_channels, out_channels, block_out_channels, num_res_blocks,
-                 ffactor_spatial, upsample_match_channel=True, **_):
-        super().__init__()
-        block_out_channels = block_out_channels[::-1]
-        self.z_channels = z_channels
-        self.block_out_channels = block_out_channels
-        self.num_res_blocks = num_res_blocks
-
-        ch = block_out_channels[0]
-        self.conv_in = ops.Conv2d(z_channels, ch, 3, 1, 1)
-
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=ops.Conv2d)
-        self.mid.attn_1 = AttnBlock(ch, conv_op=ops.Conv2d)
-        self.mid.block_2 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=ops.Conv2d)
-
-        self.up = nn.ModuleList()
-        depth = (ffactor_spatial >> 1).bit_length()
-
-        for i, tgt in enumerate(block_out_channels):
-            stage = nn.Module()
-            stage.block = nn.ModuleList([ResnetBlock(in_channels=ch if j == 0 else tgt,
-                                                     out_channels=tgt,
-                                                     temb_channels=0,
-                                                     conv_op=ops.Conv2d)
-                                        for j in range(num_res_blocks + 1)])
-            ch = tgt
-            if i < depth:
-                nxt = block_out_channels[i + 1] if i + 1 < len(block_out_channels) and upsample_match_channel else ch
-                stage.upsample = PixelUnshuffle2D(ch, nxt, ops.Conv2d)
-                ch = nxt
-            self.up.append(stage)
-
-        self.norm_out = ops.GroupNorm(32, ch, 1e-6, True)
-        self.conv_out = ops.Conv2d(ch, out_channels, 3, 1, 1)
-
-    def forward(self, z):
-        x = self.conv_in(z) + z.repeat_interleave(self.block_out_channels[0] // self.z_channels, 1)
-        x = self.mid.block_2(self.mid.attn_1(self.mid.block_1(x)))
-
-        for stage in self.up:
-            for blk in stage.block:
-                x = blk(x)
-            if hasattr(stage, 'upsample'):
-                x = stage.upsample(x)
-
-        return self.conv_out(F.silu(self.norm_out(x)))
--- a/comfy/ldm/hunyuan_video/vae_refiner.py
+++ b/comfy/ldm/hunyuan_video/vae_refiner.py
@@ -1,301 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from comfy.ldm.modules.diffusionmodules.model import ResnetBlock, AttnBlock, VideoConv3d, Normalize
-import comfy.ops
-import comfy.ldm.models.autoencoder
-ops = comfy.ops.disable_weight_init
-
-class RMS_norm(nn.Module):
-    def __init__(self, dim):
-        super().__init__()
-        shape = (dim, 1, 1, 1)
-        self.scale = dim**0.5
-        self.gamma = nn.Parameter(torch.empty(shape))
-
-    def forward(self, x):
-        return F.normalize(x, dim=1) * self.scale * self.gamma
-
-class DnSmpl(nn.Module):
-    def __init__(self, ic, oc, tds=True, refiner_vae=True, op=VideoConv3d):
-        super().__init__()
-        fct = 2 * 2 * 2 if tds else 1 * 2 * 2
-        assert oc % fct == 0
-        self.conv = op(ic, oc // fct, kernel_size=3, stride=1, padding=1)
-        self.refiner_vae = refiner_vae
-
-        self.tds = tds
-        self.gs = fct * ic // oc
-
-    def forward(self, x):
-        r1 = 2 if self.tds else 1
-        h = self.conv(x)
-
-        if self.tds and self.refiner_vae:
-            hf = h[:, :, :1, :, :]
-            b, c, f, ht, wd = hf.shape
-            hf = hf.reshape(b, c, f, ht // 2, 2, wd // 2, 2)
-            hf = hf.permute(0, 4, 6, 1, 2, 3, 5)
-            hf = hf.reshape(b, 2 * 2 * c, f, ht // 2, wd // 2)
-            hf = torch.cat([hf, hf], dim=1)
-
-            hn = h[:, :, 1:, :, :]
-            b, c, frms, ht, wd = hn.shape
-            nf = frms // r1
-            hn = hn.reshape(b, c, nf, r1, ht // 2, 2, wd // 2, 2)
-            hn = hn.permute(0, 3, 5, 7, 1, 2, 4, 6)
-            hn = hn.reshape(b, r1 * 2 * 2 * c, nf, ht // 2, wd // 2)
-
-            h = torch.cat([hf, hn], dim=2)
-
-            xf = x[:, :, :1, :, :]
-            b, ci, f, ht, wd = xf.shape
-            xf = xf.reshape(b, ci, f, ht // 2, 2, wd // 2, 2)
-            xf = xf.permute(0, 4, 6, 1, 2, 3, 5)
-            xf = xf.reshape(b, 2 * 2 * ci, f, ht // 2, wd // 2)
-            B, C, T, H, W = xf.shape
-            xf = xf.view(B, h.shape[1], self.gs // 2, T, H, W).mean(dim=2)
-
-            xn = x[:, :, 1:, :, :]
-            b, ci, frms, ht, wd = xn.shape
-            nf = frms // r1
-            xn = xn.reshape(b, ci, nf, r1, ht // 2, 2, wd // 2, 2)
-            xn = xn.permute(0, 3, 5, 7, 1, 2, 4, 6)
-            xn = xn.reshape(b, r1 * 2 * 2 * ci, nf, ht // 2, wd // 2)
-            B, C, T, H, W = xn.shape
-            xn = xn.view(B, h.shape[1], self.gs, T, H, W).mean(dim=2)
-            sc = torch.cat([xf, xn], dim=2)
-        else:
-            b, c, frms, ht, wd = h.shape
-
-            nf = frms // r1
-            h = h.reshape(b, c, nf, r1, ht // 2, 2, wd // 2, 2)
-            h = h.permute(0, 3, 5, 7, 1, 2, 4, 6)
-            h = h.reshape(b, r1 * 2 * 2 * c, nf, ht // 2, wd // 2)
-
-            b, ci, frms, ht, wd = x.shape
-            nf = frms // r1
-            sc = x.reshape(b, ci, nf, r1, ht // 2, 2, wd // 2, 2)
-            sc = sc.permute(0, 3, 5, 7, 1, 2, 4, 6)
-            sc = sc.reshape(b, r1 * 2 * 2 * ci, nf, ht // 2, wd // 2)
-            B, C, T, H, W = sc.shape
-            sc = sc.view(B, h.shape[1], self.gs, T, H, W).mean(dim=2)
-
-        return h + sc
-
-
-class UpSmpl(nn.Module):
-    def __init__(self, ic, oc, tus=True, refiner_vae=True, op=VideoConv3d):
-        super().__init__()
-        fct = 2 * 2 * 2 if tus else 1 * 2 * 2
-        self.conv = op(ic, oc * fct, kernel_size=3, stride=1, padding=1)
-        self.refiner_vae = refiner_vae
-
-        self.tus = tus
-        self.rp = fct * oc // ic
-
-    def forward(self, x):
-        r1 = 2 if self.tus else 1
-        h = self.conv(x)
-
-        if self.tus and self.refiner_vae:
-            hf = h[:, :, :1, :, :]
-            b, c, f, ht, wd = hf.shape
-            nc = c // (2 * 2)
-            hf = hf.reshape(b, 2, 2, nc, f, ht, wd)
-            hf = hf.permute(0, 3, 4, 5, 1, 6, 2)
-            hf = hf.reshape(b, nc, f, ht * 2, wd * 2)
-            hf = hf[:, : hf.shape[1] // 2]
-
-            hn = h[:, :, 1:, :, :]
-            b, c, frms, ht, wd = hn.shape
-            nc = c // (r1 * 2 * 2)
-            hn = hn.reshape(b, r1, 2, 2, nc, frms, ht, wd)
-            hn = hn.permute(0, 4, 5, 1, 6, 2, 7, 3)
-            hn = hn.reshape(b, nc, frms * r1, ht * 2, wd * 2)
-
-            h = torch.cat([hf, hn], dim=2)
-
-            xf = x[:, :, :1, :, :]
-            b, ci, f, ht, wd = xf.shape
-            xf = xf.repeat_interleave(repeats=self.rp // 2, dim=1)
-            b, c, f, ht, wd = xf.shape
-            nc = c // (2 * 2)
-            xf = xf.reshape(b, 2, 2, nc, f, ht, wd)
-            xf = xf.permute(0, 3, 4, 5, 1, 6, 2)
-            xf = xf.reshape(b, nc, f, ht * 2, wd * 2)
-
-            xn = x[:, :, 1:, :, :]
-            xn = xn.repeat_interleave(repeats=self.rp, dim=1)
-            b, c, frms, ht, wd = xn.shape
-            nc = c // (r1 * 2 * 2)
-            xn = xn.reshape(b, r1, 2, 2, nc, frms, ht, wd)
-            xn = xn.permute(0, 4, 5, 1, 6, 2, 7, 3)
-            xn = xn.reshape(b, nc, frms * r1, ht * 2, wd * 2)
-            sc = torch.cat([xf, xn], dim=2)
-        else:
-            b, c, frms, ht, wd = h.shape
-            nc = c // (r1 * 2 * 2)
-            h = h.reshape(b, r1, 2, 2, nc, frms, ht, wd)
-            h = h.permute(0, 4, 5, 1, 6, 2, 7, 3)
-            h = h.reshape(b, nc, frms * r1, ht * 2, wd * 2)
-
-            sc = x.repeat_interleave(repeats=self.rp, dim=1)
-            b, c, frms, ht, wd = sc.shape
-            nc = c // (r1 * 2 * 2)
-            sc = sc.reshape(b, r1, 2, 2, nc, frms, ht, wd)
-            sc = sc.permute(0, 4, 5, 1, 6, 2, 7, 3)
-            sc = sc.reshape(b, nc, frms * r1, ht * 2, wd * 2)
-
-        return h + sc
-
-class Encoder(nn.Module):
-    def __init__(self, in_channels, z_channels, block_out_channels, num_res_blocks,
-                 ffactor_spatial, ffactor_temporal, downsample_match_channel=True, refiner_vae=True, **_):
-        super().__init__()
-        self.z_channels = z_channels
-        self.block_out_channels = block_out_channels
-        self.num_res_blocks = num_res_blocks
-        self.ffactor_temporal = ffactor_temporal
-
-        self.refiner_vae = refiner_vae
-        if self.refiner_vae:
-            conv_op = VideoConv3d
-            norm_op = RMS_norm
-        else:
-            conv_op = ops.Conv3d
-            norm_op = Normalize
-
-        self.conv_in = conv_op(in_channels, block_out_channels[0], 3, 1, 1)
-
-        self.down = nn.ModuleList()
-        ch = block_out_channels[0]
-        depth = (ffactor_spatial >> 1).bit_length()
-        depth_temporal = ((ffactor_spatial // self.ffactor_temporal) >> 1).bit_length()
-
-        for i, tgt in enumerate(block_out_channels):
-            stage = nn.Module()
-            stage.block = nn.ModuleList([ResnetBlock(in_channels=ch if j == 0 else tgt,
-                                                     out_channels=tgt,
-                                                     temb_channels=0,
-                                                     conv_op=conv_op, norm_op=norm_op)
-                                        for j in range(num_res_blocks)])
-            ch = tgt
-            if i < depth:
-                nxt = block_out_channels[i + 1] if i + 1 < len(block_out_channels) and downsample_match_channel else ch
-                stage.downsample = DnSmpl(ch, nxt, tds=i >= depth_temporal, refiner_vae=self.refiner_vae, op=conv_op)
-                ch = nxt
-            self.down.append(stage)
-
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=conv_op, norm_op=norm_op)
-        self.mid.attn_1 = AttnBlock(ch, conv_op=ops.Conv3d, norm_op=norm_op)
-        self.mid.block_2 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=conv_op, norm_op=norm_op)
-
-        self.norm_out = norm_op(ch)
-        self.conv_out = conv_op(ch, z_channels << 1, 3, 1, 1)
-
-        self.regul = comfy.ldm.models.autoencoder.DiagonalGaussianRegularizer()
-
-    def forward(self, x):
-        if not self.refiner_vae and x.shape[2] == 1:
-            x = x.expand(-1, -1, self.ffactor_temporal, -1, -1)
-
-        x = self.conv_in(x)
-
-        for stage in self.down:
-            for blk in stage.block:
-                x = blk(x)
-            if hasattr(stage, 'downsample'):
-                x = stage.downsample(x)
-
-        x = self.mid.block_2(self.mid.attn_1(self.mid.block_1(x)))
-
-        b, c, t, h, w = x.shape
-        grp = c // (self.z_channels << 1)
-        skip = x.view(b, c // grp, grp, t, h, w).mean(2)
-
-        out = self.conv_out(F.silu(self.norm_out(x))) + skip
-
-        if self.refiner_vae:
-            out = self.regul(out)[0]
-
-            out = torch.cat((out[:, :, :1], out), dim=2)
-            out = out.permute(0, 2, 1, 3, 4)
-            b, f_times_2, c, h, w = out.shape
-            out = out.reshape(b, f_times_2 // 2, 2 * c, h, w)
-            out = out.permute(0, 2, 1, 3, 4).contiguous()
-
-        return out
-
-class Decoder(nn.Module):
-    def __init__(self, z_channels, out_channels, block_out_channels, num_res_blocks,
-                 ffactor_spatial, ffactor_temporal, upsample_match_channel=True, refiner_vae=True, **_):
-        super().__init__()
-        block_out_channels = block_out_channels[::-1]
-        self.z_channels = z_channels
-        self.block_out_channels = block_out_channels
-        self.num_res_blocks = num_res_blocks
-
-        self.refiner_vae = refiner_vae
-        if self.refiner_vae:
-            conv_op = VideoConv3d
-            norm_op = RMS_norm
-        else:
-            conv_op = ops.Conv3d
-            norm_op = Normalize
-
-        ch = block_out_channels[0]
-        self.conv_in = conv_op(z_channels, ch, kernel_size=3, stride=1, padding=1)
-
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=conv_op, norm_op=norm_op)
-        self.mid.attn_1 = AttnBlock(ch, conv_op=ops.Conv3d, norm_op=norm_op)
-        self.mid.block_2 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=conv_op, norm_op=norm_op)
-
-        self.up = nn.ModuleList()
-        depth = (ffactor_spatial >> 1).bit_length()
-        depth_temporal = (ffactor_temporal >> 1).bit_length()
-
-        for i, tgt in enumerate(block_out_channels):
-            stage = nn.Module()
-            stage.block = nn.ModuleList([ResnetBlock(in_channels=ch if j == 0 else tgt,
-                                                     out_channels=tgt,
-                                                     temb_channels=0,
-                                                     conv_op=conv_op, norm_op=norm_op)
-                                        for j in range(num_res_blocks + 1)])
-            ch = tgt
-            if i < depth:
-                nxt = block_out_channels[i + 1] if i + 1 < len(block_out_channels) and upsample_match_channel else ch
-                stage.upsample = UpSmpl(ch, nxt, tus=i < depth_temporal, refiner_vae=self.refiner_vae, op=conv_op)
-                ch = nxt
-            self.up.append(stage)
-
-        self.norm_out = norm_op(ch)
-        self.conv_out = conv_op(ch, out_channels, 3, stride=1, padding=1)
-
-    def forward(self, z):
-        if self.refiner_vae:
-            z = z.permute(0, 2, 1, 3, 4)
-            b, f, c, h, w = z.shape
-            z = z.reshape(b, f, 2, c // 2, h, w)
-            z = z.permute(0, 1, 2, 3, 4, 5).reshape(b, f * 2, c // 2, h, w)
-            z = z.permute(0, 2, 1, 3, 4)
-            z = z[:, :, 1:]
-
-        x = self.conv_in(z) + z.repeat_interleave(self.block_out_channels[0] // self.z_channels, 1)
-        x = self.mid.block_2(self.mid.attn_1(self.mid.block_1(x)))
-
-        for stage in self.up:
-            for blk in stage.block:
-                x = blk(x)
-            if hasattr(stage, 'upsample'):
-                x = stage.upsample(x)
-
-        out = self.conv_out(F.silu(self.norm_out(x)))
-
-        if not self.refiner_vae:
-            if z.shape[-3] == 1:
-                out = out[:, :, -1:]
-
-        return out
--- a/comfy/ldm/mmaudio/vae/init.py
+++ b/comfy/ldm/mmaudio/vae/init.py
--- a/comfy/ldm/mmaudio/vae/activations.py
+++ b/comfy/ldm/mmaudio/vae/activations.py
@@ -1,120 +0,0 @@
-# Implementation adapted from https://github.com/EdwardDixon/snake under the MIT license.
-#   LICENSE is in incl_licenses directory.
-
-import torch
-from torch import nn, sin, pow
-from torch.nn import Parameter
-import comfy.model_management
-
-class Snake(nn.Module):
-    '''
-    Implementation of a sine-based periodic activation function
-    Shape:
-        - Input: (B, C, T)
-        - Output: (B, C, T), same shape as the input
-    Parameters:
-        - alpha - trainable parameter
-    References:
-        - This activation function is from this paper by Liu Ziyin, Tilman Hartwig, Masahito Ueda:
-        https://arxiv.org/abs/2006.08195
-    Examples:
-        >>> a1 = snake(256)
-        >>> x = torch.randn(256)
-        >>> x = a1(x)
-    '''
-    def __init__(self, in_features, alpha=1.0, alpha_trainable=True, alpha_logscale=False):
-        '''
-        Initialization.
-        INPUT:
-            - in_features: shape of the input
-            - alpha: trainable parameter
-            alpha is initialized to 1 by default, higher values = higher-frequency.
-            alpha will be trained along with the rest of your model.
-        '''
-        super(Snake, self).__init__()
-        self.in_features = in_features
-
-        # initialize alpha
-        self.alpha_logscale = alpha_logscale
-        if self.alpha_logscale:
-            self.alpha = Parameter(torch.empty(in_features))
-        else:
-            self.alpha = Parameter(torch.empty(in_features))
-
-        self.alpha.requires_grad = alpha_trainable
-
-        self.no_div_by_zero = 0.000000001
-
-    def forward(self, x):
-        '''
-        Forward pass of the function.
-        Applies the function to the input elementwise.
-        Snake ∶= x + 1/a * sin^2 (xa)
-        '''
-        alpha = comfy.model_management.cast_to(self.alpha, dtype=x.dtype, device=x.device).unsqueeze(0).unsqueeze(-1) # line up with x to [B, C, T]
-        if self.alpha_logscale:
-            alpha = torch.exp(alpha)
-        x = x + (1.0 / (alpha + self.no_div_by_zero)) * pow(sin(x * alpha), 2)
-
-        return x
-
-
-class SnakeBeta(nn.Module):
-    '''
-    A modified Snake function which uses separate parameters for the magnitude of the periodic components
-    Shape:
-        - Input: (B, C, T)
-        - Output: (B, C, T), same shape as the input
-    Parameters:
-        - alpha - trainable parameter that controls frequency
-        - beta - trainable parameter that controls magnitude
-    References:
-        - This activation function is a modified version based on this paper by Liu Ziyin, Tilman Hartwig, Masahito Ueda:
-        https://arxiv.org/abs/2006.08195
-    Examples:
-        >>> a1 = snakebeta(256)
-        >>> x = torch.randn(256)
-        >>> x = a1(x)
-    '''
-    def __init__(self, in_features, alpha=1.0, alpha_trainable=True, alpha_logscale=False):
-        '''
-        Initialization.
-        INPUT:
-            - in_features: shape of the input
-            - alpha - trainable parameter that controls frequency
-            - beta - trainable parameter that controls magnitude
-            alpha is initialized to 1 by default, higher values = higher-frequency.
-            beta is initialized to 1 by default, higher values = higher-magnitude.
-            alpha will be trained along with the rest of your model.
-        '''
-        super(SnakeBeta, self).__init__()
-        self.in_features = in_features
-
-        # initialize alpha
-        self.alpha_logscale = alpha_logscale
-        if self.alpha_logscale:
-            self.alpha = Parameter(torch.empty(in_features))
-            self.beta = Parameter(torch.empty(in_features))
-        else:
-            self.alpha = Parameter(torch.empty(in_features))
-            self.beta = Parameter(torch.empty(in_features))
-
-        self.alpha.requires_grad = alpha_trainable
-        self.beta.requires_grad = alpha_trainable
-
-        self.no_div_by_zero = 0.000000001
-
-    def forward(self, x):
-        '''
-        Forward pass of the function.
-        Applies the function to the input elementwise.
-        SnakeBeta ∶= x + 1/b * sin^2 (xa)
-        '''
-        alpha = comfy.model_management.cast_to(self.alpha, dtype=x.dtype, device=x.device).unsqueeze(0).unsqueeze(-1) # line up with x to [B, C, T]
-        beta = comfy.model_management.cast_to(self.beta, dtype=x.dtype, device=x.device).unsqueeze(0).unsqueeze(-1)
-        if self.alpha_logscale:
-            alpha = torch.exp(alpha)
-            beta = torch.exp(beta)
-        x = x + (1.0 / (beta + self.no_div_by_zero)) * pow(sin(x * alpha), 2)
-
-        return x
--- a/comfy/ldm/mmaudio/vae/alias_free_torch.py
+++ b/comfy/ldm/mmaudio/vae/alias_free_torch.py
@@ -1,157 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import math
-import comfy.model_management
-
-if 'sinc' in dir(torch):
-    sinc = torch.sinc
-else:
-    # This code is adopted from adefossez's julius.core.sinc under the MIT License
-    # https://adefossez.github.io/julius/julius/core.html
-    #   LICENSE is in incl_licenses directory.
-    def sinc(x: torch.Tensor):
-        """
-        Implementation of sinc, i.e. sin(pi * x) / (pi * x)
-        __Warning__: Different to julius.sinc, the input is multiplied by `pi`!
-        """
-        return torch.where(x == 0,
-                           torch.tensor(1., device=x.device, dtype=x.dtype),
-                           torch.sin(math.pi * x) / math.pi / x)
-
-
-# This code is adopted from adefossez's julius.lowpass.LowPassFilters under the MIT License
-# https://adefossez.github.io/julius/julius/lowpass.html
-#   LICENSE is in incl_licenses directory.
-def kaiser_sinc_filter1d(cutoff, half_width, kernel_size): # return filter [1,1,kernel_size]
-    even = (kernel_size % 2 == 0)
-    half_size = kernel_size // 2
-
-    #For kaiser window
-    delta_f = 4 * half_width
-    A = 2.285 * (half_size - 1) * math.pi * delta_f + 7.95
-    if A > 50.:
-        beta = 0.1102 * (A - 8.7)
-    elif A >= 21.:
-        beta = 0.5842 * (A - 21)**0.4 + 0.07886 * (A - 21.)
-    else:
-        beta = 0.
-    window = torch.kaiser_window(kernel_size, beta=beta, periodic=False)
-
-    # ratio = 0.5/cutoff -> 2 * cutoff = 1 / ratio
-    if even:
-        time = (torch.arange(-half_size, half_size) + 0.5)
-    else:
-        time = torch.arange(kernel_size) - half_size
-    if cutoff == 0:
-        filter_ = torch.zeros_like(time)
-    else:
-        filter_ = 2 * cutoff * window * sinc(2 * cutoff * time)
-        # Normalize filter to have sum = 1, otherwise we will have a small leakage
-        # of the constant component in the input signal.
-        filter_ /= filter_.sum()
-        filter = filter_.view(1, 1, kernel_size)
-
-    return filter
-
-
-class LowPassFilter1d(nn.Module):
-    def __init__(self,
-                 cutoff=0.5,
-                 half_width=0.6,
-                 stride: int = 1,
-                 padding: bool = True,
-                 padding_mode: str = 'replicate',
-                 kernel_size: int = 12):
-        # kernel_size should be even number for stylegan3 setup,
-        # in this implementation, odd number is also possible.
-        super().__init__()
-        if cutoff < -0.:
-            raise ValueError("Minimum cutoff must be larger than zero.")
-        if cutoff > 0.5:
-            raise ValueError("A cutoff above 0.5 does not make sense.")
-        self.kernel_size = kernel_size
-        self.even = (kernel_size % 2 == 0)
-        self.pad_left = kernel_size // 2 - int(self.even)
-        self.pad_right = kernel_size // 2
-        self.stride = stride
-        self.padding = padding
-        self.padding_mode = padding_mode
-        filter = kaiser_sinc_filter1d(cutoff, half_width, kernel_size)
-        self.register_buffer("filter", filter)
-
-    #input [B, C, T]
-    def forward(self, x):
-        _, C, _ = x.shape
-
-        if self.padding:
-            x = F.pad(x, (self.pad_left, self.pad_right),
-                      mode=self.padding_mode)
-        out = F.conv1d(x, comfy.model_management.cast_to(self.filter.expand(C, -1, -1), dtype=x.dtype, device=x.device),
-                       stride=self.stride, groups=C)
-
-        return out
-
-
-class UpSample1d(nn.Module):
-    def __init__(self, ratio=2, kernel_size=None):
-        super().__init__()
-        self.ratio = ratio
-        self.kernel_size = int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
-        self.stride = ratio
-        self.pad = self.kernel_size // ratio - 1
-        self.pad_left = self.pad * self.stride + (self.kernel_size - self.stride) // 2
-        self.pad_right = self.pad * self.stride + (self.kernel_size - self.stride + 1) // 2
-        filter = kaiser_sinc_filter1d(cutoff=0.5 / ratio,
-                                      half_width=0.6 / ratio,
-                                      kernel_size=self.kernel_size)
-        self.register_buffer("filter", filter)
-
-    # x: [B, C, T]
-    def forward(self, x):
-        _, C, _ = x.shape
-
-        x = F.pad(x, (self.pad, self.pad), mode='replicate')
-        x = self.ratio * F.conv_transpose1d(
-            x, comfy.model_management.cast_to(self.filter.expand(C, -1, -1), dtype=x.dtype, device=x.device), stride=self.stride, groups=C)
-        x = x[..., self.pad_left:-self.pad_right]
-
-        return x
-
-
-class DownSample1d(nn.Module):
-    def __init__(self, ratio=2, kernel_size=None):
-        super().__init__()
-        self.ratio = ratio
-        self.kernel_size = int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
-        self.lowpass = LowPassFilter1d(cutoff=0.5 / ratio,
-                                       half_width=0.6 / ratio,
-                                       stride=ratio,
-                                       kernel_size=self.kernel_size)
-
-    def forward(self, x):
-        xx = self.lowpass(x)
-
-        return xx
-
-class Activation1d(nn.Module):
-    def __init__(self,
-                 activation,
-                 up_ratio: int = 2,
-                 down_ratio: int = 2,
-                 up_kernel_size: int = 12,
-                 down_kernel_size: int = 12):
-        super().__init__()
-        self.up_ratio = up_ratio
-        self.down_ratio = down_ratio
-        self.act = activation
-        self.upsample = UpSample1d(up_ratio, up_kernel_size)
-        self.downsample = DownSample1d(down_ratio, down_kernel_size)
-
-    # x: [B,C,T]
-    def forward(self, x):
-        x = self.upsample(x)
-        x = self.act(x)
-        x = self.downsample(x)
-
-        return x
--- a/comfy/ldm/mmaudio/vae/autoencoder.py
+++ b/comfy/ldm/mmaudio/vae/autoencoder.py
@@ -1,156 +0,0 @@
-from typing import Literal
-
-import torch
-import torch.nn as nn
-
-from .distributions import DiagonalGaussianDistribution
-from .vae import VAE_16k
-from .bigvgan import BigVGANVocoder
-import logging
-
-try:
-    import torchaudio
-except:
-    logging.warning("torchaudio missing, MMAudio VAE model will be broken")
-
-def dynamic_range_compression_torch(x, C=1, clip_val=1e-5, *, norm_fn):
-    return norm_fn(torch.clamp(x, min=clip_val) * C)
-
-
-def spectral_normalize_torch(magnitudes, norm_fn):
-    output = dynamic_range_compression_torch(magnitudes, norm_fn=norm_fn)
-    return output
-
-class MelConverter(nn.Module):
-
-    def __init__(
-        self,
-        *,
-        sampling_rate: float,
-        n_fft: int,
-        num_mels: int,
-        hop_size: int,
-        win_size: int,
-        fmin: float,
-        fmax: float,
-        norm_fn,
-    ):
-        super().__init__()
-        self.sampling_rate = sampling_rate
-        self.n_fft = n_fft
-        self.num_mels = num_mels
-        self.hop_size = hop_size
-        self.win_size = win_size
-        self.fmin = fmin
-        self.fmax = fmax
-        self.norm_fn = norm_fn
-
-        # mel = librosa_mel_fn(sr=self.sampling_rate,
-        #                      n_fft=self.n_fft,
-        #                      n_mels=self.num_mels,
-        #                      fmin=self.fmin,
-        #                      fmax=self.fmax)
-        # mel_basis = torch.from_numpy(mel).float()
-        mel_basis = torch.empty((num_mels, 1 + n_fft // 2))
-        hann_window = torch.hann_window(self.win_size)
-
-        self.register_buffer('mel_basis', mel_basis)
-        self.register_buffer('hann_window', hann_window)
-
-    @property
-    def device(self):
-        return self.mel_basis.device
-
-    def forward(self, waveform: torch.Tensor, center: bool = False) -> torch.Tensor:
-        waveform = waveform.clamp(min=-1., max=1.).to(self.device)
-
-        waveform = torch.nn.functional.pad(
-            waveform.unsqueeze(1),
-            [int((self.n_fft - self.hop_size) / 2),
-             int((self.n_fft - self.hop_size) / 2)],
-            mode='reflect')
-        waveform = waveform.squeeze(1)
-
-        spec = torch.stft(waveform,
-                          self.n_fft,
-                          hop_length=self.hop_size,
-                          win_length=self.win_size,
-                          window=self.hann_window,
-                          center=center,
-                          pad_mode='reflect',
-                          normalized=False,
-                          onesided=True,
-                          return_complex=True)
-
-        spec = torch.view_as_real(spec)
-        spec = torch.sqrt(spec.pow(2).sum(-1) + (1e-9))
-        spec = torch.matmul(self.mel_basis, spec)
-        spec = spectral_normalize_torch(spec, self.norm_fn)
-
-        return spec
-
-class AudioAutoencoder(nn.Module):
-
-    def __init__(
-        self,
-        *,
-        # ckpt_path: str,
-        mode=Literal['16k', '44k'],
-        need_vae_encoder: bool = True,
-    ):
-        super().__init__()
-
-        assert mode == "16k", "Only 16k mode is supported currently."
-        self.mel_converter = MelConverter(sampling_rate=16_000,
-                            n_fft=1024,
-                            num_mels=80,
-                            hop_size=256,
-                            win_size=1024,
-                            fmin=0,
-                            fmax=8_000,
-                            norm_fn=torch.log10)
-
-        self.vae = VAE_16k().eval()
-
-        bigvgan_config = {
-            "resblock": "1",
-            "num_mels": 80,
-            "upsample_rates": [4, 4, 2, 2, 2, 2],
-            "upsample_kernel_sizes": [8, 8, 4, 4, 4, 4],
-            "upsample_initial_channel": 1536,
-            "resblock_kernel_sizes": [3, 7, 11],
-            "resblock_dilation_sizes": [
-                [1, 3, 5],
-                [1, 3, 5],
-                [1, 3, 5],
-            ],
-            "activation": "snakebeta",
-            "snake_logscale": True,
-        }
-
-        self.vocoder = BigVGANVocoder(
-            bigvgan_config
-        ).eval()
-
-    @torch.inference_mode()
-    def encode_audio(self, x) -> DiagonalGaussianDistribution:
-        # x: (B * L)
-        mel = self.mel_converter(x)
-        dist = self.vae.encode(mel)
-
-        return dist
-
-    @torch.no_grad()
-    def decode(self, z):
-        mel_decoded = self.vae.decode(z)
-        audio = self.vocoder(mel_decoded)
-
-        audio = torchaudio.functional.resample(audio, 16000, 44100)
-        return audio
-
-    @torch.no_grad()
-    def encode(self, audio):
-        audio = audio.mean(dim=1)
-        audio = torchaudio.functional.resample(audio, 44100, 16000)
-        dist = self.encode_audio(audio)
-        return dist.mean
--- a/comfy/ldm/mmaudio/vae/bigvgan.py
+++ b/comfy/ldm/mmaudio/vae/bigvgan.py
@@ -1,219 +0,0 @@
-# Copyright (c) 2022 NVIDIA CORPORATION.
-#   Licensed under the MIT license.
-
-# Adapted from https://github.com/jik876/hifi-gan under the MIT license.
-#   LICENSE is in incl_licenses directory.
-
-import torch
-import torch.nn as nn
-from types import SimpleNamespace
-from . import activations
-from .alias_free_torch import Activation1d
-import comfy.ops
-ops = comfy.ops.disable_weight_init
-
-def get_padding(kernel_size, dilation=1):
-    return int((kernel_size * dilation - dilation) / 2)
-
-class AMPBlock1(torch.nn.Module):
-
-    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5), activation=None):
-        super(AMPBlock1, self).__init__()
-        self.h = h
-
-        self.convs1 = nn.ModuleList([
-                ops.Conv1d(channels,
-                       channels,
-                       kernel_size,
-                       1,
-                       dilation=dilation[0],
-                       padding=get_padding(kernel_size, dilation[0])),
-                ops.Conv1d(channels,
-                       channels,
-                       kernel_size,
-                       1,
-                       dilation=dilation[1],
-                       padding=get_padding(kernel_size, dilation[1])),
-                ops.Conv1d(channels,
-                       channels,
-                       kernel_size,
-                       1,
-                       dilation=dilation[2],
-                       padding=get_padding(kernel_size, dilation[2]))
-        ])
-
-        self.convs2 = nn.ModuleList([
-                ops.Conv1d(channels,
-                       channels,
-                       kernel_size,
-                       1,
-                       dilation=1,
-                       padding=get_padding(kernel_size, 1)),
-                ops.Conv1d(channels,
-                       channels,
-                       kernel_size,
-                       1,
-                       dilation=1,
-                       padding=get_padding(kernel_size, 1)),
-                ops.Conv1d(channels,
-                       channels,
-                       kernel_size,
-                       1,
-                       dilation=1,
-                       padding=get_padding(kernel_size, 1))
-        ])
-
-        self.num_layers = len(self.convs1) + len(self.convs2)  # total number of conv layers
-
-        if activation == 'snake':  # periodic nonlinearity with snake function and anti-aliasing
-            self.activations = nn.ModuleList([
-                Activation1d(
-                    activation=activations.Snake(channels, alpha_logscale=h.snake_logscale))
-                for _ in range(self.num_layers)
-            ])
-        elif activation == 'snakebeta':  # periodic nonlinearity with snakebeta function and anti-aliasing
-            self.activations = nn.ModuleList([
-                Activation1d(
-                    activation=activations.SnakeBeta(channels, alpha_logscale=h.snake_logscale))
-                for _ in range(self.num_layers)
-            ])
-        else:
-            raise NotImplementedError(
-                "activation incorrectly specified. check the config file and look for 'activation'."
-            )
-
-    def forward(self, x):
-        acts1, acts2 = self.activations[::2], self.activations[1::2]
-        for c1, c2, a1, a2 in zip(self.convs1, self.convs2, acts1, acts2):
-            xt = a1(x)
-            xt = c1(xt)
-            xt = a2(xt)
-            xt = c2(xt)
-            x = xt + x
-
-        return x
-
-
-class AMPBlock2(torch.nn.Module):
-
-    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3), activation=None):
-        super(AMPBlock2, self).__init__()
-        self.h = h
-
-        self.convs = nn.ModuleList([
-                ops.Conv1d(channels,
-                       channels,
-                       kernel_size,
-                       1,
-                       dilation=dilation[0],
-                       padding=get_padding(kernel_size, dilation[0])),
-                ops.Conv1d(channels,
-                       channels,
-                       kernel_size,
-                       1,
-                       dilation=dilation[1],
-                       padding=get_padding(kernel_size, dilation[1]))
-        ])
-
-        self.num_layers = len(self.convs)  # total number of conv layers
-
-        if activation == 'snake':  # periodic nonlinearity with snake function and anti-aliasing
-            self.activations = nn.ModuleList([
-                Activation1d(
-                    activation=activations.Snake(channels, alpha_logscale=h.snake_logscale))
-                for _ in range(self.num_layers)
-            ])
-        elif activation == 'snakebeta':  # periodic nonlinearity with snakebeta function and anti-aliasing
-            self.activations = nn.ModuleList([
-                Activation1d(
-                    activation=activations.SnakeBeta(channels, alpha_logscale=h.snake_logscale))
-                for _ in range(self.num_layers)
-            ])
-        else:
-            raise NotImplementedError(
-                "activation incorrectly specified. check the config file and look for 'activation'."
-            )
-
-    def forward(self, x):
-        for c, a in zip(self.convs, self.activations):
-            xt = a(x)
-            xt = c(xt)
-            x = xt + x
-
-        return x
-
-
-class BigVGANVocoder(torch.nn.Module):
-    # this is our main BigVGAN model. Applies anti-aliased periodic activation for resblocks.
-    def __init__(self, h):
-        super().__init__()
-        if isinstance(h, dict):
-            h = SimpleNamespace(**h)
-        self.h = h
-
-        self.num_kernels = len(h.resblock_kernel_sizes)
-        self.num_upsamples = len(h.upsample_rates)
-
-        # pre conv
-        self.conv_pre = ops.Conv1d(h.num_mels, h.upsample_initial_channel, 7, 1, padding=3)
-
-        # define which AMPBlock to use. BigVGAN uses AMPBlock1 as default
-        resblock = AMPBlock1 if h.resblock == '1' else AMPBlock2
-
-        # transposed conv-based upsamplers. does not apply anti-aliasing
-        self.ups = nn.ModuleList()
-        for i, (u, k) in enumerate(zip(h.upsample_rates, h.upsample_kernel_sizes)):
-            self.ups.append(
-                nn.ModuleList([
-                        ops.ConvTranspose1d(h.upsample_initial_channel // (2**i),
-                                        h.upsample_initial_channel // (2**(i + 1)),
-                                        k,
-                                        u,
-                                        padding=(k - u) // 2)
-                ]))
-
-        # residual blocks using anti-aliased multi-periodicity composition modules (AMP)
-        self.resblocks = nn.ModuleList()
-        for i in range(len(self.ups)):
-            ch = h.upsample_initial_channel // (2**(i + 1))
-            for j, (k, d) in enumerate(zip(h.resblock_kernel_sizes, h.resblock_dilation_sizes)):
-                self.resblocks.append(resblock(h, ch, k, d, activation=h.activation))
-
-        # post conv
-        if h.activation == "snake":  # periodic nonlinearity with snake function and anti-aliasing
-            activation_post = activations.Snake(ch, alpha_logscale=h.snake_logscale)
-            self.activation_post = Activation1d(activation=activation_post)
-        elif h.activation == "snakebeta":  # periodic nonlinearity with snakebeta function and anti-aliasing
-            activation_post = activations.SnakeBeta(ch, alpha_logscale=h.snake_logscale)
-            self.activation_post = Activation1d(activation=activation_post)
-        else:
-            raise NotImplementedError(
-                "activation incorrectly specified. check the config file and look for 'activation'."
-            )
-
-        self.conv_post = ops.Conv1d(ch, 1, 7, 1, padding=3)
-
-
-    def forward(self, x):
-        # pre conv
-        x = self.conv_pre(x)
-
-        for i in range(self.num_upsamples):
-            # upsampling
-            for i_up in range(len(self.ups[i])):
-                x = self.ups[i][i_up](x)
-            # AMP blocks
-            xs = None
-            for j in range(self.num_kernels):
-                if xs is None:
-                    xs = self.resblocks[i * self.num_kernels + j](x)
-                else:
-                    xs += self.resblocks[i * self.num_kernels + j](x)
-            x = xs / self.num_kernels
-
-        # post conv
-        x = self.activation_post(x)
-        x = self.conv_post(x)
-        x = torch.tanh(x)
-
-        return x
--- a/comfy/ldm/mmaudio/vae/distributions.py
+++ b/comfy/ldm/mmaudio/vae/distributions.py
@@ -1,92 +0,0 @@
-import torch
-import numpy as np
-
-
-class AbstractDistribution:
-    def sample(self):
-        raise NotImplementedError()
-
-    def mode(self):
-        raise NotImplementedError()
-
-
-class DiracDistribution(AbstractDistribution):
-    def __init__(self, value):
-        self.value = value
-
-    def sample(self):
-        return self.value
-
-    def mode(self):
-        return self.value
-
-
-class DiagonalGaussianDistribution(object):
-    def __init__(self, parameters, deterministic=False):
-        self.parameters = parameters
-        self.mean, self.logvar = torch.chunk(parameters, 2, dim=1)
-        self.logvar = torch.clamp(self.logvar, -30.0, 20.0)
-        self.deterministic = deterministic
-        self.std = torch.exp(0.5 * self.logvar)
-        self.var = torch.exp(self.logvar)
-        if self.deterministic:
-            self.var = self.std = torch.zeros_like(self.mean, device=self.parameters.device)
-
-    def sample(self):
-        x = self.mean + self.std * torch.randn(self.mean.shape, device=self.parameters.device)
-        return x
-
-    def kl(self, other=None):
-        if self.deterministic:
-            return torch.Tensor([0.])
-        else:
-            if other is None:
-                return 0.5 * torch.sum(torch.pow(self.mean, 2)
-                                       + self.var - 1.0 - self.logvar,
-                                       dim=[1, 2, 3])
-            else:
-                return 0.5 * torch.sum(
-                    torch.pow(self.mean - other.mean, 2) / other.var
-                    + self.var / other.var - 1.0 - self.logvar + other.logvar,
-                    dim=[1, 2, 3])
-
-    def nll(self, sample, dims=[1,2,3]):
-        if self.deterministic:
-            return torch.Tensor([0.])
-        logtwopi = np.log(2.0 * np.pi)
-        return 0.5 * torch.sum(
-            logtwopi + self.logvar + torch.pow(sample - self.mean, 2) / self.var,
-            dim=dims)
-
-    def mode(self):
-        return self.mean
-
-
-def normal_kl(mean1, logvar1, mean2, logvar2):
-    """
-    source: https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/losses.py#L12
-    Compute the KL divergence between two gaussians.
-    Shapes are automatically broadcasted, so batches can be compared to
-    scalars, among other use cases.
-    """
-    tensor = None
-    for obj in (mean1, logvar1, mean2, logvar2):
-        if isinstance(obj, torch.Tensor):
-            tensor = obj
-            break
-    assert tensor is not None, "at least one argument must be a Tensor"
-
-    # Force variances to be Tensors. Broadcasting helps convert scalars to
-    # Tensors, but it does not work for torch.exp().
-    logvar1, logvar2 = [
-        x if isinstance(x, torch.Tensor) else torch.tensor(x).to(tensor)
-        for x in (logvar1, logvar2)
-    ]
-
-    return 0.5 * (
-        -1.0
-        + logvar2
-        - logvar1
-        + torch.exp(logvar1 - logvar2)
-        + ((mean1 - mean2) ** 2) * torch.exp(-logvar2)
-    )
--- a/comfy/ldm/mmaudio/vae/vae.py
+++ b/comfy/ldm/mmaudio/vae/vae.py
@@ -1,358 +0,0 @@
-import logging
-from typing import Optional
-
-import torch
-import torch.nn as nn
-
-from .vae_modules import (AttnBlock1D, Downsample1D, ResnetBlock1D,
-                                                 Upsample1D, nonlinearity)
-from .distributions import DiagonalGaussianDistribution
-
-import comfy.ops
-ops = comfy.ops.disable_weight_init
-
-log = logging.getLogger()
-
-DATA_MEAN_80D = [
-    -1.6058, -1.3676, -1.2520, -1.2453, -1.2078, -1.2224, -1.2419, -1.2439, -1.2922, -1.2927,
-    -1.3170, -1.3543, -1.3401, -1.3836, -1.3907, -1.3912, -1.4313, -1.4152, -1.4527, -1.4728,
-    -1.4568, -1.5101, -1.5051, -1.5172, -1.5623, -1.5373, -1.5746, -1.5687, -1.6032, -1.6131,
-    -1.6081, -1.6331, -1.6489, -1.6489, -1.6700, -1.6738, -1.6953, -1.6969, -1.7048, -1.7280,
-    -1.7361, -1.7495, -1.7658, -1.7814, -1.7889, -1.8064, -1.8221, -1.8377, -1.8417, -1.8643,
-    -1.8857, -1.8929, -1.9173, -1.9379, -1.9531, -1.9673, -1.9824, -2.0042, -2.0215, -2.0436,
-    -2.0766, -2.1064, -2.1418, -2.1855, -2.2319, -2.2767, -2.3161, -2.3572, -2.3954, -2.4282,
-    -2.4659, -2.5072, -2.5552, -2.6074, -2.6584, -2.7107, -2.7634, -2.8266, -2.8981, -2.9673
-]
-
-DATA_STD_80D = [
-    1.0291, 1.0411, 1.0043, 0.9820, 0.9677, 0.9543, 0.9450, 0.9392, 0.9343, 0.9297, 0.9276, 0.9263,
-    0.9242, 0.9254, 0.9232, 0.9281, 0.9263, 0.9315, 0.9274, 0.9247, 0.9277, 0.9199, 0.9188, 0.9194,
-    0.9160, 0.9161, 0.9146, 0.9161, 0.9100, 0.9095, 0.9145, 0.9076, 0.9066, 0.9095, 0.9032, 0.9043,
-    0.9038, 0.9011, 0.9019, 0.9010, 0.8984, 0.8983, 0.8986, 0.8961, 0.8962, 0.8978, 0.8962, 0.8973,
-    0.8993, 0.8976, 0.8995, 0.9016, 0.8982, 0.8972, 0.8974, 0.8949, 0.8940, 0.8947, 0.8936, 0.8939,
-    0.8951, 0.8956, 0.9017, 0.9167, 0.9436, 0.9690, 1.0003, 1.0225, 1.0381, 1.0491, 1.0545, 1.0604,
-    1.0761, 1.0929, 1.1089, 1.1196, 1.1176, 1.1156, 1.1117, 1.1070
-]
-
-DATA_MEAN_128D = [
-    -3.3462, -2.6723, -2.4893, -2.3143, -2.2664, -2.3317, -2.1802, -2.4006, -2.2357, -2.4597,
-    -2.3717, -2.4690, -2.5142, -2.4919, -2.6610, -2.5047, -2.7483, -2.5926, -2.7462, -2.7033,
-    -2.7386, -2.8112, -2.7502, -2.9594, -2.7473, -3.0035, -2.8891, -2.9922, -2.9856, -3.0157,
-    -3.1191, -2.9893, -3.1718, -3.0745, -3.1879, -3.2310, -3.1424, -3.2296, -3.2791, -3.2782,
-    -3.2756, -3.3134, -3.3509, -3.3750, -3.3951, -3.3698, -3.4505, -3.4509, -3.5089, -3.4647,
-    -3.5536, -3.5788, -3.5867, -3.6036, -3.6400, -3.6747, -3.7072, -3.7279, -3.7283, -3.7795,
-    -3.8259, -3.8447, -3.8663, -3.9182, -3.9605, -3.9861, -4.0105, -4.0373, -4.0762, -4.1121,
-    -4.1488, -4.1874, -4.2461, -4.3170, -4.3639, -4.4452, -4.5282, -4.6297, -4.7019, -4.7960,
-    -4.8700, -4.9507, -5.0303, -5.0866, -5.1634, -5.2342, -5.3242, -5.4053, -5.4927, -5.5712,
-    -5.6464, -5.7052, -5.7619, -5.8410, -5.9188, -6.0103, -6.0955, -6.1673, -6.2362, -6.3120,
-    -6.3926, -6.4797, -6.5565, -6.6511, -6.8130, -6.9961, -7.1275, -7.2457, -7.3576, -7.4663,
-    -7.6136, -7.7469, -7.8815, -8.0132, -8.1515, -8.3071, -8.4722, -8.7418, -9.3975, -9.6628,
-    -9.7671, -9.8863, -9.9992, -10.0860, -10.1709, -10.5418, -11.2795, -11.3861
-]
-
-DATA_STD_128D = [
-    2.3804, 2.4368, 2.3772, 2.3145, 2.2803, 2.2510, 2.2316, 2.2083, 2.1996, 2.1835, 2.1769, 2.1659,
-    2.1631, 2.1618, 2.1540, 2.1606, 2.1571, 2.1567, 2.1612, 2.1579, 2.1679, 2.1683, 2.1634, 2.1557,
-    2.1668, 2.1518, 2.1415, 2.1449, 2.1406, 2.1350, 2.1313, 2.1415, 2.1281, 2.1352, 2.1219, 2.1182,
-    2.1327, 2.1195, 2.1137, 2.1080, 2.1179, 2.1036, 2.1087, 2.1036, 2.1015, 2.1068, 2.0975, 2.0991,
-    2.0902, 2.1015, 2.0857, 2.0920, 2.0893, 2.0897, 2.0910, 2.0881, 2.0925, 2.0873, 2.0960, 2.0900,
-    2.0957, 2.0958, 2.0978, 2.0936, 2.0886, 2.0905, 2.0845, 2.0855, 2.0796, 2.0840, 2.0813, 2.0817,
-    2.0838, 2.0840, 2.0917, 2.1061, 2.1431, 2.1976, 2.2482, 2.3055, 2.3700, 2.4088, 2.4372, 2.4609,
-    2.4731, 2.4847, 2.5072, 2.5451, 2.5772, 2.6147, 2.6529, 2.6596, 2.6645, 2.6726, 2.6803, 2.6812,
-    2.6899, 2.6916, 2.6931, 2.6998, 2.7062, 2.7262, 2.7222, 2.7158, 2.7041, 2.7485, 2.7491, 2.7451,
-    2.7485, 2.7233, 2.7297, 2.7233, 2.7145, 2.6958, 2.6788, 2.6439, 2.6007, 2.4786, 2.2469, 2.1877,
-    2.1392, 2.0717, 2.0107, 1.9676, 1.9140, 1.7102, 0.9101, 0.7164
-]
-
-
-class VAE(nn.Module):
-
-    def __init__(
-        self,
-        *,
-        data_dim: int,
-        embed_dim: int,
-        hidden_dim: int,
-    ):
-        super().__init__()
-
-        if data_dim == 80:
-            self.data_mean = nn.Buffer(torch.tensor(DATA_MEAN_80D, dtype=torch.float32))
-            self.data_std = nn.Buffer(torch.tensor(DATA_STD_80D, dtype=torch.float32))
-        elif data_dim == 128:
-            self.data_mean = nn.Buffer(torch.tensor(DATA_MEAN_128D, dtype=torch.float32))
-            self.data_std = nn.Buffer(torch.tensor(DATA_STD_128D, dtype=torch.float32))
-
-        self.data_mean = self.data_mean.view(1, -1, 1)
-        self.data_std = self.data_std.view(1, -1, 1)
-
-        self.encoder = Encoder1D(
-            dim=hidden_dim,
-            ch_mult=(1, 2, 4),
-            num_res_blocks=2,
-            attn_layers=[3],
-            down_layers=[0],
-            in_dim=data_dim,
-            embed_dim=embed_dim,
-        )
-        self.decoder = Decoder1D(
-            dim=hidden_dim,
-            ch_mult=(1, 2, 4),
-            num_res_blocks=2,
-            attn_layers=[3],
-            down_layers=[0],
-            in_dim=data_dim,
-            out_dim=data_dim,
-            embed_dim=embed_dim,
-        )
-
-        self.embed_dim = embed_dim
-        # self.quant_conv = nn.Conv1d(2 * embed_dim, 2 * embed_dim, 1)
-        # self.post_quant_conv = nn.Conv1d(embed_dim, embed_dim, 1)
-
-        self.initialize_weights()
-
-    def initialize_weights(self):
-        pass
-
-    def encode(self, x: torch.Tensor, normalize: bool = True) -> DiagonalGaussianDistribution:
-        if normalize:
-            x = self.normalize(x)
-        moments = self.encoder(x)
-        posterior = DiagonalGaussianDistribution(moments)
-        return posterior
-
-    def decode(self, z: torch.Tensor, unnormalize: bool = True) -> torch.Tensor:
-        dec = self.decoder(z)
-        if unnormalize:
-            dec = self.unnormalize(dec)
-        return dec
-
-    def normalize(self, x: torch.Tensor) -> torch.Tensor:
-        return (x - comfy.model_management.cast_to(self.data_mean, dtype=x.dtype, device=x.device)) / comfy.model_management.cast_to(self.data_std, dtype=x.dtype, device=x.device)
-
-    def unnormalize(self, x: torch.Tensor) -> torch.Tensor:
-        return x * comfy.model_management.cast_to(self.data_std, dtype=x.dtype, device=x.device) + comfy.model_management.cast_to(self.data_mean, dtype=x.dtype, device=x.device)
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        sample_posterior: bool = True,
-        rng: Optional[torch.Generator] = None,
-        normalize: bool = True,
-        unnormalize: bool = True,
-    ) -> tuple[torch.Tensor, DiagonalGaussianDistribution]:
-
-        posterior = self.encode(x, normalize=normalize)
-        if sample_posterior:
-            z = posterior.sample(rng)
-        else:
-            z = posterior.mode()
-        dec = self.decode(z, unnormalize=unnormalize)
-        return dec, posterior
-
-    def load_weights(self, src_dict) -> None:
-        self.load_state_dict(src_dict, strict=True)
-
-    @property
-    def device(self) -> torch.device:
-        return next(self.parameters()).device
-
-    def get_last_layer(self):
-        return self.decoder.conv_out.weight
-
-    def remove_weight_norm(self):
-        return self
-
-
-class Encoder1D(nn.Module):
-
-    def __init__(self,
-                 *,
-                 dim: int,
-                 ch_mult: tuple[int] = (1, 2, 4, 8),
-                 num_res_blocks: int,
-                 attn_layers: list[int] = [],
-                 down_layers: list[int] = [],
-                 resamp_with_conv: bool = True,
-                 in_dim: int,
-                 embed_dim: int,
-                 double_z: bool = True,
-                 kernel_size: int = 3,
-                 clip_act: float = 256.0):
-        super().__init__()
-        self.dim = dim
-        self.num_layers = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.in_channels = in_dim
-        self.clip_act = clip_act
-        self.down_layers = down_layers
-        self.attn_layers = attn_layers
-        self.conv_in = ops.Conv1d(in_dim, self.dim, kernel_size=kernel_size, padding=kernel_size // 2, bias=False)
-
-        in_ch_mult = (1, ) + tuple(ch_mult)
-        self.in_ch_mult = in_ch_mult
-        # downsampling
-        self.down = nn.ModuleList()
-        for i_level in range(self.num_layers):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_in = dim * in_ch_mult[i_level]
-            block_out = dim * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks):
-                block.append(
-                    ResnetBlock1D(in_dim=block_in,
-                                  out_dim=block_out,
-                                  kernel_size=kernel_size,
-                                  use_norm=True))
-                block_in = block_out
-                if i_level in attn_layers:
-                    attn.append(AttnBlock1D(block_in))
-            down = nn.Module()
-            down.block = block
-            down.attn = attn
-            if i_level in down_layers:
-                down.downsample = Downsample1D(block_in, resamp_with_conv)
-            self.down.append(down)
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock1D(in_dim=block_in,
-                                         out_dim=block_in,
-                                         kernel_size=kernel_size,
-                                         use_norm=True)
-        self.mid.attn_1 = AttnBlock1D(block_in)
-        self.mid.block_2 = ResnetBlock1D(in_dim=block_in,
-                                         out_dim=block_in,
-                                         kernel_size=kernel_size,
-                                         use_norm=True)
-
-        # end
-        self.conv_out = ops.Conv1d(block_in,
-                                 2 * embed_dim if double_z else embed_dim,
-                                 kernel_size=kernel_size, padding=kernel_size // 2, bias=False)
-
-        self.learnable_gain = nn.Parameter(torch.zeros([]))
-
-    def forward(self, x):
-
-        # downsampling
-        h = self.conv_in(x)
-        for i_level in range(self.num_layers):
-            for i_block in range(self.num_res_blocks):
-                h = self.down[i_level].block[i_block](h)
-                if len(self.down[i_level].attn) > 0:
-                    h = self.down[i_level].attn[i_block](h)
-                h = h.clamp(-self.clip_act, self.clip_act)
-            if i_level in self.down_layers:
-                h = self.down[i_level].downsample(h)
-
-        # middle
-        h = self.mid.block_1(h)
-        h = self.mid.attn_1(h)
-        h = self.mid.block_2(h)
-        h = h.clamp(-self.clip_act, self.clip_act)
-
-        # end
-        h = nonlinearity(h)
-        h = self.conv_out(h) * (self.learnable_gain + 1)
-        return h
-
-
-class Decoder1D(nn.Module):
-
-    def __init__(self,
-                 *,
-                 dim: int,
-                 out_dim: int,
-                 ch_mult: tuple[int] = (1, 2, 4, 8),
-                 num_res_blocks: int,
-                 attn_layers: list[int] = [],
-                 down_layers: list[int] = [],
-                 kernel_size: int = 3,
-                 resamp_with_conv: bool = True,
-                 in_dim: int,
-                 embed_dim: int,
-                 clip_act: float = 256.0):
-        super().__init__()
-        self.ch = dim
-        self.num_layers = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.in_channels = in_dim
-        self.clip_act = clip_act
-        self.down_layers = [i + 1 for i in down_layers]  # each downlayer add one
-
-        # compute in_ch_mult, block_in and curr_res at lowest res
-        block_in = dim * ch_mult[self.num_layers - 1]
-
-        # z to block_in
-        self.conv_in = ops.Conv1d(embed_dim, block_in, kernel_size=kernel_size, padding=kernel_size // 2, bias=False)
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock1D(in_dim=block_in, out_dim=block_in, use_norm=True)
-        self.mid.attn_1 = AttnBlock1D(block_in)
-        self.mid.block_2 = ResnetBlock1D(in_dim=block_in, out_dim=block_in, use_norm=True)
-
-        # upsampling
-        self.up = nn.ModuleList()
-        for i_level in reversed(range(self.num_layers)):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_out = dim * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks + 1):
-                block.append(ResnetBlock1D(in_dim=block_in, out_dim=block_out, use_norm=True))
-                block_in = block_out
-                if i_level in attn_layers:
-                    attn.append(AttnBlock1D(block_in))
-            up = nn.Module()
-            up.block = block
-            up.attn = attn
-            if i_level in self.down_layers:
-                up.upsample = Upsample1D(block_in, resamp_with_conv)
-            self.up.insert(0, up)  # prepend to get consistent order
-
-        # end
-        self.conv_out = ops.Conv1d(block_in, out_dim, kernel_size=kernel_size, padding=kernel_size // 2, bias=False)
-        self.learnable_gain = nn.Parameter(torch.zeros([]))
-
-    def forward(self, z):
-        # z to block_in
-        h = self.conv_in(z)
-
-        # middle
-        h = self.mid.block_1(h)
-        h = self.mid.attn_1(h)
-        h = self.mid.block_2(h)
-        h = h.clamp(-self.clip_act, self.clip_act)
-
-        # upsampling
-        for i_level in reversed(range(self.num_layers)):
-            for i_block in range(self.num_res_blocks + 1):
-                h = self.up[i_level].block[i_block](h)
-                if len(self.up[i_level].attn) > 0:
-                    h = self.up[i_level].attn[i_block](h)
-                h = h.clamp(-self.clip_act, self.clip_act)
-            if i_level in self.down_layers:
-                h = self.up[i_level].upsample(h)
-
-        h = nonlinearity(h)
-        h = self.conv_out(h) * (self.learnable_gain + 1)
-        return h
-
-
-def VAE_16k(**kwargs) -> VAE:
-    return VAE(data_dim=80, embed_dim=20, hidden_dim=384, **kwargs)
-
-
-def VAE_44k(**kwargs) -> VAE:
-    return VAE(data_dim=128, embed_dim=40, hidden_dim=512, **kwargs)
-
-
-def get_my_vae(name: str, **kwargs) -> VAE:
-    if name == '16k':
-        return VAE_16k(**kwargs)
-    if name == '44k':
-        return VAE_44k(**kwargs)
-    raise ValueError(f'Unknown model: {name}')
-
--- a/comfy/ldm/mmaudio/vae/vae_modules.py
+++ b/comfy/ldm/mmaudio/vae/vae_modules.py
@@ -1,121 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from comfy.ldm.modules.diffusionmodules.model import vae_attention
-import math
-import comfy.ops
-ops = comfy.ops.disable_weight_init
-
-def nonlinearity(x):
-    # swish
-    return torch.nn.functional.silu(x) / 0.596
-
-def mp_sum(a, b, t=0.5):
-    return a.lerp(b, t) / math.sqrt((1 - t)**2 + t**2)
-
-def normalize(x, dim=None, eps=1e-4):
-    if dim is None:
-        dim = list(range(1, x.ndim))
-    norm = torch.linalg.vector_norm(x, dim=dim, keepdim=True, dtype=torch.float32)
-    norm = torch.add(eps, norm, alpha=math.sqrt(norm.numel() / x.numel()))
-    return x / norm.to(x.dtype)
-
-class ResnetBlock1D(nn.Module):
-
-    def __init__(self, *, in_dim, out_dim=None, conv_shortcut=False, kernel_size=3, use_norm=True):
-        super().__init__()
-        self.in_dim = in_dim
-        out_dim = in_dim if out_dim is None else out_dim
-        self.out_dim = out_dim
-        self.use_conv_shortcut = conv_shortcut
-        self.use_norm = use_norm
-
-        self.conv1 = ops.Conv1d(in_dim, out_dim, kernel_size=kernel_size, padding=kernel_size // 2, bias=False)
-        self.conv2 = ops.Conv1d(out_dim, out_dim, kernel_size=kernel_size, padding=kernel_size // 2, bias=False)
-        if self.in_dim != self.out_dim:
-            if self.use_conv_shortcut:
-                self.conv_shortcut = ops.Conv1d(in_dim, out_dim, kernel_size=kernel_size, padding=kernel_size // 2, bias=False)
-            else:
-                self.nin_shortcut = ops.Conv1d(in_dim, out_dim, kernel_size=1, padding=0, bias=False)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-
-        # pixel norm
-        if self.use_norm:
-            x = normalize(x, dim=1)
-
-        h = x
-        h = nonlinearity(h)
-        h = self.conv1(h)
-
-        h = nonlinearity(h)
-        h = self.conv2(h)
-
-        if self.in_dim != self.out_dim:
-            if self.use_conv_shortcut:
-                x = self.conv_shortcut(x)
-            else:
-                x = self.nin_shortcut(x)
-
-        return mp_sum(x, h, t=0.3)
-
-
-class AttnBlock1D(nn.Module):
-
-    def __init__(self, in_channels, num_heads=1):
-        super().__init__()
-        self.in_channels = in_channels
-
-        self.num_heads = num_heads
-        self.qkv = ops.Conv1d(in_channels, in_channels * 3, kernel_size=1, padding=0, bias=False)
-        self.proj_out = ops.Conv1d(in_channels, in_channels, kernel_size=1, padding=0, bias=False)
-        self.optimized_attention = vae_attention()
-
-    def forward(self, x):
-        h = x
-        y = self.qkv(h)
-        y = y.reshape(y.shape[0], -1, 3, y.shape[-1])
-        q, k, v = normalize(y, dim=1).unbind(2)
-
-        h = self.optimized_attention(q, k, v)
-        h = self.proj_out(h)
-
-        return mp_sum(x, h, t=0.3)
-
-
-class Upsample1D(nn.Module):
-
-    def __init__(self, in_channels, with_conv):
-        super().__init__()
-        self.with_conv = with_conv
-        if self.with_conv:
-            self.conv = ops.Conv1d(in_channels, in_channels, kernel_size=3, padding=1, bias=False)
-
-    def forward(self, x):
-        x = F.interpolate(x, scale_factor=2.0, mode='nearest-exact')  # support 3D tensor(B,C,T)
-        if self.with_conv:
-            x = self.conv(x)
-        return x
-
-
-class Downsample1D(nn.Module):
-
-    def __init__(self, in_channels, with_conv):
-        super().__init__()
-        self.with_conv = with_conv
-        if self.with_conv:
-            # no asymmetric padding in torch conv, must do it ourselves
-            self.conv1 = ops.Conv1d(in_channels, in_channels, kernel_size=1, padding=0, bias=False)
-            self.conv2 = ops.Conv1d(in_channels, in_channels, kernel_size=1, padding=0, bias=False)
-
-    def forward(self, x):
-
-        if self.with_conv:
-            x = self.conv1(x)
-
-        x = F.avg_pool1d(x, kernel_size=2, stride=2)
-
-        if self.with_conv:
-            x = self.conv2(x)
-
-        return x
--- a/comfy/ldm/models/autoencoder.py
+++ b/comfy/ldm/models/autoencoder.py
@@ -26,12 +26,6 @@ class DiagonalGaussianRegularizer(torch.nn.Module):
            z = posterior.mode()
        return z, None

-class EmptyRegularizer(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-
-    def forward(self, z: torch.Tensor) -> Tuple[torch.Tensor, dict]:
-        return z, None

 class AbstractAutoencoder(torch.nn.Module):
    """
--- a/comfy/ldm/modules/attention.py
+++ b/comfy/ldm/modules/attention.py
@@ -22,7 +22,7 @@ SAGE_ATTENTION_IS_AVAILABLE = False
 try:
    from sageattention import sageattn
    SAGE_ATTENTION_IS_AVAILABLE = True
-except ImportError as e:
+except ModuleNotFoundError as e:
    if model_management.sage_attention_enabled():
        if e.name == "sageattention":
            logging.error(f"\n\nTo use the `--use-sage-attention` feature, the `sageattention` package must be installed first.\ncommand:\n\t{sys.executable} -m pip install sageattention")
@@ -34,7 +34,7 @@ FLASH_ATTENTION_IS_AVAILABLE = False
 try:
    from flash_attn import flash_attn_func
    FLASH_ATTENTION_IS_AVAILABLE = True
-except ImportError:
+except ModuleNotFoundError:
    if model_management.flash_attention_enabled():
        logging.error(f"\n\nTo use the `--use-flash-attention` feature, the `flash-attn` package must be installed first.\ncommand:\n\t{sys.executable} -m pip install flash-attn")
        exit(-1)
@@ -600,8 +600,7 @@ def attention_flash(q, k, v, heads, mask=None, attn_precision=None, skip_reshape
            mask = mask.unsqueeze(1)

    try:
-        if mask is not None:
-            raise RuntimeError("Mask must not be set for Flash attention")
+        assert mask is None
        out = flash_attn_wrapper(
            q.transpose(1, 2),
            k.transpose(1, 2),
--- a/comfy/ldm/modules/diffusionmodules/model.py
+++ b/comfy/ldm/modules/diffusionmodules/model.py
@@ -145,7 +145,7 @@ class Downsample(nn.Module):

 class ResnetBlock(nn.Module):
    def __init__(self, *, in_channels, out_channels=None, conv_shortcut=False,
-                 dropout=0.0, temb_channels=512, conv_op=ops.Conv2d, norm_op=Normalize):
+                 dropout, temb_channels=512, conv_op=ops.Conv2d):
        super().__init__()
        self.in_channels = in_channels
        out_channels = in_channels if out_channels is None else out_channels
@@ -153,7 +153,7 @@ class ResnetBlock(nn.Module):
        self.use_conv_shortcut = conv_shortcut

        self.swish = torch.nn.SiLU(inplace=True)
-        self.norm1 = norm_op(in_channels)
+        self.norm1 = Normalize(in_channels)
        self.conv1 = conv_op(in_channels,
                                     out_channels,
                                     kernel_size=3,
@@ -162,7 +162,7 @@ class ResnetBlock(nn.Module):
        if temb_channels > 0:
            self.temb_proj = ops.Linear(temb_channels,
                                             out_channels)
-        self.norm2 = norm_op(out_channels)
+        self.norm2 = Normalize(out_channels)
        self.dropout = torch.nn.Dropout(dropout, inplace=True)
        self.conv2 = conv_op(out_channels,
                                     out_channels,
@@ -183,7 +183,7 @@ class ResnetBlock(nn.Module):
                                                    stride=1,
                                                    padding=0)

-    def forward(self, x, temb=None):
+    def forward(self, x, temb):
        h = x
        h = self.norm1(h)
        h = self.swish(h)
@@ -305,11 +305,11 @@ def vae_attention():
        return normal_attention

 class AttnBlock(nn.Module):
-    def __init__(self, in_channels, conv_op=ops.Conv2d, norm_op=Normalize):
+    def __init__(self, in_channels, conv_op=ops.Conv2d):
        super().__init__()
        self.in_channels = in_channels

-        self.norm = norm_op(in_channels)
+        self.norm = Normalize(in_channels)
        self.q = conv_op(in_channels,
                                 in_channels,
                                 kernel_size=1,
--- a/comfy/ldm/wan/model.py
+++ b/comfy/ldm/wan/model.py
@@ -8,7 +8,7 @@ from einops import rearrange

 from comfy.ldm.modules.attention import optimized_attention
 from comfy.ldm.flux.layers import EmbedND
-from comfy.ldm.flux.math import apply_rope1
+from comfy.ldm.flux.math import apply_rope
 import comfy.ldm.common_dit
 import comfy.model_management
 import comfy.patcher_extension
@@ -34,9 +34,7 @@ class WanSelfAttention(nn.Module):
                 num_heads,
                 window_size=(-1, -1),
                 qk_norm=True,
-                 eps=1e-6,
-                 kv_dim=None,
-                 operation_settings={}):
+                 eps=1e-6, operation_settings={}):
        assert dim % num_heads == 0
        super().__init__()
        self.dim = dim
@@ -45,13 +43,11 @@ class WanSelfAttention(nn.Module):
        self.window_size = window_size
        self.qk_norm = qk_norm
        self.eps = eps
-        if kv_dim is None:
-            kv_dim = dim

        # layers
        self.q = operation_settings.get("operations").Linear(dim, dim, device=operation_settings.get("device"), dtype=operation_settings.get("dtype"))
-        self.k = operation_settings.get("operations").Linear(kv_dim, dim, device=operation_settings.get("device"), dtype=operation_settings.get("dtype"))
-        self.v = operation_settings.get("operations").Linear(kv_dim, dim, device=operation_settings.get("device"), dtype=operation_settings.get("dtype"))
+        self.k = operation_settings.get("operations").Linear(dim, dim, device=operation_settings.get("device"), dtype=operation_settings.get("dtype"))
+        self.v = operation_settings.get("operations").Linear(dim, dim, device=operation_settings.get("device"), dtype=operation_settings.get("dtype"))
        self.o = operation_settings.get("operations").Linear(dim, dim, device=operation_settings.get("device"), dtype=operation_settings.get("dtype"))
        self.norm_q = operation_settings.get("operations").RMSNorm(dim, eps=eps, elementwise_affine=True, device=operation_settings.get("device"), dtype=operation_settings.get("dtype")) if qk_norm else nn.Identity()
        self.norm_k = operation_settings.get("operations").RMSNorm(dim, eps=eps, elementwise_affine=True, device=operation_settings.get("device"), dtype=operation_settings.get("dtype")) if qk_norm else nn.Identity()
@@ -64,24 +60,20 @@ class WanSelfAttention(nn.Module):
        """
        b, s, n, d = *x.shape[:2], self.num_heads, self.head_dim

-        def qkv_fn_q(x):
+        # query, key, value function
+        def qkv_fn(x):
            q = self.norm_q(self.q(x)).view(b, s, n, d)
-            return apply_rope1(q, freqs)
-
-        def qkv_fn_k(x):
            k = self.norm_k(self.k(x)).view(b, s, n, d)
-            return apply_rope1(k, freqs)
+            v = self.v(x).view(b, s, n * d)
+            return q, k, v

-        #These two are VRAM hogs, so we want to do all of q computation and
-        #have pytorch garbage collect the intermediates on the sub function
-        #return before we touch k
-        q = qkv_fn_q(x)
-        k = qkv_fn_k(x)
+        q, k, v = qkv_fn(x)
+        q, k = apply_rope(q, k, freqs)

        x = optimized_attention(
            q.view(b, s, n * d),
            k.view(b, s, n * d),
-            self.v(x).view(b, s, n * d),
+            v,
            heads=self.num_heads,
            transformer_options=transformer_options,
        )
@@ -237,7 +229,6 @@ class WanAttentionBlock(nn.Module):
            freqs, transformer_options=transformer_options)

        x = torch.addcmul(x, y, repeat_e(e[2], x))
-        del y

        # cross-attention & ffn
        x = x + self.cross_attn(self.norm3(x), context, context_img_len=context_img_len, transformer_options=transformer_options)
@@ -407,7 +398,6 @@ class WanModel(torch.nn.Module):
                 eps=1e-6,
                 flf_pos_embed_token_number=None,
                 in_dim_ref_conv=None,
-                 wan_attn_block_class=WanAttentionBlock,
                 image_model=None,
                 device=None,
                 dtype=None,
@@ -485,8 +475,8 @@ class WanModel(torch.nn.Module):
        # blocks
        cross_attn_type = 't2v_cross_attn' if model_type == 't2v' else 'i2v_cross_attn'
        self.blocks = nn.ModuleList([
-            wan_attn_block_class(cross_attn_type, dim, ffn_dim, num_heads,
-                                 window_size, qk_norm, cross_attn_norm, eps, operation_settings=operation_settings)
+            WanAttentionBlock(cross_attn_type, dim, ffn_dim, num_heads,
+                              window_size, qk_norm, cross_attn_norm, eps, operation_settings=operation_settings)
            for _ in range(num_layers)
        ])

@@ -903,7 +893,7 @@ class MotionEncoder_tc(nn.Module):
    def __init__(self,
                 in_dim: int,
                 hidden_dim: int,
-                 num_heads: int,
+                 num_heads=int,
                 need_global=True,
                 dtype=None,
                 device=None,
@@ -1331,250 +1321,3 @@ class WanModel_S2V(WanModel):
        # unpatchify
        x = self.unpatchify(x, grid_sizes)
        return x
-
-
-class WanT2VCrossAttentionGather(WanSelfAttention):
-
-    def forward(self, x, context, transformer_options={}, **kwargs):
-        r"""
-        Args:
-            x(Tensor): Shape [B, L1, C] - video tokens
-            context(Tensor): Shape [B, L2, C] - audio tokens with shape [B, frames*16, 1536]
-        """
-        b, n, d = x.size(0), self.num_heads, self.head_dim
-
-        q = self.norm_q(self.q(x))
-        k = self.norm_k(self.k(context))
-        v = self.v(context)
-
-        # Handle audio temporal structure (16 tokens per frame)
-        k = k.reshape(-1, 16, n, d).transpose(1, 2)
-        v = v.reshape(-1, 16, n, d).transpose(1, 2)
-
-        # Handle video spatial structure
-        q = q.reshape(k.shape[0], -1, n, d).transpose(1, 2)
-
-        x = optimized_attention(q, k, v, heads=self.num_heads, skip_reshape=True, skip_output_reshape=True, transformer_options=transformer_options)
-
-        x = x.transpose(1, 2).reshape(b, -1, n * d)
-        x = self.o(x)
-        return x
-
-
-class AudioCrossAttentionWrapper(nn.Module):
-    def __init__(self, dim, kv_dim, num_heads, qk_norm=True, eps=1e-6, operation_settings={}):
-        super().__init__()
-
-        self.audio_cross_attn = WanT2VCrossAttentionGather(dim, num_heads, qk_norm=qk_norm, kv_dim=kv_dim, eps=eps, operation_settings=operation_settings)
-        self.norm1_audio = operation_settings.get("operations").LayerNorm(dim, eps, elementwise_affine=True, device=operation_settings.get("device"), dtype=operation_settings.get("dtype"))
-
-    def forward(self, x, audio, transformer_options={}):
-        x = x + self.audio_cross_attn(self.norm1_audio(x), audio, transformer_options=transformer_options)
-        return x
-
-
-class WanAttentionBlockAudio(WanAttentionBlock):
-
-    def __init__(self,
-                 cross_attn_type,
-                 dim,
-                 ffn_dim,
-                 num_heads,
-                 window_size=(-1, -1),
-                 qk_norm=True,
-                 cross_attn_norm=False,
-                 eps=1e-6, operation_settings={}):
-        super().__init__(cross_attn_type, dim, ffn_dim, num_heads, window_size, qk_norm, cross_attn_norm, eps, operation_settings)
-        self.audio_cross_attn_wrapper = AudioCrossAttentionWrapper(dim, 1536, num_heads, qk_norm, eps, operation_settings=operation_settings)
-
-    def forward(
-        self,
-        x,
-        e,
-        freqs,
-        context,
-        context_img_len=257,
-        audio=None,
-        transformer_options={},
-    ):
-        r"""
-        Args:
-            x(Tensor): Shape [B, L, C]
-            e(Tensor): Shape [B, 6, C]
-            freqs(Tensor): Rope freqs, shape [1024, C / num_heads / 2]
-        """
-        # assert e.dtype == torch.float32
-
-        if e.ndim < 4:
-            e = (comfy.model_management.cast_to(self.modulation, dtype=x.dtype, device=x.device) + e).chunk(6, dim=1)
-        else:
-            e = (comfy.model_management.cast_to(self.modulation, dtype=x.dtype, device=x.device).unsqueeze(0) + e).unbind(2)
-        # assert e[0].dtype == torch.float32
-
-        # self-attention
-        y = self.self_attn(
-            torch.addcmul(repeat_e(e[0], x), self.norm1(x), 1 + repeat_e(e[1], x)),
-            freqs, transformer_options=transformer_options)
-
-        x = torch.addcmul(x, y, repeat_e(e[2], x))
-
-        # cross-attention & ffn
-        x = x + self.cross_attn(self.norm3(x), context, context_img_len=context_img_len, transformer_options=transformer_options)
-        if audio is not None:
-            x = self.audio_cross_attn_wrapper(x, audio, transformer_options=transformer_options)
-        y = self.ffn(torch.addcmul(repeat_e(e[3], x), self.norm2(x), 1 + repeat_e(e[4], x)))
-        x = torch.addcmul(x, y, repeat_e(e[5], x))
-        return x
-
-class DummyAdapterLayer(nn.Module):
-    def __init__(self, layer):
-        super().__init__()
-        self.layer = layer
-
-    def forward(self, *args, **kwargs):
-        return self.layer(*args, **kwargs)
-
-
-class AudioProjModel(nn.Module):
-    def __init__(
-        self,
-        seq_len=5,
-        blocks=13,  # add a new parameter blocks
-        channels=768,  # add a new parameter channels
-        intermediate_dim=512,
-        output_dim=1536,
-        context_tokens=16,
-        device=None,
-        dtype=None,
-        operations=None,
-    ):
-        super().__init__()
-
-        self.seq_len = seq_len
-        self.blocks = blocks
-        self.channels = channels
-        self.input_dim = seq_len * blocks * channels  # update input_dim to be the product of blocks and channels.
-        self.intermediate_dim = intermediate_dim
-        self.context_tokens = context_tokens
-        self.output_dim = output_dim
-
-        # define multiple linear layers
-        self.audio_proj_glob_1 = DummyAdapterLayer(operations.Linear(self.input_dim, intermediate_dim, dtype=dtype, device=device))
-        self.audio_proj_glob_2 = DummyAdapterLayer(operations.Linear(intermediate_dim, intermediate_dim, dtype=dtype, device=device))
-        self.audio_proj_glob_3 = DummyAdapterLayer(operations.Linear(intermediate_dim, context_tokens * output_dim, dtype=dtype, device=device))
-
-        self.audio_proj_glob_norm = DummyAdapterLayer(operations.LayerNorm(output_dim, dtype=dtype, device=device))
-
-    def forward(self, audio_embeds):
-        video_length = audio_embeds.shape[1]
-        audio_embeds = rearrange(audio_embeds, "bz f w b c -> (bz f) w b c")
-        batch_size, window_size, blocks, channels = audio_embeds.shape
-        audio_embeds = audio_embeds.view(batch_size, window_size * blocks * channels)
-
-        audio_embeds = torch.relu(self.audio_proj_glob_1(audio_embeds))
-        audio_embeds = torch.relu(self.audio_proj_glob_2(audio_embeds))
-
-        context_tokens = self.audio_proj_glob_3(audio_embeds).reshape(batch_size, self.context_tokens, self.output_dim)
-
-        context_tokens = self.audio_proj_glob_norm(context_tokens)
-        context_tokens = rearrange(context_tokens, "(bz f) m c -> bz f m c", f=video_length)
-
-        return context_tokens
-
-
-class HumoWanModel(WanModel):
-    r"""
-    Wan diffusion backbone supporting both text-to-video and image-to-video.
-    """
-
-    def __init__(self,
-                 model_type='humo',
-                 patch_size=(1, 2, 2),
-                 text_len=512,
-                 in_dim=16,
-                 dim=2048,
-                 ffn_dim=8192,
-                 freq_dim=256,
-                 text_dim=4096,
-                 out_dim=16,
-                 num_heads=16,
-                 num_layers=32,
-                 window_size=(-1, -1),
-                 qk_norm=True,
-                 cross_attn_norm=True,
-                 eps=1e-6,
-                 flf_pos_embed_token_number=None,
-                 image_model=None,
-                 audio_token_num=16,
-                 device=None,
-                 dtype=None,
-                 operations=None,
-                 ):
-
-        super().__init__(model_type='t2v', patch_size=patch_size, text_len=text_len, in_dim=in_dim, dim=dim, ffn_dim=ffn_dim, freq_dim=freq_dim, text_dim=text_dim, out_dim=out_dim, num_heads=num_heads, num_layers=num_layers, window_size=window_size, qk_norm=qk_norm, cross_attn_norm=cross_attn_norm, eps=eps, flf_pos_embed_token_number=flf_pos_embed_token_number, wan_attn_block_class=WanAttentionBlockAudio, image_model=image_model, device=device, dtype=dtype, operations=operations)
-
-        self.audio_proj = AudioProjModel(seq_len=8, blocks=5, channels=1280, intermediate_dim=512, output_dim=1536, context_tokens=audio_token_num, dtype=dtype, device=device, operations=operations)
-
-    def forward_orig(
-        self,
-        x,
-        t,
-        context,
-        freqs=None,
-        audio_embed=None,
-        reference_latent=None,
-        transformer_options={},
-        **kwargs,
-    ):
-        bs, _, time, height, width = x.shape
-
-        # embeddings
-        x = self.patch_embedding(x.float()).to(x.dtype)
-        grid_sizes = x.shape[2:]
-        x = x.flatten(2).transpose(1, 2)
-
-        # time embeddings
-        e = self.time_embedding(
-            sinusoidal_embedding_1d(self.freq_dim, t.flatten()).to(dtype=x[0].dtype))
-        e = e.reshape(t.shape[0], -1, e.shape[-1])
-        e0 = self.time_projection(e).unflatten(2, (6, self.dim))
-
-        if reference_latent is not None:
-            ref = self.patch_embedding(reference_latent.float()).to(x.dtype)
-            ref = ref.flatten(2).transpose(1, 2)
-            freqs_ref = self.rope_encode(reference_latent.shape[-3], reference_latent.shape[-2], reference_latent.shape[-1], t_start=time, device=x.device, dtype=x.dtype)
-            x = torch.cat([x, ref], dim=1)
-            freqs = torch.cat([freqs, freqs_ref], dim=1)
-            del ref, freqs_ref
-
-        # context
-        context = self.text_embedding(context)
-        context_img_len = None
-
-        if audio_embed is not None:
-            if reference_latent is not None:
-                zero_audio_pad = torch.zeros(audio_embed.shape[0], reference_latent.shape[-3], *audio_embed.shape[2:], device=audio_embed.device, dtype=audio_embed.dtype)
-                audio_embed = torch.cat([audio_embed, zero_audio_pad], dim=1)
-            audio = self.audio_proj(audio_embed).permute(0, 3, 1, 2).flatten(2).transpose(1, 2)
-        else:
-            audio = None
-
-        patches_replace = transformer_options.get("patches_replace", {})
-        blocks_replace = patches_replace.get("dit", {})
-        for i, block in enumerate(self.blocks):
-            if ("double_block", i) in blocks_replace:
-                def block_wrap(args):
-                    out = {}
-                    out["img"] = block(args["img"], context=args["txt"], e=args["vec"], freqs=args["pe"], context_img_len=context_img_len, audio=audio, transformer_options=args["transformer_options"])
-                    return out
-                out = blocks_replace[("double_block", i)]({"img": x, "txt": context, "vec": e0, "pe": freqs, "transformer_options": transformer_options}, {"original_block": block_wrap})
-                x = out["img"]
-            else:
-                x = block(x, e=e0, freqs=freqs, context=context, context_img_len=context_img_len, audio=audio, transformer_options=transformer_options)
-
-        # head
-        x = self.head(x, e)
-
-        # unpatchify
-        x = self.unpatchify(x, grid_sizes)
-        return x
--- a/comfy/ldm/wan/model_animate.py
+++ b/comfy/ldm/wan/model_animate.py
@@ -1,548 +0,0 @@
-from torch import nn
-import torch
-from typing import Tuple, Optional
-from einops import rearrange
-import torch.nn.functional as F
-import math
-from .model import WanModel, sinusoidal_embedding_1d
-from comfy.ldm.modules.attention import optimized_attention
-import comfy.model_management
-
-class CausalConv1d(nn.Module):
-
-    def __init__(self, chan_in, chan_out, kernel_size=3, stride=1, dilation=1, pad_mode="replicate", operations=None, **kwargs):
-        super().__init__()
-
-        self.pad_mode = pad_mode
-        padding = (kernel_size - 1, 0)  # T
-        self.time_causal_padding = padding
-
-        self.conv = operations.Conv1d(chan_in, chan_out, kernel_size, stride=stride, dilation=dilation, **kwargs)
-
-    def forward(self, x):
-        x = F.pad(x, self.time_causal_padding, mode=self.pad_mode)
-        return self.conv(x)
-
-
-class FaceEncoder(nn.Module):
-    def __init__(self, in_dim: int, hidden_dim: int, num_heads=int, dtype=None, device=None, operations=None):
-        factory_kwargs = {"dtype": dtype, "device": device}
-        super().__init__()
-
-        self.num_heads = num_heads
-        self.conv1_local = CausalConv1d(in_dim, 1024 * num_heads, 3, stride=1, operations=operations, **factory_kwargs)
-        self.norm1 = operations.LayerNorm(hidden_dim // 8, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-        self.act = nn.SiLU()
-        self.conv2 = CausalConv1d(1024, 1024, 3, stride=2, operations=operations, **factory_kwargs)
-        self.conv3 = CausalConv1d(1024, 1024, 3, stride=2, operations=operations, **factory_kwargs)
-
-        self.out_proj = operations.Linear(1024, hidden_dim, **factory_kwargs)
-        self.norm1 = operations.LayerNorm(1024, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-
-        self.norm2 = operations.LayerNorm(1024, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-
-        self.norm3 = operations.LayerNorm(1024, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-
-        self.padding_tokens = nn.Parameter(torch.empty(1, 1, 1, hidden_dim, **factory_kwargs))
-
-    def forward(self, x):
-
-        x = rearrange(x, "b t c -> b c t")
-        b, c, t = x.shape
-
-        x = self.conv1_local(x)
-        x = rearrange(x, "b (n c) t -> (b n) t c", n=self.num_heads)
-
-        x = self.norm1(x)
-        x = self.act(x)
-        x = rearrange(x, "b t c -> b c t")
-        x = self.conv2(x)
-        x = rearrange(x, "b c t -> b t c")
-        x = self.norm2(x)
-        x = self.act(x)
-        x = rearrange(x, "b t c -> b c t")
-        x = self.conv3(x)
-        x = rearrange(x, "b c t -> b t c")
-        x = self.norm3(x)
-        x = self.act(x)
-        x = self.out_proj(x)
-        x = rearrange(x, "(b n) t c -> b t n c", b=b)
-        padding = comfy.model_management.cast_to(self.padding_tokens, dtype=x.dtype, device=x.device).repeat(b, x.shape[1], 1, 1)
-        x = torch.cat([x, padding], dim=-2)
-        x_local = x.clone()
-
-        return x_local
-
-
-def get_norm_layer(norm_layer, operations=None):
-    """
-    Get the normalization layer.
-
-    Args:
-        norm_layer (str): The type of normalization layer.
-
-    Returns:
-        norm_layer (nn.Module): The normalization layer.
-    """
-    if norm_layer == "layer":
-        return operations.LayerNorm
-    elif norm_layer == "rms":
-        return operations.RMSNorm
-    else:
-        raise NotImplementedError(f"Norm layer {norm_layer} is not implemented")
-
-
-class FaceAdapter(nn.Module):
-    def __init__(
-        self,
-        hidden_dim: int,
-        heads_num: int,
-        qk_norm: bool = True,
-        qk_norm_type: str = "rms",
-        num_adapter_layers: int = 1,
-        dtype=None, device=None, operations=None
-    ):
-
-        factory_kwargs = {"dtype": dtype, "device": device}
-        super().__init__()
-        self.hidden_size = hidden_dim
-        self.heads_num = heads_num
-        self.fuser_blocks = nn.ModuleList(
-            [
-                FaceBlock(
-                    self.hidden_size,
-                    self.heads_num,
-                    qk_norm=qk_norm,
-                    qk_norm_type=qk_norm_type,
-                    operations=operations,
-                    **factory_kwargs,
-                )
-                for _ in range(num_adapter_layers)
-            ]
-        )
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        motion_embed: torch.Tensor,
-        idx: int,
-        freqs_cis_q: Tuple[torch.Tensor, torch.Tensor] = None,
-        freqs_cis_k: Tuple[torch.Tensor, torch.Tensor] = None,
-    ) -> torch.Tensor:
-
-        return self.fuser_blocks[idx](x, motion_embed, freqs_cis_q, freqs_cis_k)
-
-
-
-class FaceBlock(nn.Module):
-    def __init__(
-        self,
-        hidden_size: int,
-        heads_num: int,
-        qk_norm: bool = True,
-        qk_norm_type: str = "rms",
-        qk_scale: float = None,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-        operations=None
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-
-        self.deterministic = False
-        self.hidden_size = hidden_size
-        self.heads_num = heads_num
-        head_dim = hidden_size // heads_num
-        self.scale = qk_scale or head_dim**-0.5
-
-        self.linear1_kv = operations.Linear(hidden_size, hidden_size * 2, **factory_kwargs)
-        self.linear1_q = operations.Linear(hidden_size, hidden_size, **factory_kwargs)
-
-        self.linear2 = operations.Linear(hidden_size, hidden_size, **factory_kwargs)
-
-        qk_norm_layer = get_norm_layer(qk_norm_type, operations=operations)
-        self.q_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs) if qk_norm else nn.Identity()
-        )
-        self.k_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs) if qk_norm else nn.Identity()
-        )
-
-        self.pre_norm_feat = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-
-        self.pre_norm_motion = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        motion_vec: torch.Tensor,
-        motion_mask: Optional[torch.Tensor] = None,
-        # use_context_parallel=False,
-    ) -> torch.Tensor:
-
-        B, T, N, C = motion_vec.shape
-        T_comp = T
-
-        x_motion = self.pre_norm_motion(motion_vec)
-        x_feat = self.pre_norm_feat(x)
-
-        kv = self.linear1_kv(x_motion)
-        q = self.linear1_q(x_feat)
-
-        k, v = rearrange(kv, "B L N (K H D) -> K B L N H D", K=2, H=self.heads_num)
-        q = rearrange(q, "B S (H D) -> B S H D", H=self.heads_num)
-
-        # Apply QK-Norm if needed.
-        q = self.q_norm(q).to(v)
-        k = self.k_norm(k).to(v)
-
-        k = rearrange(k, "B L N H D -> (B L) N H D")
-        v = rearrange(v, "B L N H D -> (B L) N H D")
-
-        q = rearrange(q, "B (L S) H D -> (B L) S (H D)", L=T_comp)
-
-        attn = optimized_attention(q, k, v, heads=self.heads_num)
-
-        attn = rearrange(attn, "(B L) S C -> B (L S) C", L=T_comp)
-
-        output = self.linear2(attn)
-
-        if motion_mask is not None:
-            output = output * rearrange(motion_mask, "B T H W -> B (T H W)").unsqueeze(-1)
-
-        return output
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/ops/upfirdn2d/upfirdn2d.py#L162
-def upfirdn2d_native(input, kernel, up_x, up_y, down_x, down_y, pad_x0, pad_x1, pad_y0, pad_y1):
-    _, minor, in_h, in_w = input.shape
-    kernel_h, kernel_w = kernel.shape
-
-    out = input.view(-1, minor, in_h, 1, in_w, 1)
-    out = F.pad(out, [0, up_x - 1, 0, 0, 0, up_y - 1, 0, 0])
-    out = out.view(-1, minor, in_h * up_y, in_w * up_x)
-
-    out = F.pad(out, [max(pad_x0, 0), max(pad_x1, 0), max(pad_y0, 0), max(pad_y1, 0)])
-    out = out[:, :, max(-pad_y0, 0): out.shape[2] - max(-pad_y1, 0), max(-pad_x0, 0): out.shape[3] - max(-pad_x1, 0)]
-
-    out = out.reshape([-1, 1, in_h * up_y + pad_y0 + pad_y1, in_w * up_x + pad_x0 + pad_x1])
-    w = torch.flip(kernel, [0, 1]).view(1, 1, kernel_h, kernel_w)
-    out = F.conv2d(out, w)
-    out = out.reshape(-1, minor, in_h * up_y + pad_y0 + pad_y1 - kernel_h + 1, in_w * up_x + pad_x0 + pad_x1 - kernel_w + 1)
-    return out[:, :, ::down_y, ::down_x]
-
-def upfirdn2d(input, kernel, up=1, down=1, pad=(0, 0)):
-    return upfirdn2d_native(input, kernel, up, up, down, down, pad[0], pad[1], pad[0], pad[1])
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/ops/fused_act/fused_act.py#L81
-class FusedLeakyReLU(torch.nn.Module):
-    def __init__(self, channel, negative_slope=0.2, scale=2 ** 0.5, dtype=None, device=None):
-        super().__init__()
-        self.bias = torch.nn.Parameter(torch.empty(1, channel, 1, 1, dtype=dtype, device=device))
-        self.negative_slope = negative_slope
-        self.scale = scale
-
-    def forward(self, input):
-        return fused_leaky_relu(input, comfy.model_management.cast_to(self.bias, device=input.device, dtype=input.dtype), self.negative_slope, self.scale)
-
-def fused_leaky_relu(input, bias, negative_slope=0.2, scale=2 ** 0.5):
-    return F.leaky_relu(input + bias, negative_slope) * scale
-
-class Blur(torch.nn.Module):
-    def __init__(self, kernel, pad, dtype=None, device=None):
-        super().__init__()
-        kernel = torch.tensor(kernel, dtype=dtype, device=device)
-        kernel = kernel[None, :] * kernel[:, None]
-        kernel = kernel / kernel.sum()
-        self.register_buffer('kernel', kernel)
-        self.pad = pad
-
-    def forward(self, input):
-        return upfirdn2d(input, comfy.model_management.cast_to(self.kernel, dtype=input.dtype, device=input.device), pad=self.pad)
-
-#https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/archs/stylegan2_arch.py#L590
-class ScaledLeakyReLU(torch.nn.Module):
-    def __init__(self, negative_slope=0.2):
-        super().__init__()
-        self.negative_slope = negative_slope
-
-    def forward(self, input):
-        return F.leaky_relu(input, negative_slope=self.negative_slope)
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/archs/stylegan2_arch.py#L605
-class EqualConv2d(torch.nn.Module):
-    def __init__(self, in_channel, out_channel, kernel_size, stride=1, padding=0, bias=True, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.weight = torch.nn.Parameter(torch.empty(out_channel, in_channel, kernel_size, kernel_size, device=device, dtype=dtype))
-        self.scale = 1 / math.sqrt(in_channel * kernel_size ** 2)
-        self.stride = stride
-        self.padding = padding
-        self.bias = torch.nn.Parameter(torch.empty(out_channel, device=device, dtype=dtype)) if bias else None
-
-    def forward(self, input):
-        if self.bias is None:
-            bias = None
-        else:
-            bias = comfy.model_management.cast_to(self.bias, device=input.device, dtype=input.dtype)
-
-        return F.conv2d(input, comfy.model_management.cast_to(self.weight, device=input.device, dtype=input.dtype) * self.scale, bias=bias, stride=self.stride, padding=self.padding)
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/archs/stylegan2_arch.py#L134
-class EqualLinear(torch.nn.Module):
-    def __init__(self, in_dim, out_dim, bias=True, bias_init=0, lr_mul=1, activation=None, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.weight = torch.nn.Parameter(torch.empty(out_dim, in_dim, device=device, dtype=dtype))
-        self.bias = torch.nn.Parameter(torch.empty(out_dim, device=device, dtype=dtype)) if bias else None
-        self.activation = activation
-        self.scale = (1 / math.sqrt(in_dim)) * lr_mul
-        self.lr_mul = lr_mul
-
-    def forward(self, input):
-        if self.bias is None:
-            bias = None
-        else:
-            bias = comfy.model_management.cast_to(self.bias, device=input.device, dtype=input.dtype) * self.lr_mul
-
-        if self.activation:
-            out = F.linear(input, comfy.model_management.cast_to(self.weight, device=input.device, dtype=input.dtype) * self.scale)
-            return fused_leaky_relu(out, bias)
-        return F.linear(input, comfy.model_management.cast_to(self.weight, device=input.device, dtype=input.dtype) * self.scale, bias=bias)
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/archs/stylegan2_arch.py#L654
-class ConvLayer(torch.nn.Sequential):
-    def __init__(self, in_channel, out_channel, kernel_size, downsample=False, blur_kernel=[1, 3, 3, 1], bias=True, activate=True, dtype=None, device=None, operations=None):
-        layers = []
-
-        if downsample:
-            factor = 2
-            p = (len(blur_kernel) - factor) + (kernel_size - 1)
-            layers.append(Blur(blur_kernel, pad=((p + 1) // 2, p // 2)))
-            stride, padding = 2, 0
-        else:
-            stride, padding = 1, kernel_size // 2
-
-        layers.append(EqualConv2d(in_channel, out_channel, kernel_size, padding=padding, stride=stride, bias=bias and not activate, dtype=dtype, device=device, operations=operations))
-
-        if activate:
-            layers.append(FusedLeakyReLU(out_channel) if bias else ScaledLeakyReLU(0.2))
-
-        super().__init__(*layers)
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/archs/stylegan2_arch.py#L704
-class ResBlock(torch.nn.Module):
-    def __init__(self, in_channel, out_channel, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.conv1 = ConvLayer(in_channel, in_channel, 3, dtype=dtype, device=device, operations=operations)
-        self.conv2 = ConvLayer(in_channel, out_channel, 3, downsample=True, dtype=dtype, device=device, operations=operations)
-        self.skip = ConvLayer(in_channel, out_channel, 1, downsample=True, activate=False, bias=False, dtype=dtype, device=device, operations=operations)
-
-    def forward(self, input):
-        out = self.conv2(self.conv1(input))
-        skip = self.skip(input)
-        return (out + skip) / math.sqrt(2)
-
-
-class EncoderApp(torch.nn.Module):
-    def __init__(self, w_dim=512, dtype=None, device=None, operations=None):
-        super().__init__()
-        kwargs = {"device": device, "dtype": dtype, "operations": operations}
-
-        self.convs = torch.nn.ModuleList([
-            ConvLayer(3, 32, 1, **kwargs), ResBlock(32, 64, **kwargs),
-            ResBlock(64, 128, **kwargs), ResBlock(128, 256, **kwargs),
-            ResBlock(256, 512, **kwargs), ResBlock(512, 512, **kwargs),
-            ResBlock(512, 512, **kwargs), ResBlock(512, 512, **kwargs),
-            EqualConv2d(512, w_dim, 4, padding=0, bias=False, **kwargs)
-        ])
-
-    def forward(self, x):
-        h = x
-        for conv in self.convs:
-            h = conv(h)
-        return h.squeeze(-1).squeeze(-1)
-
-class Encoder(torch.nn.Module):
-    def __init__(self, dim=512, motion_dim=20, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.net_app = EncoderApp(dim, dtype=dtype, device=device, operations=operations)
-        self.fc = torch.nn.Sequential(*[EqualLinear(dim, dim, dtype=dtype, device=device, operations=operations) for _ in range(4)] + [EqualLinear(dim, motion_dim, dtype=dtype, device=device, operations=operations)])
-
-    def encode_motion(self, x):
-        return self.fc(self.net_app(x))
-
-class Direction(torch.nn.Module):
-    def __init__(self, motion_dim, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.weight = torch.nn.Parameter(torch.empty(512, motion_dim, device=device, dtype=dtype))
-        self.motion_dim = motion_dim
-
-    def forward(self, input):
-        stabilized_weight = comfy.model_management.cast_to(self.weight, device=input.device, dtype=input.dtype) + 1e-8 * torch.eye(512, self.motion_dim, device=input.device, dtype=input.dtype)
-        Q, _ = torch.linalg.qr(stabilized_weight.float())
-        if input is None:
-            return Q
-        return torch.sum(input.unsqueeze(-1) * Q.T.to(input.dtype), dim=1)
-
-class Synthesis(torch.nn.Module):
-    def __init__(self, motion_dim, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.direction = Direction(motion_dim, dtype=dtype, device=device, operations=operations)
-
-class Generator(torch.nn.Module):
-    def __init__(self, style_dim=512, motion_dim=20, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.enc = Encoder(style_dim, motion_dim, dtype=dtype, device=device, operations=operations)
-        self.dec = Synthesis(motion_dim, dtype=dtype, device=device, operations=operations)
-
-    def get_motion(self, img):
-        motion_feat = self.enc.encode_motion(img)
-        return self.dec.direction(motion_feat)
-
-class AnimateWanModel(WanModel):
-    r"""
-    Wan diffusion backbone supporting both text-to-video and image-to-video.
-    """
-
-    def __init__(self,
-                 model_type='animate',
-                 patch_size=(1, 2, 2),
-                 text_len=512,
-                 in_dim=16,
-                 dim=2048,
-                 ffn_dim=8192,
-                 freq_dim=256,
-                 text_dim=4096,
-                 out_dim=16,
-                 num_heads=16,
-                 num_layers=32,
-                 window_size=(-1, -1),
-                 qk_norm=True,
-                 cross_attn_norm=True,
-                 eps=1e-6,
-                 flf_pos_embed_token_number=None,
-                 motion_encoder_dim=512,
-                 image_model=None,
-                 device=None,
-                 dtype=None,
-                 operations=None,
-                 ):
-
-        super().__init__(model_type='i2v', patch_size=patch_size, text_len=text_len, in_dim=in_dim, dim=dim, ffn_dim=ffn_dim, freq_dim=freq_dim, text_dim=text_dim, out_dim=out_dim, num_heads=num_heads, num_layers=num_layers, window_size=window_size, qk_norm=qk_norm, cross_attn_norm=cross_attn_norm, eps=eps, flf_pos_embed_token_number=flf_pos_embed_token_number, image_model=image_model, device=device, dtype=dtype, operations=operations)
-
-        self.pose_patch_embedding = operations.Conv3d(
-            16, dim, kernel_size=patch_size, stride=patch_size, device=device, dtype=dtype
-        )
-
-        self.motion_encoder = Generator(style_dim=512, motion_dim=20, device=device, dtype=dtype, operations=operations)
-
-        self.face_adapter = FaceAdapter(
-            heads_num=self.num_heads,
-            hidden_dim=self.dim,
-            num_adapter_layers=self.num_layers // 5,
-            device=device, dtype=dtype, operations=operations
-        )
-
-        self.face_encoder = FaceEncoder(
-            in_dim=motion_encoder_dim,
-            hidden_dim=self.dim,
-            num_heads=4,
-            device=device, dtype=dtype, operations=operations
-        )
-
-    def after_patch_embedding(self, x, pose_latents, face_pixel_values):
-        if pose_latents is not None:
-            pose_latents = self.pose_patch_embedding(pose_latents)
-            x[:, :, 1:pose_latents.shape[2] + 1] += pose_latents[:, :, :x.shape[2] - 1]
-
-        if face_pixel_values is None:
-            return x, None
-
-        b, c, T, h, w = face_pixel_values.shape
-        face_pixel_values = rearrange(face_pixel_values, "b c t h w -> (b t) c h w")
-        encode_bs = 8
-        face_pixel_values_tmp = []
-        for i in range(math.ceil(face_pixel_values.shape[0] / encode_bs)):
-            face_pixel_values_tmp.append(self.motion_encoder.get_motion(face_pixel_values[i * encode_bs: (i + 1) * encode_bs]))
-
-        motion_vec = torch.cat(face_pixel_values_tmp)
-
-        motion_vec = rearrange(motion_vec, "(b t) c -> b t c", t=T)
-        motion_vec = self.face_encoder(motion_vec)
-
-        B, L, H, C = motion_vec.shape
-        pad_face = torch.zeros(B, 1, H, C).type_as(motion_vec)
-        motion_vec = torch.cat([pad_face, motion_vec], dim=1)
-
-        if motion_vec.shape[1] < x.shape[2]:
-            B, L, H, C = motion_vec.shape
-            pad = torch.zeros(B, x.shape[2] - motion_vec.shape[1], H, C).type_as(motion_vec)
-            motion_vec = torch.cat([motion_vec, pad], dim=1)
-        else:
-            motion_vec = motion_vec[:, :x.shape[2]]
-        return x, motion_vec
-
-    def forward_orig(
-        self,
-        x,
-        t,
-        context,
-        clip_fea=None,
-        pose_latents=None,
-        face_pixel_values=None,
-        freqs=None,
-        transformer_options={},
-        **kwargs,
-    ):
-        # embeddings
-        x = self.patch_embedding(x.float()).to(x.dtype)
-        x, motion_vec = self.after_patch_embedding(x, pose_latents, face_pixel_values)
-        grid_sizes = x.shape[2:]
-        x = x.flatten(2).transpose(1, 2)
-
-        # time embeddings
-        e = self.time_embedding(
-            sinusoidal_embedding_1d(self.freq_dim, t.flatten()).to(dtype=x[0].dtype))
-        e = e.reshape(t.shape[0], -1, e.shape[-1])
-        e0 = self.time_projection(e).unflatten(2, (6, self.dim))
-
-        full_ref = None
-        if self.ref_conv is not None:
-            full_ref = kwargs.get("reference_latent", None)
-            if full_ref is not None:
-                full_ref = self.ref_conv(full_ref).flatten(2).transpose(1, 2)
-                x = torch.concat((full_ref, x), dim=1)
-
-        # context
-        context = self.text_embedding(context)
-
-        context_img_len = None
-        if clip_fea is not None:
-            if self.img_emb is not None:
-                context_clip = self.img_emb(clip_fea)  # bs x 257 x dim
-                context = torch.concat([context_clip, context], dim=1)
-            context_img_len = clip_fea.shape[-2]
-
-        patches_replace = transformer_options.get("patches_replace", {})
-        blocks_replace = patches_replace.get("dit", {})
-        for i, block in enumerate(self.blocks):
-            if ("double_block", i) in blocks_replace:
-                def block_wrap(args):
-                    out = {}
-                    out["img"] = block(args["img"], context=args["txt"], e=args["vec"], freqs=args["pe"], context_img_len=context_img_len, transformer_options=args["transformer_options"])
-                    return out
-                out = blocks_replace[("double_block", i)]({"img": x, "txt": context, "vec": e0, "pe": freqs, "transformer_options": transformer_options}, {"original_block": block_wrap})
-                x = out["img"]
-            else:
-                x = block(x, e=e0, freqs=freqs, context=context, context_img_len=context_img_len, transformer_options=transformer_options)
-
-            if i % 5 == 0 and motion_vec is not None:
-                x = x + self.face_adapter.fuser_blocks[i // 5](x, motion_vec)
-
-        # head
-        x = self.head(x, e)
-
-        if full_ref is not None:
-            x = x[:, full_ref.shape[1]:]
-
-        # unpatchify
-        x = self.unpatchify(x, grid_sizes)
-        return x
--- a/comfy/ldm/wan/vae.py
+++ b/comfy/ldm/wan/vae.py
@@ -468,46 +468,55 @@ class WanVAE(nn.Module):
                                 attn_scales, self.temperal_upsample, dropout)

    def encode(self, x):
-        conv_idx = [0]
-        feat_map = [None] * count_conv3d(self.decoder)
+        self.clear_cache()
        ## cache
        t = x.shape[2]
        iter_ = 1 + (t - 1) // 4
        ## 对encode输入的x，按时间拆分为1、4、4、4....
        for i in range(iter_):
-            conv_idx = [0]
+            self._enc_conv_idx = [0]
            if i == 0:
                out = self.encoder(
                    x[:, :, :1, :, :],
-                    feat_cache=feat_map,
-                    feat_idx=conv_idx)
+                    feat_cache=self._enc_feat_map,
+                    feat_idx=self._enc_conv_idx)
            else:
                out_ = self.encoder(
                    x[:, :, 1 + 4 * (i - 1):1 + 4 * i, :, :],
-                    feat_cache=feat_map,
-                    feat_idx=conv_idx)
+                    feat_cache=self._enc_feat_map,
+                    feat_idx=self._enc_conv_idx)
                out = torch.cat([out, out_], 2)
        mu, log_var = self.conv1(out).chunk(2, dim=1)
+        self.clear_cache()
        return mu

    def decode(self, z):
-        conv_idx = [0]
-        feat_map = [None] * count_conv3d(self.decoder)
+        self.clear_cache()
        # z: [b,c,t,h,w]

        iter_ = z.shape[2]
        x = self.conv2(z)
        for i in range(iter_):
-            conv_idx = [0]
+            self._conv_idx = [0]
            if i == 0:
                out = self.decoder(
                    x[:, :, i:i + 1, :, :],
-                    feat_cache=feat_map,
-                    feat_idx=conv_idx)
+                    feat_cache=self._feat_map,
+                    feat_idx=self._conv_idx)
            else:
                out_ = self.decoder(
                    x[:, :, i:i + 1, :, :],
-                    feat_cache=feat_map,
-                    feat_idx=conv_idx)
+                    feat_cache=self._feat_map,
+                    feat_idx=self._conv_idx)
                out = torch.cat([out, out_], 2)
+        self.clear_cache()
        return out
+
+    def clear_cache(self):
+        self._conv_num = count_conv3d(self.decoder)
+        self._conv_idx = [0]
+        self._feat_map = [None] * self._conv_num
+        #cache encode
+        self._enc_conv_num = count_conv3d(self.encoder)
+        self._enc_conv_idx = [0]
+        self._enc_feat_map = [None] * self._enc_conv_num
--- a/comfy/ldm/wan/vae2_2.py
+++ b/comfy/ldm/wan/vae2_2.py
@@ -657,51 +657,51 @@ class WanVAE(nn.Module):
        )

    def encode(self, x):
-        conv_idx = [0]
-        feat_map = [None] * count_conv3d(self.encoder)
+        self.clear_cache()
        x = patchify(x, patch_size=2)
        t = x.shape[2]
        iter_ = 1 + (t - 1) // 4
        for i in range(iter_):
-            conv_idx = [0]
+            self._enc_conv_idx = [0]
            if i == 0:
                out = self.encoder(
                    x[:, :, :1, :, :],
-                    feat_cache=feat_map,
-                    feat_idx=conv_idx,
+                    feat_cache=self._enc_feat_map,
+                    feat_idx=self._enc_conv_idx,
                )
            else:
                out_ = self.encoder(
                    x[:, :, 1 + 4 * (i - 1):1 + 4 * i, :, :],
-                    feat_cache=feat_map,
-                    feat_idx=conv_idx,
+                    feat_cache=self._enc_feat_map,
+                    feat_idx=self._enc_conv_idx,
                )
                out = torch.cat([out, out_], 2)
        mu, log_var = self.conv1(out).chunk(2, dim=1)
+        self.clear_cache()
        return mu

    def decode(self, z):
-        conv_idx = [0]
-        feat_map = [None] * count_conv3d(self.decoder)
+        self.clear_cache()
        iter_ = z.shape[2]
        x = self.conv2(z)
        for i in range(iter_):
-            conv_idx = [0]
+            self._conv_idx = [0]
            if i == 0:
                out = self.decoder(
                    x[:, :, i:i + 1, :, :],
-                    feat_cache=feat_map,
-                    feat_idx=conv_idx,
+                    feat_cache=self._feat_map,
+                    feat_idx=self._conv_idx,
                    first_chunk=True,
                )
            else:
                out_ = self.decoder(
                    x[:, :, i:i + 1, :, :],
-                    feat_cache=feat_map,
-                    feat_idx=conv_idx,
+                    feat_cache=self._feat_map,
+                    feat_idx=self._conv_idx,
                )
                out = torch.cat([out, out_], 2)
        out = unpatchify(out, patch_size=2)
+        self.clear_cache()
        return out

    def reparameterize(self, mu, log_var):
@@ -715,3 +715,12 @@ class WanVAE(nn.Module):
            return mu
        std = torch.exp(0.5 * log_var.clamp(-30.0, 20.0))
        return mu + std * torch.randn_like(std)
+
+    def clear_cache(self):
+        self._conv_num = count_conv3d(self.decoder)
+        self._conv_idx = [0]
+        self._feat_map = [None] * self._conv_num
+        # cache encode
+        self._enc_conv_num = count_conv3d(self.encoder)
+        self._enc_conv_idx = [0]
+        self._enc_feat_map = [None] * self._enc_conv_num
--- a/comfy/lora.py
+++ b/comfy/lora.py
@@ -260,10 +260,6 @@ def model_lora_keys_unet(model, key_map={}):
                key_map["transformer.{}".format(k[:-len(".weight")])] = to #simpletrainer and probably regular diffusers flux lora format
                key_map["lycoris_{}".format(k[:-len(".weight")].replace(".", "_"))] = to #simpletrainer lycoris
                key_map["lora_transformer_{}".format(k[:-len(".weight")].replace(".", "_"))] = to #onetrainer
-        for k in sdk:
-            hidden_size = model.model_config.unet_config.get("hidden_size", 0)
-            if k.endswith(".weight") and ".linear1." in k:
-                key_map["{}".format(k.replace(".linear1.weight", ".linear1_qkv"))] = (k, (0, 0, hidden_size * 3))

    if isinstance(model, comfy.model_base.GenmoMochi):
        for k in sdk:
@@ -297,12 +293,6 @@ def model_lora_keys_unet(model, key_map={}):
                key_lora = k[len("diffusion_model."):-len(".weight")]
                key_map["{}".format(key_lora)] = k

-    if isinstance(model, comfy.model_base.Omnigen2):
-        for k in sdk:
-            if k.startswith("diffusion_model.") and k.endswith(".weight"):
-                key_lora = k[len("diffusion_model."):-len(".weight")]
-                key_map["{}".format(key_lora)] = k
-
    if isinstance(model, comfy.model_base.QwenImage):
        for k in sdk:
            if k.startswith("diffusion_model.") and k.endswith(".weight"): #QwenImage lora format
--- a/comfy/lora_convert.py
+++ b/comfy/lora_convert.py
@@ -15,29 +15,10 @@ def convert_lora_bfl_control(sd): #BFL loras for Flux
 def convert_lora_wan_fun(sd): #Wan Fun loras
    return comfy.utils.state_dict_prefix_replace(sd, {"lora_unet__": "lora_unet_"})

-def convert_uso_lora(sd):
-    sd_out = {}
-    for k in sd:
-        tensor = sd[k]
-        k_to = "diffusion_model.{}".format(k.replace(".down.weight", ".lora_down.weight")
-                                           .replace(".up.weight", ".lora_up.weight")
-                                           .replace(".qkv_lora2.", ".txt_attn.qkv.")
-                                           .replace(".qkv_lora1.", ".img_attn.qkv.")
-                                           .replace(".proj_lora1.", ".img_attn.proj.")
-                                           .replace(".proj_lora2.", ".txt_attn.proj.")
-                                           .replace(".qkv_lora.", ".linear1_qkv.")
-                                           .replace(".proj_lora.", ".linear2.")
-                                           .replace(".processor.", ".")
-                                           )
-        sd_out[k_to] = tensor
-    return sd_out
-

 def convert_lora(sd):
    if "img_in.lora_A.weight" in sd and "single_blocks.0.norm.key_norm.scale" in sd:
        return convert_lora_bfl_control(sd)
    if "lora_unet__blocks_0_cross_attn_k.lora_down.weight" in sd:
        return convert_lora_wan_fun(sd)
-    if "single_blocks.37.processor.qkv_lora.up.weight" in sd and "double_blocks.18.processor.qkv_lora2.up.weight" in sd:
-        return convert_uso_lora(sd)
    return sd
--- a/comfy/model_base.py
+++ b/comfy/model_base.py
@@ -16,8 +16,6 @@
    along with this program.  If not, see <https://www.gnu.org/licenses/>.
 """

-import comfy.ldm.hunyuan3dv2_1
-import comfy.ldm.hunyuan3dv2_1.hunyuandit
 import torch
 import logging
 from comfy.ldm.modules.diffusionmodules.openaimodel import UNetModel, Timestep
@@ -39,11 +37,9 @@ import comfy.ldm.cosmos.model
 import comfy.ldm.cosmos.predict2
 import comfy.ldm.lumina.model
 import comfy.ldm.wan.model
-import comfy.ldm.wan.model_animate
 import comfy.ldm.hunyuan3d.model
 import comfy.ldm.hidream.model
 import comfy.ldm.chroma.model
-import comfy.ldm.chroma_radiance.model
 import comfy.ldm.ace.model
 import comfy.ldm.omnigen.omnigen2
 import comfy.ldm.qwen_image.model
@@ -138,7 +134,6 @@ class BaseModel(torch.nn.Module):
            else:
                operations = model_config.custom_operations
            self.diffusion_model = unet_model(**unet_config, device=device, operations=operations)
-            self.diffusion_model.eval()
            if comfy.model_management.force_channels_last():
                self.diffusion_model.to(memory_format=torch.channels_last)
                logging.debug("using channels last mode for diffusion model")
@@ -197,14 +192,8 @@ class BaseModel(torch.nn.Module):
            extra_conds[o] = extra

        t = self.process_timestep(t, x=x, **extra_conds)
-        if "latent_shapes" in extra_conds:
-            xc = utils.unpack_latents(xc, extra_conds.pop("latent_shapes"))
-
-        model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds)
-        if len(model_output) > 1 and not torch.is_tensor(model_output):
-            model_output, _ = utils.pack_latents(model_output)
-
-        return self.model_sampling.calculate_denoised(sigma, model_output.float(), x)
+        model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
+        return self.model_sampling.calculate_denoised(sigma, model_output, x)

    def process_timestep(self, timestep, **kwargs):
        return timestep
@@ -676,6 +665,7 @@ class Lotus(BaseModel):
 class StableCascade_C(BaseModel):
    def __init__(self, model_config, model_type=ModelType.STABLE_CASCADE, device=None):
        super().__init__(model_config, model_type, device=device, unet_model=StageC)
+        self.diffusion_model.eval().requires_grad_(False)

    def extra_conds(self, **kwargs):
        out = {}
@@ -704,6 +694,7 @@ class StableCascade_C(BaseModel):
 class StableCascade_B(BaseModel):
    def __init__(self, model_config, model_type=ModelType.STABLE_CASCADE, device=None):
        super().__init__(model_config, model_type, device=device, unet_model=StageB)
+        self.diffusion_model.eval().requires_grad_(False)

    def extra_conds(self, **kwargs):
        out = {}
@@ -1219,63 +1210,6 @@ class WAN21_Camera(WAN21):
            out['camera_conditions'] = comfy.conds.CONDRegular(camera_conditions)
        return out

-class WAN21_HuMo(WAN21):
-    def __init__(self, model_config, model_type=ModelType.FLOW, image_to_video=False, device=None):
-        super(WAN21, self).__init__(model_config, model_type, device=device, unet_model=comfy.ldm.wan.model.HumoWanModel)
-        self.image_to_video = image_to_video
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        noise = kwargs.get("noise", None)
-
-        audio_embed = kwargs.get("audio_embed", None)
-        if audio_embed is not None:
-            out['audio_embed'] = comfy.conds.CONDRegular(audio_embed)
-
-        if "c_concat" not in out:  # 1.7B model
-            reference_latents = kwargs.get("reference_latents", None)
-            if reference_latents is not None:
-                out['reference_latent'] = comfy.conds.CONDRegular(self.process_latent_in(reference_latents[-1]))
-        else:
-            noise_shape = list(noise.shape)
-            noise_shape[1] += 4
-            concat_latent = torch.zeros(noise_shape, device=noise.device, dtype=noise.dtype)
-            zero_vae_values_first = torch.tensor([0.8660, -0.4326, -0.0017, -0.4884, -0.5283, 0.9207, -0.9896, 0.4433, -0.5543, -0.0113, 0.5753, -0.6000, -0.8346, -0.3497, -0.1926, -0.6938]).view(1, 16, 1, 1, 1)
-            zero_vae_values_second = torch.tensor([1.0869, -1.2370, 0.0206, -0.4357, -0.6411, 2.0307, -1.5972, 1.2659, -0.8595, -0.4654, 0.9638, -1.6330, -1.4310, -0.1098, -0.3856, -1.4583]).view(1, 16, 1, 1, 1)
-            zero_vae_values = torch.tensor([0.8642, -1.8583, 0.1577, 0.1350, -0.3641, 2.5863, -1.9670, 1.6065, -1.0475, -0.8678, 1.1734, -1.8138, -1.5933, -0.7721, -0.3289, -1.3745]).view(1, 16, 1, 1, 1)
-            concat_latent[:, 4:] = zero_vae_values
-            concat_latent[:, 4:, :1] = zero_vae_values_first
-            concat_latent[:, 4:, 1:2] = zero_vae_values_second
-            out['c_concat'] = comfy.conds.CONDNoiseShape(concat_latent)
-            reference_latents = kwargs.get("reference_latents", None)
-            if reference_latents is not None:
-                ref_latent = self.process_latent_in(reference_latents[-1])
-                ref_latent_shape = list(ref_latent.shape)
-                ref_latent_shape[1] += 4 + ref_latent_shape[1]
-                ref_latent_full = torch.zeros(ref_latent_shape, device=ref_latent.device, dtype=ref_latent.dtype)
-                ref_latent_full[:, 20:] = ref_latent
-                ref_latent_full[:, 16:20] = 1.0
-                out['reference_latent'] = comfy.conds.CONDRegular(ref_latent_full)
-
-        return out
-
-class WAN22_Animate(WAN21):
-    def __init__(self, model_config, model_type=ModelType.FLOW, image_to_video=False, device=None):
-        super(WAN21, self).__init__(model_config, model_type, device=device, unet_model=comfy.ldm.wan.model_animate.AnimateWanModel)
-        self.image_to_video = image_to_video
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-
-        face_video_pixels = kwargs.get("face_video_pixels", None)
-        if face_video_pixels is not None:
-            out['face_pixel_values'] = comfy.conds.CONDRegular(face_video_pixels)
-
-        pose_latents = kwargs.get("pose_video_latent", None)
-        if pose_latents is not None:
-            out['pose_latents'] = comfy.conds.CONDRegular(self.process_latent_in(pose_latents))
-        return out
-
 class WAN22_S2V(WAN21):
    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
        super(WAN21, self).__init__(model_config, model_type, device=device, unet_model=comfy.ldm.wan.model.WanModel_S2V)
@@ -1348,21 +1282,6 @@ class Hunyuan3Dv2(BaseModel):
            out['guidance'] = comfy.conds.CONDRegular(torch.FloatTensor([guidance]))
        return out

-class Hunyuan3Dv2_1(BaseModel):
-    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
-        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.hunyuan3dv2_1.hunyuandit.HunYuanDiTPlain)
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        cross_attn = kwargs.get("cross_attn", None)
-        if cross_attn is not None:
-            out['c_crossattn'] = comfy.conds.CONDRegular(cross_attn)
-
-        guidance = kwargs.get("guidance", 5.0)
-        if guidance is not None:
-            out['guidance'] = comfy.conds.CONDRegular(torch.FloatTensor([guidance]))
-        return out
-
 class HiDream(BaseModel):
    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.hidream.model.HiDreamImageTransformer2DModel)
@@ -1384,8 +1303,8 @@ class HiDream(BaseModel):
        return out

 class Chroma(Flux):
-    def __init__(self, model_config, model_type=ModelType.FLUX, device=None, unet_model=comfy.ldm.chroma.model.Chroma):
-        super().__init__(model_config, model_type, device=device, unet_model=unet_model)
+    def __init__(self, model_config, model_type=ModelType.FLUX, device=None):
+        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.chroma.model.Chroma)

    def extra_conds(self, **kwargs):
        out = super().extra_conds(**kwargs)
@@ -1395,10 +1314,6 @@ class Chroma(Flux):
            out['guidance'] = comfy.conds.CONDRegular(torch.FloatTensor([guidance]))
        return out

-class ChromaRadiance(Chroma):
-    def __init__(self, model_config, model_type=ModelType.FLUX, device=None):
-        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.chroma_radiance.model.ChromaRadiance)
-
 class ACEStep(BaseModel):
    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.ace.model.ACEStepTransformer2DModel)
@@ -1476,55 +1391,3 @@ class QwenImage(BaseModel):
        if ref_latents is not None:
            out['ref_latents'] = list([1, 16, sum(map(lambda a: math.prod(a.size()), ref_latents)) // 16])
        return out
-
-class HunyuanImage21(BaseModel):
-    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
-        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.hunyuan_video.model.HunyuanVideo)
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        attention_mask = kwargs.get("attention_mask", None)
-        if attention_mask is not None:
-            if torch.numel(attention_mask) != attention_mask.sum():
-                out['attention_mask'] = comfy.conds.CONDRegular(attention_mask)
-        cross_attn = kwargs.get("cross_attn", None)
-        if cross_attn is not None:
-            out['c_crossattn'] = comfy.conds.CONDRegular(cross_attn)
-
-        conditioning_byt5small = kwargs.get("conditioning_byt5small", None)
-        if conditioning_byt5small is not None:
-            out['txt_byt5'] = comfy.conds.CONDRegular(conditioning_byt5small)
-
-        guidance = kwargs.get("guidance", 6.0)
-        if guidance is not None:
-            out['guidance'] = comfy.conds.CONDRegular(torch.FloatTensor([guidance]))
-
-        return out
-
-class HunyuanImage21Refiner(HunyuanImage21):
-    def concat_cond(self, **kwargs):
-        noise = kwargs.get("noise", None)
-        image = kwargs.get("concat_latent_image", None)
-        noise_augmentation = kwargs.get("noise_augmentation", 0.0)
-        device = kwargs["device"]
-
-        if image is None:
-            shape_image = list(noise.shape)
-            image = torch.zeros(shape_image, dtype=noise.dtype, layout=noise.layout, device=noise.device)
-        else:
-            image = utils.common_upscale(image.to(device), noise.shape[-1], noise.shape[-2], "bilinear", "center")
-            image = self.process_latent_in(image)
-            image = utils.resize_to_batch_size(image, noise.shape[0])
-            if noise_augmentation > 0:
-                generator = torch.Generator(device="cpu")
-                generator.manual_seed(kwargs.get("seed", 0) - 10)
-                noise = torch.randn(image.shape, generator=generator, dtype=image.dtype, device="cpu").to(image.device)
-                image = noise_augmentation * noise + min(1.0 - noise_augmentation, 0.75) * image
-            else:
-                image = 0.75 * image
-        return image
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        out['disable_time_r'] = comfy.conds.CONDConstant(True)
-        return out
--- a/comfy/model_detection.py
+++ b/comfy/model_detection.py
@@ -136,45 +136,25 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):

    if '{}txt_in.individual_token_refiner.blocks.0.norm1.weight'.format(key_prefix) in state_dict_keys: #Hunyuan Video
        dit_config = {}
-        in_w = state_dict['{}img_in.proj.weight'.format(key_prefix)]
-        out_w = state_dict['{}final_layer.linear.weight'.format(key_prefix)]
        dit_config["image_model"] = "hunyuan_video"
-        dit_config["in_channels"] = in_w.shape[1] #SkyReels img2video has 32 input channels
-        dit_config["patch_size"] = list(in_w.shape[2:])
-        dit_config["out_channels"] = out_w.shape[0] // math.prod(dit_config["patch_size"])
-        if any(s.startswith('{}vector_in.'.format(key_prefix)) for s in state_dict_keys):
-            dit_config["vec_in_dim"] = 768
-        else:
-            dit_config["vec_in_dim"] = None
-
-        if len(dit_config["patch_size"]) == 2:
-            dit_config["axes_dim"] = [64, 64]
-        else:
-            dit_config["axes_dim"] = [16, 56, 56]
-
-        if any(s.startswith('{}time_r_in.'.format(key_prefix)) for s in state_dict_keys):
-            dit_config["meanflow"] = True
-        else:
-            dit_config["meanflow"] = False
-
-        dit_config["context_in_dim"] = state_dict['{}txt_in.input_embedder.weight'.format(key_prefix)].shape[1]
-        dit_config["hidden_size"] = in_w.shape[0]
+        dit_config["in_channels"] = state_dict['{}img_in.proj.weight'.format(key_prefix)].shape[1] #SkyReels img2video has 32 input channels
+        dit_config["patch_size"] = [1, 2, 2]
+        dit_config["out_channels"] = 16
+        dit_config["vec_in_dim"] = 768
+        dit_config["context_in_dim"] = 4096
+        dit_config["hidden_size"] = 3072
        dit_config["mlp_ratio"] = 4.0
-        dit_config["num_heads"] = in_w.shape[0] // 128
+        dit_config["num_heads"] = 24
        dit_config["depth"] = count_blocks(state_dict_keys, '{}double_blocks.'.format(key_prefix) + '{}.')
        dit_config["depth_single_blocks"] = count_blocks(state_dict_keys, '{}single_blocks.'.format(key_prefix) + '{}.')
+        dit_config["axes_dim"] = [16, 56, 56]
        dit_config["theta"] = 256
        dit_config["qkv_bias"] = True
-        if '{}byt5_in.fc1.weight'.format(key_prefix) in state_dict:
-            dit_config["byt5"] = True
-        else:
-            dit_config["byt5"] = False
-
        guidance_keys = list(filter(lambda a: a.startswith("{}guidance_in.".format(key_prefix)), state_dict_keys))
        dit_config["guidance_embed"] = len(guidance_keys) > 0
        return dit_config

-    if '{}double_blocks.0.img_attn.norm.key_norm.scale'.format(key_prefix) in state_dict_keys and ('{}img_in.weight'.format(key_prefix) in state_dict_keys or f"{key_prefix}distilled_guidance_layer.norms.0.scale" in state_dict_keys): #Flux, Chroma or Chroma Radiance (has no img_in.weight)
+    if '{}double_blocks.0.img_attn.norm.key_norm.scale'.format(key_prefix) in state_dict_keys and '{}img_in.weight'.format(key_prefix) in state_dict_keys: #Flux
        dit_config = {}
        dit_config["image_model"] = "flux"
        dit_config["in_channels"] = 16
@@ -204,18 +184,6 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):
            dit_config["out_dim"] = 3072
            dit_config["hidden_dim"] = 5120
            dit_config["n_layers"] = 5
-            if f"{key_prefix}nerf_blocks.0.norm.scale" in state_dict_keys: #Chroma Radiance
-                dit_config["image_model"] = "chroma_radiance"
-                dit_config["in_channels"] = 3
-                dit_config["out_channels"] = 3
-                dit_config["patch_size"] = 16
-                dit_config["nerf_hidden_size"] = 64
-                dit_config["nerf_mlp_ratio"] = 4
-                dit_config["nerf_depth"] = 4
-                dit_config["nerf_max_freqs"] = 8
-                dit_config["nerf_tile_size"] = 512
-                dit_config["nerf_final_head_type"] = "conv" if f"{key_prefix}nerf_final_layer_conv.norm.scale" in state_dict_keys else "linear"
-                dit_config["nerf_embedder_dtype"] = torch.float32
        else:
            dit_config["guidance_embed"] = "{}guidance_in.in_layer.weight".format(key_prefix) in state_dict_keys
        return dit_config
@@ -365,8 +333,8 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):
        dit_config["patch_size"] = 2
        dit_config["in_channels"] = 16
        dit_config["dim"] = 2304
-        dit_config["cap_feat_dim"] = state_dict['{}cap_embedder.1.weight'.format(key_prefix)].shape[1]
-        dit_config["n_layers"] = count_blocks(state_dict_keys, '{}layers.'.format(key_prefix) + '{}.')
+        dit_config["cap_feat_dim"] = 2304
+        dit_config["n_layers"] = 26
        dit_config["n_heads"] = 24
        dit_config["n_kv_heads"] = 8
        dit_config["qk_norm"] = True
@@ -402,10 +370,6 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):
                dit_config["model_type"] = "camera_2.2"
        elif '{}casual_audio_encoder.encoder.final_linear.weight'.format(key_prefix) in state_dict_keys:
            dit_config["model_type"] = "s2v"
-        elif '{}audio_proj.audio_proj_glob_1.layer.bias'.format(key_prefix) in state_dict_keys:
-            dit_config["model_type"] = "humo"
-        elif '{}face_adapter.fuser_blocks.0.k_norm.weight'.format(key_prefix) in state_dict_keys:
-            dit_config["model_type"] = "animate"
        else:
            if '{}img_emb.proj.0.bias'.format(key_prefix) in state_dict_keys:
                dit_config["model_type"] = "i2v"
@@ -436,20 +400,6 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):
        dit_config["guidance_embed"] = "{}guidance_in.in_layer.weight".format(key_prefix) in state_dict_keys
        return dit_config

-    if f"{key_prefix}t_embedder.mlp.2.weight" in state_dict_keys:  # Hunyuan 3D 2.1
-
-        dit_config = {}
-        dit_config["image_model"] = "hunyuan3d2_1"
-        dit_config["in_channels"] = state_dict[f"{key_prefix}x_embedder.weight"].shape[1]
-        dit_config["context_dim"] = 1024
-        dit_config["hidden_size"] = state_dict[f"{key_prefix}x_embedder.weight"].shape[0]
-        dit_config["mlp_ratio"] = 4.0
-        dit_config["num_heads"] = 16
-        dit_config["depth"] = count_blocks(state_dict_keys, f"{key_prefix}blocks.{{}}")
-        dit_config["qkv_bias"] = False
-        dit_config["guidance_cond_proj_dim"] = None#f"{key_prefix}t_embedder.cond_proj.weight" in state_dict_keys
-        return dit_config
-
    if '{}caption_projection.0.linear.weight'.format(key_prefix) in state_dict_keys:  # HiDream
        dit_config = {}
        dit_config["image_model"] = "hidream"
--- a/comfy/model_management.py
+++ b/comfy/model_management.py
@@ -22,7 +22,6 @@ from enum import Enum
 from comfy.cli_args import args, PerformanceFeature
 import torch
 import sys
-import importlib
 import platform
 import weakref
 import gc
@@ -89,7 +88,6 @@ if args.deterministic:

 directml_enabled = False
 if args.directml is not None:
-    logging.warning("WARNING: torch-directml barely works, is very slow, has not been updated in over 1 year and might be removed soon, please don't use it, there are better options.")
    import torch_directml
    directml_enabled = True
    device_index = args.directml
@@ -291,24 +289,6 @@ def is_amd():
            return True
    return False

-def amd_min_version(device=None, min_rdna_version=0):
-    if not is_amd():
-        return False
-
-    if is_device_cpu(device):
-        return False
-
-    arch = torch.cuda.get_device_properties(device).gcnArchName
-    if arch.startswith('gfx') and len(arch) == 7:
-        try:
-            cmp_rdna_version = int(arch[4]) + 2
-        except:
-            cmp_rdna_version = 0
-        if cmp_rdna_version >= min_rdna_version:
-            return True
-
-    return False
-
 MIN_WEIGHT_MEMORY_RATIO = 0.4
 if is_nvidia():
    MIN_WEIGHT_MEMORY_RATIO = 0.0
@@ -331,33 +311,24 @@ except:


 SUPPORT_FP8_OPS = args.supports_fp8_compute
-
-AMD_RDNA2_AND_OLDER_ARCH = ["gfx1030", "gfx1031", "gfx1010", "gfx1011", "gfx1012", "gfx906", "gfx900", "gfx803"]
-
 try:
    if is_amd():
-        arch = torch.cuda.get_device_properties(get_torch_device()).gcnArchName
-        if not (any((a in arch) for a in AMD_RDNA2_AND_OLDER_ARCH)):
-            torch.backends.cudnn.enabled = False  # Seems to improve things a lot on AMD
-            logging.info("Set: torch.backends.cudnn.enabled = False for better AMD performance.")
-
        try:
            rocm_version = tuple(map(int, str(torch.version.hip).split(".")[:2]))
        except:
            rocm_version = (6, -1)
-
+        arch = torch.cuda.get_device_properties(get_torch_device()).gcnArchName
        logging.info("AMD arch: {}".format(arch))
        logging.info("ROCm version: {}".format(rocm_version))
        if args.use_split_cross_attention == False and args.use_quad_cross_attention == False:
-            if importlib.util.find_spec('triton') is not None:  # AMD efficient attention implementation depends on triton. TODO: better way of detecting if it's compiled in or not.
-                if torch_version_numeric >= (2, 7):  # works on 2.6 but doesn't actually seem to improve much
-                    if any((a in arch) for a in ["gfx90a", "gfx942", "gfx1100", "gfx1101", "gfx1151"]):  # TODO: more arches, TODO: gfx950
-                        ENABLE_PYTORCH_ATTENTION = True
-                if rocm_version >= (7, 0):
-                   if any((a in arch) for a in ["gfx1201"]):
-                       ENABLE_PYTORCH_ATTENTION = True
+            if torch_version_numeric >= (2, 7):  # works on 2.6 but doesn't actually seem to improve much
+                if any((a in arch) for a in ["gfx90a", "gfx942", "gfx1100", "gfx1101", "gfx1151"]):  # TODO: more arches, TODO: gfx950
+                    ENABLE_PYTORCH_ATTENTION = True
+#            if torch_version_numeric >= (2, 8):
+#                if any((a in arch) for a in ["gfx1201"]):
+#                    ENABLE_PYTORCH_ATTENTION = True
        if torch_version_numeric >= (2, 7) and rocm_version >= (6, 4):
-            if any((a in arch) for a in ["gfx1200", "gfx1201", "gfx950"]):  # TODO: more arches, "gfx942" gives error on pytorch nightly 2.10 1013 rocm7.0
+            if any((a in arch) for a in ["gfx1201", "gfx942", "gfx950"]):  # TODO: more arches
                SUPPORT_FP8_OPS = True

 except:
@@ -379,9 +350,6 @@ try:
 except:
    pass

-if torch.cuda.is_available() and torch.backends.cudnn.is_available() and PerformanceFeature.AutoTune in args.fast:
-    torch.backends.cudnn.benchmark = True
-
 try:
    if torch_version_numeric >= (2, 5):
        torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
@@ -657,9 +625,7 @@ def load_models_gpu(models, memory_required=0, force_patch_weights=False, minimu
            if loaded_model.model.is_clone(current_loaded_models[i].model):
                to_unload = [i] + to_unload
        for i in to_unload:
-            model_to_unload = current_loaded_models.pop(i)
-            model_to_unload.model.detach(unpatch_all=False)
-            model_to_unload.model_finalizer.detach()
+            current_loaded_models.pop(i).model.detach(unpatch_all=False)

    total_memory_required = {}
    for loaded_model in models_to_load:
@@ -937,7 +903,9 @@ def vae_dtype(device=None, allowed_dtypes=[]):
        if d == torch.float16 and should_use_fp16(device):
            return d

-        if d == torch.bfloat16 and should_use_bf16(device):
+        # NOTE: bfloat16 seems to work on AMD for the VAE but is extremely slow in some cases compared to fp32
+        # slowness still a problem on pytorch nightly 2.9.0.dev20250720+rocm6.4 tested on RDNA3
+        if d == torch.bfloat16 and (not is_amd()) and should_use_bf16(device):
            return d

    return torch.float32
@@ -999,6 +967,12 @@ def device_supports_non_blocking(device):
        return False
    return True

+def device_should_use_non_blocking(device):
+    if not device_supports_non_blocking(device):
+        return False
+    return False
+    # return True #TODO: figure out why this causes memory issues on Nvidia and possibly others
+
 def force_channels_last():
    if args.force_channels_last:
        return True
@@ -1332,7 +1306,7 @@ def should_use_bf16(device=None, model_params=0, prioritize_performance=True, ma

    if is_amd():
        arch = torch.cuda.get_device_properties(device).gcnArchName
-        if any((a in arch) for a in AMD_RDNA2_AND_OLDER_ARCH):  # RDNA2 and older don't support bf16
+        if any((a in arch) for a in ["gfx1030", "gfx1031", "gfx1010", "gfx1011", "gfx1012", "gfx906", "gfx900", "gfx803"]):  # RDNA2 and older don't support bf16
            if manual_cast:
                return True
            return False
--- a/comfy/model_patcher.py
+++ b/comfy/model_patcher.py
@@ -123,30 +123,16 @@ def move_weight_functions(m, device):
    return memory

 class LowVramPatch:
-    def __init__(self, key, patches, convert_func=None, set_func=None):
+    def __init__(self, key, patches):
        self.key = key
        self.patches = patches
-        self.convert_func = convert_func
-        self.set_func = set_func
-
    def __call__(self, weight):
        intermediate_dtype = weight.dtype
-        if self.convert_func is not None:
-            weight = self.convert_func(weight.to(dtype=torch.float32, copy=True), inplace=True)
-
        if intermediate_dtype not in [torch.float32, torch.float16, torch.bfloat16]: #intermediate_dtype has to be one that is supported in math ops
            intermediate_dtype = torch.float32
-            out = comfy.lora.calculate_weight(self.patches[self.key], weight.to(intermediate_dtype), self.key, intermediate_dtype=intermediate_dtype)
-            if self.set_func is None:
-                return comfy.float.stochastic_rounding(out, weight.dtype, seed=string_to_seed(self.key))
-            else:
-                return self.set_func(out, seed=string_to_seed(self.key), return_weight=True)
+            return comfy.float.stochastic_rounding(comfy.lora.calculate_weight(self.patches[self.key], weight.to(intermediate_dtype), self.key, intermediate_dtype=intermediate_dtype), weight.dtype, seed=string_to_seed(self.key))

-        out = comfy.lora.calculate_weight(self.patches[self.key], weight, self.key, intermediate_dtype=intermediate_dtype)
-        if self.set_func is not None:
-            return self.set_func(out, seed=string_to_seed(self.key), return_weight=True).to(dtype=intermediate_dtype)
-        else:
-            return out
+        return comfy.lora.calculate_weight(self.patches[self.key], weight, self.key, intermediate_dtype=intermediate_dtype)

 def get_key_weight(model, key):
    set_func = None
@@ -447,9 +433,6 @@ class ModelPatcher:
    def set_model_double_block_patch(self, patch):
        self.set_model_patch(patch, "double_block")

-    def set_model_post_input_patch(self, patch):
-        self.set_model_patch(patch, "post_input")
-
    def add_object_patch(self, name, obj):
        self.object_patches[name] = obj

@@ -671,15 +654,13 @@ class ModelPatcher:
                        if force_patch_weights:
                            self.patch_weight_to_device(weight_key)
                        else:
-                            _, set_func, convert_func = get_key_weight(self.model, weight_key)
-                            m.weight_function = [LowVramPatch(weight_key, self.patches, convert_func, set_func)]
+                            m.weight_function = [LowVramPatch(weight_key, self.patches)]
                            patch_counter += 1
                    if bias_key in self.patches:
                        if force_patch_weights:
                            self.patch_weight_to_device(bias_key)
                        else:
-                            _, set_func, convert_func = get_key_weight(self.model, bias_key)
-                            m.bias_function = [LowVramPatch(bias_key, self.patches, convert_func, set_func)]
+                            m.bias_function = [LowVramPatch(bias_key, self.patches)]
                            patch_counter += 1

                    cast_weight = True
@@ -841,12 +822,10 @@ class ModelPatcher:
                        module_mem += move_weight_functions(m, device_to)
                        if lowvram_possible:
                            if weight_key in self.patches:
-                                _, set_func, convert_func = get_key_weight(self.model, weight_key)
-                                m.weight_function.append(LowVramPatch(weight_key, self.patches, convert_func, set_func))
+                                m.weight_function.append(LowVramPatch(weight_key, self.patches))
                                patch_counter += 1
                            if bias_key in self.patches:
-                                _, set_func, convert_func = get_key_weight(self.model, bias_key)
-                                m.bias_function.append(LowVramPatch(bias_key, self.patches, convert_func, set_func))
+                                m.bias_function.append(LowVramPatch(bias_key, self.patches))
                                patch_counter += 1
                            cast_weight = True

--- a/comfy/model_sampling.py
+++ b/comfy/model_sampling.py
@@ -21,23 +21,17 @@ def rescale_zero_terminal_snr_sigmas(sigmas):
    alphas_bar[-1] = 4.8973451890853435e-08
    return ((1 - alphas_bar) / alphas_bar) ** 0.5

-def reshape_sigma(sigma, noise_dim):
-    if sigma.nelement() == 1:
-        return sigma.view(())
-    else:
-        return sigma.view(sigma.shape[:1] + (1,) * (noise_dim - 1))
-
 class EPS:
    def calculate_input(self, sigma, noise):
-        sigma = reshape_sigma(sigma, noise.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (noise.ndim - 1))
        return noise / (sigma ** 2 + self.sigma_data ** 2) ** 0.5

    def calculate_denoised(self, sigma, model_output, model_input):
-        sigma = reshape_sigma(sigma, model_output.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (model_output.ndim - 1))
        return model_input - model_output * sigma

    def noise_scaling(self, sigma, noise, latent_image, max_denoise=False):
-        sigma = reshape_sigma(sigma, noise.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (noise.ndim - 1))
        if max_denoise:
            noise = noise * torch.sqrt(1.0 + sigma ** 2.0)
        else:
@@ -51,12 +45,12 @@ class EPS:

 class V_PREDICTION(EPS):
    def calculate_denoised(self, sigma, model_output, model_input):
-        sigma = reshape_sigma(sigma, model_output.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (model_output.ndim - 1))
        return model_input * self.sigma_data ** 2 / (sigma ** 2 + self.sigma_data ** 2) - model_output * sigma * self.sigma_data / (sigma ** 2 + self.sigma_data ** 2) ** 0.5

 class EDM(V_PREDICTION):
    def calculate_denoised(self, sigma, model_output, model_input):
-        sigma = reshape_sigma(sigma, model_output.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (model_output.ndim - 1))
        return model_input * self.sigma_data ** 2 / (sigma ** 2 + self.sigma_data ** 2) + model_output * sigma * self.sigma_data / (sigma ** 2 + self.sigma_data ** 2) ** 0.5

 class CONST:
@@ -64,15 +58,15 @@ class CONST:
        return noise

    def calculate_denoised(self, sigma, model_output, model_input):
-        sigma = reshape_sigma(sigma, model_output.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (model_output.ndim - 1))
        return model_input - model_output * sigma

    def noise_scaling(self, sigma, noise, latent_image, max_denoise=False):
-        sigma = reshape_sigma(sigma, noise.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (noise.ndim - 1))
        return sigma * noise + (1.0 - sigma) * latent_image

    def inverse_noise_scaling(self, sigma, latent):
-        sigma = reshape_sigma(sigma, latent.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (latent.ndim - 1))
        return latent / (1.0 - sigma)

 class X0(EPS):
@@ -86,16 +80,16 @@ class IMG_TO_IMG(X0):
 class COSMOS_RFLOW:
    def calculate_input(self, sigma, noise):
        sigma = (sigma / (sigma + 1))
-        sigma = reshape_sigma(sigma, noise.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (noise.ndim - 1))
        return noise * (1.0 - sigma)

    def calculate_denoised(self, sigma, model_output, model_input):
        sigma = (sigma / (sigma + 1))
-        sigma = reshape_sigma(sigma, model_output.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (model_output.ndim - 1))
        return model_input * (1.0 - sigma) - model_output * sigma

    def noise_scaling(self, sigma, noise, latent_image, max_denoise=False):
-        sigma = reshape_sigma(sigma, noise.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (noise.ndim - 1))
        noise = noise * sigma
        noise += latent_image
        return noise
--- a/comfy/nested_tensor.py
+++ b/comfy/nested_tensor.py
@@ -1,91 +0,0 @@
-import torch
-
-class NestedTensor:
-    def __init__(self, tensors):
-        self.tensors = list(tensors)
-        self.is_nested = True
-
-    def _copy(self):
-        return NestedTensor(self.tensors)
-
-    def apply_operation(self, other, operation):
-        o = self._copy()
-        if isinstance(other, NestedTensor):
-            for i, t in enumerate(o.tensors):
-                o.tensors[i] = operation(t, other.tensors[i])
-        else:
-            for i, t in enumerate(o.tensors):
-                o.tensors[i] = operation(t, other)
-        return o
-
-    def __add__(self, b):
-        return self.apply_operation(b, lambda x, y: x + y)
-
-    def __sub__(self, b):
-        return self.apply_operation(b, lambda x, y: x - y)
-
-    def __mul__(self, b):
-        return self.apply_operation(b, lambda x, y: x * y)
-
-    # def __itruediv__(self, b):
-    #     return self.apply_operation(b, lambda x, y: x / y)
-
-    def __truediv__(self, b):
-        return self.apply_operation(b, lambda x, y: x / y)
-
-    def __getitem__(self, *args, **kwargs):
-        return self.apply_operation(None, lambda x, y: x.__getitem__(*args, **kwargs))
-
-    def unbind(self):
-        return self.tensors
-
-    def to(self, *args, **kwargs):
-        o = self._copy()
-        for i, t in enumerate(o.tensors):
-            o.tensors[i] = t.to(*args, **kwargs)
-        return o
-
-    def new_ones(self, *args, **kwargs):
-        return self.tensors[0].new_ones(*args, **kwargs)
-
-    def float(self):
-        return self.to(dtype=torch.float)
-
-    def chunk(self, *args, **kwargs):
-        return self.apply_operation(None, lambda x, y: x.chunk(*args, **kwargs))
-
-    def size(self):
-        return self.tensors[0].size()
-
-    @property
-    def shape(self):
-        return self.tensors[0].shape
-
-    @property
-    def ndim(self):
-        dims = 0
-        for t in self.tensors:
-            dims = max(t.ndim, dims)
-        return dims
-
-    @property
-    def device(self):
-        return self.tensors[0].device
-
-    @property
-    def dtype(self):
-        return self.tensors[0].dtype
-
-    @property
-    def layout(self):
-        return self.tensors[0].layout
-
-
-def cat_nested(tensors, *args, **kwargs):
-    cated_tensors = []
-    for i in range(len(tensors[0].tensors)):
-        tens = []
-        for j in range(len(tensors)):
-            tens.append(tensors[j].tensors[i])
-        cated_tensors.append(torch.cat(tens, *args, **kwargs))
-    return NestedTensor(cated_tensors)
--- a/comfy/ops.py
+++ b/comfy/ops.py
@@ -24,11 +24,6 @@ import comfy.float
 import comfy.rmsnorm
 import contextlib

-def run_every_op():
-    if torch.compiler.is_compiling():
-        return
-
-    comfy.model_management.throw_exception_if_processing_interrupted()

 def scaled_dot_product_attention(q, k, v, *args, **kwargs):
    return torch.nn.functional.scaled_dot_product_attention(q, k, v, *args, **kwargs)
@@ -55,22 +50,11 @@ try:
 except (ModuleNotFoundError, TypeError):
    logging.warning("Could not set sdpa backend priority.")

-NVIDIA_MEMORY_CONV_BUG_WORKAROUND = False
-try:
-    if comfy.model_management.is_nvidia():
-        if torch.backends.cudnn.version() >= 91002 and comfy.model_management.torch_version_numeric >= (2, 9) and comfy.model_management.torch_version_numeric <= (2, 10):
-            #TODO: change upper bound version once it's fixed'
-            NVIDIA_MEMORY_CONV_BUG_WORKAROUND = True
-            logging.info("working around nvidia conv3d memory bug.")
-except:
-    pass
-
 cast_to = comfy.model_management.cast_to #TODO: remove once no more references

 def cast_to_input(weight, input, non_blocking=False, copy=True):
    return comfy.model_management.cast_to(weight, input.dtype, input.device, non_blocking=non_blocking, copy=copy)

-@torch.compiler.disable()
 def cast_bias_weight(s, input=None, dtype=None, device=None, bias_dtype=None):
    if input is not None:
        if dtype is None:
@@ -122,7 +106,6 @@ class disable_weight_init:
            return torch.nn.functional.linear(input, weight, bias)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -137,7 +120,6 @@ class disable_weight_init:
            return self._conv_forward(input, weight, bias)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -152,7 +134,6 @@ class disable_weight_init:
            return self._conv_forward(input, weight, bias)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -162,21 +143,11 @@ class disable_weight_init:
        def reset_parameters(self):
            return None

-        def _conv_forward(self, input, weight, bias, *args, **kwargs):
-            if NVIDIA_MEMORY_CONV_BUG_WORKAROUND and weight.dtype in (torch.float16, torch.bfloat16):
-                out = torch.cudnn_convolution(input, weight, self.padding, self.stride, self.dilation, self.groups, benchmark=False, deterministic=False, allow_tf32=True)
-                if bias is not None:
-                    out += bias.reshape((1, -1) + (1,) * (out.ndim - 2))
-                return out
-            else:
-                return super()._conv_forward(input, weight, bias, *args, **kwargs)
-
        def forward_comfy_cast_weights(self, input):
            weight, bias = cast_bias_weight(self, input)
            return self._conv_forward(input, weight, bias)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -191,7 +162,6 @@ class disable_weight_init:
            return torch.nn.functional.group_norm(input, self.num_groups, weight, bias, self.eps)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -210,7 +180,6 @@ class disable_weight_init:
            return torch.nn.functional.layer_norm(input, self.normalized_shape, weight, bias, self.eps)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -230,7 +199,6 @@ class disable_weight_init:
            # return torch.nn.functional.rms_norm(input, self.normalized_shape, weight, self.eps)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -252,7 +220,6 @@ class disable_weight_init:
                output_padding, self.groups, self.dilation)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -274,7 +241,6 @@ class disable_weight_init:
                output_padding, self.groups, self.dilation)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -293,7 +259,6 @@ class disable_weight_init:
            return torch.nn.functional.embedding(input, weight, self.padding_idx, self.max_norm, self.norm_type, self.scale_grad_by_freq, self.sparse).to(dtype=output_dtype)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -397,13 +362,12 @@ class fp8_ops(manual_cast):
            return None

        def forward_comfy_cast_weights(self, input):
-            if not self.training:
-                try:
-                    out = fp8_linear(self, input)
-                    if out is not None:
-                        return out
-                except Exception as e:
-                    logging.info("Exception during fp8 op: {}".format(e))
+            try:
+                out = fp8_linear(self, input)
+                if out is not None:
+                    return out
+            except Exception as e:
+                logging.info("Exception during fp8 op: {}".format(e))

            weight, bias = cast_bias_weight(self, input)
            return torch.nn.functional.linear(input, weight, bias)
@@ -448,10 +412,8 @@ def scaled_fp8_ops(fp8_matrix_mult=False, scale_input=False, override_dtype=None
                else:
                    return weight * self.scale_weight.to(device=weight.device, dtype=weight.dtype)

-            def set_weight(self, weight, inplace_update=False, seed=None, return_weight=False, **kwargs):
+            def set_weight(self, weight, inplace_update=False, seed=None, **kwargs):
                weight = comfy.float.stochastic_rounding(weight / self.scale_weight.to(device=weight.device, dtype=weight.dtype), self.weight.dtype, seed=seed)
-                if return_weight:
-                    return weight
                if inplace_update:
                    self.weight.data.copy_(weight)
                else:
--- a/comfy/patcher_extension.py
+++ b/comfy/patcher_extension.py
@@ -150,7 +150,7 @@ def merge_nested_dicts(dict1: dict, dict2: dict, copy_dict1=True):
    for key, value in dict2.items():
        if isinstance(value, dict):
            curr_value = merged_dict.setdefault(key, {})
-            merged_dict[key] = merge_nested_dicts(curr_value, value)
+            merged_dict[key] = merge_nested_dicts(value, curr_value)
        elif isinstance(value, list):
            merged_dict.setdefault(key, []).extend(value)
        else:
--- a/comfy/pixel_space_convert.py
+++ b/comfy/pixel_space_convert.py
@@ -1,16 +0,0 @@
-import torch
-
-
-# "Fake" VAE that converts from IMAGE B, H, W, C and values on the scale of 0..1
-# to LATENT B, C, H, W and values on the scale of -1..1.
-class PixelspaceConversionVAE(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.pixel_space_vae = torch.nn.Parameter(torch.tensor(1.0))
-
-    def encode(self, pixels: torch.Tensor, *_args, **_kwargs) -> torch.Tensor:
-        return pixels
-
-    def decode(self, samples: torch.Tensor, *_args, **_kwargs) -> torch.Tensor:
-        return samples
-
--- a/comfy/sample.py
+++ b/comfy/sample.py
@@ -4,9 +4,13 @@ import comfy.samplers
 import comfy.utils
 import numpy as np
 import logging
-import comfy.nested_tensor

-def prepare_noise_inner(latent_image, generator, noise_inds=None):
+def prepare_noise(latent_image, seed, noise_inds=None):
+    """
+    creates random noise given a latent image and a seed.
+    optional arg skip can be used to skip and discard x number of noise generations for a given seed
+    """
+    generator = torch.manual_seed(seed)
    if noise_inds is None:
        return torch.randn(latent_image.size(), dtype=latent_image.dtype, layout=latent_image.layout, generator=generator, device="cpu")

@@ -17,29 +21,10 @@ def prepare_noise_inner(latent_image, generator, noise_inds=None):
        if i in unique_inds:
            noises.append(noise)
    noises = [noises[i] for i in inverse]
-    return torch.cat(noises, axis=0)
-
-def prepare_noise(latent_image, seed, noise_inds=None):
-    """
-    creates random noise given a latent image and a seed.
-    optional arg skip can be used to skip and discard x number of noise generations for a given seed
-    """
-    generator = torch.manual_seed(seed)
-
-    if latent_image.is_nested:
-        tensors = latent_image.unbind()
-        noises = []
-        for t in tensors:
-            noises.append(prepare_noise_inner(t, generator, noise_inds))
-        noises = comfy.nested_tensor.NestedTensor(noises)
-    else:
-        noises = prepare_noise_inner(latent_image, generator, noise_inds)
-
+    noises = torch.cat(noises, axis=0)
    return noises

 def fix_empty_latent_channels(model, latent_image):
-    if latent_image.is_nested:
-        return latent_image
    latent_format = model.get_model_object("latent_format") #Resize the empty latent image so it has the right number of channels
    if latent_format.latent_channels != latent_image.shape[1] and torch.count_nonzero(latent_image) == 0:
        latent_image = comfy.utils.repeat_to_batch_size(latent_image, latent_format.latent_channels, dim=1)
--- a/comfy/samplers.py
+++ b/comfy/samplers.py
@@ -306,10 +306,17 @@ def _calc_cond_batch(model: BaseModel, conds: list[list[dict]], x_in: torch.Tens
                                                                                 copy_dict1=False)

            if patches is not None:
-                transformer_options["patches"] = comfy.patcher_extension.merge_nested_dicts(
-                    transformer_options.get("patches", {}),
-                    patches
-                )
+                # TODO: replace with merge_nested_dicts function
+                if "patches" in transformer_options:
+                    cur_patches = transformer_options["patches"].copy()
+                    for p in patches:
+                        if p in cur_patches:
+                            cur_patches[p] = cur_patches[p] + patches[p]
+                        else:
+                            cur_patches[p] = patches[p]
+                    transformer_options["patches"] = cur_patches
+                else:
+                    transformer_options["patches"] = patches

            transformer_options["cond_or_uncond"] = cond_or_uncond[:]
            transformer_options["uuids"] = uuids[:]
@@ -353,7 +360,7 @@ def calc_cond_uncond_batch(model, cond, uncond, x_in, timestep, model_options):
 def cfg_function(model, cond_pred, uncond_pred, cond_scale, x, timestep, model_options={}, cond=None, uncond=None):
    if "sampler_cfg_function" in model_options:
        args = {"cond": x - cond_pred, "uncond": x - uncond_pred, "cond_scale": cond_scale, "timestep": timestep, "input": x, "sigma": timestep,
-                "cond_denoised": cond_pred, "uncond_denoised": uncond_pred, "model": model, "model_options": model_options, "input_cond": cond, "input_uncond": uncond}
+                "cond_denoised": cond_pred, "uncond_denoised": uncond_pred, "model": model, "model_options": model_options}
        cfg_result = x - model_options["sampler_cfg_function"](args)
    else:
        cfg_result = uncond_pred + (cond_pred - uncond_pred) * cond_scale
@@ -383,7 +390,7 @@ def sampling_function(model, x, timestep, uncond, cond, cond_scale, model_option
    for fn in model_options.get("sampler_pre_cfg_function", []):
        args = {"conds":conds, "conds_out": out, "cond_scale": cond_scale, "timestep": timestep,
                "input": x, "sigma": timestep, "model": model, "model_options": model_options}
-        out = fn(args)
+        out  = fn(args)

    return cfg_function(model, out[0], out[1], cond_scale, x, timestep, model_options=model_options, cond=cond, uncond=uncond_)

@@ -782,7 +789,7 @@ def ksampler(sampler_name, extra_options={}, inpaint_options={}):
    return KSAMPLER(sampler_function, extra_options, inpaint_options)


-def process_conds(model, noise, conds, device, latent_image=None, denoise_mask=None, seed=None, latent_shapes=None):
+def process_conds(model, noise, conds, device, latent_image=None, denoise_mask=None, seed=None):
    for k in conds:
        conds[k] = conds[k][:]
        resolve_areas_and_cond_masks_multidim(conds[k], noise.shape[2:], device)
@@ -792,7 +799,7 @@ def process_conds(model, noise, conds, device, latent_image=None, denoise_mask=N

    if hasattr(model, 'extra_conds'):
        for k in conds:
-            conds[k] = encode_model_conds(model.extra_conds, conds[k], noise, device, k, latent_image=latent_image, denoise_mask=denoise_mask, seed=seed, latent_shapes=latent_shapes)
+            conds[k] = encode_model_conds(model.extra_conds, conds[k], noise, device, k, latent_image=latent_image, denoise_mask=denoise_mask, seed=seed)

    #make sure each cond area has an opposite one with the same area
    for k in conds:
@@ -962,11 +969,11 @@ class CFGGuider:
    def predict_noise(self, x, timestep, model_options={}, seed=None):
        return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)

-    def inner_sample(self, noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=None):
+    def inner_sample(self, noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed):
        if latent_image is not None and torch.count_nonzero(latent_image) > 0: #Don't shift the empty latent image.
            latent_image = self.inner_model.process_latent_in(latent_image)

-        self.conds = process_conds(self.inner_model, noise, self.conds, device, latent_image, denoise_mask, seed, latent_shapes=latent_shapes)
+        self.conds = process_conds(self.inner_model, noise, self.conds, device, latent_image, denoise_mask, seed)

        extra_model_options = comfy.model_patcher.create_model_options_clone(self.model_options)
        extra_model_options.setdefault("transformer_options", {})["sample_sigmas"] = sigmas
@@ -980,7 +987,7 @@ class CFGGuider:
        samples = executor.execute(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
        return self.inner_model.process_latent_out(samples.to(torch.float32))

-    def outer_sample(self, noise, latent_image, sampler, sigmas, denoise_mask=None, callback=None, disable_pbar=False, seed=None, latent_shapes=None):
+    def outer_sample(self, noise, latent_image, sampler, sigmas, denoise_mask=None, callback=None, disable_pbar=False, seed=None):
        self.inner_model, self.conds, self.loaded_models = comfy.sampler_helpers.prepare_sampling(self.model_patcher, noise.shape, self.conds, self.model_options)
        device = self.model_patcher.load_device

@@ -994,7 +1001,7 @@ class CFGGuider:

        try:
            self.model_patcher.pre_run()
-            output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
+            output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
        finally:
            self.model_patcher.cleanup()

@@ -1007,12 +1014,6 @@ class CFGGuider:
        if sigmas.shape[-1] == 0:
            return latent_image

-        if latent_image.is_nested:
-            latent_image, latent_shapes = comfy.utils.pack_latents(latent_image.unbind())
-            noise, _ = comfy.utils.pack_latents(noise.unbind())
-        else:
-            latent_shapes = [latent_image.shape]
-
        self.conds = {}
        for k in self.original_conds:
            self.conds[k] = list(map(lambda a: a.copy(), self.original_conds[k]))
@@ -1032,7 +1033,7 @@ class CFGGuider:
                self,
                comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.OUTER_SAMPLE, self.model_options, is_model_options=True)
            )
-            output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
+            output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
        finally:
            cast_to_load_options(self.model_options, device=self.model_patcher.offload_device)
            self.model_options = orig_model_options
@@ -1040,9 +1041,6 @@ class CFGGuider:
            self.model_patcher.restore_hook_patches()

        del self.conds
-
-        if len(latent_shapes) > 1:
-            output = comfy.nested_tensor.NestedTensor(comfy.utils.unpack_latents(output, latent_shapes))
        return output


--- a/comfy/sd.py
+++ b/comfy/sd.py
@@ -17,9 +17,6 @@ import comfy.ldm.wan.vae
 import comfy.ldm.wan.vae2_2
 import comfy.ldm.hunyuan3d.vae
 import comfy.ldm.ace.vae.music_dcae_pipeline
-import comfy.ldm.hunyuan_video.vae
-import comfy.ldm.mmaudio.vae.autoencoder
-import comfy.pixel_space_convert
 import yaml
 import math
 import os
@@ -51,7 +48,6 @@ import comfy.text_encoders.hidream
 import comfy.text_encoders.ace
 import comfy.text_encoders.omnigen2
 import comfy.text_encoders.qwen_image
-import comfy.text_encoders.hunyuan_image

 import comfy.model_patcher
 import comfy.lora
@@ -276,13 +272,8 @@ class VAE:
        if 'decoder.up_blocks.0.resnets.0.norm1.weight' in sd.keys(): #diffusers format
            sd = diffusers_convert.convert_vae_state_dict(sd)

-        if model_management.is_amd():
-            VAE_KL_MEM_RATIO = 2.73
-        else:
-            VAE_KL_MEM_RATIO = 1.0
-
-        self.memory_used_encode = lambda shape, dtype: (1767 * shape[2] * shape[3]) * model_management.dtype_size(dtype) * VAE_KL_MEM_RATIO #These are for AutoencoderKL and need tweaking (should be lower)
-        self.memory_used_decode = lambda shape, dtype: (2178 * shape[2] * shape[3] * 64) * model_management.dtype_size(dtype) * VAE_KL_MEM_RATIO
+        self.memory_used_encode = lambda shape, dtype: (1767 * shape[2] * shape[3]) * model_management.dtype_size(dtype) #These are for AutoencoderKL and need tweaking (should be lower)
+        self.memory_used_decode = lambda shape, dtype: (2178 * shape[2] * shape[3] * 64) * model_management.dtype_size(dtype)
        self.downscale_ratio = 8
        self.upscale_ratio = 8
        self.latent_channels = 4
@@ -292,12 +283,10 @@ class VAE:
        self.process_output = lambda image: torch.clamp((image + 1.0) / 2.0, min=0.0, max=1.0)
        self.working_dtypes = [torch.bfloat16, torch.float32]
        self.disable_offload = False
-        self.not_video = False

        self.downscale_index_formula = None
        self.upscale_index_formula = None
        self.extra_1d_channel = None
-        self.crop_input = True

        if config is None:
            if "decoder.mid.block_1.mix_factor" in sd:
@@ -340,50 +329,21 @@ class VAE:
                self.downscale_ratio = 32
                self.latent_channels = 16
            elif "decoder.conv_in.weight" in sd:
-                if sd['decoder.conv_in.weight'].shape[1] == 64:
-                    ddconfig = {"block_out_channels": [128, 256, 512, 512, 1024, 1024], "in_channels": 3, "out_channels": 3, "num_res_blocks": 2, "ffactor_spatial": 32, "downsample_match_channel": True, "upsample_match_channel": True}
-                    self.latent_channels = ddconfig['z_channels'] = sd["decoder.conv_in.weight"].shape[1]
-                    self.downscale_ratio = 32
-                    self.upscale_ratio = 32
-                    self.working_dtypes = [torch.float16, torch.bfloat16, torch.float32]
-                    self.first_stage_model = AutoencodingEngine(regularizer_config={'target': "comfy.ldm.models.autoencoder.DiagonalGaussianRegularizer"},
-                                                                encoder_config={'target': "comfy.ldm.hunyuan_video.vae.Encoder", 'params': ddconfig},
-                                                                decoder_config={'target': "comfy.ldm.hunyuan_video.vae.Decoder", 'params': ddconfig})
+                #default SD1.x/SD2.x VAE parameters
+                ddconfig = {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0}

-                    self.memory_used_encode = lambda shape, dtype: (700 * shape[2] * shape[3]) * model_management.dtype_size(dtype)
-                    self.memory_used_decode = lambda shape, dtype: (700 * shape[2] * shape[3] * 32 * 32) * model_management.dtype_size(dtype)
-                elif sd['decoder.conv_in.weight'].shape[1] == 32:
-                    ddconfig = {"block_out_channels": [128, 256, 512, 1024, 1024], "in_channels": 3, "out_channels": 3, "num_res_blocks": 2, "ffactor_spatial": 16, "ffactor_temporal": 4, "downsample_match_channel": True, "upsample_match_channel": True, "refiner_vae": False}
-                    self.latent_channels = ddconfig['z_channels'] = sd["decoder.conv_in.weight"].shape[1]
-                    self.working_dtypes = [torch.float16, torch.bfloat16, torch.float32]
-                    self.upscale_ratio = (lambda a: max(0, a * 4 - 3), 16, 16)
-                    self.upscale_index_formula = (4, 16, 16)
-                    self.downscale_ratio = (lambda a: max(0, math.floor((a + 3) / 4)), 16, 16)
-                    self.downscale_index_formula = (4, 16, 16)
-                    self.latent_dim = 3
-                    self.not_video = True
-                    self.first_stage_model = AutoencodingEngine(regularizer_config={'target': "comfy.ldm.models.autoencoder.DiagonalGaussianRegularizer"},
-                                                                encoder_config={'target': "comfy.ldm.hunyuan_video.vae_refiner.Encoder", 'params': ddconfig},
-                                                                decoder_config={'target': "comfy.ldm.hunyuan_video.vae_refiner.Decoder", 'params': ddconfig})
+                if 'encoder.down.2.downsample.conv.weight' not in sd and 'decoder.up.3.upsample.conv.weight' not in sd: #Stable diffusion x4 upscaler VAE
+                    ddconfig['ch_mult'] = [1, 2, 4]
+                    self.downscale_ratio = 4
+                    self.upscale_ratio = 4

-                    self.memory_used_encode = lambda shape, dtype: (2800 * shape[-2] * shape[-1]) * model_management.dtype_size(dtype)
-                    self.memory_used_decode = lambda shape, dtype: (2800 * shape[-3] * shape[-2] * shape[-1] * 16 * 16) * model_management.dtype_size(dtype)
+                self.latent_channels = ddconfig['z_channels'] = sd["decoder.conv_in.weight"].shape[1]
+                if 'post_quant_conv.weight' in sd:
+                    self.first_stage_model = AutoencoderKL(ddconfig=ddconfig, embed_dim=sd['post_quant_conv.weight'].shape[1])
                else:
-                    #default SD1.x/SD2.x VAE parameters
-                    ddconfig = {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0}
-
-                    if 'encoder.down.2.downsample.conv.weight' not in sd and 'decoder.up.3.upsample.conv.weight' not in sd: #Stable diffusion x4 upscaler VAE
-                        ddconfig['ch_mult'] = [1, 2, 4]
-                        self.downscale_ratio = 4
-                        self.upscale_ratio = 4
-
-                    self.latent_channels = ddconfig['z_channels'] = sd["decoder.conv_in.weight"].shape[1]
-                    if 'post_quant_conv.weight' in sd:
-                        self.first_stage_model = AutoencoderKL(ddconfig=ddconfig, embed_dim=sd['post_quant_conv.weight'].shape[1])
-                    else:
-                        self.first_stage_model = AutoencodingEngine(regularizer_config={'target': "comfy.ldm.models.autoencoder.DiagonalGaussianRegularizer"},
-                                                                    encoder_config={'target': "comfy.ldm.modules.diffusionmodules.model.Encoder", 'params': ddconfig},
-                                                                    decoder_config={'target': "comfy.ldm.modules.diffusionmodules.model.Decoder", 'params': ddconfig})
+                    self.first_stage_model = AutoencodingEngine(regularizer_config={'target': "comfy.ldm.models.autoencoder.DiagonalGaussianRegularizer"},
+                                                                encoder_config={'target': "comfy.ldm.modules.diffusionmodules.model.Encoder", 'params': ddconfig},
+                                                                decoder_config={'target': "comfy.ldm.modules.diffusionmodules.model.Decoder", 'params': ddconfig})
            elif "decoder.layers.1.layers.0.beta" in sd:
                self.first_stage_model = AudioOobleckVAE()
                self.memory_used_encode = lambda shape, dtype: (1000 * shape[2]) * model_management.dtype_size(dtype)
@@ -434,23 +394,6 @@ class VAE:
                self.downscale_ratio = (lambda a: max(0, math.floor((a + 7) / 8)), 32, 32)
                self.downscale_index_formula = (8, 32, 32)
                self.working_dtypes = [torch.bfloat16, torch.float32]
-            elif "decoder.conv_in.conv.weight" in sd and sd['decoder.conv_in.conv.weight'].shape[1] == 32:
-                ddconfig = {"block_out_channels": [128, 256, 512, 1024, 1024], "in_channels": 3, "out_channels": 3, "num_res_blocks": 2, "ffactor_spatial": 16, "ffactor_temporal": 4, "downsample_match_channel": True, "upsample_match_channel": True}
-                ddconfig['z_channels'] = sd["decoder.conv_in.conv.weight"].shape[1]
-                self.latent_channels = 64
-                self.upscale_ratio = (lambda a: max(0, a * 4 - 3), 16, 16)
-                self.upscale_index_formula = (4, 16, 16)
-                self.downscale_ratio = (lambda a: max(0, math.floor((a + 3) / 4)), 16, 16)
-                self.downscale_index_formula = (4, 16, 16)
-                self.latent_dim = 3
-                self.not_video = True
-                self.working_dtypes = [torch.float16, torch.bfloat16, torch.float32]
-                self.first_stage_model = AutoencodingEngine(regularizer_config={'target': "comfy.ldm.models.autoencoder.EmptyRegularizer"},
-                                                            encoder_config={'target': "comfy.ldm.hunyuan_video.vae_refiner.Encoder", 'params': ddconfig},
-                                                            decoder_config={'target': "comfy.ldm.hunyuan_video.vae_refiner.Decoder", 'params': ddconfig})
-
-                self.memory_used_encode = lambda shape, dtype: (1400 * shape[-2] * shape[-1]) * model_management.dtype_size(dtype)
-                self.memory_used_decode = lambda shape, dtype: (1400 * shape[-3] * shape[-2] * shape[-1] * 16 * 16) * model_management.dtype_size(dtype)
            elif "decoder.conv_in.conv.weight" in sd:
                ddconfig = {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0}
                ddconfig["conv3d"] = True
@@ -503,29 +446,17 @@ class VAE:
                    self.working_dtypes = [torch.bfloat16, torch.float16, torch.float32]
                    self.memory_used_encode = lambda shape, dtype: 6000 * shape[3] * shape[4] * model_management.dtype_size(dtype)
                    self.memory_used_decode = lambda shape, dtype: 7000 * shape[3] * shape[4] * (8 * 8) * model_management.dtype_size(dtype)
-            # Hunyuan 3d v2 2.0 & 2.1
            elif "geo_decoder.cross_attn_decoder.ln_1.bias" in sd:
-
                self.latent_dim = 1
-
-                def estimate_memory(shape, dtype, num_layers = 16, kv_cache_multiplier = 2):
-                    batch, num_tokens, hidden_dim = shape
-                    dtype_size = model_management.dtype_size(dtype)
-
-                    total_mem = batch * num_tokens * hidden_dim * dtype_size * (1 + kv_cache_multiplier * num_layers)
-                    return total_mem
-
-                # better memory estimations
-                self.memory_used_encode = lambda shape, dtype, num_layers = 8, kv_cache_multiplier = 0:\
-                    estimate_memory(shape, dtype, num_layers, kv_cache_multiplier)
-
-                self.memory_used_decode = lambda shape, dtype, num_layers = 16, kv_cache_multiplier = 2: \
-                    estimate_memory(shape, dtype, num_layers, kv_cache_multiplier)
-
-                self.first_stage_model = comfy.ldm.hunyuan3d.vae.ShapeVAE()
+                ln_post = "geo_decoder.ln_post.weight" in sd
+                inner_size = sd["geo_decoder.output_proj.weight"].shape[1]
+                downsample_ratio = sd["post_kl.weight"].shape[0] // inner_size
+                mlp_expand = sd["geo_decoder.cross_attn_decoder.mlp.c_fc.weight"].shape[0] // inner_size
+                self.memory_used_encode = lambda shape, dtype: (1000 * shape[2]) * model_management.dtype_size(dtype)  # TODO
+                self.memory_used_decode = lambda shape, dtype: (1024 * 1024 * 1024 * 2.0) * model_management.dtype_size(dtype)  # TODO
+                ddconfig = {"embed_dim": 64, "num_freqs": 8, "include_pi": False, "heads": 16, "width": 1024, "num_decoder_layers": 16, "qkv_bias": False, "qk_norm": True, "geo_decoder_mlp_expand_ratio": mlp_expand, "geo_decoder_downsample_ratio": downsample_ratio, "geo_decoder_ln_post": ln_post}
+                self.first_stage_model = comfy.ldm.hunyuan3d.vae.ShapeVAE(**ddconfig)
                self.working_dtypes = [torch.float16, torch.bfloat16, torch.float32]
-
-
            elif "vocoder.backbone.channel_layers.0.0.bias" in sd: #Ace Step Audio
                self.first_stage_model = comfy.ldm.ace.vae.music_dcae_pipeline.MusicDCAE(source_sample_rate=44100)
                self.memory_used_encode = lambda shape, dtype: (shape[2] * 330) * model_management.dtype_size(dtype)
@@ -540,34 +471,6 @@ class VAE:
                self.working_dtypes = [torch.bfloat16, torch.float16, torch.float32]
                self.disable_offload = True
                self.extra_1d_channel = 16
-            elif "pixel_space_vae" in sd:
-                self.first_stage_model = comfy.pixel_space_convert.PixelspaceConversionVAE()
-                self.memory_used_encode = lambda shape, dtype: (1 * shape[2] * shape[3]) * model_management.dtype_size(dtype)
-                self.memory_used_decode = lambda shape, dtype: (1 * shape[2] * shape[3]) * model_management.dtype_size(dtype)
-                self.downscale_ratio = 1
-                self.upscale_ratio = 1
-                self.latent_channels = 3
-                self.latent_dim = 2
-                self.output_channels = 3
-            elif "vocoder.activation_post.downsample.lowpass.filter" in sd: #MMAudio VAE
-                sample_rate = 16000
-                if sample_rate == 16000:
-                    mode = '16k'
-                else:
-                    mode = '44k'
-
-                self.first_stage_model = comfy.ldm.mmaudio.vae.autoencoder.AudioAutoencoder(mode=mode)
-                self.memory_used_encode = lambda shape, dtype: (30 * shape[2]) * model_management.dtype_size(dtype)
-                self.memory_used_decode = lambda shape, dtype: (90 * shape[2] * 1411.2) * model_management.dtype_size(dtype)
-                self.latent_channels = 20
-                self.output_channels = 2
-                self.upscale_ratio = 512 * (44100 / sample_rate)
-                self.downscale_ratio = 512 * (44100 / sample_rate)
-                self.latent_dim = 1
-                self.process_output = lambda audio: audio
-                self.process_input = lambda audio: audio
-                self.working_dtypes = [torch.float32]
-                self.crop_input = False
            else:
                logging.warning("WARNING: No VAE weights detected, VAE not initalized.")
                self.first_stage_model = None
@@ -601,9 +504,6 @@ class VAE:
            raise RuntimeError("ERROR: VAE is invalid: None\n\nIf the VAE is from a checkpoint loader node your checkpoint does not contain a valid VAE.")

    def vae_encode_crop_pixels(self, pixels):
-        if not self.crop_input:
-            return pixels
-
        downscale_ratio = self.spacial_compression_encode()

        dims = pixels.shape[1:-1]
@@ -681,7 +581,6 @@ class VAE:
    def decode(self, samples_in, vae_options={}):
        self.throw_exception_if_invalid()
        pixel_samples = None
-        do_tile = False
        try:
            memory_used = self.memory_used_decode(samples_in.shape, self.vae_dtype)
            model_management.load_models_gpu([self.patcher], memory_required=memory_used, force_full_load=self.disable_offload)
@@ -697,13 +596,6 @@ class VAE:
                pixel_samples[x:x+batch_number] = out
        except model_management.OOM_EXCEPTION:
            logging.warning("Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding.")
-            #NOTE: We don't know what tensors were allocated to stack variables at the time of the
-            #exception and the exception itself refs them all until we get out of this except block.
-            #So we just set a flag for tiler fallback so that tensor gc can happen once the
-            #exception is fully off the books.
-            do_tile = True
-
-        if do_tile:
            dims = samples_in.ndim - 2
            if dims == 1 or self.extra_1d_channel is not None:
                pixel_samples = self.decode_tiled_1d(samples_in)
@@ -750,12 +642,8 @@ class VAE:
        self.throw_exception_if_invalid()
        pixel_samples = self.vae_encode_crop_pixels(pixel_samples)
        pixel_samples = pixel_samples.movedim(-1, 1)
-        do_tile = False
        if self.latent_dim == 3 and pixel_samples.ndim < 5:
-            if not self.not_video:
-                pixel_samples = pixel_samples.movedim(1, 0).unsqueeze(0)
-            else:
-                pixel_samples = pixel_samples.unsqueeze(2)
+            pixel_samples = pixel_samples.movedim(1, 0).unsqueeze(0)
        try:
            memory_used = self.memory_used_encode(pixel_samples.shape, self.vae_dtype)
            model_management.load_models_gpu([self.patcher], memory_required=memory_used, force_full_load=self.disable_offload)
@@ -772,13 +660,6 @@ class VAE:

        except model_management.OOM_EXCEPTION:
            logging.warning("Warning: Ran out of memory when regular VAE encoding, retrying with tiled VAE encoding.")
-            #NOTE: We don't know what tensors were allocated to stack variables at the time of the
-            #exception and the exception itself refs them all until we get out of this except block.
-            #So we just set a flag for tiler fallback so that tensor gc can happen once the
-            #exception is fully off the books.
-            do_tile = True
-
-        if do_tile:
            if self.latent_dim == 3:
                tile = 256
                overlap = tile // 4
@@ -796,10 +677,7 @@ class VAE:
        dims = self.latent_dim
        pixel_samples = pixel_samples.movedim(-1, 1)
        if dims == 3:
-            if not self.not_video:
-                pixel_samples = pixel_samples.movedim(1, 0).unsqueeze(0)
-            else:
-                pixel_samples = pixel_samples.unsqueeze(2)
+            pixel_samples = pixel_samples.movedim(1, 0).unsqueeze(0)

        memory_used = self.memory_used_encode(pixel_samples.shape, self.vae_dtype)  # TODO: calculate mem required for tile
        model_management.load_models_gpu([self.patcher], memory_required=memory_used, force_full_load=self.disable_offload)
@@ -856,7 +734,6 @@ class VAE:
        except:
            return None

-
 class StyleModel:
    def __init__(self, model, device="cpu"):
        self.model = model
@@ -896,7 +773,6 @@ class CLIPType(Enum):
    ACE = 16
    OMNIGEN2 = 17
    QWEN_IMAGE = 18
-    HUNYUAN_IMAGE = 19


 def load_clip(ckpt_paths, embedding_directory=None, clip_type=CLIPType.STABLE_DIFFUSION, model_options={}):
@@ -918,8 +794,6 @@ class TEModel(Enum):
    GEMMA_2_2B = 9
    QWEN25_3B = 10
    QWEN25_7B = 11
-    BYT5_SMALL_GLYPH = 12
-    GEMMA_3_4B = 13

 def detect_te_model(sd):
    if "text_model.encoder.layers.30.mlp.fc1.weight" in sd:
@@ -937,13 +811,8 @@ def detect_te_model(sd):
    if 'encoder.block.23.layer.1.DenseReluDense.wi.weight' in sd:
        return TEModel.T5_XXL_OLD
    if "encoder.block.0.layer.0.SelfAttention.k.weight" in sd:
-        weight = sd['encoder.block.0.layer.0.SelfAttention.k.weight']
-        if weight.shape[0] == 384:
-            return TEModel.BYT5_SMALL_GLYPH
        return TEModel.T5_BASE
    if 'model.layers.0.post_feedforward_layernorm.weight' in sd:
-        if 'model.layers.0.self_attn.q_norm.weight' in sd:
-            return TEModel.GEMMA_3_4B
        return TEModel.GEMMA_2_2B
    if 'model.layers.0.self_attn.k_proj.bias' in sd:
        weight = sd['model.layers.0.self_attn.k_proj.bias']
@@ -1048,10 +917,6 @@ def load_text_encoder_state_dicts(state_dicts=[], embedding_directory=None, clip
            clip_target.clip = comfy.text_encoders.lumina2.te(**llama_detect(clip_data))
            clip_target.tokenizer = comfy.text_encoders.lumina2.LuminaTokenizer
            tokenizer_data["spiece_model"] = clip_data[0].get("spiece_model", None)
-        elif te_model == TEModel.GEMMA_3_4B:
-            clip_target.clip = comfy.text_encoders.lumina2.te(**llama_detect(clip_data), model_type="gemma3_4b")
-            clip_target.tokenizer = comfy.text_encoders.lumina2.NTokenizer
-            tokenizer_data["spiece_model"] = clip_data[0].get("spiece_model", None)
        elif te_model == TEModel.LLAMA3_8:
            clip_target.clip = comfy.text_encoders.hidream.hidream_clip(**llama_detect(clip_data),
                                                                        clip_l=False, clip_g=False, t5=False, llama=True, dtype_t5=None, t5xxl_scaled_fp8=None)
@@ -1060,12 +925,8 @@ def load_text_encoder_state_dicts(state_dicts=[], embedding_directory=None, clip
            clip_target.clip = comfy.text_encoders.omnigen2.te(**llama_detect(clip_data))
            clip_target.tokenizer = comfy.text_encoders.omnigen2.Omnigen2Tokenizer
        elif te_model == TEModel.QWEN25_7B:
-            if clip_type == CLIPType.HUNYUAN_IMAGE:
-                clip_target.clip = comfy.text_encoders.hunyuan_image.te(byt5=False, **llama_detect(clip_data))
-                clip_target.tokenizer = comfy.text_encoders.hunyuan_image.HunyuanImageTokenizer
-            else:
-                clip_target.clip = comfy.text_encoders.qwen_image.te(**llama_detect(clip_data))
-                clip_target.tokenizer = comfy.text_encoders.qwen_image.QwenImageTokenizer
+            clip_target.clip = comfy.text_encoders.qwen_image.te(**llama_detect(clip_data))
+            clip_target.tokenizer = comfy.text_encoders.qwen_image.QwenImageTokenizer
        else:
            # clip_l
            if clip_type == CLIPType.SD3:
@@ -1109,9 +970,6 @@ def load_text_encoder_state_dicts(state_dicts=[], embedding_directory=None, clip

            clip_target.clip = comfy.text_encoders.hidream.hidream_clip(clip_l=clip_l, clip_g=clip_g, t5=t5, llama=llama, **t5_kwargs, **llama_kwargs)
            clip_target.tokenizer = comfy.text_encoders.hidream.HiDreamTokenizer
-        elif clip_type == CLIPType.HUNYUAN_IMAGE:
-            clip_target.clip = comfy.text_encoders.hunyuan_image.te(**llama_detect(clip_data))
-            clip_target.tokenizer = comfy.text_encoders.hunyuan_image.HunyuanImageTokenizer
        else:
            clip_target.clip = sdxl_clip.SDXLClipModel
            clip_target.tokenizer = sdxl_clip.SDXLTokenizer
--- a/comfy/supported_models.py
+++ b/comfy/supported_models.py
@@ -20,7 +20,6 @@ import comfy.text_encoders.wan
 import comfy.text_encoders.ace
 import comfy.text_encoders.omnigen2
 import comfy.text_encoders.qwen_image
-import comfy.text_encoders.hunyuan_image

 from . import supported_models_base
 from . import latent_formats
@@ -995,7 +994,7 @@ class WAN21_T2V(supported_models_base.BASE):
    unet_extra_config = {}
    latent_format = latent_formats.Wan21

-    memory_usage_factor = 0.9
+    memory_usage_factor = 1.0

    supported_inference_dtypes = [torch.float16, torch.bfloat16, torch.float32]

@@ -1004,7 +1003,7 @@ class WAN21_T2V(supported_models_base.BASE):

    def __init__(self, unet_config):
        super().__init__(unet_config)
-        self.memory_usage_factor = self.unet_config.get("dim", 2000) / 2222
+        self.memory_usage_factor = self.unet_config.get("dim", 2000) / 2000

    def get_model(self, state_dict, prefix="", device=None):
        out = model_base.WAN21(self, device=device)
@@ -1073,16 +1072,6 @@ class WAN21_Vace(WAN21_T2V):
        out = model_base.WAN21_Vace(self, image_to_video=False, device=device)
        return out

-class WAN21_HuMo(WAN21_T2V):
-    unet_config = {
-        "image_model": "wan2.1",
-        "model_type": "humo",
-    }
-
-    def get_model(self, state_dict, prefix="", device=None):
-        out = model_base.WAN21_HuMo(self, image_to_video=False, device=device)
-        return out
-
 class WAN22_S2V(WAN21_T2V):
    unet_config = {
        "image_model": "wan2.1",
@@ -1096,19 +1085,6 @@ class WAN22_S2V(WAN21_T2V):
        out = model_base.WAN22_S2V(self, device=device)
        return out

-class WAN22_Animate(WAN21_T2V):
-    unet_config = {
-        "image_model": "wan2.1",
-        "model_type": "animate",
-    }
-
-    def __init__(self, unet_config):
-        super().__init__(unet_config)
-
-    def get_model(self, state_dict, prefix="", device=None):
-        out = model_base.WAN22_Animate(self, device=device)
-        return out
-
 class WAN22_T2V(WAN21_T2V):
    unet_config = {
        "image_model": "wan2.1",
@@ -1152,17 +1128,6 @@ class Hunyuan3Dv2(supported_models_base.BASE):
    def clip_target(self, state_dict={}):
        return None

-class Hunyuan3Dv2_1(Hunyuan3Dv2):
-    unet_config = {
-        "image_model": "hunyuan3d2_1",
-    }
-
-    latent_format = latent_formats.Hunyuan3Dv2_1
-
-    def get_model(self, state_dict, prefix="", device=None):
-        out = model_base.Hunyuan3Dv2_1(self, device = device)
-        return out
-
 class Hunyuan3Dv2mini(Hunyuan3Dv2):
    unet_config = {
        "image_model": "hunyuan3d2",
@@ -1228,19 +1193,6 @@ class Chroma(supported_models_base.BASE):
        t5_detect = comfy.text_encoders.sd3_clip.t5_xxl_detect(state_dict, "{}t5xxl.transformer.".format(pref))
        return supported_models_base.ClipTarget(comfy.text_encoders.pixart_t5.PixArtTokenizer, comfy.text_encoders.pixart_t5.pixart_te(**t5_detect))

-class ChromaRadiance(Chroma):
-    unet_config = {
-        "image_model": "chroma_radiance",
-    }
-
-    latent_format = comfy.latent_formats.ChromaRadiance
-
-    # Pixel-space model, no spatial compression for model input.
-    memory_usage_factor = 0.038
-
-    def get_model(self, state_dict, prefix="", device=None):
-        return model_base.ChromaRadiance(self, device=device)
-
 class ACEStep(supported_models_base.BASE):
    unet_config = {
        "audio_model": "ace",
@@ -1332,48 +1284,7 @@ class QwenImage(supported_models_base.BASE):
        hunyuan_detect = comfy.text_encoders.hunyuan_video.llama_detect(state_dict, "{}qwen25_7b.transformer.".format(pref))
        return supported_models_base.ClipTarget(comfy.text_encoders.qwen_image.QwenImageTokenizer, comfy.text_encoders.qwen_image.te(**hunyuan_detect))

-class HunyuanImage21(HunyuanVideo):
-    unet_config = {
-        "image_model": "hunyuan_video",
-        "vec_in_dim": None,
-    }

-    sampling_settings = {
-        "shift": 5.0,
-    }
-
-    latent_format = latent_formats.HunyuanImage21
-
-    memory_usage_factor = 7.7
-
-    supported_inference_dtypes = [torch.bfloat16, torch.float32]
-
-    def get_model(self, state_dict, prefix="", device=None):
-        out = model_base.HunyuanImage21(self, device=device)
-        return out
-
-    def clip_target(self, state_dict={}):
-        pref = self.text_encoder_key_prefix[0]
-        hunyuan_detect = comfy.text_encoders.hunyuan_video.llama_detect(state_dict, "{}qwen25_7b.transformer.".format(pref))
-        return supported_models_base.ClipTarget(comfy.text_encoders.hunyuan_image.HunyuanImageTokenizer, comfy.text_encoders.hunyuan_image.te(**hunyuan_detect))
-
-class HunyuanImage21Refiner(HunyuanVideo):
-    unet_config = {
-        "image_model": "hunyuan_video",
-        "patch_size": [1, 1, 1],
-        "vec_in_dim": None,
-    }
-
-    sampling_settings = {
-        "shift": 4.0,
-    }
-
-    latent_format = latent_formats.HunyuanImage21Refiner
-
-    def get_model(self, state_dict, prefix="", device=None):
-        out = model_base.HunyuanImage21Refiner(self, device=device)
-        return out
-
-models = [LotusD, Stable_Zero123, SD15_instructpix2pix, SD15, SD20, SD21UnclipL, SD21UnclipH, SDXL_instructpix2pix, SDXLRefiner, SDXL, SSD1B, KOALA_700M, KOALA_1B, Segmind_Vega, SD_X4Upscaler, Stable_Cascade_C, Stable_Cascade_B, SV3D_u, SV3D_p, SD3, StableAudio, AuraFlow, PixArtAlpha, PixArtSigma, HunyuanDiT, HunyuanDiT1, FluxInpaint, Flux, FluxSchnell, GenmoMochi, LTXV, HunyuanImage21Refiner, HunyuanImage21, HunyuanVideoSkyreelsI2V, HunyuanVideoI2V, HunyuanVideo, CosmosT2V, CosmosI2V, CosmosT2IPredict2, CosmosI2VPredict2, Lumina2, WAN22_T2V, WAN21_T2V, WAN21_I2V, WAN21_FunControl2V, WAN21_Vace, WAN21_Camera, WAN22_Camera, WAN22_S2V, WAN21_HuMo, WAN22_Animate, Hunyuan3Dv2mini, Hunyuan3Dv2, Hunyuan3Dv2_1, HiDream, Chroma, ChromaRadiance, ACEStep, Omnigen2, QwenImage]
+models = [LotusD, Stable_Zero123, SD15_instructpix2pix, SD15, SD20, SD21UnclipL, SD21UnclipH, SDXL_instructpix2pix, SDXLRefiner, SDXL, SSD1B, KOALA_700M, KOALA_1B, Segmind_Vega, SD_X4Upscaler, Stable_Cascade_C, Stable_Cascade_B, SV3D_u, SV3D_p, SD3, StableAudio, AuraFlow, PixArtAlpha, PixArtSigma, HunyuanDiT, HunyuanDiT1, FluxInpaint, Flux, FluxSchnell, GenmoMochi, LTXV, HunyuanVideoSkyreelsI2V, HunyuanVideoI2V, HunyuanVideo, CosmosT2V, CosmosI2V, CosmosT2IPredict2, CosmosI2VPredict2, Lumina2, WAN22_T2V, WAN21_T2V, WAN21_I2V, WAN21_FunControl2V, WAN21_Vace, WAN21_Camera, WAN22_Camera, WAN22_S2V, Hunyuan3Dv2mini, Hunyuan3Dv2, HiDream, Chroma, ACEStep, Omnigen2, QwenImage]

 models += [SVD_img2vid]
--- a/comfy/text_encoders/byt5_config_small_glyph.json
+++ b/comfy/text_encoders/byt5_config_small_glyph.json
@@ -1,22 +0,0 @@
-{
-  "d_ff": 3584,
-  "d_kv": 64,
-  "d_model": 1472,
-  "decoder_start_token_id": 0,
-  "dropout_rate": 0.1,
-  "eos_token_id": 1,
-  "dense_act_fn": "gelu_pytorch_tanh",
-  "initializer_factor": 1.0,
-  "is_encoder_decoder": true,
-  "is_gated_act": true,
-  "layer_norm_epsilon": 1e-06,
-  "model_type": "t5",
-  "num_decoder_layers": 4,
-  "num_heads": 6,
-  "num_layers": 12,
-  "output_past": true,
-  "pad_token_id": 0,
-  "relative_attention_num_buckets": 32,
-  "tie_word_embeddings": false,
-  "vocab_size": 1510
-}
--- a/comfy/text_encoders/byt5_tokenizer/added_tokens.json
+++ b/comfy/text_encoders/byt5_tokenizer/added_tokens.json
@@ -1,127 +0,0 @@
-{
-  "<extra_id_0>": 259,
-  "<extra_id_100>": 359,
-  "<extra_id_101>": 360,
-  "<extra_id_102>": 361,
-  "<extra_id_103>": 362,
-  "<extra_id_104>": 363,
-  "<extra_id_105>": 364,
-  "<extra_id_106>": 365,
-  "<extra_id_107>": 366,
-  "<extra_id_108>": 367,
-  "<extra_id_109>": 368,
-  "<extra_id_10>": 269,
-  "<extra_id_110>": 369,
-  "<extra_id_111>": 370,
-  "<extra_id_112>": 371,
-  "<extra_id_113>": 372,
-  "<extra_id_114>": 373,
-  "<extra_id_115>": 374,
-  "<extra_id_116>": 375,
-  "<extra_id_117>": 376,
-  "<extra_id_118>": 377,
-  "<extra_id_119>": 378,
-  "<extra_id_11>": 270,
-  "<extra_id_120>": 379,
-  "<extra_id_121>": 380,
-  "<extra_id_122>": 381,
-  "<extra_id_123>": 382,
-  "<extra_id_124>": 383,
-  "<extra_id_12>": 271,
-  "<extra_id_13>": 272,
-  "<extra_id_14>": 273,
-  "<extra_id_15>": 274,
-  "<extra_id_16>": 275,
-  "<extra_id_17>": 276,
-  "<extra_id_18>": 277,
-  "<extra_id_19>": 278,
-  "<extra_id_1>": 260,
-  "<extra_id_20>": 279,
-  "<extra_id_21>": 280,
-  "<extra_id_22>": 281,
-  "<extra_id_23>": 282,
-  "<extra_id_24>": 283,
-  "<extra_id_25>": 284,
-  "<extra_id_26>": 285,
-  "<extra_id_27>": 286,
-  "<extra_id_28>": 287,
-  "<extra_id_29>": 288,
-  "<extra_id_2>": 261,
-  "<extra_id_30>": 289,
-  "<extra_id_31>": 290,
-  "<extra_id_32>": 291,
-  "<extra_id_33>": 292,
-  "<extra_id_34>": 293,
-  "<extra_id_35>": 294,
-  "<extra_id_36>": 295,
-  "<extra_id_37>": 296,
-  "<extra_id_38>": 297,
-  "<extra_id_39>": 298,
-  "<extra_id_3>": 262,
-  "<extra_id_40>": 299,
-  "<extra_id_41>": 300,
-  "<extra_id_42>": 301,
-  "<extra_id_43>": 302,
-  "<extra_id_44>": 303,
-  "<extra_id_45>": 304,
-  "<extra_id_46>": 305,
-  "<extra_id_47>": 306,
-  "<extra_id_48>": 307,
-  "<extra_id_49>": 308,
-  "<extra_id_4>": 263,
-  "<extra_id_50>": 309,
-  "<extra_id_51>": 310,
-  "<extra_id_52>": 311,
-  "<extra_id_53>": 312,
-  "<extra_id_54>": 313,
-  "<extra_id_55>": 314,
-  "<extra_id_56>": 315,
-  "<extra_id_57>": 316,
-  "<extra_id_58>": 317,
-  "<extra_id_59>": 318,
-  "<extra_id_5>": 264,
-  "<extra_id_60>": 319,
-  "<extra_id_61>": 320,
-  "<extra_id_62>": 321,
-  "<extra_id_63>": 322,
-  "<extra_id_64>": 323,
-  "<extra_id_65>": 324,
-  "<extra_id_66>": 325,
-  "<extra_id_67>": 326,
-  "<extra_id_68>": 327,
-  "<extra_id_69>": 328,
-  "<extra_id_6>": 265,
-  "<extra_id_70>": 329,
-  "<extra_id_71>": 330,
-  "<extra_id_72>": 331,
-  "<extra_id_73>": 332,
-  "<extra_id_74>": 333,
-  "<extra_id_75>": 334,
-  "<extra_id_76>": 335,
-  "<extra_id_77>": 336,
-  "<extra_id_78>": 337,
-  "<extra_id_79>": 338,
-  "<extra_id_7>": 266,
-  "<extra_id_80>": 339,
-  "<extra_id_81>": 340,
-  "<extra_id_82>": 341,
-  "<extra_id_83>": 342,
-  "<extra_id_84>": 343,
-  "<extra_id_85>": 344,
-  "<extra_id_86>": 345,
-  "<extra_id_87>": 346,
-  "<extra_id_88>": 347,
-  "<extra_id_89>": 348,
-  "<extra_id_8>": 267,
-  "<extra_id_90>": 349,
-  "<extra_id_91>": 350,
-  "<extra_id_92>": 351,
-  "<extra_id_93>": 352,
-  "<extra_id_94>": 353,
-  "<extra_id_95>": 354,
-  "<extra_id_96>": 355,
-  "<extra_id_97>": 356,
-  "<extra_id_98>": 357,
-  "<extra_id_99>": 358,
-  "<extra_id_9>": 268
-}
--- a/comfy/text_encoders/byt5_tokenizer/special_tokens_map.json
+++ b/comfy/text_encoders/byt5_tokenizer/special_tokens_map.json
@@ -1,150 +0,0 @@
-{
-  "additional_special_tokens": [
-    "<extra_id_0>",
-    "<extra_id_1>",
-    "<extra_id_2>",
-    "<extra_id_3>",
-    "<extra_id_4>",
-    "<extra_id_5>",
-    "<extra_id_6>",
-    "<extra_id_7>",
-    "<extra_id_8>",
-    "<extra_id_9>",
-    "<extra_id_10>",
-    "<extra_id_11>",
-    "<extra_id_12>",
-    "<extra_id_13>",
-    "<extra_id_14>",
-    "<extra_id_15>",
-    "<extra_id_16>",
-    "<extra_id_17>",
-    "<extra_id_18>",
-    "<extra_id_19>",
-    "<extra_id_20>",
-    "<extra_id_21>",
-    "<extra_id_22>",
-    "<extra_id_23>",
-    "<extra_id_24>",
-    "<extra_id_25>",
-    "<extra_id_26>",
-    "<extra_id_27>",
-    "<extra_id_28>",
-    "<extra_id_29>",
-    "<extra_id_30>",
-    "<extra_id_31>",
-    "<extra_id_32>",
-    "<extra_id_33>",
-    "<extra_id_34>",
-    "<extra_id_35>",
-    "<extra_id_36>",
-    "<extra_id_37>",
-    "<extra_id_38>",
-    "<extra_id_39>",
-    "<extra_id_40>",
-    "<extra_id_41>",
-    "<extra_id_42>",
-    "<extra_id_43>",
-    "<extra_id_44>",
-    "<extra_id_45>",
-    "<extra_id_46>",
-    "<extra_id_47>",
-    "<extra_id_48>",
-    "<extra_id_49>",
-    "<extra_id_50>",
-    "<extra_id_51>",
-    "<extra_id_52>",
-    "<extra_id_53>",
-    "<extra_id_54>",
-    "<extra_id_55>",
-    "<extra_id_56>",
-    "<extra_id_57>",
-    "<extra_id_58>",
-    "<extra_id_59>",
-    "<extra_id_60>",
-    "<extra_id_61>",
-    "<extra_id_62>",
-    "<extra_id_63>",
-    "<extra_id_64>",
-    "<extra_id_65>",
-    "<extra_id_66>",
-    "<extra_id_67>",
-    "<extra_id_68>",
-    "<extra_id_69>",
-    "<extra_id_70>",
-    "<extra_id_71>",
-    "<extra_id_72>",
-    "<extra_id_73>",
-    "<extra_id_74>",
-    "<extra_id_75>",
-    "<extra_id_76>",
-    "<extra_id_77>",
-    "<extra_id_78>",
-    "<extra_id_79>",
-    "<extra_id_80>",
-    "<extra_id_81>",
-    "<extra_id_82>",
-    "<extra_id_83>",
-    "<extra_id_84>",
-    "<extra_id_85>",
-    "<extra_id_86>",
-    "<extra_id_87>",
-    "<extra_id_88>",
-    "<extra_id_89>",
-    "<extra_id_90>",
-    "<extra_id_91>",
-    "<extra_id_92>",
-    "<extra_id_93>",
-    "<extra_id_94>",
-    "<extra_id_95>",
-    "<extra_id_96>",
-    "<extra_id_97>",
-    "<extra_id_98>",
-    "<extra_id_99>",
-    "<extra_id_100>",
-    "<extra_id_101>",
-    "<extra_id_102>",
-    "<extra_id_103>",
-    "<extra_id_104>",
-    "<extra_id_105>",
-    "<extra_id_106>",
-    "<extra_id_107>",
-    "<extra_id_108>",
-    "<extra_id_109>",
-    "<extra_id_110>",
-    "<extra_id_111>",
-    "<extra_id_112>",
-    "<extra_id_113>",
-    "<extra_id_114>",
-    "<extra_id_115>",
-    "<extra_id_116>",
-    "<extra_id_117>",
-    "<extra_id_118>",
-    "<extra_id_119>",
-    "<extra_id_120>",
-    "<extra_id_121>",
-    "<extra_id_122>",
-    "<extra_id_123>",
-    "<extra_id_124>"
-  ],
-  "eos_token": {
-    "content": "</s>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  },
-  "pad_token": {
-    "content": "<pad>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  },
-  "unk_token": {
-    "content": "<unk>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  }
-}
--- a/comfy/text_encoders/byt5_tokenizer/tokenizer_config.json
+++ b/comfy/text_encoders/byt5_tokenizer/tokenizer_config.json
--- a/comfy/text_encoders/hunyuan_image.py
+++ b/comfy/text_encoders/hunyuan_image.py
@@ -1,103 +0,0 @@
-from comfy import sd1_clip
-import comfy.text_encoders.llama
-from .qwen_image import QwenImageTokenizer, QwenImageTEModel
-from transformers import ByT5Tokenizer
-import os
-import re
-
-class ByT5SmallTokenizer(sd1_clip.SDTokenizer):
-    def __init__(self, embedding_directory=None, tokenizer_data={}):
-        tokenizer_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "byt5_tokenizer")
-        super().__init__(tokenizer_path, pad_with_end=False, embedding_size=1472, embedding_key='byt5_small', tokenizer_class=ByT5Tokenizer, has_start_token=False, pad_to_max_length=False, max_length=99999999, min_length=1, tokenizer_data=tokenizer_data)
-
-class HunyuanImageTokenizer(QwenImageTokenizer):
-    def __init__(self, embedding_directory=None, tokenizer_data={}):
-        super().__init__(embedding_directory=embedding_directory, tokenizer_data=tokenizer_data)
-        self.llama_template = "<|im_start|>system\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>"
-        # self.llama_template_images = "{}"
-        self.byt5 = ByT5SmallTokenizer(embedding_directory=embedding_directory, tokenizer_data=tokenizer_data)
-
-    def tokenize_with_weights(self, text:str, return_word_ids=False, **kwargs):
-        out = super().tokenize_with_weights(text, return_word_ids, **kwargs)
-
-        # ByT5 processing for HunyuanImage
-        text_prompt_texts = []
-        pattern_quote_double = r'\"(.*?)\"'
-        pattern_quote_chinese_single = r'‘(.*?)’'
-        pattern_quote_chinese_double = r'“(.*?)”'
-
-        matches_quote_double = re.findall(pattern_quote_double, text)
-        matches_quote_chinese_single = re.findall(pattern_quote_chinese_single, text)
-        matches_quote_chinese_double = re.findall(pattern_quote_chinese_double, text)
-
-        text_prompt_texts.extend(matches_quote_double)
-        text_prompt_texts.extend(matches_quote_chinese_single)
-        text_prompt_texts.extend(matches_quote_chinese_double)
-
-        if len(text_prompt_texts) > 0:
-            out['byt5'] = self.byt5.tokenize_with_weights(''.join(map(lambda a: 'Text "{}". '.format(a), text_prompt_texts)), return_word_ids, **kwargs)
-        return out
-
-class Qwen25_7BVLIModel(sd1_clip.SDClipModel):
-    def __init__(self, device="cpu", layer="hidden", layer_idx=-3, dtype=None, attention_mask=True, model_options={}):
-        llama_scaled_fp8 = model_options.get("qwen_scaled_fp8", None)
-        if llama_scaled_fp8 is not None:
-            model_options = model_options.copy()
-            model_options["scaled_fp8"] = llama_scaled_fp8
-        super().__init__(device=device, layer=layer, layer_idx=layer_idx, textmodel_json_config={}, dtype=dtype, special_tokens={"pad": 151643}, layer_norm_hidden_state=False, model_class=comfy.text_encoders.llama.Qwen25_7BVLI, enable_attention_masks=attention_mask, return_attention_masks=attention_mask, model_options=model_options)
-
-
-class ByT5SmallModel(sd1_clip.SDClipModel):
-    def __init__(self, device="cpu", layer="last", layer_idx=None, dtype=None, model_options={}):
-        textmodel_json_config = os.path.join(os.path.dirname(os.path.realpath(__file__)), "byt5_config_small_glyph.json")
-        super().__init__(device=device, layer=layer, layer_idx=layer_idx, textmodel_json_config=textmodel_json_config, dtype=dtype, model_options=model_options, special_tokens={"end": 1, "pad": 0}, model_class=comfy.text_encoders.t5.T5, enable_attention_masks=True, zero_out_masked=True)
-
-
-class HunyuanImageTEModel(QwenImageTEModel):
-    def __init__(self, byt5=True, device="cpu", dtype=None, model_options={}):
-        super(QwenImageTEModel, self).__init__(device=device, dtype=dtype, name="qwen25_7b", clip_model=Qwen25_7BVLIModel, model_options=model_options)
-
-        if byt5:
-            self.byt5_small = ByT5SmallModel(device=device, dtype=dtype, model_options=model_options)
-        else:
-            self.byt5_small = None
-
-    def encode_token_weights(self, token_weight_pairs):
-        tok_pairs = token_weight_pairs["qwen25_7b"][0]
-        template_end = -1
-        if tok_pairs[0][0] == 27:
-            if len(tok_pairs) > 36:  # refiner prompt uses a fixed 36 template_end
-                template_end = 36
-
-        cond, p, extra = super().encode_token_weights(token_weight_pairs, template_end=template_end)
-        if self.byt5_small is not None and "byt5" in token_weight_pairs:
-            out = self.byt5_small.encode_token_weights(token_weight_pairs["byt5"])
-            extra["conditioning_byt5small"] = out[0]
-        return cond, p, extra
-
-    def set_clip_options(self, options):
-        super().set_clip_options(options)
-        if self.byt5_small is not None:
-            self.byt5_small.set_clip_options(options)
-
-    def reset_clip_options(self):
-        super().reset_clip_options()
-        if self.byt5_small is not None:
-            self.byt5_small.reset_clip_options()
-
-    def load_sd(self, sd):
-        if "encoder.block.0.layer.0.SelfAttention.o.weight" in sd:
-            return self.byt5_small.load_sd(sd)
-        else:
-            return super().load_sd(sd)
-
-def te(byt5=True, dtype_llama=None, llama_scaled_fp8=None):
-    class QwenImageTEModel_(HunyuanImageTEModel):
-        def __init__(self, device="cpu", dtype=None, model_options={}):
-            if llama_scaled_fp8 is not None and "scaled_fp8" not in model_options:
-                model_options = model_options.copy()
-                model_options["qwen_scaled_fp8"] = llama_scaled_fp8
-            if dtype_llama is not None:
-                dtype = dtype_llama
-            super().__init__(byt5=byt5, device=device, dtype=dtype, model_options=model_options)
-    return QwenImageTEModel_
--- a/comfy/text_encoders/llama.py
+++ b/comfy/text_encoders/llama.py
@@ -3,7 +3,6 @@ import torch.nn as nn
 from dataclasses import dataclass
 from typing import Optional, Any
 import math
-import logging

 from comfy.ldm.modules.attention import optimized_attention_for_device
 import comfy.model_management
@@ -29,9 +28,6 @@ class Llama2Config:
    mlp_activation = "silu"
    qkv_bias = False
    rope_dims = None
-    q_norm = None
-    k_norm = None
-    rope_scale = None

@dataclass
 class Qwen25_3BConfig:
@@ -50,9 +46,6 @@ class Qwen25_3BConfig:
    mlp_activation = "silu"
    qkv_bias = True
    rope_dims = None
-    q_norm = None
-    k_norm = None
-    rope_scale = None

@dataclass
 class Qwen25_7BVLI_Config:
@@ -71,9 +64,6 @@ class Qwen25_7BVLI_Config:
    mlp_activation = "silu"
    qkv_bias = True
    rope_dims = [16, 24, 24]
-    q_norm = None
-    k_norm = None
-    rope_scale = None

@dataclass
 class Gemma2_2B_Config:
@@ -92,32 +82,6 @@ class Gemma2_2B_Config:
    mlp_activation = "gelu_pytorch_tanh"
    qkv_bias = False
    rope_dims = None
-    q_norm = None
-    k_norm = None
-    sliding_attention = None
-    rope_scale = None
-
-@dataclass
-class Gemma3_4B_Config:
-    vocab_size: int = 262208
-    hidden_size: int = 2560
-    intermediate_size: int = 10240
-    num_hidden_layers: int = 34
-    num_attention_heads: int = 8
-    num_key_value_heads: int = 4
-    max_position_embeddings: int = 131072
-    rms_norm_eps: float = 1e-6
-    rope_theta = [10000.0, 1000000.0]
-    transformer_type: str = "gemma3"
-    head_dim = 256
-    rms_norm_add = True
-    mlp_activation = "gelu_pytorch_tanh"
-    qkv_bias = False
-    rope_dims = None
-    q_norm = "gemma3"
-    k_norm = "gemma3"
-    sliding_attention = [False, False, False, False, False, 1024]
-    rope_scale = [1.0, 8.0]

 class RMSNorm(nn.Module):
    def __init__(self, dim: int, eps: float = 1e-5, add=False, device=None, dtype=None):
@@ -142,49 +106,33 @@ def rotate_half(x):
    return torch.cat((-x2, x1), dim=-1)


-def precompute_freqs_cis(head_dim, position_ids, theta, rope_scale=None, rope_dims=None, device=None):
-    if not isinstance(theta, list):
-        theta = [theta]
+def precompute_freqs_cis(head_dim, position_ids, theta, rope_dims=None, device=None):
+    theta_numerator = torch.arange(0, head_dim, 2, device=device).float()
+    inv_freq = 1.0 / (theta ** (theta_numerator / head_dim))

-    out = []
-    for index, t in enumerate(theta):
-        theta_numerator = torch.arange(0, head_dim, 2, device=device).float()
-        inv_freq = 1.0 / (t ** (theta_numerator / head_dim))
+    inv_freq_expanded = inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
+    position_ids_expanded = position_ids[:, None, :].float()
+    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+    emb = torch.cat((freqs, freqs), dim=-1)
+    cos = emb.cos()
+    sin = emb.sin()
+    if rope_dims is not None and position_ids.shape[0] > 1:
+        mrope_section = rope_dims * 2
+        cos = torch.cat([m[i % 3] for i, m in enumerate(cos.split(mrope_section, dim=-1))], dim=-1).unsqueeze(0)
+        sin = torch.cat([m[i % 3] for i, m in enumerate(sin.split(mrope_section, dim=-1))], dim=-1).unsqueeze(0)
+    else:
+        cos = cos.unsqueeze(1)
+        sin = sin.unsqueeze(1)

-        if rope_scale is not None:
-            if isinstance(rope_scale, list):
-                inv_freq /= rope_scale[index]
-            else:
-                inv_freq /= rope_scale
-
-        inv_freq_expanded = inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
-        position_ids_expanded = position_ids[:, None, :].float()
-        freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
-        emb = torch.cat((freqs, freqs), dim=-1)
-        cos = emb.cos()
-        sin = emb.sin()
-        if rope_dims is not None and position_ids.shape[0] > 1:
-            mrope_section = rope_dims * 2
-            cos = torch.cat([m[i % 3] for i, m in enumerate(cos.split(mrope_section, dim=-1))], dim=-1).unsqueeze(0)
-            sin = torch.cat([m[i % 3] for i, m in enumerate(sin.split(mrope_section, dim=-1))], dim=-1).unsqueeze(0)
-        else:
-            cos = cos.unsqueeze(1)
-            sin = sin.unsqueeze(1)
-        out.append((cos, sin))
-
-    if len(out) == 1:
-        return out[0]
-
-    return out
+    return (cos, sin)


 def apply_rope(xq, xk, freqs_cis):
-    org_dtype = xq.dtype
    cos = freqs_cis[0]
    sin = freqs_cis[1]
    q_embed = (xq * cos) + (rotate_half(xq) * sin)
    k_embed = (xk * cos) + (rotate_half(xk) * sin)
-    return q_embed.to(org_dtype), k_embed.to(org_dtype)
+    return q_embed, k_embed


 class Attention(nn.Module):
@@ -203,14 +151,6 @@ class Attention(nn.Module):
        self.v_proj = ops.Linear(config.hidden_size, self.num_kv_heads * self.head_dim, bias=config.qkv_bias, device=device, dtype=dtype)
        self.o_proj = ops.Linear(self.inner_size, config.hidden_size, bias=False, device=device, dtype=dtype)

-        self.q_norm = None
-        self.k_norm = None
-
-        if config.q_norm == "gemma3":
-            self.q_norm = RMSNorm(self.head_dim, eps=config.rms_norm_eps, add=config.rms_norm_add, device=device, dtype=dtype)
-        if config.k_norm == "gemma3":
-            self.k_norm = RMSNorm(self.head_dim, eps=config.rms_norm_eps, add=config.rms_norm_add, device=device, dtype=dtype)
-
    def forward(
        self,
        hidden_states: torch.Tensor,
@@ -227,11 +167,6 @@ class Attention(nn.Module):
        xk = xk.view(batch_size, seq_length, self.num_kv_heads, self.head_dim).transpose(1, 2)
        xv = xv.view(batch_size, seq_length, self.num_kv_heads, self.head_dim).transpose(1, 2)

-        if self.q_norm is not None:
-            xq = self.q_norm(xq)
-        if self.k_norm is not None:
-            xk = self.k_norm(xk)
-
        xq, xk = apply_rope(xq, xk, freqs_cis=freqs_cis)

        xk = xk.repeat_interleave(self.num_heads // self.num_kv_heads, dim=1)
@@ -256,7 +191,7 @@ class MLP(nn.Module):
        return self.down_proj(self.activation(self.gate_proj(x)) * self.up_proj(x))

 class TransformerBlock(nn.Module):
-    def __init__(self, config: Llama2Config, index, device=None, dtype=None, ops: Any = None):
+    def __init__(self, config: Llama2Config, device=None, dtype=None, ops: Any = None):
        super().__init__()
        self.self_attn = Attention(config, device=device, dtype=dtype, ops=ops)
        self.mlp = MLP(config, device=device, dtype=dtype, ops=ops)
@@ -290,7 +225,7 @@ class TransformerBlock(nn.Module):
        return x

 class TransformerBlockGemma2(nn.Module):
-    def __init__(self, config: Llama2Config, index, device=None, dtype=None, ops: Any = None):
+    def __init__(self, config: Llama2Config, device=None, dtype=None, ops: Any = None):
        super().__init__()
        self.self_attn = Attention(config, device=device, dtype=dtype, ops=ops)
        self.mlp = MLP(config, device=device, dtype=dtype, ops=ops)
@@ -299,13 +234,6 @@ class TransformerBlockGemma2(nn.Module):
        self.pre_feedforward_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps, add=config.rms_norm_add, device=device, dtype=dtype)
        self.post_feedforward_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps, add=config.rms_norm_add, device=device, dtype=dtype)

-        if config.sliding_attention is not None:  # TODO: implement. (Not that necessary since models are trained on less than 1024 tokens)
-            self.sliding_attention = config.sliding_attention[index % len(config.sliding_attention)]
-        else:
-            self.sliding_attention = False
-
-        self.transformer_type = config.transformer_type
-
    def forward(
        self,
        x: torch.Tensor,
@@ -313,14 +241,6 @@ class TransformerBlockGemma2(nn.Module):
        freqs_cis: Optional[torch.Tensor] = None,
        optimized_attention=None,
    ):
-        if self.transformer_type == 'gemma3':
-            if self.sliding_attention:
-                if x.shape[1] > self.sliding_attention:
-                    logging.warning("Warning: sliding attention not implemented, results may be incorrect")
-                freqs_cis = freqs_cis[1]
-            else:
-                freqs_cis = freqs_cis[0]
-
        # Self Attention
        residual = x
        x = self.input_layernorm(x)
@@ -355,7 +275,7 @@ class Llama2_(nn.Module):
            device=device,
            dtype=dtype
        )
-        if self.config.transformer_type == "gemma2" or self.config.transformer_type == "gemma3":
+        if self.config.transformer_type == "gemma2":
            transformer = TransformerBlockGemma2
            self.normalize_in = True
        else:
@@ -363,8 +283,8 @@ class Llama2_(nn.Module):
            self.normalize_in = False

        self.layers = nn.ModuleList([
-            transformer(config, index=i, device=device, dtype=dtype, ops=ops)
-            for i in range(config.num_hidden_layers)
+            transformer(config, device=device, dtype=dtype, ops=ops)
+            for _ in range(config.num_hidden_layers)
        ])
        self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps, add=config.rms_norm_add, device=device, dtype=dtype)
        # self.lm_head = ops.Linear(config.hidden_size, config.vocab_size, bias=False, device=device, dtype=dtype)
@@ -384,7 +304,6 @@ class Llama2_(nn.Module):
        freqs_cis = precompute_freqs_cis(self.config.head_dim,
                                         position_ids,
                                         self.config.rope_theta,
-                                         self.config.rope_scale,
                                         self.config.rope_dims,
                                         device=x.device)

@@ -480,25 +399,21 @@ class Qwen25_7BVLI(BaseLlama, torch.nn.Module):

    def forward(self, x, attention_mask=None, embeds=None, num_tokens=None, intermediate_output=None, final_layer_norm_intermediate=True, dtype=None, embeds_info=[]):
        grid = None
-        position_ids = None
-        offset = 0
        for e in embeds_info:
            if e.get("type") == "image":
                grid = e.get("extra", None)
+                position_ids = torch.zeros((3, embeds.shape[1]), device=embeds.device)
                start = e.get("index")
-                if position_ids is None:
-                    position_ids = torch.zeros((3, embeds.shape[1]), device=embeds.device)
-                    position_ids[:, :start] = torch.arange(0, start, device=embeds.device)
+                position_ids[:, :start] = torch.arange(0, start, device=embeds.device)
                end = e.get("size") + start
                len_max = int(grid.max()) // 2
                start_next = len_max + start
-                position_ids[:, end:] = torch.arange(start_next + offset, start_next + (embeds.shape[1] - end) + offset, device=embeds.device)
-                position_ids[0, start:end] = start + offset
+                position_ids[:, end:] = torch.arange(start_next, start_next + (embeds.shape[1] - end), device=embeds.device)
+                position_ids[0, start:end] = start
                max_d = int(grid[0][1]) // 2
-                position_ids[1, start:end] = torch.arange(start + offset, start + max_d + offset, device=embeds.device).unsqueeze(1).repeat(1, math.ceil((end - start) / max_d)).flatten(0)[:end - start]
+                position_ids[1, start:end] = torch.arange(start, start + max_d, device=embeds.device).unsqueeze(1).repeat(1, math.ceil((end - start) / max_d)).flatten(0)[:end - start]
                max_d = int(grid[0][2]) // 2
-                position_ids[2, start:end] = torch.arange(start + offset, start + max_d + offset, device=embeds.device).unsqueeze(0).repeat(math.ceil((end - start) / max_d), 1).flatten(0)[:end - start]
-                offset += len_max - (end - start)
+                position_ids[2, start:end] = torch.arange(start, start + max_d, device=embeds.device).unsqueeze(0).repeat(math.ceil((end - start) / max_d), 1).flatten(0)[:end - start]

        if grid is None:
            position_ids = None
@@ -513,12 +428,3 @@ class Gemma2_2B(BaseLlama, torch.nn.Module):

        self.model = Llama2_(config, device=device, dtype=dtype, ops=operations)
        self.dtype = dtype
-
-class Gemma3_4B(BaseLlama, torch.nn.Module):
-    def __init__(self, config_dict, dtype, device, operations):
-        super().__init__()
-        config = Gemma3_4B_Config(**config_dict)
-        self.num_layers = config.num_hidden_layers
-
-        self.model = Llama2_(config, device=device, dtype=dtype, ops=operations)
-        self.dtype = dtype
--- a/comfy/text_encoders/lumina2.py
+++ b/comfy/text_encoders/lumina2.py
@@ -11,41 +11,23 @@ class Gemma2BTokenizer(sd1_clip.SDTokenizer):
    def state_dict(self):
        return {"spiece_model": self.tokenizer.serialize_model()}

-class Gemma3_4BTokenizer(sd1_clip.SDTokenizer):
-    def __init__(self, embedding_directory=None, tokenizer_data={}):
-        tokenizer = tokenizer_data.get("spiece_model", None)
-        super().__init__(tokenizer, pad_with_end=False, embedding_size=2560, embedding_key='gemma3_4b', tokenizer_class=SPieceTokenizer, has_end_token=False, pad_to_max_length=False, max_length=99999999, min_length=1, tokenizer_args={"add_bos": True, "add_eos": False}, tokenizer_data=tokenizer_data)
-
-    def state_dict(self):
-        return {"spiece_model": self.tokenizer.serialize_model()}

 class LuminaTokenizer(sd1_clip.SD1Tokenizer):
    def __init__(self, embedding_directory=None, tokenizer_data={}):
        super().__init__(embedding_directory=embedding_directory, tokenizer_data=tokenizer_data, name="gemma2_2b", tokenizer=Gemma2BTokenizer)

-class NTokenizer(sd1_clip.SD1Tokenizer):
-    def __init__(self, embedding_directory=None, tokenizer_data={}):
-        super().__init__(embedding_directory=embedding_directory, tokenizer_data=tokenizer_data, name="gemma3_4b", tokenizer=Gemma3_4BTokenizer)

 class Gemma2_2BModel(sd1_clip.SDClipModel):
    def __init__(self, device="cpu", layer="hidden", layer_idx=-2, dtype=None, attention_mask=True, model_options={}):
        super().__init__(device=device, layer=layer, layer_idx=layer_idx, textmodel_json_config={}, dtype=dtype, special_tokens={"start": 2, "pad": 0}, layer_norm_hidden_state=False, model_class=comfy.text_encoders.llama.Gemma2_2B, enable_attention_masks=attention_mask, return_attention_masks=attention_mask, model_options=model_options)

-class Gemma3_4BModel(sd1_clip.SDClipModel):
-    def __init__(self, device="cpu", layer="hidden", layer_idx=-2, dtype=None, attention_mask=True, model_options={}):
-        super().__init__(device=device, layer=layer, layer_idx=layer_idx, textmodel_json_config={}, dtype=dtype, special_tokens={"start": 2, "pad": 0}, layer_norm_hidden_state=False, model_class=comfy.text_encoders.llama.Gemma3_4B, enable_attention_masks=attention_mask, return_attention_masks=attention_mask, model_options=model_options)

 class LuminaModel(sd1_clip.SD1ClipModel):
-    def __init__(self, device="cpu", dtype=None, model_options={}, name="gemma2_2b", clip_model=Gemma2_2BModel):
-        super().__init__(device=device, dtype=dtype, name=name, clip_model=clip_model, model_options=model_options)
+    def __init__(self, device="cpu", dtype=None, model_options={}):
+        super().__init__(device=device, dtype=dtype, name="gemma2_2b", clip_model=Gemma2_2BModel, model_options=model_options)


-def te(dtype_llama=None, llama_scaled_fp8=None, model_type="gemma2_2b"):
-    if model_type == "gemma2_2b":
-        model = Gemma2_2BModel
-    elif model_type == "gemma3_4b":
-        model = Gemma3_4BModel
-
+def te(dtype_llama=None, llama_scaled_fp8=None):
    class LuminaTEModel_(LuminaModel):
        def __init__(self, device="cpu", dtype=None, model_options={}):
            if llama_scaled_fp8 is not None and "scaled_fp8" not in model_options:
@@ -53,5 +35,5 @@ def te(dtype_llama=None, llama_scaled_fp8=None, model_type="gemma2_2b"):
                model_options["scaled_fp8"] = llama_scaled_fp8
            if dtype_llama is not None:
                dtype = dtype_llama
-            super().__init__(device=device, dtype=dtype, name=model_type, model_options=model_options, clip_model=model)
+            super().__init__(device=device, dtype=dtype, model_options=model_options)
    return LuminaTEModel_
--- a/comfy/text_encoders/qwen_image.py
+++ b/comfy/text_encoders/qwen_image.py
@@ -18,22 +18,13 @@ class QwenImageTokenizer(sd1_clip.SD1Tokenizer):
        self.llama_template_images = "<|im_start|>system\nDescribe the key features of the input image (color, shape, size, texture, objects, background), then explain how the user's text instruction should alter or modify the image. Generate a new image that meets the user's requirements while maintaining consistency with the original input where appropriate.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>{}<|im_end|>\n<|im_start|>assistant\n"

    def tokenize_with_weights(self, text, return_word_ids=False, llama_template=None, images=[], **kwargs):
-        skip_template = False
-        if text.startswith('<|im_start|>'):
-            skip_template = True
-        if text.startswith('<|start_header_id|>'):
-            skip_template = True
-
-        if skip_template:
-            llama_text = text
-        else:
-            if llama_template is None:
-                if len(images) > 0:
-                    llama_text = self.llama_template_images.format(text)
-                else:
-                    llama_text = self.llama_template.format(text)
+        if llama_template is None:
+            if len(images) > 0:
+                llama_text = self.llama_template_images.format(text)
            else:
-                llama_text = llama_template.format(text)
+                llama_text = self.llama_template.format(text)
+        else:
+            llama_text = llama_template.format(text)
        tokens = super().tokenize_with_weights(llama_text, return_word_ids=return_word_ids, disable_weights=True, **kwargs)
        key_name = next(iter(tokens))
        embed_count = 0
@@ -56,23 +47,22 @@ class QwenImageTEModel(sd1_clip.SD1ClipModel):
    def __init__(self, device="cpu", dtype=None, model_options={}):
        super().__init__(device=device, dtype=dtype, name="qwen25_7b", clip_model=Qwen25_7BVLIModel, model_options=model_options)

-    def encode_token_weights(self, token_weight_pairs, template_end=-1):
+    def encode_token_weights(self, token_weight_pairs):
        out, pooled, extra = super().encode_token_weights(token_weight_pairs)
        tok_pairs = token_weight_pairs["qwen25_7b"][0]
        count_im_start = 0
-        if template_end == -1:
-            for i, v in enumerate(tok_pairs):
-                elem = v[0]
-                if not torch.is_tensor(elem):
-                    if isinstance(elem, numbers.Integral):
-                        if elem == 151644 and count_im_start < 2:
-                            template_end = i
-                            count_im_start += 1
+        for i, v in enumerate(tok_pairs):
+            elem = v[0]
+            if not torch.is_tensor(elem):
+                if isinstance(elem, numbers.Integral):
+                    if elem == 151644 and count_im_start < 2:
+                        template_end = i
+                        count_im_start += 1

-            if out.shape[1] > (template_end + 3):
-                if tok_pairs[template_end + 1][0] == 872:
-                    if tok_pairs[template_end + 2][0] == 198:
-                        template_end += 3
+        if out.shape[1] > (template_end + 3):
+            if tok_pairs[template_end + 1][0] == 872:
+                if tok_pairs[template_end + 2][0] == 198:
+                    template_end += 3

        out = out[:, template_end:]

--- a/comfy/utils.py
+++ b/comfy/utils.py
@@ -39,11 +39,7 @@ if hasattr(torch.serialization, "add_safe_globals"):  # TODO: this was added in
        pass
    ModelCheckpoint.__module__ = "pytorch_lightning.callbacks.model_checkpoint"

-    def scalar(*args, **kwargs):
-        from numpy.core.multiarray import scalar as sc
-        return sc(*args, **kwargs)
-    scalar.__module__ = "numpy.core.multiarray"
-
+    from numpy.core.multiarray import scalar
    from numpy import dtype
    from numpy.dtypes import Float64DType
    from _codecs import encode
@@ -1106,25 +1102,3 @@ def upscale_dit_mask(mask: torch.Tensor, img_size_in, img_size_out):
            dim=1
        )
        return out
-
-def pack_latents(latents):
-    latent_shapes = []
-    tensors = []
-    for tensor in latents:
-        latent_shapes.append(tensor.shape)
-        tensors.append(tensor.reshape(tensor.shape[0], 1, -1))
-
-    latent = torch.cat(tensors, dim=-1)
-    return latent, latent_shapes
-
-def unpack_latents(combined_latent, latent_shapes):
-    if len(latent_shapes) > 1:
-        output_tensors = []
-        for shape in latent_shapes:
-            cut = math.prod(shape[1:])
-            tens = combined_latent[:, :, :cut]
-            combined_latent = combined_latent[:, :, cut:]
-            output_tensors.append(tens.reshape([tens.shape[0]] + list(shape)[1:]))
-    else:
-        output_tensors = combined_latent
-    return output_tensors
--- a/comfy/weight_adapter/loha.py
+++ b/comfy/weight_adapter/loha.py
@@ -130,12 +130,12 @@ class LoHaAdapter(WeightAdapterBase):
    def create_train(cls, weight, rank=1, alpha=1.0):
        out_dim = weight.shape[0]
        in_dim = weight.shape[1:].numel()
-        mat1 = torch.empty(out_dim, rank, device=weight.device, dtype=torch.float32)
-        mat2 = torch.empty(rank, in_dim, device=weight.device, dtype=torch.float32)
+        mat1 = torch.empty(out_dim, rank, device=weight.device, dtype=weight.dtype)
+        mat2 = torch.empty(rank, in_dim, device=weight.device, dtype=weight.dtype)
        torch.nn.init.normal_(mat1, 0.1)
        torch.nn.init.constant_(mat2, 0.0)
-        mat3 = torch.empty(out_dim, rank, device=weight.device, dtype=torch.float32)
-        mat4 = torch.empty(rank, in_dim, device=weight.device, dtype=torch.float32)
+        mat3 = torch.empty(out_dim, rank, device=weight.device, dtype=weight.dtype)
+        mat4 = torch.empty(rank, in_dim, device=weight.device, dtype=weight.dtype)
        torch.nn.init.normal_(mat3, 0.1)
        torch.nn.init.normal_(mat4, 0.01)
        return LohaDiff(
--- a/comfy/weight_adapter/lokr.py
+++ b/comfy/weight_adapter/lokr.py
@@ -89,8 +89,8 @@ class LoKrAdapter(WeightAdapterBase):
        in_dim = weight.shape[1:].numel()
        out1, out2 = factorization(out_dim, rank)
        in1, in2 = factorization(in_dim, rank)
-        mat1 = torch.empty(out1, in1, device=weight.device, dtype=torch.float32)
-        mat2 = torch.empty(out2, in2, device=weight.device, dtype=torch.float32)
+        mat1 = torch.empty(out1, in1, device=weight.device, dtype=weight.dtype)
+        mat2 = torch.empty(out2, in2, device=weight.device, dtype=weight.dtype)
        torch.nn.init.kaiming_uniform_(mat2, a=5**0.5)
        torch.nn.init.constant_(mat1, 0.0)
        return LokrDiff(
--- a/comfy/weight_adapter/lora.py
+++ b/comfy/weight_adapter/lora.py
@@ -66,8 +66,8 @@ class LoRAAdapter(WeightAdapterBase):
    def create_train(cls, weight, rank=1, alpha=1.0):
        out_dim = weight.shape[0]
        in_dim = weight.shape[1:].numel()
-        mat1 = torch.empty(out_dim, rank, device=weight.device, dtype=torch.float32)
-        mat2 = torch.empty(rank, in_dim, device=weight.device, dtype=torch.float32)
+        mat1 = torch.empty(out_dim, rank, device=weight.device, dtype=weight.dtype)
+        mat2 = torch.empty(rank, in_dim, device=weight.device, dtype=weight.dtype)
        torch.nn.init.kaiming_uniform_(mat1, a=5**0.5)
        torch.nn.init.constant_(mat2, 0.0)
        return LoraDiff(
--- a/comfy/weight_adapter/oft.py
+++ b/comfy/weight_adapter/oft.py
@@ -68,7 +68,7 @@ class OFTAdapter(WeightAdapterBase):
    def create_train(cls, weight, rank=1, alpha=1.0):
        out_dim = weight.shape[0]
        block_size, block_num = factorization(out_dim, rank)
-        block = torch.zeros(block_num, block_size, block_size, device=weight.device, dtype=torch.float32)
+        block = torch.zeros(block_num, block_size, block_size, device=weight.device, dtype=weight.dtype)
        return OFTDiff(
            (block, None, alpha, None)
        )
--- a/comfy_api/latest/init.py
+++ b/comfy_api/latest/init.py
@@ -8,8 +8,8 @@ from comfy_api.internal.async_to_sync import create_sync_class
 from comfy_api.latest._input import ImageInput, AudioInput, MaskInput, LatentInput, VideoInput
 from comfy_api.latest._input_impl import VideoFromFile, VideoFromComponents
 from comfy_api.latest._util import VideoCodec, VideoContainer, VideoComponents
-from . import _io as io
-from . import _ui as ui
+from comfy_api.latest._io import _IO as io  #noqa: F401
+from comfy_api.latest._ui import _UI as ui  #noqa: F401
 # from comfy_api.latest._resources import _RESOURCES as resources  #noqa: F401
 from comfy_execution.utils import get_executing_context
 from comfy_execution.progress import get_progress_state, PreviewImageTuple
@@ -114,10 +114,6 @@ if TYPE_CHECKING:
    ComfyAPISync: Type[comfy_api.latest.generated.ComfyAPISyncStub.ComfyAPISyncStub]
 ComfyAPISync = create_sync_class(ComfyAPI_latest)

-# create new aliases for io and ui
-IO = io
-UI = ui
-
 __all__ = [
    "ComfyAPI",
    "ComfyAPISync",
@@ -125,8 +121,4 @@ __all__ = [
    "InputImpl",
    "Types",
    "ComfyExtension",
-    "io",
-    "IO",
-    "ui",
-    "UI",
 ]
--- a/comfy_api/latest/_input/video_types.py
+++ b/comfy_api/latest/_input/video_types.py
@@ -1,6 +1,6 @@
 from __future__ import annotations
 from abc import ABC, abstractmethod
-from typing import Optional, Union, IO
+from typing import Optional, Union
 import io
 import av
 from comfy_api.util import VideoContainer, VideoCodec, VideoComponents
@@ -23,7 +23,7 @@ class VideoInput(ABC):
    @abstractmethod
    def save_to(
        self,
-        path: Union[str, IO[bytes]],
+        path: str,
        format: VideoContainer = VideoContainer.AUTO,
        codec: VideoCodec = VideoCodec.AUTO,
        metadata: Optional[dict] = None
--- a/comfy_api/latest/_io.py
+++ b/comfy_api/latest/_io.py
@@ -331,30 +331,16 @@ class String(ComfyTypeIO):
            })

@comfytype(io_type="COMBO")
-class Combo(ComfyTypeIO):
+class Combo(ComfyTypeI):
    Type = str
    class Input(WidgetInput):
        """Combo input (dropdown)."""
        Type = str
-        def __init__(
-            self,
-            id: str,
-            options: list[str] | list[int] | type[Enum] = None,
-            display_name: str=None,
-            optional=False,
-            tooltip: str=None,
-            lazy: bool=None,
-            default: str | int | Enum = None,
-            control_after_generate: bool=None,
-            upload: UploadType=None,
-            image_folder: FolderType=None,
-            remote: RemoteOptions=None,
-            socketless: bool=None,
-        ):
-            if isinstance(options, type) and issubclass(options, Enum):
-                options = [v.value for v in options]
-            if isinstance(default, Enum):
-                default = default.value
+        def __init__(self, id: str, options: list[str]=None, display_name: str=None, optional=False, tooltip: str=None, lazy: bool=None,
+                    default: str=None, control_after_generate: bool=None,
+                    upload: UploadType=None, image_folder: FolderType=None,
+                    remote: RemoteOptions=None,
+                    socketless: bool=None):
            super().__init__(id, display_name, optional, tooltip, lazy, default, socketless)
            self.multiselect = False
            self.options = options
@@ -374,14 +360,6 @@ class Combo(ComfyTypeIO):
                "remote": self.remote.as_dict() if self.remote else None,
            })

-    class Output(Output):
-        def __init__(self, id: str=None, display_name: str=None, options: list[str]=None, tooltip: str=None, is_output_list=False):
-            super().__init__(id, display_name, tooltip, is_output_list)
-            self.options = options if options is not None else []
-
-        @property
-        def io_type(self):
-            return self.options

@comfytype(io_type="COMBO")
 class MultiCombo(ComfyTypeI):
@@ -1212,18 +1190,13 @@ class _ComfyNodeBaseInternal(_ComfyNodeInternal):
        raise NotImplementedError

    @classmethod
-    def validate_inputs(cls, **kwargs) -> bool | str:
-        """Optionally, define this function to validate inputs; equivalent to V1's VALIDATE_INPUTS.
-
-        If the function returns a string, it will be used as the validation error message for the node.
-        """
+    def validate_inputs(cls, **kwargs) -> bool:
+        """Optionally, define this function to validate inputs; equivalent to V1's VALIDATE_INPUTS."""
        raise NotImplementedError

    @classmethod
    def fingerprint_inputs(cls, **kwargs) -> Any:
-        """Optionally, define this function to fingerprint inputs; equivalent to V1's IS_CHANGED.
-
-        If this function returns the same value as last run, the node will not be executed."""
+        """Optionally, define this function to fingerprint inputs; equivalent to V1's IS_CHANGED."""
        raise NotImplementedError

    @classmethod
@@ -1582,78 +1555,77 @@ class _UIOutput(ABC):
        ...


-__all__ = [
-    "FolderType",
-    "UploadType",
-    "RemoteOptions",
-    "NumberDisplay",
+class _IO:
+    FolderType = FolderType
+    UploadType = UploadType
+    RemoteOptions = RemoteOptions
+    NumberDisplay = NumberDisplay

-    "comfytype",
-    "Custom",
-    "Input",
-    "WidgetInput",
-    "Output",
-    "ComfyTypeI",
-    "ComfyTypeIO",
+    comfytype = staticmethod(comfytype)
+    Custom = staticmethod(Custom)
+    Input = Input
+    WidgetInput = WidgetInput
+    Output = Output
+    ComfyTypeI = ComfyTypeI
+    ComfyTypeIO = ComfyTypeIO
+    #---------------------------------
    # Supported Types
-    "Boolean",
-    "Int",
-    "Float",
-    "String",
-    "Combo",
-    "MultiCombo",
-    "Image",
-    "WanCameraEmbedding",
-    "Webcam",
-    "Mask",
-    "Latent",
-    "Conditioning",
-    "Sampler",
-    "Sigmas",
-    "Noise",
-    "Guider",
-    "Clip",
-    "ControlNet",
-    "Vae",
-    "Model",
-    "ClipVision",
-    "ClipVisionOutput",
-    "AudioEncoder",
-    "AudioEncoderOutput",
-    "StyleModel",
-    "Gligen",
-    "UpscaleModel",
-    "Audio",
-    "Video",
-    "SVG",
-    "LoraModel",
-    "LossMap",
-    "Voxel",
-    "Mesh",
-    "Hooks",
-    "HookKeyframes",
-    "TimestepsRange",
-    "LatentOperation",
-    "FlowControl",
-    "Accumulation",
-    "Load3DCamera",
-    "Load3D",
-    "Load3DAnimation",
-    "Photomaker",
-    "Point",
-    "FaceAnalysis",
-    "BBOX",
-    "SEGS",
-    "AnyType",
-    "MultiType",
-    # Other classes
-    "HiddenHolder",
-    "Hidden",
-    "NodeInfoV1",
-    "NodeInfoV3",
-    "Schema",
-    "ComfyNode",
-    "NodeOutput",
-    "add_to_dict_v1",
-    "add_to_dict_v3",
-]
+    Boolean = Boolean
+    Int = Int
+    Float = Float
+    String = String
+    Combo = Combo
+    MultiCombo = MultiCombo
+    Image = Image
+    WanCameraEmbedding = WanCameraEmbedding
+    Webcam = Webcam
+    Mask = Mask
+    Latent = Latent
+    Conditioning = Conditioning
+    Sampler = Sampler
+    Sigmas = Sigmas
+    Noise = Noise
+    Guider = Guider
+    Clip = Clip
+    ControlNet = ControlNet
+    Vae = Vae
+    Model = Model
+    ClipVision = ClipVision
+    ClipVisionOutput = ClipVisionOutput
+    AudioEncoderOutput = AudioEncoderOutput
+    StyleModel = StyleModel
+    Gligen = Gligen
+    UpscaleModel = UpscaleModel
+    Audio = Audio
+    Video = Video
+    SVG = SVG
+    LoraModel = LoraModel
+    LossMap = LossMap
+    Voxel = Voxel
+    Mesh = Mesh
+    Hooks = Hooks
+    HookKeyframes = HookKeyframes
+    TimestepsRange = TimestepsRange
+    LatentOperation = LatentOperation
+    FlowControl = FlowControl
+    Accumulation = Accumulation
+    Load3DCamera = Load3DCamera
+    Load3D = Load3D
+    Load3DAnimation = Load3DAnimation
+    Photomaker = Photomaker
+    Point = Point
+    FaceAnalysis = FaceAnalysis
+    BBOX = BBOX
+    SEGS = SEGS
+    AnyType = AnyType
+    MultiType = MultiType
+    #---------------------------------
+    HiddenHolder = HiddenHolder
+    Hidden = Hidden
+    NodeInfoV1 = NodeInfoV1
+    NodeInfoV3 = NodeInfoV3
+    Schema = Schema
+    ComfyNode = ComfyNode
+    NodeOutput = NodeOutput
+    add_to_dict_v1 = staticmethod(add_to_dict_v1)
+    add_to_dict_v3 = staticmethod(add_to_dict_v3)
--- a/comfy_api/latest/_ui.py
+++ b/comfy_api/latest/_ui.py
@@ -449,16 +449,15 @@ class PreviewText(_UIOutput):
        return {"text": (self.value,)}


-__all__ = [
-    "SavedResult",
-    "SavedImages",
-    "SavedAudios",
-    "ImageSaveHelper",
-    "AudioSaveHelper",
-    "PreviewImage",
-    "PreviewMask",
-    "PreviewAudio",
-    "PreviewVideo",
-    "PreviewUI3D",
-    "PreviewText",
-]
+class _UI:
+    SavedResult = SavedResult
+    SavedImages = SavedImages
+    SavedAudios = SavedAudios
+    ImageSaveHelper = ImageSaveHelper
+    AudioSaveHelper = AudioSaveHelper
+    PreviewImage = PreviewImage
+    PreviewMask = PreviewMask
+    PreviewAudio = PreviewAudio
+    PreviewVideo = PreviewVideo
+    PreviewUI3D = PreviewUI3D
+    PreviewText = PreviewText
--- a/comfy_api_nodes/apinode_utils.py
+++ b/comfy_api_nodes/apinode_utils.py
@@ -1,8 +1,14 @@
 from __future__ import annotations
 import aiohttp
+import io
+import logging
 import mimetypes
 from typing import Optional, Union
 from comfy.utils import common_upscale
+from comfy_api.input_impl import VideoFromFile
+from comfy_api.util import VideoContainer, VideoCodec
+from comfy_api.input.video_types import VideoInput
+from comfy_api.input.basic_types import AudioInput
 from comfy_api_nodes.apis.client import (
    ApiClient,
    ApiEndpoint,
@@ -12,15 +18,48 @@ from comfy_api_nodes.apis.client import (
    UploadResponse,
 )
 from server import PromptServer
-from comfy.cli_args import args
+

 import numpy as np
 from PIL import Image
 import torch
 import math
 import base64
-from .util import tensor_to_bytesio, bytesio_to_image_tensor
+import uuid
 from io import BytesIO
+import av
+
+
+async def download_url_to_video_output(video_url: str, timeout: int = None) -> VideoFromFile:
+    """Downloads a video from a URL and returns a `VIDEO` output.
+
+    Args:
+        video_url: The URL of the video to download.
+
+    Returns:
+        A Comfy node `VIDEO` output.
+    """
+    video_io = await download_url_to_bytesio(video_url, timeout)
+    if video_io is None:
+        error_msg = f"Failed to download video from {video_url}"
+        logging.error(error_msg)
+        raise ValueError(error_msg)
+    return VideoFromFile(video_io)
+
+
+def downscale_image_tensor(image, total_pixels=1536 * 1024) -> torch.Tensor:
+    """Downscale input image tensor to roughly the specified total pixels."""
+    samples = image.movedim(-1, 1)
+    total = int(total_pixels)
+    scale_by = math.sqrt(total / (samples.shape[3] * samples.shape[2]))
+    if scale_by >= 1:
+        return image
+    width = round(samples.shape[3] * scale_by)
+    height = round(samples.shape[2] * scale_by)
+
+    s = common_upscale(samples, width, height, "lanczos", "disabled")
+    s = s.movedim(1, -1)
+    return s


 async def validate_and_cast_response(
@@ -113,16 +152,19 @@ def validate_aspect_ratio(
            raise TypeError(
                f"Aspect ratio cannot reduce to any less than {minimum_ratio_str} ({minimum_ratio}), but was {aspect_ratio} ({calculated_ratio})."
            )
-        if calculated_ratio > maximum_ratio:
+        elif calculated_ratio > maximum_ratio:
            raise TypeError(
                f"Aspect ratio cannot reduce to any greater than {maximum_ratio_str} ({maximum_ratio}), but was {aspect_ratio} ({calculated_ratio})."
            )
    return aspect_ratio


-async def download_url_to_bytesio(
-    url: str, timeout: int = None, auth_kwargs: Optional[dict[str, str]] = None
-) -> BytesIO:
+def mimetype_to_extension(mime_type: str) -> str:
+    """Converts a MIME type to a file extension."""
+    return mime_type.split("/")[-1].lower()
+
+
+async def download_url_to_bytesio(url: str, timeout: int = None) -> BytesIO:
    """Downloads content from a URL using requests and returns it as BytesIO.

    Args:
@@ -132,27 +174,143 @@ async def download_url_to_bytesio(
    Returns:
        BytesIO object containing the downloaded content.
    """
-    headers = {}
-    if url.startswith("/proxy/"):
-        url = str(args.comfy_api_base).rstrip("/") + url
-        auth_token = auth_kwargs.get("auth_token")
-        comfy_api_key = auth_kwargs.get("comfy_api_key")
-        if auth_token:
-            headers["Authorization"] = f"Bearer {auth_token}"
-        elif comfy_api_key:
-            headers["X-API-KEY"] = comfy_api_key
    timeout_cfg = aiohttp.ClientTimeout(total=timeout) if timeout else None
    async with aiohttp.ClientSession(timeout=timeout_cfg) as session:
-        async with session.get(url, headers=headers) as resp:
+        async with session.get(url) as resp:
            resp.raise_for_status()  # Raises HTTPError for bad responses (4XX or 5XX)
            return BytesIO(await resp.read())


+def bytesio_to_image_tensor(image_bytesio: BytesIO, mode: str = "RGBA") -> torch.Tensor:
+    """Converts image data from BytesIO to a torch.Tensor.
+
+    Args:
+        image_bytesio: BytesIO object containing the image data.
+        mode: The PIL mode to convert the image to (e.g., "RGB", "RGBA").
+
+    Returns:
+        A torch.Tensor representing the image (1, H, W, C).
+
+    Raises:
+        PIL.UnidentifiedImageError: If the image data cannot be identified.
+        ValueError: If the specified mode is invalid.
+    """
+    image = Image.open(image_bytesio)
+    image = image.convert(mode)
+    image_array = np.array(image).astype(np.float32) / 255.0
+    return torch.from_numpy(image_array).unsqueeze(0)
+
+
+async def download_url_to_image_tensor(url: str, timeout: int = None) -> torch.Tensor:
+    """Downloads an image from a URL and returns a [B, H, W, C] tensor."""
+    image_bytesio = await download_url_to_bytesio(url, timeout)
+    return bytesio_to_image_tensor(image_bytesio)
+
+
 def process_image_response(response_content: bytes | str) -> torch.Tensor:
    """Uses content from a Response object and converts it to a torch.Tensor"""
    return bytesio_to_image_tensor(BytesIO(response_content))


+def _tensor_to_pil(image: torch.Tensor, total_pixels: int = 2048 * 2048) -> Image.Image:
+    """Converts a single torch.Tensor image [H, W, C] to a PIL Image, optionally downscaling."""
+    if len(image.shape) > 3:
+        image = image[0]
+    # TODO: remove alpha if not allowed and present
+    input_tensor = image.cpu()
+    input_tensor = downscale_image_tensor(
+        input_tensor.unsqueeze(0), total_pixels=total_pixels
+    ).squeeze()
+    image_np = (input_tensor.numpy() * 255).astype(np.uint8)
+    img = Image.fromarray(image_np)
+    return img
+
+
+def _pil_to_bytesio(img: Image.Image, mime_type: str = "image/png") -> BytesIO:
+    """Converts a PIL Image to a BytesIO object."""
+    if not mime_type:
+        mime_type = "image/png"
+
+    img_byte_arr = io.BytesIO()
+    # Derive PIL format from MIME type (e.g., 'image/png' -> 'PNG')
+    pil_format = mime_type.split("/")[-1].upper()
+    if pil_format == "JPG":
+        pil_format = "JPEG"
+    img.save(img_byte_arr, format=pil_format)
+    img_byte_arr.seek(0)
+    return img_byte_arr
+
+
+def tensor_to_bytesio(
+    image: torch.Tensor,
+    name: Optional[str] = None,
+    total_pixels: int = 2048 * 2048,
+    mime_type: str = "image/png",
+) -> BytesIO:
+    """Converts a torch.Tensor image to a named BytesIO object.
+
+    Args:
+        image: Input torch.Tensor image.
+        name: Optional filename for the BytesIO object.
+        total_pixels: Maximum total pixels for potential downscaling.
+        mime_type: Target image MIME type (e.g., 'image/png', 'image/jpeg', 'image/webp', 'video/mp4').
+
+    Returns:
+        Named BytesIO object containing the image data.
+    """
+    if not mime_type:
+        mime_type = "image/png"
+
+    pil_image = _tensor_to_pil(image, total_pixels=total_pixels)
+    img_binary = _pil_to_bytesio(pil_image, mime_type=mime_type)
+    img_binary.name = (
+        f"{name if name else uuid.uuid4()}.{mimetype_to_extension(mime_type)}"
+    )
+    return img_binary
+
+
+def tensor_to_base64_string(
+    image_tensor: torch.Tensor,
+    total_pixels: int = 2048 * 2048,
+    mime_type: str = "image/png",
+) -> str:
+    """Convert [B, H, W, C] or [H, W, C] tensor to a base64 string.
+
+    Args:
+        image_tensor: Input torch.Tensor image.
+        total_pixels: Maximum total pixels for potential downscaling.
+        mime_type: Target image MIME type (e.g., 'image/png', 'image/jpeg', 'image/webp', 'video/mp4').
+
+    Returns:
+        Base64 encoded string of the image.
+    """
+    pil_image = _tensor_to_pil(image_tensor, total_pixels=total_pixels)
+    img_byte_arr = _pil_to_bytesio(pil_image, mime_type=mime_type)
+    img_bytes = img_byte_arr.getvalue()
+    # Encode bytes to base64 string
+    base64_encoded_string = base64.b64encode(img_bytes).decode("utf-8")
+    return base64_encoded_string
+
+
+def tensor_to_data_uri(
+    image_tensor: torch.Tensor,
+    total_pixels: int = 2048 * 2048,
+    mime_type: str = "image/png",
+) -> str:
+    """Converts a tensor image to a Data URI string.
+
+    Args:
+        image_tensor: Input torch.Tensor image.
+        total_pixels: Maximum total pixels for potential downscaling.
+        mime_type: Target image MIME type (e.g., 'image/png', 'image/jpeg', 'image/webp').
+
+    Returns:
+        Data URI string (e.g., 'data:image/png;base64,...').
+    """
+    base64_string = tensor_to_base64_string(image_tensor, total_pixels, mime_type)
+    return f"data:{mime_type};base64,{base64_string}"
+
+
 def text_filepath_to_base64_string(filepath: str) -> str:
    """Converts a text file to a base64 string."""
    with open(filepath, "rb") as f:
@@ -207,6 +365,173 @@ async def upload_file_to_comfyapi(
    return response.download_url


+def video_to_base64_string(
+    video: VideoInput,
+    container_format: VideoContainer = None,
+    codec: VideoCodec = None
+) -> str:
+    """
+    Converts a video input to a base64 string.
+
+    Args:
+        video: The video input to convert
+        container_format: Optional container format to use (defaults to video.container if available)
+        codec: Optional codec to use (defaults to video.codec if available)
+    """
+    video_bytes_io = io.BytesIO()
+
+    # Use provided format/codec if specified, otherwise use video's own if available
+    format_to_use = container_format if container_format is not None else getattr(video, 'container', VideoContainer.MP4)
+    codec_to_use = codec if codec is not None else getattr(video, 'codec', VideoCodec.H264)
+
+    video.save_to(video_bytes_io, format=format_to_use, codec=codec_to_use)
+    video_bytes_io.seek(0)
+    return base64.b64encode(video_bytes_io.getvalue()).decode("utf-8")
+
+
+async def upload_video_to_comfyapi(
+    video: VideoInput,
+    auth_kwargs: Optional[dict[str, str]] = None,
+    container: VideoContainer = VideoContainer.MP4,
+    codec: VideoCodec = VideoCodec.H264,
+    max_duration: Optional[int] = None,
+) -> str:
+    """
+    Uploads a single video to ComfyUI API and returns its download URL.
+    Uses the specified container and codec for saving the video before upload.
+
+    Args:
+        video: VideoInput object (Comfy VIDEO type).
+        auth_kwargs: Optional authentication token(s).
+        container: The video container format to use (default: MP4).
+        codec: The video codec to use (default: H264).
+        max_duration: Optional maximum duration of the video in seconds. If the video is longer than this, an error will be raised.
+
+    Returns:
+        The download URL for the uploaded video file.
+    """
+    if max_duration is not None:
+        try:
+            actual_duration = video.duration_seconds
+            if actual_duration is not None and actual_duration > max_duration:
+                raise ValueError(
+                    f"Video duration ({actual_duration:.2f}s) exceeds the maximum allowed ({max_duration}s)."
+                )
+        except Exception as e:
+            logging.error(f"Error getting video duration: {e}")
+            raise ValueError(f"Could not verify video duration from source: {e}") from e
+
+    upload_mime_type = f"video/{container.value.lower()}"
+    filename = f"uploaded_video.{container.value.lower()}"
+
+    # Convert VideoInput to BytesIO using specified container/codec
+    video_bytes_io = io.BytesIO()
+    video.save_to(video_bytes_io, format=container, codec=codec)
+    video_bytes_io.seek(0)
+
+    return await upload_file_to_comfyapi(video_bytes_io, filename, upload_mime_type, auth_kwargs)
+
+
+def audio_tensor_to_contiguous_ndarray(waveform: torch.Tensor) -> np.ndarray:
+    """
+    Prepares audio waveform for av library by converting to a contiguous numpy array.
+
+    Args:
+        waveform: a tensor of shape (1, channels, samples) derived from a Comfy `AUDIO` type.
+
+    Returns:
+        Contiguous numpy array of the audio waveform. If the audio was batched,
+            the first item is taken.
+    """
+    if waveform.ndim != 3 or waveform.shape[0] != 1:
+        raise ValueError("Expected waveform tensor shape (1, channels, samples)")
+
+    # If batch is > 1, take first item
+    if waveform.shape[0] > 1:
+        waveform = waveform[0]
+
+    # Prepare for av: remove batch dim, move to CPU, make contiguous, convert to numpy array
+    audio_data_np = waveform.squeeze(0).cpu().contiguous().numpy()
+    if audio_data_np.dtype != np.float32:
+        audio_data_np = audio_data_np.astype(np.float32)
+
+    return audio_data_np
+
+
+def audio_ndarray_to_bytesio(
+    audio_data_np: np.ndarray,
+    sample_rate: int,
+    container_format: str = "mp4",
+    codec_name: str = "aac",
+) -> BytesIO:
+    """
+    Encodes a numpy array of audio data into a BytesIO object.
+    """
+    audio_bytes_io = io.BytesIO()
+    with av.open(audio_bytes_io, mode="w", format=container_format) as output_container:
+        audio_stream = output_container.add_stream(codec_name, rate=sample_rate)
+        frame = av.AudioFrame.from_ndarray(
+            audio_data_np,
+            format="fltp",
+            layout="stereo" if audio_data_np.shape[0] > 1 else "mono",
+        )
+        frame.sample_rate = sample_rate
+        frame.pts = 0
+
+        for packet in audio_stream.encode(frame):
+            output_container.mux(packet)
+
+        # Flush stream
+        for packet in audio_stream.encode(None):
+            output_container.mux(packet)
+
+    audio_bytes_io.seek(0)
+    return audio_bytes_io
+
+
+async def upload_audio_to_comfyapi(
+    audio: AudioInput,
+    auth_kwargs: Optional[dict[str, str]] = None,
+    container_format: str = "mp4",
+    codec_name: str = "aac",
+    mime_type: str = "audio/mp4",
+    filename: str = "uploaded_audio.mp4",
+) -> str:
+    """
+    Uploads a single audio input to ComfyUI API and returns its download URL.
+    Encodes the raw waveform into the specified format before uploading.
+
+    Args:
+        audio: a Comfy `AUDIO` type (contains waveform tensor and sample_rate)
+        auth_kwargs: Optional authentication token(s).
+
+    Returns:
+        The download URL for the uploaded audio file.
+    """
+    sample_rate: int = audio["sample_rate"]
+    waveform: torch.Tensor = audio["waveform"]
+    audio_data_np = audio_tensor_to_contiguous_ndarray(waveform)
+    audio_bytes_io = audio_ndarray_to_bytesio(
+        audio_data_np, sample_rate, container_format, codec_name
+    )
+
+    return await upload_file_to_comfyapi(audio_bytes_io, filename, mime_type, auth_kwargs)
+
+
+def audio_to_base64_string(
+    audio: AudioInput, container_format: str = "mp4", codec_name: str = "aac"
+) -> str:
+    """Converts an audio input to a base64 string."""
+    sample_rate: int = audio["sample_rate"]
+    waveform: torch.Tensor = audio["waveform"]
+    audio_data_np = audio_tensor_to_contiguous_ndarray(waveform)
+    audio_bytes_io = audio_ndarray_to_bytesio(
+        audio_data_np, sample_rate, container_format, codec_name
+    )
+    audio_bytes = audio_bytes_io.getvalue()
+    return base64.b64encode(audio_bytes).decode("utf-8")
+
+
 async def upload_images_to_comfyapi(
    image: torch.Tensor,
    max_images=8,
@@ -259,3 +584,43 @@ def resize_mask_to_image(
    if not allow_gradient:
        mask = (mask > 0.5).float()
    return mask
+
+
+def validate_string(
+    string: str,
+    strip_whitespace=True,
+    field_name="prompt",
+    min_length=None,
+    max_length=None,
+):
+    if string is None:
+        raise Exception(f"Field '{field_name}' cannot be empty.")
+    if strip_whitespace:
+        string = string.strip()
+    if min_length and len(string) < min_length:
+        raise Exception(
+            f"Field '{field_name}' cannot be shorter than {min_length} characters; was {len(string)} characters long."
+        )
+    if max_length and len(string) > max_length:
+        raise Exception(
+            f" Field '{field_name} cannot be longer than {max_length} characters; was {len(string)} characters long."
+        )
+
+
+def image_tensor_pair_to_batch(
+    image1: torch.Tensor, image2: torch.Tensor
+) -> torch.Tensor:
+    """
+    Converts a pair of image tensors to a batch tensor.
+    If the images are not the same size, the smaller image is resized to
+    match the larger image.
+    """
+    if image1.shape[1:] != image2.shape[1:]:
+        image2 = common_upscale(
+            image2.movedim(-1, 1),
+            image1.shape[2],
+            image1.shape[1],
+            "bilinear",
+            "center",
+        ).movedim(1, -1)
+    return torch.cat((image1, image2), dim=0)
--- a/comfy_api_nodes/apis/init.py
+++ b/comfy_api_nodes/apis/init.py
@@ -2,7 +2,6 @@
 #   filename:  filtered-openapi.yaml
 #   timestamp: 2025-07-30T08:54:00+00:00

-# pylint: disable
 from __future__ import annotations

 from datetime import date, datetime
@@ -952,11 +951,7 @@ class MagicPrompt2(str, Enum):


 class StyleType1(str, Enum):
-    AUTO = 'AUTO'
    GENERAL = 'GENERAL'
-    REALISTIC = 'REALISTIC'
-    DESIGN = 'DESIGN'
-    FICTION = 'FICTION'


 class ImagenImageGenerationInstance(BaseModel):
@@ -1321,7 +1316,6 @@ class KlingTextToVideoModelName(str, Enum):
    kling_v1 = 'kling-v1'
    kling_v1_6 = 'kling-v1-6'
    kling_v2_1_master = 'kling-v2-1-master'
-    kling_v2_5_turbo = 'kling-v2-5-turbo'


 class KlingVideoGenAspectRatio(str, Enum):
@@ -1356,7 +1350,6 @@ class KlingVideoGenModelName(str, Enum):
    kling_v2_master = 'kling-v2-master'
    kling_v2_1 = 'kling-v2-1'
    kling_v2_1_master = 'kling-v2-1-master'
-    kling_v2_5_turbo = 'kling-v2-5-turbo'


 class KlingVideoResult(BaseModel):
@@ -2683,7 +2676,7 @@ class ReleaseNote(BaseModel):


 class RenderingSpeed(str, Enum):
-    DEFAULT = 'DEFAULT'
+    BALANCED = 'BALANCED'
    TURBO = 'TURBO'
    QUALITY = 'QUALITY'

@@ -4925,14 +4918,6 @@ class IdeogramV3EditRequest(BaseModel):
        None,
        description='A set of images to use as style references (maximum total size 10MB across all style references). The images should be in JPEG, PNG or WebP format.',
    )
-    character_reference_images: Optional[List[str]] = Field(
-        None,
-        description='Generations with character reference are subject to the character reference pricing. A set of images to use as character references (maximum total size 10MB across all character references), currently only supports 1 character reference image. The images should be in JPEG, PNG or WebP format.'
-    )
-    character_reference_images_mask: Optional[List[str]] = Field(
-        None,
-        description='Optional masks for character reference images. When provided, must match the number of character_reference_images. Each mask should be a grayscale image of the same dimensions as the corresponding character reference image. The images should be in JPEG, PNG or WebP format.'
-    )


 class IdeogramV3Request(BaseModel):
@@ -4966,14 +4951,6 @@ class IdeogramV3Request(BaseModel):
    style_type: Optional[StyleType1] = Field(
        None, description='The type of style to apply'
    )
-    character_reference_images: Optional[List[str]] = Field(
-        None,
-        description='Generations with character reference are subject to the character reference pricing. A set of images to use as character references (maximum total size 10MB across all character references), currently only supports 1 character reference image. The images should be in JPEG, PNG or WebP format.'
-    )
-    character_reference_images_mask: Optional[List[str]] = Field(
-        None,
-        description='Optional masks for character reference images. When provided, must match the number of character_reference_images. Each mask should be a grayscale image of the same dimensions as the corresponding character reference image. The images should be in JPEG, PNG or WebP format.'
-    )


 class ImagenGenerateImageResponse(BaseModel):
--- a/comfy_api_nodes/apis/bfl_api.py
+++ b/comfy_api_nodes/apis/bfl_api.py
@@ -50,6 +50,44 @@ class BFLFluxFillImageRequest(BaseModel):
    mask: str = Field(None, description='A Base64-encoded string representing the mask of the areas you with to modify.')


+class BFLFluxCannyImageRequest(BaseModel):
+    prompt: str = Field(..., description='Text prompt for image generation')
+    prompt_upsampling: Optional[bool] = Field(
+        None, description='Whether to perform upsampling on the prompt. If active, automatically modifies the prompt for more creative generation.'
+    )
+    canny_low_threshold: Optional[int] = Field(None, description='Low threshold for Canny edge detection')
+    canny_high_threshold: Optional[int] = Field(None, description='High threshold for Canny edge detection')
+    seed: Optional[int] = Field(None, description='The seed value for reproducibility.')
+    steps: conint(ge=15, le=50) = Field(..., description='Number of steps for the image generation process')
+    guidance: confloat(ge=1, le=100) = Field(..., description='Guidance strength for the image generation process')
+    safety_tolerance: Optional[conint(ge=0, le=6)] = Field(
+        6, description='Tolerance level for input and output moderation. Between 0 and 6, 0 being most strict, 6 being least strict. Defaults to 2.'
+    )
+    output_format: Optional[BFLOutputFormat] = Field(
+        BFLOutputFormat.png, description="Output format for the generated image. Can be 'jpeg' or 'png'.", examples=['png']
+    )
+    control_image: Optional[str] = Field(None, description='Base64 encoded image to use as control input if no preprocessed image is provided')
+    preprocessed_image: Optional[str] = Field(None, description='Optional pre-processed image that will bypass the control preprocessing step')
+
+
+class BFLFluxDepthImageRequest(BaseModel):
+    prompt: str = Field(..., description='Text prompt for image generation')
+    prompt_upsampling: Optional[bool] = Field(
+        None, description='Whether to perform upsampling on the prompt. If active, automatically modifies the prompt for more creative generation.'
+    )
+    seed: Optional[int] = Field(None, description='The seed value for reproducibility.')
+    steps: conint(ge=15, le=50) = Field(..., description='Number of steps for the image generation process')
+    guidance: confloat(ge=1, le=100) = Field(..., description='Guidance strength for the image generation process')
+    safety_tolerance: Optional[conint(ge=0, le=6)] = Field(
+        6, description='Tolerance level for input and output moderation. Between 0 and 6, 0 being most strict, 6 being least strict. Defaults to 2.'
+    )
+    output_format: Optional[BFLOutputFormat] = Field(
+        BFLOutputFormat.png, description="Output format for the generated image. Can be 'jpeg' or 'png'.", examples=['png']
+    )
+    control_image: Optional[str] = Field(None, description='Base64 encoded image to use as control input if no preprocessed image is provided')
+    preprocessed_image: Optional[str] = Field(None, description='Optional pre-processed image that will bypass the control preprocessing step')
+
+
 class BFLFluxProGenerateRequest(BaseModel):
    prompt: str = Field(..., description='The text prompt for image generation.')
    prompt_upsampling: Optional[bool] = Field(
@@ -122,8 +160,15 @@ class BFLStatus(str, Enum):
    error = "Error"


-class BFLFluxStatusResponse(BaseModel):
+class BFLFluxProStatusResponse(BaseModel):
    id: str = Field(..., description="The unique identifier for the generation task.")
    status: BFLStatus = Field(..., description="The status of the task.")
-    result: Optional[Dict[str, Any]] = Field(None, description="The result of the task (null if not completed).")
-    progress: Optional[float] = Field(None, description="The progress of the task (0.0 to 1.0).", ge=0.0, le=1.0)
+    result: Optional[Dict[str, Any]] = Field(
+        None, description="The result of the task (null if not completed)."
+    )
+    progress: confloat(ge=0.0, le=1.0) = Field(
+        ..., description="The progress of the task (0.0 to 1.0)."
+    )
+    details: Optional[Dict[str, Any]] = Field(
+        None, description="Additional details about the task (null if not available)."
+    )
--- a/comfy_api_nodes/apis/client.py
+++ b/comfy_api_nodes/apis/client.py
@@ -95,10 +95,9 @@ import aiohttp
 import asyncio
 import logging
 import io
-import os
 import socket
 from aiohttp.client_exceptions import ClientError, ClientResponseError
-from typing import Type, Optional, Any, TypeVar, Generic, Callable
+from typing import Dict, Type, Optional, Any, TypeVar, Generic, Callable, Tuple
 from enum import Enum
 import json
 from urllib.parse import urljoin, urlparse
@@ -175,7 +174,7 @@ class ApiClient:
        max_retries: int = 3,
        retry_delay: float = 1.0,
        retry_backoff_factor: float = 2.0,
-        retry_status_codes: Optional[tuple[int, ...]] = None,
+        retry_status_codes: Optional[Tuple[int, ...]] = None,
        session: Optional[aiohttp.ClientSession] = None,
    ):
        self.base_url = base_url
@@ -199,9 +198,9 @@ class ApiClient:

    @staticmethod
    def _create_json_payload_args(
-        data: Optional[dict[str, Any]] = None,
-        headers: Optional[dict[str, str]] = None,
-    ) -> dict[str, Any]:
+        data: Optional[Dict[str, Any]] = None,
+        headers: Optional[Dict[str, str]] = None,
+    ) -> Dict[str, Any]:
        return {
            "json": data,
            "headers": headers,
@@ -209,27 +208,24 @@ class ApiClient:

    def _create_form_data_args(
        self,
-        data: dict[str, Any] | None,
-        files: dict[str, Any] | None,
-        headers: Optional[dict[str, str]] = None,
+        data: Dict[str, Any] | None,
+        files: Dict[str, Any] | None,
+        headers: Optional[Dict[str, str]] = None,
        multipart_parser: Callable | None = None,
-    ) -> dict[str, Any]:
+    ) -> Dict[str, Any]:
        if headers and "Content-Type" in headers:
            del headers["Content-Type"]

        if multipart_parser and data:
            data = multipart_parser(data)

-        if isinstance(data, aiohttp.FormData):
-            form = data  # If the parser already returned a FormData, pass it through
-        else:
-            form = aiohttp.FormData(default_to_multipart=True)
-            if data:  # regular text fields
-                for k, v in data.items():
-                    if v is None:
-                        continue  # aiohttp fails to serialize "None" values
-                    # aiohttp expects strings or bytes; convert enums etc.
-                    form.add_field(k, str(v) if not isinstance(v, (bytes, bytearray)) else v)
+        form = aiohttp.FormData(default_to_multipart=True)
+        if data:  # regular text fields
+            for k, v in data.items():
+                if v is None:
+                    continue  # aiohttp fails to serialize "None" values
+                # aiohttp expects strings or bytes; convert enums etc.
+                form.add_field(k, str(v) if not isinstance(v, (bytes, bytearray)) else v)

        if files:
            file_iter = files if isinstance(files, list) else files.items()
@@ -254,9 +250,9 @@ class ApiClient:

    @staticmethod
    def _create_urlencoded_form_data_args(
-        data: dict[str, Any],
-        headers: Optional[dict[str, str]] = None,
-    ) -> dict[str, Any]:
+        data: Dict[str, Any],
+        headers: Optional[Dict[str, str]] = None,
+    ) -> Dict[str, Any]:
        headers = headers or {}
        headers["Content-Type"] = "application/x-www-form-urlencoded"
        return {
@@ -264,7 +260,7 @@ class ApiClient:
            "headers": headers,
        }

-    def get_headers(self) -> dict[str, str]:
+    def get_headers(self) -> Dict[str, str]:
        """Get headers for API requests, including authentication if available"""
        headers = {"Content-Type": "application/json", "Accept": "application/json"}

@@ -275,7 +271,7 @@ class ApiClient:

        return headers

-    async def _check_connectivity(self, target_url: str) -> dict[str, bool]:
+    async def _check_connectivity(self, target_url: str) -> Dict[str, bool]:
        """
        Check connectivity to determine if network issues are local or server-related.

@@ -316,14 +312,14 @@ class ApiClient:
        self,
        method: str,
        path: str,
-        params: Optional[dict[str, Any]] = None,
-        data: Optional[dict[str, Any]] = None,
-        files: Optional[dict[str, Any] | list[tuple[str, Any]]] = None,
-        headers: Optional[dict[str, str]] = None,
+        params: Optional[Dict[str, Any]] = None,
+        data: Optional[Dict[str, Any]] = None,
+        files: Optional[Dict[str, Any] | list[tuple[str, Any]]] = None,
+        headers: Optional[Dict[str, str]] = None,
        content_type: str = "application/json",
        multipart_parser: Callable | None = None,
        retry_count: int = 0,  # Used internally for tracking retries
-    ) -> dict[str, Any]:
+    ) -> Dict[str, Any]:
        """
        Make an HTTP request to the API with automatic retries for transient errors.

@@ -359,10 +355,10 @@ class ApiClient:
        if params:
            params = {k: v for k, v in params.items() if v is not None}  # aiohttp fails to serialize None values

-        logging.debug("[DEBUG] Request Headers: %s", request_headers)
-        logging.debug("[DEBUG] Files: %s", files)
-        logging.debug("[DEBUG] Params: %s", params)
-        logging.debug("[DEBUG] Data: %s", data)
+        logging.debug(f"[DEBUG] Request Headers: {request_headers}")
+        logging.debug(f"[DEBUG] Files: {files}")
+        logging.debug(f"[DEBUG] Params: {params}")
+        logging.debug(f"[DEBUG] Data: {data}")

        if content_type == "application/x-www-form-urlencoded":
            payload_args = self._create_urlencoded_form_data_args(data or {}, request_headers)
@@ -485,7 +481,7 @@ class ApiClient:
            retry_delay: Initial delay between retries in seconds
            retry_backoff_factor: Multiplier for the delay after each retry
        """
-        headers: dict[str, str] = {}
+        headers: Dict[str, str] = {}
        skip_auto_headers: set[str] = set()
        if content_type:
            headers["Content-Type"] = content_type
@@ -503,9 +499,7 @@ class ApiClient:
        else:
            raise ValueError("File must be BytesIO or str path")

-        parsed = urlparse(upload_url)
-        basename = os.path.basename(parsed.path) or parsed.netloc or "upload"
-        operation_id = f"upload_{basename}_{uuid.uuid4().hex[:8]}"
+        operation_id = f"upload_{upload_url.split('/')[-1]}_{uuid.uuid4().hex[:8]}"
        request_logger.log_request_response(
            operation_id=operation_id,
            request_method="PUT",
@@ -538,7 +532,7 @@ class ApiClient:
                    request_method="PUT",
                    request_url=upload_url,
                    response_status_code=e.status if hasattr(e, "status") else None,
-                    response_headers=dict(e.headers) if hasattr(e, "headers") else None,
+                    response_headers=dict(e.headers) if getattr(e, "headers") else None,
                    response_content=None,
                    error_message=f"{type(e).__name__}: {str(e)}",
                )
@@ -558,7 +552,7 @@ class ApiClient:
        *req_meta,
        retry_count: int,
        response_content: dict | str = "",
-    ) -> dict[str, Any]:
+    ) -> Dict[str, Any]:
        status_code = exc.status
        if status_code == 401:
            user_friendly = "Unauthorized: Please login first to use this node."
@@ -592,9 +586,9 @@ class ApiClient:
            error_message=f"HTTP Error {exc.status}",
        )

-        logging.debug("[DEBUG] API Error: %s (Status: %s)", user_friendly, status_code)
+        logging.debug(f"[DEBUG] API Error: {user_friendly} (Status: {status_code})")
        if response_content:
-            logging.debug("[DEBUG] Response content: %s", response_content)
+            logging.debug(f"[DEBUG] Response content: {response_content}")

        # Retry if eligible
        if status_code in self.retry_status_codes and retry_count < self.max_retries:
@@ -659,7 +653,7 @@ class ApiEndpoint(Generic[T, R]):
        method: HttpMethod,
        request_model: Type[T],
        response_model: Type[R],
-        query_params: Optional[dict[str, Any]] = None,
+        query_params: Optional[Dict[str, Any]] = None,
    ):
        """Initialize an API endpoint definition.

@@ -684,12 +678,12 @@ class SynchronousOperation(Generic[T, R]):
        self,
        endpoint: ApiEndpoint[T, R],
        request: T,
-        files: Optional[dict[str, Any] | list[tuple[str, Any]]] = None,
+        files: Optional[Dict[str, Any] | list[tuple[str, Any]]] = None,
        api_base: str | None = None,
        auth_token: Optional[str] = None,
        comfy_api_key: Optional[str] = None,
-        auth_kwargs: Optional[dict[str, str]] = None,
-        timeout: float = 7200.0,
+        auth_kwargs: Optional[Dict[str, str]] = None,
+        timeout: float = 604800.0,
        verify_ssl: bool = True,
        content_type: str = "application/json",
        multipart_parser: Callable | None = None,
@@ -729,7 +723,7 @@ class SynchronousOperation(Generic[T, R]):
            )

        try:
-            request_dict: Optional[dict[str, Any]]
+            request_dict: Optional[Dict[str, Any]]
            if isinstance(self.request, EmptyRequest):
                request_dict = None
            else:
@@ -738,9 +732,11 @@ class SynchronousOperation(Generic[T, R]):
                    if isinstance(v, Enum):
                        request_dict[k] = v.value

-            logging.debug("[DEBUG] API Request: %s %s", self.endpoint.method.value, self.endpoint.path)
-            logging.debug("[DEBUG] Request Data: %s", json.dumps(request_dict, indent=2))
-            logging.debug("[DEBUG] Query Params: %s", self.endpoint.query_params)
+            logging.debug(
+                f"[DEBUG] API Request: {self.endpoint.method.value} {self.endpoint.path}"
+            )
+            logging.debug(f"[DEBUG] Request Data: {json.dumps(request_dict, indent=2)}")
+            logging.debug(f"[DEBUG] Query Params: {self.endpoint.query_params}")

            response_json = await client.request(
                self.endpoint.method.value,
@@ -755,11 +751,11 @@ class SynchronousOperation(Generic[T, R]):
            logging.debug("=" * 50)
            logging.debug("[DEBUG] RESPONSE DETAILS:")
            logging.debug("[DEBUG] Status Code: 200 (Success)")
-            logging.debug("[DEBUG] Response Body: %s", json.dumps(response_json, indent=2))
+            logging.debug(f"[DEBUG] Response Body: {json.dumps(response_json, indent=2)}")
            logging.debug("=" * 50)

            parsed_response = self.endpoint.response_model.model_validate(response_json)
-            logging.debug("[DEBUG] Parsed Response: %s", parsed_response)
+            logging.debug(f"[DEBUG] Parsed Response: {parsed_response}")
            return parsed_response
        finally:
            if owns_client:
@@ -782,16 +778,14 @@ class PollingOperation(Generic[T, R]):
        poll_endpoint: ApiEndpoint[EmptyRequest, R],
        completed_statuses: list[str],
        failed_statuses: list[str],
-        *,
-        status_extractor: Callable[[R], Optional[str]],
-        progress_extractor: Callable[[R], Optional[float]] | None = None,
-        result_url_extractor: Callable[[R], Optional[str]] | None = None,
-        price_extractor: Callable[[R], Optional[float]] | None = None,
+        status_extractor: Callable[[R], str],
+        progress_extractor: Callable[[R], float] | None = None,
+        result_url_extractor: Callable[[R], str] | None = None,
        request: Optional[T] = None,
        api_base: str | None = None,
        auth_token: Optional[str] = None,
        comfy_api_key: Optional[str] = None,
-        auth_kwargs: Optional[dict[str, str]] = None,
+        auth_kwargs: Optional[Dict[str, str]] = None,
        poll_interval: float = 5.0,
        max_poll_attempts: int = 120,  # Default max polling attempts (10 minutes with 5s interval)
        max_retries: int = 3,  # Max retries per individual API call
@@ -817,12 +811,10 @@ class PollingOperation(Generic[T, R]):
        self.status_extractor = status_extractor or (lambda x: getattr(x, "status", None))
        self.progress_extractor = progress_extractor
        self.result_url_extractor = result_url_extractor
-        self.price_extractor = price_extractor
        self.node_id = node_id
        self.completed_statuses = completed_statuses
        self.failed_statuses = failed_statuses
        self.final_response: Optional[R] = None
-        self.extracted_price: Optional[float] = None

    async def execute(self, client: Optional[ApiClient] = None) -> R:
        owns_client = client is None
@@ -844,8 +836,6 @@ class PollingOperation(Generic[T, R]):
    def _display_text_on_node(self, text: str):
        if not self.node_id:
            return
-        if self.extracted_price is not None:
-            text = f"Price: ${self.extracted_price}\n{text}"
        PromptServer.instance.send_progress_text(text, self.node_id)

    def _display_time_progress_on_node(self, time_completed: int | float):
@@ -881,19 +871,18 @@ class PollingOperation(Generic[T, R]):
        status = TaskStatus.PENDING
        for poll_count in range(1, self.max_poll_attempts + 1):
            try:
-                logging.debug("[DEBUG] Polling attempt #%s", poll_count)
+                logging.debug(f"[DEBUG] Polling attempt #{poll_count}")

-                request_dict = None if self.request is None else self.request.model_dump(exclude_none=True)
+                request_dict = (
+                    None if self.request is None else self.request.model_dump(exclude_none=True)
+                )

                if poll_count == 1:
                    logging.debug(
-                        "[DEBUG] Poll Request: %s %s",
-                        self.poll_endpoint.method.value,
-                        self.poll_endpoint.path,
+                        f"[DEBUG] Poll Request: {self.poll_endpoint.method.value} {self.poll_endpoint.path}"
                    )
                    logging.debug(
-                        "[DEBUG] Poll Request Data: %s",
-                        json.dumps(request_dict, indent=2) if request_dict else "None",
+                        f"[DEBUG] Poll Request Data: {json.dumps(request_dict, indent=2) if request_dict else 'None'}"
                    )

                # Query task status
@@ -908,7 +897,7 @@ class PollingOperation(Generic[T, R]):

                # Check if task is complete
                status = self._check_task_status(response_obj)
-                logging.debug("[DEBUG] Task Status: %s", status)
+                logging.debug(f"[DEBUG] Task Status: {status}")

                # If progress extractor is provided, extract progress
                if self.progress_extractor:
@@ -916,18 +905,13 @@ class PollingOperation(Generic[T, R]):
                    if new_progress is not None:
                        progress.update_absolute(new_progress, total=PROGRESS_BAR_MAX)

-                if self.price_extractor:
-                    price = self.price_extractor(response_obj)
-                    if price is not None:
-                        self.extracted_price = price
-
                if status == TaskStatus.COMPLETED:
                    message = "Task completed successfully"
                    if self.result_url_extractor:
                        result_url = self.result_url_extractor(response_obj)
                        if result_url:
                            message = f"Result URL: {result_url}"
-                    logging.debug("[DEBUG] %s", message)
+                    logging.debug(f"[DEBUG] {message}")
                    self._display_text_on_node(message)
                    self.final_response = response_obj
                    if self.progress_extractor:
@@ -935,7 +919,7 @@ class PollingOperation(Generic[T, R]):
                    return self.final_response
                if status == TaskStatus.FAILED:
                    message = f"Task failed: {json.dumps(resp)}"
-                    logging.error("[DEBUG] %s", message)
+                    logging.error(f"[DEBUG] {message}")
                    raise Exception(message)
                logging.debug("[DEBUG] Task still pending, continuing to poll...")
                # Task pending – wait
@@ -949,12 +933,7 @@ class PollingOperation(Generic[T, R]):
                    raise Exception(
                        f"Polling aborted after {consecutive_errors} network errors: {str(e)}"
                    ) from e
-                logging.warning(
-                    "Network error (%s/%s): %s",
-                    consecutive_errors,
-                    max_consecutive_errors,
-                    str(e),
-                )
+                logging.warning("Network error (%s/%s): %s", consecutive_errors, max_consecutive_errors, str(e))
                await asyncio.sleep(self.poll_interval)
            except Exception as e:
                # For other errors, increment count and potentially abort
@@ -964,13 +943,10 @@ class PollingOperation(Generic[T, R]):
                        f"Polling aborted after {consecutive_errors} consecutive errors: {str(e)}"
                    ) from e

-                logging.error("[DEBUG] Polling error: %s", str(e))
+                logging.error(f"[DEBUG] Polling error: {str(e)}")
                logging.warning(
-                    "Error during polling (attempt %s/%s): %s. Will retry in %s seconds.",
-                    poll_count,
-                    self.max_poll_attempts,
-                    str(e),
-                    self.poll_interval,
+                    f"Error during polling (attempt {poll_count}/{self.max_poll_attempts}): {str(e)}. "
+                    f"Will retry in {self.poll_interval} seconds."
                )
                await asyncio.sleep(self.poll_interval)

--- a/comfy_api_nodes/apis/gemini_api.py
+++ b/comfy_api_nodes/apis/gemini_api.py
@@ -1,22 +1,19 @@
-from typing import Optional
+from __future__ import annotations
+
+from typing import List, Optional

 from comfy_api_nodes.apis import GeminiGenerationConfig, GeminiContent, GeminiSafetySetting, GeminiSystemInstructionContent, GeminiTool, GeminiVideoMetadata
 from pydantic import BaseModel


-class GeminiImageConfig(BaseModel):
-    aspectRatio: Optional[str] = None
-
-
 class GeminiImageGenerationConfig(GeminiGenerationConfig):
-    responseModalities: Optional[list[str]] = None
-    imageConfig: Optional[GeminiImageConfig] = None
+    responseModalities: Optional[List[str]] = None


 class GeminiImageGenerateContentRequest(BaseModel):
-    contents: list[GeminiContent]
+    contents: List[GeminiContent]
    generationConfig: Optional[GeminiImageGenerationConfig] = None
-    safetySettings: Optional[list[GeminiSafetySetting]] = None
+    safetySettings: Optional[List[GeminiSafetySetting]] = None
    systemInstruction: Optional[GeminiSystemInstructionContent] = None
-    tools: Optional[list[GeminiTool]] = None
+    tools: Optional[List[GeminiTool]] = None
    videoMetadata: Optional[GeminiVideoMetadata] = None
--- a/comfy_api_nodes/apis/pika_defs.py
+++ b/comfy_api_nodes/apis/pika_defs.py
@@ -1,100 +0,0 @@
-from typing import Optional
-from enum import Enum
-from pydantic import BaseModel, Field
-
-
-class Pikaffect(str, Enum):
-    Cake_ify = "Cake-ify"
-    Crumble = "Crumble"
-    Crush = "Crush"
-    Decapitate = "Decapitate"
-    Deflate = "Deflate"
-    Dissolve = "Dissolve"
-    Explode = "Explode"
-    Eye_pop = "Eye-pop"
-    Inflate = "Inflate"
-    Levitate = "Levitate"
-    Melt = "Melt"
-    Peel = "Peel"
-    Poke = "Poke"
-    Squish = "Squish"
-    Ta_da = "Ta-da"
-    Tear = "Tear"
-
-
-class PikaBodyGenerate22C2vGenerate22PikascenesPost(BaseModel):
-    aspectRatio: Optional[float] = Field(None, description='Aspect ratio (width / height)')
-    duration: Optional[int] = Field(5)
-    ingredientsMode: str = Field(...)
-    negativePrompt: Optional[str] = Field(None)
-    promptText: Optional[str] = Field(None)
-    resolution: Optional[str] = Field('1080p')
-    seed: Optional[int] = Field(None)
-
-
-class PikaGenerateResponse(BaseModel):
-    video_id: str = Field(...)
-
-
-class PikaBodyGenerate22I2vGenerate22I2vPost(BaseModel):
-    duration: Optional[int] = 5
-    negativePrompt: Optional[str] = Field(None)
-    promptText: Optional[str] = Field(None)
-    resolution: Optional[str] = '1080p'
-    seed: Optional[int] = Field(None)
-
-
-class PikaBodyGenerate22KeyframeGenerate22PikaframesPost(BaseModel):
-    duration: Optional[int] = Field(None, ge=5, le=10)
-    negativePrompt: Optional[str] = Field(None)
-    promptText: str = Field(...)
-    resolution: Optional[str] = '1080p'
-    seed: Optional[int] = Field(None)
-
-
-class PikaBodyGenerate22T2vGenerate22T2vPost(BaseModel):
-    aspectRatio: Optional[float] = Field(
-        1.7777777777777777,
-        description='Aspect ratio (width / height)',
-        ge=0.4,
-        le=2.5,
-    )
-    duration: Optional[int] = 5
-    negativePrompt: Optional[str] = Field(None)
-    promptText: str = Field(...)
-    resolution: Optional[str] = '1080p'
-    seed: Optional[int] = Field(None)
-
-
-class PikaBodyGeneratePikadditionsGeneratePikadditionsPost(BaseModel):
-    negativePrompt: Optional[str] = Field(None)
-    promptText: Optional[str] = Field(None)
-    seed: Optional[int] = Field(None)
-
-
-class PikaBodyGeneratePikaffectsGeneratePikaffectsPost(BaseModel):
-    negativePrompt: Optional[str] = Field(None)
-    pikaffect: Optional[str] = None
-    promptText: Optional[str] = Field(None)
-    seed: Optional[int] = Field(None)
-
-
-class PikaBodyGeneratePikaswapsGeneratePikaswapsPost(BaseModel):
-    negativePrompt: Optional[str] = Field(None)
-    promptText: Optional[str] = Field(None)
-    seed: Optional[int] = Field(None)
-    modifyRegionRoi: Optional[str] = Field(None)
-
-
-class PikaStatusEnum(str, Enum):
-    queued = "queued"
-    started = "started"
-    finished = "finished"
-    failed = "failed"
-
-
-class PikaVideoResponse(BaseModel):
-    id: str = Field(...)
-    progress: Optional[int] = Field(None)
-    status: PikaStatusEnum
-    url: Optional[str] = Field(None)
--- a/comfy_api_nodes/apis/request_logger.py
+++ b/comfy_api_nodes/apis/request_logger.py
@@ -4,99 +4,62 @@ import os
 import datetime
 import json
 import logging
-import re
-import hashlib
-from typing import Any
-
 import folder_paths

 # Get the logger instance
 logger = logging.getLogger(__name__)

-
 def get_log_directory():
-    """Ensures the API log directory exists within ComfyUI's temp directory and returns its path."""
+    """
+    Ensures the API log directory exists within ComfyUI's temp directory
+    and returns its path.
+    """
    base_temp_dir = folder_paths.get_temp_directory()
    log_dir = os.path.join(base_temp_dir, "api_logs")
    try:
        os.makedirs(log_dir, exist_ok=True)
    except Exception as e:
-        logger.error("Error creating API log directory %s: %s", log_dir, str(e))
+        logger.error(f"Error creating API log directory {log_dir}: {e}")
        # Fallback to base temp directory if sub-directory creation fails
        return base_temp_dir
    return log_dir

-
-def _sanitize_filename_component(name: str) -> str:
-    if not name:
-        return "log"
-    sanitized = re.sub(r"[^A-Za-z0-9._-]+", "_", name)  # Replace disallowed characters with underscore
-    sanitized = sanitized.strip(" ._")  # Windows: trailing dots or spaces are not allowed
-    if not sanitized:
-        sanitized = "log"
-    return sanitized
-
-
-def _short_hash(*parts: str, length: int = 10) -> str:
-    return hashlib.sha1(("|".join(parts)).encode("utf-8")).hexdigest()[:length]
-
-
-def _build_log_filepath(log_dir: str, operation_id: str, request_url: str) -> str:
-    """Build log filepath. We keep it well under common path length limits aiming for <= 240 characters total."""
-    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S_%f")
-    slug = _sanitize_filename_component(operation_id)  # Best-effort human-readable slug from operation_id
-    h = _short_hash(operation_id or "", request_url or "")  # Short hash ties log to the full operation and URL
-
-    # Compute how much room we have for the slug given the directory length
-    # Keep total path length reasonably below ~260 on Windows.
-    max_total_path = 240
-    prefix = f"{timestamp}_"
-    suffix = f"_{h}.log"
-    if not slug:
-        slug = "op"
-    max_filename_len = max(60, max_total_path - len(log_dir) - 1)
-    max_slug_len = max(8, max_filename_len - len(prefix) - len(suffix))
-    if len(slug) > max_slug_len:
-        slug = slug[:max_slug_len].rstrip(" ._-")
-    return os.path.join(log_dir, f"{prefix}{slug}{suffix}")
-
-
-def _format_data_for_logging(data: Any) -> str:
+def _format_data_for_logging(data):
    """Helper to format data (dict, str, bytes) for logging."""
    if isinstance(data, bytes):
        try:
-            return data.decode("utf-8")  # Try to decode as text
+            return data.decode('utf-8')  # Try to decode as text
        except UnicodeDecodeError:
            return f"[Binary data of length {len(data)} bytes]"
    elif isinstance(data, (dict, list)):
        try:
            return json.dumps(data, indent=2, ensure_ascii=False)
        except TypeError:
-            return str(data)  # Fallback for non-serializable objects
+            return str(data) # Fallback for non-serializable objects
    return str(data)

-
 def log_request_response(
    operation_id: str,
    request_method: str,
    request_url: str,
    request_headers: dict | None = None,
    request_params: dict | None = None,
-    request_data: Any = None,
+    request_data: any = None,
    response_status_code: int | None = None,
    response_headers: dict | None = None,
-    response_content: Any = None,
-    error_message: str | None = None,
+    response_content: any = None,
+    error_message: str | None = None
 ):
    """
    Logs API request and response details to a file in the temp/api_logs directory.
-    Filenames are sanitized and length-limited for cross-platform safety.
-    If we still fail to write, we fall back to appending into api.log.
    """
    log_dir = get_log_directory()
-    filepath = _build_log_filepath(log_dir, operation_id, request_url)
+    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S_%f")
+    filename = f"{timestamp}_{operation_id.replace('/', '_').replace(':', '_')}.log"
+    filepath = os.path.join(log_dir, filename)
+
+    log_content = []

-    log_content: list[str] = []
    log_content.append(f"Timestamp: {datetime.datetime.now().isoformat()}")
    log_content.append(f"Operation ID: {operation_id}")
    log_content.append("-" * 30 + " REQUEST " + "-" * 30)
@@ -106,7 +69,7 @@ def log_request_response(
        log_content.append(f"Headers:\n{_format_data_for_logging(request_headers)}")
    if request_params:
        log_content.append(f"Params:\n{_format_data_for_logging(request_params)}")
-    if request_data is not None:
+    if request_data:
        log_content.append(f"Data/Body:\n{_format_data_for_logging(request_data)}")

    log_content.append("\n" + "-" * 30 + " RESPONSE " + "-" * 30)
@@ -114,7 +77,7 @@ def log_request_response(
        log_content.append(f"Status Code: {response_status_code}")
    if response_headers:
        log_content.append(f"Headers:\n{_format_data_for_logging(response_headers)}")
-    if response_content is not None:
+    if response_content:
        log_content.append(f"Content:\n{_format_data_for_logging(response_content)}")
    if error_message:
        log_content.append(f"Error:\n{error_message}")
@@ -122,10 +85,9 @@ def log_request_response(
    try:
        with open(filepath, "w", encoding="utf-8") as f:
            f.write("\n".join(log_content))
-        logger.debug("API log saved to: %s", filepath)
+        logger.debug(f"API log saved to: {filepath}")
    except Exception as e:
-        logger.error("Error writing API log to %s: %s", filepath, str(e))
-
+        logger.error(f"Error writing API log to {filepath}: {e}")

 if __name__ == '__main__':
    # Example usage (for testing the logger directly)
--- a/comfy_api_nodes/apis/rodin_api.py
+++ b/comfy_api_nodes/apis/rodin_api.py
@@ -9,9 +9,8 @@ class Rodin3DGenerateRequest(BaseModel):
    seed: int = Field(..., description="seed_")
    tier: str = Field(..., description="Tier of generation.")
    material: str = Field(..., description="The material type.")
-    quality_override: int = Field(..., description="The poly count of the mesh.")
+    quality: str = Field(..., description="The generation quality of the mesh.")
    mesh_mode: str = Field(..., description="It controls the type of faces of generated models.")
-    TAPose: Optional[bool] = Field(None, description="")

 class GenerateJobsData(BaseModel):
    uuids: List[str] = Field(..., description="str LIST")
@@ -52,3 +51,7 @@ class RodinResourceItem(BaseModel):

 class Rodin3DDownloadResponse(BaseModel):
    list: List[RodinResourceItem] = Field(..., description="Source List")
+
+
+
+
--- a/comfy_api_nodes/apis/stability_api.py
+++ b/comfy_api_nodes/apis/stability_api.py
@@ -125,25 +125,3 @@ class StabilityResultsGetResponse(BaseModel):

 class StabilityAsyncResponse(BaseModel):
    id: Optional[str] = Field(None)
-
-
-class StabilityTextToAudioRequest(BaseModel):
-    model: str = Field(...)
-    prompt: str = Field(...)
-    duration: int = Field(190, ge=1, le=190)
-    seed: int = Field(0, ge=0, le=4294967294)
-    steps: int = Field(8, ge=4, le=8)
-    output_format: str = Field("wav")
-
-
-class StabilityAudioToAudioRequest(StabilityTextToAudioRequest):
-    strength: float = Field(0.01, ge=0.01, le=1.0)
-
-
-class StabilityAudioInpaintRequest(StabilityTextToAudioRequest):
-    mask_start: int = Field(30, ge=0, le=190)
-    mask_end: int = Field(190, ge=0, le=190)
-
-
-class StabilityAudioResponse(BaseModel):
-    audio: Optional[str] = Field(None)
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Jedrzej Kosinski	295b49c165	Doing some experimentation	2025-09-02 22:19:12 -07:00
Jedrzej Kosinski	a40c5ae341	Support predict_ratio changing with timesteps	2025-09-02 15:23:28 -07:00
Jedrzej Kosinski	953b906f63	Implement Sortblock for single cond usage	2025-09-02 00:45:59 -07:00
Jedrzej Kosinski	d4a8752c8c	some exploration of sortblock as more things from paper/source code need to be added	2025-09-01 09:39:40 -07:00
Jedrzej Kosinski	cf26d3d58e	More progress on Sortblock	2025-08-31 20:26:49 -07:00
Jedrzej Kosinski	f655fcc5ce	Progress on scaffolding for an EasyCache style implementation of Sortblock	2025-08-31 00:59:01 -07:00
Jedrzej Kosinski	e2491f44e8	Merge branch 'attention-select' into sortblock	2025-08-30 20:04:48 -07:00
Jedrzej Kosinski	66c4eb006b	Remove AttentionOverrideTest node, that's something to cook up for later	2025-08-30 15:19:36 -07:00
Jedrzej Kosinski	dd0a5093f6	Satisfy ruff	2025-08-30 14:58:30 -07:00
Jedrzej Kosinski	c092b8a4ac	Remove _register_core_attention_functions, as we wouldn't want someone to call that, just in case	2025-08-30 14:49:04 -07:00
Jedrzej Kosinski	eaa9433ff8	Remove attention logging code	2025-08-30 14:45:12 -07:00
Jedrzej Kosinski	720d0a88e6	Disable attention logs for now	2025-08-30 01:11:34 -07:00
Jedrzej Kosinski	d9bb4530b0	Merge branch 'master' into attention-select	2025-08-29 23:35:38 -07:00
Jedrzej Kosinski	cb959f9669	Add optimized to get_attention_function	2025-08-29 21:48:36 -07:00
Jedrzej Kosinski	d553073a1e	Fixed WAN 2.1 VACE transformer_options passthrough	2025-08-29 13:20:43 -07:00
Jedrzej Kosinski	af288b9946	Fixed Wan2.1 Fun Camera transformer_options passthrough	2025-08-29 13:06:37 -07:00
Jedrzej Kosinski	1ae6fe14a7	Fix WanI2VCrossAttention so that it expects to receive transformer_options	2025-08-29 02:31:16 -07:00
Jedrzej Kosinski	2d13bf1c7a	Made SVD work with optimized_attention_override	2025-08-28 22:45:45 -07:00
Jedrzej Kosinski	8be3edb606	Made Chroma work with optimized_attention_override	2025-08-28 22:45:31 -07:00
Jedrzej Kosinski	d644aba6bc	Made Lumina work with optimized_attention_override	2025-08-28 22:00:44 -07:00
Jedrzej Kosinski	17090c56be	Made AuraFlow work with optimized_attention_override	2025-08-28 21:46:56 -07:00
Jedrzej Kosinski	034d6c12e6	Made StableCascade work with optimized_attention_override	2025-08-28 21:42:08 -07:00
Jedrzej Kosinski	09c84b31a2	Made Omnigen 2 work with optimized_attention_override	2025-08-28 21:30:18 -07:00
Jedrzej Kosinski	8fe2dea297	Made CosmosVideo work with optimized_attention_override	2025-08-28 21:23:03 -07:00
Jedrzej Kosinski	4a44ed4a76	Make CosmosPredict2 work with optimized_attention_override	2025-08-28 21:18:34 -07:00
Jedrzej Kosinski	8b9b4bbb62	Made Hunyuan3D work with optimized_attention_override	2025-08-28 21:06:44 -07:00
Jedrzej Kosinski	27ebd312ae	Made optimized_attention_override work with ACE Step	2025-08-28 21:03:28 -07:00
Jedrzej Kosinski	9461f30387	Made StableAudio work with optimized_attention_override	2025-08-28 20:56:56 -07:00
Jedrzej Kosinski	2cda45d1b4	Made LTX work with optimized_attention_override	2025-08-28 20:42:22 -07:00
Jedrzej Kosinski	61b5c5fc75	Made Mochi work with optimized_attention_override	2025-08-28 20:34:06 -07:00
Jedrzej Kosinski	ef894cdf08	Made HunyuanVideo work with optimized_attention_override	2025-08-28 20:26:53 -07:00
Jedrzej Kosinski	0ac5c6344f	Made SD3 work with optimized_attention_override	2025-08-28 20:21:14 -07:00
Jedrzej Kosinski	1ddfb5bb14	Made wan patches_replace work with optimized_attention_override	2025-08-28 20:13:51 -07:00
Jedrzej Kosinski	4cafd58f71	Made hidream work with optimized_attention_override	2025-08-28 20:10:50 -07:00
Jedrzej Kosinski	f752715aac	Make Qwen work with optimized_attention_override	2025-08-28 19:52:52 -07:00
Jedrzej Kosinski	48ed71caf8	Add logs to verify optimized_attention_override is passed all the way into attention function	2025-08-28 19:43:39 -07:00
Jedrzej Kosinski	a7d70e42a0	Make flux work with optimized_attention_override	2025-08-28 19:33:02 -07:00
Jedrzej Kosinski	1f499f0794	Turn off attention logging for now, make AttentionOverrideTestNode have a dropdown with available attention (this is a test node only)	2025-08-28 18:54:22 -07:00
Jedrzej Kosinski	51a30c2ad7	Make sure wrap_attn doesn't make itself recurse infinitely, attempt to load SageAttention and FlashAttention if not enabled so that they can be marked as available or not, create registry for available attention	2025-08-28 18:53:20 -07:00
Jedrzej Kosinski	669b9ef8e6	Added **kwargs to all attention functions so transformer_options could potentially be passed through	2025-08-28 13:14:41 -07:00
Jedrzej Kosinski	dd21b4aa51	Made WAN attention receive transformer_options, test node added to wan to test out attention override later	2025-08-27 17:56:21 -07:00
Jedrzej Kosinski	29b7990dc2	Fix memory usage issue with inspect	2025-08-27 17:55:35 -07:00
Jedrzej Kosinski	68b00e9c60	Created logging code for this branch so that it can be used to track down all the code paths where transformer_options would need to be added	2025-08-27 17:13:33 -07:00
Jedrzej Kosinski	b58db6934c	Looking into a @wrap_attn decorator to look for 'optimized_attention_override' entry in transformer_options	2025-08-27 14:18:18 -07:00