Change node id to reflect node name

Make sure models_memory_reserve is considered with inference_memory as well in max func calls
Renamed to Reserve Additional Memory
2026-02-12 03:00:03 +00:00 · 2025-08-18 15:39:16 -07:00 · 2025-08-18 15:34:53 -07:00 · 2025-08-18 15:04:49 -07:00 · 2025-08-18 14:51:11 -07:00 · 2025-08-18 14:45:21 -07:00
284 changed files with 14579 additions and 34624 deletions
--- a/.ci/windows_amd_base_files/README_VERY_IMPORTANT.txt
+++ b/.ci/windows_amd_base_files/README_VERY_IMPORTANT.txt
@@ -1,27 +0,0 @@
-As of the time of writing this you need this preview driver for best results:
-https://www.amd.com/en/resources/support-articles/release-notes/RN-AMDGPU-WINDOWS-PYTORCH-PREVIEW.html
-
-HOW TO RUN:
-
-If you have a AMD gpu:
-
-run_amd_gpu.bat
-
-If you have memory issues you can try disabling the smart memory management by running comfyui with:
-
-run_amd_gpu_disable_smart_memory.bat
-
-IF YOU GET A RED ERROR IN THE UI MAKE SURE YOU HAVE A MODEL/CHECKPOINT IN: ComfyUI\models\checkpoints
-
-You can download the stable diffusion XL one from: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0_0.9vae.safetensors
-
-
-RECOMMENDED WAY TO UPDATE:
-To update the ComfyUI code: update\update_comfyui.bat
-
-
-TO SHARE MODELS BETWEEN COMFYUI AND ANOTHER UI:
-In the ComfyUI directory you will find a file: extra_model_paths.yaml.example
-Rename this file to: extra_model_paths.yaml and edit it with your favorite text editor.
-
-
--- a/.ci/windows_nvidia_base_files/README_VERY_IMPORTANT.txt
+++ b/.ci/windows_nvidia_base_files/README_VERY_IMPORTANT.txt
--- a/.ci/windows_nvidia_base_files/run_cpu.bat
+++ b/.ci/windows_nvidia_base_files/run_cpu.bat
--- a/.ci/windows_amd_base_files/run_amd_gpu.bat
+++ b/.ci/windows_amd_base_files/run_amd_gpu.bat
--- a/.ci/windows_base_files/run_nvidia_gpu_fast_fp16_accumulation.bat
+++ b/.ci/windows_base_files/run_nvidia_gpu_fast_fp16_accumulation.bat
@@ -1,2 +1,2 @@
-.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --disable-smart-memory
+.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --fast fp16_accumulation
 pause
--- a/.ci/windows_nvidia_base_files/advanced/run_nvidia_gpu_disable_api_nodes.bat
+++ b/.ci/windows_nvidia_base_files/advanced/run_nvidia_gpu_disable_api_nodes.bat
@@ -1,3 +0,0 @@
-..\python_embeded\python.exe -s ..\ComfyUI\main.py --windows-standalone-build --disable-api-nodes
-echo If you see this and ComfyUI did not start try updating your Nvidia Drivers to the latest.
-pause
--- a/.ci/windows_nvidia_base_files/run_nvidia_gpu.bat
+++ b/.ci/windows_nvidia_base_files/run_nvidia_gpu.bat
@@ -1,3 +0,0 @@
-.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build
-echo If you see this and ComfyUI did not start try updating your Nvidia Drivers to the latest.
-pause
--- a/.ci/windows_nvidia_base_files/run_nvidia_gpu_fast_fp16_accumulation.bat
+++ b/.ci/windows_nvidia_base_files/run_nvidia_gpu_fast_fp16_accumulation.bat
@@ -1,3 +0,0 @@
-.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --fast fp16_accumulation
-echo If you see this and ComfyUI did not start try updating your Nvidia Drivers to the latest.
-pause
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@@ -8,15 +8,13 @@ body:
        Before submitting a **Bug Report**, please ensure the following:

        - **1:** You are running the latest version of ComfyUI.
-        - **2:** You have your ComfyUI logs and relevant workflow on hand and will post them in this bug report.
+        - **2:** You have looked at the existing bug reports and made sure this isn't already reported.
        - **3:** You confirmed that the bug is not caused by a custom node. You can disable all custom nodes by passing
-        `--disable-all-custom-nodes` command line argument. If you have custom node try updating them to the latest version.
+        `--disable-all-custom-nodes` command line argument.
        - **4:** This is an actual bug in ComfyUI, not just a support question. A bug is when you can specify exact
        steps to replicate what went wrong and others will be able to repeat your steps and see the same issue happen.

-        ## Very Important
-
-        Please make sure that you post ALL your ComfyUI logs in the bug report. A bug report without logs will likely be ignored.
+        If unsure, ask on the [ComfyUI Matrix Space](https://app.element.io/#/room/%23comfyui_space%3Amatrix.org) or the [Comfy Org Discord](https://discord.gg/comfyorg) first.
  - type: checkboxes
    id: custom-nodes-test
    attributes:
--- a/.github/PULL_REQUEST_TEMPLATE/api-node.md
+++ b/.github/PULL_REQUEST_TEMPLATE/api-node.md
@@ -1,21 +0,0 @@
-<!-- API_NODE_PR_CHECKLIST: do not remove -->
-
-## API Node PR Checklist
-
-### Scope
- [ ] **Is API Node Change**
-
-### Pricing & Billing
- [ ] **Need pricing update**
- [ ] **No pricing update**
-
-If **Need pricing update**:
- [ ] Metronome rate cards updated
- [ ] Auto‑billing tests updated and passing
-
-### QA
- [ ] **QA done**
- [ ] **QA not required**
-
-### Comms
- [ ] Informed **Kosinkadink**
--- a/.github/workflows/api-node-template.yml
+++ b/.github/workflows/api-node-template.yml
@@ -1,58 +0,0 @@
-name: Append API Node PR template
-
-on:
-  pull_request_target:
-    types: [opened, reopened, synchronize, ready_for_review]
-    paths:
-      - 'comfy_api_nodes/**'   # only run if these files changed
-
-permissions:
-  contents: read
-  pull-requests: write
-
-jobs:
-  inject:
-    runs-on: ubuntu-latest
-    steps:
-      - name: Ensure template exists and append to PR body
-        uses: actions/github-script@v7
-        with:
-          script: |
-            const { owner, repo } = context.repo;
-            const number = context.payload.pull_request.number;
-            const templatePath = '.github/PULL_REQUEST_TEMPLATE/api-node.md';
-            const marker = '<!-- API_NODE_PR_CHECKLIST: do not remove -->';
-
-            const { data: pr } = await github.rest.pulls.get({ owner, repo, pull_number: number });
-
-            let templateText;
-            try {
-              const res = await github.rest.repos.getContent({
-                owner,
-                repo,
-                path: templatePath,
-                ref: pr.base.ref
-              });
-              const buf = Buffer.from(res.data.content, res.data.encoding || 'base64');
-              templateText = buf.toString('utf8');
-            } catch (e) {
-              core.setFailed(`Required PR template not found at "${templatePath}" on ${pr.base.ref}. Please add it to the repo.`);
-              return;
-            }
-
-            // Enforce the presence of the marker inside the template (for idempotence)
-            if (!templateText.includes(marker)) {
-              core.setFailed(`Template at "${templatePath}" does not contain the required marker:\n${marker}\nAdd it so we can detect duplicates safely.`);
-              return;
-            }
-
-            // If the PR already contains the marker, do not append again.
-            const body = pr.body || '';
-            if (body.includes(marker)) {
-              core.info('Template already present in PR body; nothing to inject.');
-              return;
-            }
-
-            const newBody = (body ? body + '\n\n' : '') + templateText + '\n';
-            await github.rest.pulls.update({ owner, repo, pull_number: number, body: newBody });
-            core.notice('API Node template appended to PR description.');
--- a/.github/workflows/release-stable-all.yml
+++ b/.github/workflows/release-stable-all.yml
@@ -1,78 +0,0 @@
-name: "Release Stable All Portable Versions"
-
-on:
-  workflow_dispatch:
-    inputs:
-      git_tag:
-        description: 'Git tag'
-        required: true
-        type: string
-
-jobs:
-  release_nvidia_default:
-    permissions:
-      contents: "write"
-      packages: "write"
-      pull-requests: "read"
-    name: "Release NVIDIA Default (cu130)"
-    uses: ./.github/workflows/stable-release.yml
-    with:
-      git_tag: ${{ inputs.git_tag }}
-      cache_tag: "cu130"
-      python_minor: "13"
-      python_patch: "9"
-      rel_name: "nvidia"
-      rel_extra_name: ""
-      test_release: true
-    secrets: inherit
-
-  release_nvidia_cu128:
-    permissions:
-      contents: "write"
-      packages: "write"
-      pull-requests: "read"
-    name: "Release NVIDIA cu128"
-    uses: ./.github/workflows/stable-release.yml
-    with:
-      git_tag: ${{ inputs.git_tag }}
-      cache_tag: "cu128"
-      python_minor: "12"
-      python_patch: "10"
-      rel_name: "nvidia"
-      rel_extra_name: "_cu128"
-      test_release: true
-    secrets: inherit
-
-  release_nvidia_cu126:
-    permissions:
-      contents: "write"
-      packages: "write"
-      pull-requests: "read"
-    name: "Release NVIDIA cu126"
-    uses: ./.github/workflows/stable-release.yml
-    with:
-      git_tag: ${{ inputs.git_tag }}
-      cache_tag: "cu126"
-      python_minor: "12"
-      python_patch: "10"
-      rel_name: "nvidia"
-      rel_extra_name: "_cu126"
-      test_release: true
-    secrets: inherit
-
-  release_amd_rocm:
-    permissions:
-      contents: "write"
-      packages: "write"
-      pull-requests: "read"
-    name: "Release AMD ROCm 6.4.4"
-    uses: ./.github/workflows/stable-release.yml
-    with:
-      git_tag: ${{ inputs.git_tag }}
-      cache_tag: "rocm644"
-      python_minor: "12"
-      python_patch: "10"
-      rel_name: "amd"
-      rel_extra_name: ""
-      test_release: false
-    secrets: inherit
--- a/.github/workflows/ruff.yml
+++ b/.github/workflows/ruff.yml
@@ -21,28 +21,3 @@ jobs:

    - name: Run Ruff
      run: ruff check .
-
-  pylint:
-    name: Run Pylint
-    runs-on: ubuntu-latest
-
-    steps:
-    - name: Checkout repository
-      uses: actions/checkout@v4
-
-    - name: Set up Python
-      uses: actions/setup-python@v4
-      with:
-        python-version: '3.12'
-
-    - name: Install requirements
-      run: |
-        python -m pip install --upgrade pip
-        pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
-        pip install -r requirements.txt
-
-    - name: Install Pylint
-      run: pip install pylint
-
-    - name: Run Pylint
-      run: pylint comfy_api_nodes
--- a/.github/workflows/stable-release.yml
+++ b/.github/workflows/stable-release.yml
@@ -2,53 +2,17 @@
 name: "Release Stable Version"

 on:
-  workflow_call:
-    inputs:
-      git_tag:
-        description: 'Git tag'
-        required: true
-        type: string
-      cache_tag:
-        description: 'Cached dependencies tag'
-        required: true
-        type: string
-        default: "cu129"
-      python_minor:
-        description: 'Python minor version'
-        required: true
-        type: string
-        default: "13"
-      python_patch:
-        description: 'Python patch version'
-        required: true
-        type: string
-        default: "6"
-      rel_name:
-        description: 'Release name'
-        required: true
-        type: string
-        default: "nvidia"
-      rel_extra_name:
-        description: 'Release extra name'
-        required: false
-        type: string
-        default: ""
-      test_release:
-        description: 'Test Release'
-        required: true
-        type: boolean
-        default: true
  workflow_dispatch:
    inputs:
      git_tag:
        description: 'Git tag'
        required: true
        type: string
-      cache_tag:
-        description: 'Cached dependencies tag'
+      cu:
+        description: 'CUDA version'
        required: true
        type: string
-        default: "cu129"
+        default: "129"
      python_minor:
        description: 'Python minor version'
        required: true
@@ -59,21 +23,7 @@ on:
        required: true
        type: string
        default: "6"
-      rel_name:
-        description: 'Release name'
-        required: true
-        type: string
-        default: "nvidia"
-      rel_extra_name:
-        description: 'Release extra name'
-        required: false
-        type: string
-        default: ""
-      test_release:
-        description: 'Test Release'
-        required: true
-        type: boolean
-        default: true
+

 jobs:
  package_comfy_windows:
@@ -92,15 +42,15 @@ jobs:
        id: cache
        with:
          path: |
-            ${{ inputs.cache_tag }}_python_deps.tar
+            cu${{ inputs.cu }}_python_deps.tar
            update_comfyui_and_python_dependencies.bat
-          key: ${{ runner.os }}-build-${{ inputs.cache_tag }}-${{ inputs.python_minor }}
+          key: ${{ runner.os }}-build-cu${{ inputs.cu }}-${{ inputs.python_minor }}
      - shell: bash
        run: |
-          mv ${{ inputs.cache_tag }}_python_deps.tar ../
+          mv cu${{ inputs.cu }}_python_deps.tar ../
          mv update_comfyui_and_python_dependencies.bat ../
          cd ..
-          tar xf ${{ inputs.cache_tag }}_python_deps.tar
+          tar xf cu${{ inputs.cu }}_python_deps.tar
          pwd
          ls

@@ -115,19 +65,12 @@ jobs:
          echo 'import site' >> ./python3${{ inputs.python_minor }}._pth
          curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
          ./python.exe get-pip.py
-          ./python.exe -s -m pip install ../${{ inputs.cache_tag }}_python_deps/*
-
-          grep comfyui ../ComfyUI/requirements.txt > ./requirements_comfyui.txt
-          ./python.exe -s -m pip install -r requirements_comfyui.txt
-          rm requirements_comfyui.txt
-
+          ./python.exe -s -m pip install ../cu${{ inputs.cu }}_python_deps/*
          sed -i '1i../ComfyUI' ./python3${{ inputs.python_minor }}._pth

-          if test -f ./Lib/site-packages/torch/lib/dnnl.lib; then
-            rm ./Lib/site-packages/torch/lib/dnnl.lib #I don't think this is actually used and I need the space
-            rm ./Lib/site-packages/torch/lib/libprotoc.lib
-            rm ./Lib/site-packages/torch/lib/libprotobuf.lib
-          fi
+          rm ./Lib/site-packages/torch/lib/dnnl.lib #I don't think this is actually used and I need the space
+          rm ./Lib/site-packages/torch/lib/libprotoc.lib
+          rm ./Lib/site-packages/torch/lib/libprotobuf.lib

          cd ..

@@ -142,18 +85,14 @@ jobs:

          mkdir update
          cp -r ComfyUI/.ci/update_windows/* ./update/
-          cp -r ComfyUI/.ci/windows_${{ inputs.rel_name }}_base_files/* ./
+          cp -r ComfyUI/.ci/windows_base_files/* ./
          cp ../update_comfyui_and_python_dependencies.bat ./update/

          cd ..

          "C:\Program Files\7-Zip\7z.exe" a -t7z -m0=lzma2 -mx=9 -mfb=128 -md=768m -ms=on -mf=BCJ2 ComfyUI_windows_portable.7z ComfyUI_windows_portable
-          mv ComfyUI_windows_portable.7z ComfyUI/ComfyUI_windows_portable_${{ inputs.rel_name }}${{ inputs.rel_extra_name }}.7z
+          mv ComfyUI_windows_portable.7z ComfyUI/ComfyUI_windows_portable_nvidia.7z

-      - shell: bash
-        if: ${{ inputs.test_release }}
-        run: |
-          cd ..
          cd ComfyUI_windows_portable
          python_embeded/python.exe -s ComfyUI/main.py --quick-test-for-ci --cpu

@@ -162,9 +101,10 @@ jobs:
          ls

      - name: Upload binaries to release
-        uses: softprops/action-gh-release@v2
+        uses: svenstaro/upload-release-action@v2
        with:
-          files: ComfyUI_windows_portable_${{ inputs.rel_name }}${{ inputs.rel_extra_name }}.7z
-          tag_name: ${{ inputs.git_tag }}
+          repo_token: ${{ secrets.GITHUB_TOKEN }}
+          file: ComfyUI_windows_portable_nvidia.7z
+          tag: ${{ inputs.git_tag }}
+          overwrite: true
          draft: true
-          overwrite_files: true
--- a/.github/workflows/test-ci.yml
+++ b/.github/workflows/test-ci.yml
@@ -21,15 +21,14 @@ jobs:
      fail-fast: false
      matrix:
        # os: [macos, linux, windows]
-        # os: [macos, linux]
-        os: [linux]
-        python_version: ["3.10", "3.11", "3.12"]
+        os: [macos, linux]
+        python_version: ["3.9", "3.10", "3.11", "3.12"]
        cuda_version: ["12.1"]
        torch_version: ["stable"]
        include:
-          # - os: macos
-          #   runner_label: [self-hosted, macOS]
-          #   flags: "--use-pytorch-cross-attention"
+          - os: macos
+            runner_label: [self-hosted, macOS]
+            flags: "--use-pytorch-cross-attention"
          - os: linux
            runner_label: [self-hosted, Linux]
            flags: ""
@@ -74,15 +73,14 @@ jobs:
    strategy:
      fail-fast: false
      matrix:
-        # os: [macos, linux]
-        os: [linux]
+        os: [macos, linux]
        python_version: ["3.11"]
        cuda_version: ["12.1"]
        torch_version: ["nightly"]
        include:
-          # - os: macos
-          #   runner_label: [self-hosted, macOS]
-          #   flags: "--use-pytorch-cross-attention"
+          - os: macos
+            runner_label: [self-hosted, macOS]
+            flags: "--use-pytorch-cross-attention"
          - os: linux
            runner_label: [self-hosted, Linux]
            flags: ""
--- a/.github/workflows/test-execution.yml
+++ b/.github/workflows/test-execution.yml
@@ -1,30 +0,0 @@
-name: Execution Tests
-
-on:
-  push:
-    branches: [ main, master ]
-  pull_request:
-    branches: [ main, master ]
-
-jobs:
-  test:
-    strategy:
-      matrix:
-        os: [ubuntu-latest, windows-latest, macos-latest]
-    runs-on: ${{ matrix.os }}
-    continue-on-error: true
-    steps:
-    - uses: actions/checkout@v4
-    - name: Set up Python      
-      uses: actions/setup-python@v4
-      with:
-        python-version: '3.12'
-    - name: Install requirements
-      run: |
-        python -m pip install --upgrade pip
-        pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
-        pip install -r requirements.txt
-        pip install -r tests-unit/requirements.txt
-    - name: Run Execution Tests
-      run: |
-        python -m pytest tests/execution -v --skip-timing-checks
--- a/.github/workflows/test-unit.yml
+++ b/.github/workflows/test-unit.yml
@@ -10,7 +10,7 @@ jobs:
  test:
    strategy:
      matrix:
-        os: [ubuntu-latest, windows-2022, macos-latest]
+        os: [ubuntu-latest, windows-latest, macos-latest]
    runs-on: ${{ matrix.os }}
    continue-on-error: true
    steps:
--- a/.github/workflows/windows_release_dependencies.yml
+++ b/.github/workflows/windows_release_dependencies.yml
@@ -17,7 +17,7 @@ on:
        description: 'cuda version'
        required: true
        type: string
-        default: "130"
+        default: "129"

      python_minor:
        description: 'python minor version'
@@ -29,7 +29,7 @@ on:
        description: 'python patch version'
        required: true
        type: string
-        default: "9"
+        default: "6"
 #  push:
 #    branches:
 #      - master
@@ -56,8 +56,7 @@ jobs:
            ..\python_embeded\python.exe -s -m pip install --upgrade torch torchvision torchaudio ${{ inputs.xformers }} --extra-index-url https://download.pytorch.org/whl/cu${{ inputs.cu }} -r ../ComfyUI/requirements.txt pygit2
            pause" > update_comfyui_and_python_dependencies.bat

-            grep -v comfyui requirements.txt > requirements_nocomfyui.txt
-            python -m pip wheel --no-cache-dir torch torchvision torchaudio ${{ inputs.xformers }} ${{ inputs.extra_dependencies }} --extra-index-url https://download.pytorch.org/whl/cu${{ inputs.cu }} -r requirements_nocomfyui.txt pygit2 -w ./temp_wheel_dir
+            python -m pip wheel --no-cache-dir torch torchvision torchaudio ${{ inputs.xformers }} ${{ inputs.extra_dependencies }} --extra-index-url https://download.pytorch.org/whl/cu${{ inputs.cu }} -r requirements.txt pygit2 -w ./temp_wheel_dir
            python -m pip install --no-cache-dir ./temp_wheel_dir/*
            echo installed basic
            ls -lah temp_wheel_dir
--- a/.github/workflows/windows_release_dependencies_manual.yml
+++ b/.github/workflows/windows_release_dependencies_manual.yml
@@ -1,64 +0,0 @@
-name: "Windows Release dependencies Manual"
-
-on:
-  workflow_dispatch:
-    inputs:
-      torch_dependencies:
-        description: 'torch dependencies'
-        required: false
-        type: string
-        default: "torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu128"
-      cache_tag:
-        description: 'Cached dependencies tag'
-        required: true
-        type: string
-        default: "cu128"
-
-      python_minor:
-        description: 'python minor version'
-        required: true
-        type: string
-        default: "12"
-
-      python_patch:
-        description: 'python patch version'
-        required: true
-        type: string
-        default: "10"
-
-jobs:
-  build_dependencies:
-    runs-on: windows-latest
-    steps:
-        - uses: actions/checkout@v4
-        - uses: actions/setup-python@v5
-          with:
-            python-version: 3.${{ inputs.python_minor }}.${{ inputs.python_patch }}
-
-        - shell: bash
-          run: |
-            echo "@echo off
-            call update_comfyui.bat nopause
-            echo -
-            echo This will try to update pytorch and all python dependencies.
-            echo -
-            echo If you just want to update normally, close this and run update_comfyui.bat instead.
-            echo -
-            pause
-            ..\python_embeded\python.exe -s -m pip install --upgrade ${{ inputs.torch_dependencies }} -r ../ComfyUI/requirements.txt pygit2
-            pause" > update_comfyui_and_python_dependencies.bat
-
-            grep -v comfyui requirements.txt > requirements_nocomfyui.txt
-            python -m pip wheel --no-cache-dir ${{ inputs.torch_dependencies }} -r requirements_nocomfyui.txt pygit2 -w ./temp_wheel_dir
-            python -m pip install --no-cache-dir ./temp_wheel_dir/*
-            echo installed basic
-            ls -lah temp_wheel_dir
-            mv temp_wheel_dir ${{ inputs.cache_tag }}_python_deps
-            tar cf ${{ inputs.cache_tag }}_python_deps.tar ${{ inputs.cache_tag }}_python_deps
-
-        - uses: actions/cache/save@v4
-          with:
-            path: |
-              ${{ inputs.cache_tag }}_python_deps.tar
-              update_comfyui_and_python_dependencies.bat
-            key: ${{ runner.os }}-build-${{ inputs.cache_tag }}-${{ inputs.python_minor }}
--- a/.github/workflows/windows_release_nightly_pytorch.yml
+++ b/.github/workflows/windows_release_nightly_pytorch.yml
@@ -68,7 +68,7 @@ jobs:

            mkdir update
            cp -r ComfyUI/.ci/update_windows/* ./update/
-            cp -r ComfyUI/.ci/windows_nvidia_base_files/* ./
+            cp -r ComfyUI/.ci/windows_base_files/* ./
            cp -r ComfyUI/.ci/windows_nightly_base_files/* ./

            echo "call update_comfyui.bat nopause
--- a/.github/workflows/windows_release_package.yml
+++ b/.github/workflows/windows_release_package.yml
@@ -81,7 +81,7 @@ jobs:

            mkdir update
            cp -r ComfyUI/.ci/update_windows/* ./update/
-            cp -r ComfyUI/.ci/windows_nvidia_base_files/* ./
+            cp -r ComfyUI/.ci/windows_base_files/* ./
            cp ../update_comfyui_and_python_dependencies.bat ./update/

            cd ..
--- a/24
+++ b/24
@@ -1,3 +1,25 @@
 # Admins
 * @comfyanonymous
-* @kosinkadink
+
+# Note: Github teams syntax cannot be used here as the repo is not owned by Comfy-Org.
+# Inlined the team members for now.
+
+# Maintainers
+*.md @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
+/tests/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
+/tests-unit/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
+/notebooks/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
+/script_examples/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
+/.github/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
+/requirements.txt @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
+/pyproject.toml @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
+
+# Python web server
+/api_server/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @christian-byrne @guill
+/app/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @christian-byrne @guill
+/utils/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @christian-byrne @guill
+
+# Node developers
+/comfy_extras/ @yoland68 @robinjhuang @pythongosssss @ltdrdata @Kosinkadink @webfiltered @christian-byrne @guill
+/comfy/comfy_types/ @yoland68 @robinjhuang @pythongosssss @ltdrdata @Kosinkadink @webfiltered @christian-byrne @guill
+/comfy_api_nodes/ @yoland68 @robinjhuang @pythongosssss @ltdrdata @Kosinkadink @webfiltered @christian-byrne @guill
--- a/QUANTIZATION.md
+++ b/QUANTIZATION.md
@@ -1,168 +0,0 @@
-# The Comfy guide to Quantization
-
-
-## How does quantization work?
-
-Quantization aims to map a high-precision value x_f to a lower precision format with minimal loss in accuracy. These smaller formats then serve to reduce the models memory footprint and increase throughput by using specialized hardware.
-
-When simply converting a value from FP16 to FP8 using the round-nearest method we might hit two issues:
- The dynamic range of FP16 (-65,504, 65,504) far exceeds FP8 formats like E4M3 (-448, 448) or E5M2 (-57,344, 57,344), potentially resulting in clipped values
- The original values are concentrated in a small range (e.g. -1,1) leaving many FP8-bits "unused"
-
-By using a scaling factor, we aim to map these values into the quantized-dtype range, making use of the full spectrum. One of the easiest approaches, and common, is using per-tensor absolute-maximum scaling.
-
-```
-absmax = max(abs(tensor))
-scale = amax / max_dynamic_range_low_precision
-
-# Quantization
-tensor_q = (tensor / scale).to(low_precision_dtype)
-
-# De-Quantization
-tensor_dq = tensor_q.to(fp16) * scale
-
-tensor_dq ~ tensor
-```
-
-Given that additional information (scaling factor) is needed to "interpret" the quantized values, we describe those as derived datatypes.
-
-
-## Quantization in Comfy
-
-```
-QuantizedTensor (torch.Tensor subclass)
-  ↓ __torch_dispatch__
-Two-Level Registry (generic + layout handlers)
-  ↓
-MixedPrecisionOps + Metadata Detection
-```
-
-### Representation
-
-To represent these derived datatypes, ComfyUI uses a subclass of torch.Tensor to implements these using the `QuantizedTensor` class found in `comfy/quant_ops.py`
-
-A `Layout` class defines how a specific quantization format behaves:
- Required parameters
- Quantize method
- De-Quantize method
-
-```python
-from comfy.quant_ops import QuantizedLayout
-
-class MyLayout(QuantizedLayout):
-    @classmethod
-    def quantize(cls, tensor, **kwargs):
-        # Convert to quantized format
-        qdata = ...
-        params = {'scale': ..., 'orig_dtype': tensor.dtype}
-        return qdata, params
-    
-    @staticmethod
-    def dequantize(qdata, scale, orig_dtype, **kwargs):
-        return qdata.to(orig_dtype) * scale
-```
-
-To then run operations using these QuantizedTensors we use two registry systems to define supported operations. 
-The first is a **generic registry** that handles operations common to all quantized formats (e.g., `.to()`, `.clone()`, `.reshape()`).
-
-The second registry is layout-specific and allows to implement fast-paths like nn.Linear.
-```python
-from comfy.quant_ops import register_layout_op
-
-@register_layout_op(torch.ops.aten.linear.default, MyLayout)
-def my_linear(func, args, kwargs):
-    # Extract tensors, call optimized kernel
-    ...
-```
-When `torch.nn.functional.linear()` is called with QuantizedTensor arguments, `__torch_dispatch__` automatically routes to the registered implementation.
-For any unsupported operation, QuantizedTensor will fallback to call `dequantize` and dispatch using the high-precision implementation.
-
-
-### Mixed Precision
-
-The `MixedPrecisionOps` class (lines 542-648 in `comfy/ops.py`) enables per-layer quantization decisions, allowing different layers in a model to use different precisions. This is activated when a model config contains a `layer_quant_config` dictionary that specifies which layers should be quantized and how.
-
-**Architecture:**
-
-```python
-class MixedPrecisionOps(disable_weight_init):
-    _layer_quant_config = {}  # Maps layer names to quantization configs
-    _compute_dtype = torch.bfloat16  # Default compute / dequantize precision
-```
-
-**Key mechanism:**
-
-The custom `Linear._load_from_state_dict()` method inspects each layer during model loading:
- If the layer name is **not** in `_layer_quant_config`: load weight as regular tensor in `_compute_dtype`
- If the layer name **is** in `_layer_quant_config`: 
-  - Load weight as `QuantizedTensor` with the specified layout (e.g., `TensorCoreFP8Layout`)
-  - Load associated quantization parameters (scales, block_size, etc.)
-
-**Why it's needed:**
-
-Not all layers tolerate quantization equally. Sensitive operations like final projections can be kept in higher precision, while compute-heavy matmuls are quantized. This provides most of the performance benefits while maintaining quality.
-
-The system is selected in `pick_operations()` when `model_config.layer_quant_config` is present, making it the highest-priority operation mode.
-
-
-## Checkpoint Format
-
-Quantized checkpoints are stored as standard safetensors files with quantized weight tensors and associated scaling parameters, plus a `_quantization_metadata` JSON entry describing the quantization scheme.
-
-The quantized checkpoint will contain the same layers as the original checkpoint but:
- The weights are stored as quantized values, sometimes using a different storage datatype. E.g. uint8 container for fp8.
- For each quantized weight a number of additional scaling parameters are stored alongside depending on the recipe.
- We store a metadata.json in the metadata of the final safetensor containing the `_quantization_metadata` describing which layers are quantized and what layout has been used.
-
-### Scaling Parameters details
-We define 4 possible scaling parameters that should cover most recipes in the near-future:
- **weight_scale**: quantization scalers for the weights
- **weight_scale_2**: global scalers in the context of double scaling
- **pre_quant_scale**: scalers used for smoothing salient weights
- **input_scale**: quantization scalers for the activations
-
-| Format | Storage dtype | weight_scale | weight_scale_2 | pre_quant_scale | input_scale |
-|--------|---------------|--------------|----------------|-----------------|-------------|
-| float8_e4m3fn | float32 | float32 (scalar) | - | - | float32 (scalar) |
-
-You can find the defined formats in `comfy/quant_ops.py` (QUANT_ALGOS).
-
-### Quantization Metadata
-
-The metadata stored alongside the checkpoint contains:
- **format_version**: String to define a version of the standard
- **layers**: A dictionary mapping layer names to their quantization format. The format string maps to the definitions found in `QUANT_ALGOS`. 
-
-Example:
-```json
-{
-  "_quantization_metadata": {
-    "format_version": "1.0",
-    "layers": {
-      "model.layers.0.mlp.up_proj": "float8_e4m3fn",
-      "model.layers.0.mlp.down_proj": "float8_e4m3fn",
-      "model.layers.1.mlp.up_proj": "float8_e4m3fn"
-    }
-  }
-}
-```
-
-
-## Creating Quantized Checkpoints
-
-To create compatible checkpoints, use any quantization tool provided the output follows the checkpoint format described above and uses a layout defined in `QUANT_ALGOS`.
-
-### Weight Quantization
-
-Weight quantization is straightforward - compute the scaling factor directly from the weight tensor using the absolute maximum method described earlier. Each layer's weights are quantized independently and stored with their corresponding `weight_scale` parameter.
-
-### Calibration (for Activation Quantization)
-
-Activation quantization (e.g., for FP8 Tensor Core operations) requires `input_scale` parameters that cannot be determined from static weights alone. Since activation values depend on actual inputs, we use **post-training calibration (PTQ)**:
-
-1. **Collect statistics**: Run inference on N representative samples
-2. **Track activations**: Record the absolute maximum (`amax`) of inputs to each quantized layer
-3. **Compute scales**: Derive `input_scale` from collected statistics
-4. **Store in checkpoint**: Save `input_scale` parameters alongside weights
-
-The calibration dataset should be representative of your target use case. For diffusion models, this typically means a diverse set of prompts and generation parameters.
--- a/README.md
+++ b/README.md
@@ -65,19 +65,18 @@ See what ComfyUI can do with the [example workflows](https://comfyanonymous.gith
   - [Flux](https://comfyanonymous.github.io/ComfyUI_examples/flux/)
   - [Lumina Image 2.0](https://comfyanonymous.github.io/ComfyUI_examples/lumina2/)
   - [HiDream](https://comfyanonymous.github.io/ComfyUI_examples/hidream/)
+   - [Cosmos Predict2](https://comfyanonymous.github.io/ComfyUI_examples/cosmos_predict2/)
   - [Qwen Image](https://comfyanonymous.github.io/ComfyUI_examples/qwen_image/)
-   - [Hunyuan Image 2.1](https://comfyanonymous.github.io/ComfyUI_examples/hunyuan_image/)
-   - [Flux 2](https://comfyanonymous.github.io/ComfyUI_examples/flux2/)
 - Image Editing Models
   - [Omnigen 2](https://comfyanonymous.github.io/ComfyUI_examples/omnigen/)
   - [Flux Kontext](https://comfyanonymous.github.io/ComfyUI_examples/flux/#flux-kontext-image-editing-model)
   - [HiDream E1.1](https://comfyanonymous.github.io/ComfyUI_examples/hidream/#hidream-e11)
-   - [Qwen Image Edit](https://comfyanonymous.github.io/ComfyUI_examples/qwen_image/#edit-model)
 - Video Models
   - [Stable Video Diffusion](https://comfyanonymous.github.io/ComfyUI_examples/video/)
   - [Mochi](https://comfyanonymous.github.io/ComfyUI_examples/mochi/)
   - [LTX-Video](https://comfyanonymous.github.io/ComfyUI_examples/ltxv/)
   - [Hunyuan Video](https://comfyanonymous.github.io/ComfyUI_examples/hunyuan_video/)
+   - [Nvidia Cosmos](https://comfyanonymous.github.io/ComfyUI_examples/cosmos/) and [Cosmos Predict2](https://comfyanonymous.github.io/ComfyUI_examples/cosmos_predict2/)
   - [Wan 2.1](https://comfyanonymous.github.io/ComfyUI_examples/wan/)
   - [Wan 2.2](https://comfyanonymous.github.io/ComfyUI_examples/wan22/)
 - Audio Models
@@ -113,11 +112,10 @@ Workflow examples can be found on the [Examples page](https://comfyanonymous.git

 ## Release Process

-ComfyUI follows a weekly release cycle targeting Monday but this regularly changes because of model releases or large changes to the codebase. There are three interconnected repositories:
+ComfyUI follows a weekly release cycle targeting Friday but this regularly changes because of model releases or large changes to the codebase. There are three interconnected repositories:

 1. **[ComfyUI Core](https://github.com/comfyanonymous/ComfyUI)**
-   - Releases a new stable version (e.g., v0.7.0) roughly every week.
-   - Commits outside of the stable release tags may be very unstable and break many custom nodes.
+   - Releases a new stable version (e.g., v0.7.0)
   - Serves as the foundation for the desktop release

 2. **[ComfyUI Desktop](https://github.com/Comfy-Org/desktop)**
@@ -174,20 +172,10 @@ There is a portable standalone build for Windows that should work for running on

 ### [Direct link to download](https://github.com/comfyanonymous/ComfyUI/releases/latest/download/ComfyUI_windows_portable_nvidia.7z)

-Simply download, extract with [7-Zip](https://7-zip.org) or with the windows explorer on recent windows versions and run. For smaller models you normally only need to put the checkpoints (the huge ckpt/safetensors files) in: ComfyUI\models\checkpoints but many of the larger models have multiple files. Make sure to follow the instructions to know which subfolder to put them in ComfyUI\models\
+Simply download, extract with [7-Zip](https://7-zip.org) and run. Make sure you put your Stable Diffusion checkpoints/models (the huge ckpt/safetensors files) in: ComfyUI\models\checkpoints

 If you have trouble extracting it, right click the file -> properties -> unblock

-Update your Nvidia drivers if it doesn't start.
-
-#### Alternative Downloads:
-
-[Experimental portable for AMD GPUs](https://github.com/comfyanonymous/ComfyUI/releases/latest/download/ComfyUI_windows_portable_amd.7z)
-
-[Portable with pytorch cuda 12.8 and python 3.12](https://github.com/comfyanonymous/ComfyUI/releases/latest/download/ComfyUI_windows_portable_nvidia_cu128.7z).
-
-[Portable with pytorch cuda 12.6 and python 3.12](https://github.com/comfyanonymous/ComfyUI/releases/latest/download/ComfyUI_windows_portable_nvidia_cu126.7z) (Supports Nvidia 10 series and older GPUs).
-
 #### How do I share models between another UI and ComfyUI?

 See the [Config file](extra_model_paths.yaml.example) to set the search paths for models. In the standalone windows build you can find this file in the ComfyUI directory. Rename this file to extra_model_paths.yaml and edit it with your favorite text editor.
@@ -203,11 +191,7 @@ comfy install

 ## Manual Install (Windows, Linux)

-Python 3.14 works but you may encounter issues with the torch compile node. The free threaded variant is still missing some dependencies.
-
-Python 3.13 is very well supported. If you have trouble with some custom node dependencies on 3.13 you can try 3.12
-
-### Instructions:
+python 3.13 is supported but using 3.12 is recommended because some custom nodes and their dependencies might not support it yet.

 Git clone this repo.

@@ -216,36 +200,18 @@ Put your SD checkpoints (the huge ckpt/safetensors files) in: models/checkpoints
 Put your VAE in: models/vae


-### AMD GPUs (Linux)
-
+### AMD GPUs (Linux only)
 AMD users can install rocm and pytorch with pip if you don't have it already installed, this is the command to install the stable version:

 ```pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.4```

-This is the command to install the nightly with ROCm 7.0 which might have some performance improvements:
+This is the command to install the nightly with ROCm 6.4 which might have some performance improvements:

-```pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm7.1```
-
-
-### AMD GPUs (Experimental: Windows and Linux), RDNA 3, 3.5 and 4 only.
-
-These have less hardware support than the builds above but they work on windows. You also need to install the pytorch version specific to your hardware.
-
-RDNA 3 (RX 7000 series):
-
-```pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx110X-dgpu/```
-
-RDNA 3.5 (Strix halo/Ryzen AI Max+ 365):
-
-```pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx1151/```
-
-RDNA 4 (RX 9000 series):
-
-```pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/```
+```pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.4```

 ### Intel GPUs (Windows and Linux)

-Intel Arc GPU users can install native PyTorch with torch.xpu support using pip. More information can be found [here](https://pytorch.org/docs/main/notes/get_start_xpu.html)
+(Option 1) Intel Arc GPU users can install native PyTorch with torch.xpu support using pip. More information can be found [here](https://pytorch.org/docs/main/notes/get_start_xpu.html)

 1. To install PyTorch xpu, use the following command:

@@ -255,15 +221,19 @@ This is the command to install the Pytorch xpu nightly which might have some per

 ```pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/xpu```

+(Option 2) Alternatively, Intel GPUs supported by Intel Extension for PyTorch (IPEX) can leverage IPEX for improved performance.
+
+1. visit [Installation](https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu) for more information.
+
 ### NVIDIA

 Nvidia users should install stable pytorch using this command:

-```pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu130```
+```pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu129```

 This is the command to install pytorch nightly instead which might have performance improvements.

-```pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu130```
+```pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu129```

 #### Troubleshooting

@@ -294,6 +264,12 @@ You can install ComfyUI in Apple Mac silicon (M1 or M2) with any recent macOS ve

 > **Note**: Remember to add your models, VAE, LoRAs etc. to the corresponding Comfy folders, as discussed in [ComfyUI manual installation](#manual-install-windows-linux).

+#### DirectML (AMD Cards on Windows)
+
+This is very badly supported and is not recommended. There are some unofficial builds of pytorch ROCm on windows that exist that will give you a much better experience than this. This readme will be updated once official pytorch ROCm builds for windows come out.
+
+```pip install torch-directml``` Then you can launch ComfyUI with: ```python main.py --directml```
+
 #### Ascend NPUs

 For models compatible with Ascend Extension for PyTorch (torch_npu). To get started, ensure your environment meets the prerequisites outlined on the [installation](https://ascend.github.io/docs/sources/ascend/quick_install.html) page. Here's a step-by-step guide tailored to your platform and installation method:
--- a/app/frontend_management.py
+++ b/app/frontend_management.py
@@ -10,8 +10,7 @@ import importlib
 from dataclasses import dataclass
 from functools import cached_property
 from pathlib import Path
-from typing import Dict, TypedDict, Optional
-from aiohttp import web
+from typing import TypedDict, Optional
 from importlib.metadata import version

 import requests
@@ -43,7 +42,6 @@ def get_installed_frontend_version():
    frontend_version_str = version("comfyui-frontend-package")
    return frontend_version_str

-
 def get_required_frontend_version():
    """Get the required frontend version from requirements.txt."""
    try:
@@ -65,7 +63,6 @@ def get_required_frontend_version():
        logging.error(f"Error reading requirements.txt: {e}")
        return None

-
 def check_frontend_version():
    """Check if the frontend version is up to date."""

@@ -206,37 +203,6 @@ class FrontendManager:
        """Get the required frontend package version."""
        return get_required_frontend_version()

-    @classmethod
-    def get_installed_templates_version(cls) -> str:
-        """Get the currently installed workflow templates package version."""
-        try:
-            templates_version_str = version("comfyui-workflow-templates")
-            return templates_version_str
-        except Exception:
-            return None
-
-    @classmethod
-    def get_required_templates_version(cls) -> str:
-        """Get the required workflow templates version from requirements.txt."""
-        try:
-            with open(requirements_path, "r", encoding="utf-8") as f:
-                for line in f:
-                    line = line.strip()
-                    if line.startswith("comfyui-workflow-templates=="):
-                        version_str = line.split("==")[-1]
-                        if not is_valid_version(version_str):
-                            logging.error(f"Invalid templates version format in requirements.txt: {version_str}")
-                            return None
-                        return version_str
-                logging.error("comfyui-workflow-templates not found in requirements.txt")
-                return None
-        except FileNotFoundError:
-            logging.error("requirements.txt not found. Cannot determine required templates version.")
-            return None
-        except Exception as e:
-            logging.error(f"Error reading requirements.txt: {e}")
-            return None
-
    @classmethod
    def default_frontend_path(cls) -> str:
        try:
@@ -258,54 +224,7 @@ comfyui-frontend-package is not installed.
            sys.exit(-1)

    @classmethod
-    def template_asset_map(cls) -> Optional[Dict[str, str]]:
-        """Return a mapping of template asset names to their absolute paths."""
-        try:
-            from comfyui_workflow_templates import (
-                get_asset_path,
-                iter_templates,
-            )
-        except ImportError:
-            logging.error(
-                f"""
-********** ERROR ***********
-
-comfyui-workflow-templates is not installed.
-
-{frontend_install_warning_message()}
-
-********** ERROR ***********
-""".strip()
-            )
-            return None
-
-        try:
-            template_entries = list(iter_templates())
-        except Exception as exc:
-            logging.error(f"Failed to enumerate workflow templates: {exc}")
-            return None
-
-        asset_map: Dict[str, str] = {}
-        try:
-            for entry in template_entries:
-                for asset in entry.assets:
-                    asset_map[asset.filename] = get_asset_path(
-                        entry.template_id, asset.filename
-                    )
-        except Exception as exc:
-            logging.error(f"Failed to resolve template asset paths: {exc}")
-            return None
-
-        if not asset_map:
-            logging.error("No workflow template assets found. Did the packages install correctly?")
-            return None
-
-        return asset_map
-
-
-    @classmethod
-    def legacy_templates_path(cls) -> Optional[str]:
-        """Return the legacy templates directory shipped inside the meta package."""
+    def templates_path(cls) -> str:
        try:
            import comfyui_workflow_templates

@@ -324,7 +243,6 @@ comfyui-workflow-templates is not installed.
 ********** ERROR ***********
 """.strip()
            )
-            return None

    @classmethod
    def embedded_docs_path(cls) -> str:
@@ -441,17 +359,3 @@ comfyui-workflow-templates is not installed.
            logging.info("Falling back to the default frontend.")
            check_frontend_version()
            return cls.default_frontend_path()
-    @classmethod
-    def template_asset_handler(cls):
-        assets = cls.template_asset_map()
-        if not assets:
-            return None
-
-        async def serve_template(request: web.Request) -> web.StreamResponse:
-            rel_path = request.match_info.get("path", "")
-            target = assets.get(rel_path)
-            if target is None:
-                raise web.HTTPNotFound()
-            return web.FileResponse(target)
-
-        return serve_template
--- a/app/subgraph_manager.py
+++ b/app/subgraph_manager.py
@@ -1,112 +0,0 @@
-from __future__ import annotations
-
-from typing import TypedDict
-import os
-import folder_paths
-import glob
-from aiohttp import web
-import hashlib
-
-
-class Source:
-    custom_node = "custom_node"
-
-class SubgraphEntry(TypedDict):
-    source: str
-    """
-    Source of subgraph - custom_nodes vs templates.
-    """
-    path: str
-    """
-    Relative path of the subgraph file.
-    For custom nodes, will be the relative directory like <custom_node_dir>/subgraphs/<name>.json
-    """
-    name: str
-    """
-    Name of subgraph file.
-    """
-    info: CustomNodeSubgraphEntryInfo
-    """
-    Additional info about subgraph; in the case of custom_nodes, will contain nodepack name
-    """
-    data: str
-
-class CustomNodeSubgraphEntryInfo(TypedDict):
-    node_pack: str
-    """Node pack name."""
-
-class SubgraphManager:
-    def __init__(self):
-        self.cached_custom_node_subgraphs: dict[SubgraphEntry] | None = None
-
-    async def load_entry_data(self, entry: SubgraphEntry):
-        with open(entry['path'], 'r') as f:
-            entry['data'] = f.read()
-        return entry
-
-    async def sanitize_entry(self, entry: SubgraphEntry | None, remove_data=False) -> SubgraphEntry | None:
-        if entry is None:
-            return None
-        entry = entry.copy()
-        entry.pop('path', None)
-        if remove_data:
-            entry.pop('data', None)
-        return entry
-
-    async def sanitize_entries(self, entries: dict[str, SubgraphEntry], remove_data=False) -> dict[str, SubgraphEntry]:
-        entries = entries.copy()
-        for key in list(entries.keys()):
-            entries[key] = await self.sanitize_entry(entries[key], remove_data)
-        return entries
-
-    async def get_custom_node_subgraphs(self, loadedModules, force_reload=False):
-        # if not forced to reload and cached, return cache
-        if not force_reload and self.cached_custom_node_subgraphs is not None:
-            return self.cached_custom_node_subgraphs
-        # Load subgraphs from custom nodes
-        subfolder = "subgraphs"
-        subgraphs_dict: dict[SubgraphEntry] = {}
-
-        for folder in folder_paths.get_folder_paths("custom_nodes"):
-            pattern = os.path.join(folder, f"*/{subfolder}/*.json")
-            matched_files = glob.glob(pattern)
-            for file in matched_files:
-                # replace backslashes with forward slashes
-                file = file.replace('\\', '/')
-                info: CustomNodeSubgraphEntryInfo = {
-                    "node_pack": "custom_nodes." + file.split('/')[-3]
-                }
-                source = Source.custom_node
-                # hash source + path to make sure id will be as unique as possible, but
-                # reproducible across backend reloads
-                id = hashlib.sha256(f"{source}{file}".encode()).hexdigest()
-                entry: SubgraphEntry = {
-                    "source": Source.custom_node,
-                    "name": os.path.splitext(os.path.basename(file))[0],
-                    "path": file,
-                    "info": info,
-                }
-                subgraphs_dict[id] = entry
-        self.cached_custom_node_subgraphs = subgraphs_dict
-        return subgraphs_dict
-
-    async def get_custom_node_subgraph(self, id: str, loadedModules):
-        subgraphs = await self.get_custom_node_subgraphs(loadedModules)
-        entry: SubgraphEntry = subgraphs.get(id, None)
-        if entry is not None and entry.get('data', None) is None:
-            await self.load_entry_data(entry)
-        return entry
-
-    def add_routes(self, routes, loadedModules):
-        @routes.get("/global_subgraphs")
-        async def get_global_subgraphs(request):
-            subgraphs_dict = await self.get_custom_node_subgraphs(loadedModules)
-            # NOTE: we may want to include other sources of global subgraphs such as templates in the future;
-            # that's the reasoning for the current implementation
-            return web.json_response(await self.sanitize_entries(subgraphs_dict, remove_data=True))
-
-        @routes.get("/global_subgraphs/{id}")
-        async def get_global_subgraph(request):
-            id = request.match_info.get("id", None)
-            subgraph = await self.get_custom_node_subgraph(id, loadedModules)
-            return web.json_response(await self.sanitize_entry(subgraph))
--- a/app/user_manager.py
+++ b/app/user_manager.py
@@ -363,17 +363,10 @@ class UserManager():
            if not overwrite and os.path.exists(path):
                return web.Response(status=409, text="File already exists")

-            try:
-                body = await request.read()
+            body = await request.read()

-                with open(path, "wb") as f:
-                    f.write(body)
-            except OSError as e:
-                logging.warning(f"Error saving file '{path}': {e}")
-                return web.Response(
-                    status=400,
-                    reason="Invalid filename. Please avoid special characters like :\\/*?\"<>|"
-                )
+            with open(path, "wb") as f:
+                f.write(body)

            user_path = self.get_request_user_filepath(request, None)
            if full_info:
--- a/comfy/audio_encoders/audio_encoders.py
+++ b/comfy/audio_encoders/audio_encoders.py
@@ -1,91 +0,0 @@
-from .wav2vec2 import Wav2Vec2Model
-from .whisper import WhisperLargeV3
-import comfy.model_management
-import comfy.ops
-import comfy.utils
-import logging
-import torchaudio
-
-
-class AudioEncoderModel():
-    def __init__(self, config):
-        self.load_device = comfy.model_management.text_encoder_device()
-        offload_device = comfy.model_management.text_encoder_offload_device()
-        self.dtype = comfy.model_management.text_encoder_dtype(self.load_device)
-        model_type = config.pop("model_type")
-        model_config = dict(config)
-        model_config.update({
-            "dtype": self.dtype,
-            "device": offload_device,
-            "operations": comfy.ops.manual_cast
-        })
-
-        if model_type == "wav2vec2":
-            self.model = Wav2Vec2Model(**model_config)
-        elif model_type == "whisper3":
-            self.model = WhisperLargeV3(**model_config)
-        self.model.eval()
-        self.patcher = comfy.model_patcher.ModelPatcher(self.model, load_device=self.load_device, offload_device=offload_device)
-        self.model_sample_rate = 16000
-
-    def load_sd(self, sd):
-        return self.model.load_state_dict(sd, strict=False)
-
-    def get_sd(self):
-        return self.model.state_dict()
-
-    def encode_audio(self, audio, sample_rate):
-        comfy.model_management.load_model_gpu(self.patcher)
-        audio = torchaudio.functional.resample(audio, sample_rate, self.model_sample_rate)
-        out, all_layers = self.model(audio.to(self.load_device))
-        outputs = {}
-        outputs["encoded_audio"] = out
-        outputs["encoded_audio_all_layers"] = all_layers
-        outputs["audio_samples"] = audio.shape[2]
-        return outputs
-
-
-def load_audio_encoder_from_sd(sd, prefix=""):
-    sd = comfy.utils.state_dict_prefix_replace(sd, {"wav2vec2.": ""})
-    if "encoder.layer_norm.bias" in sd: #wav2vec2
-        embed_dim = sd["encoder.layer_norm.bias"].shape[0]
-        if embed_dim == 1024:# large
-            config = {
-                "model_type": "wav2vec2",
-                "embed_dim": 1024,
-                "num_heads": 16,
-                "num_layers": 24,
-                "conv_norm": True,
-                "conv_bias": True,
-                "do_normalize": True,
-                "do_stable_layer_norm": True
-                }
-        elif embed_dim == 768: # base
-            config = {
-                "model_type": "wav2vec2",
-                "embed_dim": 768,
-                "num_heads": 12,
-                "num_layers": 12,
-                "conv_norm": False,
-                "conv_bias": False,
-                "do_normalize": False, # chinese-wav2vec2-base has this False
-                "do_stable_layer_norm": False
-            }
-        else:
-            raise RuntimeError("ERROR: audio encoder file is invalid or unsupported embed_dim: {}".format(embed_dim))
-    elif "model.encoder.embed_positions.weight" in sd:
-        sd = comfy.utils.state_dict_prefix_replace(sd, {"model.": ""})
-        config = {
-            "model_type": "whisper3",
-        }
-    else:
-        raise RuntimeError("ERROR: audio encoder not supported.")
-
-    audio_encoder = AudioEncoderModel(config)
-    m, u = audio_encoder.load_sd(sd)
-    if len(m) > 0:
-        logging.warning("missing audio encoder: {}".format(m))
-    if len(u) > 0:
-        logging.warning("unexpected audio encoder: {}".format(u))
-
-    return audio_encoder
--- a/comfy/audio_encoders/wav2vec2.py
+++ b/comfy/audio_encoders/wav2vec2.py
@@ -1,252 +0,0 @@
-import torch
-import torch.nn as nn
-from comfy.ldm.modules.attention import optimized_attention_masked
-
-
-class LayerNormConv(nn.Module):
-    def __init__(self, in_channels, out_channels, kernel_size, stride, bias=False, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.conv = operations.Conv1d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, bias=bias, device=device, dtype=dtype)
-        self.layer_norm = operations.LayerNorm(out_channels, elementwise_affine=True, device=device, dtype=dtype)
-
-    def forward(self, x):
-        x = self.conv(x)
-        return torch.nn.functional.gelu(self.layer_norm(x.transpose(-2, -1)).transpose(-2, -1))
-
-class LayerGroupNormConv(nn.Module):
-    def __init__(self, in_channels, out_channels, kernel_size, stride, bias=False, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.conv = operations.Conv1d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, bias=bias, device=device, dtype=dtype)
-        self.layer_norm = operations.GroupNorm(num_groups=out_channels, num_channels=out_channels, affine=True, device=device, dtype=dtype)
-
-    def forward(self, x):
-        x = self.conv(x)
-        return torch.nn.functional.gelu(self.layer_norm(x))
-
-class ConvNoNorm(nn.Module):
-    def __init__(self, in_channels, out_channels, kernel_size, stride, bias=False, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.conv = operations.Conv1d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, bias=bias, device=device, dtype=dtype)
-
-    def forward(self, x):
-        x = self.conv(x)
-        return torch.nn.functional.gelu(x)
-
-
-class ConvFeatureEncoder(nn.Module):
-    def __init__(self, conv_dim, conv_bias=False, conv_norm=True, dtype=None, device=None, operations=None):
-        super().__init__()
-        if conv_norm:
-            self.conv_layers = nn.ModuleList([
-                LayerNormConv(1, conv_dim, kernel_size=10, stride=5, bias=True, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=2, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=2, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-            ])
-        else:
-            self.conv_layers = nn.ModuleList([
-                LayerGroupNormConv(1, conv_dim, kernel_size=10, stride=5, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=2, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=2, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-            ])
-
-    def forward(self, x):
-        x = x.unsqueeze(1)
-
-        for conv in self.conv_layers:
-            x = conv(x)
-
-        return x.transpose(1, 2)
-
-
-class FeatureProjection(nn.Module):
-    def __init__(self, conv_dim, embed_dim, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.layer_norm = operations.LayerNorm(conv_dim, eps=1e-05, device=device, dtype=dtype)
-        self.projection = operations.Linear(conv_dim, embed_dim, device=device, dtype=dtype)
-
-    def forward(self, x):
-        x = self.layer_norm(x)
-        x = self.projection(x)
-        return x
-
-
-class PositionalConvEmbedding(nn.Module):
-    def __init__(self, embed_dim=768, kernel_size=128, groups=16):
-        super().__init__()
-        self.conv = nn.Conv1d(
-            embed_dim,
-            embed_dim,
-            kernel_size=kernel_size,
-            padding=kernel_size // 2,
-            groups=groups,
-        )
-        self.conv = torch.nn.utils.parametrizations.weight_norm(self.conv, name="weight", dim=2)
-        self.activation = nn.GELU()
-
-    def forward(self, x):
-        x = x.transpose(1, 2)
-        x = self.conv(x)[:, :, :-1]
-        x = self.activation(x)
-        x = x.transpose(1, 2)
-        return x
-
-
-class TransformerEncoder(nn.Module):
-    def __init__(
-        self,
-        embed_dim=768,
-        num_heads=12,
-        num_layers=12,
-        mlp_ratio=4.0,
-        do_stable_layer_norm=True,
-        dtype=None, device=None, operations=None
-    ):
-        super().__init__()
-
-        self.pos_conv_embed = PositionalConvEmbedding(embed_dim=embed_dim)
-        self.layers = nn.ModuleList([
-            TransformerEncoderLayer(
-                embed_dim=embed_dim,
-                num_heads=num_heads,
-                mlp_ratio=mlp_ratio,
-                do_stable_layer_norm=do_stable_layer_norm,
-                device=device, dtype=dtype, operations=operations
-            )
-            for _ in range(num_layers)
-        ])
-
-        self.layer_norm = operations.LayerNorm(embed_dim, eps=1e-05, device=device, dtype=dtype)
-        self.do_stable_layer_norm = do_stable_layer_norm
-
-    def forward(self, x, mask=None):
-        x = x + self.pos_conv_embed(x)
-        all_x = ()
-        if not self.do_stable_layer_norm:
-            x = self.layer_norm(x)
-        for layer in self.layers:
-            all_x += (x,)
-            x = layer(x, mask)
-        if self.do_stable_layer_norm:
-            x = self.layer_norm(x)
-        all_x += (x,)
-        return x, all_x
-
-
-class Attention(nn.Module):
-    def __init__(self, embed_dim, num_heads, bias=True, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.embed_dim = embed_dim
-        self.num_heads = num_heads
-        self.head_dim = embed_dim // num_heads
-
-        self.k_proj = operations.Linear(embed_dim, embed_dim, bias=bias, device=device, dtype=dtype)
-        self.v_proj = operations.Linear(embed_dim, embed_dim, bias=bias, device=device, dtype=dtype)
-        self.q_proj = operations.Linear(embed_dim, embed_dim, bias=bias, device=device, dtype=dtype)
-        self.out_proj = operations.Linear(embed_dim, embed_dim, bias=bias, device=device, dtype=dtype)
-
-    def forward(self, x, mask=None):
-        assert (mask is None)  # TODO?
-        q = self.q_proj(x)
-        k = self.k_proj(x)
-        v = self.v_proj(x)
-
-        out = optimized_attention_masked(q, k, v, self.num_heads)
-        return self.out_proj(out)
-
-
-class FeedForward(nn.Module):
-    def __init__(self, embed_dim, mlp_ratio, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.intermediate_dense = operations.Linear(embed_dim, int(embed_dim * mlp_ratio), device=device, dtype=dtype)
-        self.output_dense = operations.Linear(int(embed_dim * mlp_ratio), embed_dim, device=device, dtype=dtype)
-
-    def forward(self, x):
-        x = self.intermediate_dense(x)
-        x = torch.nn.functional.gelu(x)
-        x = self.output_dense(x)
-        return x
-
-
-class TransformerEncoderLayer(nn.Module):
-    def __init__(
-        self,
-        embed_dim=768,
-        num_heads=12,
-        mlp_ratio=4.0,
-        do_stable_layer_norm=True,
-        dtype=None, device=None, operations=None
-    ):
-        super().__init__()
-
-        self.attention = Attention(embed_dim, num_heads, device=device, dtype=dtype, operations=operations)
-
-        self.layer_norm = operations.LayerNorm(embed_dim, device=device, dtype=dtype)
-        self.feed_forward = FeedForward(embed_dim, mlp_ratio, device=device, dtype=dtype, operations=operations)
-        self.final_layer_norm = operations.LayerNorm(embed_dim, device=device, dtype=dtype)
-        self.do_stable_layer_norm = do_stable_layer_norm
-
-    def forward(self, x, mask=None):
-        residual = x
-        if self.do_stable_layer_norm:
-            x = self.layer_norm(x)
-        x = self.attention(x, mask=mask)
-        x = residual + x
-        if not self.do_stable_layer_norm:
-            x = self.layer_norm(x)
-            return self.final_layer_norm(x + self.feed_forward(x))
-        else:
-            return x + self.feed_forward(self.final_layer_norm(x))
-
-
-class Wav2Vec2Model(nn.Module):
-    """Complete Wav2Vec 2.0 model."""
-
-    def __init__(
-        self,
-        embed_dim=1024,
-        final_dim=256,
-        num_heads=16,
-        num_layers=24,
-        conv_norm=True,
-        conv_bias=True,
-        do_normalize=True,
-        do_stable_layer_norm=True,
-        dtype=None, device=None, operations=None
-    ):
-        super().__init__()
-
-        conv_dim = 512
-        self.feature_extractor = ConvFeatureEncoder(conv_dim, conv_norm=conv_norm, conv_bias=conv_bias, device=device, dtype=dtype, operations=operations)
-        self.feature_projection = FeatureProjection(conv_dim, embed_dim, device=device, dtype=dtype, operations=operations)
-
-        self.masked_spec_embed = nn.Parameter(torch.empty(embed_dim, device=device, dtype=dtype))
-        self.do_normalize = do_normalize
-
-        self.encoder = TransformerEncoder(
-            embed_dim=embed_dim,
-            num_heads=num_heads,
-            num_layers=num_layers,
-            do_stable_layer_norm=do_stable_layer_norm,
-            device=device, dtype=dtype, operations=operations
-        )
-
-    def forward(self, x, mask_time_indices=None, return_dict=False):
-        x = torch.mean(x, dim=1)
-
-        if self.do_normalize:
-            x = (x - x.mean()) / torch.sqrt(x.var() + 1e-7)
-
-        features = self.feature_extractor(x)
-        features = self.feature_projection(features)
-        batch_size, seq_len, _ = features.shape
-
-        x, all_x = self.encoder(features)
-        return x, all_x
--- a/comfy/audio_encoders/whisper.py
+++ b/comfy/audio_encoders/whisper.py
@@ -1,186 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torchaudio
-from typing import Optional
-from comfy.ldm.modules.attention import optimized_attention_masked
-import comfy.ops
-
-class WhisperFeatureExtractor(nn.Module):
-    def __init__(self, n_mels=128, device=None):
-        super().__init__()
-        self.sample_rate = 16000
-        self.n_fft = 400
-        self.hop_length = 160
-        self.n_mels = n_mels
-        self.chunk_length = 30
-        self.n_samples = 480000
-
-        self.mel_spectrogram = torchaudio.transforms.MelSpectrogram(
-            sample_rate=self.sample_rate,
-            n_fft=self.n_fft,
-            hop_length=self.hop_length,
-            n_mels=self.n_mels,
-            f_min=0,
-            f_max=8000,
-            norm="slaney",
-            mel_scale="slaney",
-        ).to(device)
-
-    def __call__(self, audio):
-        audio = torch.mean(audio, dim=1)
-        batch_size = audio.shape[0]
-        processed_audio = []
-
-        for i in range(batch_size):
-            aud = audio[i]
-            if aud.shape[0] > self.n_samples:
-                aud = aud[:self.n_samples]
-            elif aud.shape[0] < self.n_samples:
-                aud = F.pad(aud, (0, self.n_samples - aud.shape[0]))
-            processed_audio.append(aud)
-
-        audio = torch.stack(processed_audio)
-
-        mel_spec = self.mel_spectrogram(audio.to(self.mel_spectrogram.spectrogram.window.device))[:, :, :-1].to(audio.device)
-
-        log_mel_spec = torch.clamp(mel_spec, min=1e-10).log10()
-        log_mel_spec = torch.maximum(log_mel_spec, log_mel_spec.max() - 8.0)
-        log_mel_spec = (log_mel_spec + 4.0) / 4.0
-
-        return log_mel_spec
-
-
-class MultiHeadAttention(nn.Module):
-    def __init__(self, d_model: int, n_heads: int, dtype=None, device=None, operations=None):
-        super().__init__()
-        assert d_model % n_heads == 0
-
-        self.d_model = d_model
-        self.n_heads = n_heads
-        self.d_k = d_model // n_heads
-
-        self.q_proj = operations.Linear(d_model, d_model, dtype=dtype, device=device)
-        self.k_proj = operations.Linear(d_model, d_model, bias=False, dtype=dtype, device=device)
-        self.v_proj = operations.Linear(d_model, d_model, dtype=dtype, device=device)
-        self.out_proj = operations.Linear(d_model, d_model, dtype=dtype, device=device)
-
-    def forward(
-        self,
-        query: torch.Tensor,
-        key: torch.Tensor,
-        value: torch.Tensor,
-        mask: Optional[torch.Tensor] = None,
-    ) -> torch.Tensor:
-        batch_size, seq_len, _ = query.shape
-
-        q = self.q_proj(query)
-        k = self.k_proj(key)
-        v = self.v_proj(value)
-
-        attn_output = optimized_attention_masked(q, k, v, self.n_heads, mask)
-        attn_output = self.out_proj(attn_output)
-
-        return attn_output
-
-
-class EncoderLayer(nn.Module):
-    def __init__(self, d_model: int, n_heads: int, d_ff: int, dtype=None, device=None, operations=None):
-        super().__init__()
-
-        self.self_attn = MultiHeadAttention(d_model, n_heads, dtype=dtype, device=device, operations=operations)
-        self.self_attn_layer_norm = operations.LayerNorm(d_model, dtype=dtype, device=device)
-
-        self.fc1 = operations.Linear(d_model, d_ff, dtype=dtype, device=device)
-        self.fc2 = operations.Linear(d_ff, d_model, dtype=dtype, device=device)
-        self.final_layer_norm = operations.LayerNorm(d_model, dtype=dtype, device=device)
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        attention_mask: Optional[torch.Tensor] = None
-    ) -> torch.Tensor:
-        residual = x
-        x = self.self_attn_layer_norm(x)
-        x = self.self_attn(x, x, x, attention_mask)
-        x = residual + x
-
-        residual = x
-        x = self.final_layer_norm(x)
-        x = self.fc1(x)
-        x = F.gelu(x)
-        x = self.fc2(x)
-        x = residual + x
-
-        return x
-
-
-class AudioEncoder(nn.Module):
-    def __init__(
-        self,
-        n_mels: int = 128,
-        n_ctx: int = 1500,
-        n_state: int = 1280,
-        n_head: int = 20,
-        n_layer: int = 32,
-        dtype=None,
-        device=None,
-        operations=None
-    ):
-        super().__init__()
-
-        self.conv1 = operations.Conv1d(n_mels, n_state, kernel_size=3, padding=1, dtype=dtype, device=device)
-        self.conv2 = operations.Conv1d(n_state, n_state, kernel_size=3, stride=2, padding=1, dtype=dtype, device=device)
-
-        self.embed_positions = operations.Embedding(n_ctx, n_state, dtype=dtype, device=device)
-
-        self.layers = nn.ModuleList([
-            EncoderLayer(n_state, n_head, n_state * 4, dtype=dtype, device=device, operations=operations)
-            for _ in range(n_layer)
-        ])
-
-        self.layer_norm = operations.LayerNorm(n_state, dtype=dtype, device=device)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        x = F.gelu(self.conv1(x))
-        x = F.gelu(self.conv2(x))
-
-        x = x.transpose(1, 2)
-
-        x = x + comfy.ops.cast_to_input(self.embed_positions.weight[:, :x.shape[1]], x)
-
-        all_x = ()
-        for layer in self.layers:
-            all_x += (x,)
-            x = layer(x)
-
-        x = self.layer_norm(x)
-        all_x += (x,)
-        return x, all_x
-
-
-class WhisperLargeV3(nn.Module):
-    def __init__(
-        self,
-        n_mels: int = 128,
-        n_audio_ctx: int = 1500,
-        n_audio_state: int = 1280,
-        n_audio_head: int = 20,
-        n_audio_layer: int = 32,
-        dtype=None,
-        device=None,
-        operations=None
-    ):
-        super().__init__()
-
-        self.feature_extractor = WhisperFeatureExtractor(n_mels=n_mels, device=device)
-
-        self.encoder = AudioEncoder(
-            n_mels, n_audio_ctx, n_audio_state, n_audio_head, n_audio_layer,
-            dtype=dtype, device=device, operations=operations
-        )
-
-    def forward(self, audio):
-        mel = self.feature_extractor(audio)
-        x, all_x = self.encoder(mel)
-        return x, all_x
--- a/comfy/cldm/cldm.py
+++ b/comfy/cldm/cldm.py
@@ -413,8 +413,7 @@ class ControlNet(nn.Module):
        out_middle = []

        if self.num_classes is not None:
-            if y is None:
-                raise ValueError("y is None, did you try using a controlnet for SDXL on SD1?")
+            assert y.shape[0] == x.shape[0]
            emb = emb + self.label_emb(y)

        h = x
--- a/comfy/cli_args.py
+++ b/comfy/cli_args.py
@@ -105,7 +105,6 @@ cache_group = parser.add_mutually_exclusive_group()
 cache_group.add_argument("--cache-classic", action="store_true", help="Use the old style (aggressive) caching.")
 cache_group.add_argument("--cache-lru", type=int, default=0, help="Use LRU caching with a maximum of N node results cached. May use more RAM/VRAM.")
 cache_group.add_argument("--cache-none", action="store_true", help="Reduced RAM/VRAM usage at the expense of executing every node for each run.")
-cache_group.add_argument("--cache-ram", nargs='?', const=4.0, type=float, default=0, help="Use RAM pressure caching with the specified headroom threshold. If available RAM drops below the threhold the cache remove large items to free RAM. Default 4GB")

 attn_group = parser.add_mutually_exclusive_group()
 attn_group.add_argument("--use-split-cross-attention", action="store_true", help="Use the split cross attention optimization. Ignored when xformers is used.")
@@ -144,11 +143,8 @@ class PerformanceFeature(enum.Enum):
    Fp16Accumulation = "fp16_accumulation"
    Fp8MatrixMultiplication = "fp8_matrix_mult"
    CublasOps = "cublas_ops"
-    AutoTune = "autotune"

-parser.add_argument("--fast", nargs="*", type=PerformanceFeature, help="Enable some untested and potentially quality deteriorating optimizations. This is used to test new features so using it might crash your comfyui. --fast with no arguments enables everything. You can pass a list specific optimizations if you only want to enable specific ones. Current valid optimizations: {}".format(" ".join(map(lambda c: c.value, PerformanceFeature))))
-
-parser.add_argument("--disable-pinned-memory", action="store_true", help="Disable pinned memory use.")
+parser.add_argument("--fast", nargs="*", type=PerformanceFeature, help="Enable some untested and potentially quality deteriorating optimizations. --fast with no arguments enables everything. You can pass a list specific optimizations if you only want to enable specific ones. Current valid optimizations: fp16_accumulation fp8_matrix_mult cublas_ops")

 parser.add_argument("--mmap-torch-files", action="store_true", help="Use mmap when loading ckpt/pt files.")
 parser.add_argument("--disable-mmap", action="store_true", help="Don't use mmap when loading safetensors.")
@@ -160,7 +156,7 @@ parser.add_argument("--windows-standalone-build", action="store_true", help="Win
 parser.add_argument("--disable-metadata", action="store_true", help="Disable saving prompt metadata in files.")
 parser.add_argument("--disable-all-custom-nodes", action="store_true", help="Disable loading all custom nodes.")
 parser.add_argument("--whitelist-custom-nodes", type=str, nargs='+', default=[], help="Specify custom node folders to load even when --disable-all-custom-nodes is enabled.")
-parser.add_argument("--disable-api-nodes", action="store_true", help="Disable loading all api nodes. Also prevents the frontend from communicating with the internet.")
+parser.add_argument("--disable-api-nodes", action="store_true", help="Disable loading all api nodes.")

 parser.add_argument("--multi-user", action="store_true", help="Enables per-user storage.")

--- a/comfy/clip_model.py
+++ b/comfy/clip_model.py
@@ -61,12 +61,8 @@ class CLIPEncoder(torch.nn.Module):
    def forward(self, x, mask=None, intermediate_output=None):
        optimized_attention = optimized_attention_for_device(x.device, mask=mask is not None, small_input=True)

-        all_intermediate = None
        if intermediate_output is not None:
-            if intermediate_output == "all":
-                all_intermediate = []
-                intermediate_output = None
-            elif intermediate_output < 0:
+            if intermediate_output < 0:
                intermediate_output = len(self.layers) + intermediate_output

        intermediate = None
@@ -74,12 +70,6 @@ class CLIPEncoder(torch.nn.Module):
            x = l(x, mask, optimized_attention)
            if i == intermediate_output:
                intermediate = x.clone()
-            if all_intermediate is not None:
-                all_intermediate.append(x.unsqueeze(1).clone())
-
-        if all_intermediate is not None:
-            intermediate = torch.cat(all_intermediate, dim=1)
-
        return x, intermediate

 class CLIPEmbeddings(torch.nn.Module):
@@ -107,7 +97,7 @@ class CLIPTextModel_(torch.nn.Module):
        self.encoder = CLIPEncoder(num_layers, embed_dim, heads, intermediate_size, intermediate_activation, dtype, device, operations)
        self.final_layer_norm = operations.LayerNorm(embed_dim, dtype=dtype, device=device)

-    def forward(self, input_tokens=None, attention_mask=None, embeds=None, num_tokens=None, intermediate_output=None, final_layer_norm_intermediate=True, dtype=torch.float32, embeds_info=[]):
+    def forward(self, input_tokens=None, attention_mask=None, embeds=None, num_tokens=None, intermediate_output=None, final_layer_norm_intermediate=True, dtype=torch.float32):
        if embeds is not None:
            x = embeds + comfy.ops.cast_to(self.embeddings.position_embedding.weight, dtype=dtype, device=embeds.device)
        else:
--- a/comfy/clip_vision.py
+++ b/comfy/clip_vision.py
@@ -50,13 +50,7 @@ class ClipVisionModel():
        self.image_size = config.get("image_size", 224)
        self.image_mean = config.get("image_mean", [0.48145466, 0.4578275, 0.40821073])
        self.image_std = config.get("image_std", [0.26862954, 0.26130258, 0.27577711])
-        model_type = config.get("model_type", "clip_vision_model")
-        model_class = IMAGE_ENCODERS.get(model_type)
-        if model_type == "siglip_vision_model":
-            self.return_all_hidden_states = True
-        else:
-            self.return_all_hidden_states = False
-
+        model_class = IMAGE_ENCODERS.get(config.get("model_type", "clip_vision_model"))
        self.load_device = comfy.model_management.text_encoder_device()
        offload_device = comfy.model_management.text_encoder_offload_device()
        self.dtype = comfy.model_management.text_encoder_dtype(self.load_device)
@@ -74,18 +68,12 @@ class ClipVisionModel():
    def encode_image(self, image, crop=True):
        comfy.model_management.load_model_gpu(self.patcher)
        pixel_values = clip_preprocess(image.to(self.load_device), size=self.image_size, mean=self.image_mean, std=self.image_std, crop=crop).float()
-        out = self.model(pixel_values=pixel_values, intermediate_output='all' if self.return_all_hidden_states else -2)
+        out = self.model(pixel_values=pixel_values, intermediate_output=-2)

        outputs = Output()
        outputs["last_hidden_state"] = out[0].to(comfy.model_management.intermediate_device())
        outputs["image_embeds"] = out[2].to(comfy.model_management.intermediate_device())
-        if self.return_all_hidden_states:
-            all_hs = out[1].to(comfy.model_management.intermediate_device())
-            outputs["penultimate_hidden_states"] = all_hs[:, -2]
-            outputs["all_hidden_states"] = all_hs
-        else:
-            outputs["penultimate_hidden_states"] = out[1].to(comfy.model_management.intermediate_device())
-
+        outputs["penultimate_hidden_states"] = out[1].to(comfy.model_management.intermediate_device())
        outputs["mm_projected"] = out[3]
        return outputs

@@ -136,12 +124,8 @@ def load_clipvision_from_sd(sd, prefix="", convert_keys=False):
                json_config = os.path.join(os.path.dirname(os.path.realpath(__file__)), "clip_vision_config_vitl_336.json")
        else:
            json_config = os.path.join(os.path.dirname(os.path.realpath(__file__)), "clip_vision_config_vitl.json")
-
-    # Dinov2
-    elif 'encoder.layer.39.layer_scale2.lambda1' in sd:
+    elif "embeddings.patch_embeddings.projection.weight" in sd:
        json_config = os.path.join(os.path.join(os.path.dirname(os.path.realpath(__file__)), "image_encoders"), "dino2_giant.json")
-    elif 'encoder.layer.23.layer_scale2.lambda1' in sd:
-        json_config = os.path.join(os.path.join(os.path.dirname(os.path.realpath(__file__)), "image_encoders"), "dino2_large.json")
    else:
        return None

--- a/comfy/controlnet.py
+++ b/comfy/controlnet.py
@@ -36,7 +36,6 @@ import comfy.ldm.cascade.controlnet
 import comfy.cldm.mmdit
 import comfy.ldm.hydit.controlnet
 import comfy.ldm.flux.controlnet
-import comfy.ldm.qwen_image.controlnet
 import comfy.cldm.dit_embedder
 from typing import TYPE_CHECKING
 if TYPE_CHECKING:
@@ -237,11 +236,11 @@ class ControlNet(ControlBase):
            self.cond_hint = None
            compression_ratio = self.compression_ratio
            if self.vae is not None:
-                compression_ratio *= self.vae.spacial_compression_encode()
+                compression_ratio *= self.vae.downscale_ratio
            else:
                if self.latent_format is not None:
                    raise ValueError("This Controlnet needs a VAE but none was provided, please use a ControlNetApply node with a VAE input and connect it.")
-            self.cond_hint = comfy.utils.common_upscale(self.cond_hint_original, x_noisy.shape[-1] * compression_ratio, x_noisy.shape[-2] * compression_ratio, self.upscale_algorithm, "center")
+            self.cond_hint = comfy.utils.common_upscale(self.cond_hint_original, x_noisy.shape[3] * compression_ratio, x_noisy.shape[2] * compression_ratio, self.upscale_algorithm, "center")
            self.cond_hint = self.preprocess_image(self.cond_hint)
            if self.vae is not None:
                loaded_models = comfy.model_management.loaded_models(only_currently_used=True)
@@ -253,10 +252,7 @@ class ControlNet(ControlBase):
                to_concat = []
                for c in self.extra_concat_orig:
                    c = c.to(self.cond_hint.device)
-                    c = comfy.utils.common_upscale(c, self.cond_hint.shape[-1], self.cond_hint.shape[-2], self.upscale_algorithm, "center")
-                    if c.ndim < self.cond_hint.ndim:
-                        c = c.unsqueeze(2)
-                        c = comfy.utils.repeat_to_batch_size(c, self.cond_hint.shape[2], dim=2)
+                    c = comfy.utils.common_upscale(c, self.cond_hint.shape[3], self.cond_hint.shape[2], self.upscale_algorithm, "center")
                    to_concat.append(comfy.utils.repeat_to_batch_size(c, self.cond_hint.shape[0]))
                self.cond_hint = torch.cat([self.cond_hint] + to_concat, dim=1)

@@ -310,13 +306,11 @@ class ControlLoraOps:
            self.bias = None

        def forward(self, input):
-            weight, bias, offload_stream = comfy.ops.cast_bias_weight(self, input, offloadable=True)
+            weight, bias = comfy.ops.cast_bias_weight(self, input)
            if self.up is not None:
-                x = torch.nn.functional.linear(input, weight + (torch.mm(self.up.flatten(start_dim=1), self.down.flatten(start_dim=1))).reshape(self.weight.shape).type(input.dtype), bias)
+                return torch.nn.functional.linear(input, weight + (torch.mm(self.up.flatten(start_dim=1), self.down.flatten(start_dim=1))).reshape(self.weight.shape).type(input.dtype), bias)
            else:
-                x = torch.nn.functional.linear(input, weight, bias)
-            comfy.ops.uncast_bias_weight(self, weight, bias, offload_stream)
-            return x
+                return torch.nn.functional.linear(input, weight, bias)

    class Conv2d(torch.nn.Module, comfy.ops.CastWeightBiasOp):
        def __init__(
@@ -352,13 +346,12 @@ class ControlLoraOps:


        def forward(self, input):
-            weight, bias, offload_stream = comfy.ops.cast_bias_weight(self, input, offloadable=True)
+            weight, bias = comfy.ops.cast_bias_weight(self, input)
            if self.up is not None:
-                x = torch.nn.functional.conv2d(input, weight + (torch.mm(self.up.flatten(start_dim=1), self.down.flatten(start_dim=1))).reshape(self.weight.shape).type(input.dtype), bias, self.stride, self.padding, self.dilation, self.groups)
+                return torch.nn.functional.conv2d(input, weight + (torch.mm(self.up.flatten(start_dim=1), self.down.flatten(start_dim=1))).reshape(self.weight.shape).type(input.dtype), bias, self.stride, self.padding, self.dilation, self.groups)
            else:
-                x = torch.nn.functional.conv2d(input, weight, bias, self.stride, self.padding, self.dilation, self.groups)
-            comfy.ops.uncast_bias_weight(self, weight, bias, offload_stream)
-            return x
+                return torch.nn.functional.conv2d(input, weight, bias, self.stride, self.padding, self.dilation, self.groups)
+

 class ControlLora(ControlNet):
    def __init__(self, control_weights, global_average_pooling=False, model_options={}): #TODO? model_options
@@ -589,22 +582,6 @@ def load_controlnet_flux_instantx(sd, model_options={}):
    control = ControlNet(control_model, compression_ratio=1, latent_format=latent_format, concat_mask=concat_mask, load_device=load_device, manual_cast_dtype=manual_cast_dtype, extra_conds=extra_conds)
    return control

-def load_controlnet_qwen_instantx(sd, model_options={}):
-    model_config, operations, load_device, unet_dtype, manual_cast_dtype, offload_device = controlnet_config(sd, model_options=model_options)
-    control_latent_channels = sd.get("controlnet_x_embedder.weight").shape[1]
-
-    extra_condition_channels = 0
-    concat_mask = False
-    if control_latent_channels == 68: #inpaint controlnet
-        extra_condition_channels = control_latent_channels - 64
-        concat_mask = True
-    control_model = comfy.ldm.qwen_image.controlnet.QwenImageControlNetModel(extra_condition_channels=extra_condition_channels, operations=operations, device=offload_device, dtype=unet_dtype, **model_config.unet_config)
-    control_model = controlnet_load_state_dict(control_model, sd)
-    latent_format = comfy.latent_formats.Wan21()
-    extra_conds = []
-    control = ControlNet(control_model, compression_ratio=1, latent_format=latent_format, concat_mask=concat_mask, load_device=load_device, manual_cast_dtype=manual_cast_dtype, extra_conds=extra_conds)
-    return control
-
 def convert_mistoline(sd):
    return comfy.utils.state_dict_prefix_replace(sd, {"single_controlnet_blocks.": "controlnet_single_blocks."})

@@ -678,11 +655,8 @@ def load_controlnet_state_dict(state_dict, model=None, model_options={}):
                return load_controlnet_sd35(controlnet_data, model_options=model_options) #Stability sd3.5 format
            else:
                return load_controlnet_mmdit(controlnet_data, model_options=model_options) #SD3 diffusers controlnet
-        elif "transformer_blocks.0.img_mlp.net.0.proj.weight" in controlnet_data:
-            return load_controlnet_qwen_instantx(controlnet_data, model_options=model_options)
        elif "controlnet_x_embedder.weight" in controlnet_data:
            return load_controlnet_flux_instantx(controlnet_data, model_options=model_options)
-
    elif "controlnet_blocks.0.linear.weight" in controlnet_data: #mistoline flux
        return load_controlnet_flux_xlabs_mistoline(convert_mistoline(controlnet_data), mistoline=True, model_options=model_options)

--- a/comfy/image_encoders/dino2.py
+++ b/comfy/image_encoders/dino2.py
@@ -31,20 +31,6 @@ class LayerScale(torch.nn.Module):
    def forward(self, x):
        return x * comfy.model_management.cast_to_device(self.lambda1, x.device, x.dtype)

-class Dinov2MLP(torch.nn.Module):
-    def __init__(self, hidden_size: int, dtype, device, operations):
-        super().__init__()
-
-        mlp_ratio = 4
-        hidden_features = int(hidden_size * mlp_ratio)
-        self.fc1 = operations.Linear(hidden_size, hidden_features, bias = True, device=device, dtype=dtype)
-        self.fc2 = operations.Linear(hidden_features, hidden_size, bias = True, device=device, dtype=dtype)
-
-    def forward(self, hidden_state: torch.Tensor) -> torch.Tensor:
-        hidden_state = self.fc1(hidden_state)
-        hidden_state = torch.nn.functional.gelu(hidden_state)
-        hidden_state = self.fc2(hidden_state)
-        return hidden_state

 class SwiGLUFFN(torch.nn.Module):
    def __init__(self, dim, dtype, device, operations):
@@ -64,15 +50,12 @@ class SwiGLUFFN(torch.nn.Module):


 class Dino2Block(torch.nn.Module):
-    def __init__(self, dim, num_heads, layer_norm_eps, dtype, device, operations, use_swiglu_ffn):
+    def __init__(self, dim, num_heads, layer_norm_eps, dtype, device, operations):
        super().__init__()
        self.attention = Dino2AttentionBlock(dim, num_heads, layer_norm_eps, dtype, device, operations)
        self.layer_scale1 = LayerScale(dim, dtype, device, operations)
        self.layer_scale2 = LayerScale(dim, dtype, device, operations)
-        if use_swiglu_ffn:
-            self.mlp = SwiGLUFFN(dim, dtype, device, operations)
-        else:
-            self.mlp = Dinov2MLP(dim, dtype, device, operations)
+        self.mlp = SwiGLUFFN(dim, dtype, device, operations)
        self.norm1 = operations.LayerNorm(dim, eps=layer_norm_eps, dtype=dtype, device=device)
        self.norm2 = operations.LayerNorm(dim, eps=layer_norm_eps, dtype=dtype, device=device)

@@ -83,10 +66,9 @@ class Dino2Block(torch.nn.Module):


 class Dino2Encoder(torch.nn.Module):
-    def __init__(self, dim, num_heads, layer_norm_eps, num_layers, dtype, device, operations, use_swiglu_ffn):
+    def __init__(self, dim, num_heads, layer_norm_eps, num_layers, dtype, device, operations):
        super().__init__()
-        self.layer = torch.nn.ModuleList([Dino2Block(dim, num_heads, layer_norm_eps, dtype, device, operations, use_swiglu_ffn = use_swiglu_ffn)
-                                          for _ in range(num_layers)])
+        self.layer = torch.nn.ModuleList([Dino2Block(dim, num_heads, layer_norm_eps, dtype, device, operations) for _ in range(num_layers)])

    def forward(self, x, intermediate_output=None):
        optimized_attention = optimized_attention_for_device(x.device, False, small_input=True)
@@ -96,8 +78,8 @@ class Dino2Encoder(torch.nn.Module):
                intermediate_output = len(self.layer) + intermediate_output

        intermediate = None
-        for i, layer in enumerate(self.layer):
-            x = layer(x, optimized_attention)
+        for i, l in enumerate(self.layer):
+            x = l(x, optimized_attention)
            if i == intermediate_output:
                intermediate = x.clone()
        return x, intermediate
@@ -146,10 +128,9 @@ class Dinov2Model(torch.nn.Module):
        dim = config_dict["hidden_size"]
        heads = config_dict["num_attention_heads"]
        layer_norm_eps = config_dict["layer_norm_eps"]
-        use_swiglu_ffn = config_dict["use_swiglu_ffn"]

        self.embeddings = Dino2Embeddings(dim, dtype, device, operations)
-        self.encoder = Dino2Encoder(dim, heads, layer_norm_eps, num_layers, dtype, device, operations, use_swiglu_ffn = use_swiglu_ffn)
+        self.encoder = Dino2Encoder(dim, heads, layer_norm_eps, num_layers, dtype, device, operations)
        self.layernorm = operations.LayerNorm(dim, eps=layer_norm_eps, dtype=dtype, device=device)

    def forward(self, pixel_values, attention_mask=None, intermediate_output=None):
--- a/comfy/image_encoders/dino2_large.json
+++ b/comfy/image_encoders/dino2_large.json
@@ -1,22 +0,0 @@
-{
-  "hidden_size": 1024,
-  "use_mask_token": true,
-  "patch_size": 14,
-  "image_size": 518,
-  "num_channels": 3,
-  "num_attention_heads": 16,
-  "initializer_range": 0.02,
-  "attention_probs_dropout_prob": 0.0,
-  "hidden_dropout_prob": 0.0,
-  "hidden_act": "gelu",
-  "mlp_ratio": 4,
-  "model_type": "dinov2",
-  "num_hidden_layers": 24,
-  "layer_norm_eps": 1e-6,
-  "qkv_bias": true,
-  "use_swiglu_ffn": false,
-  "layerscale_value": 1.0,
-  "drop_path_rate": 0.0,
-  "image_mean": [0.485, 0.456, 0.406],
-  "image_std": [0.229, 0.224, 0.225]
-}
--- a/comfy/k_diffusion/sampling.py
+++ b/comfy/k_diffusion/sampling.py
@@ -86,24 +86,24 @@ class BatchedBrownianTree:
    """A wrapper around torchsde.BrownianTree that enables batches of entropy."""

    def __init__(self, x, t0, t1, seed=None, **kwargs):
-        self.cpu_tree = kwargs.pop("cpu", True)
+        self.cpu_tree = True
+        if "cpu" in kwargs:
+            self.cpu_tree = kwargs.pop("cpu")
        t0, t1, self.sign = self.sort(t0, t1)
-        w0 = kwargs.pop('w0', None)
-        if w0 is None:
-            w0 = torch.zeros_like(x)
-        self.batched = False
+        w0 = kwargs.get('w0', torch.zeros_like(x))
        if seed is None:
-            seed = (torch.randint(0, 2 ** 63 - 1, ()).item(),)
-        elif isinstance(seed, (tuple, list)):
-            if len(seed) != x.shape[0]:
-                raise ValueError("Passing a list or tuple of seeds to BatchedBrownianTree requires a length matching the batch size.")
-            self.batched = True
+            seed = torch.randint(0, 2 ** 63 - 1, []).item()
+        self.batched = True
+        try:
+            assert len(seed) == x.shape[0]
            w0 = w0[0]
-        else:
-            seed = (seed,)
+        except TypeError:
+            seed = [seed]
+            self.batched = False
        if self.cpu_tree:
-            t0, w0, t1 = t0.detach().cpu(), w0.detach().cpu(), t1.detach().cpu()
-        self.trees = tuple(torchsde.BrownianTree(t0, w0, t1, entropy=s, **kwargs) for s in seed)
+            self.trees = [torchsde.BrownianTree(t0.cpu(), w0.cpu(), t1.cpu(), entropy=s, **kwargs) for s in seed]
+        else:
+            self.trees = [torchsde.BrownianTree(t0, w0, t1, entropy=s, **kwargs) for s in seed]

    @staticmethod
    def sort(a, b):
@@ -111,10 +111,11 @@ class BatchedBrownianTree:

    def __call__(self, t0, t1):
        t0, t1, sign = self.sort(t0, t1)
-        device, dtype = t0.device, t0.dtype
        if self.cpu_tree:
-            t0, t1 = t0.detach().cpu().float(), t1.detach().cpu().float()
-        w = torch.stack([tree(t0, t1) for tree in self.trees]).to(device=device, dtype=dtype) * (self.sign * sign)
+            w = torch.stack([tree(t0.cpu().float(), t1.cpu().float()).to(t0.dtype).to(t0.device) for tree in self.trees]) * (self.sign * sign)
+        else:
+            w = torch.stack([tree(t0, t1) for tree in self.trees]) * (self.sign * sign)
+
        return w if self.batched else w[0]


@@ -170,16 +171,6 @@ def offset_first_sigma_for_snr(sigmas, model_sampling, percent_offset=1e-4):
    return sigmas


-def ei_h_phi_1(h: torch.Tensor) -> torch.Tensor:
-    """Compute the result of h*phi_1(h) in exponential integrator methods."""
-    return torch.expm1(h)
-
-
-def ei_h_phi_2(h: torch.Tensor) -> torch.Tensor:
-    """Compute the result of h*phi_2(h) in exponential integrator methods."""
-    return (torch.expm1(h) - h) / h
-
-
@torch.no_grad()
 def sample_euler(model, x, sigmas, extra_args=None, callback=None, disable=None, s_churn=0., s_tmin=0., s_tmax=float('inf'), s_noise=1.):
    """Implements Algorithm 2 (Euler steps) from Karras et al. (2022)."""
@@ -862,11 +853,6 @@ def sample_dpmpp_2m_sde(model, x, sigmas, extra_args=None, callback=None, disabl
    return x


-@torch.no_grad()
-def sample_dpmpp_2m_sde_heun(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None, solver_type='heun'):
-    return sample_dpmpp_2m_sde(model, x, sigmas, extra_args=extra_args, callback=callback, disable=disable, eta=eta, s_noise=s_noise, noise_sampler=noise_sampler, solver_type=solver_type)
-
-
@torch.no_grad()
 def sample_dpmpp_3m_sde(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None):
    """DPM-Solver++(3M) SDE."""
@@ -939,16 +925,6 @@ def sample_dpmpp_3m_sde_gpu(model, x, sigmas, extra_args=None, callback=None, di
    return sample_dpmpp_3m_sde(model, x, sigmas, extra_args=extra_args, callback=callback, disable=disable, eta=eta, s_noise=s_noise, noise_sampler=noise_sampler)


-@torch.no_grad()
-def sample_dpmpp_2m_sde_heun_gpu(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None, solver_type='heun'):
-    if len(sigmas) <= 1:
-        return x
-    extra_args = {} if extra_args is None else extra_args
-    sigma_min, sigma_max = sigmas[sigmas > 0].min(), sigmas.max()
-    noise_sampler = BrownianTreeNoiseSampler(x, sigma_min, sigma_max, seed=extra_args.get("seed", None), cpu=False) if noise_sampler is None else noise_sampler
-    return sample_dpmpp_2m_sde_heun(model, x, sigmas, extra_args=extra_args, callback=callback, disable=disable, eta=eta, s_noise=s_noise, noise_sampler=noise_sampler, solver_type=solver_type)
-
-
@torch.no_grad()
 def sample_dpmpp_2m_sde_gpu(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None, solver_type='midpoint'):
    if len(sigmas) <= 1:
@@ -1559,12 +1535,13 @@ def sample_er_sde(model, x, sigmas, extra_args=None, callback=None, disable=None
@torch.no_grad()
 def sample_seeds_2(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None, r=0.5):
    """SEEDS-2 - Stochastic Explicit Exponential Derivative-free Solvers (VP Data Prediction) stage 2.
-    arXiv: https://arxiv.org/abs/2305.14267 (NeurIPS 2023)
+    arXiv: https://arxiv.org/abs/2305.14267
    """
    extra_args = {} if extra_args is None else extra_args
    seed = extra_args.get("seed", None)
    noise_sampler = default_noise_sampler(x, seed=seed) if noise_sampler is None else noise_sampler
    s_in = x.new_ones([x.shape[0]])
+
    inject_noise = eta > 0 and s_noise > 0

    model_sampling = model.inner_model.model_patcher.get_model_object('model_sampling')
@@ -1572,53 +1549,55 @@ def sample_seeds_2(model, x, sigmas, extra_args=None, callback=None, disable=Non
    lambda_fn = partial(sigma_to_half_log_snr, model_sampling=model_sampling)
    sigmas = offset_first_sigma_for_snr(sigmas, model_sampling)

-    fac = 1 / (2 * r)
-
    for i in trange(len(sigmas) - 1, disable=disable):
        denoised = model(x, sigmas[i] * s_in, **extra_args)
        if callback is not None:
            callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigmas[i], 'denoised': denoised})
-
        if sigmas[i + 1] == 0:
            x = denoised
-            continue
+        else:
+            lambda_s, lambda_t = lambda_fn(sigmas[i]), lambda_fn(sigmas[i + 1])
+            h = lambda_t - lambda_s
+            h_eta = h * (eta + 1)
+            lambda_s_1 = lambda_s + r * h
+            fac = 1 / (2 * r)
+            sigma_s_1 = sigma_fn(lambda_s_1)

-        lambda_s, lambda_t = lambda_fn(sigmas[i]), lambda_fn(sigmas[i + 1])
-        h = lambda_t - lambda_s
-        h_eta = h * (eta + 1)
-        lambda_s_1 = torch.lerp(lambda_s, lambda_t, r)
-        sigma_s_1 = sigma_fn(lambda_s_1)
+            # alpha_t = sigma_t * exp(log(alpha_t / sigma_t)) = sigma_t * exp(lambda_t)
+            alpha_s_1 = sigma_s_1 * lambda_s_1.exp()
+            alpha_t = sigmas[i + 1] * lambda_t.exp()

-        alpha_s_1 = sigma_s_1 * lambda_s_1.exp()
-        alpha_t = sigmas[i + 1] * lambda_t.exp()
+            coeff_1, coeff_2 = (-r * h_eta).expm1(), (-h_eta).expm1()
+            if inject_noise:
+                # 0 < r < 1
+                noise_coeff_1 = (-2 * r * h * eta).expm1().neg().sqrt()
+                noise_coeff_2 = (-r * h * eta).exp() * (-2 * (1 - r) * h * eta).expm1().neg().sqrt()
+                noise_1, noise_2 = noise_sampler(sigmas[i], sigma_s_1), noise_sampler(sigma_s_1, sigmas[i + 1])

-        # Step 1
-        x_2 = sigma_s_1 / sigmas[i] * (-r * h * eta).exp() * x - alpha_s_1 * ei_h_phi_1(-r * h_eta) * denoised
-        if inject_noise:
-            sde_noise = (-2 * r * h * eta).expm1().neg().sqrt() * noise_sampler(sigmas[i], sigma_s_1)
-            x_2 = x_2 + sde_noise * sigma_s_1 * s_noise
-        denoised_2 = model(x_2, sigma_s_1 * s_in, **extra_args)
+            # Step 1
+            x_2 = sigma_s_1 / sigmas[i] * (-r * h * eta).exp() * x - alpha_s_1 * coeff_1 * denoised
+            if inject_noise:
+                x_2 = x_2 + sigma_s_1 * (noise_coeff_1 * noise_1) * s_noise
+            denoised_2 = model(x_2, sigma_s_1 * s_in, **extra_args)

-        # Step 2
-        denoised_d = torch.lerp(denoised, denoised_2, fac)
-        x = sigmas[i + 1] / sigmas[i] * (-h * eta).exp() * x - alpha_t * ei_h_phi_1(-h_eta) * denoised_d
-        if inject_noise:
-            segment_factor = (r - 1) * h * eta
-            sde_noise = sde_noise * segment_factor.exp()
-            sde_noise = sde_noise + segment_factor.mul(2).expm1().neg().sqrt() * noise_sampler(sigma_s_1, sigmas[i + 1])
-            x = x + sde_noise * sigmas[i + 1] * s_noise
+            # Step 2
+            denoised_d = (1 - fac) * denoised + fac * denoised_2
+            x = sigmas[i + 1] / sigmas[i] * (-h * eta).exp() * x - alpha_t * coeff_2 * denoised_d
+            if inject_noise:
+                x = x + sigmas[i + 1] * (noise_coeff_2 * noise_1 + noise_coeff_1 * noise_2) * s_noise
    return x


@torch.no_grad()
 def sample_seeds_3(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None, r_1=1./3, r_2=2./3):
    """SEEDS-3 - Stochastic Explicit Exponential Derivative-free Solvers (VP Data Prediction) stage 3.
-    arXiv: https://arxiv.org/abs/2305.14267 (NeurIPS 2023)
+    arXiv: https://arxiv.org/abs/2305.14267
    """
    extra_args = {} if extra_args is None else extra_args
    seed = extra_args.get("seed", None)
    noise_sampler = default_noise_sampler(x, seed=seed) if noise_sampler is None else noise_sampler
    s_in = x.new_ones([x.shape[0]])
+
    inject_noise = eta > 0 and s_noise > 0

    model_sampling = model.inner_model.model_patcher.get_model_object('model_sampling')
@@ -1630,49 +1609,45 @@ def sample_seeds_3(model, x, sigmas, extra_args=None, callback=None, disable=Non
        denoised = model(x, sigmas[i] * s_in, **extra_args)
        if callback is not None:
            callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigmas[i], 'denoised': denoised})
-
        if sigmas[i + 1] == 0:
            x = denoised
-            continue
+        else:
+            lambda_s, lambda_t = lambda_fn(sigmas[i]), lambda_fn(sigmas[i + 1])
+            h = lambda_t - lambda_s
+            h_eta = h * (eta + 1)
+            lambda_s_1 = lambda_s + r_1 * h
+            lambda_s_2 = lambda_s + r_2 * h
+            sigma_s_1, sigma_s_2 = sigma_fn(lambda_s_1), sigma_fn(lambda_s_2)

-        lambda_s, lambda_t = lambda_fn(sigmas[i]), lambda_fn(sigmas[i + 1])
-        h = lambda_t - lambda_s
-        h_eta = h * (eta + 1)
-        lambda_s_1 = torch.lerp(lambda_s, lambda_t, r_1)
-        lambda_s_2 = torch.lerp(lambda_s, lambda_t, r_2)
-        sigma_s_1, sigma_s_2 = sigma_fn(lambda_s_1), sigma_fn(lambda_s_2)
+            # alpha_t = sigma_t * exp(log(alpha_t / sigma_t)) = sigma_t * exp(lambda_t)
+            alpha_s_1 = sigma_s_1 * lambda_s_1.exp()
+            alpha_s_2 = sigma_s_2 * lambda_s_2.exp()
+            alpha_t = sigmas[i + 1] * lambda_t.exp()

-        alpha_s_1 = sigma_s_1 * lambda_s_1.exp()
-        alpha_s_2 = sigma_s_2 * lambda_s_2.exp()
-        alpha_t = sigmas[i + 1] * lambda_t.exp()
+            coeff_1, coeff_2, coeff_3 = (-r_1 * h_eta).expm1(), (-r_2 * h_eta).expm1(), (-h_eta).expm1()
+            if inject_noise:
+                # 0 < r_1 < r_2 < 1
+                noise_coeff_1 = (-2 * r_1 * h * eta).expm1().neg().sqrt()
+                noise_coeff_2 = (-r_1 * h * eta).exp() * (-2 * (r_2 - r_1) * h * eta).expm1().neg().sqrt()
+                noise_coeff_3 = (-r_2 * h * eta).exp() * (-2 * (1 - r_2) * h * eta).expm1().neg().sqrt()
+                noise_1, noise_2, noise_3 = noise_sampler(sigmas[i], sigma_s_1), noise_sampler(sigma_s_1, sigma_s_2), noise_sampler(sigma_s_2, sigmas[i + 1])

-        # Step 1
-        x_2 = sigma_s_1 / sigmas[i] * (-r_1 * h * eta).exp() * x - alpha_s_1 * ei_h_phi_1(-r_1 * h_eta) * denoised
-        if inject_noise:
-            sde_noise = (-2 * r_1 * h * eta).expm1().neg().sqrt() * noise_sampler(sigmas[i], sigma_s_1)
-            x_2 = x_2 + sde_noise * sigma_s_1 * s_noise
-        denoised_2 = model(x_2, sigma_s_1 * s_in, **extra_args)
+            # Step 1
+            x_2 = sigma_s_1 / sigmas[i] * (-r_1 * h * eta).exp() * x - alpha_s_1 * coeff_1 * denoised
+            if inject_noise:
+                x_2 = x_2 + sigma_s_1 * (noise_coeff_1 * noise_1) * s_noise
+            denoised_2 = model(x_2, sigma_s_1 * s_in, **extra_args)

-        # Step 2
-        a3_2 = r_2 / r_1 * ei_h_phi_2(-r_2 * h_eta)
-        a3_1 = ei_h_phi_1(-r_2 * h_eta) - a3_2
-        x_3 = sigma_s_2 / sigmas[i] * (-r_2 * h * eta).exp() * x - alpha_s_2 * (a3_1 * denoised + a3_2 * denoised_2)
-        if inject_noise:
-            segment_factor = (r_1 - r_2) * h * eta
-            sde_noise = sde_noise * segment_factor.exp()
-            sde_noise = sde_noise + segment_factor.mul(2).expm1().neg().sqrt() * noise_sampler(sigma_s_1, sigma_s_2)
-            x_3 = x_3 + sde_noise * sigma_s_2 * s_noise
-        denoised_3 = model(x_3, sigma_s_2 * s_in, **extra_args)
+            # Step 2
+            x_3 = sigma_s_2 / sigmas[i] * (-r_2 * h * eta).exp() * x - alpha_s_2 * coeff_2 * denoised + (r_2 / r_1) * alpha_s_2 * (coeff_2 / (r_2 * h_eta) + 1) * (denoised_2 - denoised)
+            if inject_noise:
+                x_3 = x_3 + sigma_s_2 * (noise_coeff_2 * noise_1 + noise_coeff_1 * noise_2) * s_noise
+            denoised_3 = model(x_3, sigma_s_2 * s_in, **extra_args)

-        # Step 3
-        b3 = ei_h_phi_2(-h_eta) / r_2
-        b1 = ei_h_phi_1(-h_eta) - b3
-        x = sigmas[i + 1] / sigmas[i] * (-h * eta).exp() * x - alpha_t * (b1 * denoised + b3 * denoised_3)
-        if inject_noise:
-            segment_factor = (r_2 - 1) * h * eta
-            sde_noise = sde_noise * segment_factor.exp()
-            sde_noise = sde_noise + segment_factor.mul(2).expm1().neg().sqrt() * noise_sampler(sigma_s_2, sigmas[i + 1])
-            x = x + sde_noise * sigmas[i + 1] * s_noise
+            # Step 3
+            x = sigmas[i + 1] / sigmas[i] * (-h * eta).exp() * x - alpha_t * coeff_3 * denoised + (1. / r_2) * alpha_t * (coeff_3 / h_eta + 1) * (denoised_3 - denoised)
+            if inject_noise:
+                x = x + sigmas[i + 1] * (noise_coeff_3 * noise_1 + noise_coeff_2 * noise_2 + noise_coeff_1 * noise_3) * s_noise
    return x


--- a/comfy/latent_formats.py
+++ b/comfy/latent_formats.py
@@ -178,15 +178,6 @@ class Flux(SD3):
    def process_out(self, latent):
        return (latent / self.scale_factor) + self.shift_factor

-class Flux2(LatentFormat):
-    latent_channels = 128
-
-    def process_in(self, latent):
-        return latent
-
-    def process_out(self, latent):
-        return latent
-
 class Mochi(LatentFormat):
    latent_channels = 12
    latent_dimensions = 3
@@ -542,154 +533,11 @@ class Wan22(Wan21):
                0.3971, 1.0600, 0.3943, 0.5537, 0.5444, 0.4089, 0.7468, 0.7744
            ]).view(1, self.latent_channels, 1, 1, 1)

-class HunyuanImage21(LatentFormat):
-    latent_channels = 64
-    latent_dimensions = 2
-    scale_factor = 0.75289
-
-    latent_rgb_factors = [
-        [-0.0154, -0.0397, -0.0521],
-        [ 0.0005,  0.0093,  0.0006],
-        [-0.0805, -0.0773, -0.0586],
-        [-0.0494, -0.0487, -0.0498],
-        [-0.0212, -0.0076, -0.0261],
-        [-0.0179, -0.0417, -0.0505],
-        [ 0.0158,  0.0310,  0.0239],
-        [ 0.0409,  0.0516,  0.0201],
-        [ 0.0350,  0.0553,  0.0036],
-        [-0.0447, -0.0327, -0.0479],
-        [-0.0038, -0.0221, -0.0365],
-        [-0.0423, -0.0718, -0.0654],
-        [ 0.0039,  0.0368,  0.0104],
-        [ 0.0655,  0.0217,  0.0122],
-        [ 0.0490,  0.1638,  0.2053],
-        [ 0.0932,  0.0829,  0.0650],
-        [-0.0186, -0.0209, -0.0135],
-        [-0.0080, -0.0076, -0.0148],
-        [-0.0284, -0.0201,  0.0011],
-        [-0.0642, -0.0294, -0.0777],
-        [-0.0035,  0.0076, -0.0140],
-        [ 0.0519,  0.0731,  0.0887],
-        [-0.0102,  0.0095,  0.0704],
-        [ 0.0068,  0.0218, -0.0023],
-        [-0.0726, -0.0486, -0.0519],
-        [ 0.0260,  0.0295,  0.0263],
-        [ 0.0250,  0.0333,  0.0341],
-        [ 0.0168, -0.0120, -0.0174],
-        [ 0.0226,  0.1037,  0.0114],
-        [ 0.2577,  0.1906,  0.1604],
-        [-0.0646, -0.0137, -0.0018],
-        [-0.0112,  0.0309,  0.0358],
-        [-0.0347,  0.0146, -0.0481],
-        [ 0.0234,  0.0179,  0.0201],
-        [ 0.0157,  0.0313,  0.0225],
-        [ 0.0423,  0.0675,  0.0524],
-        [-0.0031,  0.0027, -0.0255],
-        [ 0.0447,  0.0555,  0.0330],
-        [-0.0152,  0.0103,  0.0299],
-        [-0.0755, -0.0489, -0.0635],
-        [ 0.0853,  0.0788,  0.1017],
-        [-0.0272, -0.0294, -0.0471],
-        [ 0.0440,  0.0400, -0.0137],
-        [ 0.0335,  0.0317, -0.0036],
-        [-0.0344, -0.0621, -0.0984],
-        [-0.0127, -0.0630, -0.0620],
-        [-0.0648,  0.0360,  0.0924],
-        [-0.0781, -0.0801, -0.0409],
-        [ 0.0363,  0.0613,  0.0499],
-        [ 0.0238,  0.0034,  0.0041],
-        [-0.0135,  0.0258,  0.0310],
-        [ 0.0614,  0.1086,  0.0589],
-        [ 0.0428,  0.0350,  0.0205],
-        [ 0.0153,  0.0173, -0.0018],
-        [-0.0288, -0.0455, -0.0091],
-        [ 0.0344,  0.0109, -0.0157],
-        [-0.0205, -0.0247, -0.0187],
-        [ 0.0487,  0.0126,  0.0064],
-        [-0.0220, -0.0013,  0.0074],
-        [-0.0203, -0.0094, -0.0048],
-        [-0.0719,  0.0429, -0.0442],
-        [ 0.1042,  0.0497,  0.0356],
-        [-0.0659, -0.0578, -0.0280],
-        [-0.0060, -0.0322, -0.0234]]
-
-    latent_rgb_factors_bias = [0.0007, -0.0256, -0.0206]
-
-class HunyuanImage21Refiner(LatentFormat):
-    latent_channels = 64
-    latent_dimensions = 3
-    scale_factor = 1.03682
-
-    def process_in(self, latent):
-        out = latent * self.scale_factor
-        out = torch.cat((out[:, :, :1], out), dim=2)
-        out = out.permute(0, 2, 1, 3, 4)
-        b, f_times_2, c, h, w = out.shape
-        out = out.reshape(b, f_times_2 // 2, 2 * c, h, w)
-        out = out.permute(0, 2, 1, 3, 4).contiguous()
-        return out
-
-    def process_out(self, latent):
-        z = latent / self.scale_factor
-        z = z.permute(0, 2, 1, 3, 4)
-        b, f, c, h, w = z.shape
-        z = z.reshape(b, f, 2, c // 2, h, w)
-        z = z.permute(0, 1, 2, 3, 4, 5).reshape(b, f * 2, c // 2, h, w)
-        z = z.permute(0, 2, 1, 3, 4)
-        z = z[:, :, 1:]
-        return z
-
-class HunyuanVideo15(LatentFormat):
-    latent_rgb_factors = [
-        [ 0.0568, -0.0521, -0.0131],
-        [ 0.0014,  0.0735,  0.0326],
-        [ 0.0186,  0.0531, -0.0138],
-        [-0.0031,  0.0051,  0.0288],
-        [ 0.0110,  0.0556,  0.0432],
-        [-0.0041, -0.0023, -0.0485],
-        [ 0.0530,  0.0413,  0.0253],
-        [ 0.0283,  0.0251,  0.0339],
-        [ 0.0277, -0.0372, -0.0093],
-        [ 0.0393,  0.0944,  0.1131],
-        [ 0.0020,  0.0251,  0.0037],
-        [-0.0017,  0.0012,  0.0234],
-        [ 0.0468,  0.0436,  0.0203],
-        [ 0.0354,  0.0439, -0.0233],
-        [ 0.0090,  0.0123,  0.0346],
-        [ 0.0382,  0.0029,  0.0217],
-        [ 0.0261, -0.0300,  0.0030],
-        [-0.0088, -0.0220, -0.0283],
-        [-0.0272, -0.0121, -0.0363],
-        [-0.0664, -0.0622,  0.0144],
-        [ 0.0414,  0.0479,  0.0529],
-        [ 0.0355,  0.0612, -0.0247],
-        [ 0.0147,  0.0264,  0.0174],
-        [ 0.0438,  0.0038,  0.0542],
-        [ 0.0431, -0.0573, -0.0033],
-        [-0.0162, -0.0211, -0.0406],
-        [-0.0487, -0.0295, -0.0393],
-        [ 0.0005, -0.0109,  0.0253],
-        [ 0.0296,  0.0591,  0.0353],
-        [ 0.0119,  0.0181, -0.0306],
-        [-0.0085, -0.0362,  0.0229],
-        [ 0.0005, -0.0106,  0.0242]
-    ]
-
-    latent_rgb_factors_bias = [ 0.0456, -0.0202, -0.0644]
-    latent_channels = 32
-    latent_dimensions = 3
-    scale_factor = 1.03682
-
 class Hunyuan3Dv2(LatentFormat):
    latent_channels = 64
    latent_dimensions = 1
    scale_factor = 0.9990943042622529

-class Hunyuan3Dv2_1(LatentFormat):
-    scale_factor = 1.0039506158752403
-    latent_channels = 64
-    latent_dimensions = 1
-
 class Hunyuan3Dv2mini(LatentFormat):
    latent_channels = 64
    latent_dimensions = 1
@@ -698,20 +546,3 @@ class Hunyuan3Dv2mini(LatentFormat):
 class ACEAudio(LatentFormat):
    latent_channels = 8
    latent_dimensions = 2
-
-class ChromaRadiance(LatentFormat):
-    latent_channels = 3
-
-    def __init__(self):
-        self.latent_rgb_factors = [
-            # R    G    B
-            [ 1.0, 0.0, 0.0 ],
-            [ 0.0, 1.0, 0.0 ],
-            [ 0.0, 0.0, 1.0 ]
-        ]
-
-    def process_in(self, latent):
-        return latent
-
-    def process_out(self, latent):
-        return latent
--- a/comfy/ldm/ace/attention.py
+++ b/comfy/ldm/ace/attention.py
@@ -133,7 +133,6 @@ class Attention(nn.Module):
        hidden_states: torch.Tensor,
        encoder_hidden_states: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
-        transformer_options={},
        **cross_attention_kwargs,
    ) -> torch.Tensor:
        return self.processor(
@@ -141,7 +140,6 @@ class Attention(nn.Module):
            hidden_states,
            encoder_hidden_states=encoder_hidden_states,
            attention_mask=attention_mask,
-            transformer_options=transformer_options,
            **cross_attention_kwargs,
        )

@@ -368,7 +366,6 @@ class CustomerAttnProcessor2_0:
        encoder_attention_mask: Optional[torch.FloatTensor] = None,
        rotary_freqs_cis: Union[torch.Tensor, Tuple[torch.Tensor]] = None,
        rotary_freqs_cis_cross: Union[torch.Tensor, Tuple[torch.Tensor]] = None,
-        transformer_options={},
        *args,
        **kwargs,
    ) -> torch.Tensor:
@@ -436,7 +433,7 @@ class CustomerAttnProcessor2_0:

        # the output of sdp = (batch, num_heads, seq_len, head_dim)
        hidden_states = optimized_attention(
-            query, key, value, heads=query.shape[1], mask=attention_mask, skip_reshape=True, transformer_options=transformer_options,
+            query, key, value, heads=query.shape[1], mask=attention_mask, skip_reshape=True,
        ).to(query.dtype)

        # linear proj
@@ -700,7 +697,6 @@ class LinearTransformerBlock(nn.Module):
        rotary_freqs_cis: Union[torch.Tensor, Tuple[torch.Tensor]] = None,
        rotary_freqs_cis_cross: Union[torch.Tensor, Tuple[torch.Tensor]] = None,
        temb: torch.FloatTensor = None,
-        transformer_options={},
    ):

        N = hidden_states.shape[0]
@@ -724,7 +720,6 @@ class LinearTransformerBlock(nn.Module):
                encoder_attention_mask=encoder_attention_mask,
                rotary_freqs_cis=rotary_freqs_cis,
                rotary_freqs_cis_cross=rotary_freqs_cis_cross,
-                transformer_options=transformer_options,
            )
        else:
            attn_output, _ = self.attn(
@@ -734,7 +729,6 @@ class LinearTransformerBlock(nn.Module):
                encoder_attention_mask=None,
                rotary_freqs_cis=rotary_freqs_cis,
                rotary_freqs_cis_cross=None,
-                transformer_options=transformer_options,
            )

        if self.use_adaln_single:
@@ -749,7 +743,6 @@ class LinearTransformerBlock(nn.Module):
                encoder_attention_mask=encoder_attention_mask,
                rotary_freqs_cis=rotary_freqs_cis,
                rotary_freqs_cis_cross=rotary_freqs_cis_cross,
-                transformer_options=transformer_options,
            )
            hidden_states = attn_output + hidden_states

--- a/comfy/ldm/ace/model.py
+++ b/comfy/ldm/ace/model.py
@@ -19,7 +19,6 @@ import torch
 from torch import nn

 import comfy.model_management
-import comfy.patcher_extension

 from comfy.ldm.lightricks.model import TimestepEmbedding, Timesteps
 from .attention import LinearTransformerBlock, t2i_modulate
@@ -314,7 +313,6 @@ class ACEStepTransformer2DModel(nn.Module):
        output_length: int = 0,
        block_controlnet_hidden_states: Optional[Union[List[torch.Tensor], torch.Tensor]] = None,
        controlnet_scale: Union[float, torch.Tensor] = 1.0,
-        transformer_options={},
    ):
        embedded_timestep = self.timestep_embedder(self.time_proj(timestep).to(dtype=hidden_states.dtype))
        temb = self.t_block(embedded_timestep)
@@ -340,34 +338,12 @@ class ACEStepTransformer2DModel(nn.Module):
                rotary_freqs_cis=rotary_freqs_cis,
                rotary_freqs_cis_cross=encoder_rotary_freqs_cis,
                temb=temb,
-                transformer_options=transformer_options,
            )

        output = self.final_layer(hidden_states, embedded_timestep, output_length)
        return output

-    def forward(self,
-        x,
-        timestep,
-        attention_mask=None,
-        context: Optional[torch.Tensor] = None,
-        text_attention_mask: Optional[torch.LongTensor] = None,
-        speaker_embeds: Optional[torch.FloatTensor] = None,
-        lyric_token_idx: Optional[torch.LongTensor] = None,
-        lyric_mask: Optional[torch.LongTensor] = None,
-        block_controlnet_hidden_states: Optional[Union[List[torch.Tensor], torch.Tensor]] = None,
-        controlnet_scale: Union[float, torch.Tensor] = 1.0,
-        lyrics_strength=1.0,
-        **kwargs
-    ):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, kwargs.get("transformer_options", {}))
-        ).execute(x, timestep, attention_mask, context, text_attention_mask, speaker_embeds, lyric_token_idx, lyric_mask, block_controlnet_hidden_states,
-                  controlnet_scale, lyrics_strength, **kwargs)
-
-    def _forward(
+    def forward(
        self,
        x,
        timestep,
@@ -395,7 +371,6 @@ class ACEStepTransformer2DModel(nn.Module):

        output_length = hidden_states.shape[-1]

-        transformer_options = kwargs.get("transformer_options", {})
        output = self.decode(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
@@ -405,7 +380,6 @@ class ACEStepTransformer2DModel(nn.Module):
            output_length=output_length,
            block_controlnet_hidden_states=block_controlnet_hidden_states,
            controlnet_scale=controlnet_scale,
-            transformer_options=transformer_options,
        )

        return output
--- a/comfy/ldm/ace/vae/music_dcae_pipeline.py
+++ b/comfy/ldm/ace/vae/music_dcae_pipeline.py
@@ -23,6 +23,8 @@ class MusicDCAE(torch.nn.Module):
        else:
            self.source_sample_rate = source_sample_rate

+        # self.resampler = torchaudio.transforms.Resample(source_sample_rate, 44100)
+
        self.transform = transforms.Compose([
            transforms.Normalize(0.5, 0.5),
        ])
@@ -35,6 +37,10 @@ class MusicDCAE(torch.nn.Module):
        self.scale_factor = 0.1786
        self.shift_factor = -1.9091

+    def load_audio(self, audio_path):
+        audio, sr = torchaudio.load(audio_path)
+        return audio, sr
+
    def forward_mel(self, audios):
        mels = []
        for i in range(len(audios)):
@@ -67,8 +73,10 @@ class MusicDCAE(torch.nn.Module):
            latent = self.dcae.encoder(mel.unsqueeze(0))
            latents.append(latent)
        latents = torch.cat(latents, dim=0)
+        # latent_lengths = (audio_lengths / sr * 44100 / 512 / self.time_dimention_multiple).long()
        latents = (latents - self.shift_factor) * self.scale_factor
        return latents
+        # return latents, latent_lengths

    @torch.no_grad()
    def decode(self, latents, audio_lengths=None, sr=None):
@@ -83,7 +91,9 @@ class MusicDCAE(torch.nn.Module):
            wav = self.vocoder.decode(mels[0]).squeeze(1)

            if sr is not None:
+                # resampler = torchaudio.transforms.Resample(44100, sr).to(latents.device).to(latents.dtype)
                wav = torchaudio.functional.resample(wav, 44100, sr)
+                # wav = resampler(wav)
            else:
                sr = 44100
            pred_wavs.append(wav)
@@ -91,6 +101,7 @@ class MusicDCAE(torch.nn.Module):
        if audio_lengths is not None:
            pred_wavs = [wav[:, :length].cpu() for wav, length in zip(pred_wavs, audio_lengths)]
        return torch.stack(pred_wavs)
+        # return sr, pred_wavs

    def forward(self, audios, audio_lengths=None, sr=None):
        latents, latent_lengths = self.encode(audios=audios, audio_lengths=audio_lengths, sr=sr)
--- a/comfy/ldm/audio/dit.py
+++ b/comfy/ldm/audio/dit.py
@@ -298,8 +298,7 @@ class Attention(nn.Module):
        mask = None,
        context_mask = None,
        rotary_pos_emb = None,
-        causal = None,
-        transformer_options={},
+        causal = None
    ):
        h, kv_h, has_context = self.num_heads, self.kv_heads, context is not None

@@ -364,7 +363,7 @@ class Attention(nn.Module):
            heads_per_kv_head = h // kv_h
            k, v = map(lambda t: t.repeat_interleave(heads_per_kv_head, dim = 1), (k, v))

-        out = optimized_attention(q, k, v, h, skip_reshape=True, transformer_options=transformer_options)
+        out = optimized_attention(q, k, v, h, skip_reshape=True)
        out = self.to_out(out)

        if mask is not None:
@@ -489,8 +488,7 @@ class TransformerBlock(nn.Module):
        global_cond=None,
        mask = None,
        context_mask = None,
-        rotary_pos_emb = None,
-        transformer_options={}
+        rotary_pos_emb = None
    ):
        if self.global_cond_dim is not None and self.global_cond_dim > 0 and global_cond is not None:

@@ -500,12 +498,12 @@ class TransformerBlock(nn.Module):
            residual = x
            x = self.pre_norm(x)
            x = x * (1 + scale_self) + shift_self
-            x = self.self_attn(x, mask = mask, rotary_pos_emb = rotary_pos_emb, transformer_options=transformer_options)
+            x = self.self_attn(x, mask = mask, rotary_pos_emb = rotary_pos_emb)
            x = x * torch.sigmoid(1 - gate_self)
            x = x + residual

            if context is not None:
-                x = x + self.cross_attn(self.cross_attend_norm(x), context = context, context_mask = context_mask, transformer_options=transformer_options)
+                x = x + self.cross_attn(self.cross_attend_norm(x), context = context, context_mask = context_mask)

            if self.conformer is not None:
                x = x + self.conformer(x)
@@ -519,10 +517,10 @@ class TransformerBlock(nn.Module):
            x = x + residual

        else:
-            x = x + self.self_attn(self.pre_norm(x), mask = mask, rotary_pos_emb = rotary_pos_emb, transformer_options=transformer_options)
+            x = x + self.self_attn(self.pre_norm(x), mask = mask, rotary_pos_emb = rotary_pos_emb)

            if context is not None:
-                x = x + self.cross_attn(self.cross_attend_norm(x), context = context, context_mask = context_mask, transformer_options=transformer_options)
+                x = x + self.cross_attn(self.cross_attend_norm(x), context = context, context_mask = context_mask)

            if self.conformer is not None:
                x = x + self.conformer(x)
@@ -608,8 +606,7 @@ class ContinuousTransformer(nn.Module):
        return_info = False,
        **kwargs
    ):
-        transformer_options = kwargs.get("transformer_options", {})
-        patches_replace = transformer_options.get("patches_replace", {})
+        patches_replace = kwargs.get("transformer_options", {}).get("patches_replace", {})
        batch, seq, device = *x.shape[:2], x.device
        context = kwargs["context"]

@@ -635,7 +632,7 @@ class ContinuousTransformer(nn.Module):
        # Attention layers

        if self.rotary_pos_emb is not None:
-            rotary_pos_emb = self.rotary_pos_emb.forward_from_seq_len(x.shape[1], dtype=torch.float, device=x.device)
+            rotary_pos_emb = self.rotary_pos_emb.forward_from_seq_len(x.shape[1], dtype=x.dtype, device=x.device)
        else:
            rotary_pos_emb = None

@@ -648,13 +645,13 @@ class ContinuousTransformer(nn.Module):
            if ("double_block", i) in blocks_replace:
                def block_wrap(args):
                    out = {}
-                    out["img"] = layer(args["img"], rotary_pos_emb=args["pe"], global_cond=args["vec"], context=args["txt"], transformer_options=args["transformer_options"])
+                    out["img"] = layer(args["img"], rotary_pos_emb=args["pe"], global_cond=args["vec"], context=args["txt"])
                    return out

-                out = blocks_replace[("double_block", i)]({"img": x, "txt": context, "vec": global_cond, "pe": rotary_pos_emb, "transformer_options": transformer_options}, {"original_block": block_wrap})
+                out = blocks_replace[("double_block", i)]({"img": x, "txt": context, "vec": global_cond, "pe": rotary_pos_emb}, {"original_block": block_wrap})
                x = out["img"]
            else:
-                x = layer(x, rotary_pos_emb = rotary_pos_emb, global_cond=global_cond, context=context, transformer_options=transformer_options)
+                x = layer(x, rotary_pos_emb = rotary_pos_emb, global_cond=global_cond, context=context)
            # x = checkpoint(layer, x, rotary_pos_emb = rotary_pos_emb, global_cond=global_cond, **kwargs)

            if return_info:
--- a/comfy/ldm/aura/mmdit.py
+++ b/comfy/ldm/aura/mmdit.py
@@ -9,7 +9,6 @@ import torch.nn.functional as F

 from comfy.ldm.modules.attention import optimized_attention
 import comfy.ops
-import comfy.patcher_extension
 import comfy.ldm.common_dit

 def modulate(x, shift, scale):
@@ -85,7 +84,7 @@ class SingleAttention(nn.Module):
        )

    #@torch.compile()
-    def forward(self, c, transformer_options={}):
+    def forward(self, c):

        bsz, seqlen1, _ = c.shape

@@ -95,7 +94,7 @@ class SingleAttention(nn.Module):
        v = v.view(bsz, seqlen1, self.n_heads, self.head_dim)
        q, k = self.q_norm1(q), self.k_norm1(k)

-        output = optimized_attention(q.permute(0, 2, 1, 3), k.permute(0, 2, 1, 3), v.permute(0, 2, 1, 3), self.n_heads, skip_reshape=True, transformer_options=transformer_options)
+        output = optimized_attention(q.permute(0, 2, 1, 3), k.permute(0, 2, 1, 3), v.permute(0, 2, 1, 3), self.n_heads, skip_reshape=True)
        c = self.w1o(output)
        return c

@@ -144,7 +143,7 @@ class DoubleAttention(nn.Module):


    #@torch.compile()
-    def forward(self, c, x, transformer_options={}):
+    def forward(self, c, x):

        bsz, seqlen1, _ = c.shape
        bsz, seqlen2, _ = x.shape
@@ -168,7 +167,7 @@ class DoubleAttention(nn.Module):
            torch.cat([cv, xv], dim=1),
        )

-        output = optimized_attention(q.permute(0, 2, 1, 3), k.permute(0, 2, 1, 3), v.permute(0, 2, 1, 3), self.n_heads, skip_reshape=True, transformer_options=transformer_options)
+        output = optimized_attention(q.permute(0, 2, 1, 3), k.permute(0, 2, 1, 3), v.permute(0, 2, 1, 3), self.n_heads, skip_reshape=True)

        c, x = output.split([seqlen1, seqlen2], dim=1)
        c = self.w1o(c)
@@ -207,7 +206,7 @@ class MMDiTBlock(nn.Module):
        self.is_last = is_last

    #@torch.compile()
-    def forward(self, c, x, global_cond, transformer_options={}, **kwargs):
+    def forward(self, c, x, global_cond, **kwargs):

        cres, xres = c, x

@@ -225,7 +224,7 @@ class MMDiTBlock(nn.Module):
        x = modulate(self.normX1(x), xshift_msa, xscale_msa)

        # attention
-        c, x = self.attn(c, x, transformer_options=transformer_options)
+        c, x = self.attn(c, x)


        c = self.normC2(cres + cgate_msa.unsqueeze(1) * c)
@@ -255,13 +254,13 @@ class DiTBlock(nn.Module):
        self.mlp = MLP(dim, hidden_dim=dim * 4, dtype=dtype, device=device, operations=operations)

    #@torch.compile()
-    def forward(self, cx, global_cond, transformer_options={}, **kwargs):
+    def forward(self, cx, global_cond, **kwargs):
        cxres = cx
        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.modCX(
            global_cond
        ).chunk(6, dim=1)
        cx = modulate(self.norm1(cx), shift_msa, scale_msa)
-        cx = self.attn(cx, transformer_options=transformer_options)
+        cx = self.attn(cx)
        cx = self.norm2(cxres + gate_msa.unsqueeze(1) * cx)
        mlpout = self.mlp(modulate(cx, shift_mlp, scale_mlp))
        cx = gate_mlp.unsqueeze(1) * mlpout
@@ -437,13 +436,6 @@ class MMDiT(nn.Module):
        return x + pos_encoding.reshape(1, -1, self.positional_encoding.shape[-1])

    def forward(self, x, timestep, context, transformer_options={}, **kwargs):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, timestep, context, transformer_options, **kwargs)
-
-    def _forward(self, x, timestep, context, transformer_options={}, **kwargs):
        patches_replace = transformer_options.get("patches_replace", {})
        # patchify x, add PE
        b, c, h, w = x.shape
@@ -473,14 +465,13 @@ class MMDiT(nn.Module):
                        out = {}
                        out["txt"], out["img"] = layer(args["txt"],
                                                       args["img"],
-                                                       args["vec"],
-                                                       transformer_options=args["transformer_options"])
+                                                       args["vec"])
                        return out
-                    out = blocks_replace[("double_block", i)]({"img": x, "txt": c, "vec": global_cond, "transformer_options": transformer_options}, {"original_block": block_wrap})
+                    out = blocks_replace[("double_block", i)]({"img": x, "txt": c, "vec": global_cond}, {"original_block": block_wrap})
                    c = out["txt"]
                    x = out["img"]
                else:
-                    c, x = layer(c, x, global_cond, transformer_options=transformer_options, **kwargs)
+                    c, x = layer(c, x, global_cond, **kwargs)

        if len(self.single_layers) > 0:
            c_len = c.size(1)
@@ -489,13 +480,13 @@ class MMDiT(nn.Module):
                if ("single_block", i) in blocks_replace:
                    def block_wrap(args):
                        out = {}
-                        out["img"] = layer(args["img"], args["vec"], transformer_options=args["transformer_options"])
+                        out["img"] = layer(args["img"], args["vec"])
                        return out

-                    out = blocks_replace[("single_block", i)]({"img": cx, "vec": global_cond, "transformer_options": transformer_options}, {"original_block": block_wrap})
+                    out = blocks_replace[("single_block", i)]({"img": cx, "vec": global_cond}, {"original_block": block_wrap})
                    cx = out["img"]
                else:
-                    cx = layer(cx, global_cond, transformer_options=transformer_options, **kwargs)
+                    cx = layer(cx, global_cond, **kwargs)

            x = cx[:, c_len:]

--- a/comfy/ldm/cascade/common.py
+++ b/comfy/ldm/cascade/common.py
@@ -32,12 +32,12 @@ class OptimizedAttention(nn.Module):

        self.out_proj = operations.Linear(c, c, bias=True, dtype=dtype, device=device)

-    def forward(self, q, k, v, transformer_options={}):
+    def forward(self, q, k, v):
        q = self.to_q(q)
        k = self.to_k(k)
        v = self.to_v(v)

-        out = optimized_attention(q, k, v, self.heads, transformer_options=transformer_options)
+        out = optimized_attention(q, k, v, self.heads)

        return self.out_proj(out)

@@ -47,13 +47,13 @@ class Attention2D(nn.Module):
        self.attn = OptimizedAttention(c, nhead, dtype=dtype, device=device, operations=operations)
        # self.attn = nn.MultiheadAttention(c, nhead, dropout=dropout, bias=True, batch_first=True, dtype=dtype, device=device)

-    def forward(self, x, kv, self_attn=False, transformer_options={}):
+    def forward(self, x, kv, self_attn=False):
        orig_shape = x.shape
        x = x.view(x.size(0), x.size(1), -1).permute(0, 2, 1)  # Bx4xHxW -> Bx(HxW)x4
        if self_attn:
            kv = torch.cat([x, kv], dim=1)
        # x = self.attn(x, kv, kv, need_weights=False)[0]
-        x = self.attn(x, kv, kv, transformer_options=transformer_options)
+        x = self.attn(x, kv, kv)
        x = x.permute(0, 2, 1).view(*orig_shape)
        return x

@@ -114,9 +114,9 @@ class AttnBlock(nn.Module):
            operations.Linear(c_cond, c, dtype=dtype, device=device)
        )

-    def forward(self, x, kv, transformer_options={}):
+    def forward(self, x, kv):
        kv = self.kv_mapper(kv)
-        x = x + self.attention(self.norm(x), kv, self_attn=self.self_attn, transformer_options=transformer_options)
+        x = x + self.attention(self.norm(x), kv, self_attn=self.self_attn)
        return x


--- a/comfy/ldm/cascade/stage_b.py
+++ b/comfy/ldm/cascade/stage_b.py
@@ -173,7 +173,7 @@ class StageB(nn.Module):
        clip = self.clip_norm(clip)
        return clip

-    def _down_encode(self, x, r_embed, clip, transformer_options={}):
+    def _down_encode(self, x, r_embed, clip):
        level_outputs = []
        block_group = zip(self.down_blocks, self.down_downscalers, self.down_repeat_mappers)
        for down_block, downscaler, repmap in block_group:
@@ -187,7 +187,7 @@ class StageB(nn.Module):
                    elif isinstance(block, AttnBlock) or (
                            hasattr(block, '_fsdp_wrapped_module') and isinstance(block._fsdp_wrapped_module,
                                                                                  AttnBlock)):
-                        x = block(x, clip, transformer_options=transformer_options)
+                        x = block(x, clip)
                    elif isinstance(block, TimestepBlock) or (
                            hasattr(block, '_fsdp_wrapped_module') and isinstance(block._fsdp_wrapped_module,
                                                                                  TimestepBlock)):
@@ -199,7 +199,7 @@ class StageB(nn.Module):
            level_outputs.insert(0, x)
        return level_outputs

-    def _up_decode(self, level_outputs, r_embed, clip, transformer_options={}):
+    def _up_decode(self, level_outputs, r_embed, clip):
        x = level_outputs[0]
        block_group = zip(self.up_blocks, self.up_upscalers, self.up_repeat_mappers)
        for i, (up_block, upscaler, repmap) in enumerate(block_group):
@@ -216,7 +216,7 @@ class StageB(nn.Module):
                    elif isinstance(block, AttnBlock) or (
                            hasattr(block, '_fsdp_wrapped_module') and isinstance(block._fsdp_wrapped_module,
                                                                                  AttnBlock)):
-                        x = block(x, clip, transformer_options=transformer_options)
+                        x = block(x, clip)
                    elif isinstance(block, TimestepBlock) or (
                            hasattr(block, '_fsdp_wrapped_module') and isinstance(block._fsdp_wrapped_module,
                                                                                  TimestepBlock)):
@@ -228,7 +228,7 @@ class StageB(nn.Module):
            x = upscaler(x)
        return x

-    def forward(self, x, r, effnet, clip, pixels=None, transformer_options={}, **kwargs):
+    def forward(self, x, r, effnet, clip, pixels=None, **kwargs):
        if pixels is None:
            pixels = x.new_zeros(x.size(0), 3, 8, 8)

@@ -245,8 +245,8 @@ class StageB(nn.Module):
            nn.functional.interpolate(effnet, size=x.shape[-2:], mode='bilinear', align_corners=True))
        x = x + nn.functional.interpolate(self.pixels_mapper(pixels), size=x.shape[-2:], mode='bilinear',
                                          align_corners=True)
-        level_outputs = self._down_encode(x, r_embed, clip, transformer_options=transformer_options)
-        x = self._up_decode(level_outputs, r_embed, clip, transformer_options=transformer_options)
+        level_outputs = self._down_encode(x, r_embed, clip)
+        x = self._up_decode(level_outputs, r_embed, clip)
        return self.clf(x)

    def update_weights_ema(self, src_model, beta=0.999):
--- a/comfy/ldm/cascade/stage_c.py
+++ b/comfy/ldm/cascade/stage_c.py
@@ -182,7 +182,7 @@ class StageC(nn.Module):
        clip = self.clip_norm(clip)
        return clip

-    def _down_encode(self, x, r_embed, clip, cnet=None, transformer_options={}):
+    def _down_encode(self, x, r_embed, clip, cnet=None):
        level_outputs = []
        block_group = zip(self.down_blocks, self.down_downscalers, self.down_repeat_mappers)
        for down_block, downscaler, repmap in block_group:
@@ -201,7 +201,7 @@ class StageC(nn.Module):
                    elif isinstance(block, AttnBlock) or (
                            hasattr(block, '_fsdp_wrapped_module') and isinstance(block._fsdp_wrapped_module,
                                                                                  AttnBlock)):
-                        x = block(x, clip, transformer_options=transformer_options)
+                        x = block(x, clip)
                    elif isinstance(block, TimestepBlock) or (
                            hasattr(block, '_fsdp_wrapped_module') and isinstance(block._fsdp_wrapped_module,
                                                                                  TimestepBlock)):
@@ -213,7 +213,7 @@ class StageC(nn.Module):
            level_outputs.insert(0, x)
        return level_outputs

-    def _up_decode(self, level_outputs, r_embed, clip, cnet=None, transformer_options={}):
+    def _up_decode(self, level_outputs, r_embed, clip, cnet=None):
        x = level_outputs[0]
        block_group = zip(self.up_blocks, self.up_upscalers, self.up_repeat_mappers)
        for i, (up_block, upscaler, repmap) in enumerate(block_group):
@@ -235,7 +235,7 @@ class StageC(nn.Module):
                    elif isinstance(block, AttnBlock) or (
                            hasattr(block, '_fsdp_wrapped_module') and isinstance(block._fsdp_wrapped_module,
                                                                                  AttnBlock)):
-                        x = block(x, clip, transformer_options=transformer_options)
+                        x = block(x, clip)
                    elif isinstance(block, TimestepBlock) or (
                            hasattr(block, '_fsdp_wrapped_module') and isinstance(block._fsdp_wrapped_module,
                                                                                  TimestepBlock)):
@@ -247,7 +247,7 @@ class StageC(nn.Module):
            x = upscaler(x)
        return x

-    def forward(self, x, r, clip_text, clip_text_pooled, clip_img, control=None, transformer_options={}, **kwargs):
+    def forward(self, x, r, clip_text, clip_text_pooled, clip_img, control=None, **kwargs):
        # Process the conditioning embeddings
        r_embed = self.gen_r_embedding(r).to(dtype=x.dtype)
        for c in self.t_conds:
@@ -262,8 +262,8 @@ class StageC(nn.Module):

        # Model Blocks
        x = self.embedding(x)
-        level_outputs = self._down_encode(x, r_embed, clip, cnet, transformer_options=transformer_options)
-        x = self._up_decode(level_outputs, r_embed, clip, cnet, transformer_options=transformer_options)
+        level_outputs = self._down_encode(x, r_embed, clip, cnet)
+        x = self._up_decode(level_outputs, r_embed, clip, cnet)
        return self.clf(x)

    def update_weights_ema(self, src_model, beta=0.999):
--- a/comfy/ldm/chroma/layers.py
+++ b/comfy/ldm/chroma/layers.py
@@ -1,15 +1,15 @@
 import torch
 from torch import Tensor, nn

+from comfy.ldm.flux.math import attention
 from comfy.ldm.flux.layers import (
    MLPEmbedder,
    RMSNorm,
+    QKNorm,
+    SelfAttention,
    ModulationOut,
 )

-# TODO: remove this in a few months
-SingleStreamBlock = None
-DoubleStreamBlock = None


 class ChromaModulationOut(ModulationOut):
@@ -48,6 +48,124 @@ class Approximator(nn.Module):
        return x


+class DoubleStreamBlock(nn.Module):
+    def __init__(self, hidden_size: int, num_heads: int, mlp_ratio: float, qkv_bias: bool = False, flipped_img_txt=False, dtype=None, device=None, operations=None):
+        super().__init__()
+
+        mlp_hidden_dim = int(hidden_size * mlp_ratio)
+        self.num_heads = num_heads
+        self.hidden_size = hidden_size
+        self.img_norm1 = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, dtype=dtype, device=device)
+        self.img_attn = SelfAttention(dim=hidden_size, num_heads=num_heads, qkv_bias=qkv_bias, dtype=dtype, device=device, operations=operations)
+
+        self.img_norm2 = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, dtype=dtype, device=device)
+        self.img_mlp = nn.Sequential(
+            operations.Linear(hidden_size, mlp_hidden_dim, bias=True, dtype=dtype, device=device),
+            nn.GELU(approximate="tanh"),
+            operations.Linear(mlp_hidden_dim, hidden_size, bias=True, dtype=dtype, device=device),
+        )
+
+        self.txt_norm1 = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, dtype=dtype, device=device)
+        self.txt_attn = SelfAttention(dim=hidden_size, num_heads=num_heads, qkv_bias=qkv_bias, dtype=dtype, device=device, operations=operations)
+
+        self.txt_norm2 = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, dtype=dtype, device=device)
+        self.txt_mlp = nn.Sequential(
+            operations.Linear(hidden_size, mlp_hidden_dim, bias=True, dtype=dtype, device=device),
+            nn.GELU(approximate="tanh"),
+            operations.Linear(mlp_hidden_dim, hidden_size, bias=True, dtype=dtype, device=device),
+        )
+        self.flipped_img_txt = flipped_img_txt
+
+    def forward(self, img: Tensor, txt: Tensor, pe: Tensor, vec: Tensor, attn_mask=None):
+        (img_mod1, img_mod2), (txt_mod1, txt_mod2) = vec
+
+        # prepare image for attention
+        img_modulated = torch.addcmul(img_mod1.shift, 1 + img_mod1.scale, self.img_norm1(img))
+        img_qkv = self.img_attn.qkv(img_modulated)
+        img_q, img_k, img_v = img_qkv.view(img_qkv.shape[0], img_qkv.shape[1], 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
+        img_q, img_k = self.img_attn.norm(img_q, img_k, img_v)
+
+        # prepare txt for attention
+        txt_modulated = torch.addcmul(txt_mod1.shift, 1 + txt_mod1.scale, self.txt_norm1(txt))
+        txt_qkv = self.txt_attn.qkv(txt_modulated)
+        txt_q, txt_k, txt_v = txt_qkv.view(txt_qkv.shape[0], txt_qkv.shape[1], 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
+        txt_q, txt_k = self.txt_attn.norm(txt_q, txt_k, txt_v)
+
+        # run actual attention
+        attn = attention(torch.cat((txt_q, img_q), dim=2),
+                         torch.cat((txt_k, img_k), dim=2),
+                         torch.cat((txt_v, img_v), dim=2),
+                         pe=pe, mask=attn_mask)
+
+        txt_attn, img_attn = attn[:, : txt.shape[1]], attn[:, txt.shape[1] :]
+
+        # calculate the img bloks
+        img.addcmul_(img_mod1.gate, self.img_attn.proj(img_attn))
+        img.addcmul_(img_mod2.gate, self.img_mlp(torch.addcmul(img_mod2.shift, 1 + img_mod2.scale, self.img_norm2(img))))
+
+        # calculate the txt bloks
+        txt.addcmul_(txt_mod1.gate, self.txt_attn.proj(txt_attn))
+        txt.addcmul_(txt_mod2.gate, self.txt_mlp(torch.addcmul(txt_mod2.shift, 1 + txt_mod2.scale, self.txt_norm2(txt))))
+
+        if txt.dtype == torch.float16:
+            txt = torch.nan_to_num(txt, nan=0.0, posinf=65504, neginf=-65504)
+
+        return img, txt
+
+
+class SingleStreamBlock(nn.Module):
+    """
+    A DiT block with parallel linear layers as described in
+    https://arxiv.org/abs/2302.05442 and adapted modulation interface.
+    """
+
+    def __init__(
+        self,
+        hidden_size: int,
+        num_heads: int,
+        mlp_ratio: float = 4.0,
+        qk_scale: float = None,
+        dtype=None,
+        device=None,
+        operations=None
+    ):
+        super().__init__()
+        self.hidden_dim = hidden_size
+        self.num_heads = num_heads
+        head_dim = hidden_size // num_heads
+        self.scale = qk_scale or head_dim**-0.5
+
+        self.mlp_hidden_dim = int(hidden_size * mlp_ratio)
+        # qkv and mlp_in
+        self.linear1 = operations.Linear(hidden_size, hidden_size * 3 + self.mlp_hidden_dim, dtype=dtype, device=device)
+        # proj and mlp_out
+        self.linear2 = operations.Linear(hidden_size + self.mlp_hidden_dim, hidden_size, dtype=dtype, device=device)
+
+        self.norm = QKNorm(head_dim, dtype=dtype, device=device, operations=operations)
+
+        self.hidden_size = hidden_size
+        self.pre_norm = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, dtype=dtype, device=device)
+
+        self.mlp_act = nn.GELU(approximate="tanh")
+
+    def forward(self, x: Tensor, pe: Tensor, vec: Tensor, attn_mask=None) -> Tensor:
+        mod = vec
+        x_mod = torch.addcmul(mod.shift, 1 + mod.scale, self.pre_norm(x))
+        qkv, mlp = torch.split(self.linear1(x_mod), [3 * self.hidden_size, self.mlp_hidden_dim], dim=-1)
+
+        q, k, v = qkv.view(qkv.shape[0], qkv.shape[1], 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
+        q, k = self.norm(q, k, v)
+
+        # compute attention
+        attn = attention(q, k, v, pe=pe, mask=attn_mask)
+        # compute activation in mlp stream, cat again and run second linear layer
+        output = self.linear2(torch.cat((attn, self.mlp_act(mlp)), 2))
+        x.addcmul_(mod.gate, output)
+        if x.dtype == torch.float16:
+            x = torch.nan_to_num(x, nan=0.0, posinf=65504, neginf=-65504)
+        return x
+
+
 class LastLayer(nn.Module):
    def __init__(self, hidden_size: int, patch_size: int, out_channels: int, dtype=None, device=None, operations=None):
        super().__init__()
--- a/comfy/ldm/chroma/model.py
+++ b/comfy/ldm/chroma/model.py
@@ -5,18 +5,17 @@ from dataclasses import dataclass
 import torch
 from torch import Tensor, nn
 from einops import rearrange, repeat
-import comfy.patcher_extension
 import comfy.ldm.common_dit

 from comfy.ldm.flux.layers import (
    EmbedND,
    timestep_embedding,
-    DoubleStreamBlock,
-    SingleStreamBlock,
 )

 from .layers import (
+    DoubleStreamBlock,
    LastLayer,
+    SingleStreamBlock,
    Approximator,
    ChromaModulationOut,
 )
@@ -90,7 +89,6 @@ class Chroma(nn.Module):
                    self.num_heads,
                    mlp_ratio=params.mlp_ratio,
                    qkv_bias=params.qkv_bias,
-                    modulation=False,
                    dtype=dtype, device=device, operations=operations
                )
                for _ in range(params.depth)
@@ -99,7 +97,7 @@ class Chroma(nn.Module):

        self.single_blocks = nn.ModuleList(
            [
-                SingleStreamBlock(self.hidden_size, self.num_heads, mlp_ratio=params.mlp_ratio, modulation=False, dtype=dtype, device=device, operations=operations)
+                SingleStreamBlock(self.hidden_size, self.num_heads, mlp_ratio=params.mlp_ratio, dtype=dtype, device=device, operations=operations)
                for _ in range(params.depth_single_blocks)
            ]
        )
@@ -152,6 +150,8 @@ class Chroma(nn.Module):
        attn_mask: Tensor = None,
    ) -> Tensor:
        patches_replace = transformer_options.get("patches_replace", {})
+        if img.ndim != 3 or txt.ndim != 3:
+            raise ValueError("Input img and txt tensors must have 3 dimensions.")

        # running on sequences img
        img = self.img_in(img)
@@ -179,10 +179,7 @@ class Chroma(nn.Module):
        pe = self.pe_embedder(ids)

        blocks_replace = patches_replace.get("dit", {})
-        transformer_options["total_blocks"] = len(self.double_blocks)
-        transformer_options["block_type"] = "double"
        for i, block in enumerate(self.double_blocks):
-            transformer_options["block_index"] = i
            if i not in self.skip_mmdit:
                double_mod = (
                    self.get_modulations(mod_vectors, "double_img", idx=i),
@@ -195,16 +192,14 @@ class Chroma(nn.Module):
                                                       txt=args["txt"],
                                                       vec=args["vec"],
                                                       pe=args["pe"],
-                                                       attn_mask=args.get("attn_mask"),
-                                                       transformer_options=args.get("transformer_options"))
+                                                       attn_mask=args.get("attn_mask"))
                        return out

                    out = blocks_replace[("double_block", i)]({"img": img,
                                                               "txt": txt,
                                                               "vec": double_mod,
                                                               "pe": pe,
-                                                               "attn_mask": attn_mask,
-                                                               "transformer_options": transformer_options},
+                                                               "attn_mask": attn_mask},
                                                              {"original_block": block_wrap})
                    txt = out["txt"]
                    img = out["img"]
@@ -213,8 +208,7 @@ class Chroma(nn.Module):
                                     txt=txt,
                                     vec=double_mod,
                                     pe=pe,
-                                     attn_mask=attn_mask,
-                                     transformer_options=transformer_options)
+                                     attn_mask=attn_mask)

                if control is not None: # Controlnet
                    control_i = control.get("input")
@@ -225,10 +219,7 @@ class Chroma(nn.Module):

        img = torch.cat((txt, img), 1)

-        transformer_options["total_blocks"] = len(self.single_blocks)
-        transformer_options["block_type"] = "single"
        for i, block in enumerate(self.single_blocks):
-            transformer_options["block_index"] = i
            if i not in self.skip_dit:
                single_mod = self.get_modulations(mod_vectors, "single", idx=i)
                if ("single_block", i) in blocks_replace:
@@ -237,19 +228,17 @@ class Chroma(nn.Module):
                        out["img"] = block(args["img"],
                                           vec=args["vec"],
                                           pe=args["pe"],
-                                           attn_mask=args.get("attn_mask"),
-                                           transformer_options=args.get("transformer_options"))
+                                           attn_mask=args.get("attn_mask"))
                        return out

                    out = blocks_replace[("single_block", i)]({"img": img,
                                                               "vec": single_mod,
                                                               "pe": pe,
-                                                               "attn_mask": attn_mask,
-                                                               "transformer_options": transformer_options},
+                                                               "attn_mask": attn_mask},
                                                              {"original_block": block_wrap})
                    img = out["img"]
                else:
-                    img = block(img, vec=single_mod, pe=pe, attn_mask=attn_mask, transformer_options=transformer_options)
+                    img = block(img, vec=single_mod, pe=pe, attn_mask=attn_mask)

                if control is not None: # Controlnet
                    control_o = control.get("output")
@@ -259,27 +248,16 @@ class Chroma(nn.Module):
                            img[:, txt.shape[1] :, ...] += add

        img = img[:, txt.shape[1] :, ...]
-        if hasattr(self, "final_layer"):
-            final_mod = self.get_modulations(mod_vectors, "final")
-            img = self.final_layer(img, vec=final_mod)  # (N, T, patch_size ** 2 * out_channels)
+        final_mod = self.get_modulations(mod_vectors, "final")
+        img = self.final_layer(img, vec=final_mod)  # (N, T, patch_size ** 2 * out_channels)
        return img

    def forward(self, x, timestep, context, guidance, control=None, transformer_options={}, **kwargs):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, timestep, context, guidance, control, transformer_options, **kwargs)
-
-    def _forward(self, x, timestep, context, guidance, control=None, transformer_options={}, **kwargs):
        bs, c, h, w = x.shape
        x = comfy.ldm.common_dit.pad_to_patch_size(x, (self.patch_size, self.patch_size))

        img = rearrange(x, "b c (h ph) (w pw) -> b (h w) (c ph pw)", ph=self.patch_size, pw=self.patch_size)

-        if img.ndim != 3 or context.ndim != 3:
-            raise ValueError("Input img and txt tensors must have 3 dimensions.")
-
        h_len = ((h + (self.patch_size // 2)) // self.patch_size)
        w_len = ((w + (self.patch_size // 2)) // self.patch_size)
        img_ids = torch.zeros((h_len, w_len, 3), device=x.device, dtype=x.dtype)
--- a/comfy/ldm/chroma_radiance/layers.py
+++ b/comfy/ldm/chroma_radiance/layers.py
@@ -1,206 +0,0 @@
-# Adapted from https://github.com/lodestone-rock/flow
-from functools import lru_cache
-
-import torch
-from torch import nn
-
-from comfy.ldm.flux.layers import RMSNorm
-
-
-class NerfEmbedder(nn.Module):
-    """
-    An embedder module that combines input features with a 2D positional
-    encoding that mimics the Discrete Cosine Transform (DCT).
-
-    This module takes an input tensor of shape (B, P^2, C), where P is the
-    patch size, and enriches it with positional information before projecting
-    it to a new hidden size.
-    """
-    def __init__(
-        self,
-        in_channels: int,
-        hidden_size_input: int,
-        max_freqs: int,
-        dtype=None,
-        device=None,
-        operations=None,
-    ):
-        """
-        Initializes the NerfEmbedder.
-
-        Args:
-            in_channels (int): The number of channels in the input tensor.
-            hidden_size_input (int): The desired dimension of the output embedding.
-            max_freqs (int): The number of frequency components to use for both
-                             the x and y dimensions of the positional encoding.
-                             The total number of positional features will be max_freqs^2.
-        """
-        super().__init__()
-        self.dtype = dtype
-        self.max_freqs = max_freqs
-        self.hidden_size_input = hidden_size_input
-
-        # A linear layer to project the concatenated input features and
-        # positional encodings to the final output dimension.
-        self.embedder = nn.Sequential(
-            operations.Linear(in_channels + max_freqs**2, hidden_size_input, dtype=dtype, device=device)
-        )
-
-    @lru_cache(maxsize=4)
-    def fetch_pos(self, patch_size: int, device: torch.device, dtype: torch.dtype) -> torch.Tensor:
-        """
-        Generates and caches 2D DCT-like positional embeddings for a given patch size.
-
-        The LRU cache is a performance optimization that avoids recomputing the
-        same positional grid on every forward pass.
-
-        Args:
-            patch_size (int): The side length of the square input patch.
-            device: The torch device to create the tensors on.
-            dtype: The torch dtype for the tensors.
-
-        Returns:
-            A tensor of shape (1, patch_size^2, max_freqs^2) containing the
-            positional embeddings.
-        """
-        # Create normalized 1D coordinate grids from 0 to 1.
-        pos_x = torch.linspace(0, 1, patch_size, device=device, dtype=dtype)
-        pos_y = torch.linspace(0, 1, patch_size, device=device, dtype=dtype)
-
-        # Create a 2D meshgrid of coordinates.
-        pos_y, pos_x = torch.meshgrid(pos_y, pos_x, indexing="ij")
-
-        # Reshape positions to be broadcastable with frequencies.
-        # Shape becomes (patch_size^2, 1, 1).
-        pos_x = pos_x.reshape(-1, 1, 1)
-        pos_y = pos_y.reshape(-1, 1, 1)
-
-        # Create a 1D tensor of frequency values from 0 to max_freqs-1.
-        freqs = torch.linspace(0, self.max_freqs - 1, self.max_freqs, dtype=dtype, device=device)
-
-        # Reshape frequencies to be broadcastable for creating 2D basis functions.
-        # freqs_x shape: (1, max_freqs, 1)
-        # freqs_y shape: (1, 1, max_freqs)
-        freqs_x = freqs[None, :, None]
-        freqs_y = freqs[None, None, :]
-
-        # A custom weighting coefficient, not part of standard DCT.
-        # This seems to down-weight the contribution of higher-frequency interactions.
-        coeffs = (1 + freqs_x * freqs_y) ** -1
-
-        # Calculate the 1D cosine basis functions for x and y coordinates.
-        # This is the core of the DCT formulation.
-        dct_x = torch.cos(pos_x * freqs_x * torch.pi)
-        dct_y = torch.cos(pos_y * freqs_y * torch.pi)
-
-        # Combine the 1D basis functions to create 2D basis functions by element-wise
-        # multiplication, and apply the custom coefficients. Broadcasting handles the
-        # combination of all (pos_x, freqs_x) with all (pos_y, freqs_y).
-        # The result is flattened into a feature vector for each position.
-        dct = (dct_x * dct_y * coeffs).view(1, -1, self.max_freqs ** 2)
-
-        return dct
-
-    def forward(self, inputs: torch.Tensor) -> torch.Tensor:
-        """
-        Forward pass for the embedder.
-
-        Args:
-            inputs (Tensor): The input tensor of shape (B, P^2, C).
-
-        Returns:
-            Tensor: The output tensor of shape (B, P^2, hidden_size_input).
-        """
-        # Get the batch size, number of pixels, and number of channels.
-        B, P2, C = inputs.shape
-
-        # Infer the patch side length from the number of pixels (P^2).
-        patch_size = int(P2 ** 0.5)
-
-        input_dtype = inputs.dtype
-        inputs = inputs.to(dtype=self.dtype)
-
-        # Fetch the pre-computed or cached positional embeddings.
-        dct = self.fetch_pos(patch_size, inputs.device, self.dtype)
-
-        # Repeat the positional embeddings for each item in the batch.
-        dct = dct.repeat(B, 1, 1)
-
-        # Concatenate the original input features with the positional embeddings
-        # along the feature dimension.
-        inputs = torch.cat((inputs, dct), dim=-1)
-
-        # Project the combined tensor to the target hidden size.
-        return self.embedder(inputs).to(dtype=input_dtype)
-
-
-class NerfGLUBlock(nn.Module):
-    """
-    A NerfBlock using a Gated Linear Unit (GLU) like MLP.
-    """
-    def __init__(self, hidden_size_s: int, hidden_size_x: int, mlp_ratio, dtype=None, device=None, operations=None):
-        super().__init__()
-        # The total number of parameters for the MLP is increased to accommodate
-        # the gate, value, and output projection matrices.
-        # We now need to generate parameters for 3 matrices.
-        total_params = 3 * hidden_size_x**2 * mlp_ratio
-        self.param_generator = operations.Linear(hidden_size_s, total_params, dtype=dtype, device=device)
-        self.norm = RMSNorm(hidden_size_x, dtype=dtype, device=device, operations=operations)
-        self.mlp_ratio = mlp_ratio
-
-
-    def forward(self, x: torch.Tensor, s: torch.Tensor) -> torch.Tensor:
-        batch_size, num_x, hidden_size_x = x.shape
-        mlp_params = self.param_generator(s)
-
-        # Split the generated parameters into three parts for the gate, value, and output projection.
-        fc1_gate_params, fc1_value_params, fc2_params = mlp_params.chunk(3, dim=-1)
-
-        # Reshape the parameters into matrices for batch matrix multiplication.
-        fc1_gate = fc1_gate_params.view(batch_size, hidden_size_x, hidden_size_x * self.mlp_ratio)
-        fc1_value = fc1_value_params.view(batch_size, hidden_size_x, hidden_size_x * self.mlp_ratio)
-        fc2 = fc2_params.view(batch_size, hidden_size_x * self.mlp_ratio, hidden_size_x)
-
-        # Normalize the generated weight matrices as in the original implementation.
-        fc1_gate = torch.nn.functional.normalize(fc1_gate, dim=-2)
-        fc1_value = torch.nn.functional.normalize(fc1_value, dim=-2)
-        fc2 = torch.nn.functional.normalize(fc2, dim=-2)
-
-        res_x = x
-        x = self.norm(x)
-
-        # Apply the final output projection.
-        x = torch.bmm(torch.nn.functional.silu(torch.bmm(x, fc1_gate)) * torch.bmm(x, fc1_value), fc2)
-
-        return x + res_x
-
-
-class NerfFinalLayer(nn.Module):
-    def __init__(self, hidden_size, out_channels, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.norm = RMSNorm(hidden_size, dtype=dtype, device=device, operations=operations)
-        self.linear = operations.Linear(hidden_size, out_channels, dtype=dtype, device=device)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        # RMSNorm normalizes over the last dimension, but our channel dim (C) is at dim=1.
-        # So we temporarily move the channel dimension to the end for the norm operation.
-        return self.linear(self.norm(x.movedim(1, -1))).movedim(-1, 1)
-
-
-class NerfFinalLayerConv(nn.Module):
-    def __init__(self, hidden_size: int, out_channels: int, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.norm = RMSNorm(hidden_size, dtype=dtype, device=device, operations=operations)
-        self.conv = operations.Conv2d(
-            in_channels=hidden_size,
-            out_channels=out_channels,
-            kernel_size=3,
-            padding=1,
-            dtype=dtype,
-            device=device,
-        )
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        # RMSNorm normalizes over the last dimension, but our channel dim (C) is at dim=1.
-        # So we temporarily move the channel dimension to the end for the norm operation.
-        return self.conv(self.norm(x.movedim(1, -1)).movedim(-1, 1))
--- a/comfy/ldm/chroma_radiance/model.py
+++ b/comfy/ldm/chroma_radiance/model.py
@@ -1,319 +0,0 @@
-# Credits:
-# Original Flux code can be found on: https://github.com/black-forest-labs/flux
-# Chroma Radiance adaption referenced from https://github.com/lodestone-rock/flow
-
-from dataclasses import dataclass
-from typing import Optional
-
-import torch
-from torch import Tensor, nn
-from einops import repeat
-import comfy.ldm.common_dit
-
-from comfy.ldm.flux.layers import EmbedND, DoubleStreamBlock, SingleStreamBlock
-
-from comfy.ldm.chroma.model import Chroma, ChromaParams
-from comfy.ldm.chroma.layers import (
-    Approximator,
-)
-from .layers import (
-    NerfEmbedder,
-    NerfGLUBlock,
-    NerfFinalLayer,
-    NerfFinalLayerConv,
-)
-
-
-@dataclass
-class ChromaRadianceParams(ChromaParams):
-    patch_size: int
-    nerf_hidden_size: int
-    nerf_mlp_ratio: int
-    nerf_depth: int
-    nerf_max_freqs: int
-    # Setting nerf_tile_size to 0 disables tiling.
-    nerf_tile_size: int
-    # Currently one of linear (legacy) or conv.
-    nerf_final_head_type: str
-    # None means use the same dtype as the model.
-    nerf_embedder_dtype: Optional[torch.dtype]
-
-
-class ChromaRadiance(Chroma):
-    """
-    Transformer model for flow matching on sequences.
-    """
-
-    def __init__(self, image_model=None, final_layer=True, dtype=None, device=None, operations=None, **kwargs):
-        if operations is None:
-            raise RuntimeError("Attempt to create ChromaRadiance object without setting operations")
-        nn.Module.__init__(self)
-        self.dtype = dtype
-        params = ChromaRadianceParams(**kwargs)
-        self.params = params
-        self.patch_size = params.patch_size
-        self.in_channels = params.in_channels
-        self.out_channels = params.out_channels
-        if params.hidden_size % params.num_heads != 0:
-            raise ValueError(
-                f"Hidden size {params.hidden_size} must be divisible by num_heads {params.num_heads}"
-            )
-        pe_dim = params.hidden_size // params.num_heads
-        if sum(params.axes_dim) != pe_dim:
-            raise ValueError(f"Got {params.axes_dim} but expected positional dim {pe_dim}")
-        self.hidden_size = params.hidden_size
-        self.num_heads = params.num_heads
-        self.in_dim = params.in_dim
-        self.out_dim = params.out_dim
-        self.hidden_dim = params.hidden_dim
-        self.n_layers = params.n_layers
-        self.pe_embedder = EmbedND(dim=pe_dim, theta=params.theta, axes_dim=params.axes_dim)
-        self.img_in_patch = operations.Conv2d(
-            params.in_channels,
-            params.hidden_size,
-            kernel_size=params.patch_size,
-            stride=params.patch_size,
-            bias=True,
-            dtype=dtype,
-            device=device,
-        )
-        self.txt_in = operations.Linear(params.context_in_dim, self.hidden_size, dtype=dtype, device=device)
-        # set as nn identity for now, will overwrite it later.
-        self.distilled_guidance_layer = Approximator(
-                    in_dim=self.in_dim,
-                    hidden_dim=self.hidden_dim,
-                    out_dim=self.out_dim,
-                    n_layers=self.n_layers,
-                    dtype=dtype, device=device, operations=operations
-                )
-
-        self.double_blocks = nn.ModuleList(
-            [
-                DoubleStreamBlock(
-                    self.hidden_size,
-                    self.num_heads,
-                    mlp_ratio=params.mlp_ratio,
-                    qkv_bias=params.qkv_bias,
-                    modulation=False,
-                    dtype=dtype, device=device, operations=operations
-                )
-                for _ in range(params.depth)
-            ]
-        )
-
-        self.single_blocks = nn.ModuleList(
-            [
-                SingleStreamBlock(
-                    self.hidden_size,
-                    self.num_heads,
-                    mlp_ratio=params.mlp_ratio,
-                    modulation=False,
-                    dtype=dtype, device=device, operations=operations,
-                )
-                for _ in range(params.depth_single_blocks)
-            ]
-        )
-
-        # pixel channel concat with DCT
-        self.nerf_image_embedder = NerfEmbedder(
-            in_channels=params.in_channels,
-            hidden_size_input=params.nerf_hidden_size,
-            max_freqs=params.nerf_max_freqs,
-            dtype=params.nerf_embedder_dtype or dtype,
-            device=device,
-            operations=operations,
-        )
-
-        self.nerf_blocks = nn.ModuleList([
-            NerfGLUBlock(
-                hidden_size_s=params.hidden_size,
-                hidden_size_x=params.nerf_hidden_size,
-                mlp_ratio=params.nerf_mlp_ratio,
-                dtype=dtype,
-                device=device,
-                operations=operations,
-            ) for _ in range(params.nerf_depth)
-        ])
-
-        if params.nerf_final_head_type == "linear":
-            self.nerf_final_layer = NerfFinalLayer(
-                params.nerf_hidden_size,
-                out_channels=params.in_channels,
-                dtype=dtype,
-                device=device,
-                operations=operations,
-            )
-        elif params.nerf_final_head_type == "conv":
-            self.nerf_final_layer_conv = NerfFinalLayerConv(
-                params.nerf_hidden_size,
-                out_channels=params.in_channels,
-                dtype=dtype,
-                device=device,
-                operations=operations,
-            )
-        else:
-            errstr = f"Unsupported nerf_final_head_type {params.nerf_final_head_type}"
-            raise ValueError(errstr)
-
-        self.skip_mmdit = []
-        self.skip_dit = []
-        self.lite = False
-
-    @property
-    def _nerf_final_layer(self) -> nn.Module:
-        if self.params.nerf_final_head_type == "linear":
-            return self.nerf_final_layer
-        if self.params.nerf_final_head_type == "conv":
-            return self.nerf_final_layer_conv
-        # Impossible to get here as we raise an error on unexpected types on initialization.
-        raise NotImplementedError
-
-    def img_in(self, img: Tensor) -> Tensor:
-        img = self.img_in_patch(img) # -> [B, Hidden, H/P, W/P]
-        # flatten into a sequence for the transformer.
-        return img.flatten(2).transpose(1, 2) # -> [B, NumPatches, Hidden]
-
-    def forward_nerf(
-        self,
-        img_orig: Tensor,
-        img_out: Tensor,
-        params: ChromaRadianceParams,
-    ) -> Tensor:
-        B, C, H, W = img_orig.shape
-        num_patches = img_out.shape[1]
-        patch_size = params.patch_size
-
-        # Store the raw pixel values of each patch for the NeRF head later.
-        # unfold creates patches: [B, C * P * P, NumPatches]
-        nerf_pixels = nn.functional.unfold(img_orig, kernel_size=patch_size, stride=patch_size)
-        nerf_pixels = nerf_pixels.transpose(1, 2) # -> [B, NumPatches, C * P * P]
-
-        # Reshape for per-patch processing
-        nerf_hidden = img_out.reshape(B * num_patches, params.hidden_size)
-        nerf_pixels = nerf_pixels.reshape(B * num_patches, C, patch_size**2).transpose(1, 2)
-
-        if params.nerf_tile_size > 0 and num_patches > params.nerf_tile_size:
-            # Enable tiling if nerf_tile_size isn't 0 and we actually have more patches than
-            # the tile size.
-            img_dct = self.forward_tiled_nerf(nerf_hidden, nerf_pixels, B, C, num_patches, patch_size, params)
-        else:
-            # Get DCT-encoded pixel embeddings [pixel-dct]
-            img_dct = self.nerf_image_embedder(nerf_pixels)
-
-            # Pass through the dynamic MLP blocks (the NeRF)
-            for block in self.nerf_blocks:
-                img_dct = block(img_dct, nerf_hidden)
-
-        # Reassemble the patches into the final image.
-        img_dct = img_dct.transpose(1, 2) # -> [B*NumPatches, C, P*P]
-        # Reshape to combine with batch dimension for fold
-        img_dct = img_dct.reshape(B, num_patches, -1) # -> [B, NumPatches, C*P*P]
-        img_dct = img_dct.transpose(1, 2) # -> [B, C*P*P, NumPatches]
-        img_dct = nn.functional.fold(
-            img_dct,
-            output_size=(H, W),
-            kernel_size=patch_size,
-            stride=patch_size,
-        )
-        return self._nerf_final_layer(img_dct)
-
-    def forward_tiled_nerf(
-        self,
-        nerf_hidden: Tensor,
-        nerf_pixels: Tensor,
-        batch: int,
-        channels: int,
-        num_patches: int,
-        patch_size: int,
-        params: ChromaRadianceParams,
-    ) -> Tensor:
-        """
-        Processes the NeRF head in tiles to save memory.
-        nerf_hidden has shape [B, L, D]
-        nerf_pixels has shape [B, L, C * P * P]
-        """
-        tile_size = params.nerf_tile_size
-        output_tiles = []
-        # Iterate over the patches in tiles. The dimension L (num_patches) is at index 1.
-        for i in range(0, num_patches, tile_size):
-            end = min(i + tile_size, num_patches)
-
-            # Slice the current tile from the input tensors
-            nerf_hidden_tile = nerf_hidden[i * batch:end * batch]
-            nerf_pixels_tile = nerf_pixels[i * batch:end * batch]
-
-            # get DCT-encoded pixel embeddings [pixel-dct]
-            img_dct_tile = self.nerf_image_embedder(nerf_pixels_tile)
-
-            # pass through the dynamic MLP blocks (the NeRF)
-            for block in self.nerf_blocks:
-                img_dct_tile = block(img_dct_tile, nerf_hidden_tile)
-
-            output_tiles.append(img_dct_tile)
-
-        # Concatenate the processed tiles along the patch dimension
-        return torch.cat(output_tiles, dim=0)
-
-    def radiance_get_override_params(self, overrides: dict) -> ChromaRadianceParams:
-        params = self.params
-        if not overrides:
-            return params
-        params_dict = {k: getattr(params, k) for k in params.__dataclass_fields__}
-        nullable_keys = frozenset(("nerf_embedder_dtype",))
-        bad_keys = tuple(k for k in overrides if k not in params_dict)
-        if bad_keys:
-            e = f"Unknown key(s) in transformer_options chroma_radiance_options: {', '.join(bad_keys)}"
-            raise ValueError(e)
-        bad_keys = tuple(
-            k
-            for k, v in overrides.items()
-            if type(v) != type(getattr(params, k)) and (v is not None or k not in nullable_keys)
-        )
-        if bad_keys:
-            e = f"Invalid value(s) in transformer_options chroma_radiance_options: {', '.join(bad_keys)}"
-            raise ValueError(e)
-        # At this point it's all valid keys and values so we can merge with the existing params.
-        params_dict |= overrides
-        return params.__class__(**params_dict)
-
-    def _forward(
-        self,
-        x: Tensor,
-        timestep: Tensor,
-        context: Tensor,
-        guidance: Optional[Tensor],
-        control: Optional[dict]=None,
-        transformer_options: dict={},
-        **kwargs: dict,
-    ) -> Tensor:
-        bs, c, h, w = x.shape
-        img = comfy.ldm.common_dit.pad_to_patch_size(x, (self.patch_size, self.patch_size))
-
-        if img.ndim != 4:
-            raise ValueError("Input img tensor must be in [B, C, H, W] format.")
-        if context.ndim != 3:
-            raise ValueError("Input txt tensors must have 3 dimensions.")
-
-        params = self.radiance_get_override_params(transformer_options.get("chroma_radiance_options", {}))
-
-        h_len = (img.shape[-2] // self.patch_size)
-        w_len = (img.shape[-1] // self.patch_size)
-
-        img_ids = torch.zeros((h_len, w_len, 3), device=x.device, dtype=x.dtype)
-        img_ids[:, :, 1] = img_ids[:, :, 1] + torch.linspace(0, h_len - 1, steps=h_len, device=x.device, dtype=x.dtype).unsqueeze(1)
-        img_ids[:, :, 2] = img_ids[:, :, 2] + torch.linspace(0, w_len - 1, steps=w_len, device=x.device, dtype=x.dtype).unsqueeze(0)
-        img_ids = repeat(img_ids, "h w c -> b (h w) c", b=bs)
-        txt_ids = torch.zeros((bs, context.shape[1], 3), device=x.device, dtype=x.dtype)
-
-        img_out = self.forward_orig(
-            img,
-            img_ids,
-            context,
-            txt_ids,
-            timestep,
-            guidance,
-            control,
-            transformer_options,
-            attn_mask=kwargs.get("attention_mask", None),
-        )
-        return self.forward_nerf(img, img_out, params)[:, :, :h, :w]
--- a/comfy/ldm/cosmos/blocks.py
+++ b/comfy/ldm/cosmos/blocks.py
@@ -176,7 +176,6 @@ class Attention(nn.Module):
        context=None,
        mask=None,
        rope_emb=None,
-        transformer_options={},
        **kwargs,
    ):
        """
@@ -185,7 +184,7 @@ class Attention(nn.Module):
            context (Optional[Tensor]): The key tensor of shape [B, Mk, K] or use x as context [self attention] if None
        """
        q, k, v = self.cal_qkv(x, context, mask, rope_emb=rope_emb, **kwargs)
-        out = optimized_attention(q, k, v, self.heads, skip_reshape=True, mask=mask, skip_output_reshape=True, transformer_options=transformer_options)
+        out = optimized_attention(q, k, v, self.heads, skip_reshape=True, mask=mask, skip_output_reshape=True)
        del q, k, v
        out = rearrange(out, " b n s c -> s b (n c)")
        return self.to_out(out)
@@ -547,7 +546,6 @@ class VideoAttn(nn.Module):
        context: Optional[torch.Tensor] = None,
        crossattn_mask: Optional[torch.Tensor] = None,
        rope_emb_L_1_1_D: Optional[torch.Tensor] = None,
-        transformer_options: Optional[dict] = {},
    ) -> torch.Tensor:
        """
        Forward pass for video attention.
@@ -573,7 +571,6 @@ class VideoAttn(nn.Module):
            context_M_B_D,
            crossattn_mask,
            rope_emb=rope_emb_L_1_1_D,
-            transformer_options=transformer_options,
        )
        x_T_H_W_B_D = rearrange(x_THW_B_D, "(t h w) b d -> t h w b d", h=H, w=W)
        return x_T_H_W_B_D
@@ -668,7 +665,6 @@ class DITBuildingBlock(nn.Module):
        crossattn_mask: Optional[torch.Tensor] = None,
        rope_emb_L_1_1_D: Optional[torch.Tensor] = None,
        adaln_lora_B_3D: Optional[torch.Tensor] = None,
-        transformer_options: Optional[dict] = {},
    ) -> torch.Tensor:
        """
        Forward pass for dynamically configured blocks with adaptive normalization.
@@ -706,7 +702,6 @@ class DITBuildingBlock(nn.Module):
                adaln_norm_state(self.norm_state, x, scale_1_1_1_B_D, shift_1_1_1_B_D),
                context=None,
                rope_emb_L_1_1_D=rope_emb_L_1_1_D,
-                transformer_options=transformer_options,
            )
        elif self.block_type in ["cross_attn", "ca"]:
            x = x + gate_1_1_1_B_D * self.block(
@@ -714,7 +709,6 @@ class DITBuildingBlock(nn.Module):
                context=crossattn_emb,
                crossattn_mask=crossattn_mask,
                rope_emb_L_1_1_D=rope_emb_L_1_1_D,
-                transformer_options=transformer_options,
            )
        else:
            raise ValueError(f"Unknown block type: {self.block_type}")
@@ -790,7 +784,6 @@ class GeneralDITTransformerBlock(nn.Module):
        crossattn_mask: Optional[torch.Tensor] = None,
        rope_emb_L_1_1_D: Optional[torch.Tensor] = None,
        adaln_lora_B_3D: Optional[torch.Tensor] = None,
-        transformer_options: Optional[dict] = {},
    ) -> torch.Tensor:
        for block in self.blocks:
            x = block(
@@ -800,6 +793,5 @@ class GeneralDITTransformerBlock(nn.Module):
                crossattn_mask,
                rope_emb_L_1_1_D=rope_emb_L_1_1_D,
                adaln_lora_B_3D=adaln_lora_B_3D,
-                transformer_options=transformer_options,
            )
        return x
--- a/comfy/ldm/cosmos/model.py
+++ b/comfy/ldm/cosmos/model.py
@@ -27,8 +27,6 @@ from torchvision import transforms
 from enum import Enum
 import logging

-import comfy.patcher_extension
-
 from .blocks import (
    FinalLayer,
    GeneralDITTransformerBlock,
@@ -437,42 +435,6 @@ class GeneralDIT(nn.Module):
        latent_condition_sigma: Optional[torch.Tensor] = None,
        condition_video_augment_sigma: Optional[torch.Tensor] = None,
        **kwargs,
-    ):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, kwargs.get("transformer_options", {}))
-        ).execute(x,
-                timesteps,
-                context,
-                attention_mask,
-                fps,
-                image_size,
-                padding_mask,
-                scalar_feature,
-                data_type,
-                latent_condition,
-                latent_condition_sigma,
-                condition_video_augment_sigma,
-                **kwargs)
-
-    def _forward(
-        self,
-        x: torch.Tensor,
-        timesteps: torch.Tensor,
-        context: torch.Tensor,
-        attention_mask: Optional[torch.Tensor] = None,
-        # crossattn_emb: torch.Tensor,
-        # crossattn_mask: Optional[torch.Tensor] = None,
-        fps: Optional[torch.Tensor] = None,
-        image_size: Optional[torch.Tensor] = None,
-        padding_mask: Optional[torch.Tensor] = None,
-        scalar_feature: Optional[torch.Tensor] = None,
-        data_type: Optional[DataType] = DataType.VIDEO,
-        latent_condition: Optional[torch.Tensor] = None,
-        latent_condition_sigma: Optional[torch.Tensor] = None,
-        condition_video_augment_sigma: Optional[torch.Tensor] = None,
-        **kwargs,
    ):
        """
        Args:
@@ -520,7 +482,6 @@ class GeneralDIT(nn.Module):
                x.shape == extra_pos_emb_B_T_H_W_D_or_T_H_W_B_D.shape
            ), f"{x.shape} != {extra_pos_emb_B_T_H_W_D_or_T_H_W_B_D.shape} {original_shape}"

-        transformer_options = kwargs.get("transformer_options", {})
        for _, block in self.blocks.items():
            assert (
                self.blocks["block0"].x_format == block.x_format
@@ -535,7 +496,6 @@ class GeneralDIT(nn.Module):
                crossattn_mask,
                rope_emb_L_1_1_D=rope_emb_L_1_1_D,
                adaln_lora_B_3D=adaln_lora_B_3D,
-                transformer_options=transformer_options,
            )

        x_B_T_H_W_D = rearrange(x, "T H W B D -> B T H W D")
--- a/comfy/ldm/cosmos/predict2.py
+++ b/comfy/ldm/cosmos/predict2.py
@@ -11,7 +11,6 @@ import math
 from .position_embedding import VideoRopePosition3DEmb, LearnablePosEmbAxis
 from torchvision import transforms

-import comfy.patcher_extension
 from comfy.ldm.modules.attention import optimized_attention

 def apply_rotary_pos_emb(
@@ -44,7 +43,7 @@ class GPT2FeedForward(nn.Module):
        return x


-def torch_attention_op(q_B_S_H_D: torch.Tensor, k_B_S_H_D: torch.Tensor, v_B_S_H_D: torch.Tensor, transformer_options: Optional[dict] = {}) -> torch.Tensor:
+def torch_attention_op(q_B_S_H_D: torch.Tensor, k_B_S_H_D: torch.Tensor, v_B_S_H_D: torch.Tensor) -> torch.Tensor:
    """Computes multi-head attention using PyTorch's native implementation.

    This function provides a PyTorch backend alternative to Transformer Engine's attention operation.
@@ -71,7 +70,7 @@ def torch_attention_op(q_B_S_H_D: torch.Tensor, k_B_S_H_D: torch.Tensor, v_B_S_H
    q_B_H_S_D = rearrange(q_B_S_H_D, "b ... h k -> b h ... k").view(in_q_shape[0], in_q_shape[-2], -1, in_q_shape[-1])
    k_B_H_S_D = rearrange(k_B_S_H_D, "b ... h v -> b h ... v").view(in_k_shape[0], in_k_shape[-2], -1, in_k_shape[-1])
    v_B_H_S_D = rearrange(v_B_S_H_D, "b ... h v -> b h ... v").view(in_k_shape[0], in_k_shape[-2], -1, in_k_shape[-1])
-    return optimized_attention(q_B_H_S_D, k_B_H_S_D, v_B_H_S_D, in_q_shape[-2], skip_reshape=True, transformer_options=transformer_options)
+    return optimized_attention(q_B_H_S_D, k_B_H_S_D, v_B_H_S_D, in_q_shape[-2], skip_reshape=True)


 class Attention(nn.Module):
@@ -180,8 +179,8 @@ class Attention(nn.Module):

        return q, k, v

-    def compute_attention(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, transformer_options: Optional[dict] = {}) -> torch.Tensor:
-        result = self.attn_op(q, k, v, transformer_options=transformer_options)  # [B, S, H, D]
+    def compute_attention(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor) -> torch.Tensor:
+        result = self.attn_op(q, k, v)  # [B, S, H, D]
        return self.output_dropout(self.output_proj(result))

    def forward(
@@ -189,7 +188,6 @@ class Attention(nn.Module):
        x: torch.Tensor,
        context: Optional[torch.Tensor] = None,
        rope_emb: Optional[torch.Tensor] = None,
-        transformer_options: Optional[dict] = {},
    ) -> torch.Tensor:
        """
        Args:
@@ -197,7 +195,7 @@ class Attention(nn.Module):
            context (Optional[Tensor]): The key tensor of shape [B, Mk, K] or use x as context [self attention] if None
        """
        q, k, v = self.compute_qkv(x, context, rope_emb=rope_emb)
-        return self.compute_attention(q, k, v, transformer_options=transformer_options)
+        return self.compute_attention(q, k, v)


 class Timesteps(nn.Module):
@@ -460,7 +458,6 @@ class Block(nn.Module):
        rope_emb_L_1_1_D: Optional[torch.Tensor] = None,
        adaln_lora_B_T_3D: Optional[torch.Tensor] = None,
        extra_per_block_pos_emb: Optional[torch.Tensor] = None,
-        transformer_options: Optional[dict] = {},
    ) -> torch.Tensor:
        if extra_per_block_pos_emb is not None:
            x_B_T_H_W_D = x_B_T_H_W_D + extra_per_block_pos_emb
@@ -514,7 +511,6 @@ class Block(nn.Module):
                rearrange(normalized_x_B_T_H_W_D, "b t h w d -> b (t h w) d"),
                None,
                rope_emb=rope_emb_L_1_1_D,
-                transformer_options=transformer_options,
            ),
            "b (t h w) d -> b t h w d",
            t=T,
@@ -528,7 +524,6 @@ class Block(nn.Module):
            layer_norm_cross_attn: Callable,
            _scale_cross_attn_B_T_1_1_D: torch.Tensor,
            _shift_cross_attn_B_T_1_1_D: torch.Tensor,
-            transformer_options: Optional[dict] = {},
        ) -> torch.Tensor:
            _normalized_x_B_T_H_W_D = _fn(
                _x_B_T_H_W_D, layer_norm_cross_attn, _scale_cross_attn_B_T_1_1_D, _shift_cross_attn_B_T_1_1_D
@@ -538,7 +533,6 @@ class Block(nn.Module):
                    rearrange(_normalized_x_B_T_H_W_D, "b t h w d -> b (t h w) d"),
                    crossattn_emb,
                    rope_emb=rope_emb_L_1_1_D,
-                    transformer_options=transformer_options,
                ),
                "b (t h w) d -> b t h w d",
                t=T,
@@ -552,7 +546,6 @@ class Block(nn.Module):
            self.layer_norm_cross_attn,
            scale_cross_attn_B_T_1_1_D,
            shift_cross_attn_B_T_1_1_D,
-            transformer_options=transformer_options,
        )
        x_B_T_H_W_D = result_B_T_H_W_D * gate_cross_attn_B_T_1_1_D + x_B_T_H_W_D

@@ -812,21 +805,7 @@ class MiniTrainDIT(nn.Module):
        )
        return x_B_C_Tt_Hp_Wp

-    def forward(self,
-        x: torch.Tensor,
-        timesteps: torch.Tensor,
-        context: torch.Tensor,
-        fps: Optional[torch.Tensor] = None,
-        padding_mask: Optional[torch.Tensor] = None,
-        **kwargs,
-    ):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, kwargs.get("transformer_options", {}))
-        ).execute(x, timesteps, context, fps, padding_mask, **kwargs)
-
-    def _forward(
+    def forward(
        self,
        x: torch.Tensor,
        timesteps: torch.Tensor,
@@ -871,7 +850,6 @@ class MiniTrainDIT(nn.Module):
            "rope_emb_L_1_1_D": rope_emb_L_1_1_D.unsqueeze(1).unsqueeze(0),
            "adaln_lora_B_T_3D": adaln_lora_B_T_3D,
            "extra_per_block_pos_emb": extra_pos_emb_B_T_H_W_D_or_T_H_W_B_D,
-            "transformer_options": kwargs.get("transformer_options", {}),
        }
        for block in self.blocks:
            x_B_T_H_W_D = block(
--- a/comfy/ldm/flux/layers.py
+++ b/comfy/ldm/flux/layers.py
@@ -48,11 +48,11 @@ def timestep_embedding(t: Tensor, dim, max_period=10000, time_factor: float = 10
    return embedding

 class MLPEmbedder(nn.Module):
-    def __init__(self, in_dim: int, hidden_dim: int, bias=True, dtype=None, device=None, operations=None):
+    def __init__(self, in_dim: int, hidden_dim: int, dtype=None, device=None, operations=None):
        super().__init__()
-        self.in_layer = operations.Linear(in_dim, hidden_dim, bias=bias, dtype=dtype, device=device)
+        self.in_layer = operations.Linear(in_dim, hidden_dim, bias=True, dtype=dtype, device=device)
        self.silu = nn.SiLU()
-        self.out_layer = operations.Linear(hidden_dim, hidden_dim, bias=bias, dtype=dtype, device=device)
+        self.out_layer = operations.Linear(hidden_dim, hidden_dim, bias=True, dtype=dtype, device=device)

    def forward(self, x: Tensor) -> Tensor:
        return self.out_layer(self.silu(self.in_layer(x)))
@@ -80,14 +80,14 @@ class QKNorm(torch.nn.Module):


 class SelfAttention(nn.Module):
-    def __init__(self, dim: int, num_heads: int = 8, qkv_bias: bool = False, proj_bias: bool = True, dtype=None, device=None, operations=None):
+    def __init__(self, dim: int, num_heads: int = 8, qkv_bias: bool = False, dtype=None, device=None, operations=None):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads

        self.qkv = operations.Linear(dim, dim * 3, bias=qkv_bias, dtype=dtype, device=device)
        self.norm = QKNorm(head_dim, dtype=dtype, device=device, operations=operations)
-        self.proj = operations.Linear(dim, dim, bias=proj_bias, dtype=dtype, device=device)
+        self.proj = operations.Linear(dim, dim, dtype=dtype, device=device)


@dataclass
@@ -98,11 +98,11 @@ class ModulationOut:


 class Modulation(nn.Module):
-    def __init__(self, dim: int, double: bool, bias=True, dtype=None, device=None, operations=None):
+    def __init__(self, dim: int, double: bool, dtype=None, device=None, operations=None):
        super().__init__()
        self.is_double = double
        self.multiplier = 6 if double else 3
-        self.lin = operations.Linear(dim, self.multiplier * dim, bias=bias, dtype=dtype, device=device)
+        self.lin = operations.Linear(dim, self.multiplier * dim, bias=True, dtype=dtype, device=device)

    def forward(self, vec: Tensor) -> tuple:
        if vec.ndim == 2:
@@ -129,129 +129,77 @@ def apply_mod(tensor, m_mult, m_add=None, modulation_dims=None):
        return tensor


-class SiLUActivation(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.gate_fn = nn.SiLU()
-
-    def forward(self, x: Tensor) -> Tensor:
-        x1, x2 = x.chunk(2, dim=-1)
-        return self.gate_fn(x1) * x2
-
-
 class DoubleStreamBlock(nn.Module):
-    def __init__(self, hidden_size: int, num_heads: int, mlp_ratio: float, qkv_bias: bool = False, flipped_img_txt=False, modulation=True, mlp_silu_act=False, proj_bias=True, dtype=None, device=None, operations=None):
+    def __init__(self, hidden_size: int, num_heads: int, mlp_ratio: float, qkv_bias: bool = False, flipped_img_txt=False, dtype=None, device=None, operations=None):
        super().__init__()

        mlp_hidden_dim = int(hidden_size * mlp_ratio)
        self.num_heads = num_heads
        self.hidden_size = hidden_size
-        self.modulation = modulation
-
-        if self.modulation:
-            self.img_mod = Modulation(hidden_size, double=True, dtype=dtype, device=device, operations=operations)
-
+        self.img_mod = Modulation(hidden_size, double=True, dtype=dtype, device=device, operations=operations)
        self.img_norm1 = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, dtype=dtype, device=device)
-        self.img_attn = SelfAttention(dim=hidden_size, num_heads=num_heads, qkv_bias=qkv_bias, proj_bias=proj_bias, dtype=dtype, device=device, operations=operations)
+        self.img_attn = SelfAttention(dim=hidden_size, num_heads=num_heads, qkv_bias=qkv_bias, dtype=dtype, device=device, operations=operations)

        self.img_norm2 = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, dtype=dtype, device=device)
+        self.img_mlp = nn.Sequential(
+            operations.Linear(hidden_size, mlp_hidden_dim, bias=True, dtype=dtype, device=device),
+            nn.GELU(approximate="tanh"),
+            operations.Linear(mlp_hidden_dim, hidden_size, bias=True, dtype=dtype, device=device),
+        )

-        if mlp_silu_act:
-            self.img_mlp = nn.Sequential(
-                operations.Linear(hidden_size, mlp_hidden_dim * 2, bias=False, dtype=dtype, device=device),
-                SiLUActivation(),
-                operations.Linear(mlp_hidden_dim, hidden_size, bias=False, dtype=dtype, device=device),
-            )
-        else:
-            self.img_mlp = nn.Sequential(
-                operations.Linear(hidden_size, mlp_hidden_dim, bias=True, dtype=dtype, device=device),
-                nn.GELU(approximate="tanh"),
-                operations.Linear(mlp_hidden_dim, hidden_size, bias=True, dtype=dtype, device=device),
-            )
-
-        if self.modulation:
-            self.txt_mod = Modulation(hidden_size, double=True, dtype=dtype, device=device, operations=operations)
-
+        self.txt_mod = Modulation(hidden_size, double=True, dtype=dtype, device=device, operations=operations)
        self.txt_norm1 = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, dtype=dtype, device=device)
-        self.txt_attn = SelfAttention(dim=hidden_size, num_heads=num_heads, qkv_bias=qkv_bias, proj_bias=proj_bias, dtype=dtype, device=device, operations=operations)
+        self.txt_attn = SelfAttention(dim=hidden_size, num_heads=num_heads, qkv_bias=qkv_bias, dtype=dtype, device=device, operations=operations)

        self.txt_norm2 = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, dtype=dtype, device=device)
-
-        if mlp_silu_act:
-            self.txt_mlp = nn.Sequential(
-                operations.Linear(hidden_size, mlp_hidden_dim * 2, bias=False, dtype=dtype, device=device),
-                SiLUActivation(),
-                operations.Linear(mlp_hidden_dim, hidden_size, bias=False, dtype=dtype, device=device),
-            )
-        else:
-            self.txt_mlp = nn.Sequential(
-                operations.Linear(hidden_size, mlp_hidden_dim, bias=True, dtype=dtype, device=device),
-                nn.GELU(approximate="tanh"),
-                operations.Linear(mlp_hidden_dim, hidden_size, bias=True, dtype=dtype, device=device),
-            )
-
+        self.txt_mlp = nn.Sequential(
+            operations.Linear(hidden_size, mlp_hidden_dim, bias=True, dtype=dtype, device=device),
+            nn.GELU(approximate="tanh"),
+            operations.Linear(mlp_hidden_dim, hidden_size, bias=True, dtype=dtype, device=device),
+        )
        self.flipped_img_txt = flipped_img_txt

-    def forward(self, img: Tensor, txt: Tensor, vec: Tensor, pe: Tensor, attn_mask=None, modulation_dims_img=None, modulation_dims_txt=None, transformer_options={}):
-        if self.modulation:
-            img_mod1, img_mod2 = self.img_mod(vec)
-            txt_mod1, txt_mod2 = self.txt_mod(vec)
-        else:
-            (img_mod1, img_mod2), (txt_mod1, txt_mod2) = vec
+    def forward(self, img: Tensor, txt: Tensor, vec: Tensor, pe: Tensor, attn_mask=None, modulation_dims_img=None, modulation_dims_txt=None):
+        img_mod1, img_mod2 = self.img_mod(vec)
+        txt_mod1, txt_mod2 = self.txt_mod(vec)

        # prepare image for attention
        img_modulated = self.img_norm1(img)
        img_modulated = apply_mod(img_modulated, (1 + img_mod1.scale), img_mod1.shift, modulation_dims_img)
        img_qkv = self.img_attn.qkv(img_modulated)
-        del img_modulated
        img_q, img_k, img_v = img_qkv.view(img_qkv.shape[0], img_qkv.shape[1], 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
-        del img_qkv
        img_q, img_k = self.img_attn.norm(img_q, img_k, img_v)

        # prepare txt for attention
        txt_modulated = self.txt_norm1(txt)
        txt_modulated = apply_mod(txt_modulated, (1 + txt_mod1.scale), txt_mod1.shift, modulation_dims_txt)
        txt_qkv = self.txt_attn.qkv(txt_modulated)
-        del txt_modulated
        txt_q, txt_k, txt_v = txt_qkv.view(txt_qkv.shape[0], txt_qkv.shape[1], 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
-        del txt_qkv
        txt_q, txt_k = self.txt_attn.norm(txt_q, txt_k, txt_v)

        if self.flipped_img_txt:
-            q = torch.cat((img_q, txt_q), dim=2)
-            del img_q, txt_q
-            k = torch.cat((img_k, txt_k), dim=2)
-            del img_k, txt_k
-            v = torch.cat((img_v, txt_v), dim=2)
-            del img_v, txt_v
            # run actual attention
-            attn = attention(q, k, v,
-                             pe=pe, mask=attn_mask, transformer_options=transformer_options)
-            del q, k, v
+            attn = attention(torch.cat((img_q, txt_q), dim=2),
+                             torch.cat((img_k, txt_k), dim=2),
+                             torch.cat((img_v, txt_v), dim=2),
+                             pe=pe, mask=attn_mask)

            img_attn, txt_attn = attn[:, : img.shape[1]], attn[:, img.shape[1]:]
        else:
-            q = torch.cat((txt_q, img_q), dim=2)
-            del txt_q, img_q
-            k = torch.cat((txt_k, img_k), dim=2)
-            del txt_k, img_k
-            v = torch.cat((txt_v, img_v), dim=2)
-            del txt_v, img_v
            # run actual attention
-            attn = attention(q, k, v,
-                             pe=pe, mask=attn_mask, transformer_options=transformer_options)
-            del q, k, v
+            attn = attention(torch.cat((txt_q, img_q), dim=2),
+                             torch.cat((txt_k, img_k), dim=2),
+                             torch.cat((txt_v, img_v), dim=2),
+                             pe=pe, mask=attn_mask)

            txt_attn, img_attn = attn[:, : txt.shape[1]], attn[:, txt.shape[1]:]

        # calculate the img bloks
-        img += apply_mod(self.img_attn.proj(img_attn), img_mod1.gate, None, modulation_dims_img)
-        del img_attn
-        img += apply_mod(self.img_mlp(apply_mod(self.img_norm2(img), (1 + img_mod2.scale), img_mod2.shift, modulation_dims_img)), img_mod2.gate, None, modulation_dims_img)
+        img = img + apply_mod(self.img_attn.proj(img_attn), img_mod1.gate, None, modulation_dims_img)
+        img = img + apply_mod(self.img_mlp(apply_mod(self.img_norm2(img), (1 + img_mod2.scale), img_mod2.shift, modulation_dims_img)), img_mod2.gate, None, modulation_dims_img)

        # calculate the txt bloks
        txt += apply_mod(self.txt_attn.proj(txt_attn), txt_mod1.gate, None, modulation_dims_txt)
-        del txt_attn
        txt += apply_mod(self.txt_mlp(apply_mod(self.txt_norm2(txt), (1 + txt_mod2.scale), txt_mod2.shift, modulation_dims_txt)), txt_mod2.gate, None, modulation_dims_txt)

        if txt.dtype == torch.float16:
@@ -272,9 +220,6 @@ class SingleStreamBlock(nn.Module):
        num_heads: int,
        mlp_ratio: float = 4.0,
        qk_scale: float = None,
-        modulation=True,
-        mlp_silu_act=False,
-        bias=True,
        dtype=None,
        device=None,
        operations=None
@@ -286,47 +231,30 @@ class SingleStreamBlock(nn.Module):
        self.scale = qk_scale or head_dim**-0.5

        self.mlp_hidden_dim = int(hidden_size * mlp_ratio)
-
-        self.mlp_hidden_dim_first = self.mlp_hidden_dim
-        if mlp_silu_act:
-            self.mlp_hidden_dim_first = int(hidden_size * mlp_ratio * 2)
-            self.mlp_act = SiLUActivation()
-        else:
-            self.mlp_act = nn.GELU(approximate="tanh")
-
        # qkv and mlp_in
-        self.linear1 = operations.Linear(hidden_size, hidden_size * 3 + self.mlp_hidden_dim_first, bias=bias, dtype=dtype, device=device)
+        self.linear1 = operations.Linear(hidden_size, hidden_size * 3 + self.mlp_hidden_dim, dtype=dtype, device=device)
        # proj and mlp_out
-        self.linear2 = operations.Linear(hidden_size + self.mlp_hidden_dim, hidden_size, bias=bias, dtype=dtype, device=device)
+        self.linear2 = operations.Linear(hidden_size + self.mlp_hidden_dim, hidden_size, dtype=dtype, device=device)

        self.norm = QKNorm(head_dim, dtype=dtype, device=device, operations=operations)

        self.hidden_size = hidden_size
        self.pre_norm = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, dtype=dtype, device=device)

-        if modulation:
-            self.modulation = Modulation(hidden_size, double=False, dtype=dtype, device=device, operations=operations)
-        else:
-            self.modulation = None
+        self.mlp_act = nn.GELU(approximate="tanh")
+        self.modulation = Modulation(hidden_size, double=False, dtype=dtype, device=device, operations=operations)

-    def forward(self, x: Tensor, vec: Tensor, pe: Tensor, attn_mask=None, modulation_dims=None, transformer_options={}) -> Tensor:
-        if self.modulation:
-            mod, _ = self.modulation(vec)
-        else:
-            mod = vec
-
-        qkv, mlp = torch.split(self.linear1(apply_mod(self.pre_norm(x), (1 + mod.scale), mod.shift, modulation_dims)), [3 * self.hidden_size, self.mlp_hidden_dim_first], dim=-1)
+    def forward(self, x: Tensor, vec: Tensor, pe: Tensor, attn_mask=None, modulation_dims=None) -> Tensor:
+        mod, _ = self.modulation(vec)
+        qkv, mlp = torch.split(self.linear1(apply_mod(self.pre_norm(x), (1 + mod.scale), mod.shift, modulation_dims)), [3 * self.hidden_size, self.mlp_hidden_dim], dim=-1)

        q, k, v = qkv.view(qkv.shape[0], qkv.shape[1], 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
-        del qkv
        q, k = self.norm(q, k, v)

        # compute attention
-        attn = attention(q, k, v, pe=pe, mask=attn_mask, transformer_options=transformer_options)
-        del q, k, v
+        attn = attention(q, k, v, pe=pe, mask=attn_mask)
        # compute activation in mlp stream, cat again and run second linear layer
-        mlp = self.mlp_act(mlp)
-        output = self.linear2(torch.cat((attn, mlp), 2))
+        output = self.linear2(torch.cat((attn, self.mlp_act(mlp)), 2))
        x += apply_mod(output, mod.gate, None, modulation_dims)
        if x.dtype == torch.float16:
            x = torch.nan_to_num(x, nan=0.0, posinf=65504, neginf=-65504)
@@ -334,11 +262,11 @@ class SingleStreamBlock(nn.Module):


 class LastLayer(nn.Module):
-    def __init__(self, hidden_size: int, patch_size: int, out_channels: int, bias=True, dtype=None, device=None, operations=None):
+    def __init__(self, hidden_size: int, patch_size: int, out_channels: int, dtype=None, device=None, operations=None):
        super().__init__()
        self.norm_final = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, dtype=dtype, device=device)
-        self.linear = operations.Linear(hidden_size, patch_size * patch_size * out_channels, bias=bias, dtype=dtype, device=device)
-        self.adaLN_modulation = nn.Sequential(nn.SiLU(), operations.Linear(hidden_size, 2 * hidden_size, bias=bias, dtype=dtype, device=device))
+        self.linear = operations.Linear(hidden_size, patch_size * patch_size * out_channels, bias=True, dtype=dtype, device=device)
+        self.adaLN_modulation = nn.Sequential(nn.SiLU(), operations.Linear(hidden_size, 2 * hidden_size, bias=True, dtype=dtype, device=device))

    def forward(self, x: Tensor, vec: Tensor, modulation_dims=None) -> Tensor:
        if vec.ndim == 2:
--- a/comfy/ldm/flux/math.py
+++ b/comfy/ldm/flux/math.py
@@ -6,11 +6,18 @@ from comfy.ldm.modules.attention import optimized_attention
 import comfy.model_management


-def attention(q: Tensor, k: Tensor, v: Tensor, pe: Tensor, mask=None, transformer_options={}) -> Tensor:
+def attention(q: Tensor, k: Tensor, v: Tensor, pe: Tensor, mask=None) -> Tensor:
+    q_shape = q.shape
+    k_shape = k.shape
+
    if pe is not None:
-        q, k = apply_rope(q, k, pe)
+        q = q.to(dtype=pe.dtype).reshape(*q.shape[:-1], -1, 1, 2)
+        k = k.to(dtype=pe.dtype).reshape(*k.shape[:-1], -1, 1, 2)
+        q = (pe[..., 0] * q[..., 0] + pe[..., 1] * q[..., 1]).reshape(*q_shape).type_as(v)
+        k = (pe[..., 0] * k[..., 0] + pe[..., 1] * k[..., 1]).reshape(*k_shape).type_as(v)
+
    heads = q.shape[1]
-    x = optimized_attention(q, k, v, heads, skip_reshape=True, mask=mask, transformer_options=transformer_options)
+    x = optimized_attention(q, k, v, heads, skip_reshape=True, mask=mask)
    return x


@@ -28,13 +35,11 @@ def rope(pos: Tensor, dim: int, theta: int) -> Tensor:
    out = rearrange(out, "b n d (i j) -> b n d i j", i=2, j=2)
    return out.to(dtype=torch.float32, device=pos.device)

-def apply_rope1(x: Tensor, freqs_cis: Tensor):
-    x_ = x.to(dtype=freqs_cis.dtype).reshape(*x.shape[:-1], -1, 1, 2)
-
-    x_out = freqs_cis[..., 0] * x_[..., 0]
-    x_out.addcmul_(freqs_cis[..., 1], x_[..., 1])
-
-    return x_out.reshape(*x.shape).type_as(x)

 def apply_rope(xq: Tensor, xk: Tensor, freqs_cis: Tensor):
-    return apply_rope1(xq, freqs_cis), apply_rope1(xk, freqs_cis)
+    xq_ = xq.to(dtype=freqs_cis.dtype).reshape(*xq.shape[:-1], -1, 1, 2)
+    xk_ = xk.to(dtype=freqs_cis.dtype).reshape(*xk.shape[:-1], -1, 1, 2)
+    xq_out = freqs_cis[..., 0] * xq_[..., 0] + freqs_cis[..., 1] * xq_[..., 1]
+    xk_out = freqs_cis[..., 0] * xk_[..., 0] + freqs_cis[..., 1] * xk_[..., 1]
+    return xq_out.reshape(*xq.shape).type_as(xq), xk_out.reshape(*xk.shape).type_as(xk)
+
--- a/comfy/ldm/flux/model.py
+++ b/comfy/ldm/flux/model.py
@@ -6,7 +6,6 @@ import torch
 from torch import Tensor, nn
 from einops import rearrange, repeat
 import comfy.ldm.common_dit
-import comfy.patcher_extension

 from .layers import (
    DoubleStreamBlock,
@@ -15,7 +14,6 @@ from .layers import (
    MLPEmbedder,
    SingleStreamBlock,
    timestep_embedding,
-    Modulation
 )

@dataclass
@@ -34,11 +32,6 @@ class FluxParams:
    patch_size: int
    qkv_bias: bool
    guidance_embed: bool
-    global_modulation: bool = False
-    mlp_silu_act: bool = False
-    ops_bias: bool = True
-    default_ref_method: str = "offset"
-    ref_index_scale: float = 1.0


 class Flux(nn.Module):
@@ -64,17 +57,13 @@ class Flux(nn.Module):
        self.hidden_size = params.hidden_size
        self.num_heads = params.num_heads
        self.pe_embedder = EmbedND(dim=pe_dim, theta=params.theta, axes_dim=params.axes_dim)
-        self.img_in = operations.Linear(self.in_channels, self.hidden_size, bias=params.ops_bias, dtype=dtype, device=device)
-        self.time_in = MLPEmbedder(in_dim=256, hidden_dim=self.hidden_size, bias=params.ops_bias, dtype=dtype, device=device, operations=operations)
-        if params.vec_in_dim is not None:
-            self.vector_in = MLPEmbedder(params.vec_in_dim, self.hidden_size, dtype=dtype, device=device, operations=operations)
-        else:
-            self.vector_in = None
-
+        self.img_in = operations.Linear(self.in_channels, self.hidden_size, bias=True, dtype=dtype, device=device)
+        self.time_in = MLPEmbedder(in_dim=256, hidden_dim=self.hidden_size, dtype=dtype, device=device, operations=operations)
+        self.vector_in = MLPEmbedder(params.vec_in_dim, self.hidden_size, dtype=dtype, device=device, operations=operations)
        self.guidance_in = (
-            MLPEmbedder(in_dim=256, hidden_dim=self.hidden_size, bias=params.ops_bias, dtype=dtype, device=device, operations=operations) if params.guidance_embed else nn.Identity()
+            MLPEmbedder(in_dim=256, hidden_dim=self.hidden_size, dtype=dtype, device=device, operations=operations) if params.guidance_embed else nn.Identity()
        )
-        self.txt_in = operations.Linear(params.context_in_dim, self.hidden_size, bias=params.ops_bias, dtype=dtype, device=device)
+        self.txt_in = operations.Linear(params.context_in_dim, self.hidden_size, dtype=dtype, device=device)

        self.double_blocks = nn.ModuleList(
            [
@@ -83,9 +72,6 @@ class Flux(nn.Module):
                    self.num_heads,
                    mlp_ratio=params.mlp_ratio,
                    qkv_bias=params.qkv_bias,
-                    modulation=params.global_modulation is False,
-                    mlp_silu_act=params.mlp_silu_act,
-                    proj_bias=params.ops_bias,
                    dtype=dtype, device=device, operations=operations
                )
                for _ in range(params.depth)
@@ -94,30 +80,13 @@ class Flux(nn.Module):

        self.single_blocks = nn.ModuleList(
            [
-                SingleStreamBlock(self.hidden_size, self.num_heads, mlp_ratio=params.mlp_ratio, modulation=params.global_modulation is False, mlp_silu_act=params.mlp_silu_act, bias=params.ops_bias, dtype=dtype, device=device, operations=operations)
+                SingleStreamBlock(self.hidden_size, self.num_heads, mlp_ratio=params.mlp_ratio, dtype=dtype, device=device, operations=operations)
                for _ in range(params.depth_single_blocks)
            ]
        )

        if final_layer:
-            self.final_layer = LastLayer(self.hidden_size, 1, self.out_channels, bias=params.ops_bias, dtype=dtype, device=device, operations=operations)
-
-        if params.global_modulation:
-            self.double_stream_modulation_img = Modulation(
-                self.hidden_size,
-                double=True,
-                bias=False,
-                dtype=dtype, device=device, operations=operations
-            )
-            self.double_stream_modulation_txt = Modulation(
-                self.hidden_size,
-                double=True,
-                bias=False,
-                dtype=dtype, device=device, operations=operations
-            )
-            self.single_stream_modulation = Modulation(
-                self.hidden_size, double=False, bias=False, dtype=dtype, device=device, operations=operations
-            )
+            self.final_layer = LastLayer(self.hidden_size, 1, self.out_channels, dtype=dtype, device=device, operations=operations)

    def forward_orig(
        self,
@@ -133,7 +102,9 @@ class Flux(nn.Module):
        attn_mask: Tensor = None,
    ) -> Tensor:

-        patches = transformer_options.get("patches", {})
+        if y is None:
+            y = torch.zeros((img.shape[0], self.params.vec_in_dim), device=img.device, dtype=img.dtype)
+
        patches_replace = transformer_options.get("patches_replace", {})
        if img.ndim != 3 or txt.ndim != 3:
            raise ValueError("Input img and txt tensors must have 3 dimensions.")
@@ -145,25 +116,9 @@ class Flux(nn.Module):
            if guidance is not None:
                vec = vec + self.guidance_in(timestep_embedding(guidance, 256).to(img.dtype))

-        if self.vector_in is not None:
-            if y is None:
-                y = torch.zeros((img.shape[0], self.params.vec_in_dim), device=img.device, dtype=img.dtype)
-            vec = vec + self.vector_in(y[:, :self.params.vec_in_dim])
-
+        vec = vec + self.vector_in(y[:,:self.params.vec_in_dim])
        txt = self.txt_in(txt)

-        vec_orig = vec
-        if self.params.global_modulation:
-            vec = (self.double_stream_modulation_img(vec_orig), self.double_stream_modulation_txt(vec_orig))
-
-        if "post_input" in patches:
-            for p in patches["post_input"]:
-                out = p({"img": img, "txt": txt, "img_ids": img_ids, "txt_ids": txt_ids})
-                img = out["img"]
-                txt = out["txt"]
-                img_ids = out["img_ids"]
-                txt_ids = out["txt_ids"]
-
        if img_ids is not None:
            ids = torch.cat((txt_ids, img_ids), dim=1)
            pe = self.pe_embedder(ids)
@@ -179,16 +134,14 @@ class Flux(nn.Module):
                                                   txt=args["txt"],
                                                   vec=args["vec"],
                                                   pe=args["pe"],
-                                                   attn_mask=args.get("attn_mask"),
-                                                   transformer_options=args.get("transformer_options"))
+                                                   attn_mask=args.get("attn_mask"))
                    return out

                out = blocks_replace[("double_block", i)]({"img": img,
                                                           "txt": txt,
                                                           "vec": vec,
                                                           "pe": pe,
-                                                           "attn_mask": attn_mask,
-                                                           "transformer_options": transformer_options},
+                                                           "attn_mask": attn_mask},
                                                          {"original_block": block_wrap})
                txt = out["txt"]
                img = out["img"]
@@ -197,24 +150,20 @@ class Flux(nn.Module):
                                 txt=txt,
                                 vec=vec,
                                 pe=pe,
-                                 attn_mask=attn_mask,
-                                 transformer_options=transformer_options)
+                                 attn_mask=attn_mask)

            if control is not None: # Controlnet
                control_i = control.get("input")
                if i < len(control_i):
                    add = control_i[i]
                    if add is not None:
-                        img[:, :add.shape[1]] += add
+                        img += add

        if img.dtype == torch.float16:
            img = torch.nan_to_num(img, nan=0.0, posinf=65504, neginf=-65504)

        img = torch.cat((txt, img), 1)

-        if self.params.global_modulation:
-            vec, _ = self.single_stream_modulation(vec_orig)
-
        for i, block in enumerate(self.single_blocks):
            if ("single_block", i) in blocks_replace:
                def block_wrap(args):
@@ -222,33 +171,31 @@ class Flux(nn.Module):
                    out["img"] = block(args["img"],
                                       vec=args["vec"],
                                       pe=args["pe"],
-                                       attn_mask=args.get("attn_mask"),
-                                       transformer_options=args.get("transformer_options"))
+                                       attn_mask=args.get("attn_mask"))
                    return out

                out = blocks_replace[("single_block", i)]({"img": img,
                                                           "vec": vec,
                                                           "pe": pe,
-                                                           "attn_mask": attn_mask,
-                                                           "transformer_options": transformer_options},
+                                                           "attn_mask": attn_mask},
                                                          {"original_block": block_wrap})
                img = out["img"]
            else:
-                img = block(img, vec=vec, pe=pe, attn_mask=attn_mask, transformer_options=transformer_options)
+                img = block(img, vec=vec, pe=pe, attn_mask=attn_mask)

            if control is not None: # Controlnet
                control_o = control.get("output")
                if i < len(control_o):
                    add = control_o[i]
                    if add is not None:
-                        img[:, txt.shape[1] : txt.shape[1] + add.shape[1], ...] += add
+                        img[:, txt.shape[1] :, ...] += add

        img = img[:, txt.shape[1] :, ...]

-        img = self.final_layer(img, vec_orig)  # (N, T, patch_size ** 2 * out_channels)
+        img = self.final_layer(img, vec)  # (N, T, patch_size ** 2 * out_channels)
        return img

-    def process_img(self, x, index=0, h_offset=0, w_offset=0, transformer_options={}):
+    def process_img(self, x, index=0, h_offset=0, w_offset=0):
        bs, c, h, w = x.shape
        patch_size = self.patch_size
        x = comfy.ldm.common_dit.pad_to_patch_size(x, (patch_size, patch_size))
@@ -260,55 +207,30 @@ class Flux(nn.Module):
        h_offset = ((h_offset + (patch_size // 2)) // patch_size)
        w_offset = ((w_offset + (patch_size // 2)) // patch_size)

-        steps_h = h_len
-        steps_w = w_len
-
-        rope_options = transformer_options.get("rope_options", None)
-        if rope_options is not None:
-            h_len = (h_len - 1.0) * rope_options.get("scale_y", 1.0) + 1.0
-            w_len = (w_len - 1.0) * rope_options.get("scale_x", 1.0) + 1.0
-
-            index += rope_options.get("shift_t", 0.0)
-            h_offset += rope_options.get("shift_y", 0.0)
-            w_offset += rope_options.get("shift_x", 0.0)
-
-        img_ids = torch.zeros((steps_h, steps_w, len(self.params.axes_dim)), device=x.device, dtype=torch.float32)
+        img_ids = torch.zeros((h_len, w_len, 3), device=x.device, dtype=x.dtype)
        img_ids[:, :, 0] = img_ids[:, :, 1] + index
-        img_ids[:, :, 1] = img_ids[:, :, 1] + torch.linspace(h_offset, h_len - 1 + h_offset, steps=steps_h, device=x.device, dtype=torch.float32).unsqueeze(1)
-        img_ids[:, :, 2] = img_ids[:, :, 2] + torch.linspace(w_offset, w_len - 1 + w_offset, steps=steps_w, device=x.device, dtype=torch.float32).unsqueeze(0)
+        img_ids[:, :, 1] = img_ids[:, :, 1] + torch.linspace(h_offset, h_len - 1 + h_offset, steps=h_len, device=x.device, dtype=x.dtype).unsqueeze(1)
+        img_ids[:, :, 2] = img_ids[:, :, 2] + torch.linspace(w_offset, w_len - 1 + w_offset, steps=w_len, device=x.device, dtype=x.dtype).unsqueeze(0)
        return img, repeat(img_ids, "h w c -> b (h w) c", b=bs)

    def forward(self, x, timestep, context, y=None, guidance=None, ref_latents=None, control=None, transformer_options={}, **kwargs):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, timestep, context, y, guidance, ref_latents, control, transformer_options, **kwargs)
-
-    def _forward(self, x, timestep, context, y=None, guidance=None, ref_latents=None, control=None, transformer_options={}, **kwargs):
        bs, c, h_orig, w_orig = x.shape
        patch_size = self.patch_size

        h_len = ((h_orig + (patch_size // 2)) // patch_size)
        w_len = ((w_orig + (patch_size // 2)) // patch_size)
-        img, img_ids = self.process_img(x, transformer_options=transformer_options)
+        img, img_ids = self.process_img(x)
        img_tokens = img.shape[1]
        if ref_latents is not None:
            h = 0
            w = 0
            index = 0
-            ref_latents_method = kwargs.get("ref_latents_method", self.params.default_ref_method)
+            index_ref_method = kwargs.get("ref_latents_method", "offset") == "index"
            for ref in ref_latents:
-                if ref_latents_method == "index":
-                    index += self.params.ref_index_scale
+                if index_ref_method:
+                    index += 1
                    h_offset = 0
                    w_offset = 0
-                elif ref_latents_method == "uxo":
-                    index = 0
-                    h_offset = h_len * patch_size + h
-                    w_offset = w_len * patch_size + w
-                    h += ref.shape[-2]
-                    w += ref.shape[-1]
                else:
                    index = 1
                    h_offset = 0
@@ -324,11 +246,7 @@ class Flux(nn.Module):
                img = torch.cat([img, kontext], dim=1)
                img_ids = torch.cat([img_ids, kontext_ids], dim=1)

-        txt_ids = torch.zeros((bs, context.shape[1], len(self.params.axes_dim)), device=x.device, dtype=torch.float32)
-
-        if len(self.params.axes_dim) == 4: # Flux 2
-            txt_ids[:, :, 3] = torch.linspace(0, context.shape[1] - 1, steps=context.shape[1], device=x.device, dtype=torch.float32)
-
+        txt_ids = torch.zeros((bs, context.shape[1], 3), device=x.device, dtype=x.dtype)
        out = self.forward_orig(img, img_ids, context, txt_ids, timestep, y, guidance, control, transformer_options, attn_mask=kwargs.get("attention_mask", None))
        out = out[:, :img_tokens]
-        return rearrange(out, "b (h w) (c ph pw) -> b c (h ph) (w pw)", h=h_len, w=w_len, ph=self.patch_size, pw=self.patch_size)[:,:,:h_orig,:w_orig]
+        return rearrange(out, "b (h w) (c ph pw) -> b c (h ph) (w pw)", h=h_len, w=w_len, ph=2, pw=2)[:,:,:h_orig,:w_orig]
--- a/comfy/ldm/genmo/joint_model/asymm_models_joint.py
+++ b/comfy/ldm/genmo/joint_model/asymm_models_joint.py
@@ -109,7 +109,6 @@ class AsymmetricAttention(nn.Module):
        scale_x: torch.Tensor,  # (B, dim_x), modulation for pre-RMSNorm.
        scale_y: torch.Tensor,  # (B, dim_y), modulation for pre-RMSNorm.
        crop_y,
-        transformer_options={},
        **rope_rotation,
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        rope_cos = rope_rotation.get("rope_cos")
@@ -144,7 +143,7 @@ class AsymmetricAttention(nn.Module):

        xy = optimized_attention(q,
                                 k,
-                                 v, self.num_heads, skip_reshape=True, transformer_options=transformer_options)
+                                 v, self.num_heads, skip_reshape=True)

        x, y = torch.tensor_split(xy, (q_x.shape[1],), dim=1)
        x = self.proj_x(x)
@@ -225,7 +224,6 @@ class AsymmetricJointBlock(nn.Module):
        x: torch.Tensor,
        c: torch.Tensor,
        y: torch.Tensor,
-        transformer_options={},
        **attn_kwargs,
    ):
        """Forward pass of a block.
@@ -258,7 +256,6 @@ class AsymmetricJointBlock(nn.Module):
            y,
            scale_x=scale_msa_x,
            scale_y=scale_msa_y,
-            transformer_options=transformer_options,
            **attn_kwargs,
        )

@@ -527,11 +524,10 @@ class AsymmDiTJoint(nn.Module):
                                                    args["txt"],
                                                    rope_cos=args["rope_cos"],
                                                    rope_sin=args["rope_sin"],
-                                                    crop_y=args["num_tokens"],
-                                                    transformer_options=args["transformer_options"]
+                                                    crop_y=args["num_tokens"]
                                                    )
                    return out
-                out = blocks_replace[("double_block", i)]({"img": x, "txt": y_feat, "vec": c, "rope_cos": rope_cos, "rope_sin": rope_sin, "num_tokens": num_tokens, "transformer_options": transformer_options}, {"original_block": block_wrap})
+                out = blocks_replace[("double_block", i)]({"img": x, "txt": y_feat, "vec": c, "rope_cos": rope_cos, "rope_sin": rope_sin, "num_tokens": num_tokens}, {"original_block": block_wrap})
                y_feat = out["txt"]
                x = out["img"]
            else:
@@ -542,7 +538,6 @@ class AsymmDiTJoint(nn.Module):
                    rope_cos=rope_cos,
                    rope_sin=rope_sin,
                    crop_y=num_tokens,
-                    transformer_options=transformer_options,
                )  # (B, M, D), (B, L, D)
        del y_feat  # Final layers don't use dense text features.

--- a/comfy/ldm/hidream/model.py
+++ b/comfy/ldm/hidream/model.py
@@ -13,7 +13,6 @@ from comfy.ldm.flux.layers import LastLayer

 from comfy.ldm.modules.attention import optimized_attention
 import comfy.model_management
-import comfy.patcher_extension
 import comfy.ldm.common_dit


@@ -72,8 +71,8 @@ class TimestepEmbed(nn.Module):
        return t_emb


-def attention(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, transformer_options={}):
-    return optimized_attention(query.view(query.shape[0], -1, query.shape[-1] * query.shape[-2]), key.view(key.shape[0], -1, key.shape[-1] * key.shape[-2]), value.view(value.shape[0], -1, value.shape[-1] * value.shape[-2]), query.shape[2], transformer_options=transformer_options)
+def attention(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor):
+    return optimized_attention(query.view(query.shape[0], -1, query.shape[-1] * query.shape[-2]), key.view(key.shape[0], -1, key.shape[-1] * key.shape[-2]), value.view(value.shape[0], -1, value.shape[-1] * value.shape[-2]), query.shape[2])


 class HiDreamAttnProcessor_flashattn:
@@ -86,7 +85,6 @@ class HiDreamAttnProcessor_flashattn:
        image_tokens_masks: Optional[torch.FloatTensor] = None,
        text_tokens: Optional[torch.FloatTensor] = None,
        rope: torch.FloatTensor = None,
-        transformer_options={},
        *args,
        **kwargs,
    ) -> torch.FloatTensor:
@@ -134,7 +132,7 @@ class HiDreamAttnProcessor_flashattn:
            query = torch.cat([query_1, query_2], dim=-1)
            key = torch.cat([key_1, key_2], dim=-1)

-        hidden_states = attention(query, key, value, transformer_options=transformer_options)
+        hidden_states = attention(query, key, value)

        if not attn.single:
            hidden_states_i, hidden_states_t = torch.split(hidden_states, [num_image_tokens, num_text_tokens], dim=1)
@@ -200,7 +198,6 @@ class HiDreamAttention(nn.Module):
        image_tokens_masks: torch.FloatTensor = None,
        norm_text_tokens: torch.FloatTensor = None,
        rope: torch.FloatTensor = None,
-        transformer_options={},
    ) -> torch.Tensor:
        return self.processor(
            self,
@@ -208,7 +205,6 @@ class HiDreamAttention(nn.Module):
            image_tokens_masks = image_tokens_masks,
            text_tokens = norm_text_tokens,
            rope = rope,
-            transformer_options=transformer_options,
        )


@@ -409,7 +405,7 @@ class HiDreamImageSingleTransformerBlock(nn.Module):
        text_tokens: Optional[torch.FloatTensor] = None,
        adaln_input: Optional[torch.FloatTensor] = None,
        rope: torch.FloatTensor = None,
-        transformer_options={},
+
    ) -> torch.FloatTensor:
        wtype = image_tokens.dtype
        shift_msa_i, scale_msa_i, gate_msa_i, shift_mlp_i, scale_mlp_i, gate_mlp_i = \
@@ -422,7 +418,6 @@ class HiDreamImageSingleTransformerBlock(nn.Module):
            norm_image_tokens,
            image_tokens_masks,
            rope = rope,
-            transformer_options=transformer_options,
        )
        image_tokens = gate_msa_i * attn_output_i + image_tokens

@@ -487,7 +482,6 @@ class HiDreamImageTransformerBlock(nn.Module):
        text_tokens: Optional[torch.FloatTensor] = None,
        adaln_input: Optional[torch.FloatTensor] = None,
        rope: torch.FloatTensor = None,
-        transformer_options={},
    ) -> torch.FloatTensor:
        wtype = image_tokens.dtype
        shift_msa_i, scale_msa_i, gate_msa_i, shift_mlp_i, scale_mlp_i, gate_mlp_i, \
@@ -505,7 +499,6 @@ class HiDreamImageTransformerBlock(nn.Module):
            image_tokens_masks,
            norm_text_tokens,
            rope = rope,
-            transformer_options=transformer_options,
        )

        image_tokens = gate_msa_i * attn_output_i + image_tokens
@@ -556,7 +549,6 @@ class HiDreamImageBlock(nn.Module):
        text_tokens: Optional[torch.FloatTensor] = None,
        adaln_input: torch.FloatTensor = None,
        rope: torch.FloatTensor = None,
-        transformer_options={},
    ) -> torch.FloatTensor:
        return self.block(
            image_tokens,
@@ -564,7 +556,6 @@ class HiDreamImageBlock(nn.Module):
            text_tokens,
            adaln_input,
            rope,
-            transformer_options=transformer_options,
        )


@@ -701,23 +692,7 @@ class HiDreamImageTransformer2DModel(nn.Module):
            raise NotImplementedError
        return x, x_masks, img_sizes

-    def forward(self,
-        x: torch.Tensor,
-        t: torch.Tensor,
-        y: Optional[torch.Tensor] = None,
-        context: Optional[torch.Tensor] = None,
-        encoder_hidden_states_llama3=None,
-        image_cond=None,
-        control = None,
-        transformer_options = {},
-    ):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, t, y, context, encoder_hidden_states_llama3, image_cond, control, transformer_options)
-
-    def _forward(
+    def forward(
        self,
        x: torch.Tensor,
        t: torch.Tensor,
@@ -794,7 +769,6 @@ class HiDreamImageTransformer2DModel(nn.Module):
                text_tokens = cur_encoder_hidden_states,
                adaln_input = adaln_input,
                rope = rope,
-                transformer_options=transformer_options,
            )
            initial_encoder_hidden_states = initial_encoder_hidden_states[:, :initial_encoder_hidden_states_seq_len]
            block_id += 1
@@ -818,7 +792,6 @@ class HiDreamImageTransformer2DModel(nn.Module):
                text_tokens=None,
                adaln_input=adaln_input,
                rope=rope,
-                transformer_options=transformer_options,
            )
            hidden_states = hidden_states[:, :hidden_states_seq_len]
            block_id += 1
--- a/comfy/ldm/hunyuan3d/model.py
+++ b/comfy/ldm/hunyuan3d/model.py
@@ -7,7 +7,6 @@ from comfy.ldm.flux.layers import (
    SingleStreamBlock,
    timestep_embedding,
 )
-import comfy.patcher_extension


 class Hunyuan3Dv2(nn.Module):
@@ -68,13 +67,6 @@ class Hunyuan3Dv2(nn.Module):
        self.final_layer = LastLayer(hidden_size, 1, in_channels, dtype=dtype, device=device, operations=operations)

    def forward(self, x, timestep, context, guidance=None, transformer_options={}, **kwargs):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, timestep, context, guidance, transformer_options, **kwargs)
-
-    def _forward(self, x, timestep, context, guidance=None, transformer_options={}, **kwargs):
        x = x.movedim(-1, -2)
        timestep = 1.0 - timestep
        txt = context
@@ -99,16 +91,14 @@ class Hunyuan3Dv2(nn.Module):
                                                   txt=args["txt"],
                                                   vec=args["vec"],
                                                   pe=args["pe"],
-                                                   attn_mask=args.get("attn_mask"),
-                                                   transformer_options=args["transformer_options"])
+                                                   attn_mask=args.get("attn_mask"))
                    return out

                out = blocks_replace[("double_block", i)]({"img": img,
                                                           "txt": txt,
                                                           "vec": vec,
                                                           "pe": pe,
-                                                           "attn_mask": attn_mask,
-                                                           "transformer_options": transformer_options},
+                                                           "attn_mask": attn_mask},
                                                          {"original_block": block_wrap})
                txt = out["txt"]
                img = out["img"]
@@ -117,8 +107,7 @@ class Hunyuan3Dv2(nn.Module):
                                 txt=txt,
                                 vec=vec,
                                 pe=pe,
-                                 attn_mask=attn_mask,
-                                 transformer_options=transformer_options)
+                                 attn_mask=attn_mask)

        img = torch.cat((txt, img), 1)

@@ -129,19 +118,17 @@ class Hunyuan3Dv2(nn.Module):
                    out["img"] = block(args["img"],
                                       vec=args["vec"],
                                       pe=args["pe"],
-                                       attn_mask=args.get("attn_mask"),
-                                       transformer_options=args["transformer_options"])
+                                       attn_mask=args.get("attn_mask"))
                    return out

                out = blocks_replace[("single_block", i)]({"img": img,
                                                           "vec": vec,
                                                           "pe": pe,
-                                                           "attn_mask": attn_mask,
-                                                           "transformer_options": transformer_options},
+                                                           "attn_mask": attn_mask},
                                                          {"original_block": block_wrap})
                img = out["img"]
            else:
-                img = block(img, vec=vec, pe=pe, attn_mask=attn_mask, transformer_options=transformer_options)
+                img = block(img, vec=vec, pe=pe, attn_mask=attn_mask)

        img = img[:, txt.shape[1]:, ...]
        img = self.final_layer(img, vec)
--- a/comfy/ldm/hunyuan3d/vae.py
+++ b/comfy/ldm/hunyuan3d/vae.py
@@ -4,458 +4,81 @@
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
+
+
+from typing import Union, Tuple, List, Callable, Optional
+
 import numpy as np
-import math
+from einops import repeat, rearrange
 from tqdm import tqdm
-
-from typing import Optional
-
 import logging

 import comfy.ops
 ops = comfy.ops.disable_weight_init

-def fps(src: torch.Tensor, batch: torch.Tensor, sampling_ratio: float, start_random: bool = True):
-
-    # manually create the pointer vector
-    assert src.size(0) == batch.numel()
-
-    batch_size = int(batch.max()) + 1
-    deg = src.new_zeros(batch_size, dtype = torch.long)
-
-    deg.scatter_add_(0, batch, torch.ones_like(batch))
-
-    ptr_vec = deg.new_zeros(batch_size + 1)
-    torch.cumsum(deg, 0, out=ptr_vec[1:])
-
-    #return fps_sampling(src, ptr_vec, ratio)
-    sampled_indicies = []
-
-    for b in range(batch_size):
-        # start and the end of each batch
-        start, end = ptr_vec[b].item(), ptr_vec[b + 1].item()
-        # points from the point cloud
-        points = src[start:end]
-
-        num_points = points.size(0)
-        num_samples = max(1, math.ceil(num_points * sampling_ratio))
-
-        selected = torch.zeros(num_samples, device = src.device, dtype = torch.long)
-        distances = torch.full((num_points,), float("inf"), device = src.device)
-
-        # select a random start point
-        if start_random:
-            farthest = torch.randint(0, num_points, (1,), device = src.device)
-        else:
-            farthest = torch.tensor([0], device = src.device, dtype = torch.long)
-
-        for i in range(num_samples):
-            selected[i] = farthest
-            centroid = points[farthest].squeeze(0)
-            dist = torch.norm(points - centroid, dim = 1) # compute euclidean distance
-            distances = torch.minimum(distances, dist)
-            farthest = torch.argmax(distances)
-
-        sampled_indicies.append(torch.arange(start, end)[selected])
-
-    return torch.cat(sampled_indicies, dim = 0)
-class PointCrossAttention(nn.Module):
-    def __init__(self,
-        num_latents: int,
-        downsample_ratio: float,
-        pc_size: int,
-        pc_sharpedge_size: int,
-        point_feats: int,
-        width: int,
-        heads: int,
-        layers: int,
-        fourier_embedder,
-        normal_pe: bool = False,
-        qkv_bias: bool = False,
-        use_ln_post: bool = True,
-        qk_norm: bool = True):
-
-        super().__init__()
-
-        self.fourier_embedder = fourier_embedder
-
-        self.pc_size = pc_size
-        self.normal_pe = normal_pe
-        self.downsample_ratio = downsample_ratio
-        self.pc_sharpedge_size = pc_sharpedge_size
-        self.num_latents = num_latents
-        self.point_feats = point_feats
-
-        self.input_proj = nn.Linear(self.fourier_embedder.out_dim + point_feats, width)
-
-        self.cross_attn = ResidualCrossAttentionBlock(
-            width = width,
-            heads = heads,
-            qkv_bias = qkv_bias,
-            qk_norm = qk_norm
-        )
-
-        self.self_attn = None
-        if layers > 0:
-            self.self_attn = Transformer(
-                width = width,
-                heads = heads,
-                qkv_bias = qkv_bias,
-                qk_norm = qk_norm,
-                layers = layers
-            )
-
-        if use_ln_post:
-            self.ln_post = nn.LayerNorm(width)
-        else:
-            self.ln_post = None
-
-    def sample_points_and_latents(self, point_cloud: torch.Tensor, features: torch.Tensor):
-
-        """
-        Subsample points randomly from the point cloud (input_pc)
-        Further sample the subsampled points to get query_pc
-        take the fourier embeddings for both input and query pc
-
-        Mental Note: FPS-sampled points (query_pc) act as latent tokens that attend to and learn from the broader context in input_pc.
-        Goal: get a smaller represenation (query_pc) to represent the entire scence structure by learning from a broader subset (input_pc).
-        More computationally efficient.
-
-        Features are additional information for each point in the cloud
-        """
-
-        B, _, D = point_cloud.shape
-
-        num_latents = int(self.num_latents)
-
-        num_random_query = self.pc_size / (self.pc_size + self.pc_sharpedge_size) * num_latents
-        num_sharpedge_query = num_latents - num_random_query
-
-        # Split random and sharpedge surface points
-        random_pc, sharpedge_pc = torch.split(point_cloud, [self.pc_size, self.pc_sharpedge_size], dim=1)
-
-        # assert statements
-        assert random_pc.shape[1] <= self.pc_size, "Random surface points size must be less than or equal to pc_size"
-        assert sharpedge_pc.shape[1] <= self.pc_sharpedge_size, "Sharpedge surface points size must be less than or equal to pc_sharpedge_size"
-
-        input_random_pc_size = int(num_random_query * self.downsample_ratio)
-        random_query_pc, random_input_pc, random_idx_pc, random_idx_query = \
-            self.subsample(pc = random_pc, num_query = num_random_query, input_pc_size = input_random_pc_size)
-
-        input_sharpedge_pc_size = int(num_sharpedge_query * self.downsample_ratio)
-
-        if input_sharpedge_pc_size == 0:
-            sharpedge_input_pc = torch.zeros(B, 0, D, dtype = random_input_pc.dtype).to(point_cloud.device)
-            sharpedge_query_pc = torch.zeros(B, 0, D, dtype= random_query_pc.dtype).to(point_cloud.device)
-
-        else:
-            sharpedge_query_pc, sharpedge_input_pc, sharpedge_idx_pc, sharpedge_idx_query = \
-            self.subsample(pc = sharpedge_pc, num_query = num_sharpedge_query, input_pc_size = input_sharpedge_pc_size)
-
-        # concat the random and sharpedges
-        query_pc = torch.cat([random_query_pc, sharpedge_query_pc], dim = 1)
-        input_pc = torch.cat([random_input_pc, sharpedge_input_pc], dim = 1)
-
-        query = self.fourier_embedder(query_pc)
-        data = self.fourier_embedder(input_pc)
-
-        if self.point_feats > 0:
-            random_surface_features, sharpedge_surface_features = torch.split(features, [self.pc_size, self.pc_sharpedge_size], dim = 1)
-
-            input_random_surface_features, query_random_features = \
-                self.handle_features(features = random_surface_features, idx_pc = random_idx_pc, batch_size = B,
-                                     input_pc_size = input_random_pc_size, idx_query = random_idx_query)
-
-            if input_sharpedge_pc_size == 0:
-                input_sharpedge_surface_features = torch.zeros(B, 0, self.point_feats,
-                                                               dtype = input_random_surface_features.dtype, device = point_cloud.device)
-
-                query_sharpedge_features = torch.zeros(B, 0, self.point_feats,
-                                                       dtype = query_random_features.dtype, device = point_cloud.device)
-            else:
-
-                input_sharpedge_surface_features, query_sharpedge_features = \
-                    self.handle_features(idx_pc = sharpedge_idx_pc, features = sharpedge_surface_features,
-                                         batch_size = B, idx_query = sharpedge_idx_query, input_pc_size = input_sharpedge_pc_size)
-
-            query_features = torch.cat([query_random_features, query_sharpedge_features], dim = 1)
-            input_features = torch.cat([input_random_surface_features, input_sharpedge_surface_features], dim = 1)
-
-            if self.normal_pe:
-                # apply the fourier embeddings on the first 3 dims (xyz)
-                input_features_pe = self.fourier_embedder(input_features[..., :3])
-                query_features_pe = self.fourier_embedder(query_features[..., :3])
-                # replace the first 3 dims with the new PE ones
-                input_features = torch.cat([input_features_pe, input_features[..., :3]], dim = -1)
-                query_features = torch.cat([query_features_pe, query_features[..., :3]], dim = -1)
-
-            # concat at the channels dim
-            query = torch.cat([query, query_features], dim = -1)
-            data = torch.cat([data, input_features], dim = -1)
-
-        # don't return pc_info to avoid unnecessary memory usuage
-        return query.view(B, -1, query.shape[-1]), data.view(B, -1, data.shape[-1])
-
-    def forward(self, point_cloud: torch.Tensor, features: torch.Tensor):
-
-        query, data = self.sample_points_and_latents(point_cloud = point_cloud, features = features)
-
-        # apply projections
-        query = self.input_proj(query)
-        data = self.input_proj(data)
-
-        # apply cross attention between query and data
-        latents = self.cross_attn(query, data)
-
-        if self.self_attn is not None:
-            latents = self.self_attn(latents)
-
-        if self.ln_post is not None:
-            latents = self.ln_post(latents)
-
-        return latents
-
-
-    def subsample(self, pc, num_query, input_pc_size: int):
-
-        """
-        num_query: number of points to keep after FPS
-        input_pc_size: number of points to select before FPS
-        """
-
-        B, _, D = pc.shape
-        query_ratio = num_query / input_pc_size
-
-        # random subsampling of points inside the point cloud
-        idx_pc = torch.randperm(pc.shape[1], device = pc.device)[:input_pc_size]
-        input_pc = pc[:, idx_pc, :]
-
-        # flatten to allow applying fps across the whole batch
-        flattent_input_pc = input_pc.view(B * input_pc_size, D)
-
-        # construct a batch_down tensor to tell fps
-        # which points belong to which batch
-        N_down = int(flattent_input_pc.shape[0] / B)
-        batch_down = torch.arange(B).to(pc.device)
-        batch_down = torch.repeat_interleave(batch_down, N_down)
-
-        idx_query = fps(flattent_input_pc, batch_down, sampling_ratio = query_ratio)
-        query_pc = flattent_input_pc[idx_query].view(B, -1, D)
-
-        return query_pc, input_pc, idx_pc, idx_query
-
-    def handle_features(self, features, idx_pc, input_pc_size, batch_size: int, idx_query):
-
-        B = batch_size
-
-        input_surface_features = features[:, idx_pc, :]
-        flattent_input_features = input_surface_features.view(B * input_pc_size, -1)
-        query_features = flattent_input_features[idx_query].view(B, -1,
-                                                                 flattent_input_features.shape[-1])
-
-        return input_surface_features, query_features
-
-def normalize_mesh(mesh, scale = 0.9999):
-    """Normalize mesh to fit in [-scale, scale]. Translate mesh so its center is [0,0,0]"""
-
-    bbox = mesh.bounds
-    center = (bbox[1] + bbox[0]) / 2
-
-    max_extent = (bbox[1] - bbox[0]).max()
-    mesh.apply_translation(-center)
-    mesh.apply_scale((2 * scale) / max_extent)
-
-    return mesh
-
-def sample_pointcloud(mesh, num = 200000):
-    """ Uniformly sample points from the surface of the mesh """
-
-    points, face_idx = mesh.sample(num, return_index = True)
-    normals = mesh.face_normals[face_idx]
-    return torch.from_numpy(points.astype(np.float32)), torch.from_numpy(normals.astype(np.float32))
-
-def detect_sharp_edges(mesh, threshold=0.985):
-    """Return edge indices (a, b) that lie on sharp boundaries of the mesh."""
-
-    V, F = mesh.vertices, mesh.faces
-    VN, FN = mesh.vertex_normals, mesh.face_normals
-
-    sharp_mask = np.ones(V.shape[0])
-    for i in range(3):
-        indices = F[:, i]
-        alignment = np.einsum('ij,ij->i', VN[indices], FN)
-        dot_stack = np.stack((sharp_mask[indices], alignment), axis=-1)
-        sharp_mask[indices] = np.min(dot_stack, axis=-1)
-
-    edge_a = np.concatenate([F[:, 0], F[:, 1], F[:, 2]])
-    edge_b = np.concatenate([F[:, 1], F[:, 2], F[:, 0]])
-    sharp_edges = (sharp_mask[edge_a] < threshold) & (sharp_mask[edge_b] < threshold)
-
-    return edge_a[sharp_edges], edge_b[sharp_edges]
-
-
-def sharp_sample_pointcloud(mesh, num = 16384):
-    """ Sample points preferentially from sharp edges in the mesh. """
-
-    edge_a, edge_b = detect_sharp_edges(mesh)
-    V, VN = mesh.vertices, mesh.vertex_normals
-
-    va, vb = V[edge_a], V[edge_b]
-    na, nb = VN[edge_a], VN[edge_b]
-
-    edge_lengths = np.linalg.norm(vb - va, axis=-1)
-    weights = edge_lengths / edge_lengths.sum()
-
-    indices = np.searchsorted(np.cumsum(weights), np.random.rand(num))
-    t = np.random.rand(num, 1)
-
-    samples = t * va[indices] + (1 - t) * vb[indices]
-    normals = t * na[indices] + (1 - t) * nb[indices]
-
-    return samples.astype(np.float32), normals.astype(np.float32)
-
-def load_surface_sharpedge(mesh, num_points=4096, num_sharp_points=4096, sharpedge_flag = True, device = "cuda"):
-    """Load a surface with optional sharp-edge annotations from a trimesh mesh."""
-
-    import trimesh
-
-    try:
-        mesh_full = trimesh.util.concatenate(mesh.dump())
-    except Exception:
-        mesh_full = trimesh.util.concatenate(mesh)
-
-    mesh_full = normalize_mesh(mesh_full)
-
-    faces = mesh_full.faces
-    vertices = mesh_full.vertices
-    origin_face_count = faces.shape[0]
-
-    mesh_surface = trimesh.Trimesh(vertices=vertices, faces=faces[:origin_face_count])
-    mesh_fill = trimesh.Trimesh(vertices=vertices, faces=faces[origin_face_count:])
-
-    area_surface = mesh_surface.area
-    area_fill = mesh_fill.area
-    total_area = area_surface + area_fill
-
-    sample_num = 499712 // 2
-    fill_ratio = area_fill / total_area if total_area > 0 else 0
-
-    num_fill = int(sample_num * fill_ratio)
-    num_surface = sample_num - num_fill
-
-    surf_pts, surf_normals = sample_pointcloud(mesh_surface, num_surface)
-    fill_pts, fill_normals = (torch.zeros(0, 3), torch.zeros(0, 3)) if num_fill == 0 else sample_pointcloud(mesh_fill, num_fill)
-
-    sharp_pts, sharp_normals = sharp_sample_pointcloud(mesh_surface, sample_num)
-
-    def assemble_tensor(points, normals, label=None):
-
-        data = torch.cat([points, normals], dim=1).half().to(device)
-
-        if label is not None:
-            label_tensor = torch.full((data.shape[0], 1), float(label), dtype=torch.float16).to(device)
-            data = torch.cat([data, label_tensor], dim=1)
-
-        return data
-
-    surface = assemble_tensor(torch.cat([surf_pts.to(device), fill_pts.to(device)], dim=0),
-                              torch.cat([surf_normals.to(device), fill_normals.to(device)], dim=0),
-                              label = 0 if sharpedge_flag else None)
-
-    sharp_surface = assemble_tensor(torch.from_numpy(sharp_pts), torch.from_numpy(sharp_normals),
-                                    label = 1 if sharpedge_flag else None)
-
-    rng = np.random.default_rng()
-
-    surface = surface[rng.choice(surface.shape[0], num_points, replace = False)]
-    sharp_surface = sharp_surface[rng.choice(sharp_surface.shape[0], num_sharp_points, replace = False)]
-
-    full = torch.cat([surface, sharp_surface], dim = 0).unsqueeze(0)
-
-    return full
-
-class SharpEdgeSurfaceLoader:
-    """ Load mesh surface and sharp edge samples. """
-
-    def __init__(self, num_uniform_points = 8192, num_sharp_points = 8192):
-
-        self.num_uniform_points = num_uniform_points
-        self.num_sharp_points = num_sharp_points
-        self.total_points = num_uniform_points + num_sharp_points
-
-    def __call__(self, mesh_input, device = "cuda"):
-        mesh = self._load_mesh(mesh_input)
-        return load_surface_sharpedge(mesh, self.num_uniform_points, self.num_sharp_points, device = device)
-
-    @staticmethod
-    def _load_mesh(mesh_input):
-        import trimesh
-
-        if isinstance(mesh_input, str):
-            mesh = trimesh.load(mesh_input, force="mesh", merge_primitives = True)
-        else:
-            mesh = mesh_input
-
-        if isinstance(mesh, trimesh.Scene):
-            combined = None
-            for obj in mesh.geometry.values():
-                combined = obj if combined is None else combined + obj
-            return combined
-
-        return mesh
-
-class DiagonalGaussianDistribution:
-    def __init__(self, params: torch.Tensor, feature_dim: int = -1):
-
-        # divide quant channels (8) into mean and log variance
-        self.mean, self.logvar = torch.chunk(params, 2, dim = feature_dim)
-
-        self.logvar = torch.clamp(self.logvar, -30.0, 20.0)
-        self.std = torch.exp(0.5 * self.logvar)
-
-    def sample(self):
-
-        eps = torch.randn_like(self.std)
-        z = self.mean + eps * self.std
-
-        return z
-
-################################################
-# Volume Decoder
-################################################
-
-class VanillaVolumeDecoder():
+def generate_dense_grid_points(
+    bbox_min: np.ndarray,
+    bbox_max: np.ndarray,
+    octree_resolution: int,
+    indexing: str = "ij",
+):
+    length = bbox_max - bbox_min
+    num_cells = octree_resolution
+
+    x = np.linspace(bbox_min[0], bbox_max[0], int(num_cells) + 1, dtype=np.float32)
+    y = np.linspace(bbox_min[1], bbox_max[1], int(num_cells) + 1, dtype=np.float32)
+    z = np.linspace(bbox_min[2], bbox_max[2], int(num_cells) + 1, dtype=np.float32)
+    [xs, ys, zs] = np.meshgrid(x, y, z, indexing=indexing)
+    xyz = np.stack((xs, ys, zs), axis=-1)
+    grid_size = [int(num_cells) + 1, int(num_cells) + 1, int(num_cells) + 1]
+
+    return xyz, grid_size, length
+
+
+class VanillaVolumeDecoder:
    @torch.no_grad()
-    def __call__(self, latents: torch.Tensor, geo_decoder: callable, octree_resolution: int, bounds = 1.01,
-                 num_chunks: int = 10_000, enable_pbar: bool = True, **kwargs):
+    def __call__(
+        self,
+        latents: torch.FloatTensor,
+        geo_decoder: Callable,
+        bounds: Union[Tuple[float], List[float], float] = 1.01,
+        num_chunks: int = 10000,
+        octree_resolution: int = None,
+        enable_pbar: bool = True,
+        **kwargs,
+    ):
+        device = latents.device
+        dtype = latents.dtype
+        batch_size = latents.shape[0]

+        # 1. generate query points
        if isinstance(bounds, float):
            bounds = [-bounds, -bounds, -bounds, bounds, bounds, bounds]

-        bbox_min, bbox_max = torch.tensor(bounds[:3]), torch.tensor(bounds[3:])
-
-        x = torch.linspace(bbox_min[0], bbox_max[0], int(octree_resolution) + 1, dtype = torch.float32)
-        y = torch.linspace(bbox_min[1], bbox_max[1], int(octree_resolution) + 1, dtype = torch.float32)
-        z = torch.linspace(bbox_min[2], bbox_max[2], int(octree_resolution) + 1, dtype = torch.float32)
-
-        [xs, ys, zs] = torch.meshgrid(x, y, z, indexing = "ij")
-        xyz = torch.stack((xs, ys, zs), axis=-1).to(latents.device, dtype = latents.dtype).contiguous().reshape(-1, 3)
-        grid_size = [int(octree_resolution) + 1, int(octree_resolution) + 1, int(octree_resolution) + 1]
+        bbox_min, bbox_max = np.array(bounds[0:3]), np.array(bounds[3:6])
+        xyz_samples, grid_size, length = generate_dense_grid_points(
+            bbox_min=bbox_min,
+            bbox_max=bbox_max,
+            octree_resolution=octree_resolution,
+            indexing="ij"
+        )
+        xyz_samples = torch.from_numpy(xyz_samples).to(device, dtype=dtype).contiguous().reshape(-1, 3)

+        # 2. latents to 3d volume
        batch_logits = []
-        for start in tqdm(range(0, xyz.shape[0], num_chunks), desc="Volume Decoding",
+        for start in tqdm(range(0, xyz_samples.shape[0], num_chunks), desc="Volume Decoding",
                          disable=not enable_pbar):
-
-            chunk_queries = xyz[start: start + num_chunks, :]
-            chunk_queries = chunk_queries.unsqueeze(0).repeat(latents.shape[0], 1, 1)
-            logits = geo_decoder(queries = chunk_queries, latents = latents)
+            chunk_queries = xyz_samples[start: start + num_chunks, :]
+            chunk_queries = repeat(chunk_queries, "p c -> b p c", b=batch_size)
+            logits = geo_decoder(queries=chunk_queries, latents=latents)
            batch_logits.append(logits)

-        grid_logits = torch.cat(batch_logits, dim = 1)
-        grid_logits = grid_logits.view((latents.shape[0], *grid_size)).float()
+        grid_logits = torch.cat(batch_logits, dim=1)
+        grid_logits = grid_logits.view((batch_size, *grid_size)).float()

        return grid_logits

+
 class FourierEmbedder(nn.Module):
    """The sin/cosine positional embedding. Given an input tensor `x` of shape [n_batch, ..., c_dim], it converts
    each feature dimension of `x[..., i]` into:
@@ -552,11 +175,13 @@ class FourierEmbedder(nn.Module):
        else:
            return x

+
 class CrossAttentionProcessor:
    def __call__(self, attn, q, k, v):
        out = comfy.ops.scaled_dot_product_attention(q, k, v)
        return out

+
 class DropPath(nn.Module):
    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
    """
@@ -607,42 +232,39 @@ class MLP(nn.Module):
    def forward(self, x):
        return self.drop_path(self.c_proj(self.gelu(self.c_fc(x))))

+
 class QKVMultiheadCrossAttention(nn.Module):
    def __init__(
        self,
+        *,
        heads: int,
-        n_data = None,
        width=None,
        qk_norm=False,
        norm_layer=ops.LayerNorm
    ):
        super().__init__()
        self.heads = heads
-        self.n_data = n_data
        self.q_norm = norm_layer(width // heads, elementwise_affine=True, eps=1e-6) if qk_norm else nn.Identity()
        self.k_norm = norm_layer(width // heads, elementwise_affine=True, eps=1e-6) if qk_norm else nn.Identity()

-    def forward(self, q, kv):
+        self.attn_processor = CrossAttentionProcessor()

+    def forward(self, q, kv):
        _, n_ctx, _ = q.shape
        bs, n_data, width = kv.shape
-
        attn_ch = width // self.heads // 2
        q = q.view(bs, n_ctx, self.heads, -1)
-
        kv = kv.view(bs, n_data, self.heads, -1)
        k, v = torch.split(kv, attn_ch, dim=-1)

        q = self.q_norm(q)
        k = self.k_norm(k)
-
-        q, k, v = [t.permute(0, 2, 1, 3) for t in (q, k, v)]
-        out = F.scaled_dot_product_attention(q, k, v)
-
+        q, k, v = map(lambda t: rearrange(t, 'b n h d -> b h n d', h=self.heads), (q, k, v))
+        out = self.attn_processor(self, q, k, v)
        out = out.transpose(1, 2).reshape(bs, n_ctx, -1)
-
        return out

+
 class MultiheadCrossAttention(nn.Module):
    def __init__(
        self,
@@ -684,6 +306,7 @@ class MultiheadCrossAttention(nn.Module):
        x = self.c_proj(x)
        return x

+
 class ResidualCrossAttentionBlock(nn.Module):
    def __init__(
        self,
@@ -743,7 +366,7 @@ class QKVMultiheadAttention(nn.Module):
        q = self.q_norm(q)
        k = self.k_norm(k)

-        q, k, v = [t.permute(0, 2, 1, 3) for t in (q, k, v)]
+        q, k, v = map(lambda t: rearrange(t, 'b n h d -> b h n d', h=self.heads), (q, k, v))
        out = F.scaled_dot_product_attention(q, k, v).transpose(1, 2).reshape(bs, n_ctx, -1)
        return out

@@ -760,7 +383,8 @@ class MultiheadAttention(nn.Module):
        drop_path_rate: float = 0.0
    ):
        super().__init__()
-
+        self.width = width
+        self.heads = heads
        self.c_qkv = ops.Linear(width, width * 3, bias=qkv_bias)
        self.c_proj = ops.Linear(width, width)
        self.attention = QKVMultiheadAttention(
@@ -867,7 +491,7 @@ class CrossAttentionDecoder(nn.Module):
        self.query_proj = ops.Linear(self.fourier_embedder.out_dim, width)
        if self.downsample_ratio != 1:
            self.latents_proj = ops.Linear(width * downsample_ratio, width)
-        if not self.enable_ln_post:
+        if self.enable_ln_post == False:
            qk_norm = False
        self.cross_attn_decoder = ResidualCrossAttentionBlock(
            width=width,
@@ -898,44 +522,28 @@ class CrossAttentionDecoder(nn.Module):

 class ShapeVAE(nn.Module):
    def __init__(
-            self,
-            *,
-            num_latents: int = 4096,
-            embed_dim: int = 64,
-            width: int = 1024,
-            heads: int = 16,
-            num_decoder_layers: int = 16,
-            num_encoder_layers: int = 8,
-            pc_size: int = 81920,
-            pc_sharpedge_size: int = 0,
-            point_feats: int = 4,
-            downsample_ratio: int = 20,
-            geo_decoder_downsample_ratio: int = 1,
-            geo_decoder_mlp_expand_ratio: int = 4,
-            geo_decoder_ln_post: bool = True,
-            num_freqs: int = 8,
-            qkv_bias: bool = False,
-            qk_norm: bool = True,
-            drop_path_rate: float = 0.0,
-            include_pi: bool = False,
-            scale_factor: float = 1.0039506158752403,
-            label_type: str = "binary",
+        self,
+        *,
+        embed_dim: int,
+        width: int,
+        heads: int,
+        num_decoder_layers: int,
+        geo_decoder_downsample_ratio: int = 1,
+        geo_decoder_mlp_expand_ratio: int = 4,
+        geo_decoder_ln_post: bool = True,
+        num_freqs: int = 8,
+        include_pi: bool = True,
+        qkv_bias: bool = True,
+        qk_norm: bool = False,
+        label_type: str = "binary",
+        drop_path_rate: float = 0.0,
+        scale_factor: float = 1.0,
    ):
        super().__init__()
        self.geo_decoder_ln_post = geo_decoder_ln_post

        self.fourier_embedder = FourierEmbedder(num_freqs=num_freqs, include_pi=include_pi)

-        self.encoder = PointCrossAttention(layers = num_encoder_layers,
-                                    num_latents = num_latents,
-                                    downsample_ratio = downsample_ratio,
-                                    heads = heads,
-                                    pc_size = pc_size,
-                                    width = width,
-                                    point_feats = point_feats,
-                                    fourier_embedder = self.fourier_embedder,
-                                    pc_sharpedge_size = pc_sharpedge_size)
-
        self.post_kl = ops.Linear(embed_dim, width)

        self.transformer = Transformer(
@@ -975,14 +583,5 @@ class ShapeVAE(nn.Module):
        grid_logits = self.volume_decoder(latents, self.geo_decoder, bounds=bounds, num_chunks=num_chunks, octree_resolution=octree_resolution, enable_pbar=enable_pbar)
        return grid_logits.movedim(-2, -1)

-    def encode(self, surface):
-
-        pc, feats = surface[:, :, :3], surface[:, :, 3:]
-        latents = self.encoder(pc, feats)
-
-        moments = self.pre_kl(latents)
-        posterior = DiagonalGaussianDistribution(moments, feature_dim = -1)
-
-        latents = posterior.sample()
-
-        return latents
+    def encode(self, x):
+        return None
--- a/comfy/ldm/hunyuan3dv2_1/hunyuandit.py
+++ b/comfy/ldm/hunyuan3dv2_1/hunyuandit.py
@@ -1,659 +0,0 @@
-import math
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from comfy.ldm.modules.attention import optimized_attention
-import comfy.model_management
-
-class GELU(nn.Module):
-
-    def __init__(self, dim_in: int, dim_out: int, operations, device, dtype):
-        super().__init__()
-        self.proj = operations.Linear(dim_in, dim_out, device = device, dtype = dtype)
-
-    def gelu(self, gate: torch.Tensor) -> torch.Tensor:
-
-        if gate.device.type == "mps":
-            return F.gelu(gate.to(dtype = torch.float32)).to(dtype = gate.dtype)
-
-        return F.gelu(gate)
-
-    def forward(self, hidden_states):
-
-        hidden_states = self.proj(hidden_states)
-        hidden_states = self.gelu(hidden_states)
-
-        return hidden_states
-
-class FeedForward(nn.Module):
-
-    def __init__(self, dim: int, dim_out = None, mult: int = 4,
-                dropout: float = 0.0, inner_dim = None, operations = None, device = None, dtype = None):
-
-        super().__init__()
-        if inner_dim is None:
-            inner_dim = int(dim * mult)
-
-        dim_out = dim_out if dim_out is not None else dim
-
-        act_fn = GELU(dim, inner_dim, operations = operations, device = device, dtype = dtype)
-
-        self.net = nn.ModuleList([])
-        self.net.append(act_fn)
-
-        self.net.append(nn.Dropout(dropout))
-        self.net.append(operations.Linear(inner_dim, dim_out, device = device, dtype = dtype))
-
-    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
-        for module in self.net:
-            hidden_states = module(hidden_states)
-        return hidden_states
-
-class AddAuxLoss(torch.autograd.Function):
-
-    @staticmethod
-    def forward(ctx, x, loss):
-        # do nothing in forward (no computation)
-        ctx.requires_aux_loss = loss.requires_grad
-        ctx.dtype = loss.dtype
-
-        return x
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        # add the aux loss gradients
-        grad_loss = None
-        # put the aux grad the same as the main grad loss
-        # aux grad contributes equally
-        if ctx.requires_aux_loss:
-            grad_loss = torch.ones(1, dtype = ctx.dtype, device = grad_output.device)
-
-        return grad_output, grad_loss
-
-class MoEGate(nn.Module):
-
-    def __init__(self, embed_dim, num_experts=16, num_experts_per_tok=2, aux_loss_alpha=0.01, device = None, dtype = None):
-
-        super().__init__()
-        self.top_k = num_experts_per_tok
-        self.n_routed_experts = num_experts
-
-        self.alpha = aux_loss_alpha
-
-        self.gating_dim = embed_dim
-        self.weight = nn.Parameter(torch.empty((self.n_routed_experts, self.gating_dim), device = device, dtype = dtype))
-
-    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
-
-        # flatten hidden states
-        hidden_states = hidden_states.view(-1, hidden_states.size(-1))
-
-        # get logits and pass it to softmax
-        logits = F.linear(hidden_states, comfy.model_management.cast_to(self.weight, dtype=hidden_states.dtype, device=hidden_states.device), bias = None)
-        scores = logits.softmax(dim = -1)
-
-        topk_weight, topk_idx = torch.topk(scores, k = self.top_k, dim = -1, sorted = False)
-
-        if self.training and self.alpha > 0.0:
-            scores_for_aux = scores
-
-            # used bincount instead of one hot encoding
-            counts = torch.bincount(topk_idx.view(-1), minlength = self.n_routed_experts).float()
-            ce = counts / topk_idx.numel()  # normalized expert usage
-
-            # mean expert score
-            Pi = scores_for_aux.mean(0)
-
-            # expert balance loss
-            aux_loss = (Pi * ce * self.n_routed_experts).sum() * self.alpha
-        else:
-            aux_loss = None
-
-        return topk_idx, topk_weight, aux_loss
-
-class MoEBlock(nn.Module):
-    def __init__(self, dim, num_experts: int = 6, moe_top_k: int = 2, dropout: float = 0.0,
-                 ff_inner_dim: int = None, operations = None, device = None, dtype = None):
-        super().__init__()
-
-        self.moe_top_k = moe_top_k
-        self.num_experts = num_experts
-
-        self.experts = nn.ModuleList([
-            FeedForward(dim, dropout = dropout, inner_dim = ff_inner_dim, operations = operations, device = device, dtype = dtype)
-            for _ in range(num_experts)
-        ])
-
-        self.gate = MoEGate(dim, num_experts = num_experts, num_experts_per_tok = moe_top_k, device = device, dtype = dtype)
-        self.shared_experts = FeedForward(dim, dropout = dropout, inner_dim = ff_inner_dim, operations = operations, device = device, dtype = dtype)
-
-    def forward(self, hidden_states) -> torch.Tensor:
-
-        identity = hidden_states
-        orig_shape = hidden_states.shape
-        topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
-
-        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
-        flat_topk_idx = topk_idx.view(-1)
-
-        if self.training:
-
-            hidden_states = hidden_states.repeat_interleave(self.moe_top_k, dim = 0)
-            y = torch.empty_like(hidden_states, dtype = hidden_states.dtype)
-
-            for i, expert in enumerate(self.experts):
-                tmp = expert(hidden_states[flat_topk_idx == i])
-                y[flat_topk_idx == i] = tmp.to(hidden_states.dtype)
-
-            y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim = 1)
-            y =  y.view(*orig_shape)
-
-            y = AddAuxLoss.apply(y, aux_loss)
-        else:
-            y = self.moe_infer(hidden_states, flat_expert_indices = flat_topk_idx,flat_expert_weights = topk_weight.view(-1, 1)).view(*orig_shape)
-
-        y = y + self.shared_experts(identity)
-
-        return y
-
-    @torch.no_grad()
-    def moe_infer(self, x, flat_expert_indices, flat_expert_weights):
-
-        expert_cache = torch.zeros_like(x)
-        idxs = flat_expert_indices.argsort()
-
-        # no need for .numpy().cpu() here
-        tokens_per_expert = flat_expert_indices.bincount().cumsum(0)
-        token_idxs = idxs // self.moe_top_k
-
-        for i, end_idx in enumerate(tokens_per_expert):
-
-            start_idx = 0 if i == 0 else tokens_per_expert[i-1]
-
-            if start_idx == end_idx:
-                continue
-
-            expert = self.experts[i]
-            exp_token_idx = token_idxs[start_idx:end_idx]
-
-            expert_tokens = x[exp_token_idx]
-            expert_out = expert(expert_tokens)
-
-            expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]])
-
-            # use index_add_ with a 1-D index tensor directly avoids building a large [N, D] index map and extra memcopy required by scatter_reduce_
-            # + avoid dtype conversion
-            expert_cache.index_add_(0, exp_token_idx, expert_out)
-
-        return expert_cache
-
-class Timesteps(nn.Module):
-    def __init__(self, num_channels: int, downscale_freq_shift: float = 0.0,
-                 scale: float = 1.0, max_period: int = 10000):
-        super().__init__()
-
-        self.num_channels = num_channels
-        half_dim = num_channels // 2
-
-        # precompute the “inv_freq” vector once
-        exponent = -math.log(max_period) * torch.arange(
-            half_dim, dtype=torch.float32
-        ) / (half_dim - downscale_freq_shift)
-
-        inv_freq = torch.exp(exponent)
-
-        # pad
-        if num_channels % 2 == 1:
-            # we’ll pad a zero at the end of the cos-half
-            inv_freq = torch.cat([inv_freq, inv_freq.new_zeros(1)])
-
-        # register to buffer so it moves with the device
-        self.register_buffer("inv_freq", inv_freq, persistent = False)
-        self.scale = scale
-
-    def forward(self, timesteps: torch.Tensor):
-
-        x = timesteps.float().unsqueeze(1) * self.inv_freq.to(timesteps.device).unsqueeze(0)
-
-
-        # fused CUDA kernels for sin and cos
-        sin_emb = x.sin()
-        cos_emb = x.cos()
-
-        emb = torch.cat([sin_emb, cos_emb], dim = 1)
-
-        # scale factor
-        if self.scale != 1.0:
-            emb = emb * self.scale
-
-        # If we padded inv_freq for odd, emb is already wide enough; otherwise:
-        if emb.shape[1] > self.num_channels:
-            emb = emb[:, :self.num_channels]
-
-        return emb
-
-class TimestepEmbedder(nn.Module):
-    def __init__(self, hidden_size, frequency_embedding_size = 256, cond_proj_dim = None, operations = None, device = None, dtype = None):
-        super().__init__()
-
-        self.mlp = nn.Sequential(
-            operations.Linear(hidden_size, frequency_embedding_size, bias=True, device = device, dtype = dtype),
-            nn.GELU(),
-            operations.Linear(frequency_embedding_size, hidden_size, bias=True, device = device, dtype = dtype),
-        )
-        self.frequency_embedding_size = frequency_embedding_size
-
-        if cond_proj_dim is not None:
-            self.cond_proj = operations.Linear(cond_proj_dim, frequency_embedding_size, bias=False, device = device, dtype = dtype)
-
-        self.time_embed = Timesteps(hidden_size)
-
-    def forward(self, timesteps, condition):
-
-        timestep_embed = self.time_embed(timesteps).type(self.mlp[0].weight.dtype)
-
-        if condition is not None:
-            cond_embed = self.cond_proj(condition)
-            timestep_embed = timestep_embed + cond_embed
-
-        time_conditioned = self.mlp(timestep_embed)
-
-        # for broadcasting with image tokens
-        return time_conditioned.unsqueeze(1)
-
-class MLP(nn.Module):
-    def __init__(self, *, width: int, operations = None, device = None, dtype = None):
-        super().__init__()
-        self.width = width
-        self.fc1 = operations.Linear(width, width * 4, device = device, dtype = dtype)
-        self.fc2 = operations.Linear(width * 4, width, device = device, dtype = dtype)
-        self.gelu = nn.GELU()
-
-    def forward(self, x):
-        return self.fc2(self.gelu(self.fc1(x)))
-
-class CrossAttention(nn.Module):
-    def __init__(
-        self,
-        qdim,
-        kdim,
-        num_heads,
-        qkv_bias=True,
-        qk_norm=False,
-        norm_layer=nn.LayerNorm,
-        use_fp16: bool = False,
-        operations = None,
-        dtype = None,
-        device = None,
-        **kwargs,
-    ):
-        super().__init__()
-        self.qdim = qdim
-        self.kdim = kdim
-
-        self.num_heads = num_heads
-        self.head_dim = self.qdim // num_heads
-
-        self.scale = self.head_dim ** -0.5
-
-        self.to_q = operations.Linear(qdim, qdim, bias=qkv_bias, device = device, dtype = dtype)
-        self.to_k = operations.Linear(kdim, qdim, bias=qkv_bias, device = device, dtype = dtype)
-        self.to_v = operations.Linear(kdim, qdim, bias=qkv_bias, device = device, dtype = dtype)
-
-        if use_fp16:
-            eps = 1.0 / 65504
-        else:
-            eps = 1e-6
-
-        if norm_layer == nn.LayerNorm:
-            norm_layer = operations.LayerNorm
-        else:
-            norm_layer = operations.RMSNorm
-
-        self.q_norm = norm_layer(self.head_dim, elementwise_affine=True, eps = eps, device = device, dtype = dtype) if qk_norm else nn.Identity()
-        self.k_norm = norm_layer(self.head_dim, elementwise_affine=True, eps = eps, device = device, dtype = dtype) if qk_norm else nn.Identity()
-        self.out_proj = operations.Linear(qdim, qdim, bias=True, device = device, dtype = dtype)
-
-    def forward(self, x, y):
-
-        b, s1, _ = x.shape
-        _, s2, _ = y.shape
-
-        y = y.to(next(self.to_k.parameters()).dtype)
-
-        q = self.to_q(x)
-        k = self.to_k(y)
-        v = self.to_v(y)
-
-        kv = torch.cat((k, v), dim=-1)
-        split_size = kv.shape[-1] // self.num_heads // 2
-
-        kv = kv.view(1, -1, self.num_heads, split_size * 2)
-        k, v = torch.split(kv, split_size, dim=-1)
-
-        q = q.view(b, s1, self.num_heads, self.head_dim)
-        k = k.view(b, s2, self.num_heads, self.head_dim)
-        v = v.reshape(b, s2, self.num_heads * self.head_dim)
-
-        q = self.q_norm(q)
-        k = self.k_norm(k)
-
-        x = optimized_attention(
-            q.reshape(b, s1, self.num_heads * self.head_dim),
-            k.reshape(b, s2, self.num_heads * self.head_dim),
-            v,
-            heads=self.num_heads,
-        )
-
-        out = self.out_proj(x)
-
-        return out
-
-class Attention(nn.Module):
-
-    def __init__(
-        self,
-        dim,
-        num_heads,
-        qkv_bias = True,
-        qk_norm = False,
-        norm_layer = nn.LayerNorm,
-        use_fp16: bool = False,
-        operations = None,
-        device = None,
-        dtype = None
-    ):
-        super().__init__()
-        self.dim = dim
-        self.num_heads = num_heads
-        self.head_dim = self.dim // num_heads
-        self.scale = self.head_dim ** -0.5
-
-        self.to_q = operations.Linear(dim, dim, bias = qkv_bias, device = device, dtype = dtype)
-        self.to_k = operations.Linear(dim, dim, bias = qkv_bias, device = device, dtype = dtype)
-        self.to_v = operations.Linear(dim, dim, bias = qkv_bias, device = device, dtype = dtype)
-
-        if use_fp16:
-            eps = 1.0 / 65504
-        else:
-            eps = 1e-6
-
-        if norm_layer == nn.LayerNorm:
-            norm_layer = operations.LayerNorm
-        else:
-            norm_layer = operations.RMSNorm
-
-        self.q_norm = norm_layer(self.head_dim, elementwise_affine=True, eps = eps, device = device, dtype = dtype) if qk_norm else nn.Identity()
-        self.k_norm = norm_layer(self.head_dim, elementwise_affine=True, eps = eps, device = device, dtype = dtype) if qk_norm else nn.Identity()
-        self.out_proj = operations.Linear(dim, dim, device = device, dtype = dtype)
-
-    def forward(self, x):
-        B, N, _ = x.shape
-
-        query = self.to_q(x)
-        key = self.to_k(x)
-        value = self.to_v(x)
-
-        qkv_combined = torch.cat((query, key, value), dim=-1)
-        split_size = qkv_combined.shape[-1] // self.num_heads // 3
-
-        qkv = qkv_combined.view(1, -1, self.num_heads, split_size * 3)
-        query, key, value = torch.split(qkv, split_size, dim=-1)
-
-        query = query.reshape(B, N, self.num_heads, self.head_dim)
-        key = key.reshape(B, N, self.num_heads, self.head_dim)
-        value = value.reshape(B, N, self.num_heads * self.head_dim)
-
-        query = self.q_norm(query)
-        key = self.k_norm(key)
-
-        x = optimized_attention(
-            query.reshape(B, N, self.num_heads * self.head_dim),
-            key.reshape(B, N, self.num_heads * self.head_dim),
-            value,
-            heads=self.num_heads,
-        )
-
-        x = self.out_proj(x)
-        return x
-
-class HunYuanDiTBlock(nn.Module):
-    def __init__(
-        self,
-        hidden_size,
-        c_emb_size,
-        num_heads,
-        text_states_dim=1024,
-        qk_norm=False,
-        norm_layer=nn.LayerNorm,
-        qk_norm_layer=True,
-        qkv_bias=True,
-        skip_connection=True,
-        timested_modulate=False,
-        use_moe: bool = False,
-        num_experts: int = 8,
-        moe_top_k: int = 2,
-        use_fp16: bool = False,
-        operations = None,
-        device = None, dtype = None
-    ):
-        super().__init__()
-
-        # eps can't be 1e-6 in fp16 mode because of numerical stability issues
-        if use_fp16:
-            eps = 1.0 / 65504
-        else:
-            eps = 1e-6
-
-        self.norm1 = norm_layer(hidden_size, elementwise_affine = True, eps = eps, device = device, dtype = dtype)
-
-        self.attn1 = Attention(hidden_size, num_heads=num_heads, qkv_bias=qkv_bias, qk_norm=qk_norm,
-                               norm_layer=qk_norm_layer, use_fp16 = use_fp16, device = device, dtype = dtype, operations = operations)
-
-        self.norm2 = norm_layer(hidden_size, elementwise_affine = True, eps = eps, device = device, dtype = dtype)
-
-        self.timested_modulate = timested_modulate
-        if self.timested_modulate:
-            self.default_modulation = nn.Sequential(
-                nn.SiLU(),
-                operations.Linear(c_emb_size, hidden_size, bias=True, device = device, dtype = dtype)
-            )
-
-        self.attn2 = CrossAttention(hidden_size, text_states_dim, num_heads=num_heads, qkv_bias=qkv_bias,
-                                    qk_norm=qk_norm, norm_layer=qk_norm_layer, use_fp16 = use_fp16,
-                                    device = device, dtype = dtype, operations = operations)
-
-        self.norm3 = norm_layer(hidden_size, elementwise_affine = True, eps = eps, device = device, dtype = dtype)
-
-        if skip_connection:
-            self.skip_norm = norm_layer(hidden_size, elementwise_affine = True, eps = eps, device = device, dtype = dtype)
-            self.skip_linear = operations.Linear(2 * hidden_size, hidden_size, device = device, dtype = dtype)
-        else:
-            self.skip_linear = None
-
-        self.use_moe = use_moe
-
-        if self.use_moe:
-            self.moe = MoEBlock(
-                hidden_size,
-                num_experts = num_experts,
-                moe_top_k = moe_top_k,
-                dropout = 0.0,
-                ff_inner_dim = int(hidden_size * 4.0),
-                device = device, dtype = dtype,
-                operations = operations
-            )
-        else:
-            self.mlp = MLP(width=hidden_size, operations=operations, device = device, dtype = dtype)
-
-    def forward(self, hidden_states, conditioning=None, text_states=None, skip_tensor=None):
-
-        if self.skip_linear is not None:
-            combined = torch.cat([skip_tensor, hidden_states], dim=-1)
-            hidden_states = self.skip_linear(combined)
-            hidden_states = self.skip_norm(hidden_states)
-
-        # self attention
-        if self.timested_modulate:
-            modulation_shift = self.default_modulation(conditioning).unsqueeze(dim=1)
-            hidden_states = hidden_states + modulation_shift
-
-        self_attn_out = self.attn1(self.norm1(hidden_states))
-        hidden_states = hidden_states + self_attn_out
-
-        # cross attention
-        hidden_states = hidden_states + self.attn2(self.norm2(hidden_states), text_states)
-
-        # MLP Layer
-        mlp_input = self.norm3(hidden_states)
-
-        if self.use_moe:
-            hidden_states = hidden_states + self.moe(mlp_input)
-        else:
-            hidden_states = hidden_states + self.mlp(mlp_input)
-
-        return hidden_states
-
-class FinalLayer(nn.Module):
-
-    def __init__(self, final_hidden_size, out_channels, operations, use_fp16: bool = False, device = None, dtype = None):
-        super().__init__()
-
-        if use_fp16:
-            eps = 1.0 / 65504
-        else:
-            eps = 1e-6
-
-        self.norm_final = operations.LayerNorm(final_hidden_size, elementwise_affine = True, eps = eps, device = device, dtype = dtype)
-        self.linear = operations.Linear(final_hidden_size, out_channels, bias = True, device = device, dtype = dtype)
-
-    def forward(self, x):
-        x = self.norm_final(x)
-        x = x[:, 1:]
-        x = self.linear(x)
-        return x
-
-class HunYuanDiTPlain(nn.Module):
-
-    # init with the defaults values from https://huggingface.co/tencent/Hunyuan3D-2.1/blob/main/hunyuan3d-dit-v2-1/config.yaml
-    def __init__(
-        self,
-        in_channels: int = 64,
-        hidden_size: int = 2048,
-        context_dim: int = 1024,
-        depth: int = 21,
-        num_heads: int = 16,
-        qk_norm: bool = True,
-        qkv_bias: bool = False,
-        num_moe_layers: int = 6,
-        guidance_cond_proj_dim = 2048,
-        norm_type = 'layer',
-        num_experts: int = 8,
-        moe_top_k: int = 2,
-        use_fp16: bool = False,
-        dtype = None,
-        device = None,
-        operations = None,
-        **kwargs
-        ):
-
-        self.dtype = dtype
-
-        super().__init__()
-
-        self.depth = depth
-
-        self.in_channels = in_channels
-        self.out_channels = in_channels
-
-        self.num_heads = num_heads
-        self.hidden_size = hidden_size
-
-        norm = operations.LayerNorm if norm_type == 'layer' else operations.RMSNorm
-        qk_norm = operations.RMSNorm
-
-        self.context_dim = context_dim
-        self.guidance_cond_proj_dim = guidance_cond_proj_dim
-
-        self.x_embedder = operations.Linear(in_channels, hidden_size, bias = True, device = device, dtype = dtype)
-        self.t_embedder = TimestepEmbedder(hidden_size, hidden_size * 4, cond_proj_dim = guidance_cond_proj_dim, device = device, dtype = dtype, operations = operations)
-
-
-        # HUnYuanDiT Blocks
-        self.blocks = nn.ModuleList([
-            HunYuanDiTBlock(hidden_size=hidden_size,
-                            c_emb_size=hidden_size,
-                            num_heads=num_heads,
-                            text_states_dim=context_dim,
-                            qk_norm=qk_norm,
-                            norm_layer = norm,
-                            qk_norm_layer = qk_norm,
-                            skip_connection=layer > depth // 2,
-                            qkv_bias=qkv_bias,
-                            use_moe=True if depth - layer <= num_moe_layers else False,
-                            num_experts=num_experts,
-                            moe_top_k=moe_top_k,
-                            use_fp16 = use_fp16,
-                            device = device, dtype = dtype, operations = operations)
-            for layer in range(depth)
-        ])
-
-        self.depth = depth
-
-        self.final_layer = FinalLayer(hidden_size, self.out_channels, use_fp16 = use_fp16, operations = operations, device = device, dtype = dtype)
-
-    def forward(self, x, t, context, transformer_options = {}, **kwargs):
-
-        x = x.movedim(-1, -2)
-        uncond_emb, cond_emb = context.chunk(2, dim = 0)
-
-        context = torch.cat([cond_emb, uncond_emb], dim = 0)
-        main_condition = context
-
-        t = 1.0 - t
-
-        time_embedded = self.t_embedder(t, condition = kwargs.get('guidance_cond'))
-
-        x = x.to(dtype = next(self.x_embedder.parameters()).dtype)
-        x_embedded = self.x_embedder(x)
-
-        combined = torch.cat([time_embedded, x_embedded], dim=1)
-
-        def block_wrap(args):
-            return block(
-                args["x"],
-                args["t"],
-                args["cond"],
-                skip_tensor=args.get("skip"),)
-
-        skip_stack = []
-        patches_replace = transformer_options.get("patches_replace", {})
-        blocks_replace = patches_replace.get("dit", {})
-        for idx, block in enumerate(self.blocks):
-            if idx <= self.depth // 2:
-                skip_input = None
-            else:
-                skip_input = skip_stack.pop()
-
-            if ("block", idx) in blocks_replace:
-
-                combined = blocks_replace[("block", idx)](
-                    {
-                        "x": combined,
-                        "t": time_embedded,
-                        "cond": main_condition,
-                        "skip": skip_input,
-                    },
-                    {"original_block": block_wrap},
-                )
-            else:
-                combined = block(combined, time_embedded, main_condition, skip_tensor=skip_input)
-
-            if idx < self.depth // 2:
-                skip_stack.append(combined)
-
-        output = self.final_layer(combined)
-        output =  output.movedim(-2, -1) * (-1.0)
-
-        cond_emb, uncond_emb = output.chunk(2, dim = 0)
-        return torch.cat([uncond_emb, cond_emb])
--- a/comfy/ldm/hunyuan_video/model.py
+++ b/comfy/ldm/hunyuan_video/model.py
@@ -1,11 +1,11 @@
 #Based on Flux code because of weird hunyuan video code license.

 import torch
-import comfy.patcher_extension
 import comfy.ldm.flux.layers
 import comfy.ldm.modules.diffusionmodules.mmdit
 from comfy.ldm.modules.attention import optimized_attention

+
 from dataclasses import dataclass
 from einops import repeat

@@ -39,10 +39,6 @@ class HunyuanVideoParams:
    patch_size: list
    qkv_bias: bool
    guidance_embed: bool
-    byt5: bool
-    meanflow: bool
-    use_cond_type_embedding: bool
-    vision_in_dim: int


 class SelfAttentionRef(nn.Module):
@@ -81,13 +77,13 @@ class TokenRefinerBlock(nn.Module):
            operations.Linear(mlp_hidden_dim, hidden_size, bias=True, dtype=dtype, device=device),
        )

-    def forward(self, x, c, mask, transformer_options={}):
+    def forward(self, x, c, mask):
        mod1, mod2 = self.adaLN_modulation(c).chunk(2, dim=1)

        norm_x = self.norm1(x)
        qkv = self.self_attn.qkv(norm_x)
        q, k, v = qkv.reshape(qkv.shape[0], qkv.shape[1], 3, self.heads, -1).permute(2, 0, 3, 1, 4)
-        attn = optimized_attention(q, k, v, self.heads, mask=mask, skip_reshape=True, transformer_options=transformer_options)
+        attn = optimized_attention(q, k, v, self.heads, mask=mask, skip_reshape=True)

        x = x + self.self_attn.proj(attn) * mod1.unsqueeze(1)
        x = x + self.mlp(self.norm2(x)) * mod2.unsqueeze(1)
@@ -118,14 +114,14 @@ class IndividualTokenRefiner(nn.Module):
            ]
        )

-    def forward(self, x, c, mask, transformer_options={}):
+    def forward(self, x, c, mask):
        m = None
        if mask is not None:
            m = mask.view(mask.shape[0], 1, 1, mask.shape[1]).repeat(1, 1, mask.shape[1], 1)
            m = m + m.transpose(2, 3)

        for block in self.blocks:
-            x = block(x, c, m, transformer_options=transformer_options)
+            x = block(x, c, m)
        return x


@@ -153,45 +149,17 @@ class TokenRefiner(nn.Module):
        x,
        timesteps,
        mask,
-        transformer_options={},
    ):
        t = self.t_embedder(timestep_embedding(timesteps, 256, time_factor=1.0).to(x.dtype))
        # m = mask.float().unsqueeze(-1)
        # c = (x.float() * m).sum(dim=1) / m.sum(dim=1) #TODO: the following works when the x.shape is the same length as the tokens but might break otherwise
-        if x.dtype == torch.float16:
-            c = x.float().sum(dim=1) / x.shape[1]
-        else:
-            c = x.sum(dim=1) / x.shape[1]
+        c = x.sum(dim=1) / x.shape[1]

        c = t + self.c_embedder(c.to(x.dtype))
        x = self.input_embedder(x)
-        x = self.individual_token_refiner(x, c, mask, transformer_options=transformer_options)
+        x = self.individual_token_refiner(x, c, mask)
        return x

-
-class ByT5Mapper(nn.Module):
-    def __init__(self, in_dim, out_dim, hidden_dim, out_dim1, use_res=False, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.layernorm = operations.LayerNorm(in_dim, dtype=dtype, device=device)
-        self.fc1 = operations.Linear(in_dim, hidden_dim, dtype=dtype, device=device)
-        self.fc2 = operations.Linear(hidden_dim, out_dim, dtype=dtype, device=device)
-        self.fc3 = operations.Linear(out_dim, out_dim1, dtype=dtype, device=device)
-        self.use_res = use_res
-        self.act_fn = nn.GELU()
-
-    def forward(self, x):
-        if self.use_res:
-            res = x
-        x = self.layernorm(x)
-        x = self.fc1(x)
-        x = self.act_fn(x)
-        x = self.fc2(x)
-        x2 = self.act_fn(x)
-        x2 = self.fc3(x2)
-        if self.use_res:
-            x2 = x2 + res
-        return x2
-
 class HunyuanVideo(nn.Module):
    """
    Transformer model for flow matching on sequences.
@@ -200,15 +168,11 @@ class HunyuanVideo(nn.Module):
    def __init__(self, image_model=None, final_layer=True, dtype=None, device=None, operations=None, **kwargs):
        super().__init__()
        self.dtype = dtype
-        operation_settings = {"operations": operations, "device": device, "dtype": dtype}
-
        params = HunyuanVideoParams(**kwargs)
        self.params = params
        self.patch_size = params.patch_size
        self.in_channels = params.in_channels
        self.out_channels = params.out_channels
-        self.use_cond_type_embedding = params.use_cond_type_embedding
-        self.vision_in_dim = params.vision_in_dim
        if params.hidden_size % params.num_heads != 0:
            raise ValueError(
                f"Hidden size {params.hidden_size} must be divisible by num_heads {params.num_heads}"
@@ -220,13 +184,9 @@ class HunyuanVideo(nn.Module):
        self.num_heads = params.num_heads
        self.pe_embedder = EmbedND(dim=pe_dim, theta=params.theta, axes_dim=params.axes_dim)

-        self.img_in = comfy.ldm.modules.diffusionmodules.mmdit.PatchEmbed(None, self.patch_size, self.in_channels, self.hidden_size, conv3d=len(self.patch_size) == 3, dtype=dtype, device=device, operations=operations)
+        self.img_in = comfy.ldm.modules.diffusionmodules.mmdit.PatchEmbed(None, self.patch_size, self.in_channels, self.hidden_size, conv3d=True, dtype=dtype, device=device, operations=operations)
        self.time_in = MLPEmbedder(in_dim=256, hidden_dim=self.hidden_size, dtype=dtype, device=device, operations=operations)
-        if params.vec_in_dim is not None:
-            self.vector_in = MLPEmbedder(params.vec_in_dim, self.hidden_size, dtype=dtype, device=device, operations=operations)
-        else:
-            self.vector_in = None
-
+        self.vector_in = MLPEmbedder(params.vec_in_dim, self.hidden_size, dtype=dtype, device=device, operations=operations)
        self.guidance_in = (
            MLPEmbedder(in_dim=256, hidden_dim=self.hidden_size, dtype=dtype, device=device, operations=operations) if params.guidance_embed else nn.Identity()
        )
@@ -254,38 +214,9 @@ class HunyuanVideo(nn.Module):
            ]
        )

-        if params.byt5:
-            self.byt5_in = ByT5Mapper(
-                in_dim=1472,
-                out_dim=2048,
-                hidden_dim=2048,
-                out_dim1=self.hidden_size,
-                use_res=False,
-                dtype=dtype, device=device, operations=operations
-            )
-        else:
-            self.byt5_in = None
-
-        if params.meanflow:
-            self.time_r_in = MLPEmbedder(in_dim=256, hidden_dim=self.hidden_size, dtype=dtype, device=device, operations=operations)
-        else:
-            self.time_r_in = None
-
        if final_layer:
            self.final_layer = LastLayer(self.hidden_size, self.patch_size[-1], self.out_channels, dtype=dtype, device=device, operations=operations)

-        # HunyuanVideo 1.5 specific modules
-        if self.vision_in_dim is not None:
-            from comfy.ldm.wan.model import MLPProj
-            self.vision_in = MLPProj(in_dim=self.vision_in_dim, out_dim=self.hidden_size, operation_settings=operation_settings)
-        else:
-            self.vision_in = None
-        if self.use_cond_type_embedding:
-            # 0: text_encoder feature 1: byt5 feature 2: vision_encoder feature
-            self.cond_type_embedding = nn.Embedding(3, self.hidden_size)
-        else:
-            self.cond_type_embedding = None
-
    def forward_orig(
        self,
        img: Tensor,
@@ -294,13 +225,10 @@ class HunyuanVideo(nn.Module):
        txt_ids: Tensor,
        txt_mask: Tensor,
        timesteps: Tensor,
-        y: Tensor = None,
-        txt_byt5=None,
-        clip_fea=None,
+        y: Tensor,
        guidance: Tensor = None,
        guiding_frame_index=None,
        ref_latent=None,
-        disable_time_r=False,
        control=None,
        transformer_options={},
    ) -> Tensor:
@@ -311,14 +239,6 @@ class HunyuanVideo(nn.Module):
        img = self.img_in(img)
        vec = self.time_in(timestep_embedding(timesteps, 256, time_factor=1.0).to(img.dtype))

-        if (self.time_r_in is not None) and (not disable_time_r):
-            w = torch.where(transformer_options['sigmas'][0] == transformer_options['sample_sigmas'])[0]  # This most likely could be improved
-            if len(w) > 0:
-                timesteps_r = transformer_options['sample_sigmas'][w[0] + 1]
-                timesteps_r = timesteps_r.unsqueeze(0).to(device=timesteps.device, dtype=timesteps.dtype)
-                vec_r = self.time_r_in(timestep_embedding(timesteps_r, 256, time_factor=1000.0).to(img.dtype))
-                vec = (vec + vec_r) / 2
-
        if ref_latent is not None:
            ref_latent_ids = self.img_ids(ref_latent)
            ref_latent = self.img_in(ref_latent)
@@ -329,17 +249,13 @@ class HunyuanVideo(nn.Module):

        if guiding_frame_index is not None:
            token_replace_vec = self.time_in(timestep_embedding(guiding_frame_index, 256, time_factor=1.0))
-            if self.vector_in is not None:
-                vec_ = self.vector_in(y[:, :self.params.vec_in_dim])
-                vec = torch.cat([(vec_ + token_replace_vec).unsqueeze(1), (vec_ + vec).unsqueeze(1)], dim=1)
-            else:
-                vec = torch.cat([(token_replace_vec).unsqueeze(1), (vec).unsqueeze(1)], dim=1)
+            vec_ = self.vector_in(y[:, :self.params.vec_in_dim])
+            vec = torch.cat([(vec_ + token_replace_vec).unsqueeze(1), (vec_ + vec).unsqueeze(1)], dim=1)
            frame_tokens = (initial_shape[-1] // self.patch_size[-1]) * (initial_shape[-2] // self.patch_size[-2])
            modulation_dims = [(0, frame_tokens, 0), (frame_tokens, None, 1)]
            modulation_dims_txt = [(0, None, 1)]
        else:
-            if self.vector_in is not None:
-                vec = vec + self.vector_in(y[:, :self.params.vec_in_dim])
+            vec = vec + self.vector_in(y[:, :self.params.vec_in_dim])
            modulation_dims = None
            modulation_dims_txt = None

@@ -350,32 +266,7 @@ class HunyuanVideo(nn.Module):
        if txt_mask is not None and not torch.is_floating_point(txt_mask):
            txt_mask = (txt_mask - 1).to(img.dtype) * torch.finfo(img.dtype).max

-        txt = self.txt_in(txt, timesteps, txt_mask, transformer_options=transformer_options)
-
-        if self.cond_type_embedding is not None:
-            self.cond_type_embedding.to(txt.device)
-            cond_emb = self.cond_type_embedding(torch.zeros_like(txt[:, :, 0], device=txt.device, dtype=torch.long))
-            txt = txt + cond_emb.to(txt.dtype)
-
-        if self.byt5_in is not None and txt_byt5 is not None:
-            txt_byt5 = self.byt5_in(txt_byt5)
-            if self.cond_type_embedding is not None:
-                cond_emb = self.cond_type_embedding(torch.ones_like(txt_byt5[:, :, 0], device=txt_byt5.device, dtype=torch.long))
-                txt_byt5 = txt_byt5 + cond_emb.to(txt_byt5.dtype)
-                txt = torch.cat((txt_byt5, txt), dim=1) # byt5 first for HunyuanVideo1.5
-            else:
-                txt = torch.cat((txt, txt_byt5), dim=1)
-            txt_byt5_ids = torch.zeros((txt_ids.shape[0], txt_byt5.shape[1], txt_ids.shape[-1]), device=txt_ids.device, dtype=txt_ids.dtype)
-            txt_ids = torch.cat((txt_ids, txt_byt5_ids), dim=1)
-
-        if clip_fea is not None:
-            txt_vision_states = self.vision_in(clip_fea)
-            if self.cond_type_embedding is not None:
-                cond_emb = self.cond_type_embedding(2 * torch.ones_like(txt_vision_states[:, :, 0], dtype=torch.long, device=txt_vision_states.device))
-                txt_vision_states = txt_vision_states + cond_emb
-            txt = torch.cat((txt_vision_states.to(txt.dtype), txt), dim=1)
-            extra_txt_ids = torch.zeros((txt_ids.shape[0], txt_vision_states.shape[1], txt_ids.shape[-1]), device=txt_ids.device, dtype=txt_ids.dtype)
-            txt_ids = torch.cat((txt_ids, extra_txt_ids), dim=1)
+        txt = self.txt_in(txt, timesteps, txt_mask)

        ids = torch.cat((img_ids, txt_ids), dim=1)
        pe = self.pe_embedder(ids)
@@ -389,21 +280,18 @@ class HunyuanVideo(nn.Module):
            attn_mask = None

        blocks_replace = patches_replace.get("dit", {})
-        transformer_options["total_blocks"] = len(self.double_blocks)
-        transformer_options["block_type"] = "double"
        for i, block in enumerate(self.double_blocks):
-            transformer_options["block_index"] = i
            if ("double_block", i) in blocks_replace:
                def block_wrap(args):
                    out = {}
-                    out["img"], out["txt"] = block(img=args["img"], txt=args["txt"], vec=args["vec"], pe=args["pe"], attn_mask=args["attention_mask"], modulation_dims_img=args["modulation_dims_img"], modulation_dims_txt=args["modulation_dims_txt"], transformer_options=args["transformer_options"])
+                    out["img"], out["txt"] = block(img=args["img"], txt=args["txt"], vec=args["vec"], pe=args["pe"], attn_mask=args["attention_mask"], modulation_dims_img=args["modulation_dims_img"], modulation_dims_txt=args["modulation_dims_txt"])
                    return out

-                out = blocks_replace[("double_block", i)]({"img": img, "txt": txt, "vec": vec, "pe": pe, "attention_mask": attn_mask, 'modulation_dims_img': modulation_dims, 'modulation_dims_txt': modulation_dims_txt, 'transformer_options': transformer_options}, {"original_block": block_wrap})
+                out = blocks_replace[("double_block", i)]({"img": img, "txt": txt, "vec": vec, "pe": pe, "attention_mask": attn_mask, 'modulation_dims_img': modulation_dims, 'modulation_dims_txt': modulation_dims_txt}, {"original_block": block_wrap})
                txt = out["txt"]
                img = out["img"]
            else:
-                img, txt = block(img=img, txt=txt, vec=vec, pe=pe, attn_mask=attn_mask, modulation_dims_img=modulation_dims, modulation_dims_txt=modulation_dims_txt, transformer_options=transformer_options)
+                img, txt = block(img=img, txt=txt, vec=vec, pe=pe, attn_mask=attn_mask, modulation_dims_img=modulation_dims, modulation_dims_txt=modulation_dims_txt)

            if control is not None: # Controlnet
                control_i = control.get("input")
@@ -414,20 +302,17 @@ class HunyuanVideo(nn.Module):

        img = torch.cat((img, txt), 1)

-        transformer_options["total_blocks"] = len(self.single_blocks)
-        transformer_options["block_type"] = "single"
        for i, block in enumerate(self.single_blocks):
-            transformer_options["block_index"] = i
            if ("single_block", i) in blocks_replace:
                def block_wrap(args):
                    out = {}
-                    out["img"] = block(args["img"], vec=args["vec"], pe=args["pe"], attn_mask=args["attention_mask"], modulation_dims=args["modulation_dims"], transformer_options=args["transformer_options"])
+                    out["img"] = block(args["img"], vec=args["vec"], pe=args["pe"], attn_mask=args["attention_mask"], modulation_dims=args["modulation_dims"])
                    return out

-                out = blocks_replace[("single_block", i)]({"img": img, "vec": vec, "pe": pe, "attention_mask": attn_mask, 'modulation_dims': modulation_dims, 'transformer_options': transformer_options}, {"original_block": block_wrap})
+                out = blocks_replace[("single_block", i)]({"img": img, "vec": vec, "pe": pe, "attention_mask": attn_mask, 'modulation_dims': modulation_dims}, {"original_block": block_wrap})
                img = out["img"]
            else:
-                img = block(img, vec=vec, pe=pe, attn_mask=attn_mask, modulation_dims=modulation_dims, transformer_options=transformer_options)
+                img = block(img, vec=vec, pe=pe, attn_mask=attn_mask, modulation_dims=modulation_dims)

            if control is not None: # Controlnet
                control_o = control.get("output")
@@ -442,16 +327,12 @@ class HunyuanVideo(nn.Module):

        img = self.final_layer(img, vec, modulation_dims=modulation_dims)  # (N, T, patch_size ** 2 * out_channels)

-        shape = initial_shape[-len(self.patch_size):]
+        shape = initial_shape[-3:]
        for i in range(len(shape)):
            shape[i] = shape[i] // self.patch_size[i]
        img = img.reshape([img.shape[0]] + shape + [self.out_channels] + self.patch_size)
-        if img.ndim == 8:
-            img = img.permute(0, 4, 1, 5, 2, 6, 3, 7)
-            img = img.reshape(initial_shape[0], self.out_channels, initial_shape[2], initial_shape[3], initial_shape[4])
-        else:
-            img = img.permute(0, 3, 1, 4, 2, 5)
-            img = img.reshape(initial_shape[0], self.out_channels, initial_shape[2], initial_shape[3])
+        img = img.permute(0, 4, 1, 5, 2, 6, 3, 7)
+        img = img.reshape(initial_shape[0], self.out_channels, initial_shape[2], initial_shape[3], initial_shape[4])
        return img

    def img_ids(self, x):
@@ -466,30 +347,9 @@ class HunyuanVideo(nn.Module):
        img_ids[:, :, :, 2] = img_ids[:, :, :, 2] + torch.linspace(0, w_len - 1, steps=w_len, device=x.device, dtype=x.dtype).reshape(1, 1, -1)
        return repeat(img_ids, "t h w c -> b (t h w) c", b=bs)

-    def img_ids_2d(self, x):
-        bs, c, h, w = x.shape
-        patch_size = self.patch_size
-        h_len = ((h + (patch_size[0] // 2)) // patch_size[0])
-        w_len = ((w + (patch_size[1] // 2)) // patch_size[1])
-        img_ids = torch.zeros((h_len, w_len, 2), device=x.device, dtype=x.dtype)
-        img_ids[:, :, 0] = img_ids[:, :, 0] + torch.linspace(0, h_len - 1, steps=h_len, device=x.device, dtype=x.dtype).unsqueeze(1)
-        img_ids[:, :, 1] = img_ids[:, :, 1] + torch.linspace(0, w_len - 1, steps=w_len, device=x.device, dtype=x.dtype).unsqueeze(0)
-        return repeat(img_ids, "h w c -> b (h w) c", b=bs)
-
-    def forward(self, x, timestep, context, y=None, txt_byt5=None, clip_fea=None, guidance=None, attention_mask=None, guiding_frame_index=None, ref_latent=None, disable_time_r=False, control=None, transformer_options={}, **kwargs):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, timestep, context, y, txt_byt5, clip_fea, guidance, attention_mask, guiding_frame_index, ref_latent, disable_time_r, control, transformer_options, **kwargs)
-
-    def _forward(self, x, timestep, context, y=None, txt_byt5=None, clip_fea=None, guidance=None, attention_mask=None, guiding_frame_index=None, ref_latent=None, disable_time_r=False, control=None, transformer_options={}, **kwargs):
-        bs = x.shape[0]
-        if len(self.patch_size) == 3:
-            img_ids = self.img_ids(x)
-            txt_ids = torch.zeros((bs, context.shape[1], 3), device=x.device, dtype=x.dtype)
-        else:
-            img_ids = self.img_ids_2d(x)
-            txt_ids = torch.zeros((bs, context.shape[1], 2), device=x.device, dtype=x.dtype)
-        out = self.forward_orig(x, img_ids, context, txt_ids, attention_mask, timestep, y, txt_byt5, clip_fea, guidance, guiding_frame_index, ref_latent, disable_time_r=disable_time_r, control=control, transformer_options=transformer_options)
+    def forward(self, x, timestep, context, y, guidance=None, attention_mask=None, guiding_frame_index=None, ref_latent=None, control=None, transformer_options={}, **kwargs):
+        bs, c, t, h, w = x.shape
+        img_ids = self.img_ids(x)
+        txt_ids = torch.zeros((bs, context.shape[1], 3), device=x.device, dtype=x.dtype)
+        out = self.forward_orig(x, img_ids, context, txt_ids, attention_mask, timestep, y, guidance, guiding_frame_index, ref_latent, control=control, transformer_options=transformer_options)
        return out
--- a/comfy/ldm/hunyuan_video/upsampler.py
+++ b/comfy/ldm/hunyuan_video/upsampler.py
@@ -1,120 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from comfy.ldm.hunyuan_video.vae_refiner import RMS_norm, ResnetBlock, VideoConv3d
-import model_management, model_patcher
-
-class SRResidualCausalBlock3D(nn.Module):
-    def __init__(self, channels: int):
-        super().__init__()
-        self.block = nn.Sequential(
-            VideoConv3d(channels, channels, kernel_size=3),
-            nn.SiLU(inplace=True),
-            VideoConv3d(channels, channels, kernel_size=3),
-            nn.SiLU(inplace=True),
-            VideoConv3d(channels, channels, kernel_size=3),
-        )
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        return x + self.block(x)
-
-class SRModel3DV2(nn.Module):
-    def __init__(
-        self,
-        in_channels: int,
-        out_channels: int,
-        hidden_channels: int = 64,
-        num_blocks: int = 6,
-        global_residual: bool = False,
-    ):
-        super().__init__()
-        self.in_conv = VideoConv3d(in_channels, hidden_channels, kernel_size=3)
-        self.blocks = nn.ModuleList([SRResidualCausalBlock3D(hidden_channels) for _ in range(num_blocks)])
-        self.out_conv = VideoConv3d(hidden_channels, out_channels, kernel_size=3)
-        self.global_residual = bool(global_residual)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        residual = x
-        y = self.in_conv(x)
-        for blk in self.blocks:
-            y = blk(y)
-        y = self.out_conv(y)
-        if self.global_residual and (y.shape == residual.shape):
-            y = y + residual
-        return y
-
-
-class Upsampler(nn.Module):
-    def __init__(
-        self,
-        z_channels: int,
-        out_channels: int,
-        block_out_channels: tuple[int, ...],
-        num_res_blocks: int = 2,
-    ):
-        super().__init__()
-        self.num_res_blocks = num_res_blocks
-        self.block_out_channels = block_out_channels
-        self.z_channels = z_channels
-
-        ch = block_out_channels[0]
-        self.conv_in = VideoConv3d(z_channels, ch, kernel_size=3)
-
-        self.up = nn.ModuleList()
-
-        for i, tgt in enumerate(block_out_channels):
-            stage = nn.Module()
-            stage.block = nn.ModuleList([ResnetBlock(in_channels=ch if j == 0 else tgt,
-                                                    out_channels=tgt,
-                                                    temb_channels=0,
-                                                    conv_shortcut=False,
-                                                    conv_op=VideoConv3d, norm_op=RMS_norm)
-                                        for j in range(num_res_blocks + 1)])
-            ch = tgt
-            self.up.append(stage)
-
-        self.norm_out = RMS_norm(ch)
-        self.conv_out = VideoConv3d(ch, out_channels, kernel_size=3)
-
-    def forward(self, z):
-        """
-        Args:
-            z: (B, C, T, H, W)
-            target_shape: (H, W)
-        """
-        # z to block_in
-        repeats = self.block_out_channels[0] // (self.z_channels)
-        x = self.conv_in(z) + z.repeat_interleave(repeats=repeats, dim=1)
-
-        # upsampling
-        for stage in self.up:
-            for blk in stage.block:
-                x = blk(x)
-
-        out = self.conv_out(F.silu(self.norm_out(x)))
-        return out
-
-UPSAMPLERS = {
-    "720p": SRModel3DV2,
-    "1080p": Upsampler,
-}
-
-class HunyuanVideo15SRModel():
-    def __init__(self, model_type, config):
-        self.load_device = model_management.vae_device()
-        offload_device = model_management.vae_offload_device()
-        self.dtype = model_management.vae_dtype(self.load_device)
-        self.model_class = UPSAMPLERS.get(model_type)
-        self.model = self.model_class(**config).eval()
-
-        self.patcher = model_patcher.ModelPatcher(self.model, load_device=self.load_device, offload_device=offload_device)
-
-    def load_sd(self, sd):
-        return self.model.load_state_dict(sd, strict=True)
-
-    def get_sd(self):
-        return self.model.state_dict()
-
-    def resample_latent(self, latent):
-        model_management.load_model_gpu(self.patcher)
-        return self.model(latent.to(self.load_device))
--- a/comfy/ldm/hunyuan_video/vae.py
+++ b/comfy/ldm/hunyuan_video/vae.py
@@ -1,136 +0,0 @@
-import torch.nn as nn
-import torch.nn.functional as F
-from comfy.ldm.modules.diffusionmodules.model import ResnetBlock, AttnBlock
-import comfy.ops
-ops = comfy.ops.disable_weight_init
-
-
-class PixelShuffle2D(nn.Module):
-    def __init__(self, in_dim, out_dim, op=ops.Conv2d):
-        super().__init__()
-        self.conv = op(in_dim, out_dim >> 2, 3, 1, 1)
-        self.ratio = (in_dim << 2) // out_dim
-
-    def forward(self, x):
-        b, c, h, w = x.shape
-        h2, w2 = h >> 1, w >> 1
-        y = self.conv(x).view(b, -1, h2, 2, w2, 2).permute(0, 3, 5, 1, 2, 4).reshape(b, -1, h2, w2)
-        r = x.view(b, c, h2, 2, w2, 2).permute(0, 3, 5, 1, 2, 4).reshape(b, c << 2, h2, w2)
-        return y + r.view(b, y.shape[1], self.ratio, h2, w2).mean(2)
-
-
-class PixelUnshuffle2D(nn.Module):
-    def __init__(self, in_dim, out_dim, op=ops.Conv2d):
-        super().__init__()
-        self.conv = op(in_dim, out_dim << 2, 3, 1, 1)
-        self.scale = (out_dim << 2) // in_dim
-
-    def forward(self, x):
-        b, c, h, w = x.shape
-        h2, w2 = h << 1, w << 1
-        y = self.conv(x).view(b, 2, 2, -1, h, w).permute(0, 3, 4, 1, 5, 2).reshape(b, -1, h2, w2)
-        r = x.repeat_interleave(self.scale, 1).view(b, 2, 2, -1, h, w).permute(0, 3, 4, 1, 5, 2).reshape(b, -1, h2, w2)
-        return y + r
-
-
-class Encoder(nn.Module):
-    def __init__(self, in_channels, z_channels, block_out_channels, num_res_blocks,
-                 ffactor_spatial, downsample_match_channel=True, **_):
-        super().__init__()
-        self.z_channels = z_channels
-        self.block_out_channels = block_out_channels
-        self.num_res_blocks = num_res_blocks
-        self.conv_in = ops.Conv2d(in_channels, block_out_channels[0], 3, 1, 1)
-
-        self.down = nn.ModuleList()
-        ch = block_out_channels[0]
-        depth = (ffactor_spatial >> 1).bit_length()
-
-        for i, tgt in enumerate(block_out_channels):
-            stage = nn.Module()
-            stage.block = nn.ModuleList([ResnetBlock(in_channels=ch if j == 0 else tgt,
-                                                     out_channels=tgt,
-                                                     temb_channels=0,
-                                                     conv_op=ops.Conv2d)
-                                        for j in range(num_res_blocks)])
-            ch = tgt
-            if i < depth:
-                nxt = block_out_channels[i + 1] if i + 1 < len(block_out_channels) and downsample_match_channel else ch
-                stage.downsample = PixelShuffle2D(ch, nxt, ops.Conv2d)
-                ch = nxt
-            self.down.append(stage)
-
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=ops.Conv2d)
-        self.mid.attn_1 = AttnBlock(ch, conv_op=ops.Conv2d)
-        self.mid.block_2 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=ops.Conv2d)
-
-        self.norm_out = ops.GroupNorm(32, ch, 1e-6, True)
-        self.conv_out = ops.Conv2d(ch, z_channels << 1, 3, 1, 1)
-
-    def forward(self, x):
-        x = self.conv_in(x)
-
-        for stage in self.down:
-            for blk in stage.block:
-                x = blk(x)
-            if hasattr(stage, 'downsample'):
-                x = stage.downsample(x)
-
-        x = self.mid.block_2(self.mid.attn_1(self.mid.block_1(x)))
-
-        b, c, h, w = x.shape
-        grp = c // (self.z_channels << 1)
-        skip = x.view(b, c // grp, grp, h, w).mean(2)
-
-        return self.conv_out(F.silu(self.norm_out(x))) + skip
-
-
-class Decoder(nn.Module):
-    def __init__(self, z_channels, out_channels, block_out_channels, num_res_blocks,
-                 ffactor_spatial, upsample_match_channel=True, **_):
-        super().__init__()
-        block_out_channels = block_out_channels[::-1]
-        self.z_channels = z_channels
-        self.block_out_channels = block_out_channels
-        self.num_res_blocks = num_res_blocks
-
-        ch = block_out_channels[0]
-        self.conv_in = ops.Conv2d(z_channels, ch, 3, 1, 1)
-
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=ops.Conv2d)
-        self.mid.attn_1 = AttnBlock(ch, conv_op=ops.Conv2d)
-        self.mid.block_2 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=ops.Conv2d)
-
-        self.up = nn.ModuleList()
-        depth = (ffactor_spatial >> 1).bit_length()
-
-        for i, tgt in enumerate(block_out_channels):
-            stage = nn.Module()
-            stage.block = nn.ModuleList([ResnetBlock(in_channels=ch if j == 0 else tgt,
-                                                     out_channels=tgt,
-                                                     temb_channels=0,
-                                                     conv_op=ops.Conv2d)
-                                        for j in range(num_res_blocks + 1)])
-            ch = tgt
-            if i < depth:
-                nxt = block_out_channels[i + 1] if i + 1 < len(block_out_channels) and upsample_match_channel else ch
-                stage.upsample = PixelUnshuffle2D(ch, nxt, ops.Conv2d)
-                ch = nxt
-            self.up.append(stage)
-
-        self.norm_out = ops.GroupNorm(32, ch, 1e-6, True)
-        self.conv_out = ops.Conv2d(ch, out_channels, 3, 1, 1)
-
-    def forward(self, z):
-        x = self.conv_in(z) + z.repeat_interleave(self.block_out_channels[0] // self.z_channels, 1)
-        x = self.mid.block_2(self.mid.attn_1(self.mid.block_1(x)))
-
-        for stage in self.up:
-            for blk in stage.block:
-                x = blk(x)
-            if hasattr(stage, 'upsample'):
-                x = stage.upsample(x)
-
-        return self.conv_out(F.silu(self.norm_out(x)))
--- a/comfy/ldm/hunyuan_video/vae_refiner.py
+++ b/comfy/ldm/hunyuan_video/vae_refiner.py
@@ -1,363 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from comfy.ldm.modules.diffusionmodules.model import ResnetBlock, AttnBlock, VideoConv3d, Normalize
-import comfy.ops
-import comfy.ldm.models.autoencoder
-import comfy.model_management
-ops = comfy.ops.disable_weight_init
-
-class NoPadConv3d(nn.Module):
-    def __init__(self, n_channels, out_channels, kernel_size, stride=1, dilation=1, padding=0, **kwargs):
-        super().__init__()
-        self.conv = ops.Conv3d(n_channels, out_channels, kernel_size, stride=stride, dilation=dilation, **kwargs)
-
-    def forward(self, x):
-        return self.conv(x)
-
-
-def conv_carry_causal_3d(xl, op, conv_carry_in=None, conv_carry_out=None):
-
-    x = xl[0]
-    xl.clear()
-
-    if conv_carry_out is not None:
-        to_push = x[:, :, -2:, :, :].clone()
-        conv_carry_out.append(to_push)
-
-    if isinstance(op, NoPadConv3d):
-        if conv_carry_in is None:
-            x = torch.nn.functional.pad(x, (1, 1, 1, 1, 2, 0), mode = 'replicate')
-        else:
-            carry_len = conv_carry_in[0].shape[2]
-            x = torch.cat([conv_carry_in.pop(0), x], dim=2)
-            x = torch.nn.functional.pad(x, (1, 1, 1, 1, 2 - carry_len, 0), mode = 'replicate')
-
-    out = op(x)
-
-    return out
-
-
-class RMS_norm(nn.Module):
-    def __init__(self, dim):
-        super().__init__()
-        shape = (dim, 1, 1, 1)
-        self.scale = dim**0.5
-        self.gamma = nn.Parameter(torch.empty(shape))
-
-    def forward(self, x):
-        return F.normalize(x, dim=1) * self.scale * comfy.model_management.cast_to(self.gamma, dtype=x.dtype, device=x.device)
-
-class DnSmpl(nn.Module):
-    def __init__(self, ic, oc, tds=True, refiner_vae=True, op=VideoConv3d):
-        super().__init__()
-        fct = 2 * 2 * 2 if tds else 1 * 2 * 2
-        assert oc % fct == 0
-        self.conv = op(ic, oc // fct, kernel_size=3, stride=1, padding=1)
-        self.refiner_vae = refiner_vae
-
-        self.tds = tds
-        self.gs = fct * ic // oc
-
-    def forward(self, x, conv_carry_in=None, conv_carry_out=None):
-        r1 = 2 if self.tds else 1
-        h = conv_carry_causal_3d([x], self.conv, conv_carry_in, conv_carry_out)
-
-        if self.tds and self.refiner_vae and conv_carry_in is None:
-
-            hf = h[:, :, :1, :, :]
-            b, c, f, ht, wd = hf.shape
-            hf = hf.reshape(b, c, f, ht // 2, 2, wd // 2, 2)
-            hf = hf.permute(0, 4, 6, 1, 2, 3, 5)
-            hf = hf.reshape(b, 2 * 2 * c, f, ht // 2, wd // 2)
-            hf = torch.cat([hf, hf], dim=1)
-
-            h = h[:, :, 1:, :, :]
-
-            xf = x[:, :, :1, :, :]
-            b, ci, f, ht, wd = xf.shape
-            xf = xf.reshape(b, ci, f, ht // 2, 2, wd // 2, 2)
-            xf = xf.permute(0, 4, 6, 1, 2, 3, 5)
-            xf = xf.reshape(b, 2 * 2 * ci, f, ht // 2, wd // 2)
-            B, C, T, H, W = xf.shape
-            xf = xf.view(B, hf.shape[1], self.gs // 2, T, H, W).mean(dim=2)
-
-            x = x[:, :, 1:, :, :]
-
-        if h.shape[2] == 0:
-            return hf + xf
-
-        b, c, frms, ht, wd = h.shape
-        nf = frms // r1
-        h = h.reshape(b, c, nf, r1, ht // 2, 2, wd // 2, 2)
-        h = h.permute(0, 3, 5, 7, 1, 2, 4, 6)
-        h = h.reshape(b, r1 * 2 * 2 * c, nf, ht // 2, wd // 2)
-
-        b, ci, frms, ht, wd = x.shape
-        nf = frms // r1
-        x = x.reshape(b, ci, nf, r1, ht // 2, 2, wd // 2, 2)
-        x = x.permute(0, 3, 5, 7, 1, 2, 4, 6)
-        x = x.reshape(b, r1 * 2 * 2 * ci, nf, ht // 2, wd // 2)
-        B, C, T, H, W = x.shape
-        x = x.view(B, h.shape[1], self.gs, T, H, W).mean(dim=2)
-
-        if self.tds and self.refiner_vae and conv_carry_in is None:
-            h = torch.cat([hf, h], dim=2)
-            x = torch.cat([xf, x], dim=2)
-
-        return h + x
-
-
-class UpSmpl(nn.Module):
-    def __init__(self, ic, oc, tus=True, refiner_vae=True, op=VideoConv3d):
-        super().__init__()
-        fct = 2 * 2 * 2 if tus else 1 * 2 * 2
-        self.conv = op(ic, oc * fct, kernel_size=3, stride=1, padding=1)
-        self.refiner_vae = refiner_vae
-
-        self.tus = tus
-        self.rp = fct * oc // ic
-
-    def forward(self, x, conv_carry_in=None, conv_carry_out=None):
-        r1 = 2 if self.tus else 1
-        h = conv_carry_causal_3d([x], self.conv, conv_carry_in, conv_carry_out)
-
-        if self.tus and self.refiner_vae and conv_carry_in is None:
-            hf = h[:, :, :1, :, :]
-            b, c, f, ht, wd = hf.shape
-            nc = c // (2 * 2)
-            hf = hf.reshape(b, 2, 2, nc, f, ht, wd)
-            hf = hf.permute(0, 3, 4, 5, 1, 6, 2)
-            hf = hf.reshape(b, nc, f, ht * 2, wd * 2)
-            hf = hf[:, : hf.shape[1] // 2]
-
-            h = h[:, :, 1:, :, :]
-
-            xf = x[:, :, :1, :, :]
-            b, ci, f, ht, wd = xf.shape
-            xf = xf.repeat_interleave(repeats=self.rp // 2, dim=1)
-            b, c, f, ht, wd = xf.shape
-            nc = c // (2 * 2)
-            xf = xf.reshape(b, 2, 2, nc, f, ht, wd)
-            xf = xf.permute(0, 3, 4, 5, 1, 6, 2)
-            xf = xf.reshape(b, nc, f, ht * 2, wd * 2)
-
-            x = x[:, :, 1:, :, :]
-
-        b, c, frms, ht, wd = h.shape
-        nc = c // (r1 * 2 * 2)
-        h = h.reshape(b, r1, 2, 2, nc, frms, ht, wd)
-        h = h.permute(0, 4, 5, 1, 6, 2, 7, 3)
-        h = h.reshape(b, nc, frms * r1, ht * 2, wd * 2)
-
-        x = x.repeat_interleave(repeats=self.rp, dim=1)
-        b, c, frms, ht, wd = x.shape
-        nc = c // (r1 * 2 * 2)
-        x = x.reshape(b, r1, 2, 2, nc, frms, ht, wd)
-        x = x.permute(0, 4, 5, 1, 6, 2, 7, 3)
-        x = x.reshape(b, nc, frms * r1, ht * 2, wd * 2)
-
-        if self.tus and self.refiner_vae and conv_carry_in is None:
-            h = torch.cat([hf, h], dim=2)
-            x = torch.cat([xf, x], dim=2)
-
-        return h + x
-
-class HunyuanRefinerResnetBlock(ResnetBlock):
-    def __init__(self, in_channels, out_channels, conv_op=NoPadConv3d, norm_op=RMS_norm):
-        super().__init__(in_channels=in_channels, out_channels=out_channels, temb_channels=0, conv_op=conv_op, norm_op=norm_op)
-
-    def forward(self, x, conv_carry_in=None, conv_carry_out=None):
-        h = x
-        h = [ self.swish(self.norm1(x)) ]
-        h = conv_carry_causal_3d(h, self.conv1, conv_carry_in=conv_carry_in, conv_carry_out=conv_carry_out)
-
-        h = [ self.dropout(self.swish(self.norm2(h))) ]
-        h = conv_carry_causal_3d(h, self.conv2, conv_carry_in=conv_carry_in, conv_carry_out=conv_carry_out)
-
-        if self.in_channels != self.out_channels:
-            x = self.nin_shortcut(x)
-
-        return x+h
-
-class Encoder(nn.Module):
-    def __init__(self, in_channels, z_channels, block_out_channels, num_res_blocks,
-                 ffactor_spatial, ffactor_temporal, downsample_match_channel=True, refiner_vae=True, **_):
-        super().__init__()
-        self.z_channels = z_channels
-        self.block_out_channels = block_out_channels
-        self.num_res_blocks = num_res_blocks
-        self.ffactor_temporal = ffactor_temporal
-
-        self.refiner_vae = refiner_vae
-        if self.refiner_vae:
-            conv_op = NoPadConv3d
-            norm_op = RMS_norm
-        else:
-            conv_op = ops.Conv3d
-            norm_op = Normalize
-
-        self.conv_in = conv_op(in_channels, block_out_channels[0], 3, 1, 1)
-
-        self.down = nn.ModuleList()
-        ch = block_out_channels[0]
-        depth = (ffactor_spatial >> 1).bit_length()
-        depth_temporal = ((ffactor_spatial // self.ffactor_temporal) >> 1).bit_length()
-
-        for i, tgt in enumerate(block_out_channels):
-            stage = nn.Module()
-            stage.block = nn.ModuleList([HunyuanRefinerResnetBlock(in_channels=ch if j == 0 else tgt,
-                                                                   out_channels=tgt,
-                                                                   conv_op=conv_op, norm_op=norm_op)
-                                        for j in range(num_res_blocks)])
-            ch = tgt
-            if i < depth:
-                nxt = block_out_channels[i + 1] if i + 1 < len(block_out_channels) and downsample_match_channel else ch
-                stage.downsample = DnSmpl(ch, nxt, tds=i >= depth_temporal, refiner_vae=self.refiner_vae, op=conv_op)
-                ch = nxt
-            self.down.append(stage)
-
-        self.mid = nn.Module()
-        self.mid.block_1 = HunyuanRefinerResnetBlock(in_channels=ch, out_channels=ch, conv_op=conv_op, norm_op=norm_op)
-        self.mid.attn_1 = AttnBlock(ch, conv_op=ops.Conv3d, norm_op=norm_op)
-        self.mid.block_2 = HunyuanRefinerResnetBlock(in_channels=ch, out_channels=ch, conv_op=conv_op, norm_op=norm_op)
-
-        self.norm_out = norm_op(ch)
-        self.conv_out = conv_op(ch, z_channels << 1, 3, 1, 1)
-
-        self.regul = comfy.ldm.models.autoencoder.DiagonalGaussianRegularizer()
-
-    def forward(self, x):
-        if not self.refiner_vae and x.shape[2] == 1:
-            x = x.expand(-1, -1, self.ffactor_temporal, -1, -1)
-
-        if self.refiner_vae:
-            xl = [x[:, :, :1, :, :]]
-            if x.shape[2] > self.ffactor_temporal:
-                xl += torch.split(x[:, :, 1: 1 + ((x.shape[2] - 1) // self.ffactor_temporal) * self.ffactor_temporal, :, :], self.ffactor_temporal * 2, dim=2)
-            x = xl
-        else:
-            x = [x]
-        out = []
-
-        conv_carry_in = None
-
-        for i, x1 in enumerate(x):
-            conv_carry_out = []
-            if i == len(x) - 1:
-                conv_carry_out = None
-            x1 = [ x1 ]
-            x1 = conv_carry_causal_3d(x1, self.conv_in, conv_carry_in, conv_carry_out)
-
-            for stage in self.down:
-                for blk in stage.block:
-                    x1 = blk(x1, conv_carry_in, conv_carry_out)
-                if hasattr(stage, 'downsample'):
-                    x1 = stage.downsample(x1, conv_carry_in, conv_carry_out)
-
-            out.append(x1)
-            conv_carry_in = conv_carry_out
-
-        if len(out) > 1:
-            out = torch.cat(out, dim=2)
-        else:
-            out = out[0]
-
-        x = self.mid.block_2(self.mid.attn_1(self.mid.block_1(out)))
-        del out
-
-        b, c, t, h, w = x.shape
-        grp = c // (self.z_channels << 1)
-        skip = x.view(b, c // grp, grp, t, h, w).mean(2)
-
-        out = conv_carry_causal_3d([F.silu(self.norm_out(x))], self.conv_out) + skip
-
-        if self.refiner_vae:
-            out = self.regul(out)[0]
-
-        return out
-
-class Decoder(nn.Module):
-    def __init__(self, z_channels, out_channels, block_out_channels, num_res_blocks,
-                 ffactor_spatial, ffactor_temporal, upsample_match_channel=True, refiner_vae=True, **_):
-        super().__init__()
-        block_out_channels = block_out_channels[::-1]
-        self.z_channels = z_channels
-        self.block_out_channels = block_out_channels
-        self.num_res_blocks = num_res_blocks
-
-        self.refiner_vae = refiner_vae
-        if self.refiner_vae:
-            conv_op = NoPadConv3d
-            norm_op = RMS_norm
-        else:
-            conv_op = ops.Conv3d
-            norm_op = Normalize
-
-        ch = block_out_channels[0]
-        self.conv_in = conv_op(z_channels, ch, kernel_size=3, stride=1, padding=1)
-
-        self.mid = nn.Module()
-        self.mid.block_1 = HunyuanRefinerResnetBlock(in_channels=ch, out_channels=ch, conv_op=conv_op, norm_op=norm_op)
-        self.mid.attn_1 = AttnBlock(ch, conv_op=ops.Conv3d, norm_op=norm_op)
-        self.mid.block_2 = HunyuanRefinerResnetBlock(in_channels=ch, out_channels=ch,  conv_op=conv_op, norm_op=norm_op)
-
-        self.up = nn.ModuleList()
-        depth = (ffactor_spatial >> 1).bit_length()
-        depth_temporal = (ffactor_temporal >> 1).bit_length()
-
-        for i, tgt in enumerate(block_out_channels):
-            stage = nn.Module()
-            stage.block = nn.ModuleList([HunyuanRefinerResnetBlock(in_channels=ch if j == 0 else tgt,
-                                                                   out_channels=tgt,
-                                                                   conv_op=conv_op, norm_op=norm_op)
-                                        for j in range(num_res_blocks + 1)])
-            ch = tgt
-            if i < depth:
-                nxt = block_out_channels[i + 1] if i + 1 < len(block_out_channels) and upsample_match_channel else ch
-                stage.upsample = UpSmpl(ch, nxt, tus=i < depth_temporal, refiner_vae=self.refiner_vae, op=conv_op)
-                ch = nxt
-            self.up.append(stage)
-
-        self.norm_out = norm_op(ch)
-        self.conv_out = conv_op(ch, out_channels, 3, stride=1, padding=1)
-
-    def forward(self, z):
-        x = conv_carry_causal_3d([z], self.conv_in) + z.repeat_interleave(self.block_out_channels[0] // self.z_channels, 1)
-        x = self.mid.block_2(self.mid.attn_1(self.mid.block_1(x)))
-
-        if self.refiner_vae:
-            x = torch.split(x, 2, dim=2)
-        else:
-            x = [ x ]
-        out = []
-
-        conv_carry_in = None
-
-        for i, x1 in enumerate(x):
-            conv_carry_out = []
-            if i == len(x) - 1:
-                conv_carry_out = None
-            for stage in self.up:
-                for blk in stage.block:
-                    x1 = blk(x1, conv_carry_in, conv_carry_out)
-                if hasattr(stage, 'upsample'):
-                    x1 = stage.upsample(x1, conv_carry_in, conv_carry_out)
-
-            x1 = [ F.silu(self.norm_out(x1)) ]
-            x1 = conv_carry_causal_3d(x1, self.conv_out, conv_carry_in, conv_carry_out)
-            out.append(x1)
-            conv_carry_in = conv_carry_out
-        del x
-
-        if len(out) > 1:
-            out = torch.cat(out, dim=2)
-        else:
-            out = out[0]
-
-        if not self.refiner_vae:
-            if z.shape[-3] == 1:
-                out = out[:, :, -1:]
-
-        return out
-
--- a/comfy/ldm/lightricks/model.py
+++ b/comfy/ldm/lightricks/model.py
@@ -1,13 +1,13 @@
 import torch
 from torch import nn
-import comfy.patcher_extension
 import comfy.ldm.modules.attention
 import comfy.ldm.common_dit
+from einops import rearrange
 import math
 from typing import Dict, Optional, Tuple

 from .symmetric_patchifier import SymmetricPatchifier, latent_to_pixel_coords
-from comfy.ldm.flux.math import apply_rope1
+

 def get_timestep_embedding(
    timesteps: torch.Tensor,
@@ -237,6 +237,20 @@ class FeedForward(nn.Module):
        return self.net(x)


+def apply_rotary_emb(input_tensor, freqs_cis): #TODO: remove duplicate funcs and pick the best/fastest one
+    cos_freqs = freqs_cis[0]
+    sin_freqs = freqs_cis[1]
+
+    t_dup = rearrange(input_tensor, "... (d r) -> ... d r", r=2)
+    t1, t2 = t_dup.unbind(dim=-1)
+    t_dup = torch.stack((-t2, t1), dim=-1)
+    input_tensor_rot = rearrange(t_dup, "... d r -> ... (d r)")
+
+    out = input_tensor * cos_freqs + input_tensor_rot * sin_freqs
+
+    return out
+
+
 class CrossAttention(nn.Module):
    def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0., attn_precision=None, dtype=None, device=None, operations=None):
        super().__init__()
@@ -256,7 +270,7 @@ class CrossAttention(nn.Module):

        self.to_out = nn.Sequential(operations.Linear(inner_dim, query_dim, dtype=dtype, device=device), nn.Dropout(dropout))

-    def forward(self, x, context=None, mask=None, pe=None, transformer_options={}):
+    def forward(self, x, context=None, mask=None, pe=None):
        q = self.to_q(x)
        context = x if context is None else context
        k = self.to_k(context)
@@ -266,13 +280,13 @@ class CrossAttention(nn.Module):
        k = self.k_norm(k)

        if pe is not None:
-            q = apply_rope1(q.unsqueeze(1), pe).squeeze(1)
-            k = apply_rope1(k.unsqueeze(1), pe).squeeze(1)
+            q = apply_rotary_emb(q, pe)
+            k = apply_rotary_emb(k, pe)

        if mask is None:
-            out = comfy.ldm.modules.attention.optimized_attention(q, k, v, self.heads, attn_precision=self.attn_precision, transformer_options=transformer_options)
+            out = comfy.ldm.modules.attention.optimized_attention(q, k, v, self.heads, attn_precision=self.attn_precision)
        else:
-            out = comfy.ldm.modules.attention.optimized_attention_masked(q, k, v, self.heads, mask, attn_precision=self.attn_precision, transformer_options=transformer_options)
+            out = comfy.ldm.modules.attention.optimized_attention_masked(q, k, v, self.heads, mask, attn_precision=self.attn_precision)
        return self.to_out(out)


@@ -288,20 +302,15 @@ class BasicTransformerBlock(nn.Module):

        self.scale_shift_table = nn.Parameter(torch.empty(6, dim, device=device, dtype=dtype))

-    def forward(self, x, context=None, attention_mask=None, timestep=None, pe=None, transformer_options={}):
+    def forward(self, x, context=None, attention_mask=None, timestep=None, pe=None):
        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (self.scale_shift_table[None, None].to(device=x.device, dtype=x.dtype) + timestep.reshape(x.shape[0], timestep.shape[1], self.scale_shift_table.shape[0], -1)).unbind(dim=2)

-        attn1_input = comfy.ldm.common_dit.rms_norm(x)
-        attn1_input = torch.addcmul(attn1_input, attn1_input, scale_msa).add_(shift_msa)
-        attn1_input = self.attn1(attn1_input, pe=pe, transformer_options=transformer_options)
-        x.addcmul_(attn1_input, gate_msa)
-        del attn1_input
+        x += self.attn1(comfy.ldm.common_dit.rms_norm(x) * (1 + scale_msa) + shift_msa, pe=pe) * gate_msa

-        x += self.attn2(x, context=context, mask=attention_mask, transformer_options=transformer_options)
+        x += self.attn2(x, context=context, mask=attention_mask)

-        y = comfy.ldm.common_dit.rms_norm(x)
-        y = torch.addcmul(y, y, scale_mlp).add_(shift_mlp)
-        x.addcmul_(self.ff(y), gate_mlp)
+        y = comfy.ldm.common_dit.rms_norm(x) * (1 + scale_mlp) + shift_mlp
+        x += self.ff(y) * gate_mlp

        return x

@@ -317,35 +326,41 @@ def get_fractional_positions(indices_grid, max_pos):


 def precompute_freqs_cis(indices_grid, dim, out_dtype, theta=10000.0, max_pos=[20, 2048, 2048]):
-    dtype = torch.float32
-    device = indices_grid.device
+    dtype = torch.float32 #self.dtype

-    # Get fractional positions and compute frequency indices
    fractional_positions = get_fractional_positions(indices_grid, max_pos)
-    indices = theta ** torch.linspace(0, 1, dim // 6, device=device, dtype=dtype) * math.pi / 2

-    # Compute frequencies and apply cos/sin
-    freqs = (indices * (fractional_positions.unsqueeze(-1) * 2 - 1)).transpose(-1, -2).flatten(2)
-    cos_vals = freqs.cos().repeat_interleave(2, dim=-1)
-    sin_vals = freqs.sin().repeat_interleave(2, dim=-1)
+    start = 1
+    end = theta
+    device = fractional_positions.device

-    # Pad if dim is not divisible by 6
+    indices = theta ** (
+        torch.linspace(
+            math.log(start, theta),
+            math.log(end, theta),
+            dim // 6,
+            device=device,
+            dtype=dtype,
+        )
+    )
+    indices = indices.to(dtype=dtype)
+
+    indices = indices * math.pi / 2
+
+    freqs = (
+        (indices * (fractional_positions.unsqueeze(-1) * 2 - 1))
+        .transpose(-1, -2)
+        .flatten(2)
+    )
+
+    cos_freq = freqs.cos().repeat_interleave(2, dim=-1)
+    sin_freq = freqs.sin().repeat_interleave(2, dim=-1)
    if dim % 6 != 0:
-        padding_size = dim % 6
-        cos_vals = torch.cat([torch.ones_like(cos_vals[:, :, :padding_size]), cos_vals], dim=-1)
-        sin_vals = torch.cat([torch.zeros_like(sin_vals[:, :, :padding_size]), sin_vals], dim=-1)
-
-    # Reshape and extract one value per pair (since repeat_interleave duplicates each value)
-    cos_vals = cos_vals.reshape(*cos_vals.shape[:2], -1, 2)[..., 0].to(out_dtype)  # [B, N, dim//2]
-    sin_vals = sin_vals.reshape(*sin_vals.shape[:2], -1, 2)[..., 0].to(out_dtype)  # [B, N, dim//2]
-
-    # Build rotation matrix [[cos, -sin], [sin, cos]] and add heads dimension
-    freqs_cis = torch.stack([
-        torch.stack([cos_vals, -sin_vals], dim=-1),
-        torch.stack([sin_vals, cos_vals], dim=-1)
-    ], dim=-2).unsqueeze(1)  # [B, 1, N, dim//2, 2, 2]
-
-    return freqs_cis
+        cos_padding = torch.ones_like(cos_freq[:, :, : dim % 6])
+        sin_padding = torch.zeros_like(cos_freq[:, :, : dim % 6])
+        cos_freq = torch.cat([cos_padding, cos_freq], dim=-1)
+        sin_freq = torch.cat([sin_padding, sin_freq], dim=-1)
+    return cos_freq.to(out_dtype), sin_freq.to(out_dtype)


 class LTXVModel(torch.nn.Module):
@@ -405,13 +420,6 @@ class LTXVModel(torch.nn.Module):
        self.patchifier = SymmetricPatchifier(1)

    def forward(self, x, timestep, context, attention_mask, frame_rate=25, transformer_options={}, keyframe_idxs=None, **kwargs):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, timestep, context, attention_mask, frame_rate, transformer_options, keyframe_idxs, **kwargs)
-
-    def _forward(self, x, timestep, context, attention_mask, frame_rate=25, transformer_options={}, keyframe_idxs=None, **kwargs):
        patches_replace = transformer_options.get("patches_replace", {})

        orig_shape = list(x.shape)
@@ -463,10 +471,10 @@ class LTXVModel(torch.nn.Module):
            if ("double_block", i) in blocks_replace:
                def block_wrap(args):
                    out = {}
-                    out["img"] = block(args["img"], context=args["txt"], attention_mask=args["attention_mask"], timestep=args["vec"], pe=args["pe"], transformer_options=args["transformer_options"])
+                    out["img"] = block(args["img"], context=args["txt"], attention_mask=args["attention_mask"], timestep=args["vec"], pe=args["pe"])
                    return out

-                out = blocks_replace[("double_block", i)]({"img": x, "txt": context, "attention_mask": attention_mask, "vec": timestep, "pe": pe, "transformer_options": transformer_options}, {"original_block": block_wrap})
+                out = blocks_replace[("double_block", i)]({"img": x, "txt": context, "attention_mask": attention_mask, "vec": timestep, "pe": pe}, {"original_block": block_wrap})
                x = out["img"]
            else:
                x = block(
@@ -474,8 +482,7 @@ class LTXVModel(torch.nn.Module):
                    context=context,
                    attention_mask=attention_mask,
                    timestep=timestep,
-                    pe=pe,
-                    transformer_options=transformer_options,
+                    pe=pe
                )

        # 3. Output
@@ -485,7 +492,7 @@ class LTXVModel(torch.nn.Module):
        shift, scale = scale_shift_values[:, :, 0], scale_shift_values[:, :, 1]
        x = self.norm_out(x)
        # Modulation
-        x = torch.addcmul(x, x, scale).add_(shift)
+        x = x * (1 + scale) + shift
        x = self.proj_out(x)

        x = self.patchifier.unpatchify(
--- a/comfy/ldm/lumina/model.py
+++ b/comfy/ldm/lumina/model.py
@@ -11,8 +11,6 @@ import comfy.ldm.common_dit
 from comfy.ldm.modules.diffusionmodules.mmdit import TimestepEmbedder
 from comfy.ldm.modules.attention import optimized_attention_masked
 from comfy.ldm.flux.layers import EmbedND
-from comfy.ldm.flux.math import apply_rope
-import comfy.patcher_extension


 def modulate(x, scale):
@@ -32,7 +30,6 @@ class JointAttention(nn.Module):
        n_heads: int,
        n_kv_heads: Optional[int],
        qk_norm: bool,
-        out_bias: bool = False,
        operation_settings={},
    ):
        """
@@ -61,7 +58,7 @@ class JointAttention(nn.Module):
        self.out = operation_settings.get("operations").Linear(
            n_heads * self.head_dim,
            dim,
-            bias=out_bias,
+            bias=False,
            device=operation_settings.get("device"),
            dtype=operation_settings.get("dtype"),
        )
@@ -72,12 +69,40 @@ class JointAttention(nn.Module):
        else:
            self.q_norm = self.k_norm = nn.Identity()

+    @staticmethod
+    def apply_rotary_emb(
+        x_in: torch.Tensor,
+        freqs_cis: torch.Tensor,
+    ) -> torch.Tensor:
+        """
+        Apply rotary embeddings to input tensors using the given frequency
+        tensor.
+
+        This function applies rotary embeddings to the given query 'xq' and
+        key 'xk' tensors using the provided frequency tensor 'freqs_cis'. The
+        input tensors are reshaped as complex numbers, and the frequency tensor
+        is reshaped for broadcasting compatibility. The resulting tensors
+        contain rotary embeddings and are returned as real tensors.
+
+        Args:
+            x_in (torch.Tensor): Query or Key tensor to apply rotary embeddings.
+            freqs_cis (torch.Tensor): Precomputed frequency tensor for complex
+                exponentials.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor]: Tuple of modified query tensor
+                and key tensor with rotary embeddings.
+        """
+
+        t_ = x_in.reshape(*x_in.shape[:-1], -1, 1, 2)
+        t_out = freqs_cis[..., 0] * t_[..., 0] + freqs_cis[..., 1] * t_[..., 1]
+        return t_out.reshape(*x_in.shape)
+
    def forward(
        self,
        x: torch.Tensor,
        x_mask: torch.Tensor,
        freqs_cis: torch.Tensor,
-        transformer_options={},
    ) -> torch.Tensor:
        """

@@ -107,13 +132,14 @@ class JointAttention(nn.Module):
        xq = self.q_norm(xq)
        xk = self.k_norm(xk)

-        xq, xk = apply_rope(xq, xk, freqs_cis)
+        xq = JointAttention.apply_rotary_emb(xq, freqs_cis=freqs_cis)
+        xk = JointAttention.apply_rotary_emb(xk, freqs_cis=freqs_cis)

        n_rep = self.n_local_heads // self.n_local_kv_heads
        if n_rep >= 1:
            xk = xk.unsqueeze(3).repeat(1, 1, 1, n_rep, 1).flatten(2, 3)
            xv = xv.unsqueeze(3).repeat(1, 1, 1, n_rep, 1).flatten(2, 3)
-        output = optimized_attention_masked(xq.movedim(1, 2), xk.movedim(1, 2), xv.movedim(1, 2), self.n_local_heads, x_mask, skip_reshape=True, transformer_options=transformer_options)
+        output = optimized_attention_masked(xq.movedim(1, 2), xk.movedim(1, 2), xv.movedim(1, 2), self.n_local_heads, x_mask, skip_reshape=True)

        return self.out(output)

@@ -187,8 +213,6 @@ class JointTransformerBlock(nn.Module):
        norm_eps: float,
        qk_norm: bool,
        modulation=True,
-        z_image_modulation=False,
-        attn_out_bias=False,
        operation_settings={},
    ) -> None:
        """
@@ -209,10 +233,10 @@ class JointTransformerBlock(nn.Module):
        super().__init__()
        self.dim = dim
        self.head_dim = dim // n_heads
-        self.attention = JointAttention(dim, n_heads, n_kv_heads, qk_norm, out_bias=attn_out_bias, operation_settings=operation_settings)
+        self.attention = JointAttention(dim, n_heads, n_kv_heads, qk_norm, operation_settings=operation_settings)
        self.feed_forward = FeedForward(
            dim=dim,
-            hidden_dim=dim,
+            hidden_dim=4 * dim,
            multiple_of=multiple_of,
            ffn_dim_multiplier=ffn_dim_multiplier,
            operation_settings=operation_settings,
@@ -226,27 +250,16 @@ class JointTransformerBlock(nn.Module):

        self.modulation = modulation
        if modulation:
-            if z_image_modulation:
-                self.adaLN_modulation = nn.Sequential(
-                    operation_settings.get("operations").Linear(
-                        min(dim, 256),
-                        4 * dim,
-                        bias=True,
-                        device=operation_settings.get("device"),
-                        dtype=operation_settings.get("dtype"),
-                    ),
-                )
-            else:
-                self.adaLN_modulation = nn.Sequential(
-                    nn.SiLU(),
-                    operation_settings.get("operations").Linear(
-                        min(dim, 1024),
-                        4 * dim,
-                        bias=True,
-                        device=operation_settings.get("device"),
-                        dtype=operation_settings.get("dtype"),
-                    ),
-                )
+            self.adaLN_modulation = nn.Sequential(
+                nn.SiLU(),
+                operation_settings.get("operations").Linear(
+                    min(dim, 1024),
+                    4 * dim,
+                    bias=True,
+                    device=operation_settings.get("device"),
+                    dtype=operation_settings.get("dtype"),
+                ),
+            )

    def forward(
        self,
@@ -254,7 +267,6 @@ class JointTransformerBlock(nn.Module):
        x_mask: torch.Tensor,
        freqs_cis: torch.Tensor,
        adaln_input: Optional[torch.Tensor]=None,
-        transformer_options={},
    ):
        """
        Perform a forward pass through the TransformerBlock.
@@ -277,7 +289,6 @@ class JointTransformerBlock(nn.Module):
                    modulate(self.attention_norm1(x), scale_msa),
                    x_mask,
                    freqs_cis,
-                    transformer_options=transformer_options,
                )
            )
            x = x + gate_mlp.unsqueeze(1).tanh() * self.ffn_norm2(
@@ -292,7 +303,6 @@ class JointTransformerBlock(nn.Module):
                    self.attention_norm1(x),
                    x_mask,
                    freqs_cis,
-                    transformer_options=transformer_options,
                )
            )
            x = x + self.ffn_norm2(
@@ -308,7 +318,7 @@ class FinalLayer(nn.Module):
    The final layer of NextDiT.
    """

-    def __init__(self, hidden_size, patch_size, out_channels, z_image_modulation=False, operation_settings={}):
+    def __init__(self, hidden_size, patch_size, out_channels, operation_settings={}):
        super().__init__()
        self.norm_final = operation_settings.get("operations").LayerNorm(
            hidden_size,
@@ -325,15 +335,10 @@ class FinalLayer(nn.Module):
            dtype=operation_settings.get("dtype"),
        )

-        if z_image_modulation:
-            min_mod = 256
-        else:
-            min_mod = 1024
-
        self.adaLN_modulation = nn.Sequential(
            nn.SiLU(),
            operation_settings.get("operations").Linear(
-                min(hidden_size, min_mod),
+                min(hidden_size, 1024),
                hidden_size,
                bias=True,
                device=operation_settings.get("device"),
@@ -363,16 +368,12 @@ class NextDiT(nn.Module):
        n_heads: int = 32,
        n_kv_heads: Optional[int] = None,
        multiple_of: int = 256,
-        ffn_dim_multiplier: float = 4.0,
+        ffn_dim_multiplier: Optional[float] = None,
        norm_eps: float = 1e-5,
        qk_norm: bool = False,
        cap_feat_dim: int = 5120,
        axes_dims: List[int] = (16, 56, 56),
        axes_lens: List[int] = (1, 512, 512),
-        rope_theta=10000.0,
-        z_image_modulation=False,
-        time_scale=1.0,
-        pad_tokens_multiple=None,
        image_model=None,
        device=None,
        dtype=None,
@@ -384,8 +385,6 @@ class NextDiT(nn.Module):
        self.in_channels = in_channels
        self.out_channels = in_channels
        self.patch_size = patch_size
-        self.time_scale = time_scale
-        self.pad_tokens_multiple = pad_tokens_multiple

        self.x_embedder = operation_settings.get("operations").Linear(
            in_features=patch_size * patch_size * in_channels,
@@ -407,7 +406,6 @@ class NextDiT(nn.Module):
                    norm_eps,
                    qk_norm,
                    modulation=True,
-                    z_image_modulation=z_image_modulation,
                    operation_settings=operation_settings,
                )
                for layer_id in range(n_refiner_layers)
@@ -431,7 +429,7 @@ class NextDiT(nn.Module):
            ]
        )

-        self.t_embedder = TimestepEmbedder(min(dim, 1024), output_size=256 if z_image_modulation else None, **operation_settings)
+        self.t_embedder = TimestepEmbedder(min(dim, 1024), **operation_settings)
        self.cap_embedder = nn.Sequential(
            operation_settings.get("operations").RMSNorm(cap_feat_dim, eps=norm_eps, elementwise_affine=True, device=operation_settings.get("device"), dtype=operation_settings.get("dtype")),
            operation_settings.get("operations").Linear(
@@ -454,24 +452,18 @@ class NextDiT(nn.Module):
                    ffn_dim_multiplier,
                    norm_eps,
                    qk_norm,
-                    z_image_modulation=z_image_modulation,
-                    attn_out_bias=False,
                    operation_settings=operation_settings,
                )
                for layer_id in range(n_layers)
            ]
        )
        self.norm_final = operation_settings.get("operations").RMSNorm(dim, eps=norm_eps, elementwise_affine=True, device=operation_settings.get("device"), dtype=operation_settings.get("dtype"))
-        self.final_layer = FinalLayer(dim, patch_size, self.out_channels, z_image_modulation=z_image_modulation, operation_settings=operation_settings)
-
-        if self.pad_tokens_multiple is not None:
-            self.x_pad_token = nn.Parameter(torch.empty((1, dim), device=device, dtype=dtype))
-            self.cap_pad_token = nn.Parameter(torch.empty((1, dim), device=device, dtype=dtype))
+        self.final_layer = FinalLayer(dim, patch_size, self.out_channels, operation_settings=operation_settings)

        assert (dim // n_heads) == sum(axes_dims)
        self.axes_dims = axes_dims
        self.axes_lens = axes_lens
-        self.rope_embedder = EmbedND(dim=dim // n_heads, theta=rope_theta, axes_dim=axes_dims)
+        self.rope_embedder = EmbedND(dim=dim // n_heads, theta=10000.0, axes_dim=axes_dims)
        self.dim = dim
        self.n_heads = n_heads

@@ -501,58 +493,105 @@ class NextDiT(nn.Module):
        return imgs

    def patchify_and_embed(
-        self, x: List[torch.Tensor] | torch.Tensor, cap_feats: torch.Tensor, cap_mask: torch.Tensor, t: torch.Tensor, num_tokens, transformer_options={}
+        self, x: List[torch.Tensor] | torch.Tensor, cap_feats: torch.Tensor, cap_mask: torch.Tensor, t: torch.Tensor, num_tokens
    ) -> Tuple[torch.Tensor, torch.Tensor, List[Tuple[int, int]], List[int], torch.Tensor]:
        bsz = len(x)
        pH = pW = self.patch_size
        device = x[0].device
+        dtype = x[0].dtype

-        if self.pad_tokens_multiple is not None:
-            pad_extra = (-cap_feats.shape[1]) % self.pad_tokens_multiple
-            cap_feats = torch.cat((cap_feats, self.cap_pad_token.to(device=cap_feats.device, dtype=cap_feats.dtype).unsqueeze(0).repeat(cap_feats.shape[0], pad_extra, 1)), dim=1)
+        if cap_mask is not None:
+            l_effective_cap_len = cap_mask.sum(dim=1).tolist()
+        else:
+            l_effective_cap_len = [num_tokens] * bsz

-        cap_pos_ids = torch.zeros(bsz, cap_feats.shape[1], 3, dtype=torch.float32, device=device)
-        cap_pos_ids[:, :, 0] = torch.arange(cap_feats.shape[1], dtype=torch.float32, device=device) + 1.0
+        if cap_mask is not None and not torch.is_floating_point(cap_mask):
+            cap_mask = (cap_mask - 1).to(dtype) * torch.finfo(dtype).max

-        B, C, H, W = x.shape
-        x = self.x_embedder(x.view(B, C, H // pH, pH, W // pW, pW).permute(0, 2, 4, 3, 5, 1).flatten(3).flatten(1, 2))
+        img_sizes = [(img.size(1), img.size(2)) for img in x]
+        l_effective_img_len = [(H // pH) * (W // pW) for (H, W) in img_sizes]

-        H_tokens, W_tokens = H // pH, W // pW
-        x_pos_ids = torch.zeros((bsz, x.shape[1], 3), dtype=torch.float32, device=device)
-        x_pos_ids[:, :, 0] = cap_feats.shape[1] + 1
-        x_pos_ids[:, :, 1] = torch.arange(H_tokens, dtype=torch.float32, device=device).view(-1, 1).repeat(1, W_tokens).flatten()
-        x_pos_ids[:, :, 2] = torch.arange(W_tokens, dtype=torch.float32, device=device).view(1, -1).repeat(H_tokens, 1).flatten()
+        max_seq_len = max(
+            (cap_len+img_len for cap_len, img_len in zip(l_effective_cap_len, l_effective_img_len))
+        )
+        max_cap_len = max(l_effective_cap_len)
+        max_img_len = max(l_effective_img_len)

-        if self.pad_tokens_multiple is not None:
-            pad_extra = (-x.shape[1]) % self.pad_tokens_multiple
-            x = torch.cat((x, self.x_pad_token.to(device=x.device, dtype=x.dtype).unsqueeze(0).repeat(x.shape[0], pad_extra, 1)), dim=1)
-            x_pos_ids = torch.nn.functional.pad(x_pos_ids, (0, 0, 0, pad_extra))
+        position_ids = torch.zeros(bsz, max_seq_len, 3, dtype=torch.int32, device=device)

-        freqs_cis = self.rope_embedder(torch.cat((cap_pos_ids, x_pos_ids), dim=1)).movedim(1, 2)
+        for i in range(bsz):
+            cap_len = l_effective_cap_len[i]
+            img_len = l_effective_img_len[i]
+            H, W = img_sizes[i]
+            H_tokens, W_tokens = H // pH, W // pW
+            assert H_tokens * W_tokens == img_len
+
+            position_ids[i, :cap_len, 0] = torch.arange(cap_len, dtype=torch.int32, device=device)
+            position_ids[i, cap_len:cap_len+img_len, 0] = cap_len
+            row_ids = torch.arange(H_tokens, dtype=torch.int32, device=device).view(-1, 1).repeat(1, W_tokens).flatten()
+            col_ids = torch.arange(W_tokens, dtype=torch.int32, device=device).view(1, -1).repeat(H_tokens, 1).flatten()
+            position_ids[i, cap_len:cap_len+img_len, 1] = row_ids
+            position_ids[i, cap_len:cap_len+img_len, 2] = col_ids
+
+        freqs_cis = self.rope_embedder(position_ids).movedim(1, 2).to(dtype)
+
+        # build freqs_cis for cap and image individually
+        cap_freqs_cis_shape = list(freqs_cis.shape)
+        # cap_freqs_cis_shape[1] = max_cap_len
+        cap_freqs_cis_shape[1] = cap_feats.shape[1]
+        cap_freqs_cis = torch.zeros(*cap_freqs_cis_shape, device=device, dtype=freqs_cis.dtype)
+
+        img_freqs_cis_shape = list(freqs_cis.shape)
+        img_freqs_cis_shape[1] = max_img_len
+        img_freqs_cis = torch.zeros(*img_freqs_cis_shape, device=device, dtype=freqs_cis.dtype)
+
+        for i in range(bsz):
+            cap_len = l_effective_cap_len[i]
+            img_len = l_effective_img_len[i]
+            cap_freqs_cis[i, :cap_len] = freqs_cis[i, :cap_len]
+            img_freqs_cis[i, :img_len] = freqs_cis[i, cap_len:cap_len+img_len]

        # refine context
        for layer in self.context_refiner:
-            cap_feats = layer(cap_feats, cap_mask, freqs_cis[:, :cap_pos_ids.shape[1]], transformer_options=transformer_options)
+            cap_feats = layer(cap_feats, cap_mask, cap_freqs_cis)

-        padded_img_mask = None
+        # refine image
+        flat_x = []
+        for i in range(bsz):
+            img = x[i]
+            C, H, W = img.size()
+            img = img.view(C, H // pH, pH, W // pW, pW).permute(1, 3, 2, 4, 0).flatten(2).flatten(0, 1)
+            flat_x.append(img)
+        x = flat_x
+        padded_img_embed = torch.zeros(bsz, max_img_len, x[0].shape[-1], device=device, dtype=x[0].dtype)
+        padded_img_mask = torch.zeros(bsz, max_img_len, dtype=dtype, device=device)
+        for i in range(bsz):
+            padded_img_embed[i, :l_effective_img_len[i]] = x[i]
+            padded_img_mask[i, l_effective_img_len[i]:] = -torch.finfo(dtype).max
+
+        padded_img_embed = self.x_embedder(padded_img_embed)
+        padded_img_mask = padded_img_mask.unsqueeze(1)
        for layer in self.noise_refiner:
-            x = layer(x, padded_img_mask, freqs_cis[:, cap_pos_ids.shape[1]:], t, transformer_options=transformer_options)
+            padded_img_embed = layer(padded_img_embed, padded_img_mask, img_freqs_cis, t)
+
+        if cap_mask is not None:
+            mask = torch.zeros(bsz, max_seq_len, dtype=dtype, device=device)
+            mask[:, :max_cap_len] = cap_mask[:, :max_cap_len]
+        else:
+            mask = None
+
+        padded_full_embed = torch.zeros(bsz, max_seq_len, self.dim, device=device, dtype=x[0].dtype)
+        for i in range(bsz):
+            cap_len = l_effective_cap_len[i]
+            img_len = l_effective_img_len[i]
+
+            padded_full_embed[i, :cap_len] = cap_feats[i, :cap_len]
+            padded_full_embed[i, cap_len:cap_len+img_len] = padded_img_embed[i, :img_len]

-        padded_full_embed = torch.cat((cap_feats, x), dim=1)
-        mask = None
-        img_sizes = [(H, W)] * bsz
-        l_effective_cap_len = [cap_feats.shape[1]] * bsz
        return padded_full_embed, mask, img_sizes, l_effective_cap_len, freqs_cis

-    def forward(self, x, timesteps, context, num_tokens, attention_mask=None, **kwargs):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, kwargs.get("transformer_options", {}))
-        ).execute(x, timesteps, context, num_tokens, attention_mask, **kwargs)
-
    # def forward(self, x, t, cap_feats, cap_mask):
-    def _forward(self, x, timesteps, context, num_tokens, attention_mask=None, **kwargs):
+    def forward(self, x, timesteps, context, num_tokens, attention_mask=None, **kwargs):
        t = 1.0 - timesteps
        cap_feats = context
        cap_mask = attention_mask
@@ -564,18 +603,17 @@ class NextDiT(nn.Module):
        y: (N,) tensor of text tokens/features
        """

-        t = self.t_embedder(t * self.time_scale, dtype=x.dtype)  # (N, D)
+        t = self.t_embedder(t, dtype=x.dtype)  # (N, D)
        adaln_input = t

        cap_feats = self.cap_embedder(cap_feats)  # (N, L, D)  # todo check if able to batchify w.o. redundant compute

-        transformer_options = kwargs.get("transformer_options", {})
        x_is_tensor = isinstance(x, torch.Tensor)
-        x, mask, img_size, cap_size, freqs_cis = self.patchify_and_embed(x, cap_feats, cap_mask, t, num_tokens, transformer_options=transformer_options)
+        x, mask, img_size, cap_size, freqs_cis = self.patchify_and_embed(x, cap_feats, cap_mask, t, num_tokens)
        freqs_cis = freqs_cis.to(x.device)

        for layer in self.layers:
-            x = layer(x, mask, freqs_cis, adaln_input, transformer_options=transformer_options)
+            x = layer(x, mask, freqs_cis, adaln_input)

        x = self.final_layer(x, adaln_input)
        x = self.unpatchify(x, img_size, cap_size, return_tensor=x_is_tensor)[:,:,:h,:w]
--- a/comfy/ldm/mmaudio/vae/init.py
+++ b/comfy/ldm/mmaudio/vae/init.py
--- a/comfy/ldm/mmaudio/vae/activations.py
+++ b/comfy/ldm/mmaudio/vae/activations.py
@@ -1,120 +0,0 @@
-# Implementation adapted from https://github.com/EdwardDixon/snake under the MIT license.
-#   LICENSE is in incl_licenses directory.
-
-import torch
-from torch import nn, sin, pow
-from torch.nn import Parameter
-import comfy.model_management
-
-class Snake(nn.Module):
-    '''
-    Implementation of a sine-based periodic activation function
-    Shape:
-        - Input: (B, C, T)
-        - Output: (B, C, T), same shape as the input
-    Parameters:
-        - alpha - trainable parameter
-    References:
-        - This activation function is from this paper by Liu Ziyin, Tilman Hartwig, Masahito Ueda:
-        https://arxiv.org/abs/2006.08195
-    Examples:
-        >>> a1 = snake(256)
-        >>> x = torch.randn(256)
-        >>> x = a1(x)
-    '''
-    def __init__(self, in_features, alpha=1.0, alpha_trainable=True, alpha_logscale=False):
-        '''
-        Initialization.
-        INPUT:
-            - in_features: shape of the input
-            - alpha: trainable parameter
-            alpha is initialized to 1 by default, higher values = higher-frequency.
-            alpha will be trained along with the rest of your model.
-        '''
-        super(Snake, self).__init__()
-        self.in_features = in_features
-
-        # initialize alpha
-        self.alpha_logscale = alpha_logscale
-        if self.alpha_logscale:
-            self.alpha = Parameter(torch.empty(in_features))
-        else:
-            self.alpha = Parameter(torch.empty(in_features))
-
-        self.alpha.requires_grad = alpha_trainable
-
-        self.no_div_by_zero = 0.000000001
-
-    def forward(self, x):
-        '''
-        Forward pass of the function.
-        Applies the function to the input elementwise.
-        Snake ∶= x + 1/a * sin^2 (xa)
-        '''
-        alpha = comfy.model_management.cast_to(self.alpha, dtype=x.dtype, device=x.device).unsqueeze(0).unsqueeze(-1) # line up with x to [B, C, T]
-        if self.alpha_logscale:
-            alpha = torch.exp(alpha)
-        x = x + (1.0 / (alpha + self.no_div_by_zero)) * pow(sin(x * alpha), 2)
-
-        return x
-
-
-class SnakeBeta(nn.Module):
-    '''
-    A modified Snake function which uses separate parameters for the magnitude of the periodic components
-    Shape:
-        - Input: (B, C, T)
-        - Output: (B, C, T), same shape as the input
-    Parameters:
-        - alpha - trainable parameter that controls frequency
-        - beta - trainable parameter that controls magnitude
-    References:
-        - This activation function is a modified version based on this paper by Liu Ziyin, Tilman Hartwig, Masahito Ueda:
-        https://arxiv.org/abs/2006.08195
-    Examples:
-        >>> a1 = snakebeta(256)
-        >>> x = torch.randn(256)
-        >>> x = a1(x)
-    '''
-    def __init__(self, in_features, alpha=1.0, alpha_trainable=True, alpha_logscale=False):
-        '''
-        Initialization.
-        INPUT:
-            - in_features: shape of the input
-            - alpha - trainable parameter that controls frequency
-            - beta - trainable parameter that controls magnitude
-            alpha is initialized to 1 by default, higher values = higher-frequency.
-            beta is initialized to 1 by default, higher values = higher-magnitude.
-            alpha will be trained along with the rest of your model.
-        '''
-        super(SnakeBeta, self).__init__()
-        self.in_features = in_features
-
-        # initialize alpha
-        self.alpha_logscale = alpha_logscale
-        if self.alpha_logscale:
-            self.alpha = Parameter(torch.empty(in_features))
-            self.beta = Parameter(torch.empty(in_features))
-        else:
-            self.alpha = Parameter(torch.empty(in_features))
-            self.beta = Parameter(torch.empty(in_features))
-
-        self.alpha.requires_grad = alpha_trainable
-        self.beta.requires_grad = alpha_trainable
-
-        self.no_div_by_zero = 0.000000001
-
-    def forward(self, x):
-        '''
-        Forward pass of the function.
-        Applies the function to the input elementwise.
-        SnakeBeta ∶= x + 1/b * sin^2 (xa)
-        '''
-        alpha = comfy.model_management.cast_to(self.alpha, dtype=x.dtype, device=x.device).unsqueeze(0).unsqueeze(-1) # line up with x to [B, C, T]
-        beta = comfy.model_management.cast_to(self.beta, dtype=x.dtype, device=x.device).unsqueeze(0).unsqueeze(-1)
-        if self.alpha_logscale:
-            alpha = torch.exp(alpha)
-            beta = torch.exp(beta)
-        x = x + (1.0 / (beta + self.no_div_by_zero)) * pow(sin(x * alpha), 2)
-
-        return x
--- a/comfy/ldm/mmaudio/vae/alias_free_torch.py
+++ b/comfy/ldm/mmaudio/vae/alias_free_torch.py
@@ -1,157 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import math
-import comfy.model_management
-
-if 'sinc' in dir(torch):
-    sinc = torch.sinc
-else:
-    # This code is adopted from adefossez's julius.core.sinc under the MIT License
-    # https://adefossez.github.io/julius/julius/core.html
-    #   LICENSE is in incl_licenses directory.
-    def sinc(x: torch.Tensor):
-        """
-        Implementation of sinc, i.e. sin(pi * x) / (pi * x)
-        __Warning__: Different to julius.sinc, the input is multiplied by `pi`!
-        """
-        return torch.where(x == 0,
-                           torch.tensor(1., device=x.device, dtype=x.dtype),
-                           torch.sin(math.pi * x) / math.pi / x)
-
-
-# This code is adopted from adefossez's julius.lowpass.LowPassFilters under the MIT License
-# https://adefossez.github.io/julius/julius/lowpass.html
-#   LICENSE is in incl_licenses directory.
-def kaiser_sinc_filter1d(cutoff, half_width, kernel_size): # return filter [1,1,kernel_size]
-    even = (kernel_size % 2 == 0)
-    half_size = kernel_size // 2
-
-    #For kaiser window
-    delta_f = 4 * half_width
-    A = 2.285 * (half_size - 1) * math.pi * delta_f + 7.95
-    if A > 50.:
-        beta = 0.1102 * (A - 8.7)
-    elif A >= 21.:
-        beta = 0.5842 * (A - 21)**0.4 + 0.07886 * (A - 21.)
-    else:
-        beta = 0.
-    window = torch.kaiser_window(kernel_size, beta=beta, periodic=False)
-
-    # ratio = 0.5/cutoff -> 2 * cutoff = 1 / ratio
-    if even:
-        time = (torch.arange(-half_size, half_size) + 0.5)
-    else:
-        time = torch.arange(kernel_size) - half_size
-    if cutoff == 0:
-        filter_ = torch.zeros_like(time)
-    else:
-        filter_ = 2 * cutoff * window * sinc(2 * cutoff * time)
-        # Normalize filter to have sum = 1, otherwise we will have a small leakage
-        # of the constant component in the input signal.
-        filter_ /= filter_.sum()
-        filter = filter_.view(1, 1, kernel_size)
-
-    return filter
-
-
-class LowPassFilter1d(nn.Module):
-    def __init__(self,
-                 cutoff=0.5,
-                 half_width=0.6,
-                 stride: int = 1,
-                 padding: bool = True,
-                 padding_mode: str = 'replicate',
-                 kernel_size: int = 12):
-        # kernel_size should be even number for stylegan3 setup,
-        # in this implementation, odd number is also possible.
-        super().__init__()
-        if cutoff < -0.:
-            raise ValueError("Minimum cutoff must be larger than zero.")
-        if cutoff > 0.5:
-            raise ValueError("A cutoff above 0.5 does not make sense.")
-        self.kernel_size = kernel_size
-        self.even = (kernel_size % 2 == 0)
-        self.pad_left = kernel_size // 2 - int(self.even)
-        self.pad_right = kernel_size // 2
-        self.stride = stride
-        self.padding = padding
-        self.padding_mode = padding_mode
-        filter = kaiser_sinc_filter1d(cutoff, half_width, kernel_size)
-        self.register_buffer("filter", filter)
-
-    #input [B, C, T]
-    def forward(self, x):
-        _, C, _ = x.shape
-
-        if self.padding:
-            x = F.pad(x, (self.pad_left, self.pad_right),
-                      mode=self.padding_mode)
-        out = F.conv1d(x, comfy.model_management.cast_to(self.filter.expand(C, -1, -1), dtype=x.dtype, device=x.device),
-                       stride=self.stride, groups=C)
-
-        return out
-
-
-class UpSample1d(nn.Module):
-    def __init__(self, ratio=2, kernel_size=None):
-        super().__init__()
-        self.ratio = ratio
-        self.kernel_size = int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
-        self.stride = ratio
-        self.pad = self.kernel_size // ratio - 1
-        self.pad_left = self.pad * self.stride + (self.kernel_size - self.stride) // 2
-        self.pad_right = self.pad * self.stride + (self.kernel_size - self.stride + 1) // 2
-        filter = kaiser_sinc_filter1d(cutoff=0.5 / ratio,
-                                      half_width=0.6 / ratio,
-                                      kernel_size=self.kernel_size)
-        self.register_buffer("filter", filter)
-
-    # x: [B, C, T]
-    def forward(self, x):
-        _, C, _ = x.shape
-
-        x = F.pad(x, (self.pad, self.pad), mode='replicate')
-        x = self.ratio * F.conv_transpose1d(
-            x, comfy.model_management.cast_to(self.filter.expand(C, -1, -1), dtype=x.dtype, device=x.device), stride=self.stride, groups=C)
-        x = x[..., self.pad_left:-self.pad_right]
-
-        return x
-
-
-class DownSample1d(nn.Module):
-    def __init__(self, ratio=2, kernel_size=None):
-        super().__init__()
-        self.ratio = ratio
-        self.kernel_size = int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
-        self.lowpass = LowPassFilter1d(cutoff=0.5 / ratio,
-                                       half_width=0.6 / ratio,
-                                       stride=ratio,
-                                       kernel_size=self.kernel_size)
-
-    def forward(self, x):
-        xx = self.lowpass(x)
-
-        return xx
-
-class Activation1d(nn.Module):
-    def __init__(self,
-                 activation,
-                 up_ratio: int = 2,
-                 down_ratio: int = 2,
-                 up_kernel_size: int = 12,
-                 down_kernel_size: int = 12):
-        super().__init__()
-        self.up_ratio = up_ratio
-        self.down_ratio = down_ratio
-        self.act = activation
-        self.upsample = UpSample1d(up_ratio, up_kernel_size)
-        self.downsample = DownSample1d(down_ratio, down_kernel_size)
-
-    # x: [B,C,T]
-    def forward(self, x):
-        x = self.upsample(x)
-        x = self.act(x)
-        x = self.downsample(x)
-
-        return x
--- a/comfy/ldm/mmaudio/vae/autoencoder.py
+++ b/comfy/ldm/mmaudio/vae/autoencoder.py
@@ -1,156 +0,0 @@
-from typing import Literal
-
-import torch
-import torch.nn as nn
-
-from .distributions import DiagonalGaussianDistribution
-from .vae import VAE_16k
-from .bigvgan import BigVGANVocoder
-import logging
-
-try:
-    import torchaudio
-except:
-    logging.warning("torchaudio missing, MMAudio VAE model will be broken")
-
-def dynamic_range_compression_torch(x, C=1, clip_val=1e-5, *, norm_fn):
-    return norm_fn(torch.clamp(x, min=clip_val) * C)
-
-
-def spectral_normalize_torch(magnitudes, norm_fn):
-    output = dynamic_range_compression_torch(magnitudes, norm_fn=norm_fn)
-    return output
-
-class MelConverter(nn.Module):
-
-    def __init__(
-        self,
-        *,
-        sampling_rate: float,
-        n_fft: int,
-        num_mels: int,
-        hop_size: int,
-        win_size: int,
-        fmin: float,
-        fmax: float,
-        norm_fn,
-    ):
-        super().__init__()
-        self.sampling_rate = sampling_rate
-        self.n_fft = n_fft
-        self.num_mels = num_mels
-        self.hop_size = hop_size
-        self.win_size = win_size
-        self.fmin = fmin
-        self.fmax = fmax
-        self.norm_fn = norm_fn
-
-        # mel = librosa_mel_fn(sr=self.sampling_rate,
-        #                      n_fft=self.n_fft,
-        #                      n_mels=self.num_mels,
-        #                      fmin=self.fmin,
-        #                      fmax=self.fmax)
-        # mel_basis = torch.from_numpy(mel).float()
-        mel_basis = torch.empty((num_mels, 1 + n_fft // 2))
-        hann_window = torch.hann_window(self.win_size)
-
-        self.register_buffer('mel_basis', mel_basis)
-        self.register_buffer('hann_window', hann_window)
-
-    @property
-    def device(self):
-        return self.mel_basis.device
-
-    def forward(self, waveform: torch.Tensor, center: bool = False) -> torch.Tensor:
-        waveform = waveform.clamp(min=-1., max=1.).to(self.device)
-
-        waveform = torch.nn.functional.pad(
-            waveform.unsqueeze(1),
-            [int((self.n_fft - self.hop_size) / 2),
-             int((self.n_fft - self.hop_size) / 2)],
-            mode='reflect')
-        waveform = waveform.squeeze(1)
-
-        spec = torch.stft(waveform,
-                          self.n_fft,
-                          hop_length=self.hop_size,
-                          win_length=self.win_size,
-                          window=self.hann_window,
-                          center=center,
-                          pad_mode='reflect',
-                          normalized=False,
-                          onesided=True,
-                          return_complex=True)
-
-        spec = torch.view_as_real(spec)
-        spec = torch.sqrt(spec.pow(2).sum(-1) + (1e-9))
-        spec = torch.matmul(self.mel_basis, spec)
-        spec = spectral_normalize_torch(spec, self.norm_fn)
-
-        return spec
-
-class AudioAutoencoder(nn.Module):
-
-    def __init__(
-        self,
-        *,
-        # ckpt_path: str,
-        mode=Literal['16k', '44k'],
-        need_vae_encoder: bool = True,
-    ):
-        super().__init__()
-
-        assert mode == "16k", "Only 16k mode is supported currently."
-        self.mel_converter = MelConverter(sampling_rate=16_000,
-                            n_fft=1024,
-                            num_mels=80,
-                            hop_size=256,
-                            win_size=1024,
-                            fmin=0,
-                            fmax=8_000,
-                            norm_fn=torch.log10)
-
-        self.vae = VAE_16k().eval()
-
-        bigvgan_config = {
-            "resblock": "1",
-            "num_mels": 80,
-            "upsample_rates": [4, 4, 2, 2, 2, 2],
-            "upsample_kernel_sizes": [8, 8, 4, 4, 4, 4],
-            "upsample_initial_channel": 1536,
-            "resblock_kernel_sizes": [3, 7, 11],
-            "resblock_dilation_sizes": [
-                [1, 3, 5],
-                [1, 3, 5],
-                [1, 3, 5],
-            ],
-            "activation": "snakebeta",
-            "snake_logscale": True,
-        }
-
-        self.vocoder = BigVGANVocoder(
-            bigvgan_config
-        ).eval()
-
-    @torch.inference_mode()
-    def encode_audio(self, x) -> DiagonalGaussianDistribution:
-        # x: (B * L)
-        mel = self.mel_converter(x)
-        dist = self.vae.encode(mel)
-
-        return dist
-
-    @torch.no_grad()
-    def decode(self, z):
-        mel_decoded = self.vae.decode(z)
-        audio = self.vocoder(mel_decoded)
-
-        audio = torchaudio.functional.resample(audio, 16000, 44100)
-        return audio
-
-    @torch.no_grad()
-    def encode(self, audio):
-        audio = audio.mean(dim=1)
-        audio = torchaudio.functional.resample(audio, 44100, 16000)
-        dist = self.encode_audio(audio)
-        return dist.mean
--- a/comfy/ldm/mmaudio/vae/bigvgan.py
+++ b/comfy/ldm/mmaudio/vae/bigvgan.py
@@ -1,219 +0,0 @@
-# Copyright (c) 2022 NVIDIA CORPORATION.
-#   Licensed under the MIT license.
-
-# Adapted from https://github.com/jik876/hifi-gan under the MIT license.
-#   LICENSE is in incl_licenses directory.
-
-import torch
-import torch.nn as nn
-from types import SimpleNamespace
-from . import activations
-from .alias_free_torch import Activation1d
-import comfy.ops
-ops = comfy.ops.disable_weight_init
-
-def get_padding(kernel_size, dilation=1):
-    return int((kernel_size * dilation - dilation) / 2)
-
-class AMPBlock1(torch.nn.Module):
-
-    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3, 5), activation=None):
-        super(AMPBlock1, self).__init__()
-        self.h = h
-
-        self.convs1 = nn.ModuleList([
-                ops.Conv1d(channels,
-                       channels,
-                       kernel_size,
-                       1,
-                       dilation=dilation[0],
-                       padding=get_padding(kernel_size, dilation[0])),
-                ops.Conv1d(channels,
-                       channels,
-                       kernel_size,
-                       1,
-                       dilation=dilation[1],
-                       padding=get_padding(kernel_size, dilation[1])),
-                ops.Conv1d(channels,
-                       channels,
-                       kernel_size,
-                       1,
-                       dilation=dilation[2],
-                       padding=get_padding(kernel_size, dilation[2]))
-        ])
-
-        self.convs2 = nn.ModuleList([
-                ops.Conv1d(channels,
-                       channels,
-                       kernel_size,
-                       1,
-                       dilation=1,
-                       padding=get_padding(kernel_size, 1)),
-                ops.Conv1d(channels,
-                       channels,
-                       kernel_size,
-                       1,
-                       dilation=1,
-                       padding=get_padding(kernel_size, 1)),
-                ops.Conv1d(channels,
-                       channels,
-                       kernel_size,
-                       1,
-                       dilation=1,
-                       padding=get_padding(kernel_size, 1))
-        ])
-
-        self.num_layers = len(self.convs1) + len(self.convs2)  # total number of conv layers
-
-        if activation == 'snake':  # periodic nonlinearity with snake function and anti-aliasing
-            self.activations = nn.ModuleList([
-                Activation1d(
-                    activation=activations.Snake(channels, alpha_logscale=h.snake_logscale))
-                for _ in range(self.num_layers)
-            ])
-        elif activation == 'snakebeta':  # periodic nonlinearity with snakebeta function and anti-aliasing
-            self.activations = nn.ModuleList([
-                Activation1d(
-                    activation=activations.SnakeBeta(channels, alpha_logscale=h.snake_logscale))
-                for _ in range(self.num_layers)
-            ])
-        else:
-            raise NotImplementedError(
-                "activation incorrectly specified. check the config file and look for 'activation'."
-            )
-
-    def forward(self, x):
-        acts1, acts2 = self.activations[::2], self.activations[1::2]
-        for c1, c2, a1, a2 in zip(self.convs1, self.convs2, acts1, acts2):
-            xt = a1(x)
-            xt = c1(xt)
-            xt = a2(xt)
-            xt = c2(xt)
-            x = xt + x
-
-        return x
-
-
-class AMPBlock2(torch.nn.Module):
-
-    def __init__(self, h, channels, kernel_size=3, dilation=(1, 3), activation=None):
-        super(AMPBlock2, self).__init__()
-        self.h = h
-
-        self.convs = nn.ModuleList([
-                ops.Conv1d(channels,
-                       channels,
-                       kernel_size,
-                       1,
-                       dilation=dilation[0],
-                       padding=get_padding(kernel_size, dilation[0])),
-                ops.Conv1d(channels,
-                       channels,
-                       kernel_size,
-                       1,
-                       dilation=dilation[1],
-                       padding=get_padding(kernel_size, dilation[1]))
-        ])
-
-        self.num_layers = len(self.convs)  # total number of conv layers
-
-        if activation == 'snake':  # periodic nonlinearity with snake function and anti-aliasing
-            self.activations = nn.ModuleList([
-                Activation1d(
-                    activation=activations.Snake(channels, alpha_logscale=h.snake_logscale))
-                for _ in range(self.num_layers)
-            ])
-        elif activation == 'snakebeta':  # periodic nonlinearity with snakebeta function and anti-aliasing
-            self.activations = nn.ModuleList([
-                Activation1d(
-                    activation=activations.SnakeBeta(channels, alpha_logscale=h.snake_logscale))
-                for _ in range(self.num_layers)
-            ])
-        else:
-            raise NotImplementedError(
-                "activation incorrectly specified. check the config file and look for 'activation'."
-            )
-
-    def forward(self, x):
-        for c, a in zip(self.convs, self.activations):
-            xt = a(x)
-            xt = c(xt)
-            x = xt + x
-
-        return x
-
-
-class BigVGANVocoder(torch.nn.Module):
-    # this is our main BigVGAN model. Applies anti-aliased periodic activation for resblocks.
-    def __init__(self, h):
-        super().__init__()
-        if isinstance(h, dict):
-            h = SimpleNamespace(**h)
-        self.h = h
-
-        self.num_kernels = len(h.resblock_kernel_sizes)
-        self.num_upsamples = len(h.upsample_rates)
-
-        # pre conv
-        self.conv_pre = ops.Conv1d(h.num_mels, h.upsample_initial_channel, 7, 1, padding=3)
-
-        # define which AMPBlock to use. BigVGAN uses AMPBlock1 as default
-        resblock = AMPBlock1 if h.resblock == '1' else AMPBlock2
-
-        # transposed conv-based upsamplers. does not apply anti-aliasing
-        self.ups = nn.ModuleList()
-        for i, (u, k) in enumerate(zip(h.upsample_rates, h.upsample_kernel_sizes)):
-            self.ups.append(
-                nn.ModuleList([
-                        ops.ConvTranspose1d(h.upsample_initial_channel // (2**i),
-                                        h.upsample_initial_channel // (2**(i + 1)),
-                                        k,
-                                        u,
-                                        padding=(k - u) // 2)
-                ]))
-
-        # residual blocks using anti-aliased multi-periodicity composition modules (AMP)
-        self.resblocks = nn.ModuleList()
-        for i in range(len(self.ups)):
-            ch = h.upsample_initial_channel // (2**(i + 1))
-            for j, (k, d) in enumerate(zip(h.resblock_kernel_sizes, h.resblock_dilation_sizes)):
-                self.resblocks.append(resblock(h, ch, k, d, activation=h.activation))
-
-        # post conv
-        if h.activation == "snake":  # periodic nonlinearity with snake function and anti-aliasing
-            activation_post = activations.Snake(ch, alpha_logscale=h.snake_logscale)
-            self.activation_post = Activation1d(activation=activation_post)
-        elif h.activation == "snakebeta":  # periodic nonlinearity with snakebeta function and anti-aliasing
-            activation_post = activations.SnakeBeta(ch, alpha_logscale=h.snake_logscale)
-            self.activation_post = Activation1d(activation=activation_post)
-        else:
-            raise NotImplementedError(
-                "activation incorrectly specified. check the config file and look for 'activation'."
-            )
-
-        self.conv_post = ops.Conv1d(ch, 1, 7, 1, padding=3)
-
-
-    def forward(self, x):
-        # pre conv
-        x = self.conv_pre(x)
-
-        for i in range(self.num_upsamples):
-            # upsampling
-            for i_up in range(len(self.ups[i])):
-                x = self.ups[i][i_up](x)
-            # AMP blocks
-            xs = None
-            for j in range(self.num_kernels):
-                if xs is None:
-                    xs = self.resblocks[i * self.num_kernels + j](x)
-                else:
-                    xs += self.resblocks[i * self.num_kernels + j](x)
-            x = xs / self.num_kernels
-
-        # post conv
-        x = self.activation_post(x)
-        x = self.conv_post(x)
-        x = torch.tanh(x)
-
-        return x
--- a/comfy/ldm/mmaudio/vae/distributions.py
+++ b/comfy/ldm/mmaudio/vae/distributions.py
@@ -1,92 +0,0 @@
-import torch
-import numpy as np
-
-
-class AbstractDistribution:
-    def sample(self):
-        raise NotImplementedError()
-
-    def mode(self):
-        raise NotImplementedError()
-
-
-class DiracDistribution(AbstractDistribution):
-    def __init__(self, value):
-        self.value = value
-
-    def sample(self):
-        return self.value
-
-    def mode(self):
-        return self.value
-
-
-class DiagonalGaussianDistribution(object):
-    def __init__(self, parameters, deterministic=False):
-        self.parameters = parameters
-        self.mean, self.logvar = torch.chunk(parameters, 2, dim=1)
-        self.logvar = torch.clamp(self.logvar, -30.0, 20.0)
-        self.deterministic = deterministic
-        self.std = torch.exp(0.5 * self.logvar)
-        self.var = torch.exp(self.logvar)
-        if self.deterministic:
-            self.var = self.std = torch.zeros_like(self.mean, device=self.parameters.device)
-
-    def sample(self):
-        x = self.mean + self.std * torch.randn(self.mean.shape, device=self.parameters.device)
-        return x
-
-    def kl(self, other=None):
-        if self.deterministic:
-            return torch.Tensor([0.])
-        else:
-            if other is None:
-                return 0.5 * torch.sum(torch.pow(self.mean, 2)
-                                       + self.var - 1.0 - self.logvar,
-                                       dim=[1, 2, 3])
-            else:
-                return 0.5 * torch.sum(
-                    torch.pow(self.mean - other.mean, 2) / other.var
-                    + self.var / other.var - 1.0 - self.logvar + other.logvar,
-                    dim=[1, 2, 3])
-
-    def nll(self, sample, dims=[1,2,3]):
-        if self.deterministic:
-            return torch.Tensor([0.])
-        logtwopi = np.log(2.0 * np.pi)
-        return 0.5 * torch.sum(
-            logtwopi + self.logvar + torch.pow(sample - self.mean, 2) / self.var,
-            dim=dims)
-
-    def mode(self):
-        return self.mean
-
-
-def normal_kl(mean1, logvar1, mean2, logvar2):
-    """
-    source: https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/losses.py#L12
-    Compute the KL divergence between two gaussians.
-    Shapes are automatically broadcasted, so batches can be compared to
-    scalars, among other use cases.
-    """
-    tensor = None
-    for obj in (mean1, logvar1, mean2, logvar2):
-        if isinstance(obj, torch.Tensor):
-            tensor = obj
-            break
-    assert tensor is not None, "at least one argument must be a Tensor"
-
-    # Force variances to be Tensors. Broadcasting helps convert scalars to
-    # Tensors, but it does not work for torch.exp().
-    logvar1, logvar2 = [
-        x if isinstance(x, torch.Tensor) else torch.tensor(x).to(tensor)
-        for x in (logvar1, logvar2)
-    ]
-
-    return 0.5 * (
-        -1.0
-        + logvar2
-        - logvar1
-        + torch.exp(logvar1 - logvar2)
-        + ((mean1 - mean2) ** 2) * torch.exp(-logvar2)
-    )
--- a/comfy/ldm/mmaudio/vae/vae.py
+++ b/comfy/ldm/mmaudio/vae/vae.py
@@ -1,358 +0,0 @@
-import logging
-from typing import Optional
-
-import torch
-import torch.nn as nn
-
-from .vae_modules import (AttnBlock1D, Downsample1D, ResnetBlock1D,
-                                                 Upsample1D, nonlinearity)
-from .distributions import DiagonalGaussianDistribution
-
-import comfy.ops
-ops = comfy.ops.disable_weight_init
-
-log = logging.getLogger()
-
-DATA_MEAN_80D = [
-    -1.6058, -1.3676, -1.2520, -1.2453, -1.2078, -1.2224, -1.2419, -1.2439, -1.2922, -1.2927,
-    -1.3170, -1.3543, -1.3401, -1.3836, -1.3907, -1.3912, -1.4313, -1.4152, -1.4527, -1.4728,
-    -1.4568, -1.5101, -1.5051, -1.5172, -1.5623, -1.5373, -1.5746, -1.5687, -1.6032, -1.6131,
-    -1.6081, -1.6331, -1.6489, -1.6489, -1.6700, -1.6738, -1.6953, -1.6969, -1.7048, -1.7280,
-    -1.7361, -1.7495, -1.7658, -1.7814, -1.7889, -1.8064, -1.8221, -1.8377, -1.8417, -1.8643,
-    -1.8857, -1.8929, -1.9173, -1.9379, -1.9531, -1.9673, -1.9824, -2.0042, -2.0215, -2.0436,
-    -2.0766, -2.1064, -2.1418, -2.1855, -2.2319, -2.2767, -2.3161, -2.3572, -2.3954, -2.4282,
-    -2.4659, -2.5072, -2.5552, -2.6074, -2.6584, -2.7107, -2.7634, -2.8266, -2.8981, -2.9673
-]
-
-DATA_STD_80D = [
-    1.0291, 1.0411, 1.0043, 0.9820, 0.9677, 0.9543, 0.9450, 0.9392, 0.9343, 0.9297, 0.9276, 0.9263,
-    0.9242, 0.9254, 0.9232, 0.9281, 0.9263, 0.9315, 0.9274, 0.9247, 0.9277, 0.9199, 0.9188, 0.9194,
-    0.9160, 0.9161, 0.9146, 0.9161, 0.9100, 0.9095, 0.9145, 0.9076, 0.9066, 0.9095, 0.9032, 0.9043,
-    0.9038, 0.9011, 0.9019, 0.9010, 0.8984, 0.8983, 0.8986, 0.8961, 0.8962, 0.8978, 0.8962, 0.8973,
-    0.8993, 0.8976, 0.8995, 0.9016, 0.8982, 0.8972, 0.8974, 0.8949, 0.8940, 0.8947, 0.8936, 0.8939,
-    0.8951, 0.8956, 0.9017, 0.9167, 0.9436, 0.9690, 1.0003, 1.0225, 1.0381, 1.0491, 1.0545, 1.0604,
-    1.0761, 1.0929, 1.1089, 1.1196, 1.1176, 1.1156, 1.1117, 1.1070
-]
-
-DATA_MEAN_128D = [
-    -3.3462, -2.6723, -2.4893, -2.3143, -2.2664, -2.3317, -2.1802, -2.4006, -2.2357, -2.4597,
-    -2.3717, -2.4690, -2.5142, -2.4919, -2.6610, -2.5047, -2.7483, -2.5926, -2.7462, -2.7033,
-    -2.7386, -2.8112, -2.7502, -2.9594, -2.7473, -3.0035, -2.8891, -2.9922, -2.9856, -3.0157,
-    -3.1191, -2.9893, -3.1718, -3.0745, -3.1879, -3.2310, -3.1424, -3.2296, -3.2791, -3.2782,
-    -3.2756, -3.3134, -3.3509, -3.3750, -3.3951, -3.3698, -3.4505, -3.4509, -3.5089, -3.4647,
-    -3.5536, -3.5788, -3.5867, -3.6036, -3.6400, -3.6747, -3.7072, -3.7279, -3.7283, -3.7795,
-    -3.8259, -3.8447, -3.8663, -3.9182, -3.9605, -3.9861, -4.0105, -4.0373, -4.0762, -4.1121,
-    -4.1488, -4.1874, -4.2461, -4.3170, -4.3639, -4.4452, -4.5282, -4.6297, -4.7019, -4.7960,
-    -4.8700, -4.9507, -5.0303, -5.0866, -5.1634, -5.2342, -5.3242, -5.4053, -5.4927, -5.5712,
-    -5.6464, -5.7052, -5.7619, -5.8410, -5.9188, -6.0103, -6.0955, -6.1673, -6.2362, -6.3120,
-    -6.3926, -6.4797, -6.5565, -6.6511, -6.8130, -6.9961, -7.1275, -7.2457, -7.3576, -7.4663,
-    -7.6136, -7.7469, -7.8815, -8.0132, -8.1515, -8.3071, -8.4722, -8.7418, -9.3975, -9.6628,
-    -9.7671, -9.8863, -9.9992, -10.0860, -10.1709, -10.5418, -11.2795, -11.3861
-]
-
-DATA_STD_128D = [
-    2.3804, 2.4368, 2.3772, 2.3145, 2.2803, 2.2510, 2.2316, 2.2083, 2.1996, 2.1835, 2.1769, 2.1659,
-    2.1631, 2.1618, 2.1540, 2.1606, 2.1571, 2.1567, 2.1612, 2.1579, 2.1679, 2.1683, 2.1634, 2.1557,
-    2.1668, 2.1518, 2.1415, 2.1449, 2.1406, 2.1350, 2.1313, 2.1415, 2.1281, 2.1352, 2.1219, 2.1182,
-    2.1327, 2.1195, 2.1137, 2.1080, 2.1179, 2.1036, 2.1087, 2.1036, 2.1015, 2.1068, 2.0975, 2.0991,
-    2.0902, 2.1015, 2.0857, 2.0920, 2.0893, 2.0897, 2.0910, 2.0881, 2.0925, 2.0873, 2.0960, 2.0900,
-    2.0957, 2.0958, 2.0978, 2.0936, 2.0886, 2.0905, 2.0845, 2.0855, 2.0796, 2.0840, 2.0813, 2.0817,
-    2.0838, 2.0840, 2.0917, 2.1061, 2.1431, 2.1976, 2.2482, 2.3055, 2.3700, 2.4088, 2.4372, 2.4609,
-    2.4731, 2.4847, 2.5072, 2.5451, 2.5772, 2.6147, 2.6529, 2.6596, 2.6645, 2.6726, 2.6803, 2.6812,
-    2.6899, 2.6916, 2.6931, 2.6998, 2.7062, 2.7262, 2.7222, 2.7158, 2.7041, 2.7485, 2.7491, 2.7451,
-    2.7485, 2.7233, 2.7297, 2.7233, 2.7145, 2.6958, 2.6788, 2.6439, 2.6007, 2.4786, 2.2469, 2.1877,
-    2.1392, 2.0717, 2.0107, 1.9676, 1.9140, 1.7102, 0.9101, 0.7164
-]
-
-
-class VAE(nn.Module):
-
-    def __init__(
-        self,
-        *,
-        data_dim: int,
-        embed_dim: int,
-        hidden_dim: int,
-    ):
-        super().__init__()
-
-        if data_dim == 80:
-            self.data_mean = nn.Buffer(torch.tensor(DATA_MEAN_80D, dtype=torch.float32))
-            self.data_std = nn.Buffer(torch.tensor(DATA_STD_80D, dtype=torch.float32))
-        elif data_dim == 128:
-            self.data_mean = nn.Buffer(torch.tensor(DATA_MEAN_128D, dtype=torch.float32))
-            self.data_std = nn.Buffer(torch.tensor(DATA_STD_128D, dtype=torch.float32))
-
-        self.data_mean = self.data_mean.view(1, -1, 1)
-        self.data_std = self.data_std.view(1, -1, 1)
-
-        self.encoder = Encoder1D(
-            dim=hidden_dim,
-            ch_mult=(1, 2, 4),
-            num_res_blocks=2,
-            attn_layers=[3],
-            down_layers=[0],
-            in_dim=data_dim,
-            embed_dim=embed_dim,
-        )
-        self.decoder = Decoder1D(
-            dim=hidden_dim,
-            ch_mult=(1, 2, 4),
-            num_res_blocks=2,
-            attn_layers=[3],
-            down_layers=[0],
-            in_dim=data_dim,
-            out_dim=data_dim,
-            embed_dim=embed_dim,
-        )
-
-        self.embed_dim = embed_dim
-        # self.quant_conv = nn.Conv1d(2 * embed_dim, 2 * embed_dim, 1)
-        # self.post_quant_conv = nn.Conv1d(embed_dim, embed_dim, 1)
-
-        self.initialize_weights()
-
-    def initialize_weights(self):
-        pass
-
-    def encode(self, x: torch.Tensor, normalize: bool = True) -> DiagonalGaussianDistribution:
-        if normalize:
-            x = self.normalize(x)
-        moments = self.encoder(x)
-        posterior = DiagonalGaussianDistribution(moments)
-        return posterior
-
-    def decode(self, z: torch.Tensor, unnormalize: bool = True) -> torch.Tensor:
-        dec = self.decoder(z)
-        if unnormalize:
-            dec = self.unnormalize(dec)
-        return dec
-
-    def normalize(self, x: torch.Tensor) -> torch.Tensor:
-        return (x - comfy.model_management.cast_to(self.data_mean, dtype=x.dtype, device=x.device)) / comfy.model_management.cast_to(self.data_std, dtype=x.dtype, device=x.device)
-
-    def unnormalize(self, x: torch.Tensor) -> torch.Tensor:
-        return x * comfy.model_management.cast_to(self.data_std, dtype=x.dtype, device=x.device) + comfy.model_management.cast_to(self.data_mean, dtype=x.dtype, device=x.device)
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        sample_posterior: bool = True,
-        rng: Optional[torch.Generator] = None,
-        normalize: bool = True,
-        unnormalize: bool = True,
-    ) -> tuple[torch.Tensor, DiagonalGaussianDistribution]:
-
-        posterior = self.encode(x, normalize=normalize)
-        if sample_posterior:
-            z = posterior.sample(rng)
-        else:
-            z = posterior.mode()
-        dec = self.decode(z, unnormalize=unnormalize)
-        return dec, posterior
-
-    def load_weights(self, src_dict) -> None:
-        self.load_state_dict(src_dict, strict=True)
-
-    @property
-    def device(self) -> torch.device:
-        return next(self.parameters()).device
-
-    def get_last_layer(self):
-        return self.decoder.conv_out.weight
-
-    def remove_weight_norm(self):
-        return self
-
-
-class Encoder1D(nn.Module):
-
-    def __init__(self,
-                 *,
-                 dim: int,
-                 ch_mult: tuple[int] = (1, 2, 4, 8),
-                 num_res_blocks: int,
-                 attn_layers: list[int] = [],
-                 down_layers: list[int] = [],
-                 resamp_with_conv: bool = True,
-                 in_dim: int,
-                 embed_dim: int,
-                 double_z: bool = True,
-                 kernel_size: int = 3,
-                 clip_act: float = 256.0):
-        super().__init__()
-        self.dim = dim
-        self.num_layers = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.in_channels = in_dim
-        self.clip_act = clip_act
-        self.down_layers = down_layers
-        self.attn_layers = attn_layers
-        self.conv_in = ops.Conv1d(in_dim, self.dim, kernel_size=kernel_size, padding=kernel_size // 2, bias=False)
-
-        in_ch_mult = (1, ) + tuple(ch_mult)
-        self.in_ch_mult = in_ch_mult
-        # downsampling
-        self.down = nn.ModuleList()
-        for i_level in range(self.num_layers):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_in = dim * in_ch_mult[i_level]
-            block_out = dim * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks):
-                block.append(
-                    ResnetBlock1D(in_dim=block_in,
-                                  out_dim=block_out,
-                                  kernel_size=kernel_size,
-                                  use_norm=True))
-                block_in = block_out
-                if i_level in attn_layers:
-                    attn.append(AttnBlock1D(block_in))
-            down = nn.Module()
-            down.block = block
-            down.attn = attn
-            if i_level in down_layers:
-                down.downsample = Downsample1D(block_in, resamp_with_conv)
-            self.down.append(down)
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock1D(in_dim=block_in,
-                                         out_dim=block_in,
-                                         kernel_size=kernel_size,
-                                         use_norm=True)
-        self.mid.attn_1 = AttnBlock1D(block_in)
-        self.mid.block_2 = ResnetBlock1D(in_dim=block_in,
-                                         out_dim=block_in,
-                                         kernel_size=kernel_size,
-                                         use_norm=True)
-
-        # end
-        self.conv_out = ops.Conv1d(block_in,
-                                 2 * embed_dim if double_z else embed_dim,
-                                 kernel_size=kernel_size, padding=kernel_size // 2, bias=False)
-
-        self.learnable_gain = nn.Parameter(torch.zeros([]))
-
-    def forward(self, x):
-
-        # downsampling
-        h = self.conv_in(x)
-        for i_level in range(self.num_layers):
-            for i_block in range(self.num_res_blocks):
-                h = self.down[i_level].block[i_block](h)
-                if len(self.down[i_level].attn) > 0:
-                    h = self.down[i_level].attn[i_block](h)
-                h = h.clamp(-self.clip_act, self.clip_act)
-            if i_level in self.down_layers:
-                h = self.down[i_level].downsample(h)
-
-        # middle
-        h = self.mid.block_1(h)
-        h = self.mid.attn_1(h)
-        h = self.mid.block_2(h)
-        h = h.clamp(-self.clip_act, self.clip_act)
-
-        # end
-        h = nonlinearity(h)
-        h = self.conv_out(h) * (self.learnable_gain + 1)
-        return h
-
-
-class Decoder1D(nn.Module):
-
-    def __init__(self,
-                 *,
-                 dim: int,
-                 out_dim: int,
-                 ch_mult: tuple[int] = (1, 2, 4, 8),
-                 num_res_blocks: int,
-                 attn_layers: list[int] = [],
-                 down_layers: list[int] = [],
-                 kernel_size: int = 3,
-                 resamp_with_conv: bool = True,
-                 in_dim: int,
-                 embed_dim: int,
-                 clip_act: float = 256.0):
-        super().__init__()
-        self.ch = dim
-        self.num_layers = len(ch_mult)
-        self.num_res_blocks = num_res_blocks
-        self.in_channels = in_dim
-        self.clip_act = clip_act
-        self.down_layers = [i + 1 for i in down_layers]  # each downlayer add one
-
-        # compute in_ch_mult, block_in and curr_res at lowest res
-        block_in = dim * ch_mult[self.num_layers - 1]
-
-        # z to block_in
-        self.conv_in = ops.Conv1d(embed_dim, block_in, kernel_size=kernel_size, padding=kernel_size // 2, bias=False)
-
-        # middle
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock1D(in_dim=block_in, out_dim=block_in, use_norm=True)
-        self.mid.attn_1 = AttnBlock1D(block_in)
-        self.mid.block_2 = ResnetBlock1D(in_dim=block_in, out_dim=block_in, use_norm=True)
-
-        # upsampling
-        self.up = nn.ModuleList()
-        for i_level in reversed(range(self.num_layers)):
-            block = nn.ModuleList()
-            attn = nn.ModuleList()
-            block_out = dim * ch_mult[i_level]
-            for i_block in range(self.num_res_blocks + 1):
-                block.append(ResnetBlock1D(in_dim=block_in, out_dim=block_out, use_norm=True))
-                block_in = block_out
-                if i_level in attn_layers:
-                    attn.append(AttnBlock1D(block_in))
-            up = nn.Module()
-            up.block = block
-            up.attn = attn
-            if i_level in self.down_layers:
-                up.upsample = Upsample1D(block_in, resamp_with_conv)
-            self.up.insert(0, up)  # prepend to get consistent order
-
-        # end
-        self.conv_out = ops.Conv1d(block_in, out_dim, kernel_size=kernel_size, padding=kernel_size // 2, bias=False)
-        self.learnable_gain = nn.Parameter(torch.zeros([]))
-
-    def forward(self, z):
-        # z to block_in
-        h = self.conv_in(z)
-
-        # middle
-        h = self.mid.block_1(h)
-        h = self.mid.attn_1(h)
-        h = self.mid.block_2(h)
-        h = h.clamp(-self.clip_act, self.clip_act)
-
-        # upsampling
-        for i_level in reversed(range(self.num_layers)):
-            for i_block in range(self.num_res_blocks + 1):
-                h = self.up[i_level].block[i_block](h)
-                if len(self.up[i_level].attn) > 0:
-                    h = self.up[i_level].attn[i_block](h)
-                h = h.clamp(-self.clip_act, self.clip_act)
-            if i_level in self.down_layers:
-                h = self.up[i_level].upsample(h)
-
-        h = nonlinearity(h)
-        h = self.conv_out(h) * (self.learnable_gain + 1)
-        return h
-
-
-def VAE_16k(**kwargs) -> VAE:
-    return VAE(data_dim=80, embed_dim=20, hidden_dim=384, **kwargs)
-
-
-def VAE_44k(**kwargs) -> VAE:
-    return VAE(data_dim=128, embed_dim=40, hidden_dim=512, **kwargs)
-
-
-def get_my_vae(name: str, **kwargs) -> VAE:
-    if name == '16k':
-        return VAE_16k(**kwargs)
-    if name == '44k':
-        return VAE_44k(**kwargs)
-    raise ValueError(f'Unknown model: {name}')
-
--- a/comfy/ldm/mmaudio/vae/vae_modules.py
+++ b/comfy/ldm/mmaudio/vae/vae_modules.py
@@ -1,121 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from comfy.ldm.modules.diffusionmodules.model import vae_attention
-import math
-import comfy.ops
-ops = comfy.ops.disable_weight_init
-
-def nonlinearity(x):
-    # swish
-    return torch.nn.functional.silu(x) / 0.596
-
-def mp_sum(a, b, t=0.5):
-    return a.lerp(b, t) / math.sqrt((1 - t)**2 + t**2)
-
-def normalize(x, dim=None, eps=1e-4):
-    if dim is None:
-        dim = list(range(1, x.ndim))
-    norm = torch.linalg.vector_norm(x, dim=dim, keepdim=True, dtype=torch.float32)
-    norm = torch.add(eps, norm, alpha=math.sqrt(norm.numel() / x.numel()))
-    return x / norm.to(x.dtype)
-
-class ResnetBlock1D(nn.Module):
-
-    def __init__(self, *, in_dim, out_dim=None, conv_shortcut=False, kernel_size=3, use_norm=True):
-        super().__init__()
-        self.in_dim = in_dim
-        out_dim = in_dim if out_dim is None else out_dim
-        self.out_dim = out_dim
-        self.use_conv_shortcut = conv_shortcut
-        self.use_norm = use_norm
-
-        self.conv1 = ops.Conv1d(in_dim, out_dim, kernel_size=kernel_size, padding=kernel_size // 2, bias=False)
-        self.conv2 = ops.Conv1d(out_dim, out_dim, kernel_size=kernel_size, padding=kernel_size // 2, bias=False)
-        if self.in_dim != self.out_dim:
-            if self.use_conv_shortcut:
-                self.conv_shortcut = ops.Conv1d(in_dim, out_dim, kernel_size=kernel_size, padding=kernel_size // 2, bias=False)
-            else:
-                self.nin_shortcut = ops.Conv1d(in_dim, out_dim, kernel_size=1, padding=0, bias=False)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-
-        # pixel norm
-        if self.use_norm:
-            x = normalize(x, dim=1)
-
-        h = x
-        h = nonlinearity(h)
-        h = self.conv1(h)
-
-        h = nonlinearity(h)
-        h = self.conv2(h)
-
-        if self.in_dim != self.out_dim:
-            if self.use_conv_shortcut:
-                x = self.conv_shortcut(x)
-            else:
-                x = self.nin_shortcut(x)
-
-        return mp_sum(x, h, t=0.3)
-
-
-class AttnBlock1D(nn.Module):
-
-    def __init__(self, in_channels, num_heads=1):
-        super().__init__()
-        self.in_channels = in_channels
-
-        self.num_heads = num_heads
-        self.qkv = ops.Conv1d(in_channels, in_channels * 3, kernel_size=1, padding=0, bias=False)
-        self.proj_out = ops.Conv1d(in_channels, in_channels, kernel_size=1, padding=0, bias=False)
-        self.optimized_attention = vae_attention()
-
-    def forward(self, x):
-        h = x
-        y = self.qkv(h)
-        y = y.reshape(y.shape[0], -1, 3, y.shape[-1])
-        q, k, v = normalize(y, dim=1).unbind(2)
-
-        h = self.optimized_attention(q, k, v)
-        h = self.proj_out(h)
-
-        return mp_sum(x, h, t=0.3)
-
-
-class Upsample1D(nn.Module):
-
-    def __init__(self, in_channels, with_conv):
-        super().__init__()
-        self.with_conv = with_conv
-        if self.with_conv:
-            self.conv = ops.Conv1d(in_channels, in_channels, kernel_size=3, padding=1, bias=False)
-
-    def forward(self, x):
-        x = F.interpolate(x, scale_factor=2.0, mode='nearest-exact')  # support 3D tensor(B,C,T)
-        if self.with_conv:
-            x = self.conv(x)
-        return x
-
-
-class Downsample1D(nn.Module):
-
-    def __init__(self, in_channels, with_conv):
-        super().__init__()
-        self.with_conv = with_conv
-        if self.with_conv:
-            # no asymmetric padding in torch conv, must do it ourselves
-            self.conv1 = ops.Conv1d(in_channels, in_channels, kernel_size=1, padding=0, bias=False)
-            self.conv2 = ops.Conv1d(in_channels, in_channels, kernel_size=1, padding=0, bias=False)
-
-    def forward(self, x):
-
-        if self.with_conv:
-            x = self.conv1(x)
-
-        x = F.avg_pool1d(x, kernel_size=2, stride=2)
-
-        if self.with_conv:
-            x = self.conv2(x)
-
-        return x
--- a/comfy/ldm/models/autoencoder.py
+++ b/comfy/ldm/models/autoencoder.py
@@ -9,8 +9,6 @@ from comfy.ldm.modules.distributions.distributions import DiagonalGaussianDistri
 from comfy.ldm.util import get_obj_from_str, instantiate_from_config
 from comfy.ldm.modules.ema import LitEma
 import comfy.ops
-from einops import rearrange
-import comfy.model_management

 class DiagonalGaussianRegularizer(torch.nn.Module):
    def __init__(self, sample: bool = False):
@@ -28,12 +26,6 @@ class DiagonalGaussianRegularizer(torch.nn.Module):
            z = posterior.mode()
        return z, None

-class EmptyRegularizer(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-
-    def forward(self, z: torch.Tensor) -> Tuple[torch.Tensor, dict]:
-        return z, None

 class AbstractAutoencoder(torch.nn.Module):
    """
@@ -181,21 +173,6 @@ class AutoencodingEngineLegacy(AutoencodingEngine):
        self.post_quant_conv = conv_op(embed_dim, ddconfig["z_channels"], 1)
        self.embed_dim = embed_dim

-        if ddconfig.get("batch_norm_latent", False):
-            self.bn_eps = 1e-4
-            self.bn_momentum = 0.1
-            self.ps = [2, 2]
-            self.bn = torch.nn.BatchNorm2d(math.prod(self.ps) * ddconfig["z_channels"],
-                                           eps=self.bn_eps,
-                                           momentum=self.bn_momentum,
-                                           affine=False,
-                                           track_running_stats=True,
-                                           )
-            self.bn.eval()
-        else:
-            self.bn = None
-
-
    def get_autoencoder_params(self) -> list:
        params = super().get_autoencoder_params()
        return params
@@ -218,36 +195,11 @@ class AutoencodingEngineLegacy(AutoencodingEngine):
            z = torch.cat(z, 0)

        z, reg_log = self.regularization(z)
-
-        if self.bn is not None:
-            z = rearrange(z,
-                          "... c (i pi) (j pj)  -> ... (c pi pj) i j",
-                          pi=self.ps[0],
-                          pj=self.ps[1],
-                          )
-
-            z = torch.nn.functional.batch_norm(z,
-                                               comfy.model_management.cast_to(self.bn.running_mean, dtype=z.dtype, device=z.device),
-                                               comfy.model_management.cast_to(self.bn.running_var, dtype=z.dtype, device=z.device),
-                                               momentum=self.bn_momentum,
-                                               eps=self.bn_eps)
-
        if return_reg_log:
            return z, reg_log
        return z

    def decode(self, z: torch.Tensor, **decoder_kwargs) -> torch.Tensor:
-        if self.bn is not None:
-            s = torch.sqrt(comfy.model_management.cast_to(self.bn.running_var.view(1, -1, 1, 1), dtype=z.dtype, device=z.device) + self.bn_eps)
-            m = comfy.model_management.cast_to(self.bn.running_mean.view(1, -1, 1, 1), dtype=z.dtype, device=z.device)
-            z = z * s + m
-            z = rearrange(
-                z,
-                "... (c pi pj) i j -> ... c (i pi) (j pj)",
-                pi=self.ps[0],
-                pj=self.ps[1],
-            )
-
        if self.max_batch_size is None:
            dec = self.post_quant_conv(z)
            dec = self.decoder(dec, **decoder_kwargs)
--- a/comfy/ldm/modules/attention.py
+++ b/comfy/ldm/modules/attention.py
@@ -5,9 +5,8 @@ import torch
 import torch.nn.functional as F
 from torch import nn, einsum
 from einops import rearrange, repeat
-from typing import Optional, Any, Callable, Union
+from typing import Optional
 import logging
-import functools

 from .diffusionmodules.util import AlphaBlender, timestep_embedding
 from .sub_quadratic_attention import efficient_dot_product_attention
@@ -18,45 +17,23 @@ if model_management.xformers_enabled():
    import xformers
    import xformers.ops

-SAGE_ATTENTION_IS_AVAILABLE = False
-try:
-    from sageattention import sageattn
-    SAGE_ATTENTION_IS_AVAILABLE = True
-except ImportError as e:
-    if model_management.sage_attention_enabled():
+if model_management.sage_attention_enabled():
+    try:
+        from sageattention import sageattn
+    except ModuleNotFoundError as e:
        if e.name == "sageattention":
            logging.error(f"\n\nTo use the `--use-sage-attention` feature, the `sageattention` package must be installed first.\ncommand:\n\t{sys.executable} -m pip install sageattention")
        else:
            raise e
        exit(-1)

-FLASH_ATTENTION_IS_AVAILABLE = False
-try:
-    from flash_attn import flash_attn_func
-    FLASH_ATTENTION_IS_AVAILABLE = True
-except ImportError:
-    if model_management.flash_attention_enabled():
+if model_management.flash_attention_enabled():
+    try:
+        from flash_attn import flash_attn_func
+    except ModuleNotFoundError:
        logging.error(f"\n\nTo use the `--use-flash-attention` feature, the `flash-attn` package must be installed first.\ncommand:\n\t{sys.executable} -m pip install flash-attn")
        exit(-1)

-REGISTERED_ATTENTION_FUNCTIONS = {}
-def register_attention_function(name: str, func: Callable):
-    # avoid replacing existing functions
-    if name not in REGISTERED_ATTENTION_FUNCTIONS:
-        REGISTERED_ATTENTION_FUNCTIONS[name] = func
-    else:
-        logging.warning(f"Attention function {name} already registered, skipping registration.")
-
-def get_attention_function(name: str, default: Any=...) -> Union[Callable, None]:
-    if name == "optimized":
-        return optimized_attention
-    elif name not in REGISTERED_ATTENTION_FUNCTIONS:
-        if default is ...:
-            raise KeyError(f"Attention function {name} not found.")
-        else:
-            return default
-    return REGISTERED_ATTENTION_FUNCTIONS[name]
-
 from comfy.cli_args import args
 import comfy.ops
 ops = comfy.ops.disable_weight_init
@@ -114,27 +91,7 @@ class FeedForward(nn.Module):
 def Normalize(in_channels, dtype=None, device=None):
    return torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True, dtype=dtype, device=device)

-
-def wrap_attn(func):
-    @functools.wraps(func)
-    def wrapper(*args, **kwargs):
-        remove_attn_wrapper_key = False
-        try:
-            if "_inside_attn_wrapper" not in kwargs:
-                transformer_options = kwargs.get("transformer_options", None)
-                remove_attn_wrapper_key = True
-                kwargs["_inside_attn_wrapper"] = True
-                if transformer_options is not None:
-                    if "optimized_attention_override" in transformer_options:
-                        return transformer_options["optimized_attention_override"](func, *args, **kwargs)
-            return func(*args, **kwargs)
-        finally:
-            if remove_attn_wrapper_key:
-                del kwargs["_inside_attn_wrapper"]
-    return wrapper
-
-@wrap_attn
-def attention_basic(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False, **kwargs):
+def attention_basic(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False):
    attn_precision = get_attn_precision(attn_precision, q.dtype)

    if skip_reshape:
@@ -202,8 +159,8 @@ def attention_basic(q, k, v, heads, mask=None, attn_precision=None, skip_reshape
        )
    return out

-@wrap_attn
-def attention_sub_quad(query, key, value, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False, **kwargs):
+
+def attention_sub_quad(query, key, value, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False):
    attn_precision = get_attn_precision(attn_precision, query.dtype)

    if skip_reshape:
@@ -273,8 +230,7 @@ def attention_sub_quad(query, key, value, heads, mask=None, attn_precision=None,
        hidden_states = hidden_states.unflatten(0, (-1, heads)).transpose(1,2).flatten(start_dim=2)
    return hidden_states

-@wrap_attn
-def attention_split(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False, **kwargs):
+def attention_split(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False):
    attn_precision = get_attn_precision(attn_precision, q.dtype)

    if skip_reshape:
@@ -403,8 +359,7 @@ try:
 except:
    pass

-@wrap_attn
-def attention_xformers(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False, **kwargs):
+def attention_xformers(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False):
    b = q.shape[0]
    dim_head = q.shape[-1]
    # check to make sure xformers isn't broken
@@ -419,7 +374,7 @@ def attention_xformers(q, k, v, heads, mask=None, attn_precision=None, skip_resh
            disabled_xformers = True

    if disabled_xformers:
-        return attention_pytorch(q, k, v, heads, mask, skip_reshape=skip_reshape, **kwargs)
+        return attention_pytorch(q, k, v, heads, mask, skip_reshape=skip_reshape)

    if skip_reshape:
        # b h k d -> b k h d
@@ -472,8 +427,8 @@ else:
    #TODO: other GPUs ?
    SDP_BATCH_LIMIT = 2**31

-@wrap_attn
-def attention_pytorch(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False, **kwargs):
+
+def attention_pytorch(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False):
    if skip_reshape:
        b, _, _, dim_head = q.shape
    else:
@@ -515,8 +470,8 @@ def attention_pytorch(q, k, v, heads, mask=None, attn_precision=None, skip_resha
            ).transpose(1, 2).reshape(-1, q.shape[2], heads * dim_head)
    return out

-@wrap_attn
-def attention_sage(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False, **kwargs):
+
+def attention_sage(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False):
    if skip_reshape:
        b, _, _, dim_head = q.shape
        tensor_layout = "HND"
@@ -546,7 +501,7 @@ def attention_sage(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=
                lambda t: t.transpose(1, 2),
                (q, k, v),
            )
-        return attention_pytorch(q, k, v, heads, mask=mask, skip_reshape=True, skip_output_reshape=skip_output_reshape, **kwargs)
+        return attention_pytorch(q, k, v, heads, mask=mask, skip_reshape=True, skip_output_reshape=skip_output_reshape)

    if tensor_layout == "HND":
        if not skip_output_reshape:
@@ -579,8 +534,8 @@ except AttributeError as error:
                    dropout_p: float = 0.0, causal: bool = False) -> torch.Tensor:
        assert False, f"Could not define flash_attn_wrapper: {FLASH_ATTN_ERROR}"

-@wrap_attn
-def attention_flash(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False, **kwargs):
+
+def attention_flash(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False):
    if skip_reshape:
        b, _, _, dim_head = q.shape
    else:
@@ -600,8 +555,7 @@ def attention_flash(q, k, v, heads, mask=None, attn_precision=None, skip_reshape
            mask = mask.unsqueeze(1)

    try:
-        if mask is not None:
-            raise RuntimeError("Mask must not be set for Flash attention")
+        assert mask is None
        out = flash_attn_wrapper(
            q.transpose(1, 2),
            k.transpose(1, 2),
@@ -643,19 +597,6 @@ else:

 optimized_attention_masked = optimized_attention

-
-# register core-supported attention functions
-if SAGE_ATTENTION_IS_AVAILABLE:
-    register_attention_function("sage", attention_sage)
-if FLASH_ATTENTION_IS_AVAILABLE:
-    register_attention_function("flash", attention_flash)
-if model_management.xformers_enabled():
-    register_attention_function("xformers", attention_xformers)
-register_attention_function("pytorch", attention_pytorch)
-register_attention_function("sub_quad", attention_sub_quad)
-register_attention_function("split", attention_split)
-
-
 def optimized_attention_for_device(device, mask=False, small_input=False):
    if small_input:
        if model_management.pytorch_attention_enabled():
@@ -688,7 +629,7 @@ class CrossAttention(nn.Module):

        self.to_out = nn.Sequential(operations.Linear(inner_dim, query_dim, dtype=dtype, device=device), nn.Dropout(dropout))

-    def forward(self, x, context=None, value=None, mask=None, transformer_options={}):
+    def forward(self, x, context=None, value=None, mask=None):
        q = self.to_q(x)
        context = default(context, x)
        k = self.to_k(context)
@@ -699,9 +640,9 @@ class CrossAttention(nn.Module):
            v = self.to_v(context)

        if mask is None:
-            out = optimized_attention(q, k, v, self.heads, attn_precision=self.attn_precision, transformer_options=transformer_options)
+            out = optimized_attention(q, k, v, self.heads, attn_precision=self.attn_precision)
        else:
-            out = optimized_attention_masked(q, k, v, self.heads, mask, attn_precision=self.attn_precision, transformer_options=transformer_options)
+            out = optimized_attention_masked(q, k, v, self.heads, mask, attn_precision=self.attn_precision)
        return self.to_out(out)


@@ -805,7 +746,7 @@ class BasicTransformerBlock(nn.Module):
            n = attn1_replace_patch[block_attn1](n, context_attn1, value_attn1, extra_options)
            n = self.attn1.to_out(n)
        else:
-            n = self.attn1(n, context=context_attn1, value=value_attn1, transformer_options=transformer_options)
+            n = self.attn1(n, context=context_attn1, value=value_attn1)

        if "attn1_output_patch" in transformer_patches:
            patch = transformer_patches["attn1_output_patch"]
@@ -845,7 +786,7 @@ class BasicTransformerBlock(nn.Module):
                n = attn2_replace_patch[block_attn2](n, context_attn2, value_attn2, extra_options)
                n = self.attn2.to_out(n)
            else:
-                n = self.attn2(n, context=context_attn2, value=value_attn2, transformer_options=transformer_options)
+                n = self.attn2(n, context=context_attn2, value=value_attn2)

        if "attn2_output_patch" in transformer_patches:
            patch = transformer_patches["attn2_output_patch"]
@@ -1076,7 +1017,7 @@ class SpatialVideoTransformer(SpatialTransformer):

            B, S, C = x_mix.shape
            x_mix = rearrange(x_mix, "(b t) s c -> (b s) t c", t=timesteps)
-            x_mix = mix_block(x_mix, context=time_context, transformer_options=transformer_options)
+            x_mix = mix_block(x_mix, context=time_context) #TODO: transformer_options
            x_mix = rearrange(
                x_mix, "(b s) t c -> (b t) s c", s=S, b=B // timesteps, c=C, t=timesteps
            )
--- a/comfy/ldm/modules/diffusionmodules/mmdit.py
+++ b/comfy/ldm/modules/diffusionmodules/mmdit.py
@@ -109,7 +109,7 @@ class PatchEmbed(nn.Module):
 def modulate(x, shift, scale):
    if shift is None:
        shift = torch.zeros_like(scale)
-    return torch.addcmul(shift.unsqueeze(1), x, 1+ scale.unsqueeze(1))
+    return x * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1)


 #################################################################################
@@ -211,14 +211,12 @@ class TimestepEmbedder(nn.Module):
    Embeds scalar timesteps into vector representations.
    """

-    def __init__(self, hidden_size, frequency_embedding_size=256, output_size=None, dtype=None, device=None, operations=None):
+    def __init__(self, hidden_size, frequency_embedding_size=256, dtype=None, device=None, operations=None):
        super().__init__()
-        if output_size is None:
-            output_size = hidden_size
        self.mlp = nn.Sequential(
            operations.Linear(frequency_embedding_size, hidden_size, bias=True, dtype=dtype, device=device),
            nn.SiLU(),
-            operations.Linear(hidden_size, output_size, bias=True, dtype=dtype, device=device),
+            operations.Linear(hidden_size, hidden_size, bias=True, dtype=dtype, device=device),
        )
        self.frequency_embedding_size = frequency_embedding_size

@@ -566,7 +564,10 @@ class DismantledBlock(nn.Module):
        assert not self.pre_only
        attn1 = self.attn.post_attention(attn)
        attn2 = self.attn2.post_attention(attn2)
-        x = gate_cat(x, gate_msa, gate_msa2, attn1, attn2)
+        out1 = gate_msa.unsqueeze(1) * attn1
+        out2 = gate_msa2.unsqueeze(1) * attn2
+        x = x + out1
+        x = x + out2
        x = x + gate_mlp.unsqueeze(1) * self.mlp(
            modulate(self.norm2(x), shift_mlp, scale_mlp)
        )
@@ -593,11 +594,6 @@ class DismantledBlock(nn.Module):
            )
            return self.post_attention(attn, *intermediates)

-def gate_cat(x, gate_msa, gate_msa2, attn1, attn2):
-    out1 = gate_msa.unsqueeze(1) * attn1
-    out2 = gate_msa2.unsqueeze(1) * attn2
-    x = torch.stack([x, out1, out2], dim=0).sum(dim=0)
-    return x

 def block_mixing(*args, use_checkpoint=True, **kwargs):
    if use_checkpoint:
@@ -608,7 +604,7 @@ def block_mixing(*args, use_checkpoint=True, **kwargs):
        return _block_mixing(*args, **kwargs)


-def _block_mixing(context, x, context_block, x_block, c, transformer_options={}):
+def _block_mixing(context, x, context_block, x_block, c):
    context_qkv, context_intermediates = context_block.pre_attention(context, c)

    if x_block.x_block_self_attn:
@@ -624,7 +620,6 @@ def _block_mixing(context, x, context_block, x_block, c, transformer_options={})
    attn = optimized_attention(
        qkv[0], qkv[1], qkv[2],
        heads=x_block.attn.num_heads,
-        transformer_options=transformer_options,
    )
    context_attn, x_attn = (
        attn[:, : context_qkv[0].shape[1]],
@@ -640,7 +635,6 @@ def _block_mixing(context, x, context_block, x_block, c, transformer_options={})
        attn2 = optimized_attention(
                x_qkv2[0], x_qkv2[1], x_qkv2[2],
                heads=x_block.attn2.num_heads,
-                transformer_options=transformer_options,
            )
        x = x_block.post_attention_x(x_attn, attn2, *x_intermediates)
    else:
@@ -962,10 +956,10 @@ class MMDiT(nn.Module):
            if ("double_block", i) in blocks_replace:
                def block_wrap(args):
                    out = {}
-                    out["txt"], out["img"] = self.joint_blocks[i](args["txt"], args["img"], c=args["vec"], transformer_options=args["transformer_options"])
+                    out["txt"], out["img"] = self.joint_blocks[i](args["txt"], args["img"], c=args["vec"])
                    return out

-                out = blocks_replace[("double_block", i)]({"img": x, "txt": context, "vec": c_mod, "transformer_options": transformer_options}, {"original_block": block_wrap})
+                out = blocks_replace[("double_block", i)]({"img": x, "txt": context, "vec": c_mod}, {"original_block": block_wrap})
                context = out["txt"]
                x = out["img"]
            else:
@@ -974,7 +968,6 @@ class MMDiT(nn.Module):
                    x,
                    c=c_mod,
                    use_checkpoint=self.use_checkpoint,
-                    transformer_options=transformer_options,
                )
            if control is not None:
                control_o = control.get("output")
--- a/comfy/ldm/modules/diffusionmodules/model.py
+++ b/comfy/ldm/modules/diffusionmodules/model.py
@@ -145,7 +145,7 @@ class Downsample(nn.Module):

 class ResnetBlock(nn.Module):
    def __init__(self, *, in_channels, out_channels=None, conv_shortcut=False,
-                 dropout=0.0, temb_channels=512, conv_op=ops.Conv2d, norm_op=Normalize):
+                 dropout, temb_channels=512, conv_op=ops.Conv2d):
        super().__init__()
        self.in_channels = in_channels
        out_channels = in_channels if out_channels is None else out_channels
@@ -153,7 +153,7 @@ class ResnetBlock(nn.Module):
        self.use_conv_shortcut = conv_shortcut

        self.swish = torch.nn.SiLU(inplace=True)
-        self.norm1 = norm_op(in_channels)
+        self.norm1 = Normalize(in_channels)
        self.conv1 = conv_op(in_channels,
                                     out_channels,
                                     kernel_size=3,
@@ -162,7 +162,7 @@ class ResnetBlock(nn.Module):
        if temb_channels > 0:
            self.temb_proj = ops.Linear(temb_channels,
                                             out_channels)
-        self.norm2 = norm_op(out_channels)
+        self.norm2 = Normalize(out_channels)
        self.dropout = torch.nn.Dropout(dropout, inplace=True)
        self.conv2 = conv_op(out_channels,
                                     out_channels,
@@ -183,7 +183,7 @@ class ResnetBlock(nn.Module):
                                                    stride=1,
                                                    padding=0)

-    def forward(self, x, temb=None):
+    def forward(self, x, temb):
        h = x
        h = self.norm1(h)
        h = self.swish(h)
@@ -305,11 +305,11 @@ def vae_attention():
        return normal_attention

 class AttnBlock(nn.Module):
-    def __init__(self, in_channels, conv_op=ops.Conv2d, norm_op=Normalize):
+    def __init__(self, in_channels, conv_op=ops.Conv2d):
        super().__init__()
        self.in_channels = in_channels

-        self.norm = norm_op(in_channels)
+        self.norm = Normalize(in_channels)
        self.q = conv_op(in_channels,
                                 in_channels,
                                 kernel_size=1,
--- a/comfy/ldm/omnigen/omnigen2.py
+++ b/comfy/ldm/omnigen/omnigen2.py
@@ -120,7 +120,7 @@ class Attention(nn.Module):
            nn.Dropout(0.0)
        )

-    def forward(self, hidden_states: torch.Tensor, encoder_hidden_states: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, image_rotary_emb: Optional[torch.Tensor] = None, transformer_options={}) -> torch.Tensor:
+    def forward(self, hidden_states: torch.Tensor, encoder_hidden_states: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, image_rotary_emb: Optional[torch.Tensor] = None) -> torch.Tensor:
        batch_size, sequence_length, _ = hidden_states.shape

        query = self.to_q(hidden_states)
@@ -146,7 +146,7 @@ class Attention(nn.Module):
            key = key.repeat_interleave(self.heads // self.kv_heads, dim=1)
            value = value.repeat_interleave(self.heads // self.kv_heads, dim=1)

-        hidden_states = optimized_attention_masked(query, key, value, self.heads, attention_mask, skip_reshape=True, transformer_options=transformer_options)
+        hidden_states = optimized_attention_masked(query, key, value, self.heads, attention_mask, skip_reshape=True)
        hidden_states = self.to_out[0](hidden_states)
        return hidden_states

@@ -182,16 +182,16 @@ class OmniGen2TransformerBlock(nn.Module):
        self.norm2 = operations.RMSNorm(dim, eps=norm_eps, dtype=dtype, device=device)
        self.ffn_norm2 = operations.RMSNorm(dim, eps=norm_eps, dtype=dtype, device=device)

-    def forward(self, hidden_states: torch.Tensor, attention_mask: torch.Tensor, image_rotary_emb: torch.Tensor, temb: Optional[torch.Tensor] = None, transformer_options={}) -> torch.Tensor:
+    def forward(self, hidden_states: torch.Tensor, attention_mask: torch.Tensor, image_rotary_emb: torch.Tensor, temb: Optional[torch.Tensor] = None) -> torch.Tensor:
        if self.modulation:
            norm_hidden_states, gate_msa, scale_mlp, gate_mlp = self.norm1(hidden_states, temb)
-            attn_output = self.attn(norm_hidden_states, norm_hidden_states, attention_mask, image_rotary_emb, transformer_options=transformer_options)
+            attn_output = self.attn(norm_hidden_states, norm_hidden_states, attention_mask, image_rotary_emb)
            hidden_states = hidden_states + gate_msa.unsqueeze(1).tanh() * self.norm2(attn_output)
            mlp_output = self.feed_forward(self.ffn_norm1(hidden_states) * (1 + scale_mlp.unsqueeze(1)))
            hidden_states = hidden_states + gate_mlp.unsqueeze(1).tanh() * self.ffn_norm2(mlp_output)
        else:
            norm_hidden_states = self.norm1(hidden_states)
-            attn_output = self.attn(norm_hidden_states, norm_hidden_states, attention_mask, image_rotary_emb, transformer_options=transformer_options)
+            attn_output = self.attn(norm_hidden_states, norm_hidden_states, attention_mask, image_rotary_emb)
            hidden_states = hidden_states + self.norm2(attn_output)
            mlp_output = self.feed_forward(self.ffn_norm1(hidden_states))
            hidden_states = hidden_states + self.ffn_norm2(mlp_output)
@@ -390,7 +390,7 @@ class OmniGen2Transformer2DModel(nn.Module):
            ref_img_sizes, img_sizes,
        )

-    def img_patch_embed_and_refine(self, hidden_states, ref_image_hidden_states, padded_img_mask, padded_ref_img_mask, noise_rotary_emb, ref_img_rotary_emb, l_effective_ref_img_len, l_effective_img_len, temb, transformer_options={}):
+    def img_patch_embed_and_refine(self, hidden_states, ref_image_hidden_states, padded_img_mask, padded_ref_img_mask, noise_rotary_emb, ref_img_rotary_emb, l_effective_ref_img_len, l_effective_img_len, temb):
        batch_size = len(hidden_states)

        hidden_states = self.x_embedder(hidden_states)
@@ -405,17 +405,17 @@ class OmniGen2Transformer2DModel(nn.Module):
                    shift += ref_img_len

        for layer in self.noise_refiner:
-            hidden_states = layer(hidden_states, padded_img_mask, noise_rotary_emb, temb, transformer_options=transformer_options)
+            hidden_states = layer(hidden_states, padded_img_mask, noise_rotary_emb, temb)

        if ref_image_hidden_states is not None:
            for layer in self.ref_image_refiner:
-                ref_image_hidden_states = layer(ref_image_hidden_states, padded_ref_img_mask, ref_img_rotary_emb, temb, transformer_options=transformer_options)
+                ref_image_hidden_states = layer(ref_image_hidden_states, padded_ref_img_mask, ref_img_rotary_emb, temb)

            hidden_states = torch.cat([ref_image_hidden_states, hidden_states], dim=1)

        return hidden_states

-    def forward(self, x, timesteps, context, num_tokens, ref_latents=None, attention_mask=None, transformer_options={}, **kwargs):
+    def forward(self, x, timesteps, context, num_tokens, ref_latents=None, attention_mask=None, **kwargs):
        B, C, H, W = x.shape
        hidden_states = comfy.ldm.common_dit.pad_to_patch_size(x, (self.patch_size, self.patch_size))
        _, _, H_padded, W_padded = hidden_states.shape
@@ -444,7 +444,7 @@ class OmniGen2Transformer2DModel(nn.Module):
        )

        for layer in self.context_refiner:
-            text_hidden_states = layer(text_hidden_states, text_attention_mask, context_rotary_emb, transformer_options=transformer_options)
+            text_hidden_states = layer(text_hidden_states, text_attention_mask, context_rotary_emb)

        img_len = hidden_states.shape[1]
        combined_img_hidden_states = self.img_patch_embed_and_refine(
@@ -453,14 +453,13 @@ class OmniGen2Transformer2DModel(nn.Module):
            noise_rotary_emb, ref_img_rotary_emb,
            l_effective_ref_img_len, l_effective_img_len,
            temb,
-            transformer_options=transformer_options,
        )

        hidden_states = torch.cat([text_hidden_states, combined_img_hidden_states], dim=1)
        attention_mask = None

        for layer in self.layers:
-            hidden_states = layer(hidden_states, attention_mask, rotary_emb, temb, transformer_options=transformer_options)
+            hidden_states = layer(hidden_states, attention_mask, rotary_emb, temb)

        hidden_states = self.norm_out(hidden_states, temb)

--- a/comfy/ldm/qwen_image/controlnet.py
+++ b/comfy/ldm/qwen_image/controlnet.py
@@ -1,77 +0,0 @@
-import torch
-import math
-
-from .model import QwenImageTransformer2DModel
-
-
-class QwenImageControlNetModel(QwenImageTransformer2DModel):
-    def __init__(
-        self,
-        extra_condition_channels=0,
-        dtype=None,
-        device=None,
-        operations=None,
-        **kwargs
-    ):
-        super().__init__(final_layer=False, dtype=dtype, device=device, operations=operations, **kwargs)
-        self.main_model_double = 60
-
-        # controlnet_blocks
-        self.controlnet_blocks = torch.nn.ModuleList([])
-        for _ in range(len(self.transformer_blocks)):
-            self.controlnet_blocks.append(operations.Linear(self.inner_dim, self.inner_dim, device=device, dtype=dtype))
-        self.controlnet_x_embedder = operations.Linear(self.in_channels + extra_condition_channels, self.inner_dim, device=device, dtype=dtype)
-
-    def forward(
-        self,
-        x,
-        timesteps,
-        context,
-        attention_mask=None,
-        guidance: torch.Tensor = None,
-        ref_latents=None,
-        hint=None,
-        transformer_options={},
-        **kwargs
-    ):
-        timestep = timesteps
-        encoder_hidden_states = context
-        encoder_hidden_states_mask = attention_mask
-
-        hidden_states, img_ids, orig_shape = self.process_img(x)
-        hint, _, _ = self.process_img(hint)
-
-        txt_start = round(max(((x.shape[-1] + (self.patch_size // 2)) // self.patch_size) // 2, ((x.shape[-2] + (self.patch_size // 2)) // self.patch_size) // 2))
-        txt_ids = torch.arange(txt_start, txt_start + context.shape[1], device=x.device).reshape(1, -1, 1).repeat(x.shape[0], 1, 3)
-        ids = torch.cat((txt_ids, img_ids), dim=1)
-        image_rotary_emb = self.pe_embedder(ids).to(x.dtype).contiguous()
-        del ids, txt_ids, img_ids
-
-        hidden_states = self.img_in(hidden_states) + self.controlnet_x_embedder(hint)
-        encoder_hidden_states = self.txt_norm(encoder_hidden_states)
-        encoder_hidden_states = self.txt_in(encoder_hidden_states)
-
-        if guidance is not None:
-            guidance = guidance * 1000
-
-        temb = (
-            self.time_text_embed(timestep, hidden_states)
-            if guidance is None
-            else self.time_text_embed(timestep, guidance, hidden_states)
-        )
-
-        repeat = math.ceil(self.main_model_double / len(self.controlnet_blocks))
-
-        controlnet_block_samples = ()
-        for i, block in enumerate(self.transformer_blocks):
-            encoder_hidden_states, hidden_states = block(
-                hidden_states=hidden_states,
-                encoder_hidden_states=encoder_hidden_states,
-                encoder_hidden_states_mask=encoder_hidden_states_mask,
-                temb=temb,
-                image_rotary_emb=image_rotary_emb,
-            )
-
-            controlnet_block_samples = controlnet_block_samples + (self.controlnet_blocks[i](hidden_states),) * repeat
-
-        return {"input": controlnet_block_samples[:self.main_model_double]}
--- a/comfy/ldm/qwen_image/model.py
+++ b/comfy/ldm/qwen_image/model.py
@@ -9,8 +9,6 @@ from comfy.ldm.lightricks.model import TimestepEmbedding, Timesteps
 from comfy.ldm.modules.attention import optimized_attention_masked
 from comfy.ldm.flux.layers import EmbedND
 import comfy.ldm.common_dit
-import comfy.patcher_extension
-from comfy.ldm.flux.math import apply_rope1

 class GELU(nn.Module):
    def __init__(self, dim_in: int, dim_out: int, approximate: str = "none", bias: bool = True, dtype=None, device=None, operations=None):
@@ -133,36 +131,34 @@ class Attention(nn.Module):
        encoder_hidden_states_mask: torch.FloatTensor = None,
        attention_mask: Optional[torch.FloatTensor] = None,
        image_rotary_emb: Optional[torch.Tensor] = None,
-        transformer_options={},
    ) -> Tuple[torch.Tensor, torch.Tensor]:
-        batch_size = hidden_states.shape[0]
-        seq_img = hidden_states.shape[1]
        seq_txt = encoder_hidden_states.shape[1]

-        # Project and reshape to BHND format (batch, heads, seq, dim)
-        img_query = self.to_q(hidden_states).view(batch_size, seq_img, self.heads, -1).transpose(1, 2).contiguous()
-        img_key = self.to_k(hidden_states).view(batch_size, seq_img, self.heads, -1).transpose(1, 2).contiguous()
-        img_value = self.to_v(hidden_states).view(batch_size, seq_img, self.heads, -1).transpose(1, 2)
+        img_query = self.to_q(hidden_states).unflatten(-1, (self.heads, -1))
+        img_key = self.to_k(hidden_states).unflatten(-1, (self.heads, -1))
+        img_value = self.to_v(hidden_states).unflatten(-1, (self.heads, -1))

-        txt_query = self.add_q_proj(encoder_hidden_states).view(batch_size, seq_txt, self.heads, -1).transpose(1, 2).contiguous()
-        txt_key = self.add_k_proj(encoder_hidden_states).view(batch_size, seq_txt, self.heads, -1).transpose(1, 2).contiguous()
-        txt_value = self.add_v_proj(encoder_hidden_states).view(batch_size, seq_txt, self.heads, -1).transpose(1, 2)
+        txt_query = self.add_q_proj(encoder_hidden_states).unflatten(-1, (self.heads, -1))
+        txt_key = self.add_k_proj(encoder_hidden_states).unflatten(-1, (self.heads, -1))
+        txt_value = self.add_v_proj(encoder_hidden_states).unflatten(-1, (self.heads, -1))

        img_query = self.norm_q(img_query)
        img_key = self.norm_k(img_key)
        txt_query = self.norm_added_q(txt_query)
        txt_key = self.norm_added_k(txt_key)

-        joint_query = torch.cat([txt_query, img_query], dim=2)
-        joint_key = torch.cat([txt_key, img_key], dim=2)
-        joint_value = torch.cat([txt_value, img_value], dim=2)
+        joint_query = torch.cat([txt_query, img_query], dim=1)
+        joint_key = torch.cat([txt_key, img_key], dim=1)
+        joint_value = torch.cat([txt_value, img_value], dim=1)

-        joint_query = apply_rope1(joint_query, image_rotary_emb)
-        joint_key = apply_rope1(joint_key, image_rotary_emb)
+        joint_query = apply_rotary_emb(joint_query, image_rotary_emb)
+        joint_key = apply_rotary_emb(joint_key, image_rotary_emb)

-        joint_hidden_states = optimized_attention_masked(joint_query, joint_key, joint_value, self.heads,
-                                                         attention_mask, transformer_options=transformer_options,
-                                                         skip_reshape=True)
+        joint_query = joint_query.flatten(start_dim=2)
+        joint_key = joint_key.flatten(start_dim=2)
+        joint_value = joint_value.flatten(start_dim=2)
+
+        joint_hidden_states = optimized_attention_masked(joint_query, joint_key, joint_value, self.heads, attention_mask)

        txt_attn_output = joint_hidden_states[:, :seq_txt, :]
        img_attn_output = joint_hidden_states[:, seq_txt:, :]
@@ -218,9 +214,9 @@ class QwenImageTransformerBlock(nn.Module):
            operations=operations,
        )

-    def _modulate(self, x: torch.Tensor, mod_params: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
-        shift, scale, gate = torch.chunk(mod_params, 3, dim=-1)
-        return torch.addcmul(shift.unsqueeze(1), x, 1 + scale.unsqueeze(1)), gate.unsqueeze(1)
+    def _modulate(self, x, mod_params):
+        shift, scale, gate = mod_params.chunk(3, dim=-1)
+        return x * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1), gate.unsqueeze(1)

    def forward(
        self,
@@ -229,40 +225,34 @@ class QwenImageTransformerBlock(nn.Module):
        encoder_hidden_states_mask: torch.Tensor,
        temb: torch.Tensor,
        image_rotary_emb: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
-        transformer_options={},
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        img_mod_params = self.img_mod(temb)
        txt_mod_params = self.txt_mod(temb)
        img_mod1, img_mod2 = img_mod_params.chunk(2, dim=-1)
        txt_mod1, txt_mod2 = txt_mod_params.chunk(2, dim=-1)

-        img_modulated, img_gate1 = self._modulate(self.img_norm1(hidden_states), img_mod1)
-        del img_mod1
-        txt_modulated, txt_gate1 = self._modulate(self.txt_norm1(encoder_hidden_states), txt_mod1)
-        del txt_mod1
+        img_normed = self.img_norm1(hidden_states)
+        img_modulated, img_gate1 = self._modulate(img_normed, img_mod1)
+        txt_normed = self.txt_norm1(encoder_hidden_states)
+        txt_modulated, txt_gate1 = self._modulate(txt_normed, txt_mod1)

        img_attn_output, txt_attn_output = self.attn(
            hidden_states=img_modulated,
            encoder_hidden_states=txt_modulated,
            encoder_hidden_states_mask=encoder_hidden_states_mask,
            image_rotary_emb=image_rotary_emb,
-            transformer_options=transformer_options,
        )
-        del img_modulated
-        del txt_modulated

        hidden_states = hidden_states + img_gate1 * img_attn_output
        encoder_hidden_states = encoder_hidden_states + txt_gate1 * txt_attn_output
-        del img_attn_output
-        del txt_attn_output
-        del img_gate1
-        del txt_gate1

-        img_modulated2, img_gate2 = self._modulate(self.img_norm2(hidden_states), img_mod2)
-        hidden_states = torch.addcmul(hidden_states, img_gate2, self.img_mlp(img_modulated2))
+        img_normed2 = self.img_norm2(hidden_states)
+        img_modulated2, img_gate2 = self._modulate(img_normed2, img_mod2)
+        hidden_states = hidden_states + img_gate2 * self.img_mlp(img_modulated2)

-        txt_modulated2, txt_gate2 = self._modulate(self.txt_norm2(encoder_hidden_states), txt_mod2)
-        encoder_hidden_states = torch.addcmul(encoder_hidden_states, txt_gate2, self.txt_mlp(txt_modulated2))
+        txt_normed2 = self.txt_norm2(encoder_hidden_states)
+        txt_modulated2, txt_gate2 = self._modulate(txt_normed2, txt_mod2)
+        encoder_hidden_states = encoder_hidden_states + txt_gate2 * self.txt_mlp(txt_modulated2)

        return encoder_hidden_states, hidden_states

@@ -285,7 +275,7 @@ class LastLayer(nn.Module):
    def forward(self, x: torch.Tensor, conditioning_embedding: torch.Tensor) -> torch.Tensor:
        emb = self.linear(self.silu(conditioning_embedding))
        scale, shift = torch.chunk(emb, 2, dim=1)
-        x = torch.addcmul(shift[:, None, :], self.norm(x), (1 + scale)[:, None, :])
+        x = self.norm(x) * (1 + scale)[:, None, :] + shift[:, None, :]
        return x


@@ -303,7 +293,6 @@ class QwenImageTransformer2DModel(nn.Module):
        guidance_embeds: bool = False,
        axes_dims_rope: Tuple[int, int, int] = (16, 56, 56),
        image_model=None,
-        final_layer=True,
        dtype=None,
        device=None,
        operations=None,
@@ -311,7 +300,6 @@ class QwenImageTransformer2DModel(nn.Module):
        super().__init__()
        self.dtype = dtype
        self.patch_size = patch_size
-        self.in_channels = in_channels
        self.out_channels = out_channels or in_channels
        self.inner_dim = num_attention_heads * attention_head_dim

@@ -341,9 +329,9 @@ class QwenImageTransformer2DModel(nn.Module):
            for _ in range(num_layers)
        ])

-        if final_layer:
-            self.norm_out = LastLayer(self.inner_dim, self.inner_dim, dtype=dtype, device=device, operations=operations)
-            self.proj_out = operations.Linear(self.inner_dim, patch_size * patch_size * self.out_channels, bias=True, dtype=dtype, device=device)
+        self.norm_out = LastLayer(self.inner_dim, self.inner_dim, dtype=dtype, device=device, operations=operations)
+        self.proj_out = operations.Linear(self.inner_dim, patch_size * patch_size * self.out_channels, bias=True, dtype=dtype, device=device)
+        self.gradient_checkpointing = False

    def process_img(self, x, index=0, h_offset=0, w_offset=0):
        bs, c, t, h, w = x.shape
@@ -359,20 +347,13 @@ class QwenImageTransformer2DModel(nn.Module):
        h_offset = ((h_offset + (patch_size // 2)) // patch_size)
        w_offset = ((w_offset + (patch_size // 2)) // patch_size)

-        img_ids = torch.zeros((h_len, w_len, 3), device=x.device)
+        img_ids = torch.zeros((h_len, w_len, 3), device=x.device, dtype=x.dtype)
        img_ids[:, :, 0] = img_ids[:, :, 1] + index
-        img_ids[:, :, 1] = img_ids[:, :, 1] + torch.linspace(h_offset, h_len - 1 + h_offset, steps=h_len, device=x.device, dtype=x.dtype).unsqueeze(1) - (h_len // 2)
-        img_ids[:, :, 2] = img_ids[:, :, 2] + torch.linspace(w_offset, w_len - 1 + w_offset, steps=w_len, device=x.device, dtype=x.dtype).unsqueeze(0) - (w_len // 2)
+        img_ids[:, :, 1] = img_ids[:, :, 1] + torch.linspace(h_offset, h_len - 1 + h_offset, steps=h_len, device=x.device, dtype=x.dtype).unsqueeze(1)
+        img_ids[:, :, 2] = img_ids[:, :, 2] + torch.linspace(w_offset, w_len - 1 + w_offset, steps=w_len, device=x.device, dtype=x.dtype).unsqueeze(0)
        return hidden_states, repeat(img_ids, "h w c -> b (h w) c", b=bs), orig_shape

-    def forward(self, x, timestep, context, attention_mask=None, guidance=None, ref_latents=None, transformer_options={}, **kwargs):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, timestep, context, attention_mask, guidance, ref_latents, transformer_options, **kwargs)
-
-    def _forward(
+    def forward(
        self,
        x,
        timesteps,
@@ -381,7 +362,6 @@ class QwenImageTransformer2DModel(nn.Module):
        guidance: torch.Tensor = None,
        ref_latents=None,
        transformer_options={},
-        control=None,
        **kwargs
    ):
        timestep = timesteps
@@ -416,11 +396,10 @@ class QwenImageTransformer2DModel(nn.Module):
                hidden_states = torch.cat([hidden_states, kontext], dim=1)
                img_ids = torch.cat([img_ids, kontext_ids], dim=1)

-        txt_start = round(max(((x.shape[-1] + (self.patch_size // 2)) // self.patch_size) // 2, ((x.shape[-2] + (self.patch_size // 2)) // self.patch_size) // 2))
-        txt_ids = torch.arange(txt_start, txt_start + context.shape[1], device=x.device).reshape(1, -1, 1).repeat(x.shape[0], 1, 3)
+        txt_start = round(max(((x.shape[-1] + (self.patch_size // 2)) // self.patch_size), ((x.shape[-2] + (self.patch_size // 2)) // self.patch_size)))
+        txt_ids = torch.linspace(txt_start, txt_start + context.shape[1], steps=context.shape[1], device=x.device, dtype=x.dtype).reshape(1, -1, 1).repeat(x.shape[0], 1, 3)
        ids = torch.cat((txt_ids, img_ids), dim=1)
-        image_rotary_emb = self.pe_embedder(ids).to(x.dtype).contiguous()
-        del ids, txt_ids, img_ids
+        image_rotary_emb = self.pe_embedder(ids).squeeze(1).unsqueeze(2).to(x.dtype)

        hidden_states = self.img_in(hidden_states)
        encoder_hidden_states = self.txt_norm(encoder_hidden_states)
@@ -436,19 +415,15 @@ class QwenImageTransformer2DModel(nn.Module):
        )

        patches_replace = transformer_options.get("patches_replace", {})
-        patches = transformer_options.get("patches", {})
        blocks_replace = patches_replace.get("dit", {})

-        transformer_options["total_blocks"] = len(self.transformer_blocks)
-        transformer_options["block_type"] = "double"
        for i, block in enumerate(self.transformer_blocks):
-            transformer_options["block_index"] = i
            if ("double_block", i) in blocks_replace:
                def block_wrap(args):
                    out = {}
-                    out["txt"], out["img"] = block(hidden_states=args["img"], encoder_hidden_states=args["txt"], encoder_hidden_states_mask=encoder_hidden_states_mask, temb=args["vec"], image_rotary_emb=args["pe"], transformer_options=args["transformer_options"])
+                    out["txt"], out["img"] = block(hidden_states=args["img"], encoder_hidden_states=args["txt"], encoder_hidden_states_mask=encoder_hidden_states_mask, temb=args["vec"], image_rotary_emb=args["pe"])
                    return out
-                out = blocks_replace[("double_block", i)]({"img": hidden_states, "txt": encoder_hidden_states, "vec": temb, "pe": image_rotary_emb, "transformer_options": transformer_options}, {"original_block": block_wrap})
+                out = blocks_replace[("double_block", i)]({"img": hidden_states, "txt": encoder_hidden_states, "vec": temb, "pe": image_rotary_emb}, {"original_block": block_wrap})
                hidden_states = out["img"]
                encoder_hidden_states = out["txt"]
            else:
@@ -458,22 +433,8 @@ class QwenImageTransformer2DModel(nn.Module):
                    encoder_hidden_states_mask=encoder_hidden_states_mask,
                    temb=temb,
                    image_rotary_emb=image_rotary_emb,
-                    transformer_options=transformer_options,
                )

-            if "double_block" in patches:
-                for p in patches["double_block"]:
-                    out = p({"img": hidden_states, "txt": encoder_hidden_states, "x": x, "block_index": i, "transformer_options": transformer_options})
-                    hidden_states = out["img"]
-                    encoder_hidden_states = out["txt"]
-
-            if control is not None: # Controlnet
-                control_i = control.get("input")
-                if i < len(control_i):
-                    add = control_i[i]
-                    if add is not None:
-                        hidden_states[:, :add.shape[1]] += add
-
        hidden_states = self.norm_out(hidden_states, temb)
        hidden_states = self.proj_out(hidden_states)

--- a/comfy/ldm/wan/model.py
+++ b/comfy/ldm/wan/model.py
--- a/comfy/ldm/wan/model_animate.py
+++ b/comfy/ldm/wan/model_animate.py
@@ -1,548 +0,0 @@
-from torch import nn
-import torch
-from typing import Tuple, Optional
-from einops import rearrange
-import torch.nn.functional as F
-import math
-from .model import WanModel, sinusoidal_embedding_1d
-from comfy.ldm.modules.attention import optimized_attention
-import comfy.model_management
-
-class CausalConv1d(nn.Module):
-
-    def __init__(self, chan_in, chan_out, kernel_size=3, stride=1, dilation=1, pad_mode="replicate", operations=None, **kwargs):
-        super().__init__()
-
-        self.pad_mode = pad_mode
-        padding = (kernel_size - 1, 0)  # T
-        self.time_causal_padding = padding
-
-        self.conv = operations.Conv1d(chan_in, chan_out, kernel_size, stride=stride, dilation=dilation, **kwargs)
-
-    def forward(self, x):
-        x = F.pad(x, self.time_causal_padding, mode=self.pad_mode)
-        return self.conv(x)
-
-
-class FaceEncoder(nn.Module):
-    def __init__(self, in_dim: int, hidden_dim: int, num_heads=int, dtype=None, device=None, operations=None):
-        factory_kwargs = {"dtype": dtype, "device": device}
-        super().__init__()
-
-        self.num_heads = num_heads
-        self.conv1_local = CausalConv1d(in_dim, 1024 * num_heads, 3, stride=1, operations=operations, **factory_kwargs)
-        self.norm1 = operations.LayerNorm(hidden_dim // 8, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-        self.act = nn.SiLU()
-        self.conv2 = CausalConv1d(1024, 1024, 3, stride=2, operations=operations, **factory_kwargs)
-        self.conv3 = CausalConv1d(1024, 1024, 3, stride=2, operations=operations, **factory_kwargs)
-
-        self.out_proj = operations.Linear(1024, hidden_dim, **factory_kwargs)
-        self.norm1 = operations.LayerNorm(1024, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-
-        self.norm2 = operations.LayerNorm(1024, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-
-        self.norm3 = operations.LayerNorm(1024, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-
-        self.padding_tokens = nn.Parameter(torch.empty(1, 1, 1, hidden_dim, **factory_kwargs))
-
-    def forward(self, x):
-
-        x = rearrange(x, "b t c -> b c t")
-        b, c, t = x.shape
-
-        x = self.conv1_local(x)
-        x = rearrange(x, "b (n c) t -> (b n) t c", n=self.num_heads)
-
-        x = self.norm1(x)
-        x = self.act(x)
-        x = rearrange(x, "b t c -> b c t")
-        x = self.conv2(x)
-        x = rearrange(x, "b c t -> b t c")
-        x = self.norm2(x)
-        x = self.act(x)
-        x = rearrange(x, "b t c -> b c t")
-        x = self.conv3(x)
-        x = rearrange(x, "b c t -> b t c")
-        x = self.norm3(x)
-        x = self.act(x)
-        x = self.out_proj(x)
-        x = rearrange(x, "(b n) t c -> b t n c", b=b)
-        padding = comfy.model_management.cast_to(self.padding_tokens, dtype=x.dtype, device=x.device).repeat(b, x.shape[1], 1, 1)
-        x = torch.cat([x, padding], dim=-2)
-        x_local = x.clone()
-
-        return x_local
-
-
-def get_norm_layer(norm_layer, operations=None):
-    """
-    Get the normalization layer.
-
-    Args:
-        norm_layer (str): The type of normalization layer.
-
-    Returns:
-        norm_layer (nn.Module): The normalization layer.
-    """
-    if norm_layer == "layer":
-        return operations.LayerNorm
-    elif norm_layer == "rms":
-        return operations.RMSNorm
-    else:
-        raise NotImplementedError(f"Norm layer {norm_layer} is not implemented")
-
-
-class FaceAdapter(nn.Module):
-    def __init__(
-        self,
-        hidden_dim: int,
-        heads_num: int,
-        qk_norm: bool = True,
-        qk_norm_type: str = "rms",
-        num_adapter_layers: int = 1,
-        dtype=None, device=None, operations=None
-    ):
-
-        factory_kwargs = {"dtype": dtype, "device": device}
-        super().__init__()
-        self.hidden_size = hidden_dim
-        self.heads_num = heads_num
-        self.fuser_blocks = nn.ModuleList(
-            [
-                FaceBlock(
-                    self.hidden_size,
-                    self.heads_num,
-                    qk_norm=qk_norm,
-                    qk_norm_type=qk_norm_type,
-                    operations=operations,
-                    **factory_kwargs,
-                )
-                for _ in range(num_adapter_layers)
-            ]
-        )
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        motion_embed: torch.Tensor,
-        idx: int,
-        freqs_cis_q: Tuple[torch.Tensor, torch.Tensor] = None,
-        freqs_cis_k: Tuple[torch.Tensor, torch.Tensor] = None,
-    ) -> torch.Tensor:
-
-        return self.fuser_blocks[idx](x, motion_embed, freqs_cis_q, freqs_cis_k)
-
-
-
-class FaceBlock(nn.Module):
-    def __init__(
-        self,
-        hidden_size: int,
-        heads_num: int,
-        qk_norm: bool = True,
-        qk_norm_type: str = "rms",
-        qk_scale: float = None,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-        operations=None
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-
-        self.deterministic = False
-        self.hidden_size = hidden_size
-        self.heads_num = heads_num
-        head_dim = hidden_size // heads_num
-        self.scale = qk_scale or head_dim**-0.5
-
-        self.linear1_kv = operations.Linear(hidden_size, hidden_size * 2, **factory_kwargs)
-        self.linear1_q = operations.Linear(hidden_size, hidden_size, **factory_kwargs)
-
-        self.linear2 = operations.Linear(hidden_size, hidden_size, **factory_kwargs)
-
-        qk_norm_layer = get_norm_layer(qk_norm_type, operations=operations)
-        self.q_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs) if qk_norm else nn.Identity()
-        )
-        self.k_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs) if qk_norm else nn.Identity()
-        )
-
-        self.pre_norm_feat = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-
-        self.pre_norm_motion = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        motion_vec: torch.Tensor,
-        motion_mask: Optional[torch.Tensor] = None,
-        # use_context_parallel=False,
-    ) -> torch.Tensor:
-
-        B, T, N, C = motion_vec.shape
-        T_comp = T
-
-        x_motion = self.pre_norm_motion(motion_vec)
-        x_feat = self.pre_norm_feat(x)
-
-        kv = self.linear1_kv(x_motion)
-        q = self.linear1_q(x_feat)
-
-        k, v = rearrange(kv, "B L N (K H D) -> K B L N H D", K=2, H=self.heads_num)
-        q = rearrange(q, "B S (H D) -> B S H D", H=self.heads_num)
-
-        # Apply QK-Norm if needed.
-        q = self.q_norm(q).to(v)
-        k = self.k_norm(k).to(v)
-
-        k = rearrange(k, "B L N H D -> (B L) N H D")
-        v = rearrange(v, "B L N H D -> (B L) N H D")
-
-        q = rearrange(q, "B (L S) H D -> (B L) S (H D)", L=T_comp)
-
-        attn = optimized_attention(q, k, v, heads=self.heads_num)
-
-        attn = rearrange(attn, "(B L) S C -> B (L S) C", L=T_comp)
-
-        output = self.linear2(attn)
-
-        if motion_mask is not None:
-            output = output * rearrange(motion_mask, "B T H W -> B (T H W)").unsqueeze(-1)
-
-        return output
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/ops/upfirdn2d/upfirdn2d.py#L162
-def upfirdn2d_native(input, kernel, up_x, up_y, down_x, down_y, pad_x0, pad_x1, pad_y0, pad_y1):
-    _, minor, in_h, in_w = input.shape
-    kernel_h, kernel_w = kernel.shape
-
-    out = input.view(-1, minor, in_h, 1, in_w, 1)
-    out = F.pad(out, [0, up_x - 1, 0, 0, 0, up_y - 1, 0, 0])
-    out = out.view(-1, minor, in_h * up_y, in_w * up_x)
-
-    out = F.pad(out, [max(pad_x0, 0), max(pad_x1, 0), max(pad_y0, 0), max(pad_y1, 0)])
-    out = out[:, :, max(-pad_y0, 0): out.shape[2] - max(-pad_y1, 0), max(-pad_x0, 0): out.shape[3] - max(-pad_x1, 0)]
-
-    out = out.reshape([-1, 1, in_h * up_y + pad_y0 + pad_y1, in_w * up_x + pad_x0 + pad_x1])
-    w = torch.flip(kernel, [0, 1]).view(1, 1, kernel_h, kernel_w)
-    out = F.conv2d(out, w)
-    out = out.reshape(-1, minor, in_h * up_y + pad_y0 + pad_y1 - kernel_h + 1, in_w * up_x + pad_x0 + pad_x1 - kernel_w + 1)
-    return out[:, :, ::down_y, ::down_x]
-
-def upfirdn2d(input, kernel, up=1, down=1, pad=(0, 0)):
-    return upfirdn2d_native(input, kernel, up, up, down, down, pad[0], pad[1], pad[0], pad[1])
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/ops/fused_act/fused_act.py#L81
-class FusedLeakyReLU(torch.nn.Module):
-    def __init__(self, channel, negative_slope=0.2, scale=2 ** 0.5, dtype=None, device=None):
-        super().__init__()
-        self.bias = torch.nn.Parameter(torch.empty(1, channel, 1, 1, dtype=dtype, device=device))
-        self.negative_slope = negative_slope
-        self.scale = scale
-
-    def forward(self, input):
-        return fused_leaky_relu(input, comfy.model_management.cast_to(self.bias, device=input.device, dtype=input.dtype), self.negative_slope, self.scale)
-
-def fused_leaky_relu(input, bias, negative_slope=0.2, scale=2 ** 0.5):
-    return F.leaky_relu(input + bias, negative_slope) * scale
-
-class Blur(torch.nn.Module):
-    def __init__(self, kernel, pad, dtype=None, device=None):
-        super().__init__()
-        kernel = torch.tensor(kernel, dtype=dtype, device=device)
-        kernel = kernel[None, :] * kernel[:, None]
-        kernel = kernel / kernel.sum()
-        self.register_buffer('kernel', kernel)
-        self.pad = pad
-
-    def forward(self, input):
-        return upfirdn2d(input, comfy.model_management.cast_to(self.kernel, dtype=input.dtype, device=input.device), pad=self.pad)
-
-#https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/archs/stylegan2_arch.py#L590
-class ScaledLeakyReLU(torch.nn.Module):
-    def __init__(self, negative_slope=0.2):
-        super().__init__()
-        self.negative_slope = negative_slope
-
-    def forward(self, input):
-        return F.leaky_relu(input, negative_slope=self.negative_slope)
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/archs/stylegan2_arch.py#L605
-class EqualConv2d(torch.nn.Module):
-    def __init__(self, in_channel, out_channel, kernel_size, stride=1, padding=0, bias=True, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.weight = torch.nn.Parameter(torch.empty(out_channel, in_channel, kernel_size, kernel_size, device=device, dtype=dtype))
-        self.scale = 1 / math.sqrt(in_channel * kernel_size ** 2)
-        self.stride = stride
-        self.padding = padding
-        self.bias = torch.nn.Parameter(torch.empty(out_channel, device=device, dtype=dtype)) if bias else None
-
-    def forward(self, input):
-        if self.bias is None:
-            bias = None
-        else:
-            bias = comfy.model_management.cast_to(self.bias, device=input.device, dtype=input.dtype)
-
-        return F.conv2d(input, comfy.model_management.cast_to(self.weight, device=input.device, dtype=input.dtype) * self.scale, bias=bias, stride=self.stride, padding=self.padding)
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/archs/stylegan2_arch.py#L134
-class EqualLinear(torch.nn.Module):
-    def __init__(self, in_dim, out_dim, bias=True, bias_init=0, lr_mul=1, activation=None, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.weight = torch.nn.Parameter(torch.empty(out_dim, in_dim, device=device, dtype=dtype))
-        self.bias = torch.nn.Parameter(torch.empty(out_dim, device=device, dtype=dtype)) if bias else None
-        self.activation = activation
-        self.scale = (1 / math.sqrt(in_dim)) * lr_mul
-        self.lr_mul = lr_mul
-
-    def forward(self, input):
-        if self.bias is None:
-            bias = None
-        else:
-            bias = comfy.model_management.cast_to(self.bias, device=input.device, dtype=input.dtype) * self.lr_mul
-
-        if self.activation:
-            out = F.linear(input, comfy.model_management.cast_to(self.weight, device=input.device, dtype=input.dtype) * self.scale)
-            return fused_leaky_relu(out, bias)
-        return F.linear(input, comfy.model_management.cast_to(self.weight, device=input.device, dtype=input.dtype) * self.scale, bias=bias)
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/archs/stylegan2_arch.py#L654
-class ConvLayer(torch.nn.Sequential):
-    def __init__(self, in_channel, out_channel, kernel_size, downsample=False, blur_kernel=[1, 3, 3, 1], bias=True, activate=True, dtype=None, device=None, operations=None):
-        layers = []
-
-        if downsample:
-            factor = 2
-            p = (len(blur_kernel) - factor) + (kernel_size - 1)
-            layers.append(Blur(blur_kernel, pad=((p + 1) // 2, p // 2)))
-            stride, padding = 2, 0
-        else:
-            stride, padding = 1, kernel_size // 2
-
-        layers.append(EqualConv2d(in_channel, out_channel, kernel_size, padding=padding, stride=stride, bias=bias and not activate, dtype=dtype, device=device, operations=operations))
-
-        if activate:
-            layers.append(FusedLeakyReLU(out_channel) if bias else ScaledLeakyReLU(0.2))
-
-        super().__init__(*layers)
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/archs/stylegan2_arch.py#L704
-class ResBlock(torch.nn.Module):
-    def __init__(self, in_channel, out_channel, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.conv1 = ConvLayer(in_channel, in_channel, 3, dtype=dtype, device=device, operations=operations)
-        self.conv2 = ConvLayer(in_channel, out_channel, 3, downsample=True, dtype=dtype, device=device, operations=operations)
-        self.skip = ConvLayer(in_channel, out_channel, 1, downsample=True, activate=False, bias=False, dtype=dtype, device=device, operations=operations)
-
-    def forward(self, input):
-        out = self.conv2(self.conv1(input))
-        skip = self.skip(input)
-        return (out + skip) / math.sqrt(2)
-
-
-class EncoderApp(torch.nn.Module):
-    def __init__(self, w_dim=512, dtype=None, device=None, operations=None):
-        super().__init__()
-        kwargs = {"device": device, "dtype": dtype, "operations": operations}
-
-        self.convs = torch.nn.ModuleList([
-            ConvLayer(3, 32, 1, **kwargs), ResBlock(32, 64, **kwargs),
-            ResBlock(64, 128, **kwargs), ResBlock(128, 256, **kwargs),
-            ResBlock(256, 512, **kwargs), ResBlock(512, 512, **kwargs),
-            ResBlock(512, 512, **kwargs), ResBlock(512, 512, **kwargs),
-            EqualConv2d(512, w_dim, 4, padding=0, bias=False, **kwargs)
-        ])
-
-    def forward(self, x):
-        h = x
-        for conv in self.convs:
-            h = conv(h)
-        return h.squeeze(-1).squeeze(-1)
-
-class Encoder(torch.nn.Module):
-    def __init__(self, dim=512, motion_dim=20, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.net_app = EncoderApp(dim, dtype=dtype, device=device, operations=operations)
-        self.fc = torch.nn.Sequential(*[EqualLinear(dim, dim, dtype=dtype, device=device, operations=operations) for _ in range(4)] + [EqualLinear(dim, motion_dim, dtype=dtype, device=device, operations=operations)])
-
-    def encode_motion(self, x):
-        return self.fc(self.net_app(x))
-
-class Direction(torch.nn.Module):
-    def __init__(self, motion_dim, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.weight = torch.nn.Parameter(torch.empty(512, motion_dim, device=device, dtype=dtype))
-        self.motion_dim = motion_dim
-
-    def forward(self, input):
-        stabilized_weight = comfy.model_management.cast_to(self.weight, device=input.device, dtype=input.dtype) + 1e-8 * torch.eye(512, self.motion_dim, device=input.device, dtype=input.dtype)
-        Q, _ = torch.linalg.qr(stabilized_weight.float())
-        if input is None:
-            return Q
-        return torch.sum(input.unsqueeze(-1) * Q.T.to(input.dtype), dim=1)
-
-class Synthesis(torch.nn.Module):
-    def __init__(self, motion_dim, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.direction = Direction(motion_dim, dtype=dtype, device=device, operations=operations)
-
-class Generator(torch.nn.Module):
-    def __init__(self, style_dim=512, motion_dim=20, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.enc = Encoder(style_dim, motion_dim, dtype=dtype, device=device, operations=operations)
-        self.dec = Synthesis(motion_dim, dtype=dtype, device=device, operations=operations)
-
-    def get_motion(self, img):
-        motion_feat = self.enc.encode_motion(img)
-        return self.dec.direction(motion_feat)
-
-class AnimateWanModel(WanModel):
-    r"""
-    Wan diffusion backbone supporting both text-to-video and image-to-video.
-    """
-
-    def __init__(self,
-                 model_type='animate',
-                 patch_size=(1, 2, 2),
-                 text_len=512,
-                 in_dim=16,
-                 dim=2048,
-                 ffn_dim=8192,
-                 freq_dim=256,
-                 text_dim=4096,
-                 out_dim=16,
-                 num_heads=16,
-                 num_layers=32,
-                 window_size=(-1, -1),
-                 qk_norm=True,
-                 cross_attn_norm=True,
-                 eps=1e-6,
-                 flf_pos_embed_token_number=None,
-                 motion_encoder_dim=512,
-                 image_model=None,
-                 device=None,
-                 dtype=None,
-                 operations=None,
-                 ):
-
-        super().__init__(model_type='i2v', patch_size=patch_size, text_len=text_len, in_dim=in_dim, dim=dim, ffn_dim=ffn_dim, freq_dim=freq_dim, text_dim=text_dim, out_dim=out_dim, num_heads=num_heads, num_layers=num_layers, window_size=window_size, qk_norm=qk_norm, cross_attn_norm=cross_attn_norm, eps=eps, flf_pos_embed_token_number=flf_pos_embed_token_number, image_model=image_model, device=device, dtype=dtype, operations=operations)
-
-        self.pose_patch_embedding = operations.Conv3d(
-            16, dim, kernel_size=patch_size, stride=patch_size, device=device, dtype=dtype
-        )
-
-        self.motion_encoder = Generator(style_dim=512, motion_dim=20, device=device, dtype=dtype, operations=operations)
-
-        self.face_adapter = FaceAdapter(
-            heads_num=self.num_heads,
-            hidden_dim=self.dim,
-            num_adapter_layers=self.num_layers // 5,
-            device=device, dtype=dtype, operations=operations
-        )
-
-        self.face_encoder = FaceEncoder(
-            in_dim=motion_encoder_dim,
-            hidden_dim=self.dim,
-            num_heads=4,
-            device=device, dtype=dtype, operations=operations
-        )
-
-    def after_patch_embedding(self, x, pose_latents, face_pixel_values):
-        if pose_latents is not None:
-            pose_latents = self.pose_patch_embedding(pose_latents)
-            x[:, :, 1:pose_latents.shape[2] + 1] += pose_latents[:, :, :x.shape[2] - 1]
-
-        if face_pixel_values is None:
-            return x, None
-
-        b, c, T, h, w = face_pixel_values.shape
-        face_pixel_values = rearrange(face_pixel_values, "b c t h w -> (b t) c h w")
-        encode_bs = 8
-        face_pixel_values_tmp = []
-        for i in range(math.ceil(face_pixel_values.shape[0] / encode_bs)):
-            face_pixel_values_tmp.append(self.motion_encoder.get_motion(face_pixel_values[i * encode_bs: (i + 1) * encode_bs]))
-
-        motion_vec = torch.cat(face_pixel_values_tmp)
-
-        motion_vec = rearrange(motion_vec, "(b t) c -> b t c", t=T)
-        motion_vec = self.face_encoder(motion_vec)
-
-        B, L, H, C = motion_vec.shape
-        pad_face = torch.zeros(B, 1, H, C).type_as(motion_vec)
-        motion_vec = torch.cat([pad_face, motion_vec], dim=1)
-
-        if motion_vec.shape[1] < x.shape[2]:
-            B, L, H, C = motion_vec.shape
-            pad = torch.zeros(B, x.shape[2] - motion_vec.shape[1], H, C).type_as(motion_vec)
-            motion_vec = torch.cat([motion_vec, pad], dim=1)
-        else:
-            motion_vec = motion_vec[:, :x.shape[2]]
-        return x, motion_vec
-
-    def forward_orig(
-        self,
-        x,
-        t,
-        context,
-        clip_fea=None,
-        pose_latents=None,
-        face_pixel_values=None,
-        freqs=None,
-        transformer_options={},
-        **kwargs,
-    ):
-        # embeddings
-        x = self.patch_embedding(x.float()).to(x.dtype)
-        x, motion_vec = self.after_patch_embedding(x, pose_latents, face_pixel_values)
-        grid_sizes = x.shape[2:]
-        x = x.flatten(2).transpose(1, 2)
-
-        # time embeddings
-        e = self.time_embedding(
-            sinusoidal_embedding_1d(self.freq_dim, t.flatten()).to(dtype=x[0].dtype))
-        e = e.reshape(t.shape[0], -1, e.shape[-1])
-        e0 = self.time_projection(e).unflatten(2, (6, self.dim))
-
-        full_ref = None
-        if self.ref_conv is not None:
-            full_ref = kwargs.get("reference_latent", None)
-            if full_ref is not None:
-                full_ref = self.ref_conv(full_ref).flatten(2).transpose(1, 2)
-                x = torch.concat((full_ref, x), dim=1)
-
-        # context
-        context = self.text_embedding(context)
-
-        context_img_len = None
-        if clip_fea is not None:
-            if self.img_emb is not None:
-                context_clip = self.img_emb(clip_fea)  # bs x 257 x dim
-                context = torch.concat([context_clip, context], dim=1)
-            context_img_len = clip_fea.shape[-2]
-
-        patches_replace = transformer_options.get("patches_replace", {})
-        blocks_replace = patches_replace.get("dit", {})
-        for i, block in enumerate(self.blocks):
-            if ("double_block", i) in blocks_replace:
-                def block_wrap(args):
-                    out = {}
-                    out["img"] = block(args["img"], context=args["txt"], e=args["vec"], freqs=args["pe"], context_img_len=context_img_len, transformer_options=args["transformer_options"])
-                    return out
-                out = blocks_replace[("double_block", i)]({"img": x, "txt": context, "vec": e0, "pe": freqs, "transformer_options": transformer_options}, {"original_block": block_wrap})
-                x = out["img"]
-            else:
-                x = block(x, e=e0, freqs=freqs, context=context, context_img_len=context_img_len, transformer_options=transformer_options)
-
-            if i % 5 == 0 and motion_vec is not None:
-                x = x + self.face_adapter.fuser_blocks[i // 5](x, motion_vec)
-
-        # head
-        x = self.head(x, e)
-
-        if full_ref is not None:
-            x = x[:, full_ref.shape[1]:]
-
-        # unpatchify
-        x = self.unpatchify(x, grid_sizes)
-        return x
--- a/comfy/ldm/wan/vae.py
+++ b/comfy/ldm/wan/vae.py
@@ -468,46 +468,55 @@ class WanVAE(nn.Module):
                                 attn_scales, self.temperal_upsample, dropout)

    def encode(self, x):
-        conv_idx = [0]
-        feat_map = [None] * count_conv3d(self.decoder)
+        self.clear_cache()
        ## cache
        t = x.shape[2]
        iter_ = 1 + (t - 1) // 4
        ## 对encode输入的x，按时间拆分为1、4、4、4....
        for i in range(iter_):
-            conv_idx = [0]
+            self._enc_conv_idx = [0]
            if i == 0:
                out = self.encoder(
                    x[:, :, :1, :, :],
-                    feat_cache=feat_map,
-                    feat_idx=conv_idx)
+                    feat_cache=self._enc_feat_map,
+                    feat_idx=self._enc_conv_idx)
            else:
                out_ = self.encoder(
                    x[:, :, 1 + 4 * (i - 1):1 + 4 * i, :, :],
-                    feat_cache=feat_map,
-                    feat_idx=conv_idx)
+                    feat_cache=self._enc_feat_map,
+                    feat_idx=self._enc_conv_idx)
                out = torch.cat([out, out_], 2)
        mu, log_var = self.conv1(out).chunk(2, dim=1)
+        self.clear_cache()
        return mu

    def decode(self, z):
-        conv_idx = [0]
-        feat_map = [None] * count_conv3d(self.decoder)
+        self.clear_cache()
        # z: [b,c,t,h,w]

        iter_ = z.shape[2]
        x = self.conv2(z)
        for i in range(iter_):
-            conv_idx = [0]
+            self._conv_idx = [0]
            if i == 0:
                out = self.decoder(
                    x[:, :, i:i + 1, :, :],
-                    feat_cache=feat_map,
-                    feat_idx=conv_idx)
+                    feat_cache=self._feat_map,
+                    feat_idx=self._conv_idx)
            else:
                out_ = self.decoder(
                    x[:, :, i:i + 1, :, :],
-                    feat_cache=feat_map,
-                    feat_idx=conv_idx)
+                    feat_cache=self._feat_map,
+                    feat_idx=self._conv_idx)
                out = torch.cat([out, out_], 2)
+        self.clear_cache()
        return out
+
+    def clear_cache(self):
+        self._conv_num = count_conv3d(self.decoder)
+        self._conv_idx = [0]
+        self._feat_map = [None] * self._conv_num
+        #cache encode
+        self._enc_conv_num = count_conv3d(self.encoder)
+        self._enc_conv_idx = [0]
+        self._enc_feat_map = [None] * self._enc_conv_num
--- a/comfy/ldm/wan/vae2_2.py
+++ b/comfy/ldm/wan/vae2_2.py
@@ -657,51 +657,51 @@ class WanVAE(nn.Module):
        )

    def encode(self, x):
-        conv_idx = [0]
-        feat_map = [None] * count_conv3d(self.encoder)
+        self.clear_cache()
        x = patchify(x, patch_size=2)
        t = x.shape[2]
        iter_ = 1 + (t - 1) // 4
        for i in range(iter_):
-            conv_idx = [0]
+            self._enc_conv_idx = [0]
            if i == 0:
                out = self.encoder(
                    x[:, :, :1, :, :],
-                    feat_cache=feat_map,
-                    feat_idx=conv_idx,
+                    feat_cache=self._enc_feat_map,
+                    feat_idx=self._enc_conv_idx,
                )
            else:
                out_ = self.encoder(
                    x[:, :, 1 + 4 * (i - 1):1 + 4 * i, :, :],
-                    feat_cache=feat_map,
-                    feat_idx=conv_idx,
+                    feat_cache=self._enc_feat_map,
+                    feat_idx=self._enc_conv_idx,
                )
                out = torch.cat([out, out_], 2)
        mu, log_var = self.conv1(out).chunk(2, dim=1)
+        self.clear_cache()
        return mu

    def decode(self, z):
-        conv_idx = [0]
-        feat_map = [None] * count_conv3d(self.decoder)
+        self.clear_cache()
        iter_ = z.shape[2]
        x = self.conv2(z)
        for i in range(iter_):
-            conv_idx = [0]
+            self._conv_idx = [0]
            if i == 0:
                out = self.decoder(
                    x[:, :, i:i + 1, :, :],
-                    feat_cache=feat_map,
-                    feat_idx=conv_idx,
+                    feat_cache=self._feat_map,
+                    feat_idx=self._conv_idx,
                    first_chunk=True,
                )
            else:
                out_ = self.decoder(
                    x[:, :, i:i + 1, :, :],
-                    feat_cache=feat_map,
-                    feat_idx=conv_idx,
+                    feat_cache=self._feat_map,
+                    feat_idx=self._conv_idx,
                )
                out = torch.cat([out, out_], 2)
        out = unpatchify(out, patch_size=2)
+        self.clear_cache()
        return out

    def reparameterize(self, mu, log_var):
@@ -715,3 +715,12 @@ class WanVAE(nn.Module):
            return mu
        std = torch.exp(0.5 * log_var.clamp(-30.0, 20.0))
        return mu + std * torch.randn_like(std)
+
+    def clear_cache(self):
+        self._conv_num = count_conv3d(self.decoder)
+        self._conv_idx = [0]
+        self._feat_map = [None] * self._conv_num
+        # cache encode
+        self._enc_conv_num = count_conv3d(self.encoder)
+        self._enc_conv_idx = [0]
+        self._enc_feat_map = [None] * self._enc_conv_num
--- a/comfy/lora.py
+++ b/comfy/lora.py
@@ -260,10 +260,6 @@ def model_lora_keys_unet(model, key_map={}):
                key_map["transformer.{}".format(k[:-len(".weight")])] = to #simpletrainer and probably regular diffusers flux lora format
                key_map["lycoris_{}".format(k[:-len(".weight")].replace(".", "_"))] = to #simpletrainer lycoris
                key_map["lora_transformer_{}".format(k[:-len(".weight")].replace(".", "_"))] = to #onetrainer
-        for k in sdk:
-            hidden_size = model.model_config.unet_config.get("hidden_size", 0)
-            if k.endswith(".weight") and ".linear1." in k:
-                key_map["{}".format(k.replace(".linear1.weight", ".linear1_qkv"))] = (k, (0, 0, hidden_size * 3))

    if isinstance(model, comfy.model_base.GenmoMochi):
        for k in sdk:
@@ -297,12 +293,6 @@ def model_lora_keys_unet(model, key_map={}):
                key_lora = k[len("diffusion_model."):-len(".weight")]
                key_map["{}".format(key_lora)] = k

-    if isinstance(model, comfy.model_base.Omnigen2):
-        for k in sdk:
-            if k.startswith("diffusion_model.") and k.endswith(".weight"):
-                key_lora = k[len("diffusion_model."):-len(".weight")]
-                key_map["{}".format(key_lora)] = k
-
    if isinstance(model, comfy.model_base.QwenImage):
        for k in sdk:
            if k.startswith("diffusion_model.") and k.endswith(".weight"): #QwenImage lora format
--- a/comfy/lora_convert.py
+++ b/comfy/lora_convert.py
@@ -15,29 +15,10 @@ def convert_lora_bfl_control(sd): #BFL loras for Flux
 def convert_lora_wan_fun(sd): #Wan Fun loras
    return comfy.utils.state_dict_prefix_replace(sd, {"lora_unet__": "lora_unet_"})

-def convert_uso_lora(sd):
-    sd_out = {}
-    for k in sd:
-        tensor = sd[k]
-        k_to = "diffusion_model.{}".format(k.replace(".down.weight", ".lora_down.weight")
-                                           .replace(".up.weight", ".lora_up.weight")
-                                           .replace(".qkv_lora2.", ".txt_attn.qkv.")
-                                           .replace(".qkv_lora1.", ".img_attn.qkv.")
-                                           .replace(".proj_lora1.", ".img_attn.proj.")
-                                           .replace(".proj_lora2.", ".txt_attn.proj.")
-                                           .replace(".qkv_lora.", ".linear1_qkv.")
-                                           .replace(".proj_lora.", ".linear2.")
-                                           .replace(".processor.", ".")
-                                           )
-        sd_out[k_to] = tensor
-    return sd_out
-

 def convert_lora(sd):
    if "img_in.lora_A.weight" in sd and "single_blocks.0.norm.key_norm.scale" in sd:
        return convert_lora_bfl_control(sd)
    if "lora_unet__blocks_0_cross_attn_k.lora_down.weight" in sd:
        return convert_lora_wan_fun(sd)
-    if "single_blocks.37.processor.qkv_lora.up.weight" in sd and "double_blocks.18.processor.qkv_lora2.up.weight" in sd:
-        return convert_uso_lora(sd)
    return sd
--- a/comfy/model_base.py
+++ b/comfy/model_base.py
@@ -16,8 +16,6 @@
    along with this program.  If not, see <https://www.gnu.org/licenses/>.
 """

-import comfy.ldm.hunyuan3dv2_1
-import comfy.ldm.hunyuan3dv2_1.hunyuandit
 import torch
 import logging
 from comfy.ldm.modules.diffusionmodules.openaimodel import UNetModel, Timestep
@@ -39,11 +37,9 @@ import comfy.ldm.cosmos.model
 import comfy.ldm.cosmos.predict2
 import comfy.ldm.lumina.model
 import comfy.ldm.wan.model
-import comfy.ldm.wan.model_animate
 import comfy.ldm.hunyuan3d.model
 import comfy.ldm.hidream.model
 import comfy.ldm.chroma.model
-import comfy.ldm.chroma_radiance.model
 import comfy.ldm.ace.model
 import comfy.ldm.omnigen.omnigen2
 import comfy.ldm.qwen_image.model
@@ -134,11 +130,10 @@ class BaseModel(torch.nn.Module):
        if not unet_config.get("disable_unet_model_creation", False):
            if model_config.custom_operations is None:
                fp8 = model_config.optimizations.get("fp8", False)
-                operations = comfy.ops.pick_operations(unet_config.get("dtype", None), self.manual_cast_dtype, fp8_optimizations=fp8, scaled_fp8=model_config.scaled_fp8, model_config=model_config)
+                operations = comfy.ops.pick_operations(unet_config.get("dtype", None), self.manual_cast_dtype, fp8_optimizations=fp8, scaled_fp8=model_config.scaled_fp8)
            else:
                operations = model_config.custom_operations
            self.diffusion_model = unet_model(**unet_config, device=device, operations=operations)
-            self.diffusion_model.eval()
            if comfy.model_management.force_channels_last():
                self.diffusion_model.to(memory_format=torch.channels_last)
                logging.debug("using channels last mode for diffusion model")
@@ -155,7 +150,6 @@ class BaseModel(torch.nn.Module):
        logging.debug("adm {}".format(self.adm_channels))
        self.memory_usage_factor = model_config.memory_usage_factor
        self.memory_usage_factor_conds = ()
-        self.memory_usage_shape_process = {}

    def apply_model(self, x, t, c_concat=None, c_crossattn=None, control=None, transformer_options={}, **kwargs):
        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
@@ -197,14 +191,8 @@ class BaseModel(torch.nn.Module):
            extra_conds[o] = extra

        t = self.process_timestep(t, x=x, **extra_conds)
-        if "latent_shapes" in extra_conds:
-            xc = utils.unpack_latents(xc, extra_conds.pop("latent_shapes"))
-
-        model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds)
-        if len(model_output) > 1 and not torch.is_tensor(model_output):
-            model_output, _ = utils.pack_latents(model_output)
-
-        return self.model_sampling.calculate_denoised(sigma, model_output.float(), x)
+        model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds).float()
+        return self.model_sampling.calculate_denoised(sigma, model_output, x)

    def process_timestep(self, timestep, **kwargs):
        return timestep
@@ -333,14 +321,6 @@ class BaseModel(torch.nn.Module):
        if self.model_config.scaled_fp8 is not None:
            unet_state_dict["scaled_fp8"] = torch.tensor([], dtype=self.model_config.scaled_fp8)

-        # Save mixed precision metadata
-        if hasattr(self.model_config, 'layer_quant_config') and self.model_config.layer_quant_config:
-            metadata = {
-                "format_version": "1.0",
-                "layers": self.model_config.layer_quant_config
-            }
-            unet_state_dict["_quantization_metadata"] = metadata
-
        unet_state_dict = self.model_config.process_unet_state_dict_for_saving(unet_state_dict)

        if self.model_type == ModelType.V_PREDICTION:
@@ -370,15 +350,8 @@ class BaseModel(torch.nn.Module):
        input_shapes = [input_shape]
        for c in self.memory_usage_factor_conds:
            shape = cond_shapes.get(c, None)
-            if shape is not None:
-                if c in self.memory_usage_shape_process:
-                    out = []
-                    for s in shape:
-                        out.append(self.memory_usage_shape_process[c](s))
-                    shape = out
-
-                if len(shape) > 0:
-                    input_shapes += shape
+            if shape is not None and len(shape) > 0:
+                input_shapes += shape

        if comfy.model_management.xformers_enabled() or comfy.model_management.pytorch_attention_flash_attention():
            dtype = self.get_dtype()
@@ -684,6 +657,7 @@ class Lotus(BaseModel):
 class StableCascade_C(BaseModel):
    def __init__(self, model_config, model_type=ModelType.STABLE_CASCADE, device=None):
        super().__init__(model_config, model_type, device=device, unet_model=StageC)
+        self.diffusion_model.eval().requires_grad_(False)

    def extra_conds(self, **kwargs):
        out = {}
@@ -712,6 +686,7 @@ class StableCascade_C(BaseModel):
 class StableCascade_B(BaseModel):
    def __init__(self, model_config, model_type=ModelType.STABLE_CASCADE, device=None):
        super().__init__(model_config, model_type, device=device, unet_model=StageB)
+        self.diffusion_model.eval().requires_grad_(False)

    def extra_conds(self, **kwargs):
        out = {}
@@ -898,13 +873,12 @@ class Flux(BaseModel):
        attention_mask = kwargs.get("attention_mask", None)
        if attention_mask is not None:
            shape = kwargs["noise"].shape
-            mask_ref_size = kwargs.get("attention_mask_img_shape", None)
-            if mask_ref_size is not None:
-                # the model will pad to the patch size, and then divide
-                # essentially dividing and rounding up
-                (h_tok, w_tok) = (math.ceil(shape[2] / self.diffusion_model.patch_size), math.ceil(shape[3] / self.diffusion_model.patch_size))
-                attention_mask = utils.upscale_dit_mask(attention_mask, mask_ref_size, (h_tok, w_tok))
-                out['attention_mask'] = comfy.conds.CONDRegular(attention_mask)
+            mask_ref_size = kwargs["attention_mask_img_shape"]
+            # the model will pad to the patch size, and then divide
+            # essentially dividing and rounding up
+            (h_tok, w_tok) = (math.ceil(shape[2] / self.diffusion_model.patch_size), math.ceil(shape[3] / self.diffusion_model.patch_size))
+            attention_mask = utils.upscale_dit_mask(attention_mask, mask_ref_size, (h_tok, w_tok))
+            out['attention_mask'] = comfy.conds.CONDRegular(attention_mask)

        guidance = kwargs.get("guidance", 3.5)
        if guidance is not None:
@@ -929,16 +903,6 @@ class Flux(BaseModel):
            out['ref_latents'] = list([1, 16, sum(map(lambda a: math.prod(a.size()), ref_latents)) // 16])
        return out

-class Flux2(Flux):
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        cross_attn = kwargs.get("cross_attn", None)
-        if cross_attn is not None:
-            target_text_len = 512
-            if cross_attn.shape[1] < target_text_len:
-                cross_attn = torch.nn.functional.pad(cross_attn, (0, 0, target_text_len - cross_attn.shape[1], 0))
-            out['c_crossattn'] = comfy.conds.CONDRegular(cross_attn)
-        return out

 class GenmoMochi(BaseModel):
    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
@@ -1114,13 +1078,9 @@ class Lumina2(BaseModel):
            if torch.numel(attention_mask) != attention_mask.sum():
                out['attention_mask'] = comfy.conds.CONDRegular(attention_mask)
            out['num_tokens'] = comfy.conds.CONDConstant(max(1, torch.sum(attention_mask).item()))
-
        cross_attn = kwargs.get("cross_attn", None)
        if cross_attn is not None:
            out['c_crossattn'] = comfy.conds.CONDRegular(cross_attn)
-            if 'num_tokens' not in out:
-                out['num_tokens'] = comfy.conds.CONDConstant(cross_attn.shape[1])
-
        return out

 class WAN21(BaseModel):
@@ -1142,10 +1102,9 @@ class WAN21(BaseModel):
            shape_image[1] = extra_channels
            image = torch.zeros(shape_image, dtype=noise.dtype, layout=noise.layout, device=noise.device)
        else:
-            latent_dim = self.latent_format.latent_channels
            image = utils.common_upscale(image.to(device), noise.shape[-1], noise.shape[-2], "bilinear", "center")
-            for i in range(0, image.shape[1], latent_dim):
-                image[:, i: i + latent_dim] = self.process_latent_in(image[:, i: i + latent_dim])
+            for i in range(0, image.shape[1], 16):
+                image[:, i: i + 16] = self.process_latent_in(image[:, i: i + 16])
            image = utils.resize_to_batch_size(image, noise.shape[0])

        if extra_channels != image.shape[1] + 4:
@@ -1242,107 +1201,18 @@ class WAN21_Camera(WAN21):
            out['camera_conditions'] = comfy.conds.CONDRegular(camera_conditions)
        return out

-class WAN21_HuMo(WAN21):
+class WAN22(BaseModel):
    def __init__(self, model_config, model_type=ModelType.FLOW, image_to_video=False, device=None):
-        super(WAN21, self).__init__(model_config, model_type, device=device, unet_model=comfy.ldm.wan.model.HumoWanModel)
+        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.wan.model.WanModel)
        self.image_to_video = image_to_video

    def extra_conds(self, **kwargs):
        out = super().extra_conds(**kwargs)
-        noise = kwargs.get("noise", None)
+        cross_attn = kwargs.get("cross_attn", None)
+        if cross_attn is not None:
+            out['c_crossattn'] = comfy.conds.CONDRegular(cross_attn)

-        audio_embed = kwargs.get("audio_embed", None)
-        if audio_embed is not None:
-            out['audio_embed'] = comfy.conds.CONDRegular(audio_embed)
-
-        if "c_concat" not in out:  # 1.7B model
-            reference_latents = kwargs.get("reference_latents", None)
-            if reference_latents is not None:
-                out['reference_latent'] = comfy.conds.CONDRegular(self.process_latent_in(reference_latents[-1]))
-        else:
-            noise_shape = list(noise.shape)
-            noise_shape[1] += 4
-            concat_latent = torch.zeros(noise_shape, device=noise.device, dtype=noise.dtype)
-            zero_vae_values_first = torch.tensor([0.8660, -0.4326, -0.0017, -0.4884, -0.5283, 0.9207, -0.9896, 0.4433, -0.5543, -0.0113, 0.5753, -0.6000, -0.8346, -0.3497, -0.1926, -0.6938]).view(1, 16, 1, 1, 1)
-            zero_vae_values_second = torch.tensor([1.0869, -1.2370, 0.0206, -0.4357, -0.6411, 2.0307, -1.5972, 1.2659, -0.8595, -0.4654, 0.9638, -1.6330, -1.4310, -0.1098, -0.3856, -1.4583]).view(1, 16, 1, 1, 1)
-            zero_vae_values = torch.tensor([0.8642, -1.8583, 0.1577, 0.1350, -0.3641, 2.5863, -1.9670, 1.6065, -1.0475, -0.8678, 1.1734, -1.8138, -1.5933, -0.7721, -0.3289, -1.3745]).view(1, 16, 1, 1, 1)
-            concat_latent[:, 4:] = zero_vae_values
-            concat_latent[:, 4:, :1] = zero_vae_values_first
-            concat_latent[:, 4:, 1:2] = zero_vae_values_second
-            out['c_concat'] = comfy.conds.CONDNoiseShape(concat_latent)
-            reference_latents = kwargs.get("reference_latents", None)
-            if reference_latents is not None:
-                ref_latent = self.process_latent_in(reference_latents[-1])
-                ref_latent_shape = list(ref_latent.shape)
-                ref_latent_shape[1] += 4 + ref_latent_shape[1]
-                ref_latent_full = torch.zeros(ref_latent_shape, device=ref_latent.device, dtype=ref_latent.dtype)
-                ref_latent_full[:, 20:] = ref_latent
-                ref_latent_full[:, 16:20] = 1.0
-                out['reference_latent'] = comfy.conds.CONDRegular(ref_latent_full)
-
-        return out
-
-class WAN22_Animate(WAN21):
-    def __init__(self, model_config, model_type=ModelType.FLOW, image_to_video=False, device=None):
-        super(WAN21, self).__init__(model_config, model_type, device=device, unet_model=comfy.ldm.wan.model_animate.AnimateWanModel)
-        self.image_to_video = image_to_video
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-
-        face_video_pixels = kwargs.get("face_video_pixels", None)
-        if face_video_pixels is not None:
-            out['face_pixel_values'] = comfy.conds.CONDRegular(face_video_pixels)
-
-        pose_latents = kwargs.get("pose_video_latent", None)
-        if pose_latents is not None:
-            out['pose_latents'] = comfy.conds.CONDRegular(self.process_latent_in(pose_latents))
-        return out
-
-class WAN22_S2V(WAN21):
-    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
-        super(WAN21, self).__init__(model_config, model_type, device=device, unet_model=comfy.ldm.wan.model.WanModel_S2V)
-        self.memory_usage_factor_conds = ("reference_latent", "reference_motion")
-        self.memory_usage_shape_process = {"reference_motion": lambda shape: [shape[0], shape[1], 1.5, shape[-2], shape[-1]]}
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        audio_embed = kwargs.get("audio_embed", None)
-        if audio_embed is not None:
-            out['audio_embed'] = comfy.conds.CONDRegular(audio_embed)
-
-        reference_latents = kwargs.get("reference_latents", None)
-        if reference_latents is not None:
-            out['reference_latent'] = comfy.conds.CONDRegular(self.process_latent_in(reference_latents[-1]))
-
-        reference_motion = kwargs.get("reference_motion", None)
-        if reference_motion is not None:
-            out['reference_motion'] = comfy.conds.CONDRegular(self.process_latent_in(reference_motion))
-
-        control_video = kwargs.get("control_video", None)
-        if control_video is not None:
-            out['control_video'] = comfy.conds.CONDRegular(self.process_latent_in(control_video))
-        return out
-
-    def extra_conds_shapes(self, **kwargs):
-        out = {}
-        ref_latents = kwargs.get("reference_latents", None)
-        if ref_latents is not None:
-            out['reference_latent'] = list([1, 16, sum(map(lambda a: math.prod(a.size()), ref_latents)) // 16])
-
-        reference_motion = kwargs.get("reference_motion", None)
-        if reference_motion is not None:
-            out['reference_motion'] = reference_motion.shape
-        return out
-
-class WAN22(WAN21):
-    def __init__(self, model_config, model_type=ModelType.FLOW, image_to_video=False, device=None):
-        super(WAN21, self).__init__(model_config, model_type, device=device, unet_model=comfy.ldm.wan.model.WanModel)
-        self.image_to_video = image_to_video
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        denoise_mask = kwargs.get("denoise_mask", None)
+        denoise_mask = kwargs.get("concat_mask", kwargs.get("denoise_mask", None))
        if denoise_mask is not None:
            out["denoise_mask"] = comfy.conds.CONDRegular(denoise_mask)
        return out
@@ -1371,21 +1241,6 @@ class Hunyuan3Dv2(BaseModel):
            out['guidance'] = comfy.conds.CONDRegular(torch.FloatTensor([guidance]))
        return out

-class Hunyuan3Dv2_1(BaseModel):
-    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
-        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.hunyuan3dv2_1.hunyuandit.HunYuanDiTPlain)
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        cross_attn = kwargs.get("cross_attn", None)
-        if cross_attn is not None:
-            out['c_crossattn'] = comfy.conds.CONDRegular(cross_attn)
-
-        guidance = kwargs.get("guidance", 5.0)
-        if guidance is not None:
-            out['guidance'] = comfy.conds.CONDRegular(torch.FloatTensor([guidance]))
-        return out
-
 class HiDream(BaseModel):
    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.hidream.model.HiDreamImageTransformer2DModel)
@@ -1407,8 +1262,8 @@ class HiDream(BaseModel):
        return out

 class Chroma(Flux):
-    def __init__(self, model_config, model_type=ModelType.FLUX, device=None, unet_model=comfy.ldm.chroma.model.Chroma):
-        super().__init__(model_config, model_type, device=device, unet_model=unet_model)
+    def __init__(self, model_config, model_type=ModelType.FLUX, device=None):
+        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.chroma.model.Chroma)

    def extra_conds(self, **kwargs):
        out = super().extra_conds(**kwargs)
@@ -1418,10 +1273,6 @@ class Chroma(Flux):
            out['guidance'] = comfy.conds.CONDRegular(torch.FloatTensor([guidance]))
        return out

-class ChromaRadiance(Chroma):
-    def __init__(self, model_config, model_type=ModelType.FLUX, device=None):
-        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.chroma_radiance.model.ChromaRadiance)
-
 class ACEStep(BaseModel):
    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.ace.model.ACEStepTransformer2DModel)
@@ -1474,7 +1325,6 @@ class Omnigen2(BaseModel):
 class QwenImage(BaseModel):
    def __init__(self, model_config, model_type=ModelType.FLUX, device=None):
        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.qwen_image.model.QwenImageTransformer2DModel)
-        self.memory_usage_factor_conds = ("ref_latents",)

    def extra_conds(self, **kwargs):
        out = super().extra_conds(**kwargs)
@@ -1492,153 +1342,3 @@ class QwenImage(BaseModel):
            if ref_latents_method is not None:
                out['ref_latents_method'] = comfy.conds.CONDConstant(ref_latents_method)
        return out
-
-    def extra_conds_shapes(self, **kwargs):
-        out = {}
-        ref_latents = kwargs.get("reference_latents", None)
-        if ref_latents is not None:
-            out['ref_latents'] = list([1, 16, sum(map(lambda a: math.prod(a.size()), ref_latents)) // 16])
-        return out
-
-class HunyuanImage21(BaseModel):
-    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
-        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.hunyuan_video.model.HunyuanVideo)
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        attention_mask = kwargs.get("attention_mask", None)
-        if attention_mask is not None:
-            if torch.numel(attention_mask) != attention_mask.sum():
-                out['attention_mask'] = comfy.conds.CONDRegular(attention_mask)
-        cross_attn = kwargs.get("cross_attn", None)
-        if cross_attn is not None:
-            out['c_crossattn'] = comfy.conds.CONDRegular(cross_attn)
-
-        conditioning_byt5small = kwargs.get("conditioning_byt5small", None)
-        if conditioning_byt5small is not None:
-            out['txt_byt5'] = comfy.conds.CONDRegular(conditioning_byt5small)
-
-        guidance = kwargs.get("guidance", 6.0)
-        if guidance is not None:
-            out['guidance'] = comfy.conds.CONDRegular(torch.FloatTensor([guidance]))
-
-        return out
-
-class HunyuanImage21Refiner(HunyuanImage21):
-    def concat_cond(self, **kwargs):
-        noise = kwargs.get("noise", None)
-        image = kwargs.get("concat_latent_image", None)
-        noise_augmentation = kwargs.get("noise_augmentation", 0.0)
-        device = kwargs["device"]
-
-        if image is None:
-            shape_image = list(noise.shape)
-            image = torch.zeros(shape_image, dtype=noise.dtype, layout=noise.layout, device=noise.device)
-        else:
-            image = utils.common_upscale(image.to(device), noise.shape[-1], noise.shape[-2], "bilinear", "center")
-            image = self.process_latent_in(image)
-            image = utils.resize_to_batch_size(image, noise.shape[0])
-            if noise_augmentation > 0:
-                generator = torch.Generator(device="cpu")
-                generator.manual_seed(kwargs.get("seed", 0) - 10)
-                noise = torch.randn(image.shape, generator=generator, dtype=image.dtype, device="cpu").to(image.device)
-                image = noise_augmentation * noise + min(1.0 - noise_augmentation, 0.75) * image
-            else:
-                image = 0.75 * image
-        return image
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        out['disable_time_r'] = comfy.conds.CONDConstant(True)
-        return out
-
-class HunyuanVideo15(HunyuanVideo):
-    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
-        super().__init__(model_config, model_type, device=device)
-
-    def concat_cond(self, **kwargs):
-        noise = kwargs.get("noise", None)
-        extra_channels = self.diffusion_model.img_in.proj.weight.shape[1] - noise.shape[1] - 1 #noise 32 img cond 32 + mask 1
-        if extra_channels == 0:
-            return None
-
-        image = kwargs.get("concat_latent_image", None)
-        device = kwargs["device"]
-
-        if image is None:
-            shape_image = list(noise.shape)
-            shape_image[1] = extra_channels
-            image = torch.zeros(shape_image, dtype=noise.dtype, layout=noise.layout, device=noise.device)
-        else:
-            latent_dim = self.latent_format.latent_channels
-            image = utils.common_upscale(image.to(device), noise.shape[-1], noise.shape[-2], "bilinear", "center")
-            for i in range(0, image.shape[1], latent_dim):
-                image[:, i: i + latent_dim] = self.process_latent_in(image[:, i: i + latent_dim])
-            image = utils.resize_to_batch_size(image, noise.shape[0])
-
-        mask = kwargs.get("concat_mask", kwargs.get("denoise_mask", None))
-        if mask is None:
-            mask = torch.zeros_like(noise)[:, :1]
-        else:
-            mask = 1.0 - mask
-            mask = utils.common_upscale(mask.to(device), noise.shape[-1], noise.shape[-2], "bilinear", "center")
-            if mask.shape[-3] < noise.shape[-3]:
-                mask = torch.nn.functional.pad(mask, (0, 0, 0, 0, 0, noise.shape[-3] - mask.shape[-3]), mode='constant', value=0)
-            mask = utils.resize_to_batch_size(mask, noise.shape[0])
-
-        return torch.cat((image, mask), dim=1)
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        attention_mask = kwargs.get("attention_mask", None)
-        if attention_mask is not None:
-            if torch.numel(attention_mask) != attention_mask.sum():
-                out['attention_mask'] = comfy.conds.CONDRegular(attention_mask)
-        cross_attn = kwargs.get("cross_attn", None)
-        if cross_attn is not None:
-            out['c_crossattn'] = comfy.conds.CONDRegular(cross_attn)
-
-        conditioning_byt5small = kwargs.get("conditioning_byt5small", None)
-        if conditioning_byt5small is not None:
-            out['txt_byt5'] = comfy.conds.CONDRegular(conditioning_byt5small)
-
-        guidance = kwargs.get("guidance", 6.0)
-        if guidance is not None:
-            out['guidance'] = comfy.conds.CONDRegular(torch.FloatTensor([guidance]))
-
-        clip_vision_output = kwargs.get("clip_vision_output", None)
-        if clip_vision_output is not None:
-            out['clip_fea'] = comfy.conds.CONDRegular(clip_vision_output.last_hidden_state)
-
-        return out
-
-class HunyuanVideo15_SR_Distilled(HunyuanVideo15):
-    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
-        super().__init__(model_config, model_type, device=device)
-
-    def concat_cond(self, **kwargs):
-        noise = kwargs.get("noise", None)
-        image = kwargs.get("concat_latent_image", None)
-        noise_augmentation = kwargs.get("noise_augmentation", 0.0)
-        device = kwargs["device"]
-
-        if image is None:
-            image = torch.zeros([noise.shape[0], noise.shape[1] * 2 + 2, noise.shape[-3], noise.shape[-2], noise.shape[-1]], device=comfy.model_management.intermediate_device())
-        else:
-            image = utils.common_upscale(image.to(device), noise.shape[-1], noise.shape[-2], "bilinear", "center")
-            #image = self.process_latent_in(image) # scaling wasn't applied in reference code
-            image = utils.resize_to_batch_size(image, noise.shape[0])
-            lq_image_slice = slice(noise.shape[1] + 1, 2 * noise.shape[1] + 1)
-            if noise_augmentation > 0:
-                generator = torch.Generator(device="cpu")
-                generator.manual_seed(kwargs.get("seed", 0) - 10)
-                noise = torch.randn(image[:, lq_image_slice].shape, generator=generator, dtype=image.dtype, device="cpu").to(image.device)
-                image[:, lq_image_slice] = noise_augmentation * noise + min(1.0 - noise_augmentation, 0.75) * image[:, lq_image_slice]
-            else:
-                image[:, lq_image_slice] = 0.75 * image[:, lq_image_slice]
-        return image
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        out['disable_time_r'] = comfy.conds.CONDConstant(False)
-        return out
--- a/comfy/model_detection.py
+++ b/comfy/model_detection.py
@@ -6,20 +6,6 @@ import math
 import logging
 import torch

-
-def detect_layer_quantization(metadata):
-    quant_key = "_quantization_metadata"
-    if metadata is not None and quant_key in metadata:
-        quant_metadata = metadata.pop(quant_key)
-        quant_metadata = json.loads(quant_metadata)
-        if isinstance(quant_metadata, dict) and "layers" in quant_metadata:
-            logging.info(f"Found quantization metadata (version {quant_metadata.get('format_version', 'unknown')})")
-            return quant_metadata["layers"]
-        else:
-            raise ValueError("Invalid quantization metadata format")
-    return None
-
-
 def count_blocks(state_dict_keys, prefix_string):
    count = 0
    while True:
@@ -150,104 +136,46 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):

    if '{}txt_in.individual_token_refiner.blocks.0.norm1.weight'.format(key_prefix) in state_dict_keys: #Hunyuan Video
        dit_config = {}
-        in_w = state_dict['{}img_in.proj.weight'.format(key_prefix)]
-        out_w = state_dict['{}final_layer.linear.weight'.format(key_prefix)]
        dit_config["image_model"] = "hunyuan_video"
-        dit_config["in_channels"] = in_w.shape[1] #SkyReels img2video has 32 input channels
-        dit_config["patch_size"] = list(in_w.shape[2:])
-        dit_config["out_channels"] = out_w.shape[0] // math.prod(dit_config["patch_size"])
-        if any(s.startswith('{}vector_in.'.format(key_prefix)) for s in state_dict_keys):
-            dit_config["vec_in_dim"] = 768
-        else:
-            dit_config["vec_in_dim"] = None
-
-        if len(dit_config["patch_size"]) == 2:
-            dit_config["axes_dim"] = [64, 64]
-        else:
-            dit_config["axes_dim"] = [16, 56, 56]
-
-        if any(s.startswith('{}time_r_in.'.format(key_prefix)) for s in state_dict_keys):
-            dit_config["meanflow"] = True
-        else:
-            dit_config["meanflow"] = False
-
-        dit_config["context_in_dim"] = state_dict['{}txt_in.input_embedder.weight'.format(key_prefix)].shape[1]
-        dit_config["hidden_size"] = in_w.shape[0]
+        dit_config["in_channels"] = state_dict['{}img_in.proj.weight'.format(key_prefix)].shape[1] #SkyReels img2video has 32 input channels
+        dit_config["patch_size"] = [1, 2, 2]
+        dit_config["out_channels"] = 16
+        dit_config["vec_in_dim"] = 768
+        dit_config["context_in_dim"] = 4096
+        dit_config["hidden_size"] = 3072
        dit_config["mlp_ratio"] = 4.0
-        dit_config["num_heads"] = in_w.shape[0] // 128
+        dit_config["num_heads"] = 24
        dit_config["depth"] = count_blocks(state_dict_keys, '{}double_blocks.'.format(key_prefix) + '{}.')
        dit_config["depth_single_blocks"] = count_blocks(state_dict_keys, '{}single_blocks.'.format(key_prefix) + '{}.')
+        dit_config["axes_dim"] = [16, 56, 56]
        dit_config["theta"] = 256
        dit_config["qkv_bias"] = True
-        if '{}byt5_in.fc1.weight'.format(key_prefix) in state_dict:
-            dit_config["byt5"] = True
-        else:
-            dit_config["byt5"] = False
-
        guidance_keys = list(filter(lambda a: a.startswith("{}guidance_in.".format(key_prefix)), state_dict_keys))
        dit_config["guidance_embed"] = len(guidance_keys) > 0
-
-        # HunyuanVideo 1.5
-        if '{}cond_type_embedding.weight'.format(key_prefix) in state_dict_keys:
-            dit_config["use_cond_type_embedding"] = True
-        else:
-            dit_config["use_cond_type_embedding"] = False
-        if '{}vision_in.proj.0.weight'.format(key_prefix) in state_dict_keys:
-            dit_config["vision_in_dim"] = state_dict['{}vision_in.proj.0.weight'.format(key_prefix)].shape[0]
-        else:
-            dit_config["vision_in_dim"] = None
        return dit_config

-    if '{}double_blocks.0.img_attn.norm.key_norm.scale'.format(key_prefix) in state_dict_keys and ('{}img_in.weight'.format(key_prefix) in state_dict_keys or f"{key_prefix}distilled_guidance_layer.norms.0.scale" in state_dict_keys): #Flux, Chroma or Chroma Radiance (has no img_in.weight)
+    if '{}double_blocks.0.img_attn.norm.key_norm.scale'.format(key_prefix) in state_dict_keys and '{}img_in.weight'.format(key_prefix) in state_dict_keys: #Flux
        dit_config = {}
-        if '{}double_stream_modulation_img.lin.weight'.format(key_prefix) in state_dict_keys:
-            dit_config["image_model"] = "flux2"
-            dit_config["axes_dim"] = [32, 32, 32, 32]
-            dit_config["num_heads"] = 48
-            dit_config["mlp_ratio"] = 3.0
-            dit_config["theta"] = 2000
-            dit_config["out_channels"] = 128
-            dit_config["global_modulation"] = True
-            dit_config["vec_in_dim"] = None
-            dit_config["mlp_silu_act"] = True
-            dit_config["qkv_bias"] = False
-            dit_config["ops_bias"] = False
-            dit_config["default_ref_method"] = "index"
-            dit_config["ref_index_scale"] = 10.0
-            patch_size = 1
-        else:
-            dit_config["image_model"] = "flux"
-            dit_config["axes_dim"] = [16, 56, 56]
-            dit_config["num_heads"] = 24
-            dit_config["mlp_ratio"] = 4.0
-            dit_config["theta"] = 10000
-            dit_config["out_channels"] = 16
-            dit_config["qkv_bias"] = True
-            patch_size = 2
-
+        dit_config["image_model"] = "flux"
        dit_config["in_channels"] = 16
-        dit_config["hidden_size"] = 3072
-        dit_config["context_in_dim"] = 4096
-
+        patch_size = 2
        dit_config["patch_size"] = patch_size
        in_key = "{}img_in.weight".format(key_prefix)
        if in_key in state_dict_keys:
-            w = state_dict[in_key]
-            dit_config["in_channels"] = w.shape[1] // (patch_size * patch_size)
-            dit_config["hidden_size"] = w.shape[0]
-
-        txt_in_key = "{}txt_in.weight".format(key_prefix)
-        if txt_in_key in state_dict_keys:
-            w = state_dict[txt_in_key]
-            dit_config["context_in_dim"] = w.shape[1]
-            dit_config["hidden_size"] = w.shape[0]
-
+            dit_config["in_channels"] = state_dict[in_key].shape[1] // (patch_size * patch_size)
+        dit_config["out_channels"] = 16
        vec_in_key = '{}vector_in.in_layer.weight'.format(key_prefix)
        if vec_in_key in state_dict_keys:
            dit_config["vec_in_dim"] = state_dict[vec_in_key].shape[1]
-
+        dit_config["context_in_dim"] = 4096
+        dit_config["hidden_size"] = 3072
+        dit_config["mlp_ratio"] = 4.0
+        dit_config["num_heads"] = 24
        dit_config["depth"] = count_blocks(state_dict_keys, '{}double_blocks.'.format(key_prefix) + '{}.')
        dit_config["depth_single_blocks"] = count_blocks(state_dict_keys, '{}single_blocks.'.format(key_prefix) + '{}.')
+        dit_config["axes_dim"] = [16, 56, 56]
+        dit_config["theta"] = 10000
+        dit_config["qkv_bias"] = True
        if '{}distilled_guidance_layer.0.norms.0.scale'.format(key_prefix) in state_dict_keys or '{}distilled_guidance_layer.norms.0.scale'.format(key_prefix) in state_dict_keys: #Chroma
            dit_config["image_model"] = "chroma"
            dit_config["in_channels"] = 64
@@ -256,18 +184,6 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):
            dit_config["out_dim"] = 3072
            dit_config["hidden_dim"] = 5120
            dit_config["n_layers"] = 5
-            if f"{key_prefix}nerf_blocks.0.norm.scale" in state_dict_keys: #Chroma Radiance
-                dit_config["image_model"] = "chroma_radiance"
-                dit_config["in_channels"] = 3
-                dit_config["out_channels"] = 3
-                dit_config["patch_size"] = 16
-                dit_config["nerf_hidden_size"] = 64
-                dit_config["nerf_mlp_ratio"] = 4
-                dit_config["nerf_depth"] = 4
-                dit_config["nerf_max_freqs"] = 8
-                dit_config["nerf_tile_size"] = 512
-                dit_config["nerf_final_head_type"] = "conv" if f"{key_prefix}nerf_final_layer_conv.norm.scale" in state_dict_keys else "linear"
-                dit_config["nerf_embedder_dtype"] = torch.float32
        else:
            dit_config["guidance_embed"] = "{}guidance_in.in_layer.weight".format(key_prefix) in state_dict_keys
        return dit_config
@@ -416,31 +332,14 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):
        dit_config["image_model"] = "lumina2"
        dit_config["patch_size"] = 2
        dit_config["in_channels"] = 16
-        w = state_dict['{}cap_embedder.1.weight'.format(key_prefix)]
-        dit_config["dim"] = w.shape[0]
-        dit_config["cap_feat_dim"] = w.shape[1]
-        dit_config["n_layers"] = count_blocks(state_dict_keys, '{}layers.'.format(key_prefix) + '{}.')
+        dit_config["dim"] = 2304
+        dit_config["cap_feat_dim"] = 2304
+        dit_config["n_layers"] = 26
+        dit_config["n_heads"] = 24
+        dit_config["n_kv_heads"] = 8
        dit_config["qk_norm"] = True
-
-        if dit_config["dim"] == 2304: # Original Lumina 2
-            dit_config["n_heads"] = 24
-            dit_config["n_kv_heads"] = 8
-            dit_config["axes_dims"] = [32, 32, 32]
-            dit_config["axes_lens"] = [300, 512, 512]
-            dit_config["rope_theta"] = 10000.0
-            dit_config["ffn_dim_multiplier"] = 4.0
-        elif dit_config["dim"] == 3840:  # Z image
-            dit_config["n_heads"] = 30
-            dit_config["n_kv_heads"] = 30
-            dit_config["axes_dims"] = [32, 48, 48]
-            dit_config["axes_lens"] = [1536, 512, 512]
-            dit_config["rope_theta"] = 256.0
-            dit_config["ffn_dim_multiplier"] = (8.0 / 3.0)
-            dit_config["z_image_modulation"] = True
-            dit_config["time_scale"] = 1000.0
-            if '{}cap_pad_token'.format(key_prefix) in state_dict_keys:
-                dit_config["pad_tokens_multiple"] = 32
-
+        dit_config["axes_dims"] = [32, 32, 32]
+        dit_config["axes_lens"] = [300, 512, 512]
        return dit_config

    if '{}head.modulation'.format(key_prefix) in state_dict_keys:  # Wan 2.1
@@ -469,12 +368,6 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):
                dit_config["model_type"] = "camera"
            else:
                dit_config["model_type"] = "camera_2.2"
-        elif '{}casual_audio_encoder.encoder.final_linear.weight'.format(key_prefix) in state_dict_keys:
-            dit_config["model_type"] = "s2v"
-        elif '{}audio_proj.audio_proj_glob_1.layer.bias'.format(key_prefix) in state_dict_keys:
-            dit_config["model_type"] = "humo"
-        elif '{}face_adapter.fuser_blocks.0.k_norm.weight'.format(key_prefix) in state_dict_keys:
-            dit_config["model_type"] = "animate"
        else:
            if '{}img_emb.proj.0.bias'.format(key_prefix) in state_dict_keys:
                dit_config["model_type"] = "i2v"
@@ -505,20 +398,6 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):
        dit_config["guidance_embed"] = "{}guidance_in.in_layer.weight".format(key_prefix) in state_dict_keys
        return dit_config

-    if f"{key_prefix}t_embedder.mlp.2.weight" in state_dict_keys:  # Hunyuan 3D 2.1
-
-        dit_config = {}
-        dit_config["image_model"] = "hunyuan3d2_1"
-        dit_config["in_channels"] = state_dict[f"{key_prefix}x_embedder.weight"].shape[1]
-        dit_config["context_dim"] = 1024
-        dit_config["hidden_size"] = state_dict[f"{key_prefix}x_embedder.weight"].shape[0]
-        dit_config["mlp_ratio"] = 4.0
-        dit_config["num_heads"] = 16
-        dit_config["depth"] = count_blocks(state_dict_keys, f"{key_prefix}blocks.{{}}")
-        dit_config["qkv_bias"] = False
-        dit_config["guidance_cond_proj_dim"] = None#f"{key_prefix}t_embedder.cond_proj.weight" in state_dict_keys
-        return dit_config
-
    if '{}caption_projection.0.linear.weight'.format(key_prefix) in state_dict_keys:  # HiDream
        dit_config = {}
        dit_config["image_model"] = "hidream"
@@ -613,8 +492,6 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):
    if '{}txt_norm.weight'.format(key_prefix) in state_dict_keys:  # Qwen Image
        dit_config = {}
        dit_config["image_model"] = "qwen_image"
-        dit_config["in_channels"] = state_dict['{}img_in.weight'.format(key_prefix)].shape[1]
-        dit_config["num_layers"] = count_blocks(state_dict_keys, '{}transformer_blocks.'.format(key_prefix) + '{}.')
        return dit_config

    if '{}input_blocks.0.0.weight'.format(key_prefix) not in state_dict_keys:
@@ -770,12 +647,6 @@ def model_config_from_unet(state_dict, unet_key_prefix, use_base_if_no_match=Fal
        else:
            model_config.optimizations["fp8"] = True

-    # Detect per-layer quantization (mixed precision)
-    layer_quant_config = detect_layer_quantization(metadata)
-    if layer_quant_config:
-        model_config.layer_quant_config = layer_quant_config
-        logging.info(f"Detected mixed precision quantization: {len(layer_quant_config)} layers quantized")
-
    return model_config

 def unet_prefix_from_state_dict(state_dict):
--- a/comfy/model_management.py
+++ b/comfy/model_management.py
@@ -22,7 +22,6 @@ from enum import Enum
 from comfy.cli_args import args, PerformanceFeature
 import torch
 import sys
-import importlib
 import platform
 import weakref
 import gc
@@ -89,7 +88,6 @@ if args.deterministic:

 directml_enabled = False
 if args.directml is not None:
-    logging.warning("WARNING: torch-directml barely works, is very slow, has not been updated in over 1 year and might be removed soon, please don't use it, there are better options.")
    import torch_directml
    directml_enabled = True
    device_index = args.directml
@@ -291,24 +289,6 @@ def is_amd():
            return True
    return False

-def amd_min_version(device=None, min_rdna_version=0):
-    if not is_amd():
-        return False
-
-    if is_device_cpu(device):
-        return False
-
-    arch = torch.cuda.get_device_properties(device).gcnArchName
-    if arch.startswith('gfx') and len(arch) == 7:
-        try:
-            cmp_rdna_version = int(arch[4]) + 2
-        except:
-            cmp_rdna_version = 0
-        if cmp_rdna_version >= min_rdna_version:
-            return True
-
-    return False
-
 MIN_WEIGHT_MEMORY_RATIO = 0.4
 if is_nvidia():
    MIN_WEIGHT_MEMORY_RATIO = 0.0
@@ -331,33 +311,24 @@ except:


 SUPPORT_FP8_OPS = args.supports_fp8_compute
-
-AMD_RDNA2_AND_OLDER_ARCH = ["gfx1030", "gfx1031", "gfx1010", "gfx1011", "gfx1012", "gfx906", "gfx900", "gfx803"]
-
 try:
    if is_amd():
-        arch = torch.cuda.get_device_properties(get_torch_device()).gcnArchName
-        if not (any((a in arch) for a in AMD_RDNA2_AND_OLDER_ARCH)):
-            torch.backends.cudnn.enabled = False  # Seems to improve things a lot on AMD
-            logging.info("Set: torch.backends.cudnn.enabled = False for better AMD performance.")
-
        try:
            rocm_version = tuple(map(int, str(torch.version.hip).split(".")[:2]))
        except:
            rocm_version = (6, -1)
-
+        arch = torch.cuda.get_device_properties(get_torch_device()).gcnArchName
        logging.info("AMD arch: {}".format(arch))
        logging.info("ROCm version: {}".format(rocm_version))
        if args.use_split_cross_attention == False and args.use_quad_cross_attention == False:
-            if importlib.util.find_spec('triton') is not None:  # AMD efficient attention implementation depends on triton. TODO: better way of detecting if it's compiled in or not.
-                if torch_version_numeric >= (2, 7):  # works on 2.6 but doesn't actually seem to improve much
-                    if any((a in arch) for a in ["gfx90a", "gfx942", "gfx1100", "gfx1101", "gfx1151"]):  # TODO: more arches, TODO: gfx950
-                        ENABLE_PYTORCH_ATTENTION = True
-                if rocm_version >= (7, 0):
-                   if any((a in arch) for a in ["gfx1201"]):
-                       ENABLE_PYTORCH_ATTENTION = True
+            if torch_version_numeric >= (2, 7):  # works on 2.6 but doesn't actually seem to improve much
+                if any((a in arch) for a in ["gfx90a", "gfx942", "gfx1100", "gfx1101", "gfx1151"]):  # TODO: more arches, TODO: gfx950
+                    ENABLE_PYTORCH_ATTENTION = True
+#            if torch_version_numeric >= (2, 8):
+#                if any((a in arch) for a in ["gfx1201"]):
+#                    ENABLE_PYTORCH_ATTENTION = True
        if torch_version_numeric >= (2, 7) and rocm_version >= (6, 4):
-            if any((a in arch) for a in ["gfx1200", "gfx1201", "gfx950"]):  # TODO: more arches, "gfx942" gives error on pytorch nightly 2.10 1013 rocm7.0
+            if any((a in arch) for a in ["gfx1201", "gfx942", "gfx950"]):  # TODO: more arches
                SUPPORT_FP8_OPS = True

 except:
@@ -379,9 +350,6 @@ try:
 except:
    pass

-if torch.cuda.is_available() and torch.backends.cudnn.is_available() and PerformanceFeature.AutoTune in args.fast:
-    torch.backends.cudnn.benchmark = True
-
 try:
    if torch_version_numeric >= (2, 5):
        torch.backends.cuda.allow_fp16_bf16_reduction_math_sdp(True)
@@ -504,7 +472,6 @@ class LoadedModel:
        if use_more_vram == 0:
            use_more_vram = 1e32
        self.model_use_more_vram(use_more_vram, force_patch_weights=force_patch_weights)
-
        real_model = self.model.model

        if is_intel_xpu() and not args.disable_ipex_optimize and 'ipex' in globals() and real_model is not None:
@@ -615,24 +582,25 @@ def free_memory(memory_required, device, keep_loaded=[]):
                soft_empty_cache()
    return unloaded_models

+def get_models_memory_reserve(models):
+    total_reserve = 0
+    for model in models:
+        total_reserve += model.get_model_memory_reserve(convert_to_bytes=True)
+    return total_reserve
+
 def load_models_gpu(models, memory_required=0, force_patch_weights=False, minimum_memory_required=None, force_full_load=False):
    cleanup_models_gc()
    global vram_state

    inference_memory = minimum_inference_memory()
-    extra_mem = max(inference_memory, memory_required + extra_reserved_memory())
+    models_memory_reserve = get_models_memory_reserve(models)
+    extra_mem = max(inference_memory + models_memory_reserve, memory_required + extra_reserved_memory() + models_memory_reserve)
    if minimum_memory_required is None:
        minimum_memory_required = extra_mem
    else:
-        minimum_memory_required = max(inference_memory, minimum_memory_required + extra_reserved_memory())
+        minimum_memory_required = max(inference_memory + models_memory_reserve, minimum_memory_required + extra_reserved_memory() + models_memory_reserve)

-    models_temp = set()
-    for m in models:
-        models_temp.add(m)
-        for mm in m.model_patches_models():
-            models_temp.add(mm)
-
-    models = models_temp
+    models = set(models)

    models_to_load = []

@@ -658,9 +626,7 @@ def load_models_gpu(models, memory_required=0, force_patch_weights=False, minimu
            if loaded_model.model.is_clone(current_loaded_models[i].model):
                to_unload = [i] + to_unload
        for i in to_unload:
-            model_to_unload = current_loaded_models.pop(i)
-            model_to_unload.model.detach(unpatch_all=False)
-            model_to_unload.model_finalizer.detach()
+            current_loaded_models.pop(i).model.detach(unpatch_all=False)

    total_memory_required = {}
    for loaded_model in models_to_load:
@@ -690,10 +656,7 @@ def load_models_gpu(models, memory_required=0, force_patch_weights=False, minimu
            current_free_mem = get_free_memory(torch_dev) + loaded_memory

            lowvram_model_memory = max(128 * 1024 * 1024, (current_free_mem - minimum_memory_required), min(current_free_mem * MIN_WEIGHT_MEMORY_RATIO, current_free_mem - minimum_inference_memory()))
-            lowvram_model_memory = lowvram_model_memory - loaded_memory
-
-            if lowvram_model_memory == 0:
-                lowvram_model_memory = 0.1
+            lowvram_model_memory = max(0.1, lowvram_model_memory - loaded_memory)

        if vram_set_state == VRAMState.NO_VRAM:
            lowvram_model_memory = 0.1
@@ -941,7 +904,9 @@ def vae_dtype(device=None, allowed_dtypes=[]):
        if d == torch.float16 and should_use_fp16(device):
            return d

-        if d == torch.bfloat16 and should_use_bf16(device):
+        # NOTE: bfloat16 seems to work on AMD for the VAE but is extremely slow in some cases compared to fp32
+        # slowness still a problem on pytorch nightly 2.9.0.dev20250720+rocm6.4 tested on RDNA3
+        if d == torch.bfloat16 and (not is_amd()) and should_use_bf16(device):
            return d

    return torch.float32
@@ -1003,6 +968,12 @@ def device_supports_non_blocking(device):
        return False
    return True

+def device_should_use_non_blocking(device):
+    if not device_supports_non_blocking(device):
+        return False
+    return False
+    # return True #TODO: figure out why this causes memory issues on Nvidia and possibly others
+
 def force_channels_last():
    if args.force_channels_last:
        return True
@@ -1017,16 +988,6 @@ if args.async_offload:
    NUM_STREAMS = 2
    logging.info("Using async weight offloading with {} streams".format(NUM_STREAMS))

-def current_stream(device):
-    if device is None:
-        return None
-    if is_device_cuda(device):
-        return torch.cuda.current_stream()
-    elif is_device_xpu(device):
-        return torch.xpu.current_stream()
-    else:
-        return None
-
 stream_counters = {}
 def get_offload_stream(device):
    stream_counter = stream_counters.get(device, 0)
@@ -1035,17 +996,21 @@ def get_offload_stream(device):

    if device in STREAMS:
        ss = STREAMS[device]
-        #Sync the oldest stream in the queue with the current
-        ss[stream_counter].wait_stream(current_stream(device))
+        s = ss[stream_counter]
        stream_counter = (stream_counter + 1) % len(ss)
+        if is_device_cuda(device):
+            ss[stream_counter].wait_stream(torch.cuda.current_stream())
+        elif is_device_xpu(device):
+            ss[stream_counter].wait_stream(torch.xpu.current_stream())
        stream_counters[device] = stream_counter
-        return ss[stream_counter]
+        return s
    elif is_device_cuda(device):
        ss = []
        for k in range(NUM_STREAMS):
            ss.append(torch.cuda.Stream(device=device, priority=0))
        STREAMS[device] = ss
        s = ss[stream_counter]
+        stream_counter = (stream_counter + 1) % len(ss)
        stream_counters[device] = stream_counter
        return s
    elif is_device_xpu(device):
@@ -1054,14 +1019,18 @@ def get_offload_stream(device):
            ss.append(torch.xpu.Stream(device=device, priority=0))
        STREAMS[device] = ss
        s = ss[stream_counter]
+        stream_counter = (stream_counter + 1) % len(ss)
        stream_counters[device] = stream_counter
        return s
    return None

 def sync_stream(device, stream):
-    if stream is None or current_stream(device) is None:
+    if stream is None:
        return
-    current_stream(device).wait_stream(stream)
+    if is_device_cuda(device):
+        torch.cuda.current_stream().wait_stream(stream)
+    elif is_device_xpu(device):
+        torch.xpu.current_stream().wait_stream(stream)

 def cast_to(weight, dtype=None, device=None, non_blocking=False, copy=False, stream=None):
    if device is None or weight.device == device:
@@ -1086,83 +1055,6 @@ def cast_to_device(tensor, device, dtype, copy=False):
    non_blocking = device_supports_non_blocking(device)
    return cast_to(tensor, dtype=dtype, device=device, non_blocking=non_blocking, copy=copy)

-
-PINNED_MEMORY = {}
-TOTAL_PINNED_MEMORY = 0
-MAX_PINNED_MEMORY = -1
-if not args.disable_pinned_memory:
-    if is_nvidia() or is_amd():
-        if WINDOWS:
-            MAX_PINNED_MEMORY = get_total_memory(torch.device("cpu")) * 0.45  # Windows limit is apparently 50%
-        else:
-            MAX_PINNED_MEMORY = get_total_memory(torch.device("cpu")) * 0.95
-        logging.info("Enabled pinned memory {}".format(MAX_PINNED_MEMORY // (1024 * 1024)))
-
-PINNING_ALLOWED_TYPES = set(["Parameter", "QuantizedTensor"])
-
-def pin_memory(tensor):
-    global TOTAL_PINNED_MEMORY
-    if MAX_PINNED_MEMORY <= 0:
-        return False
-
-    if type(tensor).__name__ not in PINNING_ALLOWED_TYPES:
-        return False
-
-    if not is_device_cpu(tensor.device):
-        return False
-
-    if tensor.is_pinned():
-        #NOTE: Cuda does detect when a tensor is already pinned and would
-        #error below, but there are proven cases where this also queues an error
-        #on the GPU async. So dont trust the CUDA API and guard here
-        return False
-
-    if not tensor.is_contiguous():
-        return False
-
-    size = tensor.numel() * tensor.element_size()
-    if (TOTAL_PINNED_MEMORY + size) > MAX_PINNED_MEMORY:
-        return False
-
-    ptr = tensor.data_ptr()
-    if ptr == 0:
-        return False
-
-    if torch.cuda.cudart().cudaHostRegister(ptr, size, 1) == 0:
-        PINNED_MEMORY[ptr] = size
-        TOTAL_PINNED_MEMORY += size
-        return True
-
-    return False
-
-def unpin_memory(tensor):
-    global TOTAL_PINNED_MEMORY
-    if MAX_PINNED_MEMORY <= 0:
-        return False
-
-    if not is_device_cpu(tensor.device):
-        return False
-
-    ptr = tensor.data_ptr()
-    size = tensor.numel() * tensor.element_size()
-
-    size_stored = PINNED_MEMORY.get(ptr, None)
-    if size_stored is None:
-        logging.warning("Tried to unpin tensor not pinned by ComfyUI")
-        return False
-
-    if size != size_stored:
-        logging.warning("Size of pinned tensor changed")
-        return False
-
-    if torch.cuda.cudart().cudaHostUnregister(ptr) == 0:
-        TOTAL_PINNED_MEMORY -= PINNED_MEMORY.pop(ptr)
-        if len(PINNED_MEMORY) == 0:
-            TOTAL_PINNED_MEMORY = 0
-        return True
-
-    return False
-
 def sage_attention_enabled():
    return args.use_sage_attention

@@ -1415,7 +1307,7 @@ def should_use_bf16(device=None, model_params=0, prioritize_performance=True, ma

    if is_amd():
        arch = torch.cuda.get_device_properties(device).gcnArchName
-        if any((a in arch) for a in AMD_RDNA2_AND_OLDER_ARCH):  # RDNA2 and older don't support bf16
+        if any((a in arch) for a in ["gfx1030", "gfx1031", "gfx1010", "gfx1011", "gfx1012", "gfx906", "gfx900", "gfx803"]):  # RDNA2 and older don't support bf16
            if manual_cast:
                return True
            return False
--- a/comfy/model_patcher.py
+++ b/comfy/model_patcher.py
@@ -24,7 +24,7 @@ import inspect
 import logging
 import math
 import uuid
-from typing import Callable, Optional
+from typing import Callable, Optional, Union

 import torch

@@ -84,6 +84,12 @@ def set_model_options_pre_cfg_function(model_options, pre_cfg_function, disable_
        model_options["disable_cfg1_optimization"] = True
    return model_options

+def add_model_options_memory_reserve(model_options, memory_reserve_gb: float):
+    if "model_memory_reserve" not in model_options:
+        model_options["model_memory_reserve"] = []
+    model_options["model_memory_reserve"].append(memory_reserve_gb)
+    return model_options
+
 def create_model_options_clone(orig_model_options: dict):
    return comfy.patcher_extension.copy_nested_dicts(orig_model_options)

@@ -123,30 +129,16 @@ def move_weight_functions(m, device):
    return memory

 class LowVramPatch:
-    def __init__(self, key, patches, convert_func=None, set_func=None):
+    def __init__(self, key, patches):
        self.key = key
        self.patches = patches
-        self.convert_func = convert_func
-        self.set_func = set_func
-
    def __call__(self, weight):
        intermediate_dtype = weight.dtype
-        if self.convert_func is not None:
-            weight = self.convert_func(weight, inplace=False)
-
        if intermediate_dtype not in [torch.float32, torch.float16, torch.bfloat16]: #intermediate_dtype has to be one that is supported in math ops
            intermediate_dtype = torch.float32
-            out = comfy.lora.calculate_weight(self.patches[self.key], weight.to(intermediate_dtype), self.key, intermediate_dtype=intermediate_dtype)
-            if self.set_func is None:
-                return comfy.float.stochastic_rounding(out, weight.dtype, seed=string_to_seed(self.key))
-            else:
-                return self.set_func(out, seed=string_to_seed(self.key), return_weight=True)
+            return comfy.float.stochastic_rounding(comfy.lora.calculate_weight(self.patches[self.key], weight.to(intermediate_dtype), self.key, intermediate_dtype=intermediate_dtype), weight.dtype, seed=string_to_seed(self.key))

-        out = comfy.lora.calculate_weight(self.patches[self.key], weight, self.key, intermediate_dtype=intermediate_dtype)
-        if self.set_func is not None:
-            return self.set_func(out, seed=string_to_seed(self.key), return_weight=True).to(dtype=intermediate_dtype)
-        else:
-            return out
+        return comfy.lora.calculate_weight(self.patches[self.key], weight, self.key, intermediate_dtype=intermediate_dtype)

 def get_key_weight(model, key):
    set_func = None
@@ -231,13 +223,13 @@ class ModelPatcher:
        self.object_patches_backup = {}
        self.weight_wrapper_patches = {}
        self.model_options = {"transformer_options":{}}
+        self.model_size()
        self.load_device = load_device
        self.offload_device = offload_device
        self.weight_inplace_update = weight_inplace_update
        self.force_cast_weights = False
        self.patches_uuid = uuid.uuid4()
        self.parent = None
-        self.pinned = set()

        self.attachments: dict[str] = {}
        self.additional_models: dict[str, list[ModelPatcher]] = {}
@@ -275,9 +267,6 @@ class ModelPatcher:
        self.size = comfy.model_management.module_size(self.model)
        return self.size

-    def get_ram_usage(self):
-        return self.model_size()
-
    def loaded_size(self):
        return self.model.model_loaded_weight_memory

@@ -285,7 +274,7 @@ class ModelPatcher:
        return self.model.lowvram_patch_counter

    def clone(self):
-        n = self.__class__(self.model, self.load_device, self.offload_device, self.model_size(), weight_inplace_update=self.weight_inplace_update)
+        n = self.__class__(self.model, self.load_device, self.offload_device, self.size, weight_inplace_update=self.weight_inplace_update)
        n.patches = {}
        for k in self.patches:
            n.patches[k] = self.patches[k][:]
@@ -297,7 +286,6 @@ class ModelPatcher:
        n.backup = self.backup
        n.object_patches_backup = self.object_patches_backup
        n.parent = self
-        n.pinned = self.pinned

        n.force_cast_weights = self.force_cast_weights

@@ -448,25 +436,6 @@ class ModelPatcher:
    def set_model_forward_timestep_embed_patch(self, patch):
        self.set_model_patch(patch, "forward_timestep_embed_patch")

-    def set_model_double_block_patch(self, patch):
-        self.set_model_patch(patch, "double_block")
-
-    def set_model_post_input_patch(self, patch):
-        self.set_model_patch(patch, "post_input")
-
-    def set_model_rope_options(self, scale_x, shift_x, scale_y, shift_y, scale_t, shift_t, **kwargs):
-        rope_options = self.model_options["transformer_options"].get("rope_options", {})
-        rope_options["scale_x"] = scale_x
-        rope_options["scale_y"] = scale_y
-        rope_options["scale_t"] = scale_t
-
-        rope_options["shift_x"] = shift_x
-        rope_options["shift_y"] = shift_y
-        rope_options["shift_t"] = shift_t
-
-        self.model_options["transformer_options"]["rope_options"] = rope_options
-
-
    def add_object_patch(self, name, obj):
        self.object_patches[name] = obj

@@ -476,6 +445,17 @@ class ModelPatcher:
            self.force_cast_weights = True
        self.patches_uuid = uuid.uuid4() #TODO: optimize by preventing a full model reload for this

+    def add_model_memory_reserve(self, memory_reserve_gb: float):
+        """Adds additional expected memory usage for the model, in gigabytes."""
+        self.model_options = add_model_options_memory_reserve(self.model_options, memory_reserve_gb)
+
+    def get_model_memory_reserve(self, convert_to_bytes: bool = False) -> Union[float, int]:
+        """Returns the total expected memory usage for the model in gigabytes, or bytes if convert_to_bytes is True."""
+        total_reserve = sum(self.model_options.get("model_memory_reserve", []))
+        if convert_to_bytes:
+            return total_reserve * 1024 * 1024 * 1024
+        return total_reserve
+
    def add_weight_wrapper(self, name, function):
        self.weight_wrapper_patches[name] = self.weight_wrapper_patches.get(name, []) + [function]
        self.patches_uuid = uuid.uuid4()
@@ -523,30 +503,6 @@ class ModelPatcher:
            if hasattr(wrap_func, "to"):
                self.model_options["model_function_wrapper"] = wrap_func.to(device)

-    def model_patches_models(self):
-        to = self.model_options["transformer_options"]
-        models = []
-        if "patches" in to:
-            patches = to["patches"]
-            for name in patches:
-                patch_list = patches[name]
-                for i in range(len(patch_list)):
-                    if hasattr(patch_list[i], "models"):
-                        models += patch_list[i].models()
-        if "patches_replace" in to:
-            patches = to["patches_replace"]
-            for name in patches:
-                patch_list = patches[name]
-                for k in patch_list:
-                    if hasattr(patch_list[k], "models"):
-                        models += patch_list[k].models()
-        if "model_function_wrapper" in self.model_options:
-            wrap_func = self.model_options["model_function_wrapper"]
-            if hasattr(wrap_func, "models"):
-                models += wrap_func.models()
-
-        return models
-
    def model_dtype(self):
        if hasattr(self.model, "get_dtype"):
            return self.model.get_dtype()
@@ -635,21 +591,6 @@ class ModelPatcher:
        else:
            set_func(out_weight, inplace_update=inplace_update, seed=string_to_seed(key))

-    def pin_weight_to_device(self, key):
-        weight, set_func, convert_func = get_key_weight(self.model, key)
-        if comfy.model_management.pin_memory(weight):
-            self.pinned.add(key)
-
-    def unpin_weight(self, key):
-        if key in self.pinned:
-            weight, set_func, convert_func = get_key_weight(self.model, key)
-            comfy.model_management.unpin_memory(weight)
-            self.pinned.remove(key)
-
-    def unpin_all_weights(self):
-        for key in list(self.pinned):
-            self.unpin_weight(key)
-
    def _load_list(self):
        loading = []
        for n, m in self.model.named_modules():
@@ -671,11 +612,9 @@ class ModelPatcher:
            mem_counter = 0
            patch_counter = 0
            lowvram_counter = 0
-            lowvram_mem_counter = 0
            loading = self._load_list()

            load_completely = []
-            offloaded = []
            loading.sort(reverse=True)
            for x in loading:
                n = x[1]
@@ -692,7 +631,6 @@ class ModelPatcher:
                    if mem_counter + module_mem >= lowvram_model_memory:
                        lowvram_weight = True
                        lowvram_counter += 1
-                        lowvram_mem_counter += module_mem
                        if hasattr(m, "prev_comfy_cast_weights"): #Already lowvramed
                            continue

@@ -706,19 +644,16 @@ class ModelPatcher:
                        if force_patch_weights:
                            self.patch_weight_to_device(weight_key)
                        else:
-                            _, set_func, convert_func = get_key_weight(self.model, weight_key)
-                            m.weight_function = [LowVramPatch(weight_key, self.patches, convert_func, set_func)]
+                            m.weight_function = [LowVramPatch(weight_key, self.patches)]
                            patch_counter += 1
                    if bias_key in self.patches:
                        if force_patch_weights:
                            self.patch_weight_to_device(bias_key)
                        else:
-                            _, set_func, convert_func = get_key_weight(self.model, bias_key)
-                            m.bias_function = [LowVramPatch(bias_key, self.patches, convert_func, set_func)]
+                            m.bias_function = [LowVramPatch(bias_key, self.patches)]
                            patch_counter += 1

                    cast_weight = True
-                    offloaded.append((module_mem, n, m, params))
                else:
                    if hasattr(m, "comfy_cast_weights"):
                        wipe_lowvram_weight(m)
@@ -749,9 +684,7 @@ class ModelPatcher:
                        continue

                for param in params:
-                    key = "{}.{}".format(n, param)
-                    self.unpin_weight(key)
-                    self.patch_weight_to_device(key, device_to=device_to)
+                    self.patch_weight_to_device("{}.{}".format(n, param), device_to=device_to)

                logging.debug("lowvram: loaded module regularly {} {}".format(n, m))
                m.comfy_patched_weights = True
@@ -759,17 +692,11 @@ class ModelPatcher:
            for x in load_completely:
                x[2].to(device_to)

-            for x in offloaded:
-                n = x[1]
-                params = x[3]
-                for param in params:
-                    self.pin_weight_to_device("{}.{}".format(n, param))
-
            if lowvram_counter > 0:
-                logging.info("loaded partially; {:.2f} MB usable, {:.2f} MB loaded, {:.2f} MB offloaded, lowvram patches: {}".format(lowvram_model_memory / (1024 * 1024), mem_counter / (1024 * 1024), lowvram_mem_counter / (1024 * 1024), patch_counter))
+                logging.info("loaded partially {} {} {}".format(lowvram_model_memory / (1024 * 1024), mem_counter / (1024 * 1024), patch_counter))
                self.model.model_lowvram = True
            else:
-                logging.info("loaded completely; {:.2f} MB usable, {:.2f} MB loaded, full load: {}".format(lowvram_model_memory / (1024 * 1024), mem_counter / (1024 * 1024), full_load))
+                logging.info("loaded completely {} {} {}".format(lowvram_model_memory / (1024 * 1024), mem_counter / (1024 * 1024), full_load))
                self.model.model_lowvram = False
                if full_load:
                    self.model.to(device_to)
@@ -806,7 +733,6 @@ class ModelPatcher:
        self.eject_model()
        if unpatch_weights:
            self.unpatch_hooks()
-            self.unpin_all_weights()
            if self.model.model_lowvram:
                for m in self.model.modules():
                    move_weight_functions(m, device_to)
@@ -842,7 +768,7 @@ class ModelPatcher:

        self.object_patches_backup.clear()

-    def partially_unload(self, device_to, memory_to_free=0, force_patch_weights=False):
+    def partially_unload(self, device_to, memory_to_free=0):
        with self.use_ejected():
            hooks_unpatched = False
            memory_freed = 0
@@ -886,19 +812,11 @@ class ModelPatcher:
                        module_mem += move_weight_functions(m, device_to)
                        if lowvram_possible:
                            if weight_key in self.patches:
-                                if force_patch_weights:
-                                    self.patch_weight_to_device(weight_key)
-                                else:
-                                    _, set_func, convert_func = get_key_weight(self.model, weight_key)
-                                    m.weight_function.append(LowVramPatch(weight_key, self.patches, convert_func, set_func))
-                                    patch_counter += 1
+                                m.weight_function.append(LowVramPatch(weight_key, self.patches))
+                                patch_counter += 1
                            if bias_key in self.patches:
-                                if force_patch_weights:
-                                    self.patch_weight_to_device(bias_key)
-                                else:
-                                    _, set_func, convert_func = get_key_weight(self.model, bias_key)
-                                    m.bias_function.append(LowVramPatch(bias_key, self.patches, convert_func, set_func))
-                                    patch_counter += 1
+                                m.bias_function.append(LowVramPatch(bias_key, self.patches))
+                                patch_counter += 1
                            cast_weight = True

                        if cast_weight:
@@ -908,13 +826,9 @@ class ModelPatcher:
                        memory_freed += module_mem
                        logging.debug("freed {}".format(n))

-                        for param in params:
-                            self.pin_weight_to_device("{}.{}".format(n, param))
-
            self.model.model_lowvram = True
            self.model.lowvram_patch_counter += patch_counter
            self.model.model_loaded_weight_memory -= memory_freed
-            logging.info("loaded partially: {:.2f} MB loaded, lowvram patches: {}".format(self.model.model_loaded_weight_memory / (1024 * 1024), self.model.lowvram_patch_counter))
            return memory_freed

    def partially_load(self, device_to, extra_memory=0, force_patch_weights=False):
@@ -927,9 +841,6 @@ class ModelPatcher:
                extra_memory += (used - self.model.model_loaded_weight_memory)

            self.patch_model(load_weights=False)
-            if extra_memory < 0 and not unpatch_weights:
-                self.partially_unload(self.offload_device, -extra_memory, force_patch_weights=force_patch_weights)
-                return 0
            full_load = False
            if self.model.model_lowvram == False and self.model.model_loaded_weight_memory > 0:
                self.apply_hooks(self.forced_hooks, force_apply=True)
@@ -1317,6 +1228,5 @@ class ModelPatcher:
        self.clear_cached_hook_weights()

    def __del__(self):
-        self.unpin_all_weights()
        self.detach(unpatch_all=False)

--- a/comfy/model_sampling.py
+++ b/comfy/model_sampling.py
@@ -21,23 +21,17 @@ def rescale_zero_terminal_snr_sigmas(sigmas):
    alphas_bar[-1] = 4.8973451890853435e-08
    return ((1 - alphas_bar) / alphas_bar) ** 0.5

-def reshape_sigma(sigma, noise_dim):
-    if sigma.nelement() == 1:
-        return sigma.view(())
-    else:
-        return sigma.view(sigma.shape[:1] + (1,) * (noise_dim - 1))
-
 class EPS:
    def calculate_input(self, sigma, noise):
-        sigma = reshape_sigma(sigma, noise.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (noise.ndim - 1))
        return noise / (sigma ** 2 + self.sigma_data ** 2) ** 0.5

    def calculate_denoised(self, sigma, model_output, model_input):
-        sigma = reshape_sigma(sigma, model_output.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (model_output.ndim - 1))
        return model_input - model_output * sigma

    def noise_scaling(self, sigma, noise, latent_image, max_denoise=False):
-        sigma = reshape_sigma(sigma, noise.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (noise.ndim - 1))
        if max_denoise:
            noise = noise * torch.sqrt(1.0 + sigma ** 2.0)
        else:
@@ -51,12 +45,12 @@ class EPS:

 class V_PREDICTION(EPS):
    def calculate_denoised(self, sigma, model_output, model_input):
-        sigma = reshape_sigma(sigma, model_output.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (model_output.ndim - 1))
        return model_input * self.sigma_data ** 2 / (sigma ** 2 + self.sigma_data ** 2) - model_output * sigma * self.sigma_data / (sigma ** 2 + self.sigma_data ** 2) ** 0.5

 class EDM(V_PREDICTION):
    def calculate_denoised(self, sigma, model_output, model_input):
-        sigma = reshape_sigma(sigma, model_output.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (model_output.ndim - 1))
        return model_input * self.sigma_data ** 2 / (sigma ** 2 + self.sigma_data ** 2) + model_output * sigma * self.sigma_data / (sigma ** 2 + self.sigma_data ** 2) ** 0.5

 class CONST:
@@ -64,15 +58,15 @@ class CONST:
        return noise

    def calculate_denoised(self, sigma, model_output, model_input):
-        sigma = reshape_sigma(sigma, model_output.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (model_output.ndim - 1))
        return model_input - model_output * sigma

    def noise_scaling(self, sigma, noise, latent_image, max_denoise=False):
-        sigma = reshape_sigma(sigma, noise.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (noise.ndim - 1))
        return sigma * noise + (1.0 - sigma) * latent_image

    def inverse_noise_scaling(self, sigma, latent):
-        sigma = reshape_sigma(sigma, latent.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (latent.ndim - 1))
        return latent / (1.0 - sigma)

 class X0(EPS):
@@ -86,16 +80,16 @@ class IMG_TO_IMG(X0):
 class COSMOS_RFLOW:
    def calculate_input(self, sigma, noise):
        sigma = (sigma / (sigma + 1))
-        sigma = reshape_sigma(sigma, noise.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (noise.ndim - 1))
        return noise * (1.0 - sigma)

    def calculate_denoised(self, sigma, model_output, model_input):
        sigma = (sigma / (sigma + 1))
-        sigma = reshape_sigma(sigma, model_output.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (model_output.ndim - 1))
        return model_input * (1.0 - sigma) - model_output * sigma

    def noise_scaling(self, sigma, noise, latent_image, max_denoise=False):
-        sigma = reshape_sigma(sigma, noise.ndim)
+        sigma = sigma.view(sigma.shape[:1] + (1,) * (noise.ndim - 1))
        noise = noise * sigma
        noise += latent_image
        return noise
--- a/comfy/nested_tensor.py
+++ b/comfy/nested_tensor.py
@@ -1,91 +0,0 @@
-import torch
-
-class NestedTensor:
-    def __init__(self, tensors):
-        self.tensors = list(tensors)
-        self.is_nested = True
-
-    def _copy(self):
-        return NestedTensor(self.tensors)
-
-    def apply_operation(self, other, operation):
-        o = self._copy()
-        if isinstance(other, NestedTensor):
-            for i, t in enumerate(o.tensors):
-                o.tensors[i] = operation(t, other.tensors[i])
-        else:
-            for i, t in enumerate(o.tensors):
-                o.tensors[i] = operation(t, other)
-        return o
-
-    def __add__(self, b):
-        return self.apply_operation(b, lambda x, y: x + y)
-
-    def __sub__(self, b):
-        return self.apply_operation(b, lambda x, y: x - y)
-
-    def __mul__(self, b):
-        return self.apply_operation(b, lambda x, y: x * y)
-
-    # def __itruediv__(self, b):
-    #     return self.apply_operation(b, lambda x, y: x / y)
-
-    def __truediv__(self, b):
-        return self.apply_operation(b, lambda x, y: x / y)
-
-    def __getitem__(self, *args, **kwargs):
-        return self.apply_operation(None, lambda x, y: x.__getitem__(*args, **kwargs))
-
-    def unbind(self):
-        return self.tensors
-
-    def to(self, *args, **kwargs):
-        o = self._copy()
-        for i, t in enumerate(o.tensors):
-            o.tensors[i] = t.to(*args, **kwargs)
-        return o
-
-    def new_ones(self, *args, **kwargs):
-        return self.tensors[0].new_ones(*args, **kwargs)
-
-    def float(self):
-        return self.to(dtype=torch.float)
-
-    def chunk(self, *args, **kwargs):
-        return self.apply_operation(None, lambda x, y: x.chunk(*args, **kwargs))
-
-    def size(self):
-        return self.tensors[0].size()
-
-    @property
-    def shape(self):
-        return self.tensors[0].shape
-
-    @property
-    def ndim(self):
-        dims = 0
-        for t in self.tensors:
-            dims = max(t.ndim, dims)
-        return dims
-
-    @property
-    def device(self):
-        return self.tensors[0].device
-
-    @property
-    def dtype(self):
-        return self.tensors[0].dtype
-
-    @property
-    def layout(self):
-        return self.tensors[0].layout
-
-
-def cat_nested(tensors, *args, **kwargs):
-    cated_tensors = []
-    for i in range(len(tensors[0].tensors)):
-        tens = []
-        for j in range(len(tensors)):
-            tens.append(tensors[j].tensors[i])
-        cated_tensors.append(torch.cat(tens, *args, **kwargs))
-    return NestedTensor(cated_tensors)
--- a/comfy/ops.py
+++ b/comfy/ops.py
@@ -24,18 +24,13 @@ import comfy.float
 import comfy.rmsnorm
 import contextlib

-def run_every_op():
-    if torch.compiler.is_compiling():
-        return
-
-    comfy.model_management.throw_exception_if_processing_interrupted()

 def scaled_dot_product_attention(q, k, v, *args, **kwargs):
    return torch.nn.functional.scaled_dot_product_attention(q, k, v, *args, **kwargs)


 try:
-    if torch.cuda.is_available() and comfy.model_management.WINDOWS:
+    if torch.cuda.is_available():
        from torch.nn.attention import SDPBackend, sdpa_kernel
        import inspect
        if "set_priority" in inspect.signature(sdpa_kernel).parameters:
@@ -55,92 +50,46 @@ try:
 except (ModuleNotFoundError, TypeError):
    logging.warning("Could not set sdpa backend priority.")

-NVIDIA_MEMORY_CONV_BUG_WORKAROUND = False
-try:
-    if comfy.model_management.is_nvidia():
-        cudnn_version = torch.backends.cudnn.version()
-        if (cudnn_version >= 91002 and cudnn_version < 91500) and comfy.model_management.torch_version_numeric >= (2, 9) and comfy.model_management.torch_version_numeric <= (2, 10):
-            #TODO: change upper bound version once it's fixed'
-            NVIDIA_MEMORY_CONV_BUG_WORKAROUND = True
-            logging.info("working around nvidia conv3d memory bug.")
-except:
-    pass
-
 cast_to = comfy.model_management.cast_to #TODO: remove once no more references

 def cast_to_input(weight, input, non_blocking=False, copy=True):
    return comfy.model_management.cast_to(weight, input.dtype, input.device, non_blocking=non_blocking, copy=copy)

-
-def cast_bias_weight(s, input=None, dtype=None, device=None, bias_dtype=None, offloadable=False):
-    # NOTE: offloadable=False is a a legacy and if you are a custom node author reading this please pass
-    # offloadable=True and call uncast_bias_weight() after your last usage of the weight/bias. This
-    # will add async-offload support to your cast and improve performance.
+def cast_bias_weight(s, input=None, dtype=None, device=None, bias_dtype=None):
    if input is not None:
        if dtype is None:
-            if isinstance(input, QuantizedTensor):
-                dtype = input._layout_params["orig_dtype"]
-            else:
-                dtype = input.dtype
+            dtype = input.dtype
        if bias_dtype is None:
            bias_dtype = dtype
        if device is None:
            device = input.device

-    if offloadable and (device != s.weight.device or
-                        (s.bias is not None and device != s.bias.device)):
-        offload_stream = comfy.model_management.get_offload_stream(device)
-    else:
-        offload_stream = None
-
+    offload_stream = comfy.model_management.get_offload_stream(device)
    if offload_stream is not None:
        wf_context = offload_stream
    else:
        wf_context = contextlib.nullcontext()

-    non_blocking = comfy.model_management.device_supports_non_blocking(device)
-
-    weight_has_function = len(s.weight_function) > 0
-    bias_has_function = len(s.bias_function) > 0
-
-    weight = comfy.model_management.cast_to(s.weight, None, device, non_blocking=non_blocking, copy=weight_has_function, stream=offload_stream)
-
    bias = None
+    non_blocking = comfy.model_management.device_supports_non_blocking(device)
    if s.bias is not None:
-        bias = comfy.model_management.cast_to(s.bias, bias_dtype, device, non_blocking=non_blocking, copy=bias_has_function, stream=offload_stream)
+        has_function = len(s.bias_function) > 0
+        bias = comfy.model_management.cast_to(s.bias, bias_dtype, device, non_blocking=non_blocking, copy=has_function, stream=offload_stream)

-        if bias_has_function:
+        if has_function:
            with wf_context:
                for f in s.bias_function:
                    bias = f(bias)

-    if weight_has_function or weight.dtype != dtype:
+    has_function = len(s.weight_function) > 0
+    weight = comfy.model_management.cast_to(s.weight, dtype, device, non_blocking=non_blocking, copy=has_function, stream=offload_stream)
+    if has_function:
        with wf_context:
-            weight = weight.to(dtype=dtype)
-            if isinstance(weight, QuantizedTensor):
-                weight = weight.dequantize()
            for f in s.weight_function:
                weight = f(weight)

    comfy.model_management.sync_stream(device, offload_stream)
-    if offloadable:
-        return weight, bias, offload_stream
-    else:
-        #Legacy function signature
-        return weight, bias
-
-
-def uncast_bias_weight(s, weight, bias, offload_stream):
-    if offload_stream is None:
-        return
-    if weight is not None:
-        device = weight.device
-    else:
-        if bias is None:
-            return
-        device = bias.device
-    offload_stream.wait_stream(comfy.model_management.current_stream(device))
-
+    return weight, bias

 class CastWeightBiasOp:
    comfy_cast_weights = False
@@ -153,13 +102,10 @@ class disable_weight_init:
            return None

        def forward_comfy_cast_weights(self, input):
-            weight, bias, offload_stream = cast_bias_weight(self, input, offloadable=True)
-            x = torch.nn.functional.linear(input, weight, bias)
-            uncast_bias_weight(self, weight, bias, offload_stream)
-            return x
+            weight, bias = cast_bias_weight(self, input)
+            return torch.nn.functional.linear(input, weight, bias)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -170,13 +116,10 @@ class disable_weight_init:
            return None

        def forward_comfy_cast_weights(self, input):
-            weight, bias, offload_stream = cast_bias_weight(self, input, offloadable=True)
-            x = self._conv_forward(input, weight, bias)
-            uncast_bias_weight(self, weight, bias, offload_stream)
-            return x
+            weight, bias = cast_bias_weight(self, input)
+            return self._conv_forward(input, weight, bias)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -187,13 +130,10 @@ class disable_weight_init:
            return None

        def forward_comfy_cast_weights(self, input):
-            weight, bias, offload_stream = cast_bias_weight(self, input, offloadable=True)
-            x = self._conv_forward(input, weight, bias)
-            uncast_bias_weight(self, weight, bias, offload_stream)
-            return x
+            weight, bias = cast_bias_weight(self, input)
+            return self._conv_forward(input, weight, bias)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -203,23 +143,11 @@ class disable_weight_init:
        def reset_parameters(self):
            return None

-        def _conv_forward(self, input, weight, bias, *args, **kwargs):
-            if NVIDIA_MEMORY_CONV_BUG_WORKAROUND and weight.dtype in (torch.float16, torch.bfloat16):
-                out = torch.cudnn_convolution(input, weight, self.padding, self.stride, self.dilation, self.groups, benchmark=False, deterministic=False, allow_tf32=True)
-                if bias is not None:
-                    out += bias.reshape((1, -1) + (1,) * (out.ndim - 2))
-                return out
-            else:
-                return super()._conv_forward(input, weight, bias, *args, **kwargs)
-
        def forward_comfy_cast_weights(self, input):
-            weight, bias, offload_stream = cast_bias_weight(self, input, offloadable=True)
-            x = self._conv_forward(input, weight, bias)
-            uncast_bias_weight(self, weight, bias, offload_stream)
-            return x
+            weight, bias = cast_bias_weight(self, input)
+            return self._conv_forward(input, weight, bias)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -230,13 +158,10 @@ class disable_weight_init:
            return None

        def forward_comfy_cast_weights(self, input):
-            weight, bias, offload_stream = cast_bias_weight(self, input, offloadable=True)
-            x = torch.nn.functional.group_norm(input, self.num_groups, weight, bias, self.eps)
-            uncast_bias_weight(self, weight, bias, offload_stream)
-            return x
+            weight, bias = cast_bias_weight(self, input)
+            return torch.nn.functional.group_norm(input, self.num_groups, weight, bias, self.eps)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -248,17 +173,13 @@ class disable_weight_init:

        def forward_comfy_cast_weights(self, input):
            if self.weight is not None:
-                weight, bias, offload_stream = cast_bias_weight(self, input, offloadable=True)
+                weight, bias = cast_bias_weight(self, input)
            else:
                weight = None
                bias = None
-                offload_stream = None
-            x = torch.nn.functional.layer_norm(input, self.normalized_shape, weight, bias, self.eps)
-            uncast_bias_weight(self, weight, bias, offload_stream)
-            return x
+            return torch.nn.functional.layer_norm(input, self.normalized_shape, weight, bias, self.eps)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -271,18 +192,13 @@ class disable_weight_init:

        def forward_comfy_cast_weights(self, input):
            if self.weight is not None:
-                weight, bias, offload_stream = cast_bias_weight(self, input, offloadable=True)
+                weight, bias = cast_bias_weight(self, input)
            else:
                weight = None
-                bias = None
-                offload_stream = None
-            x = comfy.rmsnorm.rms_norm(input, weight, self.eps)  # TODO: switch to commented out line when old torch is deprecated
-            # x = torch.nn.functional.rms_norm(input, self.normalized_shape, weight, self.eps)
-            uncast_bias_weight(self, weight, bias, offload_stream)
-            return x
+            return comfy.rmsnorm.rms_norm(input, weight, self.eps)  # TODO: switch to commented out line when old torch is deprecated
+            # return torch.nn.functional.rms_norm(input, self.normalized_shape, weight, self.eps)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -298,15 +214,12 @@ class disable_weight_init:
                input, output_size, self.stride, self.padding, self.kernel_size,
                num_spatial_dims, self.dilation)

-            weight, bias, offload_stream = cast_bias_weight(self, input, offloadable=True)
-            x = torch.nn.functional.conv_transpose2d(
+            weight, bias = cast_bias_weight(self, input)
+            return torch.nn.functional.conv_transpose2d(
                input, weight, bias, self.stride, self.padding,
                output_padding, self.groups, self.dilation)
-            uncast_bias_weight(self, weight, bias, offload_stream)
-            return x

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -322,15 +235,12 @@ class disable_weight_init:
                input, output_size, self.stride, self.padding, self.kernel_size,
                num_spatial_dims, self.dilation)

-            weight, bias, offload_stream = cast_bias_weight(self, input, offloadable=True)
-            x = torch.nn.functional.conv_transpose1d(
+            weight, bias = cast_bias_weight(self, input)
+            return torch.nn.functional.conv_transpose1d(
                input, weight, bias, self.stride, self.padding,
                output_padding, self.groups, self.dilation)
-            uncast_bias_weight(self, weight, bias, offload_stream)
-            return x

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -345,14 +255,10 @@ class disable_weight_init:
            output_dtype = out_dtype
            if self.weight.dtype == torch.float16 or self.weight.dtype == torch.bfloat16:
                out_dtype = None
-            weight, bias, offload_stream = cast_bias_weight(self, device=input.device, dtype=out_dtype, offloadable=True)
-            x = torch.nn.functional.embedding(input, weight, self.padding_idx, self.max_norm, self.norm_type, self.scale_grad_by_freq, self.sparse).to(dtype=output_dtype)
-            uncast_bias_weight(self, weight, bias, offload_stream)
-            return x
-
+            weight, bias = cast_bias_weight(self, device=input.device, dtype=out_dtype)
+            return torch.nn.functional.embedding(input, weight, self.padding_idx, self.max_norm, self.norm_type, self.scale_grad_by_freq, self.sparse).to(dtype=output_dtype)

        def forward(self, *args, **kwargs):
-            run_every_op()
            if self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
                return self.forward_comfy_cast_weights(*args, **kwargs)
            else:
@@ -403,18 +309,20 @@ class manual_cast(disable_weight_init):


 def fp8_linear(self, input):
-    """
-    Legacy FP8 linear function for backward compatibility.
-    Uses QuantizedTensor subclass for dispatch.
-    """
    dtype = self.weight.dtype
    if dtype not in [torch.float8_e4m3fn]:
        return None

-    input_dtype = input.dtype
+    tensor_2d = False
+    if len(input.shape) == 2:
+        tensor_2d = True
+        input = input.unsqueeze(1)

-    if input.ndim == 3 or input.ndim == 2:
-        w, bias, offload_stream = cast_bias_weight(self, input, dtype=dtype, bias_dtype=input_dtype, offloadable=True)
+    input_shape = input.shape
+    input_dtype = input.dtype
+    if len(input.shape) == 3:
+        w, bias = cast_bias_weight(self, input, dtype=dtype, bias_dtype=input_dtype)
+        w = w.t()

        scale_weight = self.scale_weight
        scale_input = self.scale_input
@@ -426,20 +334,23 @@ def fp8_linear(self, input):
        if scale_input is None:
            scale_input = torch.ones((), device=input.device, dtype=torch.float32)
            input = torch.clamp(input, min=-448, max=448, out=input)
-            layout_params_weight = {'scale': scale_input, 'orig_dtype': input_dtype}
-            quantized_input = QuantizedTensor(input.to(dtype).contiguous(), "TensorCoreFP8Layout", layout_params_weight)
+            input = input.reshape(-1, input_shape[2]).to(dtype).contiguous()
        else:
            scale_input = scale_input.to(input.device)
-            quantized_input = QuantizedTensor.from_float(input, "TensorCoreFP8Layout", scale=scale_input, dtype=dtype)
+            input = (input * (1.0 / scale_input).to(input_dtype)).reshape(-1, input_shape[2]).to(dtype).contiguous()

-        # Wrap weight in QuantizedTensor - this enables unified dispatch
-        # Call F.linear - __torch_dispatch__ routes to fp8_linear handler in quant_ops.py!
-        layout_params_weight = {'scale': scale_weight, 'orig_dtype': input_dtype}
-        quantized_weight = QuantizedTensor(w, "TensorCoreFP8Layout", layout_params_weight)
-        o = torch.nn.functional.linear(quantized_input, quantized_weight, bias)
+        if bias is not None:
+            o = torch._scaled_mm(input, w, out_dtype=input_dtype, bias=bias, scale_a=scale_input, scale_b=scale_weight)
+        else:
+            o = torch._scaled_mm(input, w, out_dtype=input_dtype, scale_a=scale_input, scale_b=scale_weight)

-        uncast_bias_weight(self, w, bias, offload_stream)
-        return o
+        if isinstance(o, tuple):
+            o = o[0]
+
+        if tensor_2d:
+            return o.reshape(input_shape[0], -1)
+
+        return o.reshape((-1, input_shape[1], self.weight.shape[0]))

    return None

@@ -451,18 +362,15 @@ class fp8_ops(manual_cast):
            return None

        def forward_comfy_cast_weights(self, input):
-            if not self.training:
-                try:
-                    out = fp8_linear(self, input)
-                    if out is not None:
-                        return out
-                except Exception as e:
-                    logging.info("Exception during fp8 op: {}".format(e))
+            try:
+                out = fp8_linear(self, input)
+                if out is not None:
+                    return out
+            except Exception as e:
+                logging.info("Exception during fp8 op: {}".format(e))

-            weight, bias, offload_stream = cast_bias_weight(self, input, offloadable=True)
-            x = torch.nn.functional.linear(input, weight, bias)
-            uncast_bias_weight(self, weight, bias, offload_stream)
-            return x
+            weight, bias = cast_bias_weight(self, input)
+            return torch.nn.functional.linear(input, weight, bias)

 def scaled_fp8_ops(fp8_matrix_mult=False, scale_input=False, override_dtype=None):
    logging.info("Using scaled fp8: fp8 matrix mult: {}, scale input: {}".format(fp8_matrix_mult, scale_input))
@@ -490,26 +398,22 @@ def scaled_fp8_ops(fp8_matrix_mult=False, scale_input=False, override_dtype=None
                    if out is not None:
                        return out

-                weight, bias, offload_stream = cast_bias_weight(self, input, offloadable=True)
+                weight, bias = cast_bias_weight(self, input)

                if weight.numel() < input.numel(): #TODO: optimize
-                    x = torch.nn.functional.linear(input, weight * self.scale_weight.to(device=weight.device, dtype=weight.dtype), bias)
+                    return torch.nn.functional.linear(input, weight * self.scale_weight.to(device=weight.device, dtype=weight.dtype), bias)
                else:
-                    x = torch.nn.functional.linear(input * self.scale_weight.to(device=weight.device, dtype=weight.dtype), weight, bias)
-                uncast_bias_weight(self, weight, bias, offload_stream)
-                return x
+                    return torch.nn.functional.linear(input * self.scale_weight.to(device=weight.device, dtype=weight.dtype), weight, bias)

            def convert_weight(self, weight, inplace=False, **kwargs):
                if inplace:
                    weight *= self.scale_weight.to(device=weight.device, dtype=weight.dtype)
                    return weight
                else:
-                    return weight.to(dtype=torch.float32) * self.scale_weight.to(device=weight.device, dtype=torch.float32)
+                    return weight * self.scale_weight.to(device=weight.device, dtype=weight.dtype)

-            def set_weight(self, weight, inplace_update=False, seed=None, return_weight=False, **kwargs):
+            def set_weight(self, weight, inplace_update=False, seed=None, **kwargs):
                weight = comfy.float.stochastic_rounding(weight / self.scale_weight.to(device=weight.device, dtype=weight.dtype), self.weight.dtype, seed=seed)
-                if return_weight:
-                    return weight
                if inplace_update:
                    self.weight.data.copy_(weight)
                else:
@@ -536,142 +440,8 @@ if CUBLAS_IS_AVAILABLE:
            def forward(self, *args, **kwargs):
                return super().forward(*args, **kwargs)

-
-# ==============================================================================
-# Mixed Precision Operations
-# ==============================================================================
-from .quant_ops import QuantizedTensor, QUANT_ALGOS
-
-
-def mixed_precision_ops(layer_quant_config={}, compute_dtype=torch.bfloat16, full_precision_mm=False):
-    class MixedPrecisionOps(manual_cast):
-        _layer_quant_config = layer_quant_config
-        _compute_dtype = compute_dtype
-        _full_precision_mm = full_precision_mm
-
-        class Linear(torch.nn.Module, CastWeightBiasOp):
-            def __init__(
-                self,
-                in_features: int,
-                out_features: int,
-                bias: bool = True,
-                device=None,
-                dtype=None,
-            ) -> None:
-                super().__init__()
-
-                self.factory_kwargs = {"device": device, "dtype": MixedPrecisionOps._compute_dtype}
-                # self.factory_kwargs = {"device": device, "dtype": dtype}
-
-                self.in_features = in_features
-                self.out_features = out_features
-                if bias:
-                    self.bias = torch.nn.Parameter(torch.empty(out_features, **self.factory_kwargs))
-                else:
-                    self.register_parameter("bias", None)
-
-                self.tensor_class = None
-                self._full_precision_mm = MixedPrecisionOps._full_precision_mm
-
-            def reset_parameters(self):
-                return None
-
-            def _load_from_state_dict(self, state_dict, prefix, local_metadata,
-                                    strict, missing_keys, unexpected_keys, error_msgs):
-
-                device = self.factory_kwargs["device"]
-                layer_name = prefix.rstrip('.')
-                weight_key = f"{prefix}weight"
-                weight = state_dict.pop(weight_key, None)
-                if weight is None:
-                    raise ValueError(f"Missing weight for layer {layer_name}")
-
-                manually_loaded_keys = [weight_key]
-
-                if layer_name not in MixedPrecisionOps._layer_quant_config:
-                    self.weight = torch.nn.Parameter(weight.to(device=device, dtype=MixedPrecisionOps._compute_dtype), requires_grad=False)
-                else:
-                    quant_format = MixedPrecisionOps._layer_quant_config[layer_name].get("format", None)
-                    if quant_format is None:
-                        raise ValueError(f"Unknown quantization format for layer {layer_name}")
-
-                    qconfig = QUANT_ALGOS[quant_format]
-                    self.layout_type = qconfig["comfy_tensor_layout"]
-
-                    weight_scale_key = f"{prefix}weight_scale"
-                    layout_params = {
-                        'scale': state_dict.pop(weight_scale_key, None),
-                        'orig_dtype': MixedPrecisionOps._compute_dtype,
-                        'block_size': qconfig.get("group_size", None),
-                    }
-                    if layout_params['scale'] is not None:
-                        manually_loaded_keys.append(weight_scale_key)
-
-                    self.weight = torch.nn.Parameter(
-                        QuantizedTensor(weight.to(device=device), self.layout_type, layout_params),
-                        requires_grad=False
-                    )
-
-                    for param_name in qconfig["parameters"]:
-                        param_key = f"{prefix}{param_name}"
-                        _v = state_dict.pop(param_key, None)
-                        if _v is None:
-                            continue
-                        setattr(self, param_name, torch.nn.Parameter(_v.to(device=device), requires_grad=False))
-                        manually_loaded_keys.append(param_key)
-
-                super()._load_from_state_dict(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
-
-                for key in manually_loaded_keys:
-                    if key in missing_keys:
-                        missing_keys.remove(key)
-
-            def _forward(self, input, weight, bias):
-                return torch.nn.functional.linear(input, weight, bias)
-
-            def forward_comfy_cast_weights(self, input):
-                weight, bias, offload_stream = cast_bias_weight(self, input, offloadable=True)
-                x = self._forward(input, weight, bias)
-                uncast_bias_weight(self, weight, bias, offload_stream)
-                return x
-
-            def forward(self, input, *args, **kwargs):
-                run_every_op()
-
-                if self._full_precision_mm or self.comfy_cast_weights or len(self.weight_function) > 0 or len(self.bias_function) > 0:
-                    return self.forward_comfy_cast_weights(input, *args, **kwargs)
-                if (getattr(self, 'layout_type', None) is not None and
-                    getattr(self, 'input_scale', None) is not None and
-                    not isinstance(input, QuantizedTensor)):
-                    input = QuantizedTensor.from_float(input, self.layout_type, scale=self.input_scale, dtype=self.weight.dtype)
-                return self._forward(input, self.weight, self.bias)
-
-            def convert_weight(self, weight, inplace=False, **kwargs):
-                if isinstance(weight, QuantizedTensor):
-                    return weight.dequantize()
-                else:
-                    return weight
-
-            def set_weight(self, weight, inplace_update=False, seed=None, return_weight=False, **kwargs):
-                if getattr(self, 'layout_type', None) is not None:
-                    weight = QuantizedTensor.from_float(weight, self.layout_type, scale=None, dtype=self.weight.dtype, stochastic_rounding=seed, inplace_ops=True)
-                else:
-                    weight = weight.to(self.weight.dtype)
-                if return_weight:
-                    return weight
-
-                assert inplace_update is False  # TODO: eventually remove the inplace_update stuff
-                self.weight = torch.nn.Parameter(weight, requires_grad=False)
-
-    return MixedPrecisionOps
-
-def pick_operations(weight_dtype, compute_dtype, load_device=None, disable_fast_fp8=False, fp8_optimizations=False, scaled_fp8=None, model_config=None):
-    fp8_compute = comfy.model_management.supports_fp8_compute(load_device) # TODO: if we support more ops this needs to be more granular
-
-    if model_config and hasattr(model_config, 'layer_quant_config') and model_config.layer_quant_config:
-        logging.info(f"Using mixed precision operations: {len(model_config.layer_quant_config)} quantized layers")
-        return mixed_precision_ops(model_config.layer_quant_config, compute_dtype, full_precision_mm=not fp8_compute)
-
+def pick_operations(weight_dtype, compute_dtype, load_device=None, disable_fast_fp8=False, fp8_optimizations=False, scaled_fp8=None):
+    fp8_compute = comfy.model_management.supports_fp8_compute(load_device)
    if scaled_fp8 is not None:
        return scaled_fp8_ops(fp8_matrix_mult=fp8_compute and fp8_optimizations, scale_input=fp8_optimizations, override_dtype=scaled_fp8)

--- a/comfy/patcher_extension.py
+++ b/comfy/patcher_extension.py
@@ -50,7 +50,6 @@ class WrappersMP:
    OUTER_SAMPLE = "outer_sample"
    PREPARE_SAMPLING = "prepare_sampling"
    SAMPLER_SAMPLE = "sampler_sample"
-    PREDICT_NOISE = "predict_noise"
    CALC_COND_BATCH = "calc_cond_batch"
    APPLY_MODEL = "apply_model"
    DIFFUSION_MODEL = "diffusion_model"
@@ -150,7 +149,7 @@ def merge_nested_dicts(dict1: dict, dict2: dict, copy_dict1=True):
    for key, value in dict2.items():
        if isinstance(value, dict):
            curr_value = merged_dict.setdefault(key, {})
-            merged_dict[key] = merge_nested_dicts(curr_value, value)
+            merged_dict[key] = merge_nested_dicts(value, curr_value)
        elif isinstance(value, list):
            merged_dict.setdefault(key, []).extend(value)
        else:
--- a/comfy/pixel_space_convert.py
+++ b/comfy/pixel_space_convert.py
@@ -1,16 +0,0 @@
-import torch
-
-
-# "Fake" VAE that converts from IMAGE B, H, W, C and values on the scale of 0..1
-# to LATENT B, C, H, W and values on the scale of -1..1.
-class PixelspaceConversionVAE(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.pixel_space_vae = torch.nn.Parameter(torch.tensor(1.0))
-
-    def encode(self, pixels: torch.Tensor, *_args, **_kwargs) -> torch.Tensor:
-        return pixels
-
-    def decode(self, samples: torch.Tensor, *_args, **_kwargs) -> torch.Tensor:
-        return samples
-
--- a/comfy/quant_ops.py
+++ b/comfy/quant_ops.py
@@ -1,572 +0,0 @@
-import torch
-import logging
-from typing import Tuple, Dict
-import comfy.float
-
-_LAYOUT_REGISTRY = {}
-_GENERIC_UTILS = {}
-
-
-def register_layout_op(torch_op, layout_type):
-    """
-    Decorator to register a layout-specific operation handler.
-    Args:
-        torch_op: PyTorch operation (e.g., torch.ops.aten.linear.default)
-        layout_type: Layout class (e.g., TensorCoreFP8Layout)
-    Example:
-        @register_layout_op(torch.ops.aten.linear.default, TensorCoreFP8Layout)
-        def fp8_linear(func, args, kwargs):
-            # FP8-specific linear implementation
-            ...
-    """
-    def decorator(handler_func):
-        if torch_op not in _LAYOUT_REGISTRY:
-            _LAYOUT_REGISTRY[torch_op] = {}
-        _LAYOUT_REGISTRY[torch_op][layout_type] = handler_func
-        return handler_func
-    return decorator
-
-
-def register_generic_util(torch_op):
-    """
-    Decorator to register a generic utility that works for all layouts.
-    Args:
-        torch_op: PyTorch operation (e.g., torch.ops.aten.detach.default)
-
-    Example:
-        @register_generic_util(torch.ops.aten.detach.default)
-        def generic_detach(func, args, kwargs):
-            # Works for any layout
-            ...
-    """
-    def decorator(handler_func):
-        _GENERIC_UTILS[torch_op] = handler_func
-        return handler_func
-    return decorator
-
-
-def _get_layout_from_args(args):
-    for arg in args:
-        if isinstance(arg, QuantizedTensor):
-            return arg._layout_type
-        elif isinstance(arg, (list, tuple)):
-            for item in arg:
-                if isinstance(item, QuantizedTensor):
-                    return item._layout_type
-    return None
-
-
-def _move_layout_params_to_device(params, device):
-    new_params = {}
-    for k, v in params.items():
-        if isinstance(v, torch.Tensor):
-            new_params[k] = v.to(device=device)
-        else:
-            new_params[k] = v
-    return new_params
-
-
-def _copy_layout_params(params):
-    new_params = {}
-    for k, v in params.items():
-        if isinstance(v, torch.Tensor):
-            new_params[k] = v.clone()
-        else:
-            new_params[k] = v
-    return new_params
-
-def _copy_layout_params_inplace(src, dst, non_blocking=False):
-    for k, v in src.items():
-        if isinstance(v, torch.Tensor):
-            dst[k].copy_(v, non_blocking=non_blocking)
-        else:
-            dst[k] = v
-
-class QuantizedLayout:
-    """
-    Base class for quantization layouts.
-
-    A layout encapsulates the format-specific logic for quantization/dequantization
-    and provides a uniform interface for extracting raw tensors needed for computation.
-
-    New quantization formats should subclass this and implement the required methods.
-    """
-    @classmethod
-    def quantize(cls, tensor, **kwargs) -> Tuple[torch.Tensor, Dict]:
-        raise NotImplementedError(f"{cls.__name__} must implement quantize()")
-
-    @staticmethod
-    def dequantize(qdata, **layout_params) -> torch.Tensor:
-        raise NotImplementedError("TensorLayout must implement dequantize()")
-
-    @classmethod
-    def get_plain_tensors(cls, qtensor) -> torch.Tensor:
-        raise NotImplementedError(f"{cls.__name__} must implement get_plain_tensors()")
-
-
-class QuantizedTensor(torch.Tensor):
-    """
-    Universal quantized tensor that works with any layout.
-
-    This tensor subclass uses a pluggable layout system to support multiple
-    quantization formats (FP8, INT4, INT8, etc.) without code duplication.
-
-    The layout_type determines format-specific behavior, while common operations
-    (detach, clone, to) are handled generically.
-
-    Attributes:
-        _qdata: The quantized tensor data
-        _layout_type: Layout class (e.g., TensorCoreFP8Layout)
-        _layout_params: Dict with layout-specific params (scale, zero_point, etc.)
-    """
-
-    @staticmethod
-    def __new__(cls, qdata, layout_type, layout_params):
-        """
-        Create a quantized tensor.
-
-        Args:
-            qdata: The quantized data tensor
-            layout_type: Layout class (subclass of QuantizedLayout)
-            layout_params: Dict with layout-specific parameters
-        """
-        return torch.Tensor._make_wrapper_subclass(cls, qdata.shape, device=qdata.device, dtype=qdata.dtype, requires_grad=False)
-
-    def __init__(self, qdata, layout_type, layout_params):
-        self._qdata = qdata
-        self._layout_type = layout_type
-        self._layout_params = layout_params
-
-    def __repr__(self):
-        layout_name = self._layout_type
-        param_str = ", ".join(f"{k}={v}" for k, v in list(self._layout_params.items())[:2])
-        return f"QuantizedTensor(shape={self.shape}, layout={layout_name}, {param_str})"
-
-    @property
-    def layout_type(self):
-        return self._layout_type
-
-    def __tensor_flatten__(self):
-        """
-        Tensor flattening protocol for proper device movement.
-        """
-        inner_tensors = ["_qdata"]
-        ctx = {
-            "layout_type": self._layout_type,
-        }
-
-        tensor_params = {}
-        non_tensor_params = {}
-        for k, v in self._layout_params.items():
-            if isinstance(v, torch.Tensor):
-                tensor_params[k] = v
-            else:
-                non_tensor_params[k] = v
-
-        ctx["tensor_param_keys"] = list(tensor_params.keys())
-        ctx["non_tensor_params"] = non_tensor_params
-
-        for k, v in tensor_params.items():
-            attr_name = f"_layout_param_{k}"
-            object.__setattr__(self, attr_name, v)
-            inner_tensors.append(attr_name)
-
-        return inner_tensors, ctx
-
-    @staticmethod
-    def __tensor_unflatten__(inner_tensors, ctx, outer_size, outer_stride):
-        """
-        Tensor unflattening protocol for proper device movement.
-        Reconstructs the QuantizedTensor after device movement.
-        """
-        layout_type = ctx["layout_type"]
-        layout_params = dict(ctx["non_tensor_params"])
-
-        for key in ctx["tensor_param_keys"]:
-            attr_name = f"_layout_param_{key}"
-            layout_params[key] = inner_tensors[attr_name]
-
-        return QuantizedTensor(inner_tensors["_qdata"], layout_type, layout_params)
-
-    @classmethod
-    def from_float(cls, tensor, layout_type, **quantize_kwargs) -> 'QuantizedTensor':
-        qdata, layout_params = LAYOUTS[layout_type].quantize(tensor, **quantize_kwargs)
-        return cls(qdata, layout_type, layout_params)
-
-    def dequantize(self) -> torch.Tensor:
-        return LAYOUTS[self._layout_type].dequantize(self._qdata, **self._layout_params)
-
-    @classmethod
-    def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
-        kwargs = kwargs or {}
-
-        # Step 1: Check generic utilities first (detach, clone, to, etc.)
-        if func in _GENERIC_UTILS:
-            return _GENERIC_UTILS[func](func, args, kwargs)
-
-        # Step 2: Check layout-specific handlers (linear, matmul, etc.)
-        layout_type = _get_layout_from_args(args)
-        if layout_type and func in _LAYOUT_REGISTRY:
-            handler = _LAYOUT_REGISTRY[func].get(layout_type)
-            if handler:
-                return handler(func, args, kwargs)
-
-        # Step 3: Fallback to dequantization
-        if isinstance(args[0] if args else None, QuantizedTensor):
-            logging.info(f"QuantizedTensor: Unhandled operation {func}, falling back to dequantization. kwargs={kwargs}")
-        return cls._dequant_and_fallback(func, args, kwargs)
-
-    @classmethod
-    def _dequant_and_fallback(cls, func, args, kwargs):
-        def dequant_arg(arg):
-            if isinstance(arg, QuantizedTensor):
-                return arg.dequantize()
-            elif isinstance(arg, (list, tuple)):
-                return type(arg)(dequant_arg(a) for a in arg)
-            return arg
-
-        new_args = dequant_arg(args)
-        new_kwargs = dequant_arg(kwargs)
-        return func(*new_args, **new_kwargs)
-
-    def data_ptr(self):
-        return self._qdata.data_ptr()
-
-    def is_pinned(self):
-        return self._qdata.is_pinned()
-
-    def is_contiguous(self):
-        return self._qdata.is_contiguous()
-
-# ==============================================================================
-# Generic Utilities (Layout-Agnostic Operations)
-# ==============================================================================
-
-def _create_transformed_qtensor(qt, transform_fn):
-    new_data = transform_fn(qt._qdata)
-    new_params = _copy_layout_params(qt._layout_params)
-    return QuantizedTensor(new_data, qt._layout_type, new_params)
-
-
-def _handle_device_transfer(qt, target_device, target_dtype=None, target_layout=None, op_name="to"):
-    if target_dtype is not None and target_dtype != qt.dtype:
-        logging.warning(
-            f"QuantizedTensor: dtype conversion requested to {target_dtype}, "
-            f"but not supported for quantized tensors. Ignoring dtype."
-        )
-
-    if target_layout is not None and target_layout != torch.strided:
-        logging.warning(
-            f"QuantizedTensor: layout change requested to {target_layout}, "
-            f"but not supported. Ignoring layout."
-        )
-
-    # Handle device transfer
-    current_device = qt._qdata.device
-    if target_device is not None:
-        # Normalize device for comparison
-        if isinstance(target_device, str):
-            target_device = torch.device(target_device)
-        if isinstance(current_device, str):
-            current_device = torch.device(current_device)
-
-        if target_device != current_device:
-            logging.debug(f"QuantizedTensor.{op_name}: Moving from {current_device} to {target_device}")
-            new_q_data = qt._qdata.to(device=target_device)
-            new_params = _move_layout_params_to_device(qt._layout_params, target_device)
-            new_qt = QuantizedTensor(new_q_data, qt._layout_type, new_params)
-            logging.debug(f"QuantizedTensor.{op_name}: Created new tensor on {target_device}")
-            return new_qt
-
-    logging.debug(f"QuantizedTensor.{op_name}: No device change needed, returning original")
-    return qt
-
-
-@register_generic_util(torch.ops.aten.detach.default)
-def generic_detach(func, args, kwargs):
-    """Detach operation - creates a detached copy of the quantized tensor."""
-    qt = args[0]
-    if isinstance(qt, QuantizedTensor):
-        return _create_transformed_qtensor(qt, lambda x: x.detach())
-    return func(*args, **kwargs)
-
-
-@register_generic_util(torch.ops.aten.clone.default)
-def generic_clone(func, args, kwargs):
-    """Clone operation - creates a deep copy of the quantized tensor."""
-    qt = args[0]
-    if isinstance(qt, QuantizedTensor):
-        return _create_transformed_qtensor(qt, lambda x: x.clone())
-    return func(*args, **kwargs)
-
-
-@register_generic_util(torch.ops.aten._to_copy.default)
-def generic_to_copy(func, args, kwargs):
-    """Device/dtype transfer operation - handles .to(device) calls."""
-    qt = args[0]
-    if isinstance(qt, QuantizedTensor):
-        return _handle_device_transfer(
-            qt,
-            target_device=kwargs.get('device', None),
-            target_dtype=kwargs.get('dtype', None),
-            op_name="_to_copy"
-        )
-    return func(*args, **kwargs)
-
-
-@register_generic_util(torch.ops.aten.to.dtype_layout)
-def generic_to_dtype_layout(func, args, kwargs):
-    """Handle .to(device) calls using the dtype_layout variant."""
-    qt = args[0]
-    if isinstance(qt, QuantizedTensor):
-        return _handle_device_transfer(
-            qt,
-            target_device=kwargs.get('device', None),
-            target_dtype=kwargs.get('dtype', None),
-            target_layout=kwargs.get('layout', None),
-            op_name="to"
-        )
-    return func(*args, **kwargs)
-
-
-@register_generic_util(torch.ops.aten.copy_.default)
-def generic_copy_(func, args, kwargs):
-    qt_dest = args[0]
-    src = args[1]
-    non_blocking = args[2] if len(args) > 2 else False
-    if isinstance(qt_dest, QuantizedTensor):
-        if isinstance(src, QuantizedTensor):
-            # Copy from another quantized tensor
-            qt_dest._qdata.copy_(src._qdata, non_blocking=non_blocking)
-            qt_dest._layout_type = src._layout_type
-            _copy_layout_params_inplace(src._layout_params, qt_dest._layout_params, non_blocking=non_blocking)
-        else:
-            # Copy from regular tensor - just copy raw data
-            qt_dest._qdata.copy_(src)
-        return qt_dest
-    return func(*args, **kwargs)
-
-
-@register_generic_util(torch.ops.aten.to.dtype)
-def generic_to_dtype(func, args, kwargs):
-    """Handle .to(dtype) calls - dtype conversion only."""
-    src = args[0]
-    if isinstance(src, QuantizedTensor):
-        # For dtype-only conversion, just change the orig_dtype, no real cast is needed
-        target_dtype = args[1] if len(args) > 1 else kwargs.get('dtype')
-        src._layout_params["orig_dtype"] = target_dtype
-        return src
-    return func(*args, **kwargs)
-
-
-@register_generic_util(torch.ops.aten._has_compatible_shallow_copy_type.default)
-def generic_has_compatible_shallow_copy_type(func, args, kwargs):
-    return True
-
-
-@register_generic_util(torch.ops.aten.empty_like.default)
-def generic_empty_like(func, args, kwargs):
-    """Empty_like operation - creates an empty tensor with the same quantized structure."""
-    qt = args[0]
-    if isinstance(qt, QuantizedTensor):
-        # Create empty tensor with same shape and dtype as the quantized data
-        hp_dtype = kwargs.pop('dtype', qt._layout_params["orig_dtype"])
-        new_qdata = torch.empty_like(qt._qdata, **kwargs)
-
-        # Handle device transfer for layout params
-        target_device = kwargs.get('device', new_qdata.device)
-        new_params = _move_layout_params_to_device(qt._layout_params, target_device)
-
-        # Update orig_dtype if dtype is specified
-        new_params['orig_dtype'] = hp_dtype
-
-        return QuantizedTensor(new_qdata, qt._layout_type, new_params)
-    return func(*args, **kwargs)
-
-# ==============================================================================
-# FP8 Layout + Operation Handlers
-# ==============================================================================
-class TensorCoreFP8Layout(QuantizedLayout):
-    """
-    Storage format:
-    - qdata: FP8 tensor (torch.float8_e4m3fn or torch.float8_e5m2)
-    - scale: Scalar tensor (float32) for dequantization
-    - orig_dtype: Original dtype before quantization (for casting back)
-    """
-    @classmethod
-    def quantize(cls, tensor, scale=None, dtype=torch.float8_e4m3fn, stochastic_rounding=0, inplace_ops=False):
-        orig_dtype = tensor.dtype
-
-        if scale is None:
-            scale = torch.amax(tensor.abs()) / torch.finfo(dtype).max
-
-        if not isinstance(scale, torch.Tensor):
-            scale = torch.tensor(scale)
-        scale = scale.to(device=tensor.device, dtype=torch.float32)
-
-        if inplace_ops:
-            tensor *= (1.0 / scale).to(tensor.dtype)
-        else:
-            tensor = tensor * (1.0 / scale).to(tensor.dtype)
-
-        if stochastic_rounding > 0:
-            tensor = comfy.float.stochastic_rounding(tensor, dtype=dtype, seed=stochastic_rounding)
-        else:
-            lp_amax = torch.finfo(dtype).max
-            torch.clamp(tensor, min=-lp_amax, max=lp_amax, out=tensor)
-            tensor = tensor.to(dtype, memory_format=torch.contiguous_format)
-
-        layout_params = {
-            'scale': scale,
-            'orig_dtype': orig_dtype
-        }
-        return tensor, layout_params
-
-    @staticmethod
-    def dequantize(qdata, scale, orig_dtype, **kwargs):
-        plain_tensor = torch.ops.aten._to_copy.default(qdata, dtype=orig_dtype)
-        return plain_tensor * scale
-
-    @classmethod
-    def get_plain_tensors(cls, qtensor):
-        return qtensor._qdata, qtensor._layout_params['scale']
-
-QUANT_ALGOS = {
-    "float8_e4m3fn": {
-        "storage_t": torch.float8_e4m3fn,
-        "parameters": {"weight_scale", "input_scale"},
-        "comfy_tensor_layout": "TensorCoreFP8Layout",
-    },
-}
-
-LAYOUTS = {
-    "TensorCoreFP8Layout": TensorCoreFP8Layout,
-}
-
-
-@register_layout_op(torch.ops.aten.linear.default, "TensorCoreFP8Layout")
-def fp8_linear(func, args, kwargs):
-    input_tensor = args[0]
-    weight = args[1]
-    bias = args[2] if len(args) > 2 else None
-
-    if isinstance(input_tensor, QuantizedTensor) and isinstance(weight, QuantizedTensor):
-        plain_input, scale_a = TensorCoreFP8Layout.get_plain_tensors(input_tensor)
-        plain_weight, scale_b = TensorCoreFP8Layout.get_plain_tensors(weight)
-
-        out_dtype = kwargs.get("out_dtype")
-        if out_dtype is None:
-            out_dtype = input_tensor._layout_params['orig_dtype']
-
-        weight_t = plain_weight.t()
-
-        tensor_2d = False
-        if len(plain_input.shape) == 2:
-            tensor_2d = True
-            plain_input = plain_input.unsqueeze(1)
-
-        input_shape = plain_input.shape
-        if len(input_shape) != 3:
-            return None
-
-        try:
-            output = torch._scaled_mm(
-                plain_input.reshape(-1, input_shape[2]).contiguous(),
-                weight_t,
-                bias=bias,
-                scale_a=scale_a,
-                scale_b=scale_b,
-                out_dtype=out_dtype,
-            )
-
-            if isinstance(output, tuple):  # TODO: remove when we drop support for torch 2.4
-                output = output[0]
-
-            if not tensor_2d:
-                output = output.reshape((-1, input_shape[1], weight.shape[0]))
-
-            if output.dtype in [torch.float8_e4m3fn, torch.float8_e5m2]:
-                output_scale = scale_a * scale_b
-                output_params = {
-                    'scale': output_scale,
-                    'orig_dtype': input_tensor._layout_params['orig_dtype']
-                }
-                return QuantizedTensor(output, "TensorCoreFP8Layout", output_params)
-            else:
-                return output
-
-        except Exception as e:
-            raise RuntimeError(f"FP8 _scaled_mm failed, falling back to dequantization: {e}")
-
-    # Case 2: DQ Fallback
-    if isinstance(weight, QuantizedTensor):
-        weight = weight.dequantize()
-    if isinstance(input_tensor, QuantizedTensor):
-        input_tensor = input_tensor.dequantize()
-
-    return torch.nn.functional.linear(input_tensor, weight, bias)
-
-def fp8_mm_(input_tensor, weight, bias=None, out_dtype=None):
-    if out_dtype is None:
-        out_dtype = input_tensor._layout_params['orig_dtype']
-
-    plain_input, scale_a = TensorCoreFP8Layout.get_plain_tensors(input_tensor)
-    plain_weight, scale_b = TensorCoreFP8Layout.get_plain_tensors(weight)
-
-    output = torch._scaled_mm(
-        plain_input.contiguous(),
-        plain_weight,
-        bias=bias,
-        scale_a=scale_a,
-        scale_b=scale_b,
-        out_dtype=out_dtype,
-    )
-
-    if isinstance(output, tuple):  # TODO: remove when we drop support for torch 2.4
-        output = output[0]
-    return output
-
-@register_layout_op(torch.ops.aten.addmm.default, "TensorCoreFP8Layout")
-def fp8_addmm(func, args, kwargs):
-    input_tensor = args[1]
-    weight = args[2]
-    bias = args[0]
-
-    if isinstance(input_tensor, QuantizedTensor) and isinstance(weight, QuantizedTensor):
-        return fp8_mm_(input_tensor, weight, bias=bias, out_dtype=kwargs.get("out_dtype", None))
-
-    a = list(args)
-    if isinstance(args[0], QuantizedTensor):
-        a[0] = args[0].dequantize()
-    if isinstance(args[1], QuantizedTensor):
-        a[1] = args[1].dequantize()
-    if isinstance(args[2], QuantizedTensor):
-        a[2] = args[2].dequantize()
-
-    return func(*a, **kwargs)
-
-@register_layout_op(torch.ops.aten.mm.default, "TensorCoreFP8Layout")
-def fp8_mm(func, args, kwargs):
-    input_tensor = args[0]
-    weight = args[1]
-
-    if isinstance(input_tensor, QuantizedTensor) and isinstance(weight, QuantizedTensor):
-        return fp8_mm_(input_tensor, weight, bias=None, out_dtype=kwargs.get("out_dtype", None))
-
-    a = list(args)
-    if isinstance(args[0], QuantizedTensor):
-        a[0] = args[0].dequantize()
-    if isinstance(args[1], QuantizedTensor):
-        a[1] = args[1].dequantize()
-    return func(*a, **kwargs)
-
-@register_layout_op(torch.ops.aten.view.default, "TensorCoreFP8Layout")
-@register_layout_op(torch.ops.aten.t.default, "TensorCoreFP8Layout")
-def fp8_func(func, args, kwargs):
-    input_tensor = args[0]
-    if isinstance(input_tensor, QuantizedTensor):
-        plain_input, scale_a = TensorCoreFP8Layout.get_plain_tensors(input_tensor)
-        ar = list(args)
-        ar[0] = plain_input
-        return QuantizedTensor(func(*ar, **kwargs), "TensorCoreFP8Layout", input_tensor._layout_params)
-    return func(*args, **kwargs)
--- a/comfy/sample.py
+++ b/comfy/sample.py
@@ -4,9 +4,13 @@ import comfy.samplers
 import comfy.utils
 import numpy as np
 import logging
-import comfy.nested_tensor

-def prepare_noise_inner(latent_image, generator, noise_inds=None):
+def prepare_noise(latent_image, seed, noise_inds=None):
+    """
+    creates random noise given a latent image and a seed.
+    optional arg skip can be used to skip and discard x number of noise generations for a given seed
+    """
+    generator = torch.manual_seed(seed)
    if noise_inds is None:
        return torch.randn(latent_image.size(), dtype=latent_image.dtype, layout=latent_image.layout, generator=generator, device="cpu")

@@ -17,29 +21,10 @@ def prepare_noise_inner(latent_image, generator, noise_inds=None):
        if i in unique_inds:
            noises.append(noise)
    noises = [noises[i] for i in inverse]
-    return torch.cat(noises, axis=0)
-
-def prepare_noise(latent_image, seed, noise_inds=None):
-    """
-    creates random noise given a latent image and a seed.
-    optional arg skip can be used to skip and discard x number of noise generations for a given seed
-    """
-    generator = torch.manual_seed(seed)
-
-    if latent_image.is_nested:
-        tensors = latent_image.unbind()
-        noises = []
-        for t in tensors:
-            noises.append(prepare_noise_inner(t, generator, noise_inds))
-        noises = comfy.nested_tensor.NestedTensor(noises)
-    else:
-        noises = prepare_noise_inner(latent_image, generator, noise_inds)
-
+    noises = torch.cat(noises, axis=0)
    return noises

 def fix_empty_latent_channels(model, latent_image):
-    if latent_image.is_nested:
-        return latent_image
    latent_format = model.get_model_object("latent_format") #Resize the empty latent image so it has the right number of channels
    if latent_format.latent_channels != latent_image.shape[1] and torch.count_nonzero(latent_image) == 0:
        latent_image = comfy.utils.repeat_to_batch_size(latent_image, latent_format.latent_channels, dim=1)
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Jedrzej Kosinski	6c611b0b99	Change node id to reflect node name	2025-08-18 15:39:16 -07:00
Jedrzej Kosinski	cd54d502fc	Make sure models_memory_reserve is considered with inference_memory as well in max func calls	2025-08-18 15:34:53 -07:00
Jedrzej Kosinski	63571c6c3d	Renamed to Reserve Additional Memory	2025-08-18 15:04:49 -07:00
Jedrzej Kosinski	bae0c31a68	Added missing model.clone() call	2025-08-18 14:51:11 -07:00
Jedrzej Kosinski	34b1f51f4a	Created Add Memory to Reserve node	2025-08-18 14:45:21 -07:00