Add the websocket library for automated tests

Add a missing file
It looks like this got caught by .gitignore? There's probably a better place to put it, but I'm not sure what that is.
2026-02-12 03:00:03 +00:00 · 2025-06-13 21:51:32 -07:00 · 2025-06-13 21:45:21 -07:00 · 2025-06-13 21:39:26 -07:00 · 2025-06-13 21:39:26 -07:00
252 changed files with 5809 additions and 184596 deletions
--- a/.ci/windows_base_files/README_VERY_IMPORTANT.txt
+++ b/.ci/windows_base_files/README_VERY_IMPORTANT.txt
@@ -4,9 +4,6 @@ if you have a NVIDIA gpu:

 run_nvidia_gpu.bat

-if you want to enable the fast fp16 accumulation (faster for fp16 models with slightly less quality):
-
-run_nvidia_gpu_fast_fp16_accumulation.bat


 To run it in slow CPU mode:
--- a/.gitattributes
+++ b/.gitattributes
@@ -1,3 +1,2 @@
 /web/assets/** linguist-generated
 /web/** linguist-vendored
-comfy_api_nodes/apis/__init__.py linguist-generated
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@@ -22,7 +22,7 @@ body:
      description: Please confirm you have tried to reproduce the issue with all custom nodes disabled.
      options:
        - label: I have tried disabling custom nodes and the issue persists (see [how to disable custom nodes](https://docs.comfy.org/troubleshooting/custom-node-issues#step-1%3A-test-with-all-custom-nodes-disabled) if you need help)
-          required: false
+          required: true
  - type: textarea
    attributes:
      label: Expected Behavior
--- a/.github/ISSUE_TEMPLATE/user-support.yml
+++ b/.github/ISSUE_TEMPLATE/user-support.yml
@@ -18,7 +18,7 @@ body:
        description: Please confirm you have tried to reproduce the issue with all custom nodes disabled.
        options:
          - label: I have tried disabling custom nodes and the issue persists (see [how to disable custom nodes](https://docs.comfy.org/troubleshooting/custom-node-issues#step-1%3A-test-with-all-custom-nodes-disabled) if you need help)
-            required: false
+            required: true
    - type: textarea
      attributes:
            label: Your question
--- a/.github/workflows/check-line-endings.yml
+++ b/.github/workflows/check-line-endings.yml
@@ -1,40 +0,0 @@
-name: Check for Windows Line Endings
-
-on:
-  pull_request:
-    branches: ['*'] # Trigger on all pull requests to any branch
-
-jobs:
-  check-line-endings:
-    runs-on: ubuntu-latest
-
-    steps:
-      - name: Checkout code
-        uses: actions/checkout@v4
-        with:
-          fetch-depth: 0 # Fetch all history to compare changes
-
-      - name: Check for Windows line endings (CRLF)
-        run: |
-          # Get the list of changed files in the PR
-          CHANGED_FILES=$(git diff --name-only ${{ github.event.pull_request.base.sha }}..${{ github.event.pull_request.head.sha }})
-
-          # Flag to track if CRLF is found
-          CRLF_FOUND=false
-
-          # Loop through each changed file
-          for FILE in $CHANGED_FILES; do
-            # Check if the file exists and is a text file
-            if [ -f "$FILE" ] && file "$FILE" | grep -q "text"; then
-              # Check for CRLF line endings
-              if grep -UP '\r$' "$FILE"; then
-                echo "Error: Windows line endings (CRLF) detected in $FILE"
-                CRLF_FOUND=true
-              fi
-            fi
-          done
-
-          # Exit with error if CRLF was found
-          if [ "$CRLF_FOUND" = true ]; then
-            exit 1
-          fi
--- a/.github/workflows/release-webhook.yml
+++ b/.github/workflows/release-webhook.yml
@@ -1,108 +0,0 @@
-name: Release Webhook
-
-on:
-  release:
-    types: [published]
-
-jobs:
-  send-webhook:
-    runs-on: ubuntu-latest
-    steps:
-      - name: Send release webhook
-        env:
-          WEBHOOK_URL: ${{ secrets.RELEASE_GITHUB_WEBHOOK_URL }}
-          WEBHOOK_SECRET: ${{ secrets.RELEASE_GITHUB_WEBHOOK_SECRET }}
-        run: |
-          # Generate UUID for delivery ID
-          DELIVERY_ID=$(uuidgen)
-          HOOK_ID="release-webhook-$(date +%s)"
-          
-          # Create webhook payload matching GitHub release webhook format
-          PAYLOAD=$(cat <<EOF
-          {
-            "action": "published",
-            "release": {
-              "id": ${{ github.event.release.id }},
-              "node_id": "${{ github.event.release.node_id }}",
-              "url": "${{ github.event.release.url }}",
-              "html_url": "${{ github.event.release.html_url }}",
-              "assets_url": "${{ github.event.release.assets_url }}",
-              "upload_url": "${{ github.event.release.upload_url }}",
-              "tag_name": "${{ github.event.release.tag_name }}",
-              "target_commitish": "${{ github.event.release.target_commitish }}",
-              "name": ${{ toJSON(github.event.release.name) }},
-              "body": ${{ toJSON(github.event.release.body) }},
-              "draft": ${{ github.event.release.draft }},
-              "prerelease": ${{ github.event.release.prerelease }},
-              "created_at": "${{ github.event.release.created_at }}",
-              "published_at": "${{ github.event.release.published_at }}",
-              "author": {
-                "login": "${{ github.event.release.author.login }}",
-                "id": ${{ github.event.release.author.id }},
-                "node_id": "${{ github.event.release.author.node_id }}",
-                "avatar_url": "${{ github.event.release.author.avatar_url }}",
-                "url": "${{ github.event.release.author.url }}",
-                "html_url": "${{ github.event.release.author.html_url }}",
-                "type": "${{ github.event.release.author.type }}",
-                "site_admin": ${{ github.event.release.author.site_admin }}
-              },
-              "tarball_url": "${{ github.event.release.tarball_url }}",
-              "zipball_url": "${{ github.event.release.zipball_url }}",
-              "assets": ${{ toJSON(github.event.release.assets) }}
-            },
-            "repository": {
-              "id": ${{ github.event.repository.id }},
-              "node_id": "${{ github.event.repository.node_id }}",
-              "name": "${{ github.event.repository.name }}",
-              "full_name": "${{ github.event.repository.full_name }}",
-              "private": ${{ github.event.repository.private }},
-              "owner": {
-                "login": "${{ github.event.repository.owner.login }}",
-                "id": ${{ github.event.repository.owner.id }},
-                "node_id": "${{ github.event.repository.owner.node_id }}",
-                "avatar_url": "${{ github.event.repository.owner.avatar_url }}",
-                "url": "${{ github.event.repository.owner.url }}",
-                "html_url": "${{ github.event.repository.owner.html_url }}",
-                "type": "${{ github.event.repository.owner.type }}",
-                "site_admin": ${{ github.event.repository.owner.site_admin }}
-              },
-              "html_url": "${{ github.event.repository.html_url }}",
-              "clone_url": "${{ github.event.repository.clone_url }}",
-              "git_url": "${{ github.event.repository.git_url }}",
-              "ssh_url": "${{ github.event.repository.ssh_url }}",
-              "url": "${{ github.event.repository.url }}",
-              "created_at": "${{ github.event.repository.created_at }}",
-              "updated_at": "${{ github.event.repository.updated_at }}",
-              "pushed_at": "${{ github.event.repository.pushed_at }}",
-              "default_branch": "${{ github.event.repository.default_branch }}",
-              "fork": ${{ github.event.repository.fork }}
-            },
-            "sender": {
-              "login": "${{ github.event.sender.login }}",
-              "id": ${{ github.event.sender.id }},
-              "node_id": "${{ github.event.sender.node_id }}",
-              "avatar_url": "${{ github.event.sender.avatar_url }}",
-              "url": "${{ github.event.sender.url }}",
-              "html_url": "${{ github.event.sender.html_url }}",
-              "type": "${{ github.event.sender.type }}",
-              "site_admin": ${{ github.event.sender.site_admin }}
-            }
-          }
-          EOF
-          )
-          
-          # Generate HMAC-SHA256 signature
-          SIGNATURE=$(echo -n "$PAYLOAD" | openssl dgst -sha256 -hmac "$WEBHOOK_SECRET" -hex | cut -d' ' -f2)
-          
-          # Send webhook with required headers
-          curl -X POST "$WEBHOOK_URL" \
-            -H "Content-Type: application/json" \
-            -H "X-GitHub-Event: release" \
-            -H "X-GitHub-Delivery: $DELIVERY_ID" \
-            -H "X-GitHub-Hook-ID: $HOOK_ID" \
-            -H "X-Hub-Signature-256: sha256=$SIGNATURE" \
-            -H "User-Agent: GitHub-Actions-Webhook/1.0" \
-            -d "$PAYLOAD" \
-            --fail --silent --show-error
-          
-          echo "✅ Release webhook sent successfully"
--- a/.github/workflows/stable-release.yml
+++ b/.github/workflows/stable-release.yml
@@ -12,17 +12,17 @@ on:
        description: 'CUDA version'
        required: true
        type: string
-        default: "129"
+        default: "128"
      python_minor:
        description: 'Python minor version'
        required: true
        type: string
-        default: "13"
+        default: "12"
      python_patch:
        description: 'Python patch version'
        required: true
        type: string
-        default: "6"
+        default: "10"


 jobs:
@@ -66,13 +66,8 @@ jobs:
          curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
          ./python.exe get-pip.py
          ./python.exe -s -m pip install ../cu${{ inputs.cu }}_python_deps/*
-          sed -i '1i../ComfyUI' ./python3${{ inputs.python_minor }}._pth
-
-          rm ./Lib/site-packages/torch/lib/dnnl.lib #I don't think this is actually used and I need the space
-          rm ./Lib/site-packages/torch/lib/libprotoc.lib
-          rm ./Lib/site-packages/torch/lib/libprotobuf.lib
-
-          cd ..
+            sed -i '1i../ComfyUI' ./python3${{ inputs.python_minor }}._pth
+            cd ..

          git clone --depth 1 https://github.com/comfyanonymous/taesd
          cp taesd/*.safetensors ./ComfyUI_copy/models/vae_approx/
@@ -90,7 +85,7 @@ jobs:

          cd ..

-          "C:\Program Files\7-Zip\7z.exe" a -t7z -m0=lzma2 -mx=9 -mfb=128 -md=768m -ms=on -mf=BCJ2 ComfyUI_windows_portable.7z ComfyUI_windows_portable
+          "C:\Program Files\7-Zip\7z.exe" a -t7z -m0=lzma2 -mx=9 -mfb=128 -md=512m -ms=on -mf=BCJ2 ComfyUI_windows_portable.7z ComfyUI_windows_portable
          mv ComfyUI_windows_portable.7z ComfyUI/ComfyUI_windows_portable_nvidia.7z

          cd ComfyUI_windows_portable
@@ -107,4 +102,5 @@ jobs:
          file: ComfyUI_windows_portable_nvidia.7z
          tag: ${{ inputs.git_tag }}
          overwrite: true
-          draft: true
+          prerelease: true
+          make_latest: false
--- a/.github/workflows/test-execution.yml
+++ b/.github/workflows/test-execution.yml
@@ -1,30 +0,0 @@
-name: Execution Tests
-
-on:
-  push:
-    branches: [ main, master ]
-  pull_request:
-    branches: [ main, master ]
-
-jobs:
-  test:
-    strategy:
-      matrix:
-        os: [ubuntu-latest, windows-latest, macos-latest]
-    runs-on: ${{ matrix.os }}
-    continue-on-error: true
-    steps:
-    - uses: actions/checkout@v4
-    - name: Set up Python      
-      uses: actions/setup-python@v4
-      with:
-        python-version: '3.12'
-    - name: Install requirements
-      run: |
-        python -m pip install --upgrade pip
-        pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
-        pip install -r requirements.txt
-        pip install -r tests-unit/requirements.txt
-    - name: Run Execution Tests
-      run: |
-        python -m pytest tests/execution -v --skip-timing-checks
--- a/.github/workflows/test-unit.yml
+++ b/.github/workflows/test-unit.yml
@@ -28,3 +28,7 @@ jobs:
      run: |
        pip install -r tests-unit/requirements.txt
        python -m pytest tests-unit
+    - name: Run Execution Model Tests
+      run: |
+        python -m pytest tests/inference/test_execution.py
+
--- a/.github/workflows/windows_release_dependencies.yml
+++ b/.github/workflows/windows_release_dependencies.yml
@@ -17,19 +17,19 @@ on:
        description: 'cuda version'
        required: true
        type: string
-        default: "129"
+        default: "128"

      python_minor:
        description: 'python minor version'
        required: true
        type: string
-        default: "13"
+        default: "12"

      python_patch:
        description: 'python patch version'
        required: true
        type: string
-        default: "6"
+        default: "10"
 #  push:
 #    branches:
 #      - master
--- a/.github/workflows/windows_release_nightly_pytorch.yml
+++ b/.github/workflows/windows_release_nightly_pytorch.yml
@@ -7,7 +7,7 @@ on:
        description: 'cuda version'
        required: true
        type: string
-        default: "129"
+        default: "128"

      python_minor:
        description: 'python minor version'
@@ -19,7 +19,7 @@ on:
        description: 'python patch version'
        required: true
        type: string
-        default: "5"
+        default: "2"
 #  push:
 #    branches:
 #      - master
@@ -53,8 +53,6 @@ jobs:
            ls ../temp_wheel_dir
            ./python.exe -s -m pip install --pre ../temp_wheel_dir/*
            sed -i '1i../ComfyUI' ./python3${{ inputs.python_minor }}._pth
-
-            rm ./Lib/site-packages/torch/lib/dnnl.lib #I don't think this is actually used and I need the space
            cd ..

            git clone --depth 1 https://github.com/comfyanonymous/taesd
--- a/.github/workflows/windows_release_package.yml
+++ b/.github/workflows/windows_release_package.yml
@@ -7,19 +7,19 @@ on:
        description: 'cuda version'
        required: true
        type: string
-        default: "129"
+        default: "128"

      python_minor:
        description: 'python minor version'
        required: true
        type: string
-        default: "13"
+        default: "12"

      python_patch:
        description: 'python patch version'
        required: true
        type: string
-        default: "6"
+        default: "10"
 #  push:
 #    branches:
 #      - master
@@ -64,10 +64,6 @@ jobs:
            ./python.exe get-pip.py
            ./python.exe -s -m pip install ../cu${{ inputs.cu }}_python_deps/*
            sed -i '1i../ComfyUI' ./python3${{ inputs.python_minor }}._pth
-
-            rm ./Lib/site-packages/torch/lib/dnnl.lib #I don't think this is actually used and I need the space
-            rm ./Lib/site-packages/torch/lib/libprotoc.lib
-            rm ./Lib/site-packages/torch/lib/libprotobuf.lib
            cd ..

            git clone --depth 1 https://github.com/comfyanonymous/taesd
@@ -86,7 +82,7 @@ jobs:

            cd ..

-            "C:\Program Files\7-Zip\7z.exe" a -t7z -m0=lzma2 -mx=9 -mfb=128 -md=768m -ms=on -mf=BCJ2 ComfyUI_windows_portable.7z ComfyUI_windows_portable
+            "C:\Program Files\7-Zip\7z.exe" a -t7z -m0=lzma2 -mx=9 -mfb=128 -md=512m -ms=on -mf=BCJ2 ComfyUI_windows_portable.7z ComfyUI_windows_portable
            mv ComfyUI_windows_portable.7z ComfyUI/new_ComfyUI_windows_portable_nvidia_cu${{ inputs.cu }}_or_cpu.7z

            cd ComfyUI_windows_portable
--- a/27
+++ b/27
@@ -5,21 +5,20 @@
 # Inlined the team members for now.

 # Maintainers
-*.md @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
-/tests/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
-/tests-unit/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
-/notebooks/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
-/script_examples/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
-/.github/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
-/requirements.txt @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
-/pyproject.toml @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne @guill
+*.md @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne
+/tests/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne
+/tests-unit/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne
+/notebooks/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne
+/script_examples/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne
+/.github/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne
+/requirements.txt @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne
+/pyproject.toml @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @Kosinkadink @christian-byrne

 # Python web server
-/api_server/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @christian-byrne @guill
-/app/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @christian-byrne @guill
-/utils/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @christian-byrne @guill
+/api_server/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @christian-byrne
+/app/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @christian-byrne
+/utils/ @yoland68 @robinjhuang @webfiltered @pythongosssss @ltdrdata @christian-byrne

 # Node developers
-/comfy_extras/ @yoland68 @robinjhuang @pythongosssss @ltdrdata @Kosinkadink @webfiltered @christian-byrne @guill
-/comfy/comfy_types/ @yoland68 @robinjhuang @pythongosssss @ltdrdata @Kosinkadink @webfiltered @christian-byrne @guill
-/comfy_api_nodes/ @yoland68 @robinjhuang @pythongosssss @ltdrdata @Kosinkadink @webfiltered @christian-byrne @guill
+/comfy_extras/ @yoland68 @robinjhuang @pythongosssss @ltdrdata @Kosinkadink @webfiltered @christian-byrne
+/comfy/comfy_types/ @yoland68 @robinjhuang @pythongosssss @ltdrdata @Kosinkadink @webfiltered @christian-byrne
--- a/README.md
+++ b/README.md
@@ -39,7 +39,7 @@ ComfyUI lets you design and execute advanced stable diffusion pipelines using a
 ## Get Started

 #### [Desktop Application](https://www.comfy.org/download)
- The easiest way to get started.
+- The easiest way to get started. 
 - Available on Windows & macOS.

 #### [Windows Portable Package](#installing)
@@ -55,7 +55,7 @@ See what ComfyUI can do with the [example workflows](https://comfyanonymous.gith
 ## Features
 - Nodes/graph/flowchart interface to experiment and create complex Stable Diffusion workflows without needing to code anything.
 - Image Models
-   - SD1.x, SD2.x ([unCLIP](https://comfyanonymous.github.io/ComfyUI_examples/unclip/))
+   - SD1.x, SD2.x,
   - [SDXL](https://comfyanonymous.github.io/ComfyUI_examples/sdxl/), [SDXL Turbo](https://comfyanonymous.github.io/ComfyUI_examples/sdturbo/)
   - [Stable Cascade](https://comfyanonymous.github.io/ComfyUI_examples/stable_cascade/)
   - [SD3 and SD3.5](https://comfyanonymous.github.io/ComfyUI_examples/sd3/)
@@ -65,20 +65,13 @@ See what ComfyUI can do with the [example workflows](https://comfyanonymous.gith
   - [Flux](https://comfyanonymous.github.io/ComfyUI_examples/flux/)
   - [Lumina Image 2.0](https://comfyanonymous.github.io/ComfyUI_examples/lumina2/)
   - [HiDream](https://comfyanonymous.github.io/ComfyUI_examples/hidream/)
-   - [Qwen Image](https://comfyanonymous.github.io/ComfyUI_examples/qwen_image/)
-   - [Hunyuan Image 2.1](https://comfyanonymous.github.io/ComfyUI_examples/hunyuan_image/)
- Image Editing Models
-   - [Omnigen 2](https://comfyanonymous.github.io/ComfyUI_examples/omnigen/)
-   - [Flux Kontext](https://comfyanonymous.github.io/ComfyUI_examples/flux/#flux-kontext-image-editing-model)
-   - [HiDream E1.1](https://comfyanonymous.github.io/ComfyUI_examples/hidream/#hidream-e11)
-   - [Qwen Image Edit](https://comfyanonymous.github.io/ComfyUI_examples/qwen_image/#edit-model)
 - Video Models
   - [Stable Video Diffusion](https://comfyanonymous.github.io/ComfyUI_examples/video/)
   - [Mochi](https://comfyanonymous.github.io/ComfyUI_examples/mochi/)
   - [LTX-Video](https://comfyanonymous.github.io/ComfyUI_examples/ltxv/)
   - [Hunyuan Video](https://comfyanonymous.github.io/ComfyUI_examples/hunyuan_video/)
+   - [Nvidia Cosmos](https://comfyanonymous.github.io/ComfyUI_examples/cosmos/)
   - [Wan 2.1](https://comfyanonymous.github.io/ComfyUI_examples/wan/)
-   - [Wan 2.2](https://comfyanonymous.github.io/ComfyUI_examples/wan22/)
 - Audio Models
   - [Stable Audio](https://comfyanonymous.github.io/ComfyUI_examples/audio/)
   - [ACE Step](https://comfyanonymous.github.io/ComfyUI_examples/audio/)
@@ -86,10 +79,9 @@ See what ComfyUI can do with the [example workflows](https://comfyanonymous.gith
   - [Hunyuan3D 2.0](https://docs.comfy.org/tutorials/3d/hunyuan3D-2)
 - Asynchronous Queue system
 - Many optimizations: Only re-executes the parts of the workflow that changes between executions.
- Smart memory management: can automatically run large models on GPUs with as low as 1GB vram with smart offloading.
+- Smart memory management: can automatically run models on GPUs with as low as 1GB vram.
 - Works even if you don't have a GPU with: ```--cpu``` (slow)
- Can load ckpt and safetensors: All in one checkpoints or standalone diffusion models, VAEs and CLIP models.
- Safe loading of ckpt, pt, pth, etc.. files.
+- Can load ckpt, safetensors and diffusers models/checkpoints. Standalone VAEs and CLIP models.
 - Embeddings/Textual inversion
 - [Loras (regular, locon and loha)](https://comfyanonymous.github.io/ComfyUI_examples/lora/)
 - [Hypernetworks](https://comfyanonymous.github.io/ComfyUI_examples/hypernetworks/)
@@ -100,10 +92,12 @@ See what ComfyUI can do with the [example workflows](https://comfyanonymous.gith
 - [Inpainting](https://comfyanonymous.github.io/ComfyUI_examples/inpaint/) with both regular and inpainting models.
 - [ControlNet and T2I-Adapter](https://comfyanonymous.github.io/ComfyUI_examples/controlnet/)
 - [Upscale Models (ESRGAN, ESRGAN variants, SwinIR, Swin2SR, etc...)](https://comfyanonymous.github.io/ComfyUI_examples/upscale_models/)
+- [unCLIP Models](https://comfyanonymous.github.io/ComfyUI_examples/unclip/)
 - [GLIGEN](https://comfyanonymous.github.io/ComfyUI_examples/gligen/)
 - [Model Merging](https://comfyanonymous.github.io/ComfyUI_examples/model_merging/)
 - [LCM models and Loras](https://comfyanonymous.github.io/ComfyUI_examples/lcm/)
 - Latent previews with [TAESD](#how-to-show-high-quality-previews)
+- Starts up very fast.
 - Works fully offline: core will never download anything unless you want to.
 - Optional API nodes to use paid models from external providers through the online [Comfy API](https://docs.comfy.org/tutorials/api-nodes/overview).
 - [Config file](extra_model_paths.yaml.example) to set the search paths for models.
@@ -112,7 +106,7 @@ Workflow examples can be found on the [Examples page](https://comfyanonymous.git

 ## Release Process

-ComfyUI follows a weekly release cycle targeting Friday but this regularly changes because of model releases or large changes to the codebase. There are three interconnected repositories:
+ComfyUI follows a weekly release cycle every Friday, with three interconnected repositories:

 1. **[ComfyUI Core](https://github.com/comfyanonymous/ComfyUI)**
   - Releases a new stable version (e.g., v0.7.0)
@@ -180,6 +174,10 @@ If you have trouble extracting it, right click the file -> properties -> unblock

 See the [Config file](extra_model_paths.yaml.example) to set the search paths for models. In the standalone windows build you can find this file in the ComfyUI directory. Rename this file to extra_model_paths.yaml and edit it with your favorite text editor.

+## Jupyter Notebook
+
+To run it on services like paperspace, kaggle or colab you can use my [Jupyter Notebook](notebooks/comfyui_colab.ipynb)
+

 ## [comfy-cli](https://docs.comfy.org/comfy-cli/getting-started)

@@ -191,7 +189,7 @@ comfy install

 ## Manual Install (Windows, Linux)

-Python 3.13 is very well supported. If you have trouble with some custom node dependencies you can try 3.12
+python 3.13 is supported but using 3.12 is recommended because some custom nodes and their dependencies might not support it yet.

 Git clone this repo.

@@ -203,7 +201,7 @@ Put your VAE in: models/vae
 ### AMD GPUs (Linux only)
 AMD users can install rocm and pytorch with pip if you don't have it already installed, this is the command to install the stable version:

-```pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.4```
+```pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3```

 This is the command to install the nightly with ROCm 6.4 which might have some performance improvements:

@@ -211,29 +209,37 @@ This is the command to install the nightly with ROCm 6.4 which might have some p

 ### Intel GPUs (Windows and Linux)

-(Option 1) Intel Arc GPU users can install native PyTorch with torch.xpu support using pip. More information can be found [here](https://pytorch.org/docs/main/notes/get_start_xpu.html)
-
-1. To install PyTorch xpu, use the following command:
-
-```pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/xpu```
-
-This is the command to install the Pytorch xpu nightly which might have some performance improvements:
+(Option 1) Intel Arc GPU users can install native PyTorch with torch.xpu support using pip (currently available in PyTorch nightly builds). More information can be found [here](https://pytorch.org/docs/main/notes/get_start_xpu.html)
+  
+1. To install PyTorch nightly, use the following command:

 ```pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/xpu```

+2. Launch ComfyUI by running `python main.py`
+
+
 (Option 2) Alternatively, Intel GPUs supported by Intel Extension for PyTorch (IPEX) can leverage IPEX for improved performance.

-1. visit [Installation](https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu) for more information.
+1. For Intel® Arc™ A-Series Graphics utilizing IPEX, create a conda environment and use the commands below:
+
+```
+conda install libuv
+pip install torch==2.3.1.post0+cxx11.abi torchvision==0.18.1.post0+cxx11.abi torchaudio==2.3.1.post0+cxx11.abi intel-extension-for-pytorch==2.3.110.post0+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
+```
+
+For other supported Intel GPUs with IPEX, visit [Installation](https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu) for more information.
+
+Additional discussion and help can be found [here](https://github.com/comfyanonymous/ComfyUI/discussions/476).

 ### NVIDIA

 Nvidia users should install stable pytorch using this command:

-```pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu129```
+```pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu128```

 This is the command to install pytorch nightly instead which might have performance improvements.

-```pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu129```
+```pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128```

 #### Troubleshooting

@@ -266,8 +272,6 @@ You can install ComfyUI in Apple Mac silicon (M1 or M2) with any recent macOS ve

 #### DirectML (AMD Cards on Windows)

-This is very badly supported and is not recommended. There are some unofficial builds of pytorch ROCm on windows that exist that will give you a much better experience than this. This readme will be updated once official pytorch ROCm builds for windows come out.
-
 ```pip install torch-directml``` Then you can launch ComfyUI with: ```python main.py --directml```

 #### Ascend NPUs
@@ -287,13 +291,6 @@ For models compatible with Cambricon Extension for PyTorch (torch_mlu). Here's a
 2. Next, install the PyTorch(torch_mlu) following the instructions on the [Installation](https://www.cambricon.com/docs/sdk_1.15.0/cambricon_pytorch_1.17.0/user_guide_1.9/index.html)
 3. Launch ComfyUI by running `python main.py`

-#### Iluvatar Corex
-
-For models compatible with Iluvatar Extension for PyTorch. Here's a step-by-step guide tailored to your platform and installation method:
-
-1. Install the Iluvatar Corex Toolkit by adhering to the platform-specific instructions on the [Installation](https://support.iluvatar.com/#/DocumentCentre?id=1&nameCenter=2&productId=520117912052801536)
-2. Launch ComfyUI by running `python main.py`
-
 # Running

 ```python main.py```
@@ -344,7 +341,7 @@ Generate a self-signed certificate (not appropriate for shared/production use) a

 Use `--tls-keyfile key.pem --tls-certfile cert.pem` to enable TLS/SSL, the app will now be accessible with `https://...` instead of `http://...`.

-> Note: Windows users can use [alexisrolland/docker-openssl](https://github.com/alexisrolland/docker-openssl) or one of the [3rd party binary distributions](https://wiki.openssl.org/index.php/Binaries) to run the command example above.
+> Note: Windows users can use [alexisrolland/docker-openssl](https://github.com/alexisrolland/docker-openssl) or one of the [3rd party binary distributions](https://wiki.openssl.org/index.php/Binaries) to run the command example above. 
 <br/><br/>If you use a container, note that the volume mount `-v` can be a relative path so `... -v ".\:/openssl-certs" ...` would create the key & cert files in the current directory of your command prompt or powershell terminal.

 ## Support and dev channel
--- a/app/frontend_management.py
+++ b/app/frontend_management.py
@@ -29,48 +29,18 @@ def frontend_install_warning_message():
 This error is happening because the ComfyUI frontend is no longer shipped as part of the main repo but as a pip package instead.
 """.strip()

-def parse_version(version: str) -> tuple[int, int, int]:
-        return tuple(map(int, version.split(".")))
-
-def is_valid_version(version: str) -> bool:
-    """Validate if a string is a valid semantic version (X.Y.Z format)."""
-    pattern = r"^(\d+)\.(\d+)\.(\d+)$"
-    return bool(re.match(pattern, version))
-
-def get_installed_frontend_version():
-    """Get the currently installed frontend package version."""
-    frontend_version_str = version("comfyui-frontend-package")
-    return frontend_version_str
-
-def get_required_frontend_version():
-    """Get the required frontend version from requirements.txt."""
-    try:
-        with open(requirements_path, "r", encoding="utf-8") as f:
-            for line in f:
-                line = line.strip()
-                if line.startswith("comfyui-frontend-package=="):
-                    version_str = line.split("==")[-1]
-                    if not is_valid_version(version_str):
-                        logging.error(f"Invalid version format in requirements.txt: {version_str}")
-                        return None
-                    return version_str
-            logging.error("comfyui-frontend-package not found in requirements.txt")
-            return None
-    except FileNotFoundError:
-        logging.error("requirements.txt not found. Cannot determine required frontend version.")
-        return None
-    except Exception as e:
-        logging.error(f"Error reading requirements.txt: {e}")
-        return None

 def check_frontend_version():
    """Check if the frontend version is up to date."""

+    def parse_version(version: str) -> tuple[int, int, int]:
+        return tuple(map(int, version.split(".")))
+
    try:
-        frontend_version_str = get_installed_frontend_version()
+        frontend_version_str = version("comfyui-frontend-package")
        frontend_version = parse_version(frontend_version_str)
-        required_frontend_str = get_required_frontend_version()
-        required_frontend = parse_version(required_frontend_str)
+        with open(requirements_path, "r", encoding="utf-8") as f:
+            required_frontend = parse_version(f.readline().split("=")[-1])
        if frontend_version < required_frontend:
            app.logger.log_startup_warning(
                f"""
@@ -198,11 +168,6 @@ def download_release_asset_zip(release: Release, destination_path: str) -> None:
 class FrontendManager:
    CUSTOM_FRONTENDS_ROOT = str(Path(__file__).parents[1] / "web_custom_versions")

-    @classmethod
-    def get_required_frontend_version(cls) -> str:
-        """Get the required frontend package version."""
-        return get_required_frontend_version()
-
    @classmethod
    def default_frontend_path(cls) -> str:
        try:
--- a/app/model_manager.py
+++ b/app/model_manager.py
@@ -130,21 +130,10 @@ class ModelFileManager:

            for file_name in filenames:
                try:
-                    full_path = os.path.join(dirpath, file_name)
-                    relative_path = os.path.relpath(full_path, directory)
-
-                    # Get file metadata
-                    file_info = {
-                        "name": relative_path,
-                        "pathIndex": pathIndex,
-                        "modified": os.path.getmtime(full_path),  # Add modification time
-                        "created": os.path.getctime(full_path),   # Add creation time
-                        "size": os.path.getsize(full_path)        # Add file size
-                    }
-                    result.append(file_info)
-
-                except Exception as e:
-                    logging.warning(f"Warning: Unable to access {file_name}. Error: {e}. Skipping this file.")
+                    relative_path = os.path.relpath(os.path.join(dirpath, file_name), directory)
+                    result.append(relative_path)
+                except:
+                    logging.warning(f"Warning: Unable to access {file_name}. Skipping this file.")
                    continue

            for d in subdirs:
@@ -155,7 +144,7 @@ class ModelFileManager:
                    logging.warning(f"Warning: Unable to access {path}. Skipping this path.")
                    continue

-        return result, dirs, time.perf_counter()
+        return [{"name": f, "pathIndex": pathIndex} for f in result], dirs, time.perf_counter()

    def get_model_previews(self, filepath: str) -> list[str | BytesIO]:
        dirname = os.path.dirname(filepath)
--- a/app/user_manager.py
+++ b/app/user_manager.py
@@ -20,15 +20,13 @@ class FileInfo(TypedDict):
    path: str
    size: int
    modified: int
-    created: int


 def get_file_info(path: str, relative_to: str) -> FileInfo:
    return {
        "path": os.path.relpath(path, relative_to).replace(os.sep, '/'),
        "size": os.path.getsize(path),
-        "modified": os.path.getmtime(path),
-        "created": os.path.getctime(path)
+        "modified": os.path.getmtime(path)
    }


@@ -363,17 +361,10 @@ class UserManager():
            if not overwrite and os.path.exists(path):
                return web.Response(status=409, text="File already exists")

-            try:
-                body = await request.read()
+            body = await request.read()

-                with open(path, "wb") as f:
-                    f.write(body)
-            except OSError as e:
-                logging.warning(f"Error saving file '{path}': {e}")
-                return web.Response(
-                    status=400,
-                    reason="Invalid filename. Please avoid special characters like :\\/*?\"<>|"
-                )
+            with open(path, "wb") as f:
+                f.write(body)

            user_path = self.get_request_user_filepath(request, None)
            if full_info:
--- a/comfy/audio_encoders/audio_encoders.py
+++ b/comfy/audio_encoders/audio_encoders.py
@@ -1,91 +0,0 @@
-from .wav2vec2 import Wav2Vec2Model
-from .whisper import WhisperLargeV3
-import comfy.model_management
-import comfy.ops
-import comfy.utils
-import logging
-import torchaudio
-
-
-class AudioEncoderModel():
-    def __init__(self, config):
-        self.load_device = comfy.model_management.text_encoder_device()
-        offload_device = comfy.model_management.text_encoder_offload_device()
-        self.dtype = comfy.model_management.text_encoder_dtype(self.load_device)
-        model_type = config.pop("model_type")
-        model_config = dict(config)
-        model_config.update({
-            "dtype": self.dtype,
-            "device": offload_device,
-            "operations": comfy.ops.manual_cast
-        })
-
-        if model_type == "wav2vec2":
-            self.model = Wav2Vec2Model(**model_config)
-        elif model_type == "whisper3":
-            self.model = WhisperLargeV3(**model_config)
-        self.model.eval()
-        self.patcher = comfy.model_patcher.ModelPatcher(self.model, load_device=self.load_device, offload_device=offload_device)
-        self.model_sample_rate = 16000
-
-    def load_sd(self, sd):
-        return self.model.load_state_dict(sd, strict=False)
-
-    def get_sd(self):
-        return self.model.state_dict()
-
-    def encode_audio(self, audio, sample_rate):
-        comfy.model_management.load_model_gpu(self.patcher)
-        audio = torchaudio.functional.resample(audio, sample_rate, self.model_sample_rate)
-        out, all_layers = self.model(audio.to(self.load_device))
-        outputs = {}
-        outputs["encoded_audio"] = out
-        outputs["encoded_audio_all_layers"] = all_layers
-        outputs["audio_samples"] = audio.shape[2]
-        return outputs
-
-
-def load_audio_encoder_from_sd(sd, prefix=""):
-    sd = comfy.utils.state_dict_prefix_replace(sd, {"wav2vec2.": ""})
-    if "encoder.layer_norm.bias" in sd: #wav2vec2
-        embed_dim = sd["encoder.layer_norm.bias"].shape[0]
-        if embed_dim == 1024:# large
-            config = {
-                "model_type": "wav2vec2",
-                "embed_dim": 1024,
-                "num_heads": 16,
-                "num_layers": 24,
-                "conv_norm": True,
-                "conv_bias": True,
-                "do_normalize": True,
-                "do_stable_layer_norm": True
-                }
-        elif embed_dim == 768: # base
-            config = {
-                "model_type": "wav2vec2",
-                "embed_dim": 768,
-                "num_heads": 12,
-                "num_layers": 12,
-                "conv_norm": False,
-                "conv_bias": False,
-                "do_normalize": False, # chinese-wav2vec2-base has this False
-                "do_stable_layer_norm": False
-            }
-        else:
-            raise RuntimeError("ERROR: audio encoder file is invalid or unsupported embed_dim: {}".format(embed_dim))
-    elif "model.encoder.embed_positions.weight" in sd:
-        sd = comfy.utils.state_dict_prefix_replace(sd, {"model.": ""})
-        config = {
-            "model_type": "whisper3",
-        }
-    else:
-        raise RuntimeError("ERROR: audio encoder not supported.")
-
-    audio_encoder = AudioEncoderModel(config)
-    m, u = audio_encoder.load_sd(sd)
-    if len(m) > 0:
-        logging.warning("missing audio encoder: {}".format(m))
-    if len(u) > 0:
-        logging.warning("unexpected audio encoder: {}".format(u))
-
-    return audio_encoder
--- a/comfy/audio_encoders/wav2vec2.py
+++ b/comfy/audio_encoders/wav2vec2.py
@@ -1,252 +0,0 @@
-import torch
-import torch.nn as nn
-from comfy.ldm.modules.attention import optimized_attention_masked
-
-
-class LayerNormConv(nn.Module):
-    def __init__(self, in_channels, out_channels, kernel_size, stride, bias=False, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.conv = operations.Conv1d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, bias=bias, device=device, dtype=dtype)
-        self.layer_norm = operations.LayerNorm(out_channels, elementwise_affine=True, device=device, dtype=dtype)
-
-    def forward(self, x):
-        x = self.conv(x)
-        return torch.nn.functional.gelu(self.layer_norm(x.transpose(-2, -1)).transpose(-2, -1))
-
-class LayerGroupNormConv(nn.Module):
-    def __init__(self, in_channels, out_channels, kernel_size, stride, bias=False, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.conv = operations.Conv1d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, bias=bias, device=device, dtype=dtype)
-        self.layer_norm = operations.GroupNorm(num_groups=out_channels, num_channels=out_channels, affine=True, device=device, dtype=dtype)
-
-    def forward(self, x):
-        x = self.conv(x)
-        return torch.nn.functional.gelu(self.layer_norm(x))
-
-class ConvNoNorm(nn.Module):
-    def __init__(self, in_channels, out_channels, kernel_size, stride, bias=False, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.conv = operations.Conv1d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, bias=bias, device=device, dtype=dtype)
-
-    def forward(self, x):
-        x = self.conv(x)
-        return torch.nn.functional.gelu(x)
-
-
-class ConvFeatureEncoder(nn.Module):
-    def __init__(self, conv_dim, conv_bias=False, conv_norm=True, dtype=None, device=None, operations=None):
-        super().__init__()
-        if conv_norm:
-            self.conv_layers = nn.ModuleList([
-                LayerNormConv(1, conv_dim, kernel_size=10, stride=5, bias=True, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=2, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                LayerNormConv(conv_dim, conv_dim, kernel_size=2, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-            ])
-        else:
-            self.conv_layers = nn.ModuleList([
-                LayerGroupNormConv(1, conv_dim, kernel_size=10, stride=5, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=3, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=2, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-                ConvNoNorm(conv_dim, conv_dim, kernel_size=2, stride=2, bias=conv_bias, device=device, dtype=dtype, operations=operations),
-            ])
-
-    def forward(self, x):
-        x = x.unsqueeze(1)
-
-        for conv in self.conv_layers:
-            x = conv(x)
-
-        return x.transpose(1, 2)
-
-
-class FeatureProjection(nn.Module):
-    def __init__(self, conv_dim, embed_dim, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.layer_norm = operations.LayerNorm(conv_dim, eps=1e-05, device=device, dtype=dtype)
-        self.projection = operations.Linear(conv_dim, embed_dim, device=device, dtype=dtype)
-
-    def forward(self, x):
-        x = self.layer_norm(x)
-        x = self.projection(x)
-        return x
-
-
-class PositionalConvEmbedding(nn.Module):
-    def __init__(self, embed_dim=768, kernel_size=128, groups=16):
-        super().__init__()
-        self.conv = nn.Conv1d(
-            embed_dim,
-            embed_dim,
-            kernel_size=kernel_size,
-            padding=kernel_size // 2,
-            groups=groups,
-        )
-        self.conv = torch.nn.utils.parametrizations.weight_norm(self.conv, name="weight", dim=2)
-        self.activation = nn.GELU()
-
-    def forward(self, x):
-        x = x.transpose(1, 2)
-        x = self.conv(x)[:, :, :-1]
-        x = self.activation(x)
-        x = x.transpose(1, 2)
-        return x
-
-
-class TransformerEncoder(nn.Module):
-    def __init__(
-        self,
-        embed_dim=768,
-        num_heads=12,
-        num_layers=12,
-        mlp_ratio=4.0,
-        do_stable_layer_norm=True,
-        dtype=None, device=None, operations=None
-    ):
-        super().__init__()
-
-        self.pos_conv_embed = PositionalConvEmbedding(embed_dim=embed_dim)
-        self.layers = nn.ModuleList([
-            TransformerEncoderLayer(
-                embed_dim=embed_dim,
-                num_heads=num_heads,
-                mlp_ratio=mlp_ratio,
-                do_stable_layer_norm=do_stable_layer_norm,
-                device=device, dtype=dtype, operations=operations
-            )
-            for _ in range(num_layers)
-        ])
-
-        self.layer_norm = operations.LayerNorm(embed_dim, eps=1e-05, device=device, dtype=dtype)
-        self.do_stable_layer_norm = do_stable_layer_norm
-
-    def forward(self, x, mask=None):
-        x = x + self.pos_conv_embed(x)
-        all_x = ()
-        if not self.do_stable_layer_norm:
-            x = self.layer_norm(x)
-        for layer in self.layers:
-            all_x += (x,)
-            x = layer(x, mask)
-        if self.do_stable_layer_norm:
-            x = self.layer_norm(x)
-        all_x += (x,)
-        return x, all_x
-
-
-class Attention(nn.Module):
-    def __init__(self, embed_dim, num_heads, bias=True, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.embed_dim = embed_dim
-        self.num_heads = num_heads
-        self.head_dim = embed_dim // num_heads
-
-        self.k_proj = operations.Linear(embed_dim, embed_dim, bias=bias, device=device, dtype=dtype)
-        self.v_proj = operations.Linear(embed_dim, embed_dim, bias=bias, device=device, dtype=dtype)
-        self.q_proj = operations.Linear(embed_dim, embed_dim, bias=bias, device=device, dtype=dtype)
-        self.out_proj = operations.Linear(embed_dim, embed_dim, bias=bias, device=device, dtype=dtype)
-
-    def forward(self, x, mask=None):
-        assert (mask is None)  # TODO?
-        q = self.q_proj(x)
-        k = self.k_proj(x)
-        v = self.v_proj(x)
-
-        out = optimized_attention_masked(q, k, v, self.num_heads)
-        return self.out_proj(out)
-
-
-class FeedForward(nn.Module):
-    def __init__(self, embed_dim, mlp_ratio, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.intermediate_dense = operations.Linear(embed_dim, int(embed_dim * mlp_ratio), device=device, dtype=dtype)
-        self.output_dense = operations.Linear(int(embed_dim * mlp_ratio), embed_dim, device=device, dtype=dtype)
-
-    def forward(self, x):
-        x = self.intermediate_dense(x)
-        x = torch.nn.functional.gelu(x)
-        x = self.output_dense(x)
-        return x
-
-
-class TransformerEncoderLayer(nn.Module):
-    def __init__(
-        self,
-        embed_dim=768,
-        num_heads=12,
-        mlp_ratio=4.0,
-        do_stable_layer_norm=True,
-        dtype=None, device=None, operations=None
-    ):
-        super().__init__()
-
-        self.attention = Attention(embed_dim, num_heads, device=device, dtype=dtype, operations=operations)
-
-        self.layer_norm = operations.LayerNorm(embed_dim, device=device, dtype=dtype)
-        self.feed_forward = FeedForward(embed_dim, mlp_ratio, device=device, dtype=dtype, operations=operations)
-        self.final_layer_norm = operations.LayerNorm(embed_dim, device=device, dtype=dtype)
-        self.do_stable_layer_norm = do_stable_layer_norm
-
-    def forward(self, x, mask=None):
-        residual = x
-        if self.do_stable_layer_norm:
-            x = self.layer_norm(x)
-        x = self.attention(x, mask=mask)
-        x = residual + x
-        if not self.do_stable_layer_norm:
-            x = self.layer_norm(x)
-            return self.final_layer_norm(x + self.feed_forward(x))
-        else:
-            return x + self.feed_forward(self.final_layer_norm(x))
-
-
-class Wav2Vec2Model(nn.Module):
-    """Complete Wav2Vec 2.0 model."""
-
-    def __init__(
-        self,
-        embed_dim=1024,
-        final_dim=256,
-        num_heads=16,
-        num_layers=24,
-        conv_norm=True,
-        conv_bias=True,
-        do_normalize=True,
-        do_stable_layer_norm=True,
-        dtype=None, device=None, operations=None
-    ):
-        super().__init__()
-
-        conv_dim = 512
-        self.feature_extractor = ConvFeatureEncoder(conv_dim, conv_norm=conv_norm, conv_bias=conv_bias, device=device, dtype=dtype, operations=operations)
-        self.feature_projection = FeatureProjection(conv_dim, embed_dim, device=device, dtype=dtype, operations=operations)
-
-        self.masked_spec_embed = nn.Parameter(torch.empty(embed_dim, device=device, dtype=dtype))
-        self.do_normalize = do_normalize
-
-        self.encoder = TransformerEncoder(
-            embed_dim=embed_dim,
-            num_heads=num_heads,
-            num_layers=num_layers,
-            do_stable_layer_norm=do_stable_layer_norm,
-            device=device, dtype=dtype, operations=operations
-        )
-
-    def forward(self, x, mask_time_indices=None, return_dict=False):
-        x = torch.mean(x, dim=1)
-
-        if self.do_normalize:
-            x = (x - x.mean()) / torch.sqrt(x.var() + 1e-7)
-
-        features = self.feature_extractor(x)
-        features = self.feature_projection(features)
-        batch_size, seq_len, _ = features.shape
-
-        x, all_x = self.encoder(features)
-        return x, all_x
--- a/comfy/audio_encoders/whisper.py
+++ b/comfy/audio_encoders/whisper.py
@@ -1,186 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torchaudio
-from typing import Optional
-from comfy.ldm.modules.attention import optimized_attention_masked
-import comfy.ops
-
-class WhisperFeatureExtractor(nn.Module):
-    def __init__(self, n_mels=128, device=None):
-        super().__init__()
-        self.sample_rate = 16000
-        self.n_fft = 400
-        self.hop_length = 160
-        self.n_mels = n_mels
-        self.chunk_length = 30
-        self.n_samples = 480000
-
-        self.mel_spectrogram = torchaudio.transforms.MelSpectrogram(
-            sample_rate=self.sample_rate,
-            n_fft=self.n_fft,
-            hop_length=self.hop_length,
-            n_mels=self.n_mels,
-            f_min=0,
-            f_max=8000,
-            norm="slaney",
-            mel_scale="slaney",
-        ).to(device)
-
-    def __call__(self, audio):
-        audio = torch.mean(audio, dim=1)
-        batch_size = audio.shape[0]
-        processed_audio = []
-
-        for i in range(batch_size):
-            aud = audio[i]
-            if aud.shape[0] > self.n_samples:
-                aud = aud[:self.n_samples]
-            elif aud.shape[0] < self.n_samples:
-                aud = F.pad(aud, (0, self.n_samples - aud.shape[0]))
-            processed_audio.append(aud)
-
-        audio = torch.stack(processed_audio)
-
-        mel_spec = self.mel_spectrogram(audio.to(self.mel_spectrogram.spectrogram.window.device))[:, :, :-1].to(audio.device)
-
-        log_mel_spec = torch.clamp(mel_spec, min=1e-10).log10()
-        log_mel_spec = torch.maximum(log_mel_spec, log_mel_spec.max() - 8.0)
-        log_mel_spec = (log_mel_spec + 4.0) / 4.0
-
-        return log_mel_spec
-
-
-class MultiHeadAttention(nn.Module):
-    def __init__(self, d_model: int, n_heads: int, dtype=None, device=None, operations=None):
-        super().__init__()
-        assert d_model % n_heads == 0
-
-        self.d_model = d_model
-        self.n_heads = n_heads
-        self.d_k = d_model // n_heads
-
-        self.q_proj = operations.Linear(d_model, d_model, dtype=dtype, device=device)
-        self.k_proj = operations.Linear(d_model, d_model, bias=False, dtype=dtype, device=device)
-        self.v_proj = operations.Linear(d_model, d_model, dtype=dtype, device=device)
-        self.out_proj = operations.Linear(d_model, d_model, dtype=dtype, device=device)
-
-    def forward(
-        self,
-        query: torch.Tensor,
-        key: torch.Tensor,
-        value: torch.Tensor,
-        mask: Optional[torch.Tensor] = None,
-    ) -> torch.Tensor:
-        batch_size, seq_len, _ = query.shape
-
-        q = self.q_proj(query)
-        k = self.k_proj(key)
-        v = self.v_proj(value)
-
-        attn_output = optimized_attention_masked(q, k, v, self.n_heads, mask)
-        attn_output = self.out_proj(attn_output)
-
-        return attn_output
-
-
-class EncoderLayer(nn.Module):
-    def __init__(self, d_model: int, n_heads: int, d_ff: int, dtype=None, device=None, operations=None):
-        super().__init__()
-
-        self.self_attn = MultiHeadAttention(d_model, n_heads, dtype=dtype, device=device, operations=operations)
-        self.self_attn_layer_norm = operations.LayerNorm(d_model, dtype=dtype, device=device)
-
-        self.fc1 = operations.Linear(d_model, d_ff, dtype=dtype, device=device)
-        self.fc2 = operations.Linear(d_ff, d_model, dtype=dtype, device=device)
-        self.final_layer_norm = operations.LayerNorm(d_model, dtype=dtype, device=device)
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        attention_mask: Optional[torch.Tensor] = None
-    ) -> torch.Tensor:
-        residual = x
-        x = self.self_attn_layer_norm(x)
-        x = self.self_attn(x, x, x, attention_mask)
-        x = residual + x
-
-        residual = x
-        x = self.final_layer_norm(x)
-        x = self.fc1(x)
-        x = F.gelu(x)
-        x = self.fc2(x)
-        x = residual + x
-
-        return x
-
-
-class AudioEncoder(nn.Module):
-    def __init__(
-        self,
-        n_mels: int = 128,
-        n_ctx: int = 1500,
-        n_state: int = 1280,
-        n_head: int = 20,
-        n_layer: int = 32,
-        dtype=None,
-        device=None,
-        operations=None
-    ):
-        super().__init__()
-
-        self.conv1 = operations.Conv1d(n_mels, n_state, kernel_size=3, padding=1, dtype=dtype, device=device)
-        self.conv2 = operations.Conv1d(n_state, n_state, kernel_size=3, stride=2, padding=1, dtype=dtype, device=device)
-
-        self.embed_positions = operations.Embedding(n_ctx, n_state, dtype=dtype, device=device)
-
-        self.layers = nn.ModuleList([
-            EncoderLayer(n_state, n_head, n_state * 4, dtype=dtype, device=device, operations=operations)
-            for _ in range(n_layer)
-        ])
-
-        self.layer_norm = operations.LayerNorm(n_state, dtype=dtype, device=device)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        x = F.gelu(self.conv1(x))
-        x = F.gelu(self.conv2(x))
-
-        x = x.transpose(1, 2)
-
-        x = x + comfy.ops.cast_to_input(self.embed_positions.weight[:, :x.shape[1]], x)
-
-        all_x = ()
-        for layer in self.layers:
-            all_x += (x,)
-            x = layer(x)
-
-        x = self.layer_norm(x)
-        all_x += (x,)
-        return x, all_x
-
-
-class WhisperLargeV3(nn.Module):
-    def __init__(
-        self,
-        n_mels: int = 128,
-        n_audio_ctx: int = 1500,
-        n_audio_state: int = 1280,
-        n_audio_head: int = 20,
-        n_audio_layer: int = 32,
-        dtype=None,
-        device=None,
-        operations=None
-    ):
-        super().__init__()
-
-        self.feature_extractor = WhisperFeatureExtractor(n_mels=n_mels, device=device)
-
-        self.encoder = AudioEncoder(
-            n_mels, n_audio_ctx, n_audio_state, n_audio_head, n_audio_layer,
-            dtype=dtype, device=device, operations=operations
-        )
-
-    def forward(self, audio):
-        mel = self.feature_extractor(audio)
-        x, all_x = self.encoder(mel)
-        return x, all_x
--- a/comfy/cli_args.py
+++ b/comfy/cli_args.py
@@ -49,8 +49,7 @@ parser.add_argument("--temp-directory", type=str, default=None, help="Set the Co
 parser.add_argument("--input-directory", type=str, default=None, help="Set the ComfyUI input directory. Overrides --base-directory.")
 parser.add_argument("--auto-launch", action="store_true", help="Automatically launch ComfyUI in the default browser.")
 parser.add_argument("--disable-auto-launch", action="store_true", help="Disable auto launching the browser.")
-parser.add_argument("--cuda-device", type=int, default=None, metavar="DEVICE_ID", help="Set the id of the cuda device this instance will use. All other devices will not be visible.")
-parser.add_argument("--default-device", type=int, default=None, metavar="DEFAULT_DEVICE_ID", help="Set the id of the default device, all other devices will stay visible.")
+parser.add_argument("--cuda-device", type=int, default=None, metavar="DEVICE_ID", help="Set the id of the cuda device this instance will use.")
 cm_group = parser.add_mutually_exclusive_group()
 cm_group.add_argument("--cuda-malloc", action="store_true", help="Enable cudaMallocAsync (enabled by default for torch 2.0 and up).")
 cm_group.add_argument("--disable-cuda-malloc", action="store_true", help="Disable cudaMallocAsync.")
@@ -132,8 +131,6 @@ parser.add_argument("--reserve-vram", type=float, default=None, help="Set the am

 parser.add_argument("--async-offload", action="store_true", help="Use async weight offloading.")

-parser.add_argument("--force-non-blocking", action="store_true", help="Force ComfyUI to use non-blocking operations for all applicable tensors. This may improve performance on some non-Nvidia systems but can cause issues with some workflows.")
-
 parser.add_argument("--default-hashing-function", type=str, choices=['md5', 'sha1', 'sha256', 'sha512'], default='sha256', help="Allows you to choose the hash function to use for duplicate filename / contents comparison. Default is sha256.")

 parser.add_argument("--disable-smart-memory", action="store_true", help="Force ComfyUI to agressively offload to regular ram instead of keeping models in vram when it can.")
@@ -143,12 +140,10 @@ class PerformanceFeature(enum.Enum):
    Fp16Accumulation = "fp16_accumulation"
    Fp8MatrixMultiplication = "fp8_matrix_mult"
    CublasOps = "cublas_ops"
-    AutoTune = "autotune"

-parser.add_argument("--fast", nargs="*", type=PerformanceFeature, help="Enable some untested and potentially quality deteriorating optimizations. --fast with no arguments enables everything. You can pass a list specific optimizations if you only want to enable specific ones. Current valid optimizations: {}".format(" ".join(map(lambda c: c.value, PerformanceFeature))))
+parser.add_argument("--fast", nargs="*", type=PerformanceFeature, help="Enable some untested and potentially quality deteriorating optimizations. --fast with no arguments enables everything. You can pass a list specific optimizations if you only want to enable specific ones. Current valid optimizations: fp16_accumulation fp8_matrix_mult cublas_ops")

 parser.add_argument("--mmap-torch-files", action="store_true", help="Use mmap when loading ckpt/pt files.")
-parser.add_argument("--disable-mmap", action="store_true", help="Don't use mmap when loading safetensors.")

 parser.add_argument("--dont-print-server", action="store_true", help="Don't print server output.")
 parser.add_argument("--quick-test-for-ci", action="store_true", help="Quick test for CI.")
@@ -156,7 +151,6 @@ parser.add_argument("--windows-standalone-build", action="store_true", help="Win

 parser.add_argument("--disable-metadata", action="store_true", help="Disable saving prompt metadata in files.")
 parser.add_argument("--disable-all-custom-nodes", action="store_true", help="Disable loading all custom nodes.")
-parser.add_argument("--whitelist-custom-nodes", type=str, nargs='+', default=[], help="Specify custom node folders to load even when --disable-all-custom-nodes is enabled.")
 parser.add_argument("--disable-api-nodes", action="store_true", help="Disable loading all api nodes.")

 parser.add_argument("--multi-user", action="store_true", help="Enables per-user storage.")
--- a/comfy/clip_model.py
+++ b/comfy/clip_model.py
@@ -61,12 +61,8 @@ class CLIPEncoder(torch.nn.Module):
    def forward(self, x, mask=None, intermediate_output=None):
        optimized_attention = optimized_attention_for_device(x.device, mask=mask is not None, small_input=True)

-        all_intermediate = None
        if intermediate_output is not None:
-            if intermediate_output == "all":
-                all_intermediate = []
-                intermediate_output = None
-            elif intermediate_output < 0:
+            if intermediate_output < 0:
                intermediate_output = len(self.layers) + intermediate_output

        intermediate = None
@@ -74,12 +70,6 @@ class CLIPEncoder(torch.nn.Module):
            x = l(x, mask, optimized_attention)
            if i == intermediate_output:
                intermediate = x.clone()
-            if all_intermediate is not None:
-                all_intermediate.append(x.unsqueeze(1).clone())
-
-        if all_intermediate is not None:
-            intermediate = torch.cat(all_intermediate, dim=1)
-
        return x, intermediate

 class CLIPEmbeddings(torch.nn.Module):
@@ -107,7 +97,7 @@ class CLIPTextModel_(torch.nn.Module):
        self.encoder = CLIPEncoder(num_layers, embed_dim, heads, intermediate_size, intermediate_activation, dtype, device, operations)
        self.final_layer_norm = operations.LayerNorm(embed_dim, dtype=dtype, device=device)

-    def forward(self, input_tokens=None, attention_mask=None, embeds=None, num_tokens=None, intermediate_output=None, final_layer_norm_intermediate=True, dtype=torch.float32, embeds_info=[]):
+    def forward(self, input_tokens=None, attention_mask=None, embeds=None, num_tokens=None, intermediate_output=None, final_layer_norm_intermediate=True, dtype=torch.float32):
        if embeds is not None:
            x = embeds + comfy.ops.cast_to(self.embeddings.position_embedding.weight, dtype=dtype, device=embeds.device)
        else:
--- a/comfy/clip_vision.py
+++ b/comfy/clip_vision.py
@@ -50,13 +50,7 @@ class ClipVisionModel():
        self.image_size = config.get("image_size", 224)
        self.image_mean = config.get("image_mean", [0.48145466, 0.4578275, 0.40821073])
        self.image_std = config.get("image_std", [0.26862954, 0.26130258, 0.27577711])
-        model_type = config.get("model_type", "clip_vision_model")
-        model_class = IMAGE_ENCODERS.get(model_type)
-        if model_type == "siglip_vision_model":
-            self.return_all_hidden_states = True
-        else:
-            self.return_all_hidden_states = False
-
+        model_class = IMAGE_ENCODERS.get(config.get("model_type", "clip_vision_model"))
        self.load_device = comfy.model_management.text_encoder_device()
        offload_device = comfy.model_management.text_encoder_offload_device()
        self.dtype = comfy.model_management.text_encoder_dtype(self.load_device)
@@ -74,18 +68,12 @@ class ClipVisionModel():
    def encode_image(self, image, crop=True):
        comfy.model_management.load_model_gpu(self.patcher)
        pixel_values = clip_preprocess(image.to(self.load_device), size=self.image_size, mean=self.image_mean, std=self.image_std, crop=crop).float()
-        out = self.model(pixel_values=pixel_values, intermediate_output='all' if self.return_all_hidden_states else -2)
+        out = self.model(pixel_values=pixel_values, intermediate_output=-2)

        outputs = Output()
        outputs["last_hidden_state"] = out[0].to(comfy.model_management.intermediate_device())
        outputs["image_embeds"] = out[2].to(comfy.model_management.intermediate_device())
-        if self.return_all_hidden_states:
-            all_hs = out[1].to(comfy.model_management.intermediate_device())
-            outputs["penultimate_hidden_states"] = all_hs[:, -2]
-            outputs["all_hidden_states"] = all_hs
-        else:
-            outputs["penultimate_hidden_states"] = out[1].to(comfy.model_management.intermediate_device())
-
+        outputs["penultimate_hidden_states"] = out[1].to(comfy.model_management.intermediate_device())
        outputs["mm_projected"] = out[3]
        return outputs

@@ -136,12 +124,8 @@ def load_clipvision_from_sd(sd, prefix="", convert_keys=False):
                json_config = os.path.join(os.path.dirname(os.path.realpath(__file__)), "clip_vision_config_vitl_336.json")
        else:
            json_config = os.path.join(os.path.dirname(os.path.realpath(__file__)), "clip_vision_config_vitl.json")
-
-    # Dinov2
-    elif 'encoder.layer.39.layer_scale2.lambda1' in sd:
+    elif "embeddings.patch_embeddings.projection.weight" in sd:
        json_config = os.path.join(os.path.join(os.path.dirname(os.path.realpath(__file__)), "image_encoders"), "dino2_giant.json")
-    elif 'encoder.layer.23.layer_scale2.lambda1' in sd:
-        json_config = os.path.join(os.path.join(os.path.dirname(os.path.realpath(__file__)), "image_encoders"), "dino2_large.json")
    else:
        return None

--- a/comfy/conds.py
+++ b/comfy/conds.py
@@ -1,7 +1,6 @@
 import torch
 import math
 import comfy.utils
-import logging


 class CONDRegular:
@@ -11,15 +10,12 @@ class CONDRegular:
    def _copy_with(self, cond):
        return self.__class__(cond)

-    def process_cond(self, batch_size, **kwargs):
-        return self._copy_with(comfy.utils.repeat_to_batch_size(self.cond, batch_size))
+    def process_cond(self, batch_size, device, **kwargs):
+        return self._copy_with(comfy.utils.repeat_to_batch_size(self.cond, batch_size).to(device))

    def can_concat(self, other):
        if self.cond.shape != other.cond.shape:
            return False
-        if self.cond.device != other.cond.device:
-            logging.warning("WARNING: conds not on same device, skipping concat.")
-            return False
        return True

    def concat(self, others):
@@ -33,14 +29,14 @@ class CONDRegular:


 class CONDNoiseShape(CONDRegular):
-    def process_cond(self, batch_size, area, **kwargs):
+    def process_cond(self, batch_size, device, area, **kwargs):
        data = self.cond
        if area is not None:
            dims = len(area) // 2
            for i in range(dims):
                data = data.narrow(i + 2, area[i + dims], area[i])

-        return self._copy_with(comfy.utils.repeat_to_batch_size(data, batch_size))
+        return self._copy_with(comfy.utils.repeat_to_batch_size(data, batch_size).to(device))


 class CONDCrossAttn(CONDRegular):
@@ -55,9 +51,6 @@ class CONDCrossAttn(CONDRegular):
            diff = mult_min // min(s1[1], s2[1])
            if diff > 4: #arbitrary limit on the padding because it's probably going to impact performance negatively if it's too much
                return False
-        if self.cond.device != other.cond.device:
-            logging.warning("WARNING: conds not on same device: skipping concat.")
-            return False
        return True

    def concat(self, others):
@@ -80,7 +73,7 @@ class CONDConstant(CONDRegular):
    def __init__(self, cond):
        self.cond = cond

-    def process_cond(self, batch_size, **kwargs):
+    def process_cond(self, batch_size, device, **kwargs):
        return self._copy_with(self.cond)

    def can_concat(self, other):
@@ -99,10 +92,10 @@ class CONDList(CONDRegular):
    def __init__(self, cond):
        self.cond = cond

-    def process_cond(self, batch_size, **kwargs):
+    def process_cond(self, batch_size, device, **kwargs):
        out = []
        for c in self.cond:
-            out.append(comfy.utils.repeat_to_batch_size(c, batch_size))
+            out.append(comfy.utils.repeat_to_batch_size(c, batch_size).to(device))

        return self._copy_with(out)

--- a/comfy/context_windows.py
+++ b/comfy/context_windows.py
@@ -1,540 +0,0 @@
-from __future__ import annotations
-from typing import TYPE_CHECKING, Callable
-import torch
-import numpy as np
-import collections
-from dataclasses import dataclass
-from abc import ABC, abstractmethod
-import logging
-import comfy.model_management
-import comfy.patcher_extension
-if TYPE_CHECKING:
-    from comfy.model_base import BaseModel
-    from comfy.model_patcher import ModelPatcher
-    from comfy.controlnet import ControlBase
-
-
-class ContextWindowABC(ABC):
-    def __init__(self):
-        ...
-
-    @abstractmethod
-    def get_tensor(self, full: torch.Tensor) -> torch.Tensor:
-        """
-        Get torch.Tensor applicable to current window.
-        """
-        raise NotImplementedError("Not implemented.")
-
-    @abstractmethod
-    def add_window(self, full: torch.Tensor, to_add: torch.Tensor) -> torch.Tensor:
-        """
-        Apply torch.Tensor of window to the full tensor, in place. Returns reference to updated full tensor, not a copy.
-        """
-        raise NotImplementedError("Not implemented.")
-
-class ContextHandlerABC(ABC):
-    def __init__(self):
-        ...
-
-    @abstractmethod
-    def should_use_context(self, model: BaseModel, conds: list[list[dict]], x_in: torch.Tensor, timestep: torch.Tensor, model_options: dict[str]) -> bool:
-        raise NotImplementedError("Not implemented.")
-
-    @abstractmethod
-    def get_resized_cond(self, cond_in: list[dict], x_in: torch.Tensor, window: ContextWindowABC, device=None) -> list:
-        raise NotImplementedError("Not implemented.")
-
-    @abstractmethod
-    def execute(self, calc_cond_batch: Callable, model: BaseModel, conds: list[list[dict]], x_in: torch.Tensor, timestep: torch.Tensor, model_options: dict[str]):
-        raise NotImplementedError("Not implemented.")
-
-
-
-class IndexListContextWindow(ContextWindowABC):
-    def __init__(self, index_list: list[int], dim: int=0):
-        self.index_list = index_list
-        self.context_length = len(index_list)
-        self.dim = dim
-
-    def get_tensor(self, full: torch.Tensor, device=None, dim=None) -> torch.Tensor:
-        if dim is None:
-            dim = self.dim
-        if dim == 0 and full.shape[dim] == 1:
-            return full
-        idx = [slice(None)] * dim + [self.index_list]
-        return full[idx].to(device)
-
-    def add_window(self, full: torch.Tensor, to_add: torch.Tensor, dim=None) -> torch.Tensor:
-        if dim is None:
-            dim = self.dim
-        idx = [slice(None)] * dim + [self.index_list]
-        full[idx] += to_add
-        return full
-
-
-class IndexListCallbacks:
-    EVALUATE_CONTEXT_WINDOWS = "evaluate_context_windows"
-    COMBINE_CONTEXT_WINDOW_RESULTS = "combine_context_window_results"
-    EXECUTE_START = "execute_start"
-    EXECUTE_CLEANUP = "execute_cleanup"
-
-    def init_callbacks(self):
-        return {}
-
-
-@dataclass
-class ContextSchedule:
-    name: str
-    func: Callable
-
-@dataclass
-class ContextFuseMethod:
-    name: str
-    func: Callable
-
-ContextResults = collections.namedtuple("ContextResults", ['window_idx', 'sub_conds_out', 'sub_conds', 'window'])
-class IndexListContextHandler(ContextHandlerABC):
-    def __init__(self, context_schedule: ContextSchedule, fuse_method: ContextFuseMethod, context_length: int=1, context_overlap: int=0, context_stride: int=1, closed_loop=False, dim=0):
-        self.context_schedule = context_schedule
-        self.fuse_method = fuse_method
-        self.context_length = context_length
-        self.context_overlap = context_overlap
-        self.context_stride = context_stride
-        self.closed_loop = closed_loop
-        self.dim = dim
-        self._step = 0
-
-        self.callbacks = {}
-
-    def should_use_context(self, model: BaseModel, conds: list[list[dict]], x_in: torch.Tensor, timestep: torch.Tensor, model_options: dict[str]) -> bool:
-        # for now, assume first dim is batch - should have stored on BaseModel in actual implementation
-        if x_in.size(self.dim) > self.context_length:
-            logging.info(f"Using context windows {self.context_length} for {x_in.size(self.dim)} frames.")
-            return True
-        return False
-
-    def prepare_control_objects(self, control: ControlBase, device=None) -> ControlBase:
-        if control.previous_controlnet is not None:
-            self.prepare_control_objects(control.previous_controlnet, device)
-        return control
-
-    def get_resized_cond(self, cond_in: list[dict], x_in: torch.Tensor, window: IndexListContextWindow, device=None) -> list:
-        if cond_in is None:
-            return None
-        # reuse or resize cond items to match context requirements
-        resized_cond = []
-        # cond object is a list containing a dict - outer list is irrelevant, so just loop through it
-        for actual_cond in cond_in:
-            resized_actual_cond = actual_cond.copy()
-            # now we are in the inner dict - "pooled_output" is a tensor, "control" is a ControlBase object, "model_conds" is dictionary
-            for key in actual_cond:
-                try:
-                    cond_item = actual_cond[key]
-                    if isinstance(cond_item, torch.Tensor):
-                        # check that tensor is the expected length - x.size(0)
-                        if self.dim < cond_item.ndim and cond_item.size(self.dim) == x_in.size(self.dim):
-                            # if so, it's subsetting time - tell controls the expected indeces so they can handle them
-                            actual_cond_item = window.get_tensor(cond_item)
-                            resized_actual_cond[key] = actual_cond_item.to(device)
-                        else:
-                            resized_actual_cond[key] = cond_item.to(device)
-                    # look for control
-                    elif key == "control":
-                        resized_actual_cond[key] = self.prepare_control_objects(cond_item, device)
-                    elif isinstance(cond_item, dict):
-                        new_cond_item = cond_item.copy()
-                        # when in dictionary, look for tensors and CONDCrossAttn [comfy/conds.py] (has cond attr that is a tensor)
-                        for cond_key, cond_value in new_cond_item.items():
-                            if isinstance(cond_value, torch.Tensor):
-                                if cond_value.ndim < self.dim and cond_value.size(0) == x_in.size(self.dim):
-                                    new_cond_item[cond_key] = window.get_tensor(cond_value, device)
-                            # if has cond that is a Tensor, check if needs to be subset
-                            elif hasattr(cond_value, "cond") and isinstance(cond_value.cond, torch.Tensor):
-                                if cond_value.cond.ndim < self.dim and cond_value.cond.size(0) == x_in.size(self.dim):
-                                    new_cond_item[cond_key] = cond_value._copy_with(window.get_tensor(cond_value.cond, device))
-                            elif cond_key == "num_video_frames": # for SVD
-                                new_cond_item[cond_key] = cond_value._copy_with(cond_value.cond)
-                                new_cond_item[cond_key].cond = window.context_length
-                        resized_actual_cond[key] = new_cond_item
-                    else:
-                        resized_actual_cond[key] = cond_item
-                finally:
-                    del cond_item  # just in case to prevent VRAM issues
-            resized_cond.append(resized_actual_cond)
-        return resized_cond
-
-    def set_step(self, timestep: torch.Tensor, model_options: dict[str]):
-        mask = torch.isclose(model_options["transformer_options"]["sample_sigmas"], timestep, rtol=0.0001)
-        matches = torch.nonzero(mask)
-        if torch.numel(matches) == 0:
-            raise Exception("No sample_sigmas matched current timestep; something went wrong.")
-        self._step = int(matches[0].item())
-
-    def get_context_windows(self, model: BaseModel, x_in: torch.Tensor, model_options: dict[str]) -> list[IndexListContextWindow]:
-        full_length = x_in.size(self.dim) # TODO: choose dim based on model
-        context_windows = self.context_schedule.func(full_length, self, model_options)
-        context_windows = [IndexListContextWindow(window, dim=self.dim) for window in context_windows]
-        return context_windows
-
-    def execute(self, calc_cond_batch: Callable, model: BaseModel, conds: list[list[dict]], x_in: torch.Tensor, timestep: torch.Tensor, model_options: dict[str]):
-        self.set_step(timestep, model_options)
-        context_windows = self.get_context_windows(model, x_in, model_options)
-        enumerated_context_windows = list(enumerate(context_windows))
-
-        conds_final = [torch.zeros_like(x_in) for _ in conds]
-        if self.fuse_method.name == ContextFuseMethods.RELATIVE:
-            counts_final = [torch.ones(get_shape_for_dim(x_in, self.dim), device=x_in.device) for _ in conds]
-        else:
-            counts_final = [torch.zeros(get_shape_for_dim(x_in, self.dim), device=x_in.device) for _ in conds]
-        biases_final = [([0.0] * x_in.shape[self.dim]) for _ in conds]
-
-        for callback in comfy.patcher_extension.get_all_callbacks(IndexListCallbacks.EXECUTE_START, self.callbacks):
-            callback(self, model, x_in, conds, timestep, model_options)
-
-        for enum_window in enumerated_context_windows:
-            results = self.evaluate_context_windows(calc_cond_batch, model, x_in, conds, timestep, [enum_window], model_options)
-            for result in results:
-                self.combine_context_window_results(x_in, result.sub_conds_out, result.sub_conds, result.window, result.window_idx, len(enumerated_context_windows), timestep,
-                                            conds_final, counts_final, biases_final)
-        try:
-            # finalize conds
-            if self.fuse_method.name == ContextFuseMethods.RELATIVE:
-                # relative is already normalized, so return as is
-                del counts_final
-                return conds_final
-            else:
-                # normalize conds via division by context usage counts
-                for i in range(len(conds_final)):
-                    conds_final[i] /= counts_final[i]
-                del counts_final
-                return conds_final
-        finally:
-            for callback in comfy.patcher_extension.get_all_callbacks(IndexListCallbacks.EXECUTE_CLEANUP, self.callbacks):
-                callback(self, model, x_in, conds, timestep, model_options)
-
-    def evaluate_context_windows(self, calc_cond_batch: Callable, model: BaseModel, x_in: torch.Tensor, conds, timestep: torch.Tensor, enumerated_context_windows: list[tuple[int, IndexListContextWindow]],
-                                model_options, device=None, first_device=None):
-        results: list[ContextResults] = []
-        for window_idx, window in enumerated_context_windows:
-            # allow processing to end between context window executions for faster Cancel
-            comfy.model_management.throw_exception_if_processing_interrupted()
-
-            for callback in comfy.patcher_extension.get_all_callbacks(IndexListCallbacks.EVALUATE_CONTEXT_WINDOWS, self.callbacks):
-                callback(self, model, x_in, conds, timestep, model_options, window_idx, window, model_options, device, first_device)
-
-            # update exposed params
-            model_options["transformer_options"]["context_window"] = window
-            # get subsections of x, timestep, conds
-            sub_x = window.get_tensor(x_in, device)
-            sub_timestep = window.get_tensor(timestep, device, dim=0)
-            sub_conds = [self.get_resized_cond(cond, x_in, window, device) for cond in conds]
-
-            sub_conds_out = calc_cond_batch(model, sub_conds, sub_x, sub_timestep, model_options)
-            if device is not None:
-                for i in range(len(sub_conds_out)):
-                    sub_conds_out[i] = sub_conds_out[i].to(x_in.device)
-            results.append(ContextResults(window_idx, sub_conds_out, sub_conds, window))
-        return results
-
-
-    def combine_context_window_results(self, x_in: torch.Tensor, sub_conds_out, sub_conds, window: IndexListContextWindow, window_idx: int, total_windows: int, timestep: torch.Tensor,
-                                    conds_final: list[torch.Tensor], counts_final: list[torch.Tensor], biases_final: list[torch.Tensor]):
-        if self.fuse_method.name == ContextFuseMethods.RELATIVE:
-            for pos, idx in enumerate(window.index_list):
-                # bias is the influence of a specific index in relation to the whole context window
-                bias = 1 - abs(idx - (window.index_list[0] + window.index_list[-1]) / 2) / ((window.index_list[-1] - window.index_list[0] + 1e-2) / 2)
-                bias = max(1e-2, bias)
-                # take weighted average relative to total bias of current idx
-                for i in range(len(sub_conds_out)):
-                    bias_total = biases_final[i][idx]
-                    prev_weight = (bias_total / (bias_total + bias))
-                    new_weight = (bias / (bias_total + bias))
-                    # account for dims of tensors
-                    idx_window = [slice(None)] * self.dim + [idx]
-                    pos_window = [slice(None)] * self.dim + [pos]
-                    # apply new values
-                    conds_final[i][idx_window] = conds_final[i][idx_window] * prev_weight + sub_conds_out[i][pos_window] * new_weight
-                    biases_final[i][idx] = bias_total + bias
-        else:
-            # add conds and counts based on weights of fuse method
-            weights = get_context_weights(window.context_length, x_in.shape[self.dim], window.index_list, self, sigma=timestep)
-            weights_tensor = match_weights_to_dim(weights, x_in, self.dim, device=x_in.device)
-            for i in range(len(sub_conds_out)):
-                window.add_window(conds_final[i], sub_conds_out[i] * weights_tensor)
-                window.add_window(counts_final[i], weights_tensor)
-
-        for callback in comfy.patcher_extension.get_all_callbacks(IndexListCallbacks.COMBINE_CONTEXT_WINDOW_RESULTS, self.callbacks):
-            callback(self, x_in, sub_conds_out, sub_conds, window, window_idx, total_windows, timestep, conds_final, counts_final, biases_final)
-
-
-def _prepare_sampling_wrapper(executor, model, noise_shape: torch.Tensor, *args, **kwargs):
-    # limit noise_shape length to context_length for more accurate vram use estimation
-    model_options = kwargs.get("model_options", None)
-    if model_options is None:
-        raise Exception("model_options not found in prepare_sampling_wrapper; this should never happen, something went wrong.")
-    handler: IndexListContextHandler = model_options.get("context_handler", None)
-    if handler is not None:
-        noise_shape = list(noise_shape)
-        noise_shape[handler.dim] = min(noise_shape[handler.dim], handler.context_length)
-    return executor(model, noise_shape, *args, **kwargs)
-
-
-def create_prepare_sampling_wrapper(model: ModelPatcher):
-    model.add_wrapper_with_key(
-        comfy.patcher_extension.WrappersMP.PREPARE_SAMPLING,
-        "ContextWindows_prepare_sampling",
-        _prepare_sampling_wrapper
-    )
-
-
-def match_weights_to_dim(weights: list[float], x_in: torch.Tensor, dim: int, device=None) -> torch.Tensor:
-    total_dims = len(x_in.shape)
-    weights_tensor = torch.Tensor(weights).to(device=device)
-    for _ in range(dim):
-        weights_tensor = weights_tensor.unsqueeze(0)
-    for _ in range(total_dims - dim - 1):
-        weights_tensor = weights_tensor.unsqueeze(-1)
-    return weights_tensor
-
-def get_shape_for_dim(x_in: torch.Tensor, dim: int) -> list[int]:
-    total_dims = len(x_in.shape)
-    shape = []
-    for _ in range(dim):
-        shape.append(1)
-    shape.append(x_in.shape[dim])
-    for _ in range(total_dims - dim - 1):
-        shape.append(1)
-    return shape
-
-class ContextSchedules:
-    UNIFORM_LOOPED = "looped_uniform"
-    UNIFORM_STANDARD = "standard_uniform"
-    STATIC_STANDARD = "standard_static"
-    BATCHED = "batched"
-
-
-# from https://github.com/neggles/animatediff-cli/blob/main/src/animatediff/pipelines/context.py
-def create_windows_uniform_looped(num_frames: int, handler: IndexListContextHandler, model_options: dict[str]):
-    windows = []
-    if num_frames < handler.context_length:
-        windows.append(list(range(num_frames)))
-        return windows
-
-    context_stride = min(handler.context_stride, int(np.ceil(np.log2(num_frames / handler.context_length))) + 1)
-    # obtain uniform windows as normal, looping and all
-    for context_step in 1 << np.arange(context_stride):
-        pad = int(round(num_frames * ordered_halving(handler._step)))
-        for j in range(
-            int(ordered_halving(handler._step) * context_step) + pad,
-            num_frames + pad + (0 if handler.closed_loop else -handler.context_overlap),
-            (handler.context_length * context_step - handler.context_overlap),
-        ):
-            windows.append([e % num_frames for e in range(j, j + handler.context_length * context_step, context_step)])
-
-    return windows
-
-def create_windows_uniform_standard(num_frames: int, handler: IndexListContextHandler, model_options: dict[str]):
-    # unlike looped, uniform_straight does NOT allow windows that loop back to the beginning;
-    # instead, they get shifted to the corresponding end of the frames.
-    # in the case that a window (shifted or not) is identical to the previous one, it gets skipped.
-    windows = []
-    if num_frames <= handler.context_length:
-        windows.append(list(range(num_frames)))
-        return windows
-
-    context_stride = min(handler.context_stride, int(np.ceil(np.log2(num_frames / handler.context_length))) + 1)
-    # first, obtain uniform windows as normal, looping and all
-    for context_step in 1 << np.arange(context_stride):
-        pad = int(round(num_frames * ordered_halving(handler._step)))
-        for j in range(
-            int(ordered_halving(handler._step) * context_step) + pad,
-            num_frames + pad + (-handler.context_overlap),
-            (handler.context_length * context_step - handler.context_overlap),
-        ):
-            windows.append([e % num_frames for e in range(j, j + handler.context_length * context_step, context_step)])
-
-    # now that windows are created, shift any windows that loop, and delete duplicate windows
-    delete_idxs = []
-    win_i = 0
-    while win_i < len(windows):
-        # if window is rolls over itself, need to shift it
-        is_roll, roll_idx = does_window_roll_over(windows[win_i], num_frames)
-        if is_roll:
-            roll_val = windows[win_i][roll_idx]  # roll_val might not be 0 for windows of higher strides
-            shift_window_to_end(windows[win_i], num_frames=num_frames)
-            # check if next window (cyclical) is missing roll_val
-            if roll_val not in windows[(win_i+1) % len(windows)]:
-                # need to insert new window here - just insert window starting at roll_val
-                windows.insert(win_i+1, list(range(roll_val, roll_val + handler.context_length)))
-        # delete window if it's not unique
-        for pre_i in range(0, win_i):
-            if windows[win_i] == windows[pre_i]:
-                delete_idxs.append(win_i)
-                break
-        win_i += 1
-
-    # reverse delete_idxs so that they will be deleted in an order that doesn't break idx correlation
-    delete_idxs.reverse()
-    for i in delete_idxs:
-        windows.pop(i)
-
-    return windows
-
-
-def create_windows_static_standard(num_frames: int, handler: IndexListContextHandler, model_options: dict[str]):
-    windows = []
-    if num_frames <= handler.context_length:
-        windows.append(list(range(num_frames)))
-        return windows
-    # always return the same set of windows
-    delta = handler.context_length - handler.context_overlap
-    for start_idx in range(0, num_frames, delta):
-        # if past the end of frames, move start_idx back to allow same context_length
-        ending = start_idx + handler.context_length
-        if ending >= num_frames:
-            final_delta = ending - num_frames
-            final_start_idx = start_idx - final_delta
-            windows.append(list(range(final_start_idx, final_start_idx + handler.context_length)))
-            break
-        windows.append(list(range(start_idx, start_idx + handler.context_length)))
-    return windows
-
-
-def create_windows_batched(num_frames: int, handler: IndexListContextHandler, model_options: dict[str]):
-    windows = []
-    if num_frames <= handler.context_length:
-        windows.append(list(range(num_frames)))
-        return windows
-    # always return the same set of windows;
-    # no overlap, just cut up based on context_length;
-    # last window size will be different if num_frames % opts.context_length != 0
-    for start_idx in range(0, num_frames, handler.context_length):
-        windows.append(list(range(start_idx, min(start_idx + handler.context_length, num_frames))))
-    return windows
-
-
-def create_windows_default(num_frames: int, handler: IndexListContextHandler):
-    return [list(range(num_frames))]
-
-
-CONTEXT_MAPPING = {
-    ContextSchedules.UNIFORM_LOOPED: create_windows_uniform_looped,
-    ContextSchedules.UNIFORM_STANDARD: create_windows_uniform_standard,
-    ContextSchedules.STATIC_STANDARD: create_windows_static_standard,
-    ContextSchedules.BATCHED: create_windows_batched,
-}
-
-
-def get_matching_context_schedule(context_schedule: str) -> ContextSchedule:
-    func = CONTEXT_MAPPING.get(context_schedule, None)
-    if func is None:
-        raise ValueError(f"Unknown context_schedule '{context_schedule}'.")
-    return ContextSchedule(context_schedule, func)
-
-
-def get_context_weights(length: int, full_length: int, idxs: list[int], handler: IndexListContextHandler, sigma: torch.Tensor=None):
-    return handler.fuse_method.func(length, sigma=sigma, handler=handler, full_length=full_length, idxs=idxs)
-
-
-def create_weights_flat(length: int, **kwargs) -> list[float]:
-    # weight is the same for all
-    return [1.0] * length
-
-def create_weights_pyramid(length: int, **kwargs) -> list[float]:
-    # weight is based on the distance away from the edge of the context window;
-    # based on weighted average concept in FreeNoise paper
-    if length % 2 == 0:
-        max_weight = length // 2
-        weight_sequence = list(range(1, max_weight + 1, 1)) + list(range(max_weight, 0, -1))
-    else:
-        max_weight = (length + 1) // 2
-        weight_sequence = list(range(1, max_weight, 1)) + [max_weight] + list(range(max_weight - 1, 0, -1))
-    return weight_sequence
-
-def create_weights_overlap_linear(length: int, full_length: int, idxs: list[int], handler: IndexListContextHandler, **kwargs):
-    # based on code in Kijai's WanVideoWrapper: https://github.com/kijai/ComfyUI-WanVideoWrapper/blob/dbb2523b37e4ccdf45127e5ae33e31362f755c8e/nodes.py#L1302
-    # only expected overlap is given different weights
-    weights_torch = torch.ones((length))
-    # blend left-side on all except first window
-    if min(idxs) > 0:
-        ramp_up = torch.linspace(1e-37, 1, handler.context_overlap)
-        weights_torch[:handler.context_overlap] = ramp_up
-    # blend right-side on all except last window
-    if max(idxs) < full_length-1:
-        ramp_down = torch.linspace(1, 1e-37, handler.context_overlap)
-        weights_torch[-handler.context_overlap:] = ramp_down
-    return weights_torch
-
-class ContextFuseMethods:
-    FLAT = "flat"
-    PYRAMID = "pyramid"
-    RELATIVE = "relative"
-    OVERLAP_LINEAR = "overlap-linear"
-
-    LIST = [PYRAMID, FLAT, OVERLAP_LINEAR]
-    LIST_STATIC = [PYRAMID, RELATIVE, FLAT, OVERLAP_LINEAR]
-
-
-FUSE_MAPPING = {
-    ContextFuseMethods.FLAT: create_weights_flat,
-    ContextFuseMethods.PYRAMID: create_weights_pyramid,
-    ContextFuseMethods.RELATIVE: create_weights_pyramid,
-    ContextFuseMethods.OVERLAP_LINEAR: create_weights_overlap_linear,
-}
-
-def get_matching_fuse_method(fuse_method: str) -> ContextFuseMethod:
-    func = FUSE_MAPPING.get(fuse_method, None)
-    if func is None:
-        raise ValueError(f"Unknown fuse_method '{fuse_method}'.")
-    return ContextFuseMethod(fuse_method, func)
-
-# Returns fraction that has denominator that is a power of 2
-def ordered_halving(val):
-    # get binary value, padded with 0s for 64 bits
-    bin_str = f"{val:064b}"
-    # flip binary value, padding included
-    bin_flip = bin_str[::-1]
-    # convert binary to int
-    as_int = int(bin_flip, 2)
-    # divide by 1 << 64, equivalent to 2**64, or 18446744073709551616,
-    # or b10000000000000000000000000000000000000000000000000000000000000000 (1 with 64 zero's)
-    return as_int / (1 << 64)
-
-
-def get_missing_indexes(windows: list[list[int]], num_frames: int) -> list[int]:
-    all_indexes = list(range(num_frames))
-    for w in windows:
-        for val in w:
-            try:
-                all_indexes.remove(val)
-            except ValueError:
-                pass
-    return all_indexes
-
-
-def does_window_roll_over(window: list[int], num_frames: int) -> tuple[bool, int]:
-    prev_val = -1
-    for i, val in enumerate(window):
-        val = val % num_frames
-        if val < prev_val:
-            return True, i
-        prev_val = val
-    return False, -1
-
-
-def shift_window_to_start(window: list[int], num_frames: int):
-    start_val = window[0]
-    for i in range(len(window)):
-        # 1) subtract each element by start_val to move vals relative to the start of all frames
-        # 2) add num_frames and take modulus to get adjusted vals
-        window[i] = ((window[i] - start_val) + num_frames) % num_frames
-
-
-def shift_window_to_end(window: list[int], num_frames: int):
-    # 1) shift window to start
-    shift_window_to_start(window, num_frames)
-    end_val = window[-1]
-    end_delta = num_frames - end_val - 1
-    for i in range(len(window)):
-        # 2) add end_delta to each val to slide windows to end
-        window[i] = window[i] + end_delta
--- a/comfy/controlnet.py
+++ b/comfy/controlnet.py
@@ -28,7 +28,6 @@ import comfy.model_detection
 import comfy.model_patcher
 import comfy.ops
 import comfy.latent_formats
-import comfy.model_base

 import comfy.cldm.cldm
 import comfy.t2i_adapter.adapter
@@ -36,7 +35,6 @@ import comfy.ldm.cascade.controlnet
 import comfy.cldm.mmdit
 import comfy.ldm.hydit.controlnet
 import comfy.ldm.flux.controlnet
-import comfy.ldm.qwen_image.controlnet
 import comfy.cldm.dit_embedder
 from typing import TYPE_CHECKING
 if TYPE_CHECKING:
@@ -45,6 +43,7 @@ if TYPE_CHECKING:

 def broadcast_image_to(tensor, target_batch_size, batched_number):
    current_batch_size = tensor.shape[0]
+    #print(current_batch_size, target_batch_size)
    if current_batch_size == 1:
        return tensor

@@ -237,11 +236,11 @@ class ControlNet(ControlBase):
            self.cond_hint = None
            compression_ratio = self.compression_ratio
            if self.vae is not None:
-                compression_ratio *= self.vae.spacial_compression_encode()
+                compression_ratio *= self.vae.downscale_ratio
            else:
                if self.latent_format is not None:
                    raise ValueError("This Controlnet needs a VAE but none was provided, please use a ControlNetApply node with a VAE input and connect it.")
-            self.cond_hint = comfy.utils.common_upscale(self.cond_hint_original, x_noisy.shape[-1] * compression_ratio, x_noisy.shape[-2] * compression_ratio, self.upscale_algorithm, "center")
+            self.cond_hint = comfy.utils.common_upscale(self.cond_hint_original, x_noisy.shape[3] * compression_ratio, x_noisy.shape[2] * compression_ratio, self.upscale_algorithm, "center")
            self.cond_hint = self.preprocess_image(self.cond_hint)
            if self.vae is not None:
                loaded_models = comfy.model_management.loaded_models(only_currently_used=True)
@@ -253,10 +252,7 @@ class ControlNet(ControlBase):
                to_concat = []
                for c in self.extra_concat_orig:
                    c = c.to(self.cond_hint.device)
-                    c = comfy.utils.common_upscale(c, self.cond_hint.shape[-1], self.cond_hint.shape[-2], self.upscale_algorithm, "center")
-                    if c.ndim < self.cond_hint.ndim:
-                        c = c.unsqueeze(2)
-                        c = comfy.utils.repeat_to_batch_size(c, self.cond_hint.shape[2], dim=2)
+                    c = comfy.utils.common_upscale(c, self.cond_hint.shape[3], self.cond_hint.shape[2], self.upscale_algorithm, "center")
                    to_concat.append(comfy.utils.repeat_to_batch_size(c, self.cond_hint.shape[0]))
                self.cond_hint = torch.cat([self.cond_hint] + to_concat, dim=1)

@@ -269,12 +265,12 @@ class ControlNet(ControlBase):
        for c in self.extra_conds:
            temp = cond.get(c, None)
            if temp is not None:
-                extra[c] = comfy.model_base.convert_tensor(temp, dtype, x_noisy.device)
+                extra[c] = temp.to(dtype)

        timestep = self.model_sampling_current.timestep(t)
        x_noisy = self.model_sampling_current.calculate_input(t, x_noisy)

-        control = self.control_model(x=x_noisy.to(dtype), hint=self.cond_hint, timesteps=timestep.to(dtype), context=comfy.model_management.cast_to_device(context, x_noisy.device, dtype), **extra)
+        control = self.control_model(x=x_noisy.to(dtype), hint=self.cond_hint, timesteps=timestep.to(dtype), context=context.to(dtype), **extra)
        return self.control_merge(control, control_prev, output_dtype=None)

    def copy(self):
@@ -586,22 +582,6 @@ def load_controlnet_flux_instantx(sd, model_options={}):
    control = ControlNet(control_model, compression_ratio=1, latent_format=latent_format, concat_mask=concat_mask, load_device=load_device, manual_cast_dtype=manual_cast_dtype, extra_conds=extra_conds)
    return control

-def load_controlnet_qwen_instantx(sd, model_options={}):
-    model_config, operations, load_device, unet_dtype, manual_cast_dtype, offload_device = controlnet_config(sd, model_options=model_options)
-    control_latent_channels = sd.get("controlnet_x_embedder.weight").shape[1]
-
-    extra_condition_channels = 0
-    concat_mask = False
-    if control_latent_channels == 68: #inpaint controlnet
-        extra_condition_channels = control_latent_channels - 64
-        concat_mask = True
-    control_model = comfy.ldm.qwen_image.controlnet.QwenImageControlNetModel(extra_condition_channels=extra_condition_channels, operations=operations, device=offload_device, dtype=unet_dtype, **model_config.unet_config)
-    control_model = controlnet_load_state_dict(control_model, sd)
-    latent_format = comfy.latent_formats.Wan21()
-    extra_conds = []
-    control = ControlNet(control_model, compression_ratio=1, latent_format=latent_format, concat_mask=concat_mask, load_device=load_device, manual_cast_dtype=manual_cast_dtype, extra_conds=extra_conds)
-    return control
-
 def convert_mistoline(sd):
    return comfy.utils.state_dict_prefix_replace(sd, {"single_controlnet_blocks.": "controlnet_single_blocks."})

@@ -675,11 +655,8 @@ def load_controlnet_state_dict(state_dict, model=None, model_options={}):
                return load_controlnet_sd35(controlnet_data, model_options=model_options) #Stability sd3.5 format
            else:
                return load_controlnet_mmdit(controlnet_data, model_options=model_options) #SD3 diffusers controlnet
-        elif "transformer_blocks.0.img_mlp.net.0.proj.weight" in controlnet_data:
-            return load_controlnet_qwen_instantx(controlnet_data, model_options=model_options)
        elif "controlnet_x_embedder.weight" in controlnet_data:
            return load_controlnet_flux_instantx(controlnet_data, model_options=model_options)
-
    elif "controlnet_blocks.0.linear.weight" in controlnet_data: #mistoline flux
        return load_controlnet_flux_xlabs_mistoline(convert_mistoline(controlnet_data), mistoline=True, model_options=model_options)

--- a/comfy/gligen.py
+++ b/comfy/gligen.py
@@ -1,10 +1,55 @@
 import math
 import torch
 from torch import nn
-from .ldm.modules.attention import CrossAttention, FeedForward
+from .ldm.modules.attention import CrossAttention
+from inspect import isfunction
 import comfy.ops
 ops = comfy.ops.manual_cast

+def exists(val):
+    return val is not None
+
+
+def uniq(arr):
+    return{el: True for el in arr}.keys()
+
+
+def default(val, d):
+    if exists(val):
+        return val
+    return d() if isfunction(d) else d
+
+
+# feedforward
+class GEGLU(nn.Module):
+    def __init__(self, dim_in, dim_out):
+        super().__init__()
+        self.proj = ops.Linear(dim_in, dim_out * 2)
+
+    def forward(self, x):
+        x, gate = self.proj(x).chunk(2, dim=-1)
+        return x * torch.nn.functional.gelu(gate)
+
+
+class FeedForward(nn.Module):
+    def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.):
+        super().__init__()
+        inner_dim = int(dim * mult)
+        dim_out = default(dim_out, dim)
+        project_in = nn.Sequential(
+            ops.Linear(dim, inner_dim),
+            nn.GELU()
+        ) if not glu else GEGLU(dim, inner_dim)
+
+        self.net = nn.Sequential(
+            project_in,
+            nn.Dropout(dropout),
+            ops.Linear(inner_dim, dim_out)
+        )
+
+    def forward(self, x):
+        return self.net(x)
+

 class GatedCrossAttentionDense(nn.Module):
    def __init__(self, query_dim, context_dim, n_heads, d_head):
--- a/comfy/image_encoders/dino2.py
+++ b/comfy/image_encoders/dino2.py
@@ -31,20 +31,6 @@ class LayerScale(torch.nn.Module):
    def forward(self, x):
        return x * comfy.model_management.cast_to_device(self.lambda1, x.device, x.dtype)

-class Dinov2MLP(torch.nn.Module):
-    def __init__(self, hidden_size: int, dtype, device, operations):
-        super().__init__()
-
-        mlp_ratio = 4
-        hidden_features = int(hidden_size * mlp_ratio)
-        self.fc1 = operations.Linear(hidden_size, hidden_features, bias = True, device=device, dtype=dtype)
-        self.fc2 = operations.Linear(hidden_features, hidden_size, bias = True, device=device, dtype=dtype)
-
-    def forward(self, hidden_state: torch.Tensor) -> torch.Tensor:
-        hidden_state = self.fc1(hidden_state)
-        hidden_state = torch.nn.functional.gelu(hidden_state)
-        hidden_state = self.fc2(hidden_state)
-        return hidden_state

 class SwiGLUFFN(torch.nn.Module):
    def __init__(self, dim, dtype, device, operations):
@@ -64,15 +50,12 @@ class SwiGLUFFN(torch.nn.Module):


 class Dino2Block(torch.nn.Module):
-    def __init__(self, dim, num_heads, layer_norm_eps, dtype, device, operations, use_swiglu_ffn):
+    def __init__(self, dim, num_heads, layer_norm_eps, dtype, device, operations):
        super().__init__()
        self.attention = Dino2AttentionBlock(dim, num_heads, layer_norm_eps, dtype, device, operations)
        self.layer_scale1 = LayerScale(dim, dtype, device, operations)
        self.layer_scale2 = LayerScale(dim, dtype, device, operations)
-        if use_swiglu_ffn:
-            self.mlp = SwiGLUFFN(dim, dtype, device, operations)
-        else:
-            self.mlp = Dinov2MLP(dim, dtype, device, operations)
+        self.mlp = SwiGLUFFN(dim, dtype, device, operations)
        self.norm1 = operations.LayerNorm(dim, eps=layer_norm_eps, dtype=dtype, device=device)
        self.norm2 = operations.LayerNorm(dim, eps=layer_norm_eps, dtype=dtype, device=device)

@@ -83,10 +66,9 @@ class Dino2Block(torch.nn.Module):


 class Dino2Encoder(torch.nn.Module):
-    def __init__(self, dim, num_heads, layer_norm_eps, num_layers, dtype, device, operations, use_swiglu_ffn):
+    def __init__(self, dim, num_heads, layer_norm_eps, num_layers, dtype, device, operations):
        super().__init__()
-        self.layer = torch.nn.ModuleList([Dino2Block(dim, num_heads, layer_norm_eps, dtype, device, operations, use_swiglu_ffn = use_swiglu_ffn)
-                                          for _ in range(num_layers)])
+        self.layer = torch.nn.ModuleList([Dino2Block(dim, num_heads, layer_norm_eps, dtype, device, operations) for _ in range(num_layers)])

    def forward(self, x, intermediate_output=None):
        optimized_attention = optimized_attention_for_device(x.device, False, small_input=True)
@@ -96,8 +78,8 @@ class Dino2Encoder(torch.nn.Module):
                intermediate_output = len(self.layer) + intermediate_output

        intermediate = None
-        for i, layer in enumerate(self.layer):
-            x = layer(x, optimized_attention)
+        for i, l in enumerate(self.layer):
+            x = l(x, optimized_attention)
            if i == intermediate_output:
                intermediate = x.clone()
        return x, intermediate
@@ -146,10 +128,9 @@ class Dinov2Model(torch.nn.Module):
        dim = config_dict["hidden_size"]
        heads = config_dict["num_attention_heads"]
        layer_norm_eps = config_dict["layer_norm_eps"]
-        use_swiglu_ffn = config_dict["use_swiglu_ffn"]

        self.embeddings = Dino2Embeddings(dim, dtype, device, operations)
-        self.encoder = Dino2Encoder(dim, heads, layer_norm_eps, num_layers, dtype, device, operations, use_swiglu_ffn = use_swiglu_ffn)
+        self.encoder = Dino2Encoder(dim, heads, layer_norm_eps, num_layers, dtype, device, operations)
        self.layernorm = operations.LayerNorm(dim, eps=layer_norm_eps, dtype=dtype, device=device)

    def forward(self, pixel_values, attention_mask=None, intermediate_output=None):
--- a/comfy/image_encoders/dino2_large.json
+++ b/comfy/image_encoders/dino2_large.json
@@ -1,22 +0,0 @@
-{
-  "hidden_size": 1024,
-  "use_mask_token": true,
-  "patch_size": 14,
-  "image_size": 518,
-  "num_channels": 3,
-  "num_attention_heads": 16,
-  "initializer_range": 0.02,
-  "attention_probs_dropout_prob": 0.0,
-  "hidden_dropout_prob": 0.0,
-  "hidden_act": "gelu",
-  "mlp_ratio": 4,
-  "model_type": "dinov2",
-  "num_hidden_layers": 24,
-  "layer_norm_eps": 1e-6,
-  "qkv_bias": true,
-  "use_swiglu_ffn": false,
-  "layerscale_value": 1.0,
-  "drop_path_rate": 0.0,
-  "image_mean": [0.485, 0.456, 0.406],
-  "image_std": [0.229, 0.224, 0.225]
-}
--- a/comfy/k_diffusion/sa_solver.py
+++ b/comfy/k_diffusion/sa_solver.py
@@ -1,121 +0,0 @@
-# SA-Solver: Stochastic Adams Solver (NeurIPS 2023, arXiv:2309.05019)
-# Conference: https://proceedings.neurips.cc/paper_files/paper/2023/file/f4a6806490d31216a3ba667eb240c897-Paper-Conference.pdf
-# Codebase ref: https://github.com/scxue/SA-Solver
-
-import math
-from typing import Union, Callable
-import torch
-
-
-def compute_exponential_coeffs(s: torch.Tensor, t: torch.Tensor, solver_order: int, tau_t: float) -> torch.Tensor:
-    """Compute (1 + tau^2) * integral of exp((1 + tau^2) * x) * x^p dx from s to t with exp((1 + tau^2) * t) factored out, using integration by parts.
-
-    Integral of exp((1 + tau^2) * x) * x^p dx
-        = product_terms[p] - (p / (1 + tau^2)) * integral of exp((1 + tau^2) * x) * x^(p-1) dx,
-    with base case p=0 where integral equals product_terms[0].
-
-    where
-        product_terms[p] = x^p * exp((1 + tau^2) * x) / (1 + tau^2).
-
-    Construct a recursive coefficient matrix following the above recursive relation to compute all integral terms up to p = (solver_order - 1).
-    Return coefficients used by the SA-Solver in data prediction mode.
-
-    Args:
-        s: Start time s.
-        t: End time t.
-        solver_order: Current order of the solver.
-        tau_t: Stochastic strength parameter in the SDE.
-
-    Returns:
-        Exponential coefficients used in data prediction, with exp((1 + tau^2) * t) factored out, ordered from p=0 to p=solver_order−1, shape (solver_order,).
-    """
-    tau_mul = 1 + tau_t ** 2
-    h = t - s
-    p = torch.arange(solver_order, dtype=s.dtype, device=s.device)
-
-    # product_terms after factoring out exp((1 + tau^2) * t)
-    # Includes (1 + tau^2) factor from outside the integral
-    product_terms_factored = (t ** p - s ** p * (-tau_mul * h).exp())
-
-    # Lower triangular recursive coefficient matrix
-    # Accumulates recursive coefficients based on p / (1 + tau^2)
-    recursive_depth_mat = p.unsqueeze(1) - p.unsqueeze(0)
-    log_factorial = (p + 1).lgamma()
-    recursive_coeff_mat = log_factorial.unsqueeze(1) - log_factorial.unsqueeze(0)
-    if tau_t > 0:
-        recursive_coeff_mat = recursive_coeff_mat - (recursive_depth_mat * math.log(tau_mul))
-    signs = torch.where(recursive_depth_mat % 2 == 0, 1.0, -1.0)
-    recursive_coeff_mat = (recursive_coeff_mat.exp() * signs).tril()
-
-    return recursive_coeff_mat @ product_terms_factored
-
-
-def compute_simple_stochastic_adams_b_coeffs(sigma_next: torch.Tensor, curr_lambdas: torch.Tensor, lambda_s: torch.Tensor, lambda_t: torch.Tensor, tau_t: float, is_corrector_step: bool = False) -> torch.Tensor:
-    """Compute simple order-2 b coefficients from SA-Solver paper (Appendix D. Implementation Details)."""
-    tau_mul = 1 + tau_t ** 2
-    h = lambda_t - lambda_s
-    alpha_t = sigma_next * lambda_t.exp()
-    if is_corrector_step:
-        # Simplified 1-step (order-2) corrector
-        b_1 = alpha_t * (0.5 * tau_mul * h)
-        b_2 = alpha_t * (-h * tau_mul).expm1().neg() - b_1
-    else:
-        # Simplified 2-step predictor
-        b_2 = alpha_t * (0.5 * tau_mul * h ** 2) / (curr_lambdas[-2] - lambda_s)
-        b_1 = alpha_t * (-h * tau_mul).expm1().neg() - b_2
-    return torch.stack([b_2, b_1])
-
-
-def compute_stochastic_adams_b_coeffs(sigma_next: torch.Tensor, curr_lambdas: torch.Tensor, lambda_s: torch.Tensor, lambda_t: torch.Tensor, tau_t: float, simple_order_2: bool = False, is_corrector_step: bool = False) -> torch.Tensor:
-    """Compute b_i coefficients for the SA-Solver (see eqs. 15 and 18).
-
-    The solver order corresponds to the number of input lambdas (half-logSNR points).
-
-    Args:
-        sigma_next: Sigma at end time t.
-        curr_lambdas: Lambda time points used to construct the Lagrange basis, shape (N,).
-        lambda_s: Lambda at start time s.
-        lambda_t: Lambda at end time t.
-        tau_t: Stochastic strength parameter in the SDE.
-        simple_order_2: Whether to enable the simple order-2 scheme.
-        is_corrector_step: Flag for corrector step in simple order-2 mode.
-
-    Returns:
-        b_i coefficients for the SA-Solver, shape (N,), where N is the solver order.
-    """
-    num_timesteps = curr_lambdas.shape[0]
-
-    if simple_order_2 and num_timesteps == 2:
-        return compute_simple_stochastic_adams_b_coeffs(sigma_next, curr_lambdas, lambda_s, lambda_t, tau_t, is_corrector_step)
-
-    # Compute coefficients by solving a linear system from Lagrange basis interpolation
-    exp_integral_coeffs = compute_exponential_coeffs(lambda_s, lambda_t, num_timesteps, tau_t)
-    vandermonde_matrix_T = torch.vander(curr_lambdas, num_timesteps, increasing=True).T
-    lagrange_integrals = torch.linalg.solve(vandermonde_matrix_T, exp_integral_coeffs)
-
-    # (sigma_t * exp(-tau^2 * lambda_t)) * exp((1 + tau^2) * lambda_t)
-    # = sigma_t * exp(lambda_t) = alpha_t
-    # exp((1 + tau^2) * lambda_t) is extracted from the integral
-    alpha_t = sigma_next * lambda_t.exp()
-    return alpha_t * lagrange_integrals
-
-
-def get_tau_interval_func(start_sigma: float, end_sigma: float, eta: float = 1.0) -> Callable[[Union[torch.Tensor, float]], float]:
-    """Return a function that controls the stochasticity of SA-Solver.
-
-    When eta = 0, SA-Solver runs as ODE. The official approach uses
-    time t to determine the SDE interval, while here we use sigma instead.
-
-    See:
-        https://github.com/scxue/SA-Solver/blob/main/README.md
-    """
-
-    def tau_func(sigma: Union[torch.Tensor, float]) -> float:
-        if eta <= 0:
-            return 0.0  # ODE
-
-        if isinstance(sigma, torch.Tensor):
-            sigma = sigma.item()
-        return eta if start_sigma >= sigma >= end_sigma else 0.0
-
-    return tau_func
--- a/comfy/k_diffusion/sampling.py
+++ b/comfy/k_diffusion/sampling.py
@@ -1,5 +1,4 @@
 import math
-from functools import partial

 from scipy import integrate
 import torch
@@ -9,7 +8,6 @@ from tqdm.auto import trange, tqdm

 from . import utils
 from . import deis
-from . import sa_solver
 import comfy.model_patcher
 import comfy.model_sampling

@@ -86,24 +84,24 @@ class BatchedBrownianTree:
    """A wrapper around torchsde.BrownianTree that enables batches of entropy."""

    def __init__(self, x, t0, t1, seed=None, **kwargs):
-        self.cpu_tree = kwargs.pop("cpu", True)
+        self.cpu_tree = True
+        if "cpu" in kwargs:
+            self.cpu_tree = kwargs.pop("cpu")
        t0, t1, self.sign = self.sort(t0, t1)
-        w0 = kwargs.pop('w0', None)
-        if w0 is None:
-            w0 = torch.zeros_like(x)
-        self.batched = False
+        w0 = kwargs.get('w0', torch.zeros_like(x))
        if seed is None:
-            seed = (torch.randint(0, 2 ** 63 - 1, ()).item(),)
-        elif isinstance(seed, (tuple, list)):
-            if len(seed) != x.shape[0]:
-                raise ValueError("Passing a list or tuple of seeds to BatchedBrownianTree requires a length matching the batch size.")
-            self.batched = True
+            seed = torch.randint(0, 2 ** 63 - 1, []).item()
+        self.batched = True
+        try:
+            assert len(seed) == x.shape[0]
            w0 = w0[0]
-        else:
-            seed = (seed,)
+        except TypeError:
+            seed = [seed]
+            self.batched = False
        if self.cpu_tree:
-            t0, w0, t1 = t0.detach().cpu(), w0.detach().cpu(), t1.detach().cpu()
-        self.trees = tuple(torchsde.BrownianTree(t0, w0, t1, entropy=s, **kwargs) for s in seed)
+            self.trees = [torchsde.BrownianTree(t0.cpu(), w0.cpu(), t1.cpu(), entropy=s, **kwargs) for s in seed]
+        else:
+            self.trees = [torchsde.BrownianTree(t0, w0, t1, entropy=s, **kwargs) for s in seed]

    @staticmethod
    def sort(a, b):
@@ -111,10 +109,11 @@ class BatchedBrownianTree:

    def __call__(self, t0, t1):
        t0, t1, sign = self.sort(t0, t1)
-        device, dtype = t0.device, t0.dtype
        if self.cpu_tree:
-            t0, t1 = t0.detach().cpu().float(), t1.detach().cpu().float()
-        w = torch.stack([tree(t0, t1) for tree in self.trees]).to(device=device, dtype=dtype) * (self.sign * sign)
+            w = torch.stack([tree(t0.cpu().float(), t1.cpu().float()).to(t0.dtype).to(t0.device) for tree in self.trees]) * (self.sign * sign)
+        else:
+            w = torch.stack([tree(t0, t1) for tree in self.trees]) * (self.sign * sign)
+
        return w if self.batched else w[0]


@@ -143,43 +142,6 @@ class BrownianTreeNoiseSampler:
        return self.tree(t0, t1) / (t1 - t0).abs().sqrt()


-def sigma_to_half_log_snr(sigma, model_sampling):
-    """Convert sigma to half-logSNR log(alpha_t / sigma_t)."""
-    if isinstance(model_sampling, comfy.model_sampling.CONST):
-        # log((1 - t) / t) = log((1 - sigma) / sigma)
-        return sigma.logit().neg()
-    return sigma.log().neg()
-
-
-def half_log_snr_to_sigma(half_log_snr, model_sampling):
-    """Convert half-logSNR log(alpha_t / sigma_t) to sigma."""
-    if isinstance(model_sampling, comfy.model_sampling.CONST):
-        # 1 / (1 + exp(half_log_snr))
-        return half_log_snr.neg().sigmoid()
-    return half_log_snr.neg().exp()
-
-
-def offset_first_sigma_for_snr(sigmas, model_sampling, percent_offset=1e-4):
-    """Adjust the first sigma to avoid invalid logSNR."""
-    if len(sigmas) <= 1:
-        return sigmas
-    if isinstance(model_sampling, comfy.model_sampling.CONST):
-        if sigmas[0] >= 1:
-            sigmas = sigmas.clone()
-            sigmas[0] = model_sampling.percent_to_sigma(percent_offset)
-    return sigmas
-
-
-def ei_h_phi_1(h: torch.Tensor) -> torch.Tensor:
-    """Compute the result of h*phi_1(h) in exponential integrator methods."""
-    return torch.expm1(h)
-
-
-def ei_h_phi_2(h: torch.Tensor) -> torch.Tensor:
-    """Compute the result of h*phi_2(h) in exponential integrator methods."""
-    return (torch.expm1(h) - h) / h
-
-
@torch.no_grad()
 def sample_euler(model, x, sigmas, extra_args=None, callback=None, disable=None, s_churn=0., s_tmin=0., s_tmax=float('inf'), s_noise=1.):
    """Implements Algorithm 2 (Euler steps) from Karras et al. (2022)."""
@@ -422,13 +384,9 @@ def sample_lms(model, x, sigmas, extra_args=None, callback=None, disable=None, o
            ds.pop(0)
        if callback is not None:
            callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigmas[i], 'denoised': denoised})
-        if sigmas[i + 1] == 0:
-            # Denoising step
-            x = denoised
-        else:
-            cur_order = min(i + 1, order)
-            coeffs = [linear_multistep_coeff(cur_order, sigmas_cpu, i, j) for j in range(cur_order)]
-            x = x + sum(coeff * d for coeff, d in zip(coeffs, reversed(ds)))
+        cur_order = min(i + 1, order)
+        coeffs = [linear_multistep_coeff(cur_order, sigmas_cpu, i, j) for j in range(cur_order)]
+        x = x + sum(coeff * d for coeff, d in zip(coeffs, reversed(ds)))
    return x


@@ -724,7 +682,6 @@ def sample_dpmpp_2s_ancestral_RF(model, x, sigmas, extra_args=None, callback=Non
        # logged_x = torch.cat((logged_x, x.unsqueeze(0)), dim=0)
    return x

-
@torch.no_grad()
 def sample_dpmpp_sde(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None, r=1 / 2):
    """DPM-Solver++ (stochastic)."""
@@ -736,49 +693,38 @@ def sample_dpmpp_sde(model, x, sigmas, extra_args=None, callback=None, disable=N
    seed = extra_args.get("seed", None)
    noise_sampler = BrownianTreeNoiseSampler(x, sigma_min, sigma_max, seed=seed, cpu=True) if noise_sampler is None else noise_sampler
    s_in = x.new_ones([x.shape[0]])
-
-    model_sampling = model.inner_model.model_patcher.get_model_object('model_sampling')
-    sigma_fn = partial(half_log_snr_to_sigma, model_sampling=model_sampling)
-    lambda_fn = partial(sigma_to_half_log_snr, model_sampling=model_sampling)
-    sigmas = offset_first_sigma_for_snr(sigmas, model_sampling)
+    sigma_fn = lambda t: t.neg().exp()
+    t_fn = lambda sigma: sigma.log().neg()

    for i in trange(len(sigmas) - 1, disable=disable):
        denoised = model(x, sigmas[i] * s_in, **extra_args)
        if callback is not None:
            callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigmas[i], 'denoised': denoised})
        if sigmas[i + 1] == 0:
-            # Denoising step
-            x = denoised
+            # Euler method
+            d = to_d(x, sigmas[i], denoised)
+            dt = sigmas[i + 1] - sigmas[i]
+            x = x + d * dt
        else:
            # DPM-Solver++
-            lambda_s, lambda_t = lambda_fn(sigmas[i]), lambda_fn(sigmas[i + 1])
-            h = lambda_t - lambda_s
-            lambda_s_1 = lambda_s + r * h
+            t, t_next = t_fn(sigmas[i]), t_fn(sigmas[i + 1])
+            h = t_next - t
+            s = t + h * r
            fac = 1 / (2 * r)

-            sigma_s_1 = sigma_fn(lambda_s_1)
-
-            alpha_s = sigmas[i] * lambda_s.exp()
-            alpha_s_1 = sigma_s_1 * lambda_s_1.exp()
-            alpha_t = sigmas[i + 1] * lambda_t.exp()
-
            # Step 1
-            sd, su = get_ancestral_step(lambda_s.neg().exp(), lambda_s_1.neg().exp(), eta)
-            lambda_s_1_ = sd.log().neg()
-            h_ = lambda_s_1_ - lambda_s
-            x_2 = (alpha_s_1 / alpha_s) * (-h_).exp() * x - alpha_s_1 * (-h_).expm1() * denoised
-            if eta > 0 and s_noise > 0:
-                x_2 = x_2 + alpha_s_1 * noise_sampler(sigmas[i], sigma_s_1) * s_noise * su
-            denoised_2 = model(x_2, sigma_s_1 * s_in, **extra_args)
+            sd, su = get_ancestral_step(sigma_fn(t), sigma_fn(s), eta)
+            s_ = t_fn(sd)
+            x_2 = (sigma_fn(s_) / sigma_fn(t)) * x - (t - s_).expm1() * denoised
+            x_2 = x_2 + noise_sampler(sigma_fn(t), sigma_fn(s)) * s_noise * su
+            denoised_2 = model(x_2, sigma_fn(s) * s_in, **extra_args)

            # Step 2
-            sd, su = get_ancestral_step(lambda_s.neg().exp(), lambda_t.neg().exp(), eta)
-            lambda_t_ = sd.log().neg()
-            h_ = lambda_t_ - lambda_s
+            sd, su = get_ancestral_step(sigma_fn(t), sigma_fn(t_next), eta)
+            t_next_ = t_fn(sd)
            denoised_d = (1 - fac) * denoised + fac * denoised_2
-            x = (alpha_t / alpha_s) * (-h_).exp() * x - alpha_t * (-h_).expm1() * denoised_d
-            if eta > 0 and s_noise > 0:
-                x = x + alpha_t * noise_sampler(sigmas[i], sigmas[i + 1]) * s_noise * su
+            x = (sigma_fn(t_next_) / sigma_fn(t)) * x - (t - t_next_).expm1() * denoised_d
+            x = x + noise_sampler(sigma_fn(t), sigma_fn(t_next)) * s_noise * su
    return x


@@ -807,7 +753,6 @@ def sample_dpmpp_2m(model, x, sigmas, extra_args=None, callback=None, disable=No
        old_denoised = denoised
    return x

-
@torch.no_grad()
 def sample_dpmpp_2m_sde(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None, solver_type='midpoint'):
    """DPM-Solver++(2M) SDE."""
@@ -823,12 +768,9 @@ def sample_dpmpp_2m_sde(model, x, sigmas, extra_args=None, callback=None, disabl
    noise_sampler = BrownianTreeNoiseSampler(x, sigma_min, sigma_max, seed=seed, cpu=True) if noise_sampler is None else noise_sampler
    s_in = x.new_ones([x.shape[0]])

-    model_sampling = model.inner_model.model_patcher.get_model_object('model_sampling')
-    lambda_fn = partial(sigma_to_half_log_snr, model_sampling=model_sampling)
-    sigmas = offset_first_sigma_for_snr(sigmas, model_sampling)
-
    old_denoised = None
-    h, h_last = None, None
+    h_last = None
+    h = None

    for i in trange(len(sigmas) - 1, disable=disable):
        denoised = model(x, sigmas[i] * s_in, **extra_args)
@@ -839,34 +781,26 @@ def sample_dpmpp_2m_sde(model, x, sigmas, extra_args=None, callback=None, disabl
            x = denoised
        else:
            # DPM-Solver++(2M) SDE
-            lambda_s, lambda_t = lambda_fn(sigmas[i]), lambda_fn(sigmas[i + 1])
-            h = lambda_t - lambda_s
-            h_eta = h * (eta + 1)
+            t, s = -sigmas[i].log(), -sigmas[i + 1].log()
+            h = s - t
+            eta_h = eta * h

-            alpha_t = sigmas[i + 1] * lambda_t.exp()
-
-            x = sigmas[i + 1] / sigmas[i] * (-h * eta).exp() * x + alpha_t * (-h_eta).expm1().neg() * denoised
+            x = sigmas[i + 1] / sigmas[i] * (-eta_h).exp() * x + (-h - eta_h).expm1().neg() * denoised

            if old_denoised is not None:
                r = h_last / h
                if solver_type == 'heun':
-                    x = x + alpha_t * ((-h_eta).expm1().neg() / (-h_eta) + 1) * (1 / r) * (denoised - old_denoised)
+                    x = x + ((-h - eta_h).expm1().neg() / (-h - eta_h) + 1) * (1 / r) * (denoised - old_denoised)
                elif solver_type == 'midpoint':
-                    x = x + 0.5 * alpha_t * (-h_eta).expm1().neg() * (1 / r) * (denoised - old_denoised)
+                    x = x + 0.5 * (-h - eta_h).expm1().neg() * (1 / r) * (denoised - old_denoised)

-            if eta > 0 and s_noise > 0:
-                x = x + noise_sampler(sigmas[i], sigmas[i + 1]) * sigmas[i + 1] * (-2 * h * eta).expm1().neg().sqrt() * s_noise
+            if eta:
+                x = x + noise_sampler(sigmas[i], sigmas[i + 1]) * sigmas[i + 1] * (-2 * eta_h).expm1().neg().sqrt() * s_noise

        old_denoised = denoised
        h_last = h
    return x

-
-@torch.no_grad()
-def sample_dpmpp_2m_sde_heun(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None, solver_type='heun'):
-    return sample_dpmpp_2m_sde(model, x, sigmas, extra_args=extra_args, callback=callback, disable=disable, eta=eta, s_noise=s_noise, noise_sampler=noise_sampler, solver_type=solver_type)
-
-
@torch.no_grad()
 def sample_dpmpp_3m_sde(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None):
    """DPM-Solver++(3M) SDE."""
@@ -880,10 +814,6 @@ def sample_dpmpp_3m_sde(model, x, sigmas, extra_args=None, callback=None, disabl
    noise_sampler = BrownianTreeNoiseSampler(x, sigma_min, sigma_max, seed=seed, cpu=True) if noise_sampler is None else noise_sampler
    s_in = x.new_ones([x.shape[0]])

-    model_sampling = model.inner_model.model_patcher.get_model_object('model_sampling')
-    lambda_fn = partial(sigma_to_half_log_snr, model_sampling=model_sampling)
-    sigmas = offset_first_sigma_for_snr(sigmas, model_sampling)
-
    denoised_1, denoised_2 = None, None
    h, h_1, h_2 = None, None, None

@@ -895,16 +825,13 @@ def sample_dpmpp_3m_sde(model, x, sigmas, extra_args=None, callback=None, disabl
            # Denoising step
            x = denoised
        else:
-            lambda_s, lambda_t = lambda_fn(sigmas[i]), lambda_fn(sigmas[i + 1])
-            h = lambda_t - lambda_s
+            t, s = -sigmas[i].log(), -sigmas[i + 1].log()
+            h = s - t
            h_eta = h * (eta + 1)

-            alpha_t = sigmas[i + 1] * lambda_t.exp()
-
-            x = sigmas[i + 1] / sigmas[i] * (-h * eta).exp() * x + alpha_t * (-h_eta).expm1().neg() * denoised
+            x = torch.exp(-h_eta) * x + (-h_eta).expm1().neg() * denoised

            if h_2 is not None:
-                # DPM-Solver++(3M) SDE
                r0 = h_1 / h
                r1 = h_2 / h
                d1_0 = (denoised - denoised_1) / r0
@@ -913,22 +840,20 @@ def sample_dpmpp_3m_sde(model, x, sigmas, extra_args=None, callback=None, disabl
                d2 = (d1_0 - d1_1) / (r0 + r1)
                phi_2 = h_eta.neg().expm1() / h_eta + 1
                phi_3 = phi_2 / h_eta - 0.5
-                x = x + (alpha_t * phi_2) * d1 - (alpha_t * phi_3) * d2
+                x = x + phi_2 * d1 - phi_3 * d2
            elif h_1 is not None:
-                # DPM-Solver++(2M) SDE
                r = h_1 / h
                d = (denoised - denoised_1) / r
                phi_2 = h_eta.neg().expm1() / h_eta + 1
-                x = x + (alpha_t * phi_2) * d
+                x = x + phi_2 * d

-            if eta > 0 and s_noise > 0:
+            if eta:
                x = x + noise_sampler(sigmas[i], sigmas[i + 1]) * sigmas[i + 1] * (-2 * h * eta).expm1().neg().sqrt() * s_noise

        denoised_1, denoised_2 = denoised, denoised_1
        h_1, h_2 = h, h_1
    return x

-
@torch.no_grad()
 def sample_dpmpp_3m_sde_gpu(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None):
    if len(sigmas) <= 1:
@@ -938,17 +863,6 @@ def sample_dpmpp_3m_sde_gpu(model, x, sigmas, extra_args=None, callback=None, di
    noise_sampler = BrownianTreeNoiseSampler(x, sigma_min, sigma_max, seed=extra_args.get("seed", None), cpu=False) if noise_sampler is None else noise_sampler
    return sample_dpmpp_3m_sde(model, x, sigmas, extra_args=extra_args, callback=callback, disable=disable, eta=eta, s_noise=s_noise, noise_sampler=noise_sampler)

-
-@torch.no_grad()
-def sample_dpmpp_2m_sde_heun_gpu(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None, solver_type='heun'):
-    if len(sigmas) <= 1:
-        return x
-    extra_args = {} if extra_args is None else extra_args
-    sigma_min, sigma_max = sigmas[sigmas > 0].min(), sigmas.max()
-    noise_sampler = BrownianTreeNoiseSampler(x, sigma_min, sigma_max, seed=extra_args.get("seed", None), cpu=False) if noise_sampler is None else noise_sampler
-    return sample_dpmpp_2m_sde_heun(model, x, sigmas, extra_args=extra_args, callback=callback, disable=disable, eta=eta, s_noise=s_noise, noise_sampler=noise_sampler, solver_type=solver_type)
-
-
@torch.no_grad()
 def sample_dpmpp_2m_sde_gpu(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None, solver_type='midpoint'):
    if len(sigmas) <= 1:
@@ -958,7 +872,6 @@ def sample_dpmpp_2m_sde_gpu(model, x, sigmas, extra_args=None, callback=None, di
    noise_sampler = BrownianTreeNoiseSampler(x, sigma_min, sigma_max, seed=extra_args.get("seed", None), cpu=False) if noise_sampler is None else noise_sampler
    return sample_dpmpp_2m_sde(model, x, sigmas, extra_args=extra_args, callback=callback, disable=disable, eta=eta, s_noise=s_noise, noise_sampler=noise_sampler, solver_type=solver_type)

-
@torch.no_grad()
 def sample_dpmpp_sde_gpu(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None, r=1 / 2):
    if len(sigmas) <= 1:
@@ -1096,9 +1009,7 @@ def sample_ipndm(model, x, sigmas, extra_args=None, callback=None, disable=None,
        d_cur = (x_cur - denoised) / t_cur

        order = min(max_order, i+1)
-        if t_next == 0:     # Denoising step
-            x_next = denoised
-        elif order == 1:    # First Euler step.
+        if order == 1:      # First Euler step.
            x_next = x_cur + (t_next - t_cur) * d_cur
        elif order == 2:    # Use one history point.
            x_next = x_cur + (t_next - t_cur) * (3 * d_cur - buffer_model[-1]) / 2
@@ -1116,7 +1027,6 @@ def sample_ipndm(model, x, sigmas, extra_args=None, callback=None, disable=None,

    return x_next

-
 #From https://github.com/zju-pi/diff-sampler/blob/main/diff-solvers-main/solvers.py
 #under Apache 2 license
 def sample_ipndm_v(model, x, sigmas, extra_args=None, callback=None, disable=None, max_order=4):
@@ -1140,9 +1050,7 @@ def sample_ipndm_v(model, x, sigmas, extra_args=None, callback=None, disable=Non
        d_cur = (x_cur - denoised) / t_cur

        order = min(max_order, i+1)
-        if t_next == 0:     # Denoising step
-            x_next = denoised
-        elif order == 1:    # First Euler step.
+        if order == 1:      # First Euler step.
            x_next = x_cur + (t_next - t_cur) * d_cur
        elif order == 2:    # Use one history point.
            h_n = (t_next - t_cur)
@@ -1182,7 +1090,6 @@ def sample_ipndm_v(model, x, sigmas, extra_args=None, callback=None, disable=Non

    return x_next

-
 #From https://github.com/zju-pi/diff-sampler/blob/main/diff-solvers-main/solvers.py
 #under Apache 2 license
@torch.no_grad()
@@ -1233,22 +1140,39 @@ def sample_deis(model, x, sigmas, extra_args=None, callback=None, disable=None,

    return x_next

+@torch.no_grad()
+def sample_euler_cfg_pp(model, x, sigmas, extra_args=None, callback=None, disable=None):
+    extra_args = {} if extra_args is None else extra_args
+
+    temp = [0]
+    def post_cfg_function(args):
+        temp[0] = args["uncond_denoised"]
+        return args["denoised"]
+
+    model_options = extra_args.get("model_options", {}).copy()
+    extra_args["model_options"] = comfy.model_patcher.set_model_options_post_cfg_function(model_options, post_cfg_function, disable_cfg1_optimization=True)
+
+    s_in = x.new_ones([x.shape[0]])
+    for i in trange(len(sigmas) - 1, disable=disable):
+        sigma_hat = sigmas[i]
+        denoised = model(x, sigma_hat * s_in, **extra_args)
+        d = to_d(x, sigma_hat, temp[0])
+        if callback is not None:
+            callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigma_hat, 'denoised': denoised})
+        # Euler method
+        x = denoised + d * sigmas[i + 1]
+    return x

@torch.no_grad()
 def sample_euler_ancestral_cfg_pp(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None):
-    """Ancestral sampling with Euler method steps (CFG++)."""
+    """Ancestral sampling with Euler method steps."""
    extra_args = {} if extra_args is None else extra_args
    seed = extra_args.get("seed", None)
    noise_sampler = default_noise_sampler(x, seed=seed) if noise_sampler is None else noise_sampler

-    model_sampling = model.inner_model.model_patcher.get_model_object("model_sampling")
-    lambda_fn = partial(sigma_to_half_log_snr, model_sampling=model_sampling)
-
-    uncond_denoised = None
-
+    temp = [0]
    def post_cfg_function(args):
-        nonlocal uncond_denoised
-        uncond_denoised = args["uncond_denoised"]
+        temp[0] = args["uncond_denoised"]
        return args["denoised"]

    model_options = extra_args.get("model_options", {}).copy()
@@ -1257,33 +1181,15 @@ def sample_euler_ancestral_cfg_pp(model, x, sigmas, extra_args=None, callback=No
    s_in = x.new_ones([x.shape[0]])
    for i in trange(len(sigmas) - 1, disable=disable):
        denoised = model(x, sigmas[i] * s_in, **extra_args)
+        sigma_down, sigma_up = get_ancestral_step(sigmas[i], sigmas[i + 1], eta=eta)
        if callback is not None:
            callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigmas[i], 'denoised': denoised})
-        if sigmas[i + 1] == 0:
-            # Denoising step
-            x = denoised
-        else:
-            alpha_s = sigmas[i] * lambda_fn(sigmas[i]).exp()
-            alpha_t = sigmas[i + 1] * lambda_fn(sigmas[i + 1]).exp()
-            d = to_d(x, sigmas[i], alpha_s * uncond_denoised)   # to noise
-
-            # DDIM stochastic sampling
-            sigma_down, sigma_up = get_ancestral_step(sigmas[i] / alpha_s, sigmas[i + 1] / alpha_t, eta=eta)
-            sigma_down = alpha_t * sigma_down
-
-            # Euler method
-            x = alpha_t * denoised + sigma_down * d
-            if eta > 0 and s_noise > 0:
-                x = x + alpha_t * noise_sampler(sigmas[i], sigmas[i + 1]) * s_noise * sigma_up
+        d = to_d(x, sigmas[i], temp[0])
+        # Euler method
+        x = denoised + d * sigma_down
+        if sigmas[i + 1] > 0:
+            x = x + noise_sampler(sigmas[i], sigmas[i + 1]) * s_noise * sigma_up
    return x
-
-
-@torch.no_grad()
-def sample_euler_cfg_pp(model, x, sigmas, extra_args=None, callback=None, disable=None):
-    """Euler method steps (CFG++)."""
-    return sample_euler_ancestral_cfg_pp(model, x, sigmas, extra_args=extra_args, callback=callback, disable=disable, eta=0.0, s_noise=0.0, noise_sampler=None)
-
-
@torch.no_grad()
 def sample_dpmpp_2s_ancestral_cfg_pp(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None):
    """Ancestral sampling with DPM-Solver++(2S) second-order steps."""
@@ -1440,7 +1346,6 @@ def sample_res_multistep_ancestral(model, x, sigmas, extra_args=None, callback=N
 def sample_res_multistep_ancestral_cfg_pp(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None):
    return res_multistep(model, x, sigmas, extra_args=extra_args, callback=callback, disable=disable, s_noise=s_noise, noise_sampler=noise_sampler, eta=eta, cfg_pp=True)

-
@torch.no_grad()
 def sample_gradient_estimation(model, x, sigmas, extra_args=None, callback=None, disable=None, ge_gamma=2., cfg_pp=False):
    """Gradient-estimation sampler. Paper: https://openreview.net/pdf?id=o2ND9v0CeK"""
@@ -1467,32 +1372,31 @@ def sample_gradient_estimation(model, x, sigmas, extra_args=None, callback=None,
        if callback is not None:
            callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigmas[i], 'denoised': denoised})
        dt = sigmas[i + 1] - sigmas[i]
-        if sigmas[i + 1] == 0:
-            # Denoising step
-            x = denoised
-        else:
+        if i == 0:
            # Euler method
            if cfg_pp:
                x = denoised + d * sigmas[i + 1]
            else:
                x = x + d * dt
-
-            if i >= 1:
-                # Gradient estimation
+        else:
+            # Gradient estimation
+            if cfg_pp:
                d_bar = (ge_gamma - 1) * (d - old_d)
+                x = denoised + d * sigmas[i + 1] + d_bar * dt
+            else:
+                d_bar = ge_gamma * d + (1 - ge_gamma) * old_d
                x = x + d_bar * dt
        old_d = d
    return x

-
@torch.no_grad()
 def sample_gradient_estimation_cfg_pp(model, x, sigmas, extra_args=None, callback=None, disable=None, ge_gamma=2.):
    return sample_gradient_estimation(model, x, sigmas, extra_args=extra_args, callback=callback, disable=disable, ge_gamma=ge_gamma, cfg_pp=True)

-
@torch.no_grad()
-def sample_er_sde(model, x, sigmas, extra_args=None, callback=None, disable=None, s_noise=1.0, noise_sampler=None, noise_scaler=None, max_stage=3):
-    """Extended Reverse-Time SDE solver (VP ER-SDE-Solver-3). arXiv: https://arxiv.org/abs/2309.06169.
+def sample_er_sde(model, x, sigmas, extra_args=None, callback=None, disable=None, s_noise=1., noise_sampler=None, noise_scaler=None, max_stage=3):
+    """
+    Extended Reverse-Time SDE solver (VE ER-SDE-Solver-3). Arxiv: https://arxiv.org/abs/2309.06169.
    Code reference: https://github.com/QinpengCui/ER-SDE-Solver/blob/main/er_sde_solver.py.
    """
    extra_args = {} if extra_args is None else extra_args
@@ -1500,18 +1404,12 @@ def sample_er_sde(model, x, sigmas, extra_args=None, callback=None, disable=None
    noise_sampler = default_noise_sampler(x, seed=seed) if noise_sampler is None else noise_sampler
    s_in = x.new_ones([x.shape[0]])

-    def default_er_sde_noise_scaler(x):
-        return x * ((x ** 0.3).exp() + 10.0)
-
-    noise_scaler = default_er_sde_noise_scaler if noise_scaler is None else noise_scaler
+    def default_noise_scaler(sigma):
+        return sigma * ((sigma ** 0.3).exp() + 10.0)
+    noise_scaler = default_noise_scaler if noise_scaler is None else noise_scaler
    num_integration_points = 200.0
    point_indice = torch.arange(0, num_integration_points, dtype=torch.float32, device=x.device)

-    model_sampling = model.inner_model.model_patcher.get_model_object("model_sampling")
-    sigmas = offset_first_sigma_for_snr(sigmas, model_sampling)
-    half_log_snrs = sigma_to_half_log_snr(sigmas, model_sampling)
-    er_lambdas = half_log_snrs.neg().exp()  # er_lambda_t = sigma_t / alpha_t
-
    old_denoised = None
    old_denoised_d = None

@@ -1522,265 +1420,129 @@ def sample_er_sde(model, x, sigmas, extra_args=None, callback=None, disable=None
        stage_used = min(max_stage, i + 1)
        if sigmas[i + 1] == 0:
            x = denoised
+        elif stage_used == 1:
+            r = noise_scaler(sigmas[i + 1]) / noise_scaler(sigmas[i])
+            x = r * x + (1 - r) * denoised
        else:
-            er_lambda_s, er_lambda_t = er_lambdas[i], er_lambdas[i + 1]
-            alpha_s = sigmas[i] / er_lambda_s
-            alpha_t = sigmas[i + 1] / er_lambda_t
-            r_alpha = alpha_t / alpha_s
-            r = noise_scaler(er_lambda_t) / noise_scaler(er_lambda_s)
+            r = noise_scaler(sigmas[i + 1]) / noise_scaler(sigmas[i])
+            x = r * x + (1 - r) * denoised

-            # Stage 1 Euler
-            x = r_alpha * r * x + alpha_t * (1 - r) * denoised
+            dt = sigmas[i + 1] - sigmas[i]
+            sigma_step_size = -dt / num_integration_points
+            sigma_pos = sigmas[i + 1] + point_indice * sigma_step_size
+            scaled_pos = noise_scaler(sigma_pos)

-            if stage_used >= 2:
-                dt = er_lambda_t - er_lambda_s
-                lambda_step_size = -dt / num_integration_points
-                lambda_pos = er_lambda_t + point_indice * lambda_step_size
-                scaled_pos = noise_scaler(lambda_pos)
+            # Stage 2
+            s = torch.sum(1 / scaled_pos) * sigma_step_size
+            denoised_d = (denoised - old_denoised) / (sigmas[i] - sigmas[i - 1])
+            x = x + (dt + s * noise_scaler(sigmas[i + 1])) * denoised_d

-                # Stage 2
-                s = torch.sum(1 / scaled_pos) * lambda_step_size
-                denoised_d = (denoised - old_denoised) / (er_lambda_s - er_lambdas[i - 1])
-                x = x + alpha_t * (dt + s * noise_scaler(er_lambda_t)) * denoised_d
+            if stage_used >= 3:
+                # Stage 3
+                s_u = torch.sum((sigma_pos - sigmas[i]) / scaled_pos) * sigma_step_size
+                denoised_u = (denoised_d - old_denoised_d) / ((sigmas[i] - sigmas[i - 2]) / 2)
+                x = x + ((dt ** 2) / 2 + s_u * noise_scaler(sigmas[i + 1])) * denoised_u
+            old_denoised_d = denoised_d

-                if stage_used >= 3:
-                    # Stage 3
-                    s_u = torch.sum((lambda_pos - er_lambda_s) / scaled_pos) * lambda_step_size
-                    denoised_u = (denoised_d - old_denoised_d) / ((er_lambda_s - er_lambdas[i - 2]) / 2)
-                    x = x + alpha_t * ((dt ** 2) / 2 + s_u * noise_scaler(er_lambda_t)) * denoised_u
-                old_denoised_d = denoised_d
-
-            if s_noise > 0:
-                x = x + alpha_t * noise_sampler(sigmas[i], sigmas[i + 1]) * s_noise * (er_lambda_t ** 2 - er_lambda_s ** 2 * r ** 2).sqrt().nan_to_num(nan=0.0)
+        if s_noise != 0 and sigmas[i + 1] > 0:
+            x = x + noise_sampler(sigmas[i], sigmas[i + 1]) * s_noise * (sigmas[i + 1] ** 2 - sigmas[i] ** 2 * r ** 2).sqrt().nan_to_num(nan=0.0)
        old_denoised = denoised
    return x

-
@torch.no_grad()
 def sample_seeds_2(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None, r=0.5):
-    """SEEDS-2 - Stochastic Explicit Exponential Derivative-free Solvers (VP Data Prediction) stage 2.
-    arXiv: https://arxiv.org/abs/2305.14267 (NeurIPS 2023)
-    """
+    '''
+    SEEDS-2 - Stochastic Explicit Exponential Derivative-free Solvers (VE Data Prediction) stage 2
+    Arxiv: https://arxiv.org/abs/2305.14267
+    '''
    extra_args = {} if extra_args is None else extra_args
    seed = extra_args.get("seed", None)
    noise_sampler = default_noise_sampler(x, seed=seed) if noise_sampler is None else noise_sampler
    s_in = x.new_ones([x.shape[0]])
+
    inject_noise = eta > 0 and s_noise > 0

-    model_sampling = model.inner_model.model_patcher.get_model_object('model_sampling')
-    sigma_fn = partial(half_log_snr_to_sigma, model_sampling=model_sampling)
-    lambda_fn = partial(sigma_to_half_log_snr, model_sampling=model_sampling)
-    sigmas = offset_first_sigma_for_snr(sigmas, model_sampling)
-
-    fac = 1 / (2 * r)
-
    for i in trange(len(sigmas) - 1, disable=disable):
        denoised = model(x, sigmas[i] * s_in, **extra_args)
        if callback is not None:
            callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigmas[i], 'denoised': denoised})
-
        if sigmas[i + 1] == 0:
            x = denoised
-            continue
+        else:
+            t, t_next = -sigmas[i].log(), -sigmas[i + 1].log()
+            h = t_next - t
+            h_eta = h * (eta + 1)
+            s = t + r * h
+            fac = 1 / (2 * r)
+            sigma_s = s.neg().exp()

-        lambda_s, lambda_t = lambda_fn(sigmas[i]), lambda_fn(sigmas[i + 1])
-        h = lambda_t - lambda_s
-        h_eta = h * (eta + 1)
-        lambda_s_1 = torch.lerp(lambda_s, lambda_t, r)
-        sigma_s_1 = sigma_fn(lambda_s_1)
+            coeff_1, coeff_2 = (-r * h_eta).expm1(), (-h_eta).expm1()
+            if inject_noise:
+                noise_coeff_1 = (-2 * r * h * eta).expm1().neg().sqrt()
+                noise_coeff_2 = ((-2 * r * h * eta).expm1() - (-2 * h * eta).expm1()).sqrt()
+                noise_1, noise_2 = noise_sampler(sigmas[i], sigma_s), noise_sampler(sigma_s, sigmas[i + 1])

-        alpha_s_1 = sigma_s_1 * lambda_s_1.exp()
-        alpha_t = sigmas[i + 1] * lambda_t.exp()
+            # Step 1
+            x_2 = (coeff_1 + 1) * x - coeff_1 * denoised
+            if inject_noise:
+                x_2 = x_2 + sigma_s * (noise_coeff_1 * noise_1) * s_noise
+            denoised_2 = model(x_2, sigma_s * s_in, **extra_args)

-        # Step 1
-        x_2 = sigma_s_1 / sigmas[i] * (-r * h * eta).exp() * x - alpha_s_1 * ei_h_phi_1(-r * h_eta) * denoised
-        if inject_noise:
-            sde_noise = (-2 * r * h * eta).expm1().neg().sqrt() * noise_sampler(sigmas[i], sigma_s_1)
-            x_2 = x_2 + sde_noise * sigma_s_1 * s_noise
-        denoised_2 = model(x_2, sigma_s_1 * s_in, **extra_args)
-
-        # Step 2
-        denoised_d = torch.lerp(denoised, denoised_2, fac)
-        x = sigmas[i + 1] / sigmas[i] * (-h * eta).exp() * x - alpha_t * ei_h_phi_1(-h_eta) * denoised_d
-        if inject_noise:
-            segment_factor = (r - 1) * h * eta
-            sde_noise = sde_noise * segment_factor.exp()
-            sde_noise = sde_noise + segment_factor.mul(2).expm1().neg().sqrt() * noise_sampler(sigma_s_1, sigmas[i + 1])
-            x = x + sde_noise * sigmas[i + 1] * s_noise
+            # Step 2
+            denoised_d = (1 - fac) * denoised + fac * denoised_2
+            x = (coeff_2 + 1) * x - coeff_2 * denoised_d
+            if inject_noise:
+                x = x + sigmas[i + 1] * (noise_coeff_2 * noise_1 + noise_coeff_1 * noise_2) * s_noise
    return x

-
@torch.no_grad()
 def sample_seeds_3(model, x, sigmas, extra_args=None, callback=None, disable=None, eta=1., s_noise=1., noise_sampler=None, r_1=1./3, r_2=2./3):
-    """SEEDS-3 - Stochastic Explicit Exponential Derivative-free Solvers (VP Data Prediction) stage 3.
-    arXiv: https://arxiv.org/abs/2305.14267 (NeurIPS 2023)
-    """
+    '''
+    SEEDS-3 - Stochastic Explicit Exponential Derivative-free Solvers (VE Data Prediction) stage 3
+    Arxiv: https://arxiv.org/abs/2305.14267
+    '''
    extra_args = {} if extra_args is None else extra_args
    seed = extra_args.get("seed", None)
    noise_sampler = default_noise_sampler(x, seed=seed) if noise_sampler is None else noise_sampler
    s_in = x.new_ones([x.shape[0]])
-    inject_noise = eta > 0 and s_noise > 0

-    model_sampling = model.inner_model.model_patcher.get_model_object('model_sampling')
-    sigma_fn = partial(half_log_snr_to_sigma, model_sampling=model_sampling)
-    lambda_fn = partial(sigma_to_half_log_snr, model_sampling=model_sampling)
-    sigmas = offset_first_sigma_for_snr(sigmas, model_sampling)
+    inject_noise = eta > 0 and s_noise > 0

    for i in trange(len(sigmas) - 1, disable=disable):
        denoised = model(x, sigmas[i] * s_in, **extra_args)
        if callback is not None:
            callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigmas[i], 'denoised': denoised})
-
        if sigmas[i + 1] == 0:
            x = denoised
-            continue
+        else:
+            t, t_next = -sigmas[i].log(), -sigmas[i + 1].log()
+            h = t_next - t
+            h_eta = h * (eta + 1)
+            s_1 = t + r_1 * h
+            s_2 = t + r_2 * h
+            sigma_s_1, sigma_s_2 = s_1.neg().exp(), s_2.neg().exp()

-        lambda_s, lambda_t = lambda_fn(sigmas[i]), lambda_fn(sigmas[i + 1])
-        h = lambda_t - lambda_s
-        h_eta = h * (eta + 1)
-        lambda_s_1 = torch.lerp(lambda_s, lambda_t, r_1)
-        lambda_s_2 = torch.lerp(lambda_s, lambda_t, r_2)
-        sigma_s_1, sigma_s_2 = sigma_fn(lambda_s_1), sigma_fn(lambda_s_2)
+            coeff_1, coeff_2, coeff_3 = (-r_1 * h_eta).expm1(), (-r_2 * h_eta).expm1(), (-h_eta).expm1()
+            if inject_noise:
+                noise_coeff_1 = (-2 * r_1 * h * eta).expm1().neg().sqrt()
+                noise_coeff_2 = ((-2 * r_1 * h * eta).expm1() - (-2 * r_2 * h * eta).expm1()).sqrt()
+                noise_coeff_3 = ((-2 * r_2 * h * eta).expm1() - (-2 * h * eta).expm1()).sqrt()
+                noise_1, noise_2, noise_3 = noise_sampler(sigmas[i], sigma_s_1), noise_sampler(sigma_s_1, sigma_s_2), noise_sampler(sigma_s_2, sigmas[i + 1])

-        alpha_s_1 = sigma_s_1 * lambda_s_1.exp()
-        alpha_s_2 = sigma_s_2 * lambda_s_2.exp()
-        alpha_t = sigmas[i + 1] * lambda_t.exp()
+            # Step 1
+            x_2 = (coeff_1 + 1) * x - coeff_1 * denoised
+            if inject_noise:
+                x_2 = x_2 + sigma_s_1 * (noise_coeff_1 * noise_1) * s_noise
+            denoised_2 = model(x_2, sigma_s_1 * s_in, **extra_args)

-        # Step 1
-        x_2 = sigma_s_1 / sigmas[i] * (-r_1 * h * eta).exp() * x - alpha_s_1 * ei_h_phi_1(-r_1 * h_eta) * denoised
-        if inject_noise:
-            sde_noise = (-2 * r_1 * h * eta).expm1().neg().sqrt() * noise_sampler(sigmas[i], sigma_s_1)
-            x_2 = x_2 + sde_noise * sigma_s_1 * s_noise
-        denoised_2 = model(x_2, sigma_s_1 * s_in, **extra_args)
+            # Step 2
+            x_3 = (coeff_2 + 1) * x - coeff_2 * denoised + (r_2 / r_1) * (coeff_2 / (r_2 * h_eta) + 1) * (denoised_2 - denoised)
+            if inject_noise:
+                x_3 = x_3 + sigma_s_2 * (noise_coeff_2 * noise_1 + noise_coeff_1 * noise_2) * s_noise
+            denoised_3 = model(x_3, sigma_s_2 * s_in, **extra_args)

-        # Step 2
-        a3_2 = r_2 / r_1 * ei_h_phi_2(-r_2 * h_eta)
-        a3_1 = ei_h_phi_1(-r_2 * h_eta) - a3_2
-        x_3 = sigma_s_2 / sigmas[i] * (-r_2 * h * eta).exp() * x - alpha_s_2 * (a3_1 * denoised + a3_2 * denoised_2)
-        if inject_noise:
-            segment_factor = (r_1 - r_2) * h * eta
-            sde_noise = sde_noise * segment_factor.exp()
-            sde_noise = sde_noise + segment_factor.mul(2).expm1().neg().sqrt() * noise_sampler(sigma_s_1, sigma_s_2)
-            x_3 = x_3 + sde_noise * sigma_s_2 * s_noise
-        denoised_3 = model(x_3, sigma_s_2 * s_in, **extra_args)
-
-        # Step 3
-        b3 = ei_h_phi_2(-h_eta) / r_2
-        b1 = ei_h_phi_1(-h_eta) - b3
-        x = sigmas[i + 1] / sigmas[i] * (-h * eta).exp() * x - alpha_t * (b1 * denoised + b3 * denoised_3)
-        if inject_noise:
-            segment_factor = (r_2 - 1) * h * eta
-            sde_noise = sde_noise * segment_factor.exp()
-            sde_noise = sde_noise + segment_factor.mul(2).expm1().neg().sqrt() * noise_sampler(sigma_s_2, sigmas[i + 1])
-            x = x + sde_noise * sigmas[i + 1] * s_noise
+            # Step 3
+            x = (coeff_3 + 1) * x - coeff_3 * denoised + (1. / r_2) * (coeff_3 / h_eta + 1) * (denoised_3 - denoised)
+            if inject_noise:
+                x = x + sigmas[i + 1] * (noise_coeff_3 * noise_1 + noise_coeff_2 * noise_2 + noise_coeff_1 * noise_3) * s_noise
    return x
-
-
-@torch.no_grad()
-def sample_sa_solver(model, x, sigmas, extra_args=None, callback=None, disable=False, tau_func=None, s_noise=1.0, noise_sampler=None, predictor_order=3, corrector_order=4, use_pece=False, simple_order_2=False):
-    """Stochastic Adams Solver with predictor-corrector method (NeurIPS 2023)."""
-    if len(sigmas) <= 1:
-        return x
-    extra_args = {} if extra_args is None else extra_args
-    seed = extra_args.get("seed", None)
-    noise_sampler = default_noise_sampler(x, seed=seed) if noise_sampler is None else noise_sampler
-    s_in = x.new_ones([x.shape[0]])
-
-    model_sampling = model.inner_model.model_patcher.get_model_object("model_sampling")
-    sigmas = offset_first_sigma_for_snr(sigmas, model_sampling)
-    lambdas = sigma_to_half_log_snr(sigmas, model_sampling=model_sampling)
-
-    if tau_func is None:
-        # Use default interval for stochastic sampling
-        start_sigma = model_sampling.percent_to_sigma(0.2)
-        end_sigma = model_sampling.percent_to_sigma(0.8)
-        tau_func = sa_solver.get_tau_interval_func(start_sigma, end_sigma, eta=1.0)
-
-    max_used_order = max(predictor_order, corrector_order)
-    x_pred = x  # x: current state, x_pred: predicted next state
-
-    h = 0.0
-    tau_t = 0.0
-    noise = 0.0
-    pred_list = []
-
-    # Lower order near the end to improve stability
-    lower_order_to_end = sigmas[-1].item() == 0
-
-    for i in trange(len(sigmas) - 1, disable=disable):
-        # Evaluation
-        denoised = model(x_pred, sigmas[i] * s_in, **extra_args)
-        if callback is not None:
-            callback({"x": x_pred, "i": i, "sigma": sigmas[i], "sigma_hat": sigmas[i], "denoised": denoised})
-        pred_list.append(denoised)
-        pred_list = pred_list[-max_used_order:]
-
-        predictor_order_used = min(predictor_order, len(pred_list))
-        if i == 0 or (sigmas[i + 1] == 0 and not use_pece):
-            corrector_order_used = 0
-        else:
-            corrector_order_used = min(corrector_order, len(pred_list))
-
-        if lower_order_to_end:
-            predictor_order_used = min(predictor_order_used, len(sigmas) - 2 - i)
-            corrector_order_used = min(corrector_order_used, len(sigmas) - 1 - i)
-
-        # Corrector
-        if corrector_order_used == 0:
-            # Update by the predicted state
-            x = x_pred
-        else:
-            curr_lambdas = lambdas[i - corrector_order_used + 1:i + 1]
-            b_coeffs = sa_solver.compute_stochastic_adams_b_coeffs(
-                sigmas[i],
-                curr_lambdas,
-                lambdas[i - 1],
-                lambdas[i],
-                tau_t,
-                simple_order_2,
-                is_corrector_step=True,
-            )
-            pred_mat = torch.stack(pred_list[-corrector_order_used:], dim=1)    # (B, K, ...)
-            corr_res = torch.tensordot(pred_mat, b_coeffs, dims=([1], [0]))  # (B, ...)
-            x = sigmas[i] / sigmas[i - 1] * (-(tau_t ** 2) * h).exp() * x + corr_res
-
-            if tau_t > 0 and s_noise > 0:
-                # The noise from the previous predictor step
-                x = x + noise
-
-            if use_pece:
-                # Evaluate the corrected state
-                denoised = model(x, sigmas[i] * s_in, **extra_args)
-                pred_list[-1] = denoised
-
-        # Predictor
-        if sigmas[i + 1] == 0:
-            # Denoising step
-            x = denoised
-        else:
-            tau_t = tau_func(sigmas[i + 1])
-            curr_lambdas = lambdas[i - predictor_order_used + 1:i + 1]
-            b_coeffs = sa_solver.compute_stochastic_adams_b_coeffs(
-                sigmas[i + 1],
-                curr_lambdas,
-                lambdas[i],
-                lambdas[i + 1],
-                tau_t,
-                simple_order_2,
-                is_corrector_step=False,
-            )
-            pred_mat = torch.stack(pred_list[-predictor_order_used:], dim=1)    # (B, K, ...)
-            pred_res = torch.tensordot(pred_mat, b_coeffs, dims=([1], [0]))  # (B, ...)
-            h = lambdas[i + 1] - lambdas[i]
-            x_pred = sigmas[i + 1] / sigmas[i] * (-(tau_t ** 2) * h).exp() * x + pred_res
-
-            if tau_t > 0 and s_noise > 0:
-                noise = noise_sampler(sigmas[i], sigmas[i + 1]) * sigmas[i + 1] * (-2 * tau_t ** 2 * h).expm1().neg().sqrt() * s_noise
-                x_pred = x_pred + noise
-    return x
-
-
-@torch.no_grad()
-def sample_sa_solver_pece(model, x, sigmas, extra_args=None, callback=None, disable=False, tau_func=None, s_noise=1.0, noise_sampler=None, predictor_order=3, corrector_order=4, simple_order_2=False):
-    """Stochastic Adams Solver with PECE (Predict–Evaluate–Correct–Evaluate) mode (NeurIPS 2023)."""
-    return sample_sa_solver(model, x, sigmas, extra_args=extra_args, callback=callback, disable=disable, tau_func=tau_func, s_noise=s_noise, noise_sampler=noise_sampler, predictor_order=predictor_order, corrector_order=corrector_order, use_pece=True, simple_order_2=simple_order_2)
--- a/comfy/latent_formats.py
+++ b/comfy/latent_formats.py
@@ -457,170 +457,11 @@ class Wan21(LatentFormat):
        latents_std = self.latents_std.to(latent.device, latent.dtype)
        return latent * latents_std / self.scale_factor + latents_mean

-class Wan22(Wan21):
-    latent_channels = 48
-    latent_dimensions = 3
-
-    latent_rgb_factors = [
-            [ 0.0119,  0.0103,  0.0046],
-            [-0.1062, -0.0504,  0.0165],
-            [ 0.0140,  0.0409,  0.0491],
-            [-0.0813, -0.0677,  0.0607],
-            [ 0.0656,  0.0851,  0.0808],
-            [ 0.0264,  0.0463,  0.0912],
-            [ 0.0295,  0.0326,  0.0590],
-            [-0.0244, -0.0270,  0.0025],
-            [ 0.0443, -0.0102,  0.0288],
-            [-0.0465, -0.0090, -0.0205],
-            [ 0.0359,  0.0236,  0.0082],
-            [-0.0776,  0.0854,  0.1048],
-            [ 0.0564,  0.0264,  0.0561],
-            [ 0.0006,  0.0594,  0.0418],
-            [-0.0319, -0.0542, -0.0637],
-            [-0.0268,  0.0024,  0.0260],
-            [ 0.0539,  0.0265,  0.0358],
-            [-0.0359, -0.0312, -0.0287],
-            [-0.0285, -0.1032, -0.1237],
-            [ 0.1041,  0.0537,  0.0622],
-            [-0.0086, -0.0374, -0.0051],
-            [ 0.0390,  0.0670,  0.2863],
-            [ 0.0069,  0.0144,  0.0082],
-            [ 0.0006, -0.0167,  0.0079],
-            [ 0.0313, -0.0574, -0.0232],
-            [-0.1454, -0.0902, -0.0481],
-            [ 0.0714,  0.0827,  0.0447],
-            [-0.0304, -0.0574, -0.0196],
-            [ 0.0401,  0.0384,  0.0204],
-            [-0.0758, -0.0297, -0.0014],
-            [ 0.0568,  0.1307,  0.1372],
-            [-0.0055, -0.0310, -0.0380],
-            [ 0.0239, -0.0305,  0.0325],
-            [-0.0663, -0.0673, -0.0140],
-            [-0.0416, -0.0047, -0.0023],
-            [ 0.0166,  0.0112, -0.0093],
-            [-0.0211,  0.0011,  0.0331],
-            [ 0.1833,  0.1466,  0.2250],
-            [-0.0368,  0.0370,  0.0295],
-            [-0.3441, -0.3543, -0.2008],
-            [-0.0479, -0.0489, -0.0420],
-            [-0.0660, -0.0153,  0.0800],
-            [-0.0101,  0.0068,  0.0156],
-            [-0.0690, -0.0452, -0.0927],
-            [-0.0145,  0.0041,  0.0015],
-            [ 0.0421,  0.0451,  0.0373],
-            [ 0.0504, -0.0483, -0.0356],
-            [-0.0837,  0.0168,  0.0055]
-        ]
-
-    latent_rgb_factors_bias = [0.0317, -0.0878, -0.1388]
-
-    def __init__(self):
-        self.scale_factor = 1.0
-        self.latents_mean = torch.tensor([
-                -0.2289, -0.0052, -0.1323, -0.2339, -0.2799, 0.0174, 0.1838, 0.1557,
-                -0.1382, 0.0542, 0.2813, 0.0891, 0.1570, -0.0098, 0.0375, -0.1825,
-                -0.2246, -0.1207, -0.0698, 0.5109, 0.2665, -0.2108, -0.2158, 0.2502,
-                -0.2055, -0.0322, 0.1109, 0.1567, -0.0729, 0.0899, -0.2799, -0.1230,
-                -0.0313, -0.1649, 0.0117, 0.0723, -0.2839, -0.2083, -0.0520, 0.3748,
-                0.0152, 0.1957, 0.1433, -0.2944, 0.3573, -0.0548, -0.1681, -0.0667,
-            ]).view(1, self.latent_channels, 1, 1, 1)
-        self.latents_std = torch.tensor([
-                0.4765, 1.0364, 0.4514, 1.1677, 0.5313, 0.4990, 0.4818, 0.5013,
-                0.8158, 1.0344, 0.5894, 1.0901, 0.6885, 0.6165, 0.8454, 0.4978,
-                0.5759, 0.3523, 0.7135, 0.6804, 0.5833, 1.4146, 0.8986, 0.5659,
-                0.7069, 0.5338, 0.4889, 0.4917, 0.4069, 0.4999, 0.6866, 0.4093,
-                0.5709, 0.6065, 0.6415, 0.4944, 0.5726, 1.2042, 0.5458, 1.6887,
-                0.3971, 1.0600, 0.3943, 0.5537, 0.5444, 0.4089, 0.7468, 0.7744
-            ]).view(1, self.latent_channels, 1, 1, 1)
-
-class HunyuanImage21(LatentFormat):
-    latent_channels = 64
-    latent_dimensions = 2
-    scale_factor = 0.75289
-
-    latent_rgb_factors = [
-        [-0.0154, -0.0397, -0.0521],
-        [ 0.0005,  0.0093,  0.0006],
-        [-0.0805, -0.0773, -0.0586],
-        [-0.0494, -0.0487, -0.0498],
-        [-0.0212, -0.0076, -0.0261],
-        [-0.0179, -0.0417, -0.0505],
-        [ 0.0158,  0.0310,  0.0239],
-        [ 0.0409,  0.0516,  0.0201],
-        [ 0.0350,  0.0553,  0.0036],
-        [-0.0447, -0.0327, -0.0479],
-        [-0.0038, -0.0221, -0.0365],
-        [-0.0423, -0.0718, -0.0654],
-        [ 0.0039,  0.0368,  0.0104],
-        [ 0.0655,  0.0217,  0.0122],
-        [ 0.0490,  0.1638,  0.2053],
-        [ 0.0932,  0.0829,  0.0650],
-        [-0.0186, -0.0209, -0.0135],
-        [-0.0080, -0.0076, -0.0148],
-        [-0.0284, -0.0201,  0.0011],
-        [-0.0642, -0.0294, -0.0777],
-        [-0.0035,  0.0076, -0.0140],
-        [ 0.0519,  0.0731,  0.0887],
-        [-0.0102,  0.0095,  0.0704],
-        [ 0.0068,  0.0218, -0.0023],
-        [-0.0726, -0.0486, -0.0519],
-        [ 0.0260,  0.0295,  0.0263],
-        [ 0.0250,  0.0333,  0.0341],
-        [ 0.0168, -0.0120, -0.0174],
-        [ 0.0226,  0.1037,  0.0114],
-        [ 0.2577,  0.1906,  0.1604],
-        [-0.0646, -0.0137, -0.0018],
-        [-0.0112,  0.0309,  0.0358],
-        [-0.0347,  0.0146, -0.0481],
-        [ 0.0234,  0.0179,  0.0201],
-        [ 0.0157,  0.0313,  0.0225],
-        [ 0.0423,  0.0675,  0.0524],
-        [-0.0031,  0.0027, -0.0255],
-        [ 0.0447,  0.0555,  0.0330],
-        [-0.0152,  0.0103,  0.0299],
-        [-0.0755, -0.0489, -0.0635],
-        [ 0.0853,  0.0788,  0.1017],
-        [-0.0272, -0.0294, -0.0471],
-        [ 0.0440,  0.0400, -0.0137],
-        [ 0.0335,  0.0317, -0.0036],
-        [-0.0344, -0.0621, -0.0984],
-        [-0.0127, -0.0630, -0.0620],
-        [-0.0648,  0.0360,  0.0924],
-        [-0.0781, -0.0801, -0.0409],
-        [ 0.0363,  0.0613,  0.0499],
-        [ 0.0238,  0.0034,  0.0041],
-        [-0.0135,  0.0258,  0.0310],
-        [ 0.0614,  0.1086,  0.0589],
-        [ 0.0428,  0.0350,  0.0205],
-        [ 0.0153,  0.0173, -0.0018],
-        [-0.0288, -0.0455, -0.0091],
-        [ 0.0344,  0.0109, -0.0157],
-        [-0.0205, -0.0247, -0.0187],
-        [ 0.0487,  0.0126,  0.0064],
-        [-0.0220, -0.0013,  0.0074],
-        [-0.0203, -0.0094, -0.0048],
-        [-0.0719,  0.0429, -0.0442],
-        [ 0.1042,  0.0497,  0.0356],
-        [-0.0659, -0.0578, -0.0280],
-        [-0.0060, -0.0322, -0.0234]]
-
-    latent_rgb_factors_bias = [0.0007, -0.0256, -0.0206]
-
-class HunyuanImage21Refiner(LatentFormat):
-    latent_channels = 64
-    latent_dimensions = 3
-    scale_factor = 1.03682
-
 class Hunyuan3Dv2(LatentFormat):
    latent_channels = 64
    latent_dimensions = 1
    scale_factor = 0.9990943042622529

-class Hunyuan3Dv2_1(LatentFormat):
-    scale_factor = 1.0039506158752403
-    latent_channels = 64
-    latent_dimensions = 1
-
 class Hunyuan3Dv2mini(LatentFormat):
    latent_channels = 64
    latent_dimensions = 1
@@ -629,20 +470,3 @@ class Hunyuan3Dv2mini(LatentFormat):
 class ACEAudio(LatentFormat):
    latent_channels = 8
    latent_dimensions = 2
-
-class ChromaRadiance(LatentFormat):
-    latent_channels = 3
-
-    def __init__(self):
-        self.latent_rgb_factors = [
-            # R    G    B
-            [ 1.0, 0.0, 0.0 ],
-            [ 0.0, 1.0, 0.0 ],
-            [ 0.0, 0.0, 1.0 ]
-        ]
-
-    def process_in(self, latent):
-        return latent
-
-    def process_out(self, latent):
-        return latent
--- a/comfy/ldm/ace/attention.py
+++ b/comfy/ldm/ace/attention.py
@@ -133,7 +133,6 @@ class Attention(nn.Module):
        hidden_states: torch.Tensor,
        encoder_hidden_states: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
-        transformer_options={},
        **cross_attention_kwargs,
    ) -> torch.Tensor:
        return self.processor(
@@ -141,7 +140,6 @@ class Attention(nn.Module):
            hidden_states,
            encoder_hidden_states=encoder_hidden_states,
            attention_mask=attention_mask,
-            transformer_options=transformer_options,
            **cross_attention_kwargs,
        )

@@ -368,7 +366,6 @@ class CustomerAttnProcessor2_0:
        encoder_attention_mask: Optional[torch.FloatTensor] = None,
        rotary_freqs_cis: Union[torch.Tensor, Tuple[torch.Tensor]] = None,
        rotary_freqs_cis_cross: Union[torch.Tensor, Tuple[torch.Tensor]] = None,
-        transformer_options={},
        *args,
        **kwargs,
    ) -> torch.Tensor:
@@ -436,7 +433,7 @@ class CustomerAttnProcessor2_0:

        # the output of sdp = (batch, num_heads, seq_len, head_dim)
        hidden_states = optimized_attention(
-            query, key, value, heads=query.shape[1], mask=attention_mask, skip_reshape=True, transformer_options=transformer_options,
+            query, key, value, heads=query.shape[1], mask=attention_mask, skip_reshape=True,
        ).to(query.dtype)

        # linear proj
@@ -700,7 +697,6 @@ class LinearTransformerBlock(nn.Module):
        rotary_freqs_cis: Union[torch.Tensor, Tuple[torch.Tensor]] = None,
        rotary_freqs_cis_cross: Union[torch.Tensor, Tuple[torch.Tensor]] = None,
        temb: torch.FloatTensor = None,
-        transformer_options={},
    ):

        N = hidden_states.shape[0]
@@ -724,7 +720,6 @@ class LinearTransformerBlock(nn.Module):
                encoder_attention_mask=encoder_attention_mask,
                rotary_freqs_cis=rotary_freqs_cis,
                rotary_freqs_cis_cross=rotary_freqs_cis_cross,
-                transformer_options=transformer_options,
            )
        else:
            attn_output, _ = self.attn(
@@ -734,7 +729,6 @@ class LinearTransformerBlock(nn.Module):
                encoder_attention_mask=None,
                rotary_freqs_cis=rotary_freqs_cis,
                rotary_freqs_cis_cross=None,
-                transformer_options=transformer_options,
            )

        if self.use_adaln_single:
@@ -749,7 +743,6 @@ class LinearTransformerBlock(nn.Module):
                encoder_attention_mask=encoder_attention_mask,
                rotary_freqs_cis=rotary_freqs_cis,
                rotary_freqs_cis_cross=rotary_freqs_cis_cross,
-                transformer_options=transformer_options,
            )
            hidden_states = attn_output + hidden_states

--- a/comfy/ldm/ace/model.py
+++ b/comfy/ldm/ace/model.py
@@ -19,7 +19,6 @@ import torch
 from torch import nn

 import comfy.model_management
-import comfy.patcher_extension

 from comfy.ldm.lightricks.model import TimestepEmbedding, Timesteps
 from .attention import LinearTransformerBlock, t2i_modulate
@@ -314,7 +313,6 @@ class ACEStepTransformer2DModel(nn.Module):
        output_length: int = 0,
        block_controlnet_hidden_states: Optional[Union[List[torch.Tensor], torch.Tensor]] = None,
        controlnet_scale: Union[float, torch.Tensor] = 1.0,
-        transformer_options={},
    ):
        embedded_timestep = self.timestep_embedder(self.time_proj(timestep).to(dtype=hidden_states.dtype))
        temb = self.t_block(embedded_timestep)
@@ -340,34 +338,12 @@ class ACEStepTransformer2DModel(nn.Module):
                rotary_freqs_cis=rotary_freqs_cis,
                rotary_freqs_cis_cross=encoder_rotary_freqs_cis,
                temb=temb,
-                transformer_options=transformer_options,
            )

        output = self.final_layer(hidden_states, embedded_timestep, output_length)
        return output

-    def forward(self,
-        x,
-        timestep,
-        attention_mask=None,
-        context: Optional[torch.Tensor] = None,
-        text_attention_mask: Optional[torch.LongTensor] = None,
-        speaker_embeds: Optional[torch.FloatTensor] = None,
-        lyric_token_idx: Optional[torch.LongTensor] = None,
-        lyric_mask: Optional[torch.LongTensor] = None,
-        block_controlnet_hidden_states: Optional[Union[List[torch.Tensor], torch.Tensor]] = None,
-        controlnet_scale: Union[float, torch.Tensor] = 1.0,
-        lyrics_strength=1.0,
-        **kwargs
-    ):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, kwargs.get("transformer_options", {}))
-        ).execute(x, timestep, attention_mask, context, text_attention_mask, speaker_embeds, lyric_token_idx, lyric_mask, block_controlnet_hidden_states,
-                  controlnet_scale, lyrics_strength, **kwargs)
-
-    def _forward(
+    def forward(
        self,
        x,
        timestep,
@@ -395,7 +371,6 @@ class ACEStepTransformer2DModel(nn.Module):

        output_length = hidden_states.shape[-1]

-        transformer_options = kwargs.get("transformer_options", {})
        output = self.decode(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
@@ -405,7 +380,6 @@ class ACEStepTransformer2DModel(nn.Module):
            output_length=output_length,
            block_controlnet_hidden_states=block_controlnet_hidden_states,
            controlnet_scale=controlnet_scale,
-            transformer_options=transformer_options,
        )

        return output
--- a/comfy/ldm/audio/dit.py
+++ b/comfy/ldm/audio/dit.py
@@ -298,8 +298,7 @@ class Attention(nn.Module):
        mask = None,
        context_mask = None,
        rotary_pos_emb = None,
-        causal = None,
-        transformer_options={},
+        causal = None
    ):
        h, kv_h, has_context = self.num_heads, self.kv_heads, context is not None

@@ -364,7 +363,7 @@ class Attention(nn.Module):
            heads_per_kv_head = h // kv_h
            k, v = map(lambda t: t.repeat_interleave(heads_per_kv_head, dim = 1), (k, v))

-        out = optimized_attention(q, k, v, h, skip_reshape=True, transformer_options=transformer_options)
+        out = optimized_attention(q, k, v, h, skip_reshape=True)
        out = self.to_out(out)

        if mask is not None:
@@ -489,8 +488,7 @@ class TransformerBlock(nn.Module):
        global_cond=None,
        mask = None,
        context_mask = None,
-        rotary_pos_emb = None,
-        transformer_options={}
+        rotary_pos_emb = None
    ):
        if self.global_cond_dim is not None and self.global_cond_dim > 0 and global_cond is not None:

@@ -500,12 +498,12 @@ class TransformerBlock(nn.Module):
            residual = x
            x = self.pre_norm(x)
            x = x * (1 + scale_self) + shift_self
-            x = self.self_attn(x, mask = mask, rotary_pos_emb = rotary_pos_emb, transformer_options=transformer_options)
+            x = self.self_attn(x, mask = mask, rotary_pos_emb = rotary_pos_emb)
            x = x * torch.sigmoid(1 - gate_self)
            x = x + residual

            if context is not None:
-                x = x + self.cross_attn(self.cross_attend_norm(x), context = context, context_mask = context_mask, transformer_options=transformer_options)
+                x = x + self.cross_attn(self.cross_attend_norm(x), context = context, context_mask = context_mask)

            if self.conformer is not None:
                x = x + self.conformer(x)
@@ -519,10 +517,10 @@ class TransformerBlock(nn.Module):
            x = x + residual

        else:
-            x = x + self.self_attn(self.pre_norm(x), mask = mask, rotary_pos_emb = rotary_pos_emb, transformer_options=transformer_options)
+            x = x + self.self_attn(self.pre_norm(x), mask = mask, rotary_pos_emb = rotary_pos_emb)

            if context is not None:
-                x = x + self.cross_attn(self.cross_attend_norm(x), context = context, context_mask = context_mask, transformer_options=transformer_options)
+                x = x + self.cross_attn(self.cross_attend_norm(x), context = context, context_mask = context_mask)

            if self.conformer is not None:
                x = x + self.conformer(x)
@@ -608,8 +606,7 @@ class ContinuousTransformer(nn.Module):
        return_info = False,
        **kwargs
    ):
-        transformer_options = kwargs.get("transformer_options", {})
-        patches_replace = transformer_options.get("patches_replace", {})
+        patches_replace = kwargs.get("transformer_options", {}).get("patches_replace", {})
        batch, seq, device = *x.shape[:2], x.device
        context = kwargs["context"]

@@ -635,7 +632,7 @@ class ContinuousTransformer(nn.Module):
        # Attention layers

        if self.rotary_pos_emb is not None:
-            rotary_pos_emb = self.rotary_pos_emb.forward_from_seq_len(x.shape[1], dtype=torch.float, device=x.device)
+            rotary_pos_emb = self.rotary_pos_emb.forward_from_seq_len(x.shape[1], dtype=x.dtype, device=x.device)
        else:
            rotary_pos_emb = None

@@ -648,13 +645,13 @@ class ContinuousTransformer(nn.Module):
            if ("double_block", i) in blocks_replace:
                def block_wrap(args):
                    out = {}
-                    out["img"] = layer(args["img"], rotary_pos_emb=args["pe"], global_cond=args["vec"], context=args["txt"], transformer_options=args["transformer_options"])
+                    out["img"] = layer(args["img"], rotary_pos_emb=args["pe"], global_cond=args["vec"], context=args["txt"])
                    return out

-                out = blocks_replace[("double_block", i)]({"img": x, "txt": context, "vec": global_cond, "pe": rotary_pos_emb, "transformer_options": transformer_options}, {"original_block": block_wrap})
+                out = blocks_replace[("double_block", i)]({"img": x, "txt": context, "vec": global_cond, "pe": rotary_pos_emb}, {"original_block": block_wrap})
                x = out["img"]
            else:
-                x = layer(x, rotary_pos_emb = rotary_pos_emb, global_cond=global_cond, context=context, transformer_options=transformer_options)
+                x = layer(x, rotary_pos_emb = rotary_pos_emb, global_cond=global_cond, context=context)
            # x = checkpoint(layer, x, rotary_pos_emb = rotary_pos_emb, global_cond=global_cond, **kwargs)

            if return_info:
--- a/comfy/ldm/aura/mmdit.py
+++ b/comfy/ldm/aura/mmdit.py
@@ -9,7 +9,6 @@ import torch.nn.functional as F

 from comfy.ldm.modules.attention import optimized_attention
 import comfy.ops
-import comfy.patcher_extension
 import comfy.ldm.common_dit

 def modulate(x, shift, scale):
@@ -85,7 +84,7 @@ class SingleAttention(nn.Module):
        )

    #@torch.compile()
-    def forward(self, c, transformer_options={}):
+    def forward(self, c):

        bsz, seqlen1, _ = c.shape

@@ -95,7 +94,7 @@ class SingleAttention(nn.Module):
        v = v.view(bsz, seqlen1, self.n_heads, self.head_dim)
        q, k = self.q_norm1(q), self.k_norm1(k)

-        output = optimized_attention(q.permute(0, 2, 1, 3), k.permute(0, 2, 1, 3), v.permute(0, 2, 1, 3), self.n_heads, skip_reshape=True, transformer_options=transformer_options)
+        output = optimized_attention(q.permute(0, 2, 1, 3), k.permute(0, 2, 1, 3), v.permute(0, 2, 1, 3), self.n_heads, skip_reshape=True)
        c = self.w1o(output)
        return c

@@ -144,7 +143,7 @@ class DoubleAttention(nn.Module):


    #@torch.compile()
-    def forward(self, c, x, transformer_options={}):
+    def forward(self, c, x):

        bsz, seqlen1, _ = c.shape
        bsz, seqlen2, _ = x.shape
@@ -168,7 +167,7 @@ class DoubleAttention(nn.Module):
            torch.cat([cv, xv], dim=1),
        )

-        output = optimized_attention(q.permute(0, 2, 1, 3), k.permute(0, 2, 1, 3), v.permute(0, 2, 1, 3), self.n_heads, skip_reshape=True, transformer_options=transformer_options)
+        output = optimized_attention(q.permute(0, 2, 1, 3), k.permute(0, 2, 1, 3), v.permute(0, 2, 1, 3), self.n_heads, skip_reshape=True)

        c, x = output.split([seqlen1, seqlen2], dim=1)
        c = self.w1o(c)
@@ -207,7 +206,7 @@ class MMDiTBlock(nn.Module):
        self.is_last = is_last

    #@torch.compile()
-    def forward(self, c, x, global_cond, transformer_options={}, **kwargs):
+    def forward(self, c, x, global_cond, **kwargs):

        cres, xres = c, x

@@ -225,7 +224,7 @@ class MMDiTBlock(nn.Module):
        x = modulate(self.normX1(x), xshift_msa, xscale_msa)

        # attention
-        c, x = self.attn(c, x, transformer_options=transformer_options)
+        c, x = self.attn(c, x)


        c = self.normC2(cres + cgate_msa.unsqueeze(1) * c)
@@ -255,13 +254,13 @@ class DiTBlock(nn.Module):
        self.mlp = MLP(dim, hidden_dim=dim * 4, dtype=dtype, device=device, operations=operations)

    #@torch.compile()
-    def forward(self, cx, global_cond, transformer_options={}, **kwargs):
+    def forward(self, cx, global_cond, **kwargs):
        cxres = cx
        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.modCX(
            global_cond
        ).chunk(6, dim=1)
        cx = modulate(self.norm1(cx), shift_msa, scale_msa)
-        cx = self.attn(cx, transformer_options=transformer_options)
+        cx = self.attn(cx)
        cx = self.norm2(cxres + gate_msa.unsqueeze(1) * cx)
        mlpout = self.mlp(modulate(cx, shift_mlp, scale_mlp))
        cx = gate_mlp.unsqueeze(1) * mlpout
@@ -437,13 +436,6 @@ class MMDiT(nn.Module):
        return x + pos_encoding.reshape(1, -1, self.positional_encoding.shape[-1])

    def forward(self, x, timestep, context, transformer_options={}, **kwargs):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, timestep, context, transformer_options, **kwargs)
-
-    def _forward(self, x, timestep, context, transformer_options={}, **kwargs):
        patches_replace = transformer_options.get("patches_replace", {})
        # patchify x, add PE
        b, c, h, w = x.shape
@@ -473,14 +465,13 @@ class MMDiT(nn.Module):
                        out = {}
                        out["txt"], out["img"] = layer(args["txt"],
                                                       args["img"],
-                                                       args["vec"],
-                                                       transformer_options=args["transformer_options"])
+                                                       args["vec"])
                        return out
-                    out = blocks_replace[("double_block", i)]({"img": x, "txt": c, "vec": global_cond, "transformer_options": transformer_options}, {"original_block": block_wrap})
+                    out = blocks_replace[("double_block", i)]({"img": x, "txt": c, "vec": global_cond}, {"original_block": block_wrap})
                    c = out["txt"]
                    x = out["img"]
                else:
-                    c, x = layer(c, x, global_cond, transformer_options=transformer_options, **kwargs)
+                    c, x = layer(c, x, global_cond, **kwargs)

        if len(self.single_layers) > 0:
            c_len = c.size(1)
@@ -489,13 +480,13 @@ class MMDiT(nn.Module):
                if ("single_block", i) in blocks_replace:
                    def block_wrap(args):
                        out = {}
-                        out["img"] = layer(args["img"], args["vec"], transformer_options=args["transformer_options"])
+                        out["img"] = layer(args["img"], args["vec"])
                        return out

-                    out = blocks_replace[("single_block", i)]({"img": cx, "vec": global_cond, "transformer_options": transformer_options}, {"original_block": block_wrap})
+                    out = blocks_replace[("single_block", i)]({"img": cx, "vec": global_cond}, {"original_block": block_wrap})
                    cx = out["img"]
                else:
-                    cx = layer(cx, global_cond, transformer_options=transformer_options, **kwargs)
+                    cx = layer(cx, global_cond, **kwargs)

            x = cx[:, c_len:]

--- a/comfy/ldm/cascade/common.py
+++ b/comfy/ldm/cascade/common.py
@@ -32,12 +32,12 @@ class OptimizedAttention(nn.Module):

        self.out_proj = operations.Linear(c, c, bias=True, dtype=dtype, device=device)

-    def forward(self, q, k, v, transformer_options={}):
+    def forward(self, q, k, v):
        q = self.to_q(q)
        k = self.to_k(k)
        v = self.to_v(v)

-        out = optimized_attention(q, k, v, self.heads, transformer_options=transformer_options)
+        out = optimized_attention(q, k, v, self.heads)

        return self.out_proj(out)

@@ -47,13 +47,13 @@ class Attention2D(nn.Module):
        self.attn = OptimizedAttention(c, nhead, dtype=dtype, device=device, operations=operations)
        # self.attn = nn.MultiheadAttention(c, nhead, dropout=dropout, bias=True, batch_first=True, dtype=dtype, device=device)

-    def forward(self, x, kv, self_attn=False, transformer_options={}):
+    def forward(self, x, kv, self_attn=False):
        orig_shape = x.shape
        x = x.view(x.size(0), x.size(1), -1).permute(0, 2, 1)  # Bx4xHxW -> Bx(HxW)x4
        if self_attn:
            kv = torch.cat([x, kv], dim=1)
        # x = self.attn(x, kv, kv, need_weights=False)[0]
-        x = self.attn(x, kv, kv, transformer_options=transformer_options)
+        x = self.attn(x, kv, kv)
        x = x.permute(0, 2, 1).view(*orig_shape)
        return x

@@ -114,9 +114,9 @@ class AttnBlock(nn.Module):
            operations.Linear(c_cond, c, dtype=dtype, device=device)
        )

-    def forward(self, x, kv, transformer_options={}):
+    def forward(self, x, kv):
        kv = self.kv_mapper(kv)
-        x = x + self.attention(self.norm(x), kv, self_attn=self.self_attn, transformer_options=transformer_options)
+        x = x + self.attention(self.norm(x), kv, self_attn=self.self_attn)
        return x


--- a/comfy/ldm/cascade/stage_b.py
+++ b/comfy/ldm/cascade/stage_b.py
@@ -173,7 +173,7 @@ class StageB(nn.Module):
        clip = self.clip_norm(clip)
        return clip

-    def _down_encode(self, x, r_embed, clip, transformer_options={}):
+    def _down_encode(self, x, r_embed, clip):
        level_outputs = []
        block_group = zip(self.down_blocks, self.down_downscalers, self.down_repeat_mappers)
        for down_block, downscaler, repmap in block_group:
@@ -187,7 +187,7 @@ class StageB(nn.Module):
                    elif isinstance(block, AttnBlock) or (
                            hasattr(block, '_fsdp_wrapped_module') and isinstance(block._fsdp_wrapped_module,
                                                                                  AttnBlock)):
-                        x = block(x, clip, transformer_options=transformer_options)
+                        x = block(x, clip)
                    elif isinstance(block, TimestepBlock) or (
                            hasattr(block, '_fsdp_wrapped_module') and isinstance(block._fsdp_wrapped_module,
                                                                                  TimestepBlock)):
@@ -199,7 +199,7 @@ class StageB(nn.Module):
            level_outputs.insert(0, x)
        return level_outputs

-    def _up_decode(self, level_outputs, r_embed, clip, transformer_options={}):
+    def _up_decode(self, level_outputs, r_embed, clip):
        x = level_outputs[0]
        block_group = zip(self.up_blocks, self.up_upscalers, self.up_repeat_mappers)
        for i, (up_block, upscaler, repmap) in enumerate(block_group):
@@ -216,7 +216,7 @@ class StageB(nn.Module):
                    elif isinstance(block, AttnBlock) or (
                            hasattr(block, '_fsdp_wrapped_module') and isinstance(block._fsdp_wrapped_module,
                                                                                  AttnBlock)):
-                        x = block(x, clip, transformer_options=transformer_options)
+                        x = block(x, clip)
                    elif isinstance(block, TimestepBlock) or (
                            hasattr(block, '_fsdp_wrapped_module') and isinstance(block._fsdp_wrapped_module,
                                                                                  TimestepBlock)):
@@ -228,7 +228,7 @@ class StageB(nn.Module):
            x = upscaler(x)
        return x

-    def forward(self, x, r, effnet, clip, pixels=None, transformer_options={}, **kwargs):
+    def forward(self, x, r, effnet, clip, pixels=None, **kwargs):
        if pixels is None:
            pixels = x.new_zeros(x.size(0), 3, 8, 8)

@@ -245,8 +245,8 @@ class StageB(nn.Module):
            nn.functional.interpolate(effnet, size=x.shape[-2:], mode='bilinear', align_corners=True))
        x = x + nn.functional.interpolate(self.pixels_mapper(pixels), size=x.shape[-2:], mode='bilinear',
                                          align_corners=True)
-        level_outputs = self._down_encode(x, r_embed, clip, transformer_options=transformer_options)
-        x = self._up_decode(level_outputs, r_embed, clip, transformer_options=transformer_options)
+        level_outputs = self._down_encode(x, r_embed, clip)
+        x = self._up_decode(level_outputs, r_embed, clip)
        return self.clf(x)

    def update_weights_ema(self, src_model, beta=0.999):
--- a/comfy/ldm/cascade/stage_c.py
+++ b/comfy/ldm/cascade/stage_c.py
@@ -182,7 +182,7 @@ class StageC(nn.Module):
        clip = self.clip_norm(clip)
        return clip

-    def _down_encode(self, x, r_embed, clip, cnet=None, transformer_options={}):
+    def _down_encode(self, x, r_embed, clip, cnet=None):
        level_outputs = []
        block_group = zip(self.down_blocks, self.down_downscalers, self.down_repeat_mappers)
        for down_block, downscaler, repmap in block_group:
@@ -201,7 +201,7 @@ class StageC(nn.Module):
                    elif isinstance(block, AttnBlock) or (
                            hasattr(block, '_fsdp_wrapped_module') and isinstance(block._fsdp_wrapped_module,
                                                                                  AttnBlock)):
-                        x = block(x, clip, transformer_options=transformer_options)
+                        x = block(x, clip)
                    elif isinstance(block, TimestepBlock) or (
                            hasattr(block, '_fsdp_wrapped_module') and isinstance(block._fsdp_wrapped_module,
                                                                                  TimestepBlock)):
@@ -213,7 +213,7 @@ class StageC(nn.Module):
            level_outputs.insert(0, x)
        return level_outputs

-    def _up_decode(self, level_outputs, r_embed, clip, cnet=None, transformer_options={}):
+    def _up_decode(self, level_outputs, r_embed, clip, cnet=None):
        x = level_outputs[0]
        block_group = zip(self.up_blocks, self.up_upscalers, self.up_repeat_mappers)
        for i, (up_block, upscaler, repmap) in enumerate(block_group):
@@ -235,7 +235,7 @@ class StageC(nn.Module):
                    elif isinstance(block, AttnBlock) or (
                            hasattr(block, '_fsdp_wrapped_module') and isinstance(block._fsdp_wrapped_module,
                                                                                  AttnBlock)):
-                        x = block(x, clip, transformer_options=transformer_options)
+                        x = block(x, clip)
                    elif isinstance(block, TimestepBlock) or (
                            hasattr(block, '_fsdp_wrapped_module') and isinstance(block._fsdp_wrapped_module,
                                                                                  TimestepBlock)):
@@ -247,7 +247,7 @@ class StageC(nn.Module):
            x = upscaler(x)
        return x

-    def forward(self, x, r, clip_text, clip_text_pooled, clip_img, control=None, transformer_options={}, **kwargs):
+    def forward(self, x, r, clip_text, clip_text_pooled, clip_img, control=None, **kwargs):
        # Process the conditioning embeddings
        r_embed = self.gen_r_embedding(r).to(dtype=x.dtype)
        for c in self.t_conds:
@@ -262,8 +262,8 @@ class StageC(nn.Module):

        # Model Blocks
        x = self.embedding(x)
-        level_outputs = self._down_encode(x, r_embed, clip, cnet, transformer_options=transformer_options)
-        x = self._up_decode(level_outputs, r_embed, clip, cnet, transformer_options=transformer_options)
+        level_outputs = self._down_encode(x, r_embed, clip, cnet)
+        x = self._up_decode(level_outputs, r_embed, clip, cnet)
        return self.clf(x)

    def update_weights_ema(self, src_model, beta=0.999):
--- a/comfy/ldm/chroma/layers.py
+++ b/comfy/ldm/chroma/layers.py
@@ -76,7 +76,7 @@ class DoubleStreamBlock(nn.Module):
        )
        self.flipped_img_txt = flipped_img_txt

-    def forward(self, img: Tensor, txt: Tensor, pe: Tensor, vec: Tensor, attn_mask=None, transformer_options={}):
+    def forward(self, img: Tensor, txt: Tensor, pe: Tensor, vec: Tensor, attn_mask=None):
        (img_mod1, img_mod2), (txt_mod1, txt_mod2) = vec

        # prepare image for attention
@@ -95,7 +95,7 @@ class DoubleStreamBlock(nn.Module):
        attn = attention(torch.cat((txt_q, img_q), dim=2),
                         torch.cat((txt_k, img_k), dim=2),
                         torch.cat((txt_v, img_v), dim=2),
-                         pe=pe, mask=attn_mask, transformer_options=transformer_options)
+                         pe=pe, mask=attn_mask)

        txt_attn, img_attn = attn[:, : txt.shape[1]], attn[:, txt.shape[1] :]

@@ -148,7 +148,7 @@ class SingleStreamBlock(nn.Module):

        self.mlp_act = nn.GELU(approximate="tanh")

-    def forward(self, x: Tensor, pe: Tensor, vec: Tensor, attn_mask=None, transformer_options={}) -> Tensor:
+    def forward(self, x: Tensor, pe: Tensor, vec: Tensor, attn_mask=None) -> Tensor:
        mod = vec
        x_mod = torch.addcmul(mod.shift, 1 + mod.scale, self.pre_norm(x))
        qkv, mlp = torch.split(self.linear1(x_mod), [3 * self.hidden_size, self.mlp_hidden_dim], dim=-1)
@@ -157,7 +157,7 @@ class SingleStreamBlock(nn.Module):
        q, k = self.norm(q, k, v)

        # compute attention
-        attn = attention(q, k, v, pe=pe, mask=attn_mask, transformer_options=transformer_options)
+        attn = attention(q, k, v, pe=pe, mask=attn_mask)
        # compute activation in mlp stream, cat again and run second linear layer
        output = self.linear2(torch.cat((attn, self.mlp_act(mlp)), 2))
        x.addcmul_(mod.gate, output)
--- a/comfy/ldm/chroma/model.py
+++ b/comfy/ldm/chroma/model.py
@@ -5,7 +5,6 @@ from dataclasses import dataclass
 import torch
 from torch import Tensor, nn
 from einops import rearrange, repeat
-import comfy.patcher_extension
 import comfy.ldm.common_dit

 from comfy.ldm.flux.layers import (
@@ -151,6 +150,8 @@ class Chroma(nn.Module):
        attn_mask: Tensor = None,
    ) -> Tensor:
        patches_replace = transformer_options.get("patches_replace", {})
+        if img.ndim != 3 or txt.ndim != 3:
+            raise ValueError("Input img and txt tensors must have 3 dimensions.")

        # running on sequences img
        img = self.img_in(img)
@@ -191,16 +192,14 @@ class Chroma(nn.Module):
                                                       txt=args["txt"],
                                                       vec=args["vec"],
                                                       pe=args["pe"],
-                                                       attn_mask=args.get("attn_mask"),
-                                                       transformer_options=args.get("transformer_options"))
+                                                       attn_mask=args.get("attn_mask"))
                        return out

                    out = blocks_replace[("double_block", i)]({"img": img,
                                                               "txt": txt,
                                                               "vec": double_mod,
                                                               "pe": pe,
-                                                               "attn_mask": attn_mask,
-                                                               "transformer_options": transformer_options},
+                                                               "attn_mask": attn_mask},
                                                              {"original_block": block_wrap})
                    txt = out["txt"]
                    img = out["img"]
@@ -209,8 +208,7 @@ class Chroma(nn.Module):
                                     txt=txt,
                                     vec=double_mod,
                                     pe=pe,
-                                     attn_mask=attn_mask,
-                                     transformer_options=transformer_options)
+                                     attn_mask=attn_mask)

                if control is not None: # Controlnet
                    control_i = control.get("input")
@@ -230,19 +228,17 @@ class Chroma(nn.Module):
                        out["img"] = block(args["img"],
                                           vec=args["vec"],
                                           pe=args["pe"],
-                                           attn_mask=args.get("attn_mask"),
-                                           transformer_options=args.get("transformer_options"))
+                                           attn_mask=args.get("attn_mask"))
                        return out

                    out = blocks_replace[("single_block", i)]({"img": img,
                                                               "vec": single_mod,
                                                               "pe": pe,
-                                                               "attn_mask": attn_mask,
-                                                               "transformer_options": transformer_options},
+                                                               "attn_mask": attn_mask},
                                                              {"original_block": block_wrap})
                    img = out["img"]
                else:
-                    img = block(img, vec=single_mod, pe=pe, attn_mask=attn_mask, transformer_options=transformer_options)
+                    img = block(img, vec=single_mod, pe=pe, attn_mask=attn_mask)

                if control is not None: # Controlnet
                    control_o = control.get("output")
@@ -252,29 +248,19 @@ class Chroma(nn.Module):
                            img[:, txt.shape[1] :, ...] += add

        img = img[:, txt.shape[1] :, ...]
-        if hasattr(self, "final_layer"):
-            final_mod = self.get_modulations(mod_vectors, "final")
-            img = self.final_layer(img, vec=final_mod)  # (N, T, patch_size ** 2 * out_channels)
+        final_mod = self.get_modulations(mod_vectors, "final")
+        img = self.final_layer(img, vec=final_mod)  # (N, T, patch_size ** 2 * out_channels)
        return img

    def forward(self, x, timestep, context, guidance, control=None, transformer_options={}, **kwargs):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, timestep, context, guidance, control, transformer_options, **kwargs)
-
-    def _forward(self, x, timestep, context, guidance, control=None, transformer_options={}, **kwargs):
        bs, c, h, w = x.shape
-        x = comfy.ldm.common_dit.pad_to_patch_size(x, (self.patch_size, self.patch_size))
+        patch_size = 2
+        x = comfy.ldm.common_dit.pad_to_patch_size(x, (patch_size, patch_size))

-        img = rearrange(x, "b c (h ph) (w pw) -> b (h w) (c ph pw)", ph=self.patch_size, pw=self.patch_size)
+        img = rearrange(x, "b c (h ph) (w pw) -> b (h w) (c ph pw)", ph=patch_size, pw=patch_size)

-        if img.ndim != 3 or context.ndim != 3:
-            raise ValueError("Input img and txt tensors must have 3 dimensions.")
-
-        h_len = ((h + (self.patch_size // 2)) // self.patch_size)
-        w_len = ((w + (self.patch_size // 2)) // self.patch_size)
+        h_len = ((h + (patch_size // 2)) // patch_size)
+        w_len = ((w + (patch_size // 2)) // patch_size)
        img_ids = torch.zeros((h_len, w_len, 3), device=x.device, dtype=x.dtype)
        img_ids[:, :, 1] = img_ids[:, :, 1] + torch.linspace(0, h_len - 1, steps=h_len, device=x.device, dtype=x.dtype).unsqueeze(1)
        img_ids[:, :, 2] = img_ids[:, :, 2] + torch.linspace(0, w_len - 1, steps=w_len, device=x.device, dtype=x.dtype).unsqueeze(0)
@@ -282,4 +268,4 @@ class Chroma(nn.Module):

        txt_ids = torch.zeros((bs, context.shape[1], 3), device=x.device, dtype=x.dtype)
        out = self.forward_orig(img, img_ids, context, txt_ids, timestep, guidance, control, transformer_options, attn_mask=kwargs.get("attention_mask", None))
-        return rearrange(out, "b (h w) (c ph pw) -> b c (h ph) (w pw)", h=h_len, w=w_len, ph=self.patch_size, pw=self.patch_size)[:,:,:h,:w]
+        return rearrange(out, "b (h w) (c ph pw) -> b c (h ph) (w pw)", h=h_len, w=w_len, ph=2, pw=2)[:,:,:h,:w]
--- a/comfy/ldm/chroma_radiance/layers.py
+++ b/comfy/ldm/chroma_radiance/layers.py
@@ -1,206 +0,0 @@
-# Adapted from https://github.com/lodestone-rock/flow
-from functools import lru_cache
-
-import torch
-from torch import nn
-
-from comfy.ldm.flux.layers import RMSNorm
-
-
-class NerfEmbedder(nn.Module):
-    """
-    An embedder module that combines input features with a 2D positional
-    encoding that mimics the Discrete Cosine Transform (DCT).
-
-    This module takes an input tensor of shape (B, P^2, C), where P is the
-    patch size, and enriches it with positional information before projecting
-    it to a new hidden size.
-    """
-    def __init__(
-        self,
-        in_channels: int,
-        hidden_size_input: int,
-        max_freqs: int,
-        dtype=None,
-        device=None,
-        operations=None,
-    ):
-        """
-        Initializes the NerfEmbedder.
-
-        Args:
-            in_channels (int): The number of channels in the input tensor.
-            hidden_size_input (int): The desired dimension of the output embedding.
-            max_freqs (int): The number of frequency components to use for both
-                             the x and y dimensions of the positional encoding.
-                             The total number of positional features will be max_freqs^2.
-        """
-        super().__init__()
-        self.dtype = dtype
-        self.max_freqs = max_freqs
-        self.hidden_size_input = hidden_size_input
-
-        # A linear layer to project the concatenated input features and
-        # positional encodings to the final output dimension.
-        self.embedder = nn.Sequential(
-            operations.Linear(in_channels + max_freqs**2, hidden_size_input, dtype=dtype, device=device)
-        )
-
-    @lru_cache(maxsize=4)
-    def fetch_pos(self, patch_size: int, device: torch.device, dtype: torch.dtype) -> torch.Tensor:
-        """
-        Generates and caches 2D DCT-like positional embeddings for a given patch size.
-
-        The LRU cache is a performance optimization that avoids recomputing the
-        same positional grid on every forward pass.
-
-        Args:
-            patch_size (int): The side length of the square input patch.
-            device: The torch device to create the tensors on.
-            dtype: The torch dtype for the tensors.
-
-        Returns:
-            A tensor of shape (1, patch_size^2, max_freqs^2) containing the
-            positional embeddings.
-        """
-        # Create normalized 1D coordinate grids from 0 to 1.
-        pos_x = torch.linspace(0, 1, patch_size, device=device, dtype=dtype)
-        pos_y = torch.linspace(0, 1, patch_size, device=device, dtype=dtype)
-
-        # Create a 2D meshgrid of coordinates.
-        pos_y, pos_x = torch.meshgrid(pos_y, pos_x, indexing="ij")
-
-        # Reshape positions to be broadcastable with frequencies.
-        # Shape becomes (patch_size^2, 1, 1).
-        pos_x = pos_x.reshape(-1, 1, 1)
-        pos_y = pos_y.reshape(-1, 1, 1)
-
-        # Create a 1D tensor of frequency values from 0 to max_freqs-1.
-        freqs = torch.linspace(0, self.max_freqs - 1, self.max_freqs, dtype=dtype, device=device)
-
-        # Reshape frequencies to be broadcastable for creating 2D basis functions.
-        # freqs_x shape: (1, max_freqs, 1)
-        # freqs_y shape: (1, 1, max_freqs)
-        freqs_x = freqs[None, :, None]
-        freqs_y = freqs[None, None, :]
-
-        # A custom weighting coefficient, not part of standard DCT.
-        # This seems to down-weight the contribution of higher-frequency interactions.
-        coeffs = (1 + freqs_x * freqs_y) ** -1
-
-        # Calculate the 1D cosine basis functions for x and y coordinates.
-        # This is the core of the DCT formulation.
-        dct_x = torch.cos(pos_x * freqs_x * torch.pi)
-        dct_y = torch.cos(pos_y * freqs_y * torch.pi)
-
-        # Combine the 1D basis functions to create 2D basis functions by element-wise
-        # multiplication, and apply the custom coefficients. Broadcasting handles the
-        # combination of all (pos_x, freqs_x) with all (pos_y, freqs_y).
-        # The result is flattened into a feature vector for each position.
-        dct = (dct_x * dct_y * coeffs).view(1, -1, self.max_freqs ** 2)
-
-        return dct
-
-    def forward(self, inputs: torch.Tensor) -> torch.Tensor:
-        """
-        Forward pass for the embedder.
-
-        Args:
-            inputs (Tensor): The input tensor of shape (B, P^2, C).
-
-        Returns:
-            Tensor: The output tensor of shape (B, P^2, hidden_size_input).
-        """
-        # Get the batch size, number of pixels, and number of channels.
-        B, P2, C = inputs.shape
-
-        # Infer the patch side length from the number of pixels (P^2).
-        patch_size = int(P2 ** 0.5)
-
-        input_dtype = inputs.dtype
-        inputs = inputs.to(dtype=self.dtype)
-
-        # Fetch the pre-computed or cached positional embeddings.
-        dct = self.fetch_pos(patch_size, inputs.device, self.dtype)
-
-        # Repeat the positional embeddings for each item in the batch.
-        dct = dct.repeat(B, 1, 1)
-
-        # Concatenate the original input features with the positional embeddings
-        # along the feature dimension.
-        inputs = torch.cat((inputs, dct), dim=-1)
-
-        # Project the combined tensor to the target hidden size.
-        return self.embedder(inputs).to(dtype=input_dtype)
-
-
-class NerfGLUBlock(nn.Module):
-    """
-    A NerfBlock using a Gated Linear Unit (GLU) like MLP.
-    """
-    def __init__(self, hidden_size_s: int, hidden_size_x: int, mlp_ratio, dtype=None, device=None, operations=None):
-        super().__init__()
-        # The total number of parameters for the MLP is increased to accommodate
-        # the gate, value, and output projection matrices.
-        # We now need to generate parameters for 3 matrices.
-        total_params = 3 * hidden_size_x**2 * mlp_ratio
-        self.param_generator = operations.Linear(hidden_size_s, total_params, dtype=dtype, device=device)
-        self.norm = RMSNorm(hidden_size_x, dtype=dtype, device=device, operations=operations)
-        self.mlp_ratio = mlp_ratio
-
-
-    def forward(self, x: torch.Tensor, s: torch.Tensor) -> torch.Tensor:
-        batch_size, num_x, hidden_size_x = x.shape
-        mlp_params = self.param_generator(s)
-
-        # Split the generated parameters into three parts for the gate, value, and output projection.
-        fc1_gate_params, fc1_value_params, fc2_params = mlp_params.chunk(3, dim=-1)
-
-        # Reshape the parameters into matrices for batch matrix multiplication.
-        fc1_gate = fc1_gate_params.view(batch_size, hidden_size_x, hidden_size_x * self.mlp_ratio)
-        fc1_value = fc1_value_params.view(batch_size, hidden_size_x, hidden_size_x * self.mlp_ratio)
-        fc2 = fc2_params.view(batch_size, hidden_size_x * self.mlp_ratio, hidden_size_x)
-
-        # Normalize the generated weight matrices as in the original implementation.
-        fc1_gate = torch.nn.functional.normalize(fc1_gate, dim=-2)
-        fc1_value = torch.nn.functional.normalize(fc1_value, dim=-2)
-        fc2 = torch.nn.functional.normalize(fc2, dim=-2)
-
-        res_x = x
-        x = self.norm(x)
-
-        # Apply the final output projection.
-        x = torch.bmm(torch.nn.functional.silu(torch.bmm(x, fc1_gate)) * torch.bmm(x, fc1_value), fc2)
-
-        return x + res_x
-
-
-class NerfFinalLayer(nn.Module):
-    def __init__(self, hidden_size, out_channels, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.norm = RMSNorm(hidden_size, dtype=dtype, device=device, operations=operations)
-        self.linear = operations.Linear(hidden_size, out_channels, dtype=dtype, device=device)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        # RMSNorm normalizes over the last dimension, but our channel dim (C) is at dim=1.
-        # So we temporarily move the channel dimension to the end for the norm operation.
-        return self.linear(self.norm(x.movedim(1, -1))).movedim(-1, 1)
-
-
-class NerfFinalLayerConv(nn.Module):
-    def __init__(self, hidden_size: int, out_channels: int, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.norm = RMSNorm(hidden_size, dtype=dtype, device=device, operations=operations)
-        self.conv = operations.Conv2d(
-            in_channels=hidden_size,
-            out_channels=out_channels,
-            kernel_size=3,
-            padding=1,
-            dtype=dtype,
-            device=device,
-        )
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        # RMSNorm normalizes over the last dimension, but our channel dim (C) is at dim=1.
-        # So we temporarily move the channel dimension to the end for the norm operation.
-        return self.conv(self.norm(x.movedim(1, -1)).movedim(-1, 1))
--- a/comfy/ldm/chroma_radiance/model.py
+++ b/comfy/ldm/chroma_radiance/model.py
@@ -1,329 +0,0 @@
-# Credits:
-# Original Flux code can be found on: https://github.com/black-forest-labs/flux
-# Chroma Radiance adaption referenced from https://github.com/lodestone-rock/flow
-
-from dataclasses import dataclass
-from typing import Optional
-
-import torch
-from torch import Tensor, nn
-from einops import repeat
-import comfy.ldm.common_dit
-
-from comfy.ldm.flux.layers import EmbedND
-
-from comfy.ldm.chroma.model import Chroma, ChromaParams
-from comfy.ldm.chroma.layers import (
-    DoubleStreamBlock,
-    SingleStreamBlock,
-    Approximator,
-)
-from .layers import (
-    NerfEmbedder,
-    NerfGLUBlock,
-    NerfFinalLayer,
-    NerfFinalLayerConv,
-)
-
-
-@dataclass
-class ChromaRadianceParams(ChromaParams):
-    patch_size: int
-    nerf_hidden_size: int
-    nerf_mlp_ratio: int
-    nerf_depth: int
-    nerf_max_freqs: int
-    # Setting nerf_tile_size to 0 disables tiling.
-    nerf_tile_size: int
-    # Currently one of linear (legacy) or conv.
-    nerf_final_head_type: str
-    # None means use the same dtype as the model.
-    nerf_embedder_dtype: Optional[torch.dtype]
-
-
-class ChromaRadiance(Chroma):
-    """
-    Transformer model for flow matching on sequences.
-    """
-
-    def __init__(self, image_model=None, final_layer=True, dtype=None, device=None, operations=None, **kwargs):
-        if operations is None:
-            raise RuntimeError("Attempt to create ChromaRadiance object without setting operations")
-        nn.Module.__init__(self)
-        self.dtype = dtype
-        params = ChromaRadianceParams(**kwargs)
-        self.params = params
-        self.patch_size = params.patch_size
-        self.in_channels = params.in_channels
-        self.out_channels = params.out_channels
-        if params.hidden_size % params.num_heads != 0:
-            raise ValueError(
-                f"Hidden size {params.hidden_size} must be divisible by num_heads {params.num_heads}"
-            )
-        pe_dim = params.hidden_size // params.num_heads
-        if sum(params.axes_dim) != pe_dim:
-            raise ValueError(f"Got {params.axes_dim} but expected positional dim {pe_dim}")
-        self.hidden_size = params.hidden_size
-        self.num_heads = params.num_heads
-        self.in_dim = params.in_dim
-        self.out_dim = params.out_dim
-        self.hidden_dim = params.hidden_dim
-        self.n_layers = params.n_layers
-        self.pe_embedder = EmbedND(dim=pe_dim, theta=params.theta, axes_dim=params.axes_dim)
-        self.img_in_patch = operations.Conv2d(
-            params.in_channels,
-            params.hidden_size,
-            kernel_size=params.patch_size,
-            stride=params.patch_size,
-            bias=True,
-            dtype=dtype,
-            device=device,
-        )
-        self.txt_in = operations.Linear(params.context_in_dim, self.hidden_size, dtype=dtype, device=device)
-        # set as nn identity for now, will overwrite it later.
-        self.distilled_guidance_layer = Approximator(
-                    in_dim=self.in_dim,
-                    hidden_dim=self.hidden_dim,
-                    out_dim=self.out_dim,
-                    n_layers=self.n_layers,
-                    dtype=dtype, device=device, operations=operations
-                )
-
-
-        self.double_blocks = nn.ModuleList(
-            [
-                DoubleStreamBlock(
-                    self.hidden_size,
-                    self.num_heads,
-                    mlp_ratio=params.mlp_ratio,
-                    qkv_bias=params.qkv_bias,
-                    dtype=dtype, device=device, operations=operations
-                )
-                for _ in range(params.depth)
-            ]
-        )
-
-        self.single_blocks = nn.ModuleList(
-            [
-                SingleStreamBlock(
-                    self.hidden_size,
-                    self.num_heads,
-                    mlp_ratio=params.mlp_ratio,
-                    dtype=dtype, device=device, operations=operations,
-                )
-                for _ in range(params.depth_single_blocks)
-            ]
-        )
-
-        # pixel channel concat with DCT
-        self.nerf_image_embedder = NerfEmbedder(
-            in_channels=params.in_channels,
-            hidden_size_input=params.nerf_hidden_size,
-            max_freqs=params.nerf_max_freqs,
-            dtype=params.nerf_embedder_dtype or dtype,
-            device=device,
-            operations=operations,
-        )
-
-        self.nerf_blocks = nn.ModuleList([
-            NerfGLUBlock(
-                hidden_size_s=params.hidden_size,
-                hidden_size_x=params.nerf_hidden_size,
-                mlp_ratio=params.nerf_mlp_ratio,
-                dtype=dtype,
-                device=device,
-                operations=operations,
-            ) for _ in range(params.nerf_depth)
-        ])
-
-        if params.nerf_final_head_type == "linear":
-            self.nerf_final_layer = NerfFinalLayer(
-                params.nerf_hidden_size,
-                out_channels=params.in_channels,
-                dtype=dtype,
-                device=device,
-                operations=operations,
-            )
-        elif params.nerf_final_head_type == "conv":
-            self.nerf_final_layer_conv = NerfFinalLayerConv(
-                params.nerf_hidden_size,
-                out_channels=params.in_channels,
-                dtype=dtype,
-                device=device,
-                operations=operations,
-            )
-        else:
-            errstr = f"Unsupported nerf_final_head_type {params.nerf_final_head_type}"
-            raise ValueError(errstr)
-
-        self.skip_mmdit = []
-        self.skip_dit = []
-        self.lite = False
-
-    @property
-    def _nerf_final_layer(self) -> nn.Module:
-        if self.params.nerf_final_head_type == "linear":
-            return self.nerf_final_layer
-        if self.params.nerf_final_head_type == "conv":
-            return self.nerf_final_layer_conv
-        # Impossible to get here as we raise an error on unexpected types on initialization.
-        raise NotImplementedError
-
-    def img_in(self, img: Tensor) -> Tensor:
-        img = self.img_in_patch(img) # -> [B, Hidden, H/P, W/P]
-        # flatten into a sequence for the transformer.
-        return img.flatten(2).transpose(1, 2) # -> [B, NumPatches, Hidden]
-
-    def forward_nerf(
-        self,
-        img_orig: Tensor,
-        img_out: Tensor,
-        params: ChromaRadianceParams,
-    ) -> Tensor:
-        B, C, H, W = img_orig.shape
-        num_patches = img_out.shape[1]
-        patch_size = params.patch_size
-
-        # Store the raw pixel values of each patch for the NeRF head later.
-        # unfold creates patches: [B, C * P * P, NumPatches]
-        nerf_pixels = nn.functional.unfold(img_orig, kernel_size=patch_size, stride=patch_size)
-        nerf_pixels = nerf_pixels.transpose(1, 2) # -> [B, NumPatches, C * P * P]
-
-        if params.nerf_tile_size > 0 and num_patches > params.nerf_tile_size:
-            # Enable tiling if nerf_tile_size isn't 0 and we actually have more patches than
-            # the tile size.
-            img_dct = self.forward_tiled_nerf(img_out, nerf_pixels, B, C, num_patches, patch_size, params)
-        else:
-            # Reshape for per-patch processing
-            nerf_hidden = img_out.reshape(B * num_patches, params.hidden_size)
-            nerf_pixels = nerf_pixels.reshape(B * num_patches, C, patch_size**2).transpose(1, 2)
-
-            # Get DCT-encoded pixel embeddings [pixel-dct]
-            img_dct = self.nerf_image_embedder(nerf_pixels)
-
-            # Pass through the dynamic MLP blocks (the NeRF)
-            for block in self.nerf_blocks:
-                img_dct = block(img_dct, nerf_hidden)
-
-        # Reassemble the patches into the final image.
-        img_dct = img_dct.transpose(1, 2) # -> [B*NumPatches, C, P*P]
-        # Reshape to combine with batch dimension for fold
-        img_dct = img_dct.reshape(B, num_patches, -1) # -> [B, NumPatches, C*P*P]
-        img_dct = img_dct.transpose(1, 2) # -> [B, C*P*P, NumPatches]
-        img_dct = nn.functional.fold(
-            img_dct,
-            output_size=(H, W),
-            kernel_size=patch_size,
-            stride=patch_size,
-        )
-        return self._nerf_final_layer(img_dct)
-
-    def forward_tiled_nerf(
-        self,
-        nerf_hidden: Tensor,
-        nerf_pixels: Tensor,
-        batch: int,
-        channels: int,
-        num_patches: int,
-        patch_size: int,
-        params: ChromaRadianceParams,
-    ) -> Tensor:
-        """
-        Processes the NeRF head in tiles to save memory.
-        nerf_hidden has shape [B, L, D]
-        nerf_pixels has shape [B, L, C * P * P]
-        """
-        tile_size = params.nerf_tile_size
-        output_tiles = []
-        # Iterate over the patches in tiles. The dimension L (num_patches) is at index 1.
-        for i in range(0, num_patches, tile_size):
-            end = min(i + tile_size, num_patches)
-
-            # Slice the current tile from the input tensors
-            nerf_hidden_tile = nerf_hidden[:, i:end, :]
-            nerf_pixels_tile = nerf_pixels[:, i:end, :]
-
-            # Get the actual number of patches in this tile (can be smaller for the last tile)
-            num_patches_tile = nerf_hidden_tile.shape[1]
-
-            # Reshape the tile for per-patch processing
-            # [B, NumPatches_tile, D] -> [B * NumPatches_tile, D]
-            nerf_hidden_tile = nerf_hidden_tile.reshape(batch * num_patches_tile, params.hidden_size)
-            # [B, NumPatches_tile, C*P*P] -> [B*NumPatches_tile, C, P*P] -> [B*NumPatches_tile, P*P, C]
-            nerf_pixels_tile = nerf_pixels_tile.reshape(batch * num_patches_tile, channels, patch_size**2).transpose(1, 2)
-
-            # get DCT-encoded pixel embeddings [pixel-dct]
-            img_dct_tile = self.nerf_image_embedder(nerf_pixels_tile)
-
-            # pass through the dynamic MLP blocks (the NeRF)
-            for block in self.nerf_blocks:
-                img_dct_tile = block(img_dct_tile, nerf_hidden_tile)
-
-            output_tiles.append(img_dct_tile)
-
-        # Concatenate the processed tiles along the patch dimension
-        return torch.cat(output_tiles, dim=0)
-
-    def radiance_get_override_params(self, overrides: dict) -> ChromaRadianceParams:
-        params = self.params
-        if not overrides:
-            return params
-        params_dict = {k: getattr(params, k) for k in params.__dataclass_fields__}
-        nullable_keys = frozenset(("nerf_embedder_dtype",))
-        bad_keys = tuple(k for k in overrides if k not in params_dict)
-        if bad_keys:
-            e = f"Unknown key(s) in transformer_options chroma_radiance_options: {', '.join(bad_keys)}"
-            raise ValueError(e)
-        bad_keys = tuple(
-            k
-            for k, v in overrides.items()
-            if type(v) != type(getattr(params, k)) and (v is not None or k not in nullable_keys)
-        )
-        if bad_keys:
-            e = f"Invalid value(s) in transformer_options chroma_radiance_options: {', '.join(bad_keys)}"
-            raise ValueError(e)
-        # At this point it's all valid keys and values so we can merge with the existing params.
-        params_dict |= overrides
-        return params.__class__(**params_dict)
-
-    def _forward(
-        self,
-        x: Tensor,
-        timestep: Tensor,
-        context: Tensor,
-        guidance: Optional[Tensor],
-        control: Optional[dict]=None,
-        transformer_options: dict={},
-        **kwargs: dict,
-    ) -> Tensor:
-        bs, c, h, w = x.shape
-        img = comfy.ldm.common_dit.pad_to_patch_size(x, (self.patch_size, self.patch_size))
-
-        if img.ndim != 4:
-            raise ValueError("Input img tensor must be in [B, C, H, W] format.")
-        if context.ndim != 3:
-            raise ValueError("Input txt tensors must have 3 dimensions.")
-
-        params = self.radiance_get_override_params(transformer_options.get("chroma_radiance_options", {}))
-
-        h_len = (img.shape[-2] // self.patch_size)
-        w_len = (img.shape[-1] // self.patch_size)
-
-        img_ids = torch.zeros((h_len, w_len, 3), device=x.device, dtype=x.dtype)
-        img_ids[:, :, 1] = img_ids[:, :, 1] + torch.linspace(0, h_len - 1, steps=h_len, device=x.device, dtype=x.dtype).unsqueeze(1)
-        img_ids[:, :, 2] = img_ids[:, :, 2] + torch.linspace(0, w_len - 1, steps=w_len, device=x.device, dtype=x.dtype).unsqueeze(0)
-        img_ids = repeat(img_ids, "h w c -> b (h w) c", b=bs)
-        txt_ids = torch.zeros((bs, context.shape[1], 3), device=x.device, dtype=x.dtype)
-
-        img_out = self.forward_orig(
-            img,
-            img_ids,
-            context,
-            txt_ids,
-            timestep,
-            guidance,
-            control,
-            transformer_options,
-            attn_mask=kwargs.get("attention_mask", None),
-        )
-        return self.forward_nerf(img, img_out, params)[:, :, :h, :w]
--- a/comfy/ldm/cosmos/blocks.py
+++ b/comfy/ldm/cosmos/blocks.py
@@ -176,7 +176,6 @@ class Attention(nn.Module):
        context=None,
        mask=None,
        rope_emb=None,
-        transformer_options={},
        **kwargs,
    ):
        """
@@ -185,7 +184,7 @@ class Attention(nn.Module):
            context (Optional[Tensor]): The key tensor of shape [B, Mk, K] or use x as context [self attention] if None
        """
        q, k, v = self.cal_qkv(x, context, mask, rope_emb=rope_emb, **kwargs)
-        out = optimized_attention(q, k, v, self.heads, skip_reshape=True, mask=mask, skip_output_reshape=True, transformer_options=transformer_options)
+        out = optimized_attention(q, k, v, self.heads, skip_reshape=True, mask=mask, skip_output_reshape=True)
        del q, k, v
        out = rearrange(out, " b n s c -> s b (n c)")
        return self.to_out(out)
@@ -547,7 +546,6 @@ class VideoAttn(nn.Module):
        context: Optional[torch.Tensor] = None,
        crossattn_mask: Optional[torch.Tensor] = None,
        rope_emb_L_1_1_D: Optional[torch.Tensor] = None,
-        transformer_options: Optional[dict] = {},
    ) -> torch.Tensor:
        """
        Forward pass for video attention.
@@ -573,7 +571,6 @@ class VideoAttn(nn.Module):
            context_M_B_D,
            crossattn_mask,
            rope_emb=rope_emb_L_1_1_D,
-            transformer_options=transformer_options,
        )
        x_T_H_W_B_D = rearrange(x_THW_B_D, "(t h w) b d -> t h w b d", h=H, w=W)
        return x_T_H_W_B_D
@@ -668,7 +665,6 @@ class DITBuildingBlock(nn.Module):
        crossattn_mask: Optional[torch.Tensor] = None,
        rope_emb_L_1_1_D: Optional[torch.Tensor] = None,
        adaln_lora_B_3D: Optional[torch.Tensor] = None,
-        transformer_options: Optional[dict] = {},
    ) -> torch.Tensor:
        """
        Forward pass for dynamically configured blocks with adaptive normalization.
@@ -706,7 +702,6 @@ class DITBuildingBlock(nn.Module):
                adaln_norm_state(self.norm_state, x, scale_1_1_1_B_D, shift_1_1_1_B_D),
                context=None,
                rope_emb_L_1_1_D=rope_emb_L_1_1_D,
-                transformer_options=transformer_options,
            )
        elif self.block_type in ["cross_attn", "ca"]:
            x = x + gate_1_1_1_B_D * self.block(
@@ -714,7 +709,6 @@ class DITBuildingBlock(nn.Module):
                context=crossattn_emb,
                crossattn_mask=crossattn_mask,
                rope_emb_L_1_1_D=rope_emb_L_1_1_D,
-                transformer_options=transformer_options,
            )
        else:
            raise ValueError(f"Unknown block type: {self.block_type}")
@@ -790,7 +784,6 @@ class GeneralDITTransformerBlock(nn.Module):
        crossattn_mask: Optional[torch.Tensor] = None,
        rope_emb_L_1_1_D: Optional[torch.Tensor] = None,
        adaln_lora_B_3D: Optional[torch.Tensor] = None,
-        transformer_options: Optional[dict] = {},
    ) -> torch.Tensor:
        for block in self.blocks:
            x = block(
@@ -800,6 +793,5 @@ class GeneralDITTransformerBlock(nn.Module):
                crossattn_mask,
                rope_emb_L_1_1_D=rope_emb_L_1_1_D,
                adaln_lora_B_3D=adaln_lora_B_3D,
-                transformer_options=transformer_options,
            )
        return x
--- a/comfy/ldm/cosmos/cosmos_tokenizer/utils.py
+++ b/comfy/ldm/cosmos/cosmos_tokenizer/utils.py
@@ -58,8 +58,7 @@ def is_odd(n: int) -> bool:


 def nonlinearity(x):
-    # x * sigmoid(x)
-    return torch.nn.functional.silu(x)
+    return x * torch.sigmoid(x)


 def Normalize(in_channels, num_groups=32):
--- a/comfy/ldm/cosmos/model.py
+++ b/comfy/ldm/cosmos/model.py
@@ -27,8 +27,6 @@ from torchvision import transforms
 from enum import Enum
 import logging

-import comfy.patcher_extension
-
 from .blocks import (
    FinalLayer,
    GeneralDITTransformerBlock,
@@ -437,42 +435,6 @@ class GeneralDIT(nn.Module):
        latent_condition_sigma: Optional[torch.Tensor] = None,
        condition_video_augment_sigma: Optional[torch.Tensor] = None,
        **kwargs,
-    ):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, kwargs.get("transformer_options", {}))
-        ).execute(x,
-                timesteps,
-                context,
-                attention_mask,
-                fps,
-                image_size,
-                padding_mask,
-                scalar_feature,
-                data_type,
-                latent_condition,
-                latent_condition_sigma,
-                condition_video_augment_sigma,
-                **kwargs)
-
-    def _forward(
-        self,
-        x: torch.Tensor,
-        timesteps: torch.Tensor,
-        context: torch.Tensor,
-        attention_mask: Optional[torch.Tensor] = None,
-        # crossattn_emb: torch.Tensor,
-        # crossattn_mask: Optional[torch.Tensor] = None,
-        fps: Optional[torch.Tensor] = None,
-        image_size: Optional[torch.Tensor] = None,
-        padding_mask: Optional[torch.Tensor] = None,
-        scalar_feature: Optional[torch.Tensor] = None,
-        data_type: Optional[DataType] = DataType.VIDEO,
-        latent_condition: Optional[torch.Tensor] = None,
-        latent_condition_sigma: Optional[torch.Tensor] = None,
-        condition_video_augment_sigma: Optional[torch.Tensor] = None,
-        **kwargs,
    ):
        """
        Args:
@@ -520,7 +482,6 @@ class GeneralDIT(nn.Module):
                x.shape == extra_pos_emb_B_T_H_W_D_or_T_H_W_B_D.shape
            ), f"{x.shape} != {extra_pos_emb_B_T_H_W_D_or_T_H_W_B_D.shape} {original_shape}"

-        transformer_options = kwargs.get("transformer_options", {})
        for _, block in self.blocks.items():
            assert (
                self.blocks["block0"].x_format == block.x_format
@@ -535,7 +496,6 @@ class GeneralDIT(nn.Module):
                crossattn_mask,
                rope_emb_L_1_1_D=rope_emb_L_1_1_D,
                adaln_lora_B_3D=adaln_lora_B_3D,
-                transformer_options=transformer_options,
            )

        x_B_T_H_W_D = rearrange(x, "T H W B D -> B T H W D")
--- a/comfy/ldm/cosmos/predict2.py
+++ b/comfy/ldm/cosmos/predict2.py
@@ -11,7 +11,6 @@ import math
 from .position_embedding import VideoRopePosition3DEmb, LearnablePosEmbAxis
 from torchvision import transforms

-import comfy.patcher_extension
 from comfy.ldm.modules.attention import optimized_attention

 def apply_rotary_pos_emb(
@@ -44,7 +43,7 @@ class GPT2FeedForward(nn.Module):
        return x


-def torch_attention_op(q_B_S_H_D: torch.Tensor, k_B_S_H_D: torch.Tensor, v_B_S_H_D: torch.Tensor, transformer_options: Optional[dict] = {}) -> torch.Tensor:
+def torch_attention_op(q_B_S_H_D: torch.Tensor, k_B_S_H_D: torch.Tensor, v_B_S_H_D: torch.Tensor) -> torch.Tensor:
    """Computes multi-head attention using PyTorch's native implementation.

    This function provides a PyTorch backend alternative to Transformer Engine's attention operation.
@@ -71,7 +70,11 @@ def torch_attention_op(q_B_S_H_D: torch.Tensor, k_B_S_H_D: torch.Tensor, v_B_S_H
    q_B_H_S_D = rearrange(q_B_S_H_D, "b ... h k -> b h ... k").view(in_q_shape[0], in_q_shape[-2], -1, in_q_shape[-1])
    k_B_H_S_D = rearrange(k_B_S_H_D, "b ... h v -> b h ... v").view(in_k_shape[0], in_k_shape[-2], -1, in_k_shape[-1])
    v_B_H_S_D = rearrange(v_B_S_H_D, "b ... h v -> b h ... v").view(in_k_shape[0], in_k_shape[-2], -1, in_k_shape[-1])
-    return optimized_attention(q_B_H_S_D, k_B_H_S_D, v_B_H_S_D, in_q_shape[-2], skip_reshape=True, transformer_options=transformer_options)
+    result_B_S_HD = rearrange(
+        optimized_attention(q_B_H_S_D, k_B_H_S_D, v_B_H_S_D, in_q_shape[-2], skip_reshape=True, skip_output_reshape=True), "b h ... l -> b ... (h l)"
+    )
+
+    return result_B_S_HD


 class Attention(nn.Module):
@@ -180,8 +183,8 @@ class Attention(nn.Module):

        return q, k, v

-    def compute_attention(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, transformer_options: Optional[dict] = {}) -> torch.Tensor:
-        result = self.attn_op(q, k, v, transformer_options=transformer_options)  # [B, S, H, D]
+    def compute_attention(self, q: torch.Tensor, k: torch.Tensor, v: torch.Tensor) -> torch.Tensor:
+        result = self.attn_op(q, k, v)  # [B, S, H, D]
        return self.output_dropout(self.output_proj(result))

    def forward(
@@ -189,7 +192,6 @@ class Attention(nn.Module):
        x: torch.Tensor,
        context: Optional[torch.Tensor] = None,
        rope_emb: Optional[torch.Tensor] = None,
-        transformer_options: Optional[dict] = {},
    ) -> torch.Tensor:
        """
        Args:
@@ -197,7 +199,7 @@ class Attention(nn.Module):
            context (Optional[Tensor]): The key tensor of shape [B, Mk, K] or use x as context [self attention] if None
        """
        q, k, v = self.compute_qkv(x, context, rope_emb=rope_emb)
-        return self.compute_attention(q, k, v, transformer_options=transformer_options)
+        return self.compute_attention(q, k, v)


 class Timesteps(nn.Module):
@@ -460,7 +462,6 @@ class Block(nn.Module):
        rope_emb_L_1_1_D: Optional[torch.Tensor] = None,
        adaln_lora_B_T_3D: Optional[torch.Tensor] = None,
        extra_per_block_pos_emb: Optional[torch.Tensor] = None,
-        transformer_options: Optional[dict] = {},
    ) -> torch.Tensor:
        if extra_per_block_pos_emb is not None:
            x_B_T_H_W_D = x_B_T_H_W_D + extra_per_block_pos_emb
@@ -514,7 +515,6 @@ class Block(nn.Module):
                rearrange(normalized_x_B_T_H_W_D, "b t h w d -> b (t h w) d"),
                None,
                rope_emb=rope_emb_L_1_1_D,
-                transformer_options=transformer_options,
            ),
            "b (t h w) d -> b t h w d",
            t=T,
@@ -528,7 +528,6 @@ class Block(nn.Module):
            layer_norm_cross_attn: Callable,
            _scale_cross_attn_B_T_1_1_D: torch.Tensor,
            _shift_cross_attn_B_T_1_1_D: torch.Tensor,
-            transformer_options: Optional[dict] = {},
        ) -> torch.Tensor:
            _normalized_x_B_T_H_W_D = _fn(
                _x_B_T_H_W_D, layer_norm_cross_attn, _scale_cross_attn_B_T_1_1_D, _shift_cross_attn_B_T_1_1_D
@@ -538,7 +537,6 @@ class Block(nn.Module):
                    rearrange(_normalized_x_B_T_H_W_D, "b t h w d -> b (t h w) d"),
                    crossattn_emb,
                    rope_emb=rope_emb_L_1_1_D,
-                    transformer_options=transformer_options,
                ),
                "b (t h w) d -> b t h w d",
                t=T,
@@ -552,7 +550,6 @@ class Block(nn.Module):
            self.layer_norm_cross_attn,
            scale_cross_attn_B_T_1_1_D,
            shift_cross_attn_B_T_1_1_D,
-            transformer_options=transformer_options,
        )
        x_B_T_H_W_D = result_B_T_H_W_D * gate_cross_attn_B_T_1_1_D + x_B_T_H_W_D

@@ -812,21 +809,7 @@ class MiniTrainDIT(nn.Module):
        )
        return x_B_C_Tt_Hp_Wp

-    def forward(self,
-        x: torch.Tensor,
-        timesteps: torch.Tensor,
-        context: torch.Tensor,
-        fps: Optional[torch.Tensor] = None,
-        padding_mask: Optional[torch.Tensor] = None,
-        **kwargs,
-    ):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, kwargs.get("transformer_options", {}))
-        ).execute(x, timesteps, context, fps, padding_mask, **kwargs)
-
-    def _forward(
+    def forward(
        self,
        x: torch.Tensor,
        timesteps: torch.Tensor,
@@ -871,7 +854,6 @@ class MiniTrainDIT(nn.Module):
            "rope_emb_L_1_1_D": rope_emb_L_1_1_D.unsqueeze(1).unsqueeze(0),
            "adaln_lora_B_T_3D": adaln_lora_B_T_3D,
            "extra_per_block_pos_emb": extra_pos_emb_B_T_H_W_D_or_T_H_W_B_D,
-            "transformer_options": kwargs.get("transformer_options", {}),
        }
        for block in self.blocks:
            x_B_T_H_W_D = block(
--- a/comfy/ldm/flux/controlnet.py
+++ b/comfy/ldm/flux/controlnet.py
@@ -123,8 +123,6 @@ class ControlNetFlux(Flux):

        if y is None:
            y = torch.zeros((img.shape[0], self.params.vec_in_dim), device=img.device, dtype=img.dtype)
-        else:
-            y = y[:, :self.params.vec_in_dim]

        # running on sequences img
        img = self.img_in(img)
--- a/comfy/ldm/flux/layers.py
+++ b/comfy/ldm/flux/layers.py
@@ -118,7 +118,7 @@ class Modulation(nn.Module):
 def apply_mod(tensor, m_mult, m_add=None, modulation_dims=None):
    if modulation_dims is None:
        if m_add is not None:
-            return torch.addcmul(m_add, tensor, m_mult)
+            return tensor * m_mult + m_add
        else:
            return tensor * m_mult
    else:
@@ -159,7 +159,7 @@ class DoubleStreamBlock(nn.Module):
        )
        self.flipped_img_txt = flipped_img_txt

-    def forward(self, img: Tensor, txt: Tensor, vec: Tensor, pe: Tensor, attn_mask=None, modulation_dims_img=None, modulation_dims_txt=None, transformer_options={}):
+    def forward(self, img: Tensor, txt: Tensor, vec: Tensor, pe: Tensor, attn_mask=None, modulation_dims_img=None, modulation_dims_txt=None):
        img_mod1, img_mod2 = self.img_mod(vec)
        txt_mod1, txt_mod2 = self.txt_mod(vec)

@@ -182,7 +182,7 @@ class DoubleStreamBlock(nn.Module):
            attn = attention(torch.cat((img_q, txt_q), dim=2),
                             torch.cat((img_k, txt_k), dim=2),
                             torch.cat((img_v, txt_v), dim=2),
-                             pe=pe, mask=attn_mask, transformer_options=transformer_options)
+                             pe=pe, mask=attn_mask)

            img_attn, txt_attn = attn[:, : img.shape[1]], attn[:, img.shape[1]:]
        else:
@@ -190,7 +190,7 @@ class DoubleStreamBlock(nn.Module):
            attn = attention(torch.cat((txt_q, img_q), dim=2),
                             torch.cat((txt_k, img_k), dim=2),
                             torch.cat((txt_v, img_v), dim=2),
-                             pe=pe, mask=attn_mask, transformer_options=transformer_options)
+                             pe=pe, mask=attn_mask)

            txt_attn, img_attn = attn[:, : txt.shape[1]], attn[:, txt.shape[1]:]

@@ -244,7 +244,7 @@ class SingleStreamBlock(nn.Module):
        self.mlp_act = nn.GELU(approximate="tanh")
        self.modulation = Modulation(hidden_size, double=False, dtype=dtype, device=device, operations=operations)

-    def forward(self, x: Tensor, vec: Tensor, pe: Tensor, attn_mask=None, modulation_dims=None, transformer_options={}) -> Tensor:
+    def forward(self, x: Tensor, vec: Tensor, pe: Tensor, attn_mask=None, modulation_dims=None) -> Tensor:
        mod, _ = self.modulation(vec)
        qkv, mlp = torch.split(self.linear1(apply_mod(self.pre_norm(x), (1 + mod.scale), mod.shift, modulation_dims)), [3 * self.hidden_size, self.mlp_hidden_dim], dim=-1)

@@ -252,7 +252,7 @@ class SingleStreamBlock(nn.Module):
        q, k = self.norm(q, k, v)

        # compute attention
-        attn = attention(q, k, v, pe=pe, mask=attn_mask, transformer_options=transformer_options)
+        attn = attention(q, k, v, pe=pe, mask=attn_mask)
        # compute activation in mlp stream, cat again and run second linear layer
        output = self.linear2(torch.cat((attn, self.mlp_act(mlp)), 2))
        x += apply_mod(output, mod.gate, None, modulation_dims)
--- a/comfy/ldm/flux/math.py
+++ b/comfy/ldm/flux/math.py
@@ -6,7 +6,7 @@ from comfy.ldm.modules.attention import optimized_attention
 import comfy.model_management


-def attention(q: Tensor, k: Tensor, v: Tensor, pe: Tensor, mask=None, transformer_options={}) -> Tensor:
+def attention(q: Tensor, k: Tensor, v: Tensor, pe: Tensor, mask=None) -> Tensor:
    q_shape = q.shape
    k_shape = k.shape

@@ -17,7 +17,7 @@ def attention(q: Tensor, k: Tensor, v: Tensor, pe: Tensor, mask=None, transforme
        k = (pe[..., 0] * k[..., 0] + pe[..., 1] * k[..., 1]).reshape(*k_shape).type_as(v)

    heads = q.shape[1]
-    x = optimized_attention(q, k, v, heads, skip_reshape=True, mask=mask, transformer_options=transformer_options)
+    x = optimized_attention(q, k, v, heads, skip_reshape=True, mask=mask)
    return x


@@ -35,10 +35,11 @@ def rope(pos: Tensor, dim: int, theta: int) -> Tensor:
    out = rearrange(out, "b n d (i j) -> b n d i j", i=2, j=2)
    return out.to(dtype=torch.float32, device=pos.device)

-def apply_rope1(x: Tensor, freqs_cis: Tensor):
-    x_ = x.to(dtype=freqs_cis.dtype).reshape(*x.shape[:-1], -1, 1, 2)
-    x_out = freqs_cis[..., 0] * x_[..., 0] + freqs_cis[..., 1] * x_[..., 1]
-    return x_out.reshape(*x.shape).type_as(x)

 def apply_rope(xq: Tensor, xk: Tensor, freqs_cis: Tensor):
-    return apply_rope1(xq, freqs_cis), apply_rope1(xk, freqs_cis)
+    xq_ = xq.to(dtype=freqs_cis.dtype).reshape(*xq.shape[:-1], -1, 1, 2)
+    xk_ = xk.to(dtype=freqs_cis.dtype).reshape(*xk.shape[:-1], -1, 1, 2)
+    xq_out = freqs_cis[..., 0] * xq_[..., 0] + freqs_cis[..., 1] * xq_[..., 1]
+    xk_out = freqs_cis[..., 0] * xk_[..., 0] + freqs_cis[..., 1] * xk_[..., 1]
+    return xq_out.reshape(*xq.shape).type_as(xq), xk_out.reshape(*xk.shape).type_as(xk)
+
--- a/comfy/ldm/flux/model.py
+++ b/comfy/ldm/flux/model.py
@@ -6,7 +6,6 @@ import torch
 from torch import Tensor, nn
 from einops import rearrange, repeat
 import comfy.ldm.common_dit
-import comfy.patcher_extension

 from .layers import (
    DoubleStreamBlock,
@@ -106,7 +105,6 @@ class Flux(nn.Module):
        if y is None:
            y = torch.zeros((img.shape[0], self.params.vec_in_dim), device=img.device, dtype=img.dtype)

-        patches = transformer_options.get("patches", {})
        patches_replace = transformer_options.get("patches_replace", {})
        if img.ndim != 3 or txt.ndim != 3:
            raise ValueError("Input img and txt tensors must have 3 dimensions.")
@@ -118,17 +116,9 @@ class Flux(nn.Module):
            if guidance is not None:
                vec = vec + self.guidance_in(timestep_embedding(guidance, 256).to(img.dtype))

-        vec = vec + self.vector_in(y[:, :self.params.vec_in_dim])
+        vec = vec + self.vector_in(y[:,:self.params.vec_in_dim])
        txt = self.txt_in(txt)

-        if "post_input" in patches:
-            for p in patches["post_input"]:
-                out = p({"img": img, "txt": txt, "img_ids": img_ids, "txt_ids": txt_ids})
-                img = out["img"]
-                txt = out["txt"]
-                img_ids = out["img_ids"]
-                txt_ids = out["txt_ids"]
-
        if img_ids is not None:
            ids = torch.cat((txt_ids, img_ids), dim=1)
            pe = self.pe_embedder(ids)
@@ -144,16 +134,14 @@ class Flux(nn.Module):
                                                   txt=args["txt"],
                                                   vec=args["vec"],
                                                   pe=args["pe"],
-                                                   attn_mask=args.get("attn_mask"),
-                                                   transformer_options=args.get("transformer_options"))
+                                                   attn_mask=args.get("attn_mask"))
                    return out

                out = blocks_replace[("double_block", i)]({"img": img,
                                                           "txt": txt,
                                                           "vec": vec,
                                                           "pe": pe,
-                                                           "attn_mask": attn_mask,
-                                                           "transformer_options": transformer_options},
+                                                           "attn_mask": attn_mask},
                                                          {"original_block": block_wrap})
                txt = out["txt"]
                img = out["img"]
@@ -162,15 +150,14 @@ class Flux(nn.Module):
                                 txt=txt,
                                 vec=vec,
                                 pe=pe,
-                                 attn_mask=attn_mask,
-                                 transformer_options=transformer_options)
+                                 attn_mask=attn_mask)

            if control is not None: # Controlnet
                control_i = control.get("input")
                if i < len(control_i):
                    add = control_i[i]
                    if add is not None:
-                        img[:, :add.shape[1]] += add
+                        img += add

        if img.dtype == torch.float16:
            img = torch.nan_to_num(img, nan=0.0, posinf=65504, neginf=-65504)
@@ -184,97 +171,44 @@ class Flux(nn.Module):
                    out["img"] = block(args["img"],
                                       vec=args["vec"],
                                       pe=args["pe"],
-                                       attn_mask=args.get("attn_mask"),
-                                       transformer_options=args.get("transformer_options"))
+                                       attn_mask=args.get("attn_mask"))
                    return out

                out = blocks_replace[("single_block", i)]({"img": img,
                                                           "vec": vec,
                                                           "pe": pe,
-                                                           "attn_mask": attn_mask,
-                                                           "transformer_options": transformer_options},
+                                                           "attn_mask": attn_mask},
                                                          {"original_block": block_wrap})
                img = out["img"]
            else:
-                img = block(img, vec=vec, pe=pe, attn_mask=attn_mask, transformer_options=transformer_options)
+                img = block(img, vec=vec, pe=pe, attn_mask=attn_mask)

            if control is not None: # Controlnet
                control_o = control.get("output")
                if i < len(control_o):
                    add = control_o[i]
                    if add is not None:
-                        img[:, txt.shape[1] : txt.shape[1] + add.shape[1], ...] += add
+                        img[:, txt.shape[1] :, ...] += add

        img = img[:, txt.shape[1] :, ...]

        img = self.final_layer(img, vec)  # (N, T, patch_size ** 2 * out_channels)
        return img

-    def process_img(self, x, index=0, h_offset=0, w_offset=0):
+    def forward(self, x, timestep, context, y=None, guidance=None, control=None, transformer_options={}, **kwargs):
        bs, c, h, w = x.shape
        patch_size = self.patch_size
        x = comfy.ldm.common_dit.pad_to_patch_size(x, (patch_size, patch_size))

        img = rearrange(x, "b c (h ph) (w pw) -> b (h w) (c ph pw)", ph=patch_size, pw=patch_size)
+
        h_len = ((h + (patch_size // 2)) // patch_size)
        w_len = ((w + (patch_size // 2)) // patch_size)
-
-        h_offset = ((h_offset + (patch_size // 2)) // patch_size)
-        w_offset = ((w_offset + (patch_size // 2)) // patch_size)
-
        img_ids = torch.zeros((h_len, w_len, 3), device=x.device, dtype=x.dtype)
-        img_ids[:, :, 0] = img_ids[:, :, 1] + index
-        img_ids[:, :, 1] = img_ids[:, :, 1] + torch.linspace(h_offset, h_len - 1 + h_offset, steps=h_len, device=x.device, dtype=x.dtype).unsqueeze(1)
-        img_ids[:, :, 2] = img_ids[:, :, 2] + torch.linspace(w_offset, w_len - 1 + w_offset, steps=w_len, device=x.device, dtype=x.dtype).unsqueeze(0)
-        return img, repeat(img_ids, "h w c -> b (h w) c", b=bs)
-
-    def forward(self, x, timestep, context, y=None, guidance=None, ref_latents=None, control=None, transformer_options={}, **kwargs):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, timestep, context, y, guidance, ref_latents, control, transformer_options, **kwargs)
-
-    def _forward(self, x, timestep, context, y=None, guidance=None, ref_latents=None, control=None, transformer_options={}, **kwargs):
-        bs, c, h_orig, w_orig = x.shape
-        patch_size = self.patch_size
-
-        h_len = ((h_orig + (patch_size // 2)) // patch_size)
-        w_len = ((w_orig + (patch_size // 2)) // patch_size)
-        img, img_ids = self.process_img(x)
-        img_tokens = img.shape[1]
-        if ref_latents is not None:
-            h = 0
-            w = 0
-            index = 0
-            ref_latents_method = kwargs.get("ref_latents_method", "offset")
-            for ref in ref_latents:
-                if ref_latents_method == "index":
-                    index += 1
-                    h_offset = 0
-                    w_offset = 0
-                elif ref_latents_method == "uxo":
-                    index = 0
-                    h_offset = h_len * patch_size + h
-                    w_offset = w_len * patch_size + w
-                    h += ref.shape[-2]
-                    w += ref.shape[-1]
-                else:
-                    index = 1
-                    h_offset = 0
-                    w_offset = 0
-                    if ref.shape[-2] + h > ref.shape[-1] + w:
-                        w_offset = w
-                    else:
-                        h_offset = h
-                    h = max(h, ref.shape[-2] + h_offset)
-                    w = max(w, ref.shape[-1] + w_offset)
-
-                kontext, kontext_ids = self.process_img(ref, index=index, h_offset=h_offset, w_offset=w_offset)
-                img = torch.cat([img, kontext], dim=1)
-                img_ids = torch.cat([img_ids, kontext_ids], dim=1)
+        img_ids[:, :, 1] = img_ids[:, :, 1] + torch.linspace(0, h_len - 1, steps=h_len, device=x.device, dtype=x.dtype).unsqueeze(1)
+        img_ids[:, :, 2] = img_ids[:, :, 2] + torch.linspace(0, w_len - 1, steps=w_len, device=x.device, dtype=x.dtype).unsqueeze(0)
+        img_ids = repeat(img_ids, "h w c -> b (h w) c", b=bs)

        txt_ids = torch.zeros((bs, context.shape[1], 3), device=x.device, dtype=x.dtype)
        out = self.forward_orig(img, img_ids, context, txt_ids, timestep, y, guidance, control, transformer_options, attn_mask=kwargs.get("attention_mask", None))
-        out = out[:, :img_tokens]
-        return rearrange(out, "b (h w) (c ph pw) -> b c (h ph) (w pw)", h=h_len, w=w_len, ph=2, pw=2)[:,:,:h_orig,:w_orig]
+        return rearrange(out, "b (h w) (c ph pw) -> b c (h ph) (w pw)", h=h_len, w=w_len, ph=2, pw=2)[:,:,:h,:w]
--- a/comfy/ldm/genmo/joint_model/asymm_models_joint.py
+++ b/comfy/ldm/genmo/joint_model/asymm_models_joint.py
@@ -109,7 +109,6 @@ class AsymmetricAttention(nn.Module):
        scale_x: torch.Tensor,  # (B, dim_x), modulation for pre-RMSNorm.
        scale_y: torch.Tensor,  # (B, dim_y), modulation for pre-RMSNorm.
        crop_y,
-        transformer_options={},
        **rope_rotation,
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        rope_cos = rope_rotation.get("rope_cos")
@@ -144,7 +143,7 @@ class AsymmetricAttention(nn.Module):

        xy = optimized_attention(q,
                                 k,
-                                 v, self.num_heads, skip_reshape=True, transformer_options=transformer_options)
+                                 v, self.num_heads, skip_reshape=True)

        x, y = torch.tensor_split(xy, (q_x.shape[1],), dim=1)
        x = self.proj_x(x)
@@ -225,7 +224,6 @@ class AsymmetricJointBlock(nn.Module):
        x: torch.Tensor,
        c: torch.Tensor,
        y: torch.Tensor,
-        transformer_options={},
        **attn_kwargs,
    ):
        """Forward pass of a block.
@@ -258,7 +256,6 @@ class AsymmetricJointBlock(nn.Module):
            y,
            scale_x=scale_msa_x,
            scale_y=scale_msa_y,
-            transformer_options=transformer_options,
            **attn_kwargs,
        )

@@ -527,11 +524,10 @@ class AsymmDiTJoint(nn.Module):
                                                    args["txt"],
                                                    rope_cos=args["rope_cos"],
                                                    rope_sin=args["rope_sin"],
-                                                    crop_y=args["num_tokens"],
-                                                    transformer_options=args["transformer_options"]
+                                                    crop_y=args["num_tokens"]
                                                    )
                    return out
-                out = blocks_replace[("double_block", i)]({"img": x, "txt": y_feat, "vec": c, "rope_cos": rope_cos, "rope_sin": rope_sin, "num_tokens": num_tokens, "transformer_options": transformer_options}, {"original_block": block_wrap})
+                out = blocks_replace[("double_block", i)]({"img": x, "txt": y_feat, "vec": c, "rope_cos": rope_cos, "rope_sin": rope_sin, "num_tokens": num_tokens}, {"original_block": block_wrap})
                y_feat = out["txt"]
                x = out["img"]
            else:
@@ -542,7 +538,6 @@ class AsymmDiTJoint(nn.Module):
                    rope_cos=rope_cos,
                    rope_sin=rope_sin,
                    crop_y=num_tokens,
-                    transformer_options=transformer_options,
                )  # (B, M, D), (B, L, D)
        del y_feat  # Final layers don't use dense text features.

--- a/comfy/ldm/hidream/model.py
+++ b/comfy/ldm/hidream/model.py
@@ -13,7 +13,6 @@ from comfy.ldm.flux.layers import LastLayer

 from comfy.ldm.modules.attention import optimized_attention
 import comfy.model_management
-import comfy.patcher_extension
 import comfy.ldm.common_dit


@@ -72,8 +71,8 @@ class TimestepEmbed(nn.Module):
        return t_emb


-def attention(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, transformer_options={}):
-    return optimized_attention(query.view(query.shape[0], -1, query.shape[-1] * query.shape[-2]), key.view(key.shape[0], -1, key.shape[-1] * key.shape[-2]), value.view(value.shape[0], -1, value.shape[-1] * value.shape[-2]), query.shape[2], transformer_options=transformer_options)
+def attention(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor):
+    return optimized_attention(query.view(query.shape[0], -1, query.shape[-1] * query.shape[-2]), key.view(key.shape[0], -1, key.shape[-1] * key.shape[-2]), value.view(value.shape[0], -1, value.shape[-1] * value.shape[-2]), query.shape[2])


 class HiDreamAttnProcessor_flashattn:
@@ -86,7 +85,6 @@ class HiDreamAttnProcessor_flashattn:
        image_tokens_masks: Optional[torch.FloatTensor] = None,
        text_tokens: Optional[torch.FloatTensor] = None,
        rope: torch.FloatTensor = None,
-        transformer_options={},
        *args,
        **kwargs,
    ) -> torch.FloatTensor:
@@ -134,7 +132,7 @@ class HiDreamAttnProcessor_flashattn:
            query = torch.cat([query_1, query_2], dim=-1)
            key = torch.cat([key_1, key_2], dim=-1)

-        hidden_states = attention(query, key, value, transformer_options=transformer_options)
+        hidden_states = attention(query, key, value)

        if not attn.single:
            hidden_states_i, hidden_states_t = torch.split(hidden_states, [num_image_tokens, num_text_tokens], dim=1)
@@ -200,7 +198,6 @@ class HiDreamAttention(nn.Module):
        image_tokens_masks: torch.FloatTensor = None,
        norm_text_tokens: torch.FloatTensor = None,
        rope: torch.FloatTensor = None,
-        transformer_options={},
    ) -> torch.Tensor:
        return self.processor(
            self,
@@ -208,7 +205,6 @@ class HiDreamAttention(nn.Module):
            image_tokens_masks = image_tokens_masks,
            text_tokens = norm_text_tokens,
            rope = rope,
-            transformer_options=transformer_options,
        )


@@ -409,7 +405,7 @@ class HiDreamImageSingleTransformerBlock(nn.Module):
        text_tokens: Optional[torch.FloatTensor] = None,
        adaln_input: Optional[torch.FloatTensor] = None,
        rope: torch.FloatTensor = None,
-        transformer_options={},
+
    ) -> torch.FloatTensor:
        wtype = image_tokens.dtype
        shift_msa_i, scale_msa_i, gate_msa_i, shift_mlp_i, scale_mlp_i, gate_mlp_i = \
@@ -422,7 +418,6 @@ class HiDreamImageSingleTransformerBlock(nn.Module):
            norm_image_tokens,
            image_tokens_masks,
            rope = rope,
-            transformer_options=transformer_options,
        )
        image_tokens = gate_msa_i * attn_output_i + image_tokens

@@ -487,7 +482,6 @@ class HiDreamImageTransformerBlock(nn.Module):
        text_tokens: Optional[torch.FloatTensor] = None,
        adaln_input: Optional[torch.FloatTensor] = None,
        rope: torch.FloatTensor = None,
-        transformer_options={},
    ) -> torch.FloatTensor:
        wtype = image_tokens.dtype
        shift_msa_i, scale_msa_i, gate_msa_i, shift_mlp_i, scale_mlp_i, gate_mlp_i, \
@@ -505,7 +499,6 @@ class HiDreamImageTransformerBlock(nn.Module):
            image_tokens_masks,
            norm_text_tokens,
            rope = rope,
-            transformer_options=transformer_options,
        )

        image_tokens = gate_msa_i * attn_output_i + image_tokens
@@ -556,7 +549,6 @@ class HiDreamImageBlock(nn.Module):
        text_tokens: Optional[torch.FloatTensor] = None,
        adaln_input: torch.FloatTensor = None,
        rope: torch.FloatTensor = None,
-        transformer_options={},
    ) -> torch.FloatTensor:
        return self.block(
            image_tokens,
@@ -564,7 +556,6 @@ class HiDreamImageBlock(nn.Module):
            text_tokens,
            adaln_input,
            rope,
-            transformer_options=transformer_options,
        )


@@ -701,23 +692,7 @@ class HiDreamImageTransformer2DModel(nn.Module):
            raise NotImplementedError
        return x, x_masks, img_sizes

-    def forward(self,
-        x: torch.Tensor,
-        t: torch.Tensor,
-        y: Optional[torch.Tensor] = None,
-        context: Optional[torch.Tensor] = None,
-        encoder_hidden_states_llama3=None,
-        image_cond=None,
-        control = None,
-        transformer_options = {},
-    ):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, t, y, context, encoder_hidden_states_llama3, image_cond, control, transformer_options)
-
-    def _forward(
+    def forward(
        self,
        x: torch.Tensor,
        t: torch.Tensor,
@@ -794,7 +769,6 @@ class HiDreamImageTransformer2DModel(nn.Module):
                text_tokens = cur_encoder_hidden_states,
                adaln_input = adaln_input,
                rope = rope,
-                transformer_options=transformer_options,
            )
            initial_encoder_hidden_states = initial_encoder_hidden_states[:, :initial_encoder_hidden_states_seq_len]
            block_id += 1
@@ -818,7 +792,6 @@ class HiDreamImageTransformer2DModel(nn.Module):
                text_tokens=None,
                adaln_input=adaln_input,
                rope=rope,
-                transformer_options=transformer_options,
            )
            hidden_states = hidden_states[:, :hidden_states_seq_len]
            block_id += 1
--- a/comfy/ldm/hunyuan3d/model.py
+++ b/comfy/ldm/hunyuan3d/model.py
@@ -7,7 +7,6 @@ from comfy.ldm.flux.layers import (
    SingleStreamBlock,
    timestep_embedding,
 )
-import comfy.patcher_extension


 class Hunyuan3Dv2(nn.Module):
@@ -68,13 +67,6 @@ class Hunyuan3Dv2(nn.Module):
        self.final_layer = LastLayer(hidden_size, 1, in_channels, dtype=dtype, device=device, operations=operations)

    def forward(self, x, timestep, context, guidance=None, transformer_options={}, **kwargs):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, timestep, context, guidance, transformer_options, **kwargs)
-
-    def _forward(self, x, timestep, context, guidance=None, transformer_options={}, **kwargs):
        x = x.movedim(-1, -2)
        timestep = 1.0 - timestep
        txt = context
@@ -99,16 +91,14 @@ class Hunyuan3Dv2(nn.Module):
                                                   txt=args["txt"],
                                                   vec=args["vec"],
                                                   pe=args["pe"],
-                                                   attn_mask=args.get("attn_mask"),
-                                                   transformer_options=args["transformer_options"])
+                                                   attn_mask=args.get("attn_mask"))
                    return out

                out = blocks_replace[("double_block", i)]({"img": img,
                                                           "txt": txt,
                                                           "vec": vec,
                                                           "pe": pe,
-                                                           "attn_mask": attn_mask,
-                                                           "transformer_options": transformer_options},
+                                                           "attn_mask": attn_mask},
                                                          {"original_block": block_wrap})
                txt = out["txt"]
                img = out["img"]
@@ -117,8 +107,7 @@ class Hunyuan3Dv2(nn.Module):
                                 txt=txt,
                                 vec=vec,
                                 pe=pe,
-                                 attn_mask=attn_mask,
-                                 transformer_options=transformer_options)
+                                 attn_mask=attn_mask)

        img = torch.cat((txt, img), 1)

@@ -129,19 +118,17 @@ class Hunyuan3Dv2(nn.Module):
                    out["img"] = block(args["img"],
                                       vec=args["vec"],
                                       pe=args["pe"],
-                                       attn_mask=args.get("attn_mask"),
-                                       transformer_options=args["transformer_options"])
+                                       attn_mask=args.get("attn_mask"))
                    return out

                out = blocks_replace[("single_block", i)]({"img": img,
                                                           "vec": vec,
                                                           "pe": pe,
-                                                           "attn_mask": attn_mask,
-                                                           "transformer_options": transformer_options},
+                                                           "attn_mask": attn_mask},
                                                          {"original_block": block_wrap})
                img = out["img"]
            else:
-                img = block(img, vec=vec, pe=pe, attn_mask=attn_mask, transformer_options=transformer_options)
+                img = block(img, vec=vec, pe=pe, attn_mask=attn_mask)

        img = img[:, txt.shape[1]:, ...]
        img = self.final_layer(img, vec)
--- a/comfy/ldm/hunyuan3d/vae.py
+++ b/comfy/ldm/hunyuan3d/vae.py
@@ -4,458 +4,81 @@
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
+
+
+from typing import Union, Tuple, List, Callable, Optional
+
 import numpy as np
-import math
+from einops import repeat, rearrange
 from tqdm import tqdm
-
-from typing import Optional
-
 import logging

 import comfy.ops
 ops = comfy.ops.disable_weight_init

-def fps(src: torch.Tensor, batch: torch.Tensor, sampling_ratio: float, start_random: bool = True):
-
-    # manually create the pointer vector
-    assert src.size(0) == batch.numel()
-
-    batch_size = int(batch.max()) + 1
-    deg = src.new_zeros(batch_size, dtype = torch.long)
-
-    deg.scatter_add_(0, batch, torch.ones_like(batch))
-
-    ptr_vec = deg.new_zeros(batch_size + 1)
-    torch.cumsum(deg, 0, out=ptr_vec[1:])
-
-    #return fps_sampling(src, ptr_vec, ratio)
-    sampled_indicies = []
-
-    for b in range(batch_size):
-        # start and the end of each batch
-        start, end = ptr_vec[b].item(), ptr_vec[b + 1].item()
-        # points from the point cloud
-        points = src[start:end]
-
-        num_points = points.size(0)
-        num_samples = max(1, math.ceil(num_points * sampling_ratio))
-
-        selected = torch.zeros(num_samples, device = src.device, dtype = torch.long)
-        distances = torch.full((num_points,), float("inf"), device = src.device)
-
-        # select a random start point
-        if start_random:
-            farthest = torch.randint(0, num_points, (1,), device = src.device)
-        else:
-            farthest = torch.tensor([0], device = src.device, dtype = torch.long)
-
-        for i in range(num_samples):
-            selected[i] = farthest
-            centroid = points[farthest].squeeze(0)
-            dist = torch.norm(points - centroid, dim = 1) # compute euclidean distance
-            distances = torch.minimum(distances, dist)
-            farthest = torch.argmax(distances)
-
-        sampled_indicies.append(torch.arange(start, end)[selected])
-
-    return torch.cat(sampled_indicies, dim = 0)
-class PointCrossAttention(nn.Module):
-    def __init__(self,
-        num_latents: int,
-        downsample_ratio: float,
-        pc_size: int,
-        pc_sharpedge_size: int,
-        point_feats: int,
-        width: int,
-        heads: int,
-        layers: int,
-        fourier_embedder,
-        normal_pe: bool = False,
-        qkv_bias: bool = False,
-        use_ln_post: bool = True,
-        qk_norm: bool = True):
-
-        super().__init__()
-
-        self.fourier_embedder = fourier_embedder
-
-        self.pc_size = pc_size
-        self.normal_pe = normal_pe
-        self.downsample_ratio = downsample_ratio
-        self.pc_sharpedge_size = pc_sharpedge_size
-        self.num_latents = num_latents
-        self.point_feats = point_feats
-
-        self.input_proj = nn.Linear(self.fourier_embedder.out_dim + point_feats, width)
-
-        self.cross_attn = ResidualCrossAttentionBlock(
-            width = width,
-            heads = heads,
-            qkv_bias = qkv_bias,
-            qk_norm = qk_norm
-        )
-
-        self.self_attn = None
-        if layers > 0:
-            self.self_attn = Transformer(
-                width = width,
-                heads = heads,
-                qkv_bias = qkv_bias,
-                qk_norm = qk_norm,
-                layers = layers
-            )
-
-        if use_ln_post:
-            self.ln_post = nn.LayerNorm(width)
-        else:
-            self.ln_post = None
-
-    def sample_points_and_latents(self, point_cloud: torch.Tensor, features: torch.Tensor):
-
-        """
-        Subsample points randomly from the point cloud (input_pc)
-        Further sample the subsampled points to get query_pc
-        take the fourier embeddings for both input and query pc
-
-        Mental Note: FPS-sampled points (query_pc) act as latent tokens that attend to and learn from the broader context in input_pc.
-        Goal: get a smaller represenation (query_pc) to represent the entire scence structure by learning from a broader subset (input_pc).
-        More computationally efficient.
-
-        Features are additional information for each point in the cloud
-        """
-
-        B, _, D = point_cloud.shape
-
-        num_latents = int(self.num_latents)
-
-        num_random_query = self.pc_size / (self.pc_size + self.pc_sharpedge_size) * num_latents
-        num_sharpedge_query = num_latents - num_random_query
-
-        # Split random and sharpedge surface points
-        random_pc, sharpedge_pc = torch.split(point_cloud, [self.pc_size, self.pc_sharpedge_size], dim=1)
-
-        # assert statements
-        assert random_pc.shape[1] <= self.pc_size, "Random surface points size must be less than or equal to pc_size"
-        assert sharpedge_pc.shape[1] <= self.pc_sharpedge_size, "Sharpedge surface points size must be less than or equal to pc_sharpedge_size"
-
-        input_random_pc_size = int(num_random_query * self.downsample_ratio)
-        random_query_pc, random_input_pc, random_idx_pc, random_idx_query = \
-            self.subsample(pc = random_pc, num_query = num_random_query, input_pc_size = input_random_pc_size)
-
-        input_sharpedge_pc_size = int(num_sharpedge_query * self.downsample_ratio)
-
-        if input_sharpedge_pc_size == 0:
-            sharpedge_input_pc = torch.zeros(B, 0, D, dtype = random_input_pc.dtype).to(point_cloud.device)
-            sharpedge_query_pc = torch.zeros(B, 0, D, dtype= random_query_pc.dtype).to(point_cloud.device)
-
-        else:
-            sharpedge_query_pc, sharpedge_input_pc, sharpedge_idx_pc, sharpedge_idx_query = \
-            self.subsample(pc = sharpedge_pc, num_query = num_sharpedge_query, input_pc_size = input_sharpedge_pc_size)
-
-        # concat the random and sharpedges
-        query_pc = torch.cat([random_query_pc, sharpedge_query_pc], dim = 1)
-        input_pc = torch.cat([random_input_pc, sharpedge_input_pc], dim = 1)
-
-        query = self.fourier_embedder(query_pc)
-        data = self.fourier_embedder(input_pc)
-
-        if self.point_feats > 0:
-            random_surface_features, sharpedge_surface_features = torch.split(features, [self.pc_size, self.pc_sharpedge_size], dim = 1)
-
-            input_random_surface_features, query_random_features = \
-                self.handle_features(features = random_surface_features, idx_pc = random_idx_pc, batch_size = B,
-                                     input_pc_size = input_random_pc_size, idx_query = random_idx_query)
-
-            if input_sharpedge_pc_size == 0:
-                input_sharpedge_surface_features = torch.zeros(B, 0, self.point_feats,
-                                                               dtype = input_random_surface_features.dtype, device = point_cloud.device)
-
-                query_sharpedge_features = torch.zeros(B, 0, self.point_feats,
-                                                       dtype = query_random_features.dtype, device = point_cloud.device)
-            else:
-
-                input_sharpedge_surface_features, query_sharpedge_features = \
-                    self.handle_features(idx_pc = sharpedge_idx_pc, features = sharpedge_surface_features,
-                                         batch_size = B, idx_query = sharpedge_idx_query, input_pc_size = input_sharpedge_pc_size)
-
-            query_features = torch.cat([query_random_features, query_sharpedge_features], dim = 1)
-            input_features = torch.cat([input_random_surface_features, input_sharpedge_surface_features], dim = 1)
-
-            if self.normal_pe:
-                # apply the fourier embeddings on the first 3 dims (xyz)
-                input_features_pe = self.fourier_embedder(input_features[..., :3])
-                query_features_pe = self.fourier_embedder(query_features[..., :3])
-                # replace the first 3 dims with the new PE ones
-                input_features = torch.cat([input_features_pe, input_features[..., :3]], dim = -1)
-                query_features = torch.cat([query_features_pe, query_features[..., :3]], dim = -1)
-
-            # concat at the channels dim
-            query = torch.cat([query, query_features], dim = -1)
-            data = torch.cat([data, input_features], dim = -1)
-
-        # don't return pc_info to avoid unnecessary memory usuage
-        return query.view(B, -1, query.shape[-1]), data.view(B, -1, data.shape[-1])
-
-    def forward(self, point_cloud: torch.Tensor, features: torch.Tensor):
-
-        query, data = self.sample_points_and_latents(point_cloud = point_cloud, features = features)
-
-        # apply projections
-        query = self.input_proj(query)
-        data = self.input_proj(data)
-
-        # apply cross attention between query and data
-        latents = self.cross_attn(query, data)
-
-        if self.self_attn is not None:
-            latents = self.self_attn(latents)
-
-        if self.ln_post is not None:
-            latents = self.ln_post(latents)
-
-        return latents
-
-
-    def subsample(self, pc, num_query, input_pc_size: int):
-
-        """
-        num_query: number of points to keep after FPS
-        input_pc_size: number of points to select before FPS
-        """
-
-        B, _, D = pc.shape
-        query_ratio = num_query / input_pc_size
-
-        # random subsampling of points inside the point cloud
-        idx_pc = torch.randperm(pc.shape[1], device = pc.device)[:input_pc_size]
-        input_pc = pc[:, idx_pc, :]
-
-        # flatten to allow applying fps across the whole batch
-        flattent_input_pc = input_pc.view(B * input_pc_size, D)
-
-        # construct a batch_down tensor to tell fps
-        # which points belong to which batch
-        N_down = int(flattent_input_pc.shape[0] / B)
-        batch_down = torch.arange(B).to(pc.device)
-        batch_down = torch.repeat_interleave(batch_down, N_down)
-
-        idx_query = fps(flattent_input_pc, batch_down, sampling_ratio = query_ratio)
-        query_pc = flattent_input_pc[idx_query].view(B, -1, D)
-
-        return query_pc, input_pc, idx_pc, idx_query
-
-    def handle_features(self, features, idx_pc, input_pc_size, batch_size: int, idx_query):
-
-        B = batch_size
-
-        input_surface_features = features[:, idx_pc, :]
-        flattent_input_features = input_surface_features.view(B * input_pc_size, -1)
-        query_features = flattent_input_features[idx_query].view(B, -1,
-                                                                 flattent_input_features.shape[-1])
-
-        return input_surface_features, query_features
-
-def normalize_mesh(mesh, scale = 0.9999):
-    """Normalize mesh to fit in [-scale, scale]. Translate mesh so its center is [0,0,0]"""
-
-    bbox = mesh.bounds
-    center = (bbox[1] + bbox[0]) / 2
-
-    max_extent = (bbox[1] - bbox[0]).max()
-    mesh.apply_translation(-center)
-    mesh.apply_scale((2 * scale) / max_extent)
-
-    return mesh
-
-def sample_pointcloud(mesh, num = 200000):
-    """ Uniformly sample points from the surface of the mesh """
-
-    points, face_idx = mesh.sample(num, return_index = True)
-    normals = mesh.face_normals[face_idx]
-    return torch.from_numpy(points.astype(np.float32)), torch.from_numpy(normals.astype(np.float32))
-
-def detect_sharp_edges(mesh, threshold=0.985):
-    """Return edge indices (a, b) that lie on sharp boundaries of the mesh."""
-
-    V, F = mesh.vertices, mesh.faces
-    VN, FN = mesh.vertex_normals, mesh.face_normals
-
-    sharp_mask = np.ones(V.shape[0])
-    for i in range(3):
-        indices = F[:, i]
-        alignment = np.einsum('ij,ij->i', VN[indices], FN)
-        dot_stack = np.stack((sharp_mask[indices], alignment), axis=-1)
-        sharp_mask[indices] = np.min(dot_stack, axis=-1)
-
-    edge_a = np.concatenate([F[:, 0], F[:, 1], F[:, 2]])
-    edge_b = np.concatenate([F[:, 1], F[:, 2], F[:, 0]])
-    sharp_edges = (sharp_mask[edge_a] < threshold) & (sharp_mask[edge_b] < threshold)
-
-    return edge_a[sharp_edges], edge_b[sharp_edges]
-
-
-def sharp_sample_pointcloud(mesh, num = 16384):
-    """ Sample points preferentially from sharp edges in the mesh. """
-
-    edge_a, edge_b = detect_sharp_edges(mesh)
-    V, VN = mesh.vertices, mesh.vertex_normals
-
-    va, vb = V[edge_a], V[edge_b]
-    na, nb = VN[edge_a], VN[edge_b]
-
-    edge_lengths = np.linalg.norm(vb - va, axis=-1)
-    weights = edge_lengths / edge_lengths.sum()
-
-    indices = np.searchsorted(np.cumsum(weights), np.random.rand(num))
-    t = np.random.rand(num, 1)
-
-    samples = t * va[indices] + (1 - t) * vb[indices]
-    normals = t * na[indices] + (1 - t) * nb[indices]
-
-    return samples.astype(np.float32), normals.astype(np.float32)
-
-def load_surface_sharpedge(mesh, num_points=4096, num_sharp_points=4096, sharpedge_flag = True, device = "cuda"):
-    """Load a surface with optional sharp-edge annotations from a trimesh mesh."""
-
-    import trimesh
-
-    try:
-        mesh_full = trimesh.util.concatenate(mesh.dump())
-    except Exception:
-        mesh_full = trimesh.util.concatenate(mesh)
-
-    mesh_full = normalize_mesh(mesh_full)
-
-    faces = mesh_full.faces
-    vertices = mesh_full.vertices
-    origin_face_count = faces.shape[0]
-
-    mesh_surface = trimesh.Trimesh(vertices=vertices, faces=faces[:origin_face_count])
-    mesh_fill = trimesh.Trimesh(vertices=vertices, faces=faces[origin_face_count:])
-
-    area_surface = mesh_surface.area
-    area_fill = mesh_fill.area
-    total_area = area_surface + area_fill
-
-    sample_num = 499712 // 2
-    fill_ratio = area_fill / total_area if total_area > 0 else 0
-
-    num_fill = int(sample_num * fill_ratio)
-    num_surface = sample_num - num_fill
-
-    surf_pts, surf_normals = sample_pointcloud(mesh_surface, num_surface)
-    fill_pts, fill_normals = (torch.zeros(0, 3), torch.zeros(0, 3)) if num_fill == 0 else sample_pointcloud(mesh_fill, num_fill)
-
-    sharp_pts, sharp_normals = sharp_sample_pointcloud(mesh_surface, sample_num)
-
-    def assemble_tensor(points, normals, label=None):
-
-        data = torch.cat([points, normals], dim=1).half().to(device)
-
-        if label is not None:
-            label_tensor = torch.full((data.shape[0], 1), float(label), dtype=torch.float16).to(device)
-            data = torch.cat([data, label_tensor], dim=1)
-
-        return data
-
-    surface = assemble_tensor(torch.cat([surf_pts.to(device), fill_pts.to(device)], dim=0),
-                              torch.cat([surf_normals.to(device), fill_normals.to(device)], dim=0),
-                              label = 0 if sharpedge_flag else None)
-
-    sharp_surface = assemble_tensor(torch.from_numpy(sharp_pts), torch.from_numpy(sharp_normals),
-                                    label = 1 if sharpedge_flag else None)
-
-    rng = np.random.default_rng()
-
-    surface = surface[rng.choice(surface.shape[0], num_points, replace = False)]
-    sharp_surface = sharp_surface[rng.choice(sharp_surface.shape[0], num_sharp_points, replace = False)]
-
-    full = torch.cat([surface, sharp_surface], dim = 0).unsqueeze(0)
-
-    return full
-
-class SharpEdgeSurfaceLoader:
-    """ Load mesh surface and sharp edge samples. """
-
-    def __init__(self, num_uniform_points = 8192, num_sharp_points = 8192):
-
-        self.num_uniform_points = num_uniform_points
-        self.num_sharp_points = num_sharp_points
-        self.total_points = num_uniform_points + num_sharp_points
-
-    def __call__(self, mesh_input, device = "cuda"):
-        mesh = self._load_mesh(mesh_input)
-        return load_surface_sharpedge(mesh, self.num_uniform_points, self.num_sharp_points, device = device)
-
-    @staticmethod
-    def _load_mesh(mesh_input):
-        import trimesh
-
-        if isinstance(mesh_input, str):
-            mesh = trimesh.load(mesh_input, force="mesh", merge_primitives = True)
-        else:
-            mesh = mesh_input
-
-        if isinstance(mesh, trimesh.Scene):
-            combined = None
-            for obj in mesh.geometry.values():
-                combined = obj if combined is None else combined + obj
-            return combined
-
-        return mesh
-
-class DiagonalGaussianDistribution:
-    def __init__(self, params: torch.Tensor, feature_dim: int = -1):
-
-        # divide quant channels (8) into mean and log variance
-        self.mean, self.logvar = torch.chunk(params, 2, dim = feature_dim)
-
-        self.logvar = torch.clamp(self.logvar, -30.0, 20.0)
-        self.std = torch.exp(0.5 * self.logvar)
-
-    def sample(self):
-
-        eps = torch.randn_like(self.std)
-        z = self.mean + eps * self.std
-
-        return z
-
-################################################
-# Volume Decoder
-################################################
-
-class VanillaVolumeDecoder():
+def generate_dense_grid_points(
+    bbox_min: np.ndarray,
+    bbox_max: np.ndarray,
+    octree_resolution: int,
+    indexing: str = "ij",
+):
+    length = bbox_max - bbox_min
+    num_cells = octree_resolution
+
+    x = np.linspace(bbox_min[0], bbox_max[0], int(num_cells) + 1, dtype=np.float32)
+    y = np.linspace(bbox_min[1], bbox_max[1], int(num_cells) + 1, dtype=np.float32)
+    z = np.linspace(bbox_min[2], bbox_max[2], int(num_cells) + 1, dtype=np.float32)
+    [xs, ys, zs] = np.meshgrid(x, y, z, indexing=indexing)
+    xyz = np.stack((xs, ys, zs), axis=-1)
+    grid_size = [int(num_cells) + 1, int(num_cells) + 1, int(num_cells) + 1]
+
+    return xyz, grid_size, length
+
+
+class VanillaVolumeDecoder:
    @torch.no_grad()
-    def __call__(self, latents: torch.Tensor, geo_decoder: callable, octree_resolution: int, bounds = 1.01,
-                 num_chunks: int = 10_000, enable_pbar: bool = True, **kwargs):
+    def __call__(
+        self,
+        latents: torch.FloatTensor,
+        geo_decoder: Callable,
+        bounds: Union[Tuple[float], List[float], float] = 1.01,
+        num_chunks: int = 10000,
+        octree_resolution: int = None,
+        enable_pbar: bool = True,
+        **kwargs,
+    ):
+        device = latents.device
+        dtype = latents.dtype
+        batch_size = latents.shape[0]

+        # 1. generate query points
        if isinstance(bounds, float):
            bounds = [-bounds, -bounds, -bounds, bounds, bounds, bounds]

-        bbox_min, bbox_max = torch.tensor(bounds[:3]), torch.tensor(bounds[3:])
-
-        x = torch.linspace(bbox_min[0], bbox_max[0], int(octree_resolution) + 1, dtype = torch.float32)
-        y = torch.linspace(bbox_min[1], bbox_max[1], int(octree_resolution) + 1, dtype = torch.float32)
-        z = torch.linspace(bbox_min[2], bbox_max[2], int(octree_resolution) + 1, dtype = torch.float32)
-
-        [xs, ys, zs] = torch.meshgrid(x, y, z, indexing = "ij")
-        xyz = torch.stack((xs, ys, zs), axis=-1).to(latents.device, dtype = latents.dtype).contiguous().reshape(-1, 3)
-        grid_size = [int(octree_resolution) + 1, int(octree_resolution) + 1, int(octree_resolution) + 1]
+        bbox_min, bbox_max = np.array(bounds[0:3]), np.array(bounds[3:6])
+        xyz_samples, grid_size, length = generate_dense_grid_points(
+            bbox_min=bbox_min,
+            bbox_max=bbox_max,
+            octree_resolution=octree_resolution,
+            indexing="ij"
+        )
+        xyz_samples = torch.from_numpy(xyz_samples).to(device, dtype=dtype).contiguous().reshape(-1, 3)

+        # 2. latents to 3d volume
        batch_logits = []
-        for start in tqdm(range(0, xyz.shape[0], num_chunks), desc="Volume Decoding",
+        for start in tqdm(range(0, xyz_samples.shape[0], num_chunks), desc="Volume Decoding",
                          disable=not enable_pbar):
-
-            chunk_queries = xyz[start: start + num_chunks, :]
-            chunk_queries = chunk_queries.unsqueeze(0).repeat(latents.shape[0], 1, 1)
-            logits = geo_decoder(queries = chunk_queries, latents = latents)
+            chunk_queries = xyz_samples[start: start + num_chunks, :]
+            chunk_queries = repeat(chunk_queries, "p c -> b p c", b=batch_size)
+            logits = geo_decoder(queries=chunk_queries, latents=latents)
            batch_logits.append(logits)

-        grid_logits = torch.cat(batch_logits, dim = 1)
-        grid_logits = grid_logits.view((latents.shape[0], *grid_size)).float()
+        grid_logits = torch.cat(batch_logits, dim=1)
+        grid_logits = grid_logits.view((batch_size, *grid_size)).float()

        return grid_logits

+
 class FourierEmbedder(nn.Module):
    """The sin/cosine positional embedding. Given an input tensor `x` of shape [n_batch, ..., c_dim], it converts
    each feature dimension of `x[..., i]` into:
@@ -552,11 +175,13 @@ class FourierEmbedder(nn.Module):
        else:
            return x

+
 class CrossAttentionProcessor:
    def __call__(self, attn, q, k, v):
-        out = comfy.ops.scaled_dot_product_attention(q, k, v)
+        out = F.scaled_dot_product_attention(q, k, v)
        return out

+
 class DropPath(nn.Module):
    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
    """
@@ -607,42 +232,39 @@ class MLP(nn.Module):
    def forward(self, x):
        return self.drop_path(self.c_proj(self.gelu(self.c_fc(x))))

+
 class QKVMultiheadCrossAttention(nn.Module):
    def __init__(
        self,
+        *,
        heads: int,
-        n_data = None,
        width=None,
        qk_norm=False,
        norm_layer=ops.LayerNorm
    ):
        super().__init__()
        self.heads = heads
-        self.n_data = n_data
        self.q_norm = norm_layer(width // heads, elementwise_affine=True, eps=1e-6) if qk_norm else nn.Identity()
        self.k_norm = norm_layer(width // heads, elementwise_affine=True, eps=1e-6) if qk_norm else nn.Identity()

-    def forward(self, q, kv):
+        self.attn_processor = CrossAttentionProcessor()

+    def forward(self, q, kv):
        _, n_ctx, _ = q.shape
        bs, n_data, width = kv.shape
-
        attn_ch = width // self.heads // 2
        q = q.view(bs, n_ctx, self.heads, -1)
-
        kv = kv.view(bs, n_data, self.heads, -1)
        k, v = torch.split(kv, attn_ch, dim=-1)

        q = self.q_norm(q)
        k = self.k_norm(k)
-
-        q, k, v = [t.permute(0, 2, 1, 3) for t in (q, k, v)]
-        out = F.scaled_dot_product_attention(q, k, v)
-
+        q, k, v = map(lambda t: rearrange(t, 'b n h d -> b h n d', h=self.heads), (q, k, v))
+        out = self.attn_processor(self, q, k, v)
        out = out.transpose(1, 2).reshape(bs, n_ctx, -1)
-
        return out

+
 class MultiheadCrossAttention(nn.Module):
    def __init__(
        self,
@@ -684,6 +306,7 @@ class MultiheadCrossAttention(nn.Module):
        x = self.c_proj(x)
        return x

+
 class ResidualCrossAttentionBlock(nn.Module):
    def __init__(
        self,
@@ -743,7 +366,7 @@ class QKVMultiheadAttention(nn.Module):
        q = self.q_norm(q)
        k = self.k_norm(k)

-        q, k, v = [t.permute(0, 2, 1, 3) for t in (q, k, v)]
+        q, k, v = map(lambda t: rearrange(t, 'b n h d -> b h n d', h=self.heads), (q, k, v))
        out = F.scaled_dot_product_attention(q, k, v).transpose(1, 2).reshape(bs, n_ctx, -1)
        return out

@@ -760,7 +383,8 @@ class MultiheadAttention(nn.Module):
        drop_path_rate: float = 0.0
    ):
        super().__init__()
-
+        self.width = width
+        self.heads = heads
        self.c_qkv = ops.Linear(width, width * 3, bias=qkv_bias)
        self.c_proj = ops.Linear(width, width)
        self.attention = QKVMultiheadAttention(
@@ -867,7 +491,7 @@ class CrossAttentionDecoder(nn.Module):
        self.query_proj = ops.Linear(self.fourier_embedder.out_dim, width)
        if self.downsample_ratio != 1:
            self.latents_proj = ops.Linear(width * downsample_ratio, width)
-        if not self.enable_ln_post:
+        if self.enable_ln_post == False:
            qk_norm = False
        self.cross_attn_decoder = ResidualCrossAttentionBlock(
            width=width,
@@ -898,44 +522,28 @@ class CrossAttentionDecoder(nn.Module):

 class ShapeVAE(nn.Module):
    def __init__(
-            self,
-            *,
-            num_latents: int = 4096,
-            embed_dim: int = 64,
-            width: int = 1024,
-            heads: int = 16,
-            num_decoder_layers: int = 16,
-            num_encoder_layers: int = 8,
-            pc_size: int = 81920,
-            pc_sharpedge_size: int = 0,
-            point_feats: int = 4,
-            downsample_ratio: int = 20,
-            geo_decoder_downsample_ratio: int = 1,
-            geo_decoder_mlp_expand_ratio: int = 4,
-            geo_decoder_ln_post: bool = True,
-            num_freqs: int = 8,
-            qkv_bias: bool = False,
-            qk_norm: bool = True,
-            drop_path_rate: float = 0.0,
-            include_pi: bool = False,
-            scale_factor: float = 1.0039506158752403,
-            label_type: str = "binary",
+        self,
+        *,
+        embed_dim: int,
+        width: int,
+        heads: int,
+        num_decoder_layers: int,
+        geo_decoder_downsample_ratio: int = 1,
+        geo_decoder_mlp_expand_ratio: int = 4,
+        geo_decoder_ln_post: bool = True,
+        num_freqs: int = 8,
+        include_pi: bool = True,
+        qkv_bias: bool = True,
+        qk_norm: bool = False,
+        label_type: str = "binary",
+        drop_path_rate: float = 0.0,
+        scale_factor: float = 1.0,
    ):
        super().__init__()
        self.geo_decoder_ln_post = geo_decoder_ln_post

        self.fourier_embedder = FourierEmbedder(num_freqs=num_freqs, include_pi=include_pi)

-        self.encoder = PointCrossAttention(layers = num_encoder_layers,
-                                    num_latents = num_latents,
-                                    downsample_ratio = downsample_ratio,
-                                    heads = heads,
-                                    pc_size = pc_size,
-                                    width = width,
-                                    point_feats = point_feats,
-                                    fourier_embedder = self.fourier_embedder,
-                                    pc_sharpedge_size = pc_sharpedge_size)
-
        self.post_kl = ops.Linear(embed_dim, width)

        self.transformer = Transformer(
@@ -975,14 +583,5 @@ class ShapeVAE(nn.Module):
        grid_logits = self.volume_decoder(latents, self.geo_decoder, bounds=bounds, num_chunks=num_chunks, octree_resolution=octree_resolution, enable_pbar=enable_pbar)
        return grid_logits.movedim(-2, -1)

-    def encode(self, surface):
-
-        pc, feats = surface[:, :, :3], surface[:, :, 3:]
-        latents = self.encoder(pc, feats)
-
-        moments = self.pre_kl(latents)
-        posterior = DiagonalGaussianDistribution(moments, feature_dim = -1)
-
-        latents = posterior.sample()
-
-        return latents
+    def encode(self, x):
+        return None
--- a/comfy/ldm/hunyuan3dv2_1/hunyuandit.py
+++ b/comfy/ldm/hunyuan3dv2_1/hunyuandit.py
@@ -1,659 +0,0 @@
-import math
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from comfy.ldm.modules.attention import optimized_attention
-import comfy.model_management
-
-class GELU(nn.Module):
-
-    def __init__(self, dim_in: int, dim_out: int, operations, device, dtype):
-        super().__init__()
-        self.proj = operations.Linear(dim_in, dim_out, device = device, dtype = dtype)
-
-    def gelu(self, gate: torch.Tensor) -> torch.Tensor:
-
-        if gate.device.type == "mps":
-            return F.gelu(gate.to(dtype = torch.float32)).to(dtype = gate.dtype)
-
-        return F.gelu(gate)
-
-    def forward(self, hidden_states):
-
-        hidden_states = self.proj(hidden_states)
-        hidden_states = self.gelu(hidden_states)
-
-        return hidden_states
-
-class FeedForward(nn.Module):
-
-    def __init__(self, dim: int, dim_out = None, mult: int = 4,
-                dropout: float = 0.0, inner_dim = None, operations = None, device = None, dtype = None):
-
-        super().__init__()
-        if inner_dim is None:
-            inner_dim = int(dim * mult)
-
-        dim_out = dim_out if dim_out is not None else dim
-
-        act_fn = GELU(dim, inner_dim, operations = operations, device = device, dtype = dtype)
-
-        self.net = nn.ModuleList([])
-        self.net.append(act_fn)
-
-        self.net.append(nn.Dropout(dropout))
-        self.net.append(operations.Linear(inner_dim, dim_out, device = device, dtype = dtype))
-
-    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
-        for module in self.net:
-            hidden_states = module(hidden_states)
-        return hidden_states
-
-class AddAuxLoss(torch.autograd.Function):
-
-    @staticmethod
-    def forward(ctx, x, loss):
-        # do nothing in forward (no computation)
-        ctx.requires_aux_loss = loss.requires_grad
-        ctx.dtype = loss.dtype
-
-        return x
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        # add the aux loss gradients
-        grad_loss = None
-        # put the aux grad the same as the main grad loss
-        # aux grad contributes equally
-        if ctx.requires_aux_loss:
-            grad_loss = torch.ones(1, dtype = ctx.dtype, device = grad_output.device)
-
-        return grad_output, grad_loss
-
-class MoEGate(nn.Module):
-
-    def __init__(self, embed_dim, num_experts=16, num_experts_per_tok=2, aux_loss_alpha=0.01, device = None, dtype = None):
-
-        super().__init__()
-        self.top_k = num_experts_per_tok
-        self.n_routed_experts = num_experts
-
-        self.alpha = aux_loss_alpha
-
-        self.gating_dim = embed_dim
-        self.weight = nn.Parameter(torch.empty((self.n_routed_experts, self.gating_dim), device = device, dtype = dtype))
-
-    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
-
-        # flatten hidden states
-        hidden_states = hidden_states.view(-1, hidden_states.size(-1))
-
-        # get logits and pass it to softmax
-        logits = F.linear(hidden_states, comfy.model_management.cast_to(self.weight, dtype=hidden_states.dtype, device=hidden_states.device), bias = None)
-        scores = logits.softmax(dim = -1)
-
-        topk_weight, topk_idx = torch.topk(scores, k = self.top_k, dim = -1, sorted = False)
-
-        if self.training and self.alpha > 0.0:
-            scores_for_aux = scores
-
-            # used bincount instead of one hot encoding
-            counts = torch.bincount(topk_idx.view(-1), minlength = self.n_routed_experts).float()
-            ce = counts / topk_idx.numel()  # normalized expert usage
-
-            # mean expert score
-            Pi = scores_for_aux.mean(0)
-
-            # expert balance loss
-            aux_loss = (Pi * ce * self.n_routed_experts).sum() * self.alpha
-        else:
-            aux_loss = None
-
-        return topk_idx, topk_weight, aux_loss
-
-class MoEBlock(nn.Module):
-    def __init__(self, dim, num_experts: int = 6, moe_top_k: int = 2, dropout: float = 0.0,
-                 ff_inner_dim: int = None, operations = None, device = None, dtype = None):
-        super().__init__()
-
-        self.moe_top_k = moe_top_k
-        self.num_experts = num_experts
-
-        self.experts = nn.ModuleList([
-            FeedForward(dim, dropout = dropout, inner_dim = ff_inner_dim, operations = operations, device = device, dtype = dtype)
-            for _ in range(num_experts)
-        ])
-
-        self.gate = MoEGate(dim, num_experts = num_experts, num_experts_per_tok = moe_top_k, device = device, dtype = dtype)
-        self.shared_experts = FeedForward(dim, dropout = dropout, inner_dim = ff_inner_dim, operations = operations, device = device, dtype = dtype)
-
-    def forward(self, hidden_states) -> torch.Tensor:
-
-        identity = hidden_states
-        orig_shape = hidden_states.shape
-        topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
-
-        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
-        flat_topk_idx = topk_idx.view(-1)
-
-        if self.training:
-
-            hidden_states = hidden_states.repeat_interleave(self.moe_top_k, dim = 0)
-            y = torch.empty_like(hidden_states, dtype = hidden_states.dtype)
-
-            for i, expert in enumerate(self.experts):
-                tmp = expert(hidden_states[flat_topk_idx == i])
-                y[flat_topk_idx == i] = tmp.to(hidden_states.dtype)
-
-            y = (y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1)).sum(dim = 1)
-            y =  y.view(*orig_shape)
-
-            y = AddAuxLoss.apply(y, aux_loss)
-        else:
-            y = self.moe_infer(hidden_states, flat_expert_indices = flat_topk_idx,flat_expert_weights = topk_weight.view(-1, 1)).view(*orig_shape)
-
-        y = y + self.shared_experts(identity)
-
-        return y
-
-    @torch.no_grad()
-    def moe_infer(self, x, flat_expert_indices, flat_expert_weights):
-
-        expert_cache = torch.zeros_like(x)
-        idxs = flat_expert_indices.argsort()
-
-        # no need for .numpy().cpu() here
-        tokens_per_expert = flat_expert_indices.bincount().cumsum(0)
-        token_idxs = idxs // self.moe_top_k
-
-        for i, end_idx in enumerate(tokens_per_expert):
-
-            start_idx = 0 if i == 0 else tokens_per_expert[i-1]
-
-            if start_idx == end_idx:
-                continue
-
-            expert = self.experts[i]
-            exp_token_idx = token_idxs[start_idx:end_idx]
-
-            expert_tokens = x[exp_token_idx]
-            expert_out = expert(expert_tokens)
-
-            expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]])
-
-            # use index_add_ with a 1-D index tensor directly avoids building a large [N, D] index map and extra memcopy required by scatter_reduce_
-            # + avoid dtype conversion
-            expert_cache.index_add_(0, exp_token_idx, expert_out)
-
-        return expert_cache
-
-class Timesteps(nn.Module):
-    def __init__(self, num_channels: int, downscale_freq_shift: float = 0.0,
-                 scale: float = 1.0, max_period: int = 10000):
-        super().__init__()
-
-        self.num_channels = num_channels
-        half_dim = num_channels // 2
-
-        # precompute the “inv_freq” vector once
-        exponent = -math.log(max_period) * torch.arange(
-            half_dim, dtype=torch.float32
-        ) / (half_dim - downscale_freq_shift)
-
-        inv_freq = torch.exp(exponent)
-
-        # pad
-        if num_channels % 2 == 1:
-            # we’ll pad a zero at the end of the cos-half
-            inv_freq = torch.cat([inv_freq, inv_freq.new_zeros(1)])
-
-        # register to buffer so it moves with the device
-        self.register_buffer("inv_freq", inv_freq, persistent = False)
-        self.scale = scale
-
-    def forward(self, timesteps: torch.Tensor):
-
-        x = timesteps.float().unsqueeze(1) * self.inv_freq.to(timesteps.device).unsqueeze(0)
-
-
-        # fused CUDA kernels for sin and cos
-        sin_emb = x.sin()
-        cos_emb = x.cos()
-
-        emb = torch.cat([sin_emb, cos_emb], dim = 1)
-
-        # scale factor
-        if self.scale != 1.0:
-            emb = emb * self.scale
-
-        # If we padded inv_freq for odd, emb is already wide enough; otherwise:
-        if emb.shape[1] > self.num_channels:
-            emb = emb[:, :self.num_channels]
-
-        return emb
-
-class TimestepEmbedder(nn.Module):
-    def __init__(self, hidden_size, frequency_embedding_size = 256, cond_proj_dim = None, operations = None, device = None, dtype = None):
-        super().__init__()
-
-        self.mlp = nn.Sequential(
-            operations.Linear(hidden_size, frequency_embedding_size, bias=True, device = device, dtype = dtype),
-            nn.GELU(),
-            operations.Linear(frequency_embedding_size, hidden_size, bias=True, device = device, dtype = dtype),
-        )
-        self.frequency_embedding_size = frequency_embedding_size
-
-        if cond_proj_dim is not None:
-            self.cond_proj = operations.Linear(cond_proj_dim, frequency_embedding_size, bias=False, device = device, dtype = dtype)
-
-        self.time_embed = Timesteps(hidden_size)
-
-    def forward(self, timesteps, condition):
-
-        timestep_embed = self.time_embed(timesteps).type(self.mlp[0].weight.dtype)
-
-        if condition is not None:
-            cond_embed = self.cond_proj(condition)
-            timestep_embed = timestep_embed + cond_embed
-
-        time_conditioned = self.mlp(timestep_embed)
-
-        # for broadcasting with image tokens
-        return time_conditioned.unsqueeze(1)
-
-class MLP(nn.Module):
-    def __init__(self, *, width: int, operations = None, device = None, dtype = None):
-        super().__init__()
-        self.width = width
-        self.fc1 = operations.Linear(width, width * 4, device = device, dtype = dtype)
-        self.fc2 = operations.Linear(width * 4, width, device = device, dtype = dtype)
-        self.gelu = nn.GELU()
-
-    def forward(self, x):
-        return self.fc2(self.gelu(self.fc1(x)))
-
-class CrossAttention(nn.Module):
-    def __init__(
-        self,
-        qdim,
-        kdim,
-        num_heads,
-        qkv_bias=True,
-        qk_norm=False,
-        norm_layer=nn.LayerNorm,
-        use_fp16: bool = False,
-        operations = None,
-        dtype = None,
-        device = None,
-        **kwargs,
-    ):
-        super().__init__()
-        self.qdim = qdim
-        self.kdim = kdim
-
-        self.num_heads = num_heads
-        self.head_dim = self.qdim // num_heads
-
-        self.scale = self.head_dim ** -0.5
-
-        self.to_q = operations.Linear(qdim, qdim, bias=qkv_bias, device = device, dtype = dtype)
-        self.to_k = operations.Linear(kdim, qdim, bias=qkv_bias, device = device, dtype = dtype)
-        self.to_v = operations.Linear(kdim, qdim, bias=qkv_bias, device = device, dtype = dtype)
-
-        if use_fp16:
-            eps = 1.0 / 65504
-        else:
-            eps = 1e-6
-
-        if norm_layer == nn.LayerNorm:
-            norm_layer = operations.LayerNorm
-        else:
-            norm_layer = operations.RMSNorm
-
-        self.q_norm = norm_layer(self.head_dim, elementwise_affine=True, eps = eps, device = device, dtype = dtype) if qk_norm else nn.Identity()
-        self.k_norm = norm_layer(self.head_dim, elementwise_affine=True, eps = eps, device = device, dtype = dtype) if qk_norm else nn.Identity()
-        self.out_proj = operations.Linear(qdim, qdim, bias=True, device = device, dtype = dtype)
-
-    def forward(self, x, y):
-
-        b, s1, _ = x.shape
-        _, s2, _ = y.shape
-
-        y = y.to(next(self.to_k.parameters()).dtype)
-
-        q = self.to_q(x)
-        k = self.to_k(y)
-        v = self.to_v(y)
-
-        kv = torch.cat((k, v), dim=-1)
-        split_size = kv.shape[-1] // self.num_heads // 2
-
-        kv = kv.view(1, -1, self.num_heads, split_size * 2)
-        k, v = torch.split(kv, split_size, dim=-1)
-
-        q = q.view(b, s1, self.num_heads, self.head_dim)
-        k = k.view(b, s2, self.num_heads, self.head_dim)
-        v = v.reshape(b, s2, self.num_heads * self.head_dim)
-
-        q = self.q_norm(q)
-        k = self.k_norm(k)
-
-        x = optimized_attention(
-            q.reshape(b, s1, self.num_heads * self.head_dim),
-            k.reshape(b, s2, self.num_heads * self.head_dim),
-            v,
-            heads=self.num_heads,
-        )
-
-        out = self.out_proj(x)
-
-        return out
-
-class Attention(nn.Module):
-
-    def __init__(
-        self,
-        dim,
-        num_heads,
-        qkv_bias = True,
-        qk_norm = False,
-        norm_layer = nn.LayerNorm,
-        use_fp16: bool = False,
-        operations = None,
-        device = None,
-        dtype = None
-    ):
-        super().__init__()
-        self.dim = dim
-        self.num_heads = num_heads
-        self.head_dim = self.dim // num_heads
-        self.scale = self.head_dim ** -0.5
-
-        self.to_q = operations.Linear(dim, dim, bias = qkv_bias, device = device, dtype = dtype)
-        self.to_k = operations.Linear(dim, dim, bias = qkv_bias, device = device, dtype = dtype)
-        self.to_v = operations.Linear(dim, dim, bias = qkv_bias, device = device, dtype = dtype)
-
-        if use_fp16:
-            eps = 1.0 / 65504
-        else:
-            eps = 1e-6
-
-        if norm_layer == nn.LayerNorm:
-            norm_layer = operations.LayerNorm
-        else:
-            norm_layer = operations.RMSNorm
-
-        self.q_norm = norm_layer(self.head_dim, elementwise_affine=True, eps = eps, device = device, dtype = dtype) if qk_norm else nn.Identity()
-        self.k_norm = norm_layer(self.head_dim, elementwise_affine=True, eps = eps, device = device, dtype = dtype) if qk_norm else nn.Identity()
-        self.out_proj = operations.Linear(dim, dim, device = device, dtype = dtype)
-
-    def forward(self, x):
-        B, N, _ = x.shape
-
-        query = self.to_q(x)
-        key = self.to_k(x)
-        value = self.to_v(x)
-
-        qkv_combined = torch.cat((query, key, value), dim=-1)
-        split_size = qkv_combined.shape[-1] // self.num_heads // 3
-
-        qkv = qkv_combined.view(1, -1, self.num_heads, split_size * 3)
-        query, key, value = torch.split(qkv, split_size, dim=-1)
-
-        query = query.reshape(B, N, self.num_heads, self.head_dim)
-        key = key.reshape(B, N, self.num_heads, self.head_dim)
-        value = value.reshape(B, N, self.num_heads * self.head_dim)
-
-        query = self.q_norm(query)
-        key = self.k_norm(key)
-
-        x = optimized_attention(
-            query.reshape(B, N, self.num_heads * self.head_dim),
-            key.reshape(B, N, self.num_heads * self.head_dim),
-            value,
-            heads=self.num_heads,
-        )
-
-        x = self.out_proj(x)
-        return x
-
-class HunYuanDiTBlock(nn.Module):
-    def __init__(
-        self,
-        hidden_size,
-        c_emb_size,
-        num_heads,
-        text_states_dim=1024,
-        qk_norm=False,
-        norm_layer=nn.LayerNorm,
-        qk_norm_layer=True,
-        qkv_bias=True,
-        skip_connection=True,
-        timested_modulate=False,
-        use_moe: bool = False,
-        num_experts: int = 8,
-        moe_top_k: int = 2,
-        use_fp16: bool = False,
-        operations = None,
-        device = None, dtype = None
-    ):
-        super().__init__()
-
-        # eps can't be 1e-6 in fp16 mode because of numerical stability issues
-        if use_fp16:
-            eps = 1.0 / 65504
-        else:
-            eps = 1e-6
-
-        self.norm1 = norm_layer(hidden_size, elementwise_affine = True, eps = eps, device = device, dtype = dtype)
-
-        self.attn1 = Attention(hidden_size, num_heads=num_heads, qkv_bias=qkv_bias, qk_norm=qk_norm,
-                               norm_layer=qk_norm_layer, use_fp16 = use_fp16, device = device, dtype = dtype, operations = operations)
-
-        self.norm2 = norm_layer(hidden_size, elementwise_affine = True, eps = eps, device = device, dtype = dtype)
-
-        self.timested_modulate = timested_modulate
-        if self.timested_modulate:
-            self.default_modulation = nn.Sequential(
-                nn.SiLU(),
-                operations.Linear(c_emb_size, hidden_size, bias=True, device = device, dtype = dtype)
-            )
-
-        self.attn2 = CrossAttention(hidden_size, text_states_dim, num_heads=num_heads, qkv_bias=qkv_bias,
-                                    qk_norm=qk_norm, norm_layer=qk_norm_layer, use_fp16 = use_fp16,
-                                    device = device, dtype = dtype, operations = operations)
-
-        self.norm3 = norm_layer(hidden_size, elementwise_affine = True, eps = eps, device = device, dtype = dtype)
-
-        if skip_connection:
-            self.skip_norm = norm_layer(hidden_size, elementwise_affine = True, eps = eps, device = device, dtype = dtype)
-            self.skip_linear = operations.Linear(2 * hidden_size, hidden_size, device = device, dtype = dtype)
-        else:
-            self.skip_linear = None
-
-        self.use_moe = use_moe
-
-        if self.use_moe:
-            self.moe = MoEBlock(
-                hidden_size,
-                num_experts = num_experts,
-                moe_top_k = moe_top_k,
-                dropout = 0.0,
-                ff_inner_dim = int(hidden_size * 4.0),
-                device = device, dtype = dtype,
-                operations = operations
-            )
-        else:
-            self.mlp = MLP(width=hidden_size, operations=operations, device = device, dtype = dtype)
-
-    def forward(self, hidden_states, conditioning=None, text_states=None, skip_tensor=None):
-
-        if self.skip_linear is not None:
-            combined = torch.cat([skip_tensor, hidden_states], dim=-1)
-            hidden_states = self.skip_linear(combined)
-            hidden_states = self.skip_norm(hidden_states)
-
-        # self attention
-        if self.timested_modulate:
-            modulation_shift = self.default_modulation(conditioning).unsqueeze(dim=1)
-            hidden_states = hidden_states + modulation_shift
-
-        self_attn_out = self.attn1(self.norm1(hidden_states))
-        hidden_states = hidden_states + self_attn_out
-
-        # cross attention
-        hidden_states = hidden_states + self.attn2(self.norm2(hidden_states), text_states)
-
-        # MLP Layer
-        mlp_input = self.norm3(hidden_states)
-
-        if self.use_moe:
-            hidden_states = hidden_states + self.moe(mlp_input)
-        else:
-            hidden_states = hidden_states + self.mlp(mlp_input)
-
-        return hidden_states
-
-class FinalLayer(nn.Module):
-
-    def __init__(self, final_hidden_size, out_channels, operations, use_fp16: bool = False, device = None, dtype = None):
-        super().__init__()
-
-        if use_fp16:
-            eps = 1.0 / 65504
-        else:
-            eps = 1e-6
-
-        self.norm_final = operations.LayerNorm(final_hidden_size, elementwise_affine = True, eps = eps, device = device, dtype = dtype)
-        self.linear = operations.Linear(final_hidden_size, out_channels, bias = True, device = device, dtype = dtype)
-
-    def forward(self, x):
-        x = self.norm_final(x)
-        x = x[:, 1:]
-        x = self.linear(x)
-        return x
-
-class HunYuanDiTPlain(nn.Module):
-
-    # init with the defaults values from https://huggingface.co/tencent/Hunyuan3D-2.1/blob/main/hunyuan3d-dit-v2-1/config.yaml
-    def __init__(
-        self,
-        in_channels: int = 64,
-        hidden_size: int = 2048,
-        context_dim: int = 1024,
-        depth: int = 21,
-        num_heads: int = 16,
-        qk_norm: bool = True,
-        qkv_bias: bool = False,
-        num_moe_layers: int = 6,
-        guidance_cond_proj_dim = 2048,
-        norm_type = 'layer',
-        num_experts: int = 8,
-        moe_top_k: int = 2,
-        use_fp16: bool = False,
-        dtype = None,
-        device = None,
-        operations = None,
-        **kwargs
-        ):
-
-        self.dtype = dtype
-
-        super().__init__()
-
-        self.depth = depth
-
-        self.in_channels = in_channels
-        self.out_channels = in_channels
-
-        self.num_heads = num_heads
-        self.hidden_size = hidden_size
-
-        norm = operations.LayerNorm if norm_type == 'layer' else operations.RMSNorm
-        qk_norm = operations.RMSNorm
-
-        self.context_dim = context_dim
-        self.guidance_cond_proj_dim = guidance_cond_proj_dim
-
-        self.x_embedder = operations.Linear(in_channels, hidden_size, bias = True, device = device, dtype = dtype)
-        self.t_embedder = TimestepEmbedder(hidden_size, hidden_size * 4, cond_proj_dim = guidance_cond_proj_dim, device = device, dtype = dtype, operations = operations)
-
-
-        # HUnYuanDiT Blocks
-        self.blocks = nn.ModuleList([
-            HunYuanDiTBlock(hidden_size=hidden_size,
-                            c_emb_size=hidden_size,
-                            num_heads=num_heads,
-                            text_states_dim=context_dim,
-                            qk_norm=qk_norm,
-                            norm_layer = norm,
-                            qk_norm_layer = qk_norm,
-                            skip_connection=layer > depth // 2,
-                            qkv_bias=qkv_bias,
-                            use_moe=True if depth - layer <= num_moe_layers else False,
-                            num_experts=num_experts,
-                            moe_top_k=moe_top_k,
-                            use_fp16 = use_fp16,
-                            device = device, dtype = dtype, operations = operations)
-            for layer in range(depth)
-        ])
-
-        self.depth = depth
-
-        self.final_layer = FinalLayer(hidden_size, self.out_channels, use_fp16 = use_fp16, operations = operations, device = device, dtype = dtype)
-
-    def forward(self, x, t, context, transformer_options = {}, **kwargs):
-
-        x = x.movedim(-1, -2)
-        uncond_emb, cond_emb = context.chunk(2, dim = 0)
-
-        context = torch.cat([cond_emb, uncond_emb], dim = 0)
-        main_condition = context
-
-        t = 1.0 - t
-
-        time_embedded = self.t_embedder(t, condition = kwargs.get('guidance_cond'))
-
-        x = x.to(dtype = next(self.x_embedder.parameters()).dtype)
-        x_embedded = self.x_embedder(x)
-
-        combined = torch.cat([time_embedded, x_embedded], dim=1)
-
-        def block_wrap(args):
-            return block(
-                args["x"],
-                args["t"],
-                args["cond"],
-                skip_tensor=args.get("skip"),)
-
-        skip_stack = []
-        patches_replace = transformer_options.get("patches_replace", {})
-        blocks_replace = patches_replace.get("dit", {})
-        for idx, block in enumerate(self.blocks):
-            if idx <= self.depth // 2:
-                skip_input = None
-            else:
-                skip_input = skip_stack.pop()
-
-            if ("block", idx) in blocks_replace:
-
-                combined = blocks_replace[("block", idx)](
-                    {
-                        "x": combined,
-                        "t": time_embedded,
-                        "cond": main_condition,
-                        "skip": skip_input,
-                    },
-                    {"original_block": block_wrap},
-                )
-            else:
-                combined = block(combined, time_embedded, main_condition, skip_tensor=skip_input)
-
-            if idx < self.depth // 2:
-                skip_stack.append(combined)
-
-        output = self.final_layer(combined)
-        output =  output.movedim(-2, -1) * (-1.0)
-
-        cond_emb, uncond_emb = output.chunk(2, dim = 0)
-        return torch.cat([uncond_emb, cond_emb])
--- a/comfy/ldm/hunyuan_video/model.py
+++ b/comfy/ldm/hunyuan_video/model.py
@@ -1,7 +1,6 @@
 #Based on Flux code because of weird hunyuan video code license.

 import torch
-import comfy.patcher_extension
 import comfy.ldm.flux.layers
 import comfy.ldm.modules.diffusionmodules.mmdit
 from comfy.ldm.modules.attention import optimized_attention
@@ -40,8 +39,6 @@ class HunyuanVideoParams:
    patch_size: list
    qkv_bias: bool
    guidance_embed: bool
-    byt5: bool
-    meanflow: bool


 class SelfAttentionRef(nn.Module):
@@ -80,13 +77,13 @@ class TokenRefinerBlock(nn.Module):
            operations.Linear(mlp_hidden_dim, hidden_size, bias=True, dtype=dtype, device=device),
        )

-    def forward(self, x, c, mask, transformer_options={}):
+    def forward(self, x, c, mask):
        mod1, mod2 = self.adaLN_modulation(c).chunk(2, dim=1)

        norm_x = self.norm1(x)
        qkv = self.self_attn.qkv(norm_x)
        q, k, v = qkv.reshape(qkv.shape[0], qkv.shape[1], 3, self.heads, -1).permute(2, 0, 3, 1, 4)
-        attn = optimized_attention(q, k, v, self.heads, mask=mask, skip_reshape=True, transformer_options=transformer_options)
+        attn = optimized_attention(q, k, v, self.heads, mask=mask, skip_reshape=True)

        x = x + self.self_attn.proj(attn) * mod1.unsqueeze(1)
        x = x + self.mlp(self.norm2(x)) * mod2.unsqueeze(1)
@@ -117,14 +114,14 @@ class IndividualTokenRefiner(nn.Module):
            ]
        )

-    def forward(self, x, c, mask, transformer_options={}):
+    def forward(self, x, c, mask):
        m = None
        if mask is not None:
            m = mask.view(mask.shape[0], 1, 1, mask.shape[1]).repeat(1, 1, mask.shape[1], 1)
            m = m + m.transpose(2, 3)

        for block in self.blocks:
-            x = block(x, c, m, transformer_options=transformer_options)
+            x = block(x, c, m)
        return x


@@ -152,7 +149,6 @@ class TokenRefiner(nn.Module):
        x,
        timesteps,
        mask,
-        transformer_options={},
    ):
        t = self.t_embedder(timestep_embedding(timesteps, 256, time_factor=1.0).to(x.dtype))
        # m = mask.float().unsqueeze(-1)
@@ -161,33 +157,9 @@ class TokenRefiner(nn.Module):

        c = t + self.c_embedder(c.to(x.dtype))
        x = self.input_embedder(x)
-        x = self.individual_token_refiner(x, c, mask, transformer_options=transformer_options)
+        x = self.individual_token_refiner(x, c, mask)
        return x

-
-class ByT5Mapper(nn.Module):
-    def __init__(self, in_dim, out_dim, hidden_dim, out_dim1, use_res=False, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.layernorm = operations.LayerNorm(in_dim, dtype=dtype, device=device)
-        self.fc1 = operations.Linear(in_dim, hidden_dim, dtype=dtype, device=device)
-        self.fc2 = operations.Linear(hidden_dim, out_dim, dtype=dtype, device=device)
-        self.fc3 = operations.Linear(out_dim, out_dim1, dtype=dtype, device=device)
-        self.use_res = use_res
-        self.act_fn = nn.GELU()
-
-    def forward(self, x):
-        if self.use_res:
-            res = x
-        x = self.layernorm(x)
-        x = self.fc1(x)
-        x = self.act_fn(x)
-        x = self.fc2(x)
-        x2 = self.act_fn(x)
-        x2 = self.fc3(x2)
-        if self.use_res:
-            x2 = x2 + res
-        return x2
-
 class HunyuanVideo(nn.Module):
    """
    Transformer model for flow matching on sequences.
@@ -212,13 +184,9 @@ class HunyuanVideo(nn.Module):
        self.num_heads = params.num_heads
        self.pe_embedder = EmbedND(dim=pe_dim, theta=params.theta, axes_dim=params.axes_dim)

-        self.img_in = comfy.ldm.modules.diffusionmodules.mmdit.PatchEmbed(None, self.patch_size, self.in_channels, self.hidden_size, conv3d=len(self.patch_size) == 3, dtype=dtype, device=device, operations=operations)
+        self.img_in = comfy.ldm.modules.diffusionmodules.mmdit.PatchEmbed(None, self.patch_size, self.in_channels, self.hidden_size, conv3d=True, dtype=dtype, device=device, operations=operations)
        self.time_in = MLPEmbedder(in_dim=256, hidden_dim=self.hidden_size, dtype=dtype, device=device, operations=operations)
-        if params.vec_in_dim is not None:
-            self.vector_in = MLPEmbedder(params.vec_in_dim, self.hidden_size, dtype=dtype, device=device, operations=operations)
-        else:
-            self.vector_in = None
-
+        self.vector_in = MLPEmbedder(params.vec_in_dim, self.hidden_size, dtype=dtype, device=device, operations=operations)
        self.guidance_in = (
            MLPEmbedder(in_dim=256, hidden_dim=self.hidden_size, dtype=dtype, device=device, operations=operations) if params.guidance_embed else nn.Identity()
        )
@@ -246,23 +214,6 @@ class HunyuanVideo(nn.Module):
            ]
        )

-        if params.byt5:
-            self.byt5_in = ByT5Mapper(
-                in_dim=1472,
-                out_dim=2048,
-                hidden_dim=2048,
-                out_dim1=self.hidden_size,
-                use_res=False,
-                dtype=dtype, device=device, operations=operations
-            )
-        else:
-            self.byt5_in = None
-
-        if params.meanflow:
-            self.time_r_in = MLPEmbedder(in_dim=256, hidden_dim=self.hidden_size, dtype=dtype, device=device, operations=operations)
-        else:
-            self.time_r_in = None
-
        if final_layer:
            self.final_layer = LastLayer(self.hidden_size, self.patch_size[-1], self.out_channels, dtype=dtype, device=device, operations=operations)

@@ -274,12 +225,10 @@ class HunyuanVideo(nn.Module):
        txt_ids: Tensor,
        txt_mask: Tensor,
        timesteps: Tensor,
-        y: Tensor = None,
-        txt_byt5=None,
+        y: Tensor,
        guidance: Tensor = None,
        guiding_frame_index=None,
        ref_latent=None,
-        disable_time_r=False,
        control=None,
        transformer_options={},
    ) -> Tensor:
@@ -290,14 +239,6 @@ class HunyuanVideo(nn.Module):
        img = self.img_in(img)
        vec = self.time_in(timestep_embedding(timesteps, 256, time_factor=1.0).to(img.dtype))

-        if (self.time_r_in is not None) and (not disable_time_r):
-            w = torch.where(transformer_options['sigmas'][0] == transformer_options['sample_sigmas'])[0]  # This most likely could be improved
-            if len(w) > 0:
-                timesteps_r = transformer_options['sample_sigmas'][w[0] + 1]
-                timesteps_r = timesteps_r.unsqueeze(0).to(device=timesteps.device, dtype=timesteps.dtype)
-                vec_r = self.time_r_in(timestep_embedding(timesteps_r, 256, time_factor=1000.0).to(img.dtype))
-                vec = (vec + vec_r) / 2
-
        if ref_latent is not None:
            ref_latent_ids = self.img_ids(ref_latent)
            ref_latent = self.img_in(ref_latent)
@@ -308,17 +249,13 @@ class HunyuanVideo(nn.Module):

        if guiding_frame_index is not None:
            token_replace_vec = self.time_in(timestep_embedding(guiding_frame_index, 256, time_factor=1.0))
-            if self.vector_in is not None:
-                vec_ = self.vector_in(y[:, :self.params.vec_in_dim])
-                vec = torch.cat([(vec_ + token_replace_vec).unsqueeze(1), (vec_ + vec).unsqueeze(1)], dim=1)
-            else:
-                vec = torch.cat([(token_replace_vec).unsqueeze(1), (vec).unsqueeze(1)], dim=1)
+            vec_ = self.vector_in(y[:, :self.params.vec_in_dim])
+            vec = torch.cat([(vec_ + token_replace_vec).unsqueeze(1), (vec_ + vec).unsqueeze(1)], dim=1)
            frame_tokens = (initial_shape[-1] // self.patch_size[-1]) * (initial_shape[-2] // self.patch_size[-2])
            modulation_dims = [(0, frame_tokens, 0), (frame_tokens, None, 1)]
            modulation_dims_txt = [(0, None, 1)]
        else:
-            if self.vector_in is not None:
-                vec = vec + self.vector_in(y[:, :self.params.vec_in_dim])
+            vec = vec + self.vector_in(y[:, :self.params.vec_in_dim])
            modulation_dims = None
            modulation_dims_txt = None

@@ -329,13 +266,7 @@ class HunyuanVideo(nn.Module):
        if txt_mask is not None and not torch.is_floating_point(txt_mask):
            txt_mask = (txt_mask - 1).to(img.dtype) * torch.finfo(img.dtype).max

-        txt = self.txt_in(txt, timesteps, txt_mask, transformer_options=transformer_options)
-
-        if self.byt5_in is not None and txt_byt5 is not None:
-            txt_byt5 = self.byt5_in(txt_byt5)
-            txt_byt5_ids = torch.zeros((txt_ids.shape[0], txt_byt5.shape[1], txt_ids.shape[-1]), device=txt_ids.device, dtype=txt_ids.dtype)
-            txt = torch.cat((txt, txt_byt5), dim=1)
-            txt_ids = torch.cat((txt_ids, txt_byt5_ids), dim=1)
+        txt = self.txt_in(txt, timesteps, txt_mask)

        ids = torch.cat((img_ids, txt_ids), dim=1)
        pe = self.pe_embedder(ids)
@@ -353,14 +284,14 @@ class HunyuanVideo(nn.Module):
            if ("double_block", i) in blocks_replace:
                def block_wrap(args):
                    out = {}
-                    out["img"], out["txt"] = block(img=args["img"], txt=args["txt"], vec=args["vec"], pe=args["pe"], attn_mask=args["attention_mask"], modulation_dims_img=args["modulation_dims_img"], modulation_dims_txt=args["modulation_dims_txt"], transformer_options=args["transformer_options"])
+                    out["img"], out["txt"] = block(img=args["img"], txt=args["txt"], vec=args["vec"], pe=args["pe"], attn_mask=args["attention_mask"], modulation_dims_img=args["modulation_dims_img"], modulation_dims_txt=args["modulation_dims_txt"])
                    return out

-                out = blocks_replace[("double_block", i)]({"img": img, "txt": txt, "vec": vec, "pe": pe, "attention_mask": attn_mask, 'modulation_dims_img': modulation_dims, 'modulation_dims_txt': modulation_dims_txt, 'transformer_options': transformer_options}, {"original_block": block_wrap})
+                out = blocks_replace[("double_block", i)]({"img": img, "txt": txt, "vec": vec, "pe": pe, "attention_mask": attn_mask, 'modulation_dims_img': modulation_dims, 'modulation_dims_txt': modulation_dims_txt}, {"original_block": block_wrap})
                txt = out["txt"]
                img = out["img"]
            else:
-                img, txt = block(img=img, txt=txt, vec=vec, pe=pe, attn_mask=attn_mask, modulation_dims_img=modulation_dims, modulation_dims_txt=modulation_dims_txt, transformer_options=transformer_options)
+                img, txt = block(img=img, txt=txt, vec=vec, pe=pe, attn_mask=attn_mask, modulation_dims_img=modulation_dims, modulation_dims_txt=modulation_dims_txt)

            if control is not None: # Controlnet
                control_i = control.get("input")
@@ -375,13 +306,13 @@ class HunyuanVideo(nn.Module):
            if ("single_block", i) in blocks_replace:
                def block_wrap(args):
                    out = {}
-                    out["img"] = block(args["img"], vec=args["vec"], pe=args["pe"], attn_mask=args["attention_mask"], modulation_dims=args["modulation_dims"], transformer_options=args["transformer_options"])
+                    out["img"] = block(args["img"], vec=args["vec"], pe=args["pe"], attn_mask=args["attention_mask"], modulation_dims=args["modulation_dims"])
                    return out

-                out = blocks_replace[("single_block", i)]({"img": img, "vec": vec, "pe": pe, "attention_mask": attn_mask, 'modulation_dims': modulation_dims, 'transformer_options': transformer_options}, {"original_block": block_wrap})
+                out = blocks_replace[("single_block", i)]({"img": img, "vec": vec, "pe": pe, "attention_mask": attn_mask, 'modulation_dims': modulation_dims}, {"original_block": block_wrap})
                img = out["img"]
            else:
-                img = block(img, vec=vec, pe=pe, attn_mask=attn_mask, modulation_dims=modulation_dims, transformer_options=transformer_options)
+                img = block(img, vec=vec, pe=pe, attn_mask=attn_mask, modulation_dims=modulation_dims)

            if control is not None: # Controlnet
                control_o = control.get("output")
@@ -396,16 +327,12 @@ class HunyuanVideo(nn.Module):

        img = self.final_layer(img, vec, modulation_dims=modulation_dims)  # (N, T, patch_size ** 2 * out_channels)

-        shape = initial_shape[-len(self.patch_size):]
+        shape = initial_shape[-3:]
        for i in range(len(shape)):
            shape[i] = shape[i] // self.patch_size[i]
        img = img.reshape([img.shape[0]] + shape + [self.out_channels] + self.patch_size)
-        if img.ndim == 8:
-            img = img.permute(0, 4, 1, 5, 2, 6, 3, 7)
-            img = img.reshape(initial_shape[0], self.out_channels, initial_shape[2], initial_shape[3], initial_shape[4])
-        else:
-            img = img.permute(0, 3, 1, 4, 2, 5)
-            img = img.reshape(initial_shape[0], self.out_channels, initial_shape[2], initial_shape[3])
+        img = img.permute(0, 4, 1, 5, 2, 6, 3, 7)
+        img = img.reshape(initial_shape[0], self.out_channels, initial_shape[2], initial_shape[3], initial_shape[4])
        return img

    def img_ids(self, x):
@@ -420,30 +347,9 @@ class HunyuanVideo(nn.Module):
        img_ids[:, :, :, 2] = img_ids[:, :, :, 2] + torch.linspace(0, w_len - 1, steps=w_len, device=x.device, dtype=x.dtype).reshape(1, 1, -1)
        return repeat(img_ids, "t h w c -> b (t h w) c", b=bs)

-    def img_ids_2d(self, x):
-        bs, c, h, w = x.shape
-        patch_size = self.patch_size
-        h_len = ((h + (patch_size[0] // 2)) // patch_size[0])
-        w_len = ((w + (patch_size[1] // 2)) // patch_size[1])
-        img_ids = torch.zeros((h_len, w_len, 2), device=x.device, dtype=x.dtype)
-        img_ids[:, :, 0] = img_ids[:, :, 0] + torch.linspace(0, h_len - 1, steps=h_len, device=x.device, dtype=x.dtype).unsqueeze(1)
-        img_ids[:, :, 1] = img_ids[:, :, 1] + torch.linspace(0, w_len - 1, steps=w_len, device=x.device, dtype=x.dtype).unsqueeze(0)
-        return repeat(img_ids, "h w c -> b (h w) c", b=bs)
-
-    def forward(self, x, timestep, context, y=None, txt_byt5=None, guidance=None, attention_mask=None, guiding_frame_index=None, ref_latent=None, disable_time_r=False, control=None, transformer_options={}, **kwargs):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, timestep, context, y, txt_byt5, guidance, attention_mask, guiding_frame_index, ref_latent, disable_time_r, control, transformer_options, **kwargs)
-
-    def _forward(self, x, timestep, context, y=None, txt_byt5=None, guidance=None, attention_mask=None, guiding_frame_index=None, ref_latent=None, disable_time_r=False, control=None, transformer_options={}, **kwargs):
-        bs = x.shape[0]
-        if len(self.patch_size) == 3:
-            img_ids = self.img_ids(x)
-            txt_ids = torch.zeros((bs, context.shape[1], 3), device=x.device, dtype=x.dtype)
-        else:
-            img_ids = self.img_ids_2d(x)
-            txt_ids = torch.zeros((bs, context.shape[1], 2), device=x.device, dtype=x.dtype)
-        out = self.forward_orig(x, img_ids, context, txt_ids, attention_mask, timestep, y, txt_byt5, guidance, guiding_frame_index, ref_latent, disable_time_r=disable_time_r, control=control, transformer_options=transformer_options)
+    def forward(self, x, timestep, context, y, guidance=None, attention_mask=None, guiding_frame_index=None, ref_latent=None, control=None, transformer_options={}, **kwargs):
+        bs, c, t, h, w = x.shape
+        img_ids = self.img_ids(x)
+        txt_ids = torch.zeros((bs, context.shape[1], 3), device=x.device, dtype=x.dtype)
+        out = self.forward_orig(x, img_ids, context, txt_ids, attention_mask, timestep, y, guidance, guiding_frame_index, ref_latent, control=control, transformer_options=transformer_options)
        return out
--- a/comfy/ldm/hunyuan_video/vae.py
+++ b/comfy/ldm/hunyuan_video/vae.py
@@ -1,136 +0,0 @@
-import torch.nn as nn
-import torch.nn.functional as F
-from comfy.ldm.modules.diffusionmodules.model import ResnetBlock, AttnBlock
-import comfy.ops
-ops = comfy.ops.disable_weight_init
-
-
-class PixelShuffle2D(nn.Module):
-    def __init__(self, in_dim, out_dim, op=ops.Conv2d):
-        super().__init__()
-        self.conv = op(in_dim, out_dim >> 2, 3, 1, 1)
-        self.ratio = (in_dim << 2) // out_dim
-
-    def forward(self, x):
-        b, c, h, w = x.shape
-        h2, w2 = h >> 1, w >> 1
-        y = self.conv(x).view(b, -1, h2, 2, w2, 2).permute(0, 3, 5, 1, 2, 4).reshape(b, -1, h2, w2)
-        r = x.view(b, c, h2, 2, w2, 2).permute(0, 3, 5, 1, 2, 4).reshape(b, c << 2, h2, w2)
-        return y + r.view(b, y.shape[1], self.ratio, h2, w2).mean(2)
-
-
-class PixelUnshuffle2D(nn.Module):
-    def __init__(self, in_dim, out_dim, op=ops.Conv2d):
-        super().__init__()
-        self.conv = op(in_dim, out_dim << 2, 3, 1, 1)
-        self.scale = (out_dim << 2) // in_dim
-
-    def forward(self, x):
-        b, c, h, w = x.shape
-        h2, w2 = h << 1, w << 1
-        y = self.conv(x).view(b, 2, 2, -1, h, w).permute(0, 3, 4, 1, 5, 2).reshape(b, -1, h2, w2)
-        r = x.repeat_interleave(self.scale, 1).view(b, 2, 2, -1, h, w).permute(0, 3, 4, 1, 5, 2).reshape(b, -1, h2, w2)
-        return y + r
-
-
-class Encoder(nn.Module):
-    def __init__(self, in_channels, z_channels, block_out_channels, num_res_blocks,
-                 ffactor_spatial, downsample_match_channel=True, **_):
-        super().__init__()
-        self.z_channels = z_channels
-        self.block_out_channels = block_out_channels
-        self.num_res_blocks = num_res_blocks
-        self.conv_in = ops.Conv2d(in_channels, block_out_channels[0], 3, 1, 1)
-
-        self.down = nn.ModuleList()
-        ch = block_out_channels[0]
-        depth = (ffactor_spatial >> 1).bit_length()
-
-        for i, tgt in enumerate(block_out_channels):
-            stage = nn.Module()
-            stage.block = nn.ModuleList([ResnetBlock(in_channels=ch if j == 0 else tgt,
-                                                     out_channels=tgt,
-                                                     temb_channels=0,
-                                                     conv_op=ops.Conv2d)
-                                        for j in range(num_res_blocks)])
-            ch = tgt
-            if i < depth:
-                nxt = block_out_channels[i + 1] if i + 1 < len(block_out_channels) and downsample_match_channel else ch
-                stage.downsample = PixelShuffle2D(ch, nxt, ops.Conv2d)
-                ch = nxt
-            self.down.append(stage)
-
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=ops.Conv2d)
-        self.mid.attn_1 = AttnBlock(ch, conv_op=ops.Conv2d)
-        self.mid.block_2 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=ops.Conv2d)
-
-        self.norm_out = ops.GroupNorm(32, ch, 1e-6, True)
-        self.conv_out = ops.Conv2d(ch, z_channels << 1, 3, 1, 1)
-
-    def forward(self, x):
-        x = self.conv_in(x)
-
-        for stage in self.down:
-            for blk in stage.block:
-                x = blk(x)
-            if hasattr(stage, 'downsample'):
-                x = stage.downsample(x)
-
-        x = self.mid.block_2(self.mid.attn_1(self.mid.block_1(x)))
-
-        b, c, h, w = x.shape
-        grp = c // (self.z_channels << 1)
-        skip = x.view(b, c // grp, grp, h, w).mean(2)
-
-        return self.conv_out(F.silu(self.norm_out(x))) + skip
-
-
-class Decoder(nn.Module):
-    def __init__(self, z_channels, out_channels, block_out_channels, num_res_blocks,
-                 ffactor_spatial, upsample_match_channel=True, **_):
-        super().__init__()
-        block_out_channels = block_out_channels[::-1]
-        self.z_channels = z_channels
-        self.block_out_channels = block_out_channels
-        self.num_res_blocks = num_res_blocks
-
-        ch = block_out_channels[0]
-        self.conv_in = ops.Conv2d(z_channels, ch, 3, 1, 1)
-
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=ops.Conv2d)
-        self.mid.attn_1 = AttnBlock(ch, conv_op=ops.Conv2d)
-        self.mid.block_2 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=ops.Conv2d)
-
-        self.up = nn.ModuleList()
-        depth = (ffactor_spatial >> 1).bit_length()
-
-        for i, tgt in enumerate(block_out_channels):
-            stage = nn.Module()
-            stage.block = nn.ModuleList([ResnetBlock(in_channels=ch if j == 0 else tgt,
-                                                     out_channels=tgt,
-                                                     temb_channels=0,
-                                                     conv_op=ops.Conv2d)
-                                        for j in range(num_res_blocks + 1)])
-            ch = tgt
-            if i < depth:
-                nxt = block_out_channels[i + 1] if i + 1 < len(block_out_channels) and upsample_match_channel else ch
-                stage.upsample = PixelUnshuffle2D(ch, nxt, ops.Conv2d)
-                ch = nxt
-            self.up.append(stage)
-
-        self.norm_out = ops.GroupNorm(32, ch, 1e-6, True)
-        self.conv_out = ops.Conv2d(ch, out_channels, 3, 1, 1)
-
-    def forward(self, z):
-        x = self.conv_in(z) + z.repeat_interleave(self.block_out_channels[0] // self.z_channels, 1)
-        x = self.mid.block_2(self.mid.attn_1(self.mid.block_1(x)))
-
-        for stage in self.up:
-            for blk in stage.block:
-                x = blk(x)
-            if hasattr(stage, 'upsample'):
-                x = stage.upsample(x)
-
-        return self.conv_out(F.silu(self.norm_out(x)))
--- a/comfy/ldm/hunyuan_video/vae_refiner.py
+++ b/comfy/ldm/hunyuan_video/vae_refiner.py
@@ -1,267 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from comfy.ldm.modules.diffusionmodules.model import ResnetBlock, AttnBlock, VideoConv3d
-import comfy.ops
-import comfy.ldm.models.autoencoder
-ops = comfy.ops.disable_weight_init
-
-class RMS_norm(nn.Module):
-    def __init__(self, dim):
-        super().__init__()
-        shape = (dim, 1, 1, 1)
-        self.scale = dim**0.5
-        self.gamma = nn.Parameter(torch.empty(shape))
-
-    def forward(self, x):
-        return F.normalize(x, dim=1) * self.scale * self.gamma
-
-class DnSmpl(nn.Module):
-    def __init__(self, ic, oc, tds=True):
-        super().__init__()
-        fct = 2 * 2 * 2 if tds else 1 * 2 * 2
-        assert oc % fct == 0
-        self.conv = VideoConv3d(ic, oc // fct, kernel_size=3)
-
-        self.tds = tds
-        self.gs = fct * ic // oc
-
-    def forward(self, x):
-        r1 = 2 if self.tds else 1
-        h = self.conv(x)
-
-        if self.tds:
-            hf = h[:, :, :1, :, :]
-            b, c, f, ht, wd = hf.shape
-            hf = hf.reshape(b, c, f, ht // 2, 2, wd // 2, 2)
-            hf = hf.permute(0, 4, 6, 1, 2, 3, 5)
-            hf = hf.reshape(b, 2 * 2 * c, f, ht // 2, wd // 2)
-            hf = torch.cat([hf, hf], dim=1)
-
-            hn = h[:, :, 1:, :, :]
-            b, c, frms, ht, wd = hn.shape
-            nf = frms // r1
-            hn = hn.reshape(b, c, nf, r1, ht // 2, 2, wd // 2, 2)
-            hn = hn.permute(0, 3, 5, 7, 1, 2, 4, 6)
-            hn = hn.reshape(b, r1 * 2 * 2 * c, nf, ht // 2, wd // 2)
-
-            h = torch.cat([hf, hn], dim=2)
-
-            xf = x[:, :, :1, :, :]
-            b, ci, f, ht, wd = xf.shape
-            xf = xf.reshape(b, ci, f, ht // 2, 2, wd // 2, 2)
-            xf = xf.permute(0, 4, 6, 1, 2, 3, 5)
-            xf = xf.reshape(b, 2 * 2 * ci, f, ht // 2, wd // 2)
-            B, C, T, H, W = xf.shape
-            xf = xf.view(B, h.shape[1], self.gs // 2, T, H, W).mean(dim=2)
-
-            xn = x[:, :, 1:, :, :]
-            b, ci, frms, ht, wd = xn.shape
-            nf = frms // r1
-            xn = xn.reshape(b, ci, nf, r1, ht // 2, 2, wd // 2, 2)
-            xn = xn.permute(0, 3, 5, 7, 1, 2, 4, 6)
-            xn = xn.reshape(b, r1 * 2 * 2 * ci, nf, ht // 2, wd // 2)
-            B, C, T, H, W = xn.shape
-            xn = xn.view(B, h.shape[1], self.gs, T, H, W).mean(dim=2)
-            sc = torch.cat([xf, xn], dim=2)
-        else:
-            b, c, frms, ht, wd = h.shape
-            nf = frms // r1
-            h = h.reshape(b, c, nf, r1, ht // 2, 2, wd // 2, 2)
-            h = h.permute(0, 3, 5, 7, 1, 2, 4, 6)
-            h = h.reshape(b, r1 * 2 * 2 * c, nf, ht // 2, wd // 2)
-
-            b, ci, frms, ht, wd = x.shape
-            nf = frms // r1
-            sc = x.reshape(b, ci, nf, r1, ht // 2, 2, wd // 2, 2)
-            sc = sc.permute(0, 3, 5, 7, 1, 2, 4, 6)
-            sc = sc.reshape(b, r1 * 2 * 2 * ci, nf, ht // 2, wd // 2)
-            B, C, T, H, W = sc.shape
-            sc = sc.view(B, h.shape[1], self.gs, T, H, W).mean(dim=2)
-
-        return h + sc
-
-
-class UpSmpl(nn.Module):
-    def __init__(self, ic, oc, tus=True):
-        super().__init__()
-        fct = 2 * 2 * 2 if tus else 1 * 2 * 2
-        self.conv = VideoConv3d(ic, oc * fct, kernel_size=3)
-
-        self.tus = tus
-        self.rp = fct * oc // ic
-
-    def forward(self, x):
-        r1 = 2 if self.tus else 1
-        h = self.conv(x)
-
-        if self.tus:
-            hf = h[:, :, :1, :, :]
-            b, c, f, ht, wd = hf.shape
-            nc = c // (2 * 2)
-            hf = hf.reshape(b, 2, 2, nc, f, ht, wd)
-            hf = hf.permute(0, 3, 4, 5, 1, 6, 2)
-            hf = hf.reshape(b, nc, f, ht * 2, wd * 2)
-            hf = hf[:, : hf.shape[1] // 2]
-
-            hn = h[:, :, 1:, :, :]
-            b, c, frms, ht, wd = hn.shape
-            nc = c // (r1 * 2 * 2)
-            hn = hn.reshape(b, r1, 2, 2, nc, frms, ht, wd)
-            hn = hn.permute(0, 4, 5, 1, 6, 2, 7, 3)
-            hn = hn.reshape(b, nc, frms * r1, ht * 2, wd * 2)
-
-            h = torch.cat([hf, hn], dim=2)
-
-            xf = x[:, :, :1, :, :]
-            b, ci, f, ht, wd = xf.shape
-            xf = xf.repeat_interleave(repeats=self.rp // 2, dim=1)
-            b, c, f, ht, wd = xf.shape
-            nc = c // (2 * 2)
-            xf = xf.reshape(b, 2, 2, nc, f, ht, wd)
-            xf = xf.permute(0, 3, 4, 5, 1, 6, 2)
-            xf = xf.reshape(b, nc, f, ht * 2, wd * 2)
-
-            xn = x[:, :, 1:, :, :]
-            xn = xn.repeat_interleave(repeats=self.rp, dim=1)
-            b, c, frms, ht, wd = xn.shape
-            nc = c // (r1 * 2 * 2)
-            xn = xn.reshape(b, r1, 2, 2, nc, frms, ht, wd)
-            xn = xn.permute(0, 4, 5, 1, 6, 2, 7, 3)
-            xn = xn.reshape(b, nc, frms * r1, ht * 2, wd * 2)
-            sc = torch.cat([xf, xn], dim=2)
-        else:
-            b, c, frms, ht, wd = h.shape
-            nc = c // (r1 * 2 * 2)
-            h = h.reshape(b, r1, 2, 2, nc, frms, ht, wd)
-            h = h.permute(0, 4, 5, 1, 6, 2, 7, 3)
-            h = h.reshape(b, nc, frms * r1, ht * 2, wd * 2)
-
-            sc = x.repeat_interleave(repeats=self.rp, dim=1)
-            b, c, frms, ht, wd = sc.shape
-            nc = c // (r1 * 2 * 2)
-            sc = sc.reshape(b, r1, 2, 2, nc, frms, ht, wd)
-            sc = sc.permute(0, 4, 5, 1, 6, 2, 7, 3)
-            sc = sc.reshape(b, nc, frms * r1, ht * 2, wd * 2)
-
-        return h + sc
-
-class Encoder(nn.Module):
-    def __init__(self, in_channels, z_channels, block_out_channels, num_res_blocks,
-                 ffactor_spatial, ffactor_temporal, downsample_match_channel=True, **_):
-        super().__init__()
-        self.z_channels = z_channels
-        self.block_out_channels = block_out_channels
-        self.num_res_blocks = num_res_blocks
-        self.conv_in = VideoConv3d(in_channels, block_out_channels[0], 3, 1, 1)
-
-        self.down = nn.ModuleList()
-        ch = block_out_channels[0]
-        depth = (ffactor_spatial >> 1).bit_length()
-        depth_temporal = ((ffactor_spatial // ffactor_temporal) >> 1).bit_length()
-
-        for i, tgt in enumerate(block_out_channels):
-            stage = nn.Module()
-            stage.block = nn.ModuleList([ResnetBlock(in_channels=ch if j == 0 else tgt,
-                                                     out_channels=tgt,
-                                                     temb_channels=0,
-                                                     conv_op=VideoConv3d, norm_op=RMS_norm)
-                                        for j in range(num_res_blocks)])
-            ch = tgt
-            if i < depth:
-                nxt = block_out_channels[i + 1] if i + 1 < len(block_out_channels) and downsample_match_channel else ch
-                stage.downsample = DnSmpl(ch, nxt, tds=i >= depth_temporal)
-                ch = nxt
-            self.down.append(stage)
-
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=VideoConv3d, norm_op=RMS_norm)
-        self.mid.attn_1 = AttnBlock(ch, conv_op=ops.Conv3d, norm_op=RMS_norm)
-        self.mid.block_2 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=VideoConv3d, norm_op=RMS_norm)
-
-        self.norm_out = RMS_norm(ch)
-        self.conv_out = VideoConv3d(ch, z_channels << 1, 3, 1, 1)
-
-        self.regul = comfy.ldm.models.autoencoder.DiagonalGaussianRegularizer()
-
-    def forward(self, x):
-        x = self.conv_in(x)
-
-        for stage in self.down:
-            for blk in stage.block:
-                x = blk(x)
-            if hasattr(stage, 'downsample'):
-                x = stage.downsample(x)
-
-        x = self.mid.block_2(self.mid.attn_1(self.mid.block_1(x)))
-
-        b, c, t, h, w = x.shape
-        grp = c // (self.z_channels << 1)
-        skip = x.view(b, c // grp, grp, t, h, w).mean(2)
-
-        out = self.conv_out(F.silu(self.norm_out(x))) + skip
-        out = self.regul(out)[0]
-
-        out = torch.cat((out[:, :, :1], out), dim=2)
-        out = out.permute(0, 2, 1, 3, 4)
-        b, f_times_2, c, h, w = out.shape
-        out = out.reshape(b, f_times_2 // 2, 2 * c, h, w)
-        out = out.permute(0, 2, 1, 3, 4).contiguous()
-        return out
-
-class Decoder(nn.Module):
-    def __init__(self, z_channels, out_channels, block_out_channels, num_res_blocks,
-                 ffactor_spatial, ffactor_temporal, upsample_match_channel=True, **_):
-        super().__init__()
-        block_out_channels = block_out_channels[::-1]
-        self.z_channels = z_channels
-        self.block_out_channels = block_out_channels
-        self.num_res_blocks = num_res_blocks
-
-        ch = block_out_channels[0]
-        self.conv_in = VideoConv3d(z_channels, ch, 3)
-
-        self.mid = nn.Module()
-        self.mid.block_1 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=VideoConv3d, norm_op=RMS_norm)
-        self.mid.attn_1 = AttnBlock(ch, conv_op=ops.Conv3d, norm_op=RMS_norm)
-        self.mid.block_2 = ResnetBlock(in_channels=ch, out_channels=ch, temb_channels=0, conv_op=VideoConv3d, norm_op=RMS_norm)
-
-        self.up = nn.ModuleList()
-        depth = (ffactor_spatial >> 1).bit_length()
-        depth_temporal = (ffactor_temporal >> 1).bit_length()
-
-        for i, tgt in enumerate(block_out_channels):
-            stage = nn.Module()
-            stage.block = nn.ModuleList([ResnetBlock(in_channels=ch if j == 0 else tgt,
-                                                     out_channels=tgt,
-                                                     temb_channels=0,
-                                                     conv_op=VideoConv3d, norm_op=RMS_norm)
-                                        for j in range(num_res_blocks + 1)])
-            ch = tgt
-            if i < depth:
-                nxt = block_out_channels[i + 1] if i + 1 < len(block_out_channels) and upsample_match_channel else ch
-                stage.upsample = UpSmpl(ch, nxt, tus=i < depth_temporal)
-                ch = nxt
-            self.up.append(stage)
-
-        self.norm_out = RMS_norm(ch)
-        self.conv_out = VideoConv3d(ch, out_channels, 3)
-
-    def forward(self, z):
-        z = z.permute(0, 2, 1, 3, 4)
-        b, f, c, h, w = z.shape
-        z = z.reshape(b, f, 2, c // 2, h, w)
-        z = z.permute(0, 1, 2, 3, 4, 5).reshape(b, f * 2, c // 2, h, w)
-        z = z.permute(0, 2, 1, 3, 4)
-        z = z[:, :, 1:]
-
-        x = self.conv_in(z) + z.repeat_interleave(self.block_out_channels[0] // self.z_channels, 1)
-        x = self.mid.block_2(self.mid.attn_1(self.mid.block_1(x)))
-
-        for stage in self.up:
-            for blk in stage.block:
-                x = blk(x)
-            if hasattr(stage, 'upsample'):
-                x = stage.upsample(x)
-
-        return self.conv_out(F.silu(self.norm_out(x)))
--- a/comfy/ldm/lightricks/model.py
+++ b/comfy/ldm/lightricks/model.py
@@ -1,6 +1,5 @@
 import torch
 from torch import nn
-import comfy.patcher_extension
 import comfy.ldm.modules.attention
 import comfy.ldm.common_dit
 from einops import rearrange
@@ -262,8 +261,8 @@ class CrossAttention(nn.Module):
        self.heads = heads
        self.dim_head = dim_head

-        self.q_norm = operations.RMSNorm(inner_dim, eps=1e-5, dtype=dtype, device=device)
-        self.k_norm = operations.RMSNorm(inner_dim, eps=1e-5, dtype=dtype, device=device)
+        self.q_norm = operations.RMSNorm(inner_dim, dtype=dtype, device=device)
+        self.k_norm = operations.RMSNorm(inner_dim, dtype=dtype, device=device)

        self.to_q = operations.Linear(query_dim, inner_dim, bias=True, dtype=dtype, device=device)
        self.to_k = operations.Linear(context_dim, inner_dim, bias=True, dtype=dtype, device=device)
@@ -271,7 +270,7 @@ class CrossAttention(nn.Module):

        self.to_out = nn.Sequential(operations.Linear(inner_dim, query_dim, dtype=dtype, device=device), nn.Dropout(dropout))

-    def forward(self, x, context=None, mask=None, pe=None, transformer_options={}):
+    def forward(self, x, context=None, mask=None, pe=None):
        q = self.to_q(x)
        context = x if context is None else context
        k = self.to_k(context)
@@ -285,9 +284,9 @@ class CrossAttention(nn.Module):
            k = apply_rotary_emb(k, pe)

        if mask is None:
-            out = comfy.ldm.modules.attention.optimized_attention(q, k, v, self.heads, attn_precision=self.attn_precision, transformer_options=transformer_options)
+            out = comfy.ldm.modules.attention.optimized_attention(q, k, v, self.heads, attn_precision=self.attn_precision)
        else:
-            out = comfy.ldm.modules.attention.optimized_attention_masked(q, k, v, self.heads, mask, attn_precision=self.attn_precision, transformer_options=transformer_options)
+            out = comfy.ldm.modules.attention.optimized_attention_masked(q, k, v, self.heads, mask, attn_precision=self.attn_precision)
        return self.to_out(out)


@@ -303,12 +302,12 @@ class BasicTransformerBlock(nn.Module):

        self.scale_shift_table = nn.Parameter(torch.empty(6, dim, device=device, dtype=dtype))

-    def forward(self, x, context=None, attention_mask=None, timestep=None, pe=None, transformer_options={}):
+    def forward(self, x, context=None, attention_mask=None, timestep=None, pe=None):
        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (self.scale_shift_table[None, None].to(device=x.device, dtype=x.dtype) + timestep.reshape(x.shape[0], timestep.shape[1], self.scale_shift_table.shape[0], -1)).unbind(dim=2)

-        x += self.attn1(comfy.ldm.common_dit.rms_norm(x) * (1 + scale_msa) + shift_msa, pe=pe, transformer_options=transformer_options) * gate_msa
+        x += self.attn1(comfy.ldm.common_dit.rms_norm(x) * (1 + scale_msa) + shift_msa, pe=pe) * gate_msa

-        x += self.attn2(x, context=context, mask=attention_mask, transformer_options=transformer_options)
+        x += self.attn2(x, context=context, mask=attention_mask)

        y = comfy.ldm.common_dit.rms_norm(x) * (1 + scale_mlp) + shift_mlp
        x += self.ff(y) * gate_mlp
@@ -421,13 +420,6 @@ class LTXVModel(torch.nn.Module):
        self.patchifier = SymmetricPatchifier(1)

    def forward(self, x, timestep, context, attention_mask, frame_rate=25, transformer_options={}, keyframe_idxs=None, **kwargs):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, timestep, context, attention_mask, frame_rate, transformer_options, keyframe_idxs, **kwargs)
-
-    def _forward(self, x, timestep, context, attention_mask, frame_rate=25, transformer_options={}, keyframe_idxs=None, **kwargs):
        patches_replace = transformer_options.get("patches_replace", {})

        orig_shape = list(x.shape)
@@ -479,10 +471,10 @@ class LTXVModel(torch.nn.Module):
            if ("double_block", i) in blocks_replace:
                def block_wrap(args):
                    out = {}
-                    out["img"] = block(args["img"], context=args["txt"], attention_mask=args["attention_mask"], timestep=args["vec"], pe=args["pe"], transformer_options=args["transformer_options"])
+                    out["img"] = block(args["img"], context=args["txt"], attention_mask=args["attention_mask"], timestep=args["vec"], pe=args["pe"])
                    return out

-                out = blocks_replace[("double_block", i)]({"img": x, "txt": context, "attention_mask": attention_mask, "vec": timestep, "pe": pe, "transformer_options": transformer_options}, {"original_block": block_wrap})
+                out = blocks_replace[("double_block", i)]({"img": x, "txt": context, "attention_mask": attention_mask, "vec": timestep, "pe": pe}, {"original_block": block_wrap})
                x = out["img"]
            else:
                x = block(
@@ -490,8 +482,7 @@ class LTXVModel(torch.nn.Module):
                    context=context,
                    attention_mask=attention_mask,
                    timestep=timestep,
-                    pe=pe,
-                    transformer_options=transformer_options,
+                    pe=pe
                )

        # 3. Output
--- a/comfy/ldm/lightricks/vae/causal_video_autoencoder.py
+++ b/comfy/ldm/lightricks/vae/causal_video_autoencoder.py
@@ -973,7 +973,7 @@ class VideoVAE(nn.Module):
            norm_layer=config.get("norm_layer", "group_norm"),
            causal=config.get("causal_decoder", False),
            timestep_conditioning=self.timestep_conditioning,
-            spatial_padding_mode=config.get("spatial_padding_mode", "reflect"),
+            spatial_padding_mode=config.get("spatial_padding_mode", "zeros"),
        )

        self.per_channel_statistics = processor()
--- a/comfy/ldm/lumina/model.py
+++ b/comfy/ldm/lumina/model.py
@@ -11,7 +11,6 @@ import comfy.ldm.common_dit
 from comfy.ldm.modules.diffusionmodules.mmdit import TimestepEmbedder
 from comfy.ldm.modules.attention import optimized_attention_masked
 from comfy.ldm.flux.layers import EmbedND
-import comfy.patcher_extension


 def modulate(x, scale):
@@ -104,7 +103,6 @@ class JointAttention(nn.Module):
        x: torch.Tensor,
        x_mask: torch.Tensor,
        freqs_cis: torch.Tensor,
-        transformer_options={},
    ) -> torch.Tensor:
        """

@@ -141,7 +139,7 @@ class JointAttention(nn.Module):
        if n_rep >= 1:
            xk = xk.unsqueeze(3).repeat(1, 1, 1, n_rep, 1).flatten(2, 3)
            xv = xv.unsqueeze(3).repeat(1, 1, 1, n_rep, 1).flatten(2, 3)
-        output = optimized_attention_masked(xq.movedim(1, 2), xk.movedim(1, 2), xv.movedim(1, 2), self.n_local_heads, x_mask, skip_reshape=True, transformer_options=transformer_options)
+        output = optimized_attention_masked(xq.movedim(1, 2), xk.movedim(1, 2), xv.movedim(1, 2), self.n_local_heads, x_mask, skip_reshape=True)

        return self.out(output)

@@ -269,7 +267,6 @@ class JointTransformerBlock(nn.Module):
        x_mask: torch.Tensor,
        freqs_cis: torch.Tensor,
        adaln_input: Optional[torch.Tensor]=None,
-        transformer_options={},
    ):
        """
        Perform a forward pass through the TransformerBlock.
@@ -292,7 +289,6 @@ class JointTransformerBlock(nn.Module):
                    modulate(self.attention_norm1(x), scale_msa),
                    x_mask,
                    freqs_cis,
-                    transformer_options=transformer_options,
                )
            )
            x = x + gate_mlp.unsqueeze(1).tanh() * self.ffn_norm2(
@@ -307,7 +303,6 @@ class JointTransformerBlock(nn.Module):
                    self.attention_norm1(x),
                    x_mask,
                    freqs_cis,
-                    transformer_options=transformer_options,
                )
            )
            x = x + self.ffn_norm2(
@@ -498,7 +493,7 @@ class NextDiT(nn.Module):
        return imgs

    def patchify_and_embed(
-        self, x: List[torch.Tensor] | torch.Tensor, cap_feats: torch.Tensor, cap_mask: torch.Tensor, t: torch.Tensor, num_tokens, transformer_options={}
+        self, x: List[torch.Tensor] | torch.Tensor, cap_feats: torch.Tensor, cap_mask: torch.Tensor, t: torch.Tensor, num_tokens
    ) -> Tuple[torch.Tensor, torch.Tensor, List[Tuple[int, int]], List[int], torch.Tensor]:
        bsz = len(x)
        pH = pW = self.patch_size
@@ -558,7 +553,7 @@ class NextDiT(nn.Module):

        # refine context
        for layer in self.context_refiner:
-            cap_feats = layer(cap_feats, cap_mask, cap_freqs_cis, transformer_options=transformer_options)
+            cap_feats = layer(cap_feats, cap_mask, cap_freqs_cis)

        # refine image
        flat_x = []
@@ -577,7 +572,7 @@ class NextDiT(nn.Module):
        padded_img_embed = self.x_embedder(padded_img_embed)
        padded_img_mask = padded_img_mask.unsqueeze(1)
        for layer in self.noise_refiner:
-            padded_img_embed = layer(padded_img_embed, padded_img_mask, img_freqs_cis, t, transformer_options=transformer_options)
+            padded_img_embed = layer(padded_img_embed, padded_img_mask, img_freqs_cis, t)

        if cap_mask is not None:
            mask = torch.zeros(bsz, max_seq_len, dtype=dtype, device=device)
@@ -595,15 +590,8 @@ class NextDiT(nn.Module):

        return padded_full_embed, mask, img_sizes, l_effective_cap_len, freqs_cis

-    def forward(self, x, timesteps, context, num_tokens, attention_mask=None, **kwargs):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, kwargs.get("transformer_options", {}))
-        ).execute(x, timesteps, context, num_tokens, attention_mask, **kwargs)
-
    # def forward(self, x, t, cap_feats, cap_mask):
-    def _forward(self, x, timesteps, context, num_tokens, attention_mask=None, **kwargs):
+    def forward(self, x, timesteps, context, num_tokens, attention_mask=None, **kwargs):
        t = 1.0 - timesteps
        cap_feats = context
        cap_mask = attention_mask
@@ -620,13 +608,12 @@ class NextDiT(nn.Module):

        cap_feats = self.cap_embedder(cap_feats)  # (N, L, D)  # todo check if able to batchify w.o. redundant compute

-        transformer_options = kwargs.get("transformer_options", {})
        x_is_tensor = isinstance(x, torch.Tensor)
-        x, mask, img_size, cap_size, freqs_cis = self.patchify_and_embed(x, cap_feats, cap_mask, t, num_tokens, transformer_options=transformer_options)
+        x, mask, img_size, cap_size, freqs_cis = self.patchify_and_embed(x, cap_feats, cap_mask, t, num_tokens)
        freqs_cis = freqs_cis.to(x.device)

        for layer in self.layers:
-            x = layer(x, mask, freqs_cis, adaln_input, transformer_options=transformer_options)
+            x = layer(x, mask, freqs_cis, adaln_input)

        x = self.final_layer(x, adaln_input)
        x = self.unpatchify(x, img_size, cap_size, return_tensor=x_is_tensor)[:,:,:h,:w]
--- a/comfy/ldm/models/autoencoder.py
+++ b/comfy/ldm/models/autoencoder.py
@@ -11,7 +11,7 @@ from comfy.ldm.modules.ema import LitEma
 import comfy.ops

 class DiagonalGaussianRegularizer(torch.nn.Module):
-    def __init__(self, sample: bool = False):
+    def __init__(self, sample: bool = True):
        super().__init__()
        self.sample = sample

@@ -19,19 +19,17 @@ class DiagonalGaussianRegularizer(torch.nn.Module):
        yield from ()

    def forward(self, z: torch.Tensor) -> Tuple[torch.Tensor, dict]:
+        log = dict()
        posterior = DiagonalGaussianDistribution(z)
        if self.sample:
            z = posterior.sample()
        else:
            z = posterior.mode()
-        return z, None
+        kl_loss = posterior.kl()
+        kl_loss = torch.sum(kl_loss) / kl_loss.shape[0]
+        log["kl_loss"] = kl_loss
+        return z, log

-class EmptyRegularizer(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-
-    def forward(self, z: torch.Tensor) -> Tuple[torch.Tensor, dict]:
-        return z, None

 class AbstractAutoencoder(torch.nn.Module):
    """
--- a/comfy/ldm/modules/attention.py
+++ b/comfy/ldm/modules/attention.py
@@ -5,9 +5,8 @@ import torch
 import torch.nn.functional as F
 from torch import nn, einsum
 from einops import rearrange, repeat
-from typing import Optional, Any, Callable, Union
+from typing import Optional
 import logging
-import functools

 from .diffusionmodules.util import AlphaBlender, timestep_embedding
 from .sub_quadratic_attention import efficient_dot_product_attention
@@ -18,45 +17,23 @@ if model_management.xformers_enabled():
    import xformers
    import xformers.ops

-SAGE_ATTENTION_IS_AVAILABLE = False
-try:
-    from sageattention import sageattn
-    SAGE_ATTENTION_IS_AVAILABLE = True
-except ImportError as e:
-    if model_management.sage_attention_enabled():
+if model_management.sage_attention_enabled():
+    try:
+        from sageattention import sageattn
+    except ModuleNotFoundError as e:
        if e.name == "sageattention":
            logging.error(f"\n\nTo use the `--use-sage-attention` feature, the `sageattention` package must be installed first.\ncommand:\n\t{sys.executable} -m pip install sageattention")
        else:
            raise e
        exit(-1)

-FLASH_ATTENTION_IS_AVAILABLE = False
-try:
-    from flash_attn import flash_attn_func
-    FLASH_ATTENTION_IS_AVAILABLE = True
-except ImportError:
-    if model_management.flash_attention_enabled():
+if model_management.flash_attention_enabled():
+    try:
+        from flash_attn import flash_attn_func
+    except ModuleNotFoundError:
        logging.error(f"\n\nTo use the `--use-flash-attention` feature, the `flash-attn` package must be installed first.\ncommand:\n\t{sys.executable} -m pip install flash-attn")
        exit(-1)

-REGISTERED_ATTENTION_FUNCTIONS = {}
-def register_attention_function(name: str, func: Callable):
-    # avoid replacing existing functions
-    if name not in REGISTERED_ATTENTION_FUNCTIONS:
-        REGISTERED_ATTENTION_FUNCTIONS[name] = func
-    else:
-        logging.warning(f"Attention function {name} already registered, skipping registration.")
-
-def get_attention_function(name: str, default: Any=...) -> Union[Callable, None]:
-    if name == "optimized":
-        return optimized_attention
-    elif name not in REGISTERED_ATTENTION_FUNCTIONS:
-        if default is ...:
-            raise KeyError(f"Attention function {name} not found.")
-        else:
-            return default
-    return REGISTERED_ATTENTION_FUNCTIONS[name]
-
 from comfy.cli_args import args
 import comfy.ops
 ops = comfy.ops.disable_weight_init
@@ -114,27 +91,7 @@ class FeedForward(nn.Module):
 def Normalize(in_channels, dtype=None, device=None):
    return torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True, dtype=dtype, device=device)

-
-def wrap_attn(func):
-    @functools.wraps(func)
-    def wrapper(*args, **kwargs):
-        remove_attn_wrapper_key = False
-        try:
-            if "_inside_attn_wrapper" not in kwargs:
-                transformer_options = kwargs.get("transformer_options", None)
-                remove_attn_wrapper_key = True
-                kwargs["_inside_attn_wrapper"] = True
-                if transformer_options is not None:
-                    if "optimized_attention_override" in transformer_options:
-                        return transformer_options["optimized_attention_override"](func, *args, **kwargs)
-            return func(*args, **kwargs)
-        finally:
-            if remove_attn_wrapper_key:
-                del kwargs["_inside_attn_wrapper"]
-    return wrapper
-
-@wrap_attn
-def attention_basic(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False, **kwargs):
+def attention_basic(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False):
    attn_precision = get_attn_precision(attn_precision, q.dtype)

    if skip_reshape:
@@ -202,8 +159,8 @@ def attention_basic(q, k, v, heads, mask=None, attn_precision=None, skip_reshape
        )
    return out

-@wrap_attn
-def attention_sub_quad(query, key, value, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False, **kwargs):
+
+def attention_sub_quad(query, key, value, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False):
    attn_precision = get_attn_precision(attn_precision, query.dtype)

    if skip_reshape:
@@ -273,8 +230,7 @@ def attention_sub_quad(query, key, value, heads, mask=None, attn_precision=None,
        hidden_states = hidden_states.unflatten(0, (-1, heads)).transpose(1,2).flatten(start_dim=2)
    return hidden_states

-@wrap_attn
-def attention_split(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False, **kwargs):
+def attention_split(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False):
    attn_precision = get_attn_precision(attn_precision, q.dtype)

    if skip_reshape:
@@ -403,8 +359,7 @@ try:
 except:
    pass

-@wrap_attn
-def attention_xformers(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False, **kwargs):
+def attention_xformers(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False):
    b = q.shape[0]
    dim_head = q.shape[-1]
    # check to make sure xformers isn't broken
@@ -419,7 +374,7 @@ def attention_xformers(q, k, v, heads, mask=None, attn_precision=None, skip_resh
            disabled_xformers = True

    if disabled_xformers:
-        return attention_pytorch(q, k, v, heads, mask, skip_reshape=skip_reshape, **kwargs)
+        return attention_pytorch(q, k, v, heads, mask, skip_reshape=skip_reshape)

    if skip_reshape:
        # b h k d -> b k h d
@@ -472,8 +427,8 @@ else:
    #TODO: other GPUs ?
    SDP_BATCH_LIMIT = 2**31

-@wrap_attn
-def attention_pytorch(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False, **kwargs):
+
+def attention_pytorch(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False):
    if skip_reshape:
        b, _, _, dim_head = q.shape
    else:
@@ -493,7 +448,7 @@ def attention_pytorch(q, k, v, heads, mask=None, attn_precision=None, skip_resha
            mask = mask.unsqueeze(1)

    if SDP_BATCH_LIMIT >= b:
-        out = comfy.ops.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
+        out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
        if not skip_output_reshape:
            out = (
                out.transpose(1, 2).reshape(b, -1, heads * dim_head)
@@ -506,7 +461,7 @@ def attention_pytorch(q, k, v, heads, mask=None, attn_precision=None, skip_resha
                if mask.shape[0] > 1:
                    m = mask[i : i + SDP_BATCH_LIMIT]

-            out[i : i + SDP_BATCH_LIMIT] = comfy.ops.scaled_dot_product_attention(
+            out[i : i + SDP_BATCH_LIMIT] = torch.nn.functional.scaled_dot_product_attention(
                q[i : i + SDP_BATCH_LIMIT],
                k[i : i + SDP_BATCH_LIMIT],
                v[i : i + SDP_BATCH_LIMIT],
@@ -515,8 +470,8 @@ def attention_pytorch(q, k, v, heads, mask=None, attn_precision=None, skip_resha
            ).transpose(1, 2).reshape(-1, q.shape[2], heads * dim_head)
    return out

-@wrap_attn
-def attention_sage(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False, **kwargs):
+
+def attention_sage(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False):
    if skip_reshape:
        b, _, _, dim_head = q.shape
        tensor_layout = "HND"
@@ -546,7 +501,7 @@ def attention_sage(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=
                lambda t: t.transpose(1, 2),
                (q, k, v),
            )
-        return attention_pytorch(q, k, v, heads, mask=mask, skip_reshape=True, skip_output_reshape=skip_output_reshape, **kwargs)
+        return attention_pytorch(q, k, v, heads, mask=mask, skip_reshape=True, skip_output_reshape=skip_output_reshape)

    if tensor_layout == "HND":
        if not skip_output_reshape:
@@ -579,8 +534,8 @@ except AttributeError as error:
                    dropout_p: float = 0.0, causal: bool = False) -> torch.Tensor:
        assert False, f"Could not define flash_attn_wrapper: {FLASH_ATTN_ERROR}"

-@wrap_attn
-def attention_flash(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False, **kwargs):
+
+def attention_flash(q, k, v, heads, mask=None, attn_precision=None, skip_reshape=False, skip_output_reshape=False):
    if skip_reshape:
        b, _, _, dim_head = q.shape
    else:
@@ -600,8 +555,7 @@ def attention_flash(q, k, v, heads, mask=None, attn_precision=None, skip_reshape
            mask = mask.unsqueeze(1)

    try:
-        if mask is not None:
-            raise RuntimeError("Mask must not be set for Flash attention")
+        assert mask is None
        out = flash_attn_wrapper(
            q.transpose(1, 2),
            k.transpose(1, 2),
@@ -643,19 +597,6 @@ else:

 optimized_attention_masked = optimized_attention

-
-# register core-supported attention functions
-if SAGE_ATTENTION_IS_AVAILABLE:
-    register_attention_function("sage", attention_sage)
-if FLASH_ATTENTION_IS_AVAILABLE:
-    register_attention_function("flash", attention_flash)
-if model_management.xformers_enabled():
-    register_attention_function("xformers", attention_xformers)
-register_attention_function("pytorch", attention_pytorch)
-register_attention_function("sub_quad", attention_sub_quad)
-register_attention_function("split", attention_split)
-
-
 def optimized_attention_for_device(device, mask=False, small_input=False):
    if small_input:
        if model_management.pytorch_attention_enabled():
@@ -688,7 +629,7 @@ class CrossAttention(nn.Module):

        self.to_out = nn.Sequential(operations.Linear(inner_dim, query_dim, dtype=dtype, device=device), nn.Dropout(dropout))

-    def forward(self, x, context=None, value=None, mask=None, transformer_options={}):
+    def forward(self, x, context=None, value=None, mask=None):
        q = self.to_q(x)
        context = default(context, x)
        k = self.to_k(context)
@@ -699,9 +640,9 @@ class CrossAttention(nn.Module):
            v = self.to_v(context)

        if mask is None:
-            out = optimized_attention(q, k, v, self.heads, attn_precision=self.attn_precision, transformer_options=transformer_options)
+            out = optimized_attention(q, k, v, self.heads, attn_precision=self.attn_precision)
        else:
-            out = optimized_attention_masked(q, k, v, self.heads, mask, attn_precision=self.attn_precision, transformer_options=transformer_options)
+            out = optimized_attention_masked(q, k, v, self.heads, mask, attn_precision=self.attn_precision)
        return self.to_out(out)


@@ -805,7 +746,7 @@ class BasicTransformerBlock(nn.Module):
            n = attn1_replace_patch[block_attn1](n, context_attn1, value_attn1, extra_options)
            n = self.attn1.to_out(n)
        else:
-            n = self.attn1(n, context=context_attn1, value=value_attn1, transformer_options=transformer_options)
+            n = self.attn1(n, context=context_attn1, value=value_attn1)

        if "attn1_output_patch" in transformer_patches:
            patch = transformer_patches["attn1_output_patch"]
@@ -845,7 +786,7 @@ class BasicTransformerBlock(nn.Module):
                n = attn2_replace_patch[block_attn2](n, context_attn2, value_attn2, extra_options)
                n = self.attn2.to_out(n)
            else:
-                n = self.attn2(n, context=context_attn2, value=value_attn2, transformer_options=transformer_options)
+                n = self.attn2(n, context=context_attn2, value=value_attn2)

        if "attn2_output_patch" in transformer_patches:
            patch = transformer_patches["attn2_output_patch"]
@@ -1076,7 +1017,7 @@ class SpatialVideoTransformer(SpatialTransformer):

            B, S, C = x_mix.shape
            x_mix = rearrange(x_mix, "(b t) s c -> (b s) t c", t=timesteps)
-            x_mix = mix_block(x_mix, context=time_context, transformer_options=transformer_options)
+            x_mix = mix_block(x_mix, context=time_context) #TODO: transformer_options
            x_mix = rearrange(
                x_mix, "(b s) t c -> (b t) s c", s=S, b=B // timesteps, c=C, t=timesteps
            )
--- a/comfy/ldm/modules/diffusionmodules/mmdit.py
+++ b/comfy/ldm/modules/diffusionmodules/mmdit.py
@@ -109,7 +109,7 @@ class PatchEmbed(nn.Module):
 def modulate(x, shift, scale):
    if shift is None:
        shift = torch.zeros_like(scale)
-    return torch.addcmul(shift.unsqueeze(1), x, 1+ scale.unsqueeze(1))
+    return x * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1)


 #################################################################################
@@ -564,7 +564,10 @@ class DismantledBlock(nn.Module):
        assert not self.pre_only
        attn1 = self.attn.post_attention(attn)
        attn2 = self.attn2.post_attention(attn2)
-        x = gate_cat(x, gate_msa, gate_msa2, attn1, attn2)
+        out1 = gate_msa.unsqueeze(1) * attn1
+        out2 = gate_msa2.unsqueeze(1) * attn2
+        x = x + out1
+        x = x + out2
        x = x + gate_mlp.unsqueeze(1) * self.mlp(
            modulate(self.norm2(x), shift_mlp, scale_mlp)
        )
@@ -591,11 +594,6 @@ class DismantledBlock(nn.Module):
            )
            return self.post_attention(attn, *intermediates)

-def gate_cat(x, gate_msa, gate_msa2, attn1, attn2):
-    out1 = gate_msa.unsqueeze(1) * attn1
-    out2 = gate_msa2.unsqueeze(1) * attn2
-    x = torch.stack([x, out1, out2], dim=0).sum(dim=0)
-    return x

 def block_mixing(*args, use_checkpoint=True, **kwargs):
    if use_checkpoint:
@@ -606,7 +604,7 @@ def block_mixing(*args, use_checkpoint=True, **kwargs):
        return _block_mixing(*args, **kwargs)


-def _block_mixing(context, x, context_block, x_block, c, transformer_options={}):
+def _block_mixing(context, x, context_block, x_block, c):
    context_qkv, context_intermediates = context_block.pre_attention(context, c)

    if x_block.x_block_self_attn:
@@ -622,7 +620,6 @@ def _block_mixing(context, x, context_block, x_block, c, transformer_options={})
    attn = optimized_attention(
        qkv[0], qkv[1], qkv[2],
        heads=x_block.attn.num_heads,
-        transformer_options=transformer_options,
    )
    context_attn, x_attn = (
        attn[:, : context_qkv[0].shape[1]],
@@ -638,7 +635,6 @@ def _block_mixing(context, x, context_block, x_block, c, transformer_options={})
        attn2 = optimized_attention(
                x_qkv2[0], x_qkv2[1], x_qkv2[2],
                heads=x_block.attn2.num_heads,
-                transformer_options=transformer_options,
            )
        x = x_block.post_attention_x(x_attn, attn2, *x_intermediates)
    else:
@@ -960,10 +956,10 @@ class MMDiT(nn.Module):
            if ("double_block", i) in blocks_replace:
                def block_wrap(args):
                    out = {}
-                    out["txt"], out["img"] = self.joint_blocks[i](args["txt"], args["img"], c=args["vec"], transformer_options=args["transformer_options"])
+                    out["txt"], out["img"] = self.joint_blocks[i](args["txt"], args["img"], c=args["vec"])
                    return out

-                out = blocks_replace[("double_block", i)]({"img": x, "txt": context, "vec": c_mod, "transformer_options": transformer_options}, {"original_block": block_wrap})
+                out = blocks_replace[("double_block", i)]({"img": x, "txt": context, "vec": c_mod}, {"original_block": block_wrap})
                context = out["txt"]
                x = out["img"]
            else:
@@ -972,7 +968,6 @@ class MMDiT(nn.Module):
                    x,
                    c=c_mod,
                    use_checkpoint=self.use_checkpoint,
-                    transformer_options=transformer_options,
                )
            if control is not None:
                control_o = control.get("output")
--- a/comfy/ldm/modules/diffusionmodules/model.py
+++ b/comfy/ldm/modules/diffusionmodules/model.py
@@ -36,7 +36,7 @@ def get_timestep_embedding(timesteps, embedding_dim):

 def nonlinearity(x):
    # swish
-    return torch.nn.functional.silu(x)
+    return x*torch.sigmoid(x)


 def Normalize(in_channels, num_groups=32):
@@ -145,7 +145,7 @@ class Downsample(nn.Module):

 class ResnetBlock(nn.Module):
    def __init__(self, *, in_channels, out_channels=None, conv_shortcut=False,
-                 dropout=0.0, temb_channels=512, conv_op=ops.Conv2d, norm_op=Normalize):
+                 dropout, temb_channels=512, conv_op=ops.Conv2d):
        super().__init__()
        self.in_channels = in_channels
        out_channels = in_channels if out_channels is None else out_channels
@@ -153,7 +153,7 @@ class ResnetBlock(nn.Module):
        self.use_conv_shortcut = conv_shortcut

        self.swish = torch.nn.SiLU(inplace=True)
-        self.norm1 = norm_op(in_channels)
+        self.norm1 = Normalize(in_channels)
        self.conv1 = conv_op(in_channels,
                                     out_channels,
                                     kernel_size=3,
@@ -162,7 +162,7 @@ class ResnetBlock(nn.Module):
        if temb_channels > 0:
            self.temb_proj = ops.Linear(temb_channels,
                                             out_channels)
-        self.norm2 = norm_op(out_channels)
+        self.norm2 = Normalize(out_channels)
        self.dropout = torch.nn.Dropout(dropout, inplace=True)
        self.conv2 = conv_op(out_channels,
                                     out_channels,
@@ -183,7 +183,7 @@ class ResnetBlock(nn.Module):
                                                    stride=1,
                                                    padding=0)

-    def forward(self, x, temb=None):
+    def forward(self, x, temb):
        h = x
        h = self.norm1(h)
        h = self.swish(h)
@@ -285,7 +285,7 @@ def pytorch_attention(q, k, v):
    )

    try:
-        out = comfy.ops.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=0.0, is_causal=False)
+        out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=0.0, is_causal=False)
        out = out.transpose(2, 3).reshape(orig_shape)
    except model_management.OOM_EXCEPTION:
        logging.warning("scaled_dot_product_attention OOMed: switched to slice attention")
@@ -305,11 +305,11 @@ def vae_attention():
        return normal_attention

 class AttnBlock(nn.Module):
-    def __init__(self, in_channels, conv_op=ops.Conv2d, norm_op=Normalize):
+    def __init__(self, in_channels, conv_op=ops.Conv2d):
        super().__init__()
        self.in_channels = in_channels

-        self.norm = norm_op(in_channels)
+        self.norm = Normalize(in_channels)
        self.q = conv_op(in_channels,
                                 in_channels,
                                 kernel_size=1,
--- a/comfy/ldm/modules/sub_quadratic_attention.py
+++ b/comfy/ldm/modules/sub_quadratic_attention.py
@@ -31,7 +31,7 @@ def dynamic_slice(
    starts: List[int],
    sizes: List[int],
 ) -> Tensor:
-    slicing = tuple(slice(start, start + size) for start, size in zip(starts, sizes))
+    slicing = [slice(start, start + size) for start, size in zip(starts, sizes)]
    return x[slicing]

 class AttnChunk(NamedTuple):
--- a/comfy/ldm/omnigen/omnigen2.py
+++ b/comfy/ldm/omnigen/omnigen2.py
@@ -1,470 +0,0 @@
-# Original code: https://github.com/VectorSpaceLab/OmniGen2
-
-from typing import Optional, Tuple
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from einops import rearrange, repeat
-from comfy.ldm.lightricks.model import Timesteps
-from comfy.ldm.flux.layers import EmbedND
-from comfy.ldm.modules.attention import optimized_attention_masked
-import comfy.model_management
-import comfy.ldm.common_dit
-
-
-def apply_rotary_emb(x, freqs_cis):
-    if x.shape[1] == 0:
-        return x
-
-    t_ = x.reshape(*x.shape[:-1], -1, 1, 2)
-    t_out = freqs_cis[..., 0] * t_[..., 0] + freqs_cis[..., 1] * t_[..., 1]
-    return t_out.reshape(*x.shape).to(dtype=x.dtype)
-
-
-def swiglu(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
-    return F.silu(x) * y
-
-
-class TimestepEmbedding(nn.Module):
-    def __init__(self, in_channels: int, time_embed_dim: int, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.linear_1 = operations.Linear(in_channels, time_embed_dim, dtype=dtype, device=device)
-        self.act = nn.SiLU()
-        self.linear_2 = operations.Linear(time_embed_dim, time_embed_dim, dtype=dtype, device=device)
-
-    def forward(self, sample: torch.Tensor) -> torch.Tensor:
-        sample = self.linear_1(sample)
-        sample = self.act(sample)
-        sample = self.linear_2(sample)
-        return sample
-
-
-class LuminaRMSNormZero(nn.Module):
-    def __init__(self, embedding_dim: int, norm_eps: float = 1e-5, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.silu = nn.SiLU()
-        self.linear = operations.Linear(min(embedding_dim, 1024), 4 * embedding_dim, dtype=dtype, device=device)
-        self.norm = operations.RMSNorm(embedding_dim, eps=norm_eps, dtype=dtype, device=device)
-
-    def forward(self, x: torch.Tensor, emb: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
-        emb = self.linear(self.silu(emb))
-        scale_msa, gate_msa, scale_mlp, gate_mlp = emb.chunk(4, dim=1)
-        x = self.norm(x) * (1 + scale_msa[:, None])
-        return x, gate_msa, scale_mlp, gate_mlp
-
-
-class LuminaLayerNormContinuous(nn.Module):
-    def __init__(self, embedding_dim: int, conditioning_embedding_dim: int, elementwise_affine: bool = False, eps: float = 1e-6, out_dim: Optional[int] = None, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.silu = nn.SiLU()
-        self.linear_1 = operations.Linear(conditioning_embedding_dim, embedding_dim, dtype=dtype, device=device)
-        self.norm = operations.LayerNorm(embedding_dim, eps, elementwise_affine, dtype=dtype, device=device)
-        self.linear_2 = operations.Linear(embedding_dim, out_dim, bias=True, dtype=dtype, device=device) if out_dim is not None else None
-
-    def forward(self, x: torch.Tensor, conditioning_embedding: torch.Tensor) -> torch.Tensor:
-        emb = self.linear_1(self.silu(conditioning_embedding).to(x.dtype))
-        x = self.norm(x) * (1 + emb)[:, None, :]
-        if self.linear_2 is not None:
-            x = self.linear_2(x)
-        return x
-
-
-class LuminaFeedForward(nn.Module):
-    def __init__(self, dim: int, inner_dim: int, multiple_of: int = 256, dtype=None, device=None, operations=None):
-        super().__init__()
-        inner_dim = multiple_of * ((inner_dim + multiple_of - 1) // multiple_of)
-        self.linear_1 = operations.Linear(dim, inner_dim, bias=False, dtype=dtype, device=device)
-        self.linear_2 = operations.Linear(inner_dim, dim, bias=False, dtype=dtype, device=device)
-        self.linear_3 = operations.Linear(dim, inner_dim, bias=False, dtype=dtype, device=device)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        h1, h2 = self.linear_1(x), self.linear_3(x)
-        return self.linear_2(swiglu(h1, h2))
-
-
-class Lumina2CombinedTimestepCaptionEmbedding(nn.Module):
-    def __init__(self, hidden_size: int = 4096, text_feat_dim: int = 2048, frequency_embedding_size: int = 256, norm_eps: float = 1e-5, timestep_scale: float = 1.0, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.time_proj = Timesteps(num_channels=frequency_embedding_size, flip_sin_to_cos=True, downscale_freq_shift=0.0, scale=timestep_scale)
-        self.timestep_embedder = TimestepEmbedding(in_channels=frequency_embedding_size, time_embed_dim=min(hidden_size, 1024), dtype=dtype, device=device, operations=operations)
-        self.caption_embedder = nn.Sequential(
-            operations.RMSNorm(text_feat_dim, eps=norm_eps, dtype=dtype, device=device),
-            operations.Linear(text_feat_dim, hidden_size, bias=True, dtype=dtype, device=device),
-        )
-
-    def forward(self, timestep: torch.Tensor, text_hidden_states: torch.Tensor, dtype: torch.dtype) -> Tuple[torch.Tensor, torch.Tensor]:
-        timestep_proj = self.time_proj(timestep).to(dtype=dtype)
-        time_embed = self.timestep_embedder(timestep_proj)
-        caption_embed = self.caption_embedder(text_hidden_states)
-        return time_embed, caption_embed
-
-
-class Attention(nn.Module):
-    def __init__(self, query_dim: int, dim_head: int, heads: int, kv_heads: int, eps: float = 1e-5, bias: bool = False, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.heads = heads
-        self.kv_heads = kv_heads
-        self.dim_head = dim_head
-        self.scale = dim_head ** -0.5
-
-        self.to_q = operations.Linear(query_dim, heads * dim_head, bias=bias, dtype=dtype, device=device)
-        self.to_k = operations.Linear(query_dim, kv_heads * dim_head, bias=bias, dtype=dtype, device=device)
-        self.to_v = operations.Linear(query_dim, kv_heads * dim_head, bias=bias, dtype=dtype, device=device)
-
-        self.norm_q = operations.RMSNorm(dim_head, eps=eps, dtype=dtype, device=device)
-        self.norm_k = operations.RMSNorm(dim_head, eps=eps, dtype=dtype, device=device)
-
-        self.to_out = nn.Sequential(
-            operations.Linear(heads * dim_head, query_dim, bias=bias, dtype=dtype, device=device),
-            nn.Dropout(0.0)
-        )
-
-    def forward(self, hidden_states: torch.Tensor, encoder_hidden_states: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, image_rotary_emb: Optional[torch.Tensor] = None, transformer_options={}) -> torch.Tensor:
-        batch_size, sequence_length, _ = hidden_states.shape
-
-        query = self.to_q(hidden_states)
-        key = self.to_k(encoder_hidden_states)
-        value = self.to_v(encoder_hidden_states)
-
-        query = query.view(batch_size, -1, self.heads, self.dim_head)
-        key = key.view(batch_size, -1, self.kv_heads, self.dim_head)
-        value = value.view(batch_size, -1, self.kv_heads, self.dim_head)
-
-        query = self.norm_q(query)
-        key = self.norm_k(key)
-
-        if image_rotary_emb is not None:
-            query = apply_rotary_emb(query, image_rotary_emb)
-            key = apply_rotary_emb(key, image_rotary_emb)
-
-        query = query.transpose(1, 2)
-        key = key.transpose(1, 2)
-        value = value.transpose(1, 2)
-
-        if self.kv_heads < self.heads:
-            key = key.repeat_interleave(self.heads // self.kv_heads, dim=1)
-            value = value.repeat_interleave(self.heads // self.kv_heads, dim=1)
-
-        hidden_states = optimized_attention_masked(query, key, value, self.heads, attention_mask, skip_reshape=True, transformer_options=transformer_options)
-        hidden_states = self.to_out[0](hidden_states)
-        return hidden_states
-
-
-class OmniGen2TransformerBlock(nn.Module):
-    def __init__(self, dim: int, num_attention_heads: int, num_kv_heads: int, multiple_of: int, ffn_dim_multiplier: float, norm_eps: float, modulation: bool = True, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.modulation = modulation
-
-        self.attn = Attention(
-            query_dim=dim,
-            dim_head=dim // num_attention_heads,
-            heads=num_attention_heads,
-            kv_heads=num_kv_heads,
-            eps=1e-5,
-            bias=False,
-            dtype=dtype, device=device, operations=operations,
-        )
-
-        self.feed_forward = LuminaFeedForward(
-            dim=dim,
-            inner_dim=4 * dim,
-            multiple_of=multiple_of,
-            dtype=dtype, device=device, operations=operations
-        )
-
-        if modulation:
-            self.norm1 = LuminaRMSNormZero(embedding_dim=dim, norm_eps=norm_eps, dtype=dtype, device=device, operations=operations)
-        else:
-            self.norm1 = operations.RMSNorm(dim, eps=norm_eps, dtype=dtype, device=device)
-
-        self.ffn_norm1 = operations.RMSNorm(dim, eps=norm_eps, dtype=dtype, device=device)
-        self.norm2 = operations.RMSNorm(dim, eps=norm_eps, dtype=dtype, device=device)
-        self.ffn_norm2 = operations.RMSNorm(dim, eps=norm_eps, dtype=dtype, device=device)
-
-    def forward(self, hidden_states: torch.Tensor, attention_mask: torch.Tensor, image_rotary_emb: torch.Tensor, temb: Optional[torch.Tensor] = None, transformer_options={}) -> torch.Tensor:
-        if self.modulation:
-            norm_hidden_states, gate_msa, scale_mlp, gate_mlp = self.norm1(hidden_states, temb)
-            attn_output = self.attn(norm_hidden_states, norm_hidden_states, attention_mask, image_rotary_emb, transformer_options=transformer_options)
-            hidden_states = hidden_states + gate_msa.unsqueeze(1).tanh() * self.norm2(attn_output)
-            mlp_output = self.feed_forward(self.ffn_norm1(hidden_states) * (1 + scale_mlp.unsqueeze(1)))
-            hidden_states = hidden_states + gate_mlp.unsqueeze(1).tanh() * self.ffn_norm2(mlp_output)
-        else:
-            norm_hidden_states = self.norm1(hidden_states)
-            attn_output = self.attn(norm_hidden_states, norm_hidden_states, attention_mask, image_rotary_emb, transformer_options=transformer_options)
-            hidden_states = hidden_states + self.norm2(attn_output)
-            mlp_output = self.feed_forward(self.ffn_norm1(hidden_states))
-            hidden_states = hidden_states + self.ffn_norm2(mlp_output)
-        return hidden_states
-
-
-class OmniGen2RotaryPosEmbed(nn.Module):
-    def __init__(self, theta: int, axes_dim: Tuple[int, int, int], axes_lens: Tuple[int, int, int] = (300, 512, 512), patch_size: int = 2):
-        super().__init__()
-        self.theta = theta
-        self.axes_dim = axes_dim
-        self.axes_lens = axes_lens
-        self.patch_size = patch_size
-        self.rope_embedder = EmbedND(dim=sum(axes_dim), theta=self.theta, axes_dim=axes_dim)
-
-    def forward(self, batch_size, encoder_seq_len, l_effective_cap_len, l_effective_ref_img_len, l_effective_img_len, ref_img_sizes, img_sizes, device):
-        p = self.patch_size
-
-        seq_lengths = [cap_len + sum(ref_img_len) + img_len for cap_len, ref_img_len, img_len in zip(l_effective_cap_len, l_effective_ref_img_len, l_effective_img_len)]
-
-        max_seq_len = max(seq_lengths)
-        max_ref_img_len = max([sum(ref_img_len) for ref_img_len in l_effective_ref_img_len])
-        max_img_len = max(l_effective_img_len)
-
-        position_ids = torch.zeros(batch_size, max_seq_len, 3, dtype=torch.int32, device=device)
-
-        for i, (cap_seq_len, seq_len) in enumerate(zip(l_effective_cap_len, seq_lengths)):
-            position_ids[i, :cap_seq_len] = repeat(torch.arange(cap_seq_len, dtype=torch.int32, device=device), "l -> l 3")
-
-            pe_shift = cap_seq_len
-            pe_shift_len = cap_seq_len
-
-            if ref_img_sizes[i] is not None:
-                for ref_img_size, ref_img_len in zip(ref_img_sizes[i], l_effective_ref_img_len[i]):
-                    H, W = ref_img_size
-                    ref_H_tokens, ref_W_tokens = H // p, W // p
-
-                    row_ids = repeat(torch.arange(ref_H_tokens, dtype=torch.int32, device=device), "h -> h w", w=ref_W_tokens).flatten()
-                    col_ids = repeat(torch.arange(ref_W_tokens, dtype=torch.int32, device=device), "w -> h w", h=ref_H_tokens).flatten()
-                    position_ids[i, pe_shift_len:pe_shift_len + ref_img_len, 0] = pe_shift
-                    position_ids[i, pe_shift_len:pe_shift_len + ref_img_len, 1] = row_ids
-                    position_ids[i, pe_shift_len:pe_shift_len + ref_img_len, 2] = col_ids
-
-                    pe_shift += max(ref_H_tokens, ref_W_tokens)
-                    pe_shift_len += ref_img_len
-
-            H, W = img_sizes[i]
-            H_tokens, W_tokens = H // p, W // p
-
-            row_ids = repeat(torch.arange(H_tokens, dtype=torch.int32, device=device), "h -> h w", w=W_tokens).flatten()
-            col_ids = repeat(torch.arange(W_tokens, dtype=torch.int32, device=device), "w -> h w", h=H_tokens).flatten()
-
-            position_ids[i, pe_shift_len: seq_len, 0] = pe_shift
-            position_ids[i, pe_shift_len: seq_len, 1] = row_ids
-            position_ids[i, pe_shift_len: seq_len, 2] = col_ids
-
-        freqs_cis = self.rope_embedder(position_ids).movedim(1, 2)
-
-        cap_freqs_cis_shape = list(freqs_cis.shape)
-        cap_freqs_cis_shape[1] = encoder_seq_len
-        cap_freqs_cis = torch.zeros(*cap_freqs_cis_shape, device=device, dtype=freqs_cis.dtype)
-
-        ref_img_freqs_cis_shape = list(freqs_cis.shape)
-        ref_img_freqs_cis_shape[1] = max_ref_img_len
-        ref_img_freqs_cis = torch.zeros(*ref_img_freqs_cis_shape, device=device, dtype=freqs_cis.dtype)
-
-        img_freqs_cis_shape = list(freqs_cis.shape)
-        img_freqs_cis_shape[1] = max_img_len
-        img_freqs_cis = torch.zeros(*img_freqs_cis_shape, device=device, dtype=freqs_cis.dtype)
-
-        for i, (cap_seq_len, ref_img_len, img_len, seq_len) in enumerate(zip(l_effective_cap_len, l_effective_ref_img_len, l_effective_img_len, seq_lengths)):
-            cap_freqs_cis[i, :cap_seq_len] = freqs_cis[i, :cap_seq_len]
-            ref_img_freqs_cis[i, :sum(ref_img_len)] = freqs_cis[i, cap_seq_len:cap_seq_len + sum(ref_img_len)]
-            img_freqs_cis[i, :img_len] = freqs_cis[i, cap_seq_len + sum(ref_img_len):cap_seq_len + sum(ref_img_len) + img_len]
-
-        return cap_freqs_cis, ref_img_freqs_cis, img_freqs_cis, freqs_cis, l_effective_cap_len, seq_lengths
-
-
-class OmniGen2Transformer2DModel(nn.Module):
-    def __init__(
-        self,
-        patch_size: int = 2,
-        in_channels: int = 16,
-        out_channels: Optional[int] = None,
-        hidden_size: int = 2304,
-        num_layers: int = 26,
-        num_refiner_layers: int = 2,
-        num_attention_heads: int = 24,
-        num_kv_heads: int = 8,
-        multiple_of: int = 256,
-        ffn_dim_multiplier: Optional[float] = None,
-        norm_eps: float = 1e-5,
-        axes_dim_rope: Tuple[int, int, int] = (32, 32, 32),
-        axes_lens: Tuple[int, int, int] = (300, 512, 512),
-        text_feat_dim: int = 1024,
-        timestep_scale: float = 1.0,
-        image_model=None,
-        device=None,
-        dtype=None,
-        operations=None,
-    ):
-        super().__init__()
-
-        self.patch_size = patch_size
-        self.out_channels = out_channels or in_channels
-        self.hidden_size = hidden_size
-        self.dtype = dtype
-
-        self.rope_embedder = OmniGen2RotaryPosEmbed(
-            theta=10000,
-            axes_dim=axes_dim_rope,
-            axes_lens=axes_lens,
-            patch_size=patch_size,
-        )
-
-        self.x_embedder = operations.Linear(patch_size * patch_size * in_channels, hidden_size, dtype=dtype, device=device)
-        self.ref_image_patch_embedder = operations.Linear(patch_size * patch_size * in_channels, hidden_size, dtype=dtype, device=device)
-
-        self.time_caption_embed = Lumina2CombinedTimestepCaptionEmbedding(
-            hidden_size=hidden_size,
-            text_feat_dim=text_feat_dim,
-            norm_eps=norm_eps,
-            timestep_scale=timestep_scale, dtype=dtype, device=device, operations=operations
-        )
-
-        self.noise_refiner = nn.ModuleList([
-            OmniGen2TransformerBlock(
-                hidden_size, num_attention_heads, num_kv_heads,
-                multiple_of, ffn_dim_multiplier, norm_eps, modulation=True, dtype=dtype, device=device, operations=operations
-            ) for _ in range(num_refiner_layers)
-        ])
-
-        self.ref_image_refiner = nn.ModuleList([
-            OmniGen2TransformerBlock(
-                hidden_size, num_attention_heads, num_kv_heads,
-                multiple_of, ffn_dim_multiplier, norm_eps, modulation=True, dtype=dtype, device=device, operations=operations
-            ) for _ in range(num_refiner_layers)
-        ])
-
-        self.context_refiner = nn.ModuleList([
-            OmniGen2TransformerBlock(
-                hidden_size, num_attention_heads, num_kv_heads,
-                multiple_of, ffn_dim_multiplier, norm_eps, modulation=False, dtype=dtype, device=device, operations=operations
-            ) for _ in range(num_refiner_layers)
-        ])
-
-        self.layers = nn.ModuleList([
-            OmniGen2TransformerBlock(
-                hidden_size, num_attention_heads, num_kv_heads,
-                multiple_of, ffn_dim_multiplier, norm_eps, modulation=True, dtype=dtype, device=device, operations=operations
-            ) for _ in range(num_layers)
-        ])
-
-        self.norm_out = LuminaLayerNormContinuous(
-            embedding_dim=hidden_size,
-            conditioning_embedding_dim=min(hidden_size, 1024),
-            elementwise_affine=False,
-            eps=1e-6,
-            out_dim=patch_size * patch_size * self.out_channels, dtype=dtype, device=device, operations=operations
-        )
-
-        self.image_index_embedding = nn.Parameter(torch.empty(5, hidden_size, device=device, dtype=dtype))
-
-    def flat_and_pad_to_seq(self, hidden_states, ref_image_hidden_states):
-        batch_size = len(hidden_states)
-        p = self.patch_size
-
-        img_sizes = [(img.size(1), img.size(2)) for img in hidden_states]
-        l_effective_img_len = [(H // p) * (W // p) for (H, W) in img_sizes]
-
-        if ref_image_hidden_states is not None:
-            ref_image_hidden_states = list(map(lambda ref: comfy.ldm.common_dit.pad_to_patch_size(ref, (p, p)), ref_image_hidden_states))
-            ref_img_sizes = [[(imgs.size(2), imgs.size(3)) if imgs is not None else None for imgs in ref_image_hidden_states]] * batch_size
-            l_effective_ref_img_len = [[(ref_img_size[0] // p) * (ref_img_size[1] // p) for ref_img_size in _ref_img_sizes] if _ref_img_sizes is not None else [0] for _ref_img_sizes in ref_img_sizes]
-        else:
-            ref_img_sizes = [None for _ in range(batch_size)]
-            l_effective_ref_img_len = [[0] for _ in range(batch_size)]
-
-        flat_ref_img_hidden_states = None
-        if ref_image_hidden_states is not None:
-            imgs = []
-            for ref_img in ref_image_hidden_states:
-                B, C, H, W = ref_img.size()
-                ref_img = rearrange(ref_img, 'b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1=p, p2=p)
-                imgs.append(ref_img)
-            flat_ref_img_hidden_states = torch.cat(imgs, dim=1)
-
-        img = hidden_states
-        B, C, H, W = img.size()
-        flat_hidden_states = rearrange(img, 'b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1=p, p2=p)
-
-        return (
-            flat_hidden_states, flat_ref_img_hidden_states,
-            None, None,
-            l_effective_ref_img_len, l_effective_img_len,
-            ref_img_sizes, img_sizes,
-        )
-
-    def img_patch_embed_and_refine(self, hidden_states, ref_image_hidden_states, padded_img_mask, padded_ref_img_mask, noise_rotary_emb, ref_img_rotary_emb, l_effective_ref_img_len, l_effective_img_len, temb, transformer_options={}):
-        batch_size = len(hidden_states)
-
-        hidden_states = self.x_embedder(hidden_states)
-        if ref_image_hidden_states is not None:
-            ref_image_hidden_states = self.ref_image_patch_embedder(ref_image_hidden_states)
-            image_index_embedding = comfy.model_management.cast_to(self.image_index_embedding, dtype=hidden_states.dtype, device=hidden_states.device)
-
-            for i in range(batch_size):
-                shift = 0
-                for j, ref_img_len in enumerate(l_effective_ref_img_len[i]):
-                    ref_image_hidden_states[i, shift:shift + ref_img_len, :] = ref_image_hidden_states[i, shift:shift + ref_img_len, :] + image_index_embedding[j]
-                    shift += ref_img_len
-
-        for layer in self.noise_refiner:
-            hidden_states = layer(hidden_states, padded_img_mask, noise_rotary_emb, temb, transformer_options=transformer_options)
-
-        if ref_image_hidden_states is not None:
-            for layer in self.ref_image_refiner:
-                ref_image_hidden_states = layer(ref_image_hidden_states, padded_ref_img_mask, ref_img_rotary_emb, temb, transformer_options=transformer_options)
-
-            hidden_states = torch.cat([ref_image_hidden_states, hidden_states], dim=1)
-
-        return hidden_states
-
-    def forward(self, x, timesteps, context, num_tokens, ref_latents=None, attention_mask=None, transformer_options={}, **kwargs):
-        B, C, H, W = x.shape
-        hidden_states = comfy.ldm.common_dit.pad_to_patch_size(x, (self.patch_size, self.patch_size))
-        _, _, H_padded, W_padded = hidden_states.shape
-        timestep = 1.0 - timesteps
-        text_hidden_states = context
-        text_attention_mask = attention_mask
-        ref_image_hidden_states = ref_latents
-        device = hidden_states.device
-
-        temb, text_hidden_states = self.time_caption_embed(timestep, text_hidden_states, hidden_states[0].dtype)
-
-        (
-            hidden_states, ref_image_hidden_states,
-            img_mask, ref_img_mask,
-            l_effective_ref_img_len, l_effective_img_len,
-            ref_img_sizes, img_sizes,
-        ) = self.flat_and_pad_to_seq(hidden_states, ref_image_hidden_states)
-
-        (
-            context_rotary_emb, ref_img_rotary_emb, noise_rotary_emb,
-            rotary_emb, encoder_seq_lengths, seq_lengths,
-        ) = self.rope_embedder(
-            hidden_states.shape[0], text_hidden_states.shape[1], [num_tokens] * text_hidden_states.shape[0],
-            l_effective_ref_img_len, l_effective_img_len,
-            ref_img_sizes, img_sizes, device,
-        )
-
-        for layer in self.context_refiner:
-            text_hidden_states = layer(text_hidden_states, text_attention_mask, context_rotary_emb, transformer_options=transformer_options)
-
-        img_len = hidden_states.shape[1]
-        combined_img_hidden_states = self.img_patch_embed_and_refine(
-            hidden_states, ref_image_hidden_states,
-            img_mask, ref_img_mask,
-            noise_rotary_emb, ref_img_rotary_emb,
-            l_effective_ref_img_len, l_effective_img_len,
-            temb,
-            transformer_options=transformer_options,
-        )
-
-        hidden_states = torch.cat([text_hidden_states, combined_img_hidden_states], dim=1)
-        attention_mask = None
-
-        for layer in self.layers:
-            hidden_states = layer(hidden_states, attention_mask, rotary_emb, temb, transformer_options=transformer_options)
-
-        hidden_states = self.norm_out(hidden_states, temb)
-
-        p = self.patch_size
-        output = rearrange(hidden_states[:, -img_len:], 'b (h w) (p1 p2 c) -> b c (h p1) (w p2)',  h=H_padded // p, w=W_padded// p, p1=p, p2=p)[:, :, :H, :W]
-
-        return -output
--- a/comfy/ldm/pixart/pixartms.py
+++ b/comfy/ldm/pixart/pixartms.py
@@ -1,256 +1,256 @@
-# Based on:
-# https://github.com/PixArt-alpha/PixArt-alpha [Apache 2.0 license]
-# https://github.com/PixArt-alpha/PixArt-sigma [Apache 2.0 license]
-import torch
-import torch.nn as nn
-
-from .blocks import (
-    t2i_modulate,
-    CaptionEmbedder,
-    AttentionKVCompress,
-    MultiHeadCrossAttention,
-    T2IFinalLayer,
-    SizeEmbedder,
-)
-from comfy.ldm.modules.diffusionmodules.mmdit import TimestepEmbedder, PatchEmbed, Mlp, get_1d_sincos_pos_embed_from_grid_torch
-
-
-def get_2d_sincos_pos_embed_torch(embed_dim, w, h, pe_interpolation=1.0, base_size=16, device=None, dtype=torch.float32):
-    grid_h, grid_w = torch.meshgrid(
-        torch.arange(h, device=device, dtype=dtype) / (h/base_size) / pe_interpolation,
-        torch.arange(w, device=device, dtype=dtype) / (w/base_size) / pe_interpolation,
-        indexing='ij'
-    )
-    emb_h = get_1d_sincos_pos_embed_from_grid_torch(embed_dim // 2, grid_h, device=device, dtype=dtype)
-    emb_w = get_1d_sincos_pos_embed_from_grid_torch(embed_dim // 2, grid_w, device=device, dtype=dtype)
-    emb = torch.cat([emb_w, emb_h], dim=1)  # (H*W, D)
-    return emb
-
-class PixArtMSBlock(nn.Module):
-    """
-    A PixArt block with adaptive layer norm zero (adaLN-Zero) conditioning.
-    """
-    def __init__(self, hidden_size, num_heads, mlp_ratio=4.0, drop_path=0., input_size=None,
-                 sampling=None, sr_ratio=1, qk_norm=False, dtype=None, device=None, operations=None, **block_kwargs):
-        super().__init__()
-        self.hidden_size = hidden_size
-        self.norm1 = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, dtype=dtype, device=device)
-        self.attn = AttentionKVCompress(
-            hidden_size, num_heads=num_heads, qkv_bias=True, sampling=sampling, sr_ratio=sr_ratio,
-            qk_norm=qk_norm, dtype=dtype, device=device, operations=operations, **block_kwargs
-        )
-        self.cross_attn = MultiHeadCrossAttention(
-            hidden_size, num_heads, dtype=dtype, device=device, operations=operations, **block_kwargs
-        )
-        self.norm2 = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, dtype=dtype, device=device)
-        # to be compatible with lower version pytorch
-        approx_gelu = lambda: nn.GELU(approximate="tanh")
-        self.mlp = Mlp(
-            in_features=hidden_size, hidden_features=int(hidden_size * mlp_ratio), act_layer=approx_gelu,
-            dtype=dtype, device=device, operations=operations
-        )
-        self.scale_shift_table = nn.Parameter(torch.randn(6, hidden_size) / hidden_size ** 0.5)
-
-    def forward(self, x, y, t, mask=None, HW=None, **kwargs):
-        B, N, C = x.shape
-
-        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (self.scale_shift_table[None].to(dtype=x.dtype, device=x.device) + t.reshape(B, 6, -1)).chunk(6, dim=1)
-        x = x + (gate_msa * self.attn(t2i_modulate(self.norm1(x), shift_msa, scale_msa), HW=HW))
-        x = x + self.cross_attn(x, y, mask)
-        x = x + (gate_mlp * self.mlp(t2i_modulate(self.norm2(x), shift_mlp, scale_mlp)))
-
-        return x
-
-
-### Core PixArt Model ###
-class PixArtMS(nn.Module):
-    """
-    Diffusion model with a Transformer backbone.
-    """
-    def __init__(
-            self,
-            input_size=32,
-            patch_size=2,
-            in_channels=4,
-            hidden_size=1152,
-            depth=28,
-            num_heads=16,
-            mlp_ratio=4.0,
-            class_dropout_prob=0.1,
-            learn_sigma=True,
-            pred_sigma=True,
-            drop_path: float = 0.,
-            caption_channels=4096,
-            pe_interpolation=None,
-            pe_precision=None,
-            config=None,
-            model_max_length=120,
-            micro_condition=True,
-            qk_norm=False,
-            kv_compress_config=None,
-            dtype=None,
-            device=None,
-            operations=None,
-            **kwargs,
-    ):
-        nn.Module.__init__(self)
-        self.dtype = dtype
-        self.pred_sigma = pred_sigma
-        self.in_channels = in_channels
-        self.out_channels = in_channels * 2 if pred_sigma else in_channels
-        self.patch_size = patch_size
-        self.num_heads = num_heads
-        self.pe_interpolation = pe_interpolation
-        self.pe_precision = pe_precision
-        self.hidden_size = hidden_size
-        self.depth = depth
-
-        approx_gelu = lambda: nn.GELU(approximate="tanh")
-        self.t_block = nn.Sequential(
-            nn.SiLU(),
-            operations.Linear(hidden_size, 6 * hidden_size, bias=True, dtype=dtype, device=device)
-        )
-        self.x_embedder = PatchEmbed(
-            patch_size=patch_size,
-            in_chans=in_channels,
-            embed_dim=hidden_size,
-            bias=True,
-            dtype=dtype,
-            device=device,
-            operations=operations
-        )
-        self.t_embedder = TimestepEmbedder(
-            hidden_size, dtype=dtype, device=device, operations=operations,
-        )
-        self.y_embedder = CaptionEmbedder(
-            in_channels=caption_channels, hidden_size=hidden_size, uncond_prob=class_dropout_prob,
-            act_layer=approx_gelu, token_num=model_max_length,
-            dtype=dtype, device=device, operations=operations,
-        )
-
-        self.micro_conditioning = micro_condition
-        if self.micro_conditioning:
-            self.csize_embedder = SizeEmbedder(hidden_size//3, dtype=dtype, device=device, operations=operations)
-            self.ar_embedder = SizeEmbedder(hidden_size//3, dtype=dtype, device=device, operations=operations)
-
-        # For fixed sin-cos embedding:
-        # num_patches = (input_size // patch_size) * (input_size // patch_size)
-        # self.base_size = input_size // self.patch_size
-        # self.register_buffer("pos_embed", torch.zeros(1, num_patches, hidden_size))
-
-        drop_path = [x.item() for x in torch.linspace(0, drop_path, depth)]  # stochastic depth decay rule
-        if kv_compress_config is None:
-            kv_compress_config = {
-                'sampling': None,
-                'scale_factor': 1,
-                'kv_compress_layer': [],
-            }
-        self.blocks = nn.ModuleList([
-            PixArtMSBlock(
-                hidden_size, num_heads, mlp_ratio=mlp_ratio, drop_path=drop_path[i],
-                sampling=kv_compress_config['sampling'],
-                sr_ratio=int(kv_compress_config['scale_factor']) if i in kv_compress_config['kv_compress_layer'] else 1,
-                qk_norm=qk_norm,
-                dtype=dtype,
-                device=device,
-                operations=operations,
-            )
-            for i in range(depth)
-        ])
-        self.final_layer = T2IFinalLayer(
-            hidden_size, patch_size, self.out_channels, dtype=dtype, device=device, operations=operations
-        )
-
-    def forward_orig(self, x, timestep, y, mask=None, c_size=None, c_ar=None, **kwargs):
-        """
-        Original forward pass of PixArt.
-        x: (N, C, H, W) tensor of spatial inputs (images or latent representations of images)
-        t: (N,) tensor of diffusion timesteps
-        y: (N, 1, 120, C) conditioning
-        ar: (N, 1): aspect ratio
-        cs: (N ,2) size conditioning for height/width
-        """
-        B, C, H, W = x.shape
-        c_res = (H + W) // 2
-        pe_interpolation = self.pe_interpolation
-        if pe_interpolation is None or self.pe_precision is not None:
-            # calculate pe_interpolation on-the-fly
-            pe_interpolation = round(c_res / (512/8.0), self.pe_precision or 0)
-
-        pos_embed = get_2d_sincos_pos_embed_torch(
-            self.hidden_size,
-            h=(H // self.patch_size),
-            w=(W // self.patch_size),
-            pe_interpolation=pe_interpolation,
-            base_size=((round(c_res / 64) * 64) // self.patch_size),
-            device=x.device,
-            dtype=x.dtype,
-        ).unsqueeze(0)
-
-        x = self.x_embedder(x) + pos_embed  # (N, T, D), where T = H * W / patch_size ** 2
-        t = self.t_embedder(timestep, x.dtype)  # (N, D)
-
-        if self.micro_conditioning and (c_size is not None and c_ar is not None):
-            bs = x.shape[0]
-            c_size = self.csize_embedder(c_size, bs)  # (N, D)
-            c_ar = self.ar_embedder(c_ar, bs)  # (N, D)
-            t = t + torch.cat([c_size, c_ar], dim=1)
-
-        t0 = self.t_block(t)
-        y = self.y_embedder(y, self.training)  # (N, D)
-
-        if mask is not None:
-            if mask.shape[0] != y.shape[0]:
-                mask = mask.repeat(y.shape[0] // mask.shape[0], 1)
-            mask = mask.squeeze(1).squeeze(1)
-            y = y.squeeze(1).masked_select(mask.unsqueeze(-1) != 0).view(1, -1, x.shape[-1])
-            y_lens = mask.sum(dim=1).tolist()
-        else:
-            y_lens = None
-            y = y.squeeze(1).view(1, -1, x.shape[-1])
-        for block in self.blocks:
-            x = block(x, y, t0, y_lens, (H, W), **kwargs)  # (N, T, D)
-
-        x = self.final_layer(x, t)  # (N, T, patch_size ** 2 * out_channels)
-        x = self.unpatchify(x, H, W)  # (N, out_channels, H, W)
-
-        return x
-
-    def forward(self, x, timesteps, context, c_size=None, c_ar=None, **kwargs):
-        B, C, H, W = x.shape
-
-        # Fallback for missing microconds
-        if self.micro_conditioning:
-            if c_size is None:
-                c_size = torch.tensor([H*8, W*8], dtype=x.dtype, device=x.device).repeat(B, 1)
-
-            if c_ar is None:
-                c_ar = torch.tensor([H/W], dtype=x.dtype, device=x.device).repeat(B, 1)
-
-        ## Still accepts the input w/o that dim but returns garbage
-        if len(context.shape) == 3:
-            context = context.unsqueeze(1)
-
-        ## run original forward pass
-        out = self.forward_orig(x, timesteps, context, c_size=c_size, c_ar=c_ar)
-
-        ## only return EPS
-        if self.pred_sigma:
-            return out[:, :self.in_channels]
-        return out
-
-    def unpatchify(self, x, h, w):
-        """
-        x: (N, T, patch_size**2 * C)
-        imgs: (N, H, W, C)
-        """
-        c = self.out_channels
-        p = self.x_embedder.patch_size[0]
-        h = h // self.patch_size
-        w = w // self.patch_size
-        assert h * w == x.shape[1]
-
-        x = x.reshape(shape=(x.shape[0], h, w, p, p, c))
-        x = torch.einsum('nhwpqc->nchpwq', x)
-        imgs = x.reshape(shape=(x.shape[0], c, h * p, w * p))
-        return imgs
+# Based on:
+# https://github.com/PixArt-alpha/PixArt-alpha [Apache 2.0 license]
+# https://github.com/PixArt-alpha/PixArt-sigma [Apache 2.0 license]
+import torch
+import torch.nn as nn
+
+from .blocks import (
+    t2i_modulate,
+    CaptionEmbedder,
+    AttentionKVCompress,
+    MultiHeadCrossAttention,
+    T2IFinalLayer,
+    SizeEmbedder,
+)
+from comfy.ldm.modules.diffusionmodules.mmdit import TimestepEmbedder, PatchEmbed, Mlp, get_1d_sincos_pos_embed_from_grid_torch
+
+
+def get_2d_sincos_pos_embed_torch(embed_dim, w, h, pe_interpolation=1.0, base_size=16, device=None, dtype=torch.float32):
+    grid_h, grid_w = torch.meshgrid(
+        torch.arange(h, device=device, dtype=dtype) / (h/base_size) / pe_interpolation,
+        torch.arange(w, device=device, dtype=dtype) / (w/base_size) / pe_interpolation,
+        indexing='ij'
+    )
+    emb_h = get_1d_sincos_pos_embed_from_grid_torch(embed_dim // 2, grid_h, device=device, dtype=dtype)
+    emb_w = get_1d_sincos_pos_embed_from_grid_torch(embed_dim // 2, grid_w, device=device, dtype=dtype)
+    emb = torch.cat([emb_w, emb_h], dim=1)  # (H*W, D)
+    return emb
+
+class PixArtMSBlock(nn.Module):
+    """
+    A PixArt block with adaptive layer norm zero (adaLN-Zero) conditioning.
+    """
+    def __init__(self, hidden_size, num_heads, mlp_ratio=4.0, drop_path=0., input_size=None,
+                 sampling=None, sr_ratio=1, qk_norm=False, dtype=None, device=None, operations=None, **block_kwargs):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.norm1 = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, dtype=dtype, device=device)
+        self.attn = AttentionKVCompress(
+            hidden_size, num_heads=num_heads, qkv_bias=True, sampling=sampling, sr_ratio=sr_ratio,
+            qk_norm=qk_norm, dtype=dtype, device=device, operations=operations, **block_kwargs
+        )
+        self.cross_attn = MultiHeadCrossAttention(
+            hidden_size, num_heads, dtype=dtype, device=device, operations=operations, **block_kwargs
+        )
+        self.norm2 = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, dtype=dtype, device=device)
+        # to be compatible with lower version pytorch
+        approx_gelu = lambda: nn.GELU(approximate="tanh")
+        self.mlp = Mlp(
+            in_features=hidden_size, hidden_features=int(hidden_size * mlp_ratio), act_layer=approx_gelu,
+            dtype=dtype, device=device, operations=operations
+        )
+        self.scale_shift_table = nn.Parameter(torch.randn(6, hidden_size) / hidden_size ** 0.5)
+
+    def forward(self, x, y, t, mask=None, HW=None, **kwargs):
+        B, N, C = x.shape
+
+        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (self.scale_shift_table[None].to(dtype=x.dtype, device=x.device) + t.reshape(B, 6, -1)).chunk(6, dim=1)
+        x = x + (gate_msa * self.attn(t2i_modulate(self.norm1(x), shift_msa, scale_msa), HW=HW))
+        x = x + self.cross_attn(x, y, mask)
+        x = x + (gate_mlp * self.mlp(t2i_modulate(self.norm2(x), shift_mlp, scale_mlp)))
+
+        return x
+
+
+### Core PixArt Model ###
+class PixArtMS(nn.Module):
+    """
+    Diffusion model with a Transformer backbone.
+    """
+    def __init__(
+            self,
+            input_size=32,
+            patch_size=2,
+            in_channels=4,
+            hidden_size=1152,
+            depth=28,
+            num_heads=16,
+            mlp_ratio=4.0,
+            class_dropout_prob=0.1,
+            learn_sigma=True,
+            pred_sigma=True,
+            drop_path: float = 0.,
+            caption_channels=4096,
+            pe_interpolation=None,
+            pe_precision=None,
+            config=None,
+            model_max_length=120,
+            micro_condition=True,
+            qk_norm=False,
+            kv_compress_config=None,
+            dtype=None,
+            device=None,
+            operations=None,
+            **kwargs,
+    ):
+        nn.Module.__init__(self)
+        self.dtype = dtype
+        self.pred_sigma = pred_sigma
+        self.in_channels = in_channels
+        self.out_channels = in_channels * 2 if pred_sigma else in_channels
+        self.patch_size = patch_size
+        self.num_heads = num_heads
+        self.pe_interpolation = pe_interpolation
+        self.pe_precision = pe_precision
+        self.hidden_size = hidden_size
+        self.depth = depth
+
+        approx_gelu = lambda: nn.GELU(approximate="tanh")
+        self.t_block = nn.Sequential(
+            nn.SiLU(),
+            operations.Linear(hidden_size, 6 * hidden_size, bias=True, dtype=dtype, device=device)
+        )
+        self.x_embedder = PatchEmbed(
+            patch_size=patch_size,
+            in_chans=in_channels,
+            embed_dim=hidden_size,
+            bias=True,
+            dtype=dtype,
+            device=device,
+            operations=operations
+        )
+        self.t_embedder = TimestepEmbedder(
+            hidden_size, dtype=dtype, device=device, operations=operations,
+        )
+        self.y_embedder = CaptionEmbedder(
+            in_channels=caption_channels, hidden_size=hidden_size, uncond_prob=class_dropout_prob,
+            act_layer=approx_gelu, token_num=model_max_length,
+            dtype=dtype, device=device, operations=operations,
+        )
+
+        self.micro_conditioning = micro_condition
+        if self.micro_conditioning:
+            self.csize_embedder = SizeEmbedder(hidden_size//3, dtype=dtype, device=device, operations=operations)
+            self.ar_embedder = SizeEmbedder(hidden_size//3, dtype=dtype, device=device, operations=operations)
+
+        # For fixed sin-cos embedding:
+        # num_patches = (input_size // patch_size) * (input_size // patch_size)
+        # self.base_size = input_size // self.patch_size
+        # self.register_buffer("pos_embed", torch.zeros(1, num_patches, hidden_size))
+
+        drop_path = [x.item() for x in torch.linspace(0, drop_path, depth)]  # stochastic depth decay rule
+        if kv_compress_config is None:
+            kv_compress_config = {
+                'sampling': None,
+                'scale_factor': 1,
+                'kv_compress_layer': [],
+            }
+        self.blocks = nn.ModuleList([
+            PixArtMSBlock(
+                hidden_size, num_heads, mlp_ratio=mlp_ratio, drop_path=drop_path[i],
+                sampling=kv_compress_config['sampling'],
+                sr_ratio=int(kv_compress_config['scale_factor']) if i in kv_compress_config['kv_compress_layer'] else 1,
+                qk_norm=qk_norm,
+                dtype=dtype,
+                device=device,
+                operations=operations,
+            )
+            for i in range(depth)
+        ])
+        self.final_layer = T2IFinalLayer(
+            hidden_size, patch_size, self.out_channels, dtype=dtype, device=device, operations=operations
+        )
+
+    def forward_orig(self, x, timestep, y, mask=None, c_size=None, c_ar=None, **kwargs):
+        """
+        Original forward pass of PixArt.
+        x: (N, C, H, W) tensor of spatial inputs (images or latent representations of images)
+        t: (N,) tensor of diffusion timesteps
+        y: (N, 1, 120, C) conditioning
+        ar: (N, 1): aspect ratio
+        cs: (N ,2) size conditioning for height/width
+        """
+        B, C, H, W = x.shape
+        c_res = (H + W) // 2
+        pe_interpolation = self.pe_interpolation
+        if pe_interpolation is None or self.pe_precision is not None:
+            # calculate pe_interpolation on-the-fly
+            pe_interpolation = round(c_res / (512/8.0), self.pe_precision or 0)
+
+        pos_embed = get_2d_sincos_pos_embed_torch(
+            self.hidden_size,
+            h=(H // self.patch_size),
+            w=(W // self.patch_size),
+            pe_interpolation=pe_interpolation,
+            base_size=((round(c_res / 64) * 64) // self.patch_size),
+            device=x.device,
+            dtype=x.dtype,
+        ).unsqueeze(0)
+
+        x = self.x_embedder(x) + pos_embed  # (N, T, D), where T = H * W / patch_size ** 2
+        t = self.t_embedder(timestep, x.dtype)  # (N, D)
+
+        if self.micro_conditioning and (c_size is not None and c_ar is not None):
+            bs = x.shape[0]
+            c_size = self.csize_embedder(c_size, bs)  # (N, D)
+            c_ar = self.ar_embedder(c_ar, bs)  # (N, D)
+            t = t + torch.cat([c_size, c_ar], dim=1)
+
+        t0 = self.t_block(t)
+        y = self.y_embedder(y, self.training)  # (N, D)
+
+        if mask is not None:
+            if mask.shape[0] != y.shape[0]:
+                mask = mask.repeat(y.shape[0] // mask.shape[0], 1)
+            mask = mask.squeeze(1).squeeze(1)
+            y = y.squeeze(1).masked_select(mask.unsqueeze(-1) != 0).view(1, -1, x.shape[-1])
+            y_lens = mask.sum(dim=1).tolist()
+        else:
+            y_lens = None
+            y = y.squeeze(1).view(1, -1, x.shape[-1])
+        for block in self.blocks:
+            x = block(x, y, t0, y_lens, (H, W), **kwargs)  # (N, T, D)
+
+        x = self.final_layer(x, t)  # (N, T, patch_size ** 2 * out_channels)
+        x = self.unpatchify(x, H, W)  # (N, out_channels, H, W)
+
+        return x
+
+    def forward(self, x, timesteps, context, c_size=None, c_ar=None, **kwargs):
+        B, C, H, W = x.shape
+
+        # Fallback for missing microconds
+        if self.micro_conditioning:
+            if c_size is None:
+                c_size = torch.tensor([H*8, W*8], dtype=x.dtype, device=x.device).repeat(B, 1)
+
+            if c_ar is None:
+                c_ar = torch.tensor([H/W], dtype=x.dtype, device=x.device).repeat(B, 1)
+
+        ## Still accepts the input w/o that dim but returns garbage
+        if len(context.shape) == 3:
+            context = context.unsqueeze(1)
+
+        ## run original forward pass
+        out = self.forward_orig(x, timesteps, context, c_size=c_size, c_ar=c_ar)
+
+        ## only return EPS
+        if self.pred_sigma:
+            return out[:, :self.in_channels]
+        return out
+
+    def unpatchify(self, x, h, w):
+        """
+        x: (N, T, patch_size**2 * C)
+        imgs: (N, H, W, C)
+        """
+        c = self.out_channels
+        p = self.x_embedder.patch_size[0]
+        h = h // self.patch_size
+        w = w // self.patch_size
+        assert h * w == x.shape[1]
+
+        x = x.reshape(shape=(x.shape[0], h, w, p, p, c))
+        x = torch.einsum('nhwpqc->nchpwq', x)
+        imgs = x.reshape(shape=(x.shape[0], c, h * p, w * p))
+        return imgs
--- a/comfy/ldm/qwen_image/controlnet.py
+++ b/comfy/ldm/qwen_image/controlnet.py
@@ -1,77 +0,0 @@
-import torch
-import math
-
-from .model import QwenImageTransformer2DModel
-
-
-class QwenImageControlNetModel(QwenImageTransformer2DModel):
-    def __init__(
-        self,
-        extra_condition_channels=0,
-        dtype=None,
-        device=None,
-        operations=None,
-        **kwargs
-    ):
-        super().__init__(final_layer=False, dtype=dtype, device=device, operations=operations, **kwargs)
-        self.main_model_double = 60
-
-        # controlnet_blocks
-        self.controlnet_blocks = torch.nn.ModuleList([])
-        for _ in range(len(self.transformer_blocks)):
-            self.controlnet_blocks.append(operations.Linear(self.inner_dim, self.inner_dim, device=device, dtype=dtype))
-        self.controlnet_x_embedder = operations.Linear(self.in_channels + extra_condition_channels, self.inner_dim, device=device, dtype=dtype)
-
-    def forward(
-        self,
-        x,
-        timesteps,
-        context,
-        attention_mask=None,
-        guidance: torch.Tensor = None,
-        ref_latents=None,
-        hint=None,
-        transformer_options={},
-        **kwargs
-    ):
-        timestep = timesteps
-        encoder_hidden_states = context
-        encoder_hidden_states_mask = attention_mask
-
-        hidden_states, img_ids, orig_shape = self.process_img(x)
-        hint, _, _ = self.process_img(hint)
-
-        txt_start = round(max(((x.shape[-1] + (self.patch_size // 2)) // self.patch_size) // 2, ((x.shape[-2] + (self.patch_size // 2)) // self.patch_size) // 2))
-        txt_ids = torch.arange(txt_start, txt_start + context.shape[1], device=x.device).reshape(1, -1, 1).repeat(x.shape[0], 1, 3)
-        ids = torch.cat((txt_ids, img_ids), dim=1)
-        image_rotary_emb = self.pe_embedder(ids).squeeze(1).unsqueeze(2).to(x.dtype)
-        del ids, txt_ids, img_ids
-
-        hidden_states = self.img_in(hidden_states) + self.controlnet_x_embedder(hint)
-        encoder_hidden_states = self.txt_norm(encoder_hidden_states)
-        encoder_hidden_states = self.txt_in(encoder_hidden_states)
-
-        if guidance is not None:
-            guidance = guidance * 1000
-
-        temb = (
-            self.time_text_embed(timestep, hidden_states)
-            if guidance is None
-            else self.time_text_embed(timestep, guidance, hidden_states)
-        )
-
-        repeat = math.ceil(self.main_model_double / len(self.controlnet_blocks))
-
-        controlnet_block_samples = ()
-        for i, block in enumerate(self.transformer_blocks):
-            encoder_hidden_states, hidden_states = block(
-                hidden_states=hidden_states,
-                encoder_hidden_states=encoder_hidden_states,
-                encoder_hidden_states_mask=encoder_hidden_states_mask,
-                temb=temb,
-                image_rotary_emb=image_rotary_emb,
-            )
-
-            controlnet_block_samples = controlnet_block_samples + (self.controlnet_blocks[i](hidden_states),) * repeat
-
-        return {"input": controlnet_block_samples[:self.main_model_double]}
--- a/comfy/ldm/qwen_image/model.py
+++ b/comfy/ldm/qwen_image/model.py
@@ -1,473 +0,0 @@
-# https://github.com/QwenLM/Qwen-Image (Apache 2.0)
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from typing import Optional, Tuple
-from einops import repeat
-
-from comfy.ldm.lightricks.model import TimestepEmbedding, Timesteps
-from comfy.ldm.modules.attention import optimized_attention_masked
-from comfy.ldm.flux.layers import EmbedND
-import comfy.ldm.common_dit
-import comfy.patcher_extension
-
-class GELU(nn.Module):
-    def __init__(self, dim_in: int, dim_out: int, approximate: str = "none", bias: bool = True, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.proj = operations.Linear(dim_in, dim_out, bias=bias, dtype=dtype, device=device)
-        self.approximate = approximate
-
-    def forward(self, hidden_states):
-        hidden_states = self.proj(hidden_states)
-        hidden_states = F.gelu(hidden_states, approximate=self.approximate)
-        return hidden_states
-
-
-class FeedForward(nn.Module):
-    def __init__(
-        self,
-        dim: int,
-        dim_out: Optional[int] = None,
-        mult: int = 4,
-        dropout: float = 0.0,
-        inner_dim=None,
-        bias: bool = True,
-        dtype=None, device=None, operations=None
-    ):
-        super().__init__()
-        if inner_dim is None:
-            inner_dim = int(dim * mult)
-        dim_out = dim_out if dim_out is not None else dim
-
-        self.net = nn.ModuleList([])
-        self.net.append(GELU(dim, inner_dim, approximate="tanh", bias=bias, dtype=dtype, device=device, operations=operations))
-        self.net.append(nn.Dropout(dropout))
-        self.net.append(operations.Linear(inner_dim, dim_out, bias=bias, dtype=dtype, device=device))
-
-    def forward(self, hidden_states: torch.Tensor, *args, **kwargs) -> torch.Tensor:
-        for module in self.net:
-            hidden_states = module(hidden_states)
-        return hidden_states
-
-
-def apply_rotary_emb(x, freqs_cis):
-    if x.shape[1] == 0:
-        return x
-
-    t_ = x.reshape(*x.shape[:-1], -1, 1, 2)
-    t_out = freqs_cis[..., 0] * t_[..., 0] + freqs_cis[..., 1] * t_[..., 1]
-    return t_out.reshape(*x.shape)
-
-
-class QwenTimestepProjEmbeddings(nn.Module):
-    def __init__(self, embedding_dim, pooled_projection_dim, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.time_proj = Timesteps(num_channels=256, flip_sin_to_cos=True, downscale_freq_shift=0, scale=1000)
-        self.timestep_embedder = TimestepEmbedding(
-            in_channels=256,
-            time_embed_dim=embedding_dim,
-            dtype=dtype,
-            device=device,
-            operations=operations
-        )
-
-    def forward(self, timestep, hidden_states):
-        timesteps_proj = self.time_proj(timestep)
-        timesteps_emb = self.timestep_embedder(timesteps_proj.to(dtype=hidden_states.dtype))
-        return timesteps_emb
-
-
-class Attention(nn.Module):
-    def __init__(
-        self,
-        query_dim: int,
-        dim_head: int = 64,
-        heads: int = 8,
-        dropout: float = 0.0,
-        bias: bool = False,
-        eps: float = 1e-5,
-        out_bias: bool = True,
-        out_dim: int = None,
-        out_context_dim: int = None,
-        dtype=None,
-        device=None,
-        operations=None
-    ):
-        super().__init__()
-        self.inner_dim = out_dim if out_dim is not None else dim_head * heads
-        self.inner_kv_dim = self.inner_dim
-        self.heads = heads
-        self.dim_head = dim_head
-        self.out_dim = out_dim if out_dim is not None else query_dim
-        self.out_context_dim = out_context_dim if out_context_dim is not None else query_dim
-        self.dropout = dropout
-
-        # Q/K normalization
-        self.norm_q = operations.RMSNorm(dim_head, eps=eps, elementwise_affine=True, dtype=dtype, device=device)
-        self.norm_k = operations.RMSNorm(dim_head, eps=eps, elementwise_affine=True, dtype=dtype, device=device)
-        self.norm_added_q = operations.RMSNorm(dim_head, eps=eps, dtype=dtype, device=device)
-        self.norm_added_k = operations.RMSNorm(dim_head, eps=eps, dtype=dtype, device=device)
-
-        # Image stream projections
-        self.to_q = operations.Linear(query_dim, self.inner_dim, bias=bias, dtype=dtype, device=device)
-        self.to_k = operations.Linear(query_dim, self.inner_kv_dim, bias=bias, dtype=dtype, device=device)
-        self.to_v = operations.Linear(query_dim, self.inner_kv_dim, bias=bias, dtype=dtype, device=device)
-
-        # Text stream projections
-        self.add_q_proj = operations.Linear(query_dim, self.inner_dim, bias=bias, dtype=dtype, device=device)
-        self.add_k_proj = operations.Linear(query_dim, self.inner_kv_dim, bias=bias, dtype=dtype, device=device)
-        self.add_v_proj = operations.Linear(query_dim, self.inner_kv_dim, bias=bias, dtype=dtype, device=device)
-
-        # Output projections
-        self.to_out = nn.ModuleList([
-            operations.Linear(self.inner_dim, self.out_dim, bias=out_bias, dtype=dtype, device=device),
-            nn.Dropout(dropout)
-        ])
-        self.to_add_out = operations.Linear(self.inner_dim, self.out_context_dim, bias=out_bias, dtype=dtype, device=device)
-
-    def forward(
-        self,
-        hidden_states: torch.FloatTensor,  # Image stream
-        encoder_hidden_states: torch.FloatTensor = None,  # Text stream
-        encoder_hidden_states_mask: torch.FloatTensor = None,
-        attention_mask: Optional[torch.FloatTensor] = None,
-        image_rotary_emb: Optional[torch.Tensor] = None,
-        transformer_options={},
-    ) -> Tuple[torch.Tensor, torch.Tensor]:
-        seq_txt = encoder_hidden_states.shape[1]
-
-        img_query = self.to_q(hidden_states).unflatten(-1, (self.heads, -1))
-        img_key = self.to_k(hidden_states).unflatten(-1, (self.heads, -1))
-        img_value = self.to_v(hidden_states).unflatten(-1, (self.heads, -1))
-
-        txt_query = self.add_q_proj(encoder_hidden_states).unflatten(-1, (self.heads, -1))
-        txt_key = self.add_k_proj(encoder_hidden_states).unflatten(-1, (self.heads, -1))
-        txt_value = self.add_v_proj(encoder_hidden_states).unflatten(-1, (self.heads, -1))
-
-        img_query = self.norm_q(img_query)
-        img_key = self.norm_k(img_key)
-        txt_query = self.norm_added_q(txt_query)
-        txt_key = self.norm_added_k(txt_key)
-
-        joint_query = torch.cat([txt_query, img_query], dim=1)
-        joint_key = torch.cat([txt_key, img_key], dim=1)
-        joint_value = torch.cat([txt_value, img_value], dim=1)
-
-        joint_query = apply_rotary_emb(joint_query, image_rotary_emb)
-        joint_key = apply_rotary_emb(joint_key, image_rotary_emb)
-
-        joint_query = joint_query.flatten(start_dim=2)
-        joint_key = joint_key.flatten(start_dim=2)
-        joint_value = joint_value.flatten(start_dim=2)
-
-        joint_hidden_states = optimized_attention_masked(joint_query, joint_key, joint_value, self.heads, attention_mask, transformer_options=transformer_options)
-
-        txt_attn_output = joint_hidden_states[:, :seq_txt, :]
-        img_attn_output = joint_hidden_states[:, seq_txt:, :]
-
-        img_attn_output = self.to_out[0](img_attn_output)
-        img_attn_output = self.to_out[1](img_attn_output)
-        txt_attn_output = self.to_add_out(txt_attn_output)
-
-        return img_attn_output, txt_attn_output
-
-
-class QwenImageTransformerBlock(nn.Module):
-    def __init__(
-        self,
-        dim: int,
-        num_attention_heads: int,
-        attention_head_dim: int,
-        eps: float = 1e-6,
-        dtype=None,
-        device=None,
-        operations=None
-    ):
-        super().__init__()
-        self.dim = dim
-        self.num_attention_heads = num_attention_heads
-        self.attention_head_dim = attention_head_dim
-
-        self.img_mod = nn.Sequential(
-            nn.SiLU(),
-            operations.Linear(dim, 6 * dim, bias=True, dtype=dtype, device=device),
-        )
-        self.img_norm1 = operations.LayerNorm(dim, elementwise_affine=False, eps=eps, dtype=dtype, device=device)
-        self.img_norm2 = operations.LayerNorm(dim, elementwise_affine=False, eps=eps, dtype=dtype, device=device)
-        self.img_mlp = FeedForward(dim=dim, dim_out=dim, dtype=dtype, device=device, operations=operations)
-
-        self.txt_mod = nn.Sequential(
-            nn.SiLU(),
-            operations.Linear(dim, 6 * dim, bias=True, dtype=dtype, device=device),
-        )
-        self.txt_norm1 = operations.LayerNorm(dim, elementwise_affine=False, eps=eps, dtype=dtype, device=device)
-        self.txt_norm2 = operations.LayerNorm(dim, elementwise_affine=False, eps=eps, dtype=dtype, device=device)
-        self.txt_mlp = FeedForward(dim=dim, dim_out=dim, dtype=dtype, device=device, operations=operations)
-
-        self.attn = Attention(
-            query_dim=dim,
-            dim_head=attention_head_dim,
-            heads=num_attention_heads,
-            out_dim=dim,
-            bias=True,
-            eps=eps,
-            dtype=dtype,
-            device=device,
-            operations=operations,
-        )
-
-    def _modulate(self, x: torch.Tensor, mod_params: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
-        shift, scale, gate = torch.chunk(mod_params, 3, dim=-1)
-        return torch.addcmul(shift.unsqueeze(1), x, 1 + scale.unsqueeze(1)), gate.unsqueeze(1)
-
-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        encoder_hidden_states: torch.Tensor,
-        encoder_hidden_states_mask: torch.Tensor,
-        temb: torch.Tensor,
-        image_rotary_emb: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
-        transformer_options={},
-    ) -> Tuple[torch.Tensor, torch.Tensor]:
-        img_mod_params = self.img_mod(temb)
-        txt_mod_params = self.txt_mod(temb)
-        img_mod1, img_mod2 = img_mod_params.chunk(2, dim=-1)
-        txt_mod1, txt_mod2 = txt_mod_params.chunk(2, dim=-1)
-
-        img_normed = self.img_norm1(hidden_states)
-        img_modulated, img_gate1 = self._modulate(img_normed, img_mod1)
-        txt_normed = self.txt_norm1(encoder_hidden_states)
-        txt_modulated, txt_gate1 = self._modulate(txt_normed, txt_mod1)
-
-        img_attn_output, txt_attn_output = self.attn(
-            hidden_states=img_modulated,
-            encoder_hidden_states=txt_modulated,
-            encoder_hidden_states_mask=encoder_hidden_states_mask,
-            image_rotary_emb=image_rotary_emb,
-            transformer_options=transformer_options,
-        )
-
-        hidden_states = hidden_states + img_gate1 * img_attn_output
-        encoder_hidden_states = encoder_hidden_states + txt_gate1 * txt_attn_output
-
-        img_normed2 = self.img_norm2(hidden_states)
-        img_modulated2, img_gate2 = self._modulate(img_normed2, img_mod2)
-        hidden_states = torch.addcmul(hidden_states, img_gate2, self.img_mlp(img_modulated2))
-
-        txt_normed2 = self.txt_norm2(encoder_hidden_states)
-        txt_modulated2, txt_gate2 = self._modulate(txt_normed2, txt_mod2)
-        encoder_hidden_states = torch.addcmul(encoder_hidden_states, txt_gate2, self.txt_mlp(txt_modulated2))
-
-        return encoder_hidden_states, hidden_states
-
-
-class LastLayer(nn.Module):
-    def __init__(
-        self,
-        embedding_dim: int,
-        conditioning_embedding_dim: int,
-        elementwise_affine=False,
-        eps=1e-6,
-        bias=True,
-        dtype=None, device=None, operations=None
-    ):
-        super().__init__()
-        self.silu = nn.SiLU()
-        self.linear = operations.Linear(conditioning_embedding_dim, embedding_dim * 2, bias=bias, dtype=dtype, device=device)
-        self.norm = operations.LayerNorm(embedding_dim, eps, elementwise_affine=False, bias=bias, dtype=dtype, device=device)
-
-    def forward(self, x: torch.Tensor, conditioning_embedding: torch.Tensor) -> torch.Tensor:
-        emb = self.linear(self.silu(conditioning_embedding))
-        scale, shift = torch.chunk(emb, 2, dim=1)
-        x = torch.addcmul(shift[:, None, :], self.norm(x), (1 + scale)[:, None, :])
-        return x
-
-
-class QwenImageTransformer2DModel(nn.Module):
-    def __init__(
-        self,
-        patch_size: int = 2,
-        in_channels: int = 64,
-        out_channels: Optional[int] = 16,
-        num_layers: int = 60,
-        attention_head_dim: int = 128,
-        num_attention_heads: int = 24,
-        joint_attention_dim: int = 3584,
-        pooled_projection_dim: int = 768,
-        guidance_embeds: bool = False,
-        axes_dims_rope: Tuple[int, int, int] = (16, 56, 56),
-        image_model=None,
-        final_layer=True,
-        dtype=None,
-        device=None,
-        operations=None,
-    ):
-        super().__init__()
-        self.dtype = dtype
-        self.patch_size = patch_size
-        self.in_channels = in_channels
-        self.out_channels = out_channels or in_channels
-        self.inner_dim = num_attention_heads * attention_head_dim
-
-        self.pe_embedder = EmbedND(dim=attention_head_dim, theta=10000, axes_dim=list(axes_dims_rope))
-
-        self.time_text_embed = QwenTimestepProjEmbeddings(
-            embedding_dim=self.inner_dim,
-            pooled_projection_dim=pooled_projection_dim,
-            dtype=dtype,
-            device=device,
-            operations=operations
-        )
-
-        self.txt_norm = operations.RMSNorm(joint_attention_dim, eps=1e-6, dtype=dtype, device=device)
-        self.img_in = operations.Linear(in_channels, self.inner_dim, dtype=dtype, device=device)
-        self.txt_in = operations.Linear(joint_attention_dim, self.inner_dim, dtype=dtype, device=device)
-
-        self.transformer_blocks = nn.ModuleList([
-            QwenImageTransformerBlock(
-                dim=self.inner_dim,
-                num_attention_heads=num_attention_heads,
-                attention_head_dim=attention_head_dim,
-                dtype=dtype,
-                device=device,
-                operations=operations
-            )
-            for _ in range(num_layers)
-        ])
-
-        if final_layer:
-            self.norm_out = LastLayer(self.inner_dim, self.inner_dim, dtype=dtype, device=device, operations=operations)
-            self.proj_out = operations.Linear(self.inner_dim, patch_size * patch_size * self.out_channels, bias=True, dtype=dtype, device=device)
-
-    def process_img(self, x, index=0, h_offset=0, w_offset=0):
-        bs, c, t, h, w = x.shape
-        patch_size = self.patch_size
-        hidden_states = comfy.ldm.common_dit.pad_to_patch_size(x, (1, self.patch_size, self.patch_size))
-        orig_shape = hidden_states.shape
-        hidden_states = hidden_states.view(orig_shape[0], orig_shape[1], orig_shape[-2] // 2, 2, orig_shape[-1] // 2, 2)
-        hidden_states = hidden_states.permute(0, 2, 4, 1, 3, 5)
-        hidden_states = hidden_states.reshape(orig_shape[0], (orig_shape[-2] // 2) * (orig_shape[-1] // 2), orig_shape[1] * 4)
-        h_len = ((h + (patch_size // 2)) // patch_size)
-        w_len = ((w + (patch_size // 2)) // patch_size)
-
-        h_offset = ((h_offset + (patch_size // 2)) // patch_size)
-        w_offset = ((w_offset + (patch_size // 2)) // patch_size)
-
-        img_ids = torch.zeros((h_len, w_len, 3), device=x.device)
-        img_ids[:, :, 0] = img_ids[:, :, 1] + index
-        img_ids[:, :, 1] = img_ids[:, :, 1] + torch.linspace(h_offset, h_len - 1 + h_offset, steps=h_len, device=x.device, dtype=x.dtype).unsqueeze(1) - (h_len // 2)
-        img_ids[:, :, 2] = img_ids[:, :, 2] + torch.linspace(w_offset, w_len - 1 + w_offset, steps=w_len, device=x.device, dtype=x.dtype).unsqueeze(0) - (w_len // 2)
-        return hidden_states, repeat(img_ids, "h w c -> b (h w) c", b=bs), orig_shape
-
-    def forward(self, x, timestep, context, attention_mask=None, guidance=None, ref_latents=None, transformer_options={}, **kwargs):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, timestep, context, attention_mask, guidance, ref_latents, transformer_options, **kwargs)
-
-    def _forward(
-        self,
-        x,
-        timesteps,
-        context,
-        attention_mask=None,
-        guidance: torch.Tensor = None,
-        ref_latents=None,
-        transformer_options={},
-        control=None,
-        **kwargs
-    ):
-        timestep = timesteps
-        encoder_hidden_states = context
-        encoder_hidden_states_mask = attention_mask
-
-        hidden_states, img_ids, orig_shape = self.process_img(x)
-        num_embeds = hidden_states.shape[1]
-
-        if ref_latents is not None:
-            h = 0
-            w = 0
-            index = 0
-            index_ref_method = kwargs.get("ref_latents_method", "index") == "index"
-            for ref in ref_latents:
-                if index_ref_method:
-                    index += 1
-                    h_offset = 0
-                    w_offset = 0
-                else:
-                    index = 1
-                    h_offset = 0
-                    w_offset = 0
-                    if ref.shape[-2] + h > ref.shape[-1] + w:
-                        w_offset = w
-                    else:
-                        h_offset = h
-                    h = max(h, ref.shape[-2] + h_offset)
-                    w = max(w, ref.shape[-1] + w_offset)
-
-                kontext, kontext_ids, _ = self.process_img(ref, index=index, h_offset=h_offset, w_offset=w_offset)
-                hidden_states = torch.cat([hidden_states, kontext], dim=1)
-                img_ids = torch.cat([img_ids, kontext_ids], dim=1)
-
-        txt_start = round(max(((x.shape[-1] + (self.patch_size // 2)) // self.patch_size) // 2, ((x.shape[-2] + (self.patch_size // 2)) // self.patch_size) // 2))
-        txt_ids = torch.arange(txt_start, txt_start + context.shape[1], device=x.device).reshape(1, -1, 1).repeat(x.shape[0], 1, 3)
-        ids = torch.cat((txt_ids, img_ids), dim=1)
-        image_rotary_emb = self.pe_embedder(ids).squeeze(1).unsqueeze(2).to(x.dtype)
-        del ids, txt_ids, img_ids
-
-        hidden_states = self.img_in(hidden_states)
-        encoder_hidden_states = self.txt_norm(encoder_hidden_states)
-        encoder_hidden_states = self.txt_in(encoder_hidden_states)
-
-        if guidance is not None:
-            guidance = guidance * 1000
-
-        temb = (
-            self.time_text_embed(timestep, hidden_states)
-            if guidance is None
-            else self.time_text_embed(timestep, guidance, hidden_states)
-        )
-
-        patches_replace = transformer_options.get("patches_replace", {})
-        patches = transformer_options.get("patches", {})
-        blocks_replace = patches_replace.get("dit", {})
-
-        for i, block in enumerate(self.transformer_blocks):
-            if ("double_block", i) in blocks_replace:
-                def block_wrap(args):
-                    out = {}
-                    out["txt"], out["img"] = block(hidden_states=args["img"], encoder_hidden_states=args["txt"], encoder_hidden_states_mask=encoder_hidden_states_mask, temb=args["vec"], image_rotary_emb=args["pe"], transformer_options=args["transformer_options"])
-                    return out
-                out = blocks_replace[("double_block", i)]({"img": hidden_states, "txt": encoder_hidden_states, "vec": temb, "pe": image_rotary_emb, "transformer_options": transformer_options}, {"original_block": block_wrap})
-                hidden_states = out["img"]
-                encoder_hidden_states = out["txt"]
-            else:
-                encoder_hidden_states, hidden_states = block(
-                    hidden_states=hidden_states,
-                    encoder_hidden_states=encoder_hidden_states,
-                    encoder_hidden_states_mask=encoder_hidden_states_mask,
-                    temb=temb,
-                    image_rotary_emb=image_rotary_emb,
-                    transformer_options=transformer_options,
-                )
-
-            if "double_block" in patches:
-                for p in patches["double_block"]:
-                    out = p({"img": hidden_states, "txt": encoder_hidden_states, "x": x, "block_index": i, "transformer_options": transformer_options})
-                    hidden_states = out["img"]
-                    encoder_hidden_states = out["txt"]
-
-            if control is not None: # Controlnet
-                control_i = control.get("input")
-                if i < len(control_i):
-                    add = control_i[i]
-                    if add is not None:
-                        hidden_states[:, :add.shape[1]] += add
-
-        hidden_states = self.norm_out(hidden_states, temb)
-        hidden_states = self.proj_out(hidden_states)
-
-        hidden_states = hidden_states[:, :num_embeds].view(orig_shape[0], orig_shape[-2] // 2, orig_shape[-1] // 2, orig_shape[1], 2, 2)
-        hidden_states = hidden_states.permute(0, 3, 1, 4, 2, 5)
-        return hidden_states.reshape(orig_shape)[:, :, :, :x.shape[-2], :x.shape[-1]]
--- a/comfy/ldm/wan/model.py
+++ b/comfy/ldm/wan/model.py
--- a/comfy/ldm/wan/model_animate.py
+++ b/comfy/ldm/wan/model_animate.py
@@ -1,548 +0,0 @@
-from torch import nn
-import torch
-from typing import Tuple, Optional
-from einops import rearrange
-import torch.nn.functional as F
-import math
-from .model import WanModel, sinusoidal_embedding_1d
-from comfy.ldm.modules.attention import optimized_attention
-import comfy.model_management
-
-class CausalConv1d(nn.Module):
-
-    def __init__(self, chan_in, chan_out, kernel_size=3, stride=1, dilation=1, pad_mode="replicate", operations=None, **kwargs):
-        super().__init__()
-
-        self.pad_mode = pad_mode
-        padding = (kernel_size - 1, 0)  # T
-        self.time_causal_padding = padding
-
-        self.conv = operations.Conv1d(chan_in, chan_out, kernel_size, stride=stride, dilation=dilation, **kwargs)
-
-    def forward(self, x):
-        x = F.pad(x, self.time_causal_padding, mode=self.pad_mode)
-        return self.conv(x)
-
-
-class FaceEncoder(nn.Module):
-    def __init__(self, in_dim: int, hidden_dim: int, num_heads=int, dtype=None, device=None, operations=None):
-        factory_kwargs = {"dtype": dtype, "device": device}
-        super().__init__()
-
-        self.num_heads = num_heads
-        self.conv1_local = CausalConv1d(in_dim, 1024 * num_heads, 3, stride=1, operations=operations, **factory_kwargs)
-        self.norm1 = operations.LayerNorm(hidden_dim // 8, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-        self.act = nn.SiLU()
-        self.conv2 = CausalConv1d(1024, 1024, 3, stride=2, operations=operations, **factory_kwargs)
-        self.conv3 = CausalConv1d(1024, 1024, 3, stride=2, operations=operations, **factory_kwargs)
-
-        self.out_proj = operations.Linear(1024, hidden_dim, **factory_kwargs)
-        self.norm1 = operations.LayerNorm(1024, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-
-        self.norm2 = operations.LayerNorm(1024, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-
-        self.norm3 = operations.LayerNorm(1024, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-
-        self.padding_tokens = nn.Parameter(torch.empty(1, 1, 1, hidden_dim, **factory_kwargs))
-
-    def forward(self, x):
-
-        x = rearrange(x, "b t c -> b c t")
-        b, c, t = x.shape
-
-        x = self.conv1_local(x)
-        x = rearrange(x, "b (n c) t -> (b n) t c", n=self.num_heads)
-
-        x = self.norm1(x)
-        x = self.act(x)
-        x = rearrange(x, "b t c -> b c t")
-        x = self.conv2(x)
-        x = rearrange(x, "b c t -> b t c")
-        x = self.norm2(x)
-        x = self.act(x)
-        x = rearrange(x, "b t c -> b c t")
-        x = self.conv3(x)
-        x = rearrange(x, "b c t -> b t c")
-        x = self.norm3(x)
-        x = self.act(x)
-        x = self.out_proj(x)
-        x = rearrange(x, "(b n) t c -> b t n c", b=b)
-        padding = comfy.model_management.cast_to(self.padding_tokens, dtype=x.dtype, device=x.device).repeat(b, x.shape[1], 1, 1)
-        x = torch.cat([x, padding], dim=-2)
-        x_local = x.clone()
-
-        return x_local
-
-
-def get_norm_layer(norm_layer, operations=None):
-    """
-    Get the normalization layer.
-
-    Args:
-        norm_layer (str): The type of normalization layer.
-
-    Returns:
-        norm_layer (nn.Module): The normalization layer.
-    """
-    if norm_layer == "layer":
-        return operations.LayerNorm
-    elif norm_layer == "rms":
-        return operations.RMSNorm
-    else:
-        raise NotImplementedError(f"Norm layer {norm_layer} is not implemented")
-
-
-class FaceAdapter(nn.Module):
-    def __init__(
-        self,
-        hidden_dim: int,
-        heads_num: int,
-        qk_norm: bool = True,
-        qk_norm_type: str = "rms",
-        num_adapter_layers: int = 1,
-        dtype=None, device=None, operations=None
-    ):
-
-        factory_kwargs = {"dtype": dtype, "device": device}
-        super().__init__()
-        self.hidden_size = hidden_dim
-        self.heads_num = heads_num
-        self.fuser_blocks = nn.ModuleList(
-            [
-                FaceBlock(
-                    self.hidden_size,
-                    self.heads_num,
-                    qk_norm=qk_norm,
-                    qk_norm_type=qk_norm_type,
-                    operations=operations,
-                    **factory_kwargs,
-                )
-                for _ in range(num_adapter_layers)
-            ]
-        )
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        motion_embed: torch.Tensor,
-        idx: int,
-        freqs_cis_q: Tuple[torch.Tensor, torch.Tensor] = None,
-        freqs_cis_k: Tuple[torch.Tensor, torch.Tensor] = None,
-    ) -> torch.Tensor:
-
-        return self.fuser_blocks[idx](x, motion_embed, freqs_cis_q, freqs_cis_k)
-
-
-
-class FaceBlock(nn.Module):
-    def __init__(
-        self,
-        hidden_size: int,
-        heads_num: int,
-        qk_norm: bool = True,
-        qk_norm_type: str = "rms",
-        qk_scale: float = None,
-        dtype: Optional[torch.dtype] = None,
-        device: Optional[torch.device] = None,
-        operations=None
-    ):
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-
-        self.deterministic = False
-        self.hidden_size = hidden_size
-        self.heads_num = heads_num
-        head_dim = hidden_size // heads_num
-        self.scale = qk_scale or head_dim**-0.5
-
-        self.linear1_kv = operations.Linear(hidden_size, hidden_size * 2, **factory_kwargs)
-        self.linear1_q = operations.Linear(hidden_size, hidden_size, **factory_kwargs)
-
-        self.linear2 = operations.Linear(hidden_size, hidden_size, **factory_kwargs)
-
-        qk_norm_layer = get_norm_layer(qk_norm_type, operations=operations)
-        self.q_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs) if qk_norm else nn.Identity()
-        )
-        self.k_norm = (
-            qk_norm_layer(head_dim, elementwise_affine=True, eps=1e-6, **factory_kwargs) if qk_norm else nn.Identity()
-        )
-
-        self.pre_norm_feat = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-
-        self.pre_norm_motion = operations.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6, **factory_kwargs)
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        motion_vec: torch.Tensor,
-        motion_mask: Optional[torch.Tensor] = None,
-        # use_context_parallel=False,
-    ) -> torch.Tensor:
-
-        B, T, N, C = motion_vec.shape
-        T_comp = T
-
-        x_motion = self.pre_norm_motion(motion_vec)
-        x_feat = self.pre_norm_feat(x)
-
-        kv = self.linear1_kv(x_motion)
-        q = self.linear1_q(x_feat)
-
-        k, v = rearrange(kv, "B L N (K H D) -> K B L N H D", K=2, H=self.heads_num)
-        q = rearrange(q, "B S (H D) -> B S H D", H=self.heads_num)
-
-        # Apply QK-Norm if needed.
-        q = self.q_norm(q).to(v)
-        k = self.k_norm(k).to(v)
-
-        k = rearrange(k, "B L N H D -> (B L) N H D")
-        v = rearrange(v, "B L N H D -> (B L) N H D")
-
-        q = rearrange(q, "B (L S) H D -> (B L) S (H D)", L=T_comp)
-
-        attn = optimized_attention(q, k, v, heads=self.heads_num)
-
-        attn = rearrange(attn, "(B L) S C -> B (L S) C", L=T_comp)
-
-        output = self.linear2(attn)
-
-        if motion_mask is not None:
-            output = output * rearrange(motion_mask, "B T H W -> B (T H W)").unsqueeze(-1)
-
-        return output
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/ops/upfirdn2d/upfirdn2d.py#L162
-def upfirdn2d_native(input, kernel, up_x, up_y, down_x, down_y, pad_x0, pad_x1, pad_y0, pad_y1):
-    _, minor, in_h, in_w = input.shape
-    kernel_h, kernel_w = kernel.shape
-
-    out = input.view(-1, minor, in_h, 1, in_w, 1)
-    out = F.pad(out, [0, up_x - 1, 0, 0, 0, up_y - 1, 0, 0])
-    out = out.view(-1, minor, in_h * up_y, in_w * up_x)
-
-    out = F.pad(out, [max(pad_x0, 0), max(pad_x1, 0), max(pad_y0, 0), max(pad_y1, 0)])
-    out = out[:, :, max(-pad_y0, 0): out.shape[2] - max(-pad_y1, 0), max(-pad_x0, 0): out.shape[3] - max(-pad_x1, 0)]
-
-    out = out.reshape([-1, 1, in_h * up_y + pad_y0 + pad_y1, in_w * up_x + pad_x0 + pad_x1])
-    w = torch.flip(kernel, [0, 1]).view(1, 1, kernel_h, kernel_w)
-    out = F.conv2d(out, w)
-    out = out.reshape(-1, minor, in_h * up_y + pad_y0 + pad_y1 - kernel_h + 1, in_w * up_x + pad_x0 + pad_x1 - kernel_w + 1)
-    return out[:, :, ::down_y, ::down_x]
-
-def upfirdn2d(input, kernel, up=1, down=1, pad=(0, 0)):
-    return upfirdn2d_native(input, kernel, up, up, down, down, pad[0], pad[1], pad[0], pad[1])
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/ops/fused_act/fused_act.py#L81
-class FusedLeakyReLU(torch.nn.Module):
-    def __init__(self, channel, negative_slope=0.2, scale=2 ** 0.5, dtype=None, device=None):
-        super().__init__()
-        self.bias = torch.nn.Parameter(torch.empty(1, channel, 1, 1, dtype=dtype, device=device))
-        self.negative_slope = negative_slope
-        self.scale = scale
-
-    def forward(self, input):
-        return fused_leaky_relu(input, comfy.model_management.cast_to(self.bias, device=input.device, dtype=input.dtype), self.negative_slope, self.scale)
-
-def fused_leaky_relu(input, bias, negative_slope=0.2, scale=2 ** 0.5):
-    return F.leaky_relu(input + bias, negative_slope) * scale
-
-class Blur(torch.nn.Module):
-    def __init__(self, kernel, pad, dtype=None, device=None):
-        super().__init__()
-        kernel = torch.tensor(kernel, dtype=dtype, device=device)
-        kernel = kernel[None, :] * kernel[:, None]
-        kernel = kernel / kernel.sum()
-        self.register_buffer('kernel', kernel)
-        self.pad = pad
-
-    def forward(self, input):
-        return upfirdn2d(input, comfy.model_management.cast_to(self.kernel, dtype=input.dtype, device=input.device), pad=self.pad)
-
-#https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/archs/stylegan2_arch.py#L590
-class ScaledLeakyReLU(torch.nn.Module):
-    def __init__(self, negative_slope=0.2):
-        super().__init__()
-        self.negative_slope = negative_slope
-
-    def forward(self, input):
-        return F.leaky_relu(input, negative_slope=self.negative_slope)
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/archs/stylegan2_arch.py#L605
-class EqualConv2d(torch.nn.Module):
-    def __init__(self, in_channel, out_channel, kernel_size, stride=1, padding=0, bias=True, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.weight = torch.nn.Parameter(torch.empty(out_channel, in_channel, kernel_size, kernel_size, device=device, dtype=dtype))
-        self.scale = 1 / math.sqrt(in_channel * kernel_size ** 2)
-        self.stride = stride
-        self.padding = padding
-        self.bias = torch.nn.Parameter(torch.empty(out_channel, device=device, dtype=dtype)) if bias else None
-
-    def forward(self, input):
-        if self.bias is None:
-            bias = None
-        else:
-            bias = comfy.model_management.cast_to(self.bias, device=input.device, dtype=input.dtype)
-
-        return F.conv2d(input, comfy.model_management.cast_to(self.weight, device=input.device, dtype=input.dtype) * self.scale, bias=bias, stride=self.stride, padding=self.padding)
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/archs/stylegan2_arch.py#L134
-class EqualLinear(torch.nn.Module):
-    def __init__(self, in_dim, out_dim, bias=True, bias_init=0, lr_mul=1, activation=None, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.weight = torch.nn.Parameter(torch.empty(out_dim, in_dim, device=device, dtype=dtype))
-        self.bias = torch.nn.Parameter(torch.empty(out_dim, device=device, dtype=dtype)) if bias else None
-        self.activation = activation
-        self.scale = (1 / math.sqrt(in_dim)) * lr_mul
-        self.lr_mul = lr_mul
-
-    def forward(self, input):
-        if self.bias is None:
-            bias = None
-        else:
-            bias = comfy.model_management.cast_to(self.bias, device=input.device, dtype=input.dtype) * self.lr_mul
-
-        if self.activation:
-            out = F.linear(input, comfy.model_management.cast_to(self.weight, device=input.device, dtype=input.dtype) * self.scale)
-            return fused_leaky_relu(out, bias)
-        return F.linear(input, comfy.model_management.cast_to(self.weight, device=input.device, dtype=input.dtype) * self.scale, bias=bias)
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/archs/stylegan2_arch.py#L654
-class ConvLayer(torch.nn.Sequential):
-    def __init__(self, in_channel, out_channel, kernel_size, downsample=False, blur_kernel=[1, 3, 3, 1], bias=True, activate=True, dtype=None, device=None, operations=None):
-        layers = []
-
-        if downsample:
-            factor = 2
-            p = (len(blur_kernel) - factor) + (kernel_size - 1)
-            layers.append(Blur(blur_kernel, pad=((p + 1) // 2, p // 2)))
-            stride, padding = 2, 0
-        else:
-            stride, padding = 1, kernel_size // 2
-
-        layers.append(EqualConv2d(in_channel, out_channel, kernel_size, padding=padding, stride=stride, bias=bias and not activate, dtype=dtype, device=device, operations=operations))
-
-        if activate:
-            layers.append(FusedLeakyReLU(out_channel) if bias else ScaledLeakyReLU(0.2))
-
-        super().__init__(*layers)
-
-# https://github.com/XPixelGroup/BasicSR/blob/8d56e3a045f9fb3e1d8872f92ee4a4f07f886b0a/basicsr/archs/stylegan2_arch.py#L704
-class ResBlock(torch.nn.Module):
-    def __init__(self, in_channel, out_channel, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.conv1 = ConvLayer(in_channel, in_channel, 3, dtype=dtype, device=device, operations=operations)
-        self.conv2 = ConvLayer(in_channel, out_channel, 3, downsample=True, dtype=dtype, device=device, operations=operations)
-        self.skip = ConvLayer(in_channel, out_channel, 1, downsample=True, activate=False, bias=False, dtype=dtype, device=device, operations=operations)
-
-    def forward(self, input):
-        out = self.conv2(self.conv1(input))
-        skip = self.skip(input)
-        return (out + skip) / math.sqrt(2)
-
-
-class EncoderApp(torch.nn.Module):
-    def __init__(self, w_dim=512, dtype=None, device=None, operations=None):
-        super().__init__()
-        kwargs = {"device": device, "dtype": dtype, "operations": operations}
-
-        self.convs = torch.nn.ModuleList([
-            ConvLayer(3, 32, 1, **kwargs), ResBlock(32, 64, **kwargs),
-            ResBlock(64, 128, **kwargs), ResBlock(128, 256, **kwargs),
-            ResBlock(256, 512, **kwargs), ResBlock(512, 512, **kwargs),
-            ResBlock(512, 512, **kwargs), ResBlock(512, 512, **kwargs),
-            EqualConv2d(512, w_dim, 4, padding=0, bias=False, **kwargs)
-        ])
-
-    def forward(self, x):
-        h = x
-        for conv in self.convs:
-            h = conv(h)
-        return h.squeeze(-1).squeeze(-1)
-
-class Encoder(torch.nn.Module):
-    def __init__(self, dim=512, motion_dim=20, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.net_app = EncoderApp(dim, dtype=dtype, device=device, operations=operations)
-        self.fc = torch.nn.Sequential(*[EqualLinear(dim, dim, dtype=dtype, device=device, operations=operations) for _ in range(4)] + [EqualLinear(dim, motion_dim, dtype=dtype, device=device, operations=operations)])
-
-    def encode_motion(self, x):
-        return self.fc(self.net_app(x))
-
-class Direction(torch.nn.Module):
-    def __init__(self, motion_dim, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.weight = torch.nn.Parameter(torch.empty(512, motion_dim, device=device, dtype=dtype))
-        self.motion_dim = motion_dim
-
-    def forward(self, input):
-        stabilized_weight = comfy.model_management.cast_to(self.weight, device=input.device, dtype=input.dtype) + 1e-8 * torch.eye(512, self.motion_dim, device=input.device, dtype=input.dtype)
-        Q, _ = torch.linalg.qr(stabilized_weight.float())
-        if input is None:
-            return Q
-        return torch.sum(input.unsqueeze(-1) * Q.T.to(input.dtype), dim=1)
-
-class Synthesis(torch.nn.Module):
-    def __init__(self, motion_dim, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.direction = Direction(motion_dim, dtype=dtype, device=device, operations=operations)
-
-class Generator(torch.nn.Module):
-    def __init__(self, style_dim=512, motion_dim=20, dtype=None, device=None, operations=None):
-        super().__init__()
-        self.enc = Encoder(style_dim, motion_dim, dtype=dtype, device=device, operations=operations)
-        self.dec = Synthesis(motion_dim, dtype=dtype, device=device, operations=operations)
-
-    def get_motion(self, img):
-        motion_feat = self.enc.encode_motion(img)
-        return self.dec.direction(motion_feat)
-
-class AnimateWanModel(WanModel):
-    r"""
-    Wan diffusion backbone supporting both text-to-video and image-to-video.
-    """
-
-    def __init__(self,
-                 model_type='animate',
-                 patch_size=(1, 2, 2),
-                 text_len=512,
-                 in_dim=16,
-                 dim=2048,
-                 ffn_dim=8192,
-                 freq_dim=256,
-                 text_dim=4096,
-                 out_dim=16,
-                 num_heads=16,
-                 num_layers=32,
-                 window_size=(-1, -1),
-                 qk_norm=True,
-                 cross_attn_norm=True,
-                 eps=1e-6,
-                 flf_pos_embed_token_number=None,
-                 motion_encoder_dim=512,
-                 image_model=None,
-                 device=None,
-                 dtype=None,
-                 operations=None,
-                 ):
-
-        super().__init__(model_type='i2v', patch_size=patch_size, text_len=text_len, in_dim=in_dim, dim=dim, ffn_dim=ffn_dim, freq_dim=freq_dim, text_dim=text_dim, out_dim=out_dim, num_heads=num_heads, num_layers=num_layers, window_size=window_size, qk_norm=qk_norm, cross_attn_norm=cross_attn_norm, eps=eps, flf_pos_embed_token_number=flf_pos_embed_token_number, image_model=image_model, device=device, dtype=dtype, operations=operations)
-
-        self.pose_patch_embedding = operations.Conv3d(
-            16, dim, kernel_size=patch_size, stride=patch_size, device=device, dtype=dtype
-        )
-
-        self.motion_encoder = Generator(style_dim=512, motion_dim=20, device=device, dtype=dtype, operations=operations)
-
-        self.face_adapter = FaceAdapter(
-            heads_num=self.num_heads,
-            hidden_dim=self.dim,
-            num_adapter_layers=self.num_layers // 5,
-            device=device, dtype=dtype, operations=operations
-        )
-
-        self.face_encoder = FaceEncoder(
-            in_dim=motion_encoder_dim,
-            hidden_dim=self.dim,
-            num_heads=4,
-            device=device, dtype=dtype, operations=operations
-        )
-
-    def after_patch_embedding(self, x, pose_latents, face_pixel_values):
-        if pose_latents is not None:
-            pose_latents = self.pose_patch_embedding(pose_latents)
-            x[:, :, 1:pose_latents.shape[2] + 1] += pose_latents[:, :, :x.shape[2] - 1]
-
-        if face_pixel_values is None:
-            return x, None
-
-        b, c, T, h, w = face_pixel_values.shape
-        face_pixel_values = rearrange(face_pixel_values, "b c t h w -> (b t) c h w")
-        encode_bs = 8
-        face_pixel_values_tmp = []
-        for i in range(math.ceil(face_pixel_values.shape[0] / encode_bs)):
-            face_pixel_values_tmp.append(self.motion_encoder.get_motion(face_pixel_values[i * encode_bs: (i + 1) * encode_bs]))
-
-        motion_vec = torch.cat(face_pixel_values_tmp)
-
-        motion_vec = rearrange(motion_vec, "(b t) c -> b t c", t=T)
-        motion_vec = self.face_encoder(motion_vec)
-
-        B, L, H, C = motion_vec.shape
-        pad_face = torch.zeros(B, 1, H, C).type_as(motion_vec)
-        motion_vec = torch.cat([pad_face, motion_vec], dim=1)
-
-        if motion_vec.shape[1] < x.shape[2]:
-            B, L, H, C = motion_vec.shape
-            pad = torch.zeros(B, x.shape[2] - motion_vec.shape[1], H, C).type_as(motion_vec)
-            motion_vec = torch.cat([motion_vec, pad], dim=1)
-        else:
-            motion_vec = motion_vec[:, :x.shape[2]]
-        return x, motion_vec
-
-    def forward_orig(
-        self,
-        x,
-        t,
-        context,
-        clip_fea=None,
-        pose_latents=None,
-        face_pixel_values=None,
-        freqs=None,
-        transformer_options={},
-        **kwargs,
-    ):
-        # embeddings
-        x = self.patch_embedding(x.float()).to(x.dtype)
-        x, motion_vec = self.after_patch_embedding(x, pose_latents, face_pixel_values)
-        grid_sizes = x.shape[2:]
-        x = x.flatten(2).transpose(1, 2)
-
-        # time embeddings
-        e = self.time_embedding(
-            sinusoidal_embedding_1d(self.freq_dim, t.flatten()).to(dtype=x[0].dtype))
-        e = e.reshape(t.shape[0], -1, e.shape[-1])
-        e0 = self.time_projection(e).unflatten(2, (6, self.dim))
-
-        full_ref = None
-        if self.ref_conv is not None:
-            full_ref = kwargs.get("reference_latent", None)
-            if full_ref is not None:
-                full_ref = self.ref_conv(full_ref).flatten(2).transpose(1, 2)
-                x = torch.concat((full_ref, x), dim=1)
-
-        # context
-        context = self.text_embedding(context)
-
-        context_img_len = None
-        if clip_fea is not None:
-            if self.img_emb is not None:
-                context_clip = self.img_emb(clip_fea)  # bs x 257 x dim
-                context = torch.concat([context_clip, context], dim=1)
-            context_img_len = clip_fea.shape[-2]
-
-        patches_replace = transformer_options.get("patches_replace", {})
-        blocks_replace = patches_replace.get("dit", {})
-        for i, block in enumerate(self.blocks):
-            if ("double_block", i) in blocks_replace:
-                def block_wrap(args):
-                    out = {}
-                    out["img"] = block(args["img"], context=args["txt"], e=args["vec"], freqs=args["pe"], context_img_len=context_img_len, transformer_options=args["transformer_options"])
-                    return out
-                out = blocks_replace[("double_block", i)]({"img": x, "txt": context, "vec": e0, "pe": freqs, "transformer_options": transformer_options}, {"original_block": block_wrap})
-                x = out["img"]
-            else:
-                x = block(x, e=e0, freqs=freqs, context=context, context_img_len=context_img_len, transformer_options=transformer_options)
-
-            if i % 5 == 0 and motion_vec is not None:
-                x = x + self.face_adapter.fuser_blocks[i // 5](x, motion_vec)
-
-        # head
-        x = self.head(x, e)
-
-        if full_ref is not None:
-            x = x[:, full_ref.shape[1]:]
-
-        # unpatchify
-        x = self.unpatchify(x, grid_sizes)
-        return x
--- a/comfy/ldm/wan/vae.py
+++ b/comfy/ldm/wan/vae.py
@@ -24,17 +24,12 @@ class CausalConv3d(ops.Conv3d):
                         self.padding[1], 2 * self.padding[0], 0)
        self.padding = (0, 0, 0)

-    def forward(self, x, cache_x=None, cache_list=None, cache_idx=None):
-        if cache_list is not None:
-            cache_x = cache_list[cache_idx]
-            cache_list[cache_idx] = None
-
+    def forward(self, x, cache_x=None):
        padding = list(self._padding)
        if cache_x is not None and self._padding[4] > 0:
            cache_x = cache_x.to(x.device)
            x = torch.cat([cache_x, x], dim=2)
            padding[4] -= cache_x.shape[2]
-            del cache_x
        x = F.pad(x, padding)

        return super().forward(x)
@@ -57,6 +52,15 @@ class RMS_norm(nn.Module):
            x, dim=(1 if self.channel_first else -1)) * self.scale * self.gamma.to(x) + (self.bias.to(x) if self.bias is not None else 0)


+class Upsample(nn.Upsample):
+
+    def forward(self, x):
+        """
+        Fix bfloat16 support for nearest neighbor interpolation.
+        """
+        return super().forward(x.float()).type_as(x)
+
+
 class Resample(nn.Module):

    def __init__(self, dim, mode):
@@ -69,11 +73,11 @@ class Resample(nn.Module):
        # layers
        if mode == 'upsample2d':
            self.resample = nn.Sequential(
-                nn.Upsample(scale_factor=(2., 2.), mode='nearest-exact'),
+                Upsample(scale_factor=(2., 2.), mode='nearest-exact'),
                ops.Conv2d(dim, dim // 2, 3, padding=1))
        elif mode == 'upsample3d':
            self.resample = nn.Sequential(
-                nn.Upsample(scale_factor=(2., 2.), mode='nearest-exact'),
+                Upsample(scale_factor=(2., 2.), mode='nearest-exact'),
                ops.Conv2d(dim, dim // 2, 3, padding=1))
            self.time_conv = CausalConv3d(
                dim, dim * 2, (3, 1, 1), padding=(1, 0, 0))
@@ -153,6 +157,29 @@ class Resample(nn.Module):
                    feat_idx[0] += 1
        return x

+    def init_weight(self, conv):
+        conv_weight = conv.weight
+        nn.init.zeros_(conv_weight)
+        c1, c2, t, h, w = conv_weight.size()
+        one_matrix = torch.eye(c1, c2)
+        init_matrix = one_matrix
+        nn.init.zeros_(conv_weight)
+        #conv_weight.data[:,:,-1,1,1] = init_matrix * 0.5
+        conv_weight.data[:, :, 1, 0, 0] = init_matrix  #* 0.5
+        conv.weight.data.copy_(conv_weight)
+        nn.init.zeros_(conv.bias.data)
+
+    def init_weight2(self, conv):
+        conv_weight = conv.weight.data
+        nn.init.zeros_(conv_weight)
+        c1, c2, t, h, w = conv_weight.size()
+        init_matrix = torch.eye(c1 // 2, c2)
+        #init_matrix = repeat(init_matrix, 'o ... -> (o 2) ...').permute(1,0,2).contiguous().reshape(c1,c2)
+        conv_weight[:c1 // 2, :, -1, 0, 0] = init_matrix
+        conv_weight[c1 // 2:, :, -1, 0, 0] = init_matrix
+        conv.weight.data.copy_(conv_weight)
+        nn.init.zeros_(conv.bias.data)
+

 class ResidualBlock(nn.Module):

@@ -171,7 +198,7 @@ class ResidualBlock(nn.Module):
            if in_dim != out_dim else nn.Identity()

    def forward(self, x, feat_cache=None, feat_idx=[0]):
-        old_x = x
+        h = self.shortcut(x)
        for layer in self.residual:
            if isinstance(layer, CausalConv3d) and feat_cache is not None:
                idx = feat_idx[0]
@@ -183,12 +210,12 @@ class ResidualBlock(nn.Module):
                            cache_x.device), cache_x
                    ],
                                        dim=2)
-                x = layer(x, cache_list=feat_cache, cache_idx=idx)
+                x = layer(x, feat_cache[idx])
                feat_cache[idx] = cache_x
                feat_idx[0] += 1
            else:
                x = layer(x)
-        return x + self.shortcut(old_x)
+        return x + h


 class AttentionBlock(nn.Module):
@@ -467,6 +494,12 @@ class WanVAE(nn.Module):
        self.decoder = Decoder3d(dim, z_dim, dim_mult, num_res_blocks,
                                 attn_scales, self.temperal_upsample, dropout)

+    def forward(self, x):
+        mu, log_var = self.encode(x)
+        z = self.reparameterize(mu, log_var)
+        x_recon = self.decode(z)
+        return x_recon, mu, log_var
+
    def encode(self, x):
        self.clear_cache()
        ## cache
@@ -512,6 +545,18 @@ class WanVAE(nn.Module):
        self.clear_cache()
        return out

+    def reparameterize(self, mu, log_var):
+        std = torch.exp(0.5 * log_var)
+        eps = torch.randn_like(std)
+        return eps * std + mu
+
+    def sample(self, imgs, deterministic=False):
+        mu, log_var = self.encode(imgs)
+        if deterministic:
+            return mu
+        std = torch.exp(0.5 * log_var.clamp(-30.0, 20.0))
+        return mu + std * torch.randn_like(std)
+
    def clear_cache(self):
        self._conv_num = count_conv3d(self.decoder)
        self._conv_idx = [0]
--- a/comfy/ldm/wan/vae2_2.py
+++ b/comfy/ldm/wan/vae2_2.py
@@ -1,726 +0,0 @@
-# original version: https://github.com/Wan-Video/Wan2.2/blob/main/wan/modules/vae2_2.py
-# Copyright 2024-2025 The Alibaba Wan Team Authors. All rights reserved.
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from einops import rearrange
-from .vae import AttentionBlock, CausalConv3d, RMS_norm
-
-import comfy.ops
-ops = comfy.ops.disable_weight_init
-
-CACHE_T = 2
-
-
-class Resample(nn.Module):
-
-    def __init__(self, dim, mode):
-        assert mode in (
-            "none",
-            "upsample2d",
-            "upsample3d",
-            "downsample2d",
-            "downsample3d",
-        )
-        super().__init__()
-        self.dim = dim
-        self.mode = mode
-
-        # layers
-        if mode == "upsample2d":
-            self.resample = nn.Sequential(
-                nn.Upsample(scale_factor=(2.0, 2.0), mode="nearest-exact"),
-                ops.Conv2d(dim, dim, 3, padding=1),
-            )
-        elif mode == "upsample3d":
-            self.resample = nn.Sequential(
-                nn.Upsample(scale_factor=(2.0, 2.0), mode="nearest-exact"),
-                ops.Conv2d(dim, dim, 3, padding=1),
-                # ops.Conv2d(dim, dim//2, 3, padding=1)
-            )
-            self.time_conv = CausalConv3d(
-                dim, dim * 2, (3, 1, 1), padding=(1, 0, 0))
-        elif mode == "downsample2d":
-            self.resample = nn.Sequential(
-                nn.ZeroPad2d((0, 1, 0, 1)),
-                ops.Conv2d(dim, dim, 3, stride=(2, 2)))
-        elif mode == "downsample3d":
-            self.resample = nn.Sequential(
-                nn.ZeroPad2d((0, 1, 0, 1)),
-                ops.Conv2d(dim, dim, 3, stride=(2, 2)))
-            self.time_conv = CausalConv3d(
-                dim, dim, (3, 1, 1), stride=(2, 1, 1), padding=(0, 0, 0))
-        else:
-            self.resample = nn.Identity()
-
-    def forward(self, x, feat_cache=None, feat_idx=[0]):
-        b, c, t, h, w = x.size()
-        if self.mode == "upsample3d":
-            if feat_cache is not None:
-                idx = feat_idx[0]
-                if feat_cache[idx] is None:
-                    feat_cache[idx] = "Rep"
-                    feat_idx[0] += 1
-                else:
-                    cache_x = x[:, :, -CACHE_T:, :, :].clone()
-                    if (cache_x.shape[2] < 2 and feat_cache[idx] is not None and
-                            feat_cache[idx] != "Rep"):
-                        # cache last frame of last two chunk
-                        cache_x = torch.cat(
-                            [
-                                feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(
-                                    cache_x.device),
-                                cache_x,
-                            ],
-                            dim=2,
-                        )
-                    if (cache_x.shape[2] < 2 and feat_cache[idx] is not None and
-                            feat_cache[idx] == "Rep"):
-                        cache_x = torch.cat(
-                            [
-                                torch.zeros_like(cache_x).to(cache_x.device),
-                                cache_x
-                            ],
-                            dim=2,
-                        )
-                    if feat_cache[idx] == "Rep":
-                        x = self.time_conv(x)
-                    else:
-                        x = self.time_conv(x, feat_cache[idx])
-                    feat_cache[idx] = cache_x
-                    feat_idx[0] += 1
-                    x = x.reshape(b, 2, c, t, h, w)
-                    x = torch.stack((x[:, 0, :, :, :, :], x[:, 1, :, :, :, :]),
-                                    3)
-                    x = x.reshape(b, c, t * 2, h, w)
-        t = x.shape[2]
-        x = rearrange(x, "b c t h w -> (b t) c h w")
-        x = self.resample(x)
-        x = rearrange(x, "(b t) c h w -> b c t h w", t=t)
-
-        if self.mode == "downsample3d":
-            if feat_cache is not None:
-                idx = feat_idx[0]
-                if feat_cache[idx] is None:
-                    feat_cache[idx] = x.clone()
-                    feat_idx[0] += 1
-                else:
-                    cache_x = x[:, :, -1:, :, :].clone()
-                    x = self.time_conv(
-                        torch.cat([feat_cache[idx][:, :, -1:, :, :], x], 2))
-                    feat_cache[idx] = cache_x
-                    feat_idx[0] += 1
-        return x
-
-
-class ResidualBlock(nn.Module):
-
-    def __init__(self, in_dim, out_dim, dropout=0.0):
-        super().__init__()
-        self.in_dim = in_dim
-        self.out_dim = out_dim
-
-        # layers
-        self.residual = nn.Sequential(
-            RMS_norm(in_dim, images=False),
-            nn.SiLU(),
-            CausalConv3d(in_dim, out_dim, 3, padding=1),
-            RMS_norm(out_dim, images=False),
-            nn.SiLU(),
-            nn.Dropout(dropout),
-            CausalConv3d(out_dim, out_dim, 3, padding=1),
-        )
-        self.shortcut = (
-            CausalConv3d(in_dim, out_dim, 1)
-            if in_dim != out_dim else nn.Identity())
-
-    def forward(self, x, feat_cache=None, feat_idx=[0]):
-        old_x = x
-        for layer in self.residual:
-            if isinstance(layer, CausalConv3d) and feat_cache is not None:
-                idx = feat_idx[0]
-                cache_x = x[:, :, -CACHE_T:, :, :].clone()
-                if cache_x.shape[2] < 2 and feat_cache[idx] is not None:
-                    # cache last frame of last two chunk
-                    cache_x = torch.cat(
-                        [
-                            feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(
-                                cache_x.device),
-                            cache_x,
-                        ],
-                        dim=2,
-                    )
-                x = layer(x, cache_list=feat_cache, cache_idx=idx)
-                feat_cache[idx] = cache_x
-                feat_idx[0] += 1
-            else:
-                x = layer(x)
-        return x + self.shortcut(old_x)
-
-
-def patchify(x, patch_size):
-    if patch_size == 1:
-        return x
-    if x.dim() == 4:
-        x = rearrange(
-            x, "b c (h q) (w r) -> b (c r q) h w", q=patch_size, r=patch_size)
-    elif x.dim() == 5:
-        x = rearrange(
-            x,
-            "b c f (h q) (w r) -> b (c r q) f h w",
-            q=patch_size,
-            r=patch_size,
-        )
-    else:
-        raise ValueError(f"Invalid input shape: {x.shape}")
-
-    return x
-
-
-def unpatchify(x, patch_size):
-    if patch_size == 1:
-        return x
-
-    if x.dim() == 4:
-        x = rearrange(
-            x, "b (c r q) h w -> b c (h q) (w r)", q=patch_size, r=patch_size)
-    elif x.dim() == 5:
-        x = rearrange(
-            x,
-            "b (c r q) f h w -> b c f (h q) (w r)",
-            q=patch_size,
-            r=patch_size,
-        )
-    return x
-
-
-class AvgDown3D(nn.Module):
-
-    def __init__(
-        self,
-        in_channels,
-        out_channels,
-        factor_t,
-        factor_s=1,
-    ):
-        super().__init__()
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.factor_t = factor_t
-        self.factor_s = factor_s
-        self.factor = self.factor_t * self.factor_s * self.factor_s
-
-        assert in_channels * self.factor % out_channels == 0
-        self.group_size = in_channels * self.factor // out_channels
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        pad_t = (self.factor_t - x.shape[2] % self.factor_t) % self.factor_t
-        pad = (0, 0, 0, 0, pad_t, 0)
-        x = F.pad(x, pad)
-        B, C, T, H, W = x.shape
-        x = x.view(
-            B,
-            C,
-            T // self.factor_t,
-            self.factor_t,
-            H // self.factor_s,
-            self.factor_s,
-            W // self.factor_s,
-            self.factor_s,
-        )
-        x = x.permute(0, 1, 3, 5, 7, 2, 4, 6).contiguous()
-        x = x.view(
-            B,
-            C * self.factor,
-            T // self.factor_t,
-            H // self.factor_s,
-            W // self.factor_s,
-        )
-        x = x.view(
-            B,
-            self.out_channels,
-            self.group_size,
-            T // self.factor_t,
-            H // self.factor_s,
-            W // self.factor_s,
-        )
-        x = x.mean(dim=2)
-        return x
-
-
-class DupUp3D(nn.Module):
-
-    def __init__(
-        self,
-        in_channels: int,
-        out_channels: int,
-        factor_t,
-        factor_s=1,
-    ):
-        super().__init__()
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-
-        self.factor_t = factor_t
-        self.factor_s = factor_s
-        self.factor = self.factor_t * self.factor_s * self.factor_s
-
-        assert out_channels * self.factor % in_channels == 0
-        self.repeats = out_channels * self.factor // in_channels
-
-    def forward(self, x: torch.Tensor, first_chunk=False) -> torch.Tensor:
-        x = x.repeat_interleave(self.repeats, dim=1)
-        x = x.view(
-            x.size(0),
-            self.out_channels,
-            self.factor_t,
-            self.factor_s,
-            self.factor_s,
-            x.size(2),
-            x.size(3),
-            x.size(4),
-        )
-        x = x.permute(0, 1, 5, 2, 6, 3, 7, 4).contiguous()
-        x = x.view(
-            x.size(0),
-            self.out_channels,
-            x.size(2) * self.factor_t,
-            x.size(4) * self.factor_s,
-            x.size(6) * self.factor_s,
-        )
-        if first_chunk:
-            x = x[:, :, self.factor_t - 1:, :, :]
-        return x
-
-
-class Down_ResidualBlock(nn.Module):
-
-    def __init__(self,
-                 in_dim,
-                 out_dim,
-                 dropout,
-                 mult,
-                 temperal_downsample=False,
-                 down_flag=False):
-        super().__init__()
-
-        # Shortcut path with downsample
-        self.avg_shortcut = AvgDown3D(
-            in_dim,
-            out_dim,
-            factor_t=2 if temperal_downsample else 1,
-            factor_s=2 if down_flag else 1,
-        )
-
-        # Main path with residual blocks and downsample
-        downsamples = []
-        for _ in range(mult):
-            downsamples.append(ResidualBlock(in_dim, out_dim, dropout))
-            in_dim = out_dim
-
-        # Add the final downsample block
-        if down_flag:
-            mode = "downsample3d" if temperal_downsample else "downsample2d"
-            downsamples.append(Resample(out_dim, mode=mode))
-
-        self.downsamples = nn.Sequential(*downsamples)
-
-    def forward(self, x, feat_cache=None, feat_idx=[0]):
-        x_copy = x
-        for module in self.downsamples:
-            x = module(x, feat_cache, feat_idx)
-
-        return x + self.avg_shortcut(x_copy)
-
-
-class Up_ResidualBlock(nn.Module):
-
-    def __init__(self,
-                 in_dim,
-                 out_dim,
-                 dropout,
-                 mult,
-                 temperal_upsample=False,
-                 up_flag=False):
-        super().__init__()
-        # Shortcut path with upsample
-        if up_flag:
-            self.avg_shortcut = DupUp3D(
-                in_dim,
-                out_dim,
-                factor_t=2 if temperal_upsample else 1,
-                factor_s=2 if up_flag else 1,
-            )
-        else:
-            self.avg_shortcut = None
-
-        # Main path with residual blocks and upsample
-        upsamples = []
-        for _ in range(mult):
-            upsamples.append(ResidualBlock(in_dim, out_dim, dropout))
-            in_dim = out_dim
-
-        # Add the final upsample block
-        if up_flag:
-            mode = "upsample3d" if temperal_upsample else "upsample2d"
-            upsamples.append(Resample(out_dim, mode=mode))
-
-        self.upsamples = nn.Sequential(*upsamples)
-
-    def forward(self, x, feat_cache=None, feat_idx=[0], first_chunk=False):
-        x_main = x
-        for module in self.upsamples:
-            x_main = module(x_main, feat_cache, feat_idx)
-        if self.avg_shortcut is not None:
-            x_shortcut = self.avg_shortcut(x, first_chunk)
-            return x_main + x_shortcut
-        else:
-            return x_main
-
-
-class Encoder3d(nn.Module):
-
-    def __init__(
-        self,
-        dim=128,
-        z_dim=4,
-        dim_mult=[1, 2, 4, 4],
-        num_res_blocks=2,
-        attn_scales=[],
-        temperal_downsample=[True, True, False],
-        dropout=0.0,
-    ):
-        super().__init__()
-        self.dim = dim
-        self.z_dim = z_dim
-        self.dim_mult = dim_mult
-        self.num_res_blocks = num_res_blocks
-        self.attn_scales = attn_scales
-        self.temperal_downsample = temperal_downsample
-
-        # dimensions
-        dims = [dim * u for u in [1] + dim_mult]
-        scale = 1.0
-
-        # init block
-        self.conv1 = CausalConv3d(12, dims[0], 3, padding=1)
-
-        # downsample blocks
-        downsamples = []
-        for i, (in_dim, out_dim) in enumerate(zip(dims[:-1], dims[1:])):
-            t_down_flag = (
-                temperal_downsample[i]
-                if i < len(temperal_downsample) else False)
-            downsamples.append(
-                Down_ResidualBlock(
-                    in_dim=in_dim,
-                    out_dim=out_dim,
-                    dropout=dropout,
-                    mult=num_res_blocks,
-                    temperal_downsample=t_down_flag,
-                    down_flag=i != len(dim_mult) - 1,
-                ))
-            scale /= 2.0
-        self.downsamples = nn.Sequential(*downsamples)
-
-        # middle blocks
-        self.middle = nn.Sequential(
-            ResidualBlock(out_dim, out_dim, dropout),
-            AttentionBlock(out_dim),
-            ResidualBlock(out_dim, out_dim, dropout),
-        )
-
-        # # output blocks
-        self.head = nn.Sequential(
-            RMS_norm(out_dim, images=False),
-            nn.SiLU(),
-            CausalConv3d(out_dim, z_dim, 3, padding=1),
-        )
-
-    def forward(self, x, feat_cache=None, feat_idx=[0]):
-
-        if feat_cache is not None:
-            idx = feat_idx[0]
-            cache_x = x[:, :, -CACHE_T:, :, :].clone()
-            if cache_x.shape[2] < 2 and feat_cache[idx] is not None:
-                cache_x = torch.cat(
-                    [
-                        feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(
-                            cache_x.device),
-                        cache_x,
-                    ],
-                    dim=2,
-                )
-            x = self.conv1(x, feat_cache[idx])
-            feat_cache[idx] = cache_x
-            feat_idx[0] += 1
-        else:
-            x = self.conv1(x)
-
-        ## downsamples
-        for layer in self.downsamples:
-            if feat_cache is not None:
-                x = layer(x, feat_cache, feat_idx)
-            else:
-                x = layer(x)
-
-        ## middle
-        for layer in self.middle:
-            if isinstance(layer, ResidualBlock) and feat_cache is not None:
-                x = layer(x, feat_cache, feat_idx)
-            else:
-                x = layer(x)
-
-        ## head
-        for layer in self.head:
-            if isinstance(layer, CausalConv3d) and feat_cache is not None:
-                idx = feat_idx[0]
-                cache_x = x[:, :, -CACHE_T:, :, :].clone()
-                if cache_x.shape[2] < 2 and feat_cache[idx] is not None:
-                    cache_x = torch.cat(
-                        [
-                            feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(
-                                cache_x.device),
-                            cache_x,
-                        ],
-                        dim=2,
-                    )
-                x = layer(x, feat_cache[idx])
-                feat_cache[idx] = cache_x
-                feat_idx[0] += 1
-            else:
-                x = layer(x)
-
-        return x
-
-
-class Decoder3d(nn.Module):
-
-    def __init__(
-        self,
-        dim=128,
-        z_dim=4,
-        dim_mult=[1, 2, 4, 4],
-        num_res_blocks=2,
-        attn_scales=[],
-        temperal_upsample=[False, True, True],
-        dropout=0.0,
-    ):
-        super().__init__()
-        self.dim = dim
-        self.z_dim = z_dim
-        self.dim_mult = dim_mult
-        self.num_res_blocks = num_res_blocks
-        self.attn_scales = attn_scales
-        self.temperal_upsample = temperal_upsample
-
-        # dimensions
-        dims = [dim * u for u in [dim_mult[-1]] + dim_mult[::-1]]
-        # init block
-        self.conv1 = CausalConv3d(z_dim, dims[0], 3, padding=1)
-
-        # middle blocks
-        self.middle = nn.Sequential(
-            ResidualBlock(dims[0], dims[0], dropout),
-            AttentionBlock(dims[0]),
-            ResidualBlock(dims[0], dims[0], dropout),
-        )
-
-        # upsample blocks
-        upsamples = []
-        for i, (in_dim, out_dim) in enumerate(zip(dims[:-1], dims[1:])):
-            t_up_flag = temperal_upsample[i] if i < len(
-                temperal_upsample) else False
-            upsamples.append(
-                Up_ResidualBlock(
-                    in_dim=in_dim,
-                    out_dim=out_dim,
-                    dropout=dropout,
-                    mult=num_res_blocks + 1,
-                    temperal_upsample=t_up_flag,
-                    up_flag=i != len(dim_mult) - 1,
-                ))
-        self.upsamples = nn.Sequential(*upsamples)
-
-        # output blocks
-        self.head = nn.Sequential(
-            RMS_norm(out_dim, images=False),
-            nn.SiLU(),
-            CausalConv3d(out_dim, 12, 3, padding=1),
-        )
-
-    def forward(self, x, feat_cache=None, feat_idx=[0], first_chunk=False):
-        if feat_cache is not None:
-            idx = feat_idx[0]
-            cache_x = x[:, :, -CACHE_T:, :, :].clone()
-            if cache_x.shape[2] < 2 and feat_cache[idx] is not None:
-                cache_x = torch.cat(
-                    [
-                        feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(
-                            cache_x.device),
-                        cache_x,
-                    ],
-                    dim=2,
-                )
-            x = self.conv1(x, feat_cache[idx])
-            feat_cache[idx] = cache_x
-            feat_idx[0] += 1
-        else:
-            x = self.conv1(x)
-
-        for layer in self.middle:
-            if isinstance(layer, ResidualBlock) and feat_cache is not None:
-                x = layer(x, feat_cache, feat_idx)
-            else:
-                x = layer(x)
-
-        ## upsamples
-        for layer in self.upsamples:
-            if feat_cache is not None:
-                x = layer(x, feat_cache, feat_idx, first_chunk)
-            else:
-                x = layer(x)
-
-        ## head
-        for layer in self.head:
-            if isinstance(layer, CausalConv3d) and feat_cache is not None:
-                idx = feat_idx[0]
-                cache_x = x[:, :, -CACHE_T:, :, :].clone()
-                if cache_x.shape[2] < 2 and feat_cache[idx] is not None:
-                    cache_x = torch.cat(
-                        [
-                            feat_cache[idx][:, :, -1, :, :].unsqueeze(2).to(
-                                cache_x.device),
-                            cache_x,
-                        ],
-                        dim=2,
-                    )
-                x = layer(x, feat_cache[idx])
-                feat_cache[idx] = cache_x
-                feat_idx[0] += 1
-            else:
-                x = layer(x)
-        return x
-
-
-def count_conv3d(model):
-    count = 0
-    for m in model.modules():
-        if isinstance(m, CausalConv3d):
-            count += 1
-    return count
-
-
-class WanVAE(nn.Module):
-
-    def __init__(
-        self,
-        dim=160,
-        dec_dim=256,
-        z_dim=16,
-        dim_mult=[1, 2, 4, 4],
-        num_res_blocks=2,
-        attn_scales=[],
-        temperal_downsample=[True, True, False],
-        dropout=0.0,
-    ):
-        super().__init__()
-        self.dim = dim
-        self.z_dim = z_dim
-        self.dim_mult = dim_mult
-        self.num_res_blocks = num_res_blocks
-        self.attn_scales = attn_scales
-        self.temperal_downsample = temperal_downsample
-        self.temperal_upsample = temperal_downsample[::-1]
-
-        # modules
-        self.encoder = Encoder3d(
-            dim,
-            z_dim * 2,
-            dim_mult,
-            num_res_blocks,
-            attn_scales,
-            self.temperal_downsample,
-            dropout,
-        )
-        self.conv1 = CausalConv3d(z_dim * 2, z_dim * 2, 1)
-        self.conv2 = CausalConv3d(z_dim, z_dim, 1)
-        self.decoder = Decoder3d(
-            dec_dim,
-            z_dim,
-            dim_mult,
-            num_res_blocks,
-            attn_scales,
-            self.temperal_upsample,
-            dropout,
-        )
-
-    def encode(self, x):
-        self.clear_cache()
-        x = patchify(x, patch_size=2)
-        t = x.shape[2]
-        iter_ = 1 + (t - 1) // 4
-        for i in range(iter_):
-            self._enc_conv_idx = [0]
-            if i == 0:
-                out = self.encoder(
-                    x[:, :, :1, :, :],
-                    feat_cache=self._enc_feat_map,
-                    feat_idx=self._enc_conv_idx,
-                )
-            else:
-                out_ = self.encoder(
-                    x[:, :, 1 + 4 * (i - 1):1 + 4 * i, :, :],
-                    feat_cache=self._enc_feat_map,
-                    feat_idx=self._enc_conv_idx,
-                )
-                out = torch.cat([out, out_], 2)
-        mu, log_var = self.conv1(out).chunk(2, dim=1)
-        self.clear_cache()
-        return mu
-
-    def decode(self, z):
-        self.clear_cache()
-        iter_ = z.shape[2]
-        x = self.conv2(z)
-        for i in range(iter_):
-            self._conv_idx = [0]
-            if i == 0:
-                out = self.decoder(
-                    x[:, :, i:i + 1, :, :],
-                    feat_cache=self._feat_map,
-                    feat_idx=self._conv_idx,
-                    first_chunk=True,
-                )
-            else:
-                out_ = self.decoder(
-                    x[:, :, i:i + 1, :, :],
-                    feat_cache=self._feat_map,
-                    feat_idx=self._conv_idx,
-                )
-                out = torch.cat([out, out_], 2)
-        out = unpatchify(out, patch_size=2)
-        self.clear_cache()
-        return out
-
-    def reparameterize(self, mu, log_var):
-        std = torch.exp(0.5 * log_var)
-        eps = torch.randn_like(std)
-        return eps * std + mu
-
-    def sample(self, imgs, deterministic=False):
-        mu, log_var = self.encode(imgs)
-        if deterministic:
-            return mu
-        std = torch.exp(0.5 * log_var.clamp(-30.0, 20.0))
-        return mu + std * torch.randn_like(std)
-
-    def clear_cache(self):
-        self._conv_num = count_conv3d(self.decoder)
-        self._conv_idx = [0]
-        self._feat_map = [None] * self._conv_num
-        # cache encode
-        self._enc_conv_num = count_conv3d(self.encoder)
-        self._enc_conv_idx = [0]
-        self._enc_feat_map = [None] * self._enc_conv_num
--- a/comfy/lora.py
+++ b/comfy/lora.py
@@ -260,10 +260,6 @@ def model_lora_keys_unet(model, key_map={}):
                key_map["transformer.{}".format(k[:-len(".weight")])] = to #simpletrainer and probably regular diffusers flux lora format
                key_map["lycoris_{}".format(k[:-len(".weight")].replace(".", "_"))] = to #simpletrainer lycoris
                key_map["lora_transformer_{}".format(k[:-len(".weight")].replace(".", "_"))] = to #onetrainer
-        for k in sdk:
-            hidden_size = model.model_config.unet_config.get("hidden_size", 0)
-            if k.endswith(".weight") and ".linear1." in k:
-                key_map["{}".format(k.replace(".linear1.weight", ".linear1_qkv"))] = (k, (0, 0, hidden_size * 3))

    if isinstance(model, comfy.model_base.GenmoMochi):
        for k in sdk:
@@ -297,22 +293,6 @@ def model_lora_keys_unet(model, key_map={}):
                key_lora = k[len("diffusion_model."):-len(".weight")]
                key_map["{}".format(key_lora)] = k

-    if isinstance(model, comfy.model_base.Omnigen2):
-        for k in sdk:
-            if k.startswith("diffusion_model.") and k.endswith(".weight"):
-                key_lora = k[len("diffusion_model."):-len(".weight")]
-                key_map["{}".format(key_lora)] = k
-
-    if isinstance(model, comfy.model_base.QwenImage):
-        for k in sdk:
-            if k.startswith("diffusion_model.") and k.endswith(".weight"): #QwenImage lora format
-                key_lora = k[len("diffusion_model."):-len(".weight")]
-                # Direct mapping for transformer_blocks format (QwenImage LoRA format)
-                key_map["{}".format(key_lora)] = k
-                # Support transformer prefix format
-                key_map["transformer.{}".format(key_lora)] = k
-                key_map["lycoris_{}".format(key_lora.replace(".", "_"))] = k #SimpleTuner lycoris format
-
    return key_map


--- a/comfy/lora_convert.py
+++ b/comfy/lora_convert.py
@@ -15,29 +15,10 @@ def convert_lora_bfl_control(sd): #BFL loras for Flux
 def convert_lora_wan_fun(sd): #Wan Fun loras
    return comfy.utils.state_dict_prefix_replace(sd, {"lora_unet__": "lora_unet_"})

-def convert_uso_lora(sd):
-    sd_out = {}
-    for k in sd:
-        tensor = sd[k]
-        k_to = "diffusion_model.{}".format(k.replace(".down.weight", ".lora_down.weight")
-                                           .replace(".up.weight", ".lora_up.weight")
-                                           .replace(".qkv_lora2.", ".txt_attn.qkv.")
-                                           .replace(".qkv_lora1.", ".img_attn.qkv.")
-                                           .replace(".proj_lora1.", ".img_attn.proj.")
-                                           .replace(".proj_lora2.", ".txt_attn.proj.")
-                                           .replace(".qkv_lora.", ".linear1_qkv.")
-                                           .replace(".proj_lora.", ".linear2.")
-                                           .replace(".processor.", ".")
-                                           )
-        sd_out[k_to] = tensor
-    return sd_out
-

 def convert_lora(sd):
    if "img_in.lora_A.weight" in sd and "single_blocks.0.norm.key_norm.scale" in sd:
        return convert_lora_bfl_control(sd)
    if "lora_unet__blocks_0_cross_attn_k.lora_down.weight" in sd:
        return convert_lora_wan_fun(sd)
-    if "single_blocks.37.processor.qkv_lora.up.weight" in sd and "double_blocks.18.processor.qkv_lora2.up.weight" in sd:
-        return convert_uso_lora(sd)
    return sd
--- a/comfy/model_base.py
+++ b/comfy/model_base.py
@@ -16,8 +16,6 @@
    along with this program.  If not, see <https://www.gnu.org/licenses/>.
 """

-import comfy.ldm.hunyuan3dv2_1
-import comfy.ldm.hunyuan3dv2_1.hunyuandit
 import torch
 import logging
 from comfy.ldm.modules.diffusionmodules.openaimodel import UNetModel, Timestep
@@ -39,14 +37,10 @@ import comfy.ldm.cosmos.model
 import comfy.ldm.cosmos.predict2
 import comfy.ldm.lumina.model
 import comfy.ldm.wan.model
-import comfy.ldm.wan.model_animate
 import comfy.ldm.hunyuan3d.model
 import comfy.ldm.hidream.model
 import comfy.ldm.chroma.model
-import comfy.ldm.chroma_radiance.model
 import comfy.ldm.ace.model
-import comfy.ldm.omnigen.omnigen2
-import comfy.ldm.qwen_image.model

 import comfy.model_management
 import comfy.patcher_extension
@@ -111,12 +105,10 @@ def model_sampling(model_config, model_type):
    return ModelSampling(model_config)


-def convert_tensor(extra, dtype, device):
+def convert_tensor(extra, dtype):
    if hasattr(extra, "dtype"):
        if extra.dtype != torch.int and extra.dtype != torch.long:
-            extra = comfy.model_management.cast_to_device(extra, device, dtype)
-        else:
-            extra = comfy.model_management.cast_to_device(extra, device, None)
+            extra = extra.to(dtype)
    return extra


@@ -154,7 +146,6 @@ class BaseModel(torch.nn.Module):
        logging.debug("adm {}".format(self.adm_channels))
        self.memory_usage_factor = model_config.memory_usage_factor
        self.memory_usage_factor_conds = ()
-        self.memory_usage_shape_process = {}

    def apply_model(self, x, t, c_concat=None, c_crossattn=None, control=None, transformer_options={}, **kwargs):
        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
@@ -168,7 +159,7 @@ class BaseModel(torch.nn.Module):
        xc = self.model_sampling.calculate_input(sigma, x)

        if c_concat is not None:
-            xc = torch.cat([xc] + [comfy.model_management.cast_to_device(c_concat, xc.device, xc.dtype)], dim=1)
+            xc = torch.cat([xc] + [c_concat], dim=1)

        context = c_crossattn
        dtype = self.get_dtype()
@@ -177,21 +168,20 @@ class BaseModel(torch.nn.Module):
            dtype = self.manual_cast_dtype

        xc = xc.to(dtype)
-        device = xc.device
        t = self.model_sampling.timestep(t).float()
        if context is not None:
-            context = comfy.model_management.cast_to_device(context, device, dtype)
+            context = context.to(dtype)

        extra_conds = {}
        for o in kwargs:
            extra = kwargs[o]

            if hasattr(extra, "dtype"):
-                extra = convert_tensor(extra, dtype, device)
+                extra = convert_tensor(extra, dtype)
            elif isinstance(extra, list):
                ex = []
                for ext in extra:
-                    ex.append(convert_tensor(ext, dtype, device))
+                    ex.append(convert_tensor(ext, dtype))
                extra = ex
            extra_conds[o] = extra

@@ -355,15 +345,8 @@ class BaseModel(torch.nn.Module):
        input_shapes = [input_shape]
        for c in self.memory_usage_factor_conds:
            shape = cond_shapes.get(c, None)
-            if shape is not None:
-                if c in self.memory_usage_shape_process:
-                    out = []
-                    for s in shape:
-                        out.append(self.memory_usage_shape_process[c](s))
-                    shape = out
-
-                if len(shape) > 0:
-                    input_shapes += shape
+            if shape is not None and len(shape) > 0:
+                input_shapes += shape

        if comfy.model_management.xformers_enabled() or comfy.model_management.pytorch_attention_flash_attention():
            dtype = self.get_dtype()
@@ -414,7 +397,7 @@ class SD21UNCLIP(BaseModel):
        unclip_conditioning = kwargs.get("unclip_conditioning", None)
        device = kwargs["device"]
        if unclip_conditioning is None:
-            return torch.zeros((1, self.adm_channels), device=device)
+            return torch.zeros((1, self.adm_channels))
        else:
            return unclip_adm(unclip_conditioning, device, self.noise_augmentor, kwargs.get("unclip_noise_augment_merge", 0.05), kwargs.get("seed", 0) - 10)

@@ -628,11 +611,9 @@ class IP2P:

        if image is None:
            image = torch.zeros_like(noise)
-        else:
-            image = image.to(device=device)

        if image.shape[1:] != noise.shape[1:]:
-            image = utils.common_upscale(image, noise.shape[-1], noise.shape[-2], "bilinear", "center")
+            image = utils.common_upscale(image.to(device), noise.shape[-1], noise.shape[-2], "bilinear", "center")

        image = utils.resize_to_batch_size(image, noise.shape[0])
        return self.process_ip2p_image_in(image)
@@ -711,7 +692,7 @@ class StableCascade_B(BaseModel):
        #size of prior doesn't really matter if zeros because it gets resized but I still want it to get batched
        prior = kwargs.get("stable_cascade_prior", torch.zeros((1, 16, (noise.shape[2] * 4) // 42, (noise.shape[3] * 4) // 42), dtype=noise.dtype, layout=noise.layout, device=noise.device))

-        out["effnet"] = comfy.conds.CONDRegular(prior.to(device=noise.device))
+        out["effnet"] = comfy.conds.CONDRegular(prior)
        out["sca"] = comfy.conds.CONDRegular(torch.zeros((1,)))
        return out

@@ -834,7 +815,6 @@ class PixArt(BaseModel):
 class Flux(BaseModel):
    def __init__(self, model_config, model_type=ModelType.FLUX, device=None, unet_model=comfy.ldm.flux.model.Flux):
        super().__init__(model_config, model_type, device=device, unet_model=unet_model)
-        self.memory_usage_factor_conds = ("ref_latents",)

    def concat_cond(self, **kwargs):
        try:
@@ -895,27 +875,8 @@ class Flux(BaseModel):
        guidance = kwargs.get("guidance", 3.5)
        if guidance is not None:
            out['guidance'] = comfy.conds.CONDRegular(torch.FloatTensor([guidance]))
-
-        ref_latents = kwargs.get("reference_latents", None)
-        if ref_latents is not None:
-            latents = []
-            for lat in ref_latents:
-                latents.append(self.process_latent_in(lat))
-            out['ref_latents'] = comfy.conds.CONDList(latents)
-
-            ref_latents_method = kwargs.get("reference_latents_method", None)
-            if ref_latents_method is not None:
-                out['ref_latents_method'] = comfy.conds.CONDConstant(ref_latents_method)
        return out

-    def extra_conds_shapes(self, **kwargs):
-        out = {}
-        ref_latents = kwargs.get("reference_latents", None)
-        if ref_latents is not None:
-            out['ref_latents'] = list([1, 16, sum(map(lambda a: math.prod(a.size()), ref_latents)) // 16])
-        return out
-
-
 class GenmoMochi(BaseModel):
    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.genmo.joint_model.asymm_models_joint.AsymmDiTJoint)
@@ -1053,32 +1014,9 @@ class CosmosPredict2(BaseModel):
        if cross_attn is not None:
            out['c_crossattn'] = comfy.conds.CONDRegular(cross_attn)

-        denoise_mask = kwargs.get("concat_mask", kwargs.get("denoise_mask", None))
-        if denoise_mask is not None:
-            out["denoise_mask"] = comfy.conds.CONDRegular(denoise_mask)
-
        out['fps'] = comfy.conds.CONDConstant(kwargs.get("frame_rate", None))
        return out

-    def process_timestep(self, timestep, x, denoise_mask=None, **kwargs):
-        if denoise_mask is None:
-            return timestep
-        if denoise_mask.ndim <= 4:
-            return timestep
-        condition_video_mask_B_1_T_1_1 = denoise_mask.mean(dim=[1, 3, 4], keepdim=True)
-        c_noise_B_1_T_1_1 = 0.0 * (1.0 - condition_video_mask_B_1_T_1_1) + timestep.reshape(timestep.shape[0], 1, 1, 1, 1) * condition_video_mask_B_1_T_1_1
-        out = c_noise_B_1_T_1_1.squeeze(dim=[1, 3, 4])
-        return out
-
-    def scale_latent_inpaint(self, sigma, noise, latent_image, **kwargs):
-        sigma = sigma.reshape([sigma.shape[0]] + [1] * (len(noise.shape) - 1))
-        sigma_noise_augmentation = 0 #TODO
-        if sigma_noise_augmentation != 0:
-            latent_image = latent_image + noise
-        latent_image = self.model_sampling.calculate_input(torch.tensor([sigma_noise_augmentation], device=latent_image.device, dtype=latent_image.dtype), latent_image)
-        sigma = (sigma / (sigma + 1))
-        return latent_image / (1.0 - sigma)
-
 class Lumina2(BaseModel):
    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.lumina.model.NextDiT)
@@ -1114,15 +1052,13 @@ class WAN21(BaseModel):
            shape_image[1] = extra_channels
            image = torch.zeros(shape_image, dtype=noise.dtype, layout=noise.layout, device=noise.device)
        else:
-            latent_dim = self.latent_format.latent_channels
            image = utils.common_upscale(image.to(device), noise.shape[-1], noise.shape[-2], "bilinear", "center")
-            for i in range(0, image.shape[1], latent_dim):
-                image[:, i: i + latent_dim] = self.process_latent_in(image[:, i: i + latent_dim])
+            for i in range(0, image.shape[1], 16):
+                image[:, i: i + 16] = self.process_latent_in(image[:, i: i + 16])
            image = utils.resize_to_batch_size(image, noise.shape[0])

-        if extra_channels != image.shape[1] + 4:
-            if not self.image_to_video or extra_channels == image.shape[1]:
-                return image
+        if not self.image_to_video or extra_channels == image.shape[1]:
+            return image

        if image.shape[1] > (extra_channels - 4):
            image = image[:, :(extra_channels - 4)]
@@ -1141,11 +1077,7 @@ class WAN21(BaseModel):
                mask = mask.repeat(1, 4, 1, 1, 1)
            mask = utils.resize_to_batch_size(mask, noise.shape[0])

-        concat_mask_index = kwargs.get("concat_mask_index", 0)
-        if concat_mask_index != 0:
-            return torch.cat((image[:, :concat_mask_index], mask, image[:, concat_mask_index:]), dim=1)
-        else:
-            return torch.cat((mask, image), dim=1)
+        return torch.cat((mask, image), dim=1)

    def extra_conds(self, **kwargs):
        out = super().extra_conds(**kwargs)
@@ -1161,10 +1093,6 @@ class WAN21(BaseModel):
        if time_dim_concat is not None:
            out['time_dim_concat'] = comfy.conds.CONDRegular(self.process_latent_in(time_dim_concat))

-        reference_latents = kwargs.get("reference_latents", None)
-        if reference_latents is not None:
-            out['reference_latent'] = comfy.conds.CONDRegular(self.process_latent_in(reference_latents[-1])[:, :, 0])
-
        return out


@@ -1189,10 +1117,10 @@ class WAN21_Vace(WAN21):

        vace_frames_out = []
        for j in range(len(vace_frames)):
-            vf = vace_frames[j].to(device=noise.device, dtype=noise.dtype, copy=True)
+            vf = vace_frames[j].clone()
            for i in range(0, vf.shape[1], 16):
                vf[:, i:i + 16] = self.process_latent_in(vf[:, i:i + 16])
-            vf = torch.cat([vf, mask[j].to(device=noise.device, dtype=noise.dtype)], dim=1)
+            vf = torch.cat([vf, mask[j]], dim=1)
            vace_frames_out.append(vf)

        vace_frames = torch.stack(vace_frames_out, dim=1)
@@ -1214,120 +1142,6 @@ class WAN21_Camera(WAN21):
            out['camera_conditions'] = comfy.conds.CONDRegular(camera_conditions)
        return out

-class WAN21_HuMo(WAN21):
-    def __init__(self, model_config, model_type=ModelType.FLOW, image_to_video=False, device=None):
-        super(WAN21, self).__init__(model_config, model_type, device=device, unet_model=comfy.ldm.wan.model.HumoWanModel)
-        self.image_to_video = image_to_video
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        noise = kwargs.get("noise", None)
-
-        audio_embed = kwargs.get("audio_embed", None)
-        if audio_embed is not None:
-            out['audio_embed'] = comfy.conds.CONDRegular(audio_embed)
-
-        if "c_concat" not in out:  # 1.7B model
-            reference_latents = kwargs.get("reference_latents", None)
-            if reference_latents is not None:
-                out['reference_latent'] = comfy.conds.CONDRegular(self.process_latent_in(reference_latents[-1]))
-        else:
-            noise_shape = list(noise.shape)
-            noise_shape[1] += 4
-            concat_latent = torch.zeros(noise_shape, device=noise.device, dtype=noise.dtype)
-            zero_vae_values_first = torch.tensor([0.8660, -0.4326, -0.0017, -0.4884, -0.5283, 0.9207, -0.9896, 0.4433, -0.5543, -0.0113, 0.5753, -0.6000, -0.8346, -0.3497, -0.1926, -0.6938]).view(1, 16, 1, 1, 1)
-            zero_vae_values_second = torch.tensor([1.0869, -1.2370, 0.0206, -0.4357, -0.6411, 2.0307, -1.5972, 1.2659, -0.8595, -0.4654, 0.9638, -1.6330, -1.4310, -0.1098, -0.3856, -1.4583]).view(1, 16, 1, 1, 1)
-            zero_vae_values = torch.tensor([0.8642, -1.8583, 0.1577, 0.1350, -0.3641, 2.5863, -1.9670, 1.6065, -1.0475, -0.8678, 1.1734, -1.8138, -1.5933, -0.7721, -0.3289, -1.3745]).view(1, 16, 1, 1, 1)
-            concat_latent[:, 4:] = zero_vae_values
-            concat_latent[:, 4:, :1] = zero_vae_values_first
-            concat_latent[:, 4:, 1:2] = zero_vae_values_second
-            out['c_concat'] = comfy.conds.CONDNoiseShape(concat_latent)
-            reference_latents = kwargs.get("reference_latents", None)
-            if reference_latents is not None:
-                ref_latent = self.process_latent_in(reference_latents[-1])
-                ref_latent_shape = list(ref_latent.shape)
-                ref_latent_shape[1] += 4 + ref_latent_shape[1]
-                ref_latent_full = torch.zeros(ref_latent_shape, device=ref_latent.device, dtype=ref_latent.dtype)
-                ref_latent_full[:, 20:] = ref_latent
-                ref_latent_full[:, 16:20] = 1.0
-                out['reference_latent'] = comfy.conds.CONDRegular(ref_latent_full)
-
-        return out
-
-class WAN22_Animate(WAN21):
-    def __init__(self, model_config, model_type=ModelType.FLOW, image_to_video=False, device=None):
-        super(WAN21, self).__init__(model_config, model_type, device=device, unet_model=comfy.ldm.wan.model_animate.AnimateWanModel)
-        self.image_to_video = image_to_video
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-
-        face_video_pixels = kwargs.get("face_video_pixels", None)
-        if face_video_pixels is not None:
-            out['face_pixel_values'] = comfy.conds.CONDRegular(face_video_pixels)
-
-        pose_latents = kwargs.get("pose_video_latent", None)
-        if pose_latents is not None:
-            out['pose_latents'] = comfy.conds.CONDRegular(self.process_latent_in(pose_latents))
-        return out
-
-class WAN22_S2V(WAN21):
-    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
-        super(WAN21, self).__init__(model_config, model_type, device=device, unet_model=comfy.ldm.wan.model.WanModel_S2V)
-        self.memory_usage_factor_conds = ("reference_latent", "reference_motion")
-        self.memory_usage_shape_process = {"reference_motion": lambda shape: [shape[0], shape[1], 1.5, shape[-2], shape[-1]]}
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        audio_embed = kwargs.get("audio_embed", None)
-        if audio_embed is not None:
-            out['audio_embed'] = comfy.conds.CONDRegular(audio_embed)
-
-        reference_latents = kwargs.get("reference_latents", None)
-        if reference_latents is not None:
-            out['reference_latent'] = comfy.conds.CONDRegular(self.process_latent_in(reference_latents[-1]))
-
-        reference_motion = kwargs.get("reference_motion", None)
-        if reference_motion is not None:
-            out['reference_motion'] = comfy.conds.CONDRegular(self.process_latent_in(reference_motion))
-
-        control_video = kwargs.get("control_video", None)
-        if control_video is not None:
-            out['control_video'] = comfy.conds.CONDRegular(self.process_latent_in(control_video))
-        return out
-
-    def extra_conds_shapes(self, **kwargs):
-        out = {}
-        ref_latents = kwargs.get("reference_latents", None)
-        if ref_latents is not None:
-            out['reference_latent'] = list([1, 16, sum(map(lambda a: math.prod(a.size()), ref_latents)) // 16])
-
-        reference_motion = kwargs.get("reference_motion", None)
-        if reference_motion is not None:
-            out['reference_motion'] = reference_motion.shape
-        return out
-
-class WAN22(WAN21):
-    def __init__(self, model_config, model_type=ModelType.FLOW, image_to_video=False, device=None):
-        super(WAN21, self).__init__(model_config, model_type, device=device, unet_model=comfy.ldm.wan.model.WanModel)
-        self.image_to_video = image_to_video
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        denoise_mask = kwargs.get("denoise_mask", None)
-        if denoise_mask is not None:
-            out["denoise_mask"] = comfy.conds.CONDRegular(denoise_mask)
-        return out
-
-    def process_timestep(self, timestep, x, denoise_mask=None, **kwargs):
-        if denoise_mask is None:
-            return timestep
-        temp_ts = (torch.mean(denoise_mask[:, :, :, :, :], dim=(1, 3, 4), keepdim=True) * timestep.view([timestep.shape[0]] + [1] * (denoise_mask.ndim - 1))).reshape(timestep.shape[0], -1)
-        return temp_ts
-
-    def scale_latent_inpaint(self, sigma, noise, latent_image, **kwargs):
-        return latent_image
-
 class Hunyuan3Dv2(BaseModel):
    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.hunyuan3d.model.Hunyuan3Dv2)
@@ -1343,21 +1157,6 @@ class Hunyuan3Dv2(BaseModel):
            out['guidance'] = comfy.conds.CONDRegular(torch.FloatTensor([guidance]))
        return out

-class Hunyuan3Dv2_1(BaseModel):
-    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
-        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.hunyuan3dv2_1.hunyuandit.HunYuanDiTPlain)
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        cross_attn = kwargs.get("cross_attn", None)
-        if cross_attn is not None:
-            out['c_crossattn'] = comfy.conds.CONDRegular(cross_attn)
-
-        guidance = kwargs.get("guidance", 5.0)
-        if guidance is not None:
-            out['guidance'] = comfy.conds.CONDRegular(torch.FloatTensor([guidance]))
-        return out
-
 class HiDream(BaseModel):
    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.hidream.model.HiDreamImageTransformer2DModel)
@@ -1379,8 +1178,8 @@ class HiDream(BaseModel):
        return out

 class Chroma(Flux):
-    def __init__(self, model_config, model_type=ModelType.FLUX, device=None, unet_model=comfy.ldm.chroma.model.Chroma):
-        super().__init__(model_config, model_type, device=device, unet_model=unet_model)
+    def __init__(self, model_config, model_type=ModelType.FLUX, device=None):
+        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.chroma.model.Chroma)

    def extra_conds(self, **kwargs):
        out = super().extra_conds(**kwargs)
@@ -1390,10 +1189,6 @@ class Chroma(Flux):
            out['guidance'] = comfy.conds.CONDRegular(torch.FloatTensor([guidance]))
        return out

-class ChromaRadiance(Chroma):
-    def __init__(self, model_config, model_type=ModelType.FLUX, device=None):
-        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.chroma_radiance.model.ChromaRadiance)
-
 class ACEStep(BaseModel):
    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.ace.model.ACEStepTransformer2DModel)
@@ -1412,114 +1207,3 @@ class ACEStep(BaseModel):
        out['speaker_embeds'] = comfy.conds.CONDRegular(torch.zeros(noise.shape[0], 512, device=noise.device, dtype=noise.dtype))
        out['lyrics_strength'] = comfy.conds.CONDConstant(kwargs.get("lyrics_strength", 1.0))
        return out
-
-class Omnigen2(BaseModel):
-    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
-        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.omnigen.omnigen2.OmniGen2Transformer2DModel)
-        self.memory_usage_factor_conds = ("ref_latents",)
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        attention_mask = kwargs.get("attention_mask", None)
-        if attention_mask is not None:
-            if torch.numel(attention_mask) != attention_mask.sum():
-                out['attention_mask'] = comfy.conds.CONDRegular(attention_mask)
-            out['num_tokens'] = comfy.conds.CONDConstant(max(1, torch.sum(attention_mask).item()))
-        cross_attn = kwargs.get("cross_attn", None)
-        if cross_attn is not None:
-            out['c_crossattn'] = comfy.conds.CONDRegular(cross_attn)
-        ref_latents = kwargs.get("reference_latents", None)
-        if ref_latents is not None:
-            latents = []
-            for lat in ref_latents:
-                latents.append(self.process_latent_in(lat))
-            out['ref_latents'] = comfy.conds.CONDList(latents)
-        return out
-
-    def extra_conds_shapes(self, **kwargs):
-        out = {}
-        ref_latents = kwargs.get("reference_latents", None)
-        if ref_latents is not None:
-            out['ref_latents'] = list([1, 16, sum(map(lambda a: math.prod(a.size()), ref_latents)) // 16])
-        return out
-
-class QwenImage(BaseModel):
-    def __init__(self, model_config, model_type=ModelType.FLUX, device=None):
-        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.qwen_image.model.QwenImageTransformer2DModel)
-        self.memory_usage_factor_conds = ("ref_latents",)
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        cross_attn = kwargs.get("cross_attn", None)
-        if cross_attn is not None:
-            out['c_crossattn'] = comfy.conds.CONDRegular(cross_attn)
-        ref_latents = kwargs.get("reference_latents", None)
-        if ref_latents is not None:
-            latents = []
-            for lat in ref_latents:
-                latents.append(self.process_latent_in(lat))
-            out['ref_latents'] = comfy.conds.CONDList(latents)
-
-            ref_latents_method = kwargs.get("reference_latents_method", None)
-            if ref_latents_method is not None:
-                out['ref_latents_method'] = comfy.conds.CONDConstant(ref_latents_method)
-        return out
-
-    def extra_conds_shapes(self, **kwargs):
-        out = {}
-        ref_latents = kwargs.get("reference_latents", None)
-        if ref_latents is not None:
-            out['ref_latents'] = list([1, 16, sum(map(lambda a: math.prod(a.size()), ref_latents)) // 16])
-        return out
-
-class HunyuanImage21(BaseModel):
-    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
-        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.hunyuan_video.model.HunyuanVideo)
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        attention_mask = kwargs.get("attention_mask", None)
-        if attention_mask is not None:
-            if torch.numel(attention_mask) != attention_mask.sum():
-                out['attention_mask'] = comfy.conds.CONDRegular(attention_mask)
-        cross_attn = kwargs.get("cross_attn", None)
-        if cross_attn is not None:
-            out['c_crossattn'] = comfy.conds.CONDRegular(cross_attn)
-
-        conditioning_byt5small = kwargs.get("conditioning_byt5small", None)
-        if conditioning_byt5small is not None:
-            out['txt_byt5'] = comfy.conds.CONDRegular(conditioning_byt5small)
-
-        guidance = kwargs.get("guidance", 6.0)
-        if guidance is not None:
-            out['guidance'] = comfy.conds.CONDRegular(torch.FloatTensor([guidance]))
-
-        return out
-
-class HunyuanImage21Refiner(HunyuanImage21):
-    def concat_cond(self, **kwargs):
-        noise = kwargs.get("noise", None)
-        image = kwargs.get("concat_latent_image", None)
-        noise_augmentation = kwargs.get("noise_augmentation", 0.0)
-        device = kwargs["device"]
-
-        if image is None:
-            shape_image = list(noise.shape)
-            image = torch.zeros(shape_image, dtype=noise.dtype, layout=noise.layout, device=noise.device)
-        else:
-            image = utils.common_upscale(image.to(device), noise.shape[-1], noise.shape[-2], "bilinear", "center")
-            image = self.process_latent_in(image)
-            image = utils.resize_to_batch_size(image, noise.shape[0])
-            if noise_augmentation > 0:
-                generator = torch.Generator(device="cpu")
-                generator.manual_seed(kwargs.get("seed", 0) - 10)
-                noise = torch.randn(image.shape, generator=generator, dtype=image.dtype, device="cpu").to(image.device)
-                image = noise_augmentation * noise + min(1.0 - noise_augmentation, 0.75) * image
-            else:
-                image = 0.75 * image
-        return image
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        out['disable_time_r'] = comfy.conds.CONDConstant(True)
-        return out
--- a/comfy/model_detection.py
+++ b/comfy/model_detection.py
@@ -136,45 +136,25 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):

    if '{}txt_in.individual_token_refiner.blocks.0.norm1.weight'.format(key_prefix) in state_dict_keys: #Hunyuan Video
        dit_config = {}
-        in_w = state_dict['{}img_in.proj.weight'.format(key_prefix)]
-        out_w = state_dict['{}final_layer.linear.weight'.format(key_prefix)]
        dit_config["image_model"] = "hunyuan_video"
-        dit_config["in_channels"] = in_w.shape[1] #SkyReels img2video has 32 input channels
-        dit_config["patch_size"] = list(in_w.shape[2:])
-        dit_config["out_channels"] = out_w.shape[0] // math.prod(dit_config["patch_size"])
-        if any(s.startswith('{}vector_in.'.format(key_prefix)) for s in state_dict_keys):
-            dit_config["vec_in_dim"] = 768
-        else:
-            dit_config["vec_in_dim"] = None
-
-        if len(dit_config["patch_size"]) == 2:
-            dit_config["axes_dim"] = [64, 64]
-        else:
-            dit_config["axes_dim"] = [16, 56, 56]
-
-        if any(s.startswith('{}time_r_in.'.format(key_prefix)) for s in state_dict_keys):
-            dit_config["meanflow"] = True
-        else:
-            dit_config["meanflow"] = False
-
-        dit_config["context_in_dim"] = state_dict['{}txt_in.input_embedder.weight'.format(key_prefix)].shape[1]
-        dit_config["hidden_size"] = in_w.shape[0]
+        dit_config["in_channels"] = state_dict['{}img_in.proj.weight'.format(key_prefix)].shape[1] #SkyReels img2video has 32 input channels
+        dit_config["patch_size"] = [1, 2, 2]
+        dit_config["out_channels"] = 16
+        dit_config["vec_in_dim"] = 768
+        dit_config["context_in_dim"] = 4096
+        dit_config["hidden_size"] = 3072
        dit_config["mlp_ratio"] = 4.0
-        dit_config["num_heads"] = in_w.shape[0] // 128
+        dit_config["num_heads"] = 24
        dit_config["depth"] = count_blocks(state_dict_keys, '{}double_blocks.'.format(key_prefix) + '{}.')
        dit_config["depth_single_blocks"] = count_blocks(state_dict_keys, '{}single_blocks.'.format(key_prefix) + '{}.')
+        dit_config["axes_dim"] = [16, 56, 56]
        dit_config["theta"] = 256
        dit_config["qkv_bias"] = True
-        if '{}byt5_in.fc1.weight'.format(key_prefix) in state_dict:
-            dit_config["byt5"] = True
-        else:
-            dit_config["byt5"] = False
-
        guidance_keys = list(filter(lambda a: a.startswith("{}guidance_in.".format(key_prefix)), state_dict_keys))
        dit_config["guidance_embed"] = len(guidance_keys) > 0
        return dit_config

-    if '{}double_blocks.0.img_attn.norm.key_norm.scale'.format(key_prefix) in state_dict_keys and ('{}img_in.weight'.format(key_prefix) in state_dict_keys or f"{key_prefix}distilled_guidance_layer.norms.0.scale" in state_dict_keys): #Flux, Chroma or Chroma Radiance (has no img_in.weight)
+    if '{}double_blocks.0.img_attn.norm.key_norm.scale'.format(key_prefix) in state_dict_keys and '{}img_in.weight'.format(key_prefix) in state_dict_keys: #Flux
        dit_config = {}
        dit_config["image_model"] = "flux"
        dit_config["in_channels"] = 16
@@ -204,18 +184,6 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):
            dit_config["out_dim"] = 3072
            dit_config["hidden_dim"] = 5120
            dit_config["n_layers"] = 5
-            if f"{key_prefix}nerf_blocks.0.norm.scale" in state_dict_keys: #Chroma Radiance
-                dit_config["image_model"] = "chroma_radiance"
-                dit_config["in_channels"] = 3
-                dit_config["out_channels"] = 3
-                dit_config["patch_size"] = 16
-                dit_config["nerf_hidden_size"] = 64
-                dit_config["nerf_mlp_ratio"] = 4
-                dit_config["nerf_depth"] = 4
-                dit_config["nerf_max_freqs"] = 8
-                dit_config["nerf_tile_size"] = 32
-                dit_config["nerf_final_head_type"] = "conv" if f"{key_prefix}nerf_final_layer_conv.norm.scale" in state_dict_keys else "linear"
-                dit_config["nerf_embedder_dtype"] = torch.float32
        else:
            dit_config["guidance_embed"] = "{}guidance_in.in_layer.weight".format(key_prefix) in state_dict_keys
        return dit_config
@@ -378,9 +346,7 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):
        dit_config = {}
        dit_config["image_model"] = "wan2.1"
        dim = state_dict['{}head.modulation'.format(key_prefix)].shape[-1]
-        out_dim = state_dict['{}head.head.weight'.format(key_prefix)].shape[0] // 4
        dit_config["dim"] = dim
-        dit_config["out_dim"] = out_dim
        dit_config["num_heads"] = dim // 128
        dit_config["ffn_dim"] = state_dict['{}blocks.0.ffn.0.weight'.format(key_prefix)].shape[0]
        dit_config["num_layers"] = count_blocks(state_dict_keys, '{}blocks.'.format(key_prefix) + '{}.')
@@ -396,16 +362,7 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):
            dit_config["vace_in_dim"] = state_dict['{}vace_patch_embedding.weight'.format(key_prefix)].shape[1]
            dit_config["vace_layers"] = count_blocks(state_dict_keys, '{}vace_blocks.'.format(key_prefix) + '{}.')
        elif '{}control_adapter.conv.weight'.format(key_prefix) in state_dict_keys:
-            if '{}img_emb.proj.0.bias'.format(key_prefix) in state_dict_keys:
-                dit_config["model_type"] = "camera"
-            else:
-                dit_config["model_type"] = "camera_2.2"
-        elif '{}casual_audio_encoder.encoder.final_linear.weight'.format(key_prefix) in state_dict_keys:
-            dit_config["model_type"] = "s2v"
-        elif '{}audio_proj.audio_proj_glob_1.layer.bias'.format(key_prefix) in state_dict_keys:
-            dit_config["model_type"] = "humo"
-        elif '{}face_adapter.fuser_blocks.0.k_norm.weight'.format(key_prefix) in state_dict_keys:
-            dit_config["model_type"] = "animate"
+            dit_config["model_type"] = "camera"
        else:
            if '{}img_emb.proj.0.bias'.format(key_prefix) in state_dict_keys:
                dit_config["model_type"] = "i2v"
@@ -414,11 +371,6 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):
        flf_weight = state_dict.get('{}img_emb.emb_pos'.format(key_prefix))
        if flf_weight is not None:
            dit_config["flf_pos_embed_token_number"] = flf_weight.shape[1]
-
-        ref_conv_weight = state_dict.get('{}ref_conv.weight'.format(key_prefix))
-        if ref_conv_weight is not None:
-            dit_config["in_dim_ref_conv"] = ref_conv_weight.shape[1]
-
        return dit_config

    if '{}latent_in.weight'.format(key_prefix) in state_dict_keys:  # Hunyuan 3D
@@ -436,20 +388,6 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):
        dit_config["guidance_embed"] = "{}guidance_in.in_layer.weight".format(key_prefix) in state_dict_keys
        return dit_config

-    if f"{key_prefix}t_embedder.mlp.2.weight" in state_dict_keys:  # Hunyuan 3D 2.1
-
-        dit_config = {}
-        dit_config["image_model"] = "hunyuan3d2_1"
-        dit_config["in_channels"] = state_dict[f"{key_prefix}x_embedder.weight"].shape[1]
-        dit_config["context_dim"] = 1024
-        dit_config["hidden_size"] = state_dict[f"{key_prefix}x_embedder.weight"].shape[0]
-        dit_config["mlp_ratio"] = 4.0
-        dit_config["num_heads"] = 16
-        dit_config["depth"] = count_blocks(state_dict_keys, f"{key_prefix}blocks.{{}}")
-        dit_config["qkv_bias"] = False
-        dit_config["guidance_cond_proj_dim"] = None#f"{key_prefix}t_embedder.cond_proj.weight" in state_dict_keys
-        return dit_config
-
    if '{}caption_projection.0.linear.weight'.format(key_prefix) in state_dict_keys:  # HiDream
        dit_config = {}
        dit_config["image_model"] = "hidream"
@@ -503,16 +441,11 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):
            dit_config["rope_h_extrapolation_ratio"] = 4.0
            dit_config["rope_w_extrapolation_ratio"] = 4.0
            dit_config["rope_t_extrapolation_ratio"] = 1.0
-        elif dit_config["in_channels"] == 17: # img to video
-            if dit_config["model_channels"] == 2048:
-                dit_config["extra_per_block_abs_pos_emb"] = False
-                dit_config["rope_h_extrapolation_ratio"] = 3.0
-                dit_config["rope_w_extrapolation_ratio"] = 3.0
-                dit_config["rope_t_extrapolation_ratio"] = 1.0
-            elif dit_config["model_channels"] == 5120:
-                dit_config["rope_h_extrapolation_ratio"] = 2.0
-                dit_config["rope_w_extrapolation_ratio"] = 2.0
-                dit_config["rope_t_extrapolation_ratio"] = 0.8333333333333334
+        elif dit_config["in_channels"] == 17:
+            dit_config["extra_per_block_abs_pos_emb"] = False
+            dit_config["rope_h_extrapolation_ratio"] = 3.0
+            dit_config["rope_w_extrapolation_ratio"] = 3.0
+            dit_config["rope_t_extrapolation_ratio"] = 1.0

        dit_config["extra_h_extrapolation_ratio"] = 1.0
        dit_config["extra_w_extrapolation_ratio"] = 1.0
@@ -521,33 +454,6 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):

        return dit_config

-    if '{}time_caption_embed.timestep_embedder.linear_1.bias'.format(key_prefix) in state_dict_keys:  # Omnigen2
-        dit_config = {}
-        dit_config["image_model"] = "omnigen2"
-        dit_config["axes_dim_rope"] = [40, 40, 40]
-        dit_config["axes_lens"] = [1024, 1664, 1664]
-        dit_config["ffn_dim_multiplier"] = None
-        dit_config["hidden_size"] = 2520
-        dit_config["in_channels"] = 16
-        dit_config["multiple_of"] = 256
-        dit_config["norm_eps"] = 1e-05
-        dit_config["num_attention_heads"] = 21
-        dit_config["num_kv_heads"] = 7
-        dit_config["num_layers"] = 32
-        dit_config["num_refiner_layers"] = 2
-        dit_config["out_channels"] = None
-        dit_config["patch_size"] = 2
-        dit_config["text_feat_dim"] = 2048
-        dit_config["timestep_scale"] = 1000.0
-        return dit_config
-
-    if '{}txt_norm.weight'.format(key_prefix) in state_dict_keys:  # Qwen Image
-        dit_config = {}
-        dit_config["image_model"] = "qwen_image"
-        dit_config["in_channels"] = state_dict['{}img_in.weight'.format(key_prefix)].shape[1]
-        dit_config["num_layers"] = count_blocks(state_dict_keys, '{}transformer_blocks.'.format(key_prefix) + '{}.')
-        return dit_config
-
    if '{}input_blocks.0.0.weight'.format(key_prefix) not in state_dict_keys:
        return None

@@ -934,7 +840,7 @@ def convert_diffusers_mmdit(state_dict, output_prefix=""):
        depth_single_blocks = count_blocks(state_dict, 'single_transformer_blocks.{}.')
        hidden_size = state_dict["x_embedder.bias"].shape[0]
        sd_map = comfy.utils.flux_to_diffusers({"depth": depth, "depth_single_blocks": depth_single_blocks, "hidden_size": hidden_size}, output_prefix=output_prefix)
-    elif 'transformer_blocks.0.attn.add_q_proj.weight' in state_dict and 'pos_embed.proj.weight' in state_dict: #SD3
+    elif 'transformer_blocks.0.attn.add_q_proj.weight' in state_dict: #SD3
        num_blocks = count_blocks(state_dict, 'transformer_blocks.{}.')
        depth = state_dict["pos_embed.proj.weight"].shape[0] // 64
        sd_map = comfy.utils.mmdit_to_diffusers({"depth": depth, "num_blocks": num_blocks}, output_prefix=output_prefix)
--- a/comfy/model_management.py
+++ b/comfy/model_management.py
@@ -22,7 +22,6 @@ from enum import Enum
 from comfy.cli_args import args, PerformanceFeature
 import torch
 import sys
-import importlib
 import platform
 import weakref
 import gc
@@ -79,6 +78,7 @@ try:
    torch_version = torch.version.__version__
    temp = torch_version.split(".")
    torch_version_numeric = (int(temp[0]), int(temp[1]))
+    xpu_available = (torch_version_numeric[0] < 2 or (torch_version_numeric[0] == 2 and torch_version_numeric[1] <= 4)) and torch.xpu.is_available()
 except:
    pass

@@ -101,15 +101,11 @@ if args.directml is not None:
    lowvram_available = False #TODO: need to find a way to get free memory in directml before this can be enabled by default.

 try:
-    import intel_extension_for_pytorch as ipex  # noqa: F401
-except:
-    pass
-
-try:
+    import intel_extension_for_pytorch as ipex
    _ = torch.xpu.device_count()
-    xpu_available = torch.xpu.is_available()
+    xpu_available = xpu_available or torch.xpu.is_available()
 except:
-    xpu_available = False
+    xpu_available = xpu_available or (hasattr(torch, "xpu") and torch.xpu.is_available())

 try:
    if torch.backends.mps.is_available():
@@ -132,11 +128,6 @@ try:
 except:
    mlu_available = False

-try:
-    ixuca_available = hasattr(torch, "corex")
-except:
-    ixuca_available = False
-
 if args.cpu:
    cpu_state = CPUState.CPU

@@ -160,12 +151,6 @@ def is_mlu():
        return True
    return False

-def is_ixuca():
-    global ixuca_available
-    if ixuca_available:
-        return True
-    return False
-
 def get_torch_device():
    global directml_enabled
    global cpu_state
@@ -201,9 +186,8 @@ def get_total_memory(dev=None, torch_total_too=False):
        elif is_intel_xpu():
            stats = torch.xpu.memory_stats(dev)
            mem_reserved = stats['reserved_bytes.all.current']
-            mem_total_xpu = torch.xpu.get_device_properties(dev).total_memory
            mem_total_torch = mem_reserved
-            mem_total = mem_total_xpu
+            mem_total = torch.xpu.get_device_properties(dev).total_memory
        elif is_ascend_npu():
            stats = torch.npu.memory_stats(dev)
            mem_reserved = stats['reserved_bytes.all.current']
@@ -290,24 +274,6 @@ def is_amd():
            return True
    return False

-def amd_min_version(device=None, min_rdna_version=0):
-    if not is_amd():
-        return False
-
-    if is_device_cpu(device):
-        return False
-
-    arch = torch.cuda.get_device_properties(device).gcnArchName
-    if arch.startswith('gfx') and len(arch) == 7:
-        try:
-            cmp_rdna_version = int(arch[4]) + 2
-        except:
-            cmp_rdna_version = 0
-        if cmp_rdna_version >= min_rdna_version:
-            return True
-
-    return False
-
 MIN_WEIGHT_MEMORY_RATIO = 0.4
 if is_nvidia():
    MIN_WEIGHT_MEMORY_RATIO = 0.0
@@ -322,7 +288,7 @@ try:
        if torch_version_numeric[0] >= 2:
            if ENABLE_PYTORCH_ATTENTION == False and args.use_split_cross_attention == False and args.use_quad_cross_attention == False:
                ENABLE_PYTORCH_ATTENTION = True
-    if is_intel_xpu() or is_ascend_npu() or is_mlu() or is_ixuca():
+    if is_intel_xpu() or is_ascend_npu() or is_mlu():
        if args.use_split_cross_attention == False and args.use_quad_cross_attention == False:
            ENABLE_PYTORCH_ATTENTION = True
 except:
@@ -340,15 +306,11 @@ try:
        logging.info("AMD arch: {}".format(arch))
        logging.info("ROCm version: {}".format(rocm_version))
        if args.use_split_cross_attention == False and args.use_quad_cross_attention == False:
-            if importlib.util.find_spec('triton') is not None:  # AMD efficient attention implementation depends on triton. TODO: better way of detecting if it's compiled in or not.
-                if torch_version_numeric >= (2, 7):  # works on 2.6 but doesn't actually seem to improve much
-                    if any((a in arch) for a in ["gfx90a", "gfx942", "gfx1100", "gfx1101", "gfx1151"]):  # TODO: more arches, TODO: gfx950
-                        ENABLE_PYTORCH_ATTENTION = True
-#                if torch_version_numeric >= (2, 8):
-#                    if any((a in arch) for a in ["gfx1201"]):
-#                        ENABLE_PYTORCH_ATTENTION = True
+            if torch_version_numeric >= (2, 7):  # works on 2.6 but doesn't actually seem to improve much
+                if any((a in arch) for a in ["gfx90a", "gfx942", "gfx1100", "gfx1101", "gfx1151"]):  # TODO: more arches, TODO: gfx1201 and gfx950
+                    ENABLE_PYTORCH_ATTENTION = True
        if torch_version_numeric >= (2, 7) and rocm_version >= (6, 4):
-            if any((a in arch) for a in ["gfx1200", "gfx1201", "gfx942", "gfx950"]):  # TODO: more arches
+            if any((a in arch) for a in ["gfx1201", "gfx942", "gfx950"]):  # TODO: more arches
                SUPPORT_FP8_OPS = True

 except:
@@ -363,7 +325,7 @@ if ENABLE_PYTORCH_ATTENTION:

 PRIORITIZE_FP16 = False  # TODO: remove and replace with something that shows exactly which dtype is faster than the other
 try:
-    if (is_nvidia() or is_amd()) and PerformanceFeature.Fp16Accumulation in args.fast:
+    if is_nvidia() and PerformanceFeature.Fp16Accumulation in args.fast:
        torch.backends.cuda.matmul.allow_fp16_accumulation = True
        PRIORITIZE_FP16 = True  # TODO: limit to cards where it actually boosts performance
        logging.info("Enabled fp16 accumulation.")
@@ -415,8 +377,6 @@ def get_torch_device_name(device):
            except:
                allocator_backend = ""
            return "{} {} : {}".format(device, torch.cuda.get_device_name(device), allocator_backend)
-        elif device.type == "xpu":
-            return "{} {}".format(device, torch.xpu.get_device_name(device))
        else:
            return "{}".format(device.type)
    elif is_intel_xpu():
@@ -552,8 +512,6 @@ WINDOWS = any(platform.win32_ver())
 EXTRA_RESERVED_VRAM = 400 * 1024 * 1024
 if WINDOWS:
    EXTRA_RESERVED_VRAM = 600 * 1024 * 1024 #Windows is higher because of the shared vram issue
-    if total_vram > (15 * 1024):  # more extra reserved vram on 16GB+ cards
-        EXTRA_RESERVED_VRAM += 100 * 1024 * 1024

 if args.reserve_vram is not None:
    EXTRA_RESERVED_VRAM = args.reserve_vram * 1024 * 1024 * 1024
@@ -613,13 +571,7 @@ def load_models_gpu(models, memory_required=0, force_patch_weights=False, minimu
    else:
        minimum_memory_required = max(inference_memory, minimum_memory_required + extra_reserved_memory())

-    models_temp = set()
-    for m in models:
-        models_temp.add(m)
-        for mm in m.model_patches_models():
-            models_temp.add(mm)
-
-    models = models_temp
+    models = set(models)

    models_to_load = []

@@ -924,10 +876,7 @@ def vae_dtype(device=None, allowed_dtypes=[]):
            return d

        # NOTE: bfloat16 seems to work on AMD for the VAE but is extremely slow in some cases compared to fp32
-        # slowness still a problem on pytorch nightly 2.9.0.dev20250720+rocm6.4 tested on RDNA3
-        # also a problem on RDNA4 except fp32 is also slow there.
-        # This is due to large bf16 convolutions being extremely slow.
-        if d == torch.bfloat16 and ((not is_amd()) or amd_min_version(device, min_rdna_version=4)) and should_use_bf16(device):
+        if d == torch.bfloat16 and (not is_amd()) and should_use_bf16(device):
            return d

    return torch.float32
@@ -977,11 +926,9 @@ def pick_weight_dtype(dtype, fallback_dtype, device=None):
    return dtype

 def device_supports_non_blocking(device):
-    if args.force_non_blocking:
-        return True
    if is_device_mps(device):
        return False #pytorch bug? mps doesn't support non blocking
-    if is_intel_xpu(): #xpu does support non blocking but it is slower on iGPUs for some reason so disable by default until situation changes
+    if is_intel_xpu():
        return False
    if args.deterministic: #TODO: figure out why deterministic breaks non blocking from gpu to cpu (previews)
        return False
@@ -1021,8 +968,6 @@ def get_offload_stream(device):
        stream_counter = (stream_counter + 1) % len(ss)
        if is_device_cuda(device):
            ss[stream_counter].wait_stream(torch.cuda.current_stream())
-        elif is_device_xpu(device):
-            ss[stream_counter].wait_stream(torch.xpu.current_stream())
        stream_counters[device] = stream_counter
        return s
    elif is_device_cuda(device):
@@ -1034,15 +979,6 @@ def get_offload_stream(device):
        stream_counter = (stream_counter + 1) % len(ss)
        stream_counters[device] = stream_counter
        return s
-    elif is_device_xpu(device):
-        ss = []
-        for k in range(NUM_STREAMS):
-            ss.append(torch.xpu.Stream(device=device, priority=0))
-        STREAMS[device] = ss
-        s = ss[stream_counter]
-        stream_counter = (stream_counter + 1) % len(ss)
-        stream_counters[device] = stream_counter
-        return s
    return None

 def sync_stream(device, stream):
@@ -1050,8 +986,6 @@ def sync_stream(device, stream):
        return
    if is_device_cuda(device):
        torch.cuda.current_stream().wait_stream(stream)
-    elif is_device_xpu(device):
-        torch.xpu.current_stream().wait_stream(stream)

 def cast_to(weight, dtype=None, device=None, non_blocking=False, copy=False, stream=None):
    if device is None or weight.device == device:
@@ -1093,8 +1027,6 @@ def xformers_enabled():
        return False
    if is_mlu():
        return False
-    if is_ixuca():
-        return False
    if directml_enabled:
        return False
    return XFORMERS_IS_AVAILABLE
@@ -1130,8 +1062,6 @@ def pytorch_attention_flash_attention():
            return True
        if is_amd():
            return True #if you have pytorch attention enabled on AMD it probably supports at least mem efficient attention
-        if is_ixuca():
-            return True
    return False

 def force_upcast_attention_dtype():
@@ -1162,8 +1092,8 @@ def get_free_memory(dev=None, torch_free_too=False):
            stats = torch.xpu.memory_stats(dev)
            mem_active = stats['active_bytes.all.current']
            mem_reserved = stats['reserved_bytes.all.current']
-            mem_free_xpu = torch.xpu.get_device_properties(dev).total_memory - mem_reserved
            mem_free_torch = mem_reserved - mem_active
+            mem_free_xpu = torch.xpu.get_device_properties(dev).total_memory - mem_reserved
            mem_free_total = mem_free_xpu + mem_free_torch
        elif is_ascend_npu():
            stats = torch.npu.memory_stats(dev)
@@ -1212,9 +1142,6 @@ def is_device_cpu(device):
 def is_device_mps(device):
    return is_device_type(device, 'mps')

-def is_device_xpu(device):
-    return is_device_type(device, 'xpu')
-
 def is_device_cuda(device):
    return is_device_type(device, 'cuda')

@@ -1246,10 +1173,7 @@ def should_use_fp16(device=None, model_params=0, prioritize_performance=True, ma
        return False

    if is_intel_xpu():
-        if torch_version_numeric < (2, 3):
-            return True
-        else:
-            return torch.xpu.get_device_properties(device).has_fp16
+        return True

    if is_ascend_npu():
        return True
@@ -1257,9 +1181,6 @@ def should_use_fp16(device=None, model_params=0, prioritize_performance=True, ma
    if is_mlu():
        return True

-    if is_ixuca():
-        return True
-
    if torch.version.hip:
        return True

@@ -1315,15 +1236,9 @@ def should_use_bf16(device=None, model_params=0, prioritize_performance=True, ma
        return False

    if is_intel_xpu():
-        if torch_version_numeric < (2, 3):
-            return True
-        else:
-            return torch.xpu.is_bf16_supported()
-
-    if is_ascend_npu():
        return True

-    if is_ixuca():
+    if is_ascend_npu():
        return True

    if is_amd():
@@ -1375,13 +1290,6 @@ def supports_fp8_compute(device=None):

    return True

-def extended_fp16_support():
-    # TODO: check why some models work with fp16 on newer torch versions but not on older
-    if torch_version_numeric < (2, 7):
-        return False
-
-    return True
-
 def soft_empty_cache(force=False):
    global cpu_state
    if cpu_state == CPUState.MPS:
--- a/comfy/model_patcher.py
+++ b/comfy/model_patcher.py
@@ -379,9 +379,6 @@ class ModelPatcher:
    def set_model_sampler_pre_cfg_function(self, pre_cfg_function, disable_cfg1_optimization=False):
        self.model_options = set_model_options_pre_cfg_function(self.model_options, pre_cfg_function, disable_cfg1_optimization)

-    def set_model_sampler_calc_cond_batch_function(self, sampler_calc_cond_batch_function):
-        self.model_options["sampler_calc_cond_batch_function"] = sampler_calc_cond_batch_function
-
    def set_model_unet_function_wrapper(self, unet_wrapper_function: UnetWrapperFunction):
        self.model_options["model_function_wrapper"] = unet_wrapper_function

@@ -430,12 +427,6 @@ class ModelPatcher:
    def set_model_forward_timestep_embed_patch(self, patch):
        self.set_model_patch(patch, "forward_timestep_embed_patch")

-    def set_model_double_block_patch(self, patch):
-        self.set_model_patch(patch, "double_block")
-
-    def set_model_post_input_patch(self, patch):
-        self.set_model_patch(patch, "post_input")
-
    def add_object_patch(self, name, obj):
        self.object_patches[name] = obj

@@ -492,30 +483,6 @@ class ModelPatcher:
            if hasattr(wrap_func, "to"):
                self.model_options["model_function_wrapper"] = wrap_func.to(device)

-    def model_patches_models(self):
-        to = self.model_options["transformer_options"]
-        models = []
-        if "patches" in to:
-            patches = to["patches"]
-            for name in patches:
-                patch_list = patches[name]
-                for i in range(len(patch_list)):
-                    if hasattr(patch_list[i], "models"):
-                        models += patch_list[i].models()
-        if "patches_replace" in to:
-            patches = to["patches_replace"]
-            for name in patches:
-                patch_list = patches[name]
-                for k in patch_list:
-                    if hasattr(patch_list[k], "models"):
-                        models += patch_list[k].models()
-        if "model_function_wrapper" in self.model_options:
-            wrap_func = self.model_options["model_function_wrapper"]
-            if hasattr(wrap_func, "models"):
-                models += wrap_func.models()
-
-        return models
-
    def model_dtype(self):
        if hasattr(self.model, "get_dtype"):
            return self.model.get_dtype()
--- a/comfy/ops.py
+++ b/comfy/ops.py
@@ -24,37 +24,8 @@ import comfy.float
 import comfy.rmsnorm
 import contextlib

-
-def scaled_dot_product_attention(q, k, v, *args, **kwargs):
-    return torch.nn.functional.scaled_dot_product_attention(q, k, v, *args, **kwargs)
-
-
-try:
-    if torch.cuda.is_available():
-        from torch.nn.attention import SDPBackend, sdpa_kernel
-        import inspect
-        if "set_priority" in inspect.signature(sdpa_kernel).parameters:
-            SDPA_BACKEND_PRIORITY = [
-                SDPBackend.FLASH_ATTENTION,
-                SDPBackend.EFFICIENT_ATTENTION,
-                SDPBackend.MATH,
-            ]
-
-            SDPA_BACKEND_PRIORITY.insert(0, SDPBackend.CUDNN_ATTENTION)
-
-            def scaled_dot_product_attention(q, k, v, *args, **kwargs):
-                with sdpa_kernel(SDPA_BACKEND_PRIORITY, set_priority=True):
-                    return torch.nn.functional.scaled_dot_product_attention(q, k, v, *args, **kwargs)
-        else:
-            logging.warning("Torch version too old to set sdpa backend priority.")
-except (ModuleNotFoundError, TypeError):
-    logging.warning("Could not set sdpa backend priority.")
-
 cast_to = comfy.model_management.cast_to #TODO: remove once no more references

-if torch.cuda.is_available() and torch.backends.cudnn.is_available() and PerformanceFeature.AutoTune in args.fast:
-    torch.backends.cudnn.benchmark = True
-
 def cast_to_input(weight, input, non_blocking=False, copy=True):
    return comfy.model_management.cast_to(weight, input.dtype, input.device, non_blocking=non_blocking, copy=copy)

@@ -365,13 +336,9 @@ class fp8_ops(manual_cast):
            return None

        def forward_comfy_cast_weights(self, input):
-            if not self.training:
-                try:
-                    out = fp8_linear(self, input)
-                    if out is not None:
-                        return out
-                except Exception as e:
-                    logging.info("Exception during fp8 op: {}".format(e))
+            out = fp8_linear(self, input)
+            if out is not None:
+                return out

            weight, bias = cast_bias_weight(self, input)
            return torch.nn.functional.linear(input, weight, bias)
--- a/comfy/patcher_extension.py
+++ b/comfy/patcher_extension.py
@@ -50,7 +50,6 @@ class WrappersMP:
    OUTER_SAMPLE = "outer_sample"
    PREPARE_SAMPLING = "prepare_sampling"
    SAMPLER_SAMPLE = "sampler_sample"
-    PREDICT_NOISE = "predict_noise"
    CALC_COND_BATCH = "calc_cond_batch"
    APPLY_MODEL = "apply_model"
    DIFFUSION_MODEL = "diffusion_model"
--- a/comfy/pixel_space_convert.py
+++ b/comfy/pixel_space_convert.py
@@ -1,16 +0,0 @@
-import torch
-
-
-# "Fake" VAE that converts from IMAGE B, H, W, C and values on the scale of 0..1
-# to LATENT B, C, H, W and values on the scale of -1..1.
-class PixelspaceConversionVAE(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.pixel_space_vae = torch.nn.Parameter(torch.tensor(1.0))
-
-    def encode(self, pixels: torch.Tensor, *_args, **_kwargs) -> torch.Tensor:
-        return pixels
-
-    def decode(self, samples: torch.Tensor, *_args, **_kwargs) -> torch.Tensor:
-        return samples
-
--- a/comfy/rmsnorm.py
+++ b/comfy/rmsnorm.py
@@ -1,7 +1,6 @@
 import torch
 import comfy.model_management
 import numbers
-import logging

 RMSNorm = None

@@ -10,7 +9,6 @@ try:
    RMSNorm = torch.nn.RMSNorm
 except:
    rms_norm_torch = None
-    logging.warning("Please update pytorch to use native RMSNorm")


 def rms_norm(x, weight=None, eps=1e-6):
--- a/comfy/sampler_helpers.py
+++ b/comfy/sampler_helpers.py
@@ -149,7 +149,7 @@ def cleanup_models(conds, models):

    cleanup_additional_models(set(control_cleanup))

-def prepare_model_patcher(model: ModelPatcher, conds, model_options: dict):
+def prepare_model_patcher(model: 'ModelPatcher', conds, model_options: dict):
    '''
    Registers hooks from conds.
    '''
@@ -158,8 +158,8 @@ def prepare_model_patcher(model: ModelPatcher, conds, model_options: dict):
    for k in conds:
        get_hooks_from_cond(conds[k], hooks)
    # add wrappers and callbacks from ModelPatcher to transformer_options
-    comfy.patcher_extension.merge_nested_dicts(model_options["transformer_options"].setdefault("wrappers", {}), model.wrappers, copy_dict1=False)
-    comfy.patcher_extension.merge_nested_dicts(model_options["transformer_options"].setdefault("callbacks", {}), model.callbacks, copy_dict1=False)
+    model_options["transformer_options"]["wrappers"] = comfy.patcher_extension.copy_nested_dicts(model.wrappers)
+    model_options["transformer_options"]["callbacks"] = comfy.patcher_extension.copy_nested_dicts(model.callbacks)
    # begin registering hooks
    registered = comfy.hooks.HookGroup()
    target_dict = comfy.hooks.create_target_dict(comfy.hooks.EnumWeightTarget.Model)
--- a/comfy/samplers.py
+++ b/comfy/samplers.py
@@ -16,8 +16,6 @@ import comfy.sampler_helpers
 import comfy.model_patcher
 import comfy.patcher_extension
 import comfy.hooks
-import comfy.context_windows
-import comfy.utils
 import scipy.stats
 import numpy

@@ -62,7 +60,7 @@ def get_area_and_mult(conds, x_in, timestep_in):
        if "mask_strength" in conds:
            mask_strength = conds["mask_strength"]
        mask = conds['mask']
-        # assert (mask.shape[1:] == x_in.shape[2:])
+        assert (mask.shape[1:] == x_in.shape[2:])

        mask = mask[:input_x.shape[0]]
        if area is not None:
@@ -70,7 +68,7 @@ def get_area_and_mult(conds, x_in, timestep_in):
                mask = mask.narrow(i + 1, area[len(dims) + i], area[i])

        mask = mask * mask_strength
-        mask = mask.unsqueeze(1).repeat((input_x.shape[0] // mask.shape[0], input_x.shape[1]) + (1, ) * (mask.ndim - 1))
+        mask = mask.unsqueeze(1).repeat(input_x.shape[0] // mask.shape[0], input_x.shape[1], 1, 1)
    else:
        mask = torch.ones_like(input_x)
    mult = mask * strength
@@ -91,7 +89,7 @@ def get_area_and_mult(conds, x_in, timestep_in):
    conditioning = {}
    model_conds = conds["model_conds"]
    for c in model_conds:
-        conditioning[c] = model_conds[c].process_cond(batch_size=x_in.shape[0], area=area)
+        conditioning[c] = model_conds[c].process_cond(batch_size=x_in.shape[0], device=x_in.device, area=area)

    hooks = conds.get('hooks', None)
    control = conds.get('control', None)
@@ -200,20 +198,14 @@ def finalize_default_conds(model: 'BaseModel', hooked_to_run: dict[comfy.hooks.H
            hooked_to_run.setdefault(p.hooks, list())
            hooked_to_run[p.hooks] += [(p, i)]

-def calc_cond_batch(model: BaseModel, conds: list[list[dict]], x_in: torch.Tensor, timestep, model_options: dict[str]):
-    handler: comfy.context_windows.ContextHandlerABC = model_options.get("context_handler", None)
-    if handler is None or not handler.should_use_context(model, conds, x_in, timestep, model_options):
-        return _calc_cond_batch_outer(model, conds, x_in, timestep, model_options)
-    return handler.execute(_calc_cond_batch_outer, model, conds, x_in, timestep, model_options)
-
-def _calc_cond_batch_outer(model: BaseModel, conds: list[list[dict]], x_in: torch.Tensor, timestep, model_options):
+def calc_cond_batch(model: 'BaseModel', conds: list[list[dict]], x_in: torch.Tensor, timestep, model_options):
    executor = comfy.patcher_extension.WrapperExecutor.new_executor(
        _calc_cond_batch,
        comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.CALC_COND_BATCH, model_options, is_model_options=True)
    )
    return executor.execute(model, conds, x_in, timestep, model_options)

-def _calc_cond_batch(model: BaseModel, conds: list[list[dict]], x_in: torch.Tensor, timestep, model_options):
+def _calc_cond_batch(model: 'BaseModel', conds: list[list[dict]], x_in: torch.Tensor, timestep, model_options):
    out_conds = []
    out_counts = []
    # separate conds by matching hooks
@@ -381,11 +373,7 @@ def sampling_function(model, x, timestep, uncond, cond, cond_scale, model_option
        uncond_ = uncond

    conds = [cond, uncond_]
-    if "sampler_calc_cond_batch_function" in model_options:
-        args = {"conds": conds, "input": x, "sigma": timestep, "model": model, "model_options": model_options}
-        out = model_options["sampler_calc_cond_batch_function"](args)
-    else:
-        out = calc_cond_batch(model, conds, x, timestep, model_options)
+    out = calc_cond_batch(model, conds, x, timestep, model_options)

    for fn in model_options.get("sampler_pre_cfg_function", []):
        args = {"conds":conds, "conds_out": out, "cond_scale": cond_scale, "timestep": timestep,
@@ -554,10 +542,7 @@ def resolve_areas_and_cond_masks_multidim(conditions, dims, device):
            if len(mask.shape) == len(dims):
                mask = mask.unsqueeze(0)
            if mask.shape[1:] != dims:
-                if mask.ndim < 4:
-                    mask = comfy.utils.common_upscale(mask.unsqueeze(1), dims[-1], dims[-2], 'bilinear', 'none').squeeze(1)
-                else:
-                    mask = comfy.utils.common_upscale(mask, dims[-1], dims[-2], 'bilinear', 'none')
+                mask = torch.nn.functional.interpolate(mask.unsqueeze(1), size=dims, mode='bilinear', align_corners=False).squeeze(1)

            if modified.get("set_area_to_bounds", False): #TODO: handle dim != 2
                bounds = torch.max(torch.abs(mask),dim=0).values.unsqueeze(0)
@@ -729,9 +714,9 @@ class Sampler:

 KSAMPLER_NAMES = ["euler", "euler_cfg_pp", "euler_ancestral", "euler_ancestral_cfg_pp", "heun", "heunpp2","dpm_2", "dpm_2_ancestral",
                  "lms", "dpm_fast", "dpm_adaptive", "dpmpp_2s_ancestral", "dpmpp_2s_ancestral_cfg_pp", "dpmpp_sde", "dpmpp_sde_gpu",
-                  "dpmpp_2m", "dpmpp_2m_cfg_pp", "dpmpp_2m_sde", "dpmpp_2m_sde_gpu", "dpmpp_2m_sde_heun", "dpmpp_2m_sde_heun_gpu", "dpmpp_3m_sde", "dpmpp_3m_sde_gpu", "ddpm", "lcm",
+                  "dpmpp_2m", "dpmpp_2m_cfg_pp", "dpmpp_2m_sde", "dpmpp_2m_sde_gpu", "dpmpp_3m_sde", "dpmpp_3m_sde_gpu", "ddpm", "lcm",
                  "ipndm", "ipndm_v", "deis", "res_multistep", "res_multistep_cfg_pp", "res_multistep_ancestral", "res_multistep_ancestral_cfg_pp",
-                  "gradient_estimation", "gradient_estimation_cfg_pp", "er_sde", "seeds_2", "seeds_3", "sa_solver", "sa_solver_pece"]
+                  "gradient_estimation", "gradient_estimation_cfg_pp", "er_sde", "seeds_2", "seeds_3"]

 class KSAMPLER(Sampler):
    def __init__(self, sampler_function, extra_options={}, inpaint_options={}):
@@ -957,14 +942,7 @@ class CFGGuider:
            self.original_conds[k] = comfy.sampler_helpers.convert_cond(conds[k])

    def __call__(self, *args, **kwargs):
-        return self.outer_predict_noise(*args, **kwargs)
-
-    def outer_predict_noise(self, x, timestep, model_options={}, seed=None):
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self.predict_noise,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.PREDICT_NOISE, self.model_options, is_model_options=True)
-        ).execute(x, timestep, model_options, seed)
+        return self.predict_noise(*args, **kwargs)

    def predict_noise(self, x, timestep, model_options={}, seed=None):
        return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
@@ -1061,13 +1039,13 @@ class SchedulerHandler(NamedTuple):
    use_ms: bool = True

 SCHEDULER_HANDLERS = {
-    "simple": SchedulerHandler(simple_scheduler),
-    "sgm_uniform": SchedulerHandler(partial(normal_scheduler, sgm=True)),
+    "normal": SchedulerHandler(normal_scheduler),
    "karras": SchedulerHandler(k_diffusion_sampling.get_sigmas_karras, use_ms=False),
    "exponential": SchedulerHandler(k_diffusion_sampling.get_sigmas_exponential, use_ms=False),
+    "sgm_uniform": SchedulerHandler(partial(normal_scheduler, sgm=True)),
+    "simple": SchedulerHandler(simple_scheduler),
    "ddim_uniform": SchedulerHandler(ddim_scheduler),
    "beta": SchedulerHandler(beta_scheduler),
-    "normal": SchedulerHandler(normal_scheduler),
    "linear_quadratic": SchedulerHandler(linear_quadratic_schedule),
    "kl_optimal": SchedulerHandler(kl_optimal_scheduler, use_ms=False),
 }
--- a/comfy/sd.py
+++ b/comfy/sd.py
@@ -14,14 +14,10 @@ import comfy.ldm.genmo.vae.model
 import comfy.ldm.lightricks.vae.causal_video_autoencoder
 import comfy.ldm.cosmos.vae
 import comfy.ldm.wan.vae
-import comfy.ldm.wan.vae2_2
 import comfy.ldm.hunyuan3d.vae
 import comfy.ldm.ace.vae.music_dcae_pipeline
-import comfy.ldm.hunyuan_video.vae
-import comfy.pixel_space_convert
 import yaml
 import math
-import os

 import comfy.utils

@@ -48,9 +44,6 @@ import comfy.text_encoders.lumina2
 import comfy.text_encoders.wan
 import comfy.text_encoders.hidream
 import comfy.text_encoders.ace
-import comfy.text_encoders.omnigen2
-import comfy.text_encoders.qwen_image
-import comfy.text_encoders.hunyuan_image

 import comfy.model_patcher
 import comfy.lora
@@ -286,7 +279,6 @@ class VAE:
        self.process_output = lambda image: torch.clamp((image + 1.0) / 2.0, min=0.0, max=1.0)
        self.working_dtypes = [torch.bfloat16, torch.float32]
        self.disable_offload = False
-        self.not_video = False

        self.downscale_index_formula = None
        self.upscale_index_formula = None
@@ -332,19 +324,6 @@ class VAE:
                self.first_stage_model = StageC_coder()
                self.downscale_ratio = 32
                self.latent_channels = 16
-            elif "decoder.conv_in.weight" in sd and sd['decoder.conv_in.weight'].shape[1] == 64:
-                ddconfig = {"block_out_channels": [128, 256, 512, 512, 1024, 1024], "in_channels": 3, "out_channels": 3, "num_res_blocks": 2, "ffactor_spatial": 32, "downsample_match_channel": True, "upsample_match_channel": True}
-                self.latent_channels = ddconfig['z_channels'] = sd["decoder.conv_in.weight"].shape[1]
-                self.downscale_ratio = 32
-                self.upscale_ratio = 32
-                self.working_dtypes = [torch.float16, torch.bfloat16, torch.float32]
-                self.first_stage_model = AutoencodingEngine(regularizer_config={'target': "comfy.ldm.models.autoencoder.DiagonalGaussianRegularizer"},
-                                                            encoder_config={'target': "comfy.ldm.hunyuan_video.vae.Encoder", 'params': ddconfig},
-                                                            decoder_config={'target': "comfy.ldm.hunyuan_video.vae.Decoder", 'params': ddconfig})
-
-                self.memory_used_encode = lambda shape, dtype: (700 * shape[2] * shape[3]) * model_management.dtype_size(dtype)
-                self.memory_used_decode = lambda shape, dtype: (700 * shape[2] * shape[3] * 32 * 32) * model_management.dtype_size(dtype)
-
            elif "decoder.conv_in.weight" in sd:
                #default SD1.x/SD2.x VAE parameters
                ddconfig = {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0}
@@ -411,23 +390,6 @@ class VAE:
                self.downscale_ratio = (lambda a: max(0, math.floor((a + 7) / 8)), 32, 32)
                self.downscale_index_formula = (8, 32, 32)
                self.working_dtypes = [torch.bfloat16, torch.float32]
-            elif "decoder.conv_in.conv.weight" in sd and sd['decoder.conv_in.conv.weight'].shape[1] == 32:
-                ddconfig = {"block_out_channels": [128, 256, 512, 1024, 1024], "in_channels": 3, "out_channels": 3, "num_res_blocks": 2, "ffactor_spatial": 16, "ffactor_temporal": 4, "downsample_match_channel": True, "upsample_match_channel": True}
-                ddconfig['z_channels'] = sd["decoder.conv_in.conv.weight"].shape[1]
-                self.latent_channels = 64
-                self.upscale_ratio = (lambda a: max(0, a * 4 - 3), 16, 16)
-                self.upscale_index_formula = (4, 16, 16)
-                self.downscale_ratio = (lambda a: max(0, math.floor((a + 3) / 4)), 16, 16)
-                self.downscale_index_formula = (4, 16, 16)
-                self.latent_dim = 3
-                self.not_video = True
-                self.working_dtypes = [torch.float16, torch.bfloat16, torch.float32]
-                self.first_stage_model = AutoencodingEngine(regularizer_config={'target': "comfy.ldm.models.autoencoder.EmptyRegularizer"},
-                                                            encoder_config={'target': "comfy.ldm.hunyuan_video.vae_refiner.Encoder", 'params': ddconfig},
-                                                            decoder_config={'target': "comfy.ldm.hunyuan_video.vae_refiner.Decoder", 'params': ddconfig})
-
-                self.memory_used_encode = lambda shape, dtype: (1400 * shape[-2] * shape[-1]) * model_management.dtype_size(dtype)
-                self.memory_used_decode = lambda shape, dtype: (1400 * shape[-3] * shape[-2] * shape[-1] * 16 * 16) * model_management.dtype_size(dtype)
            elif "decoder.conv_in.conv.weight" in sd:
                ddconfig = {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0}
                ddconfig["conv3d"] = True
@@ -456,53 +418,28 @@ class VAE:
                self.memory_used_encode = lambda shape, dtype: (50 * (round((shape[2] + 7) / 8) * 8) * shape[3] * shape[4]) * model_management.dtype_size(dtype)
                self.working_dtypes = [torch.bfloat16, torch.float32]
            elif "decoder.middle.0.residual.0.gamma" in sd:
-                if "decoder.upsamples.0.upsamples.0.residual.2.weight" in sd:  # Wan 2.2 VAE
-                    self.upscale_ratio = (lambda a: max(0, a * 4 - 3), 16, 16)
-                    self.upscale_index_formula = (4, 16, 16)
-                    self.downscale_ratio = (lambda a: max(0, math.floor((a + 3) / 4)), 16, 16)
-                    self.downscale_index_formula = (4, 16, 16)
-                    self.latent_dim = 3
-                    self.latent_channels = 48
-                    ddconfig = {"dim": 160, "z_dim": self.latent_channels, "dim_mult": [1, 2, 4, 4], "num_res_blocks": 2, "attn_scales": [], "temperal_downsample": [False, True, True], "dropout": 0.0}
-                    self.first_stage_model = comfy.ldm.wan.vae2_2.WanVAE(**ddconfig)
-                    self.working_dtypes = [torch.bfloat16, torch.float16, torch.float32]
-                    self.memory_used_encode = lambda shape, dtype: 3300 * shape[3] * shape[4] * model_management.dtype_size(dtype)
-                    self.memory_used_decode = lambda shape, dtype: 8000 * shape[3] * shape[4] * (16 * 16) * model_management.dtype_size(dtype)
-                else:  # Wan 2.1 VAE
-                    self.upscale_ratio = (lambda a: max(0, a * 4 - 3), 8, 8)
-                    self.upscale_index_formula = (4, 8, 8)
-                    self.downscale_ratio = (lambda a: max(0, math.floor((a + 3) / 4)), 8, 8)
-                    self.downscale_index_formula = (4, 8, 8)
-                    self.latent_dim = 3
-                    self.latent_channels = 16
-                    ddconfig = {"dim": 96, "z_dim": self.latent_channels, "dim_mult": [1, 2, 4, 4], "num_res_blocks": 2, "attn_scales": [], "temperal_downsample": [False, True, True], "dropout": 0.0}
-                    self.first_stage_model = comfy.ldm.wan.vae.WanVAE(**ddconfig)
-                    self.working_dtypes = [torch.bfloat16, torch.float16, torch.float32]
-                    self.memory_used_encode = lambda shape, dtype: 6000 * shape[3] * shape[4] * model_management.dtype_size(dtype)
-                    self.memory_used_decode = lambda shape, dtype: 7000 * shape[3] * shape[4] * (8 * 8) * model_management.dtype_size(dtype)
-            # Hunyuan 3d v2 2.0 & 2.1
+                self.upscale_ratio = (lambda a: max(0, a * 4 - 3), 8, 8)
+                self.upscale_index_formula = (4, 8, 8)
+                self.downscale_ratio = (lambda a: max(0, math.floor((a + 3) / 4)), 8, 8)
+                self.downscale_index_formula = (4, 8, 8)
+                self.latent_dim = 3
+                self.latent_channels = 16
+                ddconfig = {"dim": 96, "z_dim": self.latent_channels, "dim_mult": [1, 2, 4, 4], "num_res_blocks": 2, "attn_scales": [], "temperal_downsample": [False, True, True], "dropout": 0.0}
+                self.first_stage_model = comfy.ldm.wan.vae.WanVAE(**ddconfig)
+                self.working_dtypes = [torch.bfloat16, torch.float16, torch.float32]
+                self.memory_used_encode = lambda shape, dtype: 6000 * shape[3] * shape[4] * model_management.dtype_size(dtype)
+                self.memory_used_decode = lambda shape, dtype: 7000 * shape[3] * shape[4] * (8 * 8) * model_management.dtype_size(dtype)
            elif "geo_decoder.cross_attn_decoder.ln_1.bias" in sd:
-
                self.latent_dim = 1
-
-                def estimate_memory(shape, dtype, num_layers = 16, kv_cache_multiplier = 2):
-                    batch, num_tokens, hidden_dim = shape
-                    dtype_size = model_management.dtype_size(dtype)
-
-                    total_mem = batch * num_tokens * hidden_dim * dtype_size * (1 + kv_cache_multiplier * num_layers)
-                    return total_mem
-
-                # better memory estimations
-                self.memory_used_encode = lambda shape, dtype, num_layers = 8, kv_cache_multiplier = 0:\
-                    estimate_memory(shape, dtype, num_layers, kv_cache_multiplier)
-
-                self.memory_used_decode = lambda shape, dtype, num_layers = 16, kv_cache_multiplier = 2: \
-                    estimate_memory(shape, dtype, num_layers, kv_cache_multiplier)
-
-                self.first_stage_model = comfy.ldm.hunyuan3d.vae.ShapeVAE()
+                ln_post = "geo_decoder.ln_post.weight" in sd
+                inner_size = sd["geo_decoder.output_proj.weight"].shape[1]
+                downsample_ratio = sd["post_kl.weight"].shape[0] // inner_size
+                mlp_expand = sd["geo_decoder.cross_attn_decoder.mlp.c_fc.weight"].shape[0] // inner_size
+                self.memory_used_encode = lambda shape, dtype: (1000 * shape[2]) * model_management.dtype_size(dtype)  # TODO
+                self.memory_used_decode = lambda shape, dtype: (1024 * 1024 * 1024 * 2.0) * model_management.dtype_size(dtype)  # TODO
+                ddconfig = {"embed_dim": 64, "num_freqs": 8, "include_pi": False, "heads": 16, "width": 1024, "num_decoder_layers": 16, "qkv_bias": False, "qk_norm": True, "geo_decoder_mlp_expand_ratio": mlp_expand, "geo_decoder_downsample_ratio": downsample_ratio, "geo_decoder_ln_post": ln_post}
+                self.first_stage_model = comfy.ldm.hunyuan3d.vae.ShapeVAE(**ddconfig)
                self.working_dtypes = [torch.float16, torch.bfloat16, torch.float32]
-
-
            elif "vocoder.backbone.channel_layers.0.0.bias" in sd: #Ace Step Audio
                self.first_stage_model = comfy.ldm.ace.vae.music_dcae_pipeline.MusicDCAE(source_sample_rate=44100)
                self.memory_used_encode = lambda shape, dtype: (shape[2] * 330) * model_management.dtype_size(dtype)
@@ -517,15 +454,6 @@ class VAE:
                self.working_dtypes = [torch.bfloat16, torch.float16, torch.float32]
                self.disable_offload = True
                self.extra_1d_channel = 16
-            elif "pixel_space_vae" in sd:
-                self.first_stage_model = comfy.pixel_space_convert.PixelspaceConversionVAE()
-                self.memory_used_encode = lambda shape, dtype: (1 * shape[2] * shape[3]) * model_management.dtype_size(dtype)
-                self.memory_used_decode = lambda shape, dtype: (1 * shape[2] * shape[3]) * model_management.dtype_size(dtype)
-                self.downscale_ratio = 1
-                self.upscale_ratio = 1
-                self.latent_channels = 3
-                self.latent_dim = 2
-                self.output_channels = 3
            else:
                logging.warning("WARNING: No VAE weights detected, VAE not initalized.")
                self.first_stage_model = None
@@ -698,10 +626,7 @@ class VAE:
        pixel_samples = self.vae_encode_crop_pixels(pixel_samples)
        pixel_samples = pixel_samples.movedim(-1, 1)
        if self.latent_dim == 3 and pixel_samples.ndim < 5:
-            if not self.not_video:
-                pixel_samples = pixel_samples.movedim(1, 0).unsqueeze(0)
-            else:
-                pixel_samples = pixel_samples.unsqueeze(2)
+            pixel_samples = pixel_samples.movedim(1, 0).unsqueeze(0)
        try:
            memory_used = self.memory_used_encode(pixel_samples.shape, self.vae_dtype)
            model_management.load_models_gpu([self.patcher], memory_required=memory_used, force_full_load=self.disable_offload)
@@ -735,10 +660,7 @@ class VAE:
        dims = self.latent_dim
        pixel_samples = pixel_samples.movedim(-1, 1)
        if dims == 3:
-            if not self.not_video:
-                pixel_samples = pixel_samples.movedim(1, 0).unsqueeze(0)
-            else:
-                pixel_samples = pixel_samples.unsqueeze(2)
+            pixel_samples = pixel_samples.movedim(1, 0).unsqueeze(0)

        memory_used = self.memory_used_encode(pixel_samples.shape, self.vae_dtype)  # TODO: calculate mem required for tile
        model_management.load_models_gpu([self.patcher], memory_required=memory_used, force_full_load=self.disable_offload)
@@ -795,7 +717,6 @@ class VAE:
        except:
            return None

-
 class StyleModel:
    def __init__(self, model, device="cpu"):
        self.model = model
@@ -833,9 +754,6 @@ class CLIPType(Enum):
    HIDREAM = 14
    CHROMA = 15
    ACE = 16
-    OMNIGEN2 = 17
-    QWEN_IMAGE = 18
-    HUNYUAN_IMAGE = 19


 def load_clip(ckpt_paths, embedding_directory=None, clip_type=CLIPType.STABLE_DIFFUSION, model_options={}):
@@ -855,9 +773,6 @@ class TEModel(Enum):
    LLAMA3_8 = 7
    T5_XXL_OLD = 8
    GEMMA_2_2B = 9
-    QWEN25_3B = 10
-    QWEN25_7B = 11
-    BYT5_SMALL_GLYPH = 12

 def detect_te_model(sd):
    if "text_model.encoder.layers.30.mlp.fc1.weight" in sd:
@@ -875,18 +790,9 @@ def detect_te_model(sd):
    if 'encoder.block.23.layer.1.DenseReluDense.wi.weight' in sd:
        return TEModel.T5_XXL_OLD
    if "encoder.block.0.layer.0.SelfAttention.k.weight" in sd:
-        weight = sd['encoder.block.0.layer.0.SelfAttention.k.weight']
-        if weight.shape[0] == 384:
-            return TEModel.BYT5_SMALL_GLYPH
        return TEModel.T5_BASE
    if 'model.layers.0.post_feedforward_layernorm.weight' in sd:
        return TEModel.GEMMA_2_2B
-    if 'model.layers.0.self_attn.k_proj.bias' in sd:
-        weight = sd['model.layers.0.self_attn.k_proj.bias']
-        if weight.shape[0] == 256:
-            return TEModel.QWEN25_3B
-        if weight.shape[0] == 512:
-            return TEModel.QWEN25_7B
    if "model.layers.0.post_attention_layernorm.weight" in sd:
        return TEModel.LLAMA3_8
    return None
@@ -988,16 +894,6 @@ def load_text_encoder_state_dicts(state_dicts=[], embedding_directory=None, clip
            clip_target.clip = comfy.text_encoders.hidream.hidream_clip(**llama_detect(clip_data),
                                                                        clip_l=False, clip_g=False, t5=False, llama=True, dtype_t5=None, t5xxl_scaled_fp8=None)
            clip_target.tokenizer = comfy.text_encoders.hidream.HiDreamTokenizer
-        elif te_model == TEModel.QWEN25_3B:
-            clip_target.clip = comfy.text_encoders.omnigen2.te(**llama_detect(clip_data))
-            clip_target.tokenizer = comfy.text_encoders.omnigen2.Omnigen2Tokenizer
-        elif te_model == TEModel.QWEN25_7B:
-            if clip_type == CLIPType.HUNYUAN_IMAGE:
-                clip_target.clip = comfy.text_encoders.hunyuan_image.te(byt5=False, **llama_detect(clip_data))
-                clip_target.tokenizer = comfy.text_encoders.hunyuan_image.HunyuanImageTokenizer
-            else:
-                clip_target.clip = comfy.text_encoders.qwen_image.te(**llama_detect(clip_data))
-                clip_target.tokenizer = comfy.text_encoders.qwen_image.QwenImageTokenizer
        else:
            # clip_l
            if clip_type == CLIPType.SD3:
@@ -1041,9 +937,6 @@ def load_text_encoder_state_dicts(state_dicts=[], embedding_directory=None, clip

            clip_target.clip = comfy.text_encoders.hidream.hidream_clip(clip_l=clip_l, clip_g=clip_g, t5=t5, llama=llama, **t5_kwargs, **llama_kwargs)
            clip_target.tokenizer = comfy.text_encoders.hidream.HiDreamTokenizer
-        elif clip_type == CLIPType.HUNYUAN_IMAGE:
-            clip_target.clip = comfy.text_encoders.hunyuan_image.te(**llama_detect(clip_data))
-            clip_target.tokenizer = comfy.text_encoders.hunyuan_image.HunyuanImageTokenizer
        else:
            clip_target.clip = sdxl_clip.SDXLClipModel
            clip_target.tokenizer = sdxl_clip.SDXLTokenizer
@@ -1076,12 +969,6 @@ def load_gligen(ckpt_path):
        model = model.half()
    return comfy.model_patcher.ModelPatcher(model, load_device=model_management.get_torch_device(), offload_device=model_management.unet_offload_device())

-def model_detection_error_hint(path, state_dict):
-    filename = os.path.basename(path)
-    if 'lora' in filename.lower():
-        return "\nHINT: This seems to be a Lora file and Lora files should be put in the lora folder and loaded with a lora loader node.."
-    return ""
-
 def load_checkpoint(config_path=None, ckpt_path=None, output_vae=True, output_clip=True, embedding_directory=None, state_dict=None, config=None):
    logging.warning("Warning: The load checkpoint with config function is deprecated and will eventually be removed, please use the other one.")
    model, clip, vae, _ = load_checkpoint_guess_config(ckpt_path, output_vae=output_vae, output_clip=output_clip, output_clipvision=False, embedding_directory=embedding_directory, output_model=True)
@@ -1110,7 +997,7 @@ def load_checkpoint_guess_config(ckpt_path, output_vae=True, output_clip=True, o
    sd, metadata = comfy.utils.load_torch_file(ckpt_path, return_metadata=True)
    out = load_state_dict_guess_config(sd, output_vae, output_clip, output_clipvision, embedding_directory, output_model, model_options, te_model_options=te_model_options, metadata=metadata)
    if out is None:
-        raise RuntimeError("ERROR: Could not detect model type of: {}\n{}".format(ckpt_path, model_detection_error_hint(ckpt_path, sd)))
+        raise RuntimeError("ERROR: Could not detect model type of: {}".format(ckpt_path))
    return out

 def load_state_dict_guess_config(sd, output_vae=True, output_clip=True, output_clipvision=False, embedding_directory=None, output_model=True, model_options={}, te_model_options={}, metadata=None):
@@ -1273,7 +1160,7 @@ def load_diffusion_model_state_dict(sd, model_options={}):
    model.load_model_weights(new_sd, "")
    left_over = sd.keys()
    if len(left_over) > 0:
-        logging.info("left over keys in diffusion model: {}".format(left_over))
+        logging.info("left over keys in unet: {}".format(left_over))
    return comfy.model_patcher.ModelPatcher(model, load_device=load_device, offload_device=offload_device)


@@ -1281,8 +1168,8 @@ def load_diffusion_model(unet_path, model_options={}):
    sd = comfy.utils.load_torch_file(unet_path)
    model = load_diffusion_model_state_dict(sd, model_options=model_options)
    if model is None:
-        logging.error("ERROR UNSUPPORTED DIFFUSION MODEL {}".format(unet_path))
-        raise RuntimeError("ERROR: Could not detect model type of: {}\n{}".format(unet_path, model_detection_error_hint(unet_path, sd)))
+        logging.error("ERROR UNSUPPORTED UNET {}".format(unet_path))
+        raise RuntimeError("ERROR: Could not detect model type of: {}".format(unet_path))
    return model

 def load_unet(unet_path, dtype=None):
--- a/comfy/sd1_clip.py
+++ b/comfy/sd1_clip.py
@@ -204,19 +204,17 @@ class SDClipModel(torch.nn.Module, ClipTokenWeightEncoder):
            tokens_embed = self.transformer.get_input_embeddings()(tokens_embed, out_dtype=torch.float32)
            index = 0
            pad_extra = 0
-            embeds_info = []
            for o in other_embeds:
                emb = o[1]
                if torch.is_tensor(emb):
                    emb = {"type": "embedding", "data": emb}

-                extra = None
                emb_type = emb.get("type", None)
                if emb_type == "embedding":
                    emb = emb.get("data", None)
                else:
                    if hasattr(self.transformer, "preprocess_embed"):
-                        emb, extra = self.transformer.preprocess_embed(emb, device=device)
+                        emb = self.transformer.preprocess_embed(emb, device=device)
                    else:
                        emb = None

@@ -231,7 +229,6 @@ class SDClipModel(torch.nn.Module, ClipTokenWeightEncoder):
                    tokens_embed = torch.cat([tokens_embed[:, :ind], emb, tokens_embed[:, ind:]], dim=1)
                    attention_mask = attention_mask[:ind] + [1] * emb_shape + attention_mask[ind:]
                    index += emb_shape - 1
-                    embeds_info.append({"type": emb_type, "index": ind, "size": emb_shape, "extra": extra})
                else:
                    index += -1
                    pad_extra += emb_shape
@@ -246,11 +243,11 @@ class SDClipModel(torch.nn.Module, ClipTokenWeightEncoder):
            attention_masks.append(attention_mask)
            num_tokens.append(sum(attention_mask))

-        return torch.cat(embeds_out), torch.tensor(attention_masks, device=device, dtype=torch.long), num_tokens, embeds_info
+        return torch.cat(embeds_out), torch.tensor(attention_masks, device=device, dtype=torch.long), num_tokens

    def forward(self, tokens):
        device = self.transformer.get_input_embeddings().weight.device
-        embeds, attention_mask, num_tokens, embeds_info = self.process_tokens(tokens, device)
+        embeds, attention_mask, num_tokens = self.process_tokens(tokens, device)

        attention_mask_model = None
        if self.enable_attention_masks:
@@ -261,7 +258,7 @@ class SDClipModel(torch.nn.Module, ClipTokenWeightEncoder):
        else:
            intermediate_output = self.layer_idx

-        outputs = self.transformer(None, attention_mask_model, embeds=embeds, num_tokens=num_tokens, intermediate_output=intermediate_output, final_layer_norm_intermediate=self.layer_norm_hidden_state, dtype=torch.float32, embeds_info=embeds_info)
+        outputs = self.transformer(None, attention_mask_model, embeds=embeds, num_tokens=num_tokens, intermediate_output=intermediate_output, final_layer_norm_intermediate=self.layer_norm_hidden_state, dtype=torch.float32)

        if self.layer == "last":
            z = outputs[0].float()
@@ -465,7 +462,7 @@ class SDTokenizer:
            tokenizer_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "sd1_tokenizer")
        self.tokenizer = tokenizer_class.from_pretrained(tokenizer_path, **tokenizer_args)
        self.max_length = tokenizer_data.get("{}_max_length".format(embedding_key), max_length)
-        self.min_length = tokenizer_data.get("{}_min_length".format(embedding_key), min_length)
+        self.min_length = min_length
        self.end_token = None
        self.min_padding = min_padding

@@ -485,8 +482,7 @@ class SDTokenizer:
            if end_token is not None:
                self.end_token = end_token
            else:
-                if has_end_token:
-                    self.end_token = empty[0]
+                self.end_token = empty[0]

        if pad_token is not None:
            self.pad_token = pad_token
@@ -534,10 +530,7 @@ class SDTokenizer:
        min_padding = tokenizer_options.get("{}_min_padding".format(self.embedding_key), self.min_padding)

        text = escape_important(text)
-        if kwargs.get("disable_weights", False):
-            parsed_weights = [(text, 1.0)]
-        else:
-            parsed_weights = token_weights(text, 1.0)
+        parsed_weights = token_weights(text, 1.0)

        # tokenize words
        tokens = []
--- a/comfy/sd1_tokenizer/tokenizer_config.json
+++ b/comfy/sd1_tokenizer/tokenizer_config.json
@@ -18,7 +18,7 @@
    "single_word": false
  },
  "errors": "replace",
-  "model_max_length": 8192,
+  "model_max_length": 77,
  "name_or_path": "openai/clip-vit-large-patch14",
  "pad_token": "<|endoftext|>",
  "special_tokens_map_file": "./special_tokens_map.json",
--- a/comfy/supported_models.py
+++ b/comfy/supported_models.py
@@ -18,9 +18,6 @@ import comfy.text_encoders.cosmos
 import comfy.text_encoders.lumina2
 import comfy.text_encoders.wan
 import comfy.text_encoders.ace
-import comfy.text_encoders.omnigen2
-import comfy.text_encoders.qwen_image
-import comfy.text_encoders.hunyuan_image

 from . import supported_models_base
 from . import latent_formats
@@ -701,7 +698,7 @@ class Flux(supported_models_base.BASE):
    unet_extra_config = {}
    latent_format = latent_formats.Flux

-    memory_usage_factor = 3.1 # TODO: debug why flux mem usage is so weird on windows.
+    memory_usage_factor = 2.8

    supported_inference_dtypes = [torch.bfloat16, torch.float16, torch.float32]

@@ -995,7 +992,7 @@ class WAN21_T2V(supported_models_base.BASE):
    unet_extra_config = {}
    latent_format = latent_formats.Wan21

-    memory_usage_factor = 0.9
+    memory_usage_factor = 1.0

    supported_inference_dtypes = [torch.float16, torch.bfloat16, torch.float32]

@@ -1004,7 +1001,7 @@ class WAN21_T2V(supported_models_base.BASE):

    def __init__(self, unet_config):
        super().__init__(unet_config)
-        self.memory_usage_factor = self.unet_config.get("dim", 2000) / 2222
+        self.memory_usage_factor = self.unet_config.get("dim", 2000) / 2000

    def get_model(self, state_dict, prefix="", device=None):
        out = model_base.WAN21(self, device=device)
@@ -1047,18 +1044,6 @@ class WAN21_Camera(WAN21_T2V):
    def get_model(self, state_dict, prefix="", device=None):
        out = model_base.WAN21_Camera(self, image_to_video=False, device=device)
        return out
-
-class WAN22_Camera(WAN21_T2V):
-    unet_config = {
-        "image_model": "wan2.1",
-        "model_type": "camera_2.2",
-        "in_dim": 36,
-    }
-
-    def get_model(self, state_dict, prefix="", device=None):
-        out = model_base.WAN21_Camera(self, image_to_video=False, device=device)
-        return out
-
 class WAN21_Vace(WAN21_T2V):
    unet_config = {
        "image_model": "wan2.1",
@@ -1073,55 +1058,6 @@ class WAN21_Vace(WAN21_T2V):
        out = model_base.WAN21_Vace(self, image_to_video=False, device=device)
        return out

-class WAN21_HuMo(WAN21_T2V):
-    unet_config = {
-        "image_model": "wan2.1",
-        "model_type": "humo",
-    }
-
-    def get_model(self, state_dict, prefix="", device=None):
-        out = model_base.WAN21_HuMo(self, image_to_video=False, device=device)
-        return out
-
-class WAN22_S2V(WAN21_T2V):
-    unet_config = {
-        "image_model": "wan2.1",
-        "model_type": "s2v",
-    }
-
-    def __init__(self, unet_config):
-        super().__init__(unet_config)
-
-    def get_model(self, state_dict, prefix="", device=None):
-        out = model_base.WAN22_S2V(self, device=device)
-        return out
-
-class WAN22_Animate(WAN21_T2V):
-    unet_config = {
-        "image_model": "wan2.1",
-        "model_type": "animate",
-    }
-
-    def __init__(self, unet_config):
-        super().__init__(unet_config)
-
-    def get_model(self, state_dict, prefix="", device=None):
-        out = model_base.WAN22_Animate(self, device=device)
-        return out
-
-class WAN22_T2V(WAN21_T2V):
-    unet_config = {
-        "image_model": "wan2.1",
-        "model_type": "t2v",
-        "out_dim": 48,
-    }
-
-    latent_format = latent_formats.Wan22
-
-    def get_model(self, state_dict, prefix="", device=None):
-        out = model_base.WAN22(self, image_to_video=True, device=device)
-        return out
-
 class Hunyuan3Dv2(supported_models_base.BASE):
    unet_config = {
        "image_model": "hunyuan3d2",
@@ -1152,17 +1088,6 @@ class Hunyuan3Dv2(supported_models_base.BASE):
    def clip_target(self, state_dict={}):
        return None

-class Hunyuan3Dv2_1(Hunyuan3Dv2):
-    unet_config = {
-        "image_model": "hunyuan3d2_1",
-    }
-
-    latent_format = latent_formats.Hunyuan3Dv2_1
-
-    def get_model(self, state_dict, prefix="", device=None):
-        out = model_base.Hunyuan3Dv2_1(self, device = device)
-        return out
-
 class Hunyuan3Dv2mini(Hunyuan3Dv2):
    unet_config = {
        "image_model": "hunyuan3d2",
@@ -1228,19 +1153,6 @@ class Chroma(supported_models_base.BASE):
        t5_detect = comfy.text_encoders.sd3_clip.t5_xxl_detect(state_dict, "{}t5xxl.transformer.".format(pref))
        return supported_models_base.ClipTarget(comfy.text_encoders.pixart_t5.PixArtTokenizer, comfy.text_encoders.pixart_t5.pixart_te(**t5_detect))

-class ChromaRadiance(Chroma):
-    unet_config = {
-        "image_model": "chroma_radiance",
-    }
-
-    latent_format = comfy.latent_formats.ChromaRadiance
-
-    # Pixel-space model, no spatial compression for model input.
-    memory_usage_factor = 0.038
-
-    def get_model(self, state_dict, prefix="", device=None):
-        return model_base.ChromaRadiance(self, device=device)
-
 class ACEStep(supported_models_base.BASE):
    unet_config = {
        "audio_model": "ace",
@@ -1269,111 +1181,6 @@ class ACEStep(supported_models_base.BASE):
    def clip_target(self, state_dict={}):
        return supported_models_base.ClipTarget(comfy.text_encoders.ace.AceT5Tokenizer, comfy.text_encoders.ace.AceT5Model)

-class Omnigen2(supported_models_base.BASE):
-    unet_config = {
-        "image_model": "omnigen2",
-    }
-
-    sampling_settings = {
-        "multiplier": 1.0,
-        "shift": 2.6,
-    }
-
-    memory_usage_factor = 1.65 #TODO
-
-    unet_extra_config = {}
-    latent_format = latent_formats.Flux
-
-    supported_inference_dtypes = [torch.bfloat16, torch.float32]
-
-    vae_key_prefix = ["vae."]
-    text_encoder_key_prefix = ["text_encoders."]
-
-    def __init__(self, unet_config):
-        super().__init__(unet_config)
-        if comfy.model_management.extended_fp16_support():
-            self.supported_inference_dtypes = [torch.float16] + self.supported_inference_dtypes
-
-    def get_model(self, state_dict, prefix="", device=None):
-        out = model_base.Omnigen2(self, device=device)
-        return out
-
-    def clip_target(self, state_dict={}):
-        pref = self.text_encoder_key_prefix[0]
-        hunyuan_detect = comfy.text_encoders.hunyuan_video.llama_detect(state_dict, "{}qwen25_3b.transformer.".format(pref))
-        return supported_models_base.ClipTarget(comfy.text_encoders.omnigen2.Omnigen2Tokenizer, comfy.text_encoders.omnigen2.te(**hunyuan_detect))
-
-class QwenImage(supported_models_base.BASE):
-    unet_config = {
-        "image_model": "qwen_image",
-    }
-
-    sampling_settings = {
-        "multiplier": 1.0,
-        "shift": 1.15,
-    }
-
-    memory_usage_factor = 1.8 #TODO
-
-    unet_extra_config = {}
-    latent_format = latent_formats.Wan21
-
-    supported_inference_dtypes = [torch.bfloat16, torch.float32]
-
-    vae_key_prefix = ["vae."]
-    text_encoder_key_prefix = ["text_encoders."]
-
-    def get_model(self, state_dict, prefix="", device=None):
-        out = model_base.QwenImage(self, device=device)
-        return out
-
-    def clip_target(self, state_dict={}):
-        pref = self.text_encoder_key_prefix[0]
-        hunyuan_detect = comfy.text_encoders.hunyuan_video.llama_detect(state_dict, "{}qwen25_7b.transformer.".format(pref))
-        return supported_models_base.ClipTarget(comfy.text_encoders.qwen_image.QwenImageTokenizer, comfy.text_encoders.qwen_image.te(**hunyuan_detect))
-
-class HunyuanImage21(HunyuanVideo):
-    unet_config = {
-        "image_model": "hunyuan_video",
-        "vec_in_dim": None,
-    }
-
-    sampling_settings = {
-        "shift": 5.0,
-    }
-
-    latent_format = latent_formats.HunyuanImage21
-
-    memory_usage_factor = 7.7
-
-    supported_inference_dtypes = [torch.bfloat16, torch.float32]
-
-    def get_model(self, state_dict, prefix="", device=None):
-        out = model_base.HunyuanImage21(self, device=device)
-        return out
-
-    def clip_target(self, state_dict={}):
-        pref = self.text_encoder_key_prefix[0]
-        hunyuan_detect = comfy.text_encoders.hunyuan_video.llama_detect(state_dict, "{}qwen25_7b.transformer.".format(pref))
-        return supported_models_base.ClipTarget(comfy.text_encoders.hunyuan_image.HunyuanImageTokenizer, comfy.text_encoders.hunyuan_image.te(**hunyuan_detect))
-
-class HunyuanImage21Refiner(HunyuanVideo):
-    unet_config = {
-        "image_model": "hunyuan_video",
-        "patch_size": [1, 1, 1],
-        "vec_in_dim": None,
-    }
-
-    sampling_settings = {
-        "shift": 4.0,
-    }
-
-    latent_format = latent_formats.HunyuanImage21Refiner
-
-    def get_model(self, state_dict, prefix="", device=None):
-        out = model_base.HunyuanImage21Refiner(self, device=device)
-        return out
-
-models = [LotusD, Stable_Zero123, SD15_instructpix2pix, SD15, SD20, SD21UnclipL, SD21UnclipH, SDXL_instructpix2pix, SDXLRefiner, SDXL, SSD1B, KOALA_700M, KOALA_1B, Segmind_Vega, SD_X4Upscaler, Stable_Cascade_C, Stable_Cascade_B, SV3D_u, SV3D_p, SD3, StableAudio, AuraFlow, PixArtAlpha, PixArtSigma, HunyuanDiT, HunyuanDiT1, FluxInpaint, Flux, FluxSchnell, GenmoMochi, LTXV, HunyuanImage21Refiner, HunyuanImage21, HunyuanVideoSkyreelsI2V, HunyuanVideoI2V, HunyuanVideo, CosmosT2V, CosmosI2V, CosmosT2IPredict2, CosmosI2VPredict2, Lumina2, WAN22_T2V, WAN21_T2V, WAN21_I2V, WAN21_FunControl2V, WAN21_Vace, WAN21_Camera, WAN22_Camera, WAN22_S2V, WAN21_HuMo, WAN22_Animate, Hunyuan3Dv2mini, Hunyuan3Dv2, Hunyuan3Dv2_1, HiDream, Chroma, ChromaRadiance, ACEStep, Omnigen2, QwenImage]
+models = [LotusD, Stable_Zero123, SD15_instructpix2pix, SD15, SD20, SD21UnclipL, SD21UnclipH, SDXL_instructpix2pix, SDXLRefiner, SDXL, SSD1B, KOALA_700M, KOALA_1B, Segmind_Vega, SD_X4Upscaler, Stable_Cascade_C, Stable_Cascade_B, SV3D_u, SV3D_p, SD3, StableAudio, AuraFlow, PixArtAlpha, PixArtSigma, HunyuanDiT, HunyuanDiT1, FluxInpaint, Flux, FluxSchnell, GenmoMochi, LTXV, HunyuanVideoSkyreelsI2V, HunyuanVideoI2V, HunyuanVideo, CosmosT2V, CosmosI2V, CosmosT2IPredict2, CosmosI2VPredict2, Lumina2, WAN21_T2V, WAN21_I2V, WAN21_FunControl2V, WAN21_Vace, WAN21_Camera, Hunyuan3Dv2mini, Hunyuan3Dv2, HiDream, Chroma, ACEStep]

 models += [SVD_img2vid]
--- a/comfy/text_encoders/bert.py
+++ b/comfy/text_encoders/bert.py
@@ -116,7 +116,7 @@ class BertModel_(torch.nn.Module):
        self.embeddings = BertEmbeddings(config_dict["vocab_size"], config_dict["max_position_embeddings"], config_dict["type_vocab_size"], config_dict["pad_token_id"], embed_dim, layer_norm_eps, dtype, device, operations)
        self.encoder = BertEncoder(config_dict["num_hidden_layers"], embed_dim, config_dict["intermediate_size"], config_dict["num_attention_heads"], layer_norm_eps, dtype, device, operations)

-    def forward(self, input_tokens, attention_mask=None, embeds=None, num_tokens=None, intermediate_output=None, final_layer_norm_intermediate=True, dtype=None, embeds_info=[]):
+    def forward(self, input_tokens, attention_mask=None, embeds=None, num_tokens=None, intermediate_output=None, final_layer_norm_intermediate=True, dtype=None):
        x = self.embeddings(input_tokens, embeds=embeds, dtype=dtype)
        mask = None
        if attention_mask is not None:
--- a/comfy/text_encoders/byt5_config_small_glyph.json
+++ b/comfy/text_encoders/byt5_config_small_glyph.json
@@ -1,22 +0,0 @@
-{
-  "d_ff": 3584,
-  "d_kv": 64,
-  "d_model": 1472,
-  "decoder_start_token_id": 0,
-  "dropout_rate": 0.1,
-  "eos_token_id": 1,
-  "dense_act_fn": "gelu_pytorch_tanh",
-  "initializer_factor": 1.0,
-  "is_encoder_decoder": true,
-  "is_gated_act": true,
-  "layer_norm_epsilon": 1e-06,
-  "model_type": "t5",
-  "num_decoder_layers": 4,
-  "num_heads": 6,
-  "num_layers": 12,
-  "output_past": true,
-  "pad_token_id": 0,
-  "relative_attention_num_buckets": 32,
-  "tie_word_embeddings": false,
-  "vocab_size": 1510
-}
--- a/comfy/text_encoders/byt5_tokenizer/added_tokens.json
+++ b/comfy/text_encoders/byt5_tokenizer/added_tokens.json
@@ -1,127 +0,0 @@
-{
-  "<extra_id_0>": 259,
-  "<extra_id_100>": 359,
-  "<extra_id_101>": 360,
-  "<extra_id_102>": 361,
-  "<extra_id_103>": 362,
-  "<extra_id_104>": 363,
-  "<extra_id_105>": 364,
-  "<extra_id_106>": 365,
-  "<extra_id_107>": 366,
-  "<extra_id_108>": 367,
-  "<extra_id_109>": 368,
-  "<extra_id_10>": 269,
-  "<extra_id_110>": 369,
-  "<extra_id_111>": 370,
-  "<extra_id_112>": 371,
-  "<extra_id_113>": 372,
-  "<extra_id_114>": 373,
-  "<extra_id_115>": 374,
-  "<extra_id_116>": 375,
-  "<extra_id_117>": 376,
-  "<extra_id_118>": 377,
-  "<extra_id_119>": 378,
-  "<extra_id_11>": 270,
-  "<extra_id_120>": 379,
-  "<extra_id_121>": 380,
-  "<extra_id_122>": 381,
-  "<extra_id_123>": 382,
-  "<extra_id_124>": 383,
-  "<extra_id_12>": 271,
-  "<extra_id_13>": 272,
-  "<extra_id_14>": 273,
-  "<extra_id_15>": 274,
-  "<extra_id_16>": 275,
-  "<extra_id_17>": 276,
-  "<extra_id_18>": 277,
-  "<extra_id_19>": 278,
-  "<extra_id_1>": 260,
-  "<extra_id_20>": 279,
-  "<extra_id_21>": 280,
-  "<extra_id_22>": 281,
-  "<extra_id_23>": 282,
-  "<extra_id_24>": 283,
-  "<extra_id_25>": 284,
-  "<extra_id_26>": 285,
-  "<extra_id_27>": 286,
-  "<extra_id_28>": 287,
-  "<extra_id_29>": 288,
-  "<extra_id_2>": 261,
-  "<extra_id_30>": 289,
-  "<extra_id_31>": 290,
-  "<extra_id_32>": 291,
-  "<extra_id_33>": 292,
-  "<extra_id_34>": 293,
-  "<extra_id_35>": 294,
-  "<extra_id_36>": 295,
-  "<extra_id_37>": 296,
-  "<extra_id_38>": 297,
-  "<extra_id_39>": 298,
-  "<extra_id_3>": 262,
-  "<extra_id_40>": 299,
-  "<extra_id_41>": 300,
-  "<extra_id_42>": 301,
-  "<extra_id_43>": 302,
-  "<extra_id_44>": 303,
-  "<extra_id_45>": 304,
-  "<extra_id_46>": 305,
-  "<extra_id_47>": 306,
-  "<extra_id_48>": 307,
-  "<extra_id_49>": 308,
-  "<extra_id_4>": 263,
-  "<extra_id_50>": 309,
-  "<extra_id_51>": 310,
-  "<extra_id_52>": 311,
-  "<extra_id_53>": 312,
-  "<extra_id_54>": 313,
-  "<extra_id_55>": 314,
-  "<extra_id_56>": 315,
-  "<extra_id_57>": 316,
-  "<extra_id_58>": 317,
-  "<extra_id_59>": 318,
-  "<extra_id_5>": 264,
-  "<extra_id_60>": 319,
-  "<extra_id_61>": 320,
-  "<extra_id_62>": 321,
-  "<extra_id_63>": 322,
-  "<extra_id_64>": 323,
-  "<extra_id_65>": 324,
-  "<extra_id_66>": 325,
-  "<extra_id_67>": 326,
-  "<extra_id_68>": 327,
-  "<extra_id_69>": 328,
-  "<extra_id_6>": 265,
-  "<extra_id_70>": 329,
-  "<extra_id_71>": 330,
-  "<extra_id_72>": 331,
-  "<extra_id_73>": 332,
-  "<extra_id_74>": 333,
-  "<extra_id_75>": 334,
-  "<extra_id_76>": 335,
-  "<extra_id_77>": 336,
-  "<extra_id_78>": 337,
-  "<extra_id_79>": 338,
-  "<extra_id_7>": 266,
-  "<extra_id_80>": 339,
-  "<extra_id_81>": 340,
-  "<extra_id_82>": 341,
-  "<extra_id_83>": 342,
-  "<extra_id_84>": 343,
-  "<extra_id_85>": 344,
-  "<extra_id_86>": 345,
-  "<extra_id_87>": 346,
-  "<extra_id_88>": 347,
-  "<extra_id_89>": 348,
-  "<extra_id_8>": 267,
-  "<extra_id_90>": 349,
-  "<extra_id_91>": 350,
-  "<extra_id_92>": 351,
-  "<extra_id_93>": 352,
-  "<extra_id_94>": 353,
-  "<extra_id_95>": 354,
-  "<extra_id_96>": 355,
-  "<extra_id_97>": 356,
-  "<extra_id_98>": 357,
-  "<extra_id_99>": 358,
-  "<extra_id_9>": 268
-}
--- a/comfy/text_encoders/byt5_tokenizer/special_tokens_map.json
+++ b/comfy/text_encoders/byt5_tokenizer/special_tokens_map.json
@@ -1,150 +0,0 @@
-{
-  "additional_special_tokens": [
-    "<extra_id_0>",
-    "<extra_id_1>",
-    "<extra_id_2>",
-    "<extra_id_3>",
-    "<extra_id_4>",
-    "<extra_id_5>",
-    "<extra_id_6>",
-    "<extra_id_7>",
-    "<extra_id_8>",
-    "<extra_id_9>",
-    "<extra_id_10>",
-    "<extra_id_11>",
-    "<extra_id_12>",
-    "<extra_id_13>",
-    "<extra_id_14>",
-    "<extra_id_15>",
-    "<extra_id_16>",
-    "<extra_id_17>",
-    "<extra_id_18>",
-    "<extra_id_19>",
-    "<extra_id_20>",
-    "<extra_id_21>",
-    "<extra_id_22>",
-    "<extra_id_23>",
-    "<extra_id_24>",
-    "<extra_id_25>",
-    "<extra_id_26>",
-    "<extra_id_27>",
-    "<extra_id_28>",
-    "<extra_id_29>",
-    "<extra_id_30>",
-    "<extra_id_31>",
-    "<extra_id_32>",
-    "<extra_id_33>",
-    "<extra_id_34>",
-    "<extra_id_35>",
-    "<extra_id_36>",
-    "<extra_id_37>",
-    "<extra_id_38>",
-    "<extra_id_39>",
-    "<extra_id_40>",
-    "<extra_id_41>",
-    "<extra_id_42>",
-    "<extra_id_43>",
-    "<extra_id_44>",
-    "<extra_id_45>",
-    "<extra_id_46>",
-    "<extra_id_47>",
-    "<extra_id_48>",
-    "<extra_id_49>",
-    "<extra_id_50>",
-    "<extra_id_51>",
-    "<extra_id_52>",
-    "<extra_id_53>",
-    "<extra_id_54>",
-    "<extra_id_55>",
-    "<extra_id_56>",
-    "<extra_id_57>",
-    "<extra_id_58>",
-    "<extra_id_59>",
-    "<extra_id_60>",
-    "<extra_id_61>",
-    "<extra_id_62>",
-    "<extra_id_63>",
-    "<extra_id_64>",
-    "<extra_id_65>",
-    "<extra_id_66>",
-    "<extra_id_67>",
-    "<extra_id_68>",
-    "<extra_id_69>",
-    "<extra_id_70>",
-    "<extra_id_71>",
-    "<extra_id_72>",
-    "<extra_id_73>",
-    "<extra_id_74>",
-    "<extra_id_75>",
-    "<extra_id_76>",
-    "<extra_id_77>",
-    "<extra_id_78>",
-    "<extra_id_79>",
-    "<extra_id_80>",
-    "<extra_id_81>",
-    "<extra_id_82>",
-    "<extra_id_83>",
-    "<extra_id_84>",
-    "<extra_id_85>",
-    "<extra_id_86>",
-    "<extra_id_87>",
-    "<extra_id_88>",
-    "<extra_id_89>",
-    "<extra_id_90>",
-    "<extra_id_91>",
-    "<extra_id_92>",
-    "<extra_id_93>",
-    "<extra_id_94>",
-    "<extra_id_95>",
-    "<extra_id_96>",
-    "<extra_id_97>",
-    "<extra_id_98>",
-    "<extra_id_99>",
-    "<extra_id_100>",
-    "<extra_id_101>",
-    "<extra_id_102>",
-    "<extra_id_103>",
-    "<extra_id_104>",
-    "<extra_id_105>",
-    "<extra_id_106>",
-    "<extra_id_107>",
-    "<extra_id_108>",
-    "<extra_id_109>",
-    "<extra_id_110>",
-    "<extra_id_111>",
-    "<extra_id_112>",
-    "<extra_id_113>",
-    "<extra_id_114>",
-    "<extra_id_115>",
-    "<extra_id_116>",
-    "<extra_id_117>",
-    "<extra_id_118>",
-    "<extra_id_119>",
-    "<extra_id_120>",
-    "<extra_id_121>",
-    "<extra_id_122>",
-    "<extra_id_123>",
-    "<extra_id_124>"
-  ],
-  "eos_token": {
-    "content": "</s>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  },
-  "pad_token": {
-    "content": "<pad>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  },
-  "unk_token": {
-    "content": "<unk>",
-    "lstrip": false,
-    "normalized": true,
-    "rstrip": false,
-    "single_word": false
-  }
-}
--- a/comfy/text_encoders/byt5_tokenizer/tokenizer_config.json
+++ b/comfy/text_encoders/byt5_tokenizer/tokenizer_config.json
--- a/comfy/text_encoders/hunyuan_image.py
+++ b/comfy/text_encoders/hunyuan_image.py
@@ -1,97 +0,0 @@
-from comfy import sd1_clip
-import comfy.text_encoders.llama
-from .qwen_image import QwenImageTokenizer, QwenImageTEModel
-from transformers import ByT5Tokenizer
-import os
-import re
-
-class ByT5SmallTokenizer(sd1_clip.SDTokenizer):
-    def __init__(self, embedding_directory=None, tokenizer_data={}):
-        tokenizer_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "byt5_tokenizer")
-        super().__init__(tokenizer_path, pad_with_end=False, embedding_size=1472, embedding_key='byt5_small', tokenizer_class=ByT5Tokenizer, has_start_token=False, pad_to_max_length=False, max_length=99999999, min_length=1, tokenizer_data=tokenizer_data)
-
-class HunyuanImageTokenizer(QwenImageTokenizer):
-    def __init__(self, embedding_directory=None, tokenizer_data={}):
-        super().__init__(embedding_directory=embedding_directory, tokenizer_data=tokenizer_data)
-        self.llama_template = "<|im_start|>system\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>"
-        # self.llama_template_images = "{}"
-        self.byt5 = ByT5SmallTokenizer(embedding_directory=embedding_directory, tokenizer_data=tokenizer_data)
-
-    def tokenize_with_weights(self, text:str, return_word_ids=False, **kwargs):
-        out = super().tokenize_with_weights(text, return_word_ids, **kwargs)
-
-        # ByT5 processing for HunyuanImage
-        text_prompt_texts = []
-        pattern_quote_double = r'\"(.*?)\"'
-        pattern_quote_chinese_single = r'‘(.*?)’'
-        pattern_quote_chinese_double = r'“(.*?)”'
-
-        matches_quote_double = re.findall(pattern_quote_double, text)
-        matches_quote_chinese_single = re.findall(pattern_quote_chinese_single, text)
-        matches_quote_chinese_double = re.findall(pattern_quote_chinese_double, text)
-
-        text_prompt_texts.extend(matches_quote_double)
-        text_prompt_texts.extend(matches_quote_chinese_single)
-        text_prompt_texts.extend(matches_quote_chinese_double)
-
-        if len(text_prompt_texts) > 0:
-            out['byt5'] = self.byt5.tokenize_with_weights(''.join(map(lambda a: 'Text "{}". '.format(a), text_prompt_texts)), return_word_ids, **kwargs)
-        return out
-
-class Qwen25_7BVLIModel(sd1_clip.SDClipModel):
-    def __init__(self, device="cpu", layer="hidden", layer_idx=-3, dtype=None, attention_mask=True, model_options={}):
-        llama_scaled_fp8 = model_options.get("qwen_scaled_fp8", None)
-        if llama_scaled_fp8 is not None:
-            model_options = model_options.copy()
-            model_options["scaled_fp8"] = llama_scaled_fp8
-        super().__init__(device=device, layer=layer, layer_idx=layer_idx, textmodel_json_config={}, dtype=dtype, special_tokens={"pad": 151643}, layer_norm_hidden_state=False, model_class=comfy.text_encoders.llama.Qwen25_7BVLI, enable_attention_masks=attention_mask, return_attention_masks=attention_mask, model_options=model_options)
-
-
-class ByT5SmallModel(sd1_clip.SDClipModel):
-    def __init__(self, device="cpu", layer="last", layer_idx=None, dtype=None, model_options={}):
-        textmodel_json_config = os.path.join(os.path.dirname(os.path.realpath(__file__)), "byt5_config_small_glyph.json")
-        super().__init__(device=device, layer=layer, layer_idx=layer_idx, textmodel_json_config=textmodel_json_config, dtype=dtype, model_options=model_options, special_tokens={"end": 1, "pad": 0}, model_class=comfy.text_encoders.t5.T5, enable_attention_masks=True, zero_out_masked=True)
-
-
-class HunyuanImageTEModel(QwenImageTEModel):
-    def __init__(self, byt5=True, device="cpu", dtype=None, model_options={}):
-        super(QwenImageTEModel, self).__init__(device=device, dtype=dtype, name="qwen25_7b", clip_model=Qwen25_7BVLIModel, model_options=model_options)
-
-        if byt5:
-            self.byt5_small = ByT5SmallModel(device=device, dtype=dtype, model_options=model_options)
-        else:
-            self.byt5_small = None
-
-    def encode_token_weights(self, token_weight_pairs):
-        cond, p, extra = super().encode_token_weights(token_weight_pairs)
-        if self.byt5_small is not None and "byt5" in token_weight_pairs:
-            out = self.byt5_small.encode_token_weights(token_weight_pairs["byt5"])
-            extra["conditioning_byt5small"] = out[0]
-        return cond, p, extra
-
-    def set_clip_options(self, options):
-        super().set_clip_options(options)
-        if self.byt5_small is not None:
-            self.byt5_small.set_clip_options(options)
-
-    def reset_clip_options(self):
-        super().reset_clip_options()
-        if self.byt5_small is not None:
-            self.byt5_small.reset_clip_options()
-
-    def load_sd(self, sd):
-        if "encoder.block.0.layer.0.SelfAttention.o.weight" in sd:
-            return self.byt5_small.load_sd(sd)
-        else:
-            return super().load_sd(sd)
-
-def te(byt5=True, dtype_llama=None, llama_scaled_fp8=None):
-    class QwenImageTEModel_(HunyuanImageTEModel):
-        def __init__(self, device="cpu", dtype=None, model_options={}):
-            if llama_scaled_fp8 is not None and "scaled_fp8" not in model_options:
-                model_options = model_options.copy()
-                model_options["qwen_scaled_fp8"] = llama_scaled_fp8
-            if dtype_llama is not None:
-                dtype = dtype_llama
-            super().__init__(byt5=byt5, device=device, dtype=dtype, model_options=model_options)
-    return QwenImageTEModel_
--- a/comfy/text_encoders/llama.py
+++ b/comfy/text_encoders/llama.py
@@ -2,14 +2,12 @@ import torch
 import torch.nn as nn
 from dataclasses import dataclass
 from typing import Optional, Any
-import math

 from comfy.ldm.modules.attention import optimized_attention_for_device
 import comfy.model_management
 import comfy.ldm.common_dit

 import comfy.model_management
-from . import qwen_vl

@dataclass
 class Llama2Config:
@@ -26,44 +24,6 @@ class Llama2Config:
    head_dim = 128
    rms_norm_add = False
    mlp_activation = "silu"
-    qkv_bias = False
-    rope_dims = None
-
-@dataclass
-class Qwen25_3BConfig:
-    vocab_size: int = 151936
-    hidden_size: int = 2048
-    intermediate_size: int = 11008
-    num_hidden_layers: int = 36
-    num_attention_heads: int = 16
-    num_key_value_heads: int = 2
-    max_position_embeddings: int = 128000
-    rms_norm_eps: float = 1e-6
-    rope_theta: float = 1000000.0
-    transformer_type: str = "llama"
-    head_dim = 128
-    rms_norm_add = False
-    mlp_activation = "silu"
-    qkv_bias = True
-    rope_dims = None
-
-@dataclass
-class Qwen25_7BVLI_Config:
-    vocab_size: int = 152064
-    hidden_size: int = 3584
-    intermediate_size: int = 18944
-    num_hidden_layers: int = 28
-    num_attention_heads: int = 28
-    num_key_value_heads: int = 4
-    max_position_embeddings: int = 128000
-    rms_norm_eps: float = 1e-6
-    rope_theta: float = 1000000.0
-    transformer_type: str = "llama"
-    head_dim = 128
-    rms_norm_add = False
-    mlp_activation = "silu"
-    qkv_bias = True
-    rope_dims = [16, 24, 24]

@dataclass
 class Gemma2_2B_Config:
@@ -80,8 +40,6 @@ class Gemma2_2B_Config:
    head_dim = 256
    rms_norm_add = True
    mlp_activation = "gelu_pytorch_tanh"
-    qkv_bias = False
-    rope_dims = None

 class RMSNorm(nn.Module):
    def __init__(self, dim: int, eps: float = 1e-5, add=False, device=None, dtype=None):
@@ -106,34 +64,27 @@ def rotate_half(x):
    return torch.cat((-x2, x1), dim=-1)


-def precompute_freqs_cis(head_dim, position_ids, theta, rope_dims=None, device=None):
+def precompute_freqs_cis(head_dim, seq_len, theta, device=None):
    theta_numerator = torch.arange(0, head_dim, 2, device=device).float()
    inv_freq = 1.0 / (theta ** (theta_numerator / head_dim))

+    position_ids = torch.arange(0, seq_len, device=device).unsqueeze(0)
+
    inv_freq_expanded = inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
    position_ids_expanded = position_ids[:, None, :].float()
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
    emb = torch.cat((freqs, freqs), dim=-1)
    cos = emb.cos()
    sin = emb.sin()
-    if rope_dims is not None and position_ids.shape[0] > 1:
-        mrope_section = rope_dims * 2
-        cos = torch.cat([m[i % 3] for i, m in enumerate(cos.split(mrope_section, dim=-1))], dim=-1).unsqueeze(0)
-        sin = torch.cat([m[i % 3] for i, m in enumerate(sin.split(mrope_section, dim=-1))], dim=-1).unsqueeze(0)
-    else:
-        cos = cos.unsqueeze(1)
-        sin = sin.unsqueeze(1)
-
    return (cos, sin)


 def apply_rope(xq, xk, freqs_cis):
-    org_dtype = xq.dtype
-    cos = freqs_cis[0]
-    sin = freqs_cis[1]
+    cos = freqs_cis[0].unsqueeze(1)
+    sin = freqs_cis[1].unsqueeze(1)
    q_embed = (xq * cos) + (rotate_half(xq) * sin)
    k_embed = (xk * cos) + (rotate_half(xk) * sin)
-    return q_embed.to(org_dtype), k_embed.to(org_dtype)
+    return q_embed, k_embed


 class Attention(nn.Module):
@@ -147,9 +98,9 @@ class Attention(nn.Module):
        self.inner_size = self.num_heads * self.head_dim

        ops = ops or nn
-        self.q_proj = ops.Linear(config.hidden_size, self.inner_size, bias=config.qkv_bias, device=device, dtype=dtype)
-        self.k_proj = ops.Linear(config.hidden_size, self.num_kv_heads * self.head_dim, bias=config.qkv_bias, device=device, dtype=dtype)
-        self.v_proj = ops.Linear(config.hidden_size, self.num_kv_heads * self.head_dim, bias=config.qkv_bias, device=device, dtype=dtype)
+        self.q_proj = ops.Linear(config.hidden_size, self.inner_size, bias=False, device=device, dtype=dtype)
+        self.k_proj = ops.Linear(config.hidden_size, self.num_kv_heads * self.head_dim, bias=False, device=device, dtype=dtype)
+        self.v_proj = ops.Linear(config.hidden_size, self.num_kv_heads * self.head_dim, bias=False, device=device, dtype=dtype)
        self.o_proj = ops.Linear(self.inner_size, config.hidden_size, bias=False, device=device, dtype=dtype)

    def forward(
@@ -290,7 +241,7 @@ class Llama2_(nn.Module):
        self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps, add=config.rms_norm_add, device=device, dtype=dtype)
        # self.lm_head = ops.Linear(config.hidden_size, config.vocab_size, bias=False, device=device, dtype=dtype)

-    def forward(self, x, attention_mask=None, embeds=None, num_tokens=None, intermediate_output=None, final_layer_norm_intermediate=True, dtype=None, position_ids=None, embeds_info=[]):
+    def forward(self, x, attention_mask=None, embeds=None, num_tokens=None, intermediate_output=None, final_layer_norm_intermediate=True, dtype=None):
        if embeds is not None:
            x = embeds
        else:
@@ -299,13 +250,9 @@ class Llama2_(nn.Module):
        if self.normalize_in:
            x *= self.config.hidden_size ** 0.5

-        if position_ids is None:
-            position_ids = torch.arange(0, x.shape[1], device=x.device).unsqueeze(0)
-
        freqs_cis = precompute_freqs_cis(self.config.head_dim,
-                                         position_ids,
+                                         x.shape[1],
                                         self.config.rope_theta,
-                                         self.config.rope_dims,
                                         device=x.device)

        mask = None
@@ -373,57 +320,6 @@ class Llama2(BaseLlama, torch.nn.Module):
        self.model = Llama2_(config, device=device, dtype=dtype, ops=operations)
        self.dtype = dtype

-class Qwen25_3B(BaseLlama, torch.nn.Module):
-    def __init__(self, config_dict, dtype, device, operations):
-        super().__init__()
-        config = Qwen25_3BConfig(**config_dict)
-        self.num_layers = config.num_hidden_layers
-
-        self.model = Llama2_(config, device=device, dtype=dtype, ops=operations)
-        self.dtype = dtype
-
-class Qwen25_7BVLI(BaseLlama, torch.nn.Module):
-    def __init__(self, config_dict, dtype, device, operations):
-        super().__init__()
-        config = Qwen25_7BVLI_Config(**config_dict)
-        self.num_layers = config.num_hidden_layers
-
-        self.model = Llama2_(config, device=device, dtype=dtype, ops=operations)
-        self.visual = qwen_vl.Qwen2VLVisionTransformer(hidden_size=1280, output_hidden_size=config.hidden_size, device=device, dtype=dtype, ops=operations)
-        self.dtype = dtype
-
-    def preprocess_embed(self, embed, device):
-        if embed["type"] == "image":
-            image, grid = qwen_vl.process_qwen2vl_images(embed["data"])
-            return self.visual(image.to(device, dtype=torch.float32), grid), grid
-        return None, None
-
-    def forward(self, x, attention_mask=None, embeds=None, num_tokens=None, intermediate_output=None, final_layer_norm_intermediate=True, dtype=None, embeds_info=[]):
-        grid = None
-        position_ids = None
-        offset = 0
-        for e in embeds_info:
-            if e.get("type") == "image":
-                grid = e.get("extra", None)
-                start = e.get("index")
-                if position_ids is None:
-                    position_ids = torch.zeros((3, embeds.shape[1]), device=embeds.device)
-                    position_ids[:, :start] = torch.arange(0, start, device=embeds.device)
-                end = e.get("size") + start
-                len_max = int(grid.max()) // 2
-                start_next = len_max + start
-                position_ids[:, end:] = torch.arange(start_next + offset, start_next + (embeds.shape[1] - end) + offset, device=embeds.device)
-                position_ids[0, start:end] = start + offset
-                max_d = int(grid[0][1]) // 2
-                position_ids[1, start:end] = torch.arange(start + offset, start + max_d + offset, device=embeds.device).unsqueeze(1).repeat(1, math.ceil((end - start) / max_d)).flatten(0)[:end - start]
-                max_d = int(grid[0][2]) // 2
-                position_ids[2, start:end] = torch.arange(start + offset, start + max_d + offset, device=embeds.device).unsqueeze(0).repeat(math.ceil((end - start) / max_d), 1).flatten(0)[:end - start]
-                offset += len_max - (end - start)
-
-        if grid is None:
-            position_ids = None
-
-        return super().forward(x, attention_mask=attention_mask, embeds=embeds, num_tokens=num_tokens, intermediate_output=intermediate_output, final_layer_norm_intermediate=final_layer_norm_intermediate, dtype=dtype, position_ids=position_ids)

 class Gemma2_2B(BaseLlama, torch.nn.Module):
    def __init__(self, config_dict, dtype, device, operations):
--- a/comfy/text_encoders/omnigen2.py
+++ b/comfy/text_encoders/omnigen2.py
@@ -1,44 +0,0 @@
-from transformers import Qwen2Tokenizer
-from comfy import sd1_clip
-import comfy.text_encoders.llama
-import os
-
-
-class Qwen25_3BTokenizer(sd1_clip.SDTokenizer):
-    def __init__(self, embedding_directory=None, tokenizer_data={}):
-        tokenizer_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "qwen25_tokenizer")
-        super().__init__(tokenizer_path, pad_with_end=False, embedding_size=2048, embedding_key='qwen25_3b', tokenizer_class=Qwen2Tokenizer, has_start_token=False, has_end_token=False, pad_to_max_length=False, max_length=99999999, min_length=1, pad_token=151643, tokenizer_data=tokenizer_data)
-
-
-class Omnigen2Tokenizer(sd1_clip.SD1Tokenizer):
-    def __init__(self, embedding_directory=None, tokenizer_data={}):
-        super().__init__(embedding_directory=embedding_directory, tokenizer_data=tokenizer_data, name="qwen25_3b", tokenizer=Qwen25_3BTokenizer)
-        self.llama_template = '<|im_start|>system\nYou are a helpful assistant that generates high-quality images based on user instructions.<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n'
-
-    def tokenize_with_weights(self, text, return_word_ids=False, llama_template=None,**kwargs):
-        if llama_template is None:
-            llama_text = self.llama_template.format(text)
-        else:
-            llama_text = llama_template.format(text)
-        return super().tokenize_with_weights(llama_text, return_word_ids=return_word_ids, **kwargs)
-
-class Qwen25_3BModel(sd1_clip.SDClipModel):
-    def __init__(self, device="cpu", layer="last", layer_idx=None, dtype=None, attention_mask=True, model_options={}):
-        super().__init__(device=device, layer=layer, layer_idx=layer_idx, textmodel_json_config={}, dtype=dtype, special_tokens={"pad": 151643}, layer_norm_hidden_state=False, model_class=comfy.text_encoders.llama.Qwen25_3B, enable_attention_masks=attention_mask, return_attention_masks=attention_mask, model_options=model_options)
-
-
-class Omnigen2Model(sd1_clip.SD1ClipModel):
-    def __init__(self, device="cpu", dtype=None, model_options={}):
-        super().__init__(device=device, dtype=dtype, name="qwen25_3b", clip_model=Qwen25_3BModel, model_options=model_options)
-
-
-def te(dtype_llama=None, llama_scaled_fp8=None):
-    class Omnigen2TEModel_(Omnigen2Model):
-        def __init__(self, device="cpu", dtype=None, model_options={}):
-            if llama_scaled_fp8 is not None and "scaled_fp8" not in model_options:
-                model_options = model_options.copy()
-                model_options["scaled_fp8"] = llama_scaled_fp8
-            if dtype_llama is not None:
-                dtype = dtype_llama
-            super().__init__(device=device, dtype=dtype, model_options=model_options)
-    return Omnigen2TEModel_
--- a/comfy/text_encoders/pixart_t5.py
+++ b/comfy/text_encoders/pixart_t5.py
@@ -1,42 +1,42 @@
-import os
-
-from comfy import sd1_clip
-import comfy.text_encoders.t5
-import comfy.text_encoders.sd3_clip
-from comfy.sd1_clip import gen_empty_tokens
-
-from transformers import T5TokenizerFast
-
-class T5XXLModel(comfy.text_encoders.sd3_clip.T5XXLModel):
-    def __init__(self, **kwargs):
-        super().__init__(**kwargs)
-
-    def gen_empty_tokens(self, special_tokens, *args, **kwargs):
-        # PixArt expects the negative to be all pad tokens
-        special_tokens = special_tokens.copy()
-        special_tokens.pop("end")
-        return gen_empty_tokens(special_tokens, *args, **kwargs)
-
-class PixArtT5XXL(sd1_clip.SD1ClipModel):
-    def __init__(self, device="cpu", dtype=None, model_options={}):
-        super().__init__(device=device, dtype=dtype, name="t5xxl", clip_model=T5XXLModel, model_options=model_options)
-
-class T5XXLTokenizer(sd1_clip.SDTokenizer):
-    def __init__(self, embedding_directory=None, tokenizer_data={}):
-        tokenizer_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "t5_tokenizer")
-        super().__init__(tokenizer_path, embedding_directory=embedding_directory, pad_with_end=False, embedding_size=4096, embedding_key='t5xxl', tokenizer_class=T5TokenizerFast, has_start_token=False, pad_to_max_length=False, max_length=99999999, min_length=1, tokenizer_data=tokenizer_data) # no padding
-
-class PixArtTokenizer(sd1_clip.SD1Tokenizer):
-    def __init__(self, embedding_directory=None, tokenizer_data={}):
-        super().__init__(embedding_directory=embedding_directory, tokenizer_data=tokenizer_data, clip_name="t5xxl", tokenizer=T5XXLTokenizer)
-
-def pixart_te(dtype_t5=None, t5xxl_scaled_fp8=None):
-    class PixArtTEModel_(PixArtT5XXL):
-        def __init__(self, device="cpu", dtype=None, model_options={}):
-            if t5xxl_scaled_fp8 is not None and "t5xxl_scaled_fp8" not in model_options:
-                model_options = model_options.copy()
-                model_options["t5xxl_scaled_fp8"] = t5xxl_scaled_fp8
-            if dtype is None:
-                dtype = dtype_t5
-            super().__init__(device=device, dtype=dtype, model_options=model_options)
-    return PixArtTEModel_
+import os
+
+from comfy import sd1_clip
+import comfy.text_encoders.t5
+import comfy.text_encoders.sd3_clip
+from comfy.sd1_clip import gen_empty_tokens
+
+from transformers import T5TokenizerFast
+
+class T5XXLModel(comfy.text_encoders.sd3_clip.T5XXLModel):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+
+    def gen_empty_tokens(self, special_tokens, *args, **kwargs):
+        # PixArt expects the negative to be all pad tokens
+        special_tokens = special_tokens.copy()
+        special_tokens.pop("end")
+        return gen_empty_tokens(special_tokens, *args, **kwargs)
+
+class PixArtT5XXL(sd1_clip.SD1ClipModel):
+    def __init__(self, device="cpu", dtype=None, model_options={}):
+        super().__init__(device=device, dtype=dtype, name="t5xxl", clip_model=T5XXLModel, model_options=model_options)
+
+class T5XXLTokenizer(sd1_clip.SDTokenizer):
+    def __init__(self, embedding_directory=None, tokenizer_data={}):
+        tokenizer_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "t5_tokenizer")
+        super().__init__(tokenizer_path, embedding_directory=embedding_directory, pad_with_end=False, embedding_size=4096, embedding_key='t5xxl', tokenizer_class=T5TokenizerFast, has_start_token=False, pad_to_max_length=False, max_length=99999999, min_length=1, tokenizer_data=tokenizer_data) # no padding
+
+class PixArtTokenizer(sd1_clip.SD1Tokenizer):
+    def __init__(self, embedding_directory=None, tokenizer_data={}):
+        super().__init__(embedding_directory=embedding_directory, tokenizer_data=tokenizer_data, clip_name="t5xxl", tokenizer=T5XXLTokenizer)
+
+def pixart_te(dtype_t5=None, t5xxl_scaled_fp8=None):
+    class PixArtTEModel_(PixArtT5XXL):
+        def __init__(self, device="cpu", dtype=None, model_options={}):
+            if t5xxl_scaled_fp8 is not None and "t5xxl_scaled_fp8" not in model_options:
+                model_options = model_options.copy()
+                model_options["t5xxl_scaled_fp8"] = t5xxl_scaled_fp8
+            if dtype is None:
+                dtype = dtype_t5
+            super().__init__(device=device, dtype=dtype, model_options=model_options)
+    return PixArtTEModel_
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Jacob Segal	1316e608c9	Add the websocket library for automated tests	2025-06-13 21:51:32 -07:00
Jacob Segal	8d28c17369	Add a missing file It looks like this got caught by .gitignore? There's probably a better place to put it, but I'm not sure what that is.	2025-06-13 21:45:21 -07:00
Jacob Segal	6df907c413	Add the execution model tests to CI	2025-06-13 21:39:26 -07:00
Jacob Segal	f1dc13037e	Support for async execution functions This commit adds support for node execution functions defined as async. When a node's execution function is defined as async, we can continue executing other nodes while it is processing. Standard uses of `await` should "just work", but people will still have to be careful if they spawn actual threads. Because torch doesn't really have async/await versions of functions, this won't particularly help with most locally-executing nodes, but it does work for e.g. web requests to other machines. In addition to the execute function, the `VALIDATE_INPUTS` and `check_lazy_status` functions can also be defined as async, though we'll only resolve one node at a time right now for those.	2025-06-13 21:39:26 -07:00