Revert "Fix some fp8 scaled checkpoints no longer working. (#13239 )"

This reverts commit 4504ee718a.
ComfyUI version 0.18.4
2026-04-30 03:11:33 +00:00 · 2026-04-03 03:09:58 -04:00 · 2026-04-03 03:06:35 -04:00 · 2026-04-03 03:04:35 -04:00 · 2026-04-03 03:03:57 -04:00 · 2026-04-03 03:03:50 -04:00
194 changed files with 801 additions and 591558 deletions
--- a/.ci/windows_intel_base_files/run_intel_gpu.bat
+++ b/.ci/windows_intel_base_files/run_intel_gpu.bat
@@ -1,2 +0,0 @@
-.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build
-pause
--- a/.github/workflows/release-stable-all.yml
+++ b/.github/workflows/release-stable-all.yml
@@ -20,12 +20,29 @@ jobs:
      git_tag: ${{ inputs.git_tag }}
      cache_tag: "cu130"
      python_minor: "13"
-      python_patch: "12"
+      python_patch: "11"
      rel_name: "nvidia"
      rel_extra_name: ""
      test_release: true
    secrets: inherit

+  release_nvidia_cu128:
+    permissions:
+      contents: "write"
+      packages: "write"
+      pull-requests: "read"
+    name: "Release NVIDIA cu128"
+    uses: ./.github/workflows/stable-release.yml
+    with:
+      git_tag: ${{ inputs.git_tag }}
+      cache_tag: "cu128"
+      python_minor: "12"
+      python_patch: "10"
+      rel_name: "nvidia"
+      rel_extra_name: "_cu128"
+      test_release: true
+    secrets: inherit
+
  release_nvidia_cu126:
    permissions:
      contents: "write"
@@ -59,20 +76,3 @@ jobs:
      rel_extra_name: ""
      test_release: false
    secrets: inherit
-
-  release_xpu:
-    permissions:
-      contents: "write"
-      packages: "write"
-      pull-requests: "read"
-    name: "Release Intel XPU"
-    uses: ./.github/workflows/stable-release.yml
-    with:
-      git_tag: ${{ inputs.git_tag }}
-      cache_tag: "xpu"
-      python_minor: "13"
-      python_patch: "12"
-      rel_name: "intel"
-      rel_extra_name: ""
-      test_release: true
-    secrets: inherit
--- a/.github/workflows/tag-dispatch-cloud.yml
+++ b/.github/workflows/tag-dispatch-cloud.yml
@@ -1,45 +0,0 @@
-name: Tag Dispatch to Cloud
-
-on:
-  push:
-    tags:
-      - 'v*'
-
-jobs:
-  dispatch-cloud:
-    runs-on: ubuntu-latest
-    steps:
-      - name: Send repository dispatch to cloud
-        env:
-          DISPATCH_TOKEN: ${{ secrets.CLOUD_REPO_DISPATCH_TOKEN }}
-          RELEASE_TAG: ${{ github.ref_name }}
-        run: |
-          set -euo pipefail
-
-          if [ -z "${DISPATCH_TOKEN:-}" ]; then
-            echo "::error::CLOUD_REPO_DISPATCH_TOKEN is required but not set."
-            exit 1
-          fi
-
-          RELEASE_URL="https://github.com/${{ github.repository }}/releases/tag/${RELEASE_TAG}"
-
-          PAYLOAD="$(jq -n \
-            --arg release_tag "$RELEASE_TAG" \
-            --arg release_url "$RELEASE_URL" \
-            '{
-              event_type: "comfyui_tag_pushed",
-              client_payload: {
-                release_tag: $release_tag,
-                release_url: $release_url
-              }
-            }')"
-
-          curl -fsSL \
-            -X POST \
-            -H "Accept: application/vnd.github+json" \
-            -H "Content-Type: application/json" \
-            -H "Authorization: Bearer ${DISPATCH_TOKEN}" \
-            https://api.github.com/repos/Comfy-Org/cloud/dispatches \
-            -d "$PAYLOAD"
-
-          echo "✅ Dispatched ComfyUI tag ${RELEASE_TAG} to Comfy-Org/cloud"
--- a/.gitignore
+++ b/.gitignore
@@ -21,5 +21,6 @@ venv*/
 *.log
 web_custom_versions/
 .DS_Store
+openapi.yaml
 filtered-openapi.yaml
 uv.lock
--- a/QUANTIZATION.md
+++ b/QUANTIZATION.md
@@ -139,9 +139,9 @@ Example:
  "_quantization_metadata": {
    "format_version": "1.0",
    "layers": {
-      "model.layers.0.mlp.up_proj": {"format": "float8_e4m3fn"},
-      "model.layers.0.mlp.down_proj": {"format": "float8_e4m3fn"},
-      "model.layers.1.mlp.up_proj": {"format": "float8_e4m3fn"}
+      "model.layers.0.mlp.up_proj": "float8_e4m3fn",
+      "model.layers.0.mlp.down_proj": "float8_e4m3fn",
+      "model.layers.1.mlp.up_proj": "float8_e4m3fn"
    }
  }
 }
--- a/README.md
+++ b/README.md
@@ -61,7 +61,6 @@ See what ComfyUI can do with the [newer template workflows](https://comfy.org/wo

 ## Features
 - Nodes/graph/flowchart interface to experiment and create complex Stable Diffusion workflows without needing to code anything.
- NOTE: There are many more models supported than the list below, if you want to see what is supported see our templates list inside ComfyUI.
 - Image Models
   - SD1.x, SD2.x ([unCLIP](https://comfyanonymous.github.io/ComfyUI_examples/unclip/))
   - [SDXL](https://comfyanonymous.github.io/ComfyUI_examples/sdxl/), [SDXL Turbo](https://comfyanonymous.github.io/ComfyUI_examples/sdturbo/)
@@ -137,7 +136,7 @@ ComfyUI follows a weekly release cycle targeting Monday but this regularly chang
   - Builds a new release using the latest stable core version

 3. **[ComfyUI Frontend](https://github.com/Comfy-Org/ComfyUI_frontend)**
-   - Every 2+ weeks frontend updates are merged into the core repository
+   - Weekly frontend updates are merged into the core repository
   - Features are frozen for the upcoming core release
   - Development continues for the next release cycle

@@ -195,9 +194,7 @@ The portable above currently comes with python 3.13 and pytorch cuda 13.0. Updat

 #### Alternative Downloads:

-[Portable for AMD GPUs](https://github.com/comfyanonymous/ComfyUI/releases/latest/download/ComfyUI_windows_portable_amd.7z)
-
-[Experimental portable for Intel GPUs](https://github.com/comfyanonymous/ComfyUI/releases/latest/download/ComfyUI_windows_portable_intel.7z)
+[Experimental portable for AMD GPUs](https://github.com/comfyanonymous/ComfyUI/releases/latest/download/ComfyUI_windows_portable_amd.7z)

 [Portable with pytorch cuda 12.6 and python 3.12](https://github.com/comfyanonymous/ComfyUI/releases/latest/download/ComfyUI_windows_portable_nvidia_cu126.7z) (Supports Nvidia 10 series and older GPUs).

@@ -235,7 +232,7 @@ Put your VAE in: models/vae

 AMD users can install rocm and pytorch with pip if you don't have it already installed, this is the command to install the stable version:

-```pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.2```
+```pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.1```

 This is the command to install the nightly with ROCm 7.2 which might have some performance improvements:

@@ -278,7 +275,7 @@ Nvidia users should install stable pytorch using this command:

 This is the command to install pytorch nightly instead which might have performance improvements.

-```pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu132```
+```pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu130```

 #### Troubleshooting

--- a/api_server/routes/internal/internal_routes.py
+++ b/api_server/routes/internal/internal_routes.py
@@ -67,7 +67,7 @@ class InternalRoutes:
                (entry for entry in os.scandir(directory) if is_visible_file(entry)),
                key=lambda entry: -entry.stat().st_mtime
            )
-            return web.json_response([f"{entry.name} [{directory_type}]" for entry in sorted_files], status=200)
+            return web.json_response([entry.name for entry in sorted_files], status=200)


    def get_app(self):
--- a/app/assets/database/queries/init.py
+++ b/app/assets/database/queries/init.py
@@ -1,7 +1,6 @@
 from app.assets.database.queries.asset import (
    asset_exists_by_hash,
    bulk_insert_assets,
-    create_stub_asset,
    get_asset_by_hash,
    get_existing_asset_ids,
    reassign_asset_references,
@@ -13,7 +12,6 @@ from app.assets.database.queries.asset_reference import (
    UnenrichedReferenceRow,
    bulk_insert_references_ignore_conflicts,
    bulk_update_enrichment_level,
-    count_active_siblings,
    bulk_update_is_missing,
    bulk_update_needs_verify,
    convert_metadata_to_rows,
@@ -82,8 +80,6 @@ __all__ = [
    "bulk_insert_references_ignore_conflicts",
    "bulk_insert_tags_and_meta",
    "bulk_update_enrichment_level",
-    "count_active_siblings",
-    "create_stub_asset",
    "bulk_update_is_missing",
    "bulk_update_needs_verify",
    "convert_metadata_to_rows",
--- a/app/assets/database/queries/asset.py
+++ b/app/assets/database/queries/asset.py
@@ -78,18 +78,6 @@ def upsert_asset(
    return asset, created, updated


-def create_stub_asset(
-    session: Session,
-    size_bytes: int,
-    mime_type: str | None = None,
-) -> Asset:
-    """Create a new asset with no hash (stub for later enrichment)."""
-    asset = Asset(size_bytes=size_bytes, mime_type=mime_type, hash=None)
-    session.add(asset)
-    session.flush()
-    return asset
-
-
 def bulk_insert_assets(
    session: Session,
    rows: list[dict],
--- a/app/assets/database/queries/asset_reference.py
+++ b/app/assets/database/queries/asset_reference.py
@@ -114,23 +114,6 @@ def get_reference_by_file_path(
    )


-def count_active_siblings(
-    session: Session,
-    asset_id: str,
-    exclude_reference_id: str,
-) -> int:
-    """Count active (non-deleted) references to an asset, excluding one reference."""
-    return (
-        session.query(AssetReference)
-        .filter(
-            AssetReference.asset_id == asset_id,
-            AssetReference.id != exclude_reference_id,
-            AssetReference.deleted_at.is_(None),
-        )
-        .count()
-    )
-
-
 def reference_exists_for_asset_id(
    session: Session,
    asset_id: str,
--- a/app/assets/scanner.py
+++ b/app/assets/scanner.py
@@ -13,7 +13,6 @@ from app.assets.database.queries import (
    delete_references_by_ids,
    ensure_tags_exist,
    get_asset_by_hash,
-    get_reference_by_id,
    get_references_for_prefixes,
    get_unenriched_references,
    mark_references_missing_outside_prefixes,
@@ -339,7 +338,6 @@ def build_asset_specs(
                "metadata": metadata,
                "hash": asset_hash,
                "mime_type": mime_type,
-                "job_id": None,
            }
        )
        tag_pool.update(tags)
@@ -428,7 +426,6 @@ def enrich_asset(
    except OSError:
        return new_level

-    initial_mtime_ns = get_mtime_ns(stat_p)
    rel_fname = compute_relative_filename(file_path)
    mime_type: str | None = None
    metadata = None
@@ -492,18 +489,6 @@ def enrich_asset(
        except Exception as e:
            logging.warning("Failed to hash %s: %s", file_path, e)

-    # Optimistic guard: if the reference's mtime_ns changed since we
-    # started (e.g. ingest_existing_file updated it), our results are
-    # stale — discard them to avoid overwriting fresh registration data.
-    ref = get_reference_by_id(session, reference_id)
-    if ref is None or ref.mtime_ns != initial_mtime_ns:
-        session.rollback()
-        logging.info(
-            "Ref %s mtime changed during enrichment, discarding stale result",
-            reference_id,
-        )
-        return ENRICHMENT_STUB
-
    if extract_metadata and metadata:
        system_metadata = metadata.to_user_metadata()
        set_reference_system_metadata(session, reference_id, system_metadata)
--- a/app/assets/seeder.py
+++ b/app/assets/seeder.py
@@ -77,9 +77,7 @@ class _AssetSeeder:
    """

    def __init__(self) -> None:
-        # RLock is required because _run_scan() drains pending work while
-        # holding _lock and re-enters start() which also acquires _lock.
-        self._lock = threading.RLock()
+        self._lock = threading.Lock()
        self._state = State.IDLE
        self._progress: Progress | None = None
        self._last_progress: Progress | None = None
@@ -94,7 +92,6 @@ class _AssetSeeder:
        self._prune_first: bool = False
        self._progress_callback: ProgressCallback | None = None
        self._disabled: bool = False
-        self._pending_enrich: dict | None = None

    def disable(self) -> None:
        """Disable the asset seeder, preventing any scans from starting."""
@@ -199,42 +196,6 @@ class _AssetSeeder:
            compute_hashes=compute_hashes,
        )

-    def enqueue_enrich(
-        self,
-        roots: tuple[RootType, ...] = ("models", "input", "output"),
-        compute_hashes: bool = False,
-    ) -> bool:
-        """Start an enrichment scan now, or queue it for after the current scan.
-
-        If the seeder is idle, starts immediately. Otherwise, the enrich
-        request is stored and will run automatically when the current scan
-        finishes.
-
-        Args:
-            roots: Tuple of root types to scan
-            compute_hashes: If True, compute blake3 hashes
-
-        Returns:
-            True if started immediately, False if queued for later
-        """
-        with self._lock:
-            if self.start_enrich(roots=roots, compute_hashes=compute_hashes):
-                return True
-            if self._pending_enrich is not None:
-                existing_roots = set(self._pending_enrich["roots"])
-                existing_roots.update(roots)
-                self._pending_enrich["roots"] = tuple(existing_roots)
-                self._pending_enrich["compute_hashes"] = (
-                    self._pending_enrich["compute_hashes"] or compute_hashes
-                )
-            else:
-                self._pending_enrich = {
-                    "roots": roots,
-                    "compute_hashes": compute_hashes,
-                }
-            logging.info("Enrich scan queued (roots=%s)", self._pending_enrich["roots"])
-        return False
-
    def cancel(self) -> bool:
        """Request cancellation of the current scan.

@@ -420,10 +381,6 @@ class _AssetSeeder:
            return marked
        finally:
            with self._lock:
-                self._reset_to_idle()
-
-    def _reset_to_idle(self) -> None:
-        """Reset state to IDLE, preserving last progress. Caller must hold _lock."""
                self._last_progress = self._progress
                self._state = State.IDLE
                self._progress = None
@@ -637,18 +594,9 @@ class _AssetSeeder:
                    },
                )
            with self._lock:
-                self._reset_to_idle()
-                pending = self._pending_enrich
-                if pending is not None:
-                    self._pending_enrich = None
-                    if not self.start_enrich(
-                        roots=pending["roots"],
-                        compute_hashes=pending["compute_hashes"],
-                    ):
-                        logging.warning(
-                            "Pending enrich scan could not start (roots=%s)",
-                            pending["roots"],
-                        )
+                self._last_progress = self._progress
+                self._state = State.IDLE
+                self._progress = None

    def _run_fast_phase(self, roots: tuple[RootType, ...]) -> tuple[int, int, int]:
        """Run phase 1: fast scan to create stub records.
--- a/app/assets/services/init.py
+++ b/app/assets/services/init.py
@@ -23,8 +23,6 @@ from app.assets.services.ingest import (
    DependencyMissingError,
    HashMismatchError,
    create_from_hash,
-    ingest_existing_file,
-    register_output_files,
    upload_from_temp_path,
 )
 from app.assets.database.queries import (
@@ -74,8 +72,6 @@ __all__ = [
    "delete_asset_reference",
    "get_asset_by_hash",
    "get_asset_detail",
-    "ingest_existing_file",
-    "register_output_files",
    "get_mtime_ns",
    "get_size_and_mtime_ns",
    "list_assets_page",
--- a/app/assets/services/bulk_ingest.py
+++ b/app/assets/services/bulk_ingest.py
@@ -37,7 +37,6 @@ class SeedAssetSpec(TypedDict):
    metadata: ExtractedMetadata | None
    hash: str | None
    mime_type: str | None
-    job_id: str | None


 class AssetRow(TypedDict):
@@ -61,7 +60,6 @@ class ReferenceRow(TypedDict):
    name: str
    preview_id: str | None
    user_metadata: dict[str, Any] | None
-    job_id: str | None
    created_at: datetime
    updated_at: datetime
    last_access_time: datetime
@@ -169,7 +167,6 @@ def batch_insert_seed_assets(
                "name": spec["info_name"],
                "preview_id": None,
                "user_metadata": user_metadata,
-                "job_id": spec.get("job_id"),
                "created_at": current_time,
                "updated_at": current_time,
                "last_access_time": current_time,
--- a/app/assets/services/ingest.py
+++ b/app/assets/services/ingest.py
@@ -9,9 +9,6 @@ from sqlalchemy.orm import Session
 import app.assets.services.hashing as hashing
 from app.assets.database.queries import (
    add_tags_to_reference,
-    count_active_siblings,
-    create_stub_asset,
-    ensure_tags_exist,
    fetch_reference_and_asset,
    get_asset_by_hash,
    get_reference_by_file_path,
@@ -26,8 +23,7 @@ from app.assets.database.queries import (
    upsert_reference,
    validate_tags_exist,
 )
-from app.assets.helpers import get_utc_now, normalize_tags
-from app.assets.services.bulk_ingest import batch_insert_seed_assets
+from app.assets.helpers import normalize_tags
 from app.assets.services.file_utils import get_size_and_mtime_ns
 from app.assets.services.path_utils import (
    compute_relative_filename,
@@ -134,102 +130,6 @@ def _ingest_file_from_path(
    )


-def register_output_files(
-    file_paths: Sequence[str],
-    user_metadata: UserMetadata = None,
-    job_id: str | None = None,
-) -> int:
-    """Register a batch of output file paths as assets.
-
-    Returns the number of files successfully registered.
-    """
-    registered = 0
-    for abs_path in file_paths:
-        if not os.path.isfile(abs_path):
-            continue
-        try:
-            if ingest_existing_file(
-                abs_path, user_metadata=user_metadata, job_id=job_id
-            ):
-                registered += 1
-        except Exception:
-            logging.exception("Failed to register output: %s", abs_path)
-    return registered
-
-
-def ingest_existing_file(
-    abs_path: str,
-    user_metadata: UserMetadata = None,
-    extra_tags: Sequence[str] = (),
-    owner_id: str = "",
-    job_id: str | None = None,
-) -> bool:
-    """Register an existing on-disk file as an asset stub.
-
-    If a reference already exists for this path, updates mtime_ns, job_id,
-    size_bytes, and resets enrichment so the enricher will re-hash it.
-
-    For brand-new paths, inserts a stub record (hash=NULL) for immediate
-    UX visibility.
-
-    Returns True if a row was inserted or updated, False otherwise.
-    """
-    locator = os.path.abspath(abs_path)
-    size_bytes, mtime_ns = get_size_and_mtime_ns(abs_path)
-    mime_type = mimetypes.guess_type(abs_path, strict=False)[0]
-    name, path_tags = get_name_and_tags_from_asset_path(abs_path)
-    tags = list(dict.fromkeys(path_tags + list(extra_tags)))
-
-    with create_session() as session:
-        existing_ref = get_reference_by_file_path(session, locator)
-        if existing_ref is not None:
-            now = get_utc_now()
-            existing_ref.mtime_ns = mtime_ns
-            existing_ref.job_id = job_id
-            existing_ref.is_missing = False
-            existing_ref.deleted_at = None
-            existing_ref.updated_at = now
-            existing_ref.enrichment_level = 0
-
-            asset = existing_ref.asset
-            if asset:
-                # If other refs share this asset, detach to a new stub
-                # instead of mutating the shared row.
-                siblings = count_active_siblings(session, asset.id, existing_ref.id)
-                if siblings > 0:
-                    new_asset = create_stub_asset(
-                        session,
-                        size_bytes=size_bytes,
-                        mime_type=mime_type or asset.mime_type,
-                    )
-                    existing_ref.asset_id = new_asset.id
-                else:
-                    asset.hash = None
-                    asset.size_bytes = size_bytes
-                    if mime_type:
-                        asset.mime_type = mime_type
-            session.commit()
-            return True
-
-        spec = {
-            "abs_path": abs_path,
-            "size_bytes": size_bytes,
-            "mtime_ns": mtime_ns,
-            "info_name": name,
-            "tags": tags,
-            "fname": os.path.basename(abs_path),
-            "metadata": None,
-            "hash": None,
-            "mime_type": mime_type,
-            "job_id": job_id,
-        }
-        if tags:
-            ensure_tags_exist(session, tags)
-        result = batch_insert_seed_assets(session, [spec], owner_id=owner_id)
-        session.commit()
-        return result.won_paths > 0
-
-
 def _register_existing_asset(
    asset_hash: str,
    name: str,
--- a/app/assets/services/path_utils.py
+++ b/app/assets/services/path_utils.py
@@ -93,13 +93,12 @@ def compute_relative_filename(file_path: str) -> str | None:

 def get_asset_category_and_relative_path(
    file_path: str,
-) -> tuple[Literal["input", "output", "temp", "models"], str]:
+) -> tuple[Literal["input", "output", "models"], str]:
    """Determine which root category a file path belongs to.

    Categories:
      - 'input': under folder_paths.get_input_directory()
      - 'output': under folder_paths.get_output_directory()
-      - 'temp': under folder_paths.get_temp_directory()
      - 'models': under any base path from get_comfy_models_folders()

    Returns:
@@ -130,12 +129,7 @@ def get_asset_category_and_relative_path(
    if _check_is_within(fp_abs, output_base):
        return "output", _compute_relative(fp_abs, output_base)

-    # 3) temp
-    temp_base = os.path.abspath(folder_paths.get_temp_directory())
-    if _check_is_within(fp_abs, temp_base):
-        return "temp", _compute_relative(fp_abs, temp_base)
-
-    # 4) models (check deepest matching base to avoid ambiguity)
+    # 3) models (check deepest matching base to avoid ambiguity)
    best: tuple[int, str, str] | None = None  # (base_len, bucket, rel_inside_bucket)
    for bucket, bases in get_comfy_models_folders():
        for b in bases:
@@ -152,7 +146,7 @@ def get_asset_category_and_relative_path(
        return "models", os.path.relpath(os.path.join(os.sep, combined), os.sep)

    raise ValueError(
-        f"Path is not within input, output, temp, or configured model bases: {file_path}"
+        f"Path is not within input, output, or configured model bases: {file_path}"
    )


--- a/blueprints/.glsl/Color_Balance_15.frag
+++ b/blueprints/.glsl/Color_Balance_15.frag
@@ -1,90 +0,0 @@
-#version 300 es
-precision highp float;
-
-uniform sampler2D u_image0;
-uniform float u_float0;
-uniform float u_float1;
-uniform float u_float2;
-uniform float u_float3;
-uniform float u_float4;
-uniform float u_float5;
-uniform float u_float6;
-uniform float u_float7;
-uniform float u_float8;
-uniform bool u_bool0;
-
-in vec2 v_texCoord;
-out vec4 fragColor;
-
-vec3 rgb2hsl(vec3 c) {
-    float maxC = max(c.r, max(c.g, c.b));
-    float minC = min(c.r, min(c.g, c.b));
-    float l = (maxC + minC) * 0.5;
-    if (maxC == minC) return vec3(0.0, 0.0, l);
-    float d = maxC - minC;
-    float s = l > 0.5 ? d / (2.0 - maxC - minC) : d / (maxC + minC);
-    float h;
-    if (maxC == c.r) {
-        h = (c.g - c.b) / d + (c.g < c.b ? 6.0 : 0.0);
-    } else if (maxC == c.g) {
-        h = (c.b - c.r) / d + 2.0;
-    } else {
-        h = (c.r - c.g) / d + 4.0;
-    }
-    h /= 6.0;
-    return vec3(h, s, l);
-}
-
-float hue2rgb(float p, float q, float t) {
-    if (t < 0.0) t += 1.0;
-    if (t > 1.0) t -= 1.0;
-    if (t < 1.0 / 6.0) return p + (q - p) * 6.0 * t;
-    if (t < 1.0 / 2.0) return q;
-    if (t < 2.0 / 3.0) return p + (q - p) * (2.0 / 3.0 - t) * 6.0;
-    return p;
-}
-
-vec3 hsl2rgb(vec3 hsl) {
-    float h = hsl.x, s = hsl.y, l = hsl.z;
-    if (s == 0.0) return vec3(l);
-    float q = l < 0.5 ? l * (1.0 + s) : l + s - l * s;
-    float p = 2.0 * l - q;
-    return vec3(
-        hue2rgb(p, q, h + 1.0 / 3.0),
-        hue2rgb(p, q, h),
-        hue2rgb(p, q, h - 1.0 / 3.0)
-    );
-}
-
-void main() {
-    vec4 tex = texture(u_image0, v_texCoord);
-    vec3 color = tex.rgb;
-
-    vec3 shadows = vec3(u_float0, u_float1, u_float2) * 0.01;
-    vec3 midtones = vec3(u_float3, u_float4, u_float5) * 0.01;
-    vec3 highlights = vec3(u_float6, u_float7, u_float8) * 0.01;
-
-    float maxC = max(color.r, max(color.g, color.b));
-    float minC = min(color.r, min(color.g, color.b));
-    float lightness = (maxC + minC) * 0.5;
-
-    // GIMP weight curves: linear ramps with constants a=0.25, b=0.333, scale=0.7
-    const float a = 0.25;
-    const float b = 0.333;
-    const float scale = 0.7;
-
-    float sw = clamp((lightness - b) / -a + 0.5, 0.0, 1.0) * scale;
-    float mw = clamp((lightness - b) / a + 0.5, 0.0, 1.0) *
-               clamp((lightness + b - 1.0) / -a + 0.5, 0.0, 1.0) * scale;
-    float hw = clamp((lightness + b - 1.0) / a + 0.5, 0.0, 1.0) * scale;
-
-    color += sw * shadows + mw * midtones + hw * highlights;
-
-    if (u_bool0) {
-        vec3 hsl = rgb2hsl(clamp(color, 0.0, 1.0));
-        hsl.z = lightness;
-        color = hsl2rgb(hsl);
-    }
-
-    fragColor = vec4(clamp(color, 0.0, 1.0), tex.a);
-}
--- a/blueprints/.glsl/Color_Curves_8.frag
+++ b/blueprints/.glsl/Color_Curves_8.frag
@@ -1,49 +0,0 @@
-#version 300 es
-precision highp float;
-
-uniform sampler2D u_image0;
-uniform sampler2D u_curve0;  // RGB master curve (256x1 LUT)
-uniform sampler2D u_curve1;  // Red channel curve
-uniform sampler2D u_curve2;  // Green channel curve
-uniform sampler2D u_curve3;  // Blue channel curve
-
-in vec2 v_texCoord;
-layout(location = 0) out vec4 fragColor0;
-
-// GIMP-compatible curve lookup with manual linear interpolation.
-// Matches gimp_curve_map_value_inline() from gimpcurve-map.c:
-//   index = value * (n_samples - 1)
-//   f = fract(index)
-//   result = (1-f) * samples[floor] + f * samples[ceil]
-//
-// Uses texelFetch (NEAREST) to avoid GPU half-texel offset issues
-// that occur with texture() + GL_LINEAR on small 256x1 LUTs.
-float applyCurve(sampler2D curve, float value) {
-    value = clamp(value, 0.0, 1.0);
-
-    float pos = value * 255.0;
-    int lo = int(floor(pos));
-    int hi = min(lo + 1, 255);
-    float f = pos - float(lo);
-
-    float a = texelFetch(curve, ivec2(lo, 0), 0).r;
-    float b = texelFetch(curve, ivec2(hi, 0), 0).r;
-
-    return a + f * (b - a);
-}
-
-void main() {
-    vec4 color = texture(u_image0, v_texCoord);
-
-    // GIMP order: per-channel curves first, then RGB master curve.
-    // See gimp_curve_map_pixels() default case in gimpcurve-map.c:
-    //   dest = colors_curve( channel_curve( src ) )
-    float tmp_r = applyCurve(u_curve1, color.r);
-    float tmp_g = applyCurve(u_curve2, color.g);
-    float tmp_b = applyCurve(u_curve3, color.b);
-    color.r = applyCurve(u_curve0, tmp_r);
-    color.g = applyCurve(u_curve0, tmp_g);
-    color.b = applyCurve(u_curve0, tmp_b);
-
-    fragColor0 = vec4(color.rgb, color.a);
-}
--- a/blueprints/.glsl/Glow_30.frag
+++ b/blueprints/.glsl/Glow_30.frag
@@ -2,6 +2,7 @@
 precision mediump float;

 uniform sampler2D u_image0;
+uniform vec2 u_resolution;
 uniform int u_int0;      // Blend mode
 uniform int u_int1;      // Color tint
 uniform float u_float0;  // Intensity
@@ -74,7 +75,7 @@ void main() {
    float t0 = threshold - 0.15;
    float t1 = threshold + 0.15;
    
-    vec2 texelSize = 1.0 / vec2(textureSize(u_image0, 0));
+    vec2 texelSize = 1.0 / u_resolution;
    float radius2 = radius * radius;
    
    float sampleScale = clamp(radius * 0.75, 0.35, 1.0);
--- a/blueprints/.glsl/Image_Blur_1.frag
+++ b/blueprints/.glsl/Image_Blur_1.frag
@@ -12,6 +12,7 @@ const int RADIAL_SAMPLES = 12;
 const float RADIAL_STRENGTH = 0.0003;

 uniform sampler2D u_image0;
+uniform vec2 u_resolution;
 uniform int u_int0;      // Blur type (BLUR_GAUSSIAN, BLUR_BOX, BLUR_RADIAL)
 uniform float u_float0;  // Blur radius/amount
 uniform int u_pass;      // Pass index (0 = horizontal, 1 = vertical)
@@ -24,7 +25,7 @@ float gaussian(float x, float sigma) {
 }

 void main() {
-    vec2 texelSize = 1.0 / vec2(textureSize(u_image0, 0));
+    vec2 texelSize = 1.0 / u_resolution;
    float radius = max(u_float0, 0.0);

    // Radial (angular) blur - single pass, doesn't use separable
--- a/blueprints/.glsl/Sharpen_23.frag
+++ b/blueprints/.glsl/Sharpen_23.frag
@@ -2,13 +2,14 @@
 precision highp float;

 uniform sampler2D u_image0;
+uniform vec2 u_resolution;
 uniform float u_float0;  // strength [0.0 – 2.0] typical: 0.3–1.0

 in vec2 v_texCoord;
 layout(location = 0) out vec4 fragColor0;

 void main() {
-    vec2 texel = 1.0 / vec2(textureSize(u_image0, 0));
+    vec2 texel = 1.0 / u_resolution;
    
    // Sample center and neighbors
    vec4 center = texture(u_image0, v_texCoord);
--- a/blueprints/.glsl/Unsharp_Mask_26.frag
+++ b/blueprints/.glsl/Unsharp_Mask_26.frag
@@ -2,6 +2,7 @@
 precision highp float;

 uniform sampler2D u_image0;
+uniform vec2 u_resolution;
 uniform float u_float0;  // amount    [0.0 - 3.0]  typical: 0.5-1.5
 uniform float u_float1;  // radius    [0.5 - 10.0] blur radius in pixels
 uniform float u_float2;  // threshold [0.0 - 0.1]  min difference to sharpen
@@ -18,7 +19,7 @@ float getLuminance(vec3 color) {
 }

 void main() {
-    vec2 texel = 1.0 / vec2(textureSize(u_image0, 0));
+    vec2 texel = 1.0 / u_resolution;
    float radius = max(u_float1, 0.5);
    float amount = u_float0;
    float threshold = u_float2;
--- a/blueprints/Brightness
+++ b/blueprints/Brightness
--- a/(Z-Image-Turbo).json
+++ b/(Z-Image-Turbo).json
--- a/blueprints/Canny
+++ b/blueprints/Canny
--- a/blueprints/Chromatic
+++ b/blueprints/Chromatic
--- a/blueprints/Color
+++ b/blueprints/Color
--- a/blueprints/Color
+++ b/blueprints/Color
--- a/blueprints/Color
+++ b/blueprints/Color
@@ -1,615 +0,0 @@
-{
-  "revision": 0,
-  "last_node_id": 10,
-  "last_link_id": 0,
-  "nodes": [
-    {
-      "id": 10,
-      "type": "d5c462c8-1372-4af8-84f2-547c83470d04",
-      "pos": [
-        3610,
-        -2630
-      ],
-      "size": [
-        270,
-        420
-      ],
-      "flags": {},
-      "order": 0,
-      "mode": 0,
-      "inputs": [
-        {
-          "label": "image",
-          "localized_name": "images.image0",
-          "name": "images.image0",
-          "type": "IMAGE",
-          "link": null
-        }
-      ],
-      "outputs": [
-        {
-          "label": "IMAGE",
-          "localized_name": "IMAGE0",
-          "name": "IMAGE0",
-          "type": "IMAGE",
-          "links": []
-        }
-      ],
-      "properties": {
-        "proxyWidgets": [
-          [
-            "4",
-            "curve"
-          ],
-          [
-            "5",
-            "curve"
-          ],
-          [
-            "6",
-            "curve"
-          ],
-          [
-            "7",
-            "curve"
-          ]
-        ]
-      },
-      "widgets_values": [],
-      "title": "Color Curves"
-    }
-  ],
-  "links": [],
-  "version": 0.4,
-  "definitions": {
-    "subgraphs": [
-      {
-        "id": "d5c462c8-1372-4af8-84f2-547c83470d04",
-        "version": 1,
-        "state": {
-          "lastGroupId": 0,
-          "lastNodeId": 9,
-          "lastLinkId": 38,
-          "lastRerouteId": 0
-        },
-        "revision": 0,
-        "config": {},
-        "name": "Color Curves",
-        "inputNode": {
-          "id": -10,
-          "bounding": [
-            2660,
-            -4500,
-            120,
-            60
-          ]
-        },
-        "outputNode": {
-          "id": -20,
-          "bounding": [
-            4270,
-            -4500,
-            120,
-            60
-          ]
-        },
-        "inputs": [
-          {
-            "id": "abc345b7-f55e-4f32-a11d-3aa4c2b0936b",
-            "name": "images.image0",
-            "type": "IMAGE",
-            "linkIds": [
-              29,
-              34
-            ],
-            "localized_name": "images.image0",
-            "label": "image",
-            "pos": [
-              2760,
-              -4480
-            ]
-          }
-        ],
-        "outputs": [
-          {
-            "id": "eb0ec079-46da-4408-8263-9ef85569d33d",
-            "name": "IMAGE0",
-            "type": "IMAGE",
-            "linkIds": [
-              28
-            ],
-            "localized_name": "IMAGE0",
-            "label": "IMAGE",
-            "pos": [
-              4290,
-              -4480
-            ]
-          }
-        ],
-        "widgets": [],
-        "nodes": [
-          {
-            "id": 4,
-            "type": "CurveEditor",
-            "pos": [
-              3060,
-              -4500
-            ],
-            "size": [
-              270,
-              200
-            ],
-            "flags": {},
-            "order": 0,
-            "mode": 0,
-            "inputs": [
-              {
-                "label": "curve",
-                "localized_name": "curve",
-                "name": "curve",
-                "type": "CURVE",
-                "widget": {
-                  "name": "curve"
-                },
-                "link": null
-              },
-              {
-                "label": "histogram",
-                "localized_name": "histogram",
-                "name": "histogram",
-                "type": "HISTOGRAM",
-                "shape": 7,
-                "link": 35
-              }
-            ],
-            "outputs": [
-              {
-                "localized_name": "CURVE",
-                "name": "CURVE",
-                "type": "CURVE",
-                "links": [
-                  30
-                ]
-              }
-            ],
-            "title": "RGB Master",
-            "properties": {
-              "Node name for S&R": "CurveEditor"
-            },
-            "widgets_values": []
-          },
-          {
-            "id": 5,
-            "type": "CurveEditor",
-            "pos": [
-              3060,
-              -4250
-            ],
-            "size": [
-              270,
-              200
-            ],
-            "flags": {},
-            "order": 1,
-            "mode": 0,
-            "inputs": [
-              {
-                "label": "curve",
-                "localized_name": "curve",
-                "name": "curve",
-                "type": "CURVE",
-                "widget": {
-                  "name": "curve"
-                },
-                "link": null
-              },
-              {
-                "label": "histogram",
-                "localized_name": "histogram",
-                "name": "histogram",
-                "type": "HISTOGRAM",
-                "shape": 7,
-                "link": 36
-              }
-            ],
-            "outputs": [
-              {
-                "localized_name": "CURVE",
-                "name": "CURVE",
-                "type": "CURVE",
-                "links": [
-                  31
-                ]
-              }
-            ],
-            "title": "Red",
-            "properties": {
-              "Node name for S&R": "CurveEditor"
-            },
-            "widgets_values": []
-          },
-          {
-            "id": 6,
-            "type": "CurveEditor",
-            "pos": [
-              3060,
-              -4000
-            ],
-            "size": [
-              270,
-              200
-            ],
-            "flags": {},
-            "order": 2,
-            "mode": 0,
-            "inputs": [
-              {
-                "label": "curve",
-                "localized_name": "curve",
-                "name": "curve",
-                "type": "CURVE",
-                "widget": {
-                  "name": "curve"
-                },
-                "link": null
-              },
-              {
-                "label": "histogram",
-                "localized_name": "histogram",
-                "name": "histogram",
-                "type": "HISTOGRAM",
-                "shape": 7,
-                "link": 37
-              }
-            ],
-            "outputs": [
-              {
-                "localized_name": "CURVE",
-                "name": "CURVE",
-                "type": "CURVE",
-                "links": [
-                  32
-                ]
-              }
-            ],
-            "title": "Green",
-            "properties": {
-              "Node name for S&R": "CurveEditor"
-            },
-            "widgets_values": []
-          },
-          {
-            "id": 7,
-            "type": "CurveEditor",
-            "pos": [
-              3060,
-              -3750
-            ],
-            "size": [
-              270,
-              200
-            ],
-            "flags": {},
-            "order": 3,
-            "mode": 0,
-            "inputs": [
-              {
-                "label": "curve",
-                "localized_name": "curve",
-                "name": "curve",
-                "type": "CURVE",
-                "widget": {
-                  "name": "curve"
-                },
-                "link": null
-              },
-              {
-                "label": "histogram",
-                "localized_name": "histogram",
-                "name": "histogram",
-                "type": "HISTOGRAM",
-                "shape": 7,
-                "link": 38
-              }
-            ],
-            "outputs": [
-              {
-                "localized_name": "CURVE",
-                "name": "CURVE",
-                "type": "CURVE",
-                "links": [
-                  33
-                ]
-              }
-            ],
-            "title": "Blue",
-            "properties": {
-              "Node name for S&R": "CurveEditor"
-            },
-            "widgets_values": []
-          },
-          {
-            "id": 8,
-            "type": "GLSLShader",
-            "pos": [
-              3590,
-              -4500
-            ],
-            "size": [
-              420,
-              500
-            ],
-            "flags": {},
-            "order": 4,
-            "mode": 0,
-            "inputs": [
-              {
-                "label": "image0",
-                "localized_name": "images.image0",
-                "name": "images.image0",
-                "type": "IMAGE",
-                "link": 29
-              },
-              {
-                "label": "image1",
-                "localized_name": "images.image1",
-                "name": "images.image1",
-                "shape": 7,
-                "type": "IMAGE",
-                "link": null
-              },
-              {
-                "label": "u_curve0",
-                "localized_name": "curves.u_curve0",
-                "name": "curves.u_curve0",
-                "shape": 7,
-                "type": "CURVE",
-                "link": 30
-              },
-              {
-                "label": "u_curve1",
-                "localized_name": "curves.u_curve1",
-                "name": "curves.u_curve1",
-                "shape": 7,
-                "type": "CURVE",
-                "link": 31
-              },
-              {
-                "label": "u_curve2",
-                "localized_name": "curves.u_curve2",
-                "name": "curves.u_curve2",
-                "shape": 7,
-                "type": "CURVE",
-                "link": 32
-              },
-              {
-                "label": "u_curve3",
-                "localized_name": "curves.u_curve3",
-                "name": "curves.u_curve3",
-                "shape": 7,
-                "type": "CURVE",
-                "link": 33
-              },
-              {
-                "localized_name": "fragment_shader",
-                "name": "fragment_shader",
-                "type": "STRING",
-                "widget": {
-                  "name": "fragment_shader"
-                },
-                "link": null
-              },
-              {
-                "localized_name": "size_mode",
-                "name": "size_mode",
-                "type": "COMFY_DYNAMICCOMBO_V3",
-                "widget": {
-                  "name": "size_mode"
-                },
-                "link": null
-              }
-            ],
-            "outputs": [
-              {
-                "localized_name": "IMAGE0",
-                "name": "IMAGE0",
-                "type": "IMAGE",
-                "links": [
-                  28
-                ]
-              },
-              {
-                "localized_name": "IMAGE1",
-                "name": "IMAGE1",
-                "type": "IMAGE",
-                "links": null
-              },
-              {
-                "localized_name": "IMAGE2",
-                "name": "IMAGE2",
-                "type": "IMAGE",
-                "links": null
-              },
-              {
-                "localized_name": "IMAGE3",
-                "name": "IMAGE3",
-                "type": "IMAGE",
-                "links": null
-              }
-            ],
-            "properties": {
-              "Node name for S&R": "GLSLShader"
-            },
-            "widgets_values": [
-              "#version 300 es\nprecision highp float;\n\nuniform sampler2D u_image0;\nuniform sampler2D u_curve0;  // RGB master curve (256x1 LUT)\nuniform sampler2D u_curve1;  // Red channel curve\nuniform sampler2D u_curve2;  // Green channel curve\nuniform sampler2D u_curve3;  // Blue channel curve\n\nin vec2 v_texCoord;\nlayout(location = 0) out vec4 fragColor0;\n\n// GIMP-compatible curve lookup with manual linear interpolation.\n// Matches gimp_curve_map_value_inline() from gimpcurve-map.c:\n//   index = value * (n_samples - 1)\n//   f = fract(index)\n//   result = (1-f) * samples[floor] + f * samples[ceil]\n//\n// Uses texelFetch (NEAREST) to avoid GPU half-texel offset issues\n// that occur with texture() + GL_LINEAR on small 256x1 LUTs.\nfloat applyCurve(sampler2D curve, float value) {\n    value = clamp(value, 0.0, 1.0);\n\n    float pos = value * 255.0;\n    int lo = int(floor(pos));\n    int hi = min(lo + 1, 255);\n    float f = pos - float(lo);\n\n    float a = texelFetch(curve, ivec2(lo, 0), 0).r;\n    float b = texelFetch(curve, ivec2(hi, 0), 0).r;\n\n    return a + f * (b - a);\n}\n\nvoid main() {\n    vec4 color = texture(u_image0, v_texCoord);\n\n    // GIMP order: per-channel curves first, then RGB master curve.\n    // See gimp_curve_map_pixels() default case in gimpcurve-map.c:\n    //   dest = colors_curve( channel_curve( src ) )\n    float tmp_r = applyCurve(u_curve1, color.r);\n    float tmp_g = applyCurve(u_curve2, color.g);\n    float tmp_b = applyCurve(u_curve3, color.b);\n    color.r = applyCurve(u_curve0, tmp_r);\n    color.g = applyCurve(u_curve0, tmp_g);\n    color.b = applyCurve(u_curve0, tmp_b);\n\n    fragColor0 = vec4(color.rgb, color.a);\n}\n",
-              "from_input"
-            ]
-          },
-          {
-            "id": 9,
-            "type": "ImageHistogram",
-            "pos": [
-              2800,
-              -4300
-            ],
-            "size": [
-              210,
-              150
-            ],
-            "flags": {},
-            "order": 5,
-            "mode": 0,
-            "inputs": [
-              {
-                "label": "image",
-                "localized_name": "image",
-                "name": "image",
-                "type": "IMAGE",
-                "link": 34
-              }
-            ],
-            "outputs": [
-              {
-                "localized_name": "HISTOGRAM",
-                "name": "rgb",
-                "type": "HISTOGRAM",
-                "links": [
-                  35
-                ]
-              },
-              {
-                "localized_name": "HISTOGRAM",
-                "name": "luminance",
-                "type": "HISTOGRAM",
-                "links": []
-              },
-              {
-                "localized_name": "HISTOGRAM",
-                "name": "red",
-                "type": "HISTOGRAM",
-                "links": [
-                  36
-                ]
-              },
-              {
-                "localized_name": "HISTOGRAM",
-                "name": "green",
-                "type": "HISTOGRAM",
-                "links": [
-                  37
-                ]
-              },
-              {
-                "localized_name": "HISTOGRAM",
-                "name": "blue",
-                "type": "HISTOGRAM",
-                "links": [
-                  38
-                ]
-              }
-            ],
-            "properties": {
-              "Node name for S&R": "ImageHistogram"
-            },
-            "widgets_values": []
-          }
-        ],
-        "groups": [],
-        "links": [
-          {
-            "id": 29,
-            "origin_id": -10,
-            "origin_slot": 0,
-            "target_id": 8,
-            "target_slot": 0,
-            "type": "IMAGE"
-          },
-          {
-            "id": 28,
-            "origin_id": 8,
-            "origin_slot": 0,
-            "target_id": -20,
-            "target_slot": 0,
-            "type": "IMAGE"
-          },
-          {
-            "id": 30,
-            "origin_id": 4,
-            "origin_slot": 0,
-            "target_id": 8,
-            "target_slot": 2,
-            "type": "CURVE"
-          },
-          {
-            "id": 31,
-            "origin_id": 5,
-            "origin_slot": 0,
-            "target_id": 8,
-            "target_slot": 3,
-            "type": "CURVE"
-          },
-          {
-            "id": 32,
-            "origin_id": 6,
-            "origin_slot": 0,
-            "target_id": 8,
-            "target_slot": 4,
-            "type": "CURVE"
-          },
-          {
-            "id": 33,
-            "origin_id": 7,
-            "origin_slot": 0,
-            "target_id": 8,
-            "target_slot": 5,
-            "type": "CURVE"
-          },
-          {
-            "id": 34,
-            "origin_id": -10,
-            "origin_slot": 0,
-            "target_id": 9,
-            "target_slot": 0,
-            "type": "IMAGE"
-          },
-          {
-            "id": 35,
-            "origin_id": 9,
-            "origin_slot": 0,
-            "target_id": 4,
-            "target_slot": 1,
-            "type": "HISTOGRAM"
-          },
-          {
-            "id": 36,
-            "origin_id": 9,
-            "origin_slot": 2,
-            "target_id": 5,
-            "target_slot": 1,
-            "type": "HISTOGRAM"
-          },
-          {
-            "id": 37,
-            "origin_id": 9,
-            "origin_slot": 3,
-            "target_id": 6,
-            "target_slot": 1,
-            "type": "HISTOGRAM"
-          },
-          {
-            "id": 38,
-            "origin_id": 9,
-            "origin_slot": 4,
-            "target_id": 7,
-            "target_slot": 1,
-            "type": "HISTOGRAM"
-          }
-        ],
-        "extra": {
-          "workflowRendererVersion": "LG"
-        },
-        "category": "Image Tools/Color adjust"
-      }
-    ]
-  }
-}
--- a/blueprints/Crop
+++ b/blueprints/Crop
--- a/blueprints/Crop
+++ b/blueprints/Crop
--- a/(Z-Image-Turbo).json
+++ b/(Z-Image-Turbo).json
--- a/blueprints/Depth
+++ b/blueprints/Depth
--- a/blueprints/Edge-Preserving
+++ b/blueprints/Edge-Preserving
--- a/blueprints/Film
+++ b/blueprints/Film
--- a/blueprints/First-Last-Frame
+++ b/blueprints/First-Last-Frame
--- a/blueprints/Glow.json
+++ b/blueprints/Glow.json
--- a/Saturation.json
+++ b/Saturation.json
--- a/blueprints/Image
+++ b/blueprints/Image
--- a/blueprints/Image
+++ b/blueprints/Image
--- a/blueprints/Image
+++ b/blueprints/Image
@@ -1,322 +1 @@
-{
-  "revision": 0,
-  "last_node_id": 29,
-  "last_link_id": 0,
-  "nodes": [
-    {
-      "id": 29,
-      "type": "4c9d6ea4-b912-40e5-8766-6793a9758c53",
-      "pos": [
-        1970,
-        -230
-      ],
-      "size": [
-        180,
-        86
-      ],
-      "flags": {},
-      "order": 5,
-      "mode": 0,
-      "inputs": [
-        {
-          "label": "image",
-          "localized_name": "images.image0",
-          "name": "images.image0",
-          "type": "IMAGE",
-          "link": null
-        }
-      ],
-      "outputs": [
-        {
-          "label": "R",
-          "localized_name": "IMAGE0",
-          "name": "IMAGE0",
-          "type": "IMAGE",
-          "links": []
-        },
-        {
-          "label": "G",
-          "localized_name": "IMAGE1",
-          "name": "IMAGE1",
-          "type": "IMAGE",
-          "links": []
-        },
-        {
-          "label": "B",
-          "localized_name": "IMAGE2",
-          "name": "IMAGE2",
-          "type": "IMAGE",
-          "links": []
-        },
-        {
-          "label": "A",
-          "localized_name": "IMAGE3",
-          "name": "IMAGE3",
-          "type": "IMAGE",
-          "links": []
-        }
-      ],
-      "title": "Image Channels",
-      "properties": {
-        "proxyWidgets": []
-      },
-      "widgets_values": []
-    }
-  ],
-  "links": [],
-  "version": 0.4,
-  "definitions": {
-    "subgraphs": [
-      {
-        "id": "4c9d6ea4-b912-40e5-8766-6793a9758c53",
-        "version": 1,
-        "state": {
-          "lastGroupId": 0,
-          "lastNodeId": 28,
-          "lastLinkId": 39,
-          "lastRerouteId": 0
-        },
-        "revision": 0,
-        "config": {},
-        "name": "Image Channels",
-        "inputNode": {
-          "id": -10,
-          "bounding": [
-            1820,
-            -185,
-            120,
-            60
-          ]
-        },
-        "outputNode": {
-          "id": -20,
-          "bounding": [
-            2460,
-            -215,
-            120,
-            120
-          ]
-        },
-        "inputs": [
-          {
-            "id": "3522932b-2d86-4a1f-a02a-cb29f3a9d7fe",
-            "name": "images.image0",
-            "type": "IMAGE",
-            "linkIds": [
-              39
-            ],
-            "localized_name": "images.image0",
-            "label": "image",
-            "pos": [
-              1920,
-              -165
-            ]
-          }
-        ],
-        "outputs": [
-          {
-            "id": "605cb9c3-b065-4d9b-81d2-3ec331889b2b",
-            "name": "IMAGE0",
-            "type": "IMAGE",
-            "linkIds": [
-              26
-            ],
-            "localized_name": "IMAGE0",
-            "label": "R",
-            "pos": [
-              2480,
-              -195
-            ]
-          },
-          {
-            "id": "fb44a77e-0522-43e9-9527-82e7465b3596",
-            "name": "IMAGE1",
-            "type": "IMAGE",
-            "linkIds": [
-              27
-            ],
-            "localized_name": "IMAGE1",
-            "label": "G",
-            "pos": [
-              2480,
-              -175
-            ]
-          },
-          {
-            "id": "81460ee6-0131-402a-874f-6bf3001fc4ff",
-            "name": "IMAGE2",
-            "type": "IMAGE",
-            "linkIds": [
-              28
-            ],
-            "localized_name": "IMAGE2",
-            "label": "B",
-            "pos": [
-              2480,
-              -155
-            ]
-          },
-          {
-            "id": "ae690246-80d4-4951-b1d9-9306d8a77417",
-            "name": "IMAGE3",
-            "type": "IMAGE",
-            "linkIds": [
-              29
-            ],
-            "localized_name": "IMAGE3",
-            "label": "A",
-            "pos": [
-              2480,
-              -135
-            ]
-          }
-        ],
-        "widgets": [],
-        "nodes": [
-          {
-            "id": 23,
-            "type": "GLSLShader",
-            "pos": [
-              2000,
-              -330
-            ],
-            "size": [
-              400,
-              172
-            ],
-            "flags": {},
-            "order": 0,
-            "mode": 0,
-            "inputs": [
-              {
-                "label": "image",
-                "localized_name": "images.image0",
-                "name": "images.image0",
-                "type": "IMAGE",
-                "link": 39
-              },
-              {
-                "localized_name": "fragment_shader",
-                "name": "fragment_shader",
-                "type": "STRING",
-                "widget": {
-                  "name": "fragment_shader"
-                },
-                "link": null
-              },
-              {
-                "localized_name": "size_mode",
-                "name": "size_mode",
-                "type": "COMFY_DYNAMICCOMBO_V3",
-                "widget": {
-                  "name": "size_mode"
-                },
-                "link": null
-              },
-              {
-                "label": "image1",
-                "localized_name": "images.image1",
-                "name": "images.image1",
-                "shape": 7,
-                "type": "IMAGE",
-                "link": null
-              }
-            ],
-            "outputs": [
-              {
-                "label": "R",
-                "localized_name": "IMAGE0",
-                "name": "IMAGE0",
-                "type": "IMAGE",
-                "links": [
-                  26
-                ]
-              },
-              {
-                "label": "G",
-                "localized_name": "IMAGE1",
-                "name": "IMAGE1",
-                "type": "IMAGE",
-                "links": [
-                  27
-                ]
-              },
-              {
-                "label": "B",
-                "localized_name": "IMAGE2",
-                "name": "IMAGE2",
-                "type": "IMAGE",
-                "links": [
-                  28
-                ]
-              },
-              {
-                "label": "A",
-                "localized_name": "IMAGE3",
-                "name": "IMAGE3",
-                "type": "IMAGE",
-                "links": [
-                  29
-                ]
-              }
-            ],
-            "properties": {
-              "Node name for S&R": "GLSLShader"
-            },
-            "widgets_values": [
-              "#version 300 es\nprecision highp float;\n\nuniform sampler2D u_image0;\n\nin vec2 v_texCoord;\nlayout(location = 0) out vec4 fragColor0;\nlayout(location = 1) out vec4 fragColor1;\nlayout(location = 2) out vec4 fragColor2;\nlayout(location = 3) out vec4 fragColor3;\n\nvoid main() {\n  vec4 color = texture(u_image0, v_texCoord);\n  // Output each channel as grayscale to separate render targets\n  fragColor0 = vec4(vec3(color.r), 1.0);  // Red channel\n  fragColor1 = vec4(vec3(color.g), 1.0);  // Green channel\n  fragColor2 = vec4(vec3(color.b), 1.0);  // Blue channel\n  fragColor3 = vec4(vec3(color.a), 1.0);  // Alpha channel\n}\n",
-              "from_input"
-            ]
-          }
-        ],
-        "groups": [],
-        "links": [
-          {
-            "id": 39,
-            "origin_id": -10,
-            "origin_slot": 0,
-            "target_id": 23,
-            "target_slot": 0,
-            "type": "IMAGE"
-          },
-          {
-            "id": 26,
-            "origin_id": 23,
-            "origin_slot": 0,
-            "target_id": -20,
-            "target_slot": 0,
-            "type": "IMAGE"
-          },
-          {
-            "id": 27,
-            "origin_id": 23,
-            "origin_slot": 1,
-            "target_id": -20,
-            "target_slot": 1,
-            "type": "IMAGE"
-          },
-          {
-            "id": 28,
-            "origin_id": 23,
-            "origin_slot": 2,
-            "target_id": -20,
-            "target_slot": 2,
-            "type": "IMAGE"
-          },
-          {
-            "id": 29,
-            "origin_id": 23,
-            "origin_slot": 3,
-            "target_id": -20,
-            "target_slot": 3,
-            "type": "IMAGE"
-          }
-        ],
-        "extra": {
-          "workflowRendererVersion": "LG"
-        },
-        "category": "Image Tools/Color adjust"
-      }
-    ]
-  }
-}
+{"revision": 0, "last_node_id": 29, "last_link_id": 0, "nodes": [{"id": 29, "type": "4c9d6ea4-b912-40e5-8766-6793a9758c53", "pos": [1970, -230], "size": [180, 86], "flags": {}, "order": 5, "mode": 0, "inputs": [{"label": "image", "localized_name": "images.image0", "name": "images.image0", "type": "IMAGE", "link": null}], "outputs": [{"label": "R", "localized_name": "IMAGE0", "name": "IMAGE0", "type": "IMAGE", "links": []}, {"label": "G", "localized_name": "IMAGE1", "name": "IMAGE1", "type": "IMAGE", "links": []}, {"label": "B", "localized_name": "IMAGE2", "name": "IMAGE2", "type": "IMAGE", "links": []}, {"label": "A", "localized_name": "IMAGE3", "name": "IMAGE3", "type": "IMAGE", "links": []}], "title": "Image Channels", "properties": {"proxyWidgets": []}, "widgets_values": []}], "links": [], "version": 0.4, "definitions": {"subgraphs": [{"id": "4c9d6ea4-b912-40e5-8766-6793a9758c53", "version": 1, "state": {"lastGroupId": 0, "lastNodeId": 28, "lastLinkId": 39, "lastRerouteId": 0}, "revision": 0, "config": {}, "name": "Image Channels", "inputNode": {"id": -10, "bounding": [1820, -185, 120, 60]}, "outputNode": {"id": -20, "bounding": [2460, -215, 120, 120]}, "inputs": [{"id": "3522932b-2d86-4a1f-a02a-cb29f3a9d7fe", "name": "images.image0", "type": "IMAGE", "linkIds": [39], "localized_name": "images.image0", "label": "image", "pos": [1920, -165]}], "outputs": [{"id": "605cb9c3-b065-4d9b-81d2-3ec331889b2b", "name": "IMAGE0", "type": "IMAGE", "linkIds": [26], "localized_name": "IMAGE0", "label": "R", "pos": [2480, -195]}, {"id": "fb44a77e-0522-43e9-9527-82e7465b3596", "name": "IMAGE1", "type": "IMAGE", "linkIds": [27], "localized_name": "IMAGE1", "label": "G", "pos": [2480, -175]}, {"id": "81460ee6-0131-402a-874f-6bf3001fc4ff", "name": "IMAGE2", "type": "IMAGE", "linkIds": [28], "localized_name": "IMAGE2", "label": "B", "pos": [2480, -155]}, {"id": "ae690246-80d4-4951-b1d9-9306d8a77417", "name": "IMAGE3", "type": "IMAGE", "linkIds": [29], "localized_name": "IMAGE3", "label": "A", "pos": [2480, -135]}], "widgets": [], "nodes": [{"id": 23, "type": "GLSLShader", "pos": [2000, -330], "size": [400, 172], "flags": {}, "order": 0, "mode": 0, "inputs": [{"label": "image", "localized_name": "images.image0", "name": "images.image0", "type": "IMAGE", "link": 39}, {"localized_name": "fragment_shader", "name": "fragment_shader", "type": "STRING", "widget": {"name": "fragment_shader"}, "link": null}, {"localized_name": "size_mode", "name": "size_mode", "type": "COMFY_DYNAMICCOMBO_V3", "widget": {"name": "size_mode"}, "link": null}, {"label": "image1", "localized_name": "images.image1", "name": "images.image1", "shape": 7, "type": "IMAGE", "link": null}], "outputs": [{"label": "R", "localized_name": "IMAGE0", "name": "IMAGE0", "type": "IMAGE", "links": [26]}, {"label": "G", "localized_name": "IMAGE1", "name": "IMAGE1", "type": "IMAGE", "links": [27]}, {"label": "B", "localized_name": "IMAGE2", "name": "IMAGE2", "type": "IMAGE", "links": [28]}, {"label": "A", "localized_name": "IMAGE3", "name": "IMAGE3", "type": "IMAGE", "links": [29]}], "properties": {"Node name for S&R": "GLSLShader"}, "widgets_values": ["#version 300 es\nprecision highp float;\n\nuniform sampler2D u_image0;\n\nin vec2 v_texCoord;\nlayout(location = 0) out vec4 fragColor0;\nlayout(location = 1) out vec4 fragColor1;\nlayout(location = 2) out vec4 fragColor2;\nlayout(location = 3) out vec4 fragColor3;\n\nvoid main() {\n  vec4 color = texture(u_image0, v_texCoord);\n  // Output each channel as grayscale to separate render targets\n  fragColor0 = vec4(vec3(color.r), 1.0);  // Red channel\n  fragColor1 = vec4(vec3(color.g), 1.0);  // Green channel\n  fragColor2 = vec4(vec3(color.b), 1.0);  // Blue channel\n  fragColor3 = vec4(vec3(color.a), 1.0);  // Alpha channel\n}\n", "from_input"]}], "groups": [], "links": [{"id": 39, "origin_id": -10, "origin_slot": 0, "target_id": 23, "target_slot": 0, "type": "IMAGE"}, {"id": 26, "origin_id": 23, "origin_slot": 0, "target_id": -20, "target_slot": 0, "type": "IMAGE"}, {"id": 27, "origin_id": 23, "origin_slot": 1, "target_id": -20, "target_slot": 1, "type": "IMAGE"}, {"id": 28, "origin_id": 23, "origin_slot": 2, "target_id": -20, "target_slot": 2, "type": "IMAGE"}, {"id": 29, "origin_id": 23, "origin_slot": 3, "target_id": -20, "target_slot": 3, "type": "IMAGE"}], "extra": {"workflowRendererVersion": "LG"}, "category": "Image Tools/Color adjust"}]}}
--- a/blueprints/Image
+++ b/blueprints/Image
--- a/blueprints/Image
+++ b/blueprints/Image
--- a/blueprints/Image
+++ b/blueprints/Image
--- a/blueprints/Image
+++ b/blueprints/Image
--- a/blueprints/Image
+++ b/blueprints/Image
--- a/(Qwen-image).json
+++ b/(Qwen-image).json
--- a/blueprints/Image
+++ b/blueprints/Image
--- a/(Qwen-Image).json
+++ b/(Qwen-Image).json
--- a/Upscale(Z-image-Turbo).json
+++ b/Upscale(Z-image-Turbo).json
--- a/blueprints/Image
+++ b/blueprints/Image
--- a/blueprints/Image
+++ b/blueprints/Image
--- a/Layers(Qwen-Image-Layered).json
+++ b/Layers(Qwen-Image-Layered).json
--- a/blueprints/Image
+++ b/blueprints/Image
--- a/blueprints/Image
+++ b/blueprints/Image
--- a/blueprints/Image
+++ b/blueprints/Image
--- a/(Z-Image-Turbo).json
+++ b/(Z-Image-Turbo).json
--- a/blueprints/Pose
+++ b/blueprints/Pose
--- a/blueprints/Prompt
+++ b/blueprints/Prompt
@@ -1,278 +1 @@
-{
-  "revision": 0,
-  "last_node_id": 15,
-  "last_link_id": 0,
-  "nodes": [
-    {
-      "id": 15,
-      "type": "24d8bbfd-39d4-4774-bff0-3de40cc7a471",
-      "pos": [
-        -1490,
-        2040
-      ],
-      "size": [
-        400,
-        260
-      ],
-      "flags": {},
-      "order": 0,
-      "mode": 0,
-      "inputs": [
-        {
-          "name": "prompt",
-          "type": "STRING",
-          "widget": {
-            "name": "prompt"
-          },
-          "link": null
-        },
-        {
-          "label": "reference images",
-          "name": "images",
-          "type": "IMAGE",
-          "link": null
-        }
-      ],
-      "outputs": [
-        {
-          "name": "STRING",
-          "type": "STRING",
-          "links": null
-        }
-      ],
-      "title": "Prompt Enhance",
-      "properties": {
-        "proxyWidgets": [
-          [
-            "-1",
-            "prompt"
-          ]
-        ],
-        "cnr_id": "comfy-core",
-        "ver": "0.14.1"
-      },
-      "widgets_values": [
-        ""
-      ]
-    }
-  ],
-  "links": [],
-  "version": 0.4,
-  "definitions": {
-    "subgraphs": [
-      {
-        "id": "24d8bbfd-39d4-4774-bff0-3de40cc7a471",
-        "version": 1,
-        "state": {
-          "lastGroupId": 0,
-          "lastNodeId": 15,
-          "lastLinkId": 14,
-          "lastRerouteId": 0
-        },
-        "revision": 0,
-        "config": {},
-        "name": "Prompt Enhance",
-        "inputNode": {
-          "id": -10,
-          "bounding": [
-            -2170,
-            2110,
-            138.876953125,
-            80
-          ]
-        },
-        "outputNode": {
-          "id": -20,
-          "bounding": [
-            -640,
-            2110,
-            120,
-            60
-          ]
-        },
-        "inputs": [
-          {
-            "id": "aeab7216-00e0-4528-a09b-bba50845c5a6",
-            "name": "prompt",
-            "type": "STRING",
-            "linkIds": [
-              11
-            ],
-            "pos": [
-              -2051.123046875,
-              2130
-            ]
-          },
-          {
-            "id": "7b73fd36-aa31-4771-9066-f6c83879994b",
-            "name": "images",
-            "type": "IMAGE",
-            "linkIds": [
-              14
-            ],
-            "label": "reference images",
-            "pos": [
-              -2051.123046875,
-              2150
-            ]
-          }
-        ],
-        "outputs": [
-          {
-            "id": "c7b0d930-68a1-48d1-b496-0519e5837064",
-            "name": "STRING",
-            "type": "STRING",
-            "linkIds": [
-              13
-            ],
-            "pos": [
-              -620,
-              2130
-            ]
-          }
-        ],
-        "widgets": [],
-        "nodes": [
-          {
-            "id": 11,
-            "type": "GeminiNode",
-            "pos": [
-              -1560,
-              1990
-            ],
-            "size": [
-              470,
-              470
-            ],
-            "flags": {},
-            "order": 0,
-            "mode": 0,
-            "inputs": [
-              {
-                "localized_name": "images",
-                "name": "images",
-                "shape": 7,
-                "type": "IMAGE",
-                "link": 14
-              },
-              {
-                "localized_name": "audio",
-                "name": "audio",
-                "shape": 7,
-                "type": "AUDIO",
-                "link": null
-              },
-              {
-                "localized_name": "video",
-                "name": "video",
-                "shape": 7,
-                "type": "VIDEO",
-                "link": null
-              },
-              {
-                "localized_name": "files",
-                "name": "files",
-                "shape": 7,
-                "type": "GEMINI_INPUT_FILES",
-                "link": null
-              },
-              {
-                "localized_name": "prompt",
-                "name": "prompt",
-                "type": "STRING",
-                "widget": {
-                  "name": "prompt"
-                },
-                "link": 11
-              },
-              {
-                "localized_name": "model",
-                "name": "model",
-                "type": "COMBO",
-                "widget": {
-                  "name": "model"
-                },
-                "link": null
-              },
-              {
-                "localized_name": "seed",
-                "name": "seed",
-                "type": "INT",
-                "widget": {
-                  "name": "seed"
-                },
-                "link": null
-              },
-              {
-                "localized_name": "system_prompt",
-                "name": "system_prompt",
-                "shape": 7,
-                "type": "STRING",
-                "widget": {
-                  "name": "system_prompt"
-                },
-                "link": null
-              }
-            ],
-            "outputs": [
-              {
-                "localized_name": "STRING",
-                "name": "STRING",
-                "type": "STRING",
-                "links": [
-                  13
-                ]
-              }
-            ],
-            "properties": {
-              "cnr_id": "comfy-core",
-              "ver": "0.14.1",
-              "Node name for S&R": "GeminiNode"
-            },
-            "widgets_values": [
-              "",
-              "gemini-3-pro-preview",
-              42,
-              "randomize",
-              "You are an expert in prompt writing.\nBased on the input, rewrite the user's input into a detailed prompt.\nincluding camera settings, lighting, composition, and style.\nReturn the prompt only"
-            ],
-            "color": "#432",
-            "bgcolor": "#653"
-          }
-        ],
-        "groups": [],
-        "links": [
-          {
-            "id": 11,
-            "origin_id": -10,
-            "origin_slot": 0,
-            "target_id": 11,
-            "target_slot": 4,
-            "type": "STRING"
-          },
-          {
-            "id": 13,
-            "origin_id": 11,
-            "origin_slot": 0,
-            "target_id": -20,
-            "target_slot": 0,
-            "type": "STRING"
-          },
-          {
-            "id": 14,
-            "origin_id": -10,
-            "origin_slot": 1,
-            "target_id": 11,
-            "target_slot": 0,
-            "type": "IMAGE"
-          }
-        ],
-        "extra": {
-          "workflowRendererVersion": "LG"
-        },
-        "category": "Text generation/Prompt enhance"
-      }
-    ]
-  },
-  "extra": {}
-}
+{"revision": 0, "last_node_id": 15, "last_link_id": 0, "nodes": [{"id": 15, "type": "24d8bbfd-39d4-4774-bff0-3de40cc7a471", "pos": [-1490, 2040], "size": [400, 260], "flags": {}, "order": 0, "mode": 0, "inputs": [{"name": "prompt", "type": "STRING", "widget": {"name": "prompt"}, "link": null}, {"label": "reference images", "name": "images", "type": "IMAGE", "link": null}], "outputs": [{"name": "STRING", "type": "STRING", "links": null}], "title": "Prompt Enhance", "properties": {"proxyWidgets": [["-1", "prompt"]], "cnr_id": "comfy-core", "ver": "0.14.1"}, "widgets_values": [""]}], "links": [], "version": 0.4, "definitions": {"subgraphs": [{"id": "24d8bbfd-39d4-4774-bff0-3de40cc7a471", "version": 1, "state": {"lastGroupId": 0, "lastNodeId": 15, "lastLinkId": 14, "lastRerouteId": 0}, "revision": 0, "config": {}, "name": "Prompt Enhance", "inputNode": {"id": -10, "bounding": [-2170, 2110, 138.876953125, 80]}, "outputNode": {"id": -20, "bounding": [-640, 2110, 120, 60]}, "inputs": [{"id": "aeab7216-00e0-4528-a09b-bba50845c5a6", "name": "prompt", "type": "STRING", "linkIds": [11], "pos": [-2051.123046875, 2130]}, {"id": "7b73fd36-aa31-4771-9066-f6c83879994b", "name": "images", "type": "IMAGE", "linkIds": [14], "label": "reference images", "pos": [-2051.123046875, 2150]}], "outputs": [{"id": "c7b0d930-68a1-48d1-b496-0519e5837064", "name": "STRING", "type": "STRING", "linkIds": [13], "pos": [-620, 2130]}], "widgets": [], "nodes": [{"id": 11, "type": "GeminiNode", "pos": [-1560, 1990], "size": [470, 470], "flags": {}, "order": 0, "mode": 0, "inputs": [{"localized_name": "images", "name": "images", "shape": 7, "type": "IMAGE", "link": 14}, {"localized_name": "audio", "name": "audio", "shape": 7, "type": "AUDIO", "link": null}, {"localized_name": "video", "name": "video", "shape": 7, "type": "VIDEO", "link": null}, {"localized_name": "files", "name": "files", "shape": 7, "type": "GEMINI_INPUT_FILES", "link": null}, {"localized_name": "prompt", "name": "prompt", "type": "STRING", "widget": {"name": "prompt"}, "link": 11}, {"localized_name": "model", "name": "model", "type": "COMBO", "widget": {"name": "model"}, "link": null}, {"localized_name": "seed", "name": "seed", "type": "INT", "widget": {"name": "seed"}, "link": null}, {"localized_name": "system_prompt", "name": "system_prompt", "shape": 7, "type": "STRING", "widget": {"name": "system_prompt"}, "link": null}], "outputs": [{"localized_name": "STRING", "name": "STRING", "type": "STRING", "links": [13]}], "properties": {"cnr_id": "comfy-core", "ver": "0.14.1", "Node name for S&R": "GeminiNode"}, "widgets_values": ["", "gemini-3-pro-preview", 42, "randomize", "You are an expert in prompt writing.\nBased on the input, rewrite the user's input into a detailed prompt.\nincluding camera settings, lighting, composition, and style.\nReturn the prompt only"], "color": "#432", "bgcolor": "#653"}], "groups": [], "links": [{"id": 11, "origin_id": -10, "origin_slot": 0, "target_id": 11, "target_slot": 4, "type": "STRING"}, {"id": 13, "origin_id": 11, "origin_slot": 0, "target_id": -20, "target_slot": 0, "type": "STRING"}, {"id": 14, "origin_id": -10, "origin_slot": 1, "target_id": 11, "target_slot": 0, "type": "IMAGE"}], "extra": {"workflowRendererVersion": "LG"}, "category": "Text generation/Prompt enhance"}]}, "extra": {}}
--- a/blueprints/Sharpen.json
+++ b/blueprints/Sharpen.json
@@ -1,309 +1 @@
-{
-  "revision": 0,
-  "last_node_id": 25,
-  "last_link_id": 0,
-  "nodes": [
-    {
-      "id": 25,
-      "type": "621ba4e2-22a8-482d-a369-023753198b7b",
-      "pos": [
-        4610,
-        -790
-      ],
-      "size": [
-        230,
-        58
-      ],
-      "flags": {},
-      "order": 4,
-      "mode": 0,
-      "inputs": [
-        {
-          "label": "image",
-          "localized_name": "images.image0",
-          "name": "images.image0",
-          "type": "IMAGE",
-          "link": null
-        }
-      ],
-      "outputs": [
-        {
-          "label": "IMAGE",
-          "localized_name": "IMAGE0",
-          "name": "IMAGE0",
-          "type": "IMAGE",
-          "links": []
-        }
-      ],
-      "title": "Sharpen",
-      "properties": {
-        "proxyWidgets": [
-          [
-            "24",
-            "value"
-          ]
-        ]
-      },
-      "widgets_values": []
-    }
-  ],
-  "links": [],
-  "version": 0.4,
-  "definitions": {
-    "subgraphs": [
-      {
-        "id": "621ba4e2-22a8-482d-a369-023753198b7b",
-        "version": 1,
-        "state": {
-          "lastGroupId": 0,
-          "lastNodeId": 24,
-          "lastLinkId": 36,
-          "lastRerouteId": 0
-        },
-        "revision": 0,
-        "config": {},
-        "name": "Sharpen",
-        "inputNode": {
-          "id": -10,
-          "bounding": [
-            4090,
-            -825,
-            120,
-            60
-          ]
-        },
-        "outputNode": {
-          "id": -20,
-          "bounding": [
-            5150,
-            -825,
-            120,
-            60
-          ]
-        },
-        "inputs": [
-          {
-            "id": "37011fb7-14b7-4e0e-b1a0-6a02e8da1fd7",
-            "name": "images.image0",
-            "type": "IMAGE",
-            "linkIds": [
-              34
-            ],
-            "localized_name": "images.image0",
-            "label": "image",
-            "pos": [
-              4190,
-              -805
-            ]
-          }
-        ],
-        "outputs": [
-          {
-            "id": "e9182b3f-635c-4cd4-a152-4b4be17ae4b9",
-            "name": "IMAGE0",
-            "type": "IMAGE",
-            "linkIds": [
-              35
-            ],
-            "localized_name": "IMAGE0",
-            "label": "IMAGE",
-            "pos": [
-              5170,
-              -805
-            ]
-          }
-        ],
-        "widgets": [],
-        "nodes": [
-          {
-            "id": 24,
-            "type": "PrimitiveFloat",
-            "pos": [
-              4280,
-              -1240
-            ],
-            "size": [
-              270,
-              58
-            ],
-            "flags": {},
-            "order": 0,
-            "mode": 0,
-            "inputs": [
-              {
-                "label": "strength",
-                "localized_name": "value",
-                "name": "value",
-                "type": "FLOAT",
-                "widget": {
-                  "name": "value"
-                },
-                "link": null
-              }
-            ],
-            "outputs": [
-              {
-                "localized_name": "FLOAT",
-                "name": "FLOAT",
-                "type": "FLOAT",
-                "links": [
-                  36
-                ]
-              }
-            ],
-            "properties": {
-              "Node name for S&R": "PrimitiveFloat",
-              "min": 0,
-              "max": 3,
-              "precision": 2,
-              "step": 0.05
-            },
-            "widgets_values": [
-              0.5
-            ]
-          },
-          {
-            "id": 23,
-            "type": "GLSLShader",
-            "pos": [
-              4570,
-              -1240
-            ],
-            "size": [
-              370,
-              192
-            ],
-            "flags": {},
-            "order": 1,
-            "mode": 0,
-            "inputs": [
-              {
-                "label": "image0",
-                "localized_name": "images.image0",
-                "name": "images.image0",
-                "type": "IMAGE",
-                "link": 34
-              },
-              {
-                "label": "image1",
-                "localized_name": "images.image1",
-                "name": "images.image1",
-                "shape": 7,
-                "type": "IMAGE",
-                "link": null
-              },
-              {
-                "label": "u_float0",
-                "localized_name": "floats.u_float0",
-                "name": "floats.u_float0",
-                "shape": 7,
-                "type": "FLOAT",
-                "link": 36
-              },
-              {
-                "label": "u_float1",
-                "localized_name": "floats.u_float1",
-                "name": "floats.u_float1",
-                "shape": 7,
-                "type": "FLOAT",
-                "link": null
-              },
-              {
-                "label": "u_int0",
-                "localized_name": "ints.u_int0",
-                "name": "ints.u_int0",
-                "shape": 7,
-                "type": "INT",
-                "link": null
-              },
-              {
-                "localized_name": "fragment_shader",
-                "name": "fragment_shader",
-                "type": "STRING",
-                "widget": {
-                  "name": "fragment_shader"
-                },
-                "link": null
-              },
-              {
-                "localized_name": "size_mode",
-                "name": "size_mode",
-                "type": "COMFY_DYNAMICCOMBO_V3",
-                "widget": {
-                  "name": "size_mode"
-                },
-                "link": null
-              }
-            ],
-            "outputs": [
-              {
-                "localized_name": "IMAGE0",
-                "name": "IMAGE0",
-                "type": "IMAGE",
-                "links": [
-                  35
-                ]
-              },
-              {
-                "localized_name": "IMAGE1",
-                "name": "IMAGE1",
-                "type": "IMAGE",
-                "links": null
-              },
-              {
-                "localized_name": "IMAGE2",
-                "name": "IMAGE2",
-                "type": "IMAGE",
-                "links": null
-              },
-              {
-                "localized_name": "IMAGE3",
-                "name": "IMAGE3",
-                "type": "IMAGE",
-                "links": null
-              }
-            ],
-            "properties": {
-              "Node name for S&R": "GLSLShader"
-            },
-            "widgets_values": [
-              "#version 300 es\nprecision highp float;\n\nuniform sampler2D u_image0;\nuniform float u_float0;  // strength [0.0 – 2.0] typical: 0.3–1.0\n\nin vec2 v_texCoord;\nlayout(location = 0) out vec4 fragColor0;\n\nvoid main() {\n    vec2 texel = 1.0 / vec2(textureSize(u_image0, 0));\n    \n    // Sample center and neighbors\n    vec4 center = texture(u_image0, v_texCoord);\n    vec4 top    = texture(u_image0, v_texCoord + vec2( 0.0, -texel.y));\n    vec4 bottom = texture(u_image0, v_texCoord + vec2( 0.0,  texel.y));\n    vec4 left   = texture(u_image0, v_texCoord + vec2(-texel.x,  0.0));\n    vec4 right  = texture(u_image0, v_texCoord + vec2( texel.x,  0.0));\n    \n    // Edge enhancement (Laplacian)\n    vec4 edges = center * 4.0 - top - bottom - left - right;\n    \n    // Add edges back scaled by strength\n    vec4 sharpened = center + edges * u_float0;\n    \n    fragColor0 = vec4(clamp(sharpened.rgb, 0.0, 1.0), center.a);\n}",
-              "from_input"
-            ]
-          }
-        ],
-        "groups": [],
-        "links": [
-          {
-            "id": 36,
-            "origin_id": 24,
-            "origin_slot": 0,
-            "target_id": 23,
-            "target_slot": 2,
-            "type": "FLOAT"
-          },
-          {
-            "id": 34,
-            "origin_id": -10,
-            "origin_slot": 0,
-            "target_id": 23,
-            "target_slot": 0,
-            "type": "IMAGE"
-          },
-          {
-            "id": 35,
-            "origin_id": 23,
-            "origin_slot": 0,
-            "target_id": -20,
-            "target_slot": 0,
-            "type": "IMAGE"
-          }
-        ],
-        "extra": {
-          "workflowRendererVersion": "LG"
-        },
-        "category": "Image Tools/Sharpen"
-      }
-    ]
-  }
-}
+{"revision": 0, "last_node_id": 25, "last_link_id": 0, "nodes": [{"id": 25, "type": "621ba4e2-22a8-482d-a369-023753198b7b", "pos": [4610, -790], "size": [230, 58], "flags": {}, "order": 4, "mode": 0, "inputs": [{"label": "image", "localized_name": "images.image0", "name": "images.image0", "type": "IMAGE", "link": null}], "outputs": [{"label": "IMAGE", "localized_name": "IMAGE0", "name": "IMAGE0", "type": "IMAGE", "links": []}], "title": "Sharpen", "properties": {"proxyWidgets": [["24", "value"]]}, "widgets_values": []}], "links": [], "version": 0.4, "definitions": {"subgraphs": [{"id": "621ba4e2-22a8-482d-a369-023753198b7b", "version": 1, "state": {"lastGroupId": 0, "lastNodeId": 24, "lastLinkId": 36, "lastRerouteId": 0}, "revision": 0, "config": {}, "name": "Sharpen", "inputNode": {"id": -10, "bounding": [4090, -825, 120, 60]}, "outputNode": {"id": -20, "bounding": [5150, -825, 120, 60]}, "inputs": [{"id": "37011fb7-14b7-4e0e-b1a0-6a02e8da1fd7", "name": "images.image0", "type": "IMAGE", "linkIds": [34], "localized_name": "images.image0", "label": "image", "pos": [4190, -805]}], "outputs": [{"id": "e9182b3f-635c-4cd4-a152-4b4be17ae4b9", "name": "IMAGE0", "type": "IMAGE", "linkIds": [35], "localized_name": "IMAGE0", "label": "IMAGE", "pos": [5170, -805]}], "widgets": [], "nodes": [{"id": 24, "type": "PrimitiveFloat", "pos": [4280, -1240], "size": [270, 58], "flags": {}, "order": 0, "mode": 0, "inputs": [{"label": "strength", "localized_name": "value", "name": "value", "type": "FLOAT", "widget": {"name": "value"}, "link": null}], "outputs": [{"localized_name": "FLOAT", "name": "FLOAT", "type": "FLOAT", "links": [36]}], "properties": {"Node name for S&R": "PrimitiveFloat", "min": 0, "max": 3, "precision": 2, "step": 0.05}, "widgets_values": [0.5]}, {"id": 23, "type": "GLSLShader", "pos": [4570, -1240], "size": [370, 192], "flags": {}, "order": 1, "mode": 0, "inputs": [{"label": "image0", "localized_name": "images.image0", "name": "images.image0", "type": "IMAGE", "link": 34}, {"label": "image1", "localized_name": "images.image1", "name": "images.image1", "shape": 7, "type": "IMAGE", "link": null}, {"label": "u_float0", "localized_name": "floats.u_float0", "name": "floats.u_float0", "shape": 7, "type": "FLOAT", "link": 36}, {"label": "u_float1", "localized_name": "floats.u_float1", "name": "floats.u_float1", "shape": 7, "type": "FLOAT", "link": null}, {"label": "u_int0", "localized_name": "ints.u_int0", "name": "ints.u_int0", "shape": 7, "type": "INT", "link": null}, {"localized_name": "fragment_shader", "name": "fragment_shader", "type": "STRING", "widget": {"name": "fragment_shader"}, "link": null}, {"localized_name": "size_mode", "name": "size_mode", "type": "COMFY_DYNAMICCOMBO_V3", "widget": {"name": "size_mode"}, "link": null}], "outputs": [{"localized_name": "IMAGE0", "name": "IMAGE0", "type": "IMAGE", "links": [35]}, {"localized_name": "IMAGE1", "name": "IMAGE1", "type": "IMAGE", "links": null}, {"localized_name": "IMAGE2", "name": "IMAGE2", "type": "IMAGE", "links": null}, {"localized_name": "IMAGE3", "name": "IMAGE3", "type": "IMAGE", "links": null}], "properties": {"Node name for S&R": "GLSLShader"}, "widgets_values": ["#version 300 es\nprecision highp float;\n\nuniform sampler2D u_image0;\nuniform vec2 u_resolution;\nuniform float u_float0;  // strength [0.0 – 2.0] typical: 0.3–1.0\n\nin vec2 v_texCoord;\nlayout(location = 0) out vec4 fragColor0;\n\nvoid main() {\n    vec2 texel = 1.0 / u_resolution;\n    \n    // Sample center and neighbors\n    vec4 center = texture(u_image0, v_texCoord);\n    vec4 top    = texture(u_image0, v_texCoord + vec2( 0.0, -texel.y));\n    vec4 bottom = texture(u_image0, v_texCoord + vec2( 0.0,  texel.y));\n    vec4 left   = texture(u_image0, v_texCoord + vec2(-texel.x,  0.0));\n    vec4 right  = texture(u_image0, v_texCoord + vec2( texel.x,  0.0));\n    \n    // Edge enhancement (Laplacian)\n    vec4 edges = center * 4.0 - top - bottom - left - right;\n    \n    // Add edges back scaled by strength\n    vec4 sharpened = center + edges * u_float0;\n    \n    fragColor0 = vec4(clamp(sharpened.rgb, 0.0, 1.0), center.a);\n}", "from_input"]}], "groups": [], "links": [{"id": 36, "origin_id": 24, "origin_slot": 0, "target_id": 23, "target_slot": 2, "type": "FLOAT"}, {"id": 34, "origin_id": -10, "origin_slot": 0, "target_id": 23, "target_slot": 0, "type": "IMAGE"}, {"id": 35, "origin_id": 23, "origin_slot": 0, "target_id": -20, "target_slot": 0, "type": "IMAGE"}], "extra": {"workflowRendererVersion": "LG"}, "category": "Image Tools/Sharpen"}]}}
--- a/blueprints/Text
+++ b/blueprints/Text
--- a/blueprints/Text
+++ b/blueprints/Text
--- a/blueprints/Text
+++ b/blueprints/Text
--- a/blueprints/Text
+++ b/blueprints/Text
--- a/blueprints/Text
+++ b/blueprints/Text
--- a/(Qwen-Image).json
+++ b/(Qwen-Image).json
--- a/(Z-Image-Turbo).json
+++ b/(Z-Image-Turbo).json
--- a/blueprints/Text
+++ b/blueprints/Text
--- a/blueprints/Text
+++ b/blueprints/Text
--- a/blueprints/Unsharp
+++ b/blueprints/Unsharp
--- a/blueprints/Video
+++ b/blueprints/Video
--- a/blueprints/Video
+++ b/blueprints/Video
--- a/blueprints/Video
+++ b/blueprints/Video
--- a/blueprints/Video
+++ b/blueprints/Video
@@ -1,420 +1 @@
-{
-  "revision": 0,
-  "last_node_id": 13,
-  "last_link_id": 0,
-  "nodes": [
-    {
-      "id": 13,
-      "type": "cf95b747-3e17-46cb-8097-cac60ff9b2e1",
-      "pos": [
-        1120,
-        330
-      ],
-      "size": [
-        240,
-        58
-      ],
-      "flags": {},
-      "order": 3,
-      "mode": 0,
-      "inputs": [
-        {
-          "localized_name": "video",
-          "name": "video",
-          "type": "VIDEO",
-          "link": null
-        },
-        {
-          "name": "model_name",
-          "type": "COMBO",
-          "widget": {
-            "name": "model_name"
-          },
-          "link": null
-        }
-      ],
-      "outputs": [
-        {
-          "localized_name": "VIDEO",
-          "name": "VIDEO",
-          "type": "VIDEO",
-          "links": []
-        }
-      ],
-      "title": "Video Upscale(GAN x4)",
-      "properties": {
-        "proxyWidgets": [
-          [
-            "-1",
-            "model_name"
-          ]
-        ],
-        "cnr_id": "comfy-core",
-        "ver": "0.14.1"
-      },
-      "widgets_values": [
-        "RealESRGAN_x4plus.safetensors"
-      ]
-    }
-  ],
-  "links": [],
-  "version": 0.4,
-  "definitions": {
-    "subgraphs": [
-      {
-        "id": "cf95b747-3e17-46cb-8097-cac60ff9b2e1",
-        "version": 1,
-        "state": {
-          "lastGroupId": 0,
-          "lastNodeId": 13,
-          "lastLinkId": 19,
-          "lastRerouteId": 0
-        },
-        "revision": 0,
-        "config": {},
-        "name": "Video Upscale(GAN x4)",
-        "inputNode": {
-          "id": -10,
-          "bounding": [
-            550,
-            460,
-            120,
-            80
-          ]
-        },
-        "outputNode": {
-          "id": -20,
-          "bounding": [
-            1490,
-            460,
-            120,
-            60
-          ]
-        },
-        "inputs": [
-          {
-            "id": "666d633e-93e7-42dc-8d11-2b7b99b0f2a6",
-            "name": "video",
-            "type": "VIDEO",
-            "linkIds": [
-              10
-            ],
-            "localized_name": "video",
-            "pos": [
-              650,
-              480
-            ]
-          },
-          {
-            "id": "2e23a087-caa8-4d65-99e6-662761aa905a",
-            "name": "model_name",
-            "type": "COMBO",
-            "linkIds": [
-              19
-            ],
-            "pos": [
-              650,
-              500
-            ]
-          }
-        ],
-        "outputs": [
-          {
-            "id": "0c1768ea-3ec2-412f-9af6-8e0fa36dae70",
-            "name": "VIDEO",
-            "type": "VIDEO",
-            "linkIds": [
-              15
-            ],
-            "localized_name": "VIDEO",
-            "pos": [
-              1510,
-              480
-            ]
-          }
-        ],
-        "widgets": [],
-        "nodes": [
-          {
-            "id": 2,
-            "type": "ImageUpscaleWithModel",
-            "pos": [
-              1110,
-              450
-            ],
-            "size": [
-              320,
-              46
-            ],
-            "flags": {},
-            "order": 1,
-            "mode": 0,
-            "inputs": [
-              {
-                "localized_name": "upscale_model",
-                "name": "upscale_model",
-                "type": "UPSCALE_MODEL",
-                "link": 1
-              },
-              {
-                "localized_name": "image",
-                "name": "image",
-                "type": "IMAGE",
-                "link": 14
-              }
-            ],
-            "outputs": [
-              {
-                "localized_name": "IMAGE",
-                "name": "IMAGE",
-                "type": "IMAGE",
-                "links": [
-                  13
-                ]
-              }
-            ],
-            "properties": {
-              "cnr_id": "comfy-core",
-              "ver": "0.10.0",
-              "Node name for S&R": "ImageUpscaleWithModel"
-            }
-          },
-          {
-            "id": 11,
-            "type": "CreateVideo",
-            "pos": [
-              1110,
-              550
-            ],
-            "size": [
-              320,
-              78
-            ],
-            "flags": {},
-            "order": 3,
-            "mode": 0,
-            "inputs": [
-              {
-                "localized_name": "images",
-                "name": "images",
-                "type": "IMAGE",
-                "link": 13
-              },
-              {
-                "localized_name": "audio",
-                "name": "audio",
-                "shape": 7,
-                "type": "AUDIO",
-                "link": 16
-              },
-              {
-                "localized_name": "fps",
-                "name": "fps",
-                "type": "FLOAT",
-                "widget": {
-                  "name": "fps"
-                },
-                "link": 12
-              }
-            ],
-            "outputs": [
-              {
-                "localized_name": "VIDEO",
-                "name": "VIDEO",
-                "type": "VIDEO",
-                "links": [
-                  15
-                ]
-              }
-            ],
-            "properties": {
-              "cnr_id": "comfy-core",
-              "ver": "0.10.0",
-              "Node name for S&R": "CreateVideo"
-            },
-            "widgets_values": [
-              30
-            ]
-          },
-          {
-            "id": 10,
-            "type": "GetVideoComponents",
-            "pos": [
-              1110,
-              330
-            ],
-            "size": [
-              320,
-              70
-            ],
-            "flags": {},
-            "order": 2,
-            "mode": 0,
-            "inputs": [
-              {
-                "localized_name": "video",
-                "name": "video",
-                "type": "VIDEO",
-                "link": 10
-              }
-            ],
-            "outputs": [
-              {
-                "localized_name": "images",
-                "name": "images",
-                "type": "IMAGE",
-                "links": [
-                  14
-                ]
-              },
-              {
-                "localized_name": "audio",
-                "name": "audio",
-                "type": "AUDIO",
-                "links": [
-                  16
-                ]
-              },
-              {
-                "localized_name": "fps",
-                "name": "fps",
-                "type": "FLOAT",
-                "links": [
-                  12
-                ]
-              }
-            ],
-            "properties": {
-              "cnr_id": "comfy-core",
-              "ver": "0.10.0",
-              "Node name for S&R": "GetVideoComponents"
-            }
-          },
-          {
-            "id": 1,
-            "type": "UpscaleModelLoader",
-            "pos": [
-              750,
-              450
-            ],
-            "size": [
-              280,
-              60
-            ],
-            "flags": {},
-            "order": 0,
-            "mode": 0,
-            "inputs": [
-              {
-                "localized_name": "model_name",
-                "name": "model_name",
-                "type": "COMBO",
-                "widget": {
-                  "name": "model_name"
-                },
-                "link": 19
-              }
-            ],
-            "outputs": [
-              {
-                "localized_name": "UPSCALE_MODEL",
-                "name": "UPSCALE_MODEL",
-                "type": "UPSCALE_MODEL",
-                "links": [
-                  1
-                ]
-              }
-            ],
-            "properties": {
-              "cnr_id": "comfy-core",
-              "ver": "0.10.0",
-              "Node name for S&R": "UpscaleModelLoader",
-              "models": [
-                {
-                  "name": "RealESRGAN_x4plus.safetensors",
-                  "url": "https://huggingface.co/Comfy-Org/Real-ESRGAN_repackaged/resolve/main/RealESRGAN_x4plus.safetensors",
-                  "directory": "upscale_models"
-                }
-              ]
-            },
-            "widgets_values": [
-              "RealESRGAN_x4plus.safetensors"
-            ]
-          }
-        ],
-        "groups": [],
-        "links": [
-          {
-            "id": 1,
-            "origin_id": 1,
-            "origin_slot": 0,
-            "target_id": 2,
-            "target_slot": 0,
-            "type": "UPSCALE_MODEL"
-          },
-          {
-            "id": 14,
-            "origin_id": 10,
-            "origin_slot": 0,
-            "target_id": 2,
-            "target_slot": 1,
-            "type": "IMAGE"
-          },
-          {
-            "id": 13,
-            "origin_id": 2,
-            "origin_slot": 0,
-            "target_id": 11,
-            "target_slot": 0,
-            "type": "IMAGE"
-          },
-          {
-            "id": 16,
-            "origin_id": 10,
-            "origin_slot": 1,
-            "target_id": 11,
-            "target_slot": 1,
-            "type": "AUDIO"
-          },
-          {
-            "id": 12,
-            "origin_id": 10,
-            "origin_slot": 2,
-            "target_id": 11,
-            "target_slot": 2,
-            "type": "FLOAT"
-          },
-          {
-            "id": 10,
-            "origin_id": -10,
-            "origin_slot": 0,
-            "target_id": 10,
-            "target_slot": 0,
-            "type": "VIDEO"
-          },
-          {
-            "id": 15,
-            "origin_id": 11,
-            "origin_slot": 0,
-            "target_id": -20,
-            "target_slot": 0,
-            "type": "VIDEO"
-          },
-          {
-            "id": 19,
-            "origin_id": -10,
-            "origin_slot": 1,
-            "target_id": 1,
-            "target_slot": 0,
-            "type": "COMBO"
-          }
-        ],
-        "extra": {
-          "workflowRendererVersion": "LG"
-        },
-        "category": "Video generation and editing/Enhance video"
-      }
-    ]
-  },
-  "extra": {}
-}
+{"revision": 0, "last_node_id": 13, "last_link_id": 0, "nodes": [{"id": 13, "type": "cf95b747-3e17-46cb-8097-cac60ff9b2e1", "pos": [1120, 330], "size": [240, 58], "flags": {}, "order": 3, "mode": 0, "inputs": [{"localized_name": "video", "name": "video", "type": "VIDEO", "link": null}, {"name": "model_name", "type": "COMBO", "widget": {"name": "model_name"}, "link": null}], "outputs": [{"localized_name": "VIDEO", "name": "VIDEO", "type": "VIDEO", "links": []}], "title": "Video Upscale(GAN x4)", "properties": {"proxyWidgets": [["-1", "model_name"]], "cnr_id": "comfy-core", "ver": "0.14.1"}, "widgets_values": ["RealESRGAN_x4plus.safetensors"]}], "links": [], "version": 0.4, "definitions": {"subgraphs": [{"id": "cf95b747-3e17-46cb-8097-cac60ff9b2e1", "version": 1, "state": {"lastGroupId": 0, "lastNodeId": 13, "lastLinkId": 19, "lastRerouteId": 0}, "revision": 0, "config": {}, "name": "Video Upscale(GAN x4)", "inputNode": {"id": -10, "bounding": [550, 460, 120, 80]}, "outputNode": {"id": -20, "bounding": [1490, 460, 120, 60]}, "inputs": [{"id": "666d633e-93e7-42dc-8d11-2b7b99b0f2a6", "name": "video", "type": "VIDEO", "linkIds": [10], "localized_name": "video", "pos": [650, 480]}, {"id": "2e23a087-caa8-4d65-99e6-662761aa905a", "name": "model_name", "type": "COMBO", "linkIds": [19], "pos": [650, 500]}], "outputs": [{"id": "0c1768ea-3ec2-412f-9af6-8e0fa36dae70", "name": "VIDEO", "type": "VIDEO", "linkIds": [15], "localized_name": "VIDEO", "pos": [1510, 480]}], "widgets": [], "nodes": [{"id": 2, "type": "ImageUpscaleWithModel", "pos": [1110, 450], "size": [320, 46], "flags": {}, "order": 1, "mode": 0, "inputs": [{"localized_name": "upscale_model", "name": "upscale_model", "type": "UPSCALE_MODEL", "link": 1}, {"localized_name": "image", "name": "image", "type": "IMAGE", "link": 14}], "outputs": [{"localized_name": "IMAGE", "name": "IMAGE", "type": "IMAGE", "links": [13]}], "properties": {"cnr_id": "comfy-core", "ver": "0.10.0", "Node name for S&R": "ImageUpscaleWithModel"}}, {"id": 11, "type": "CreateVideo", "pos": [1110, 550], "size": [320, 78], "flags": {}, "order": 3, "mode": 0, "inputs": [{"localized_name": "images", "name": "images", "type": "IMAGE", "link": 13}, {"localized_name": "audio", "name": "audio", "shape": 7, "type": "AUDIO", "link": 16}, {"localized_name": "fps", "name": "fps", "type": "FLOAT", "widget": {"name": "fps"}, "link": 12}], "outputs": [{"localized_name": "VIDEO", "name": "VIDEO", "type": "VIDEO", "links": [15]}], "properties": {"cnr_id": "comfy-core", "ver": "0.10.0", "Node name for S&R": "CreateVideo"}, "widgets_values": [30]}, {"id": 10, "type": "GetVideoComponents", "pos": [1110, 330], "size": [320, 70], "flags": {}, "order": 2, "mode": 0, "inputs": [{"localized_name": "video", "name": "video", "type": "VIDEO", "link": 10}], "outputs": [{"localized_name": "images", "name": "images", "type": "IMAGE", "links": [14]}, {"localized_name": "audio", "name": "audio", "type": "AUDIO", "links": [16]}, {"localized_name": "fps", "name": "fps", "type": "FLOAT", "links": [12]}], "properties": {"cnr_id": "comfy-core", "ver": "0.10.0", "Node name for S&R": "GetVideoComponents"}}, {"id": 1, "type": "UpscaleModelLoader", "pos": [750, 450], "size": [280, 60], "flags": {}, "order": 0, "mode": 0, "inputs": [{"localized_name": "model_name", "name": "model_name", "type": "COMBO", "widget": {"name": "model_name"}, "link": 19}], "outputs": [{"localized_name": "UPSCALE_MODEL", "name": "UPSCALE_MODEL", "type": "UPSCALE_MODEL", "links": [1]}], "properties": {"cnr_id": "comfy-core", "ver": "0.10.0", "Node name for S&R": "UpscaleModelLoader", "models": [{"name": "RealESRGAN_x4plus.safetensors", "url": "https://huggingface.co/Comfy-Org/Real-ESRGAN_repackaged/resolve/main/RealESRGAN_x4plus.safetensors", "directory": "upscale_models"}]}, "widgets_values": ["RealESRGAN_x4plus.safetensors"]}], "groups": [], "links": [{"id": 1, "origin_id": 1, "origin_slot": 0, "target_id": 2, "target_slot": 0, "type": "UPSCALE_MODEL"}, {"id": 14, "origin_id": 10, "origin_slot": 0, "target_id": 2, "target_slot": 1, "type": "IMAGE"}, {"id": 13, "origin_id": 2, "origin_slot": 0, "target_id": 11, "target_slot": 0, "type": "IMAGE"}, {"id": 16, "origin_id": 10, "origin_slot": 1, "target_id": 11, "target_slot": 1, "type": "AUDIO"}, {"id": 12, "origin_id": 10, "origin_slot": 2, "target_id": 11, "target_slot": 2, "type": "FLOAT"}, {"id": 10, "origin_id": -10, "origin_slot": 0, "target_id": 10, "target_slot": 0, "type": "VIDEO"}, {"id": 15, "origin_id": 11, "origin_slot": 0, "target_id": -20, "target_slot": 0, "type": "VIDEO"}, {"id": 19, "origin_id": -10, "origin_slot": 1, "target_id": 1, "target_slot": 0, "type": "COMBO"}], "extra": {"workflowRendererVersion": "LG"}, "category": "Video generation and editing/Enhance video"}]}, "extra": {}}
--- a/comfy/cli_args.py
+++ b/comfy/cli_args.py
@@ -110,13 +110,11 @@ parser.add_argument("--preview-method", type=LatentPreviewMethod, default=Latent

 parser.add_argument("--preview-size", type=int, default=512, help="Sets the maximum preview size for sampler nodes.")

-CACHE_RAM_AUTO_GB = -1.0
-
 cache_group = parser.add_mutually_exclusive_group()
 cache_group.add_argument("--cache-classic", action="store_true", help="Use the old style (aggressive) caching.")
 cache_group.add_argument("--cache-lru", type=int, default=0, help="Use LRU caching with a maximum of N node results cached. May use more RAM/VRAM.")
 cache_group.add_argument("--cache-none", action="store_true", help="Reduced RAM/VRAM usage at the expense of executing every node for each run.")
-cache_group.add_argument("--cache-ram", nargs='?', const=CACHE_RAM_AUTO_GB, type=float, default=0, help="Use RAM pressure caching with the specified headroom threshold. If available RAM drops below the threshold the cache removes large items to free RAM. Default (when no value is provided): 25%% of system RAM (min 4GB, max 32GB).")
+cache_group.add_argument("--cache-ram", nargs='?', const=4.0, type=float, default=0, help="Use RAM pressure caching with the specified headroom threshold. If available RAM drops below the threhold the cache remove large items to free RAM. Default 4GB")

 attn_group = parser.add_mutually_exclusive_group()
 attn_group.add_argument("--use-split-cross-attention", action="store_true", help="Use the split cross attention optimization. Ignored when xformers is used.")
--- a/comfy/latent_formats.py
+++ b/comfy/latent_formats.py
@@ -224,7 +224,6 @@ class Flux2(LatentFormat):

        self.latent_rgb_factors_bias = [-0.0329, -0.0718, -0.0851]
        self.latent_rgb_factors_reshape = lambda t: t.reshape(t.shape[0], 32, 2, 2, t.shape[-2], t.shape[-1]).permute(0, 1, 4, 2, 5, 3).reshape(t.shape[0], 32, t.shape[-2] * 2, t.shape[-1] * 2)
-        self.taesd_decoder_name = "taef2_decoder"

    def process_in(self, latent):
        return latent
@@ -784,28 +783,3 @@ class ZImagePixelSpace(ChromaRadiance):
    No VAE encoding/decoding — the model operates directly on RGB pixels.
    """
    pass
-
-class CogVideoX(LatentFormat):
-    """Latent format for CogVideoX-2b (THUDM/CogVideoX-2b).
-
-    scale_factor matches the vae/config.json scaling_factor for the 2b variant.
-    The 5b-class checkpoints (CogVideoX-5b, CogVideoX-1.5-5B, CogVideoX-Fun-V1.5-*)
-    use a different value; see CogVideoX1_5 below.
-    """
-
-    latent_channels = 16
-    latent_dimensions = 3
-
-    def __init__(self):
-        self.scale_factor = 1.15258426
-
-class CogVideoX1_5(CogVideoX):
-    """Latent format for 5b-class CogVideoX checkpoints.
-
-    Covers THUDM/CogVideoX-5b, THUDM/CogVideoX-1.5-5B, and the CogVideoX-Fun
-    V1.5-5b family (including VOID inpainting). All of these have
-    scaling_factor=0.7 in their vae/config.json. Auto-selected in
-    supported_models.CogVideoX_T2V based on transformer hidden dim.
-    """
-    def __init__(self):
-        self.scale_factor = 0.7
--- a/comfy/ldm/ace/ace_step15.py
+++ b/comfy/ldm/ace/ace_step15.py
@@ -611,7 +611,6 @@ class AceStepDiTModel(nn.Module):
        intermediate_size,
        patch_size,
        audio_acoustic_hidden_dim,
-        condition_dim=None,
        layer_types=None,
        sliding_window=128,
        rms_norm_eps=1e-6,
@@ -641,7 +640,7 @@ class AceStepDiTModel(nn.Module):

        self.time_embed = TimestepEmbedding(256, hidden_size, dtype=dtype, device=device, operations=operations)
        self.time_embed_r = TimestepEmbedding(256, hidden_size, dtype=dtype, device=device, operations=operations)
-        self.condition_embedder = Linear(condition_dim, hidden_size, dtype=dtype, device=device)
+        self.condition_embedder = Linear(hidden_size, hidden_size, dtype=dtype, device=device)

        if layer_types is None:
            layer_types = ["full_attention"] * num_layers
@@ -1036,9 +1035,6 @@ class AceStepConditionGenerationModel(nn.Module):
        fsq_dim=2048,
        fsq_levels=[8, 8, 8, 5, 5, 5],
        fsq_input_num_quantizers=1,
-        encoder_hidden_size=2048,
-        encoder_intermediate_size=6144,
-        encoder_num_heads=16,
        audio_model=None,
        dtype=None,
        device=None,
@@ -1058,24 +1054,24 @@ class AceStepConditionGenerationModel(nn.Module):

        self.decoder = AceStepDiTModel(
            in_channels, hidden_size, num_dit_layers, num_heads, num_kv_heads, head_dim,
-            intermediate_size, patch_size, audio_acoustic_hidden_dim, condition_dim=encoder_hidden_size,
+            intermediate_size, patch_size, audio_acoustic_hidden_dim,
            layer_types=layer_types, sliding_window=sliding_window, rms_norm_eps=rms_norm_eps,
            dtype=dtype, device=device, operations=operations
        )
        self.encoder = AceStepConditionEncoder(
-            text_hidden_dim, timbre_hidden_dim, encoder_hidden_size, num_lyric_layers, num_timbre_layers,
-            encoder_num_heads, num_kv_heads, head_dim, encoder_intermediate_size, rms_norm_eps,
+            text_hidden_dim, timbre_hidden_dim, hidden_size, num_lyric_layers, num_timbre_layers,
+            num_heads, num_kv_heads, head_dim, intermediate_size, rms_norm_eps,
            dtype=dtype, device=device, operations=operations
        )
        self.tokenizer = AceStepAudioTokenizer(
-            audio_acoustic_hidden_dim, encoder_hidden_size, pool_window_size, fsq_dim=fsq_dim, fsq_levels=fsq_levels, fsq_input_num_quantizers=fsq_input_num_quantizers, num_layers=num_tokenizer_layers, head_dim=head_dim, rms_norm_eps=rms_norm_eps,
+            audio_acoustic_hidden_dim, hidden_size, pool_window_size, fsq_dim=fsq_dim, fsq_levels=fsq_levels, fsq_input_num_quantizers=fsq_input_num_quantizers, num_layers=num_tokenizer_layers, head_dim=head_dim, rms_norm_eps=rms_norm_eps,
            dtype=dtype, device=device, operations=operations
        )
        self.detokenizer = AudioTokenDetokenizer(
-            encoder_hidden_size, pool_window_size, audio_acoustic_hidden_dim, num_layers=2, head_dim=head_dim,
+            hidden_size, pool_window_size, audio_acoustic_hidden_dim, num_layers=2, head_dim=head_dim,
            dtype=dtype, device=device, operations=operations
        )
-        self.null_condition_emb = nn.Parameter(torch.empty(1, 1, encoder_hidden_size, dtype=dtype, device=device))
+        self.null_condition_emb = nn.Parameter(torch.empty(1, 1, hidden_size, dtype=dtype, device=device))

    def prepare_condition(
        self,
--- a/comfy/ldm/cogvideo/init.py
+++ b/comfy/ldm/cogvideo/init.py
--- a/comfy/ldm/cogvideo/model.py
+++ b/comfy/ldm/cogvideo/model.py
@@ -1,573 +0,0 @@
-# CogVideoX 3D Transformer - ported to ComfyUI native ops
-# Architecture reference: diffusers CogVideoXTransformer3DModel
-# Style reference: comfy/ldm/wan/model.py
-
-import math
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from comfy.ldm.modules.attention import optimized_attention
-import comfy.patcher_extension
-import comfy.ldm.common_dit
-
-
-def _get_1d_rotary_pos_embed(dim, pos, theta=10000.0):
-    """Returns (cos, sin) each with shape [seq_len, dim].
-
-    Frequencies are computed at dim//2 resolution then repeat_interleaved
-    to full dim, matching CogVideoX's interleaved (real, imag) pair format.
-    """
-    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=torch.float32, device=pos.device) / dim))
-    angles = torch.outer(pos.float(), freqs.float())
-    cos = angles.cos().repeat_interleave(2, dim=-1).float()
-    sin = angles.sin().repeat_interleave(2, dim=-1).float()
-    return (cos, sin)
-
-
-def apply_rotary_emb(x, freqs_cos_sin):
-    """Apply CogVideoX rotary embedding to query or key tensor.
-
-    x: [B, heads, seq_len, head_dim]
-    freqs_cos_sin: (cos, sin) each [seq_len, head_dim//2]
-
-    Uses interleaved pair rotation (same as diffusers CogVideoX/Flux).
-    head_dim is reshaped to (-1, 2) pairs, rotated, then flattened back.
-    """
-    cos, sin = freqs_cos_sin
-    cos = cos[None, None, :, :].to(x.device)
-    sin = sin[None, None, :, :].to(x.device)
-
-    # Interleaved pairs: [B, H, S, D] -> [B, H, S, D//2, 2] -> (real, imag)
-    x_real, x_imag = x.reshape(*x.shape[:-1], -1, 2).unbind(-1)
-    x_rotated = torch.stack([-x_imag, x_real], dim=-1).flatten(3)
-
-    return (x.float() * cos + x_rotated.float() * sin).to(x.dtype)
-
-
-def get_timestep_embedding(timesteps, dim, flip_sin_to_cos=True, downscale_freq_shift=0, scale=1, max_period=10000):
-    half = dim // 2
-    freqs = torch.exp(-math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32, device=timesteps.device) / half)
-    args = timesteps[:, None].float() * freqs[None] * scale
-    embedding = torch.cat([torch.sin(args), torch.cos(args)], dim=-1)
-    if flip_sin_to_cos:
-        embedding = torch.cat([embedding[:, half:], embedding[:, :half]], dim=-1)
-    if dim % 2:
-        embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
-    return embedding
-
-
-def get_3d_sincos_pos_embed(embed_dim, spatial_size, temporal_size, spatial_interpolation_scale=1.0, temporal_interpolation_scale=1.0, device=None):
-    if isinstance(spatial_size, int):
-        spatial_size = (spatial_size, spatial_size)
-
-    grid_w = torch.arange(spatial_size[0], dtype=torch.float32, device=device) / spatial_interpolation_scale
-    grid_h = torch.arange(spatial_size[1], dtype=torch.float32, device=device) / spatial_interpolation_scale
-    grid_t = torch.arange(temporal_size, dtype=torch.float32, device=device) / temporal_interpolation_scale
-
-    grid_t, grid_h, grid_w = torch.meshgrid(grid_t, grid_h, grid_w, indexing="ij")
-
-    embed_dim_spatial = 2 * (embed_dim // 3)
-    embed_dim_temporal = embed_dim // 3
-
-    pos_embed_spatial = _get_2d_sincos_pos_embed(embed_dim_spatial, grid_h, grid_w, device=device)
-    pos_embed_temporal = _get_1d_sincos_pos_embed(embed_dim_temporal, grid_t[:, 0, 0], device=device)
-
-    T, H, W = grid_t.shape
-    pos_embed_temporal = pos_embed_temporal.unsqueeze(1).unsqueeze(1).expand(-1, H, W, -1)
-    pos_embed = torch.cat([pos_embed_temporal, pos_embed_spatial], dim=-1)
-
-    return pos_embed
-
-
-def _get_2d_sincos_pos_embed(embed_dim, grid_h, grid_w, device=None):
-    T, H, W = grid_h.shape
-    half_dim = embed_dim // 2
-    pos_h = _get_1d_sincos_pos_embed(half_dim, grid_h.reshape(-1), device=device).reshape(T, H, W, half_dim)
-    pos_w = _get_1d_sincos_pos_embed(half_dim, grid_w.reshape(-1), device=device).reshape(T, H, W, half_dim)
-    return torch.cat([pos_h, pos_w], dim=-1)
-
-
-def _get_1d_sincos_pos_embed(embed_dim, pos, device=None):
-    half = embed_dim // 2
-    freqs = torch.exp(-math.log(10000.0) * torch.arange(start=0, end=half, dtype=torch.float32, device=device) / half)
-    args = pos.float().reshape(-1)[:, None] * freqs[None]
-    embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
-    if embed_dim % 2:
-        embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
-    return embedding
-
-
-
-class CogVideoXPatchEmbed(nn.Module):
-    def __init__(self, patch_size=2, patch_size_t=None, in_channels=16, dim=1920,
-                 text_dim=4096, bias=True, sample_width=90, sample_height=60,
-                 sample_frames=49, temporal_compression_ratio=4,
-                 max_text_seq_length=226, spatial_interpolation_scale=1.875,
-                 temporal_interpolation_scale=1.0, use_positional_embeddings=True,
-                 use_learned_positional_embeddings=True,
-                 device=None, dtype=None, operations=None):
-        super().__init__()
-        self.patch_size = patch_size
-        self.patch_size_t = patch_size_t
-        self.dim = dim
-        self.sample_height = sample_height
-        self.sample_width = sample_width
-        self.sample_frames = sample_frames
-        self.temporal_compression_ratio = temporal_compression_ratio
-        self.max_text_seq_length = max_text_seq_length
-        self.spatial_interpolation_scale = spatial_interpolation_scale
-        self.temporal_interpolation_scale = temporal_interpolation_scale
-        self.use_positional_embeddings = use_positional_embeddings
-        self.use_learned_positional_embeddings = use_learned_positional_embeddings
-
-        if patch_size_t is None:
-            self.proj = operations.Conv2d(in_channels, dim, kernel_size=patch_size, stride=patch_size, bias=bias, device=device, dtype=dtype)
-        else:
-            self.proj = operations.Linear(in_channels * patch_size * patch_size * patch_size_t, dim, device=device, dtype=dtype)
-
-        self.text_proj = operations.Linear(text_dim, dim, device=device, dtype=dtype)
-
-        if use_positional_embeddings or use_learned_positional_embeddings:
-            persistent = use_learned_positional_embeddings
-            pos_embedding = self._get_positional_embeddings(sample_height, sample_width, sample_frames)
-            self.register_buffer("pos_embedding", pos_embedding, persistent=persistent)
-
-    def _get_positional_embeddings(self, sample_height, sample_width, sample_frames, device=None):
-        post_patch_height = sample_height // self.patch_size
-        post_patch_width = sample_width // self.patch_size
-        post_time_compression_frames = (sample_frames - 1) // self.temporal_compression_ratio + 1
-        if self.patch_size_t is not None:
-            post_time_compression_frames = post_time_compression_frames // self.patch_size_t
-        num_patches = post_patch_height * post_patch_width * post_time_compression_frames
-
-        pos_embedding = get_3d_sincos_pos_embed(
-            self.dim,
-            (post_patch_width, post_patch_height),
-            post_time_compression_frames,
-            self.spatial_interpolation_scale,
-            self.temporal_interpolation_scale,
-            device=device,
-        )
-        pos_embedding = pos_embedding.reshape(-1, self.dim)
-        joint_pos_embedding = pos_embedding.new_zeros(
-            1, self.max_text_seq_length + num_patches, self.dim, requires_grad=False
-        )
-        joint_pos_embedding.data[:, self.max_text_seq_length:].copy_(pos_embedding)
-        return joint_pos_embedding
-
-    def forward(self, text_embeds, image_embeds):
-        input_dtype = text_embeds.dtype
-        text_embeds = self.text_proj(text_embeds.to(self.text_proj.weight.dtype)).to(input_dtype)
-        batch_size, num_frames, channels, height, width = image_embeds.shape
-
-        proj_dtype = self.proj.weight.dtype
-        if self.patch_size_t is None:
-            image_embeds = image_embeds.reshape(-1, channels, height, width)
-            image_embeds = self.proj(image_embeds.to(proj_dtype)).to(input_dtype)
-            image_embeds = image_embeds.view(batch_size, num_frames, *image_embeds.shape[1:])
-            image_embeds = image_embeds.flatten(3).transpose(2, 3)
-            image_embeds = image_embeds.flatten(1, 2)
-        else:
-            p = self.patch_size
-            p_t = self.patch_size_t
-            image_embeds = image_embeds.permute(0, 1, 3, 4, 2)
-            image_embeds = image_embeds.reshape(
-                batch_size, num_frames // p_t, p_t, height // p, p, width // p, p, channels
-            )
-            image_embeds = image_embeds.permute(0, 1, 3, 5, 7, 2, 4, 6).flatten(4, 7).flatten(1, 3)
-            image_embeds = self.proj(image_embeds.to(proj_dtype)).to(input_dtype)
-
-        embeds = torch.cat([text_embeds, image_embeds], dim=1).contiguous()
-
-        if self.use_positional_embeddings or self.use_learned_positional_embeddings:
-            text_seq_length = text_embeds.shape[1]
-            num_image_patches = image_embeds.shape[1]
-
-            if self.use_learned_positional_embeddings:
-                image_pos = self.pos_embedding[
-                    :, self.max_text_seq_length:self.max_text_seq_length + num_image_patches
-                ].to(device=embeds.device, dtype=embeds.dtype)
-            else:
-                image_pos = get_3d_sincos_pos_embed(
-                    self.dim,
-                    (width // self.patch_size, height // self.patch_size),
-                    num_image_patches // ((height // self.patch_size) * (width // self.patch_size)),
-                    self.spatial_interpolation_scale,
-                    self.temporal_interpolation_scale,
-                    device=embeds.device,
-                ).reshape(1, num_image_patches, self.dim).to(dtype=embeds.dtype)
-
-            # Build joint: zeros for text + sincos for image
-            joint_pos = torch.zeros(1, text_seq_length + num_image_patches, self.dim, device=embeds.device, dtype=embeds.dtype)
-            joint_pos[:, text_seq_length:] = image_pos
-            embeds = embeds + joint_pos
-
-        return embeds
-
-
-class CogVideoXLayerNormZero(nn.Module):
-    def __init__(self, time_dim, dim, elementwise_affine=True, eps=1e-5, bias=True,
-                 device=None, dtype=None, operations=None):
-        super().__init__()
-        self.silu = nn.SiLU()
-        self.linear = operations.Linear(time_dim, 6 * dim, bias=bias, device=device, dtype=dtype)
-        self.norm = operations.LayerNorm(dim, eps=eps, elementwise_affine=elementwise_affine, device=device, dtype=dtype)
-
-    def forward(self, hidden_states, encoder_hidden_states, temb):
-        shift, scale, gate, enc_shift, enc_scale, enc_gate = self.linear(self.silu(temb)).chunk(6, dim=1)
-        hidden_states = self.norm(hidden_states) * (1 + scale)[:, None, :] + shift[:, None, :]
-        encoder_hidden_states = self.norm(encoder_hidden_states) * (1 + enc_scale)[:, None, :] + enc_shift[:, None, :]
-        return hidden_states, encoder_hidden_states, gate[:, None, :], enc_gate[:, None, :]
-
-
-class CogVideoXAdaLayerNorm(nn.Module):
-    def __init__(self, time_dim, dim, elementwise_affine=True, eps=1e-5,
-                 device=None, dtype=None, operations=None):
-        super().__init__()
-        self.silu = nn.SiLU()
-        self.linear = operations.Linear(time_dim, 2 * dim, device=device, dtype=dtype)
-        self.norm = operations.LayerNorm(dim, eps=eps, elementwise_affine=elementwise_affine, device=device, dtype=dtype)
-
-    def forward(self, x, temb):
-        temb = self.linear(self.silu(temb))
-        shift, scale = temb.chunk(2, dim=1)
-        x = self.norm(x) * (1 + scale)[:, None, :] + shift[:, None, :]
-        return x
-
-
-class CogVideoXBlock(nn.Module):
-    def __init__(self, dim, num_heads, head_dim, time_dim,
-                 eps=1e-5, ff_inner_dim=None, ff_bias=True,
-                 device=None, dtype=None, operations=None):
-        super().__init__()
-        self.dim = dim
-        self.num_heads = num_heads
-        self.head_dim = head_dim
-
-        self.norm1 = CogVideoXLayerNormZero(time_dim, dim, eps=eps, device=device, dtype=dtype, operations=operations)
-
-        # Self-attention (joint text + latent)
-        self.q = operations.Linear(dim, dim, bias=True, device=device, dtype=dtype)
-        self.k = operations.Linear(dim, dim, bias=True, device=device, dtype=dtype)
-        self.v = operations.Linear(dim, dim, bias=True, device=device, dtype=dtype)
-        self.norm_q = operations.LayerNorm(head_dim, eps=1e-6, elementwise_affine=True, device=device, dtype=dtype)
-        self.norm_k = operations.LayerNorm(head_dim, eps=1e-6, elementwise_affine=True, device=device, dtype=dtype)
-        self.attn_out = operations.Linear(dim, dim, bias=True, device=device, dtype=dtype)
-
-        self.norm2 = CogVideoXLayerNormZero(time_dim, dim, eps=eps, device=device, dtype=dtype, operations=operations)
-
-        # Feed-forward (GELU approximate)
-        inner_dim = ff_inner_dim or dim * 4
-        self.ff_proj = operations.Linear(dim, inner_dim, bias=ff_bias, device=device, dtype=dtype)
-        self.ff_out = operations.Linear(inner_dim, dim, bias=ff_bias, device=device, dtype=dtype)
-
-    def forward(self, hidden_states, encoder_hidden_states, temb, image_rotary_emb=None, transformer_options=None):
-        if transformer_options is None:
-            transformer_options = {}
-        text_seq_length = encoder_hidden_states.size(1)
-
-        # Norm & modulate
-        norm_hidden, norm_encoder, gate_msa, enc_gate_msa = self.norm1(hidden_states, encoder_hidden_states, temb)
-
-        # Joint self-attention
-        qkv_input = torch.cat([norm_encoder, norm_hidden], dim=1)
-        b, s, _ = qkv_input.shape
-        n, d = self.num_heads, self.head_dim
-
-        q = self.q(qkv_input).view(b, s, n, d)
-        k = self.k(qkv_input).view(b, s, n, d)
-        v = self.v(qkv_input)
-
-        q = self.norm_q(q).view(b, s, n, d)
-        k = self.norm_k(k).view(b, s, n, d)
-
-        # Apply rotary embeddings to image tokens only (diffusers format: [B, heads, seq, head_dim])
-        if image_rotary_emb is not None:
-            q_img = q[:, text_seq_length:].transpose(1, 2)  # [B, heads, img_seq, head_dim]
-            k_img = k[:, text_seq_length:].transpose(1, 2)
-            q_img = apply_rotary_emb(q_img, image_rotary_emb)
-            k_img = apply_rotary_emb(k_img, image_rotary_emb)
-            q = torch.cat([q[:, :text_seq_length], q_img.transpose(1, 2)], dim=1)
-            k = torch.cat([k[:, :text_seq_length], k_img.transpose(1, 2)], dim=1)
-
-        attn_out = optimized_attention(
-            q.reshape(b, s, n * d),
-            k.reshape(b, s, n * d),
-            v,
-            heads=self.num_heads,
-            transformer_options=transformer_options,
-        )
-
-        attn_out = self.attn_out(attn_out)
-
-        attn_encoder, attn_hidden = attn_out.split([text_seq_length, s - text_seq_length], dim=1)
-
-        hidden_states = hidden_states + gate_msa * attn_hidden
-        encoder_hidden_states = encoder_hidden_states + enc_gate_msa * attn_encoder
-
-        # Norm & modulate for FF
-        norm_hidden, norm_encoder, gate_ff, enc_gate_ff = self.norm2(hidden_states, encoder_hidden_states, temb)
-
-        # Feed-forward (GELU on concatenated text + latent)
-        ff_input = torch.cat([norm_encoder, norm_hidden], dim=1)
-        ff_output = self.ff_out(F.gelu(self.ff_proj(ff_input), approximate="tanh"))
-
-        hidden_states = hidden_states + gate_ff * ff_output[:, text_seq_length:]
-        encoder_hidden_states = encoder_hidden_states + enc_gate_ff * ff_output[:, :text_seq_length]
-
-        return hidden_states, encoder_hidden_states
-
-
-class CogVideoXTransformer3DModel(nn.Module):
-    def __init__(self,
-                 num_attention_heads=30,
-                 attention_head_dim=64,
-                 in_channels=16,
-                 out_channels=16,
-                 flip_sin_to_cos=True,
-                 freq_shift=0,
-                 time_embed_dim=512,
-                 ofs_embed_dim=None,
-                 text_embed_dim=4096,
-                 num_layers=30,
-                 dropout=0.0,
-                 attention_bias=True,
-                 sample_width=90,
-                 sample_height=60,
-                 sample_frames=49,
-                 patch_size=2,
-                 patch_size_t=None,
-                 temporal_compression_ratio=4,
-                 max_text_seq_length=226,
-                 spatial_interpolation_scale=1.875,
-                 temporal_interpolation_scale=1.0,
-                 use_rotary_positional_embeddings=False,
-                 use_learned_positional_embeddings=False,
-                 patch_bias=True,
-                 image_model=None,
-                 device=None,
-                 dtype=None,
-                 operations=None,
-                 ):
-        super().__init__()
-        self.dtype = dtype
-        dim = num_attention_heads * attention_head_dim
-        self.dim = dim
-        self.num_attention_heads = num_attention_heads
-        self.attention_head_dim = attention_head_dim
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.patch_size = patch_size
-        self.patch_size_t = patch_size_t
-        self.max_text_seq_length = max_text_seq_length
-        self.use_rotary_positional_embeddings = use_rotary_positional_embeddings
-
-        # 1. Patch embedding
-        self.patch_embed = CogVideoXPatchEmbed(
-            patch_size=patch_size,
-            patch_size_t=patch_size_t,
-            in_channels=in_channels,
-            dim=dim,
-            text_dim=text_embed_dim,
-            bias=patch_bias,
-            sample_width=sample_width,
-            sample_height=sample_height,
-            sample_frames=sample_frames,
-            temporal_compression_ratio=temporal_compression_ratio,
-            max_text_seq_length=max_text_seq_length,
-            spatial_interpolation_scale=spatial_interpolation_scale,
-            temporal_interpolation_scale=temporal_interpolation_scale,
-            use_positional_embeddings=not use_rotary_positional_embeddings,
-            use_learned_positional_embeddings=use_learned_positional_embeddings,
-            device=device, dtype=torch.float32, operations=operations,
-        )
-
-        # 2. Time embedding
-        self.time_proj_dim = dim
-        self.time_proj_flip = flip_sin_to_cos
-        self.time_proj_shift = freq_shift
-        self.time_embedding_linear_1 = operations.Linear(dim, time_embed_dim, device=device, dtype=dtype)
-        self.time_embedding_act = nn.SiLU()
-        self.time_embedding_linear_2 = operations.Linear(time_embed_dim, time_embed_dim, device=device, dtype=dtype)
-
-        # Optional OFS embedding (CogVideoX 1.5 I2V)
-        self.ofs_proj_dim = ofs_embed_dim
-        if ofs_embed_dim:
-            self.ofs_embedding_linear_1 = operations.Linear(ofs_embed_dim, ofs_embed_dim, device=device, dtype=dtype)
-            self.ofs_embedding_act = nn.SiLU()
-            self.ofs_embedding_linear_2 = operations.Linear(ofs_embed_dim, ofs_embed_dim, device=device, dtype=dtype)
-        else:
-            self.ofs_embedding_linear_1 = None
-
-        # 3. Transformer blocks
-        self.blocks = nn.ModuleList([
-            CogVideoXBlock(
-                dim=dim,
-                num_heads=num_attention_heads,
-                head_dim=attention_head_dim,
-                time_dim=time_embed_dim,
-                eps=1e-5,
-                device=device, dtype=dtype, operations=operations,
-            )
-            for _ in range(num_layers)
-        ])
-
-        self.norm_final = operations.LayerNorm(dim, eps=1e-5, elementwise_affine=True, device=device, dtype=dtype)
-
-        # 4. Output
-        self.norm_out = CogVideoXAdaLayerNorm(
-            time_dim=time_embed_dim, dim=dim, eps=1e-5,
-            device=device, dtype=dtype, operations=operations,
-        )
-
-        if patch_size_t is None:
-            output_dim = patch_size * patch_size * out_channels
-        else:
-            output_dim = patch_size * patch_size * patch_size_t * out_channels
-
-        self.proj_out = operations.Linear(dim, output_dim, device=device, dtype=dtype)
-
-        self.spatial_interpolation_scale = spatial_interpolation_scale
-        self.temporal_interpolation_scale = temporal_interpolation_scale
-        self.temporal_compression_ratio = temporal_compression_ratio
-
-    def forward(self, x, timestep, context, ofs=None, transformer_options=None, **kwargs):
-        if transformer_options is None:
-            transformer_options = {}
-        return comfy.patcher_extension.WrapperExecutor.new_class_executor(
-            self._forward,
-            self,
-            comfy.patcher_extension.get_all_wrappers(comfy.patcher_extension.WrappersMP.DIFFUSION_MODEL, transformer_options)
-        ).execute(x, timestep, context, ofs, transformer_options, **kwargs)
-
-    def _forward(self, x, timestep, context, ofs=None, transformer_options=None, **kwargs):
-        if transformer_options is None:
-            transformer_options = {}
-        # ComfyUI passes [B, C, T, H, W]
-        batch_size, channels, t, h, w = x.shape
-
-        # Pad to patch size (temporal + spatial), same pattern as WAN
-        p_t = self.patch_size_t if self.patch_size_t is not None else 1
-        x = comfy.ldm.common_dit.pad_to_patch_size(x, (p_t, self.patch_size, self.patch_size))
-
-        # CogVideoX expects [B, T, C, H, W]
-        x = x.permute(0, 2, 1, 3, 4)
-        batch_size, num_frames, channels, height, width = x.shape
-
-        # Time embedding
-        t_emb = get_timestep_embedding(timestep, self.time_proj_dim, self.time_proj_flip, self.time_proj_shift)
-        t_emb = t_emb.to(dtype=x.dtype)
-        emb = self.time_embedding_linear_2(self.time_embedding_act(self.time_embedding_linear_1(t_emb)))
-
-        if self.ofs_embedding_linear_1 is not None and ofs is not None:
-            ofs_emb = get_timestep_embedding(ofs, self.ofs_proj_dim, self.time_proj_flip, self.time_proj_shift)
-            ofs_emb = ofs_emb.to(dtype=x.dtype)
-            ofs_emb = self.ofs_embedding_linear_2(self.ofs_embedding_act(self.ofs_embedding_linear_1(ofs_emb)))
-            emb = emb + ofs_emb
-
-        # Patch embedding
-        hidden_states = self.patch_embed(context, x)
-
-        text_seq_length = context.shape[1]
-        encoder_hidden_states = hidden_states[:, :text_seq_length]
-        hidden_states = hidden_states[:, text_seq_length:]
-
-        # Rotary embeddings (if used)
-        image_rotary_emb = None
-        if self.use_rotary_positional_embeddings:
-            post_patch_height = height // self.patch_size
-            post_patch_width = width // self.patch_size
-            if self.patch_size_t is None:
-                post_time = num_frames
-            else:
-                post_time = num_frames // self.patch_size_t
-            image_rotary_emb = self._get_rotary_emb(post_patch_height, post_patch_width, post_time, device=x.device)
-
-        # Transformer blocks
-        for i, block in enumerate(self.blocks):
-            hidden_states, encoder_hidden_states = block(
-                hidden_states=hidden_states,
-                encoder_hidden_states=encoder_hidden_states,
-                temb=emb,
-                image_rotary_emb=image_rotary_emb,
-                transformer_options=transformer_options,
-            )
-
-        hidden_states = self.norm_final(hidden_states)
-
-        # Output projection
-        hidden_states = self.norm_out(hidden_states, temb=emb)
-        hidden_states = self.proj_out(hidden_states)
-
-        # Unpatchify
-        p = self.patch_size
-        p_t = self.patch_size_t
-
-        if p_t is None:
-            output = hidden_states.reshape(batch_size, num_frames, height // p, width // p, -1, p, p)
-            output = output.permute(0, 1, 4, 2, 5, 3, 6).flatten(5, 6).flatten(3, 4)
-        else:
-            output = hidden_states.reshape(
-                batch_size, (num_frames + p_t - 1) // p_t, height // p, width // p, -1, p_t, p, p
-            )
-            output = output.permute(0, 1, 5, 4, 2, 6, 3, 7).flatten(6, 7).flatten(4, 5).flatten(1, 2)
-
-        # Back to ComfyUI format [B, C, T, H, W] and crop padding
-        output = output.permute(0, 2, 1, 3, 4)[:, :, :t, :h, :w]
-        return output
-
-    def _get_rotary_emb(self, h, w, t, device):
-        """Compute CogVideoX 3D rotary positional embeddings.
-
-        For CogVideoX 1.5 (patch_size_t != None): uses "slice" mode — grid positions
-        are integer arange computed at max_size, then sliced to actual size.
-        For CogVideoX 1.0 (patch_size_t == None): uses "linspace" mode with crop coords
-        scaled by spatial_interpolation_scale.
-        """
-        d = self.attention_head_dim
-        dim_t = d // 4
-        dim_h = d // 8 * 3
-        dim_w = d // 8 * 3
-
-        if self.patch_size_t is not None:
-            # CogVideoX 1.5: "slice" mode — positions are simple integer indices
-            # Compute at max(sample_size, actual_size) then slice to actual
-            base_h = self.patch_embed.sample_height // self.patch_size
-            base_w = self.patch_embed.sample_width // self.patch_size
-            max_h = max(base_h, h)
-            max_w = max(base_w, w)
-
-            grid_h = torch.arange(max_h, device=device, dtype=torch.float32)
-            grid_w = torch.arange(max_w, device=device, dtype=torch.float32)
-            grid_t = torch.arange(t, device=device, dtype=torch.float32)
-        else:
-            # CogVideoX 1.0: "linspace" mode with interpolation scale
-            grid_h = torch.linspace(0, h - 1, h, device=device, dtype=torch.float32) * self.spatial_interpolation_scale
-            grid_w = torch.linspace(0, w - 1, w, device=device, dtype=torch.float32) * self.spatial_interpolation_scale
-            grid_t = torch.arange(t, device=device, dtype=torch.float32)
-
-        freqs_t = _get_1d_rotary_pos_embed(dim_t, grid_t)
-        freqs_h = _get_1d_rotary_pos_embed(dim_h, grid_h)
-        freqs_w = _get_1d_rotary_pos_embed(dim_w, grid_w)
-
-        t_cos, t_sin = freqs_t
-        h_cos, h_sin = freqs_h
-        w_cos, w_sin = freqs_w
-
-        # Slice to actual size (for "slice" mode where grids may be larger)
-        t_cos, t_sin = t_cos[:t], t_sin[:t]
-        h_cos, h_sin = h_cos[:h], h_sin[:h]
-        w_cos, w_sin = w_cos[:w], w_sin[:w]
-
-        # Broadcast and concatenate into [T*H*W, head_dim]
-        t_cos = t_cos[:, None, None, :].expand(-1, h, w, -1)
-        t_sin = t_sin[:, None, None, :].expand(-1, h, w, -1)
-        h_cos = h_cos[None, :, None, :].expand(t, -1, w, -1)
-        h_sin = h_sin[None, :, None, :].expand(t, -1, w, -1)
-        w_cos = w_cos[None, None, :, :].expand(t, h, -1, -1)
-        w_sin = w_sin[None, None, :, :].expand(t, h, -1, -1)
-
-        cos = torch.cat([t_cos, h_cos, w_cos], dim=-1).reshape(t * h * w, -1)
-        sin = torch.cat([t_sin, h_sin, w_sin], dim=-1).reshape(t * h * w, -1)
-        return (cos, sin)
--- a/comfy/ldm/cogvideo/vae.py
+++ b/comfy/ldm/cogvideo/vae.py
@@ -1,566 +0,0 @@
-# CogVideoX VAE - ported to ComfyUI native ops
-# Architecture reference: diffusers AutoencoderKLCogVideoX
-# Style reference: comfy/ldm/wan/vae.py
-
-import numpy as np
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-import comfy.ops
-ops = comfy.ops.disable_weight_init
-
-
-class CausalConv3d(nn.Module):
-    """Causal 3D convolution with temporal padding.
-
-    Uses comfy.ops.Conv3d with autopad='causal_zero' fast path: when input has
-    a single temporal frame and no cache, the 3D conv weight is sliced to act
-    as a 2D conv, avoiding computation on zero-padded temporal dimensions.
-    """
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1, dilation=1, pad_mode="constant"):
-        super().__init__()
-        if isinstance(kernel_size, int):
-            kernel_size = (kernel_size,) * 3
-
-        time_kernel, height_kernel, width_kernel = kernel_size
-        self.time_kernel_size = time_kernel
-        self.pad_mode = pad_mode
-
-        height_pad = (height_kernel - 1) // 2
-        width_pad = (width_kernel - 1) // 2
-        self.time_causal_padding = (width_pad, width_pad, height_pad, height_pad, time_kernel - 1, 0)
-
-        stride = stride if isinstance(stride, tuple) else (stride, 1, 1)
-        dilation = (dilation, 1, 1)
-        self.conv = ops.Conv3d(
-            in_channels, out_channels, kernel_size,
-            stride=stride, dilation=dilation,
-            padding=(0, height_pad, width_pad),
-        )
-
-    def forward(self, x, conv_cache=None):
-        if self.pad_mode == "replicate":
-            x = F.pad(x, self.time_causal_padding, mode="replicate")
-            conv_cache = None
-        else:
-            kernel_t = self.time_kernel_size
-            if kernel_t > 1:
-                if conv_cache is None and x.shape[2] == 1:
-                    # Fast path: single frame, no cache. All temporal padding
-                    # frames are copies of the input (replicate-style), so the
-                    # 3D conv reduces to a 2D conv with summed temporal kernel.
-                    w = comfy.ops.cast_to_input(self.conv.weight, x)
-                    b = comfy.ops.cast_to_input(self.conv.bias, x) if self.conv.bias is not None else None
-                    w2d = w.sum(dim=2, keepdim=True)
-                    out = F.conv3d(x, w2d, b,
-                                   self.conv.stride, self.conv.padding,
-                                   self.conv.dilation, self.conv.groups)
-                    return out, None
-                cached = [conv_cache] if conv_cache is not None else [x[:, :, :1]] * (kernel_t - 1)
-                x = torch.cat(cached + [x], dim=2)
-            conv_cache = x[:, :, -self.time_kernel_size + 1:].clone() if self.time_kernel_size > 1 else None
-
-        out = self.conv(x)
-        return out, conv_cache
-
-
-def _interpolate_zq(zq, target_size):
-    """Interpolate latent z to target (T, H, W), matching CogVideoX's first-frame-special handling."""
-    t = target_size[0]
-    if t > 1 and t % 2 == 1:
-        z_first = F.interpolate(zq[:, :, :1], size=(1, target_size[1], target_size[2]))
-        z_rest = F.interpolate(zq[:, :, 1:], size=(t - 1, target_size[1], target_size[2]))
-        return torch.cat([z_first, z_rest], dim=2)
-    return F.interpolate(zq, size=target_size)
-
-
-class SpatialNorm3D(nn.Module):
-    """Spatially conditioned normalization."""
-    def __init__(self, f_channels, zq_channels, groups=32):
-        super().__init__()
-        self.norm_layer = ops.GroupNorm(num_channels=f_channels, num_groups=groups, eps=1e-6, affine=True)
-        self.conv_y = CausalConv3d(zq_channels, f_channels, kernel_size=1, stride=1)
-        self.conv_b = CausalConv3d(zq_channels, f_channels, kernel_size=1, stride=1)
-
-    def forward(self, f, zq, conv_cache=None):
-        new_cache = {}
-        conv_cache = conv_cache or {}
-
-        if zq.shape[-3:] != f.shape[-3:]:
-            zq = _interpolate_zq(zq, f.shape[-3:])
-
-        conv_y, new_cache["conv_y"] = self.conv_y(zq, conv_cache=conv_cache.get("conv_y"))
-        conv_b, new_cache["conv_b"] = self.conv_b(zq, conv_cache=conv_cache.get("conv_b"))
-
-        return self.norm_layer(f) * conv_y + conv_b, new_cache
-
-
-class ResnetBlock3D(nn.Module):
-    """3D ResNet block with optional spatial norm."""
-    def __init__(self, in_channels, out_channels=None, temb_channels=512, groups=32,
-                 eps=1e-6, act_fn="silu", spatial_norm_dim=None, pad_mode="first"):
-        super().__init__()
-        out_channels = out_channels or in_channels
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.spatial_norm_dim = spatial_norm_dim
-
-        if act_fn == "silu":
-            self.nonlinearity = nn.SiLU()
-        elif act_fn == "swish":
-            self.nonlinearity = nn.SiLU()
-        else:
-            self.nonlinearity = nn.SiLU()
-
-        if spatial_norm_dim is None:
-            self.norm1 = ops.GroupNorm(num_channels=in_channels, num_groups=groups, eps=eps)
-            self.norm2 = ops.GroupNorm(num_channels=out_channels, num_groups=groups, eps=eps)
-        else:
-            self.norm1 = SpatialNorm3D(in_channels, spatial_norm_dim, groups=groups)
-            self.norm2 = SpatialNorm3D(out_channels, spatial_norm_dim, groups=groups)
-
-        self.conv1 = CausalConv3d(in_channels, out_channels, kernel_size=3, pad_mode=pad_mode)
-
-        if temb_channels > 0:
-            self.temb_proj = ops.Linear(temb_channels, out_channels)
-
-        self.conv2 = CausalConv3d(out_channels, out_channels, kernel_size=3, pad_mode=pad_mode)
-
-        if in_channels != out_channels:
-            self.conv_shortcut = ops.Conv3d(in_channels, out_channels, kernel_size=1, stride=1, padding=0)
-        else:
-            self.conv_shortcut = None
-
-    def forward(self, x, temb=None, zq=None, conv_cache=None):
-        new_cache = {}
-        conv_cache = conv_cache or {}
-        residual = x
-
-        if zq is not None:
-            x, new_cache["norm1"] = self.norm1(x, zq, conv_cache=conv_cache.get("norm1"))
-        else:
-            x = self.norm1(x)
-
-        x = self.nonlinearity(x)
-        x, new_cache["conv1"] = self.conv1(x, conv_cache=conv_cache.get("conv1"))
-
-        if temb is not None and hasattr(self, "temb_proj"):
-            x = x + self.temb_proj(self.nonlinearity(temb))[:, :, None, None, None]
-
-        if zq is not None:
-            x, new_cache["norm2"] = self.norm2(x, zq, conv_cache=conv_cache.get("norm2"))
-        else:
-            x = self.norm2(x)
-
-        x = self.nonlinearity(x)
-        x, new_cache["conv2"] = self.conv2(x, conv_cache=conv_cache.get("conv2"))
-
-        if self.conv_shortcut is not None:
-            residual = self.conv_shortcut(residual)
-
-        return x + residual, new_cache
-
-
-class Downsample3D(nn.Module):
-    """3D downsampling with optional temporal compression."""
-    def __init__(self, in_channels, out_channels, kernel_size=3, stride=2, padding=0, compress_time=False):
-        super().__init__()
-        self.conv = ops.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding)
-        self.compress_time = compress_time
-
-    def forward(self, x):
-        if self.compress_time:
-            b, c, t, h, w = x.shape
-            x = x.permute(0, 3, 4, 1, 2).reshape(b * h * w, c, t)
-            if t % 2 == 1:
-                x_first, x_rest = x[..., 0], x[..., 1:]
-                if x_rest.shape[-1] > 0:
-                    x_rest = F.avg_pool1d(x_rest, kernel_size=2, stride=2)
-                x = torch.cat([x_first[..., None], x_rest], dim=-1)
-                x = x.reshape(b, h, w, c, x.shape[-1]).permute(0, 3, 4, 1, 2)
-            else:
-                x = F.avg_pool1d(x, kernel_size=2, stride=2)
-                x = x.reshape(b, h, w, c, x.shape[-1]).permute(0, 3, 4, 1, 2)
-
-        pad = (0, 1, 0, 1)
-        x = F.pad(x, pad, mode="constant", value=0)
-        b, c, t, h, w = x.shape
-        x = x.permute(0, 2, 1, 3, 4).reshape(b * t, c, h, w)
-        x = self.conv(x)
-        x = x.reshape(b, t, x.shape[1], x.shape[2], x.shape[3]).permute(0, 2, 1, 3, 4)
-        return x
-
-
-class Upsample3D(nn.Module):
-    """3D upsampling with optional temporal decompression."""
-    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1, compress_time=False):
-        super().__init__()
-        self.conv = ops.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding)
-        self.compress_time = compress_time
-
-    def forward(self, x):
-        if self.compress_time:
-            if x.shape[2] > 1 and x.shape[2] % 2 == 1:
-                x_first, x_rest = x[:, :, 0], x[:, :, 1:]
-                x_first = F.interpolate(x_first, scale_factor=2.0)
-                x_rest = F.interpolate(x_rest, scale_factor=2.0)
-                x = torch.cat([x_first[:, :, None, :, :], x_rest], dim=2)
-            elif x.shape[2] > 1:
-                x = F.interpolate(x, scale_factor=2.0)
-            else:
-                x = x.squeeze(2)
-                x = F.interpolate(x, scale_factor=2.0)
-                x = x[:, :, None, :, :]
-        else:
-            b, c, t, h, w = x.shape
-            x = x.permute(0, 2, 1, 3, 4).reshape(b * t, c, h, w)
-            x = F.interpolate(x, scale_factor=2.0)
-            x = x.reshape(b, t, c, *x.shape[2:]).permute(0, 2, 1, 3, 4)
-
-        b, c, t, h, w = x.shape
-        x = x.permute(0, 2, 1, 3, 4).reshape(b * t, c, h, w)
-        x = self.conv(x)
-        x = x.reshape(b, t, *x.shape[1:]).permute(0, 2, 1, 3, 4)
-        return x
-
-
-class DownBlock3D(nn.Module):
-    def __init__(self, in_channels, out_channels, temb_channels=0, num_layers=1,
-                 eps=1e-6, act_fn="silu", groups=32, add_downsample=True,
-                 compress_time=False, pad_mode="first"):
-        super().__init__()
-        self.resnets = nn.ModuleList([
-            ResnetBlock3D(
-                in_channels=in_channels if i == 0 else out_channels,
-                out_channels=out_channels,
-                temb_channels=temb_channels,
-                groups=groups, eps=eps, act_fn=act_fn, pad_mode=pad_mode,
-            )
-            for i in range(num_layers)
-        ])
-        self.downsamplers = nn.ModuleList([Downsample3D(out_channels, out_channels, compress_time=compress_time)]) if add_downsample else None
-
-    def forward(self, x, temb=None, zq=None, conv_cache=None):
-        new_cache = {}
-        conv_cache = conv_cache or {}
-        for i, resnet in enumerate(self.resnets):
-            x, new_cache[f"resnet_{i}"] = resnet(x, temb, zq, conv_cache=conv_cache.get(f"resnet_{i}"))
-        if self.downsamplers is not None:
-            for ds in self.downsamplers:
-                x = ds(x)
-        return x, new_cache
-
-
-class MidBlock3D(nn.Module):
-    def __init__(self, in_channels, temb_channels=0, num_layers=1,
-                 eps=1e-6, act_fn="silu", groups=32, spatial_norm_dim=None, pad_mode="first"):
-        super().__init__()
-        self.resnets = nn.ModuleList([
-            ResnetBlock3D(
-                in_channels=in_channels, out_channels=in_channels,
-                temb_channels=temb_channels, groups=groups, eps=eps,
-                act_fn=act_fn, spatial_norm_dim=spatial_norm_dim, pad_mode=pad_mode,
-            )
-            for _ in range(num_layers)
-        ])
-
-    def forward(self, x, temb=None, zq=None, conv_cache=None):
-        new_cache = {}
-        conv_cache = conv_cache or {}
-        for i, resnet in enumerate(self.resnets):
-            x, new_cache[f"resnet_{i}"] = resnet(x, temb, zq, conv_cache=conv_cache.get(f"resnet_{i}"))
-        return x, new_cache
-
-
-class UpBlock3D(nn.Module):
-    def __init__(self, in_channels, out_channels, temb_channels=0, num_layers=1,
-                 eps=1e-6, act_fn="silu", groups=32, spatial_norm_dim=16,
-                 add_upsample=True, compress_time=False, pad_mode="first"):
-        super().__init__()
-        self.resnets = nn.ModuleList([
-            ResnetBlock3D(
-                in_channels=in_channels if i == 0 else out_channels,
-                out_channels=out_channels,
-                temb_channels=temb_channels, groups=groups, eps=eps,
-                act_fn=act_fn, spatial_norm_dim=spatial_norm_dim, pad_mode=pad_mode,
-            )
-            for i in range(num_layers)
-        ])
-        self.upsamplers = nn.ModuleList([Upsample3D(out_channels, out_channels, compress_time=compress_time)]) if add_upsample else None
-
-    def forward(self, x, temb=None, zq=None, conv_cache=None):
-        new_cache = {}
-        conv_cache = conv_cache or {}
-        for i, resnet in enumerate(self.resnets):
-            x, new_cache[f"resnet_{i}"] = resnet(x, temb, zq, conv_cache=conv_cache.get(f"resnet_{i}"))
-        if self.upsamplers is not None:
-            for us in self.upsamplers:
-                x = us(x)
-        return x, new_cache
-
-
-class Encoder3D(nn.Module):
-    def __init__(self, in_channels=3, out_channels=16,
-                 block_out_channels=(128, 256, 256, 512),
-                 layers_per_block=3, act_fn="silu",
-                 eps=1e-6, groups=32, pad_mode="first",
-                 temporal_compression_ratio=4):
-        super().__init__()
-        temporal_compress_level = int(np.log2(temporal_compression_ratio))
-
-        self.conv_in = CausalConv3d(in_channels, block_out_channels[0], kernel_size=3, pad_mode=pad_mode)
-
-        self.down_blocks = nn.ModuleList()
-        output_channel = block_out_channels[0]
-        for i in range(len(block_out_channels)):
-            input_channel = output_channel
-            output_channel = block_out_channels[i]
-            is_final = i == len(block_out_channels) - 1
-            compress_time = i < temporal_compress_level
-
-            self.down_blocks.append(DownBlock3D(
-                in_channels=input_channel, out_channels=output_channel,
-                temb_channels=0, num_layers=layers_per_block,
-                eps=eps, act_fn=act_fn, groups=groups,
-                add_downsample=not is_final, compress_time=compress_time,
-            ))
-
-        self.mid_block = MidBlock3D(
-            in_channels=block_out_channels[-1], temb_channels=0,
-            num_layers=2, eps=eps, act_fn=act_fn, groups=groups, pad_mode=pad_mode,
-        )
-
-        self.norm_out = ops.GroupNorm(groups, block_out_channels[-1], eps=1e-6)
-        self.conv_act = nn.SiLU()
-        self.conv_out = CausalConv3d(block_out_channels[-1], 2 * out_channels, kernel_size=3, pad_mode=pad_mode)
-
-    def forward(self, x, conv_cache=None):
-        new_cache = {}
-        conv_cache = conv_cache or {}
-
-        x, new_cache["conv_in"] = self.conv_in(x, conv_cache=conv_cache.get("conv_in"))
-
-        for i, block in enumerate(self.down_blocks):
-            key = f"down_block_{i}"
-            x, new_cache[key] = block(x, None, None, conv_cache.get(key))
-
-        x, new_cache["mid_block"] = self.mid_block(x, None, None, conv_cache=conv_cache.get("mid_block"))
-
-        x = self.norm_out(x)
-        x = self.conv_act(x)
-        x, new_cache["conv_out"] = self.conv_out(x, conv_cache=conv_cache.get("conv_out"))
-
-        return x, new_cache
-
-
-class Decoder3D(nn.Module):
-    def __init__(self, in_channels=16, out_channels=3,
-                 block_out_channels=(128, 256, 256, 512),
-                 layers_per_block=3, act_fn="silu",
-                 eps=1e-6, groups=32, pad_mode="first",
-                 temporal_compression_ratio=4):
-        super().__init__()
-        reversed_channels = list(reversed(block_out_channels))
-        temporal_compress_level = int(np.log2(temporal_compression_ratio))
-
-        self.conv_in = CausalConv3d(in_channels, reversed_channels[0], kernel_size=3, pad_mode=pad_mode)
-
-        self.mid_block = MidBlock3D(
-            in_channels=reversed_channels[0], temb_channels=0,
-            num_layers=2, eps=eps, act_fn=act_fn, groups=groups,
-            spatial_norm_dim=in_channels, pad_mode=pad_mode,
-        )
-
-        self.up_blocks = nn.ModuleList()
-        output_channel = reversed_channels[0]
-        for i in range(len(block_out_channels)):
-            prev_channel = output_channel
-            output_channel = reversed_channels[i]
-            is_final = i == len(block_out_channels) - 1
-            compress_time = i < temporal_compress_level
-
-            self.up_blocks.append(UpBlock3D(
-                in_channels=prev_channel, out_channels=output_channel,
-                temb_channels=0, num_layers=layers_per_block + 1,
-                eps=eps, act_fn=act_fn, groups=groups,
-                spatial_norm_dim=in_channels,
-                add_upsample=not is_final, compress_time=compress_time,
-            ))
-
-        self.norm_out = SpatialNorm3D(reversed_channels[-1], in_channels, groups=groups)
-        self.conv_act = nn.SiLU()
-        self.conv_out = CausalConv3d(reversed_channels[-1], out_channels, kernel_size=3, pad_mode=pad_mode)
-
-    def forward(self, sample, conv_cache=None):
-        new_cache = {}
-        conv_cache = conv_cache or {}
-
-        x, new_cache["conv_in"] = self.conv_in(sample, conv_cache=conv_cache.get("conv_in"))
-
-        x, new_cache["mid_block"] = self.mid_block(x, None, sample, conv_cache=conv_cache.get("mid_block"))
-
-        for i, block in enumerate(self.up_blocks):
-            key = f"up_block_{i}"
-            x, new_cache[key] = block(x, None, sample, conv_cache=conv_cache.get(key))
-
-        x, new_cache["norm_out"] = self.norm_out(x, sample, conv_cache=conv_cache.get("norm_out"))
-        x = self.conv_act(x)
-        x, new_cache["conv_out"] = self.conv_out(x, conv_cache=conv_cache.get("conv_out"))
-
-        return x, new_cache
-
-
-
-class AutoencoderKLCogVideoX(nn.Module):
-    """CogVideoX VAE. Spatial tiling/slicing handled by ComfyUI's VAE wrapper.
-
-    Uses rolling temporal decode: conv_in + mid_block + temporal up_blocks run
-    on the full (low-res) tensor, then the expensive spatial-only up_blocks +
-    norm_out + conv_out are processed in small temporal chunks with conv_cache
-    carrying causal state between chunks. This keeps peak VRAM proportional to
-    chunk_size rather than total frame count.
-    """
-
-    def __init__(self,
-                 in_channels=3, out_channels=3,
-                 block_out_channels=(128, 256, 256, 512),
-                 latent_channels=16, layers_per_block=3,
-                 act_fn="silu", eps=1e-6, groups=32,
-                 temporal_compression_ratio=4,
-                 ):
-        super().__init__()
-        self.latent_channels = latent_channels
-        self.temporal_compression_ratio = temporal_compression_ratio
-
-        self.encoder = Encoder3D(
-            in_channels=in_channels, out_channels=latent_channels,
-            block_out_channels=block_out_channels, layers_per_block=layers_per_block,
-            act_fn=act_fn, eps=eps, groups=groups,
-            temporal_compression_ratio=temporal_compression_ratio,
-        )
-        self.decoder = Decoder3D(
-            in_channels=latent_channels, out_channels=out_channels,
-            block_out_channels=block_out_channels, layers_per_block=layers_per_block,
-            act_fn=act_fn, eps=eps, groups=groups,
-            temporal_compression_ratio=temporal_compression_ratio,
-        )
-
-        self.num_latent_frames_batch_size = 2
-        self.num_sample_frames_batch_size = 8
-
-    def encode(self, x):
-        t = x.shape[2]
-        frame_batch = self.num_sample_frames_batch_size
-        remainder = t % frame_batch
-        conv_cache = None
-        enc = []
-
-        # Process remainder frames first so only the first chunk can have an
-        # odd temporal dimension — where Downsample3D's first-frame-special
-        # handling in temporal compression is actually correct.
-        if remainder > 0:
-            chunk, conv_cache = self.encoder(x[:, :, :remainder], conv_cache=conv_cache)
-            enc.append(chunk.to(x.device))
-
-        for start in range(remainder, t, frame_batch):
-            chunk, conv_cache = self.encoder(x[:, :, start:start + frame_batch], conv_cache=conv_cache)
-            enc.append(chunk.to(x.device))
-
-        enc = torch.cat(enc, dim=2)
-        mean, _ = enc.chunk(2, dim=1)
-        return mean
-
-    def decode(self, z):
-        return self._decode_rolling(z)
-
-    def _decode_batched(self, z):
-        """Original batched decode - processes 2 latent frames through full decoder."""
-        t = z.shape[2]
-        frame_batch = self.num_latent_frames_batch_size
-        num_batches = max(t // frame_batch, 1)
-        conv_cache = None
-        dec = []
-        for i in range(num_batches):
-            remaining = t % frame_batch
-            start = frame_batch * i + (0 if i == 0 else remaining)
-            end = frame_batch * (i + 1) + remaining
-            chunk, conv_cache = self.decoder(z[:, :, start:end], conv_cache=conv_cache)
-            dec.append(chunk.cpu())
-        return torch.cat(dec, dim=2).to(z.device)
-
-    def _decode_rolling(self, z):
-        """Rolling decode - processes low-res layers on full tensor, then rolls
-        through expensive high-res layers in temporal chunks."""
-        decoder = self.decoder
-        device = z.device
-
-        # Determine which up_blocks have temporal upsample vs spatial-only.
-        # Temporal up_blocks are cheap (low res), spatial-only are expensive.
-        temporal_compress_level = int(np.log2(self.temporal_compression_ratio))
-        split_at = temporal_compress_level  # first N up_blocks do temporal upsample
-
-        # Phase 1: conv_in + mid_block + temporal up_blocks on full tensor (low/medium res)
-        x, _ = decoder.conv_in(z)
-        x, _ = decoder.mid_block(x, None, z)
-
-        for i in range(split_at):
-            x, _ = decoder.up_blocks[i](x, None, z)
-
-        # Phase 2: remaining spatial-only up_blocks + norm_out + conv_out in temporal chunks
-        remaining_blocks = list(range(split_at, len(decoder.up_blocks)))
-        chunk_size = 4  # pixel frames per chunk through high-res layers
-        t_expanded = x.shape[2]
-
-        if t_expanded <= chunk_size or len(remaining_blocks) == 0:
-            # Small enough to process in one go
-            for i in remaining_blocks:
-                x, _ = decoder.up_blocks[i](x, None, z)
-            x, _ = decoder.norm_out(x, z)
-            x = decoder.conv_act(x)
-            x, _ = decoder.conv_out(x)
-            return x
-
-        # Expand z temporally once to match Phase 2's time dimension.
-        # z stays at latent spatial resolution so this is small (~16 MB vs ~1.3 GB
-        # for the old approach of pre-interpolating to every pixel resolution).
-        z_time_expanded = _interpolate_zq(z, (t_expanded, z.shape[3], z.shape[4]))
-
-        # Process in temporal chunks, interpolating spatially per-chunk to avoid
-        # allocating full [B, C, t_expanded, H, W] tensors at each resolution.
-        dec_out = []
-        conv_caches = {}
-
-        for chunk_start in range(0, t_expanded, chunk_size):
-            chunk_end = min(chunk_start + chunk_size, t_expanded)
-            x_chunk = x[:, :, chunk_start:chunk_end]
-            z_t_chunk = z_time_expanded[:, :, chunk_start:chunk_end]
-            z_spatial_cache = {}
-
-            for i in remaining_blocks:
-                block = decoder.up_blocks[i]
-                cache_key = f"up_block_{i}"
-                hw_key = (x_chunk.shape[3], x_chunk.shape[4])
-                if hw_key not in z_spatial_cache:
-                    if z_t_chunk.shape[3] == hw_key[0] and z_t_chunk.shape[4] == hw_key[1]:
-                        z_spatial_cache[hw_key] = z_t_chunk
-                    else:
-                        z_spatial_cache[hw_key] = F.interpolate(z_t_chunk, size=(z_t_chunk.shape[2], hw_key[0], hw_key[1]))
-                x_chunk, new_cache = block(x_chunk, None, z_spatial_cache[hw_key], conv_cache=conv_caches.get(cache_key))
-                conv_caches[cache_key] = new_cache
-
-            hw_key = (x_chunk.shape[3], x_chunk.shape[4])
-            if hw_key not in z_spatial_cache:
-                z_spatial_cache[hw_key] = F.interpolate(z_t_chunk, size=(z_t_chunk.shape[2], hw_key[0], hw_key[1]))
-            x_chunk, new_cache = decoder.norm_out(x_chunk, z_spatial_cache[hw_key], conv_cache=conv_caches.get("norm_out"))
-            conv_caches["norm_out"] = new_cache
-            x_chunk = decoder.conv_act(x_chunk)
-            x_chunk, new_cache = decoder.conv_out(x_chunk, conv_cache=conv_caches.get("conv_out"))
-            conv_caches["conv_out"] = new_cache
-
-            dec_out.append(x_chunk.cpu())
-            del z_spatial_cache
-
-        del x, z_time_expanded
-        return torch.cat(dec_out, dim=2).to(device)
--- a/comfy/ldm/ernie/model.py
+++ b/comfy/ldm/ernie/model.py
@@ -1,301 +0,0 @@
-import math
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from comfy.ldm.modules.attention import optimized_attention
-import comfy.model_management
-
-def rope(pos: torch.Tensor, dim: int, theta: int) -> torch.Tensor:
-    assert dim % 2 == 0
-    if not comfy.model_management.supports_fp64(pos.device):
-        device = torch.device("cpu")
-    else:
-        device = pos.device
-
-    scale = torch.arange(0, dim, 2, dtype=torch.float64, device=device) / dim
-    omega = 1.0 / (theta**scale)
-    out = torch.einsum("...n,d->...nd", pos.to(device), omega)
-    out = torch.stack([torch.cos(out), torch.sin(out)], dim=0)
-    return out.to(dtype=torch.float32, device=pos.device)
-
-def apply_rotary_emb(x_in: torch.Tensor, freqs_cis: torch.Tensor) -> torch.Tensor:
-    rot_dim = freqs_cis.shape[-1]
-    x, x_pass = x_in[..., :rot_dim], x_in[..., rot_dim:]
-    cos_ = freqs_cis[0]
-    sin_ = freqs_cis[1]
-    x1, x2 = x.chunk(2, dim=-1)
-    x_rotated = torch.cat((-x2, x1), dim=-1)
-    return torch.cat((x * cos_ + x_rotated * sin_, x_pass), dim=-1)
-
-class ErnieImageEmbedND3(nn.Module):
-    def __init__(self, dim: int, theta: int, axes_dim: tuple):
-        super().__init__()
-        self.dim = dim
-        self.theta = theta
-        self.axes_dim = list(axes_dim)
-
-    def forward(self, ids: torch.Tensor) -> torch.Tensor:
-        emb = torch.cat([rope(ids[..., i], self.axes_dim[i], self.theta) for i in range(3)], dim=-1)
-        emb = emb.unsqueeze(3)  # [2, B, S, 1, head_dim//2]
-        return torch.stack([emb, emb], dim=-1).reshape(*emb.shape[:-1], -1)  # [B, S, 1, head_dim]
-
-class ErnieImagePatchEmbedDynamic(nn.Module):
-    def __init__(self, in_channels: int, embed_dim: int, patch_size: int, operations, device=None, dtype=None):
-        super().__init__()
-        self.patch_size = patch_size
-        self.proj = operations.Conv2d(in_channels, embed_dim, kernel_size=patch_size, stride=patch_size, bias=True, device=device, dtype=dtype)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        x = self.proj(x)
-        batch_size, dim, height, width = x.shape
-        return x.reshape(batch_size, dim, height * width).transpose(1, 2).contiguous()
-
-class Timesteps(nn.Module):
-    def __init__(self, num_channels: int, flip_sin_to_cos: bool = False):
-        super().__init__()
-        self.num_channels = num_channels
-        self.flip_sin_to_cos = flip_sin_to_cos
-
-    def forward(self, timesteps: torch.Tensor) -> torch.Tensor:
-        half_dim = self.num_channels // 2
-        exponent = -math.log(10000) * torch.arange(half_dim, dtype=torch.float32, device=timesteps.device) / half_dim
-        emb = torch.exp(exponent)
-        emb = timesteps[:, None].float() * emb[None, :]
-        if self.flip_sin_to_cos:
-            emb = torch.cat([torch.cos(emb), torch.sin(emb)], dim=-1)
-        else:
-            emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=-1)
-        return emb
-
-class TimestepEmbedding(nn.Module):
-    def __init__(self, in_channels: int, time_embed_dim: int, operations, device=None, dtype=None):
-        super().__init__()
-        Linear = operations.Linear
-        self.linear_1 = Linear(in_channels, time_embed_dim, bias=True, device=device, dtype=dtype)
-        self.act = nn.SiLU()
-        self.linear_2 = Linear(time_embed_dim, time_embed_dim, bias=True, device=device, dtype=dtype)
-
-    def forward(self, sample: torch.Tensor) -> torch.Tensor:
-        sample = self.linear_1(sample)
-        sample = self.act(sample)
-        sample = self.linear_2(sample)
-        return sample
-
-class ErnieImageAttention(nn.Module):
-    def __init__(self, query_dim: int, heads: int, dim_head: int, eps: float = 1e-6, operations=None, device=None, dtype=None):
-        super().__init__()
-        self.heads = heads
-        self.head_dim = dim_head
-        self.inner_dim = heads * dim_head
-
-        Linear = operations.Linear
-        RMSNorm = operations.RMSNorm
-
-        self.to_q = Linear(query_dim, self.inner_dim, bias=False, device=device, dtype=dtype)
-        self.to_k = Linear(query_dim, self.inner_dim, bias=False, device=device, dtype=dtype)
-        self.to_v = Linear(query_dim, self.inner_dim, bias=False, device=device, dtype=dtype)
-
-        self.norm_q = RMSNorm(dim_head, eps=eps, elementwise_affine=True, device=device, dtype=dtype)
-        self.norm_k = RMSNorm(dim_head, eps=eps, elementwise_affine=True, device=device, dtype=dtype)
-
-        self.to_out = nn.ModuleList([Linear(self.inner_dim, query_dim, bias=False, device=device, dtype=dtype)])
-
-    def forward(self, x: torch.Tensor, attention_mask: torch.Tensor = None, image_rotary_emb: torch.Tensor = None) -> torch.Tensor:
-        B, S, _ = x.shape
-
-        q_flat = self.to_q(x)
-        k_flat = self.to_k(x)
-        v_flat = self.to_v(x)
-
-        query = q_flat.view(B, S, self.heads, self.head_dim)
-        key = k_flat.view(B, S, self.heads, self.head_dim)
-
-        query = self.norm_q(query)
-        key = self.norm_k(key)
-
-        if image_rotary_emb is not None:
-            query = apply_rotary_emb(query, image_rotary_emb)
-            key = apply_rotary_emb(key, image_rotary_emb)
-
-        q_flat = query.reshape(B, S, -1)
-        k_flat = key.reshape(B, S, -1)
-
-        hidden_states = optimized_attention(q_flat, k_flat, v_flat, self.heads, mask=attention_mask)
-
-        return self.to_out[0](hidden_states)
-
-class ErnieImageFeedForward(nn.Module):
-    def __init__(self, hidden_size: int, ffn_hidden_size: int, operations, device=None, dtype=None):
-        super().__init__()
-        Linear = operations.Linear
-        self.gate_proj = Linear(hidden_size, ffn_hidden_size, bias=False, device=device, dtype=dtype)
-        self.up_proj = Linear(hidden_size, ffn_hidden_size, bias=False, device=device, dtype=dtype)
-        self.linear_fc2 = Linear(ffn_hidden_size, hidden_size, bias=False, device=device, dtype=dtype)
-
-    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        return self.linear_fc2(self.up_proj(x) * F.gelu(self.gate_proj(x)))
-
-class ErnieImageSharedAdaLNBlock(nn.Module):
-    def __init__(self, hidden_size: int, num_heads: int, ffn_hidden_size: int, eps: float = 1e-6, operations=None, device=None, dtype=None):
-        super().__init__()
-        RMSNorm = operations.RMSNorm
-
-        self.adaLN_sa_ln = RMSNorm(hidden_size, eps=eps, device=device, dtype=dtype)
-        self.self_attention = ErnieImageAttention(
-            query_dim=hidden_size,
-            dim_head=hidden_size // num_heads,
-            heads=num_heads,
-            eps=eps,
-            operations=operations,
-            device=device,
-            dtype=dtype
-        )
-        self.adaLN_mlp_ln = RMSNorm(hidden_size, eps=eps, device=device, dtype=dtype)
-        self.mlp = ErnieImageFeedForward(hidden_size, ffn_hidden_size, operations=operations, device=device, dtype=dtype)
-
-    def forward(self, x, rotary_pos_emb, temb, attention_mask=None):
-        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = temb
-
-        residual = x
-        x_norm = self.adaLN_sa_ln(x)
-        x_norm = x_norm * (1 + scale_msa) + shift_msa
-
-        attn_out = self.self_attention(x_norm, attention_mask=attention_mask, image_rotary_emb=rotary_pos_emb)
-        x = residual + gate_msa * attn_out
-
-        residual = x
-        x_norm = self.adaLN_mlp_ln(x)
-        x_norm = x_norm * (1 + scale_mlp) + shift_mlp
-
-        return residual + gate_mlp * self.mlp(x_norm)
-
-class ErnieImageAdaLNContinuous(nn.Module):
-    def __init__(self, hidden_size: int, eps: float = 1e-6, operations=None, device=None, dtype=None):
-        super().__init__()
-        LayerNorm = operations.LayerNorm
-        Linear = operations.Linear
-        self.norm = LayerNorm(hidden_size, elementwise_affine=False, eps=eps, device=device, dtype=dtype)
-        self.linear = Linear(hidden_size, hidden_size * 2, device=device, dtype=dtype)
-
-    def forward(self, x: torch.Tensor, conditioning: torch.Tensor) -> torch.Tensor:
-        scale, shift = self.linear(conditioning).chunk(2, dim=-1)
-        x = self.norm(x)
-        x = torch.addcmul(shift.unsqueeze(1), x, 1 + scale.unsqueeze(1))
-        return x
-
-class ErnieImageModel(nn.Module):
-    def __init__(
-        self,
-        hidden_size: int = 4096,
-        num_attention_heads: int = 32,
-        num_layers: int = 36,
-        ffn_hidden_size: int = 12288,
-        in_channels: int = 128,
-        out_channels: int = 128,
-        patch_size: int = 1,
-        text_in_dim: int = 3072,
-        rope_theta: int = 256,
-        rope_axes_dim: tuple = (32, 48, 48),
-        eps: float = 1e-6,
-        qk_layernorm: bool = True,
-        device=None,
-        dtype=None,
-        operations=None,
-        **kwargs
-    ):
-        super().__init__()
-        self.dtype = dtype
-        self.hidden_size = hidden_size
-        self.num_heads = num_attention_heads
-        self.head_dim = hidden_size // num_attention_heads
-        self.patch_size = patch_size
-        self.out_channels = out_channels
-
-        Linear = operations.Linear
-
-        self.x_embedder = ErnieImagePatchEmbedDynamic(in_channels, hidden_size, patch_size, operations, device, dtype)
-        self.text_proj = Linear(text_in_dim, hidden_size, bias=False, device=device, dtype=dtype) if text_in_dim != hidden_size else None
-
-        self.time_proj = Timesteps(hidden_size, flip_sin_to_cos=False)
-        self.time_embedding = TimestepEmbedding(hidden_size, hidden_size, operations, device, dtype)
-
-        self.pos_embed = ErnieImageEmbedND3(dim=self.head_dim, theta=rope_theta, axes_dim=rope_axes_dim)
-
-        self.adaLN_modulation = nn.Sequential(
-            nn.SiLU(),
-            Linear(hidden_size, 6 * hidden_size, device=device, dtype=dtype)
-        )
-
-        self.layers = nn.ModuleList([
-            ErnieImageSharedAdaLNBlock(hidden_size, num_attention_heads, ffn_hidden_size, eps, operations, device, dtype)
-            for _ in range(num_layers)
-        ])
-
-        self.final_norm = ErnieImageAdaLNContinuous(hidden_size, eps, operations, device, dtype)
-        self.final_linear = Linear(hidden_size, patch_size * patch_size * out_channels, device=device, dtype=dtype)
-
-    def forward(self, x, timesteps, context, **kwargs):
-        device, dtype = x.device, x.dtype
-        B, C, H, W = x.shape
-        p, Hp, Wp = self.patch_size, H // self.patch_size, W // self.patch_size
-        N_img = Hp * Wp
-
-        img_bsh = self.x_embedder(x)
-
-        text_bth = context
-        if self.text_proj is not None and text_bth.numel() > 0:
-            text_bth = self.text_proj(text_bth)
-        Tmax = text_bth.shape[1]
-
-        hidden_states = torch.cat([img_bsh, text_bth], dim=1)
-
-        text_ids = torch.zeros((B, Tmax, 3), device=device, dtype=torch.float32)
-        text_ids[:, :, 0] = torch.linspace(0, Tmax - 1, steps=Tmax, device=x.device, dtype=torch.float32)
-        index = float(Tmax)
-
-        transformer_options = kwargs.get("transformer_options", {})
-        rope_options = transformer_options.get("rope_options", None)
-
-        h_len, w_len = float(Hp), float(Wp)
-        h_offset, w_offset = 0.0, 0.0
-
-        if rope_options is not None:
-            h_len = (h_len - 1.0) * rope_options.get("scale_y", 1.0) + 1.0
-            w_len = (w_len - 1.0) * rope_options.get("scale_x", 1.0) + 1.0
-            index += rope_options.get("shift_t", 0.0)
-            h_offset += rope_options.get("shift_y", 0.0)
-            w_offset += rope_options.get("shift_x", 0.0)
-
-        image_ids = torch.zeros((Hp, Wp, 3), device=device, dtype=torch.float32)
-        image_ids[:, :, 0] = image_ids[:, :, 1] + index
-        image_ids[:, :, 1] = image_ids[:, :, 1] + torch.linspace(h_offset, h_len - 1 + h_offset, steps=Hp, device=device, dtype=torch.float32).unsqueeze(1)
-        image_ids[:, :, 2] = image_ids[:, :, 2] + torch.linspace(w_offset, w_len - 1 + w_offset, steps=Wp, device=device, dtype=torch.float32).unsqueeze(0)
-
-        image_ids = image_ids.view(1, N_img, 3).expand(B, -1, -1)
-
-        rotary_pos_emb = self.pos_embed(torch.cat([image_ids, text_ids], dim=1)).to(x.dtype)
-        del image_ids, text_ids
-
-        sample = self.time_proj(timesteps).to(dtype)
-        c = self.time_embedding(sample)
-
-        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = [
-            t.unsqueeze(1).contiguous() for t in self.adaLN_modulation(c).chunk(6, dim=-1)
-        ]
-
-        temb = [shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp]
-        for layer in self.layers:
-            hidden_states = layer(hidden_states, rotary_pos_emb, temb)
-
-        hidden_states = self.final_norm(hidden_states, c).type_as(hidden_states)
-
-        patches = self.final_linear(hidden_states)[:, :N_img, :]
-        output = (
-            patches.view(B, Hp, Wp, p, p, self.out_channels)
-            .permute(0, 5, 1, 3, 2, 4)
-            .contiguous()
-            .view(B, self.out_channels, H, W)
-        )
-
-        return output
--- a/comfy/ldm/flux/math.py
+++ b/comfy/ldm/flux/math.py
@@ -16,7 +16,7 @@ def attention(q: Tensor, k: Tensor, v: Tensor, pe: Tensor, mask=None, transforme

 def rope(pos: Tensor, dim: int, theta: int) -> Tensor:
    assert dim % 2 == 0
-    if not comfy.model_management.supports_fp64(pos.device):
+    if comfy.model_management.is_device_mps(pos.device) or comfy.model_management.is_intel_xpu() or comfy.model_management.is_directml_enabled():
        device = torch.device("cpu")
    else:
        device = pos.device
--- a/comfy/ldm/flux/model.py
+++ b/comfy/ldm/flux/model.py
@@ -386,7 +386,7 @@ class Flux(nn.Module):
                    h = max(h, ref.shape[-2] + h_offset)
                    w = max(w, ref.shape[-1] + w_offset)

-                kontext, kontext_ids = self.process_img(ref, index=index, h_offset=h_offset, w_offset=w_offset, transformer_options=transformer_options)
+                kontext, kontext_ids = self.process_img(ref, index=index, h_offset=h_offset, w_offset=w_offset)
                img = torch.cat([img, kontext], dim=1)
                img_ids = torch.cat([img_ids, kontext_ids], dim=1)
                ref_num_tokens.append(kontext.shape[1])
--- a/comfy/ldm/lightricks/av_model.py
+++ b/comfy/ldm/lightricks/av_model.py
@@ -681,33 +681,6 @@ class LTXAVModel(LTXVModel):
        additional_args["has_spatial_mask"] = has_spatial_mask

        ax, a_latent_coords = self.a_patchifier.patchify(ax)
-
-        # Inject reference audio for ID-LoRA in-context conditioning
-        ref_audio = kwargs.get("ref_audio", None)
-        ref_audio_seq_len = 0
-        if ref_audio is not None:
-            ref_tokens = ref_audio["tokens"].to(dtype=ax.dtype, device=ax.device)
-            if ref_tokens.shape[0] < ax.shape[0]:
-                ref_tokens = ref_tokens.expand(ax.shape[0], -1, -1)
-            ref_audio_seq_len = ref_tokens.shape[1]
-            B = ax.shape[0]
-
-            # Compute negative temporal positions matching ID-LoRA convention:
-            # offset by -(end_of_last_token + time_per_latent) so reference ends just before t=0
-            p = self.a_patchifier
-            tpl = p.hop_length * p.audio_latent_downsample_factor / p.sample_rate
-            ref_start = p._get_audio_latent_time_in_sec(0, ref_audio_seq_len, torch.float32, ax.device)
-            ref_end = p._get_audio_latent_time_in_sec(1, ref_audio_seq_len + 1, torch.float32, ax.device)
-            time_offset = ref_end[-1].item() + tpl
-            ref_start = (ref_start - time_offset).unsqueeze(0).expand(B, -1).unsqueeze(1)
-            ref_end = (ref_end - time_offset).unsqueeze(0).expand(B, -1).unsqueeze(1)
-            ref_pos = torch.stack([ref_start, ref_end], dim=-1)
-
-            additional_args["ref_audio_seq_len"] = ref_audio_seq_len
-            additional_args["target_audio_seq_len"] = ax.shape[1]
-            ax = torch.cat([ref_tokens, ax], dim=1)
-            a_latent_coords = torch.cat([ref_pos.to(a_latent_coords), a_latent_coords], dim=2)
-
        ax = self.audio_patchify_proj(ax)

        # additional_args.update({"av_orig_shape": list(x.shape)})
@@ -748,14 +721,6 @@ class LTXAVModel(LTXVModel):

        # Prepare audio timestep
        a_timestep = kwargs.get("a_timestep")
-        ref_audio_seq_len = kwargs.get("ref_audio_seq_len", 0)
-        if ref_audio_seq_len > 0 and a_timestep is not None:
-            # Reference tokens must have timestep=0, expand scalar/1D timestep to per-token so ref=0 and target=sigma.
-            target_len = kwargs.get("target_audio_seq_len")
-            if a_timestep.dim() <= 1:
-                a_timestep = a_timestep.view(-1, 1).expand(batch_size, target_len)
-            ref_ts = torch.zeros(batch_size, ref_audio_seq_len, *a_timestep.shape[2:], device=a_timestep.device, dtype=a_timestep.dtype)
-            a_timestep = torch.cat([ref_ts, a_timestep], dim=1)
        if a_timestep is not None:
            a_timestep_scaled = a_timestep * self.timestep_scale_multiplier
            a_timestep_flat = a_timestep_scaled.flatten()
@@ -990,13 +955,6 @@ class LTXAVModel(LTXVModel):
        v_embedded_timestep = embedded_timestep[0]
        a_embedded_timestep = embedded_timestep[1]

-        # Trim reference audio tokens before unpatchification
-        ref_audio_seq_len = kwargs.get("ref_audio_seq_len", 0)
-        if ref_audio_seq_len > 0:
-            ax = ax[:, ref_audio_seq_len:]
-            if a_embedded_timestep.shape[1] > 1:
-                a_embedded_timestep = a_embedded_timestep[:, ref_audio_seq_len:]
-
        # Expand compressed video timestep if needed
        if isinstance(v_embedded_timestep, CompressedTimestep):
            v_embedded_timestep = v_embedded_timestep.expand()
--- a/comfy/ldm/lightricks/vae/audio_vae.py
+++ b/comfy/ldm/lightricks/vae/audio_vae.py
@@ -4,6 +4,9 @@ import math
 import torch
 import torchaudio

+import comfy.model_management
+import comfy.model_patcher
+import comfy.utils as utils
 from comfy.ldm.mmaudio.vae.distributions import DiagonalGaussianDistribution
 from comfy.ldm.lightricks.symmetric_patchifier import AudioPatchifier
 from comfy.ldm.lightricks.vae.causal_audio_autoencoder import (
@@ -40,6 +43,30 @@ class AudioVAEComponentConfig:

        return cls(autoencoder=audio_config, vocoder=vocoder_config)

+
+class ModelDeviceManager:
+    """Manages device placement and GPU residency for the composed model."""
+
+    def __init__(self, module: torch.nn.Module):
+        load_device = comfy.model_management.get_torch_device()
+        offload_device = comfy.model_management.vae_offload_device()
+        self.patcher = comfy.model_patcher.ModelPatcher(module, load_device, offload_device)
+
+    def ensure_model_loaded(self) -> None:
+        comfy.model_management.free_memory(
+            self.patcher.model_size(),
+            self.patcher.load_device,
+        )
+        comfy.model_management.load_model_gpu(self.patcher)
+
+    def move_to_load_device(self, tensor: torch.Tensor) -> torch.Tensor:
+        return tensor.to(self.patcher.load_device)
+
+    @property
+    def load_device(self):
+        return self.patcher.load_device
+
+
 class AudioLatentNormalizer:
    """Applies per-channel statistics in patch space and restores original layout."""

@@ -105,17 +132,23 @@ class AudioPreprocessor:
 class AudioVAE(torch.nn.Module):
    """High-level Audio VAE wrapper exposing encode and decode entry points."""

-    def __init__(self, metadata: dict):
+    def __init__(self, state_dict: dict, metadata: dict):
        super().__init__()

        component_config = AudioVAEComponentConfig.from_metadata(metadata)

+        vae_sd = utils.state_dict_prefix_replace(state_dict, {"audio_vae.": ""}, filter_keys=True)
+        vocoder_sd = utils.state_dict_prefix_replace(state_dict, {"vocoder.": ""}, filter_keys=True)
+
        self.autoencoder = CausalAudioAutoencoder(config=component_config.autoencoder)
        if "bwe" in component_config.vocoder:
            self.vocoder = VocoderWithBWE(config=component_config.vocoder)
        else:
            self.vocoder = Vocoder(config=component_config.vocoder)

+        self.autoencoder.load_state_dict(vae_sd, strict=False)
+        self.vocoder.load_state_dict(vocoder_sd, strict=False)
+
        autoencoder_config = self.autoencoder.get_config()
        self.normalizer = AudioLatentNormalizer(
            AudioPatchifier(
@@ -135,12 +168,18 @@ class AudioVAE(torch.nn.Module):
            n_fft=autoencoder_config["n_fft"],
        )

-    def encode(self, audio, sample_rate=44100) -> torch.Tensor:
+        self.device_manager = ModelDeviceManager(self)
+
+    def encode(self, audio: dict) -> torch.Tensor:
        """Encode a waveform dictionary into normalized latent tensors."""

-        waveform = audio
-        waveform_sample_rate = sample_rate
+        waveform = audio["waveform"]
+        waveform_sample_rate = audio["sample_rate"]
        input_device = waveform.device
+        # Ensure that Audio VAE is loaded on the correct device.
+        self.device_manager.ensure_model_loaded()
+
+        waveform = self.device_manager.move_to_load_device(waveform)
        expected_channels = self.autoencoder.encoder.in_channels
        if waveform.shape[1] != expected_channels:
            if waveform.shape[1] == 1:
@@ -151,7 +190,7 @@ class AudioVAE(torch.nn.Module):
                )

        mel_spec = self.preprocessor.waveform_to_mel(
-            waveform, waveform_sample_rate, device=waveform.device
+            waveform, waveform_sample_rate, device=self.device_manager.load_device
        )

        latents = self.autoencoder.encode(mel_spec)
@@ -165,13 +204,17 @@ class AudioVAE(torch.nn.Module):
        """Decode normalized latent tensors into an audio waveform."""
        original_shape = latents.shape

+        # Ensure that Audio VAE is loaded on the correct device.
+        self.device_manager.ensure_model_loaded()
+
+        latents = self.device_manager.move_to_load_device(latents)
        latents = self.normalizer.denormalize(latents)

        target_shape = self.target_shape_from_latents(original_shape)
        mel_spec = self.autoencoder.decode(latents, target_shape=target_shape)

        waveform = self.run_vocoder(mel_spec)
-        return waveform
+        return self.device_manager.move_to_load_device(waveform)

    def target_shape_from_latents(self, latents_shape):
        batch, _, time, _ = latents_shape
--- a/comfy/ldm/models/autoencoder.py
+++ b/comfy/ldm/models/autoencoder.py
@@ -155,7 +155,6 @@ class AutoencodingEngineLegacy(AutoencodingEngine):
    def __init__(self, embed_dim: int, **kwargs):
        self.max_batch_size = kwargs.pop("max_batch_size", None)
        ddconfig = kwargs.pop("ddconfig")
-        decoder_ddconfig = kwargs.pop("decoder_ddconfig", ddconfig)
        super().__init__(
            encoder_config={
                "target": "comfy.ldm.modules.diffusionmodules.model.Encoder",
@@ -163,7 +162,7 @@ class AutoencodingEngineLegacy(AutoencodingEngine):
            },
            decoder_config={
                "target": "comfy.ldm.modules.diffusionmodules.model.Decoder",
-                "params": decoder_ddconfig,
+                "params": ddconfig,
            },
            **kwargs,
        )
--- a/comfy/ldm/modules/diffusionmodules/openaimodel.py
+++ b/comfy/ldm/modules/diffusionmodules/openaimodel.py
@@ -34,16 +34,6 @@ class TimestepBlock(nn.Module):
 #This is needed because accelerate makes a copy of transformer_options which breaks "transformer_index"
 def forward_timestep_embed(ts, x, emb, context=None, transformer_options={}, output_shape=None, time_context=None, num_video_frames=None, image_only_indicator=None):
    for layer in ts:
-        if "patches" in transformer_options and "forward_timestep_embed_patch" in transformer_options["patches"]:
-            found_patched = False
-            for class_type, handler in transformer_options["patches"]["forward_timestep_embed_patch"]:
-                if isinstance(layer, class_type):
-                    x = handler(layer, x, emb, context, transformer_options, output_shape, time_context, num_video_frames, image_only_indicator)
-                    found_patched = True
-                    break
-            if found_patched:
-                continue
-
        if isinstance(layer, VideoResBlock):
            x = layer(x, emb, num_video_frames, image_only_indicator)
        elif isinstance(layer, TimestepBlock):
@@ -59,6 +49,15 @@ def forward_timestep_embed(ts, x, emb, context=None, transformer_options={}, out
        elif isinstance(layer, Upsample):
            x = layer(x, output_shape=output_shape)
        else:
+            if "patches" in transformer_options and "forward_timestep_embed_patch" in transformer_options["patches"]:
+                found_patched = False
+                for class_type, handler in transformer_options["patches"]["forward_timestep_embed_patch"]:
+                    if isinstance(layer, class_type):
+                        x = handler(layer, x, emb, context, transformer_options, output_shape, time_context, num_video_frames, image_only_indicator)
+                        found_patched = True
+                        break
+                if found_patched:
+                    continue
            x = layer(x)
    return x

@@ -895,12 +894,6 @@ class UNetModel(nn.Module):
            h = forward_timestep_embed(self.middle_block, h, emb, context, transformer_options, time_context=time_context, num_video_frames=num_video_frames, image_only_indicator=image_only_indicator)
        h = apply_control(h, control, 'middle')

-        if "middle_block_after_patch" in transformer_patches:
-            patch = transformer_patches["middle_block_after_patch"]
-            for p in patch:
-                out = p({"h": h, "x": x, "emb": emb, "context": context, "y": y,
-                         "timesteps": timesteps, "transformer_options": transformer_options})
-                h = out["h"]

        for id, module in enumerate(self.output_blocks):
            transformer_options["block"] = ("output", id)
@@ -912,7 +905,6 @@ class UNetModel(nn.Module):
                for p in patch:
                    h, hsp = p(h, hsp, transformer_options)

-            if hsp is not None:
            h = th.cat([h, hsp], dim=1)
            del hsp
            if len(hs) > 0:
--- a/comfy/ldm/modules/encoders/noise_aug_modules.py
+++ b/comfy/ldm/modules/encoders/noise_aug_modules.py
@@ -3,9 +3,12 @@ from ..diffusionmodules.openaimodel import Timestep
 import torch

 class CLIPEmbeddingNoiseAugmentation(ImageConcatWithNoiseAugmentation):
-    def __init__(self, *args, timestep_dim=256, **kwargs):
+    def __init__(self, *args, clip_stats_path=None, timestep_dim=256, **kwargs):
        super().__init__(*args, **kwargs)
+        if clip_stats_path is None:
            clip_mean, clip_std = torch.zeros(timestep_dim), torch.ones(timestep_dim)
+        else:
+            clip_mean, clip_std = torch.load(clip_stats_path, map_location="cpu")
        self.register_buffer("data_mean", clip_mean[None, :], persistent=False)
        self.register_buffer("data_std", clip_std[None, :], persistent=False)
        self.time_embed = Timestep(timestep_dim)
--- a/comfy/ldm/modules/sdpose.py
+++ b/comfy/ldm/modules/sdpose.py
@@ -90,7 +90,7 @@ class HeatmapHead(torch.nn.Module):
                origin_max = np.max(hm[k])
                dr = np.zeros((H + 2 * border, W + 2 * border), dtype=np.float32)
                dr[border:-border, border:-border] = hm[k].copy()
-                dr = gaussian_filter(dr, sigma=2.0, truncate=2.5)
+                dr = gaussian_filter(dr, sigma=2.0)
                hm[k] = dr[border:-border, border:-border].copy()
                cur_max = np.max(hm[k])
                if cur_max > 0:
--- a/comfy/ldm/rt_detr/rtdetr_v4.py
+++ b/comfy/ldm/rt_detr/rtdetr_v4.py
@@ -1,725 +0,0 @@
-from collections import OrderedDict
-from typing import List
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-import torchvision
-import comfy.model_management
-from comfy.ldm.modules.attention import optimized_attention_for_device
-
-COCO_CLASSES = [
-    'person','bicycle','car','motorcycle','airplane','bus','train','truck','boat',
-    'traffic light','fire hydrant','stop sign','parking meter','bench','bird','cat',
-    'dog','horse','sheep','cow','elephant','bear','zebra','giraffe','backpack',
-    'umbrella','handbag','tie','suitcase','frisbee','skis','snowboard','sports ball',
-    'kite','baseball bat','baseball glove','skateboard','surfboard','tennis racket',
-    'bottle','wine glass','cup','fork','knife','spoon','bowl','banana','apple',
-    'sandwich','orange','broccoli','carrot','hot dog','pizza','donut','cake','chair',
-    'couch','potted plant','bed','dining table','toilet','tv','laptop','mouse',
-    'remote','keyboard','cell phone','microwave','oven','toaster','sink',
-    'refrigerator','book','clock','vase','scissors','teddy bear','hair drier','toothbrush',
-]
-
-# ---------------------------------------------------------------------------
-# HGNetv2 backbone
-# ---------------------------------------------------------------------------
-
-class ConvBNAct(nn.Module):
-    """Conv→BN→ReLU.  padding='same' adds asymmetric zero-pad (stem)."""
-    def __init__(self, ic, oc, k=3, s=1, groups=1, use_act=True, device=None, dtype=None, operations=None):
-        super().__init__()
-
-        self.conv = operations.Conv2d(ic, oc, k, s, (k - 1) // 2, groups=groups, bias=False, device=device, dtype=dtype)
-        self.bn   = nn.BatchNorm2d(oc, device=device, dtype=dtype)
-        self.act  = nn.ReLU() if use_act else nn.Identity()
-
-    def forward(self, x):
-        return self.act(self.bn(self.conv(x)))
-
-class LightConvBNAct(nn.Module):
-    def __init__(self, ic, oc, k, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.conv1 = ConvBNAct(ic, oc, 1, use_act=False, device=device, dtype=dtype, operations=operations)
-        self.conv2 = ConvBNAct(oc, oc, k, groups=oc, use_act=True, device=device, dtype=dtype, operations=operations)
-
-    def forward(self, x):
-        return self.conv2(self.conv1(x))
-
-class _StemBlock(nn.Module):
-    def __init__(self, ic, mc, oc, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.stem1  = ConvBNAct(ic,    mc,    3, 2, device=device, dtype=dtype, operations=operations)
-        # stem2a/stem2b use kernel=2, stride=1, no internal padding;
-        # padding is applied manually in forward (matching PaddlePaddle original)
-        self.stem2a = ConvBNAct(mc,    mc//2, 2, 1, device=device, dtype=dtype, operations=operations)
-        self.stem2b = ConvBNAct(mc//2, mc,    2, 1, device=device, dtype=dtype, operations=operations)
-        self.stem3  = ConvBNAct(mc*2,  mc,    3, 2, device=device, dtype=dtype, operations=operations)
-        self.stem4  = ConvBNAct(mc,    oc,    1, device=device, dtype=dtype, operations=operations)
-        self.pool   = nn.MaxPool2d(2, 1, ceil_mode=True)
-
-    def forward(self, x):
-        x  = self.stem1(x)
-        x  = F.pad(x, (0, 1, 0, 1))   # pad before pool and stem2a
-        x2 = self.stem2a(x)
-        x2 = F.pad(x2, (0, 1, 0, 1))  # pad before stem2b
-        x2 = self.stem2b(x2)
-        x1 = self.pool(x)
-        return self.stem4(self.stem3(torch.cat([x1, x2], 1)))
-
-
-class _HG_Block(nn.Module):
-    def __init__(self, ic, mc, oc, layer_num, k=3, residual=False, light=False, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.residual = residual
-        if light:
-            self.layers = nn.ModuleList(
-                [LightConvBNAct(ic if i == 0 else mc, mc, k, device=device, dtype=dtype, operations=operations) for i in range(layer_num)])
-        else:
-            self.layers = nn.ModuleList(
-                [ConvBNAct(ic if i == 0 else mc, mc, k, device=device, dtype=dtype, operations=operations) for i in range(layer_num)])
-        total = ic + layer_num * mc
-
-        self.aggregation = nn.Sequential(
-            ConvBNAct(total,   oc // 2, 1, device=device, dtype=dtype, operations=operations),
-            ConvBNAct(oc // 2, oc,      1, device=device, dtype=dtype, operations=operations))
-
-    def forward(self, x):
-        identity = x
-        outs = [x]
-        for layer in self.layers:
-            x = layer(x)
-            outs.append(x)
-        x = self.aggregation(torch.cat(outs, 1))
-        return x + identity if self.residual else x
-
-
-class _HG_Stage(nn.Module):
-    # config order: ic, mc, oc, num_blocks, downsample, light, k, layer_num
-    def __init__(self, ic, mc, oc, num_blocks, downsample=True, light=False, k=3, layer_num=6, device=None, dtype=None, operations=None):
-        super().__init__()
-        if downsample:
-            self.downsample = ConvBNAct(ic, ic, 3, 2, groups=ic, use_act=False, device=device, dtype=dtype, operations=operations)
-        else:
-            self.downsample = nn.Identity()
-        self.blocks = nn.Sequential(*[
-            _HG_Block(ic if i == 0 else oc, mc, oc, layer_num,
-                      k=k, residual=(i != 0), light=light, device=device, dtype=dtype, operations=operations)
-            for i in range(num_blocks)
-        ])
-
-    def forward(self, x):
-        return self.blocks(self.downsample(x))
-
-
-class HGNetv2(nn.Module):
-    # B5 config: stem=[3,32,64], stages=[ic, mc, oc, blocks, down, light, k, layers]
-    _STAGE_CFGS = [[64,  64,  128,  1, False, False, 3, 6],
-                   [128, 128, 512,  2, True,  False, 3, 6],
-                   [512, 256, 1024, 5, True,  True,  5, 6],
-                   [1024,512, 2048, 2, True,  True,  5, 6]]
-
-    def __init__(self, return_idx=(1, 2, 3), device=None, dtype=None, operations=None):
-        super().__init__()
-        self.stem   = _StemBlock(3, 32, 64, device=device, dtype=dtype, operations=operations)
-        self.stages = nn.ModuleList([_HG_Stage(*cfg, device=device, dtype=dtype, operations=operations) for cfg in self._STAGE_CFGS])
-        self.return_idx  = list(return_idx)
-        self.out_channels = [self._STAGE_CFGS[i][2] for i in return_idx]
-
-    def forward(self, x: torch.Tensor) -> List[torch.Tensor]:
-        x = self.stem(x)
-        outs = []
-        for i, stage in enumerate(self.stages):
-            x = stage(x)
-            if i in self.return_idx:
-                outs.append(x)
-        return outs
-
-
-# ---------------------------------------------------------------------------
-# Encoder — HybridEncoder  (dfine version: RepNCSPELAN4 + SCDown PAN)
-# ---------------------------------------------------------------------------
-
-class ConvNormLayer(nn.Module):
-    """Conv→act (expects pre-fused BN weights)."""
-    def __init__(self, ic, oc, k, s, g=1, padding=None, act=None, device=None, dtype=None, operations=None):
-        super().__init__()
-        p = (k - 1) // 2 if padding is None else padding
-        self.conv = operations.Conv2d(ic, oc, k, s, p, groups=g, bias=True, device=device, dtype=dtype)
-        self.act  = nn.SiLU() if act == 'silu' else nn.Identity()
-
-    def forward(self, x):
-        return self.act(self.conv(x))
-
-
-class VGGBlock(nn.Module):
-    """Rep-VGG block (expects pre-fused weights)."""
-    def __init__(self, ic, oc, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.conv = operations.Conv2d(ic, oc, 3, 1, padding=1, bias=True, device=device, dtype=dtype)
-        self.act  = nn.SiLU()
-
-    def forward(self, x):
-        return self.act(self.conv(x))
-
-
-class CSPLayer(nn.Module):
-    def __init__(self, ic, oc, num_blocks=3, expansion=1.0, act='silu', device=None, dtype=None, operations=None):
-        super().__init__()
-        h = int(oc * expansion)
-        self.conv1 = ConvNormLayer(ic, h, 1, 1, act=act, device=device, dtype=dtype, operations=operations)
-        self.conv2 = ConvNormLayer(ic, h, 1, 1, act=act, device=device, dtype=dtype, operations=operations)
-        self.bottlenecks = nn.Sequential(*[VGGBlock(h, h, device=device, dtype=dtype, operations=operations) for _ in range(num_blocks)])
-        self.conv3 = ConvNormLayer(h, oc, 1, 1, act=act, device=device, dtype=dtype, operations=operations) if h != oc else nn.Identity()
-
-    def forward(self, x):
-        return self.conv3(self.bottlenecks(self.conv1(x)) + self.conv2(x))
-
-
-class RepNCSPELAN4(nn.Module):
-    """CSP-ELAN block — the FPN/PAN block in RTv4's HybridEncoder."""
-    def __init__(self, c1, c2, c3, c4, n=3, act='silu', device=None, dtype=None, operations=None):
-        super().__init__()
-        self.c = c3 // 2
-        self.cv1 = ConvNormLayer(c1, c3, 1, 1, act=act, device=device, dtype=dtype, operations=operations)
-        self.cv2 = nn.Sequential(CSPLayer(c3 // 2, c4, n, 1.0, act=act, device=device, dtype=dtype, operations=operations), ConvNormLayer(c4, c4, 3, 1, act=act, device=device, dtype=dtype, operations=operations))
-        self.cv3 = nn.Sequential(CSPLayer(c4, c4, n, 1.0, act=act, device=device, dtype=dtype, operations=operations), ConvNormLayer(c4, c4, 3, 1, act=act, device=device, dtype=dtype, operations=operations))
-        self.cv4 = ConvNormLayer(c3 + 2 * c4, c2, 1, 1, act=act, device=device, dtype=dtype, operations=operations)
-
-    def forward(self, x):
-        y = list(self.cv1(x).split((self.c, self.c), 1))
-        y.extend(m(y[-1]) for m in [self.cv2, self.cv3])
-        return self.cv4(torch.cat(y, 1))
-
-
-class SCDown(nn.Module):
-    """Separable conv downsampling used in HybridEncoder PAN bottom-up path."""
-    def __init__(self, ic, oc, k, s, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.cv1 = ConvNormLayer(ic, oc, 1, 1, device=device, dtype=dtype, operations=operations)
-        self.cv2 = ConvNormLayer(oc, oc, k, s, g=oc, device=device, dtype=dtype, operations=operations)
-
-    def forward(self, x):
-        return self.cv2(self.cv1(x))
-
-
-class SelfAttention(nn.Module):
-    def __init__(self, embed_dim, num_heads, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.embed_dim = embed_dim
-        self.num_heads = num_heads
-        self.head_dim  = embed_dim // num_heads
-        self.q_proj   = operations.Linear(embed_dim, embed_dim, device=device, dtype=dtype)
-        self.k_proj   = operations.Linear(embed_dim, embed_dim, device=device, dtype=dtype)
-        self.v_proj   = operations.Linear(embed_dim, embed_dim, device=device, dtype=dtype)
-        self.out_proj = operations.Linear(embed_dim, embed_dim, device=device, dtype=dtype)
-
-    def forward(self, query, key, value, attn_mask=None):
-        optimized_attention = optimized_attention_for_device(query.device, False, small_input=True)
-        q, k, v = self.q_proj(query), self.k_proj(key), self.v_proj(value)
-        out = optimized_attention(q, k, v, heads=self.num_heads, mask=attn_mask)
-        return self.out_proj(out)
-
-
-class _TransformerEncoderLayer(nn.Module):
-    """Single AIFI encoder layer (pre- or post-norm, GELU by default)."""
-    def __init__(self, d_model, nhead, dim_feedforward, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.self_attn  = SelfAttention(d_model, nhead, device=device, dtype=dtype, operations=operations)
-        self.linear1    = operations.Linear(d_model, dim_feedforward, device=device, dtype=dtype)
-        self.linear2    = operations.Linear(dim_feedforward, d_model, device=device, dtype=dtype)
-        self.norm1      = operations.LayerNorm(d_model, device=device, dtype=dtype)
-        self.norm2      = operations.LayerNorm(d_model, device=device, dtype=dtype)
-        self.activation = nn.GELU()
-
-    def forward(self, src, src_mask=None, pos_embed=None):
-        q = k = src if pos_embed is None else src + pos_embed
-        src2 = self.self_attn(q, k, value=src, attn_mask=src_mask)
-        src = self.norm1(src + src2)
-        src2 = self.linear2(self.activation(self.linear1(src)))
-        return self.norm2(src + src2)
-
-
-class _TransformerEncoder(nn.Module):
-    """Thin wrapper so state-dict keys are  encoder.0.layers.N.*"""
-    def __init__(self, num_layers, d_model, nhead, dim_feedforward, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.layers = nn.ModuleList([
-            _TransformerEncoderLayer(d_model, nhead, dim_feedforward, device=device, dtype=dtype, operations=operations)
-            for _ in range(num_layers)
-        ])
-
-    def forward(self, src, src_mask=None, pos_embed=None):
-        for layer in self.layers:
-            src = layer(src, src_mask=src_mask, pos_embed=pos_embed)
-        return src
-
-
-class HybridEncoder(nn.Module):
-    def __init__(self, in_channels=(512, 1024, 2048), feat_strides=(8, 16, 32), hidden_dim=256, nhead=8, dim_feedforward=2048, use_encoder_idx=(2,), num_encoder_layers=1,
-                 pe_temperature=10000, expansion=1.0, depth_mult=1.0, act='silu', eval_spatial_size=(640, 640), device=None, dtype=None, operations=None):
-        super().__init__()
-        self.in_channels       = list(in_channels)
-        self.feat_strides      = list(feat_strides)
-        self.hidden_dim        = hidden_dim
-        self.use_encoder_idx   = list(use_encoder_idx)
-        self.pe_temperature    = pe_temperature
-        self.eval_spatial_size = eval_spatial_size
-        self.out_channels      = [hidden_dim] * len(in_channels)
-        self.out_strides       = list(feat_strides)
-
-        # channel projection (expects pre-fused weights)
-        self.input_proj = nn.ModuleList([
-            nn.Sequential(OrderedDict([('conv', operations.Conv2d(ch, hidden_dim, 1, bias=True, device=device, dtype=dtype))]))
-            for ch in in_channels
-        ])
-
-        # AIFI transformer — use _TransformerEncoder so keys are  encoder.0.layers.N.*
-        self.encoder = nn.ModuleList([
-            _TransformerEncoder(num_encoder_layers, hidden_dim, nhead, dim_feedforward, device=device, dtype=dtype, operations=operations)
-            for _ in range(len(use_encoder_idx))
-        ])
-
-        nb  = round(3 * depth_mult)
-        exp = expansion
-
-        # top-down FPN  (dfine: lateral conv has no act)
-        self.lateral_convs = nn.ModuleList(
-            [ConvNormLayer(hidden_dim, hidden_dim, 1, 1, device=device, dtype=dtype, operations=operations)
-             for _ in range(len(in_channels) - 1)])
-        self.fpn_blocks = nn.ModuleList(
-            [RepNCSPELAN4(hidden_dim * 2, hidden_dim, hidden_dim * 2, round(exp * hidden_dim // 2), nb, act=act, device=device, dtype=dtype, operations=operations)
-             for _ in range(len(in_channels) - 1)])
-
-        # bottom-up PAN  (dfine: nn.Sequential(SCDown) — keeps checkpoint key  .0.cv1/.0.cv2)
-        self.downsample_convs = nn.ModuleList(
-            [nn.Sequential(SCDown(hidden_dim, hidden_dim, 3, 2, device=device, dtype=dtype, operations=operations))
-             for _ in range(len(in_channels) - 1)])
-        self.pan_blocks = nn.ModuleList(
-            [RepNCSPELAN4(hidden_dim * 2, hidden_dim, hidden_dim * 2, round(exp * hidden_dim // 2), nb, act=act, device=device, dtype=dtype, operations=operations)
-             for _ in range(len(in_channels) - 1)])
-
-        # cache positional embeddings for fixed spatial size
-        if eval_spatial_size:
-            for idx in self.use_encoder_idx:
-                stride = self.feat_strides[idx]
-                pe = self._build_pe(eval_spatial_size[1] // stride,
-                                    eval_spatial_size[0] // stride,
-                                    hidden_dim, pe_temperature)
-                setattr(self, f'pos_embed{idx}', pe)
-
-    @staticmethod
-    def _build_pe(w, h, dim=256, temp=10000.):
-        assert dim % 4 == 0
-        gw = torch.arange(w, dtype=torch.float32)
-        gh = torch.arange(h, dtype=torch.float32)
-        gw, gh = torch.meshgrid(gw, gh, indexing='ij')
-        pdim  = dim // 4
-        omega = 1. / (temp ** (torch.arange(pdim, dtype=torch.float32) / pdim))
-        ow = gw.flatten()[:, None] @ omega[None]
-        oh = gh.flatten()[:, None] @ omega[None]
-        return torch.cat([ow.sin(), ow.cos(), oh.sin(), oh.cos()], 1)[None]
-
-    def forward(self, feats: List[torch.Tensor]) -> List[torch.Tensor]:
-        proj = [self.input_proj[i](f) for i, f in enumerate(feats)]
-
-        for i, enc_idx in enumerate(self.use_encoder_idx):
-            h, w = proj[enc_idx].shape[2:]
-            src  = proj[enc_idx].flatten(2).permute(0, 2, 1)
-            pe = getattr(self, f'pos_embed{enc_idx}').to(device=src.device, dtype=src.dtype)
-            for layer in self.encoder[i].layers:
-                src = layer(src, pos_embed=pe)
-            proj[enc_idx] = src.permute(0, 2, 1).reshape(-1, self.hidden_dim, h, w).contiguous()
-
-        n = len(self.in_channels)
-        inner = [proj[-1]]
-        for k in range(n - 1, 0, -1):
-            j = n - 1 - k
-            top = self.lateral_convs[j](inner[0])
-            inner[0] = top
-            up = F.interpolate(top, scale_factor=2., mode='nearest')
-            inner.insert(0, self.fpn_blocks[j](torch.cat([up, proj[k - 1]], 1)))
-
-        outs = [inner[0]]
-        for k in range(n - 1):
-            outs.append(self.pan_blocks[k](
-                torch.cat([self.downsample_convs[k](outs[-1]), inner[k + 1]], 1)))
-        return outs
-
-
-# ---------------------------------------------------------------------------
-# Decoder — DFINETransformer
-# ---------------------------------------------------------------------------
-
-def _deformable_attn_v2(value: list, spatial_shapes, sampling_locations: torch.Tensor, attention_weights: torch.Tensor, num_points_list: List[int]) -> torch.Tensor:
-    """
-    value            : list of per-level tensors  [bs*n_head, c, h_l, w_l]
-    sampling_locations: [bs, Lq, n_head, sum(pts), 2]  in [0,1]
-    attention_weights : [bs, Lq, n_head, sum(pts)]
-    """
-    _, c = value[0].shape[:2]      # bs*n_head, c
-    _, Lq, n_head, _, _ = sampling_locations.shape
-    bs = sampling_locations.shape[0]
-    n_h = n_head
-
-    grids = (2 * sampling_locations - 1)          # [bs, Lq, n_head, sum_pts, 2]
-    grids = grids.permute(0, 2, 1, 3, 4).flatten(0, 1)  # [bs*n_head, Lq, sum_pts, 2]
-    grids_per_lvl = grids.split(num_points_list, dim=2)  # list of [bs*n_head, Lq, pts_l, 2]
-
-    sampled = []
-    for lvl, (h, w) in enumerate(spatial_shapes):
-        val_l = value[lvl].reshape(bs * n_h, c, h, w)
-        sv = F.grid_sample(val_l, grids_per_lvl[lvl], mode='bilinear', padding_mode='zeros', align_corners=False)
-        sampled.append(sv) # sv: [bs*n_head, c, Lq, pts_l]
-
-    attn = attention_weights.permute(0, 2, 1, 3)  # [bs, n_head, Lq, sum_pts]
-    attn = attn.flatten(0, 1).unsqueeze(1)         # [bs*n_head, 1, Lq, sum_pts]
-    out  = (torch.cat(sampled, -1) * attn).sum(-1) # [bs*n_head, c, Lq]
-    out  = out.reshape(bs, n_h * c, Lq)
-    return out.permute(0, 2, 1)                    # [bs, Lq, hidden]
-
-
-class MSDeformableAttention(nn.Module):
-    def __init__(self, embed_dim=256, num_heads=8, num_levels=3, num_points=4, offset_scale=0.5, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.embed_dim, self.num_heads = embed_dim, num_heads
-        self.head_dim  = embed_dim // num_heads
-        pts = num_points if isinstance(num_points, list) else [num_points] * num_levels
-        self.num_points_list = pts
-        self.offset_scale    = offset_scale
-        total = num_heads * sum(pts)
-        self.register_buffer('num_points_scale', torch.tensor([1. / n for n in pts for _ in range(n)], dtype=torch.float32))
-        self.sampling_offsets  = operations.Linear(embed_dim, total * 2, device=device, dtype=dtype)
-        self.attention_weights = operations.Linear(embed_dim, total, device=device, dtype=dtype)
-
-    def forward(self, query, ref_pts, value, spatial_shapes):
-        bs, Lq = query.shape[:2]
-        offsets = self.sampling_offsets(query).reshape(
-            bs, Lq, self.num_heads, sum(self.num_points_list), 2)
-        attn_w  = F.softmax(
-            self.attention_weights(query).reshape(
-                bs, Lq, self.num_heads, sum(self.num_points_list)), -1)
-        scale   = self.num_points_scale.to(query).unsqueeze(-1)
-        offset  = offsets * scale * ref_pts[:, :, None, :, 2:] * self.offset_scale
-        locs    = ref_pts[:, :, None, :, :2] + offset  # [bs, Lq, n_head, sum_pts, 2]
-        return _deformable_attn_v2(value, spatial_shapes, locs, attn_w, self.num_points_list)
-
-
-class Gate(nn.Module):
-    def __init__(self, d_model, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.gate = operations.Linear(2 * d_model, 2 * d_model, device=device, dtype=dtype)
-        self.norm = operations.LayerNorm(d_model, device=device, dtype=dtype)
-
-    def forward(self, x1, x2):
-        g1, g2 = torch.sigmoid(self.gate(torch.cat([x1, x2], -1))).chunk(2, -1)
-        return self.norm(g1 * x1 + g2 * x2)
-
-
-class MLP(nn.Module):
-    def __init__(self, in_dim, hidden_dim, out_dim, num_layers, device=None, dtype=None, operations=None):
-        super().__init__()
-        dims = [in_dim] + [hidden_dim] * (num_layers - 1) + [out_dim]
-        self.layers = nn.ModuleList(operations.Linear(dims[i], dims[i + 1], device=device, dtype=dtype) for i in range(num_layers))
-
-    def forward(self, x):
-        for i, layer in enumerate(self.layers):
-            x = nn.SiLU()(layer(x)) if i < len(self.layers) - 1 else layer(x)
-        return x
-
-
-class TransformerDecoderLayer(nn.Module):
-    def __init__(self, d_model=256, nhead=8, dim_feedforward=1024, num_levels=3, num_points=4, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.self_attn  = SelfAttention(d_model, nhead, device=device, dtype=dtype, operations=operations)
-        self.norm1      = operations.LayerNorm(d_model, device=device, dtype=dtype)
-        self.cross_attn = MSDeformableAttention(d_model, nhead, num_levels, num_points, device=device, dtype=dtype, operations=operations)
-        self.gateway    = Gate(d_model, device=device, dtype=dtype, operations=operations)
-        self.linear1    = operations.Linear(d_model, dim_feedforward, device=device, dtype=dtype)
-        self.activation = nn.ReLU()
-        self.linear2    = operations.Linear(dim_feedforward, d_model, device=device, dtype=dtype)
-        self.norm3      = operations.LayerNorm(d_model, device=device, dtype=dtype)
-
-    def forward(self, target, ref_pts, value, spatial_shapes, attn_mask=None, query_pos=None):
-        q = k = target if query_pos is None else target + query_pos
-        t2 = self.self_attn(q, k, value=target, attn_mask=attn_mask)
-        target = self.norm1(target + t2)
-        t2 = self.cross_attn(
-            target if query_pos is None else target + query_pos,
-            ref_pts, value, spatial_shapes)
-        target = self.gateway(target, t2)
-        t2 = self.linear2(self.activation(self.linear1(target)))
-        target = self.norm3((target + t2).clamp(-65504, 65504))
-        return target
-
-
-# ---------------------------------------------------------------------------
-# FDR utilities
-# ---------------------------------------------------------------------------
-
-def weighting_function(reg_max, up, reg_scale):
-    """Non-uniform weighting function W(n) for FDR box regression."""
-    ub1 = (abs(up[0]) * abs(reg_scale)).item()
-    ub2 = ub1 * 2
-    step = (ub1 + 1) ** (2 / (reg_max - 2))
-    left  = [-(step ** i) + 1 for i in range(reg_max // 2 - 1, 0, -1)]
-    right = [ (step ** i) - 1 for i in range(1, reg_max // 2)]
-    vals  = [-ub2] + left + [0] + right + [ub2]
-    return torch.tensor(vals, dtype=up.dtype, device=up.device)
-
-
-def distance2bbox(points, distance, reg_scale):
-    """Decode edge-distances → cxcywh boxes."""
-    rs = abs(reg_scale).to(dtype=points.dtype)
-    x1 = points[..., 0] - (0.5 * rs + distance[..., 0]) * (points[..., 2] / rs)
-    y1 = points[..., 1] - (0.5 * rs + distance[..., 1]) * (points[..., 3] / rs)
-    x2 = points[..., 0] + (0.5 * rs + distance[..., 2]) * (points[..., 2] / rs)
-    y2 = points[..., 1] + (0.5 * rs + distance[..., 3]) * (points[..., 3] / rs)
-    x0, y0, x1_, y1_ = (x1 + x2) / 2, (y1 + y2) / 2, x2 - x1, y2 - y1
-    return torch.stack([x0, y0, x1_, y1_], -1)
-
-
-class Integral(nn.Module):
-    """Sum Pr(n)·W(n) over the distribution bins."""
-    def __init__(self, reg_max=32):
-        super().__init__()
-        self.reg_max = reg_max
-
-    def forward(self, x, project):
-        shape = x.shape
-        x = F.softmax(x.reshape(-1, self.reg_max + 1), 1)
-        x = F.linear(x, project.to(device=x.device, dtype=x.dtype)).reshape(-1, 4)
-        return x.reshape(list(shape[:-1]) + [-1])
-
-
-class LQE(nn.Module):
-    """Location Quality Estimator — refines class scores using corner distribution."""
-    def __init__(self, k=4, hidden_dim=64, num_layers=2, reg_max=32, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.k, self.reg_max = k, reg_max
-        self.reg_conf = MLP(4 * (k + 1), hidden_dim, 1, num_layers, device=device, dtype=dtype, operations=operations)
-
-    def forward(self, scores, pred_corners):
-        B, L, _ = pred_corners.shape
-        prob     = F.softmax(pred_corners.reshape(B, L, 4, self.reg_max + 1), -1)
-        topk, _  = prob.topk(self.k, -1)
-        stat     = torch.cat([topk, topk.mean(-1, keepdim=True)], -1)
-        return scores + self.reg_conf(stat.reshape(B, L, -1))
-
-
-class TransformerDecoder(nn.Module):
-    def __init__(self, hidden_dim, nhead, dim_feedforward, num_levels, num_points, num_layers, reg_max, reg_scale, up, eval_idx=-1, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.hidden_dim = hidden_dim
-        self.num_layers = num_layers
-        self.nhead      = nhead
-        self.eval_idx   = eval_idx if eval_idx >= 0 else num_layers + eval_idx
-        self.up, self.reg_scale, self.reg_max = up, reg_scale, reg_max
-        self.layers = nn.ModuleList([
-            TransformerDecoderLayer(hidden_dim, nhead, dim_feedforward, num_levels, num_points, device=device, dtype=dtype, operations=operations)
-            for _ in range(self.eval_idx + 1)
-        ])
-        self.lqe_layers = nn.ModuleList([LQE(4, 64, 2, reg_max, device=device, dtype=dtype, operations=operations) for _ in range(self.eval_idx + 1)])
-        self.register_buffer('project', weighting_function(reg_max, up, reg_scale))
-
-    def _value_op(self, memory, spatial_shapes):
-        """Reshape memory to per-level value tensors for deformable attention."""
-        c = self.hidden_dim // self.nhead
-        split = [h * w for h, w in spatial_shapes]
-        val = memory.reshape(memory.shape[0], memory.shape[1], self.nhead, c) # memory: [bs, sum(h*w), hidden_dim]
-        # → [bs, n_head, c, sum_hw]
-        val = val.permute(0, 2, 3, 1).flatten(0, 1)  # [bs*n_head, c, sum_hw]
-        return val.split(split, dim=-1)  # list of [bs*n_head, c, h_l*w_l]
-
-    def forward(self, target, ref_pts_unact, memory, spatial_shapes, bbox_head, score_head, query_pos_head, pre_bbox_head, integral):
-        val_split_flat = self._value_op(memory, spatial_shapes) # pre-split value for deformable attention
-
-        # reshape to [bs*n_head, c, h_l, w_l]
-        value = []
-        for lvl, (h, w) in enumerate(spatial_shapes):
-            v = val_split_flat[lvl]   # [bs*n_head, c, h*w]
-            value.append(v.reshape(v.shape[0], v.shape[1], h, w))
-
-        ref_pts  = F.sigmoid(ref_pts_unact)
-        output   = target
-        output_detach = pred_corners_undetach = 0
-
-        dec_bboxes, dec_logits = [], []
-
-        for i, layer in enumerate(self.layers):
-            ref_input    = ref_pts.unsqueeze(2)           # [bs, Lq, 1, 4]
-            query_pos    = query_pos_head(ref_pts).clamp(-10, 10)
-            output       = layer(output, ref_input, value, spatial_shapes, query_pos=query_pos)
-
-            if i == 0:
-                ref_unact = ref_pts.clamp(1e-5, 1 - 1e-5)
-                ref_unact = torch.log(ref_unact / (1 - ref_unact))
-                pre_bboxes = F.sigmoid(pre_bbox_head(output) + ref_unact)
-                ref_pts_initial = pre_bboxes.detach()
-
-            pred_corners = bbox_head[i](output + output_detach) + pred_corners_undetach
-            inter_ref_bbox = distance2bbox(ref_pts_initial, integral(pred_corners, self.project), self.reg_scale)
-
-            if i == self.eval_idx:
-                scores = score_head[i](output)
-                scores = self.lqe_layers[i](scores, pred_corners)
-                dec_bboxes.append(inter_ref_bbox)
-                dec_logits.append(scores)
-                break
-
-            pred_corners_undetach = pred_corners
-            ref_pts        = inter_ref_bbox.detach()
-            output_detach  = output.detach()
-
-        return torch.stack(dec_bboxes), torch.stack(dec_logits)
-
-
-class DFINETransformer(nn.Module):
-    def __init__(self, num_classes=80, hidden_dim=256, num_queries=300, feat_channels=[256, 256, 256], feat_strides=[8, 16, 32],
-                 num_levels=3, num_points=[3, 6, 3], nhead=8, num_layers=6, dim_feedforward=1024, eval_idx=-1, eps=1e-2, reg_max=32,
-                 reg_scale=8.0, eval_spatial_size=(640, 640), device=None, dtype=None, operations=None):
-        super().__init__()
-        assert len(feat_strides) == len(feat_channels)
-        self.hidden_dim  = hidden_dim
-        self.num_queries = num_queries
-        self.num_levels  = num_levels
-        self.eps         = eps
-        self.eval_spatial_size = eval_spatial_size
-
-        self.feat_strides = list(feat_strides)
-        for i in range(num_levels - len(feat_strides)):
-            self.feat_strides.append(feat_strides[-1] * 2 ** (i + 1))
-
-        # input projection (expects pre-fused weights)
-        self.input_proj = nn.ModuleList()
-        for ch in feat_channels:
-            if ch == hidden_dim:
-                self.input_proj.append(nn.Identity())
-            else:
-                self.input_proj.append(nn.Sequential(OrderedDict([
-                    ('conv', operations.Conv2d(ch, hidden_dim, 1, bias=True, device=device, dtype=dtype))])))
-        in_ch = feat_channels[-1]
-        for i in range(num_levels - len(feat_channels)):
-            self.input_proj.append(nn.Sequential(OrderedDict([
-                ('conv', operations.Conv2d(in_ch if i == 0 else hidden_dim,
-                                           hidden_dim, 3, 2, 1, bias=True, device=device, dtype=dtype))])))
-            in_ch = hidden_dim
-
-        # FDR parameters (non-trainable placeholders, set from config)
-        self.up        = nn.Parameter(torch.tensor([0.5]),      requires_grad=False)
-        self.reg_scale = nn.Parameter(torch.tensor([reg_scale]), requires_grad=False)
-
-        pts = num_points if isinstance(num_points, (list, tuple)) else [num_points] * num_levels
-        self.decoder = TransformerDecoder(hidden_dim, nhead, dim_feedforward, num_levels, pts,
-                                          num_layers, reg_max, self.reg_scale, self.up, eval_idx, device=device, dtype=dtype, operations=operations)
-
-        self.query_pos_head = MLP(4, 2 * hidden_dim, hidden_dim, 2, device=device, dtype=dtype, operations=operations)
-        self.enc_output     = nn.Sequential(OrderedDict([
-            ('proj', operations.Linear(hidden_dim, hidden_dim, device=device, dtype=dtype)),
-            ('norm', operations.LayerNorm(hidden_dim, device=device, dtype=dtype))]))
-        self.enc_score_head = operations.Linear(hidden_dim, num_classes, device=device, dtype=dtype)
-        self.enc_bbox_head  = MLP(hidden_dim, hidden_dim, 4, 3, device=device, dtype=dtype, operations=operations)
-
-        self.eval_idx_ = eval_idx if eval_idx >= 0 else num_layers + eval_idx
-        self.dec_score_head = nn.ModuleList(
-            [operations.Linear(hidden_dim, num_classes, device=device, dtype=dtype) for _ in range(self.eval_idx_ + 1)])
-        self.pre_bbox_head  = MLP(hidden_dim, hidden_dim, 4, 3, device=device, dtype=dtype, operations=operations)
-        self.dec_bbox_head  = nn.ModuleList(
-            [MLP(hidden_dim, hidden_dim, 4 * (reg_max + 1), 3, device=device, dtype=dtype, operations=operations)
-             for _ in range(self.eval_idx_ + 1)])
-        self.integral = Integral(reg_max)
-
-        if eval_spatial_size:
-            # Register as buffers so checkpoint values override the freshly-computed defaults
-            anchors, valid_mask = self._gen_anchors()
-            self.register_buffer('anchors', anchors)
-            self.register_buffer('valid_mask', valid_mask)
-
-    def _gen_anchors(self, spatial_shapes=None, grid_size=0.05, dtype=torch.float32, device='cpu'):
-        if spatial_shapes is None:
-            h0, w0 = self.eval_spatial_size
-            spatial_shapes = [[int(h0 / s), int(w0 / s)] for s in self.feat_strides]
-        anchors = []
-        for lvl, (h, w) in enumerate(spatial_shapes):
-            gy, gx = torch.meshgrid(torch.arange(h), torch.arange(w), indexing='ij')
-            gxy = (torch.stack([gx, gy], -1).float() + 0.5) / torch.tensor([w, h], dtype=dtype)
-            wh  = torch.ones_like(gxy) * grid_size * (2. ** lvl)
-            anchors.append(torch.cat([gxy, wh], -1).reshape(-1, h * w, 4))
-        anchors    = torch.cat(anchors, 1).to(device)
-        valid_mask = ((anchors > self.eps) & (anchors < 1 - self.eps)).all(-1, keepdim=True)
-        anchors    = torch.log(anchors / (1 - anchors))
-        anchors    = torch.where(valid_mask, anchors, torch.full_like(anchors, float('inf')))
-        return anchors, valid_mask
-
-    def _encoder_input(self, feats: List[torch.Tensor]):
-        proj = [self.input_proj[i](f) for i, f in enumerate(feats)]
-        for i in range(len(feats), self.num_levels):
-            proj.append(self.input_proj[i](feats[-1] if i == len(feats) else proj[-1]))
-        flat, shapes = [], []
-        for f in proj:
-            _, _, h, w = f.shape
-            flat.append(f.flatten(2).permute(0, 2, 1))
-            shapes.append([h, w])
-        return torch.cat(flat, 1), shapes
-
-    def _decoder_input(self, memory: torch.Tensor):
-        anchors, valid_mask = self.anchors.to(memory), self.valid_mask
-        if memory.shape[0] > 1:
-            anchors = anchors.repeat(memory.shape[0], 1, 1)
-
-        mem      = valid_mask.to(memory) * memory
-        out_mem  = self.enc_output(mem)
-        logits   = self.enc_score_head(out_mem)
-        _, idx   = torch.topk(logits.max(-1).values, self.num_queries, dim=-1)
-        idx_e    = idx.unsqueeze(-1)
-        topk_mem = out_mem.gather(1, idx_e.expand(-1, -1, out_mem.shape[-1]))
-        topk_anc = anchors.gather(1, idx_e.expand(-1, -1, anchors.shape[-1]))
-        topk_ref = self.enc_bbox_head(topk_mem) + topk_anc
-        return topk_mem.detach(), topk_ref.detach()
-
-    def forward(self, feats: List[torch.Tensor]):
-        memory, shapes = self._encoder_input(feats)
-        content, ref   = self._decoder_input(memory)
-        out_bboxes, out_logits = self.decoder(
-            content, ref, memory, shapes,
-            self.dec_bbox_head, self.dec_score_head,
-            self.query_pos_head, self.pre_bbox_head, self.integral)
-        return {'pred_logits': out_logits[-1], 'pred_boxes': out_bboxes[-1]}
-
-
-# ---------------------------------------------------------------------------
-# Main model
-# ---------------------------------------------------------------------------
-
-class RTv4(nn.Module):
-    def __init__(self, num_classes=80, num_queries=300, enc_h=256, dec_h=256, enc_ff=2048, dec_ff=1024, feat_strides=[8, 16, 32], device=None, dtype=None, operations=None, **kwargs):
-        super().__init__()
-        self.device = device
-        self.dtype = dtype
-        self.operations = operations
-
-        self.backbone = HGNetv2(device=device, dtype=dtype, operations=operations)
-        self.encoder  = HybridEncoder(hidden_dim=enc_h, dim_feedforward=enc_ff, device=device, dtype=dtype, operations=operations)
-        self.decoder  = DFINETransformer(num_classes=num_classes, hidden_dim=dec_h, num_queries=num_queries,
-            feat_channels=[enc_h] * len(feat_strides), feat_strides=feat_strides, dim_feedforward=dec_ff, device=device, dtype=dtype, operations=operations)
-
-        self.num_classes = num_classes
-        self.num_queries = num_queries
-        self.load_device = comfy.model_management.get_torch_device()
-
-    def _forward(self, x: torch.Tensor):
-        return self.decoder(self.encoder(self.backbone(x)))
-
-    def postprocess(self, outputs, orig_size: tuple = (640, 640)) -> List[dict]:
-        logits = outputs['pred_logits']
-        boxes  = torchvision.ops.box_convert(outputs['pred_boxes'], 'cxcywh', 'xyxy')
-        boxes  = boxes * torch.tensor(orig_size, device=boxes.device, dtype=boxes.dtype).repeat(1, 2).unsqueeze(1)
-        scores = F.sigmoid(logits)
-        scores, idx = torch.topk(scores.flatten(1), self.num_queries, dim=-1)
-        labels = idx % self.num_classes
-        boxes  = boxes.gather(1, (idx // self.num_classes).unsqueeze(-1).expand(-1, -1, 4))
-        return [{'labels': lbl, 'boxes': b, 'scores': s} for lbl, b, s in zip(labels, boxes, scores)]
-
-    def forward(self, x: torch.Tensor, orig_size: tuple = (640, 640), **kwargs):
-        outputs = self._forward(x.to(device=self.load_device, dtype=self.dtype))
-        return self.postprocess(outputs, orig_size)
--- a/comfy/ldm/sam3/detector.py
+++ b/comfy/ldm/sam3/detector.py
@@ -1,596 +0,0 @@
-# SAM3 detector: transformer encoder-decoder, segmentation head, geometry encoder, scoring.
-
-import math
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from torchvision.ops import roi_align
-
-from comfy.ldm.modules.attention import optimized_attention
-from comfy.ldm.sam3.tracker import SAM3Tracker, SAM31Tracker
-from comfy.ldm.sam3.sam import SAM3VisionBackbone  # noqa: used in __init__
-from comfy.ldm.sam3.sam import MLP, PositionEmbeddingSine
-
-TRACKER_CLASSES = {"SAM3": SAM3Tracker, "SAM31": SAM31Tracker}
-from comfy.ops import cast_to_input
-
-
-def box_cxcywh_to_xyxy(x):
-    cx, cy, w, h = x.unbind(-1)
-    return torch.stack([cx - 0.5 * w, cy - 0.5 * h, cx + 0.5 * w, cy + 0.5 * h], dim=-1)
-
-
-def gen_sineembed_for_position(pos_tensor, num_feats=256):
-    """Per-coordinate sinusoidal embedding: (..., N) -> (..., N * num_feats)."""
-    assert num_feats % 2 == 0
-    hdim = num_feats // 2
-    freqs = 10000.0 ** (2 * (torch.arange(hdim, dtype=torch.float32, device=pos_tensor.device) // 2) / hdim)
-    embeds = []
-    for c in range(pos_tensor.shape[-1]):
-        raw = (pos_tensor[..., c].float() * 2 * math.pi).unsqueeze(-1) / freqs
-        embeds.append(torch.stack([raw[..., 0::2].sin(), raw[..., 1::2].cos()], dim=-1).flatten(-2))
-    return torch.cat(embeds, dim=-1).to(pos_tensor.dtype)
-
-
-class SplitMHA(nn.Module):
-    """Multi-head attention with separate Q/K/V projections (split from fused in_proj_weight)."""
-    def __init__(self, d_model, num_heads=8, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.num_heads = num_heads
-        self.q_proj = operations.Linear(d_model, d_model, device=device, dtype=dtype)
-        self.k_proj = operations.Linear(d_model, d_model, device=device, dtype=dtype)
-        self.v_proj = operations.Linear(d_model, d_model, device=device, dtype=dtype)
-        self.out_proj = operations.Linear(d_model, d_model, device=device, dtype=dtype)
-
-    def forward(self, q_input, k_input=None, v_input=None, mask=None):
-        q = self.q_proj(q_input)
-        if k_input is None:
-            k = self.k_proj(q_input)
-            v = self.v_proj(q_input)
-        else:
-            k = self.k_proj(k_input)
-            v = self.v_proj(v_input if v_input is not None else k_input)
-        if mask is not None and mask.ndim == 2:
-            mask = mask[:, None, None, :]  # [B, T] -> [B, 1, 1, T] for SDPA broadcast
-        dtype = q.dtype  # manual_cast may produce mixed dtypes
-        out = optimized_attention(q, k.to(dtype), v.to(dtype), self.num_heads, mask=mask, low_precision_attention=False)
-        return self.out_proj(out)
-
-
-class MLPWithNorm(nn.Module):
-    """MLP with residual connection and output LayerNorm."""
-    def __init__(self, input_dim, hidden_dim, output_dim, num_layers, residual=True, device=None, dtype=None, operations=None):
-        super().__init__()
-        dims = [input_dim] + [hidden_dim] * (num_layers - 1) + [output_dim]
-        self.layers = nn.ModuleList([
-            operations.Linear(dims[i], dims[i + 1], device=device, dtype=dtype)
-            for i in range(num_layers)
-        ])
-        self.out_norm = operations.LayerNorm(output_dim, device=device, dtype=dtype)
-        self.residual = residual and (input_dim == output_dim)
-
-    def forward(self, x):
-        orig = x
-        for i, layer in enumerate(self.layers):
-            x = layer(x)
-            if i < len(self.layers) - 1:
-                x = F.relu(x)
-        if self.residual:
-            x = x + orig
-        return self.out_norm(x)
-
-
-class EncoderLayer(nn.Module):
-    def __init__(self, d_model=256, num_heads=8, dim_ff=2048, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.self_attn = SplitMHA(d_model, num_heads, device=device, dtype=dtype, operations=operations)
-        self.cross_attn_image = SplitMHA(d_model, num_heads, device=device, dtype=dtype, operations=operations)
-        self.linear1 = operations.Linear(d_model, dim_ff, device=device, dtype=dtype)
-        self.linear2 = operations.Linear(dim_ff, d_model, device=device, dtype=dtype)
-        self.norm1 = operations.LayerNorm(d_model, device=device, dtype=dtype)
-        self.norm2 = operations.LayerNorm(d_model, device=device, dtype=dtype)
-        self.norm3 = operations.LayerNorm(d_model, device=device, dtype=dtype)
-
-    def forward(self, x, pos, text_memory=None, text_mask=None):
-        normed = self.norm1(x)
-        q_k = normed + pos
-        x = x + self.self_attn(q_k, q_k, normed)
-        if text_memory is not None:
-            normed = self.norm2(x)
-            x = x + self.cross_attn_image(normed, text_memory, text_memory, mask=text_mask)
-        normed = self.norm3(x)
-        x = x + self.linear2(F.relu(self.linear1(normed)))
-        return x
-
-
-class TransformerEncoder(nn.Module):
-    """Checkpoint: transformer.encoder.layers.N.*"""
-    def __init__(self, d_model=256, num_heads=8, dim_ff=2048, num_layers=6, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.layers = nn.ModuleList([
-            EncoderLayer(d_model, num_heads, dim_ff, device=device, dtype=dtype, operations=operations)
-            for _ in range(num_layers)
-        ])
-
-    def forward(self, x, pos, text_memory=None, text_mask=None):
-        for layer in self.layers:
-            x = layer(x, pos, text_memory, text_mask)
-        return x
-
-
-class DecoderLayer(nn.Module):
-    def __init__(self, d_model=256, num_heads=8, dim_ff=2048, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.self_attn = SplitMHA(d_model, num_heads, device=device, dtype=dtype, operations=operations)
-        self.cross_attn = SplitMHA(d_model, num_heads, device=device, dtype=dtype, operations=operations)
-        self.ca_text = SplitMHA(d_model, num_heads, device=device, dtype=dtype, operations=operations)
-        self.norm1 = operations.LayerNorm(d_model, device=device, dtype=dtype)
-        self.norm2 = operations.LayerNorm(d_model, device=device, dtype=dtype)
-        self.norm3 = operations.LayerNorm(d_model, device=device, dtype=dtype)
-        self.catext_norm = operations.LayerNorm(d_model, device=device, dtype=dtype)
-        self.linear1 = operations.Linear(d_model, dim_ff, device=device, dtype=dtype)
-        self.linear2 = operations.Linear(dim_ff, d_model, device=device, dtype=dtype)
-
-    def forward(self, x, memory, x_pos, memory_pos, text_memory=None, text_mask=None, cross_attn_bias=None):
-        q_k = x + x_pos
-        x = self.norm2(x + self.self_attn(q_k, q_k, x))
-        if text_memory is not None:
-            x = self.catext_norm(x + self.ca_text(x + x_pos, text_memory, text_memory, mask=text_mask))
-        x = self.norm1(x + self.cross_attn(x + x_pos, memory + memory_pos, memory, mask=cross_attn_bias))
-        x = self.norm3(x + self.linear2(F.relu(self.linear1(x))))
-        return x
-
-
-class TransformerDecoder(nn.Module):
-    def __init__(self, d_model=256, num_heads=8, dim_ff=2048, num_layers=6,
-                 num_queries=200, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.d_model = d_model
-        self.num_queries = num_queries
-
-        self.layers = nn.ModuleList([
-            DecoderLayer(d_model, num_heads, dim_ff, device=device, dtype=dtype, operations=operations)
-            for _ in range(num_layers)
-        ])
-        self.norm = operations.LayerNorm(d_model, device=device, dtype=dtype)
-        self.query_embed = operations.Embedding(num_queries, d_model, device=device, dtype=dtype)
-        self.reference_points = operations.Embedding(num_queries, 4, device=device, dtype=dtype) # Reference points: Embedding(num_queries, 4) — learned anchor boxes
-        self.ref_point_head = MLP(d_model * 2, d_model, d_model, 2, device=device, dtype=dtype, operations=operations) # ref_point_head input: 512 (4 coords * 128 sine features each)
-        self.bbox_embed = MLP(d_model, d_model, 4, 3, device=device, dtype=dtype, operations=operations)
-
-        self.boxRPB_embed_x = MLP(2, d_model, num_heads, 2, device=device, dtype=dtype, operations=operations)
-        self.boxRPB_embed_y = MLP(2, d_model, num_heads, 2, device=device, dtype=dtype, operations=operations)
-
-        self.presence_token = operations.Embedding(1, d_model, device=device, dtype=dtype)
-        self.presence_token_head = MLP(d_model, d_model, 1, 3, device=device, dtype=dtype, operations=operations)
-        self.presence_token_out_norm = operations.LayerNorm(d_model, device=device, dtype=dtype)
-
-    @staticmethod
-    def _inverse_sigmoid(x):
-        return torch.log(x / (1 - x + 1e-6) + 1e-6)
-
-    def _compute_box_rpb(self, ref_points, H, W):
-        """Box rotary position bias: (B, Q, 4) cxcywh -> (B, n_heads, Q+1, H*W) bias."""
-        boxes_xyxy = box_cxcywh_to_xyxy(ref_points)
-        B, Q, _ = boxes_xyxy.shape
-        coords_h = torch.arange(H, device=ref_points.device, dtype=torch.float32) / H
-        coords_w = torch.arange(W, device=ref_points.device, dtype=torch.float32) / W
-        deltas_x = coords_w.view(1, 1, -1, 1) - boxes_xyxy[:, :, None, 0:3:2]
-        deltas_y = coords_h.view(1, 1, -1, 1) - boxes_xyxy[:, :, None, 1:4:2]
-
-        log2_8 = float(math.log2(8))
-        def log_scale(d):
-            return torch.sign(d * 8) * torch.log2(torch.abs(d * 8) + 1.0) / log2_8
-
-        rpb_x = self.boxRPB_embed_x(log_scale(deltas_x).to(ref_points.dtype))
-        rpb_y = self.boxRPB_embed_y(log_scale(deltas_y).to(ref_points.dtype))
-
-        bias = (rpb_y.unsqueeze(3) + rpb_x.unsqueeze(2)).flatten(2, 3).permute(0, 3, 1, 2)
-        pres_bias = torch.zeros(B, bias.shape[1], 1, bias.shape[3], device=bias.device, dtype=bias.dtype)
-        return torch.cat([pres_bias, bias], dim=2)
-
-    def forward(self, memory, memory_pos, text_memory=None, text_mask=None, H=72, W=72):
-        B = memory.shape[0]
-        tgt = cast_to_input(self.query_embed.weight, memory).unsqueeze(0).expand(B, -1, -1)
-        presence_out = cast_to_input(self.presence_token.weight, memory)[None].expand(B, -1, -1)
-        ref_points = cast_to_input(self.reference_points.weight, memory).unsqueeze(0).expand(B, -1, -1).sigmoid()
-
-        for layer_idx, layer in enumerate(self.layers):
-            query_pos = self.ref_point_head(gen_sineembed_for_position(ref_points, self.d_model))
-            tgt_with_pres = torch.cat([presence_out, tgt], dim=1)
-            pos_with_pres = torch.cat([torch.zeros_like(presence_out), query_pos], dim=1)
-            tgt_with_pres = layer(tgt_with_pres, memory, pos_with_pres, memory_pos,
-                                  text_memory, text_mask, self._compute_box_rpb(ref_points, H, W))
-            presence_out, tgt = tgt_with_pres[:, :1], tgt_with_pres[:, 1:]
-            if layer_idx < len(self.layers) - 1:
-                ref_inv = self._inverse_sigmoid(ref_points)
-                ref_points = (ref_inv + self.bbox_embed(self.norm(tgt))).sigmoid().detach()
-
-        query_out = self.norm(tgt)
-        ref_inv = self._inverse_sigmoid(ref_points)
-        boxes = (ref_inv + self.bbox_embed(query_out)).sigmoid()
-        presence = self.presence_token_head(self.presence_token_out_norm(presence_out)).squeeze(-1)
-        return {"decoder_output": query_out, "pred_boxes": boxes, "presence": presence}
-
-
-class Transformer(nn.Module):
-    def __init__(self, d_model=256, num_heads=8, dim_ff=2048, enc_layers=6, dec_layers=6,
-                 num_queries=200, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.encoder = TransformerEncoder(d_model, num_heads, dim_ff, enc_layers, device=device, dtype=dtype, operations=operations)
-        self.decoder = TransformerDecoder(d_model, num_heads, dim_ff, dec_layers, num_queries, device=device, dtype=dtype, operations=operations)
-
-
-class GeometryEncoder(nn.Module):
-    def __init__(self, d_model=256, num_heads=8, num_layers=3, roi_size=7, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.d_model = d_model
-        self.roi_size = roi_size
-        self.pos_enc = PositionEmbeddingSine(num_pos_feats=d_model, normalize=True)
-        self.points_direct_project = operations.Linear(2, d_model, device=device, dtype=dtype)
-        self.points_pool_project = operations.Linear(d_model, d_model, device=device, dtype=dtype)
-        self.points_pos_enc_project = operations.Linear(d_model, d_model, device=device, dtype=dtype)
-        self.boxes_direct_project = operations.Linear(4, d_model, device=device, dtype=dtype)
-        self.boxes_pool_project = operations.Conv2d(d_model, d_model, kernel_size=roi_size, device=device, dtype=dtype)
-        self.boxes_pos_enc_project = operations.Linear(d_model + 2, d_model, device=device, dtype=dtype)
-        self.label_embed = operations.Embedding(2, d_model, device=device, dtype=dtype)
-        self.cls_embed = operations.Embedding(1, d_model, device=device, dtype=dtype)
-        self.norm = operations.LayerNorm(d_model, device=device, dtype=dtype)
-        self.img_pre_norm = operations.LayerNorm(d_model, device=device, dtype=dtype)
-        self.encode = nn.ModuleList([
-            EncoderLayer(d_model, num_heads, 2048, device=device, dtype=dtype, operations=operations)
-            for _ in range(num_layers)
-        ])
-        self.encode_norm = operations.LayerNorm(d_model, device=device, dtype=dtype)
-        self.final_proj = operations.Linear(d_model, d_model, device=device, dtype=dtype)
-
-    def _encode_points(self, coords, labels, img_feat_2d):
-        """Encode point prompts: direct + pool + pos_enc + label. coords: [B, N, 2] normalized."""
-        B, N, _ = coords.shape
-        embed = self.points_direct_project(coords)
-        # Pool features from backbone at point locations via grid_sample
-        grid = (coords * 2 - 1).unsqueeze(2)  # [B, N, 1, 2] in [-1, 1]
-        sampled = F.grid_sample(img_feat_2d, grid, align_corners=False)  # [B, C, N, 1]
-        embed = embed + self.points_pool_project(sampled.squeeze(-1).permute(0, 2, 1))  # [B, N, C]
-        # Positional encoding of coordinates
-        x, y = coords[:, :, 0], coords[:, :, 1]  # [B, N]
-        pos_x, pos_y = self.pos_enc._encode_xy(x.flatten(), y.flatten())
-        enc = torch.cat([pos_x, pos_y], dim=-1).view(B, N, -1)
-        embed = embed + self.points_pos_enc_project(cast_to_input(enc, embed))
-        embed = embed + cast_to_input(self.label_embed(labels.long()), embed)
-        return embed
-
-    def _encode_boxes(self, boxes, labels, img_feat_2d):
-        """Encode box prompts: direct + pool + pos_enc + label. boxes: [B, N, 4] normalized cxcywh."""
-        B, N, _ = boxes.shape
-        embed = self.boxes_direct_project(boxes)
-        # ROI align from backbone at box regions
-        H, W = img_feat_2d.shape[-2:]
-        boxes_xyxy = box_cxcywh_to_xyxy(boxes)
-        scale = torch.tensor([W, H, W, H], dtype=boxes_xyxy.dtype, device=boxes_xyxy.device)
-        boxes_scaled = boxes_xyxy * scale
-        sampled = roi_align(img_feat_2d, boxes_scaled.view(-1, 4).split(N), self.roi_size)
-        proj = self.boxes_pool_project(sampled).view(B, N, -1)  # Conv2d(roi_size) -> [B*N, C, 1, 1] -> [B, N, C]
-        embed = embed + proj
-        # Positional encoding of box center + size
-        cx, cy, w, h = boxes[:, :, 0], boxes[:, :, 1], boxes[:, :, 2], boxes[:, :, 3]
-        enc = self.pos_enc.encode_boxes(cx.flatten(), cy.flatten(), w.flatten(), h.flatten())
-        enc = enc.view(B, N, -1)
-        embed = embed + self.boxes_pos_enc_project(cast_to_input(enc, embed))
-        embed = embed + cast_to_input(self.label_embed(labels.long()), embed)
-        return embed
-
-    def forward(self, points=None, boxes=None, image_features=None):
-        """Encode geometry prompts. image_features: [B, HW, C] flattened backbone features."""
-        # Prepare 2D image features for pooling
-        img_feat_2d = None
-        if image_features is not None:
-            B = image_features.shape[0]
-            HW, C = image_features.shape[1], image_features.shape[2]
-            hw = int(math.sqrt(HW))
-            img_normed = self.img_pre_norm(image_features)
-            img_feat_2d = img_normed.permute(0, 2, 1).view(B, C, hw, hw)
-
-        embeddings = []
-        if points is not None:
-            coords, labels = points
-            embeddings.append(self._encode_points(coords, labels, img_feat_2d))
-        if boxes is not None:
-            B = boxes.shape[0]
-            box_labels = torch.ones(B, boxes.shape[1], dtype=torch.long, device=boxes.device)
-            embeddings.append(self._encode_boxes(boxes, box_labels, img_feat_2d))
-        if not embeddings:
-            return None
-        geo = torch.cat(embeddings, dim=1)
-        geo = self.norm(geo)
-        if image_features is not None:
-            for layer in self.encode:
-                geo = layer(geo, torch.zeros_like(geo), image_features)
-        geo = self.encode_norm(geo)
-        return self.final_proj(geo)
-
-
-class PixelDecoder(nn.Module):
-    """Top-down FPN pixel decoder with GroupNorm + ReLU + nearest interpolation."""
-    def __init__(self, d_model=256, num_stages=3, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.conv_layers = nn.ModuleList([operations.Conv2d(d_model, d_model, kernel_size=3, padding=1, device=device, dtype=dtype) for _ in range(num_stages)])
-        self.norms = nn.ModuleList([operations.GroupNorm(8, d_model, device=device, dtype=dtype) for _ in range(num_stages)])
-
-    def forward(self, backbone_features):
-        prev = backbone_features[-1]
-        for i, feat in enumerate(backbone_features[:-1][::-1]):
-            prev = F.relu(self.norms[i](self.conv_layers[i](feat + F.interpolate(prev, size=feat.shape[-2:], mode="nearest"))))
-        return prev
-
-
-class MaskPredictor(nn.Module):
-    def __init__(self, d_model=256, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.mask_embed = MLP(d_model, d_model, d_model, 3, device=device, dtype=dtype, operations=operations)
-
-    def forward(self, query_embeddings, pixel_features):
-        mask_embed = self.mask_embed(query_embeddings)
-        return torch.einsum("bqc,bchw->bqhw", mask_embed, pixel_features)
-
-
-class SegmentationHead(nn.Module):
-    def __init__(self, d_model=256, num_heads=8, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.d_model = d_model
-        self.pixel_decoder = PixelDecoder(d_model, 3, device=device, dtype=dtype, operations=operations)
-        self.mask_predictor = MaskPredictor(d_model, device=device, dtype=dtype, operations=operations)
-        self.cross_attend_prompt = SplitMHA(d_model, num_heads, device=device, dtype=dtype, operations=operations)
-        self.cross_attn_norm = operations.LayerNorm(d_model, device=device, dtype=dtype)
-        self.instance_seg_head = operations.Conv2d(d_model, d_model, kernel_size=1, device=device, dtype=dtype)
-        self.semantic_seg_head = operations.Conv2d(d_model, 1, kernel_size=1, device=device, dtype=dtype)
-
-    def forward(self, query_embeddings, backbone_features, encoder_hidden_states=None, prompt=None, prompt_mask=None):
-        if encoder_hidden_states is not None and prompt is not None:
-            enc_normed = self.cross_attn_norm(encoder_hidden_states)
-            enc_cross = self.cross_attend_prompt(enc_normed, prompt, prompt, mask=prompt_mask)
-            encoder_hidden_states = enc_cross + encoder_hidden_states
-
-        if encoder_hidden_states is not None:
-            B, H, W = encoder_hidden_states.shape[0], backbone_features[-1].shape[-2], backbone_features[-1].shape[-1]
-            encoder_visual = encoder_hidden_states[:, :H * W].permute(0, 2, 1).view(B, self.d_model, H, W)
-            backbone_features = list(backbone_features)
-            backbone_features[-1] = encoder_visual
-
-        pixel_features = self.pixel_decoder(backbone_features)
-        instance_features = self.instance_seg_head(pixel_features)
-        masks = self.mask_predictor(query_embeddings, instance_features)
-        return masks
-
-
-class DotProductScoring(nn.Module):
-    def __init__(self, d_model=256, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.hs_proj = operations.Linear(d_model, d_model, device=device, dtype=dtype)
-        self.prompt_proj = operations.Linear(d_model, d_model, device=device, dtype=dtype)
-        self.prompt_mlp = MLPWithNorm(d_model, 2048, d_model, 2, device=device, dtype=dtype, operations=operations)
-        self.scale = 1.0 / (d_model ** 0.5)
-
-    def forward(self, query_embeddings, prompt_embeddings, prompt_mask=None):
-        prompt = self.prompt_mlp(prompt_embeddings)
-        if prompt_mask is not None:
-            weight = prompt_mask.unsqueeze(-1).to(dtype=prompt.dtype)
-            pooled = (prompt * weight).sum(dim=1) / weight.sum(dim=1).clamp(min=1)
-        else:
-            pooled = prompt.mean(dim=1)
-        hs = self.hs_proj(query_embeddings)
-        pp = self.prompt_proj(pooled).unsqueeze(-1).to(hs.dtype)
-        scores = torch.matmul(hs, pp)
-        return (scores * self.scale).clamp(-12.0, 12.0).squeeze(-1)
-
-
-class SAM3Detector(nn.Module):
-    def __init__(self, d_model=256, embed_dim=1024, num_queries=200, device=None, dtype=None, operations=None, **kwargs):
-        super().__init__()
-        image_model = kwargs.pop("image_model", "SAM3")
-        for k in ("num_heads", "num_head_channels"):
-            kwargs.pop(k, None)
-        multiplex = image_model == "SAM31"
-        # SAM3: 4 FPN levels, drop last (scalp=1); SAM3.1: 3 levels, use all (scalp=0)
-        self.scalp = 0 if multiplex else 1
-        self.backbone = nn.ModuleDict({
-            "vision_backbone": SAM3VisionBackbone(embed_dim=embed_dim, d_model=d_model, multiplex=multiplex, device=device, dtype=dtype, operations=operations, **kwargs),
-            "language_backbone": nn.ModuleDict({"resizer": operations.Linear(embed_dim, d_model, device=device, dtype=dtype)}),
-        })
-        self.transformer = Transformer(d_model=d_model, num_queries=num_queries, device=device, dtype=dtype, operations=operations)
-        self.segmentation_head = SegmentationHead(d_model=d_model, device=device, dtype=dtype, operations=operations)
-        self.geometry_encoder = GeometryEncoder(d_model=d_model, device=device, dtype=dtype, operations=operations)
-        self.dot_prod_scoring = DotProductScoring(d_model=d_model, device=device, dtype=dtype, operations=operations)
-
-    def _get_backbone_features(self, images):
-        """Run backbone and return (detector_features, detector_positions, tracker_features, tracker_positions)."""
-        bb = self.backbone["vision_backbone"]
-        if bb.multiplex:
-            all_f, all_p, tf, tp = bb(images, tracker_mode="propagation")
-        else:
-            all_f, all_p, tf, tp = bb(images, need_tracker=True)
-        return all_f, all_p, tf, tp
-
-    @staticmethod
-    def _run_geo_layer(layer, x, memory, memory_pos):
-        x = x + layer.self_attn(layer.norm1(x))
-        x = x + layer.cross_attn_image(layer.norm2(x), memory + memory_pos, memory)
-        x = x + layer.linear2(F.relu(layer.linear1(layer.norm3(x))))
-        return x
-
-    def _detect(self, features, positions, text_embeddings=None, text_mask=None,
-                points=None, boxes=None):
-        """Shared detection: geometry encoding, transformer, scoring, segmentation."""
-        B = features[0].shape[0]
-        # Scalp for encoder (use top-level feature), but keep all levels for segmentation head
-        seg_features = features
-        if self.scalp > 0:
-            features = features[:-self.scalp]
-            positions = positions[:-self.scalp]
-        enc_feat, enc_pos = features[-1], positions[-1]
-        _, _, H, W = enc_feat.shape
-        img_flat = enc_feat.flatten(2).permute(0, 2, 1)
-        pos_flat = enc_pos.flatten(2).permute(0, 2, 1)
-
-        has_prompts = text_embeddings is not None or points is not None or boxes is not None
-        if has_prompts:
-            geo_enc = self.geometry_encoder
-            geo_prompts = geo_enc(points=points, boxes=boxes, image_features=img_flat)
-            geo_cls = geo_enc.norm(geo_enc.final_proj(cast_to_input(geo_enc.cls_embed.weight, img_flat).view(1, 1, -1).expand(B, -1, -1)))
-            for layer in geo_enc.encode:
-                geo_cls = self._run_geo_layer(layer, geo_cls, img_flat, pos_flat)
-            geo_cls = geo_enc.encode_norm(geo_cls)
-            if text_embeddings is not None and text_embeddings.shape[0] != B:
-                text_embeddings = text_embeddings.expand(B, -1, -1)
-            if text_mask is not None and text_mask.shape[0] != B:
-                text_mask = text_mask.expand(B, -1)
-            parts = [t for t in [text_embeddings, geo_prompts, geo_cls] if t is not None]
-            text_embeddings = torch.cat(parts, dim=1)
-            n_new = text_embeddings.shape[1] - (text_mask.shape[1] if text_mask is not None else 0)
-            if text_mask is not None:
-                text_mask = torch.cat([text_mask, torch.ones(B, n_new, dtype=torch.bool, device=text_mask.device)], dim=1)
-            else:
-                text_mask = torch.ones(B, text_embeddings.shape[1], dtype=torch.bool, device=text_embeddings.device)
-
-        memory = self.transformer.encoder(img_flat, pos_flat, text_embeddings, text_mask)
-        dec_out = self.transformer.decoder(memory, pos_flat, text_embeddings, text_mask, H, W)
-        query_out, pred_boxes = dec_out["decoder_output"], dec_out["pred_boxes"]
-
-        if text_embeddings is not None:
-            scores = self.dot_prod_scoring(query_out, text_embeddings, text_mask)
-        else:
-            scores = torch.zeros(B, query_out.shape[1], device=query_out.device)
-
-        masks = self.segmentation_head(query_out, seg_features, encoder_hidden_states=memory, prompt=text_embeddings, prompt_mask=text_mask)
-        return box_cxcywh_to_xyxy(pred_boxes), scores, masks, dec_out
-
-    def forward(self, images, text_embeddings=None, text_mask=None, points=None, boxes=None, threshold=0.3, orig_size=None):
-        features, positions, _, _ = self._get_backbone_features(images)
-
-        if text_embeddings is not None:
-            text_embeddings = self.backbone["language_backbone"]["resizer"](text_embeddings)
-            if text_mask is not None:
-                text_mask = text_mask.bool()
-
-        boxes_xyxy, scores, masks, dec_out = self._detect(
-            features, positions, text_embeddings, text_mask, points, boxes)
-
-        if orig_size is not None:
-            oh, ow = orig_size
-            boxes_xyxy = boxes_xyxy * torch.tensor([ow, oh, ow, oh], device=boxes_xyxy.device, dtype=boxes_xyxy.dtype)
-            masks = F.interpolate(masks, size=orig_size, mode="bilinear", align_corners=False)
-
-        return {
-            "boxes": boxes_xyxy,
-            "scores": scores,
-            "masks": masks,
-            "presence": dec_out.get("presence"),
-        }
-
-    def forward_from_trunk(self, trunk_out, text_embeddings, text_mask):
-        """Run detection using a pre-computed ViTDet trunk output.
-
-        text_embeddings must already be resized through language_backbone.resizer.
-        Returns dict with boxes (normalized xyxy), scores, masks at detector resolution.
-        """
-        bb = self.backbone["vision_backbone"]
-        features = [conv(trunk_out) for conv in bb.convs]
-        positions = [cast_to_input(bb.position_encoding(f), f) for f in features]
-
-        if text_mask is not None:
-            text_mask = text_mask.bool()
-
-        boxes_xyxy, scores, masks, _ = self._detect(features, positions, text_embeddings, text_mask)
-        return {"boxes": boxes_xyxy, "scores": scores, "masks": masks}
-
-
-class SAM3Model(nn.Module):
-    def __init__(self, device=None, dtype=None, operations=None, **kwargs):
-        super().__init__()
-        self.dtype = dtype
-        image_model = kwargs.get("image_model", "SAM3")
-        tracker_cls = TRACKER_CLASSES[image_model]
-        self.detector = SAM3Detector(device=device, dtype=dtype, operations=operations, **kwargs)
-        self.tracker = tracker_cls(device=device, dtype=dtype, operations=operations, **kwargs)
-
-    def forward(self, images, **kwargs):
-        return self.detector(images, **kwargs)
-
-    def forward_segment(self, images, point_inputs=None, box_inputs=None, mask_inputs=None):
-        """Interactive segmentation using SAM decoder with point/box/mask prompts.
-
-        Args:
-            images: [B, 3, 1008, 1008] preprocessed images
-            point_inputs: {"point_coords": [B, N, 2], "point_labels": [B, N]} in 1008x1008 pixel space
-            box_inputs: [B, 2, 2] box corners (top-left, bottom-right) in 1008x1008 pixel space
-            mask_inputs: [B, 1, H, W] coarse mask logits to refine
-        Returns:
-            [B, 1, image_size, image_size] high-res mask logits
-        """
-        bb = self.detector.backbone["vision_backbone"]
-        if bb.multiplex:
-            _, _, tracker_features, tracker_positions = bb(images, tracker_mode="interactive")
-        else:
-            _, _, tracker_features, tracker_positions = bb(images, need_tracker=True)
-            if self.detector.scalp > 0:
-                tracker_features = tracker_features[:-self.detector.scalp]
-                tracker_positions = tracker_positions[:-self.detector.scalp]
-
-        high_res = list(tracker_features[:-1])
-        backbone_feat = tracker_features[-1]
-        B, C, H, W = backbone_feat.shape
-        # Add no-memory embedding (init frame path)
-        no_mem = getattr(self.tracker, 'interactivity_no_mem_embed', None)
-        if no_mem is None:
-            no_mem = getattr(self.tracker, 'no_mem_embed', None)
-        if no_mem is not None:
-            feat_flat = backbone_feat.flatten(2).permute(0, 2, 1)
-            feat_flat = feat_flat + cast_to_input(no_mem, feat_flat)
-            backbone_feat = feat_flat.view(B, H, W, C).permute(0, 3, 1, 2)
-
-        num_pts = 0 if point_inputs is None else point_inputs["point_labels"].size(1)
-        _, high_res_masks, _, _ = self.tracker._forward_sam_heads(
-            backbone_features=backbone_feat,
-            point_inputs=point_inputs,
-            mask_inputs=mask_inputs,
-            box_inputs=box_inputs,
-            high_res_features=high_res,
-            multimask_output=(0 < num_pts <= 1),
-        )
-        return high_res_masks
-
-    def forward_video(self, images, initial_masks, pbar=None, text_prompts=None,
-                       new_det_thresh=0.5, max_objects=0, detect_interval=1):
-        """Track video with optional per-frame text-prompted detection."""
-        bb = self.detector.backbone["vision_backbone"]
-
-        def backbone_fn(frame, frame_idx=None):
-            trunk_out = bb.trunk(frame)
-            if bb.multiplex:
-                _, _, tf, tp = bb(frame, tracker_mode="propagation", cached_trunk=trunk_out, tracker_only=True)
-            else:
-                _, _, tf, tp = bb(frame, need_tracker=True, cached_trunk=trunk_out, tracker_only=True)
-            return tf, tp, trunk_out
-
-        detect_fn = None
-        if text_prompts:
-            resizer = self.detector.backbone["language_backbone"]["resizer"]
-            resized = [(resizer(emb), m.bool() if m is not None else None) for emb, m in text_prompts]
-            def detect_fn(trunk_out):
-                all_scores, all_masks = [], []
-                for emb, mask in resized:
-                    det = self.detector.forward_from_trunk(trunk_out, emb, mask)
-                    all_scores.append(det["scores"])
-                    all_masks.append(det["masks"])
-                return {"scores": torch.cat(all_scores, dim=1), "masks": torch.cat(all_masks, dim=1)}
-
-        if hasattr(self.tracker, 'track_video_with_detection'):
-            return self.tracker.track_video_with_detection(
-                backbone_fn, images, initial_masks, detect_fn,
-                new_det_thresh=new_det_thresh, max_objects=max_objects,
-                detect_interval=detect_interval, backbone_obj=bb, pbar=pbar)
-        # SAM3 (non-multiplex) — no detection support, requires initial masks
-        if initial_masks is None:
-            raise ValueError("SAM3 (non-multiplex) requires initial_mask for video tracking")
-        return self.tracker.track_video(backbone_fn, images, initial_masks, pbar=pbar, backbone_obj=bb)
--- a/comfy/ldm/sam3/sam.py
+++ b/comfy/ldm/sam3/sam.py
@@ -1,425 +0,0 @@
-# SAM3 shared components: primitives, ViTDet backbone, FPN neck, position encodings.
-
-import math
-
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-from comfy.ldm.modules.attention import optimized_attention
-from comfy.ldm.flux.math import apply_rope
-from comfy.ldm.flux.layers import EmbedND
-from comfy.ops import cast_to_input
-
-
-class MLP(nn.Module):
-    def __init__(self, input_dim, hidden_dim, output_dim, num_layers, sigmoid_output=False, device=None, dtype=None, operations=None):
-        super().__init__()
-        dims = [input_dim] + [hidden_dim] * (num_layers - 1) + [output_dim]
-        self.layers = nn.ModuleList([operations.Linear(dims[i], dims[i + 1], device=device, dtype=dtype) for i in range(num_layers)])
-        self.sigmoid_output = sigmoid_output
-
-    def forward(self, x):
-        for i, layer in enumerate(self.layers):
-            x = F.relu(layer(x)) if i < len(self.layers) - 1 else layer(x)
-        return torch.sigmoid(x) if self.sigmoid_output else x
-
-
-class SAMAttention(nn.Module):
-    def __init__(self, embedding_dim, num_heads, downsample_rate=1, kv_in_dim=None, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.num_heads = num_heads
-        internal_dim = embedding_dim // downsample_rate
-        kv_dim = kv_in_dim if kv_in_dim is not None else embedding_dim
-        self.q_proj = operations.Linear(embedding_dim, internal_dim, device=device, dtype=dtype)
-        self.k_proj = operations.Linear(kv_dim, internal_dim, device=device, dtype=dtype)
-        self.v_proj = operations.Linear(kv_dim, internal_dim, device=device, dtype=dtype)
-        self.out_proj = operations.Linear(internal_dim, embedding_dim, device=device, dtype=dtype)
-
-    def forward(self, q, k, v):
-        q = self.q_proj(q)
-        k = self.k_proj(k)
-        v = self.v_proj(v)
-        return self.out_proj(optimized_attention(q, k, v, self.num_heads, low_precision_attention=False))
-
-
-class TwoWayAttentionBlock(nn.Module):
-    def __init__(self, embedding_dim, num_heads, mlp_dim=2048, attention_downsample_rate=2, skip_first_layer_pe=False, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.skip_first_layer_pe = skip_first_layer_pe
-        self.self_attn = SAMAttention(embedding_dim, num_heads, device=device, dtype=dtype, operations=operations)
-        self.cross_attn_token_to_image = SAMAttention(embedding_dim, num_heads, downsample_rate=attention_downsample_rate, device=device, dtype=dtype, operations=operations)
-        self.cross_attn_image_to_token = SAMAttention(embedding_dim, num_heads, downsample_rate=attention_downsample_rate, device=device, dtype=dtype, operations=operations)
-        self.mlp = nn.Sequential(operations.Linear(embedding_dim, mlp_dim, device=device, dtype=dtype), nn.ReLU(), operations.Linear(mlp_dim, embedding_dim, device=device, dtype=dtype))
-        self.norm1 = operations.LayerNorm(embedding_dim, device=device, dtype=dtype)
-        self.norm2 = operations.LayerNorm(embedding_dim, device=device, dtype=dtype)
-        self.norm3 = operations.LayerNorm(embedding_dim, device=device, dtype=dtype)
-        self.norm4 = operations.LayerNorm(embedding_dim, device=device, dtype=dtype)
-
-    def forward(self, queries, keys, query_pe, key_pe):
-        if self.skip_first_layer_pe:
-            queries = self.norm1(self.self_attn(queries, queries, queries))
-        else:
-            q = queries + query_pe
-            queries = self.norm1(queries + self.self_attn(q, q, queries))
-        q, k = queries + query_pe, keys + key_pe
-        queries = self.norm2(queries + self.cross_attn_token_to_image(q, k, keys))
-        queries = self.norm3(queries + self.mlp(queries))
-        q, k = queries + query_pe, keys + key_pe
-        keys = self.norm4(keys + self.cross_attn_image_to_token(k, q, queries))
-        return queries, keys
-
-
-class TwoWayTransformer(nn.Module):
-    def __init__(self, depth=2, embedding_dim=256, num_heads=8, mlp_dim=2048, attention_downsample_rate=2, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.layers = nn.ModuleList([
-            TwoWayAttentionBlock(embedding_dim, num_heads, mlp_dim, attention_downsample_rate,
-                                 skip_first_layer_pe=(i == 0), device=device, dtype=dtype, operations=operations)
-            for i in range(depth)
-        ])
-        self.final_attn_token_to_image = SAMAttention(embedding_dim, num_heads, downsample_rate=attention_downsample_rate, device=device, dtype=dtype, operations=operations)
-        self.norm_final = operations.LayerNorm(embedding_dim, device=device, dtype=dtype)
-
-    def forward(self, image_embedding, image_pe, point_embedding):
-        queries, keys = point_embedding, image_embedding
-        for layer in self.layers:
-            queries, keys = layer(queries, keys, point_embedding, image_pe)
-        q, k = queries + point_embedding, keys + image_pe
-        queries = self.norm_final(queries + self.final_attn_token_to_image(q, k, keys))
-        return queries, keys
-
-
-class PositionEmbeddingRandom(nn.Module):
-    """Fourier feature positional encoding with random gaussian projection."""
-    def __init__(self, num_pos_feats=64, scale=None):
-        super().__init__()
-        self.register_buffer("positional_encoding_gaussian_matrix", (scale or 1.0) * torch.randn(2, num_pos_feats))
-
-    def _encode(self, normalized_coords):
-        """Map normalized [0,1] coordinates to fourier features via random projection. Computes in fp32."""
-        orig_dtype = normalized_coords.dtype
-        proj_matrix = self.positional_encoding_gaussian_matrix.to(device=normalized_coords.device, dtype=torch.float32)
-        projected = 2 * math.pi * (2 * normalized_coords.float() - 1) @ proj_matrix
-        return torch.cat([projected.sin(), projected.cos()], dim=-1).to(orig_dtype)
-
-    def forward(self, size, device=None):
-        h, w = size
-        dev = device if device is not None else self.positional_encoding_gaussian_matrix.device
-        ones = torch.ones((h, w), device=dev, dtype=torch.float32)
-        norm_xy = torch.stack([(ones.cumsum(1) - 0.5) / w, (ones.cumsum(0) - 0.5) / h], dim=-1)
-        return self._encode(norm_xy).permute(2, 0, 1).unsqueeze(0)
-
-    def forward_with_coords(self, pixel_coords, image_size):
-        norm = pixel_coords.clone()
-        norm[:, :, 0] /= image_size[1]
-        norm[:, :, 1] /= image_size[0]
-        return self._encode(norm)
-
-
-# ViTDet backbone + FPN neck
-
-def window_partition(x: torch.Tensor, window_size: int):
-    B, H, W, C = x.shape
-    pad_h = (window_size - H % window_size) % window_size
-    pad_w = (window_size - W % window_size) % window_size
-    if pad_h > 0 or pad_w > 0:
-        x = F.pad(x, (0, 0, 0, pad_w, 0, pad_h))
-    Hp, Wp = H + pad_h, W + pad_w
-    x = x.view(B, Hp // window_size, window_size, Wp // window_size, window_size, C)
-    windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)
-    return windows, (Hp, Wp)
-
-
-def window_unpartition(windows: torch.Tensor, window_size: int, pad_hw, hw):
-    Hp, Wp = pad_hw
-    H, W = hw
-    B = windows.shape[0] // (Hp * Wp // window_size // window_size)
-    x = windows.view(B, Hp // window_size, Wp // window_size, window_size, window_size, -1)
-    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, Hp, Wp, -1)
-    if Hp > H or Wp > W:
-        x = x[:, :H, :W, :].contiguous()
-    return x
-
-
-def rope_2d(end_x: int, end_y: int, dim: int, theta: float = 10000.0, scale_pos: float = 1.0):
-    """Generate 2D axial RoPE using flux EmbedND. Returns [1, 1, HW, dim//2, 2, 2]."""
-    t = torch.arange(end_x * end_y, dtype=torch.float32)
-    ids = torch.stack([(t % end_x) * scale_pos,
-                       torch.div(t, end_x, rounding_mode="floor") * scale_pos], dim=-1)
-    return EmbedND(dim=dim, theta=theta, axes_dim=[dim // 2, dim // 2])(ids.unsqueeze(0))
-
-
-class _ViTMLP(nn.Module):
-    def __init__(self, dim, mlp_ratio=4.0, device=None, dtype=None, operations=None):
-        super().__init__()
-        hidden = int(dim * mlp_ratio)
-        self.fc1 = operations.Linear(dim, hidden, device=device, dtype=dtype)
-        self.act = nn.GELU()
-        self.fc2 = operations.Linear(hidden, dim, device=device, dtype=dtype)
-
-    def forward(self, x):
-        return self.fc2(self.act(self.fc1(x)))
-
-
-class Attention(nn.Module):
-    """ViTDet multi-head attention with fused QKV projection."""
-
-    def __init__(self, dim, num_heads=8, qkv_bias=True, use_rope=False, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.num_heads = num_heads
-        self.head_dim = dim // num_heads
-        self.use_rope = use_rope
-        self.qkv = operations.Linear(dim, dim * 3, bias=qkv_bias, device=device, dtype=dtype)
-        self.proj = operations.Linear(dim, dim, device=device, dtype=dtype)
-
-    def forward(self, x, freqs_cis=None):
-        B, N, C = x.shape
-        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim)
-        q, k, v = qkv.permute(2, 0, 3, 1, 4).unbind(dim=0)
-        if self.use_rope and freqs_cis is not None:
-            q, k = apply_rope(q, k, freqs_cis)
-        return self.proj(optimized_attention(q, k, v, self.num_heads, skip_reshape=True, low_precision_attention=False))
-
-
-class Block(nn.Module):
-    def __init__(self, dim, num_heads, mlp_ratio=4.0, qkv_bias=True, window_size=0, use_rope=False, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.window_size = window_size
-        self.norm1 = operations.LayerNorm(dim, device=device, dtype=dtype)
-        self.attn = Attention(dim, num_heads, qkv_bias, use_rope, device=device, dtype=dtype, operations=operations)
-        self.norm2 = operations.LayerNorm(dim, device=device, dtype=dtype)
-        self.mlp = _ViTMLP(dim, mlp_ratio, device=device, dtype=dtype, operations=operations)
-
-    def forward(self, x, freqs_cis=None):
-        shortcut = x
-        x = self.norm1(x)
-        if self.window_size > 0:
-            H, W = x.shape[1], x.shape[2]
-            x, pad_hw = window_partition(x, self.window_size)
-            x = x.view(x.shape[0], self.window_size * self.window_size, -1)
-            x = self.attn(x, freqs_cis=freqs_cis)
-            x = x.view(-1, self.window_size, self.window_size, x.shape[-1])
-            x = window_unpartition(x, self.window_size, pad_hw, (H, W))
-        else:
-            B, H, W, C = x.shape
-            x = x.view(B, H * W, C)
-            x = self.attn(x, freqs_cis=freqs_cis)
-            x = x.view(B, H, W, C)
-        x = shortcut + x
-        x = x + self.mlp(self.norm2(x))
-        return x
-
-
-class PatchEmbed(nn.Module):
-    def __init__(self, patch_size=14, in_chans=3, embed_dim=1024, device=None, dtype=None, operations=None):
-        super().__init__()
-        self.proj = operations.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size, bias=False, device=device, dtype=dtype)
-
-    def forward(self, x):
-        return self.proj(x)
-
-
-class ViTDet(nn.Module):
-    def __init__(self, img_size=1008, patch_size=14, embed_dim=1024, depth=32, num_heads=16, mlp_ratio=4.625, qkv_bias=True, window_size=24,
-                 global_att_blocks=(7, 15, 23, 31), use_rope=True, pretrain_img_size=336, device=None, dtype=None, operations=None, **kwargs):
-        super().__init__()
-        self.img_size = img_size
-        self.patch_size = patch_size
-        self.embed_dim = embed_dim
-        self.num_heads = num_heads
-        self.global_att_blocks = set(global_att_blocks)
-
-        self.patch_embed = PatchEmbed(patch_size, 3, embed_dim, device=device, dtype=dtype, operations=operations)
-
-        num_patches = (pretrain_img_size // patch_size) ** 2 + 1  # +1 for cls token
-        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches, embed_dim, device=device, dtype=dtype))
-
-        self.ln_pre = operations.LayerNorm(embed_dim, device=device, dtype=dtype)
-
-        grid_size = img_size // patch_size
-        pretrain_grid = pretrain_img_size // patch_size
-
-        self.blocks = nn.ModuleList()
-        for i in range(depth):
-            is_global = i in self.global_att_blocks
-            self.blocks.append(Block(
-                embed_dim, num_heads, mlp_ratio, qkv_bias,
-                window_size=0 if is_global else window_size,
-                use_rope=use_rope,
-                device=device, dtype=dtype, operations=operations,
-            ))
-
-        if use_rope:
-            rope_scale = pretrain_grid / grid_size
-            self.register_buffer("freqs_cis", rope_2d(grid_size, grid_size, embed_dim // num_heads, scale_pos=rope_scale), persistent=False)
-            self.register_buffer("freqs_cis_window", rope_2d(window_size, window_size, embed_dim // num_heads), persistent=False)
-        else:
-            self.freqs_cis = None
-            self.freqs_cis_window = None
-
-    def _get_pos_embed(self, num_tokens):
-        pos = self.pos_embed
-        if pos.shape[1] == num_tokens:
-            return pos
-        cls_pos = pos[:, :1]
-        spatial_pos = pos[:, 1:]
-        old_size = int(math.sqrt(spatial_pos.shape[1]))
-        new_size = int(math.sqrt(num_tokens - 1)) if num_tokens > 1 else old_size
-        spatial_2d = spatial_pos.reshape(1, old_size, old_size, -1).permute(0, 3, 1, 2)
-        tiles_h = new_size // old_size + 1
-        tiles_w = new_size // old_size + 1
-        tiled = spatial_2d.tile([1, 1, tiles_h, tiles_w])[:, :, :new_size, :new_size]
-        tiled = tiled.permute(0, 2, 3, 1).reshape(1, new_size * new_size, -1)
-        return torch.cat([cls_pos, tiled], dim=1)
-
-    def forward(self, x):
-        x = self.patch_embed(x)
-        B, C, Hp, Wp = x.shape
-        x = x.permute(0, 2, 3, 1).reshape(B, Hp * Wp, C)
-
-        pos = cast_to_input(self._get_pos_embed(Hp * Wp + 1), x)
-        x = x + pos[:, 1:Hp * Wp + 1]
-
-        x = x.view(B, Hp, Wp, C)
-        x = self.ln_pre(x)
-
-        freqs_cis_global = self.freqs_cis
-        freqs_cis_win = self.freqs_cis_window
-        if freqs_cis_global is not None:
-            freqs_cis_global = cast_to_input(freqs_cis_global, x)
-        if freqs_cis_win is not None:
-            freqs_cis_win = cast_to_input(freqs_cis_win, x)
-
-        for block in self.blocks:
-            fc = freqs_cis_win if block.window_size > 0 else freqs_cis_global
-            x = block(x, freqs_cis=fc)
-
-        return x.permute(0, 3, 1, 2)
-
-
-class FPNScaleConv(nn.Module):
-    def __init__(self, in_dim, out_dim, scale, device=None, dtype=None, operations=None):
-        super().__init__()
-        if scale == 4.0:
-            self.dconv_2x2_0 = operations.ConvTranspose2d(in_dim, in_dim // 2, kernel_size=2, stride=2, device=device, dtype=dtype)
-            self.dconv_2x2_1 = operations.ConvTranspose2d(in_dim // 2, in_dim // 4, kernel_size=2, stride=2, device=device, dtype=dtype)
-            proj_in = in_dim // 4
-        elif scale == 2.0:
-            self.dconv_2x2 = operations.ConvTranspose2d(in_dim, in_dim // 2, kernel_size=2, stride=2, device=device, dtype=dtype)
-            proj_in = in_dim // 2
-        elif scale == 1.0:
-            proj_in = in_dim
-        elif scale == 0.5:
-            self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
-            proj_in = in_dim
-        self.scale = scale
-        self.conv_1x1 = operations.Conv2d(proj_in, out_dim, kernel_size=1, device=device, dtype=dtype)
-        self.conv_3x3 = operations.Conv2d(out_dim, out_dim, kernel_size=3, padding=1, device=device, dtype=dtype)
-
-    def forward(self, x):
-        if self.scale == 4.0:
-            x = F.gelu(self.dconv_2x2_0(x))
-            x = self.dconv_2x2_1(x)
-        elif self.scale == 2.0:
-            x = self.dconv_2x2(x)
-        elif self.scale == 0.5:
-            x = self.pool(x)
-        x = self.conv_1x1(x)
-        x = self.conv_3x3(x)
-        return x
-
-
-class PositionEmbeddingSine(nn.Module):
-    """2D sinusoidal position encoding (DETR-style) with result caching."""
-    def __init__(self, num_pos_feats=256, temperature=10000.0, normalize=True, scale=None):
-        super().__init__()
-        assert num_pos_feats % 2 == 0
-        self.half_dim = num_pos_feats // 2
-        self.temperature = temperature
-        self.normalize = normalize
-        self.scale = scale if scale is not None else 2 * math.pi
-        self._cache = {}
-
-    def _sincos(self, vals):
-        """Encode 1D values to interleaved sin/cos features."""
-        freqs = self.temperature ** (2 * (torch.arange(self.half_dim, dtype=torch.float32, device=vals.device) // 2) / self.half_dim)
-        raw = vals[..., None] * self.scale / freqs
-        return torch.stack((raw[..., 0::2].sin(), raw[..., 1::2].cos()), dim=-1).flatten(-2)
-
-    def _encode_xy(self, x, y):
-        """Encode normalized x, y coordinates to sinusoidal features. Returns (pos_x, pos_y) each [N, half_dim]."""
-        dim_t = self.temperature ** (2 * (torch.arange(self.half_dim, dtype=torch.float32, device=x.device) // 2) / self.half_dim)
-        pos_x = x[:, None] * self.scale / dim_t
-        pos_y = y[:, None] * self.scale / dim_t
-        pos_x = torch.stack((pos_x[:, 0::2].sin(), pos_x[:, 1::2].cos()), dim=2).flatten(1)
-        pos_y = torch.stack((pos_y[:, 0::2].sin(), pos_y[:, 1::2].cos()), dim=2).flatten(1)
-        return pos_x, pos_y
-
-    def encode_boxes(self, cx, cy, w, h):
-        """Encode box center + size to [N, d_model+2] features."""
-        pos_x, pos_y = self._encode_xy(cx, cy)
-        return torch.cat((pos_y, pos_x, h[:, None], w[:, None]), dim=1)
-
-    def forward(self, x):
-        B, C, H, W = x.shape
-        key = (H, W, x.device)
-        if key not in self._cache:
-            gy = torch.arange(H, dtype=torch.float32, device=x.device)
-            gx = torch.arange(W, dtype=torch.float32, device=x.device)
-            if self.normalize:
-                gy, gx = gy / (H - 1 + 1e-6), gx / (W - 1 + 1e-6)
-            yy, xx = torch.meshgrid(gy, gx, indexing="ij")
-            self._cache[key] = torch.cat((self._sincos(yy), self._sincos(xx)), dim=-1).permute(2, 0, 1).unsqueeze(0)
-        return self._cache[key].expand(B, -1, -1, -1)
-
-
-class SAM3VisionBackbone(nn.Module):
-    def __init__(self, embed_dim=1024, d_model=256, multiplex=False, device=None, dtype=None, operations=None, **kwargs):
-        super().__init__()
-        self.trunk = ViTDet(embed_dim=embed_dim, device=device, dtype=dtype, operations=operations, **kwargs)
-        self.position_encoding = PositionEmbeddingSine(num_pos_feats=d_model, normalize=True)
-        self.multiplex = multiplex
-
-        fpn_args = dict(device=device, dtype=dtype, operations=operations)
-        if multiplex:
-            scales = [4.0, 2.0, 1.0]
-            self.convs = nn.ModuleList([FPNScaleConv(embed_dim, d_model, s, **fpn_args) for s in scales])
-            self.propagation_convs = nn.ModuleList([FPNScaleConv(embed_dim, d_model, s, **fpn_args) for s in scales])
-            self.interactive_convs = nn.ModuleList([FPNScaleConv(embed_dim, d_model, s, **fpn_args) for s in scales])
-        else:
-            scales = [4.0, 2.0, 1.0, 0.5]
-            self.convs = nn.ModuleList([FPNScaleConv(embed_dim, d_model, s, **fpn_args) for s in scales])
-            self.sam2_convs = nn.ModuleList([FPNScaleConv(embed_dim, d_model, s, **fpn_args) for s in scales])
-
-    def forward(self, images, need_tracker=False, tracker_mode=None, cached_trunk=None, tracker_only=False):
-        backbone_out = cached_trunk if cached_trunk is not None else self.trunk(images)
-
-        if tracker_only:
-            # Skip detector FPN when only tracker features are needed (video tracking)
-            if self.multiplex:
-                tracker_convs = self.propagation_convs if tracker_mode == "propagation" else self.interactive_convs
-            else:
-                tracker_convs = self.sam2_convs
-            tracker_features = [conv(backbone_out) for conv in tracker_convs]
-            tracker_positions = [cast_to_input(self.position_encoding(f), f) for f in tracker_features]
-            return None, None, tracker_features, tracker_positions
-
-        features = [conv(backbone_out) for conv in self.convs]
-        positions = [cast_to_input(self.position_encoding(f), f) for f in features]
-
-        if self.multiplex:
-            if tracker_mode == "propagation":
-                tracker_convs = self.propagation_convs
-            elif tracker_mode == "interactive":
-                tracker_convs = self.interactive_convs
-            else:
-                return features, positions, None, None
-        elif need_tracker:
-            tracker_convs = self.sam2_convs
-        else:
-            return features, positions, None, None
-
-        tracker_features = [conv(backbone_out) for conv in tracker_convs]
-        tracker_positions = [cast_to_input(self.position_encoding(f), f) for f in tracker_features]
-        return features, positions, tracker_features, tracker_positions
--- a/comfy/ldm/sam3/tracker.py
+++ b/comfy/ldm/sam3/tracker.py
--- a/comfy/ldm/supir/init.py
+++ b/comfy/ldm/supir/init.py
--- a/comfy/ldm/supir/supir_modules.py
+++ b/comfy/ldm/supir/supir_modules.py
@@ -1,226 +0,0 @@
-import torch
-import torch.nn as nn
-
-from comfy.ldm.modules.diffusionmodules.util import timestep_embedding
-from comfy.ldm.modules.diffusionmodules.openaimodel import Downsample, TimestepEmbedSequential, ResBlock, SpatialTransformer
-from comfy.ldm.modules.attention import optimized_attention
-
-
-class ZeroSFT(nn.Module):
-    def __init__(self, label_nc, norm_nc, concat_channels=0, dtype=None, device=None, operations=None):
-        super().__init__()
-
-        ks = 3
-        pw = ks // 2
-
-        self.param_free_norm = operations.GroupNorm(32, norm_nc + concat_channels, dtype=dtype, device=device)
-
-        nhidden = 128
-
-        self.mlp_shared = nn.Sequential(
-            operations.Conv2d(label_nc, nhidden, kernel_size=ks, padding=pw, dtype=dtype, device=device),
-            nn.SiLU()
-        )
-        self.zero_mul = operations.Conv2d(nhidden, norm_nc + concat_channels, kernel_size=ks, padding=pw, dtype=dtype, device=device)
-        self.zero_add = operations.Conv2d(nhidden, norm_nc + concat_channels, kernel_size=ks, padding=pw, dtype=dtype, device=device)
-
-        self.zero_conv = operations.Conv2d(label_nc, norm_nc, 1, 1, 0, dtype=dtype, device=device)
-        self.pre_concat = bool(concat_channels != 0)
-
-    def forward(self, c, h, h_ori=None, control_scale=1):
-        if h_ori is not None and self.pre_concat:
-            h_raw = torch.cat([h_ori, h], dim=1)
-        else:
-            h_raw = h
-
-        h = h + self.zero_conv(c)
-        if h_ori is not None and self.pre_concat:
-            h = torch.cat([h_ori, h], dim=1)
-        actv = self.mlp_shared(c)
-        gamma = self.zero_mul(actv)
-        beta = self.zero_add(actv)
-        h = self.param_free_norm(h)
-        h = torch.addcmul(h + beta, h, gamma)
-        if h_ori is not None and not self.pre_concat:
-            h = torch.cat([h_ori, h], dim=1)
-        return torch.lerp(h_raw, h, control_scale)
-
-
-class _CrossAttnInner(nn.Module):
-    """Inner cross-attention module matching the state_dict layout of the original CrossAttention."""
-    def __init__(self, query_dim, context_dim, heads, dim_head, dtype=None, device=None, operations=None):
-        super().__init__()
-        inner_dim = dim_head * heads
-        self.heads = heads
-        self.to_q = operations.Linear(query_dim, inner_dim, bias=False, dtype=dtype, device=device)
-        self.to_k = operations.Linear(context_dim, inner_dim, bias=False, dtype=dtype, device=device)
-        self.to_v = operations.Linear(context_dim, inner_dim, bias=False, dtype=dtype, device=device)
-        self.to_out = nn.Sequential(
-            operations.Linear(inner_dim, query_dim, dtype=dtype, device=device),
-        )
-
-    def forward(self, x, context):
-        q = self.to_q(x)
-        k = self.to_k(context)
-        v = self.to_v(context)
-        return self.to_out(optimized_attention(q, k, v, self.heads))
-
-
-class ZeroCrossAttn(nn.Module):
-    def __init__(self, context_dim, query_dim, dtype=None, device=None, operations=None):
-        super().__init__()
-        heads = query_dim // 64
-        dim_head = 64
-        self.attn = _CrossAttnInner(query_dim, context_dim, heads, dim_head, dtype=dtype, device=device, operations=operations)
-        self.norm1 = operations.GroupNorm(32, query_dim, dtype=dtype, device=device)
-        self.norm2 = operations.GroupNorm(32, context_dim, dtype=dtype, device=device)
-
-    def forward(self, context, x, control_scale=1):
-        b, c, h, w = x.shape
-        x_in = x
-
-        x = self.attn(
-            self.norm1(x).flatten(2).transpose(1, 2),
-            self.norm2(context).flatten(2).transpose(1, 2),
-        ).transpose(1, 2).unflatten(2, (h, w))
-
-        return x_in + x * control_scale
-
-
-class GLVControl(nn.Module):
-    """SUPIR's Guided Latent Vector control encoder. Truncated UNet (input + middle blocks only)."""
-    def __init__(
-        self,
-        in_channels=4,
-        model_channels=320,
-        num_res_blocks=2,
-        attention_resolutions=(4, 2),
-        channel_mult=(1, 2, 4),
-        num_head_channels=64,
-        transformer_depth=(1, 2, 10),
-        context_dim=2048,
-        adm_in_channels=2816,
-        use_linear_in_transformer=True,
-        use_checkpoint=False,
-        dtype=None,
-        device=None,
-        operations=None,
-        **kwargs,
-    ):
-        super().__init__()
-        self.model_channels = model_channels
-        time_embed_dim = model_channels * 4
-
-        self.time_embed = nn.Sequential(
-            operations.Linear(model_channels, time_embed_dim, dtype=dtype, device=device),
-            nn.SiLU(),
-            operations.Linear(time_embed_dim, time_embed_dim, dtype=dtype, device=device),
-        )
-
-        self.label_emb = nn.Sequential(
-            nn.Sequential(
-                operations.Linear(adm_in_channels, time_embed_dim, dtype=dtype, device=device),
-                nn.SiLU(),
-                operations.Linear(time_embed_dim, time_embed_dim, dtype=dtype, device=device),
-            )
-        )
-
-        self.input_blocks = nn.ModuleList([
-            TimestepEmbedSequential(
-                operations.Conv2d(in_channels, model_channels, 3, padding=1, dtype=dtype, device=device)
-            )
-        ])
-        ch = model_channels
-        ds = 1
-        for level, mult in enumerate(channel_mult):
-            for nr in range(num_res_blocks):
-                layers = [
-                    ResBlock(ch, time_embed_dim, 0, out_channels=mult * model_channels,
-                             dtype=dtype, device=device, operations=operations)
-                ]
-                ch = mult * model_channels
-                if ds in attention_resolutions:
-                    num_heads = ch // num_head_channels
-                    layers.append(
-                        SpatialTransformer(ch, num_heads, num_head_channels,
-                                           depth=transformer_depth[level], context_dim=context_dim,
-                                           use_linear=use_linear_in_transformer,
-                                           use_checkpoint=use_checkpoint,
-                                           dtype=dtype, device=device, operations=operations)
-                    )
-                self.input_blocks.append(TimestepEmbedSequential(*layers))
-            if level != len(channel_mult) - 1:
-                self.input_blocks.append(
-                    TimestepEmbedSequential(
-                        Downsample(ch, True, out_channels=ch, dtype=dtype, device=device, operations=operations)
-                    )
-                )
-                ds *= 2
-
-        num_heads = ch // num_head_channels
-        self.middle_block = TimestepEmbedSequential(
-            ResBlock(ch, time_embed_dim, 0, dtype=dtype, device=device, operations=operations),
-            SpatialTransformer(ch, num_heads, num_head_channels,
-                               depth=transformer_depth[-1], context_dim=context_dim,
-                               use_linear=use_linear_in_transformer,
-                               use_checkpoint=use_checkpoint,
-                               dtype=dtype, device=device, operations=operations),
-            ResBlock(ch, time_embed_dim, 0, dtype=dtype, device=device, operations=operations),
-        )
-
-        self.input_hint_block = TimestepEmbedSequential(
-            operations.Conv2d(in_channels, model_channels, 3, padding=1, dtype=dtype, device=device)
-        )
-
-    def forward(self, x, timesteps, xt, context=None, y=None, **kwargs):
-        t_emb = timestep_embedding(timesteps, self.model_channels, repeat_only=False).to(x.dtype)
-        emb = self.time_embed(t_emb) + self.label_emb(y)
-
-        guided_hint = self.input_hint_block(x, emb, context)
-
-        hs = []
-        h = xt
-        for module in self.input_blocks:
-            if guided_hint is not None:
-                h = module(h, emb, context)
-                h += guided_hint
-                guided_hint = None
-            else:
-                h = module(h, emb, context)
-            hs.append(h)
-        h = self.middle_block(h, emb, context)
-        hs.append(h)
-        return hs
-
-
-class SUPIR(nn.Module):
-    """
-    SUPIR model containing GLVControl (control encoder) and project_modules (adapters).
-    State dict keys match the original SUPIR checkpoint layout:
-      control_model.*           -> GLVControl
-      project_modules.*         -> nn.ModuleList of ZeroSFT/ZeroCrossAttn
-    """
-    def __init__(self, device=None, dtype=None, operations=None):
-        super().__init__()
-
-        self.control_model = GLVControl(dtype=dtype, device=device, operations=operations)
-
-        project_channel_scale = 2
-        cond_output_channels = [320] * 4 + [640] * 3 + [1280] * 3
-        project_channels = [int(c * project_channel_scale) for c in [160] * 4 + [320] * 3 + [640] * 3]
-        concat_channels = [320] * 2 + [640] * 3 + [1280] * 4 + [0]
-        cross_attn_insert_idx = [6, 3]
-
-        self.project_modules = nn.ModuleList()
-        for i in range(len(cond_output_channels)):
-            self.project_modules.append(ZeroSFT(
-                project_channels[i], cond_output_channels[i],
-                concat_channels=concat_channels[i],
-                dtype=dtype, device=device, operations=operations,
-            ))
-
-        for i in cross_attn_insert_idx:
-            self.project_modules.insert(i, ZeroCrossAttn(
-                cond_output_channels[i], concat_channels[i],
-                dtype=dtype, device=device, operations=operations,
-            ))
--- a/comfy/ldm/supir/supir_patch.py
+++ b/comfy/ldm/supir/supir_patch.py
@@ -1,103 +0,0 @@
-import torch
-from comfy.ldm.modules.diffusionmodules.openaimodel import Upsample
-
-
-class SUPIRPatch:
-    """
-    Holds GLVControl (control encoder) + project_modules (ZeroSFT/ZeroCrossAttn adapters).
-    Runs GLVControl lazily on first patch invocation per step, applies adapters through
-    middle_block_after_patch, output_block_merge_patch, and forward_timestep_embed_patch.
-    """
-    SIGMA_MAX = 14.6146
-
-    def __init__(self, model_patch, project_modules, hint_latent, strength_start, strength_end):
-        self.model_patch = model_patch           # CoreModelPatcher wrapping GLVControl
-        self.project_modules = project_modules   # nn.ModuleList of ZeroSFT/ZeroCrossAttn
-        self.hint_latent = hint_latent           # encoded LQ image latent
-        self.strength_start = strength_start
-        self.strength_end = strength_end
-        self.cached_features = None
-        self.adapter_idx = 0
-        self.control_idx = 0
-        self.current_control_idx = 0
-        self.active = True
-
-    def _ensure_features(self, kwargs):
-        """Run GLVControl on first call per step, cache results."""
-        if self.cached_features is not None:
-            return
-        x = kwargs["x"]
-        b = x.shape[0]
-        hint = self.hint_latent.to(device=x.device, dtype=x.dtype)
-        if hint.shape[0] != b:
-            hint = hint.expand(b, -1, -1, -1) if hint.shape[0] == 1 else hint.repeat((b + hint.shape[0] - 1) // hint.shape[0], 1, 1, 1)[:b]
-        self.cached_features = self.model_patch.model.control_model(
-            hint, kwargs["timesteps"], x,
-            kwargs["context"], kwargs["y"]
-        )
-        self.adapter_idx = len(self.project_modules) - 1
-        self.control_idx = len(self.cached_features) - 1
-
-    def _get_control_scale(self, kwargs):
-        if self.strength_start == self.strength_end:
-            return self.strength_end
-        sigma = kwargs["transformer_options"].get("sigmas")
-        if sigma is None:
-            return self.strength_end
-        s = sigma[0].item() if sigma.dim() > 0 else sigma.item()
-        t = min(s / self.SIGMA_MAX, 1.0)
-        return t * (self.strength_start - self.strength_end) + self.strength_end
-
-    def middle_after(self, kwargs):
-        """middle_block_after_patch: run GLVControl lazily, apply last adapter after middle block."""
-        self.cached_features = None  # reset from previous step
-        self.current_scale = self._get_control_scale(kwargs)
-        self.active = self.current_scale > 0
-        if not self.active:
-            return {"h": kwargs["h"]}
-        self._ensure_features(kwargs)
-        h = kwargs["h"]
-        h = self.project_modules[self.adapter_idx](
-            self.cached_features[self.control_idx], h, control_scale=self.current_scale
-        )
-        self.adapter_idx -= 1
-        self.control_idx -= 1
-        return {"h": h}
-
-    def output_block(self, h, hsp, transformer_options):
-        """output_block_patch: ZeroSFT adapter fusion replaces cat([h, hsp]). Returns (h, None) to skip cat."""
-        if not self.active:
-            return h, hsp
-        self.current_control_idx = self.control_idx
-        h = self.project_modules[self.adapter_idx](
-            self.cached_features[self.control_idx], hsp, h, control_scale=self.current_scale
-        )
-        self.adapter_idx -= 1
-        self.control_idx -= 1
-        return h, None
-
-    def pre_upsample(self, layer, x, emb, context, transformer_options, output_shape, *args, **kw):
-        """forward_timestep_embed_patch for Upsample: extra cross-attn adapter before upsample."""
-        block_type, _ = transformer_options["block"]
-        if block_type == "output" and self.active and self.cached_features is not None:
-            x = self.project_modules[self.adapter_idx](
-                self.cached_features[self.current_control_idx], x, control_scale=self.current_scale
-            )
-            self.adapter_idx -= 1
-        return layer(x, output_shape=output_shape)
-
-    def to(self, device_or_dtype):
-        if isinstance(device_or_dtype, torch.device):
-            self.cached_features = None
-            if self.hint_latent is not None:
-                self.hint_latent = self.hint_latent.to(device_or_dtype)
-        return self
-
-    def models(self):
-        return [self.model_patch]
-
-    def register(self, model_patcher):
-        """Register all patches on a cloned model patcher."""
-        model_patcher.set_model_patch(self.middle_after, "middle_block_after_patch")
-        model_patcher.set_model_output_block_patch(self.output_block)
-        model_patcher.set_model_patch((Upsample, self.pre_upsample), "forward_timestep_embed_patch")
--- a/comfy/memory_management.py
+++ b/comfy/memory_management.py
@@ -141,17 +141,3 @@ def interpret_gathered_like(tensors, gathered):
    return dest_views

 aimdo_enabled = False
-
-extra_ram_release_callback = None
-RAM_CACHE_HEADROOM = 0
-
-def set_ram_cache_release_state(callback, headroom):
-    global extra_ram_release_callback
-    global RAM_CACHE_HEADROOM
-    extra_ram_release_callback = callback
-    RAM_CACHE_HEADROOM = max(0, int(headroom))
-
-def extra_ram_release(target):
-    if extra_ram_release_callback is None:
-        return 0
-    return extra_ram_release_callback(target)
--- a/comfy/model_base.py
+++ b/comfy/model_base.py
@@ -52,10 +52,6 @@ import comfy.ldm.qwen_image.model
 import comfy.ldm.kandinsky5.model
 import comfy.ldm.anima.model
 import comfy.ldm.ace.ace_step15
-import comfy.ldm.cogvideo.model
-import comfy.ldm.rt_detr.rtdetr_v4
-import comfy.ldm.ernie.model
-import comfy.ldm.sam3.detector

 import comfy.model_management
 import comfy.patcher_extension
@@ -82,7 +78,6 @@ class ModelType(Enum):
    IMG_TO_IMG = 9
    FLOW_COSMOS = 10
    IMG_TO_IMG_FLOW = 11
-    V_PREDICTION_DDPM = 12


 def model_sampling(model_config, model_type):
@@ -117,8 +112,6 @@ def model_sampling(model_config, model_type):
        s = comfy.model_sampling.ModelSamplingCosmosRFlow
    elif model_type == ModelType.IMG_TO_IMG_FLOW:
        c = comfy.model_sampling.IMG_TO_IMG_FLOW
-    elif model_type == ModelType.V_PREDICTION_DDPM:
-        c = comfy.model_sampling.V_PREDICTION_DDPM

    class ModelSampling(s, c):
        pass
@@ -583,8 +576,8 @@ class Stable_Zero123(BaseModel):
    def __init__(self, model_config, model_type=ModelType.EPS, device=None, cc_projection_weight=None, cc_projection_bias=None):
        super().__init__(model_config, model_type, device=device)
        self.cc_projection = comfy.ops.manual_cast.Linear(cc_projection_weight.shape[1], cc_projection_weight.shape[0], dtype=self.get_dtype(), device=device)
-        self.cc_projection.weight = torch.nn.Parameter(cc_projection_weight.clone())
-        self.cc_projection.bias = torch.nn.Parameter(cc_projection_bias.clone())
+        self.cc_projection.weight.copy_(cc_projection_weight)
+        self.cc_projection.bias.copy_(cc_projection_bias)

    def extra_conds(self, **kwargs):
        out = {}
@@ -897,7 +890,7 @@ class Flux(BaseModel):
        return torch.cat((image, mask), dim=1)

    def encode_adm(self, **kwargs):
-        return kwargs.get("pooled_output", None)
+        return kwargs["pooled_output"]

    def extra_conds(self, **kwargs):
        out = super().extra_conds(**kwargs)
@@ -944,10 +937,9 @@ class LongCatImage(Flux):
        transformer_options = transformer_options.copy()
        rope_opts = transformer_options.get("rope_options", {})
        rope_opts = dict(rope_opts)
-        pe_len = float(c_crossattn.shape[1]) if c_crossattn is not None else 512.0
        rope_opts.setdefault("shift_t", 1.0)
-        rope_opts.setdefault("shift_y", pe_len)
-        rope_opts.setdefault("shift_x", pe_len)
+        rope_opts.setdefault("shift_y", 512.0)
+        rope_opts.setdefault("shift_x", 512.0)
        transformer_options["rope_options"] = rope_opts
        return super()._apply_model(x, t, c_concat, c_crossattn, control, transformer_options, **kwargs)

@@ -1068,10 +1060,6 @@ class LTXAV(BaseModel):
        if guide_attention_entries is not None:
            out['guide_attention_entries'] = comfy.conds.CONDConstant(guide_attention_entries)

-        ref_audio = kwargs.get("ref_audio", None)
-        if ref_audio is not None:
-            out['ref_audio'] = comfy.conds.CONDConstant(ref_audio)
-
        return out

    def process_timestep(self, timestep, x, denoise_mask=None, audio_denoise_mask=None, **kwargs):
@@ -1964,78 +1952,3 @@ class Kandinsky5Image(Kandinsky5):

    def concat_cond(self, **kwargs):
        return None
-
-class RT_DETR_v4(BaseModel):
-    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
-        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.rt_detr.rtdetr_v4.RTv4)
-
-class ErnieImage(BaseModel):
-    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
-        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.ernie.model.ErnieImageModel)
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        cross_attn = kwargs.get("cross_attn", None)
-        if cross_attn is not None:
-            out['c_crossattn'] = comfy.conds.CONDRegular(cross_attn)
-        return out
-
-class SAM3(BaseModel):
-    def __init__(self, model_config, model_type=ModelType.FLOW, device=None):
-        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.sam3.detector.SAM3Model)
-
-class CogVideoX(BaseModel):
-    def __init__(self, model_config, model_type=ModelType.V_PREDICTION_DDPM, image_to_video=False, device=None):
-        super().__init__(model_config, model_type, device=device, unet_model=comfy.ldm.cogvideo.model.CogVideoXTransformer3DModel)
-        self.image_to_video = image_to_video
-
-    def concat_cond(self, **kwargs):
-        noise = kwargs.get("noise", None)
-        # Detect extra channels needed (e.g. 32 - 16 = 16 for ref latent)
-        extra_channels = self.diffusion_model.in_channels - noise.shape[1]
-        if extra_channels == 0:
-            return None
-
-        image = kwargs.get("concat_latent_image", None)
-        device = kwargs["device"]
-
-        if image is None:
-            shape = list(noise.shape)
-            shape[1] = extra_channels
-            return torch.zeros(shape, dtype=noise.dtype, layout=noise.layout, device=noise.device)
-
-        latent_dim = self.latent_format.latent_channels
-        image = utils.common_upscale(image.to(device), noise.shape[-1], noise.shape[-2], "bilinear", "center")
-
-        if noise.ndim == 5 and image.ndim == 5:
-            if image.shape[-3] < noise.shape[-3]:
-                image = torch.nn.functional.pad(image, (0, 0, 0, 0, 0, noise.shape[-3] - image.shape[-3]), "constant", 0)
-            elif image.shape[-3] > noise.shape[-3]:
-                image = image[:, :, :noise.shape[-3]]
-
-        for i in range(0, image.shape[1], latent_dim):
-            image[:, i:i + latent_dim] = self.process_latent_in(image[:, i:i + latent_dim])
-        image = utils.resize_to_batch_size(image, noise.shape[0])
-
-        if image.shape[1] > extra_channels:
-            image = image[:, :extra_channels]
-        elif image.shape[1] < extra_channels:
-            repeats = extra_channels // image.shape[1]
-            remainder = extra_channels % image.shape[1]
-            parts = [image] * repeats
-            if remainder > 0:
-                parts.append(image[:, :remainder])
-            image = torch.cat(parts, dim=1)
-
-        return image
-
-    def extra_conds(self, **kwargs):
-        out = super().extra_conds(**kwargs)
-        # OFS embedding (CogVideoX 1.5 I2V), default 2.0 as used by SparkVSR
-        if self.diffusion_model.ofs_proj_dim is not None:
-            ofs = kwargs.get("ofs", None)
-            if ofs is None:
-                noise = kwargs.get("noise", None)
-                ofs = torch.full((noise.shape[0],), 2.0, device=noise.device, dtype=noise.dtype)
-            out['ofs'] = comfy.conds.CONDRegular(ofs)
-        return out
--- a/comfy/model_detection.py
+++ b/comfy/model_detection.py
@@ -490,54 +490,6 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):

        return dit_config

-    if '{}blocks.0.norm1.linear.weight'.format(key_prefix) in state_dict_keys:  # CogVideoX
-        dit_config = {}
-        dit_config["image_model"] = "cogvideox"
-
-        # Extract config from weight shapes
-        norm1_weight = state_dict['{}blocks.0.norm1.linear.weight'.format(key_prefix)]
-        time_embed_dim = norm1_weight.shape[1]
-        dim = norm1_weight.shape[0] // 6
-
-        dit_config["num_attention_heads"] = dim // 64
-        dit_config["attention_head_dim"] = 64
-        dit_config["time_embed_dim"] = time_embed_dim
-        dit_config["num_layers"] = count_blocks(state_dict_keys, '{}blocks.'.format(key_prefix) + '{}.')
-
-        # Detect in_channels from patch_embed
-        patch_proj_key = '{}patch_embed.proj.weight'.format(key_prefix)
-        if patch_proj_key in state_dict_keys:
-            w = state_dict[patch_proj_key]
-            if w.ndim == 4:
-                # Conv2d: [out, in, kh, kw] — CogVideoX 1.0
-                dit_config["in_channels"] = w.shape[1]
-                dit_config["patch_size"] = w.shape[2]
-            elif w.ndim == 2:
-                # Linear: [out, in_channels * patch_size * patch_size * patch_size_t] — CogVideoX 1.5
-                dit_config["patch_size"] = 2
-                dit_config["patch_size_t"] = 2
-                dit_config["in_channels"] = w.shape[1] // (2 * 2 * 2)  # 256 // 8 = 32
-
-        text_proj_key = '{}patch_embed.text_proj.weight'.format(key_prefix)
-        if text_proj_key in state_dict_keys:
-            dit_config["text_embed_dim"] = state_dict[text_proj_key].shape[1]
-
-        # Detect OFS embedding
-        ofs_key = '{}ofs_embedding_linear_1.weight'.format(key_prefix)
-        if ofs_key in state_dict_keys:
-            dit_config["ofs_embed_dim"] = state_dict[ofs_key].shape[1]
-
-        # Detect positional embedding type
-        pos_key = '{}patch_embed.pos_embedding'.format(key_prefix)
-        if pos_key in state_dict_keys:
-            dit_config["use_learned_positional_embeddings"] = True
-            dit_config["use_rotary_positional_embeddings"] = False
-        else:
-            dit_config["use_learned_positional_embeddings"] = False
-            dit_config["use_rotary_positional_embeddings"] = True
-
-        return dit_config
-
    if '{}head.modulation'.format(key_prefix) in state_dict_keys:  # Wan 2.1
        dit_config = {}
        dit_config["image_model"] = "wan2.1"
@@ -744,34 +696,6 @@ def detect_unet_config(state_dict, key_prefix, metadata=None):
    if '{}encoder.lyric_encoder.layers.0.input_layernorm.weight'.format(key_prefix) in state_dict_keys:
        dit_config = {}
        dit_config["audio_model"] = "ace1.5"
-        head_dim = 128
-        dit_config["hidden_size"] = state_dict['{}decoder.layers.0.self_attn_norm.weight'.format(key_prefix)].shape[0]
-        dit_config["intermediate_size"] = state_dict['{}decoder.layers.0.mlp.gate_proj.weight'.format(key_prefix)].shape[0]
-        dit_config["num_heads"] = state_dict['{}decoder.layers.0.self_attn.q_proj.weight'.format(key_prefix)].shape[0] // head_dim
-
-        dit_config["encoder_hidden_size"] = state_dict['{}encoder.lyric_encoder.layers.0.input_layernorm.weight'.format(key_prefix)].shape[0]
-        dit_config["encoder_num_heads"] = state_dict['{}encoder.lyric_encoder.layers.0.self_attn.q_proj.weight'.format(key_prefix)].shape[0] // head_dim
-        dit_config["encoder_intermediate_size"] = state_dict['{}encoder.lyric_encoder.layers.0.mlp.gate_proj.weight'.format(key_prefix)].shape[0]
-        dit_config["num_dit_layers"] = count_blocks(state_dict_keys, '{}decoder.layers.'.format(key_prefix) + '{}.')
-        return dit_config
-
-    if '{}encoder.pan_blocks.1.cv4.conv.weight'.format(key_prefix) in state_dict_keys: # RT-DETR_v4
-        dit_config = {}
-        dit_config["image_model"] = "RT_DETR_v4"
-        dit_config["enc_h"] = state_dict['{}encoder.pan_blocks.1.cv4.conv.weight'.format(key_prefix)].shape[0]
-        return dit_config
-
-    if '{}layers.0.mlp.linear_fc2.weight'.format(key_prefix) in state_dict_keys: # Ernie Image
-        dit_config = {}
-        dit_config["image_model"] = "ernie"
-        return dit_config
-
-    if 'detector.backbone.vision_backbone.trunk.blocks.0.attn.qkv.weight' in state_dict_keys: # SAM3 / SAM3.1
-        if 'detector.transformer.decoder.query_embed.weight' in state_dict_keys:
-            dit_config = {}
-            dit_config["image_model"] = "SAM3"
-            if 'detector.backbone.vision_backbone.propagation_convs.0.conv_1x1.weight' in state_dict_keys:
-                dit_config["image_model"] = "SAM31"
        return dit_config

    if '{}input_blocks.0.0.weight'.format(key_prefix) not in state_dict_keys:
@@ -929,10 +853,6 @@ def model_config_from_unet(state_dict, unet_key_prefix, use_base_if_no_match=Fal
    return model_config

 def unet_prefix_from_state_dict(state_dict):
-    # SAM3: detector.* and tracker.* at top level, no common prefix
-    if any(k.startswith("detector.") for k in state_dict) and any(k.startswith("tracker.") for k in state_dict):
-        return ""
-
    candidates = ["model.diffusion_model.", #ldm/sgm models
                  "model.model.", #audio models
                  "net.", #cosmos
--- a/comfy/model_management.py
+++ b/comfy/model_management.py
@@ -55,7 +55,6 @@ total_vram = 0

 # Training Related State
 in_training = False
-training_fp8_bwd = False


 def get_supported_float8_types():
@@ -663,14 +662,13 @@ def minimum_inference_memory():

 def free_memory(memory_required, device, keep_loaded=[], for_dynamic=False, pins_required=0, ram_required=0):
    cleanup_models_gc()
-    comfy.memory_management.extra_ram_release(max(pins_required, ram_required))
    unloaded_model = []
    can_unload = []
    unloaded_models = []

    for i in range(len(current_loaded_models) -1, -1, -1):
        shift_model = current_loaded_models[i]
-        if device is None or shift_model.device == device:
+        if shift_model.device == device:
            if shift_model not in keep_loaded and not shift_model.is_dead():
                can_unload.append((-shift_model.model_offloaded_memory(), sys.getrefcount(shift_model.model), shift_model.model_memory(), i))
                shift_model.currently_used = False
@@ -680,8 +678,8 @@ def free_memory(memory_required, device, keep_loaded=[], for_dynamic=False, pins
        i = x[-1]
        memory_to_free = 1e32
        pins_to_free = 1e32
-        if not DISABLE_SMART_MEMORY or device is None:
-            memory_to_free = 0 if device is None else memory_required - get_free_memory(device)
+        if not DISABLE_SMART_MEMORY:
+            memory_to_free = memory_required - get_free_memory(device)
            pins_to_free = pins_required - get_free_ram()
            if current_loaded_models[i].model.is_dynamic() and for_dynamic:
                #don't actually unload dynamic models for the sake of other dynamic models
@@ -709,7 +707,7 @@ def free_memory(memory_required, device, keep_loaded=[], for_dynamic=False, pins

    if len(unloaded_model) > 0:
        soft_empty_cache()
-    elif device is not None:
+    else:
        if vram_state != VRAMState.HIGH_VRAM:
            mem_free_total, mem_free_torch = get_free_memory(device, torch_free_too=True)
            if mem_free_torch > mem_free_total * 0.25:
@@ -1733,21 +1731,6 @@ def supports_mxfp8_compute(device=None):

    return True

-def supports_fp64(device=None):
-    if is_device_mps(device):
-        return False
-
-    if is_intel_xpu():
-        return False
-
-    if is_directml_enabled():
-        return False
-
-    if is_ixuca():
-        return False
-
-    return True
-
 def extended_fp16_support():
    # TODO: check why some models work with fp16 on newer torch versions but not on older
    if torch_version_numeric < (2, 7):
@@ -1802,7 +1785,7 @@ def debug_memory_summary():
        return torch.cuda.memory.memory_summary()
    return ""

-class InterruptProcessingException(BaseException):
+class InterruptProcessingException(Exception):
    pass

 interrupt_processing_mutex = threading.RLock()
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
comfyanonymous	c6e15cb366	Revert "Fix some fp8 scaled checkpoints no longer working. (#13239 )" This reverts commit `4504ee718a`.	2026-04-03 03:09:58 -04:00
comfyanonymous	6aba6bd435	ComfyUI version 0.18.4	2026-04-03 03:06:35 -04:00
Daxiong (Lin)	dedec11116	Update template to 0.9.43 (#13265 )	2026-04-03 03:04:35 -04:00
Alexander Piskun	b818eacb0b	feat(api-nodes): new Partner nodes for Wan2.7 (#13264 ) Signed-off-by: bigcat88 <bigcat88@icloud.com>	2026-04-03 03:03:57 -04:00
comfyanonymous	4504ee718a	Fix some fp8 scaled checkpoints no longer working. (#13239 )	2026-04-03 03:03:50 -04:00
rattus	94ab69f609	Fix/tweak pinned memory accounting (#13221 ) * mm: Lower windows pin threshold Some workflows have more extranous use of shared GPU memory than is accounted for in the 5% pin headroom. Lower this for safety. * mm: Remove pin count clearing threshold. TOTAL_PINNED_MEMORY is shared between the legacy and aimdo pinning systems, however this catch-all assumes only the legacy system exists. Remove the catch-all as the PINNED_MEMORY buffer is coherent already.	2026-04-03 03:03:42 -04:00
ComfyUI Wiki	173e1aa2df	chore: update workflow templates to v0.9.38 (#13176 )	2026-03-26 16:02:18 -04:00
Alexander Piskun	bbc8977597	feat(api-nodes): added new Topaz model (#13175 ) Signed-off-by: bigcat88 <bigcat88@icloud.com>	2026-03-26 16:01:16 -04:00
comfyanonymous	a0ae3f3bd4	ComfyUI v0.18.2	2026-03-24 17:41:52 -04:00
Alexander Piskun	cc9273e655	feat(api-nodes): update xAI Grok nodes (#13140 )	2026-03-24 17:39:48 -04:00
comfyanonymous	a6e967a391	Update templates package version. (#13141 )	2026-03-24 17:38:24 -04:00