diff --git a/CHANGELOG.md b/CHANGELOG.md
index 3df2d095c..7a837307f 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -2,6 +2,40 @@
 
 # CUTLASS 4.x
 
+## [4.6.0](https://github.com/NVIDIA/cutlass/tree/main) (2026-06-11)
+
+### CuTe DSL
+* New features
+  - Supported AoT cross-compilation for aarch64‑linux‑gnu
+  - Support for two launch attributes: launch completion events (cudaLaunchAttributeLaunchCompletionEvent), for recording an event once all thread blocks have begun executing, and launch programatic events (cudaLaunchAttributeProgrammaticEvent), for PDL event-based synchronization
+  - Supported auto calculating per-kernel shared memory carveout preference, or use new launch option `preferred_smem_carveout` to set manually.
+  - Auto-deduced smem size for launching kernels
+    - Launch config `smem` now defaults to `None` for auto-calculating kernel shared memory usage, which is recommended unless manual control is required.
+    - Warnings will be raised when the manually set shared memory size is insufficient or exceeds the GPU maximum.
+    - The default shared memory usage calculation aligns with CUDA C++ static shared memory behavior, i.e. summing all allocations additively.
+    - An additional launch option `smem_merge_branch_allocs` is provided to merge shared memory allocations across mutually exclusive code branches, which is recommended for inlined mega-kernels to reduce total footprint.
+
+* Bug fixing and improvements
+  - Improvements on linter support with more type ignores cleaned up
+  - Improvements on tvm-ffi CUDA runtime error diagnostics
+  - Improvements on dataclass support for TVM-FFI
+  - Fixed a regression on compilation time
+  - Enhancement on compile time checks to reject mis-aligned smem operand for TMA
+  - Long-deprecated API clean-up, including:
+    - cute.core.ThrMma, please use cute.ThrMma instead
+    - cute.core.ThrCopy, please use cute.ThrCopy instead
+    - cute.make_fragment, please use cute.make_rmem_tensor instead
+
+### CUTLASS C++
+* Add [example 113](https://github.com/NVIDIA/cutlass/tree/main/examples/113_hopper_gemm_activation_fusion) for Hopper GEMM with activation fusion.
+  - Supports standard and gated activations (e.g., SiLu) with fp8 and fp16 inputs.
+  - Covers both regular GEMM and grouped GEMM variants.
+* Improve SM90 grouped/ptr-array GEMM with EVT support.
+  - Adds the EVT (Epilogue Visitor Tree) plumbing required to do activation, bias, and auxiliary-tensor fusion inside SM90 grouped and ptr-array GEMM kernels.
+* Fix `DescriptorIterator::operator+` in `mma_traits_sm100.hpp` to use 32-bit arithmetic on CUDA toolkit version <= 13.3, preserving the high half of the smem descriptor.
+* Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
+* Optimal code generation with CUDA toolkit versions 13.3.
+
 ## [4.5.2](https://github.com/NVIDIA/cutlass/releases/tag/v4.5.2) (2026-05-22)
 
 ### CuTe DSL
diff --git a/README.md b/README.md
index fdbb6f3db..75e590e7f 100644
--- a/README.md
+++ b/README.md
@@ -1,9 +1,9 @@
 ![ALT](./media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")
 # Overview
 
-# CUTLASS 4.5.2
+# CUTLASS 4.6.0
 
-_CUTLASS 4.5.2 - May 2026_
+_CUTLASS 4.6.0 - June 2026_
 
 CUTLASS is a collection of abstractions for implementing high-performance matrix-matrix multiplication (GEMM)
 and related computations at all levels and scales within CUDA. It incorporates strategies for
@@ -37,86 +37,43 @@ We believe it will become an indispensable tool for students, researchers, and p
 engineers alike — flattening the learning curve of GPU programming, rapidly prototyping kernel
 designs, and bringing optimized solutions into production.
 
-CuTe DSL is currently in public beta and will graduate out of beta by end of summer 2025.
+CuTe DSL is currently in public beta and will graduate out of beta by end of summer 2026.
 
 To get started quickly - please refer :
   - [CUTLASS C++ Quick Start Guide](https://docs.nvidia.com/cutlass/latest/media/docs/cpp/quickstart.html).
   - [CuTe DSL Quick Start Guide](https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/quick_start.html).
 
-# What's New in CUTLASS 4.5
+# What's New in CUTLASS 4.6
 
 ## CuTe DSL
 * New features
-  - New Block API `block_copy()` to simplify TMA and S2T copy. Users can ignore detail about multicast and 2CTA partition for TMA by `block_copy()` and need not to invoke `tma_partition()`. And users can remove bulk of S2T initialization to simplify S2T copy.
-  - MXF8F6F4 mixed precision support
-    - BlockScaled MMA now supports MXF8*MXF4 or MXF8*MXF6
-  - Block Scaled MMA for SM120 now works on Spark
-  - EFC broadcast semantics support
-    -  EFC epilogue functions can now broadcast and remap tensor modes via `C.remap_modes[:, 0, 1]` subscript syntax (where `:` marks a broadcast dimension and integers select source mode indices). Covers scalar broadcast, row/column broadcast, and arbitrary mode permutations (e.g. transpose). The PyTorch reference evaluator mirrors the same transformations.
-  - Initial linter support: Improved type hints on CuTe DSL APIs to support static type checkers like MyPy
-  - dataclasses.dataclass is now supported for JIT compilaton and cute.compile for both plain and tvm-ffi path
-  - cute.copy now supports user specified loop unrolling
-  - Python 3.14t is now supported with GIL enabled
+  - Supported AoT cross-compilation for aarch64?~@~Qlinux?~@~Qgnu
+  - Support for two launch attributes: launch completion events (cudaLaunchAttributeLaunchCompletionEvent), for recording an event once all thread blocks have begun executing, and launch programatic events (cudaLaunchAttributeProgrammaticEvent), for PDL event-based synchronization
+  - Supported auto calculating per-kernel shared memory carveout preference, or use new laucnch option `preferred_smem_carveout` to set manually.
+  - Auto-deduced smem size for launching kernels
+    - Launch config `smem` now defaults to `None` for auto-calculating kernel shared memory usage, which is recommended unless manual control is required.
+    - Warnings will be raised when the manually set shared memory size is insufficient or exceeds the GPU maximum.
+    - The default shared memory usage calculation aligns with CUDA C++ static shared memory behavior, i.e. summing all allocations additively.
+    - An additional launch option `smem_merge_branch_allocs` is provided to merge shared memory allocations across mutually exclusive code branches, which is recommended for inlined mega-kernels to reduce total footprint.
 
 * Bug fixing and improvements
-  - Improved source code correlation for profiling/debugging
-  - Fixed an aarch64 segfault issue with tvm-ffi
-  - Re-organization for CuTe DSL examples/tutorials for better discoverability
-  - Fixed following issues:
-    https://github.com/NVIDIA/cutlass/issues/3219
-    https://github.com/NVIDIA/cutlass/issues/3218
-    https://github.com/NVIDIA/cutlass/issues/3212
-    https://github.com/NVIDIA/cutlass/issues/3210
-    https://github.com/NVIDIA/cutlass/issues/3208
-    https://github.com/NVIDIA/cutlass/issues/3201
-    https://github.com/NVIDIA/cutlass/issues/3227
-    https://github.com/NVIDIA/cutlass/issues/3240
-    https://github.com/NVIDIA/cutlass/issues/3241
-  - Fixed Jax int64 stride divisibility issue
-  - Fixed issues for SM120 blockscaled MMAs
-    - added missing MXFP8MMAOP and MXF8F6F4MMAOP for sm120.
-
-* More examples of authorizing peak-performance kernels
-  - MOE examles
-    - A new style of grouped-gemm that aligns to torch's grouped_mm and scaled_groued_mm interface.
-    - Expert-wise tensormap descriptor setup by a cheap helper kernel (~2us) to avoid long latency in tile switching, kernel structure is much more closer to a normal GEMM.
-    - Compared to torch_210_cu13, very few problem has worse perf in B200.
-        - mxfp8_2dx3d: avg 1.29 speedup;
-        - mxfp8_2dx2d: avg 1.41 speedup;
-            - nvfp4_2dx3d: avg 1.11 speedup;
-        - nvfp4_2dx2d: avg 1.12 speedup (worst case 0.98)
-        - bf16_2dx3d: avg 1.15 speedup (worst case 0.98)
-        - bf16_2dx2d: avg 1.17 speedup (worst case 0.96)
-        - Note: The perf is measured from torch profiler, this impl includes the helper kernel + main kernel, while torch's includes its setup kernel and cutlass_cpp main kernel.
-
-* API changes
-  - ab_dtype is deprecated in make_trivial_tiled_mma and make_blockscaled_trivial_tiled_mma from blackwell_helpers.py. Please specify a_dtype and b_dtype separately instead.
+  - Improvements on linter support with more type ignores cleaned up
+  - Improvements on tvm-ffi CUDA runtime error diagnostics
+  - Improvements on dataclass support for TVM-FFI
+  - Fixed a regression on compilation time
+  - Enhancement on compile time checks to reject mis-aligned smem operand for TMA
+  - Long-deprecated API clean-up, including:
+    - cute.core.ThrMma, please use cute.ThrMma instead
+    - cute.core.ThrCopy, please use cute.ThrCopy instead
+    - cute.make_fragment, please use cute.make_rmem_tensor instead
 
 ## CUTLASS C++
-* Add 2SM MMA instruction support to mixed TMA+CpAsync SM100 vanilla GEMM kernels.
-  - Mixed TMA+CpAsync can now accept static, but non trivial cluster shapes.
-  - Uses TMA multicast for A tile when using non-trivial cluster size along N mode.
-  - Uses an additional barrier (mma_trampoline_barrier) to track cp.async arrivals in both CTAs.
-  - Changes included in [example 92](https://github.com/NVIDIA/cutlass/tree/main/examples/92_blackwell_moe_gemm).
-* Add support for 128x32xK and 128x64xK tile sizes for SM120 blockscaled MMA collective builders, yielding up to 30% performance improvement on Blackwell SM121 related kernels.
-* Add static load to tensor memory support, included in [example 77](https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha/).
-* Use 64-bit adds for SM100 MMA descriptor offsets and reduce move instructions for improved code generation.
-* Add [example 95](https://github.com/NVIDIA/cutlass/tree/main/examples/95_blackwell_gemm_green_context) to support green context SM partition
-  - Enables launching GEMM on stream with partial SM allocation.
-* Add [Snake](https://github.com/NVIDIA/cutlass/blob/main/test/unit/epilogue/thread/activation.cu#L409) activation functor for EVT.
-* Fix SM100 F8F6F4 SS MMA (1SM and 2SM) traits to use typed op templates.
-* Add UE8M0 (uniform exponent distribution) initialization support in tensor fill utilities.
-* Add `cvt.rn.bf16x2.e4m3x2` conversion instruction support to `numeric_conversion.h`.
-* Update [example 93](https://github.com/NVIDIA/cutlass/tree/main/examples/93_blackwell_low_latency_gqa) with paged KV cache support for Blackwell low-latency GQA.
-* Fix some kernel issues:
-  - Fix l2_capacity=0 handling in Blackwell SM100/SM120 kernel templates
-  - Fix CUTLASS clang build issues
-  - Remove `PipelineStorage` shadowing in SM100 complex epilogue
-  - Fix build issue in SM90 epilogue fusion visitor TMA warpspecialized
-  - Fix missing convert fucntion in EVT for fp4 kernels
-* Fix some profiler issues:
-  - Add missing reference kernels for blockwise GEMM profiler.
-  - Avoid instantiate 2sm tma kernels where ctaN is none power of 64 when ctaN > 128 in profiler.
+* Add [example 113](https://github.com/NVIDIA/cutlass/tree/main/examples/113_hopper_gemm_activation_fusion) for Hopper GEMM with activation fusion.
+  - Supports standard and gated activations (e.g., SiLu) with fp8 and fp16 inputs.
+  - Covers both regular GEMM and grouped GEMM variants.
+* Improve SM90 grouped/ptr-array GEMM with EVT support.
+  - Adds the EVT (Epilogue Visitor Tree) plumbing required to do activation, bias, and auxiliary-tensor fusion inside SM90 grouped and ptr-array GEMM kernels.
+* Fix `DescriptorIterator::operator+` in `mma_traits_sm100.hpp` to use 32-bit arithmetic on CUDA toolkit version <= 13.3, preserving the high half of the smem descriptor.
 
 Note: CUTLASS 4.x builds are known to be down on Windows platforms for all CUDA toolkits.
 CUTLASS team is working on a fix.
diff --git a/examples/113_hopper_gemm_activation_fusion/113_hopper_gemm_fused_act.cu b/examples/113_hopper_gemm_activation_fusion/113_hopper_gemm_fused_act.cu
new file mode 100644
index 000000000..d8b7c7413
--- /dev/null
+++ b/examples/113_hopper_gemm_activation_fusion/113_hopper_gemm_fused_act.cu
@@ -0,0 +1,533 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Hopper GEMM with activation fusion example
+*/
+
+#include <cstdlib>
+#include <iostream>
+#include <numeric>
+#include <vector>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/fusion/operations.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/device_memory.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+
+#include "helper.h"
+#include "options.hpp"
+#include "utils.hpp"
+#include "sm90_lin_comb_elt_act_scaled.hpp"
+#include "activation_kernel.cuh"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+#if 0
+template<class T>
+using ActivationFn = cutlass::epilogue::thread::ReLu<T>;
+#elif 1
+template<class T>
+using ActivationFn = cutlass::epilogue::thread::SiLu<T>;
+#else
+template<class T>
+using ActivationFn = cutlass::epilogue::thread::Identity<T>;
+#endif
+
+bool constexpr IsFp8         = true;  // whether to run with fp8 or fp16 input/output
+bool constexpr Quantize      = true;  // whether to quantize output with a per-tensor scale factor
+bool constexpr ExactMode     = false; // whether to reproduce unfused dual gemm+activation exactly
+bool constexpr BiasBroadcast = true;  // whether bias is broadcast along columns in each group
+bool constexpr Pingpong      = true;  // whether to use pingpong schedule
+
+// A matrix configuration
+using         ElementA    = conditional_t<IsFp8, cutlass::float_e4m3_t, cutlass::half_t>; // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                                    // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;                  // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = conditional_t<IsFp8, cutlass::float_e5m2_t, cutlass::half_t>; // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                                 // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;                  // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementC    = cutlass::half_t;                                              // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::ColumnMajor;                                 // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;                  // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+using         ElementD    = conditional_t<IsFp8, cutlass::float_e4m3_t, cutlass::half_t>; // Element type for C and D matrix operands
+using         LayoutD     = cutlass::layout::ColumnMajor;                                 // Layout type for C and D matrix operands
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;                  // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ElementCompute      = float;                                           // Element type for internal accumulation
+using ElementScalar       = float;                                           // Element type for internal accumulation
+using ElementIntermediate = cutlass::half_t;                                 // Element type of intermediate result between GEMM and bias+activation
+using ArchTag             = cutlass::arch::Sm90;                             // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                  // Operator class tag
+using EpiTileShape        = cutlass::epilogue::collective::EpilogueTileAuto; // Epilogue sub-tile shape
+using ClusterShape        = Shape<_1,_2,_1>;                                 // Shape of the threadblocks in a cluster
+using TileShapeK          = Int<128 * 8 / sizeof_bits<ElementA>::value>;
+
+using KernelScheduleCooperative = conditional_t<cutlass::gemm::collective::detail::is_input_fp8<ElementA, ElementB>(),
+                                                cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum,
+                                                cutlass::gemm::KernelTmaWarpSpecializedCooperative>;
+
+using KernelSchedulePingpong    = conditional_t<cutlass::gemm::collective::detail::is_input_fp8<ElementA, ElementB>(),
+                                                cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum,
+                                                cutlass::gemm::KernelTmaWarpSpecializedPingpong>;
+
+using KernelSchedule   = conditional_t<Pingpong, KernelSchedulePingpong, KernelScheduleCooperative>;
+using EpilogueSchedule = conditional_t<Pingpong, cutlass::epilogue::TmaWarpSpecialized, cutlass::epilogue::TmaWarpSpecializedCooperative>;
+using TileShape        = conditional_t<Pingpong, Shape<_128,_128,TileShapeK>, Shape<_128,_256,TileShapeK>>;
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    EpiTileShape,
+    ElementAccumulator, ElementCompute,
+    ElementC, LayoutC, AlignmentC,
+    ElementD, LayoutD, AlignmentD,
+    EpilogueSchedule,
+    cutlass::epilogue::fusion::AccCastLinCombEltActScale<
+      Quantize,
+      ActivationFn,
+      ElementD,
+      ElementCompute,
+      ElementC,
+      ElementScalar,
+      conditional_t<ExactMode, ElementIntermediate, ElementCompute>
+    >
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA, AlignmentA,
+    ElementB, LayoutB, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int,int>, // Indicates ProblemShape
+    CollectiveMainloop,
+    CollectiveEpilogue
+>;
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Reference device GEMM implementation type
+using CollectiveEpilogueRef = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    EpiTileShape,
+    ElementAccumulator, ElementCompute,
+    void,     LayoutC, AlignmentC,
+    ElementIntermediate, LayoutD, AlignmentD,
+    EpilogueSchedule,
+    cutlass::epilogue::fusion::ScaledAcc<ElementIntermediate, ElementCompute, ElementScalar>
+  >::CollectiveOp;
+
+using CollectiveMainloopRef = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA, AlignmentA,
+    ElementB, LayoutB, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernelRef = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int,int>, // Indicates ProblemShape
+    CollectiveMainloopRef,
+    CollectiveEpilogueRef
+>;
+
+using GemmRef = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelRef>;
+
+using StrideA = typename GemmKernel::StrideA;
+using StrideB = typename GemmKernel::StrideB;
+using StrideC = typename GemmKernel::StrideC;
+using StrideD = typename GemmKernel::StrideD;
+
+//
+// Data members
+//
+
+/// Initialization
+StrideA stride_A;
+StrideB stride_B;
+StrideC stride_C;
+StrideD stride_D;
+uint64_t seed;
+
+cutlass::DeviceAllocation<ElementA> block_A;
+cutlass::DeviceAllocation<ElementB> block_B;
+cutlass::DeviceAllocation<ElementC> block_C;
+cutlass::DeviceAllocation<ElementD> block_D;
+cutlass::DeviceAllocation<ElementD> block_D_ref;
+cutlass::DeviceAllocation<ElementIntermediate> block_D_ref_gemm;
+cutlass::DeviceAllocation<int64_t> offset_col_D;
+cutlass::DeviceAllocation<ElementScalar> block_scale;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options : GemmOptionsBase<cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90Params::RasterOrderOptions> {
+
+  using Base = GemmOptionsBase<cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90Params::RasterOrderOptions>;
+
+  float alpha = 1.f, beta = 0.f;
+  int m = 10240, n = 2048, k = 8192, l = 10;
+
+  // Parses the command line
+  void parse(cutlass::CommandLine const& cmd) {
+    Base::parse(cmd);
+    cmd.get_cmd_line_argument("alpha", alpha);
+    cmd.get_cmd_line_argument("beta", beta);
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("l", l);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << program_path << "\n"
+           "\n"
+           "  Hopper GEMM with fused activation function.\n"
+           "\n"
+           "Options:\n"
+           "\n"
+           "  --help                      If specified, displays this usage statement\n\n"
+           "  --m=<int>                   Sets the M extent of the GEMM\n"
+           "  --n=<int>                   Sets the N extent of the GEMM\n"
+           "  --k=<int>                   Sets the K extent of the GEMM\n"
+           "  --l=<int>                   Sets the L extent of the GEMM\n"
+           "  --alpha=<f32>               Epilogue scalar alpha\n"
+           "  --beta=<f32>                Epilogue scalar beta\n"
+           "  --raster=<char>             CTA Rasterization direction (N for along N, M for along M, and H for heuristic)\n"
+           "  --swizzle=<int>             CTA Rasterization swizzle\n"
+           "  --warmup=<int>              Number of warmup iterations to perform.\n"
+           "  --iterations=<int>          Number of profiling iterations to perform.\n"
+           "  --verbose                   Verbose mode (output detailed verification result)\n"
+           "  --verify=<bool>             Verification (correctness check) on/off\n"
+           "  --sms                       Number of SMs to run the GEMMs on\n"
+           "  --device                    Device index\n"
+           "\n"
+           "Examples:\n"
+           "\n"
+        << program_path << " --m=1024 --n=512 --k=1024 --l=10 --alpha=2 --beta=0.707\n";
+
+    return out;
+  }
+
+  /// Compute total number of floating point operations
+  double total_flops() const {
+    // Two flops per multiply-add
+    return uint64_t(2) * m * n * k * l;
+  }
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+
+  auto [M, N, K, L] = make_tuple(options.m, options.n, options.k, options.l);
+  auto NC = BiasBroadcast ? 1 : N;
+
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, {M, K, L});
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, {N, K, L});
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, {M, NC, L});
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, {M, N, L});
+
+  block_A.reset(M * K * L);
+  block_B.reset(N * K * L);
+  block_C.reset(M * NC * L);
+  block_D.reset(M * N * L);
+  block_D_ref.reset(M * N * L);
+  block_D_ref_gemm.reset(M * N * L);
+  block_scale.reset(1);
+
+  if constexpr (BiasBroadcast) {
+    get<1>(stride_C) = 0;
+  }
+
+  std::vector<int64_t> offset_col_D_host(options.l + 1);
+  std::iota(offset_col_D_host.begin(), offset_col_D_host.end(), 0ll);
+  std::transform(offset_col_D_host.begin(), offset_col_D_host.end(), offset_col_D_host.begin(), [&](auto i) { return i * options.n; });
+  offset_col_D.reset(options.l + 1);
+  offset_col_D.copy_from_host(offset_col_D_host.data());
+
+  cutlass::reference::device::BlockFillRandom(block_A.get(), block_A.size(), 2024, options.dist_a);
+  cutlass::reference::device::BlockFillRandom(block_B.get(), block_B.size(), 2025, options.dist_b);
+  cutlass::reference::device::BlockFillRandom(block_C.get(), block_C.size(), 2026, options.dist_c);
+  cutlass::reference::device::BlockFillRandomUniform(block_scale.get(), block_scale.size(), 2027, 0.5, 1.0);
+}
+
+template <typename GemmT>
+auto args_from_options_common(const Options &options)
+{
+  cutlass::KernelHardwareInfo hw_info;
+  hw_info.device_id = 0;
+  hw_info.sm_count = options.sm_count > 0 
+                   ? options.sm_count 
+                   : cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+
+  typename GemmT::Arguments arguments;
+  decltype(arguments.epilogue.thread) fusion_args{};
+
+  fusion_args.alpha = options.alpha;
+
+  return make_tuple(fusion_args, hw_info);
+}
+
+template <typename GemmT>
+typename GemmT::Arguments
+args_from_options(const Options &options);
+
+template <>
+typename Gemm::Arguments
+args_from_options<Gemm>(Options const& options)
+{
+  auto [fusion_args, hw_info] = args_from_options_common<Gemm>(options);
+  fusion_args.beta = options.beta;
+  fusion_args.scale_ptr = Quantize ? block_scale.get() : nullptr;
+
+  typename Gemm::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, options.l},
+    {block_A.get(), stride_A, block_B.get(), stride_B},
+    {fusion_args, block_C.get(), stride_C, block_D.get(), stride_D},
+    hw_info,
+    {options.swizzle, options.raster}
+  };
+
+  return arguments;
+}
+
+template <>
+typename GemmRef::Arguments
+args_from_options<GemmRef>(Options const& options)
+{
+  auto [fusion_args, hw_info] = args_from_options_common<GemmRef>(options);
+
+  typename GemmRef::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, options.l},
+    {block_A.get(), stride_A, block_B.get(), stride_B},
+    {fusion_args, nullptr, stride_C, block_D_ref_gemm.get(), stride_D},
+    hw_info,
+    {options.swizzle, options.raster}
+  };
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  bool passed = false;
+  if constexpr (ExactMode) {
+    passed = cutlass::reference::device::BlockCompareEqual(block_D_ref.get(), block_D.get(), block_D.size());
+  }
+  else {
+    passed = cutlass::reference::device::BlockCompareRelativelyEqual(block_D_ref.get(), block_D.get(), block_D.size(), ElementD(options.tolerance), ElementD(options.nonzero_floor));
+  }
+
+  if (!passed && options.verbose) {
+    print("D reference: "); print_device_tensor(make_tensor(block_D_ref.get(), make_shape(options.m, options.n, options.l), stride_D));
+    print("D  computed: "); print_device_tensor(make_tensor(block_D.get(), make_shape(options.m, options.n, options.l), stride_D));
+  }
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+bool run(Options& options)
+{
+  if (options.beta != 1.f && options.beta != 0.f) {
+    throw std::runtime_error("Specifying beta != 0/1 is not supported by verification kernel");
+  }
+
+  initialize(options);
+
+  std::cout << "Problem Size: " << shape_string(make_tuple(options.m, options.n, options.k, options.l)) << std::endl;
+  std::cout << "Data types: " << problem_desc_string<ElementA, ElementB, ElementAccumulator, ElementC, ElementD>() << std::endl;
+  std::cout << "Activation function: " << activation_func_string<ActivationFn>() << std::endl;
+  std::cout << "Kernel schedule: " << kernel_schedule_string<KernelSchedule>() << std::endl;
+  std::cout << "GEMM tile shape: " << shape_string(TileShape{}) << std::endl;
+  std::cout << "Epi tile shape: " << epilogue_tile_string(EpiTileShape{}) << std::endl;
+  std::cout << "Cluster shape: " << shape_string(ClusterShape{}) << std::endl;
+  std::cout << "Rasterization: " << options.raster_string() << " with a maximum CTA swizzle of " << options.swizzle << std::endl;
+  std::cout << "Options: Quantize = " << Quantize << ", Exact = " << ExactMode << ", BiasBroadcast = " << BiasBroadcast << std::endl;
+
+  Runner<Gemm> gemm(args_from_options<Gemm>(options));
+  Runner<GemmRef> gemm_ref(args_from_options<GemmRef>(options));
+
+  auto run_fused = [&](){ gemm.run(); };
+  auto run_ref_gemm = [&](){ gemm_ref.run(); };
+  auto run_activation = [&](){ 
+    do_activation<ActivationFn>(
+      block_D_ref.get(),
+      block_D_ref_gemm.get(),
+      Quantize ? block_scale.get() : static_cast<ElementScalar const*>(nullptr),
+      options.beta != 0.f ? block_C.get() : static_cast<ElementC const*>(nullptr),
+      BiasBroadcast,
+      offset_col_D.get(),
+      options.l,
+      options.m,
+      options.n * options.l,
+      false);
+  };
+  auto run_unfused = [&](){ run_ref_gemm(); run_activation(); };
+
+  run_fused();
+  CUDA_CHECK(cudaDeviceSynchronize());
+
+  // Correctness check
+  bool passed = true;
+  if (options.verify) {
+    run_unfused();
+    CUDA_CHECK(cudaDeviceSynchronize());
+
+    passed = verify(options);
+    std::cout << "Disposition: " << (passed ? "Passed" : "Failed") << std::endl;
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    auto benchmark = [&](auto name, auto func)
+    {
+      BenchmarkResult result = run_benchmark(func, options.warmup, options.iterations);
+      double avg_tflops = double(options.total_flops()) / result.avg_runtime_ms / 1e9; // FLOP/ms -> TFLOP/s
+      printf(options.csv ? "%s,%.3f,%.0f\n" : "%20s  %20.3f  %20.0f\n",
+             name, result.avg_runtime_ms, avg_tflops);
+    };
+    printf(options.csv ? "%s,%s,%s\n" : "%20s  %20s  %20s\n",
+           "Kernel", "Runtime (ms)", "Throughput (Tflop/s)");
+    benchmark("Fused", run_fused);
+    benchmark("Unfused", run_unfused);
+    benchmark("GEMM only", run_ref_gemm);
+  }
+
+  return passed;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.0 Toolkit to run this example
+  // and must have compute capability at least 90.
+  if (__CUDACC_VER_MAJOR__ < 12) {
+    std::cerr << "This example requires CUDA 12 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  try {
+    Options options;
+    cutlass::CommandLine cmd(argc, args);
+    options.parse(cmd);
+
+    if (options.help) {
+      options.print_usage(std::cout) << std::endl;
+      return EXIT_SUCCESS;
+    }
+
+    if (options.device >= 0) {
+      CUDA_CHECK(cudaSetDevice(options.device));
+    }
+    else {
+      CUDA_CHECK(cudaGetDevice(&options.device));
+    }
+
+    cudaDeviceProp props;
+    CUDA_CHECK(cudaGetDeviceProperties(&props, options.device));
+    if (props.major != 9 || props.minor != 0) {
+      std::cerr << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
+      return EXIT_SUCCESS;
+    }
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+    if (!run(options)) {
+      return EXIT_FAILURE;
+    }
+#endif
+  }
+  catch (std::exception const& e) {
+    std::cerr << e.what() << std::endl;
+    return EXIT_FAILURE;
+  }
+
+  return EXIT_SUCCESS;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/113_hopper_gemm_activation_fusion/113_hopper_gemm_fused_gated_act.cu b/examples/113_hopper_gemm_activation_fusion/113_hopper_gemm_fused_gated_act.cu
new file mode 100644
index 000000000..f4046c2d0
--- /dev/null
+++ b/examples/113_hopper_gemm_activation_fusion/113_hopper_gemm_fused_gated_act.cu
@@ -0,0 +1,559 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Hopper GEMM with activation fusion example
+*/
+
+#include <cstdlib>
+#include <iostream>
+#include <numeric>
+#include <vector>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/fusion/operations.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/device_memory.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+
+#include "helper.h"
+#include "options.hpp"
+#include "utils.hpp"
+#include "gated_stride.hpp"
+#include "gated_builder.hpp"
+#include "activation_kernel.cuh"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+#if 0
+template<class T>
+using ActivationFn = cutlass::epilogue::thread::ReLu<T>;
+#elif 1
+template<class T>
+using ActivationFn = cutlass::epilogue::thread::SiLu<T>;
+#else
+template<class T>
+using ActivationFn = cutlass::epilogue::thread::Identity<T>;
+#endif
+
+bool constexpr IsFp8         = true;  // whether to run with fp8 or fp16 input/output
+bool constexpr Quantize      = true;  // whether to quantize output with a per-tensor scale factor
+bool constexpr ExactMode     = false; // whether to reproduce unfused dual gemm+activation exactly
+bool constexpr BiasBroadcast = true;  // whether bias is broadcast along columns in each group
+bool constexpr Pingpong      = true;  // whether to use pingpong schedule
+
+using ProblemShape = Shape<int,int,int,int>;
+using GatedProblemShape = decltype(cutlass::sm90_make_gated_shape<0>(ProblemShape{}));
+
+// A matrix configuration
+using         ElementA    = conditional_t<IsFp8, cutlass::float_e4m3_t, cutlass::half_t>; // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                                    // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;                  // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = conditional_t<IsFp8, cutlass::float_e5m2_t, cutlass::half_t>; // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                                 // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;                  // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementC    = cutlass::half_t;                                              // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::ColumnMajor;                                 // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;                  // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+using         ElementD    = conditional_t<IsFp8, cutlass::float_e4m3_t, cutlass::half_t>; // Element type for C and D matrix operands
+using         LayoutD     = cutlass::layout::ColumnMajor;                                 // Layout type for C and D matrix operands
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;                  // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ElementCompute      = float;                                           // Element type for internal accumulation
+using ElementScalar       = float;                                           // Element type for internal accumulation
+using ElementIntermediate = cutlass::half_t;                                 // Element type of intermediate result between GEMM and bias+activation
+using ArchTag             = cutlass::arch::Sm90;                             // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                  // Operator class tag
+using EpiTileShape        = cutlass::epilogue::collective::EpilogueTileAuto; // Epilogue sub-tile shape
+using ClusterShape        = Shape<_1,_2,_1>;                                 // Shape of the threadblocks in a cluster
+using TileShapeK          = Int<128 * 8 / sizeof_bits<ElementA>::value>;
+
+using KernelScheduleCooperative = conditional_t<cutlass::gemm::collective::detail::is_input_fp8<ElementA, ElementB>(),
+                                                cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum,
+                                                cutlass::gemm::KernelTmaWarpSpecializedCooperative>;
+
+using KernelSchedulePingpong    = conditional_t<cutlass::gemm::collective::detail::is_input_fp8<ElementA, ElementB>(),
+                                                cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum,
+                                                cutlass::gemm::KernelTmaWarpSpecializedPingpong>;
+
+using KernelSchedule   = conditional_t<Pingpong, KernelSchedulePingpong, KernelScheduleCooperative>;
+using EpilogueSchedule = conditional_t<Pingpong, cutlass::epilogue::TmaWarpSpecialized, cutlass::epilogue::TmaWarpSpecializedCooperative>;
+using TileShape        = conditional_t<Pingpong, Shape<_128,_128,TileShapeK>, Shape<_128,_256,TileShapeK>>;
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::Sm90CollectiveBuilderGated<
+  OperatorClass,
+  TileShape, ClusterShape,
+  EpiTileShape,
+  ElementAccumulator, ElementCompute, ElementScalar,
+  conditional_t<ExactMode, ElementIntermediate, ElementCompute>,
+  ElementC, LayoutC, AlignmentC,
+  ElementD, LayoutD, AlignmentD,
+  EpilogueSchedule,
+  ActivationFn,
+  Quantize
+>::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::Sm90CollectiveBuilderGated<
+  OperatorClass,
+  ElementA, LayoutA, AlignmentA,
+  ElementB, LayoutB, AlignmentB,
+  ElementAccumulator,
+  TileShape, ClusterShape,
+  cutlass::gemm::collective::StageCountAutoCarveout<
+    static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+  KernelSchedule
+>::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    GatedProblemShape,
+    CollectiveMainloop,
+    CollectiveEpilogue
+>;
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Reference device GEMM implementation type
+using CollectiveEpilogueRef = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    EpiTileShape,
+    ElementAccumulator, ElementCompute,
+    void,     LayoutC, AlignmentC,
+    ElementIntermediate, LayoutD, AlignmentD,
+    EpilogueSchedule,
+    cutlass::epilogue::fusion::ScaledAcc<ElementIntermediate, ElementCompute, ElementScalar>
+  >::CollectiveOp;
+
+using CollectiveMainloopRef = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA, AlignmentA,
+    ElementB, LayoutB, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernelRef = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape,
+    CollectiveMainloopRef,
+    CollectiveEpilogueRef
+>;
+
+using GemmRef = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelRef>;
+
+using StrideA = GemmKernel::StrideA;
+using StrideB = GemmKernel::StrideB;
+using StrideC = GemmKernel::StrideC;
+using StrideD = GemmKernel::CollectiveEpilogue::FusionCallbacks::Operation::GmemLayoutTagAux;
+
+using StrideARef = GemmKernelRef::StrideA;
+using StrideBRef = GemmKernelRef::StrideB;
+using StrideCRef = GemmKernelRef::StrideC;
+using StrideDRef = GemmKernelRef::StrideD;
+
+//
+// Data members
+//
+
+/// Initialization
+StrideA stride_A;
+StrideB stride_B;
+StrideC stride_C;
+StrideD stride_D;
+
+StrideARef stride_A_ref;
+StrideBRef stride_B_ref;
+StrideCRef stride_C_ref;
+StrideDRef stride_D_ref;
+StrideDRef stride_D_ref_gemm;
+
+cutlass::DeviceAllocation<ElementA> block_A;
+cutlass::DeviceAllocation<ElementB> block_B;
+cutlass::DeviceAllocation<ElementC> block_C;
+cutlass::DeviceAllocation<ElementD> block_D;
+cutlass::DeviceAllocation<ElementD> block_D_ref;
+cutlass::DeviceAllocation<ElementIntermediate> block_D_ref_gemm;
+cutlass::DeviceAllocation<int64_t> offset_col_D;
+cutlass::DeviceAllocation<ElementScalar> block_scale;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options : GemmOptionsBase<cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90Params::RasterOrderOptions> {
+
+  using Base = GemmOptionsBase<cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90Params::RasterOrderOptions>;
+
+  float alpha = 1.f, beta = 0.f;
+  int m = 10240, n = 2048, k = 8192, l = 10;
+
+  // Parses the command line
+  void parse(cutlass::CommandLine const& cmd) {
+    Base::parse(cmd);
+    cmd.get_cmd_line_argument("alpha", alpha);
+    cmd.get_cmd_line_argument("beta", beta);
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("l", l);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << program_path << "\n"
+           "\n"
+           "  Hopper GEMM with fused activation function.\n"
+           "\n"
+           "Options:\n"
+           "\n"
+           "  --help                      If specified, displays this usage statement\n\n"
+           "  --m=<int>                   Sets the M extent of the GEMM\n"
+           "  --n=<int>                   Sets the N extent of the GEMM\n"
+           "  --k=<int>                   Sets the K extent of the GEMM\n"
+           "  --l=<int>                   Sets the L extent of the GEMM\n"
+           "  --alpha=<f32>               Epilogue scalar alpha\n"
+           "  --beta=<f32>                Epilogue scalar beta\n"
+           "  --raster=<char>             CTA Rasterization direction (N for along N, M for along M, and H for heuristic)\n"
+           "  --swizzle=<int>             CTA Rasterization swizzle\n"
+           "  --warmup=<int>              Number of warmup iterations to perform.\n"
+           "  --iterations=<int>          Number of profiling iterations to perform.\n"
+           "  --verbose                   Verbose mode (output detailed verification result)\n"
+           "  --verify=<bool>             Verification (correctness check) on/off\n"
+           "  --sms                       Number of SMs to run the GEMMs on\n"
+           "  --device                    Device index\n"
+           "\n"
+           "Examples:\n"
+           "\n"
+        << program_path << " --m=1024 --n=512 --k=1024 --l=10 --alpha=2 --beta=0.707\n";
+
+    return out;
+  }
+
+  /// Compute total number of floating point operations
+  double total_flops() const {
+    // Two flops per multiply-add
+    return uint64_t(2) * m * n * k * l;
+  }
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+
+  auto [M, N, K, L] = make_tuple(options.m, options.n, options.k, options.l);
+  auto NC = BiasBroadcast ? 1 : N;
+
+  namespace cd = cutlass::gemm::collective::detail;
+
+  stride_A = cutlass::sm90_make_gated_packed_stride(StrideA{}, {M, K, L});
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, {N, K, L});
+  stride_C = cutlass::sm90_make_gated_packed_stride(StrideC{}, {M, NC, L});
+  stride_D = cutlass::sm90_make_gated_packed_stride(StrideD{}, {M/2, N, L});
+
+  stride_A_ref = cutlass::make_cute_packed_stride(StrideARef{}, {M, K, L});
+  stride_B_ref = cutlass::make_cute_packed_stride(StrideBRef{}, {N, K, L});
+  stride_C_ref = cutlass::make_cute_packed_stride(StrideCRef{}, {M, NC, L});
+  stride_D_ref = cutlass::make_cute_packed_stride(StrideDRef{}, {M/2, N, L});
+  stride_D_ref_gemm = cutlass::make_cute_packed_stride(StrideDRef{}, {M, N, L});
+
+  block_A.reset(M * K * L);
+  block_B.reset(N * K * L);
+  block_C.reset(M * NC * L);
+  block_D.reset(M/2 * N * L);
+  block_D_ref.reset(M/2 * N * L);
+  block_D_ref_gemm.reset(M * N * L);
+  block_scale.reset(1);
+
+  if constexpr (BiasBroadcast) {
+    get<1>(stride_C) = 0;
+    get<1>(stride_C_ref) = 0;
+  }
+
+  std::vector<int64_t> offset_col_D_host(options.l + 1);
+  std::iota(offset_col_D_host.begin(), offset_col_D_host.end(), 0ll);
+  std::transform(offset_col_D_host.begin(), offset_col_D_host.end(), offset_col_D_host.begin(), [&](auto i) { return i * options.n; });
+  offset_col_D.reset(options.l + 1);
+  offset_col_D.copy_from_host(offset_col_D_host.data());
+
+  cutlass::reference::device::BlockFillRandom(block_A.get(), block_A.size(), 2024, options.dist_a);
+  cutlass::reference::device::BlockFillRandom(block_B.get(), block_B.size(), 2025, options.dist_b);
+  cutlass::reference::device::BlockFillRandom(block_C.get(), block_C.size(), 2026, options.dist_c);
+  cutlass::reference::device::BlockFillRandomUniform(block_scale.get(), block_scale.size(), 2027, 0.5, 1.0);
+}
+
+template <typename GemmT>
+auto args_from_options_common(const Options &options)
+{
+  cutlass::KernelHardwareInfo hw_info;
+  hw_info.device_id = 0;
+  hw_info.sm_count = options.sm_count > 0 
+                   ? options.sm_count 
+                   : cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+
+  typename GemmT::Arguments arguments;
+  decltype(arguments.epilogue.thread) fusion_args{};
+
+  fusion_args.alpha = options.alpha;
+
+  return make_tuple(fusion_args, hw_info);
+}
+
+template <typename GemmT>
+typename GemmT::Arguments
+args_from_options(const Options &options);
+
+template <>
+typename Gemm::Arguments
+args_from_options<Gemm>(Options const& options)
+{
+  auto [fusion_args, hw_info] = args_from_options_common<Gemm>(options);
+
+  fusion_args.beta = options.beta;
+  fusion_args.scale_ptr = Quantize ? block_scale.get() : nullptr;
+  fusion_args.ptr_D = block_D.get();
+  fusion_args.dD = stride_D;
+
+  namespace cd = cutlass::gemm::collective::detail;
+  auto problem_shape = cutlass::sm90_make_gated_shape<0>(make_shape(options.m, options.n, options.k, options.l));
+
+  typename Gemm::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    problem_shape,
+    {block_A.get(), stride_A, block_B.get(), stride_B},
+    {fusion_args, block_C.get(), stride_C, nullptr, {}},
+    hw_info,
+    {options.swizzle, options.raster}
+  };
+
+  return arguments;
+}
+
+template <>
+typename GemmRef::Arguments
+args_from_options<GemmRef>(Options const& options)
+{
+  auto [fusion_args, hw_info] = args_from_options_common<Gemm>(options);
+
+  auto problem_shape = make_shape(options.m, options.n, options.k, options.l);
+
+  typename GemmRef::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    problem_shape,
+    {block_A.get(), stride_A_ref, block_B.get(), stride_B_ref},
+    {{options.alpha, 0}, nullptr, stride_C_ref, block_D_ref_gemm.get(), stride_D_ref_gemm},
+    hw_info,
+    {options.swizzle, options.raster}
+  };
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  bool passed = false;
+  if constexpr (ExactMode) {
+    passed = cutlass::reference::device::BlockCompareEqual(block_D_ref.get(), block_D.get(), block_D.size());
+  }
+  else {
+    passed = cutlass::reference::device::BlockCompareRelativelyEqual(block_D_ref.get(), block_D.get(), block_D.size(), ElementD(options.tolerance), ElementD(options.nonzero_floor));
+  }
+
+  if (!passed && options.verbose) {
+    auto [M,N,K,L] =  make_shape(options.m, options.n, options.k, options.l);
+    print("D reference: "); print_device_tensor(make_tensor(block_D_ref.get(), make_shape(M/2,N,L), stride_D_ref));
+    print("D  computed: "); print_device_tensor(make_tensor(block_D.get(), make_shape(M/2,N,L), stride_D_ref));
+  }
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+bool run(Options& options)
+{
+  if (options.beta != 1.f && options.beta != 0.f) {
+    throw std::runtime_error("Specifying beta != 0/1 is not supported by verification kernel");
+  }
+
+  initialize(options);
+
+  std::cout << "Problem Size: " << shape_string(make_tuple(options.m, options.n, options.k, options.l)) << std::endl;
+  std::cout << "Data types: " << problem_desc_string<ElementA, ElementB, ElementAccumulator, ElementC, ElementD>() << std::endl;
+  std::cout << "Activation function: " << activation_func_string<ActivationFn>() << std::endl;
+  std::cout << "Kernel schedule: " << kernel_schedule_string<KernelSchedule>() << std::endl;
+  std::cout << "GEMM tile shape: " << shape_string(TileShape{}) << std::endl;
+  std::cout << "Epi tile shape: " << epilogue_tile_string(EpiTileShape{}) << std::endl;
+  std::cout << "Cluster shape: " << shape_string(ClusterShape{}) << std::endl;
+  std::cout << "Rasterization: " << options.raster_string() << " with a maximum CTA swizzle of " << options.swizzle << std::endl;
+  std::cout << "Options: Quantize = " << Quantize << ", Exact = " << ExactMode << ", BiasBroadcast = " << BiasBroadcast << std::endl;
+
+  Runner<Gemm> gemm(args_from_options<Gemm>(options));
+  Runner<GemmRef> gemm_ref(args_from_options<GemmRef>(options));
+
+  auto run_fused = [&](){ gemm.run(); };
+  auto run_ref_gemm = [&](){ gemm_ref.run(); };
+  auto run_activation = [&](){ 
+    do_activation<ActivationFn>(
+      block_D_ref.get(),
+      block_D_ref_gemm.get(),
+      Quantize ? block_scale.get() : static_cast<ElementScalar const*>(nullptr),
+      options.beta != 0.f ? block_C.get() : static_cast<ElementC const*>(nullptr),
+      BiasBroadcast,
+      offset_col_D.get(),
+      options.l,
+      options.m / 2,
+      options.n * options.l,
+      true);
+  };
+  auto run_unfused = [&](){ run_ref_gemm(); run_activation(); };
+
+  run_fused();
+  CUDA_CHECK(cudaDeviceSynchronize());
+
+  // Correctness check
+  bool passed = true;
+  if (options.verify) {
+    run_unfused();
+    CUDA_CHECK(cudaDeviceSynchronize());
+
+    passed = verify(options);
+    std::cout << "Disposition: " << (passed ? "Passed" : "Failed") << std::endl;
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    auto benchmark = [&](auto name, auto func)
+    {
+      BenchmarkResult result = run_benchmark(func, options.warmup, options.iterations);
+      double avg_tflops = double(options.total_flops()) / result.avg_runtime_ms / 1e9; // FLOP/ms -> TFLOP/s
+      printf(options.csv ? "%s,%.3f,%.0f\n" : "%20s  %20.3f  %20.0f\n",
+             name, result.avg_runtime_ms, avg_tflops);
+    };
+    printf(options.csv ? "%s,%s,%s\n" : "%20s  %20s  %20s\n",
+           "Kernel", "Runtime (ms)", "Throughput (Tflop/s)");
+    benchmark("Fused", run_fused);
+    benchmark("Unfused", run_unfused);
+    benchmark("GEMM only", run_ref_gemm);
+  }
+
+  return passed;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.0 Toolkit to run this example
+  // and must have compute capability at least 90.
+  if (__CUDACC_VER_MAJOR__ < 12) {
+    std::cerr << "This example requires CUDA 12 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  try {
+    Options options;
+    cutlass::CommandLine cmd(argc, args);
+    options.parse(cmd);
+
+    if (options.help) {
+      options.print_usage(std::cout) << std::endl;
+      return EXIT_SUCCESS;
+    }
+
+    if (options.device >= 0) {
+      CUDA_CHECK(cudaSetDevice(options.device));
+    }
+    else {
+      CUDA_CHECK(cudaGetDevice(&options.device));
+    }
+
+    cudaDeviceProp props;
+    CUDA_CHECK(cudaGetDeviceProperties(&props, options.device));
+    if (props.major != 9 || props.minor != 0) {
+      std::cerr << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
+      return EXIT_SUCCESS;
+    }
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+    if (!run(options)) {
+      return EXIT_FAILURE;
+    }
+#endif
+  }
+  catch (std::exception const& e) {
+    std::cerr << e.what() << std::endl;
+    return EXIT_FAILURE;
+  }
+
+  return EXIT_SUCCESS;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/113_hopper_gemm_activation_fusion/113_hopper_grouped_gemm_fused_act.cu b/examples/113_hopper_gemm_activation_fusion/113_hopper_grouped_gemm_fused_act.cu
new file mode 100644
index 000000000..81095fd7b
--- /dev/null
+++ b/examples/113_hopper_gemm_activation_fusion/113_hopper_grouped_gemm_fused_act.cu
@@ -0,0 +1,655 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief
+*/
+
+#include <iostream>
+#include <vector>
+#include <exception>
+#include <random>
+#include <cfloat>
+
+#include "cute/tensor.hpp"
+
+#include "cutlass/cutlass.h"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/fusion/operations.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/group_array_problem_shape.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/device_memory.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+
+#include "helper.h"
+#include "options.hpp"
+#include "utils.hpp"
+#include "sm90_lin_comb_elt_act_scaled.hpp"
+#include "activation_kernel.cuh"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+#if 0
+template<class T>
+using ActivationFn = cutlass::epilogue::thread::ReLu<T>;
+#elif 1
+template<class T>
+using ActivationFn = cutlass::epilogue::thread::SiLu<T>;
+#else
+template<class T>
+using ActivationFn = cutlass::epilogue::thread::Identity<T>;
+#endif
+
+bool constexpr IsFp8         = true;  // whether to run with fp8 or fp16 input/output
+bool constexpr Quantize      = true;  // whether to quantize output with a per-tensor scale factor
+bool constexpr ExactMode     = false; // whether to reproduce unfused dual gemm+activation exactly
+bool constexpr BiasBroadcast = true;  // whether bias is broadcast along columns in each group
+bool constexpr Pingpong      = true;  // whether to use pingpong schedule
+
+using ProblemShape = Shape<int,int,int>; // <M,N,K> per group
+using GroupProblemShape = cutlass::gemm::GroupProblemShape<ProblemShape>;
+
+// A matrix configuration
+using         ElementA    = conditional_t<IsFp8, cutlass::float_e4m3_t, cutlass::half_t>; // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                                    // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;                  // Alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = conditional_t<IsFp8, cutlass::float_e5m2_t, cutlass::half_t>; // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                                 // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;                  // Alignment of B matrix in units of elements (up to 16 bytes)
+
+// C matrix configuration
+using         ElementC    = cutlass::half_t;                                              // Element type for C matrix operand
+using         LayoutC     = cutlass::layout::ColumnMajor;                                 // Layout type for C matrix operand
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;                  // Alignment of C matrix in units of elements (up to 16 bytes)
+
+// D matrix configuration
+using         ElementD    = conditional_t<IsFp8, cutlass::float_e4m3_t, cutlass::half_t>; // Element type for D matrix operand
+using         LayoutD     = cutlass::layout::ColumnMajor;                                 // Layout type for D matrix operand
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;                  // Alignment of D matrix in units of elements (up to 16 bytes)
+
+int constexpr AlignmentM = max(make_tuple(cutlass::gemm::detail::is_mn_major_A<LayoutA>() ? AlignmentA : 1,
+                                          cutlass::gemm::detail::is_m_major_C<LayoutC>()  ? AlignmentC : 1,
+                                          cutlass::gemm::detail::is_m_major_C<LayoutD>()  ? AlignmentD : 1));
+int constexpr AlignmentN = max(make_tuple(cutlass::gemm::detail::is_mn_major_B<LayoutB>() ? AlignmentB : 1,
+                                          cutlass::gemm::detail::is_n_major_C<LayoutC>()  ? AlignmentC : 1,
+                                          cutlass::gemm::detail::is_n_major_C<LayoutD>()  ? AlignmentD : 1));
+int constexpr AlignmentK = max(make_tuple(cutlass::gemm::detail::is_k_major_A<LayoutA>()  ? AlignmentA : 1,
+                                          cutlass::gemm::detail::is_k_major_B<LayoutB>()  ? AlignmentB : 1));
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ElementCompute      = float;                                          // Element type for epilogue compute
+using ElementScalar       = float;                                          // Element type for scalar values (alpha, beta)
+using ElementIntermediate = cutlass::half_t;                                 // Element type of intermediate result between GEMM and bias+activation
+using ArchTag             = cutlass::arch::Sm90;                            // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
+using EpiTileShape        = cutlass::epilogue::collective::EpilogueTileAuto;
+using ClusterShape        = Shape<_1,_2,_1>;
+using TileShapeK          = Int<128 * 8 / sizeof_bits<ElementA>::value>;
+
+using KernelScheduleCooperative = conditional_t<cutlass::gemm::collective::detail::is_input_fp8<ElementA, ElementB>(),
+                                                cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperativeFP8FastAccum,
+                                                cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperative>;
+
+using KernelSchedulePingpong    = conditional_t<cutlass::gemm::collective::detail::is_input_fp8<ElementA, ElementB>(),
+                                                cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpongFP8FastAccum,
+                                                cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpong>;
+
+using KernelSchedule   = conditional_t<Pingpong, KernelSchedulePingpong, KernelScheduleCooperative>;
+using EpilogueSchedule = conditional_t<Pingpong, cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong, cutlass::epilogue::PtrArrayTmaWarpSpecializedCooperative>;
+using TileShape        = conditional_t<Pingpong, Shape<_128,_128,TileShapeK>, Shape<_128,_256,TileShapeK>>;
+
+// GEMM setup
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    EpiTileShape,
+    ElementAccumulator, ElementCompute,
+    ElementC, LayoutC *, AlignmentC,
+    ElementD, LayoutD *, AlignmentD,
+    EpilogueSchedule,
+    cutlass::epilogue::fusion::AccCastLinCombEltActScale<
+      Quantize,
+      ActivationFn,
+      ElementD,
+      ElementCompute,
+      ElementC,
+      ElementScalar,
+      conditional_t<ExactMode, ElementIntermediate, ElementCompute>
+    >
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA *, AlignmentA,
+    ElementB, LayoutB *, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    GroupProblemShape,
+    CollectiveMainloop,
+    CollectiveEpilogue
+>;
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Reference GEMM setup
+
+using CollectiveEpilogueRef = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    TileShape, ClusterShape, EpiTileShape,
+    ElementAccumulator, ElementCompute,
+    void,     LayoutC *, AlignmentC,
+    ElementIntermediate, LayoutD *, AlignmentD,
+    EpilogueSchedule,
+    cutlass::epilogue::fusion::ScaledAcc<ElementIntermediate, ElementCompute, ElementScalar>
+  >::CollectiveOp;
+
+using CollectiveMainloopRef = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA *, AlignmentA,
+    ElementB, LayoutB *, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogueRef::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernelRef = cutlass::gemm::kernel::GemmUniversal<
+    GroupProblemShape,
+    CollectiveMainloopRef,
+    CollectiveEpilogueRef
+>;
+
+using GemmRef = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelRef>;
+
+using StrideA = GemmKernel::InternalStrideA;
+using StrideB = GemmKernel::InternalStrideB;
+using StrideC = GemmKernel::InternalStrideC;
+using StrideD = GemmKernel::InternalStrideD;
+
+// Host-side allocations
+std::vector<ProblemShape> problem_shapes_host;
+
+std::vector<int64_t> offset_A;
+std::vector<int64_t> offset_B;
+std::vector<int64_t> offset_C;
+std::vector<int64_t> offset_D;
+std::vector<int64_t> offset_col_D_host;
+
+std::vector<StrideA> stride_A_host;
+std::vector<StrideB> stride_B_host;
+std::vector<StrideC> stride_C_host;
+std::vector<StrideD> stride_D_host;
+
+std::vector<ElementScalar> alpha_host;
+std::vector<ElementScalar> beta_host;
+std::vector<ElementScalar> scale_host;
+
+// Device-side allocations
+cutlass::DeviceAllocation<ProblemShape> problem_shapes;
+
+cutlass::DeviceAllocation<ElementA> block_A;
+cutlass::DeviceAllocation<ElementB> block_B;
+cutlass::DeviceAllocation<ElementC> block_C;
+cutlass::DeviceAllocation<ElementD> block_D;
+cutlass::DeviceAllocation<ElementD> block_D_ref;
+cutlass::DeviceAllocation<ElementIntermediate> block_D_ref_gemm;
+
+cutlass::DeviceAllocation<ElementA const*> ptr_A;
+cutlass::DeviceAllocation<ElementB const*> ptr_B;
+cutlass::DeviceAllocation<ElementC const*> ptr_C;
+cutlass::DeviceAllocation<ElementD *> ptr_D;
+cutlass::DeviceAllocation<ElementIntermediate *> ptr_D_ref;
+cutlass::DeviceAllocation<int64_t> offset_col_D;
+
+cutlass::DeviceAllocation<StrideA> stride_A;
+cutlass::DeviceAllocation<StrideB> stride_B;
+cutlass::DeviceAllocation<StrideC> stride_C;
+cutlass::DeviceAllocation<StrideD> stride_D;
+
+// Note, this is an array of pointers to alpha and beta scaling values per group
+cutlass::DeviceAllocation<ElementScalar const*> ptr_alpha;
+cutlass::DeviceAllocation<ElementScalar const*> ptr_beta;
+cutlass::DeviceAllocation<ElementScalar> block_alpha;
+cutlass::DeviceAllocation<ElementScalar> block_beta;
+cutlass::DeviceAllocation<ElementScalar> block_scale;
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+using Options = GroupedGemmOptions;
+
+/// Allocates device-side data
+void allocate(const Options &options) {
+  int64_t total_elements_A = 0;
+  int64_t total_elements_B = 0;
+  int64_t total_elements_C = 0;
+  int64_t total_elements_D = 0;
+  int64_t total_cols_D = 0;
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+
+    cutlass::gemm::GemmCoord const& problem_size = options.problem_sizes[i];
+    auto problem_shape_ref = make_shape(problem_size.m(), problem_size.n(), problem_size.k());
+    auto [M, N, K] = problem_shape_ref;
+
+    problem_shapes_host.push_back(make_shape(M, N, K));
+
+    offset_A.push_back(total_elements_A);
+    offset_B.push_back(total_elements_B);
+    offset_C.push_back(total_elements_C);
+    offset_D.push_back(total_elements_D);
+    offset_col_D_host.push_back(total_cols_D);
+
+    int64_t elements_A = M * K;
+    int64_t elements_B = K * N;
+    int64_t elements_C = M * N;
+    int64_t elements_D = M * N;
+
+    stride_A_host.push_back(cutlass::make_cute_packed_stride(StrideA{}, {M, K, 1}));
+    stride_B_host.push_back(cutlass::make_cute_packed_stride(StrideB{}, {N, K, 1}));
+    stride_C_host.push_back(cutlass::make_cute_packed_stride(StrideC{}, {M, N, 1}));
+    stride_D_host.push_back(cutlass::make_cute_packed_stride(StrideD{}, {M, N, 1}));
+
+    if constexpr (BiasBroadcast) {
+      get<1>(stride_C_host.back()) = 0;
+      elements_C = M;
+    }
+
+    total_elements_A += elements_A;
+    total_elements_B += elements_B;
+    total_elements_C += elements_C;
+    total_elements_D += elements_D;
+    total_cols_D += N;
+  }
+  offset_col_D_host.push_back(total_cols_D);
+
+  block_A.reset(total_elements_A);
+  block_B.reset(total_elements_B);
+  block_C.reset(total_elements_C);
+  block_D.reset(total_elements_D);
+  block_D_ref.reset(total_elements_D);
+  block_D_ref_gemm.reset(total_elements_D);
+  block_alpha.reset(options.groups);
+  block_beta.reset(options.groups);
+  block_scale.reset(1);
+
+  problem_shapes.reset(options.groups);
+
+  ptr_A.reset(options.groups);
+  ptr_B.reset(options.groups);
+  ptr_C.reset(options.groups);
+  ptr_D.reset(options.groups);
+  ptr_D_ref.reset(options.groups);
+  ptr_alpha.reset(options.groups);
+  ptr_beta.reset(options.groups);
+
+  stride_A.reset(options.groups);
+  stride_B.reset(options.groups);
+  stride_C.reset(options.groups);
+  stride_D.reset(options.groups);
+
+  offset_col_D.reset(options.groups + 1);
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+
+  //
+  // Assign pointers
+  //
+
+  std::vector<ElementA const*>      ptr_A_host(options.groups);
+  std::vector<ElementB const*>      ptr_B_host(options.groups);
+  std::vector<ElementC const*>      ptr_C_host(options.groups);
+  std::vector<ElementD*>            ptr_D_host(options.groups);
+  std::vector<ElementIntermediate*> ptr_D_ref_host(options.groups);
+  std::vector<ElementScalar const*> ptr_alpha_host(options.groups);
+  std::vector<ElementScalar const*> ptr_beta_host(options.groups);
+
+  std::mt19937 rng(2024);
+  std::uniform_real_distribution<ElementScalar> alpha_dist(0.5, 2.0);
+  std::uniform_real_distribution<ElementScalar> beta_dist(1.0, 1.0); // (0.0, 4.0);
+  std::uniform_real_distribution<ElementScalar> scale_dist(0.5, 1.0);
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+    ptr_A_host[i] = block_A.get() + offset_A[i];
+    ptr_B_host[i] = block_B.get() + offset_B[i];
+    ptr_C_host[i] = block_C.get() + offset_C[i];
+    ptr_D_host[i] = block_D.get() + offset_D[i];
+    ptr_D_ref_host[i] = block_D_ref_gemm.get() + offset_D[i];
+    alpha_host.push_back(options.alpha == FLT_MAX ? alpha_dist(rng) : options.alpha);
+    beta_host.push_back(options.beta == FLT_MAX ? beta_dist(rng) : options.beta);
+    ptr_alpha_host[i] = block_alpha.get() + i;
+    ptr_beta_host[i] = block_beta.get() + i;
+  }
+  scale_host.push_back(scale_dist(rng));
+
+  problem_shapes.copy_from_host(problem_shapes_host.data());
+
+  ptr_A.copy_from_host(ptr_A_host.data());
+  ptr_B.copy_from_host(ptr_B_host.data());
+  ptr_C.copy_from_host(ptr_C_host.data());
+  ptr_D.copy_from_host(ptr_D_host.data());
+  ptr_D_ref.copy_from_host(ptr_D_ref_host.data());
+  ptr_alpha.copy_from_host(ptr_alpha_host.data());
+  ptr_beta.copy_from_host(ptr_beta_host.data());
+
+  stride_A.copy_from_host(stride_A_host.data());
+  stride_B.copy_from_host(stride_B_host.data());
+  stride_C.copy_from_host(stride_C_host.data());
+  stride_D.copy_from_host(stride_D_host.data());
+
+  offset_col_D.copy_from_host(offset_col_D_host.data());
+
+  block_alpha.copy_from_host(alpha_host.data());
+  block_beta.copy_from_host(beta_host.data());
+  block_scale.copy_from_host(scale_host.data());
+
+  cutlass::reference::device::BlockFillRandom(block_A.get(), block_A.size(), 2024, options.dist_a);
+  cutlass::reference::device::BlockFillRandom(block_B.get(), block_B.size(), 2025, options.dist_b);
+  cutlass::reference::device::BlockFillRandom(block_C.get(), block_C.size(), 2026, options.dist_c);
+}
+
+template <typename GemmT>
+auto args_from_options_common(const Options &options, bool host_problem_shapes_available)
+{
+  cutlass::KernelHardwareInfo hw_info;
+  hw_info.device_id = 0;
+  hw_info.sm_count = options.sm_count > 0 
+                   ? options.sm_count 
+                   : cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+
+  typename GemmT::Arguments arguments;
+  decltype(arguments.epilogue.thread) fusion_args{};
+
+  if (options.alpha != FLT_MAX) {
+    fusion_args.alpha = options.alpha;
+  }
+  else {
+    fusion_args.alpha_ptr_array = ptr_alpha.get();
+    fusion_args.dAlpha = {{},{},1};
+  }
+
+  auto problem_shapes_host_ptr = host_problem_shapes_available ? problem_shapes_host.data() : nullptr;
+
+  return make_tuple(fusion_args, hw_info, problem_shapes_host_ptr);
+}
+
+template <typename GemmT>
+typename GemmT::Arguments
+args_from_options(const Options &options, bool host_problem_shapes_available);
+
+template <>
+Gemm::Arguments
+args_from_options<Gemm>(const Options &options, bool host_problem_shapes_available)
+{
+  auto [fusion_args, hw_info, problem_shapes_host_ptr] = args_from_options_common<Gemm>(options, host_problem_shapes_available);
+
+  fusion_args.beta = options.beta;
+  fusion_args.beta_ptr = block_beta.get();
+  fusion_args.scale_ptr = Quantize ? block_scale.get() : nullptr;
+
+  using RasterOrderOptions = cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90GroupParams<ProblemShape>::RasterOrderOptions;
+  RasterOrderOptions raster = [&] {
+    switch (options.raster) {
+      case Options::RasterOrderOptions::Heuristic: return RasterOrderOptions::Heuristic;
+      case Options::RasterOrderOptions::AlongM: return RasterOrderOptions::AlongM;
+      case Options::RasterOrderOptions::AlongN: return RasterOrderOptions::AlongN;
+      default: return RasterOrderOptions::Heuristic;
+    }
+  }();
+
+  Gemm::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGrouped,
+    {options.groups, problem_shapes.get(), problem_shapes_host_ptr},
+    {ptr_A.get(), stride_A.get(), ptr_B.get(), stride_B.get()},
+    {fusion_args, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
+    hw_info,
+    {options.swizzle, raster}
+  };
+
+  return arguments;
+}
+
+template <>
+GemmRef::Arguments
+args_from_options<GemmRef>(const Options &options, bool host_problem_shapes_available)
+{
+  auto [fusion_args, hw_info, problem_shapes_host_ptr] = args_from_options_common<GemmRef>(options, host_problem_shapes_available);
+
+  using RasterOrderOptions = cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90GroupParams<ProblemShape>::RasterOrderOptions;
+  RasterOrderOptions raster = [&] {
+    switch (options.raster) {
+      case Options::RasterOrderOptions::Heuristic: return RasterOrderOptions::Heuristic;
+      case Options::RasterOrderOptions::AlongM: return RasterOrderOptions::AlongM;
+      case Options::RasterOrderOptions::AlongN: return RasterOrderOptions::AlongN;
+      default: return RasterOrderOptions::Heuristic;
+    }
+  }();
+
+  GemmRef::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGrouped,
+    {options.groups, problem_shapes.get(), problem_shapes_host_ptr},
+    {ptr_A.get(), stride_A.get(), ptr_B.get(), stride_B.get()},
+    {fusion_args, nullptr, stride_C.get(), ptr_D_ref.get(), stride_D.get()},
+    hw_info,
+    {options.swizzle, raster}
+  };
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  bool passed = true;
+  for (int32_t i = 0; i < options.groups; ++i) {
+    auto const& problem_size = options.problem_sizes[i];
+    auto problem_shape = make_shape(problem_size.m(), problem_size.n(), problem_size.k());
+    auto [M, N, K] = problem_shape;
+
+    bool group_passed = false;
+    if constexpr (ExactMode) {
+      group_passed = cutlass::reference::device::BlockCompareEqual(
+        block_D_ref.get() + offset_D[i], block_D.get() + offset_D[i], M * N);
+    }
+    else {
+      group_passed = cutlass::reference::device::BlockCompareRelativelyEqual(
+        block_D_ref.get() + offset_D[i], block_D.get() + offset_D[i], M * N, ElementD(options.tolerance), ElementD(options.nonzero_floor));
+    }
+    if (!group_passed && options.verbose) {
+      std::cout << "Group " << i << " failed" << std::endl;
+      print("D reference: "); print_device_tensor(make_tensor(block_D_ref.get() + offset_D[i], make_shape(M, N, 1), stride_D_host[i]));
+      print("D  computed: "); print_device_tensor(make_tensor(block_D.get()     + offset_D[i], make_shape(M, N, 1), stride_D_host[i]));
+    }
+    passed &= group_passed;
+  }
+  return passed;
+}
+
+bool run(Options &options, bool host_problem_shapes_available = true)
+{
+  // Apply some restrictions on Grouped GEMM options
+  for (int i = 0; i < options.groups; ++i) {
+    if (options.problem_sizes[i].m() != options.problem_sizes[0].m()) {
+      throw std::runtime_error("Variable M problem size is not supported by verification kernel");
+    }
+  }
+  if (options.beta != FLT_MAX && options.beta != 1.f && options.beta != 0.f) {
+    throw std::runtime_error("Specifying beta != 0/1 is not supported by verification kernel");
+  }
+
+  allocate(options);
+  initialize(options);
+
+  std::cout << "Groups      : " << options.groups  << std::endl;
+  std::cout << "Problem Sizes, Alpha, Beta " << std::endl;
+  for (int32_t i = 0; i < options.groups; ++i) {
+    std::cout << "  " << shape_string(make_tuple(options.problem_sizes[i].m(), options.problem_sizes[i].n(), options.problem_sizes[i].k()));
+    std::cout << ", " << alpha_host[i] << ", " << beta_host[i] << std::endl;
+  }
+  std::cout << "Data types: " << problem_desc_string<ElementA, ElementB, ElementAccumulator, ElementC, ElementD>() << std::endl;
+  std::cout << "Activation function: " << activation_func_string<ActivationFn>() << std::endl;
+  std::cout << "Kernel schedule: " << kernel_schedule_string<KernelSchedule>() << std::endl;
+  std::cout << "GEMM tile shape: " << shape_string(TileShape{}) << std::endl;
+  std::cout << "Epi tile shape: " << epilogue_tile_string(EpiTileShape{}) << std::endl;
+  std::cout << "Cluster shape: " << shape_string(ClusterShape{}) << std::endl;
+  std::cout << "Rasterization: " << options.raster_string() << " with a maximum CTA swizzle of " << options.swizzle << std::endl;
+  std::cout << "Options: Quantize = " << Quantize << ", Exact = " << ExactMode << ", BiasBroadcast = " << BiasBroadcast << std::endl;
+
+  Runner<Gemm> gemm(args_from_options<Gemm>(options, host_problem_shapes_available));
+  Runner<GemmRef> gemm_ref(args_from_options<GemmRef>(options, host_problem_shapes_available));
+
+  auto run_fused = [&](){ gemm.run(); };
+  auto run_ref_gemm = [&](){ gemm_ref.run(); };
+  auto run_activation = [&](){ 
+    do_activation<ActivationFn>(
+      block_D_ref.get(),
+      block_D_ref_gemm.get(),
+      Quantize ? block_scale.get() : static_cast<ElementScalar const*>(nullptr),
+      options.beta != 0.f ? block_C.get() : static_cast<ElementC const*>(nullptr),
+      BiasBroadcast,
+      offset_col_D.get(),
+      options.groups,
+      options.problem_sizes.at(0).m(), // all problems have same M
+      offset_col_D_host[options.groups],
+      false);
+  };
+  auto run_unfused = [&](){ run_ref_gemm(); run_activation(); };
+
+  run_fused();
+  CUDA_CHECK(cudaDeviceSynchronize());
+
+  // Correctness check
+  bool passed = true;
+  if (options.verify) {
+    run_unfused();
+    CUDA_CHECK(cudaDeviceSynchronize());
+
+    passed = verify(options);
+    std::cout << "Disposition: " << (passed ? "Passed" : "Failed") << std::endl;
+  }
+
+  if (options.iterations > 0)
+  {
+    auto benchmark = [&](auto name, auto func)
+    {
+      BenchmarkResult result = run_benchmark(func, options.warmup, options.iterations);
+      double avg_tflops = double(options.total_flops()) / result.avg_runtime_ms / 1e9; // FLOP/ms -> TFLOP/s
+      printf(options.csv ? "%s,%.3f,%.0f\n" : "%20s  %20.3f  %20.0f\n",
+             name, result.avg_runtime_ms, avg_tflops);
+    };
+    printf(options.csv ? "%s,%s,%s\n" : "%20s  %20s  %20s\n",
+           "Kernel", "Runtime (ms)", "Throughput (Tflop/s)");
+    benchmark("Fused", run_fused);
+    benchmark("Unfused", run_unfused);
+    benchmark("GEMM only", run_ref_gemm);
+  }
+
+  return passed;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.3 Toolkit to run this example
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 3)) {
+    std::cerr << "This example requires CUDA 12.3 or newer.\n";
+    return EXIT_SUCCESS;
+  }
+
+  try {
+    Options options(AlignmentM, AlignmentN, AlignmentK);
+    cutlass::CommandLine cmd(argc, args);
+    options.parse(cmd);
+
+    if (options.help) {
+      options.print_usage(std::cout) << std::endl;
+      return 0;
+    }
+
+    if (options.device >= 0) {
+      CUDA_CHECK(cudaSetDevice(options.device));
+    }
+    else {
+      CUDA_CHECK(cudaGetDevice(&options.device));
+    }
+
+    cudaDeviceProp props;
+    CUDA_CHECK(cudaGetDeviceProperties(&props, options.device));
+    if (props.major != 9 || props.minor != 0) {
+      std::cerr << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
+      return EXIT_SUCCESS;
+    }
+
+#if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+    if (!run(options, false)) {
+      return EXIT_FAILURE;
+    }
+#endif
+  }
+  catch (std::exception const& e) {
+    std::cerr << e.what() << std::endl;
+    return EXIT_FAILURE;
+  }
+
+  return EXIT_SUCCESS;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/113_hopper_gemm_activation_fusion/113_hopper_grouped_gemm_fused_gated_act.cu b/examples/113_hopper_gemm_activation_fusion/113_hopper_grouped_gemm_fused_gated_act.cu
new file mode 100644
index 000000000..f111acbd7
--- /dev/null
+++ b/examples/113_hopper_gemm_activation_fusion/113_hopper_grouped_gemm_fused_gated_act.cu
@@ -0,0 +1,694 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief
+*/
+
+#include <iostream>
+#include <vector>
+#include <exception>
+#include <random>
+#include <cfloat>
+
+#include "cute/tensor.hpp"
+
+#include "cutlass/cutlass.h"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/fusion/operations.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/group_array_problem_shape.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/device_memory.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+
+#include "helper.h"
+#include "options.hpp"
+#include "utils.hpp"
+#include "tile_scheduler_group.hpp"
+#include "gated_stride.hpp"
+#include "gated_builder.hpp"
+#include "activation_kernel.cuh"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+#if 0
+template<class T>
+using ActivationFn = cutlass::epilogue::thread::ReLu<T>;
+#elif 1
+template<class T>
+using ActivationFn = cutlass::epilogue::thread::SiLu<T>;
+#else
+template<class T>
+using ActivationFn = cutlass::epilogue::thread::Identity<T>;
+#endif
+
+bool constexpr IsFp8         = true;  // whether to run with fp8 or fp16 input/output
+bool constexpr Quantize      = true;  // whether to quantize output with a per-tensor scale factor
+bool constexpr ExactMode     = false; // whether to reproduce unfused dual gemm+activation exactly
+bool constexpr BiasBroadcast = true;  // whether bias is broadcast along columns in each group
+bool constexpr Pingpong      = true;  // whether to use pingpong schedule
+
+using ProblemShape = Shape<int,int,int>; // <M,N,K> per group
+using GroupProblemShape = cutlass::gemm::GroupProblemShape<ProblemShape>;
+
+using GatedProblemShape = decltype(cutlass::sm90_make_gated_shape<0>(ProblemShape{}));
+using GatedGroupProblemShape = cutlass::gemm::GroupProblemShape<GatedProblemShape>;
+
+// A matrix configuration
+using         ElementA    = conditional_t<IsFp8, cutlass::float_e4m3_t, cutlass::half_t>; // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                                    // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;                  // Alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = conditional_t<IsFp8, cutlass::float_e5m2_t, cutlass::half_t>; // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                                 // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;                  // Alignment of B matrix in units of elements (up to 16 bytes)
+
+// C matrix configuration
+using         ElementC    = cutlass::half_t;                                              // Element type for C matrix operand
+using         LayoutC     = cutlass::layout::ColumnMajor;                                 // Layout type for C matrix operand
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;                  // Alignment of C matrix in units of elements (up to 16 bytes)
+
+// D matrix configuration
+using         ElementD    = conditional_t<IsFp8, cutlass::float_e4m3_t, cutlass::half_t>; // Element type for D matrix operand
+using         LayoutD     = cutlass::layout::ColumnMajor;                                 // Layout type for D matrix operand
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;                  // Alignment of D matrix in units of elements (up to 16 bytes)
+
+int constexpr AlignmentM = max(make_tuple(cutlass::gemm::detail::is_mn_major_A<LayoutA>() ? AlignmentA : 1,
+                                          cutlass::gemm::detail::is_m_major_C<LayoutC>()  ? AlignmentC : 1,
+                                          cutlass::gemm::detail::is_m_major_C<LayoutD>()  ? AlignmentD : 1));
+int constexpr AlignmentN = max(make_tuple(cutlass::gemm::detail::is_mn_major_B<LayoutB>() ? AlignmentB : 1,
+                                          cutlass::gemm::detail::is_n_major_C<LayoutC>()  ? AlignmentC : 1,
+                                          cutlass::gemm::detail::is_n_major_C<LayoutD>()  ? AlignmentD : 1));
+int constexpr AlignmentK = max(make_tuple(cutlass::gemm::detail::is_k_major_A<LayoutA>()  ? AlignmentA : 1,
+                                          cutlass::gemm::detail::is_k_major_B<LayoutB>()  ? AlignmentB : 1));
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ElementCompute      = float;                                           // Element type for epilogue compute
+using ElementScalar       = float;                                           // Element type for scalar values (alpha, beta)
+using ElementIntermediate = cutlass::half_t;                                 // Element type of intermediate result between GEMM and bias+activation
+using ArchTag             = cutlass::arch::Sm90;                             // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                  // Operator class tag
+using EpiTileShape        = cutlass::epilogue::collective::EpilogueTileAuto; // Epilogue sub-tile shape
+using ClusterShape        = Shape<_1,_2,_1>;                                 // Cluster shape
+using TileShapeK          = Int<128 * 8 / sizeof_bits<ElementA>::value>;
+
+using KernelScheduleCooperative = conditional_t<cutlass::gemm::collective::detail::is_input_fp8<ElementA, ElementB>(),
+                                                cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperativeFP8FastAccum,
+                                                cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperative>;
+
+using KernelSchedulePingpong    = conditional_t<cutlass::gemm::collective::detail::is_input_fp8<ElementA, ElementB>(),
+                                                cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpongFP8FastAccum,
+                                                cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpong>;
+
+using KernelSchedule   = conditional_t<Pingpong, KernelSchedulePingpong, KernelScheduleCooperative>;
+using EpilogueSchedule = conditional_t<Pingpong, cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong, cutlass::epilogue::PtrArrayTmaWarpSpecializedCooperative>;
+using TileShape        = conditional_t<Pingpong, Shape<_128,_128,TileShapeK>, Shape<_128,_256,TileShapeK>>;
+
+// Gated GEMM setup
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::Sm90CollectiveBuilderGated<
+    OperatorClass,
+    TileShape, ClusterShape,
+    EpiTileShape,
+    ElementAccumulator, ElementCompute, ElementScalar,
+    conditional_t<ExactMode, ElementIntermediate, ElementCompute>,
+    ElementC, LayoutC *, AlignmentC,
+    ElementD, LayoutD *, AlignmentD,
+    EpilogueSchedule,
+    ActivationFn,
+    Quantize
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::Sm90CollectiveBuilderGated<
+    OperatorClass,
+    ElementA, LayoutA *, AlignmentA,
+    ElementB, LayoutB *, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    GatedGroupProblemShape,
+    CollectiveMainloop,
+    CollectiveEpilogue,
+    GroupSchedulerTileShapeDependent
+>;
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Reference GEMM setup
+
+using CollectiveEpilogueRef = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    TileShape, ClusterShape, EpiTileShape,
+    ElementAccumulator, ElementCompute,
+    void,     LayoutC *, AlignmentC,
+    ElementIntermediate, LayoutD *, AlignmentD,
+    EpilogueSchedule,
+    cutlass::epilogue::fusion::ScaledAcc<ElementIntermediate, ElementCompute, ElementScalar>
+  >::CollectiveOp;
+
+using CollectiveMainloopRef = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA *, AlignmentA,
+    ElementB, LayoutB *, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogueRef::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernelRef = cutlass::gemm::kernel::GemmUniversal<
+    GroupProblemShape,
+    CollectiveMainloopRef,
+    CollectiveEpilogueRef
+>;
+
+using GemmRef = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelRef>;
+
+using StrideA = GemmKernel::InternalStrideA;
+using StrideB = GemmKernel::InternalStrideB;
+using StrideC = GemmKernel::InternalStrideC;
+using StrideD = remove_pointer_t<GemmKernel::CollectiveEpilogue::FusionCallbacks::Operation::GmemLayoutTagAux>;
+
+using StrideARef = GemmKernelRef::InternalStrideA;
+using StrideBRef = GemmKernelRef::InternalStrideB;
+using StrideCRef = GemmKernelRef::InternalStrideC;
+using StrideDRef = GemmKernelRef::InternalStrideD;
+
+// Host-side allocations
+std::vector<GatedProblemShape> problem_shapes_host;
+std::vector<ProblemShape> problem_shapes_ref_host;
+
+std::vector<int64_t> offset_A;
+std::vector<int64_t> offset_B;
+std::vector<int64_t> offset_C;
+std::vector<int64_t> offset_D;
+std::vector<int64_t> offset_D_ref;
+std::vector<int64_t> offset_col_D_host;
+
+std::vector<StrideA> stride_A_host;
+std::vector<StrideB> stride_B_host;
+std::vector<StrideC> stride_C_host;
+std::vector<StrideD> stride_D_host;
+
+std::vector<StrideARef> stride_A_ref_host;
+std::vector<StrideBRef> stride_B_ref_host;
+std::vector<StrideCRef> stride_C_ref_host;
+std::vector<StrideDRef> stride_D_ref_host;
+
+std::vector<ElementScalar> alpha_host;
+std::vector<ElementScalar> beta_host;
+std::vector<ElementScalar> scale_host;
+
+// Device-side allocations
+cutlass::DeviceAllocation<GatedProblemShape> problem_shapes;
+cutlass::DeviceAllocation<ProblemShape> problem_shapes_ref;
+
+cutlass::DeviceAllocation<ElementA> block_A;
+cutlass::DeviceAllocation<ElementB> block_B;
+cutlass::DeviceAllocation<ElementC> block_C;
+cutlass::DeviceAllocation<ElementD> block_D;
+cutlass::DeviceAllocation<ElementIntermediate> block_D_ref_gemm;
+cutlass::DeviceAllocation<ElementD> block_D_ref;
+
+cutlass::DeviceAllocation<ElementA const*> ptr_A;
+cutlass::DeviceAllocation<ElementB const*> ptr_B;
+cutlass::DeviceAllocation<ElementC const*> ptr_C;
+cutlass::DeviceAllocation<ElementD *> ptr_D;
+cutlass::DeviceAllocation<ElementIntermediate *> ptr_D_ref;
+cutlass::DeviceAllocation<int64_t> offset_col_D;
+
+cutlass::DeviceAllocation<StrideA> stride_A;
+cutlass::DeviceAllocation<StrideB> stride_B;
+cutlass::DeviceAllocation<StrideC> stride_C;
+cutlass::DeviceAllocation<StrideD> stride_D;
+
+cutlass::DeviceAllocation<StrideARef> stride_A_ref;
+cutlass::DeviceAllocation<StrideBRef> stride_B_ref;
+cutlass::DeviceAllocation<StrideCRef> stride_C_ref;
+cutlass::DeviceAllocation<StrideDRef> stride_D_ref;
+
+// Note, this is an array of pointers to alpha and beta scaling values per group
+cutlass::DeviceAllocation<ElementScalar const*> ptr_alpha;
+cutlass::DeviceAllocation<ElementScalar const*> ptr_beta;
+cutlass::DeviceAllocation<ElementScalar> block_alpha;
+cutlass::DeviceAllocation<ElementScalar> block_beta;
+cutlass::DeviceAllocation<ElementScalar> block_scale;
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+using Options = GroupedGemmOptions;
+
+/// Allocates device-side data
+void allocate(const Options &options) {
+  int64_t total_elements_A = 0;
+  int64_t total_elements_B = 0;
+  int64_t total_elements_C = 0;
+  int64_t total_elements_D = 0;
+  int64_t total_cols_D = 0;
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+
+    cutlass::gemm::GemmCoord const& problem_size = options.problem_sizes[i];
+    auto problem_shape_ref = make_shape(problem_size.m(), problem_size.n(), problem_size.k());
+    auto [M, N, K] = problem_shape_ref;
+    auto NC = BiasBroadcast ? 1 : N;
+    
+    problem_shapes_host.push_back(cutlass::sm90_make_gated_shape<0>(make_shape(M, N, K)));
+    problem_shapes_ref_host.push_back(make_shape(M, N, K));
+
+    offset_A.push_back(total_elements_A);
+    offset_B.push_back(total_elements_B);
+    offset_C.push_back(total_elements_C);
+    offset_D.push_back(total_elements_D);
+    offset_D_ref.push_back(total_elements_D * 2);
+    offset_col_D_host.push_back(total_cols_D);
+
+    int64_t elements_A = M * K;
+    int64_t elements_B = K * N;
+    int64_t elements_C = M * NC;
+    int64_t elements_D = M/2 * N;
+
+    stride_A_host.push_back(cutlass::sm90_make_gated_packed_stride(StrideA{}, {M, K, 1}));
+    stride_B_host.push_back(cutlass::make_cute_packed_stride(StrideB{}, {N, K, 1}));
+    stride_C_host.push_back(cutlass::sm90_make_gated_packed_stride(StrideC{}, {M, NC, 1}));
+    stride_D_host.push_back(cutlass::sm90_make_gated_packed_stride(StrideD{}, {M/2, N, 1}));
+
+    stride_A_ref_host.push_back(cutlass::make_cute_packed_stride(StrideARef{}, {M, K, 1}));
+    stride_B_ref_host.push_back(cutlass::make_cute_packed_stride(StrideBRef{}, {N, K, 1}));
+    stride_C_ref_host.push_back(cutlass::make_cute_packed_stride(StrideCRef{}, {M, NC, 1}));
+    stride_D_ref_host.push_back(cutlass::make_cute_packed_stride(StrideDRef{}, {M, N, 1}));
+
+    if constexpr (BiasBroadcast) {
+      get<1>(stride_C_host.back()) = 0;
+      get<1>(stride_C_ref_host.back()) = 0;
+    }
+
+    total_elements_A += elements_A;
+    total_elements_B += elements_B;
+    total_elements_C += elements_C;
+    total_elements_D += elements_D;
+    total_cols_D += N;
+  }
+  offset_col_D_host.push_back(total_cols_D);
+
+  block_A.reset(total_elements_A);
+  block_B.reset(total_elements_B);
+  block_C.reset(total_elements_C);
+  block_D.reset(total_elements_D);
+  block_D_ref_gemm.reset(total_elements_D * 2);
+  block_D_ref.reset(total_elements_D);
+  block_alpha.reset(options.groups);
+  block_beta.reset(options.groups);
+  block_scale.reset(1);
+
+  problem_shapes.reset(options.groups);
+  problem_shapes_ref.reset(options.groups);
+
+  ptr_A.reset(options.groups);
+  ptr_B.reset(options.groups);
+  ptr_C.reset(options.groups);
+  ptr_D.reset(options.groups);
+  ptr_D_ref.reset(options.groups);
+  ptr_alpha.reset(options.groups);
+  ptr_beta.reset(options.groups);
+
+  stride_A.reset(options.groups);
+  stride_B.reset(options.groups);
+  stride_C.reset(options.groups);
+  stride_D.reset(options.groups);
+
+  stride_A_ref.reset(options.groups);
+  stride_B_ref.reset(options.groups);
+  stride_C_ref.reset(options.groups);
+  stride_D_ref.reset(options.groups);
+  offset_col_D.reset(options.groups + 1);
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+
+  //
+  // Assign pointers
+  //
+
+  std::vector<ElementA const*>      ptr_A_host(options.groups);
+  std::vector<ElementB const*>      ptr_B_host(options.groups);
+  std::vector<ElementC const*>      ptr_C_host(options.groups);
+  std::vector<ElementD*>            ptr_D_host(options.groups);
+  std::vector<ElementIntermediate*> ptr_D_ref_host(options.groups);
+  std::vector<ElementScalar const*> ptr_alpha_host(options.groups);
+  std::vector<ElementScalar const*> ptr_beta_host(options.groups);
+
+  std::mt19937 rng(2024);
+  std::uniform_real_distribution<ElementScalar> alpha_dist(0.5, 2.0);
+  std::uniform_real_distribution<ElementScalar> beta_dist(1.0, 1.0); // (0.0, 4.0);
+  std::uniform_real_distribution<ElementScalar> scale_dist(0.5, 1.0);
+
+  for (int32_t i = 0; i < options.groups; ++i) {
+    ptr_A_host[i] = block_A.get() + offset_A[i];
+    ptr_B_host[i] = block_B.get() + offset_B[i];
+    ptr_C_host[i] = block_C.get() + offset_C[i];
+    ptr_D_host[i] = block_D.get() + offset_D[i];
+    ptr_D_ref_host[i] = block_D_ref_gemm.get() + offset_D_ref[i];
+    alpha_host.push_back(options.alpha == FLT_MAX ? alpha_dist(rng) : options.alpha);
+    beta_host.push_back(options.beta == FLT_MAX ? beta_dist(rng) : options.beta);
+    ptr_alpha_host[i] = block_alpha.get() + i;
+    ptr_beta_host[i] = block_beta.get() + i;
+  }
+  scale_host.push_back(scale_dist(rng));
+
+  problem_shapes.copy_from_host(problem_shapes_host.data());
+  problem_shapes_ref.copy_from_host(problem_shapes_ref_host.data());
+
+  ptr_A.copy_from_host(ptr_A_host.data());
+  ptr_B.copy_from_host(ptr_B_host.data());
+  ptr_C.copy_from_host(ptr_C_host.data());
+  ptr_D.copy_from_host(ptr_D_host.data());
+  ptr_D_ref.copy_from_host(ptr_D_ref_host.data());
+  ptr_alpha.copy_from_host(ptr_alpha_host.data());
+  ptr_beta.copy_from_host(ptr_beta_host.data());
+
+  stride_A.copy_from_host(stride_A_host.data());
+  stride_B.copy_from_host(stride_B_host.data());
+  stride_C.copy_from_host(stride_C_host.data());
+  stride_D.copy_from_host(stride_D_host.data());
+
+  stride_A_ref.copy_from_host(stride_A_ref_host.data());
+  stride_B_ref.copy_from_host(stride_B_ref_host.data());
+  stride_C_ref.copy_from_host(stride_C_ref_host.data());
+  stride_D_ref.copy_from_host(stride_D_ref_host.data());
+  offset_col_D.copy_from_host(offset_col_D_host.data());
+
+  block_alpha.copy_from_host(alpha_host.data());
+  block_beta.copy_from_host(beta_host.data());
+  block_scale.copy_from_host(scale_host.data());
+
+  cutlass::reference::device::BlockFillRandom(block_A.get(), block_A.size(), 2024, options.dist_a);
+  cutlass::reference::device::BlockFillRandom(block_B.get(), block_B.size(), 2025, options.dist_b);
+  cutlass::reference::device::BlockFillRandom(block_C.get(), block_C.size(), 2026, options.dist_c);
+}
+
+template <typename GemmT>
+auto args_from_options_common(const Options &options, bool host_problem_shapes_available)
+{
+  cutlass::KernelHardwareInfo hw_info;
+  hw_info.device_id = 0;
+  hw_info.sm_count = options.sm_count > 0 
+                   ? options.sm_count 
+                   : cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+
+  typename GemmT::Arguments arguments;
+  decltype(arguments.epilogue.thread) fusion_args{};
+
+  if (options.alpha != FLT_MAX) {
+    fusion_args.alpha = options.alpha;
+  }
+  else {
+    fusion_args.alpha_ptr_array = ptr_alpha.get();
+    fusion_args.dAlpha = {{},{},1};
+  }
+
+  return make_tuple(fusion_args, hw_info);
+}
+
+template <typename GemmT>
+typename GemmT::Arguments
+args_from_options(const Options &options, bool host_problem_shapes_available);
+
+template <>
+Gemm::Arguments
+args_from_options<Gemm>(const Options &options, bool host_problem_shapes_available)
+{
+  auto [fusion_args, hw_info] = args_from_options_common<Gemm>(options, host_problem_shapes_available);
+  auto problem_shapes_host_ptr = host_problem_shapes_available ? problem_shapes_host.data() : nullptr;
+
+  fusion_args.beta = options.beta;
+  fusion_args.beta_ptr = block_beta.get();
+  fusion_args.scale_ptr = Quantize ? block_scale.get() : nullptr;
+  fusion_args.ptr_D = ptr_D.get();
+  fusion_args.dD = stride_D.get();
+  fusion_args.sm_count = hw_info.sm_count;
+
+  using RasterOrderOptions = cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90GroupParams<GatedProblemShape>::RasterOrderOptions;
+  RasterOrderOptions raster = [&] {
+    switch (options.raster) {
+      case Options::RasterOrderOptions::Heuristic: return RasterOrderOptions::Heuristic;
+      case Options::RasterOrderOptions::AlongM: return RasterOrderOptions::AlongM;
+      case Options::RasterOrderOptions::AlongN: return RasterOrderOptions::AlongN;
+      default: return RasterOrderOptions::Heuristic;
+    }
+  }();
+
+  Gemm::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGrouped,
+    {options.groups, problem_shapes.get(), problem_shapes_host_ptr},
+    {ptr_A.get(), stride_A.get(), ptr_B.get(), stride_B.get()},
+    {fusion_args, ptr_C.get(), stride_C.get(), {}, {}},
+    hw_info,
+    {options.swizzle, raster}
+  };
+
+  return arguments;
+}
+
+template <>
+GemmRef::Arguments
+args_from_options<GemmRef>(const Options &options, bool host_problem_shapes_available)
+{
+  auto [fusion_args, hw_info] = args_from_options_common<GemmRef>(options, host_problem_shapes_available);
+  auto problem_shapes_host_ptr = host_problem_shapes_available ? problem_shapes_ref_host.data() : nullptr;
+
+  using RasterOrderOptions = cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90GroupParams<ProblemShape>::RasterOrderOptions;
+  RasterOrderOptions raster = [&] {
+    switch (options.raster) {
+      case Options::RasterOrderOptions::Heuristic: return RasterOrderOptions::Heuristic;
+      case Options::RasterOrderOptions::AlongM: return RasterOrderOptions::AlongM;
+      case Options::RasterOrderOptions::AlongN: return RasterOrderOptions::AlongN;
+      default: return RasterOrderOptions::Heuristic;
+    }
+  }();
+
+  GemmRef::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGrouped,
+    {options.groups, problem_shapes_ref.get(), problem_shapes_host_ptr},
+    {ptr_A.get(), stride_A_ref.get(), ptr_B.get(), stride_B_ref.get()},
+    {fusion_args, nullptr, stride_C_ref.get(), ptr_D_ref.get(), stride_D_ref.get()},
+    hw_info,
+    {options.swizzle, raster}
+  };
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  bool passed = true;
+  for (int32_t i = 0; i < options.groups; ++i) {
+    auto const& problem_size = options.problem_sizes[i];
+    auto problem_shape = make_shape(problem_size.m(), problem_size.n(), problem_size.k());
+    auto [M, N, K] = problem_shape;
+
+    bool group_passed = false;
+    if constexpr (ExactMode) {
+      group_passed = cutlass::reference::device::BlockCompareEqual(
+        block_D_ref.get() + offset_D[i], block_D.get() + offset_D[i], M/2 * N);
+    }
+    else {
+      group_passed = cutlass::reference::device::BlockCompareRelativelyEqual(
+        block_D_ref.get() + offset_D[i], block_D.get() + offset_D[i], M/2 * N, ElementD(options.tolerance), ElementD(options.nonzero_floor));
+    }
+    if (!group_passed && options.verbose) {
+      std::cout << "Group " << i << " failed" << std::endl;
+      print("D reference: "); print_device_tensor(make_tensor(block_D_ref.get() + offset_D[i], make_shape(M/2, N, 1), GenColMajor{}));
+      print("D  computed: "); print_device_tensor(make_tensor(block_D.get()     + offset_D[i], make_shape(M/2, N, 1), GenColMajor{}));
+    }
+    passed &= group_passed;
+  }
+  return passed;
+}
+
+bool run(Options &options, bool host_problem_shapes_available = true)
+{
+  // Apply some restrictions on Grouped GEMM options
+  for (int i = 0; i < options.groups; ++i) {
+    if (options.problem_sizes[i].m() != options.problem_sizes[0].m()) {
+      throw std::runtime_error("Variable M problem size is not supported by verification kernel");
+    }
+  }
+  if (options.beta != FLT_MAX && options.beta != 1.f && options.beta != 0.f) {
+    throw std::runtime_error("Specifying beta != 0/1 is not supported by verification kernel");
+  }
+
+  allocate(options);
+  initialize(options);
+
+  std::cout << "Groups      : " << options.groups  << std::endl;
+  std::cout << "Problem Sizes, Alpha, Beta " << std::endl;
+  for (int32_t i = 0; i < options.groups; ++i) {
+    std::cout << "  " << shape_string(make_tuple(options.problem_sizes[i].m(), options.problem_sizes[i].n(), options.problem_sizes[i].k()));
+    std::cout << ", " << alpha_host[i] << ", " << beta_host[i] << std::endl;
+  }
+  std::cout << "Data types: " << problem_desc_string<ElementA, ElementB, ElementAccumulator, ElementC, ElementD>() << std::endl;
+  std::cout << "Activation function: " << activation_func_string<ActivationFn>() << std::endl;
+  std::cout << "Kernel schedule: " << kernel_schedule_string<KernelSchedule>() << std::endl;
+  std::cout << "GEMM tile shape: " << shape_string(TileShape{}) << std::endl;
+  std::cout << "Epi tile shape: " << epilogue_tile_string(EpiTileShape{}) << std::endl;
+  std::cout << "Cluster shape: " << shape_string(ClusterShape{}) << std::endl;
+  std::cout << "Rasterization: " << options.raster_string() << " with a maximum CTA swizzle of " << options.swizzle << std::endl;
+  std::cout << "Options: Quantize = " << Quantize << ", Exact = " << ExactMode << ", BiasBroadcast = " << BiasBroadcast << std::endl;
+
+  Runner<Gemm> gemm(args_from_options<Gemm>(options, host_problem_shapes_available));
+  Runner<GemmRef> gemm_ref(args_from_options<GemmRef>(options, host_problem_shapes_available));
+
+  auto run_fused = [&](){ gemm.run(); };
+  auto run_ref_gemm = [&](){ gemm_ref.run(); };
+  auto run_activation = [&](){ 
+    do_activation<ActivationFn>(
+      block_D_ref.get(),
+      block_D_ref_gemm.get(),
+      Quantize ? block_scale.get() : static_cast<ElementScalar const*>(nullptr),
+      options.beta != 0.f ? block_C.get() : static_cast<ElementC const*>(nullptr),
+      BiasBroadcast,
+      offset_col_D.get(),
+      options.groups,
+      options.problem_sizes.at(0).m() / 2, // all problems have same M
+      offset_col_D_host[options.groups],
+      true);
+  };
+  auto run_unfused = [&](){ run_ref_gemm(); run_activation(); };
+
+  run_fused();
+  CUDA_CHECK(cudaDeviceSynchronize());
+
+  // Correctness check
+  bool passed = true;
+  if (options.verify) {
+    run_unfused();
+    CUDA_CHECK(cudaDeviceSynchronize());
+
+    passed = verify(options);
+    std::cout << "Disposition: " << (passed ? "Passed" : "Failed") << std::endl;
+  }
+
+  if (options.iterations > 0)
+  {
+    auto benchmark = [&](auto name, auto func)
+    {
+      BenchmarkResult result = run_benchmark(func, options.warmup, options.iterations);
+      double avg_tflops = double(options.total_flops()) / result.avg_runtime_ms / 1e9; // FLOP/ms -> TFLOP/s
+      printf(options.csv ? "%s,%.3f,%.0f\n" : "%20s  %20.3f  %20.0f\n",
+             name, result.avg_runtime_ms, avg_tflops);
+    };
+    printf(options.csv ? "%s,%s,%s\n" : "%20s  %20s  %20s\n",
+           "Kernel", "Runtime (ms)", "Throughput (Tflop/s)");
+    benchmark("Fused", run_fused);
+    benchmark("Unfused", run_unfused);
+    benchmark("GEMM only", run_ref_gemm);
+  }
+
+  return passed;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.3 Toolkit to run this example
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 3)) {
+    std::cerr << "This example requires CUDA 12.3 or newer.\n";
+    return EXIT_SUCCESS;
+  }
+
+  try {
+    Options options(AlignmentM, AlignmentN, AlignmentK);
+    cutlass::CommandLine cmd(argc, args);
+    options.parse(cmd);
+
+    if (options.help) {
+      options.print_usage(std::cout) << std::endl;
+      return 0;
+    }
+
+    if (options.device >= 0) {
+      CUDA_CHECK(cudaSetDevice(options.device));
+    }
+    else {
+      CUDA_CHECK(cudaGetDevice(&options.device));
+    }
+
+    cudaDeviceProp props;
+    CUDA_CHECK(cudaGetDeviceProperties(&props, options.device));
+    if (props.major != 9 || props.minor != 0) {
+      std::cerr << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
+      return EXIT_SUCCESS;
+    }
+
+#if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+    if (!run(options, false)) {
+      return EXIT_FAILURE;
+    }
+#endif
+  }
+  catch (std::exception const& e) {
+    std::cerr << e.what() << std::endl;
+    return EXIT_FAILURE;
+  }
+
+  return EXIT_SUCCESS;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/113_hopper_gemm_activation_fusion/CMakeLists.txt b/examples/113_hopper_gemm_activation_fusion/CMakeLists.txt
new file mode 100644
index 000000000..e948fbf64
--- /dev/null
+++ b/examples/113_hopper_gemm_activation_fusion/CMakeLists.txt
@@ -0,0 +1,57 @@
+# Copyright (c) 2023 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+
+cutlass_example_add_executable(
+  113_hopper_gemm_fused_act
+  113_hopper_gemm_fused_act.cu
+  )
+
+cutlass_example_add_executable(
+  113_hopper_gemm_fused_gated_act
+  113_hopper_gemm_fused_gated_act.cu
+  )
+
+cutlass_example_add_executable(
+  113_hopper_grouped_gemm_fused_act
+  113_hopper_grouped_gemm_fused_act.cu
+  )
+
+cutlass_example_add_executable(
+  113_hopper_grouped_gemm_fused_gated_act
+  113_hopper_grouped_gemm_fused_gated_act.cu
+  )
+
+add_custom_target(
+  113_hopper_gemm_activation_fusion
+  DEPENDS
+  113_hopper_gemm_fused_act
+  113_hopper_gemm_fused_gated_act
+  113_hopper_grouped_gemm_fused_act
+  113_hopper_grouped_gemm_fused_gated_act
+)
diff --git a/examples/113_hopper_gemm_activation_fusion/activation_kernel.cuh b/examples/113_hopper_gemm_activation_fusion/activation_kernel.cuh
new file mode 100644
index 000000000..a368d2b5f
--- /dev/null
+++ b/examples/113_hopper_gemm_activation_fusion/activation_kernel.cuh
@@ -0,0 +1,220 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cutlass/array.h"
+#include "cutlass/numeric_conversion.h"
+
+template <class T, class U>
+CUTLASS_DEVICE
+constexpr static U
+array_convert(T const& input)
+{
+  using SrcType = typename T::Element;
+  using DstType = typename U::Element;
+  static_assert(T::kElements == U::kElements);
+  using Converter = cutlass::NumericArrayConverter<DstType, SrcType, T::kElements>;
+  return Converter{}(input);
+}
+
+template <class T>
+CUTLASS_DEVICE
+int64_t
+lower_bound(T const* values, size_t const size, T const target) {
+  T const* low = values;
+  T const* high = values + size;
+  while (low < high) {
+      T const* mid = low + (high - low) / 2;
+      if (*mid < target) {
+        low = mid + 1;
+      }
+      else {
+        high = mid;
+      }
+  }
+  return static_cast<int64_t>(low - values);
+}
+
+template <class T>
+CUTLASS_DEVICE
+T
+load_vec(T const& src) {
+  constexpr int B = cute::min(128, cute::sizeof_bits_v<T>);
+  constexpr int N = cute::sizeof_bits_v<T> / B;
+  using V = cute::uint_bit_t<B>;
+  V v[N];
+  V const* vptr = reinterpret_cast<V const*>(&src);
+
+  CUTLASS_PRAGMA_UNROLL
+  for (int i = 0; i < N; ++i) {
+    v[i] = vptr[i];
+  }
+  return *reinterpret_cast<T*>(&v);
+}
+
+template <class T>
+CUTLASS_DEVICE
+void
+store_vec(T& dst, T const& src) {
+  constexpr int B = cute::min(128, cute::sizeof_bits_v<T>);
+  constexpr int N = cute::sizeof_bits_v<T> / B;
+  using V = cute::uint_bit_t<B>;
+  V v[N];
+  V* vptr = reinterpret_cast<V*>(&dst);
+
+  *reinterpret_cast<T*>(&v) = src;
+  CUTLASS_PRAGMA_UNROLL
+  for (int i = 0; i < N; ++i) {
+    vptr[i] = v[i];
+  }
+}
+
+template <
+  int NumThreads,
+  template <class> class ActFn,
+  class ElementOutput,
+  class ElementInput,
+  class ElementBias,
+  class ElementCompute
+>
+CUTLASS_GLOBAL
+void
+activation_kernel(
+    ElementOutput* output,
+    ElementInput const* input,
+    ElementCompute const* scale_ptr,
+    ElementBias const* bias_ptr,
+    bool bias_is_broadcast,
+    int64_t const* group_col_offset,
+    int num_groups,
+    int64_t stride,
+    bool gated)
+{
+  int64_t const tid = threadIdx.x;
+  int64_t const col = blockIdx.x;
+  if (col >= group_col_offset[num_groups])
+  {
+    return;
+  }
+
+  size_t gated_size_mul = gated ? 2 : 1;
+  size_t gated_off = gated ? stride : 0;
+
+
+  input  = input  + col * stride * gated_size_mul;
+  output = output + col * stride;
+
+  float const quant_scale = scale_ptr ? *scale_ptr : 1.f;
+
+  if (bias_ptr) {
+    int64_t group = 0;
+    if (bias_is_broadcast) {
+      group = lower_bound(group_col_offset, num_groups, (int64_t) col + 1) - 1;
+    }
+    size_t bias_offset = (bias_is_broadcast ? group : col) * stride * gated_size_mul;
+    bias_ptr = bias_ptr + bias_offset;
+  }
+
+  // Vectorize all loads up to 128 bits
+  constexpr int64_t VecSize = 128 / cute::max(cutlass::sizeof_bits_v<ElementBias>, 
+                                    cute::max(cutlass::sizeof_bits_v<ElementOutput>,
+                                              cutlass::sizeof_bits_v<ElementInput>));
+
+  using BiasElem    = cutlass::Array<ElementBias,    VecSize>;
+  using InputElem   = cutlass::Array<ElementInput,   VecSize>;
+  using OutputElem  = cutlass::Array<ElementOutput,  VecSize>;
+  using ComputeElem = cutlass::Array<ElementCompute, VecSize>;
+
+  auto input_vec    = reinterpret_cast<InputElem const*>(input);
+  auto output_vec   = reinterpret_cast<OutputElem*>(output);
+  auto bias_ptr_vec = reinterpret_cast<BiasElem const*>(bias_ptr);
+  
+  int64_t const num_elems_in_col = stride / VecSize;
+  int64_t const gated_off_vec = gated_off / VecSize;
+
+  ActFn<ComputeElem> fn{};
+  for (int64_t elem_index = tid; elem_index < num_elems_in_col; elem_index += NumThreads)
+  {
+    auto fc1_value = array_convert<InputElem, ComputeElem>(load_vec(input_vec[elem_index + gated_off_vec]));
+    if (bias_ptr_vec)
+    {
+      fc1_value = fc1_value + array_convert<BiasElem, ComputeElem>(load_vec(bias_ptr_vec[elem_index + gated_off_vec]));
+    }
+    auto gate_act = fn(fc1_value);
+
+    if (gated)
+    {
+      auto gate_mul = array_convert<InputElem, ComputeElem>(load_vec(input_vec[elem_index]));
+      if (bias_ptr_vec)
+      {
+        gate_mul = gate_mul + array_convert<BiasElem, ComputeElem>(load_vec(bias_ptr_vec[elem_index]));
+      }
+      gate_act = gate_act * gate_mul;
+    }
+
+    store_vec(output_vec[elem_index], array_convert<ComputeElem, OutputElem>(gate_act * quant_scale));
+  }
+}
+
+template <
+  template <class> class ActFn,
+  class ElementOutput,
+  class ElementInput,
+  class ElementBias,
+  class ElementCompute
+>
+void do_activation(
+    ElementOutput* output,
+    ElementInput const* input,
+    ElementCompute const* scale,
+    ElementBias const* bias,
+    bool bias_is_broadcast,
+    int64_t const* group_col_offset,
+    int num_groups,
+    int64_t stride,
+    int64_t num_tokens,
+    bool gated)
+{
+  int const blocks = num_tokens;
+  int constexpr threads = 256;
+  activation_kernel<threads, ActFn><<<blocks, threads>>>(
+    output,
+    input,
+    scale,
+    bias,
+    bias_is_broadcast,
+    group_col_offset,
+    num_groups,
+    stride,
+    gated);
+}
diff --git a/examples/113_hopper_gemm_activation_fusion/gated_builder.hpp b/examples/113_hopper_gemm_activation_fusion/gated_builder.hpp
new file mode 100644
index 000000000..03c0e40ab
--- /dev/null
+++ b/examples/113_hopper_gemm_activation_fusion/gated_builder.hpp
@@ -0,0 +1,173 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+#pragma once
+
+// This is a temp fix for circular include issue in CuTe
+#include "cute/atom/copy_atom.hpp"
+
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+
+#include "gated_stride.hpp"
+#include "sm90_visitor_gated_act.hpp"
+
+namespace cutlass::detail {
+
+template <int ModeIndex, class InputStride>
+using GatedStride = cute::conditional_t<
+  cutlass::detail::is_major<ModeIndex, InputStride>(),
+  decltype(replace<ModeIndex>(InputStride{}, cute::Stride<cute::_1,int64_t,cute::_8>{})),
+  decltype(replace<ModeIndex>(InputStride{}, cute::Stride< int64_t,int64_t, int64_t>{}))
+>;
+
+template <int ModeIndex, class InputStride>
+using GatedOutputStride = cute::conditional_t<
+  cutlass::detail::is_major<ModeIndex, InputStride>(),
+  decltype(replace<ModeIndex>(InputStride{}, cute::Stride<cute::_1,cute::_8>{})),
+  decltype(replace<ModeIndex>(InputStride{}, cute::Stride< int64_t, int64_t>{}))
+>;
+
+}
+
+namespace cutlass::gemm::collective {
+
+template <
+  class OpClass,
+  class ElementA,
+  class GmemLayoutA,
+  int AlignmentA,
+  class ElementB,
+  class GmemLayoutB,
+  int AlignmentB,
+  class ElementAccumulator,
+  class TileShape_MNK_,
+  class ClusterShape_MNK,
+  class StageCountType,
+  class KernelScheduleType
+>
+struct Sm90CollectiveBuilderGated {
+
+  using TileShape_MNK = decltype(cutlass::sm90_make_gated_shape<0>(TileShape_MNK_{}));
+
+  using InternalStrideA = cute::remove_pointer_t<cutlass::gemm::TagToStrideA_t<GmemLayoutA>>;
+  using GatedInternalStrideA = cutlass::detail::GatedStride<0, InternalStrideA>;
+  using StrideA = cute::conditional_t<platform::is_pointer<GmemLayoutA>::value, GatedInternalStrideA *, GatedInternalStrideA>;
+
+  using StrideB = cutlass::gemm::TagToStrideB_t<GmemLayoutB>;
+  
+  using CollectiveOp = typename CollectiveBuilder<
+    arch::Sm90, OpClass,
+    ElementA, StrideA, AlignmentA,
+    ElementB, StrideB, AlignmentB,
+    ElementAccumulator,
+    TileShape_MNK, ClusterShape_MNK,
+    StageCountType, KernelScheduleType
+  >::CollectiveOp;
+};
+
+} // namespace cutlass::gemm::collective
+
+namespace cutlass::epilogue::collective {
+
+template <
+  class OpClass,
+  class TileShape_MNK,
+  class ClusterShape_MNK,
+  class EpilogueTileType,
+  class ElementAccumulator,
+  class ElementCompute,
+  class ElementScalar,
+  class ElementIntermediate,
+  class ElementC,
+  class GmemLayoutTagC,
+  int AlignmentC,
+  class ElementD,
+  class GmemLayoutTagD,
+  int AlignmentD,
+  class EpilogueSchedule,
+  template <class> class ActivationFn,
+  bool Quantize
+>
+struct Sm90CollectiveBuilderGated {
+
+  static constexpr bool IsPtrArray = platform::is_pointer<GmemLayoutTagD>::value;
+
+  using EpilogueTile_MN =
+    decltype(detail::sm90_compute_tile_shape_or_override<ElementD, EpilogueTileType, EpilogueSchedule, TileShape_MNK>());
+
+  // Factor out a 8x2 sub-tile in TileM
+  using GatedTileShape_MNK = decltype(cutlass::sm90_make_gated_shape<0>(TileShape_MNK{}));
+  using GatedEpilogueTile_MN = decltype(cutlass::sm90_make_gated_shape<0>(EpilogueTile_MN{}));
+
+  using InternalStrideC = cute::remove_pointer_t<cutlass::gemm::TagToStrideC_t<GmemLayoutTagC>>;
+  using GatedInternalStrideC = cutlass::detail::GatedStride<0, InternalStrideC>;
+  using StrideC = cute::conditional_t<IsPtrArray, GatedInternalStrideC *, GatedInternalStrideC>;
+
+  using InternalStrideD = cute::remove_pointer_t<cutlass::gemm::TagToStrideC_t<GmemLayoutTagD>>;
+  using GatedInternalStrideD = cutlass::detail::GatedStride<0, InternalStrideD>;
+  using StrideD = cute::conditional_t<IsPtrArray, GatedInternalStrideD *, GatedInternalStrideD>;
+
+  // Gated kernel uses Aux output instead of D due to change in tensor shape
+  using InternalStrideDAux = cutlass::detail::GatedOutputStride<0, InternalStrideD>;
+  using StrideDAux = cute::conditional_t<IsPtrArray, InternalStrideDAux *, InternalStrideDAux>;
+
+  using FusionOp = cutlass::epilogue::fusion::LinCombGatedActFunc<
+    Quantize,            // Quantize
+    ActivationFn,        // ActivationFn
+    StrideDAux,          // GmemLayoutTagOutput
+    ElementD,            // ElementOutput
+    ElementCompute,      // ElementCompute
+    ElementC,            // ElementSource
+    ElementScalar,       // ElementScalar
+    ElementIntermediate, // ElementIntermediate
+    AlignmentD           // Alignment
+  >;
+
+  using CollectiveOp = typename CollectiveBuilder<
+    arch::Sm90,
+    OpClass,
+    GatedTileShape_MNK,
+    ClusterShape_MNK,
+    GatedEpilogueTile_MN,
+    ElementAccumulator,
+    ElementCompute,
+    ElementC,
+    StrideC,
+    AlignmentC,
+    void, // output through AuxStore
+    StrideD,
+    AlignmentD,
+    EpilogueSchedule,
+    FusionOp
+  >::CollectiveOp;
+};
+
+} // namespace cutlass::epilogue::collective 
diff --git a/examples/113_hopper_gemm_activation_fusion/gated_stride.hpp b/examples/113_hopper_gemm_activation_fusion/gated_stride.hpp
new file mode 100644
index 000000000..e3f1600d9
--- /dev/null
+++ b/examples/113_hopper_gemm_activation_fusion/gated_stride.hpp
@@ -0,0 +1,157 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+#pragma once
+
+#include "cute/layout.hpp"
+#include "cute/algorithm/tuple_algorithms.hpp"
+#include "cutlass/detail/layout.hpp"
+
+/**
+ * Convenience functions for computing input/output strides for sm90 gated activation kernel
+ */
+
+namespace cutlass {
+
+template <int ModeIndex, class InputShape>
+CUTLASS_HOST_DEVICE
+auto
+sm90_make_gated_shape(InputShape const& shape) {
+  using namespace cute;
+  using Tiler = Shape<_8,_2>;
+  return replace<ModeIndex>(shape, append(Tiler{}, shape_div(get<ModeIndex>(shape), Tiler{})));
+}
+
+template <int ModeIndex, class InputShape>
+CUTLASS_HOST_DEVICE
+auto
+sm90_make_gated_output_shape(InputShape const& shape) {
+  using namespace cute;
+  using Tiler = Shape<_8>;
+  return replace<ModeIndex>(shape, append(Tiler{}, shape_div(get<ModeIndex>(shape), Tiler{})));
+}
+
+// K-major gated gemm stride
+template <class StrideIntT>
+CUTLASS_HOST_DEVICE
+cute::Stride<cute::Stride<StrideIntT,StrideIntT,StrideIntT>, cute::Int<1>, StrideIntT>
+sm90_make_gated_packed_stride(cute::Stride<cute::Stride<StrideIntT,StrideIntT,StrideIntT>, cute::Int<1>, StrideIntT>, cute::Shape<int,int,int> shape_MKL) {
+  using namespace cute;
+  static_assert(std::is_integral_v<StrideIntT>, "Stride must have an integral type so it can be set dynamically. Static strides not supported.");
+  auto shape = sm90_make_gated_shape<0>(cute::transform(shape_MKL, [](auto s){ return static_cast<StrideIntT>(s); }));
+  auto stride = compact_order(shape, Step<Step<_1,_3,_2>,_0,_4>{});
+  return stride;
+}
+
+// K-major gated gemm output stride
+template <class StrideIntT>
+CUTLASS_HOST_DEVICE
+cute::Stride<cute::Stride<StrideIntT,StrideIntT>, cute::Int<1>, StrideIntT>
+sm90_make_gated_packed_stride(cute::Stride<cute::Stride<StrideIntT,StrideIntT>, cute::Int<1>, StrideIntT>, cute::Shape<int,int,int> shape_MKL) {
+  using namespace cute;
+  static_assert(std::is_integral_v<StrideIntT>, "Stride must have an integral type so it can be set dynamically. Static strides not supported.");
+  auto shape = sm90_make_gated_output_shape<0>(cute::transform(shape_MKL, [](auto s){ return static_cast<StrideIntT>(s); }));
+  auto stride = compact_order(shape, Step<Step<_1,_2>,_0,_3>{});
+  return stride;
+}
+
+// K-major grouped gemm stride
+template <class StrideIntT>
+CUTLASS_HOST_DEVICE
+cute::Stride<cute::Stride<StrideIntT,StrideIntT,StrideIntT>, cute::Int<1>, cute::Int<0>>
+sm90_make_gated_packed_stride(cute::Stride<cute::Stride<StrideIntT,StrideIntT,StrideIntT>, cute::Int<1>, cute::Int<0>>, cute::Shape<int,int,int> shape_MKL) {
+  using namespace cute;
+  static_assert(std::is_integral_v<StrideIntT>, "Stride must have an integral type so it can be set dynamically. Static strides not supported.");
+  auto shape = sm90_make_gated_shape<0>(cute::transform(shape_MKL, [](auto s){ return static_cast<StrideIntT>(s); }));
+  auto stride = append(compact_order(take<0,2>(shape), Step<Step<_1,_3,_2>,_0>{}), Int<0>{});
+  return stride;
+}
+
+// K-major grouped gemm output stride
+template <class StrideIntT>
+CUTLASS_HOST_DEVICE
+cute::Stride<cute::Stride<StrideIntT,StrideIntT>, cute::Int<1>, cute::Int<0>>
+sm90_make_gated_packed_stride(cute::Stride<cute::Stride<StrideIntT,StrideIntT>, cute::Int<1>, cute::Int<0>>, cute::Shape<int,int,int> shape_MKL) {
+  using namespace cute;
+  static_assert(std::is_integral_v<StrideIntT>, "Stride must have an integral type so it can be set dynamically. Static strides not supported.");
+  auto shape = sm90_make_gated_output_shape<0>(cute::transform(shape_MKL, [](auto s){ return static_cast<StrideIntT>(s); }));
+  auto stride = append(compact_order(take<0,2>(shape), Step<Step<_1,_2>,_0>{}), Int<0>{});
+  return stride;
+}
+
+// MN-major gated gemm stride
+template <class StrideIntT>
+CUTLASS_HOST_DEVICE
+cute::Stride<cute::Stride<cute::Int<1>,StrideIntT,cute::Int<8>>, StrideIntT, StrideIntT>
+sm90_make_gated_packed_stride(cute::Stride<cute::Stride<cute::Int<1>,StrideIntT,cute::Int<8>>, StrideIntT, StrideIntT>, cute::Shape<int,int,int> shape_MKL) {
+  using namespace cute;
+  static_assert(std::is_integral_v<StrideIntT>, "Stride must have an integral type so it can be set dynamically. Static strides not supported.");
+  auto shape = sm90_make_gated_shape<0>(cute::transform(shape_MKL, [](auto s){ return static_cast<StrideIntT>(s); }));
+  auto stride = compact_order(shape, Step<Step<_0,_2,_1>,_3,_4>{});
+  return stride;
+}
+
+// MN-major gated gemm output stride
+template <class StrideIntT>
+CUTLASS_HOST_DEVICE
+cute::Stride<cute::Stride<cute::Int<1>,cute::Int<8>>, StrideIntT, StrideIntT>
+sm90_make_gated_packed_stride(cute::Stride<cute::Stride<cute::Int<1>,cute::Int<8>>, StrideIntT, StrideIntT>, cute::Shape<int,int,int> shape_MKL) {
+  using namespace cute;
+  static_assert(std::is_integral_v<StrideIntT>, "Stride must have an integral type so it can be set dynamically. Static strides not supported.");
+  auto shape = sm90_make_gated_output_shape<0>(cute::transform(shape_MKL, [](auto s){ return static_cast<StrideIntT>(s); }));
+  auto stride = compact_order(shape, Step<Step<_0,_1>,_2,_3>{});
+  return stride;
+}
+
+// MN-major gated grouped gemm stride
+template <class StrideIntT>
+CUTLASS_HOST_DEVICE
+cute::Stride<cute::Stride<cute::Int<1>,StrideIntT,cute::Int<8>>, StrideIntT, cute::Int<0>>
+sm90_make_gated_packed_stride(cute::Stride<cute::Stride<cute::Int<1>,StrideIntT,cute::Int<8>>, StrideIntT, cute::Int<0>>, cute::Shape<int,int,int> shape_MKL) {
+  using namespace cute;
+  static_assert(std::is_integral_v<StrideIntT>, "Stride must have an integral type so it can be set dynamically. Static strides not supported.");
+  auto shape = sm90_make_gated_shape<0>(cute::transform(shape_MKL, [](auto s){ return static_cast<StrideIntT>(s); }));
+  auto stride = append(compact_order(take<0,2>(shape), Step<Step<_0,_2,_1>,_3>{}), Int<0>{});
+  return stride;
+}
+
+// MN-major gated grouped gemm output stride
+template <class StrideIntT>
+CUTLASS_HOST_DEVICE
+cute::Stride<cute::Stride<cute::Int<1>,cute::Int<8>>, StrideIntT, cute::Int<0>>
+sm90_make_gated_packed_stride(cute::Stride<cute::Stride<cute::Int<1>,cute::Int<8>>, StrideIntT, cute::Int<0>>, cute::Shape<int,int,int> shape_MKL) {
+  using namespace cute;
+  static_assert(std::is_integral_v<StrideIntT>, "Stride must have an integral type so it can be set dynamically. Static strides not supported.");
+  auto shape = sm90_make_gated_output_shape<0>(cute::transform(shape_MKL, [](auto s){ return static_cast<StrideIntT>(s); }));
+  auto stride = append(compact_order(take<0,2>(shape), Step<Step<_0,_1>,_2>{}), Int<0>{});
+  return stride;
+}
+
+} // namespace cutlass
diff --git a/examples/113_hopper_gemm_activation_fusion/options.hpp b/examples/113_hopper_gemm_activation_fusion/options.hpp
new file mode 100644
index 000000000..6679a1d18
--- /dev/null
+++ b/examples/113_hopper_gemm_activation_fusion/options.hpp
@@ -0,0 +1,380 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+#include <vector>
+#include <string>
+#include <sstream>
+#include <numeric>
+#include <stdexcept>
+#include <algorithm>
+#include <cfloat>
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/gemm_coord.h"
+#include "cutlass/gemm/kernel/tile_scheduler_params.h"
+
+#define OPTIONS_ERROR(...)            \
+do {                                  \
+  std::stringstream ss;               \
+  ss << __VA_ARGS__;                  \
+  throw std::runtime_error(ss.str()); \
+}                                     \
+while (false)
+
+// Command line options parsing
+template <class RasterOrderOptions_>
+struct GemmOptionsBase {
+
+  using RasterOrderOptions = RasterOrderOptions_;
+
+  bool help = false;
+  int iterations = 100;
+  int warmup = 100;
+  RasterOrderOptions raster = RasterOrderOptions::Heuristic;
+  int swizzle = 1;
+  int sm_count = 0;
+  int device = -1;
+  float tolerance = 2e-1f;
+  float nonzero_floor = 3e-1f;
+  bool verify = true;
+  bool verbose = false;
+  bool csv = false;
+  std::string program_path;
+
+  cutlass::Distribution dist_a;
+  cutlass::Distribution dist_b;
+  cutlass::Distribution dist_c;
+
+  std::string raster_string() const {
+    switch (raster) {
+      case RasterOrderOptions::Heuristic: return "Heuristic";
+      case RasterOrderOptions::AlongM: return "AlongM";
+      case RasterOrderOptions::AlongN: return "AlongN";
+    }
+    return "Unknown";
+  }
+
+  static cutlass::Distribution
+  get_distribution(
+      cutlass::CommandLine const& cmd,
+      char const* arg_name) {
+
+    struct {
+      const char *label;
+      cutlass::Distribution::Kind kind;
+    } distribution_kinds[] = {
+      {"uniform", cutlass::Distribution::Uniform},
+      {"gaussian", cutlass::Distribution::Gaussian},
+      {"sequential", cutlass::Distribution::Sequential},
+      {0, cutlass::Distribution::Invalid}
+    };
+
+    cutlass::Distribution dist;
+
+    struct {
+      char const *label;
+      double *member;
+    } members[] = {
+      {"min", &dist.uniform.min},
+      {"max", &dist.uniform.max},
+      {"mean", &dist.gaussian.mean},
+      {"stddev", &dist.gaussian.stddev},
+      {"pnzA", &dist.gaussian.pnzA},
+      {"pnzB", &dist.gaussian.pnzB},
+      {"pnzC", &dist.gaussian.pnzC},
+      {"start", &dist.sequential.start},
+      {"delta", &dist.sequential.delta},
+      {0, 0}
+    };
+
+    using KeyValueVector = std::vector<std::pair<std::string, std::string>>;
+
+    KeyValueVector values;
+    cmd.get_cmd_line_argument_pairs(arg_name, values);
+
+    // The parser expects the first token to be a string identifying the distribution type.
+    auto it = values.begin();
+    if (it != values.end()) {
+      for (int i = 0; distribution_kinds[i].label; ++i) {
+        if (it->first == distribution_kinds[i].label) {
+          dist.kind = distribution_kinds[i].kind;
+          break;
+        }
+      }
+      ++it;
+    }
+
+    // Default initialization
+    switch (dist.kind) {
+      case cutlass::Distribution::Uniform:
+        dist.set_uniform(-1/*min*/, 1/*max*/, -1/*int_scale*/);
+        break;
+      case cutlass::Distribution::Gaussian:
+        dist.set_gaussian(0/*mean*/, 1/*stddev*/, -1/*int_scale*/);
+        break;
+      case cutlass::Distribution::Sequential:
+        dist.set_sequential(0/*start*/, 1/*delta*/, -1/*int_scale*/);
+        break;
+      default:
+        dist.set_gaussian(0/*mean*/, 1/*stddev*/, -1/*int_scale*/);
+        return dist;
+    }
+
+    // Subsequent key-value pairs update the named field of the distribution struct.
+    for (; it != values.end(); ++it) {
+      // Integer scaling factor - if < 0, no integer rounding is performed.
+      if ((it->first.compare("scale") == 0) && !it->second.empty()) {
+        std::stringstream ss;
+        ss << it->second;
+        ss >> dist.int_scale;
+        continue;  // next token
+      }
+
+      // Casts as integer without scaling
+      if (it->first.compare("integer") == 0) {
+        dist.int_scale = 0;
+        continue;  // next token
+      }
+
+      // initialize other members
+      for (int m = 0; members[m].label; ++m) {
+        if (it->first == members[m].label && !it->second.empty()) {
+          std::stringstream ss;
+          ss << it->second;
+          ss >> *(members[m].member);
+        }
+      }
+    }
+
+    return dist;
+  }
+
+  // Parses the command line
+  void parse(cutlass::CommandLine const& cmd) {
+    program_path = cmd.program_path;
+
+    cmd.get_cmd_line_argument("help", help);
+    cmd.get_cmd_line_argument("iterations", iterations);
+    cmd.get_cmd_line_argument("warmup", warmup);
+    cmd.get_cmd_line_argument("sms", sm_count);
+    cmd.get_cmd_line_argument("device", device);
+    cmd.get_cmd_line_argument("tolerance", tolerance);
+    cmd.get_cmd_line_argument("nonzero_floor", nonzero_floor);
+    cmd.get_cmd_line_argument("verify", verify);
+    cmd.get_cmd_line_argument("verbose", verbose);
+    cmd.get_cmd_line_argument("csv", csv);
+
+    char raster_char = 'H';
+    cmd.get_cmd_line_argument("raster", raster_char);
+
+    if (std::toupper(raster_char) == 'N') {
+      raster = RasterOrderOptions::AlongN;
+    }
+    else if (std::toupper(raster_char) == 'M') {
+      raster = RasterOrderOptions::AlongM;
+    }
+    else if (std::toupper(raster_char) == 'H') {
+      raster = RasterOrderOptions::Heuristic;
+    }
+    else {
+      OPTIONS_ERROR("Invalid raster order: " << raster_char);
+    }
+
+    cmd.get_cmd_line_argument("swizzle", swizzle, 1);
+
+    dist_a = get_distribution(cmd, "adist");
+    dist_b = get_distribution(cmd, "bdist");
+    dist_c = get_distribution(cmd, "cdist");
+  }
+};
+
+// Command line options parsing
+struct GroupedGemmOptions : GemmOptionsBase<typename cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90GroupParams<int>::RasterOrderOptions> {
+
+  using Base = GemmOptionsBase<typename cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90GroupParams<int>::RasterOrderOptions>;
+
+  float alpha = FLT_MAX, beta = FLT_MAX;
+  int groups = 10;
+  std::vector<cutlass::gemm::GemmCoord> problem_sizes;
+
+  int align_m = 1;
+  int align_n = 1;
+  int align_k = 1;
+
+  GroupedGemmOptions(
+      int align_m = 1,
+      int align_n = 1,
+      int align_k = 1) 
+    : Base(),
+      align_m(align_m), 
+      align_n(align_n),
+      align_k(align_k) {}
+
+  // Parses the command line
+  void parse(cutlass::CommandLine const& cmd) {
+    Base::parse(cmd);
+    cmd.get_cmd_line_argument("alpha", alpha);
+    cmd.get_cmd_line_argument("beta", beta);
+    cmd.get_cmd_line_argument("groups", groups);
+    randomize_problems(cmd);
+  }
+
+  template <class T>
+  static T 
+  read_value(std::string const& s, T default_ = {}) {
+    std::istringstream ss(s);
+    T val;
+    ss >> val;
+    if (ss.fail()) {
+      val = default_;
+    }
+    return val;
+  }
+
+  // Read from command line a comma-separated list of ranges of the form <min>(:<max>)[,...].
+  // If only <min> value specified, set <max>=<min>.
+  // If arg_name no present on command line, use default_ as the only value.
+  // If num_ranges >= 0, returns exactly num_ranges ranges, truncating the list or extending it
+  // by repeating the last value, if necessary.
+  template <class T>
+  static std::vector<std::pair<T,T>> 
+  get_int_ranges(
+      cutlass::CommandLine const& cmd,
+      char const* arg_name,
+      std::pair<T,T> default_,
+      int num_ranges = -1) {
+    std::vector<std::pair<std::string,std::string>> input;
+    cmd.get_cmd_line_argument_pairs(arg_name, input);
+
+    std::vector<std::pair<T,T>> result;
+    std::transform(input.begin(), input.end(), std::back_inserter(result),
+      [](auto const& range_str) {
+        T minval = read_value<T>(range_str.first);
+        T maxval = read_value<T>(range_str.second, minval);
+        return std::make_pair(minval, maxval);
+      });
+
+    if (result.empty()) {
+      result.push_back(default_);
+    }
+
+    if (num_ranges >= 0) {
+      auto last = result.back();
+      for (int i = static_cast<int>(result.size()); i < num_ranges; ++i) {
+        result.push_back(last);
+      }
+      result.resize(num_ranges);
+    }
+
+    return result;
+  }
+
+  void randomize_problems(cutlass::CommandLine const& cmd) {
+
+    auto m_ranges = get_int_ranges<int>(cmd, "m", {10240, 10240}, groups); // Fixed "inter_size" in MoE
+    auto n_ranges = get_int_ranges<int>(cmd, "n", { 1024,  2048}, groups); // Variable "token per expert" dimension in MoE, should always vary to test correctness
+    auto k_ranges = get_int_ranges<int>(cmd, "k", { 8192,  8192}, groups); // Fixed "hidden_dim" in MoE
+
+    auto random_size = [](auto vmin, auto vmax, auto align, int group, char const * name) {
+      auto avmin = (vmin + align - 1) / align;
+      auto avmax = vmax / align;
+      if (avmax - avmin < 0) {
+        OPTIONS_ERROR("Group " << group << ": problem size " << name << " range=[" << vmin << "," << vmax << "], must contain at least one multiple of " << align);
+      }
+      return align * ((rand() % (avmax - avmin + 1)) + avmin);
+    };
+
+    auto check_size = [](auto value, auto align, int group, char const * name) {
+      if (value <= 0) {
+        OPTIONS_ERROR("Group " << group << ": problem size " << name << "=" << value << ", must be positive");
+      }
+      if (value % align != 0) {
+        OPTIONS_ERROR("Group " << group << ": problem size " << name << "=" << value << ", must be a multiple of " << align);
+      }
+    };
+    
+    problem_sizes.reserve(groups);
+    for (int i = 0; i < groups; ++i) {
+      int M = random_size(m_ranges[i].first, m_ranges[i].second, align_m, i, "M");
+      int N = random_size(n_ranges[i].first, n_ranges[i].second, align_n, i, "N");
+      int K = random_size(k_ranges[i].first, k_ranges[i].second, align_k, i, "K");
+      check_size(M, align_m, i, "M");
+      check_size(N, align_n, i, "N");
+      check_size(K, align_k, i, "K");
+      problem_sizes.emplace_back(M, N, K);
+    }
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << program_path << "\n"
+           "\n"
+           "  Hopper Grouped Dual GEMM using with fused activation.\n"
+           "\n"
+           "Options:\n"
+           "\n"
+           "  --help                      Display this usage statement\n"
+           "  --m=<int>(:<int>)[,...]     Set the M range of the GEMM for each group (last range used for remaining groups)\n"
+           "  --n=<int>(:<int>)[,...]     Set the N range of the GEMM for each group (last range used for remaining groups)\n"
+           "  --k=<int>(:<int>)[,...]     Set the K range of the GEMM for each group (last range used for remaining groups)\n"
+           "  --groups=<int>              Set the number of individual GEMM problems for Grouped GEMM\n"
+           "  --alpha=<f32>               Epilogue scalar alpha\n"
+           "  --beta=<f32>                Epilogue scalar beta\n"
+           "  --raster=<char>             CTA Rasterization direction (N for along N, M for along M, and H for heuristic)\n"
+           "  --swizzle=<int>             CTA Rasterization swizzle\n"
+           "  --warmup=<int>              Number of warmup iterations to perform\n"
+           "  --iterations=<int>          Number of profiling iterations to perform\n"
+           "  --verify=<bool>             Verification (correctness check) on/off\n"
+           "  --verbose                   Verbose mode (output detailed verification result)\n"
+           "  --sms                       Number of SMs to run the GEMMs on\n"
+           "  --device                    Device index\n"
+           "\n"
+           "Any problem size range can be specifed as a pair <min>:<max> or a single integer (<min>=<max>) for a fixed size.\n"
+           "\n"
+           "Example:\n"
+        << program_path << " --m=5120 --n=1024:2048 --k=4096 --groups=10 --alpha=1 --beta=0\n";
+
+    return out;
+  }
+
+  /// Compute number of flops
+  uint64_t total_flops() const {
+    // Two flops per multiply-add
+    return 2 * std::accumulate(problem_sizes.begin(), problem_sizes.end(), 0ULL,
+                               [](uint64_t acc, auto p){ return acc + p.product(); });
+  }
+};
+
+#undef OPTIONS_ERROR
diff --git a/examples/113_hopper_gemm_activation_fusion/sm90_lin_comb_elt_act_scaled.hpp b/examples/113_hopper_gemm_activation_fusion/sm90_lin_comb_elt_act_scaled.hpp
new file mode 100644
index 000000000..6cd9a0852
--- /dev/null
+++ b/examples/113_hopper_gemm_activation_fusion/sm90_lin_comb_elt_act_scaled.hpp
@@ -0,0 +1,298 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+  \brief Visitor tree node for gated activation function
+*/
+
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cutlass/workspace.h"
+
+#include "cute/tensor.hpp"
+
+#include "cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp"       // Sm90EVT
+#include "cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp"    // Sm90ScalarBroadcastPtrArray
+#include "cutlass/epilogue/fusion/sm90_visitor_store_tma_warpspecialized.hpp"   // Sm90AuxArrayStore
+#include "cutlass/epilogue/fusion/sm90_visitor_compute_tma_warpspecialized.hpp" // Sm90Compute
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass::epilogue::fusion {
+
+template<
+  bool PtrArray,
+  class Element,
+  class Stride>
+using Sm90ScalarBroadcastSelector = cute::conditional_t<PtrArray,
+  Sm90ScalarBroadcastPtrArray<Element, Stride>,
+  Sm90ScalarBroadcast<Element, Stride>
+>;
+
+// D = activation(alpha * acc + beta * C)
+template<
+  bool DoScale,
+  template <class> class ActivationFn_,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementSource = ElementOutput,
+  class ElementScalar = ElementCompute,
+  class ElementIntermediate = ElementOutput,
+  FloatRoundStyle RoundStyle = FloatRoundStyle::round_to_nearest
+>
+struct AccCastLinCombEltActScale
+    : LinearCombination<ElementOutput, ElementCompute, ElementSource, ElementScalar, RoundStyle> {
+  using ActivationFn = ActivationFn_<ElementCompute>;
+  static constexpr bool IsEltActSupported = true;
+};
+
+// D = activation(alpha * acc + beta * C))
+template<
+  bool PtrArray,
+  template <class> class ActivationFn,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementSource = ElementOutput,
+  class ElementScalar = ElementCompute,
+  class ElementIntermediate = ElementOutput,
+  FloatRoundStyle RoundStyle = FloatRoundStyle::round_to_nearest
+>
+using Sm90AccCastLinCombEltAct =
+  Sm90EVT<Sm90Compute<ActivationFn, ElementCompute, ElementCompute, RoundStyle>, // activation(beta * C + (alpha * acc))
+          // This is same as Sm90LinearCombination except performs a roundrip cast to ElementIntermediate
+          // after accumulator scaling but before adding source (bias)
+          Sm90EVT<Sm90Compute<homogeneous_multiply_add, ElementCompute, ElementCompute, RoundStyle>, // beta * C + (alpha * acc)
+                  Sm90ScalarBroadcastSelector<PtrArray, ElementScalar, Stride<_0,_0,int64_t>>, // beta
+                  Sm90SrcFetch<ElementSource>, // C
+                  Sm90EVT<Sm90Compute<multiplies, ElementIntermediate, ElementCompute, RoundStyle>, // alpha * acc
+                          Sm90ScalarBroadcastSelector<PtrArray, ElementScalar, Stride<_0,_0,int64_t>>, // alpha
+                          Sm90AccFetch // acc
+                  >
+          >
+  >;
+
+// D = scale * activation(alpha * acc + beta * C)
+template<
+  bool PtrArray,
+  bool DoScale,
+  template <class> class ActivationFn,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementSource = ElementOutput,
+  class ElementScalar = ElementCompute,
+  class ElementIntermediate = ElementOutput,
+  FloatRoundStyle RoundStyle = FloatRoundStyle::round_to_nearest
+>
+using Sm90AccCastLinCombEltActScale =
+  cute::conditional_t<DoScale,
+    Sm90EVT<Sm90Compute<multiplies, ElementCompute, ElementCompute, RoundStyle>,
+            Sm90ScalarBroadcastSelector<PtrArray, ElementScalar, Stride<_0,_0,int64_t>>,
+            Sm90AccCastLinCombEltAct<PtrArray, ActivationFn, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, RoundStyle>
+    >,
+    Sm90AccCastLinCombEltAct<PtrArray, ActivationFn, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, RoundStyle>
+  >;
+
+template <
+  int StagesC,
+  int StagesD,
+  int FragmentSize,
+  bool ReuseSmemC,
+  bool DelayTmaStore,
+  bool DoScale,
+  template <class> class ActivationFn,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementSource,
+  class ElementScalar,
+  class ElementIntermediate,
+  FloatRoundStyle RoundStyle,
+  class CtaTileShapeMNK,
+  class EpilogueTile
+>
+struct FusionCallbacks<
+    epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>,
+    fusion::AccCastLinCombEltActScale<DoScale, ActivationFn, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, RoundStyle>,
+    CtaTileShapeMNK,
+    EpilogueTile
+> : Sm90AccCastLinCombEltActScale<false, DoScale, ActivationFn, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, RoundStyle> {
+
+  using Impl = Sm90AccCastLinCombEltActScale<false, DoScale, ActivationFn, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, RoundStyle>;
+  using Operation = fusion::AccCastLinCombEltActScale<DoScale, ActivationFn, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, RoundStyle>;
+
+  struct Arguments {
+
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    ElementScalar        alpha = ElementScalar(1);
+    ElementScalar const* alpha_ptr{};
+    StrideAlpha          dAlpha{};
+
+    using StrideBeta  = Stride<_0,_0,int64_t>;
+    ElementScalar        beta = ElementScalar(0);
+    ElementScalar const* beta_ptr{};
+    StrideBeta           dBeta{};
+
+    using StrideScale = Stride<_0,_0,int64_t>;
+    ElementScalar        scale = ElementScalar(1);
+    ElementScalar const* scale_ptr{};
+    StrideScale          dScale{};
+
+    using ActivationArguments = typename Sm90Compute<ActivationFn, ElementCompute, ElementCompute, RoundStyle>::Arguments;
+    ActivationArguments activation = ActivationArguments();
+
+    operator typename Impl::Arguments() const {
+
+      using SubImpl = Sm90AccCastLinCombEltAct<false, ActivationFn, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, RoundStyle>;
+      typename SubImpl::Arguments actlincomb_args
+      {                                       // unary op: activation(beta * C + (alpha * acc))
+        {                                       // ternary op : beta * C + (alpha * acc)
+          {{beta}, {beta_ptr}, {dBeta}},          // leaf args : beta
+          {},                                     // leaf args : C
+          {                                       // binary op : alpha * acc
+            {{alpha}, {alpha_ptr}, {dAlpha}},       // leaf args : alpha
+            {},                                     // leaf args : acc
+            {}                                      // binary args : multiplies
+          },
+          {}                                      // ternary args : multiply_add
+        },
+        activation                              // unary args: activation
+      };
+
+      return [&]() {
+        if constexpr (DoScale) {
+          return typename Impl::Arguments
+          {                                   // binary op : scale * (actlincomb)
+            {{scale}, {scale_ptr}, {dScale}},   // leaf args : scale
+            actlincomb_args,                    // leaf_args : actlincomb
+            {}                                  // leaf args : multiplies
+          };
+        }
+        else {
+          return actlincomb_args;
+        }
+      }();
+    }
+  };
+
+  // Ctor inheritance
+  using Impl::Impl;
+};
+
+template <
+  int StagesC,
+  int StagesD,
+  int FragmentSize,
+  bool ReuseSmemC,
+  bool DelayTmaStore,
+  int NumEpilogueWarpGroups,
+  bool DoScale,
+  template <class> class ActivationFn,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementSource,
+  class ElementScalar,
+  class ElementIntermediate,
+  FloatRoundStyle RoundStyle,
+  class CtaTileShapeMNK,
+  class EpilogueTile
+>
+struct FusionCallbacks<
+    epilogue::Sm90PtrArrayTmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore, NumEpilogueWarpGroups>,
+    fusion::AccCastLinCombEltActScale<DoScale, ActivationFn, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, RoundStyle>,
+    CtaTileShapeMNK,
+    EpilogueTile
+> : Sm90AccCastLinCombEltActScale<true, DoScale, ActivationFn, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, RoundStyle> {
+
+  using Impl = Sm90AccCastLinCombEltActScale<true, DoScale, ActivationFn, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, RoundStyle>;
+  using Operation = fusion::AccCastLinCombEltActScale<DoScale, ActivationFn, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, RoundStyle>;
+
+  struct Arguments {
+
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    ElementScalar               alpha = ElementScalar(1);
+    ElementScalar const*        alpha_ptr{};
+    ElementScalar const* const* alpha_ptr_array{};
+    StrideAlpha                 dAlpha{};
+
+    using StrideBeta = Stride<_0,_0,int64_t>;
+    ElementScalar               beta = ElementScalar(0);
+    ElementScalar const*        beta_ptr{};
+    ElementScalar const* const* beta_ptr_array{};
+    StrideBeta                  dBeta{};
+
+    using StrideScale = Stride<_0,_0,int64_t>;
+    ElementScalar               scale = ElementScalar(1);
+    ElementScalar const*        scale_ptr{};
+    ElementScalar const* const* scale_ptr_array{};
+    StrideScale                 dScale{};
+
+    using ActivationArguments = typename Sm90Compute<ActivationFn, ElementCompute, ElementCompute, RoundStyle>::Arguments;
+    ActivationArguments activation = ActivationArguments();
+
+    operator typename Impl::Arguments() const {
+
+      using SubImpl = Sm90AccCastLinCombEltAct<true, ActivationFn, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, RoundStyle>;
+      typename SubImpl::Arguments actlincomb_args
+      {                                                          // unary op: activation(beta * C + (alpha * acc))
+        {                                                          // ternary op : beta * C + (alpha * acc)
+          {{beta}, {beta_ptr}, {beta_ptr_array}, {dBeta}},           // leaf args : beta
+          {},                                                        // leaf args : C
+          {                                                          // binary op : alpha * acc
+            {{alpha}, {alpha_ptr}, {alpha_ptr_array}, {dAlpha}},       // leaf args : alpha
+            {},                                                        // leaf args : acc
+            {}                                                         // binary args : multiplies
+          },
+          {}                                                         // ternary args : multiply_add
+        },
+        activation                                                 // unary args: activation
+      };
+
+      return [&]() {
+        if constexpr (DoScale) {
+          return typename Impl::Arguments
+          {                                                      // binary op : scale * (actlincomb)
+            {{scale}, {scale_ptr}, {scale_ptr_array}, {dScale}},   // leaf args : scale
+            actlincomb_args,                                       // leaf args : actlincomb
+            {}                                                     // leaf args : multiplies
+          };
+        }
+        else {
+          return actlincomb_args;
+        }
+      }();
+    }
+  };
+
+  // Ctor inheritance
+  using Impl::Impl;
+};
+
+} // namespace cutlass::epilogue::fusion 
diff --git a/examples/113_hopper_gemm_activation_fusion/sm90_visitor_gated_act.hpp b/examples/113_hopper_gemm_activation_fusion/sm90_visitor_gated_act.hpp
new file mode 100644
index 000000000..31dd28773
--- /dev/null
+++ b/examples/113_hopper_gemm_activation_fusion/sm90_visitor_gated_act.hpp
@@ -0,0 +1,614 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+  \brief Visitor tree node for gated activation function
+*/
+
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cutlass/workspace.h"
+
+#include "cute/tensor.hpp"
+
+#include "cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp"       // Sm90EVT
+#include "cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp"    // Sm90ScalarBroadcast(PtrArray)
+#include "cutlass/epilogue/fusion/sm90_visitor_store_tma_warpspecialized.hpp"   // Sm90Aux(Array)Store
+#include "cutlass/epilogue/fusion/sm90_visitor_compute_tma_warpspecialized.hpp" // Sm90Compute
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass::epilogue::fusion {
+
+template<
+  bool PtrArray,
+  class Element,
+  class Stride>
+using Sm90ScalarBroadcastSelector = cute::conditional_t<PtrArray,
+  Sm90ScalarBroadcastPtrArray<Element, Stride>,
+  Sm90ScalarBroadcast<Element, Stride>
+>;
+
+template<
+  bool PtrArray,
+  bool Quantize,
+  template <class> class ActivationFn,
+  int Stages,
+  int NumEpilogueWarpGroups,
+  class EpilogueTile,
+  class StrideMNL,
+  class SmemLayoutAtom,
+  class CopyOpR2S,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementScalar = ElementCompute,
+  FloatRoundStyle RoundStyle = FloatRoundStyle::round_to_nearest>
+struct Sm90GatedActivation
+{
+  // Transparently handle PtrArray/GroupGemm case by using a dummy shape on host
+  template<class ProblemShape>
+  CUTLASS_HOST_DEVICE
+  static constexpr auto
+  get_problem_shape(ProblemShape const& problem_shape) {
+    if constexpr (PtrArray) {
+      return typename ProblemShape::UnderlyingProblemShape{};
+    }
+    else {
+      return problem_shape;
+    }
+  }
+
+  // Convert input problem shape [(M,2),N,K,L] to output problem shape [M,N,K,L]
+  template<class Shape>
+  CUTLASS_HOST_DEVICE
+  static constexpr auto
+  to_output_shape(Shape const& shape) {
+    using namespace cute;
+    static_assert(CUTE_STATIC_V(rank<0>(shape)) == 3, "Input shape/coord must have a rank-3 M-mode");
+    auto M = remove<1>(get<0>(shape));
+    return replace<0>(shape, M);
+  }
+
+  using EpilogueTileOut = decltype(to_output_shape(EpilogueTile{}));
+
+  // Define sub-EVTs below that will be invoked manually
+  // Cannot compose them using normal EVT structure due to gated activation logic:
+  // 1. Compute EVT (activation) is only visited on "bottom" half of the values
+  // 2. Store EVT is visited after multiplying gating and activation values, 
+  //    which needs access to the whole epilogue tile, i.e. in reduce()
+
+  using ComputeOp = Sm90Compute<ActivationFn, ElementCompute, ElementCompute, RoundStyle>;
+  using ComputeEVT = Sm90EVT<ComputeOp, Sm90AccFetch>; // leaf input slot
+
+  using StoreOp = cute::conditional_t<PtrArray,
+    Sm90AuxArrayStore<Stages, NumEpilogueWarpGroups, EpilogueTileOut, ElementOutput, RoundStyle, StrideMNL, SmemLayoutAtom, CopyOpR2S>,
+    Sm90AuxStore<Stages, EpilogueTileOut, ElementOutput, RoundStyle, StrideMNL, SmemLayoutAtom, CopyOpR2S>
+  >;
+  using ScaleOp = Sm90EVT<Sm90Compute<multiplies, ElementCompute, ElementCompute, RoundStyle>,         // scale op
+                          Sm90ScalarBroadcastSelector<PtrArray, ElementScalar, Stride<_0,_0,int64_t>>, // scale factor broadcast
+                          Sm90AccFetch>;                                                               // leaf input slot
+  using StoreEVT = Sm90EVT<StoreOp, cute::conditional_t<Quantize, ScaleOp, Sm90AccFetch>>;
+
+  // Delegate most operations to generic Sm90Visitor even though we don't inherit from it
+  using Impl = Sm90EVT<StoreEVT, ComputeEVT>;
+
+  using SharedStorage = typename Impl::SharedStorage;
+  using Arguments = typename Impl::Arguments;
+  using Params = typename Impl::Params;
+  template <bool IsLoad>
+  using TensorMaps = typename Impl::template TensorMaps<IsLoad>;
+
+  template <class ProblemShape>
+  static constexpr Params
+  to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
+    return Impl::to_underlying_arguments(to_output_shape(get_problem_shape(problem_shape)), args, workspace);
+  }
+
+  template <class ProblemShape>
+  static bool
+  can_implement(ProblemShape const& problem_shape, Arguments const& args) {
+    // unlike other host APIs, can_implement gets passed underlying problem shape for Grouped Gemm cases
+    return Impl::can_implement(to_output_shape(problem_shape), args);
+  }
+
+  template <class ProblemShape>
+  static size_t
+  get_workspace_size(ProblemShape const& problem_shape, Arguments const& args) {
+    return Impl::get_workspace_size(to_output_shape(get_problem_shape(problem_shape)), args);
+  }
+
+  template <class ProblemShape>
+  static cutlass::Status
+  initialize_workspace(ProblemShape const& problem_shape, Arguments const& args, void* workspace, cudaStream_t stream,
+    CudaHostAdapter* cuda_adapter = nullptr) {
+    return Impl::initialize_workspace(to_output_shape(get_problem_shape(problem_shape)), args, workspace, stream);
+  }
+
+  CUTLASS_HOST_DEVICE
+  Sm90GatedActivation() : impl() { }
+
+  CUTLASS_HOST_DEVICE
+  Sm90GatedActivation(Params const& params, SharedStorage const& shared_storage)
+    : impl(params, shared_storage) { }
+
+  Impl impl;
+
+  CUTLASS_DEVICE bool
+  is_producer_load_needed() const {
+    return impl.is_producer_load_needed();
+  }
+
+  CUTLASS_DEVICE bool
+  is_C_load_needed() const {
+    return impl.is_C_load_needed();
+  }
+
+  template <class... Args>
+  CUTLASS_DEVICE auto
+  get_producer_load_callbacks(ProducerLoadArgs<Args...> const& args) {
+    return impl.get_producer_load_callbacks(args);
+  }
+
+  template <
+    class CallbacksImpl,
+    class CrdTensor
+  >
+  struct ConsumerStoreCallbacks : CallbacksImpl {
+    CUTLASS_DEVICE
+    ConsumerStoreCallbacks(
+        CallbacksImpl impl,
+        CrdTensor tRS_cD)
+    : CallbacksImpl(impl), 
+      tRS_cD(tRS_cD) {}
+
+    using CallbacksImpl::callbacks_tuple;
+    CrdTensor tRS_cD;
+
+    template <typename ElementAccumulator, typename ElementInput, int FragmentSize>
+    CUTLASS_DEVICE auto
+    visit(Array<ElementAccumulator, FragmentSize> const& frg_acc, int epi_v, int epi_m, int epi_n,
+          Array<ElementInput, FragmentSize> const& frg_input) {
+      using namespace cute;
+
+      static_assert(FragmentSize % 4 == 0, "Fragment size is too small");
+      using FrgOutput = Array<ElementInput, FragmentSize / 2>;
+
+      // This splitting relies on details of WGMMA accumulator register layout
+      Tensor input = flat_divide(make_tensor(frg_input.data(), Int<FragmentSize>{}), Layout<Shape<_2,_2>>{});
+      Tensor input_val = make_tensor_like(input(_,0,_));
+      Tensor input_act = make_tensor_like(input(_,1,_));
+      copy(input(_,0,_), input_val);
+      copy(input(_,1,_), input_act);
+
+      FrgOutput const& frg_input_val = recast<FrgOutput>(input_val)(0);
+      FrgOutput const& frg_input_act = recast<FrgOutput>(input_act)(0);
+
+      // store(gemm0 * act(gemm1))
+      FrgOutput frg_output_act = get<0>(callbacks_tuple).visit(frg_input_act, epi_v, epi_m, epi_n);
+      FrgOutput frg_output = frg_input_val * frg_output_act;
+      get<1>(callbacks_tuple).visit(frg_output, epi_v, epi_m, epi_n);
+
+      return frg_input;
+    }
+  };
+
+  template <
+    bool ReferenceSrc, // do register tensors reference the src or dst layout of the tiled copy
+    class... Args
+  >
+  CUTLASS_DEVICE auto
+  get_consumer_store_callbacks(ConsumerStoreArgs<Args...> const& args) {
+
+    using namespace cute;
+
+    // Transform TV layout of the tiled copy by removing every other group of 8 rows.
+    // Note: assumes by-mode tilers that are bijective here - not necessarily the case in general!
+
+    auto tiler_mn = typename decltype(args.tiled_copy)::Tiler_MN{};
+    auto layout_tv = typename decltype(args.tiled_copy)::TiledLayout_TV{};
+    auto [tiler_m, tiler_n] = tiler_mn;
+    int constexpr TileM = CUTE_STATIC_V(size(tiler_m));
+    int constexpr TileN = CUTE_STATIC_V(size(tiler_n));
+    auto row_selector = Layout<Shape<_8,Int<TileM/16>>, Stride<_1,_16>>{}; // select every other group of 8 rows
+    auto col_selector = Layout<Int<TileN>>{};                              // select all columns
+
+    auto tiler_mn_out =
+      make_tile(
+        right_inverse(
+          make_layout_like(
+            composition(
+              right_inverse(tiler_m),
+              row_selector
+            )
+          )
+        ),
+        tiler_n
+      );
+
+    auto layout_tv_out =
+      right_inverse(                                                         // t,v/2 -> copy m/2,n
+        composition(                                                         // copy m/2,n -> t,v/2
+          make_layout_like(                                                  // real m/2,n -> t,v/2
+            composition(                                                     // real m/2,n -> t,v/2
+              composition(                                                   // real m,n -> t,v
+                right_inverse(layout_tv).with_shape(shape(tiler_mn)),        // copy m,n -> t,v
+                make_tile(right_inverse(tiler_m), right_inverse(tiler_n))    // real m,n -> copy m,n
+              ),
+              make_tile(row_selector, col_selector)                          // real m,n -> real m/2,n
+            )
+          ),
+          tiler_mn_out
+        )
+      ).with_shape(make_shape(size<0>(layout_tv), size<1>(layout_tv)/_2{})); // t,v/2 -> copy m/2,n
+
+    auto tiled_copy = TiledCopy<Copy_Atom<DefaultCopy,ElementCompute>, decltype(layout_tv_out), decltype(tiler_mn_out)>{};
+    
+    auto args_impl = ConsumerStoreArgs{
+      to_output_shape(args.problem_shape_mnkl),
+      to_output_shape(args.tile_shape_mnk),
+      to_output_shape(args.tile_coord_mnkl),
+      args.tiled_mma,
+      EpilogueTileOut{},
+      tiled_copy,
+      args.cD,
+      args.residue_cD,
+      args.tCcD,
+      args.residue_tCcD,
+      args.tCrC,
+      args.thread_idx
+    };
+
+    auto cst_impl = impl.get_consumer_store_callbacks<ReferenceSrc>(args_impl);
+
+    return ConsumerStoreCallbacks<decltype(cst_impl), decltype(args.tCcD)>(
+      cst_impl, 
+      args.tCcD);
+  }
+
+  template <bool IsLoad, class CallbacksImpl>
+  struct TensorMapCallbacks : CallbacksImpl {
+
+    CUTLASS_DEVICE
+    TensorMapCallbacks(CallbacksImpl&& impl) : CallbacksImpl(cute::move(impl)) {}
+    
+    template <class ProblemShape_MNKL>
+    CUTLASS_DEVICE
+    void
+    perform_update(
+        TensorMaps<IsLoad> tensormaps,
+        ProblemShape_MNKL problem_shape_mnkl,
+        int32_t next_batch,
+        int32_t warp_group_idx)
+    {
+      CallbacksImpl::perform_update(tensormaps, to_output_shape(problem_shape_mnkl), next_batch, warp_group_idx);
+    }
+  };
+
+  template <bool IsLoad>
+  CUTLASS_DEVICE constexpr auto
+  get_tensormap_callbacks() {
+    auto tmap_callbacks = impl.template get_tensormap_callbacks<IsLoad>();
+    return TensorMapCallbacks<IsLoad,decltype(tmap_callbacks)>(cute::move(tmap_callbacks));
+  }
+};
+
+template<
+  bool Quantize, // whether to quantize output with a per-tensor scale factor
+  template <class> class ActivationFn,
+  class GmemLayoutTagOutput,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementSource = ElementOutput,
+  class ElementScalar = ElementCompute,
+  class ElementIntermediate = ElementOutput,
+  int Alignment = 128 / cute::sizeof_bits_v<ElementOutput>,
+  FloatRoundStyle RoundStyle = FloatRoundStyle::round_to_nearest
+>
+struct LinCombGatedActFunc
+    : LinearCombination<ElementOutput, ElementCompute, ElementSource, ElementScalar, RoundStyle> {
+  using ElementAux = ElementOutput;
+  using GmemLayoutTagAux = GmemLayoutTagOutput;
+  static constexpr int AlignmentAux = Alignment;
+  static constexpr bool IsAuxOutSupported = true;
+};
+
+template<
+  bool PtrArray,
+  bool Quantize,
+  template <class> class ActivationFn,
+  int Stages,
+  int NumEpilogueWarpGroups,
+  class EpilogueTile,
+  class StrideMNL,
+  class SmemLayoutAtom,
+  class CopyOpR2S,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementSource = ElementOutput,
+  class ElementScalar = ElementCompute,
+  class ElementIntermediate = ElementOutput,
+  FloatRoundStyle RoundStyle = FloatRoundStyle::round_to_nearest
+>
+using Sm90LinCombGatedActFunc =
+  Sm90EVT<Sm90GatedActivation<PtrArray, Quantize, ActivationFn, Stages, NumEpilogueWarpGroups, EpilogueTile, StrideMNL,
+                              SmemLayoutAtom, CopyOpR2S, ElementOutput, ElementCompute, ElementScalar, RoundStyle>,  // store(x(0) * f(x(1) * scale))
+          // This is same as Sm90LinearCombinationPtrArray except it performs a roundrip cast to ElementIntermediate 
+          // after accumulator scaling but before adding source (bias), which emulates precision of unfused path
+          Sm90EVT<Sm90Compute<homogeneous_multiply_add, ElementCompute, ElementCompute, RoundStyle>, // beta * C + (alpha * acc)
+                  Sm90ScalarBroadcastSelector<PtrArray, ElementScalar, Stride<_0,_0,int64_t>>, // beta
+                  Sm90SrcFetch<ElementSource>, // C
+                  Sm90EVT<Sm90Compute<multiplies, ElementIntermediate, ElementCompute, RoundStyle>, // alpha * acc
+                          Sm90ScalarBroadcastSelector<PtrArray, ElementScalar, Stride<_0,_0,int64_t>>, // alpha
+                          Sm90AccFetch // acc
+                  >
+          >
+  >;
+
+  template <
+  // DispatchPolicy args
+  int StagesC,
+  int StagesD,
+  int FragmentSize,
+  bool ReuseSmemC,
+  bool DelayTmaStore,
+  // Fusion op args
+  // Gated act + quantization args
+  bool Quantize,
+  template <class> class ActivationFn,
+  // Store args
+  class GmemLayoutTagOutput,
+  // Element types
+  class ElementOutput,
+  class ElementCompute,
+  class ElementSource,
+  class ElementScalar,
+  class ElementIntermediate,
+  int Alignment,
+  FloatRoundStyle RoundStyle,
+  // Tile shape args
+  class CtaTileShapeMNK,
+  class EpilogueTile,
+  // Aux store args
+  class SmemLayoutAtom,
+  class CopyOpR2S
+>
+struct FusionCallbacks<
+    epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>,
+    fusion::LinCombGatedActFunc<Quantize, ActivationFn, GmemLayoutTagOutput, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, Alignment, RoundStyle>,
+    CtaTileShapeMNK,
+    EpilogueTile,
+    SmemLayoutAtom,
+    CopyOpR2S
+> : Sm90LinCombGatedActFunc<false, Quantize, ActivationFn, StagesD, 2, EpilogueTile, cutlass::gemm::TagToStrideC_t<GmemLayoutTagOutput>,
+                            SmemLayoutAtom, CopyOpR2S, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, RoundStyle> {
+
+  using Impl = Sm90LinCombGatedActFunc<false, Quantize, ActivationFn, StagesD, 2, EpilogueTile, cutlass::gemm::TagToStrideC_t<GmemLayoutTagOutput>,
+                                       SmemLayoutAtom, CopyOpR2S, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, RoundStyle>;
+  using Operation = fusion::LinCombGatedActFunc<Quantize, ActivationFn, GmemLayoutTagOutput, ElementOutput, ElementCompute,
+                                                ElementSource, ElementScalar, ElementIntermediate, Alignment, RoundStyle>;
+
+  struct Arguments {
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    ElementScalar               alpha = ElementScalar(1);
+    ElementScalar const*        alpha_ptr{};
+    StrideAlpha                 dAlpha{};
+
+    using StrideBeta = Stride<_0,_0,int64_t>;
+    ElementScalar               beta = ElementScalar(0);
+    ElementScalar const*        beta_ptr{};
+    StrideBeta                  dBeta{};
+
+    using StrideScale = Stride<_0,_0,int64_t>;
+    ElementScalar               scale = ElementScalar(1);
+    ElementScalar const*        scale_ptr{};
+    StrideScale                 dScale{};
+
+    using StrideOutput = cutlass::gemm::TagToStrideC_t<GmemLayoutTagOutput>;
+    ElementOutput* ptr_D{};
+    StrideOutput dD{};
+
+    int sm_count{};
+
+    operator typename Impl::Arguments() const {
+
+      using StoreArgs = decltype(typename Impl::Arguments{}.op_1.op_1);
+
+      StoreArgs store_args = [&]{
+        if constexpr (Quantize) {
+          return StoreArgs
+          {                                                        // custom node : conversion + store 
+            {                                                        // binary op : conversion + scale
+              {{scale}, {scale_ptr}, {dScale}},                        // leaf args : scalar broadcast (scale)
+              {},                                                      // leaf args : acc fetch (input)
+              {}                                                       // binary args : multiplies
+            },
+            {ptr_D, dD},                                             // unary op : aux store
+          };
+        }
+        else {
+          return StoreArgs
+          {                                                        // unary op : aux store
+            {},                                                      // leaf args : acc fetch (input)
+            {ptr_D, dD}                                              // unary args : aux store
+          };
+        }
+      }();
+
+      return
+        {                                                          // unary op: store(scale(gated_act(beta * C + (alpha * acc))))
+          {                                                          // ternary op : beta * C + (alpha * acc)
+            {{beta}, {beta_ptr}, {dBeta}},                             // leaf args : beta
+            {},                                                        // leaf args : C
+            {                                                          // binary op : alpha * acc
+              {{alpha}, {alpha_ptr}, {dAlpha}},                          // leaf args : alpha
+              {},                                                        // leaf args : acc
+              {}                                                         // binary args : multiplies
+            },
+            {}                                                         // ternary args : multiply_add
+          },
+          {                                                        // custom node : gated_act+scale+store custom node
+            {                                                        // unary op : act_func(input)
+              {},                                                      // leaf args : input
+              {}                                                       // unary args : act_func
+            },
+            store_args
+          }
+        };
+    }
+  };
+
+  // Ctor inheritance
+  using Impl::Impl;
+};
+
+template <
+  // DispatchPolicy args
+  int StagesC,
+  int StagesD,
+  int FragmentSize,
+  bool ReuseSmemC,
+  bool DelayTmaStore,
+  int NumEpilogueWarpGroups,
+  // Fusion op args
+  // Gated act + quantization args
+  bool Quantize,
+  template <class> class ActivationFn,
+  // Store args
+  class GmemLayoutTagOutput,
+  // Element types
+  class ElementOutput,
+  class ElementCompute,
+  class ElementSource,
+  class ElementScalar,
+  class ElementIntermediate,
+  int Alignment,
+  FloatRoundStyle RoundStyle,
+  // Tile shape args
+  class CtaTileShapeMNK,
+  class EpilogueTile,
+  // Aux store args
+  class SmemLayoutAtom,
+  class CopyOpR2S
+>
+struct FusionCallbacks<
+    epilogue::Sm90PtrArrayTmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore, NumEpilogueWarpGroups>,
+    fusion::LinCombGatedActFunc<Quantize, ActivationFn, GmemLayoutTagOutput, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, Alignment, RoundStyle>,
+    CtaTileShapeMNK,
+    EpilogueTile,
+    SmemLayoutAtom,
+    CopyOpR2S
+> : Sm90LinCombGatedActFunc<true, Quantize, ActivationFn, StagesD, NumEpilogueWarpGroups, EpilogueTile, cutlass::gemm::TagToStrideC_t<GmemLayoutTagOutput>,
+                            SmemLayoutAtom, CopyOpR2S, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, RoundStyle> {
+
+  using Impl = Sm90LinCombGatedActFunc<true, Quantize, ActivationFn, StagesD, NumEpilogueWarpGroups, EpilogueTile, cutlass::gemm::TagToStrideC_t<GmemLayoutTagOutput>,
+                                       SmemLayoutAtom, CopyOpR2S, ElementOutput, ElementCompute, ElementSource, ElementScalar, ElementIntermediate, RoundStyle>;
+  using Operation = fusion::LinCombGatedActFunc<Quantize, ActivationFn, GmemLayoutTagOutput, ElementOutput, ElementCompute,
+                                                ElementSource, ElementScalar, ElementIntermediate, Alignment, RoundStyle>;
+
+  struct Arguments {
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    ElementScalar               alpha = ElementScalar(1);
+    ElementScalar const*        alpha_ptr{};
+    ElementScalar const* const* alpha_ptr_array{};
+    StrideAlpha                 dAlpha{};
+
+    using StrideBeta = Stride<_0,_0,int64_t>;
+    ElementScalar               beta = ElementScalar(0);
+    ElementScalar const*        beta_ptr{};
+    ElementScalar const* const* beta_ptr_array{};
+    StrideBeta                  dBeta{};
+
+    using StrideScale = Stride<_0,_0,int64_t>;
+    ElementScalar               scale = ElementScalar(1);
+    ElementScalar const*        scale_ptr{};
+    ElementScalar const* const* scale_ptr_array{};
+    StrideScale                 dScale{};
+
+    using StrideOutput = cutlass::gemm::TagToStrideC_t<GmemLayoutTagOutput>;
+    ElementOutput** ptr_D{};
+    StrideOutput dD{};
+
+    int sm_count{};
+
+    operator typename Impl::Arguments() const {
+
+      using StoreArgs = decltype(typename Impl::Arguments{}.op_1.op_1);
+
+      StoreArgs store_args = [&]{
+        if constexpr (Quantize) {
+          return StoreArgs
+          {                                                        // custom node : conversion + store 
+            {                                                        // binary op : conversion + scale
+              {{scale}, {scale_ptr}, {scale_ptr_array}, {dScale}},     // leaf args : scalar broadcast (scale)
+              {},                                                      // leaf args : acc fetch (input)
+              {}                                                       // binary args : multiplies
+            },
+            {ptr_D, dD, sm_count},                                   // unary op : aux store
+          };
+        }
+        else {
+          return StoreArgs
+          {                                                        // unary op : aux store
+            {},                                                      // leaf args : acc fetch (input)
+            {ptr_D, dD, sm_count}                                    // unary args : aux store
+          };
+        }
+      }();
+
+      return
+        {                                                          // unary op: store(scale(gated_act(beta * C + (alpha * acc))))
+          {                                                          // ternary op : beta * C + (alpha * acc)
+            {{beta}, {beta_ptr}, {beta_ptr_array}, {dBeta}},           // leaf args : beta
+            {},                                                        // leaf args : C
+            {                                                          // binary op : alpha * acc
+              {{alpha}, {alpha_ptr}, {alpha_ptr_array}, {dAlpha}},       // leaf args : alpha
+              {},                                                        // leaf args : acc
+              {}                                                         // binary args : multiplies
+            },
+            {}                                                         // ternary args : multiply_add
+          },
+          {                                                        // custom node : gated_act+scale+store custom node
+            {                                                        // unary op : act_func(input)
+              {},                                                      // leaf args : input
+              {}                                                       // unary args : act_func
+            },
+            store_args
+          }
+        };
+    }
+  };
+
+  // Ctor inheritance
+  using Impl::Impl;
+};
+
+} // namespace cutlass::epilogue::fusion 
diff --git a/examples/113_hopper_gemm_activation_fusion/tile_scheduler_group.hpp b/examples/113_hopper_gemm_activation_fusion/tile_scheduler_group.hpp
new file mode 100644
index 000000000..1934965b8
--- /dev/null
+++ b/examples/113_hopper_gemm_activation_fusion/tile_scheduler_group.hpp
@@ -0,0 +1,132 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+#include "cutlass/gemm/kernel/tile_scheduler.hpp"
+
+// A version of Persistent Group scheduler that preserves multimodal tiling
+template <class GroupProblemShape, uint32_t SchedulerPipelineStageCount, class TileShape>
+class PersistentTileSchedulerSm90GroupTileShapeDependent
+: public cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90Group<GroupProblemShape, SchedulerPipelineStageCount> {
+
+public:
+
+  static_assert(cute::is_static_v<TileShape>, "TileShape must be static");
+
+  using Base = cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90Group<GroupProblemShape, SchedulerPipelineStageCount>;
+  using Base::Base;
+  using WorkTileInfo = typename Base::WorkTileInfo;
+
+  // Customize this function to pass static (hierarchical) tile shape instead of dynamic flattened tile shape
+  CUTLASS_DEVICE
+  WorkTileInfo
+  get_current_work_for_linear_idx(uint64_t linear_idx) {
+    if (this->scheduler_params.pre_processed_problem_shapes && linear_idx >= this->scheduler_params.blocks_across_problem_) {
+      return WorkTileInfo::invalid_work_tile();
+    }
+
+    return this->template get_work_idx_m_and_n<WorkTileInfo>(
+      linear_idx,
+      this->current_group_info_,
+      this->scheduler_params.problem_shapes_,
+      this->cached_problem_shapes_,
+      TileShape{},
+      this->scheduler_params.cluster_shape_,
+      this->scheduler_params.divmod_cluster_shape_major_,
+      this->scheduler_params.divmod_cluster_shape_minor_,
+      this->scheduler_params.divmod_cta_shape_m_,
+      this->scheduler_params.divmod_cta_shape_n_,
+      this->scheduler_params.max_swizzle_size_, 
+      this->scheduler_params.raster_order_);
+  }
+
+  // Must re-implement every function that calls get_current_work_for_linear_idx() to get the call to resolve to correct version
+
+  template <typename TileSchedulerPipeline, typename TileSchedulerPipelineState,
+            typename CallbackBeforeCommit = WorkTileInfo(*)(WorkTileInfo)>
+  CUTLASS_DEVICE
+  auto
+  advance_to_next_work(
+    TileSchedulerPipeline& scheduler_pipeline,
+    TileSchedulerPipelineState scheduler_pipe_producer_state,
+    uint32_t advance_count = 1,
+    CallbackBeforeCommit callback_before_commit = [] (WorkTileInfo info) { return info; }) {
+
+    this->current_work_linear_idx_ += this->total_grid_size_ * uint64_t(advance_count);
+    auto work_tile = get_current_work_for_linear_idx(this->current_work_linear_idx_);
+    using WorkTileWithCallbackInfo = decltype(callback_before_commit(work_tile));
+    WorkTileWithCallbackInfo work_tile_with_callback_info = work_tile;
+    scheduler_pipeline.producer_acquire(scheduler_pipe_producer_state);
+    if (work_tile_with_callback_info.is_valid()) {
+      work_tile_with_callback_info = callback_before_commit(work_tile);
+    }
+    if (cute::elect_one_sync()) {
+      reinterpret_cast<WorkTileWithCallbackInfo *>(this->response_ptr_)[scheduler_pipe_producer_state.index()] = work_tile_with_callback_info;
+      cutlass::arch::fence_view_async_shared();
+      scheduler_pipeline.producer_commit(scheduler_pipe_producer_state);
+    }
+    return cute::make_tuple(work_tile_with_callback_info, true);
+  }
+  
+  // Returns the initial work tile info that will be computed over
+  template <class ClusterShape>
+  CUTLASS_DEVICE
+  WorkTileInfo
+  initial_work_tile_info(ClusterShape) {
+    return get_current_work_for_linear_idx(this->current_work_linear_idx_);
+  }
+};
+
+// Derives from GroupScheduler so the cooperative kernel's scheduler-compatibility
+// static_assert (is_base_of_v<GroupScheduler, TileScheduler_>) accepts this tag.
+struct GroupSchedulerTileShapeDependent : cutlass::gemm::GroupScheduler {};
+
+namespace cutlass::gemm::kernel::detail {
+
+template <
+  class TileShape,
+  class ClusterShape,
+  uint32_t SchedulerPipelineStageCount,
+  class GroupProblemShape
+>
+struct TileSchedulerSelector<
+    GroupSchedulerTileShapeDependent,
+    arch::Sm90,
+    TileShape,
+    ClusterShape
+    , SchedulerPipelineStageCount
+    , GroupProblemShape
+  > {
+  using Scheduler = PersistentTileSchedulerSm90GroupTileShapeDependent<GroupProblemShape, SchedulerPipelineStageCount, TileShape>;
+};
+
+}
diff --git a/examples/113_hopper_gemm_activation_fusion/utils.hpp b/examples/113_hopper_gemm_activation_fusion/utils.hpp
new file mode 100644
index 000000000..6a8c00b49
--- /dev/null
+++ b/examples/113_hopper_gemm_activation_fusion/utils.hpp
@@ -0,0 +1,158 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+#pragma once
+
+#include <string>
+
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/epilogue/thread/activation.h"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+
+template <class Schedule>
+std::string kernel_schedule_string() {
+  if constexpr (cute::is_base_of_v<cutlass::gemm::KernelTmaWarpSpecialized, Schedule>) {
+    return "Non-persistent";
+  }
+  else if constexpr (cute::is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedPingpong, Schedule> ||
+                     cute::is_base_of_v<cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpong, Schedule>) {
+    return "Pingpong";
+  }
+  else if constexpr (cute::is_base_of_v<cutlass::gemm::KernelTmaWarpSpecializedCooperative, Schedule> ||
+                     cute::is_base_of_v<cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperative, Schedule>) {
+    return "Cooperative";
+  }
+  else {
+    return "Unknown";
+  }
+}
+
+template <template <class> class ActFn>
+std::string activation_func_string() {
+  if constexpr (cute::is_same_v<cutlass::epilogue::thread::ReLu<float>, ActFn<float>>) {
+    return "ReLU";
+  }
+  else if constexpr (cute::is_same_v<cutlass::epilogue::thread::SiLu<float>, ActFn<float>>) {
+    return "SiLU";
+  }
+  else if constexpr (cute::is_same_v<cutlass::epilogue::thread::Identity<float>, ActFn<float>>) {
+    return "None";
+  }
+  else {
+    return "Unknown";
+  }
+}
+
+template <class Shape>
+std::string shape_string(Shape shape = {}) {
+  std::stringstream ss;
+  cute::for_each(shape, [&](auto s) { ss << 'x' << int(cute::size(s)); });
+  return ss.str().substr(1);
+}
+
+template <class EpiTile>
+std::string epilogue_tile_string(EpiTile = {}) {
+  if constexpr (cute::is_same_v<cutlass::epilogue::collective::EpilogueTileAuto, EpiTile>) {
+    return "Auto";
+  }
+  else if (cute::is_tuple_v<EpiTile>) {
+    return shape_string(EpiTile{});
+  }
+  else {
+    return "Unknown";
+  }
+}
+
+template <class Element>
+std::string data_type_string(Element = {}) {
+  if constexpr (cute::is_same_v<cutlass::float_e4m3_t, Element> || cute::is_same_v<cutlass::float_e5m2_t, Element>) {
+    return "fp8";
+  }
+  else if (cute::is_same_v<cutlass::half_t, Element>) {
+    return "fp16";
+  }
+  else if (cute::is_same_v<cutlass::bfloat16_t, Element>) {
+    return "bf16";
+  }
+  else if (cute::is_same_v<float, Element>) {
+    return "fp32";
+  }
+  else if (cute::is_same_v<cutlass::tfloat32_t, Element>) {
+    return "tf32";
+  }
+  else if (cute::is_same_v<cutlass::int4b_t, Element> || cute::is_same_v<cutlass::uint4b_t, Element>) {
+    return "int4";
+  }
+  else if (cute::is_same_v<int8_t, Element> || cute::is_same_v<uint8_t, Element>) {
+    return "int8";
+  }
+  else if (cute::is_same_v<int, Element> || cute::is_same_v<unsigned int, Element>) {
+    return "int32";
+  }
+  else {
+    return "Unknown";
+  }
+}
+
+template <class ElementA, class ElementB, class ElementAcc, class ElementC, class ElementD>
+std::string problem_desc_string(ElementA = {}, ElementB = {}, ElementAcc = {}, ElementC = {}, ElementD = {}) {
+  return data_type_string<ElementA>() + " x " +
+         data_type_string<ElementB>() + " -> " +
+         data_type_string<ElementAcc>() + " + " +
+         data_type_string<ElementC>() + " -> " +
+         data_type_string<ElementD>();
+}
+
+// Simple wrapper that can initialize and repeatedly run a GEMM. Lives here
+// rather than in examples/common/helper.h to avoid colliding with other
+// examples (e.g. 62_hopper_sparse_gemm) that define their own Runner with a
+// different API. Requires CUTLASS_CHECK (helper.h) and
+// cutlass::device_memory::allocation (cutlass/util/device_memory.h) - all
+// four 113 examples include both before this header.
+template<class Gemm>
+struct Runner
+{
+  using Arguments = typename Gemm::Arguments;
+
+  Runner(Arguments const& args)
+  : arguments(args) {
+    workspace.reset(Gemm::get_workspace_size(arguments));
+    CUTLASS_CHECK(gemm.can_implement(arguments));
+  }
+
+  void run() {
+    CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+    CUTLASS_CHECK(gemm.run());
+  }
+
+  Gemm gemm;
+  Arguments arguments;
+  cutlass::device_memory::allocation<uint8_t> workspace;
+};
diff --git a/examples/CMakeLists.txt b/examples/CMakeLists.txt
index d7670f642..a0486b1f2 100644
--- a/examples/CMakeLists.txt
+++ b/examples/CMakeLists.txt
@@ -175,6 +175,7 @@ foreach(EXAMPLE
   95_blackwell_gemm_green_context
   111_hopper_ssd
   112_blackwell_ssd
+  113_hopper_gemm_activation_fusion
   )
 
   add_subdirectory(${EXAMPLE})
diff --git a/examples/common/helper.h b/examples/common/helper.h
index 1f9d07fe4..7a2578667 100644
--- a/examples/common/helper.h
+++ b/examples/common/helper.h
@@ -31,78 +31,118 @@
 #pragma once
 
 #include "cuda_runtime.h"
+
 #include <iostream>
+#include <vector>
+
+#include "cutlass/util/device_memory.h"
 
 /**
  * Panic wrapper for unwinding CUTLASS errors
  */
-#define CUTLASS_CHECK(status)                                                                    \
-  {                                                                                              \
-    cutlass::Status error = status;                                                              \
-    if (error != cutlass::Status::kSuccess) {                                                    \
-      std::cerr << "Got cutlass error: " << cutlassGetStatusString(error) << " at: " << __LINE__ \
-                << std::endl;                                                                    \
-      exit(EXIT_FAILURE);                                                                        \
-    }                                                                                            \
-  }
+#define CUTLASS_CHECK(CALL)                                           \
+do {                                                                  \
+  cutlass::Status __status = CALL;                                    \
+  if (__status != cutlass::Status::kSuccess) {                        \
+    std::cerr << "Got CUTLASS error while calling " #CALL " at line " \
+              << __LINE__ << ": " << cutlassGetStatusString(__status) \
+              << std::endl;                                           \
+    exit(EXIT_FAILURE);                                               \
+  }                                                                   \
+} while (false)
 
 
 /**
  * Panic wrapper for unwinding CUDA runtime errors
  */
-#define CUDA_CHECK(status)                                              \
-  {                                                                     \
-    cudaError_t error = status;                                         \
-    if (error != cudaSuccess) {                                         \
-      std::cerr << "Got bad cuda status: " << cudaGetErrorString(error) \
-                << " at line: " << __LINE__ << std::endl;               \
-      exit(EXIT_FAILURE);                                               \
-    }                                                                   \
-  }
-
+#define CUDA_CHECK(CALL)                                           \
+do {                                                               \
+  cudaError_t __status = CALL;                                     \
+  if (__status != cudaSuccess) {                                   \
+    std::cerr << "Got CUDA error while calling " #CALL " at line " \
+              << __LINE__ << ": " << cudaGetErrorString(__status)  \
+              << std::endl;                                        \
+    exit(EXIT_FAILURE);                                            \
+  }                                                                \
+} while (false)
 
 /**
  * GPU timer for recording the elapsed time across kernel(s) launched in GPU stream
  */
 struct GpuTimer
 {
-    cudaStream_t _stream_id;
-    cudaEvent_t _start;
-    cudaEvent_t _stop;
+  cudaStream_t _stream_id;
+  cudaEvent_t _start;
+  cudaEvent_t _stop;
 
-    /// Constructor
-    GpuTimer() : _stream_id(0)
-    {
-        CUDA_CHECK(cudaEventCreate(&_start));
-        CUDA_CHECK(cudaEventCreate(&_stop));
-    }
+  /// Constructor
+  GpuTimer() : _stream_id(0)
+  {
+    CUDA_CHECK(cudaEventCreate(&_start));
+    CUDA_CHECK(cudaEventCreate(&_stop));
+  }
 
-    /// Destructor
-    ~GpuTimer()
-    {
-        CUDA_CHECK(cudaEventDestroy(_start));
-        CUDA_CHECK(cudaEventDestroy(_stop));
-    }
+  /// Destructor
+  ~GpuTimer()
+  {
+    CUDA_CHECK(cudaEventDestroy(_start));
+    CUDA_CHECK(cudaEventDestroy(_stop));
+  }
 
-    /// Start the timer for a given stream (defaults to the default stream)
-    void start(cudaStream_t stream_id = 0)
-    {
-        _stream_id = stream_id;
-        CUDA_CHECK(cudaEventRecord(_start, _stream_id));
-    }
+  /// Start the timer for a given stream (defaults to the default stream)
+  void start(cudaStream_t stream_id = 0)
+  {
+    _stream_id = stream_id;
+    CUDA_CHECK(cudaEventRecord(_start, _stream_id));
+  }
 
-    /// Stop the timer
-    void stop()
-    {
-        CUDA_CHECK(cudaEventRecord(_stop, _stream_id));
-    }
+  /// Stop the timer
+  void stop()
+  {
+    CUDA_CHECK(cudaEventRecord(_stop, _stream_id));
+  }
 
-    /// Return the elapsed time (in milliseconds)
-    float elapsed_millis()
-    {
-        float elapsed = 0.0;
-        CUDA_CHECK(cudaEventSynchronize(_stop));
-        CUDA_CHECK(cudaEventElapsedTime(&elapsed, _start, _stop));
-        return elapsed;
-    }
+  /// Return the elapsed time (in milliseconds)
+  float elapsed_millis()
+  {
+    float elapsed = 0.0;
+    CUDA_CHECK(cudaEventSynchronize(_stop));
+    CUDA_CHECK(cudaEventElapsedTime(&elapsed, _start, _stop));
+    return elapsed;
+  }
 };
+
+struct BenchmarkResult {
+  double avg_runtime_ms;
+};
+
+template <class Func>
+BenchmarkResult run_benchmark(
+    Func func,
+    int warmup_iterations,
+    int bench_iterations) {
+
+  for (int iter = 0; iter < warmup_iterations; ++iter) {
+    func();
+  }
+
+  GpuTimer timer;
+  timer.start();
+  for (int iter = 0; iter < bench_iterations; ++iter) {
+    func();
+  }
+  timer.stop();
+
+  return { timer.elapsed_millis() / double(bench_iterations) };
+};
+
+template<class T>
+__global__ void print_device_tensor_kernel(T t) {
+  print_tensor(t);
+}
+
+template<class T>
+void print_device_tensor(T const& t) {
+  print_device_tensor_kernel<<<1,1>>>(t);
+  CUDA_CHECK(cudaDeviceSynchronize());
+}
diff --git a/examples/python/CuTeDSL/cute/ampere/kernel/attention/hstu_attention.py b/examples/python/CuTeDSL/cute/ampere/kernel/attention/hstu_attention.py
index fbe6f7af4..0f328d7a7 100644
--- a/examples/python/CuTeDSL/cute/ampere/kernel/attention/hstu_attention.py
+++ b/examples/python/CuTeDSL/cute/ampere/kernel/attention/hstu_attention.py
@@ -49,7 +49,7 @@ To run this example:
 
 .. code-block:: bash
 
-    python examples/ampere/hstu_attention.py --batch_size 4 --seqlen_q 8192 --seqlen_kv 8192 --num_head 4 --head_dim 128 --m_block_size 128 --n_block_size 64 --is_causal --perf_test
+    python examples/cute/ampere/kernel/attention/hstu_attention.py --batch_size 4 --seqlen_q 8192 --seqlen_kv 8192 --num_head 4 --head_dim 128 --m_block_size 128 --n_block_size 64 --is_causal --perf_test
 
 The above example tests the performance of HSTU attention with batch size 4, sequence length 8192, 4 attention heads, and head dimension 128. The m_block_size is 128, and n_block_size is 64. The causal masking is enabled.
 
@@ -268,6 +268,104 @@ class HSTUAttentionForwardAmpere(object):
             stream=stream,
         )
 
+    @cute.jit
+    def _copy_with_residue(
+        self,
+        copy_atom,
+        src_tile,
+        dst_tile,
+        coord_tile,
+        head_dim_pred,
+        has_outer_residue: cutlass.Constexpr,
+        has_inner_residue: cutlass.Constexpr,
+        outer_size,
+        block_end,
+        fill_zero_on_oob: cutlass.Constexpr = True,
+        is_known_boundary: cutlass.Constexpr = False,
+    ):
+        # Copy a (CPY_Atom, M, K) tile with optional outer-axis (M) and head-dim (K) residue.
+        # `is_known_boundary=True` skips the runtime boundary check (caller knows the tile straddles outer_size).
+        # `fill_zero_on_oob=False` for stores; loads zero-fill out-of-bounds rows so SMEM contents are well-defined.
+        if cutlass.const_expr(not has_outer_residue and not has_inner_residue):
+            cute.copy(copy_atom, src_tile, dst_tile)
+        elif cutlass.const_expr(not has_outer_residue):
+            cute.copy(copy_atom, src_tile, dst_tile, pred=head_dim_pred)
+        else:
+            if cutlass.const_expr(is_known_boundary):
+                is_boundary = True
+            else:
+                is_boundary = cute.elem_less(outer_size, block_end)
+            if is_boundary:
+                for m in cutlass.range_constexpr(cute.size(dst_tile.shape[1])):
+                    if cute.elem_less(coord_tile[0, m, 0][1], outer_size):
+                        if cutlass.const_expr(has_inner_residue):
+                            cute.copy(
+                                copy_atom,
+                                src_tile[None, m, None],
+                                dst_tile[None, m, None],
+                                pred=head_dim_pred[None, m, None],
+                            )
+                        else:
+                            cute.copy(
+                                copy_atom,
+                                src_tile[None, m, None],
+                                dst_tile[None, m, None],
+                            )
+                    elif cutlass.const_expr(fill_zero_on_oob):
+                        dst_tile[None, m, None].fill(0)
+            else:
+                if cutlass.const_expr(has_inner_residue):
+                    cute.copy(copy_atom, src_tile, dst_tile, pred=head_dim_pred)
+                else:
+                    cute.copy(copy_atom, src_tile, dst_tile)
+
+    @cute.jit
+    def _copy_rab_tile(
+        self,
+        copy_atom,
+        src_tile,
+        dst_tile,
+        coord_tile,
+        has_q_residue: cutlass.Constexpr,
+        has_kv_residue: cutlass.Constexpr,
+        seqlen_q,
+        seqlen_kv,
+        q_block_end,
+        kv_block_end,
+        is_known_kv_interior: cutlass.Constexpr = False,
+    ):
+        # Copy a (CPY_Atom, M, N) RAB tile with optional 2D residue.
+        # Coord entries are 4-tuples from the (B, H, Q, KV) identity tensor; index 2 is q-coord, index 3 is kv-coord.
+        # `is_known_kv_interior=True` skips the kv-side runtime check when the loaded tile is guaranteed inside seqlen_kv.
+        kv_check_active = has_kv_residue and not is_known_kv_interior
+        if cutlass.const_expr(not has_q_residue and not kv_check_active):
+            cute.copy(copy_atom, src_tile, dst_tile)
+        else:
+            # Pick the runtime predicate without mixing cute.Boolean with Python bool.
+            if cutlass.const_expr(has_q_residue and kv_check_active):
+                needs_per_elem = cute.elem_less(
+                    seqlen_q, q_block_end
+                ) or cute.elem_less(seqlen_kv, kv_block_end)
+            elif cutlass.const_expr(has_q_residue):
+                needs_per_elem = cute.elem_less(seqlen_q, q_block_end)
+            else:
+                needs_per_elem = cute.elem_less(seqlen_kv, kv_block_end)
+            if needs_per_elem:
+                for m in cutlass.range_constexpr(cute.size(dst_tile.shape[1])):
+                    for n in cutlass.range_constexpr(cute.size(dst_tile.shape[2])):
+                        if cute.elem_less(
+                            coord_tile[0, m, n][2], seqlen_q
+                        ) and cute.elem_less(coord_tile[0, m, n][3], seqlen_kv):
+                            cute.copy(
+                                copy_atom,
+                                src_tile[None, m, n],
+                                dst_tile[None, m, n],
+                            )
+                        else:
+                            dst_tile[None, m, n].fill(0)
+            else:
+                cute.copy(copy_atom, src_tile, dst_tile)
+
     @cute.kernel
     def kernel(
         self,
@@ -395,7 +493,7 @@ class HSTUAttentionForwardAmpere(object):
         tVsV = gmem_thr_copy_QKV.partition_D(sV)
         # (CPY_Atom, CPY_M, CPY_N, n_block)
         tRABgRAB = gmem_tiled_copy_QKV.get_slice(tidx).partition_S(gRAB)
-        tRabsRAB = gmem_tiled_copy_QKV.get_slice(tidx).partition_D(sRAB)
+        tRABsRAB = gmem_tiled_copy_QKV.get_slice(tidx).partition_D(sRAB)
 
         # ///////////////////////////////////////////////////////////////////////////////
         # Tile MMA compute thread partitions and allocate accumulators
@@ -446,115 +544,118 @@ class HSTUAttentionForwardAmpere(object):
         tOsVt = smem_thr_copy_V.partition_S(sVt)
         tOrVt_copy_view = smem_thr_copy_V.retile(tOrVt)
         tSsRAB = smem_thr_copy_RAB.partition_S(sRAB)
+        has_head_dim_residue = self._head_dim != self._head_dim_padded
+        has_q_residue = self._seqlen_q % self._m_block_size != 0
+        has_kv_residue = self._seqlen_kv % self._n_block_size != 0
+        need_q_predicates = has_head_dim_residue or has_q_residue
+        need_kv_predicates = has_head_dim_residue or has_kv_residue
+        need_rab_predicates = has_q_residue or has_kv_residue
 
         # ///////////////////////////////////////////////////////////////////////////////
         # Predicate: Mark indices that need to copy when problem_shape isn't a multiple
         # of tile_shape
         # ///////////////////////////////////////////////////////////////////////////////
-        # Construct identity layout for Q, KV and RAB
-        mcQ = cute.make_identity_tensor(mQ.layout.shape)
-        mcKV = cute.make_identity_tensor(mK.layout.shape)
-        mcRAB = cute.make_identity_tensor(mRAB.layout.shape)
-
-        cQ = cute.local_tile(
-            mcQ[batch_size, None, num_head, None],
-            (self._m_block_size, self._head_dim_padded),
-            (m_block, 0),
-        )
-        cKV = cute.local_tile(
-            mcKV[batch_size, None, num_head, None],
-            (self._n_block_size, self._head_dim_padded),
-            (n_block, 0),
-        )
-        cRAB = cute.local_tile(
-            mcRAB[batch_size, num_head, None, None],
-            (self._m_block_size, self._n_block_size),
-            (m_block, None),
-        )
-
-        # Repeat the partitioning with identity layouts
-        tQcQ = gmem_thr_copy_QKV.partition_S(cQ)
-        tKVcKV = gmem_thr_copy_QKV.partition_S(cKV)
-        tRABcRAB = gmem_thr_copy_QKV.partition_S(cRAB)
-
-        tQpQ = cute.make_rmem_tensor(
-            cute.make_layout(
-                (
-                    tQsQ.shape[0][1],
-                    cute.size(tQsQ, mode=[1]),
-                    cute.size(tQsQ, mode=[2]),
-                ),
-                stride=(cute.size(tQsQ, mode=[2]), 0, 1),
-            ),
-            cutlass.Boolean,
-        )
-        tKVpKV = cute.make_rmem_tensor(
-            cute.make_layout(
-                (
-                    tKsK.shape[0][1],
-                    cute.size(tKsK, mode=[1]),
-                    cute.size(tKsK, mode=[2]),
-                ),
-                stride=(cute.size(tKsK, mode=[2]), 0, 1),
-            ),
-            cutlass.Boolean,
-        )
-
-        # Set predicates for head_dim bounds, seqlen_q/k/v bounds is processed at the first tile.
-        for rest_v in cutlass.range_constexpr(tQpQ.shape[0]):
-            for rest_k in cutlass.range_constexpr(tQpQ.shape[2]):
-                tQpQ[rest_v, 0, rest_k] = cute.elem_less(
-                    tQcQ[(0, rest_v), 0, rest_k][3], mQ.layout.shape[3]
+        if cutlass.const_expr(need_q_predicates):
+            mcQ = cute.make_identity_tensor(mQ.layout.shape)
+            cQ = cute.local_tile(
+                mcQ[batch_size, None, num_head, None],
+                (self._m_block_size, self._head_dim_padded),
+                (m_block, 0),
+            )
+            tQcQ = gmem_thr_copy_QKV.partition_S(cQ)
+            if cutlass.const_expr(has_head_dim_residue):
+                tQpQ = cute.make_rmem_tensor(
+                    cute.make_layout(
+                        (
+                            tQsQ.shape[0][1],
+                            cute.size(tQsQ, mode=[1]),
+                            cute.size(tQsQ, mode=[2]),
+                        ),
+                        stride=(cute.size(tQsQ, mode=[2]), 0, 1),
+                    ),
+                    cutlass.Boolean,
                 )
-        for rest_v in cutlass.range_constexpr(tKVpKV.shape[0]):
-            for rest_k in cutlass.range_constexpr(tKVpKV.shape[2]):
-                tKVpKV[rest_v, 0, rest_k] = cute.elem_less(
-                    tKVcKV[(0, rest_v), 0, rest_k][3], mK.layout.shape[3]
+                for rest_v in cutlass.range_constexpr(tQpQ.shape[0]):
+                    for rest_k in cutlass.range_constexpr(tQpQ.shape[2]):
+                        tQpQ[rest_v, 0, rest_k] = cute.elem_less(
+                            tQcQ[(0, rest_v), 0, rest_k][3], mQ.layout.shape[3]
+                        )
+
+        if cutlass.const_expr(need_kv_predicates):
+            mcKV = cute.make_identity_tensor(mK.layout.shape)
+            cKV = cute.local_tile(
+                mcKV[batch_size, None, num_head, None],
+                (self._n_block_size, self._head_dim_padded),
+                (n_block, 0),
+            )
+            tKVcKV = gmem_thr_copy_QKV.partition_S(cKV)
+            if cutlass.const_expr(has_head_dim_residue):
+                tKVpKV = cute.make_rmem_tensor(
+                    cute.make_layout(
+                        (
+                            tKsK.shape[0][1],
+                            cute.size(tKsK, mode=[1]),
+                            cute.size(tKsK, mode=[2]),
+                        ),
+                        stride=(cute.size(tKsK, mode=[2]), 0, 1),
+                    ),
+                    cutlass.Boolean,
                 )
+                for rest_v in cutlass.range_constexpr(tKVpKV.shape[0]):
+                    for rest_k in cutlass.range_constexpr(tKVpKV.shape[2]):
+                        tKVpKV[rest_v, 0, rest_k] = cute.elem_less(
+                            tKVcKV[(0, rest_v), 0, rest_k][3], mK.layout.shape[3]
+                        )
+
+        if cutlass.const_expr(need_rab_predicates):
+            mcRAB = cute.make_identity_tensor(mRAB.layout.shape)
+            cRAB = cute.local_tile(
+                mcRAB[batch_size, num_head, None, None],
+                (self._m_block_size, self._n_block_size),
+                (m_block, None),
+            )
+            tRABcRAB = gmem_thr_copy_QKV.partition_S(cRAB)
 
         # ///////////////////////////////////////////////////////////////////////////////
         # Prefetch Prologue
         # ///////////////////////////////////////////////////////////////////////////////
         # Start async loads of the last mn-tile, where we take care of the mn residue
-        for m in cutlass.range_constexpr(cute.size(tQsQ.shape[1])):
-            if cute.elem_less(tQcQ[0, m, 0][1], mQ.layout.shape[1]):
-                cute.copy(
-                    gmem_tiled_copy_QKV,
-                    tQgQ[None, m, None],
-                    tQsQ[None, m, None],
-                    pred=tQpQ[None, m, None],
-                )
-            else:
-                # Clear the smem tiles to account for predicated off loads
-                tQsQ[None, m, None].fill(0)
-
-        for n in cutlass.range_constexpr(cute.size(tKsK.shape[1])):
-            if cute.elem_less(tKVcKV[0, n, 0][1], mK.layout.shape[1]):
-                cute.copy(
-                    gmem_tiled_copy_QKV,
-                    tKgK[None, n, None, n_block],
-                    tKsK[None, n, None],
-                    pred=tKVpKV[None, n, None],
-                )
-            else:
-                # Clear the smem tiles to account for predicated off loads
-                tKsK[None, n, None].fill(0)
-
-        for m in cutlass.range_constexpr(cute.size(tRABcRAB.shape[1])):
-            for n in cutlass.range_constexpr(cute.size(tRABcRAB.shape[2])):
-                if cute.elem_less(
-                    tRABcRAB[0, m, n, n_block][1], mRAB.layout.shape[2]
-                ) and cute.elem_less(
-                    tRABcRAB[0, m, n, n_block][2], mRAB.layout.shape[3]
-                ):
-                    cute.copy(
-                        gmem_tiled_copy_QKV,
-                        tRABgRAB[None, m, n, n_block],
-                        tRabsRAB[None, m, n],
-                    )
-                else:
-                    # Clear the smem tiles to account for predicated off loads
-                    tRabsRAB[None, m, n].fill(0)
+        self._copy_with_residue(
+            gmem_tiled_copy_QKV,
+            tQgQ[None, None, None],
+            tQsQ[None, None, None],
+            tQcQ if has_q_residue else None,
+            tQpQ if has_head_dim_residue else None,
+            has_q_residue,
+            has_head_dim_residue,
+            mQ.layout.shape[1],
+            (m_block + 1) * self._m_block_size,
+        )
+        # n_block is the last n-tile by construction, so any kv-residue is in this tile.
+        self._copy_with_residue(
+            gmem_tiled_copy_QKV,
+            tKgK[None, None, None, n_block],
+            tKsK[None, None, None],
+            tKVcKV if has_kv_residue else None,
+            tKVpKV if has_head_dim_residue else None,
+            has_kv_residue,
+            has_head_dim_residue,
+            mK.layout.shape[1],
+            (n_block + 1) * self._n_block_size,
+            is_known_boundary=has_kv_residue,
+        )
+        self._copy_rab_tile(
+            gmem_tiled_copy_QKV,
+            tRABgRAB[None, None, None, n_block],
+            tRABsRAB[None, None, None],
+            tRABcRAB[None, None, None, n_block] if need_rab_predicates else None,
+            has_q_residue,
+            has_kv_residue,
+            mRAB.layout.shape[2],
+            mRAB.layout.shape[3],
+            (m_block + 1) * self._m_block_size,
+            (n_block + 1) * self._n_block_size,
+        )
         cute.arch.cp_async_commit_group()
 
         # ///////////////////////////////////////////////////////////////////////////////
@@ -565,24 +666,17 @@ class HSTUAttentionForwardAmpere(object):
             cute.arch.cp_async_wait_group(0)
             self.cta_sync_barrier.arrive_and_wait()
 
-            if n_block_idx == n_block:
-                for n in cutlass.range_constexpr(cute.size(tVsV.shape[1])):
-                    if cute.elem_less(tKVcKV[0, n, 0][1], mV.layout.shape[1]):
-                        cute.copy(
-                            gmem_tiled_copy_QKV,
-                            tVgV[None, n, None, n_block_idx],
-                            tVsV[None, n, None],
-                            pred=tKVpKV[None, n, None],
-                        )
-                    else:
-                        tVsV[None, n, None].fill(0)
-            else:
-                cute.copy(
-                    gmem_tiled_copy_QKV,
-                    tVgV[None, None, None, n_block_idx],
-                    tVsV[None, None, None],
-                    pred=tKVpKV[None, None, None],
-                )
+            self._copy_with_residue(
+                gmem_tiled_copy_QKV,
+                tVgV[None, None, None, n_block_idx],
+                tVsV[None, None, None],
+                tKVcKV if has_kv_residue else None,
+                tKVpKV if has_head_dim_residue else None,
+                has_kv_residue,
+                has_head_dim_residue,
+                mV.layout.shape[1],
+                (n_block_idx + 1) * self._n_block_size,
+            )
             cute.arch.cp_async_commit_group()
 
             acc_shape_S = thr_mma.partition_shape_C(
@@ -643,25 +737,33 @@ class HSTUAttentionForwardAmpere(object):
             self.cta_sync_barrier.arrive_and_wait()
 
             if n_block_idx > 0:
-                cute.copy(
+                # tile (n_block_idx - 1) is always inside seqlen_kv, so only head_dim residue can apply
+                self._copy_with_residue(
                     gmem_tiled_copy_QKV,
                     tKgK[None, None, None, n_block_idx - 1],
                     tKsK[None, None, None],
-                    pred=tKVpKV[None, None, None],
+                    None,
+                    tKVpKV[None, None, None] if has_head_dim_residue else None,
+                    False,
+                    has_head_dim_residue,
+                    None,
+                    None,
+                )
+                self._copy_rab_tile(
+                    gmem_tiled_copy_QKV,
+                    tRABgRAB[None, None, None, n_block_idx - 1],
+                    tRABsRAB[None, None, None],
+                    tRABcRAB[None, None, None, n_block_idx - 1]
+                    if need_rab_predicates
+                    else None,
+                    has_q_residue,
+                    has_kv_residue,
+                    mRAB.layout.shape[2],
+                    mRAB.layout.shape[3],
+                    (m_block + 1) * self._m_block_size,
+                    None,
+                    is_known_kv_interior=True,
                 )
-                # m residue handling for RAB
-                for m in cutlass.range_constexpr(cute.size(tRABcRAB.shape[1])):
-                    if cute.elem_less(
-                        tRABcRAB[0, m, 0, n_block_idx - 1][1], mRAB.layout.shape[2]
-                    ):
-                        cute.copy(
-                            gmem_tiled_copy_QKV,
-                            tRABgRAB[None, m, None, n_block_idx - 1],
-                            tRabsRAB[None, m, None],
-                        )
-                    else:
-                        tRabsRAB[None, m, None].fill(0)
-
                 cute.arch.cp_async_commit_group()
 
             # ///////////////////////////////////////////////////////////////////////////////
@@ -695,26 +797,29 @@ class HSTUAttentionForwardAmpere(object):
                 t4 = t1 / t3
                 acc_S.store(t4)
 
-            mACC = cute.make_identity_tensor(
-                (mRAB.layout.shape[2], mRAB.layout.shape[3])
-            )  # (seqlen_q, seqlen_kv)
-            cACC = cute.local_tile(
-                mACC[None, None],
-                (self._m_block_size, self._n_block_size),
-                (m_block, n_block_idx),
-            )
+            if cutlass.const_expr(self._is_causal):
+                mACC = cute.make_identity_tensor(
+                    (mRAB.layout.shape[2], mRAB.layout.shape[3])
+                )  # (seqlen_q, seqlen_kv)
+                cACC = cute.local_tile(
+                    mACC[None, None],
+                    (self._m_block_size, self._n_block_size),
+                    (m_block, n_block_idx),
+                )
 
-            if self._is_causal and (n_block - n_block_idx) < cute.ceil_div(
-                self._m_block_size, self._n_block_size
-            ):
-                tACCcACC = thr_mma.partition_C(cACC)
-                for i in cutlass.range_constexpr(cute.size(tACCcACC.shape[0])):
-                    for j in cutlass.range_constexpr(cute.size(tACCcACC.shape[1])):
-                        for k in cutlass.range_constexpr(cute.size(tACCcACC.shape[2])):
-                            if cute.elem_less(
-                                tACCcACC[i, j, k][0], tACCcACC[i, j, k][1]
+                if (n_block - n_block_idx) < cute.ceil_div(
+                    self._m_block_size, self._n_block_size
+                ):
+                    tACCcACC = thr_mma.partition_C(cACC)
+                    for i in cutlass.range_constexpr(cute.size(tACCcACC.shape[0])):
+                        for j in cutlass.range_constexpr(cute.size(tACCcACC.shape[1])):
+                            for k in cutlass.range_constexpr(
+                                cute.size(tACCcACC.shape[2])
                             ):
-                                acc_S[i, j, k] = 0.0
+                                if cute.elem_less(
+                                    tACCcACC[i, j, k][0], tACCcACC[i, j, k][1]
+                                ):
+                                    acc_S[i, j, k] = 0.0
 
             rP = cute.make_rmem_tensor_like(acc_S, self._dtype)
             rP.store(acc_S.load().to(self._dtype))
@@ -803,35 +908,39 @@ class HSTUAttentionForwardAmpere(object):
             tOsO,
             tOrO,
         )
-        # predicate for O
-        mcO = cute.make_identity_tensor(mO.layout.shape)
-        cO = cute.local_tile(
-            mcO[batch_size, None, num_head, None],
-            (self._m_block_size, self._head_dim_padded),
-            (m_block, 0),
-        )
-        tOcO = gmem_thr_copy_O.partition_D(cO)
-        tOpO = cute.make_rmem_tensor(
-            cute.make_layout(
-                (tOgO.shape[0][1], tOgO.shape[1], tOgO.shape[2]),
-                stride=(tOgO.shape[2], 0, 1),
-            ),
-            cutlass.Boolean,
-        )
-        for rest_v in cutlass.range_constexpr(tOpO.shape[0]):
-            for rest_n in cutlass.range_constexpr(cute.size(tOpO.shape[2])):
-                tOpO[rest_v, 0, rest_n] = cute.elem_less(
-                    tOcO[(0, rest_v), 0, rest_n][3], mO.layout.shape[3]
-                )
-        # copy acc O from rmem to gmem
-        for rest_m in cutlass.range_constexpr(cute.size(tOpO.shape[1])):
-            if cute.elem_less(tOcO[0, rest_m, 0][1], mO.layout.shape[1]):
-                cute.copy(
-                    gmem_tiled_copy_O,
-                    tOrO[None, rest_m, None],
-                    tOgO[None, rest_m, None],
-                    pred=tOpO[None, rest_m, None],
+        if cutlass.const_expr(need_q_predicates):
+            mcO = cute.make_identity_tensor(mO.layout.shape)
+            cO = cute.local_tile(
+                mcO[batch_size, None, num_head, None],
+                (self._m_block_size, self._head_dim_padded),
+                (m_block, 0),
+            )
+            tOcO = gmem_thr_copy_O.partition_D(cO)
+            if cutlass.const_expr(has_head_dim_residue):
+                tOpO = cute.make_rmem_tensor(
+                    cute.make_layout(
+                        (tOgO.shape[0][1], tOgO.shape[1], tOgO.shape[2]),
+                        stride=(tOgO.shape[2], 0, 1),
+                    ),
+                    cutlass.Boolean,
                 )
+                for rest_v in cutlass.range_constexpr(tOpO.shape[0]):
+                    for rest_n in cutlass.range_constexpr(cute.size(tOpO.shape[2])):
+                        tOpO[rest_v, 0, rest_n] = cute.elem_less(
+                            tOcO[(0, rest_v), 0, rest_n][3], mO.layout.shape[3]
+                        )
+        self._copy_with_residue(
+            gmem_tiled_copy_O,
+            tOrO[None, None, None],
+            tOgO[None, None, None],
+            tOcO if has_q_residue else None,
+            tOpO if has_head_dim_residue else None,
+            has_q_residue,
+            has_head_dim_residue,
+            mO.layout.shape[1],
+            (m_block + 1) * self._m_block_size,
+            fill_zero_on_oob=False,
+        )
 
 
 def run_pytorch_hstu_test(
diff --git a/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha_bwd.py b/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha_bwd.py
index cb94029af..95d40a3a3 100644
--- a/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha_bwd.py
+++ b/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha_bwd.py
@@ -27,7 +27,6 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-import enum
 import math
 import os
 import sys
@@ -42,12 +41,20 @@ import cutlass
 import cutlass.cute as cute
 import cutlass.cute.testing as testing
 from cutlass.cute.nvgpu import cpasync, tcgen05
+from cutlass.cute.nvgpu.common import OperandMajorMode
 import cutlass.utils as utils
 import cutlass.pipeline as pipeline
 import cutlass.torch as cutlass_torch
 import cutlass.utils.blackwell_helpers as sm100_utils
 from cutlass.cute.runtime import from_dlpack
-from cutlass.cute.typing import Int32, Float32, Float8E4M3FN, Float16, BFloat16, Boolean
+from cutlass.cute.typing import (
+    Int32,
+    Float32,
+    Float8E4M3FN,
+    Float16,
+    BFloat16,
+    Boolean,
+)
 
 if __name__ == "__main__":
     current_dir = os.path.dirname(os.path.abspath(__file__))
@@ -74,7 +81,7 @@ To run this example:
 
 .. code-block:: bash
 
-    python examples/blackwell/fmha_bwd.py \\
+    python examples/cute/blackwell/kernel/attention/fmha/fmha_bwd.py \\
         --s_q_max 1024 --s_k_max 1024 \\
         --h_q 8 --h_k 8 --d 128 --b 1 \\
         --element_dtype float16 --acc_dtype float32 \\
@@ -193,6 +200,22 @@ class BlackwellFusedMultiHeadAttentionBackward:
             barrier_id=5,
             num_threads=self.num_reduce_warps * self.threads_per_warp,
         )
+        self.load_reduce_tma_sync_barrier_0 = pipeline.NamedBarrier(
+            barrier_id=6,
+            num_threads=2 * self.threads_per_warp,
+        )
+        self.load_reduce_tma_sync_barrier_1 = pipeline.NamedBarrier(
+            barrier_id=7,
+            num_threads=2 * self.threads_per_warp,
+        )
+        self.load_reduce_tma_sync_barrier_2 = pipeline.NamedBarrier(
+            barrier_id=8,
+            num_threads=2 * self.threads_per_warp,
+        )
+        self.load_reduce_tma_sync_barrier_3 = pipeline.NamedBarrier(
+            barrier_id=9,
+            num_threads=2 * self.threads_per_warp,
+        )
 
         self.tmem_dK_offset = 0
         self.tmem_dV_offset = self.tmem_dK_offset + mma_tiler[2]
@@ -200,8 +223,8 @@ class BlackwellFusedMultiHeadAttentionBackward:
         self.tmem_dP_offset = self.tmem_dQ_offset
         self.tmem_S_offset = self.tmem_dQ_offset + max(mma_tiler[0], mma_tiler[2])
 
-        self.num_regs_reduce = 152
-        self.num_regs_compute = 128
+        self.num_regs_reduce = 144
+        self.num_regs_compute = 136
         self.num_regs_mma = 96
         self.num_regs_empty = 96
         self.num_regs_load = 96
@@ -344,17 +367,17 @@ class BlackwellFusedMultiHeadAttentionBackward:
         self.dO_major_mode = utils.LayoutEnum.from_tensor(dO).mma_major_mode()
         self.dQ_layout = utils.LayoutEnum.from_tensor(dQ)
 
-        if cutlass.const_expr(self.Q_major_mode != tcgen05.OperandMajorMode.K):
+        if cutlass.const_expr(self.Q_major_mode != OperandMajorMode.K):
             raise RuntimeError("The layout of q is not supported")
-        if cutlass.const_expr(self.dQ_major_mode != tcgen05.OperandMajorMode.K):
+        if cutlass.const_expr(self.dQ_major_mode != OperandMajorMode.K):
             raise RuntimeError("The layout of dq is not supported")
-        if cutlass.const_expr(self.K_major_mode != tcgen05.OperandMajorMode.K):
+        if cutlass.const_expr(self.K_major_mode != OperandMajorMode.K):
             raise RuntimeError("The layout of k is not supported")
-        if cutlass.const_expr(self.dK_major_mode != tcgen05.OperandMajorMode.K):
+        if cutlass.const_expr(self.dK_major_mode != OperandMajorMode.K):
             raise RuntimeError("The layout of dk is not supported")
-        if cutlass.const_expr(self.V_major_mode != tcgen05.OperandMajorMode.K):
+        if cutlass.const_expr(self.V_major_mode != OperandMajorMode.K):
             raise RuntimeError("The layout of v is not supported")
-        if cutlass.const_expr(self.dV_major_mode != tcgen05.OperandMajorMode.K):
+        if cutlass.const_expr(self.dV_major_mode != OperandMajorMode.K):
             raise RuntimeError("The layout of dv is not supported")
 
         self._setup_attributes()
@@ -364,8 +387,9 @@ class BlackwellFusedMultiHeadAttentionBackward:
         # compute S
         KQ_tiled_mma = sm100_utils.make_trivial_tiled_mma(
             self.element_dtype,
-            tcgen05.OperandMajorMode.K,
-            tcgen05.OperandMajorMode.K,
+            self.element_dtype,
+            OperandMajorMode.K,
+            OperandMajorMode.K,
             self.acc_dtype,
             cta_group,
             self.KQ_mma_tiler[:2],
@@ -373,8 +397,9 @@ class BlackwellFusedMultiHeadAttentionBackward:
         # compute dP
         VdO_tiled_mma = sm100_utils.make_trivial_tiled_mma(
             self.element_dtype,
-            tcgen05.OperandMajorMode.K,
-            tcgen05.OperandMajorMode.K,
+            self.element_dtype,
+            OperandMajorMode.K,
+            OperandMajorMode.K,
             self.acc_dtype,
             cta_group,
             self.VdO_mma_tiler[:2],
@@ -382,8 +407,9 @@ class BlackwellFusedMultiHeadAttentionBackward:
         # compute dV
         PdO_tiled_mma = sm100_utils.make_trivial_tiled_mma(
             self.element_dtype,
-            tcgen05.OperandMajorMode.K,
-            tcgen05.OperandMajorMode.MN,
+            self.element_dtype,
+            OperandMajorMode.K,
+            OperandMajorMode.MN,
             self.acc_dtype,
             cta_group,
             self.PdO_mma_tiler[:2],
@@ -392,8 +418,9 @@ class BlackwellFusedMultiHeadAttentionBackward:
         # compute dK
         dSQ_tiled_mma = sm100_utils.make_trivial_tiled_mma(
             self.element_dtype,
-            tcgen05.OperandMajorMode.K,
-            tcgen05.OperandMajorMode.MN,
+            self.element_dtype,
+            OperandMajorMode.K,
+            OperandMajorMode.MN,
             self.acc_dtype,
             cta_group,
             self.dSQ_mma_tiler[:2],
@@ -401,8 +428,9 @@ class BlackwellFusedMultiHeadAttentionBackward:
         # compute dQ
         dSK_tiled_mma = sm100_utils.make_trivial_tiled_mma(
             self.element_dtype,
-            tcgen05.OperandMajorMode.MN,
-            tcgen05.OperandMajorMode.MN,
+            self.element_dtype,
+            OperandMajorMode.MN,
+            OperandMajorMode.MN,
             self.acc_dtype,
             cta_group,
             self.dSK_mma_tiler[:2],
@@ -483,7 +511,7 @@ class BlackwellFusedMultiHeadAttentionBackward:
 
         dQ_smem_layout_atom = sm100_utils.make_smem_layout_atom(
             sm100_utils.get_smem_layout_atom_ab(
-                tcgen05.OperandMajorMode.K,
+                OperandMajorMode.K,
                 self.acc_dtype,
                 (self.tile_shape_Q, 32),
             ),
@@ -844,9 +872,7 @@ class BlackwellFusedMultiHeadAttentionBackward:
         LSE_smem_layout: cute.Layout,
         sum_OdO_smem_layout: cute.Layout,
     ):
-        tidx, tidy, tidz = cute.arch.thread_idx()
         bidx, bidy, bidz = cute.arch.block_idx()
-        grid_dim_x, grid_dim_y, grid_dim_z = cute.arch.grid_dim()
         warp_idx = cute.arch.make_warp_uniform(cute.arch.warp_idx())
 
         if warp_idx == self.load_warp_id:
@@ -1265,12 +1291,10 @@ class BlackwellFusedMultiHeadAttentionBackward:
         # (load_mma_Q_pipeline, load_compute_LSE_pipeline, load_mma_dO_pipeline, load_compute_sum_OdO_pipeline)
         pipeline_args: tuple,
     ):
-        tidx, tidy, tidz = cute.arch.thread_idx()
+        tidx = cute.arch.thread_idx()[0]
         blk_coord_k, blk_coord_h_k, blk_coord_b = cute.arch.block_idx()
         blk_coord_h_r = Int32(0)
         blk_coord_h = (blk_coord_h_r, blk_coord_h_k)
-        seq_Q, seq_K, D, HB = problem_shape
-        H, B = HB
         iter_index = iter_start
         (
             load_mma_Q_pipeline,
@@ -1362,6 +1386,7 @@ class BlackwellFusedMultiHeadAttentionBackward:
         load_compute_sum_OdO_producer_state = pipeline.make_pipeline_state(
             pipeline.PipelineUserType.Producer, self.load_compute_sum_OdO_stage
         )
+
         load_mma_Q_pipeline.producer_acquire(load_mma_Q_producer_state)
         tma_barrier = load_mma_Q_pipeline.producer_get_barrier(
             load_mma_Q_producer_state
@@ -1396,7 +1421,7 @@ class BlackwellFusedMultiHeadAttentionBackward:
 
         async_copy_num_elts = sLSE.shape[0] // self.threads_per_warp
         atom_async_copy = cute.make_copy_atom(
-            cpasync.CopyG2SOp(cache_mode=cpasync.LoadCacheMode.ALWAYS),
+            cpasync.CopyG2SOp(cache_mode=cute.nvgpu.LoadCacheMode.ALWAYS),
             self.acc_dtype,
             num_bits_per_copy=self.acc_dtype.width,
         )
@@ -1486,6 +1511,7 @@ class BlackwellFusedMultiHeadAttentionBackward:
         iter_count -= 1
         iter_index += 1
 
+        load_reduce_tma_sync_phase = Int32(0)
         while iter_count > 0:
             if iter_index == iter_end:
                 iter_index = iter_start
@@ -1507,6 +1533,9 @@ class BlackwellFusedMultiHeadAttentionBackward:
 
             load_mma_Q_producer_state.advance()
 
+            self.load_reduce_tma_sync_arrive(load_reduce_tma_sync_phase)
+            load_reduce_tma_sync_phase += 1
+
             load_compute_LSE_pipeline.producer_acquire(load_compute_LSE_producer_state)
 
             # Load LSE
@@ -1589,6 +1618,30 @@ class BlackwellFusedMultiHeadAttentionBackward:
             iter_count -= 1
             iter_index += 1
 
+    @cute.jit
+    def load_reduce_tma_sync_arrive(self, phase: Int32):
+        phase_mod = phase % 4
+        if phase_mod == 0:
+            self.load_reduce_tma_sync_barrier_0.arrive()
+        elif phase_mod == 1:
+            self.load_reduce_tma_sync_barrier_1.arrive()
+        elif phase_mod == 2:
+            self.load_reduce_tma_sync_barrier_2.arrive()
+        else:
+            self.load_reduce_tma_sync_barrier_3.arrive()
+
+    @cute.jit
+    def load_reduce_tma_sync_wait(self, phase: Int32):
+        phase_mod = phase % 4
+        if phase_mod == 0:
+            self.load_reduce_tma_sync_barrier_0.arrive_and_wait()
+        elif phase_mod == 1:
+            self.load_reduce_tma_sync_barrier_1.arrive_and_wait()
+        elif phase_mod == 2:
+            self.load_reduce_tma_sync_barrier_2.arrive_and_wait()
+        else:
+            self.load_reduce_tma_sync_barrier_3.arrive_and_wait()
+
     @cute.jit
     def mma(
         self,
@@ -1909,11 +1962,10 @@ class BlackwellFusedMultiHeadAttentionBackward:
         # (mma_compute_S_pipeline, compute_mma_P_pipeline, load_compute_LSE_pipeline, load_compute_sum_OdO_pipeline, mma_compute_dP_pipeline, compute_mma_dS_pipeline, mma_compute_dKdV_pipeline)
         pipeline_args: tuple,
     ):
-        tidx, tidy, tidz = cute.arch.thread_idx()
-        bidx, bidy, bidz = cute.arch.block_idx()
+        tidx = cute.arch.thread_idx()[0]
 
-        Q, K, D, HB = problem_shape
-        blk_coord_q, blk_coord_k, blk_coord_d, blk_coord_batch = blk_coord
+        Q, K = problem_shape[0], problem_shape[1]
+        blk_coord_k, blk_coord_batch = blk_coord[1], blk_coord[3]
         iter_index = iter_start
         (
             mma_compute_S_pipeline,
@@ -2205,10 +2257,9 @@ class BlackwellFusedMultiHeadAttentionBackward:
         # (mma_reduce_dQ_pipeline, reduce_tma_store_pipeline)
         pipeline_args: tuple,
     ):
-        tidx, tidy, tidz = cute.arch.thread_idx()
+        tidx = cute.arch.thread_idx()[0]
         warp_idx = cute.arch.make_warp_uniform(cute.arch.warp_idx())
-        Q, K, D, HB = problem_shape
-        blk_coord_q, blk_coord_k, blk_coord_d, blk_coord_batch = blk_coord
+        blk_coord_batch = blk_coord[3]
 
         blk_coord_h, blk_coord_b = blk_coord_batch
         blk_coord_h_r, blk_coord_h_k = blk_coord_h
@@ -2241,7 +2292,6 @@ class BlackwellFusedMultiHeadAttentionBackward:
         thr_t2r = tiled_t2r.get_slice(thread_idx)
 
         tTR_cdQ = thr_t2r.partition_D(cdQ)
-        tTR_gdQ = thr_t2r.partition_D(gdQ)
         tTR_sdQ = thr_t2r.partition_D(sdQ)
         tTR_tdQ = thr_t2r.partition_S(tdQtdQ)
 
@@ -2253,6 +2303,7 @@ class BlackwellFusedMultiHeadAttentionBackward:
             cute.group_modes(gdQ, 0, 2),
         )
 
+        load_reduce_tma_sync_phase = Int32(0)
         while iter_count > 0:
             mma_reduce_dQ_pipeline.consumer_wait(mma_reduce_dQ_consumer_state)
 
@@ -2286,6 +2337,10 @@ class BlackwellFusedMultiHeadAttentionBackward:
                 self.reduce_sync_barrier.arrive_and_wait()
 
                 if warp_idx == 0:
+                    if iter_count > 1 and i == 0:
+                        self.load_reduce_tma_sync_wait(load_reduce_tma_sync_phase)
+                        load_reduce_tma_sync_phase += 1
+
                     cute.copy(
                         tma_atom_dQ_acc,
                         tdQsdQ[None, reduce_tma_store_producer_state.index],
@@ -2346,7 +2401,6 @@ class BlackwellFusedMultiHeadAttentionBackward:
         input: cute.Tensor,
         frg_cnt: Int32,
     ) -> cute.Tensor:
-        tidx, tidy, tidz = cute.arch.thread_idx()
         output = cute.make_rmem_tensor(input.shape, self.element_dtype)
         frg_tile = cute.size(input) // frg_cnt
         t_frg = cute.logical_divide(input, cute.make_layout(frg_cnt))
@@ -2406,9 +2460,9 @@ class BlackwellFusedMultiHeadAttentionBackward:
         # (mma_compute_dKdV_pipeline, mma_compute_dKdV_consumer_state)
         pipeline_args: tuple,
     ):
-        tidx, tidy, tidz = cute.arch.thread_idx()
-        Q, K, D, HB = problem_shape
-        blk_coord_q, blk_coord_k, blk_coord_d, blk_coord_batch = blk_coord
+        tidx = cute.arch.thread_idx()[0]
+        K, D, HB = problem_shape[1], problem_shape[2], problem_shape[3]
+        blk_coord_k, blk_coord_batch = blk_coord[1], blk_coord[3]
         mma_compute_dKdV_pipeline, mma_compute_dKdV_consumer_state = pipeline_args
 
         load_op = cute.make_copy_atom(
@@ -2511,7 +2565,7 @@ class BlackwellFusedMultiHeadAttentionBackward:
         workspace: cute.Tensor,
         acc_dtype: Type[cutlass.Numeric],
     ) -> Tuple[cute.Tensor, cute.Tensor, cute.Tensor]:
-        Q, D, HB = (
+        Q, D, _HB = (
             problem_shape[0],
             problem_shape[2],
             problem_shape[3],
@@ -2524,7 +2578,7 @@ class BlackwellFusedMultiHeadAttentionBackward:
         acc_bytes = acc_dtype.width // 8
         sum_OdO_bytes = cute.assume(B * H * Q * acc_bytes, divby=acc_bytes)
         scaled_lse_bytes = cute.assume(B * H * Q * acc_bytes, divby=acc_bytes)
-        dQ_acc_bytes = cute.assume(B * H * Q * D * acc_bytes, divby=acc_bytes)
+        cute.assume(B * H * Q * D * acc_bytes, divby=acc_bytes)
 
         sum_OdO_iter = workspace.iterator
         scaled_lse_iter = sum_OdO_iter + sum_OdO_bytes
@@ -2899,7 +2953,9 @@ def run(
             window_size_right,
             bottom_right_align,
         ):
-            raise ValueError("sliding window doesn't support current setting")
+            raise testing.CantImplementError(
+                "sliding window doesn't support current setting"
+            )
 
     # create sequence lengths for variable length inputs
     cumulative_s_q = [0]
@@ -2985,10 +3041,10 @@ def run(
 
     lse_ref = cutlass_torch.create_and_permute_torch_tensor(
         (b, h_k, h_r, s_q),
-        cutlass.torch.dtype(acc_dtype),
+        cutlass_torch.dtype(acc_dtype),
         permute_order=(3, 2, 1, 0),
-        init_type=cutlass.torch.TensorInitType.RANDOM,
-        init_config=cutlass.torch.RandomInitConfig(min_val=10, max_val=11),
+        init_type=cutlass_torch.TensorInitType.RANDOM,
+        init_config=cutlass_torch.RandomInitConfig(min_val=10, max_val=11),
     )
     lse_torch = lse_ref.cuda()
     lse_tensor = from_dlpack(lse_torch, assumed_align=16)
@@ -2997,7 +3053,11 @@ def run(
     mma_tiler = (*mma_tiler_mn, d)
 
     fmha_bwd = BlackwellFusedMultiHeadAttentionBackward(
-        element_dtype, acc_dtype, mma_tiler, varlen, mask_type
+        element_dtype,
+        acc_dtype,
+        mma_tiler,
+        varlen,
+        mask_type,
     )
 
     workspace_size = BlackwellFusedMultiHeadAttentionBackward._get_workspace_size(
@@ -3112,11 +3172,11 @@ def run(
         torch.cuda.synchronize()
         print("Verifying results...")
 
-        q_ref = q_ref.cuda().to(cutlass.torch.dtype(element_dtype))
-        k_ref = k_ref.cuda().to(cutlass.torch.dtype(element_dtype))
-        v_ref = v_ref.cuda().to(cutlass.torch.dtype(element_dtype))
-        o_ref = o_ref.cuda().to(cutlass.torch.dtype(element_dtype))
-        do_ref = do_ref.cuda().to(cutlass.torch.dtype(element_dtype))
+        q_ref = q_ref.cuda().to(cutlass_torch.dtype(element_dtype))
+        k_ref = k_ref.cuda().to(cutlass_torch.dtype(element_dtype))
+        v_ref = v_ref.cuda().to(cutlass_torch.dtype(element_dtype))
+        o_ref = o_ref.cuda().to(cutlass_torch.dtype(element_dtype))
+        do_ref = do_ref.cuda().to(cutlass_torch.dtype(element_dtype))
         dv = dv_torch.to(dtype=torch.float32)
         dk = dk_torch.to(dtype=torch.float32)
         dq = dq_torch.to(dtype=torch.float32)
@@ -3227,10 +3287,10 @@ def run(
 
         lse_ref_new = cutlass_torch.create_and_permute_torch_tensor(
             (b, h_k, h_r, s_q),
-            cutlass.torch.dtype(acc_dtype),
+            cutlass_torch.dtype(acc_dtype),
             permute_order=(3, 2, 1, 0),
-            init_type=cutlass.torch.TensorInitType.RANDOM,
-            init_config=cutlass.torch.RandomInitConfig(min_val=10, max_val=11),
+            init_type=cutlass_torch.TensorInitType.RANDOM,
+            init_config=cutlass_torch.RandomInitConfig(min_val=10, max_val=11),
         )
         lse_torch_new = lse_ref_new.cuda()
         lse_tensor_new = from_dlpack(lse_torch_new, assumed_align=16)
diff --git a/examples/python/CuTeDSL/cute/blackwell/kernel/dense_gemm/dense_gemm_persistent_dynamic.py b/examples/python/CuTeDSL/cute/blackwell/kernel/dense_gemm/dense_gemm_persistent_dynamic.py
index d4ce020ed..b0504f8fd 100644
--- a/examples/python/CuTeDSL/cute/blackwell/kernel/dense_gemm/dense_gemm_persistent_dynamic.py
+++ b/examples/python/CuTeDSL/cute/blackwell/kernel/dense_gemm/dense_gemm_persistent_dynamic.py
@@ -27,18 +27,21 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import argparse
-from typing import Optional, Tuple, Type, Union
+from typing import Optional, Tuple, Type, Union, Literal
 from functools import lru_cache
 import cuda.bindings.driver as cuda
 
 import cutlass
 import cutlass.cute as cute
-import cutlass.cute.testing as testing
+from cutlass import testing
 import cutlass.utils as utils
 from cutlass.utils import is_fp8_dtype, create_cute_tensor_for_fp8
 import cutlass.pipeline as pipeline
 from cutlass.pipeline import pipeline_init_arrive, pipeline_init_wait
 from cutlass.cute.nvgpu import cpasync, tcgen05
+from cutlass.cute.arch.smem import map_dsmem_ptr
+from cutlass.cute.nvgpu.cpasync import CopyDsmemStoreOp
+
 
 """
 A high-performance cluster launch control(CLC) dynamic persistent batched dense GEMM example
@@ -83,7 +86,7 @@ Input arguments to this example is same as dense_gemm.py.
 
 .. code-block:: bash
 
-    python examples/blackwell/dense_gemm_persistent_dynamic.py                          \
+    python examples/cute/blackwell/kernel/dense_gemm/dense_gemm_persistent_dynamic.py                          \
       --ab_dtype Float16 --c_dtype Float16 --acc_dtype Float32                  \
       --mma_tiler_mn 256,128 --cluster_shape_mn 2,1                             \
       --mnkl 8192,8192,8192,1                                                   \
@@ -93,7 +96,7 @@ To collect performance with NCU profiler:
 
 .. code-block:: bash
 
-    ncu python examples/blackwell/dense_gemm_persistent_dynamic.py                     \
+    ncu python examples/cute/blackwell/kernel/dense_gemm/dense_gemm_persistent_dynamic.py                     \
       --ab_dtype Float16 --c_dtype Float16 --acc_dtype Float32                 \
       --mma_tiler_mn 256,128 --cluster_shape_mn 2,1                            \
       --mnkl 8192,8192,8192,1                                                  \
@@ -226,7 +229,7 @@ class PersistentDenseGemmKernel:
     :note: Supported C data types:
         - Float32 (for float32 and int32 accumulator data types)
         - Int32 (for float32 and int32 accumulator data types)
-        - Float16/BFloat16 (for fp16 and fp8 accumulator data types)
+        - Float16/BFloat16 (for fp32, fp16, and fp8 accumulator data types)
         - Int8/Uint8 (for uint8/int8 accumulator data types)
         - Float8E4M3FN/Float8E5M2 (for float32 accumulator data types)
 
@@ -241,9 +244,10 @@ class PersistentDenseGemmKernel:
             acc_dtype=cutlass.Float32,
             use_2cta_instrs=True,
             mma_tiler_mn=(128, 128),
-            cluster_shape_mn=(2, 2)
+            cluster_shape_mn=(2, 2),
+            use_tma_store=True
         )
-        gemm(a, b, c, max_active_clusters, stream)
+        gemm(a, b, c, stream, epilogue_op)
     """
 
     def __init__(
@@ -253,6 +257,9 @@ class PersistentDenseGemmKernel:
         mma_tiler_mn: Tuple[int, int],
         cluster_shape_mn: Tuple[int, int],
         use_tma_store: bool,
+        swizzle_size: int = 1,
+        raster_along: Literal["m", "n"] = "m",
+        split_k: int = 1,
     ):
         """Initializes the configuration for a Blackwell dense GEMM kernel.
 
@@ -270,6 +277,10 @@ class PersistentDenseGemmKernel:
         3. Output C tensor store mode:
             - use_tma_store: Boolean indicating whether to use Tensor Memory Access (TMA) for storing results.
 
+        4. Cluster Split-K:
+            - split_k: Number of CTAs along cluster Z that split the K dimension.
+              Each CTA computes a K-partition and they reduce via DSMEM scatter.
+
         :param acc_dtype: Data type of the accumulator.
         :type acc_dtype: type[cutlass.Numeric]
         :param mma_tiler_mn: Tuple (M, N) shape of the MMA instruction.
@@ -280,11 +291,16 @@ class PersistentDenseGemmKernel:
         :type cluster_shape_mn: Tuple[int, int]
         :param use_tma_store: Use Tensor Memory Access (TMA) or normal store for output C tensor.
         :type use_tma_store: bool
+        :param split_k: cluster split-K factor. CTAs along cluster Z split K and reduce via DSMEM.
+        :type split_k: int
         """
 
         self.acc_dtype: Type[cutlass.Numeric] = acc_dtype
         self.use_2cta_instrs = use_2cta_instrs
         self.cluster_shape_mn = cluster_shape_mn
+        self.swizzle_size = swizzle_size
+        self.raster_along = raster_along
+        self.split_k = split_k
         # K dimension is deferred in _setup_attributes
         self.mma_tiler_mn = mma_tiler_mn
         self.mma_tiler = (*mma_tiler_mn, 1)
@@ -314,9 +330,18 @@ class PersistentDenseGemmKernel:
         self.tmem_alloc_sync_bar_id = 2
         self.tmem_dealloc_sync_bar_id = 3
 
+        # DSMEM mailbox sizing: each of 128 epilogue threads owns
+        # (cta_m * cta_n / 128) accumulator elems, sharded split_k ways.
+        # use_2cta_instrs halves the per-CTA M tile.
+        cta_m = mma_tiler_mn[0] // (2 if use_2cta_instrs else 1)
+        self._mailbox_elems_per_thread = (cta_m * mma_tiler_mn[1]) // 128
+        shard_ept = self._mailbox_elems_per_thread // max(split_k, 1)
+        self._mailbox_total_elems = max(split_k - 1, 0) * 128 * shard_ept
+
     def _create_tiled_mma(self):
         return utils.sm100.make_trivial_tiled_mma(
             self.a_dtype,
+            self.b_dtype,
             self.a_major_mode,
             self.b_major_mode,
             self.acc_dtype,
@@ -355,12 +380,18 @@ class PersistentDenseGemmKernel:
             self.mma_tiler[2],
         )
 
-        # Compute cluster layout
+        # Compute cluster layout (M,N only -- used for TMA multicast)
         self.cluster_layout_vmnk = cute.tiled_divide(
             cute.make_layout((*self.cluster_shape_mn, 1)),
             (tiled_mma.thr_id.shape,),
         )
 
+        # Full cluster layout including split_k Z-dimension (used for CLC pipeline signaling)
+        self.clc_cluster_layout_vmnk = cute.tiled_divide(
+            cute.make_layout((*self.cluster_shape_mn, self.split_k)),
+            (tiled_mma.thr_id.shape,),
+        )
+
         # Compute number of multicast CTAs for A/B
         self.num_mcast_ctas_a = cute.size(self.cluster_layout_vmnk.shape[2])
         self.num_mcast_ctas_b = cute.size(self.cluster_layout_vmnk.shape[1])
@@ -386,6 +417,13 @@ class PersistentDenseGemmKernel:
 
         self.smem_capacity = utils.get_smem_capacity_in_bytes()
 
+        # Reserve SMEM for DSMEM mailbox when using split-K
+        dsmem_mailbox_bytes = 0
+        if self.split_k > 1:
+            dsmem_mailbox_bytes = (
+                self._mailbox_total_elems * 4 + 16
+            )  # data + 2 barriers
+
         # Setup A/B/C stage count in shared memory and ACC stage count in tensor memory
         self.num_acc_stage, self.num_ab_stage, self.num_c_stage = _compute_stages(
             tiled_mma,
@@ -393,7 +431,7 @@ class PersistentDenseGemmKernel:
             self.a_dtype,
             self.b_dtype,
             self.c_dtype,
-            self.smem_capacity,
+            self.smem_capacity - dsmem_mailbox_bytes,
             self.occupancy,
             self.use_tma_store,
             c_smem_layout,
@@ -428,7 +466,6 @@ class PersistentDenseGemmKernel:
         a: cute.Tensor,
         b: cute.Tensor,
         c: cute.Tensor,
-        max_active_clusters: cutlass.Constexpr,
         stream: cuda.CUstream,
         epilogue_op: cutlass.Constexpr = lambda x: x,
     ):
@@ -445,8 +482,6 @@ class PersistentDenseGemmKernel:
         :type b: cute.Tensor
         :param c: Output tensor C
         :type c: cute.Tensor
-        :param max_active_clusters: Maximum number of active clusters
-        :type max_active_clusters: cutlass.Constexpr
         :param stream: CUDA stream for asynchronous execution
         :type stream: cuda.CUstream
         :param epilogue_op: Optional elementwise lambda function to apply to the output tensor
@@ -524,7 +559,12 @@ class PersistentDenseGemmKernel:
 
         # Compute grid size
         self.tile_sched_params, grid = self._compute_grid(
-            c, self.cta_tile_shape_mnk, self.cluster_shape_mn
+            c,
+            self.cta_tile_shape_mnk,
+            self.cluster_shape_mn,
+            self.swizzle_size,
+            self.raster_along,
+            self.split_k,
         )
 
         # Launch the kernel synchronously
@@ -537,6 +577,7 @@ class PersistentDenseGemmKernel:
             tma_atom_c,
             tma_tensor_c if self.use_tma_store else c,
             self.cluster_layout_vmnk,
+            self.clc_cluster_layout_vmnk,
             self.a_smem_layout_staged,
             self.b_smem_layout_staged,
             self.c_smem_layout_staged,
@@ -546,7 +587,7 @@ class PersistentDenseGemmKernel:
         ).launch(
             grid=grid,
             block=[self.threads_per_cta, 1, 1],
-            cluster=(*self.cluster_shape_mn, 1),
+            cluster=(*self.cluster_shape_mn, self.split_k),
             stream=stream,
         )
         return
@@ -563,6 +604,7 @@ class PersistentDenseGemmKernel:
         tma_atom_c: Optional[cute.CopyAtom],
         mC_mnl: cute.Tensor,
         cluster_layout_vmnk: cute.Layout,
+        clc_cluster_layout_vmnk: cute.Layout,
         a_smem_layout_staged: cute.ComposedLayout,
         b_smem_layout_staged: cute.ComposedLayout,
         c_smem_layout_staged: Union[cute.Layout, cute.ComposedLayout, None],
@@ -598,7 +640,19 @@ class PersistentDenseGemmKernel:
             cute.arch.block_idx_in_cluster()
         )
         is_first_cta_in_cluster = cta_rank_in_cluster == 0
+        # When split_k>1 the physical cluster extends along Z; fold out the Z stride
+        # to get the MN-only rank for TMA partition. split_rank is this CTA's Z position.
+        mn_cluster_size = self.cluster_shape_mn[0] * self.cluster_shape_mn[1]
+        cta_rank_in_mn_cluster = cta_rank_in_cluster % mn_cluster_size
+        (
+            block_in_cluster_x,
+            block_in_cluster_y,
+            split_rank,
+        ) = cute.arch.block_in_cluster_idx()
         block_in_cluster_coord_vmnk = cluster_layout_vmnk.get_flat_coord(
+            cta_rank_in_mn_cluster
+        )
+        full_block_in_cluster_coord_vmnk = clc_cluster_layout_vmnk.get_flat_coord(
             cta_rank_in_cluster
         )
         # Coord inside cta
@@ -617,7 +671,20 @@ class PersistentDenseGemmKernel:
             tmem_dealloc_mbar: cutlass.Int64
             tmem_holding_buf: cutlass.Int32
             clc_mbar_ptr: cute.struct.MemRange[cutlass.Int64, 2]
-            clc_response: cute.struct.MemRange[cutlass.Int32, 4]
+            clc_response_align_bytes = self.num_clc_response_bytes
+            clc_response: cute.struct.Align[
+                cute.struct.MemRange[cutlass.Int32, 4],
+                clc_response_align_bytes,
+            ]
+            dsmem_mailbox_full_barrier: cute.struct.MemRange[
+                cutlass.Int64, 1 if self.split_k > 1 else 0
+            ]
+            dsmem_mailbox_empty_barrier: cute.struct.MemRange[
+                cutlass.Int64, 1 if self.split_k > 1 else 0
+            ]
+            dsmem_mailbox: cute.struct.MemRange[
+                cutlass.Float32, self._mailbox_total_elems
+            ]
 
         smem = utils.SmemAllocator()
         storage = smem.allocate(SharedStorage)
@@ -634,7 +701,10 @@ class PersistentDenseGemmKernel:
             producer_group=ab_pipeline_producer_group,
             consumer_group=ab_pipeline_consumer_group,
             tx_count=self.num_tma_load_bytes,
-            cta_layout_vmnk=cluster_layout_vmnk,
+            # Pipeline signaling must use the full physical cluster layout so internal
+            # block_idx_in_cluster -> get_flat_coord() mappings remain valid when split_k>1.
+            # The actual TMA multicast masks still use the MN-only cluster layout below.
+            cta_layout_vmnk=clc_cluster_layout_vmnk,
             defer_sync=True,
         ).make_participants()
 
@@ -651,15 +721,16 @@ class PersistentDenseGemmKernel:
             num_stages=self.num_acc_stage,
             producer_group=acc_pipeline_producer_group,
             consumer_group=acc_pipeline_consumer_group,
-            cta_layout_vmnk=cluster_layout_vmnk,
+            cta_layout_vmnk=clc_cluster_layout_vmnk,
             defer_sync=True,
         )
 
         # Initialize clc_pipeline (barrier) and states
+        # Use full cluster size (including split_k) for CLC pipeline signaling
         clc_pipeline_producer_group = pipeline.CooperativeGroup(pipeline.Agent.Thread)
-        cluster_size = cute.size(self.cluster_shape_mn)
+        full_cluster_size = cute.size(self.cluster_shape_mn) * self.split_k
         num_clc_consumer_threads = 32 * (
-            1 + cluster_size * (1 + len(self.epilogue_warp_id) + 1)
+            1 + full_cluster_size * (1 + len(self.epilogue_warp_id) + 1)
         )
         clc_pipeline_consumer_group = pipeline.CooperativeGroup(
             pipeline.Agent.Thread, num_clc_consumer_threads
@@ -670,7 +741,7 @@ class PersistentDenseGemmKernel:
             producer_group=clc_pipeline_producer_group,
             consumer_group=clc_pipeline_consumer_group,
             tx_count=self.num_clc_response_bytes,
-            cta_layout_vmnk=cluster_layout_vmnk,
+            cta_layout_vmnk=clc_cluster_layout_vmnk,
             defer_sync=True,
         )
 
@@ -693,8 +764,28 @@ class PersistentDenseGemmKernel:
             two_cta_tmem_dealloc_mbar_ptr=storage.tmem_dealloc_mbar.ptr,
         )
 
+        # DSMEM mailbox barriers (split-K only): FULL is armed per-tile by
+        # arrive_and_expect_tx; EMPTY collects one ack from each peer after it reads.
+        # Guard mbarrier_init with a single-warp check -- otherwise every specialized
+        # warp's elect_one would race on the same barrier (7 concurrent writers).
+        # The reference tgv_gemm split-K epilogue uses the same single-warp pattern.
+        dsmem_mailbox_full_mbar = storage.dsmem_mailbox_full_barrier.data_ptr()
+        dsmem_mailbox_empty_mbar = storage.dsmem_mailbox_empty_barrier.data_ptr()
+        dsmem_mailbox_ptr = storage.dsmem_mailbox.data_ptr()
+        if cutlass.const_expr(self.split_k > 1):
+            if warp_idx == 0:
+                with cute.arch.elect_one():
+                    cute.arch.mbarrier_init(dsmem_mailbox_full_mbar, 1)
+                    cute.arch.mbarrier_init(dsmem_mailbox_empty_mbar, self.split_k - 1)
+
         # Cluster arrive after barrier init
-        pipeline_init_arrive(cluster_shape_mn=cluster_layout_vmnk, is_relaxed=True)
+        if cutlass.const_expr(self.split_k > 1):
+            cute.arch.mbarrier_init_fence()
+            cute.arch.cluster_arrive_relaxed()
+        else:
+            pipeline_init_arrive(
+                cluster_shape_mn=self.cluster_shape_mn, is_relaxed=True
+            )
 
         # Initial clc response pointer
         clc_response_ptr = storage.clc_response.data_ptr()
@@ -728,10 +819,10 @@ class PersistentDenseGemmKernel:
         b_full_mcast_mask = None
         if cutlass.const_expr(self.is_a_mcast or self.is_b_mcast or use_2cta_instrs):
             a_full_mcast_mask = cpasync.create_tma_multicast_mask(
-                cluster_layout_vmnk, block_in_cluster_coord_vmnk, mcast_mode=2
+                clc_cluster_layout_vmnk, full_block_in_cluster_coord_vmnk, mcast_mode=2
             )
             b_full_mcast_mask = cpasync.create_tma_multicast_mask(
-                cluster_layout_vmnk, block_in_cluster_coord_vmnk, mcast_mode=1
+                clc_cluster_layout_vmnk, full_block_in_cluster_coord_vmnk, mcast_mode=1
             )
 
         #
@@ -751,6 +842,16 @@ class PersistentDenseGemmKernel:
         )
         k_tile_cnt = cute.size(gA_mkl, mode=[3])
 
+        # Cluster split-K: each CTA owns k_tiles_per_split tiles starting at k_start,
+        # clamped so the last Z CTA gets only what's left. Reduces to the full range
+        # when split_k=1.
+        k_tiles_per_split = (k_tile_cnt + self.split_k - 1) // self.split_k
+        k_start = split_rank * k_tiles_per_split
+        k_start = min(k_start, k_tile_cnt)
+        k_end = k_start + k_tiles_per_split
+        k_end = min(k_end, k_tile_cnt)
+        k_tile_count = k_end - k_start
+
         #
         # Partition global tensor for TiledMMA_A/B/C
         #
@@ -809,7 +910,10 @@ class PersistentDenseGemmKernel:
         #
         # Cluster wait before tensor memory alloc
         #
-        pipeline_init_wait(cluster_shape_mn=cluster_layout_vmnk)
+        if cutlass.const_expr(self.split_k > 1):
+            cute.arch.cluster_wait()
+        else:
+            pipeline_init_wait(cluster_shape_mn=self.cluster_shape_mn)
 
         #
         # Construct the scheduler
@@ -831,12 +935,13 @@ class PersistentDenseGemmKernel:
             # Persistent tile scheduling loop
             #
             while work_tile.is_valid_tile:
-                # Get tile coord from tile scheduler
+                # Get tile coord from tile scheduler. Grid Z is L*split_k so divide
+                # out the split_k factor to recover the batch index (no-op when split_k=1).
                 cur_tile_coord = work_tile.tile_idx
                 mma_tile_coord_mnl = (
                     cur_tile_coord[0] // cute.size(tiled_mma.thr_id.shape),
                     cur_tile_coord[1],
-                    cur_tile_coord[2],
+                    cur_tile_coord[2] // self.split_k,
                 )
 
                 #
@@ -856,31 +961,34 @@ class PersistentDenseGemmKernel:
                 peek_ab_empty_status = ab_producer.try_acquire()
 
                 #
-                # Tma load loop
+                # Tma load loop (K-partitioned for split-K)
                 #
-                for k_tile in cutlass.range(0, k_tile_cnt, 1, unroll=1):
+                for k_tile in cutlass.range(0, k_tile_count, 1, unroll=1):
                     # Conditionally wait for AB buffer empty
                     handle = ab_producer.acquire_and_advance(peek_ab_empty_status)
 
+                    # Offset into global K tiles for this CTA's partition
+                    k_tile_global = handle.count + k_start
+
                     # TMA load A/B
                     cute.copy(
                         tma_atom_a,
-                        tAgA_slice[(None, handle.count)],
+                        tAgA_slice[(None, k_tile_global)],
                         tAsA[(None, handle.index)],
                         tma_bar_ptr=handle.barrier,
                         mcast_mask=a_full_mcast_mask,
                     )
                     cute.copy(
                         tma_atom_b,
-                        tBgB_slice[(None, handle.count)],
+                        tBgB_slice[(None, k_tile_global)],
                         tBsB[(None, handle.index)],
                         tma_bar_ptr=handle.barrier,
                         mcast_mask=b_full_mcast_mask,
                     )
 
-                    # Peek (try_wait) AB buffer empty for k_tile = prefetch_k_tile_cnt + k_tile + 1
+                    # Peek (try_wait) AB buffer empty for next k_tile
                     peek_ab_empty_status = cutlass.Boolean(1)
-                    if handle.count + 1 < k_tile_cnt:
+                    if handle.count + 1 < k_tile_count:
                         peek_ab_empty_status = ab_producer.try_acquire()
 
                 #
@@ -946,7 +1054,7 @@ class PersistentDenseGemmKernel:
                 mma_tile_coord_mnl = (
                     cur_tile_coord[0] // cute.size(tiled_mma.thr_id.shape),
                     cur_tile_coord[1],
-                    cur_tile_coord[2],
+                    cur_tile_coord[2] // self.split_k,
                 )
 
                 # Set tensor memory buffer for current tile
@@ -966,39 +1074,26 @@ class PersistentDenseGemmKernel:
                     acc_pipeline.producer_acquire(acc_producer_state)
 
                 #
-                # Reset the ACCUMULATE field for each tile
+                # Mma mainloop (K-partitioned for split-K)
                 #
-                tiled_mma.set(tcgen05.Field.ACCUMULATE, False)
-
-                #
-                # Mma mainloop
-                #
-                for k_tile in range(k_tile_cnt):
+                for k_tile in range(k_tile_count):
                     if is_leader_cta:
                         # Conditionally wait for AB buffer full
                         handle = ab_consumer.wait_and_advance(peek_ab_full_status)
 
                         # tCtAcc += tCrA * tCrB
-                        num_kblocks = cute.size(tCrA, mode=[2])
-                        for kblk_idx in cutlass.range(num_kblocks, unroll_full=True):
-                            kblk_crd = (None, None, kblk_idx, handle.index)
-
-                            cute.gemm(
-                                tiled_mma,
-                                tCtAcc,
-                                tCrA[kblk_crd],
-                                tCrB[kblk_crd],
-                                tCtAcc,
-                            )
-                            # Enable accumulate on tCtAcc after first kblock
-                            tiled_mma.set(tcgen05.Field.ACCUMULATE, True)
+                        tiled_mma.set(tcgen05.Field.ACCUMULATE, k_tile != 0)
+                        tile_crd = (None, None, None, handle.index)
+                        cute.gemm(
+                            tiled_mma, tCtAcc, tCrA[tile_crd], tCrB[tile_crd], tCtAcc
+                        )
 
                         # Async arrive AB buffer empty
                         handle.release()
 
                         # Peek (try_wait) AB buffer full for k_tile = k_tile + 1
                         peek_ab_full_status = cutlass.Boolean(1)
-                        if handle.count + 1 < k_tile_cnt:
+                        if handle.count + 1 < k_tile_count:
                             peek_ab_full_status = ab_consumer.try_wait()
 
                 #
@@ -1059,16 +1154,204 @@ class PersistentDenseGemmKernel:
                 c_pipeline = pipeline.PipelineTmaStore.create(
                     num_stages=self.num_c_stage, producer_group=c_producer_group
                 )
+
+            # DSMEM split-K: pre-compute epilogue partitioning outside persistent loop
+            if cutlass.const_expr(self.split_k > 1):
+                tCgC_xf = utils.gemm.sm100.transform_partitioned_tensor_layout(tCgC)
+                tCtAcc_xf = utils.gemm.sm100.transform_partitioned_tensor_layout(
+                    tCtAcc_base
+                )
+
+                tiled_copy_t2r, tTR_tAcc_base, _ = (
+                    utils.gemm.sm100.epilogue_tmem_copy_and_partition(
+                        self, tidx, tCtAcc_xf, tCgC_xf, epi_tile, self.use_2cta_instrs
+                    )
+                )
+                thr_copy_t2r = tiled_copy_t2r.get_slice(tidx)
+                tTR_gC_partitioned = thr_copy_t2r.partition_D(
+                    cute.flat_divide(tCgC_xf, epi_tile)
+                )
+
+                epilog_sync_barrier = pipeline.NamedBarrier(
+                    barrier_id=self.epilog_sync_bar_id,
+                    num_threads=32 * len(self.epilogue_warp_id),
+                )
+                dsmem_full_phase = cutlass.Int32(0)
+                dsmem_empty_phase = cutlass.Int32(0)
+                num_tiles_done = cutlass.Int32(0)
+
+                # (epi_tid, elem_in_shard, sender_slot) -> flat mailbox offset.
+                # epi_tid is innermost (stride 1) for coalesced DSMEM stores.
+                shard_ept_static = self._mailbox_elems_per_thread // self.split_k
+                mailbox_layout = cute.make_layout(
+                    (128, shard_ept_static, max(self.split_k - 1, 1)),
+                    stride=(1, 128, 128 * shard_ept_static),
+                )
+                dsmem_store_atom = cute.make_copy_atom(
+                    CopyDsmemStoreOp(), cutlass.Float32
+                )
+                scatter_val = cute.make_rmem_tensor(
+                    cute.make_layout(1), cutlass.Float32
+                )
+
             while work_tile.is_valid_tile:
                 # Get tile coord from tile scheduler
                 cur_tile_coord = work_tile.tile_idx
                 mma_tile_coord_mnl = (
                     cur_tile_coord[0] // cute.size(tiled_mma.thr_id.shape),
                     cur_tile_coord[1],
-                    cur_tile_coord[2],
+                    cur_tile_coord[2] // self.split_k,
                 )
                 num_tiles_executed = tile_sched.num_tiles_executed
-                if cutlass.const_expr(self.use_tma_store):
+                if cutlass.const_expr(self.split_k > 1):
+                    # ---- Cluster split-K epilogue with DSMEM scatter-reduce ----
+                    # Non-TMA epilogue sets epi_tile == cta_tile so subtile_cnt == 1.
+                    tTR_tAcc = tTR_tAcc_base[
+                        (None, None, None, None, None, acc_consumer_state.index)
+                    ]
+                    tTR_gC = tTR_gC_partitioned[
+                        (None, None, None, None, None, *mma_tile_coord_mnl)
+                    ]
+                    tTR_tAcc = cute.group_modes(tTR_tAcc, 3, cute.rank(tTR_tAcc))
+                    tTR_gC = cute.group_modes(tTR_gC, 3, cute.rank(tTR_gC))
+                    subtile_cnt = cute.size(tTR_tAcc.shape, mode=[3])
+
+                    acc_pipeline.consumer_wait(acc_consumer_state)
+
+                    for subtile_idx in range(subtile_cnt):
+                        # Block until every peer has read our previous sender slot.
+                        if num_tiles_done > 0 or subtile_idx > 0:
+                            cute.arch.mbarrier_wait(
+                                dsmem_mailbox_empty_mbar, dsmem_empty_phase
+                            )
+                            dsmem_empty_phase = dsmem_empty_phase ^ 1
+
+                        tTR_gC_subtile = tTR_gC[(None, None, None, subtile_idx)]
+                        tTR_rAcc = cute.make_rmem_tensor(
+                            tTR_gC_subtile.shape, self.acc_dtype
+                        )
+                        cute.copy(
+                            tiled_copy_t2r,
+                            tTR_tAcc[(None, None, None, subtile_idx)],
+                            tTR_rAcc,
+                        )
+
+                        # Release acc pipeline once the final subtile is fetched from TMEM.
+                        if subtile_idx == subtile_cnt - 1:
+                            cute.arch.fence_view_async_tmem_load()
+                            with cute.arch.elect_one():
+                                acc_pipeline.consumer_release(acc_consumer_state)
+                            acc_consumer_state.advance()
+
+                        # Derive shard size from the actual rmem tensor (avoids baking in
+                        # a per-subtile constant that may differ if epi_tile < cta_tile).
+                        shard_ept = cute.size(tTR_rAcc) // self.split_k
+                        # Expected bytes from (split_k - 1) peers: each peer's 128
+                        # epilogue threads (4 warps * 32) send shard_ept elements
+                        # of 4 B (Float32) into our mailbox.
+                        mailbox_tx_total = (self.split_k - 1) * 128 * shard_ept * 4
+                        my_shard_start = split_rank * shard_ept
+
+                        # Arm mailbox FULL (expecting TX from split_k-1 peers) then scatter.
+                        if tidx == 0:
+                            cute.arch.mbarrier_arrive_and_expect_tx(
+                                dsmem_mailbox_full_mbar, mailbox_tx_total
+                            )
+                        epilog_sync_barrier.arrive_and_wait()
+
+                        # Each peer writes to a slot in our mailbox indexed by the sender
+                        # rank with self-rank compressed out (slot = split_rank if
+                        # split_rank < peer else split_rank - 1).
+                        for peer in range(self.split_k):
+                            if peer != split_rank:
+                                sender_idx = split_rank - (
+                                    1 if split_rank > cutlass.Int32(peer) else 0
+                                )
+                                peer_rank = (
+                                    block_in_cluster_x
+                                    + cutlass.Int32(self.cluster_shape_mn[0])
+                                    * block_in_cluster_y
+                                    + cutlass.Int32(peer * mn_cluster_size)
+                                )
+                                remote_mbar = map_dsmem_ptr(
+                                    dsmem_mailbox_full_mbar, peer_rank
+                                )
+                                for i in range(shard_ept):
+                                    scatter_val[0] = tTR_rAcc[peer * shard_ept + i]
+                                    remote_ptr = map_dsmem_ptr(
+                                        dsmem_mailbox_ptr
+                                        + mailbox_layout((tidx, i, sender_idx)),
+                                        peer_rank,
+                                    )
+                                    dst = cute.make_tensor(
+                                        remote_ptr, cute.make_layout(1)
+                                    )
+                                    cute.copy(
+                                        dsmem_store_atom,
+                                        scatter_val,
+                                        dst,
+                                        mbar_ptr=remote_mbar,
+                                    )
+
+                        # Do not let local FULL waits move ahead of this CTA's
+                        # remote DSMEM stores to peer mailboxes.
+                        cute.arch.fence_acq_rel_cta()
+                        cute.arch.mbarrier_wait(
+                            dsmem_mailbox_full_mbar, dsmem_full_phase
+                        )
+                        dsmem_full_phase = dsmem_full_phase ^ 1
+
+                        # Reduce received shards into our own partial.
+                        for s in range(self.split_k - 1):
+                            for i in range(shard_ept):
+                                tTR_rAcc[my_shard_start + i] = (
+                                    tTR_rAcc[my_shard_start + i]
+                                    + (
+                                        dsmem_mailbox_ptr + mailbox_layout((tidx, i, s))
+                                    ).load()
+                                )
+
+                        # Intra-CTA sync: tidx=0 must not signal peer empty barriers
+                        # until every thread in our CTA has finished reading the mailbox,
+                        # otherwise the peer's next-tile scatter may overwrite slots we
+                        # are still reading.
+                        epilog_sync_barrier.arrive_and_wait()
+
+                        # Signal each peer that we've read their shard.
+                        if tidx == 0:
+                            for peer in range(self.split_k):
+                                if peer != split_rank:
+                                    peer_rank = (
+                                        block_in_cluster_x
+                                        + cutlass.Int32(self.cluster_shape_mn[0])
+                                        * block_in_cluster_y
+                                        + cutlass.Int32(peer * mn_cluster_size)
+                                    )
+                                    cute.arch.mbarrier_arrive(
+                                        dsmem_mailbox_empty_mbar,
+                                        peer_cta_rank_in_cluster=peer_rank,
+                                    )
+
+                        # Predicated GMEM store: only write our shard to C.
+                        # (Two-pass predicate setup avoids Python chained comparisons
+                        # and `or` on runtime Int32, which don't short-circuit correctly.)
+                        tTR_rC = cute.make_rmem_tensor(
+                            tTR_gC_subtile.shape, self.c_dtype
+                        )
+                        tTR_rC.store(epilogue_op(tTR_rAcc.load().to(self.c_dtype)))
+                        pred = cute.make_rmem_tensor(
+                            tTR_gC_subtile.shape, cutlass.Boolean
+                        )
+                        for i in range(cute.size(tTR_rC)):
+                            pred[i] = cutlass.Boolean(True)
+                        for i in range(cute.size(tTR_rC)):
+                            if i < my_shard_start or i >= my_shard_start + shard_ept:
+                                pred[i] = cutlass.Boolean(False)
+                        cute.basic_copy_if(pred, tTR_rC, tTR_gC_subtile)
+
+                    num_tiles_done = num_tiles_done + 1
+
+                elif cutlass.const_expr(self.use_tma_store):
                     acc_consumer_state = utils.gemm.sm100.epilogue_tma_store(
                         self,
                         tidx,
@@ -1108,7 +1391,7 @@ class PersistentDenseGemmKernel:
             if cutlass.const_expr(self.use_tma_store):
                 # Wait for C store complete
                 c_pipeline.producer_tail()
-            else:
+            elif cutlass.const_expr(self.split_k <= 1):
                 # Synchronize before TMEM dealloc (done by the caller)
                 tmem_dealloc_barrier.arrive_and_wait()
             #
@@ -1117,11 +1400,21 @@ class PersistentDenseGemmKernel:
             tmem.relinquish_alloc_permit()
             tmem.free(tmem_ptr)
 
+        if cutlass.const_expr(self.split_k > 1):
+            # Keep DSMEM mailbox barriers alive until all peer CTAs finish the
+            # final scatter/read/ack sequence for this physical cluster.
+            # Otherwise races
+            cute.arch.cluster_arrive_relaxed()
+            cute.arch.cluster_wait()
+
     @staticmethod
     def _compute_grid(
         c: cute.Tensor,
         cta_tile_shape_mnk: Tuple[int, int, int],
         cluster_shape_mn: Tuple[int, int],
+        swizzle_size: int,
+        raster_along: Literal["m", "n"],
+        split_k: int = 1,
     ) -> Tuple[utils.ClcDynamicPersistentTileSchedulerParams, Tuple[int, int, int]]:
         """Use persistent tile scheduler to compute the grid size for the output tensor C.
 
@@ -1131,6 +1424,8 @@ class PersistentDenseGemmKernel:
         :type cta_tile_shape_mnk: tuple[int, int, int]
         :param cluster_shape_mn: Shape of each cluster in M, N dimensions.
         :type cluster_shape_mn: tuple[int, int]
+        :param split_k: cluster split-K factor (CTAs along cluster Z).
+        :type split_k: int
 
         :return: A tuple containing:
             - tile_sched_params: Parameters for the persistent tile scheduler.
@@ -1140,10 +1435,19 @@ class PersistentDenseGemmKernel:
         c_shape = cute.slice_(cta_tile_shape_mnk, (None, None, 0))
         gc = cute.zipped_divide(c, tiler=c_shape)
         num_ctas_mnl = gc[(0, (None, None, None))].shape
+        # Grid Z is expanded by split_k so the physical launch uses a 3D cluster, but the
+        # existing CLC scheduler still reasons about output-tile ownership only in M/N.
+        # Keep the scheduler's logical cluster shape at K=1 and let the kernel reinterpret
+        # cluster Z as the split-k rank.
+        num_ctas_mnl_expanded = (
+            num_ctas_mnl[0],
+            num_ctas_mnl[1],
+            num_ctas_mnl[2] * split_k,
+        )
         cluster_shape_mnl = (*cluster_shape_mn, 1)
 
         tile_sched_params = utils.ClcDynamicPersistentTileSchedulerParams(
-            num_ctas_mnl, cluster_shape_mnl
+            num_ctas_mnl_expanded, cluster_shape_mnl, swizzle_size, raster_along == "m"
         )
         grid = utils.ClcDynamicPersistentTileScheduler.get_grid_shape(tile_sched_params)
 
@@ -1290,17 +1594,35 @@ class PersistentDenseGemmKernel:
             raise testing.CantImplementError(
                 f"Invalid cluster shape M: {self.cluster_shape_mn[0]}"
             )
-        # Skip invalid cluster shape
+        # Skip invalid cluster shape (total cluster including split_k must be <= 16)
         is_power_of_2 = lambda x: x > 0 and (x & (x - 1)) == 0
+        total_cluster_size = (
+            self.cluster_shape_mn[0] * self.cluster_shape_mn[1] * self.split_k
+        )
         if (
-            self.cluster_shape_mn[0] * self.cluster_shape_mn[1] > 16
+            total_cluster_size > 16
             or self.cluster_shape_mn[0] <= 0
             or self.cluster_shape_mn[1] <= 0
             or not is_power_of_2(self.cluster_shape_mn[0])
             or not is_power_of_2(self.cluster_shape_mn[1])
         ):
             raise testing.CantImplementError(
-                f"Invalid cluster shape: {self.cluster_shape_mn}"
+                f"Invalid cluster shape: {self.cluster_shape_mn} with split_k={self.split_k}"
+            )
+        # split_k > 1 requires non-TMA store epilogue
+        if self.split_k > 1 and self.use_tma_store:
+            raise testing.CantImplementError("split_k > 1 requires use_tma_store=False")
+        # The current split-k integration layers cluster Z in software on top of the
+        # existing 2D CLC scheduler. Swizzled or raster_along='n' launches would require
+        # a deeper scheduler refactor to encode cluster Z in the scheduler's working id space.
+        if self.split_k > 1 and (self.swizzle_size != 1 or self.raster_along != "m"):
+            raise testing.CantImplementError(
+                "split_k > 1 requires swizzle_size=1 and raster_along='m'"
+            )
+        # Accumulator elements per thread must be divisible by split_k
+        if self.split_k > 1 and self._mailbox_elems_per_thread % self.split_k != 0:
+            raise testing.CantImplementError(
+                f"elems_per_thread={self._mailbox_elems_per_thread} must be divisible by split_k={self.split_k}"
             )
 
     def check_tensor_alignment(
@@ -1370,7 +1692,8 @@ class PersistentDenseGemmKernel:
 
         :raises testing.CantImplementError: If the epilogue store option is invalid
         """
-        # None TMA store version does not have predication, can not support OOB tiles
+
+        # Non TMA store version does not have predication, can not support OOB tiles
         cta_tile_shape_mn = (
             self.mma_tiler_mn[0] // (2 if self.use_2cta_instrs else 1),
             self.mma_tiler_mn[1],
@@ -1378,7 +1701,16 @@ class PersistentDenseGemmKernel:
         if not self.use_tma_store:
             if not (m % cta_tile_shape_mn[0] == 0 and n % cta_tile_shape_mn[1] == 0):
                 raise testing.CantImplementError(
-                    f"Invalid epilog store option: {m}, {n}"
+                    f"Problem shape {m}, {n} must be divisible by cta tile shape {cta_tile_shape_mn} for non TMA store"
+                )
+            # CTA swizzling improves the L2 cache utilization and reduces the number of cache misses.
+            m_per_swizzle = (m // cta_tile_shape_mn[0]) // self.cluster_shape_mn[0]
+            n_per_swizzle = (n // cta_tile_shape_mn[1]) // self.cluster_shape_mn[1]
+            if (m_per_swizzle % self.swizzle_size != 0) or (
+                n_per_swizzle % self.swizzle_size != 0
+            ):
+                raise testing.CantImplementError(
+                    f"Problem shape {m}, {n} must be divisible by swizzle size {self.swizzle_size} for non TMA store"
                 )
 
     def can_implement(
@@ -1396,9 +1728,9 @@ class PersistentDenseGemmKernel:
 
         :param mnkl: Problem size as a tuple (M, N, K, L).
         :type mnkl: Tuple[int, int, int, int]
-        :param a_dtype: Data type for input tensors A.
+        :param a_dtype: Data type for input tensors A and B.
         :type a_dtype: Type[cutlass.Numeric]
-        :param b_dtype: Data type for input tensors B.
+        :param b_dtype: Data type for input tensors B and B.
         :type b_dtype: Type[cutlass.Numeric]
         :param c_dtype: Data type for output tensor C.
         :type c_dtype: Type[cutlass.Numeric]
@@ -1435,7 +1767,6 @@ def bmm(
     a: cute.Tensor,  # (l, m, k)
     b: cute.Tensor,  # (l, k, n)
     c: cute.Tensor,  # (l, m, n)
-    max_active_clusters: cutlass.Constexpr,
     stream: cuda.CUstream,
     epilogue_op: cutlass.Constexpr = lambda x: x,
 ):
@@ -1447,7 +1778,7 @@ def bmm(
       - b: (n, k, l)
       - c: (m, n, l)
 
-    :param gemm_op: Kernel operation, expects (a, b, c, max_active_clusters, stream, epilogue_op)
+    :param gemm_op: Kernel operation, expects (a, b, c, stream, epilogue_op)
     :type gemm_op: cutlass.Constexpr
     :param a: Input tensor of shape (l, m, k)
     :type a: cute.Tensor
@@ -1455,8 +1786,6 @@ def bmm(
     :type b: cute.Tensor
     :param c: Output tensor of shape (l, m, n)
     :type c: cute.Tensor
-    :param max_active_clusters: Maximum number of hardware clusters to launch
-    :type max_active_clusters: cutlass.Constexpr
     :param epilogue_op: Optional elementwise lambda function to apply per output element, defaults to identity
     :type epilogue_op: cutlass.Constexpr, optional
     """
@@ -1467,7 +1796,7 @@ def bmm(
     # (l,m,n) -> (m,n,l)
     c = cute.make_tensor(c.iterator, cute.select(c.layout, mode=[1, 2, 0]))
 
-    gemm_op(a, b, c, max_active_clusters, stream, epilogue_op)
+    gemm_op(a, b, c, stream, epilogue_op)
 
 
 @lru_cache(maxsize=1)
@@ -1516,16 +1845,12 @@ def prepare_tensors(
             0, 2, 1
         )
 
-    if init_random:
-        # Uniform random initialization in range [-2, 3)
-        a_f32.random_(-2, 3)
-        b_f32.random_(-2, 3)
-        c_f32.random_(-2, 3)
-    else:
-        # Normal (Gaussian) initialization with user-specified mean and std
-        a_f32.normal_(mean=normal_mean, std=normal_std)
-        b_f32.normal_(mean=normal_mean, std=normal_std)
-        c_f32.normal_(mean=normal_mean, std=normal_std)
+    # Initialize tensors with either uniform random or normal distribution
+    for tensor in [a_f32, b_f32, c_f32]:
+        if init_random:
+            tensor.random_(-2, 3)
+        else:
+            tensor.normal_(mean=normal_mean, std=normal_std)
 
     # For float8 types, use uint8 as storage type to avoid dlpack limitation
     # (dlpack doesn't support float8 types)
@@ -1551,22 +1876,29 @@ def compile_bmm(
     a_major: str,
     b_major: str,
     c_major: str,
-    mma_tiler_mn: Tuple[int, int] = (256, 256),
+    mma_tiler: Union[Tuple[int, int], Tuple[int, int, int]] = (256, 256),
     cluster_shape_mn: Tuple[int, int] = (2, 1),
-    max_active_clusters: cutlass.Constexpr = None,
     use_2cta_instrs: bool = True,
     use_tma_store: bool = True,
+    swizzle_size: int = 1,
+    raster_along: Literal["m", "n"] = "m",
     epilogue_op: cutlass.Constexpr = lambda x: x,
+    split_k: int = 1,
 ):
     from cutlass.cute.runtime import make_fake_stream
 
+    # Build GEMM object
     gemm = PersistentDenseGemmKernel(
         acc_dtype,
         use_2cta_instrs,
-        mma_tiler_mn,
+        mma_tiler,
         cluster_shape_mn,
         use_tma_store,
+        swizzle_size,
+        raster_along,
+        split_k,
     )
+
     # Check if configuration can be implemented
     can_implement = gemm.can_implement(
         mnkl, a.element_type, b.element_type, c.element_type, a_major, b_major, c_major
@@ -1574,12 +1906,16 @@ def compile_bmm(
     if not can_implement:
         raise testing.CantImplementError(
             f"The current config which is invalid/unsupported: use_2cta_instrs = {use_2cta_instrs}, "
-            f"mma_tiler_mn = {mma_tiler_mn}, cluster_shape_mn = {cluster_shape_mn}, "
-            f"use_tma_store = {use_tma_store}"
+            f"mma_tiler = {mma_tiler}, cluster_shape_mn = {cluster_shape_mn}, "
+            f"use_tma_store = {use_tma_store}, "
+            f"swizzle_size = {swizzle_size}, "
+            f"raster_along = {raster_along}, "
+            f"split_k = {split_k}, "
+            f"mnkl = {mnkl}"
         )
 
     stream = make_fake_stream()
-    return cute.compile(bmm, gemm, a, b, c, max_active_clusters, stream, epilogue_op)
+    return cute.compile(bmm, gemm, a, b, c, stream, epilogue_op)
 
 
 def run(
@@ -1592,6 +1928,8 @@ def run(
     c_major: str,
     mma_tiler_mn: Tuple[int, int] = (256, 256),
     cluster_shape_mn: Tuple[int, int] = (2, 1),
+    swizzle_size: int = 1,
+    raster_along: Literal["m", "n"] = "m",
     use_2cta_instrs: bool = True,
     use_tma_store: bool = True,
     tolerance: float = 1e-01,
@@ -1600,6 +1938,10 @@ def run(
     skip_ref_check: bool = False,
     use_cold_l2: bool = False,
     benchmark: bool = False,
+    init_normal: bool = False,
+    normal_mean: float = 0.0,
+    normal_std: float = 1.0,
+    split_k: int = 1,
     **kwargs,
 ):
     """
@@ -1642,6 +1984,14 @@ def run(
     :type use_cold_l2: bool, optional
     :param benchmark: Whether to only benchmark the kernel, defaults to False.
     :type benchmark: bool, optional
+    :param init_normal: Whether to initialize tensors using normal distribution
+        instead of uniform random, defaults to False.
+    :type init_normal: bool, optional
+    :param normal_mean: Mean for normal distribution initialization, defaults to 0.0.
+    :type normal_mean: float, optional
+    :param normal_std: Standard deviation for normal distribution initialization,
+        defaults to 1.0.
+    :type normal_std: float, optional
     :raises RuntimeError: If CUDA GPU is not available.
     :raises ValueError: If the configuration is invalid or unsupported by the kernel.
     :return: Execution time of the GEMM kernel.
@@ -1658,14 +2008,18 @@ def run(
     # Get the raw stream pointer as a CUstream
     current_stream = cuda.CUstream(torch_stream.cuda_stream)
 
-    # Check if configuration can be implemented
-    max_active_clusters = utils.HardwareInfo().get_max_active_clusters(
-        cluster_shape_mn[0] * cluster_shape_mn[1]
-    )
-
     # Run and verify BMM with torch
     a_f32, b_f32, c_f32, a_storage, b_storage, c_storage = prepare_tensors(
-        mnkl, ab_dtype, ab_dtype, c_dtype, a_major, b_major, c_major
+        mnkl,
+        ab_dtype,
+        ab_dtype,
+        c_dtype,
+        a_major,
+        b_major,
+        c_major,
+        init_random=not init_normal,
+        normal_mean=normal_mean,
+        normal_std=normal_std,
     )
 
     leading_dim_a = 2 if a_major == "k" else 1
@@ -1688,6 +2042,8 @@ def run(
     print(f"a_major: {a_major}, b_major: {b_major}, c_major: {c_major}")
     print(f"mma_tiler_mn: {mma_tiler_mn}, cluster_shape_mn: {cluster_shape_mn}")
     print(f"use_2cta_instrs: {use_2cta_instrs}, use_tma_store: {use_tma_store}")
+    print(f"swizzle_size: {swizzle_size}, raster_along: {raster_along}")
+    print(f"split_k: {split_k}")
 
     compiled_fn = compile_bmm(
         mnkl,
@@ -1700,10 +2056,12 @@ def run(
         c_major,
         mma_tiler_mn,
         cluster_shape_mn,
-        max_active_clusters,
         use_2cta_instrs,
         use_tma_store,
         epilogue_op=lambda x: x,
+        swizzle_size=swizzle_size,
+        raster_along=raster_along,
+        split_k=split_k,
     )
 
     print("Running Blackwell Persistent Dense GEMM test with:")
@@ -1736,6 +2094,10 @@ def run(
         return 0
 
     def generate_tensors():
+        # Use init_normal from outer scope, but force random init for Int8/Uint8 types
+        use_normal_init = init_normal and (
+            ab_dtype not in [cutlass.Int8, cutlass.Uint8]
+        )
         a_f32, b_f32, c_f32, a_st, b_st, c_st = prepare_tensors(
             mnkl,
             ab_dtype,
@@ -1744,6 +2106,9 @@ def run(
             a_major,
             b_major,
             c_major,
+            init_random=not use_normal_init,
+            normal_mean=normal_mean,
+            normal_std=normal_std,
         )
         a_ = create_cute_tensor_for_fp8(
             a_st, ab_dtype, leading_dim_a, source_f32_tensor=a_f32
@@ -1768,7 +2133,7 @@ def run(
         )
 
     # Return execution time in microseconds
-    return testing.benchmark(
+    exec_time = testing.benchmark(
         compiled_fn,
         workspace_generator=generate_tensors,
         workspace_count=workspace_count,
@@ -1776,10 +2141,8 @@ def run(
         warmup_iterations=warmup_iterations,
         iterations=iterations,
     )
-
-
-def compute_tflops(time_ns, m, n, k):
-    return 2.0 * m * n * k / time_ns / 1000.0
+    print(f"[DSL INFO] Execution time: {exec_time} microseconds per iteration")
+    return exec_time
 
 
 
@@ -1803,27 +2166,13 @@ def prepare_parser():
         default=(256, 256, 512, 1),
         help="mnkl dimensions (comma-separated)",
     )
-    parser.add_argument(
-        "--cluster_shape_mn",
-        type=_parse_comma_separated_ints,
-        default=(1, 1),
-        help="Cluster shape (comma-separated)",
-    )
     parser.add_argument("--ab_dtype", type=cutlass.dtype, default=cutlass.TFloat32)
     parser.add_argument("--c_dtype", type=cutlass.dtype, default=cutlass.Float32)
     parser.add_argument("--acc_dtype", type=cutlass.dtype, default=cutlass.Float32)
-    parser.add_argument(
-        "--use_2cta_instrs",
-        action="store_true",
-        help="Enable 2CTA MMA instructions feature",
-    )
 
     parser.add_argument("--a_major", choices=["k", "m"], type=str, default="k")
     parser.add_argument("--b_major", choices=["k", "n"], type=str, default="k")
     parser.add_argument("--c_major", choices=["n", "m"], type=str, default="n")
-    parser.add_argument(
-        "--use_tma_store", action="store_true", help="Use tma store or not"
-    )
     parser.add_argument(
         "--tolerance", type=float, default=1e-01, help="Tolerance for validation"
     )
@@ -1835,7 +2184,10 @@ def prepare_parser():
             "default",
             "none",
         ],
-        help="Benchmark the kernel with nsight or default (cute.testing.benchmark) or none",
+        help="Benchmark the kernel with nsight or default (cutlass.testing.benchmark) or none",
+    )
+    parser.add_argument(
+        "--skip_ref_check", action="store_true", help="Skip reference checking"
     )
     parser.add_argument(
         "--warmup_iterations", type=int, default=0, help="Warmup iterations"
@@ -1846,21 +2198,37 @@ def prepare_parser():
         default=1,
         help="Number of iterations to run the kernel",
     )
-    parser.add_argument(
-        "--skip_ref_check", action="store_true", help="Skip reference checking"
-    )
     parser.add_argument(
         "--use_cold_l2",
         action="store_true",
         default=False,
         help="Use circular buffer tensor sets to ensure L2 cold cache",
     )
+    testing.add_tensor_init_args(parser, supports_int_dtypes=True)
 
     return parser
 
 
 if __name__ == "__main__":
     parser = prepare_parser()
+
+    # Kernel Configurations
+    parser.add_argument(
+        "--use_tma_store", action="store_true", help="Use tma store or not"
+    )
+    parser.add_argument(
+        "--cluster_shape_mn",
+        type=_parse_comma_separated_ints,
+        default=(1, 1),
+        help="Cluster shape (comma-separated)",
+    )
+
+    parser.add_argument(
+        "--use_2cta_instrs",
+        action="store_true",
+        help="Enable 2CTA MMA instructions feature",
+    )
+
     parser.add_argument(
         "--mma_tiler_mn",
         type=_parse_comma_separated_ints,
@@ -1868,8 +2236,30 @@ if __name__ == "__main__":
         help="Mma tile shape (comma-separated)",
     )
 
+    parser.add_argument(
+        "--swizzle_size",
+        type=int,
+        default=1,
+        help="Swizzling size in the unit of cluster for improving L2 cache hit rate",
+    )
+    parser.add_argument(
+        "--raster_order",
+        type=str,
+        choices=["m", "n"],
+        default="m",
+        help="Rasterization order of clusters",
+    )
+    parser.add_argument(
+        "--cluster_split_k",
+        type=int,
+        default=1,
+        help="cluster split-k factor (CTAs along cluster Z that split K and reduce via DSMEM)",
+    )
+
     args = parser.parse_args()
 
+    testing.validate_tensor_init_args(args, parser)
+
     if len(args.mnkl) != 4:
         parser.error("--mnkl must contain exactly 4 values")
 
@@ -1879,7 +2269,7 @@ if __name__ == "__main__":
     if len(args.cluster_shape_mn) != 2:
         parser.error("--cluster_shape_mn must contain exactly 2 values")
 
-    print(f"[DSL INFO] Compiling Blackwell Persistent Dense GEMM with:")
+    print("[DSL INFO] Compiling Blackwell Persistent Dense GEMM with:")
     print(
         f"[DSL INFO] A dtype: {args.ab_dtype}, B dtype: {args.c_dtype}, C dtype: {args.acc_dtype}, Acc dtype: {args.acc_dtype}"
     )
@@ -1892,6 +2282,7 @@ if __name__ == "__main__":
         f"[DSL INFO] 2CTA MMA instructions: {'True' if args.use_2cta_instrs else 'False'}"
     )
     print(f"[DSL INFO] Use TMA Store: {'True' if args.use_tma_store else 'False'}")
+    print(f"[DSL INFO] Cluster Split-K: {args.cluster_split_k}")
 
     run(
         args.mnkl,
@@ -1903,6 +2294,8 @@ if __name__ == "__main__":
         args.c_major,
         args.mma_tiler_mn,
         args.cluster_shape_mn,
+        args.swizzle_size,
+        args.raster_order,
         args.use_2cta_instrs,
         args.use_tma_store,
         args.tolerance,
@@ -1911,5 +2304,9 @@ if __name__ == "__main__":
         args.skip_ref_check,
         args.use_cold_l2,
         args.benchmark == "default",
+        args.init_normal,
+        args.normal_mean,
+        args.normal_std,
+        args.cluster_split_k,
     )
     print("PASS")
diff --git a/examples/python/CuTeDSL/cute/blackwell/kernel/distributed/all_reduce_one_shot_lamport.py b/examples/python/CuTeDSL/cute/blackwell/kernel/distributed/all_reduce_one_shot_lamport.py
index eab122987..c5bd783cb 100644
--- a/examples/python/CuTeDSL/cute/blackwell/kernel/distributed/all_reduce_one_shot_lamport.py
+++ b/examples/python/CuTeDSL/cute/blackwell/kernel/distributed/all_reduce_one_shot_lamport.py
@@ -27,54 +27,21 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 import os
-import torch
 import argparse
-
-import numpy as np
-import torch.distributed as dist
-import torch.distributed._symmetric_memory as symm_mem
 import cuda.bindings.driver as cuda
-try:
-    from cuda.core import Device
-except ImportError:
-    from cuda.core.experimental import Device
-from cuda.pathfinder import load_nvidia_dynamic_lib
 
 import cutlass
 import cutlass.cute as cute
-import cutlass.cute.testing as testing
+from cutlass import testing
 from cutlass.cute.runtime import from_dlpack
 from cutlass.cutlass_dsl import T
 from cutlass._mlir.dialects import vector
 
-try:
-    import nvshmem.core
-except ImportError as exc:
-    raise ImportError(
-        "nvshmem4py is required but not installed. Please install it using:\n"
-        "  For CUDA 12: pip install nvshmem4py-cu12\n"
-        "  For CUDA 13: pip install nvshmem4py-cu13\n"
-        "Note: nvshmem4py version >= 0.1.3 is recommended."
-    ) from None
-
-try:
-    load_nvidia_dynamic_lib("nvshmem_host")
-except RuntimeError as exc:
-    raise ImportError(
-        "nvshmem lib is required but not installed. Please install it using:\n"
-        "  For CUDA 12: pip install nvidia-nvshmem-cu12\n"
-        "  For CUDA 13: pip install nvidia-nvshmem-cu13\n"
-    ) from None
-
 """
 A Distributed One-Shot All-Reduce Example using CuTe DSL and fine-grained memory control. This is a mirrored version of the 
 existing tensorrt_llm kernel:
 https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
 
-In Lamport terminology this is a classic flag-based busy-wait: every participant keeps polling the shared slot until the
-flag changes from the sentinel (negative zero) to real data, which indicates that the Lamport-style logical ordering has
-advanced and the payload is safe to consume.
-
 This example kernel demonstrates a one-shot all-reduce operation using the CuTe DSL with fine-grained memory control.
 It uses dedicated communication buffers for data exchange, and these buffers act as ping-pong buffers. During the 
 process, the kernel uses one buffer for communication and initializes the next buffer to all negative zeros.
@@ -90,8 +57,8 @@ The .SYS memory scope and .VOLATILE memory order are used to ensure that the dat
 
 .. code-block:: bash
 
-    torchrun --nproc-per-node 8  examples/distributed/all_reduce_one_shot_lamport.py --M 8192 --N 8192
-    torchrun --nproc-per-node 8  examples/distributed/all_reduce_one_shot_lamport.py \
+    torchrun --nproc-per-node 8  examples/cute/blackwell/distributed/all_reduce_one_shot_lamport.py --M 8192 --N 8192
+    torchrun --nproc-per-node 8  examples/cute/blackwell/distributed/all_reduce_one_shot_lamport.py \
         --M 8192 --N 8192 --benchmark --warmup_iterations 2 --iterations 10
 """
 
@@ -174,14 +141,14 @@ class AllReduceOneShotLamportKernel:
 
         # assume all buffers have the same element type with input
         copy_atom_load = cute.make_copy_atom(
-            cute.nvgpu.CopyUniversalOp(),
+            cute.nvgpu.CopyG2ROp(),
             buffers[0].element_type,
             num_bits_per_copy=128,
             memory_scope=cute.nvgpu.common.MemoryScope.SYS,
             memory_order=cute.nvgpu.common.MemoryOrder.VOLATILE,
         )
         copy_atom_store = cute.make_copy_atom(
-            cute.nvgpu.CopyUniversalOp(),
+            cute.nvgpu.CopyR2GOp(),
             buffers[0].element_type,
             num_bits_per_copy=128,
             memory_scope=cute.nvgpu.common.MemoryScope.SYS,
@@ -276,6 +243,10 @@ def run_all_reduce_one_shot(
     skip_ref_check=False,
     benchmark=True,
 ):
+    import torch
+    import torch.distributed as dist
+    import torch.distributed._symmetric_memory as symm_mem
+
     rank = torch.distributed.get_rank()
     world_size = torch.distributed.get_world_size()
     if rank == 0:
@@ -284,8 +255,20 @@ def run_all_reduce_one_shot(
         print(f"GPU count: {world_size}")
 
     # init buffer tensors to be neg 0
-    local_buffer_tensor = nvshmem.core.tensor([PING_PONG_SIZE, world_size, M, N,], dtype=torch.float32).neg_()
-    buffer_tensor_list = [nvshmem.core.get_peer_tensor(local_buffer_tensor, rank).permute(2, 3, 1, 0) for rank in range(world_size)]
+    t = symm_mem.empty(
+        [
+            PING_PONG_SIZE,
+            world_size,
+            M,
+            N,
+        ],
+        device="cuda",
+    ).neg_()
+    hdl = symm_mem.rendezvous(t, dist.group.WORLD)
+    buffer_tensor_list = [
+        hdl.get_buffer(rank, t.shape, t.dtype).permute(2, 3, 1, 0)
+        for rank in range(world_size)
+    ]
     signal = cutlass.Int32(0)
     input_tensor = torch.randn([M, N], device=f"cuda:{rank}")
     output_tensor = torch.zeros([M, N], device=f"cuda:{rank}")
@@ -319,33 +302,40 @@ def run_all_reduce_one_shot(
         if rank == 0:
             print("Results verified successfully!")
 
-    for t in buffer_tensor_list:
-        nvshmem.core.free_tensor(t)
-
     if not benchmark:
         return
 
-    free_func_and_tensor_pairs = []
-    def add_free_func_and_tensor(free_func, tensor):
-        free_func_and_tensor_pairs.append((free_func, tensor))
-
     def generate_tensors():
-        local_buffer = nvshmem.core.tensor([PING_PONG_SIZE, world_size, M, N,], dtype=torch.float32).neg_()
-        buffer_tensor_list = [nvshmem.core.get_peer_tensor(local_buffer, rank).permute(2, 3, 1, 0) for rank in range(world_size)]
-        input_tensor = torch.randn([M, N], device=f"cuda:{rank}")
-        output_tensor = torch.zeros([M, N], device=f"cuda:{rank}")
+        t = symm_mem.empty(
+            [
+                PING_PONG_SIZE,
+                world_size,
+                M * N,
+            ],
+            device="cuda",
+        ).neg_()
+        hdl = symm_mem.rendezvous(t, group=dist.group.WORLD.group_name)
+        # get tensors from other devices from the symmetric memory
+        buffers = [
+            hdl.get_buffer(rank, t.shape, t.dtype).permute(2, 1, 0)
+            for rank in range(world_size)
+        ]
+        input_tensor = torch.randn(M * N, device=f"cuda:{rank}")
+        output_tensor = torch.zeros(M * N, device=f"cuda:{rank}")
 
         ja = testing.JitArguments(
             cutlass.Int32(0),
             from_dlpack(input_tensor, assumed_align=32),
             from_dlpack(output_tensor, assumed_align=32),
-            [from_dlpack(t, assumed_align=32) for t in buffer_tensor_list],
-            stream=stream
+            [from_dlpack(t, assumed_align=32) for t in buffers],
+            stream=stream,
         )
-        for tensor in buffer_tensor_list:
-            add_free_func_and_tensor(nvshmem.core.free_tensor, tensor)
-        
+        ja._hdl = (
+            hdl  # in order to extend the life cycle of hdl for the kernel execution
+        )
+        ja._t = t  # same reason
         return ja
+
     avg_time_us = testing.benchmark(
         compiled_func,
         workspace_generator=generate_tensors,
@@ -361,47 +351,39 @@ def run_all_reduce_one_shot(
             f"Achieved memory throughput: {((world_size + 1) * output_tensor.numel() * 32 // 8) / (avg_time_us / 1e6) / 1e9:.2f} GB/s"
         )
 
-    for free_func, tensor in free_func_and_tensor_pairs:
-        free_func(tensor)
 
-def torchrun_uid_init_bcast():
-    """
-    Initialize NVSHMEM using UniqueID with `torchrun` as the launcher
+def run(
+    M,
+    N,
+    warmup_iterations=2,
+    iterations=10,
+    skip_ref_check=False,
+    benchmark=True,
+):
+    import torch
+    import torch.distributed as dist
+    import torch.distributed._symmetric_memory as symm_mem
 
-    It uses torch.distributed.broadcast on a NumPy array to handle the broadcasting
-    """
-    # Set Torch device
-    local_rank = int(os.environ['LOCAL_RANK'])
+    globals()["torch"] = torch
+    globals()["dist"] = dist
+    globals()["symm_mem"] = symm_mem
+
+    local_rank = int(os.environ["LOCAL_RANK"])
     torch.cuda.set_device(local_rank)
-
-    # nvshmem4py requires a cuda.core Device at init time
-    dev = Device(local_rank)
-    dev.set_current()
-    global stream
-    stream = dev.create_stream()
-
-    # Initialize torch.distributed process group
-    dist.init_process_group(
-        backend="cpu:gloo,cuda:nccl",
-    )
-
-    # Extract rank, nranks from process group
-    num_ranks = dist.get_world_size()
-
-    # Create an empty uniqueid for all ranks
-    uid = nvshmem.core.get_unique_id(empty=(local_rank != 0))
-    uid_bytes = uid._data.view(np.uint8).copy()
-    uid_tensor = torch.from_numpy(uid_bytes).cuda()
-    dist.broadcast(uid_tensor, src=0)
-    dist.barrier()
-    uid._data[:] = uid_tensor.cpu().numpy().view(uid._data.dtype)
-
-    nvshmem.core.init(device=dev, uid=uid, rank=local_rank, nranks=num_ranks, initializer_method="uid")
-
-
-def torchrun_finalize():
-    nvshmem.core.finalize()
-    dist.destroy_process_group()
+    if not dist.is_initialized():
+        dist.init_process_group(backend="nccl")
+    try:
+        run_all_reduce_one_shot(
+            M,
+            N,
+            warmup_iterations,
+            iterations,
+            skip_ref_check,
+            benchmark,
+        )
+    finally:
+        if dist.is_initialized():
+            dist.destroy_process_group()
 
 
 def main():
@@ -414,13 +396,17 @@ def main():
     parser.add_argument("--iterations", default=10, type=int)
     parser.add_argument("--skip_ref_check", action="store_true")
     parser.add_argument("--benchmark", action="store_true")
+
     args = parser.parse_args()
 
-    torchrun_uid_init_bcast()
-
-    run_all_reduce_one_shot(args.M, args.N, args.warmup_iterations, args.iterations, args.skip_ref_check, args.benchmark)
-
-    torchrun_finalize()
+    run(
+        args.M,
+        args.N,
+        args.warmup_iterations,
+        args.iterations,
+        args.skip_ref_check,
+        args.benchmark,
+    )
 
     return
 
diff --git a/examples/python/CuTeDSL/cute/blackwell/kernel/moe/torch_grouped_mm.py b/examples/python/CuTeDSL/cute/blackwell/kernel/moe/torch_grouped_mm.py
index 13c6352e2..d7eb2bb3e 100644
--- a/examples/python/CuTeDSL/cute/blackwell/kernel/moe/torch_grouped_mm.py
+++ b/examples/python/CuTeDSL/cute/blackwell/kernel/moe/torch_grouped_mm.py
@@ -269,8 +269,8 @@ class GroupedGemmKernel:
         c_bytes = c_bytes_per_stage * self.num_c_stage
 
         self.num_sched_stages = 2
-        sched_work_tile_bytes_per_stage = 16  # 4 fields * sizeof(Int32)
-        sched_bytes = sched_work_tile_bytes_per_stage * self.num_sched_stages
+        self.sched_work_tile_bytes_per_stage = 16  # 4 fields * sizeof(Int32)
+        sched_bytes = self.sched_work_tile_bytes_per_stage * self.num_sched_stages
 
         fixed_overhead = mbar_helpers_bytes + c_bytes + sched_bytes
 
@@ -774,7 +774,11 @@ class GroupedGemmKernel:
             acc_full_mbar_ptr: cute.struct.MemRange[
                 cutlass.Int64, self.num_acc_stage * 2
             ]
-            sched_buf: cute.struct.MemRange[cutlass.Int32, self.num_sched_stages * 4]
+            sched_buf_align_bytes = self.sched_work_tile_bytes_per_stage
+            sched_buf: cute.struct.Align[
+                cute.struct.MemRange[cutlass.Int32, self.num_sched_stages * 4],
+                sched_buf_align_bytes,
+            ]
             sched_mbar_ptr: cute.struct.MemRange[
                 cutlass.Int64, self.num_sched_stages * 2
             ]
@@ -2010,9 +2014,6 @@ if __name__ == "__main__":
         compare_with_bmm=args.compare_with_bmm,
         compare_with_sol=args.compare_with_sol,
     )
-    if misc.no_torch_210:
-        misc.compare_with_bmm = True
-        print("Override to set --compare_with_bmm to avoid possible torch crash.")
 
     tester = GroupedGemmTester(problem, impl, misc)
     tester.run()
diff --git a/examples/python/CuTeDSL/cute/blackwell/kernel/moe/torch_scaled_grouped_mm.py b/examples/python/CuTeDSL/cute/blackwell/kernel/moe/torch_scaled_grouped_mm.py
index 5a2d98fe3..1c8405c08 100644
--- a/examples/python/CuTeDSL/cute/blackwell/kernel/moe/torch_scaled_grouped_mm.py
+++ b/examples/python/CuTeDSL/cute/blackwell/kernel/moe/torch_scaled_grouped_mm.py
@@ -359,6 +359,7 @@ class ScaledGroupedGemmKernel:
         )
 
         self.num_sched_stages = 2
+        self.sched_work_tile_bytes_per_stage = 16  # 4 fields * sizeof(Int32)
 
         # ── SMEM layouts ──
         self.a_smem_layout_staged = sm100_utils.make_smem_layout_a(
@@ -1371,7 +1372,11 @@ class ScaledGroupedGemmKernel:
             acc_full_mbar_ptr: cute.struct.MemRange[
                 cutlass.Int64, self.num_acc_pipeline_stages * 2
             ]
-            sched_buf: cute.struct.MemRange[cutlass.Int32, self.num_sched_stages * 4]
+            sched_buf_align_bytes = self.sched_work_tile_bytes_per_stage
+            sched_buf: cute.struct.Align[
+                cute.struct.MemRange[cutlass.Int32, self.num_sched_stages * 4],
+                sched_buf_align_bytes,
+            ]
             sched_mbar_ptr: cute.struct.MemRange[
                 cutlass.Int64, self.num_sched_stages * 2
             ]
diff --git a/examples/python/CuTeDSL/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_3_1.py b/examples/python/CuTeDSL/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_3_1.py
index 961949e09..3c365b93f 100644
--- a/examples/python/CuTeDSL/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_3_1.py
+++ b/examples/python/CuTeDSL/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_3_1.py
@@ -1,12 +1,30 @@
-# SPDX-FileCopyrightText: Copyright (c) 2024 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
-#
-# NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
-# property and proprietary rights in and to this material, related
-# documentation and any modifications thereto. Any use, reproduction,
-# disclosure or distribution of this material and related documentation
-# without an express license agreement from NVIDIA CORPORATION or
-# its affiliates is strictly prohibited.
+# Copyright (c) 2024 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 # This is the third tutorial GEMM. It further enhances the second tutorial by adding warp
 # specialization for TMA, MMA, and epilogue warps.
@@ -32,7 +50,7 @@ The dynamic scheduler is more flexible than the static scheduler, as it can hand
 
 To run this example:
 .. code-block:: bash
-    python examples/blackwell/tutorial_gemm/fp16_gemm_3_1.py  \
+    python examples/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_3_1.py  \
       --mnk 8192,8192,8192
 
 Constraints for this example:
@@ -72,29 +90,32 @@ class SharedStorage:
     tmem_holding_buffer: cutlass.Int32
     # Only for CLC Dynamic Scheduler
     clc_mbar_ptr: cute.struct.MemRange[cutlass.Int64, 2]
-    clc_response: cute.struct.MemRange[cutlass.Int32, 4]
+    clc_response_align_bytes = num_clc_response_bytes
+    clc_response: cute.struct.Align[
+        cute.struct.MemRange[cutlass.Int32, 4],
+        clc_response_align_bytes,
+    ]
 
 
 @cute.kernel()
 def kernel(
     tiled_mma: cute.TiledMma,
-    tma_atom_a: cute.CopyAtom,
-    mA_mkl: cute.Tensor,
-    tma_atom_b: cute.CopyAtom,
-    mB_nkl: cute.Tensor,
-    tma_atom_c: cute.CopyAtom,
-    mC_mnl: cute.Tensor,
-    a_smem_layout: cute.ComposedLayout,
-    b_smem_layout: cute.ComposedLayout,
+    tma_a: cpasync.TmaInfo,
+    tma_b: cpasync.TmaInfo,
+    tma_c: cpasync.TmaInfo,
     c_smem_layout_kind: cutlass.Constexpr,
-    epi_smem_layout_staged: cute.ComposedLayout,
     epi_tile: cute.Tile,
     cta_layout_vmnk: cute.Layout,
     tile_sched_params: Union[
         utils.ClcDynamicPersistentTileSchedulerParams,
         utils.PersistentTileSchedulerParams,
     ],
-):  
+):
+    # Extract tma_tensor from TmaInfo
+    mA_mkl = tma_a.tma_tensor
+    mB_nkl = tma_b.tma_tensor
+    mC_mnl = tma_c.tma_tensor
+
     warp_idx = cute.arch.warp_idx()
     warp_idx = cute.arch.make_warp_uniform(warp_idx)
 
@@ -123,9 +144,9 @@ def kernel(
 
     # Prefetch tma descriptor
     if warp_idx == tma_warp_id:
-        cpasync.prefetch_descriptor(tma_atom_a)
-        cpasync.prefetch_descriptor(tma_atom_b)
-        cpasync.prefetch_descriptor(tma_atom_c)
+        cpasync.prefetch_descriptor(tma_a.atom)
+        cpasync.prefetch_descriptor(tma_b.atom)
+        cpasync.prefetch_descriptor(tma_c.atom)
 
     # As many participants as the number of threads issuing the MMA in the same row and column
     # Substract one to not count twice the same thread
@@ -167,8 +188,8 @@ def kernel(
     )
 
     num_tma_copy_bytes = (
-        cute.size_in_bytes(io_dtype, cute.select(a_smem_layout, mode=[0, 1, 2]))
-        + cute.size_in_bytes(io_dtype, cute.select(b_smem_layout, mode=[0, 1, 2]))
+        cute.size_in_bytes(io_dtype, cute.select(tma_a.smem_layout, mode=[0, 1, 2]))
+        + cute.size_in_bytes(io_dtype, cute.select(tma_b.smem_layout, mode=[0, 1, 2]))
     ) * cute.size(cta_layout_vmnk, mode=[0])
 
     # Threads/warps participating in the mainloop pipeline
@@ -248,21 +269,21 @@ def kernel(
     # Allocate SMEM
     sA = smem.allocate_tensor(
         element_type=io_dtype,
-        layout=a_smem_layout.outer,
+        layout=tma_a.smem_layout.outer,
         byte_alignment=128,
-        swizzle=a_smem_layout.inner,
+        swizzle=tma_a.smem_layout.inner,
     )
     sB = smem.allocate_tensor(
         element_type=io_dtype,
-        layout=b_smem_layout.outer,
+        layout=tma_b.smem_layout.outer,
         byte_alignment=128,
-        swizzle=b_smem_layout.inner,
+        swizzle=tma_b.smem_layout.inner,
     )
     sC = smem.allocate_tensor(
         element_type=io_dtype,
-        layout=epi_smem_layout_staged.outer,
+        layout=tma_c.smem_layout.outer,
         byte_alignment=128,
-        swizzle=epi_smem_layout_staged.inner,
+        swizzle=tma_c.smem_layout.inner,
     )
 
     # Partition tensors for MMA and make fragments
@@ -301,7 +322,7 @@ def kernel(
     # ((atom_v, rest_v), STAGE)
     # ((atom_v, rest_v), RestM, RestK)
     tAsA, tAgA = cute.nvgpu.cpasync.tma_partition(
-        tma_atom_a,
+        tma_a.atom,
         cta_in_cluster_coord_vmnk[2],
         cute.make_layout(cute.size(cta_layout_vmnk, mode=[2])),
         cute.group_modes(sA, 0, 3),
@@ -311,7 +332,7 @@ def kernel(
     # ((atom_v, rest_v), STAGE)
     # ((atom_v, rest_v), RestN, RestK)
     tBsB, tBgB = cute.nvgpu.cpasync.tma_partition(
-        tma_atom_b,
+        tma_b.atom,
         cta_in_cluster_coord_vmnk[1],
         cute.make_layout(cute.size(cta_layout_vmnk, mode=[1])),
         cute.group_modes(sB, 0, 3),
@@ -321,7 +342,7 @@ def kernel(
     gC_epi = cute.flat_divide(tCgC[((None, None), 0, 0, None, None)], epi_tile)
 
     tCsC, tCgC_tma = cute.nvgpu.cpasync.tma_partition(
-        tma_atom_c,
+        tma_c.atom,
         0,
         cute.make_layout(1),
         cute.group_modes(sC, 0, 2),
@@ -379,14 +400,14 @@ def kernel(
 
                 # Issue TMA loads
                 cute.copy(
-                    tma_atom_a,
+                    tma_a.atom,
                     tAgA_slice[(None, k_tile_idx)],
                     tAsA[(None, handle.index)],
                     tma_bar_ptr=handle.barrier,
                     mcast_mask=tma_mcast_mask_a,
                 )
                 cute.copy(
-                    tma_atom_b,
+                    tma_b.atom,
                     tBgB_slice[(None, k_tile_idx)],
                     tBsB[(None, handle.index)],
                     tma_bar_ptr=handle.barrier,
@@ -447,23 +468,14 @@ def kernel(
                 # (MMA, MMA_M, MMA_N)
                 tCtAcc = tCtAcc_base[(None, None, None, acc_empty.index)]
 
-                tiled_mma.set(tcgen05.Field.ACCUMULATE, False)
                 for k_tile_idx in range(num_k_tiles):
                     # Wait for TMA copies to complete
                     handle = ab_consumer.wait_and_advance()
 
                     # Execute one K-block worth of MMA instructions
-                    num_k_blocks = cute.size(tCrA, mode=[2])
-                    for k_block_idx in cutlass.range_constexpr(num_k_blocks):
-                        k_block_coord = (None, None, k_block_idx, handle.index)
-                        cute.gemm(
-                            tiled_mma,
-                            tCtAcc,
-                            tCrA[k_block_coord],
-                            tCrB[k_block_coord],
-                            tCtAcc,
-                        )
-                        tiled_mma.set(tcgen05.Field.ACCUMULATE, True)
+                    tiled_mma.set(tcgen05.Field.ACCUMULATE, k_tile_idx != 0)
+                    tile_crd = (None, None, None, handle.index)
+                    cute.gemm(tiled_mma, tCtAcc, tCrA[tile_crd], tCrB[tile_crd], tCtAcc)
 
                     # Signal that the A/B buffers have been consumed and are ready for the next load
                     handle.release()
@@ -496,10 +508,10 @@ def kernel(
         # (MMA, MMA_M, MMA_N, STAGE)
         tCtAcc_base = cute.make_tensor(tmem_ptr, tCtAcc_fake.layout)
 
-        # Initialize TMA store pipeline for epilogue
+        # Initialize TMA store pipeline for epilogue with 4 warps
         epilogue_pipeline_producer_group = pipeline.CooperativeGroup(
-            pipeline.Agent.Thread,
-            size=128,
+            pipeline.Agent.Warp,
+            size=4,
         )
         epilogue_pipeline = pipeline.PipelineTmaStore.create(
             num_stages=epi_stages,
@@ -594,7 +606,7 @@ def kernel(
                 # SMEM -> GMEM
                 if warp_idx == epilogue_warp_ids[0]:
                     cute.copy(
-                        tma_atom_c,
+                        tma_c.atom,
                         tCsC[(None, c_buffer)],
                         tCgC_grouped[(None, subtile_idx)],
                     )
@@ -678,8 +690,8 @@ def host_function(
         mma_inst_shape_mnk,
         tcgen05.CtaGroup.TWO if use_2cta_instrs else tcgen05.CtaGroup.ONE,
         tcgen05.OperandSource.SMEM,
-        tcgen05.OperandMajorMode.K,
-        tcgen05.OperandMajorMode.K,
+        cute.nvgpu.OperandMajorMode.K,
+        cute.nvgpu.OperandMajorMode.K,
     )
     tiled_mma = cute.make_tiled_mma(op)
 
@@ -717,21 +729,18 @@ def host_function(
     op = cute.nvgpu.cpasync.CopyBulkTensorTileG2SMulticastOp(
         tcgen05.CtaGroup.TWO if use_2cta_instrs else tcgen05.CtaGroup.ONE
     )
-    a_smem_layout_slice = cute.slice_(a_smem_layout, (None, None, None, 0))
-    a_tma_atom, a_tma_tensor = cute.nvgpu.make_tiled_tma_atom_A(
+    tma_a = cute.nvgpu.make_tiled_tma_atom_A(
         op,
         a,
-        a_smem_layout_slice,
+        a_smem_layout,
         mma_tiler_mnk,
         tiled_mma,
         cta_layout_vmnk.shape,
-
     )
-    b_smem_layout_slice = cute.slice_(b_smem_layout, (None, None, None, 0))
-    b_tma_atom, b_tma_tensor = cute.nvgpu.make_tiled_tma_atom_B(
+    tma_b = cute.nvgpu.make_tiled_tma_atom_B(
         op,
         b,
-        b_smem_layout_slice,
+        b_smem_layout,
         mma_tiler_mnk,
         tiled_mma,
         cta_layout_vmnk.shape,
@@ -757,11 +766,10 @@ def host_function(
         epi_stages,
     )
 
-    epi_smem_layout = cute.slice_(epi_smem_layout_staged, (None, None, 0))
-    c_tma_atom, c_tma_tensor = cute.nvgpu.cpasync.make_tiled_tma_atom(
+    tma_c = cute.nvgpu.cpasync.make_tiled_tma_atom(
         cute.nvgpu.cpasync.CopyBulkTensorTileS2GOp(),
         c,
-        epi_smem_layout,
+        epi_smem_layout_staged,
         epi_tile,
     )
 
@@ -779,16 +787,10 @@ def host_function(
 
     kernel(
         tiled_mma,
-        a_tma_atom,
-        a_tma_tensor,
-        b_tma_atom,
-        b_tma_tensor,
-        c_tma_atom,
-        c_tma_tensor,
-        a_smem_layout,
-        b_smem_layout,
+        tma_a,
+        tma_b,
+        tma_c,
         c_smem_layout_kind,
-        epi_smem_layout_staged,
         epi_tile,
         cta_layout_vmnk,
         tile_sched_params,
diff --git a/examples/python/CuTeDSL/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_4.py b/examples/python/CuTeDSL/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_4.py
index 0576778cf..7870f254a 100644
--- a/examples/python/CuTeDSL/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_4.py
+++ b/examples/python/CuTeDSL/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_4.py
@@ -1,12 +1,30 @@
-# SPDX-FileCopyrightText: Copyright (c) 2024 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
-#
-# NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
-# property and proprietary rights in and to this material, related
-# documentation and any modifications thereto. Any use, reproduction,
-# disclosure or distribution of this material and related documentation
-# without an express license agreement from NVIDIA CORPORATION or
-# its affiliates is strictly prohibited.
+# Copyright (c) 2024 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 # This is the fourth tutorial GEMM (4). It extends fp16_gemm_3_1.py by adding TMA prefetch.
 # TMA prefetch uses cute.prefetch() to bring data into L2 cache before TMA copy needs it,
@@ -55,7 +73,7 @@ CuTe DSL Blackwell SM100 kernels. Users can specify preferred and fallback clust
 
 To run this example:
 .. code-block:: bash
-    python examples/blackwell/tutorial_gemm/fp16_gemm_4.py  \
+    python examples/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_4.py  \
       --mnk 8192,8192,8192
 
 Constraints for this example:
@@ -96,22 +114,20 @@ class SharedStorage:
     tmem_holding_buffer: cutlass.Int32
     # Only for CLC Dynamic Scheduler
     clc_mbar_ptr: cute.struct.MemRange[cutlass.Int64, 2]
-    clc_response: cute.struct.MemRange[cutlass.Int32, 4]
+    clc_response_align_bytes = num_clc_response_bytes
+    clc_response: cute.struct.Align[
+        cute.struct.MemRange[cutlass.Int32, 4],
+        clc_response_align_bytes,
+    ]
 
 
 @cute.jit
 def cluster_specific_kernel(
     tiled_mma: cute.TiledMma,
-    tma_atom_a: cute.CopyAtom,
-    mA_mkl: cute.Tensor,
-    tma_atom_b: cute.CopyAtom,
-    mB_nkl: cute.Tensor,
-    tma_atom_c: cute.CopyAtom,
-    mC_mnl: cute.Tensor,
-    a_smem_layout: cute.ComposedLayout,
-    b_smem_layout: cute.ComposedLayout,
+    tma_a: cpasync.TmaInfo,
+    tma_b: cpasync.TmaInfo,
+    tma_c: cpasync.TmaInfo,
     c_smem_layout_kind: cutlass.Constexpr,
-    epi_smem_layout_staged: cute.ComposedLayout,
     epi_tile: cute.Tile,
     cta_layout_vmnk: cute.Layout,
     cluster_shape_mnk: Tuple[int, int, int],
@@ -120,6 +136,11 @@ def cluster_specific_kernel(
         utils.PersistentTileSchedulerParams,
     ],
 ):
+    # Extract tma_tensor from TmaInfo
+    mA_mkl = tma_a.tma_tensor
+    mB_nkl = tma_b.tma_tensor
+    mC_mnl = tma_c.tma_tensor
+
     warp_idx = cute.arch.warp_idx()
     warp_idx = cute.arch.make_warp_uniform(warp_idx)
 
@@ -148,9 +169,9 @@ def cluster_specific_kernel(
 
     # Prefetch tma descriptor
     if warp_idx == tma_warp_id:
-        cpasync.prefetch_descriptor(tma_atom_a)
-        cpasync.prefetch_descriptor(tma_atom_b)
-        cpasync.prefetch_descriptor(tma_atom_c)
+        cpasync.prefetch_descriptor(tma_a.atom)
+        cpasync.prefetch_descriptor(tma_b.atom)
+        cpasync.prefetch_descriptor(tma_c.atom)
 
     # As many participants as the number of threads issuing the MMA in the same row and column
     # Substract one to not count twice the same thread
@@ -192,8 +213,8 @@ def cluster_specific_kernel(
     )
 
     num_tma_copy_bytes = (
-        cute.size_in_bytes(io_dtype, cute.select(a_smem_layout, mode=[0, 1, 2]))
-        + cute.size_in_bytes(io_dtype, cute.select(b_smem_layout, mode=[0, 1, 2]))
+        cute.size_in_bytes(io_dtype, cute.select(tma_a.smem_layout, mode=[0, 1, 2]))
+        + cute.size_in_bytes(io_dtype, cute.select(tma_b.smem_layout, mode=[0, 1, 2]))
     ) * cute.size(cta_layout_vmnk, mode=[0])
 
     # Threads/warps participating in the mainloop pipeline
@@ -273,21 +294,21 @@ def cluster_specific_kernel(
     # Allocate SMEM
     sA = smem.allocate_tensor(
         element_type=io_dtype,
-        layout=a_smem_layout.outer,
+        layout=tma_a.smem_layout.outer,
         byte_alignment=128,
-        swizzle=a_smem_layout.inner,
+        swizzle=tma_a.smem_layout.inner,
     )
     sB = smem.allocate_tensor(
         element_type=io_dtype,
-        layout=b_smem_layout.outer,
+        layout=tma_b.smem_layout.outer,
         byte_alignment=128,
-        swizzle=b_smem_layout.inner,
+        swizzle=tma_b.smem_layout.inner,
     )
     sC = smem.allocate_tensor(
         element_type=io_dtype,
-        layout=epi_smem_layout_staged.outer,
+        layout=tma_c.smem_layout.outer,
         byte_alignment=128,
-        swizzle=epi_smem_layout_staged.inner,
+        swizzle=tma_c.smem_layout.inner,
     )
 
     # Partition tensors for MMA and make fragments
@@ -326,7 +347,7 @@ def cluster_specific_kernel(
     # ((atom_v, rest_v), STAGE)
     # ((atom_v, rest_v), RestM, RestK)
     tAsA, tAgA = cute.nvgpu.cpasync.tma_partition(
-        tma_atom_a,
+        tma_a.atom,
         cta_in_cluster_coord_vmnk[2],
         cute.make_layout(cute.size(cta_layout_vmnk, mode=[2])),
         cute.group_modes(sA, 0, 3),
@@ -336,7 +357,7 @@ def cluster_specific_kernel(
     # ((atom_v, rest_v), STAGE)
     # ((atom_v, rest_v), RestN, RestK)
     tBsB, tBgB = cute.nvgpu.cpasync.tma_partition(
-        tma_atom_b,
+        tma_b.atom,
         cta_in_cluster_coord_vmnk[1],
         cute.make_layout(cute.size(cta_layout_vmnk, mode=[1])),
         cute.group_modes(sB, 0, 3),
@@ -346,7 +367,7 @@ def cluster_specific_kernel(
     gC_epi = cute.flat_divide(tCgC[((None, None), 0, 0, None, None)], epi_tile)
 
     tCsC, tCgC_tma = cute.nvgpu.cpasync.tma_partition(
-        tma_atom_c,
+        tma_c.atom,
         0,
         cute.make_layout(1),
         cute.group_modes(sC, 0, 2),
@@ -403,14 +424,14 @@ def cluster_specific_kernel(
 
                 # Issue TMA loads
                 cute.copy(
-                    tma_atom_a,
+                    tma_a.atom,
                     tAgA_slice[(None, k_tile_idx)],
                     tAsA[(None, handle.index)],
                     tma_bar_ptr=handle.barrier,
                     mcast_mask=tma_mcast_mask_a,
                 )
                 cute.copy(
-                    tma_atom_b,
+                    tma_b.atom,
                     tBgB_slice[(None, k_tile_idx)],
                     tBsB[(None, handle.index)],
                     tma_bar_ptr=handle.barrier,
@@ -471,23 +492,14 @@ def cluster_specific_kernel(
                 # (MMA, MMA_M, MMA_N)
                 tCtAcc = tCtAcc_base[(None, None, None, acc_empty.index)]
 
-                tiled_mma.set(tcgen05.Field.ACCUMULATE, False)
                 for k_tile_idx in range(num_k_tiles):
                     # Wait for TMA copies to complete
                     handle = ab_consumer.wait_and_advance()
 
                     # Execute one K-block worth of MMA instructions
-                    num_k_blocks = cute.size(tCrA, mode=[2])
-                    for k_block_idx in cutlass.range_constexpr(num_k_blocks):
-                        k_block_coord = (None, None, k_block_idx, handle.index)
-                        cute.gemm(
-                            tiled_mma,
-                            tCtAcc,
-                            tCrA[k_block_coord],
-                            tCrB[k_block_coord],
-                            tCtAcc,
-                        )
-                        tiled_mma.set(tcgen05.Field.ACCUMULATE, True)
+                    tiled_mma.set(tcgen05.Field.ACCUMULATE, k_tile_idx != 0)
+                    tile_crd = (None, None, None, handle.index)
+                    cute.gemm(tiled_mma, tCtAcc, tCrA[tile_crd], tCrB[tile_crd], tCtAcc)
 
                     # Signal that the A/B buffers have been consumed and are ready for the next load
                     handle.release()
@@ -520,10 +532,10 @@ def cluster_specific_kernel(
         # (MMA, MMA_M, MMA_N, STAGE)
         tCtAcc_base = cute.make_tensor(tmem_ptr, tCtAcc_fake.layout)
 
-        # Initialize TMA store pipeline for epilogue
+        # Initialize TMA store pipeline for epilogue with 4 warps
         epilogue_pipeline_producer_group = pipeline.CooperativeGroup(
-            pipeline.Agent.Thread,
-            size=128,
+            pipeline.Agent.Warp,
+            size=4,
         )
         epilogue_pipeline = pipeline.PipelineTmaStore.create(
             num_stages=epi_stages,
@@ -618,7 +630,7 @@ def cluster_specific_kernel(
                 # SMEM -> GMEM
                 if warp_idx == epilogue_warp_ids[0]:
                     cute.copy(
-                        tma_atom_c,
+                        tma_c.atom,
                         tCsC[(None, c_buffer)],
                         tCgC_grouped[(None, subtile_idx)],
                     )
@@ -652,20 +664,12 @@ def cluster_specific_kernel(
 @cute.kernel
 def kernel(
     tiled_mma: cute.TiledMma,
-    tma_atom_a_preferred: cute.CopyAtom,
-    mA_mkl_preferred: cute.Tensor,
-    tma_atom_b_preferred: cute.CopyAtom,
-    mB_nkl_preferred: cute.Tensor,
-    tma_atom_a_fallback: cute.CopyAtom,
-    mA_mkl_fallback: cute.Tensor,
-    tma_atom_b_fallback: cute.CopyAtom,
-    mB_nkl_fallback: cute.Tensor,
-    tma_atom_c: cute.CopyAtom,
-    mC_mnl: cute.Tensor,
-    a_smem_layout: cute.ComposedLayout,
-    b_smem_layout: cute.ComposedLayout,
+    tma_a_preferred: cpasync.TmaInfo,
+    tma_b_preferred: cpasync.TmaInfo,
+    tma_a_fallback: cpasync.TmaInfo,
+    tma_b_fallback: cpasync.TmaInfo,
+    tma_c: cpasync.TmaInfo,
     c_smem_layout_kind: cutlass.Constexpr,
-    epi_smem_layout_staged: cute.ComposedLayout,
     epi_tile: cute.Tile,
     preferred_cta_layout_vmnk: cute.Layout,
     fallback_cta_layout_vmnk: cute.Layout,
@@ -687,19 +691,15 @@ def kernel(
     )
 
     # As for now, only support preferred cluster kernel via the mega-kernel approach
+    # mega-kernel approach has 2 mutually exclusive code branches, only one path runs per launch,
+    # specify `smem_merge_branch_allocs=True` at launch to enables shared memory reuse between two paths
     if is_preferred_cluster:
         cluster_specific_kernel(
             tiled_mma,
-            tma_atom_a_preferred,
-            mA_mkl_preferred,
-            tma_atom_b_preferred,
-            mB_nkl_preferred,
-            tma_atom_c,
-            mC_mnl,
-            a_smem_layout,
-            b_smem_layout,
+            tma_a_preferred,
+            tma_b_preferred,
+            tma_c,
             c_smem_layout_kind,
-            epi_smem_layout_staged,
             epi_tile,
             preferred_cta_layout_vmnk,
             preferred_cluster_shape_mnk,
@@ -708,16 +708,10 @@ def kernel(
     else:
         cluster_specific_kernel(
             tiled_mma,
-            tma_atom_a_fallback,
-            mA_mkl_fallback,
-            tma_atom_b_fallback,
-            mB_nkl_fallback,
-            tma_atom_c,
-            mC_mnl,
-            a_smem_layout,
-            b_smem_layout,
+            tma_a_fallback,
+            tma_b_fallback,
+            tma_c,
             c_smem_layout_kind,
-            epi_smem_layout_staged,
             epi_tile,
             fallback_cta_layout_vmnk,
             fallback_cluster_shape_mnk,
@@ -814,8 +808,8 @@ def host_function(
         mma_inst_shape_mnk,
         tcgen05.CtaGroup.TWO if use_2cta_instrs else tcgen05.CtaGroup.ONE,
         tcgen05.OperandSource.SMEM,
-        tcgen05.OperandMajorMode.K,
-        tcgen05.OperandMajorMode.K,
+        cute.nvgpu.OperandMajorMode.K,
+        cute.nvgpu.OperandMajorMode.K,
     )
     tiled_mma = cute.make_tiled_mma(op)
 
@@ -860,37 +854,35 @@ def host_function(
     op = cute.nvgpu.cpasync.CopyBulkTensorTileG2SMulticastOp(
         tcgen05.CtaGroup.TWO if use_2cta_instrs else tcgen05.CtaGroup.ONE
     )
-    a_smem_layout_slice = cute.slice_(a_smem_layout, (None, None, None, 0))
-    tma_atom_a_fallback, a_tma_tensor_fallback = cute.nvgpu.make_tiled_tma_atom_A(
+    tma_a_fallback = cute.nvgpu.make_tiled_tma_atom_A(
         op,
         a,
-        a_smem_layout_slice,
+        a_smem_layout,
         mma_tiler_mnk,
         tiled_mma,
         fallback_cta_layout_vmnk.shape,
     )
-    b_smem_layout_slice = cute.slice_(b_smem_layout, (None, None, None, 0))
-    tma_atom_b_fallback, b_tma_tensor_fallback = cute.nvgpu.make_tiled_tma_atom_B(
+    tma_b_fallback = cute.nvgpu.make_tiled_tma_atom_B(
         op,
         b,
-        b_smem_layout_slice,
+        b_smem_layout,
         mma_tiler_mnk,
         tiled_mma,
         fallback_cta_layout_vmnk.shape,
     )
 
-    tma_atom_a_preferred, a_tma_tensor_preferred = cute.nvgpu.make_tiled_tma_atom_A(
+    tma_a_preferred = cute.nvgpu.make_tiled_tma_atom_A(
         op,
         a,
-        a_smem_layout_slice,
+        a_smem_layout,
         mma_tiler_mnk,
         tiled_mma,
         preferred_cta_layout_vmnk.shape,
     )
-    tma_atom_b_preferred, b_tma_tensor_preferred = cute.nvgpu.make_tiled_tma_atom_B(
+    tma_b_preferred = cute.nvgpu.make_tiled_tma_atom_B(
         op,
         b,
-        b_smem_layout_slice,
+        b_smem_layout,
         mma_tiler_mnk,
         tiled_mma,
         preferred_cta_layout_vmnk.shape,
@@ -915,12 +907,11 @@ def host_function(
         epi_tile,
         epi_stages,
     )
-    epi_smem_layout = cute.slice_(epi_smem_layout_staged, (None, None, 0))
 
-    tma_atom_c, c_tma_tensor = cute.nvgpu.cpasync.make_tiled_tma_atom(
+    tma_c = cute.nvgpu.cpasync.make_tiled_tma_atom(
         cute.nvgpu.cpasync.CopyBulkTensorTileS2GOp(),
         c,
-        epi_smem_layout,
+        epi_smem_layout_staged,
         epi_tile,
     )
 
@@ -945,20 +936,12 @@ def host_function(
 
     kernel(
         tiled_mma,
-        tma_atom_a_preferred,
-        a_tma_tensor_preferred,
-        tma_atom_b_preferred,
-        b_tma_tensor_preferred,
-        tma_atom_a_fallback,
-        a_tma_tensor_fallback,
-        tma_atom_b_fallback,
-        b_tma_tensor_fallback,
-        tma_atom_c,
-        c_tma_tensor,
-        a_smem_layout,
-        b_smem_layout,
+        tma_a_preferred,
+        tma_b_preferred,
+        tma_a_fallback,
+        tma_b_fallback,
+        tma_c,
         c_smem_layout_kind,
-        epi_smem_layout_staged,
         epi_tile,
         preferred_cta_layout_vmnk,
         fallback_cta_layout_vmnk,
@@ -969,6 +952,7 @@ def host_function(
         block=[224, 1, 1] if use_clc_dynamic_scheduler else [192, 1, 1],
         cluster=preferred_cluster_shape_mnk,
         fallback_cluster=fallback_cluster_shape_mnk,
+        smem_merge_branch_allocs=True,
     )
 
 
@@ -981,7 +965,7 @@ def run_dense_gemm(
     import cutlass.torch as cutlass_torch
 
     print("===================================================================")
-    print("Running Blackwell fp16 GEMM example 4 (with MIX cluster size support):")
+    print("Running Blackwell fp16 GEMM example 4 (with MIX cluster support):")
     print(f"  mnk:                        {mnk}")
     print(f"  tolerance:                  {tolerance}")
     print(f"  Preferred cluster shape:    {preferred_cluster_shape_mnk}")
diff --git a/examples/python/CuTeDSL/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_5.py b/examples/python/CuTeDSL/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_5.py
index e1f5dd750..d4b8cdb11 100644
--- a/examples/python/CuTeDSL/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_5.py
+++ b/examples/python/CuTeDSL/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_5.py
@@ -1,12 +1,30 @@
-# SPDX-FileCopyrightText: Copyright (c) 2024 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
-#
-# NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
-# property and proprietary rights in and to this material, related
-# documentation and any modifications thereto. Any use, reproduction,
-# disclosure or distribution of this material and related documentation
-# without an express license agreement from NVIDIA CORPORATION or
-# its affiliates is strictly prohibited.
+# Copyright (c) 2024 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 # This is the fifth tutorial GEMM (5). It extends fp16_gemm_3_1.py by adding TMA prefetch.
 # TMA prefetch uses cute.prefetch() to bring data into L2 cache before TMA copy needs it,
@@ -44,7 +62,7 @@ Key differences from fp16_gemm_3_1.py:
 
 To run this example:
 .. code-block:: bash
-    python examples/blackwell/tutorial_gemm/fp16_gemm_5.py  \
+    python examples/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_5.py  \
       --mnk 8192,8192,8192
 
 Constraints for this example:
@@ -84,22 +102,20 @@ class SharedStorage:
     tmem_holding_buffer: cutlass.Int32
     # Only for CLC Dynamic Scheduler
     clc_mbar_ptr: cute.struct.MemRange[cutlass.Int64, 2]
-    clc_response: cute.struct.MemRange[cutlass.Int32, 4]
+    clc_response_align_bytes = num_clc_response_bytes
+    clc_response: cute.struct.Align[
+        cute.struct.MemRange[cutlass.Int32, 4],
+        clc_response_align_bytes,
+    ]
 
 
 @cute.kernel()
 def kernel(
     tiled_mma: cute.TiledMma,
-    tma_atom_a: cute.CopyAtom,
-    mA_mkl: cute.Tensor,
-    tma_atom_b: cute.CopyAtom,
-    mB_nkl: cute.Tensor,
-    tma_atom_c: cute.CopyAtom,
-    mC_mnl: cute.Tensor,
-    a_smem_layout: cute.ComposedLayout,
-    b_smem_layout: cute.ComposedLayout,
+    tma_a: cpasync.TmaInfo,
+    tma_b: cpasync.TmaInfo,
+    tma_c: cpasync.TmaInfo,
     c_smem_layout_kind: cutlass.Constexpr,
-    epi_smem_layout_staged: cute.ComposedLayout,
     epi_tile: cute.Tile,
     cta_layout_vmnk: cute.Layout,
     tile_sched_params: Union[
@@ -107,6 +123,11 @@ def kernel(
         utils.PersistentTileSchedulerParams,
     ],
 ):
+    # Extract tma_tensor from TmaInfo
+    mA_mkl = tma_a.tma_tensor
+    mB_nkl = tma_b.tma_tensor
+    mC_mnl = tma_c.tma_tensor
+
     warp_idx = cute.arch.warp_idx()
     warp_idx = cute.arch.make_warp_uniform(warp_idx)
 
@@ -135,9 +156,9 @@ def kernel(
 
     # Prefetch tma descriptor
     if warp_idx == tma_warp_id:
-        cpasync.prefetch_descriptor(tma_atom_a)
-        cpasync.prefetch_descriptor(tma_atom_b)
-        cpasync.prefetch_descriptor(tma_atom_c)
+        cpasync.prefetch_descriptor(tma_a.atom)
+        cpasync.prefetch_descriptor(tma_b.atom)
+        cpasync.prefetch_descriptor(tma_c.atom)
 
     # As many participants as the number of threads issuing the MMA in the same row and column
     # Substract one to not count twice the same thread
@@ -179,8 +200,8 @@ def kernel(
     )
 
     num_tma_copy_bytes = (
-        cute.size_in_bytes(io_dtype, cute.select(a_smem_layout, mode=[0, 1, 2]))
-        + cute.size_in_bytes(io_dtype, cute.select(b_smem_layout, mode=[0, 1, 2]))
+        cute.size_in_bytes(io_dtype, cute.select(tma_a.smem_layout, mode=[0, 1, 2]))
+        + cute.size_in_bytes(io_dtype, cute.select(tma_b.smem_layout, mode=[0, 1, 2]))
     ) * cute.size(cta_layout_vmnk, mode=[0])
 
     # Threads/warps participating in the mainloop pipeline
@@ -260,21 +281,21 @@ def kernel(
     # Allocate SMEM
     sA = smem.allocate_tensor(
         element_type=io_dtype,
-        layout=a_smem_layout.outer,
+        layout=tma_a.smem_layout.outer,
         byte_alignment=128,
-        swizzle=a_smem_layout.inner,
+        swizzle=tma_a.smem_layout.inner,
     )
     sB = smem.allocate_tensor(
         element_type=io_dtype,
-        layout=b_smem_layout.outer,
+        layout=tma_b.smem_layout.outer,
         byte_alignment=128,
-        swizzle=b_smem_layout.inner,
+        swizzle=tma_b.smem_layout.inner,
     )
     sC = smem.allocate_tensor(
         element_type=io_dtype,
-        layout=epi_smem_layout_staged.outer,
+        layout=tma_c.smem_layout.outer,
         byte_alignment=128,
-        swizzle=epi_smem_layout_staged.inner,
+        swizzle=tma_c.smem_layout.inner,
     )
 
     # Partition tensors for MMA and make fragments
@@ -313,7 +334,7 @@ def kernel(
     # ((atom_v, rest_v), STAGE)
     # ((atom_v, rest_v), RestM, RestK)
     tAsA, tAgA = cute.nvgpu.cpasync.tma_partition(
-        tma_atom_a,
+        tma_a.atom,
         cta_in_cluster_coord_vmnk[2],
         cute.make_layout(cute.size(cta_layout_vmnk, mode=[2])),
         cute.group_modes(sA, 0, 3),
@@ -323,7 +344,7 @@ def kernel(
     # ((atom_v, rest_v), STAGE)
     # ((atom_v, rest_v), RestN, RestK)
     tBsB, tBgB = cute.nvgpu.cpasync.tma_partition(
-        tma_atom_b,
+        tma_b.atom,
         cta_in_cluster_coord_vmnk[1],
         cute.make_layout(cute.size(cta_layout_vmnk, mode=[1])),
         cute.group_modes(sB, 0, 3),
@@ -333,7 +354,7 @@ def kernel(
     gC_epi = cute.flat_divide(tCgC[((None, None), 0, 0, None, None)], epi_tile)
 
     tCsC, tCgC_tma = cute.nvgpu.cpasync.tma_partition(
-        tma_atom_c,
+        tma_c.atom,
         0,
         cute.make_layout(1),
         cute.group_modes(sC, 0, 2),
@@ -396,8 +417,8 @@ def kernel(
             for pf_k_tile in cutlass.range(
                 cutlass.min(prefetch_dist, num_k_tiles), unroll=1
             ):
-                cute.prefetch(tma_atom_a, tAgA_slice[(None, pf_k_tile)])
-                cute.prefetch(tma_atom_b, tBgB_slice[(None, pf_k_tile)])
+                cute.prefetch(tma_a.atom, tAgA_slice[(None, pf_k_tile)])
+                cute.prefetch(tma_b.atom, tBgB_slice[(None, pf_k_tile)])
 
             # =========================================================
             # TMA Load Loop with Rolling Prefetch
@@ -408,14 +429,14 @@ def kernel(
 
                 # Issue TMA loads (use k_tile_idx like fp16_gemm_3_1.py)
                 cute.copy(
-                    tma_atom_a,
+                    tma_a.atom,
                     tAgA_slice[(None, k_tile_idx)],
                     tAsA[(None, handle.index)],
                     tma_bar_ptr=handle.barrier,
                     mcast_mask=tma_mcast_mask_a,
                 )
                 cute.copy(
-                    tma_atom_b,
+                    tma_b.atom,
                     tBgB_slice[(None, k_tile_idx)],
                     tBsB[(None, handle.index)],
                     tma_bar_ptr=handle.barrier,
@@ -426,8 +447,8 @@ def kernel(
                 # This keeps the L2 primed as we progress through the K dimension
                 if k_tile_idx + prefetch_dist < num_k_tiles:
                     future_k_tile = k_tile_idx + prefetch_dist
-                    cute.prefetch(tma_atom_a, tAgA_slice[(None, future_k_tile)])
-                    cute.prefetch(tma_atom_b, tBgB_slice[(None, future_k_tile)])
+                    cute.prefetch(tma_a.atom, tAgA_slice[(None, future_k_tile)])
+                    cute.prefetch(tma_b.atom, tBgB_slice[(None, future_k_tile)])
 
             # Advance to next tile
             if cutlass.const_expr(use_clc_dynamic_scheduler):
@@ -483,24 +504,14 @@ def kernel(
                 # (MMA, MMA_M, MMA_N)
                 tCtAcc = tCtAcc_base[(None, None, None, acc_empty.index)]
 
-                tiled_mma.set(tcgen05.Field.ACCUMULATE, False)
-
                 for k_tile_idx in range(num_k_tiles):
                     # Wait for TMA copies to complete
                     handle = ab_consumer.wait_and_advance()
 
                     # Execute one K-block worth of MMA instructions
-                    num_k_blocks = cute.size(tCrA, mode=[2])
-                    for k_block_idx in cutlass.range_constexpr(num_k_blocks):
-                        k_block_coord = (None, None, k_block_idx, handle.index)
-                        cute.gemm(
-                            tiled_mma,
-                            tCtAcc,
-                            tCrA[k_block_coord],
-                            tCrB[k_block_coord],
-                            tCtAcc,
-                        )
-                        tiled_mma.set(tcgen05.Field.ACCUMULATE, True)
+                    tiled_mma.set(tcgen05.Field.ACCUMULATE, k_tile_idx != 0)
+                    tile_crd = (None, None, None, handle.index)
+                    cute.gemm(tiled_mma, tCtAcc, tCrA[tile_crd], tCrB[tile_crd], tCtAcc)
 
                     # Signal that the A/B buffers have been consumed and are ready for the next load
                     handle.release()
@@ -533,10 +544,10 @@ def kernel(
         # (MMA, MMA_M, MMA_N, STAGE)
         tCtAcc_base = cute.make_tensor(tmem_ptr, tCtAcc_fake.layout)
 
-        # Initialize TMA store pipeline for epilogue
+        # Initialize TMA store pipeline for epilogue with 4 warps
         epilogue_pipeline_producer_group = pipeline.CooperativeGroup(
-            pipeline.Agent.Thread,
-            size=128,
+            pipeline.Agent.Warp,
+            size=4,
         )
         epilogue_pipeline = pipeline.PipelineTmaStore.create(
             num_stages=epi_stages,
@@ -631,7 +642,7 @@ def kernel(
                 # SMEM -> GMEM
                 if warp_idx == epilogue_warp_ids[0]:
                     cute.copy(
-                        tma_atom_c,
+                        tma_c.atom,
                         tCsC[(None, c_buffer)],
                         tCgC_grouped[(None, subtile_idx)],
                     )
@@ -715,8 +726,8 @@ def host_function(
         mma_inst_shape_mnk,
         tcgen05.CtaGroup.TWO if use_2cta_instrs else tcgen05.CtaGroup.ONE,
         tcgen05.OperandSource.SMEM,
-        tcgen05.OperandMajorMode.K,
-        tcgen05.OperandMajorMode.K,
+        cute.nvgpu.OperandMajorMode.K,
+        cute.nvgpu.OperandMajorMode.K,
     )
     tiled_mma = cute.make_tiled_mma(op)
 
@@ -754,20 +765,18 @@ def host_function(
     op = cute.nvgpu.cpasync.CopyBulkTensorTileG2SMulticastOp(
         tcgen05.CtaGroup.TWO if use_2cta_instrs else tcgen05.CtaGroup.ONE
     )
-    a_smem_layout_slice = cute.slice_(a_smem_layout, (None, None, None, 0))
-    tma_atom_a, a_tma_tensor = cute.nvgpu.make_tiled_tma_atom_A(
+    tma_a = cute.nvgpu.make_tiled_tma_atom_A(
         op,
         a,
-        a_smem_layout_slice,
+        a_smem_layout,
         mma_tiler_mnk,
         tiled_mma,
         cta_layout_vmnk.shape,
     )
-    b_smem_layout_slice = cute.slice_(b_smem_layout, (None, None, None, 0))
-    tma_atom_b, b_tma_tensor = cute.nvgpu.make_tiled_tma_atom_B(
+    tma_b = cute.nvgpu.make_tiled_tma_atom_B(
         op,
         b,
-        b_smem_layout_slice,
+        b_smem_layout,
         mma_tiler_mnk,
         tiled_mma,
         cta_layout_vmnk.shape,
@@ -793,11 +802,10 @@ def host_function(
         epi_stages,
     )
 
-    epi_smem_layout = cute.slice_(epi_smem_layout_staged, (None, None, 0))
-    tma_atom_c, c_tma_tensor = cute.nvgpu.cpasync.make_tiled_tma_atom(
+    tma_c = cute.nvgpu.cpasync.make_tiled_tma_atom(
         cute.nvgpu.cpasync.CopyBulkTensorTileS2GOp(),
         c,
-        epi_smem_layout,
+        epi_smem_layout_staged,
         epi_tile,
     )
 
@@ -815,16 +823,10 @@ def host_function(
 
     kernel(
         tiled_mma,
-        tma_atom_a,
-        a_tma_tensor,
-        tma_atom_b,
-        b_tma_tensor,
-        tma_atom_c,
-        c_tma_tensor,
-        a_smem_layout,
-        b_smem_layout,
+        tma_a,
+        tma_b,
+        tma_c,
         c_smem_layout_kind,
-        epi_smem_layout_staged,
         epi_tile,
         cta_layout_vmnk,
         tile_sched_params,
diff --git a/examples/python/CuTeDSL/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_6.py b/examples/python/CuTeDSL/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_6.py
index f06c95444..ad7797131 100644
--- a/examples/python/CuTeDSL/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_6.py
+++ b/examples/python/CuTeDSL/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_6.py
@@ -1,12 +1,30 @@
-# SPDX-FileCopyrightText: Copyright (c) 2024 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
-#
-# NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
-# property and proprietary rights in and to this material, related
-# documentation and any modifications thereto. Any use, reproduction,
-# disclosure or distribution of this material and related documentation
-# without an express license agreement from NVIDIA CORPORATION or
-# its affiliates is strictly prohibited.
+# Copyright (c) 2024 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
 # This is the sixth tutorial GEMM. It enables programmatic dependent launch (PDL) features.
 
@@ -57,7 +75,7 @@ For --mnk 256,8192,128, the speedup pdl v.s. no pdl can be up to 1.16x.
 
 To run this example:
 .. code-block:: bash
-    python examples/blackwell/tutorial_gemm/fp16_gemm_6.py  \
+    python examples/cute/blackwell/tutorial/tutorial_gemm/fp16_gemm_6.py  \
       --mnk 256,8192,128
 
 Constraints for this example:
@@ -100,7 +118,11 @@ class SharedStorage:
     tmem_holding_buffer: cutlass.Int32
     # Only for CLC Dynamic Scheduler
     clc_mbar_ptr: cute.struct.MemRange[cutlass.Int64, 2]
-    clc_response: cute.struct.MemRange[cutlass.Int32, 4]
+    clc_response_align_bytes = num_clc_response_bytes
+    clc_response: cute.struct.Align[
+        cute.struct.MemRange[cutlass.Int32, 4],
+        clc_response_align_bytes,
+    ]
 
 
 @cute.kernel()
@@ -133,16 +155,10 @@ def dequantize(
 @cute.kernel()
 def gemm(
     tiled_mma: cute.TiledMma,
-    tma_atom_a: cute.CopyAtom,
-    mA_mkl: cute.Tensor,
-    tma_atom_b: cute.CopyAtom,
-    mB_nkl: cute.Tensor,
-    tma_atom_c: cute.CopyAtom,
-    mC_mnl: cute.Tensor,
-    a_smem_layout: cute.ComposedLayout,
-    b_smem_layout: cute.ComposedLayout,
+    tma_a: cpasync.TmaInfo,
+    tma_b: cpasync.TmaInfo,
+    tma_c: cpasync.TmaInfo,
     c_smem_layout_kind: cutlass.Constexpr,
-    epi_smem_layout_staged: cute.ComposedLayout,
     epi_tile: cute.Tile,
     cta_layout_vmnk: cute.Layout,
     tile_sched_params: Union[
@@ -150,6 +166,11 @@ def gemm(
         utils.PersistentTileSchedulerParams,
     ],
 ):
+    # Extract tma_tensor from TmaInfo
+    mA_mkl = tma_a.tma_tensor
+    mB_nkl = tma_b.tma_tensor
+    mC_mnl = tma_c.tma_tensor
+
     warp_idx = cute.arch.warp_idx()
     warp_idx = cute.arch.make_warp_uniform(warp_idx)
 
@@ -178,9 +199,9 @@ def gemm(
 
     # Prefetch tma descriptor
     if warp_idx == tma_warp_id:
-        cpasync.prefetch_descriptor(tma_atom_a)
-        cpasync.prefetch_descriptor(tma_atom_b)
-        cpasync.prefetch_descriptor(tma_atom_c)
+        cpasync.prefetch_descriptor(tma_a.atom)
+        cpasync.prefetch_descriptor(tma_b.atom)
+        cpasync.prefetch_descriptor(tma_c.atom)
 
     # As many participants as the number of threads issuing the MMA in the same row and column
     # Substract one to not count twice the same thread
@@ -222,8 +243,8 @@ def gemm(
     )
 
     num_tma_copy_bytes = (
-        cute.size_in_bytes(io_dtype, cute.select(a_smem_layout, mode=[0, 1, 2]))
-        + cute.size_in_bytes(io_dtype, cute.select(b_smem_layout, mode=[0, 1, 2]))
+        cute.size_in_bytes(io_dtype, cute.select(tma_a.smem_layout, mode=[0, 1, 2]))
+        + cute.size_in_bytes(io_dtype, cute.select(tma_b.smem_layout, mode=[0, 1, 2]))
     ) * cute.size(cta_layout_vmnk, mode=[0])
 
     # Threads/warps participating in the mainloop pipeline
@@ -303,21 +324,21 @@ def gemm(
     # Allocate SMEM
     sA = smem.allocate_tensor(
         element_type=io_dtype,
-        layout=a_smem_layout.outer,
+        layout=tma_a.smem_layout.outer,
         byte_alignment=128,
-        swizzle=a_smem_layout.inner,
+        swizzle=tma_a.smem_layout.inner,
     )
     sB = smem.allocate_tensor(
         element_type=io_dtype,
-        layout=b_smem_layout.outer,
+        layout=tma_b.smem_layout.outer,
         byte_alignment=128,
-        swizzle=b_smem_layout.inner,
+        swizzle=tma_b.smem_layout.inner,
     )
     sC = smem.allocate_tensor(
         element_type=io_dtype,
-        layout=epi_smem_layout_staged.outer,
+        layout=tma_c.smem_layout.outer,
         byte_alignment=128,
-        swizzle=epi_smem_layout_staged.inner,
+        swizzle=tma_c.smem_layout.inner,
     )
 
     # Partition tensors for MMA and make fragments
@@ -356,7 +377,7 @@ def gemm(
     # ((atom_v, rest_v), STAGE)
     # ((atom_v, rest_v), RestM, RestK)
     tAsA, tAgA = cute.nvgpu.cpasync.tma_partition(
-        tma_atom_a,
+        tma_a.atom,
         cta_in_cluster_coord_vmnk[2],
         cute.make_layout(cute.size(cta_layout_vmnk, mode=[2])),
         cute.group_modes(sA, 0, 3),
@@ -366,7 +387,7 @@ def gemm(
     # ((atom_v, rest_v), STAGE)
     # ((atom_v, rest_v), RestN, RestK)
     tBsB, tBgB = cute.nvgpu.cpasync.tma_partition(
-        tma_atom_b,
+        tma_b.atom,
         cta_in_cluster_coord_vmnk[1],
         cute.make_layout(cute.size(cta_layout_vmnk, mode=[1])),
         cute.group_modes(sB, 0, 3),
@@ -376,7 +397,7 @@ def gemm(
     gC_epi = cute.flat_divide(tCgC[((None, None), 0, 0, None, None)], epi_tile)
 
     tCsC, tCgC_tma = cute.nvgpu.cpasync.tma_partition(
-        tma_atom_c,
+        tma_c.atom,
         0,
         cute.make_layout(1),
         cute.group_modes(sC, 0, 2),
@@ -440,14 +461,14 @@ def gemm(
 
                 # Issue TMA loads
                 cute.copy(
-                    tma_atom_a,
+                    tma_a.atom,
                     tAgA_slice[(None, k_tile_idx)],
                     tAsA[(None, handle.index)],
                     tma_bar_ptr=handle.barrier,
                     mcast_mask=tma_mcast_mask_a,
                 )
                 cute.copy(
-                    tma_atom_b,
+                    tma_b.atom,
                     tBgB_slice[(None, k_tile_idx)],
                     tBsB[(None, handle.index)],
                     tma_bar_ptr=handle.barrier,
@@ -508,23 +529,14 @@ def gemm(
                 # (MMA, MMA_M, MMA_N)
                 tCtAcc = tCtAcc_base[(None, None, None, acc_empty.index)]
 
-                tiled_mma.set(tcgen05.Field.ACCUMULATE, False)
                 for k_tile_idx in range(num_k_tiles):
                     # Wait for TMA copies to complete
                     handle = ab_consumer.wait_and_advance()
 
                     # Execute one K-block worth of MMA instructions
-                    num_k_blocks = cute.size(tCrA, mode=[2])
-                    for k_block_idx in cutlass.range_constexpr(num_k_blocks):
-                        k_block_coord = (None, None, k_block_idx, handle.index)
-                        cute.gemm(
-                            tiled_mma,
-                            tCtAcc,
-                            tCrA[k_block_coord],
-                            tCrB[k_block_coord],
-                            tCtAcc,
-                        )
-                        tiled_mma.set(tcgen05.Field.ACCUMULATE, True)
+                    tiled_mma.set(tcgen05.Field.ACCUMULATE, k_tile_idx != 0)
+                    tile_crd = (None, None, None, handle.index)
+                    cute.gemm(tiled_mma, tCtAcc, tCrA[tile_crd], tCrB[tile_crd], tCtAcc)
 
                     # Signal that the A/B buffers have been consumed and are ready for the next load
                     handle.release()
@@ -557,10 +569,10 @@ def gemm(
         # (MMA, MMA_M, MMA_N, STAGE)
         tCtAcc_base = cute.make_tensor(tmem_ptr, tCtAcc_fake.layout)
 
-        # Initialize TMA store pipeline for epilogue
+        # Initialize TMA store pipeline for epilogue with 4 warps
         epilogue_pipeline_producer_group = pipeline.CooperativeGroup(
-            pipeline.Agent.Thread,
-            size=128,
+            pipeline.Agent.Warp,
+            size=4,
         )
         epilogue_pipeline = pipeline.PipelineTmaStore.create(
             num_stages=epi_stages,
@@ -654,7 +666,7 @@ def gemm(
                 # SMEM -> GMEM
                 if warp_idx == epilogue_warp_ids[0]:
                     cute.copy(
-                        tma_atom_c,
+                        tma_c.atom,
                         tCsC[(None, c_buffer)],
                         tCgC_grouped[(None, subtile_idx)],
                     )
@@ -750,8 +762,8 @@ def host_function(
         mma_inst_shape_mnk,
         tcgen05.CtaGroup.TWO if use_2cta_instrs else tcgen05.CtaGroup.ONE,
         tcgen05.OperandSource.SMEM,
-        tcgen05.OperandMajorMode.K,
-        tcgen05.OperandMajorMode.K,
+        cute.nvgpu.OperandMajorMode.K,
+        cute.nvgpu.OperandMajorMode.K,
     )
     tiled_mma = cute.make_tiled_mma(op)
 
@@ -789,20 +801,18 @@ def host_function(
     op = cute.nvgpu.cpasync.CopyBulkTensorTileG2SMulticastOp(
         tcgen05.CtaGroup.TWO if use_2cta_instrs else tcgen05.CtaGroup.ONE
     )
-    a_smem_layout_slice = cute.slice_(a_smem_layout, (None, None, None, 0))
-    tma_atom_a, a_tma_tensor = cute.nvgpu.make_tiled_tma_atom_A(
+    tma_a = cute.nvgpu.make_tiled_tma_atom_A(
         op,
         a,
-        a_smem_layout_slice,
+        a_smem_layout,
         mma_tiler_mnk,
         tiled_mma,
         cta_layout_vmnk.shape,
     )
-    b_smem_layout_slice = cute.slice_(b_smem_layout, (None, None, None, 0))
-    tma_atom_b, b_tma_tensor = cute.nvgpu.make_tiled_tma_atom_B(
+    tma_b = cute.nvgpu.make_tiled_tma_atom_B(
         op,
         b_dequantized,
-        b_smem_layout_slice,
+        b_smem_layout,
         mma_tiler_mnk,
         tiled_mma,
         cta_layout_vmnk.shape,
@@ -828,11 +838,10 @@ def host_function(
         epi_stages,
     )
 
-    epi_smem_layout = cute.slice_(epi_smem_layout_staged, (None, None, 0))
-    tma_atom_c, c_tma_tensor = cute.nvgpu.cpasync.make_tiled_tma_atom(
+    tma_c = cute.nvgpu.cpasync.make_tiled_tma_atom(
         cute.nvgpu.cpasync.CopyBulkTensorTileS2GOp(),
         c,
-        epi_smem_layout,
+        epi_smem_layout_staged,
         epi_tile,
     )
 
@@ -862,16 +871,10 @@ def host_function(
 
     gemm(
         tiled_mma,
-        tma_atom_a,
-        a_tma_tensor,
-        tma_atom_b,
-        b_tma_tensor,
-        tma_atom_c,
-        c_tma_tensor,
-        a_smem_layout,
-        b_smem_layout,
+        tma_a,
+        tma_b,
+        tma_c,
         c_smem_layout_kind,
-        epi_smem_layout_staged,
         epi_tile,
         cta_layout_vmnk,
         tile_sched_params,
diff --git a/examples/python/CuTeDSL/cute/blackwell_geforce/kernel/blockscaled_gemm/dense_blockscaled_gemm_persistent_pingpong.py b/examples/python/CuTeDSL/cute/blackwell_geforce/kernel/blockscaled_gemm/dense_blockscaled_gemm_persistent_pingpong.py
index 83005464b..4a0c62d89 100644
--- a/examples/python/CuTeDSL/cute/blackwell_geforce/kernel/blockscaled_gemm/dense_blockscaled_gemm_persistent_pingpong.py
+++ b/examples/python/CuTeDSL/cute/blackwell_geforce/kernel/blockscaled_gemm/dense_blockscaled_gemm_persistent_pingpong.py
@@ -34,7 +34,7 @@ import cuda.bindings.driver as cuda
 import cutlass
 import cutlass.cute as cute
 from cutlass.cute.nvgpu import cpasync
-import cutlass.cute.testing as testing
+from cutlass import testing
 import cutlass.utils as utils
 import cutlass.pipeline as pipeline
 from cutlass.cute.runtime import from_dlpack
@@ -42,6 +42,11 @@ import cutlass.utils.hopper_helpers as sm90_utils
 import cutlass.utils.blockscaled_layout as blockscaled_utils
 import cutlass.utils.blackwell_helpers as sm120_utils
 
+# SM120 block-scaled GEMM dispatch helpers (sibling utility module). Try the
+# namespace-package import path first (used under pytest, where
+# `python/examples/CuTeDSL/cute` is on sys.path); fall back to the bare local
+# import for standalone-script invocation, where Python places only the
+# script's own directory on sys.path[0].
 try:
     from blackwell_geforce.kernel.blockscaled_gemm.blockscaled_gemm_dispatch import (
         FP4_SHIFT_BITS,
@@ -49,8 +54,8 @@ try:
         make_sm120_blockscaled_mma_op,
         validate_blockscaled_args,
     )
-except ImportError:
-    from blockscaled_gemm_dispatch import (
+except ImportError:  # pragma: no cover - exercised only via standalone-script invocation
+    from blockscaled_gemm_dispatch import (  # noqa: E402
         FP4_SHIFT_BITS,
         make_ldmatrix_atom,
         make_sm120_blockscaled_mma_op,
@@ -705,9 +710,6 @@ class Sm120BlockScaledGemmKernel:
         tCrB = tiled_mma.make_fragment_B(tCsB[None, None, None, 0])
         tCrSFA = sm120_utils.partition_fragment_SFA(sSFA[None, None, 0], thr_mma, tidx)
         tCrSFB = sm120_utils.partition_fragment_SFB(sSFB[None, None, 0], thr_mma, tidx)
-        # Keep residual K modes nested to match the C++ SM120 block-scaled mainloop.
-        tCrSFA = cute.group_modes(tCrSFA, 2, cute.rank(tCrSFA))
-        tCrSFB = cute.group_modes(tCrSFB, 2, cute.rank(tCrSFB))
 
         tCgC = thr_mma.partition_C(gC_mnl)
         acc_shape = tCgC.shape[:3]
@@ -879,8 +881,6 @@ class Sm120BlockScaledGemmKernel:
             tCsSFB_copy_view = thr_copy_ldmatrix_SFB.partition_S(sSFB)
             tCrSFB_copy_view = thr_copy_ldmatrix_SFB.retile(tCrSFB)
 
-            epi_buffer = cutlass.Int32(0)
-
             if warp_group_idx == 1:
                 tile_sched.advance_to_next_work()
                 mainloop_consumer_state = self.advance(
@@ -1051,7 +1051,6 @@ class Sm120BlockScaledGemmKernel:
                     )
 
                     if k_block_idx == num_k_blocks - 1:
-                        cute.arch.fence_proxy("async.shared", space="cta")
                         mainloop_pipeline.consumer_release(mainloop_consumer_state)
                         mainloop_consumer_state.advance()
 
@@ -1210,8 +1209,7 @@ class Sm120BlockScaledGemmKernel:
                         tRS_rD_out.store(acc_vec.to(self.c_dtype))
 
                         # Register to shared memory
-                        epi_buffer = epi_buffer + 1
-                        epi_buffer = epi_buffer % cute.size(
+                        epi_buffer = (epi_m * epi_rest_n + epi_n) % cute.size(
                             tRS_sD, mode=[3]
                         )
                         self.epilog_sync_barrier.arrive_and_wait()
@@ -1242,6 +1240,7 @@ class Sm120BlockScaledGemmKernel:
                 tile_sched.advance_to_next_work()
                 tile_sched.advance_to_next_work()
                 work_tile = tile_sched.get_current_work()
+                tma_store_pipeline.producer_tail()
                 math_wg_order_state = math_wg_order_barrier.arrive(math_wg_order_state)
                 # End of for k_tile loop
             # End of while loop
@@ -1886,24 +1885,6 @@ def run_bs(
             cute.testing.convert(ref_f8, ref_tensor)
             ref = ref_device.cpu()
             torch.testing.assert_close(c_ref, ref, atol=tolerance, rtol=1e-02)
-        elif c_dtype is cutlass.Float4E2M1FN:
-            # Convert ref : f32 -> f4 -> f32
-            ref_f4_ = torch.empty(*(l, m, n), dtype=torch.uint8, device="cuda").permute(
-                1, 2, 0
-            )
-            ref_f4 = from_dlpack(ref_f4_, assumed_align=16).mark_layout_dynamic(
-                leading_dim=1
-            )
-            ref_f4.element_type = c_dtype
-            ref_device = ref.permute(2, 0, 1).contiguous().permute(1, 2, 0).cuda()
-            ref_tensor = from_dlpack(ref_device, assumed_align=16).mark_layout_dynamic(
-                leading_dim=1
-            )
-            cute.testing.convert(ref_tensor, ref_f4)
-            cute.testing.convert(ref_f4, ref_tensor)
-            ref = ref_device.cpu()
-            torch.testing.assert_close(c_ref, ref, atol=tolerance, rtol=1e-02)
-
     def generate_tensors():
         a_tensor, _ = cutlass_torch.cute_tensor_like(
             a_ref, a_dtype, is_dynamic_layout=True, assumed_align=16
@@ -1933,7 +1914,7 @@ def run_bs(
 
         _, sfa_tensor, _ = create_scale_factor_tensor(l, m, k, sf_vec_size, sf_dtype)
         _, sfb_tensor, _ = create_scale_factor_tensor(l, n, k, sf_vec_size, sf_dtype)
-        return cute.testing.JitArguments(
+        return cutlass.testing.JitArguments(
             a_tensor, b_tensor, sfa_tensor, sfb_tensor, c_tensor, stream
         )
 
diff --git a/examples/python/CuTeDSL/cute/notebooks/elementwise_add.ipynb b/examples/python/CuTeDSL/cute/notebooks/elementwise_add.ipynb
index f5db9c0f3..dc27e6b06 100644
--- a/examples/python/CuTeDSL/cute/notebooks/elementwise_add.ipynb
+++ b/examples/python/CuTeDSL/cute/notebooks/elementwise_add.ipynb
@@ -13,7 +13,6 @@
    "outputs": [],
    "source": [
     "import torch\n",
-    "from functools import partial\n",
     "from typing import List\n",
     "\n",
     "import cutlass\n",
@@ -332,8 +331,8 @@
     "\n",
     "    # Print results\n",
     "    # ------------\n",
-    "    print(f\"Performance Metrics:\")\n",
-    "    print(f\"-------------------\")\n",
+    "    print(\"Performance Metrics:\")\n",
+    "    print(\"-------------------\")\n",
     "    print(f\"Kernel execution time: {avg_time_us:.4f} us\")\n",
     "    print(f\"Memory throughput: {achieved_bandwidth:.2f} GB/s\")"
    ]
@@ -1082,7 +1081,7 @@
     "    ###############################################################################\n",
     "    # Compute predicate for out of boundary checks\n",
     "    ###############################################################################\n",
-    "    frgPred = cute.make_fragment(thrCrd.shape, cutlass.Boolean)\n",
+    "    frgPred = cute.make_rmem_tensor(thrCrd.shape, cutlass.Boolean)\n",
     "    print(f\"[DSL INFO]   frgPred = {frgPred.type}\")\n",
     "\n",
     "    for i in cutlass.range_constexpr(cute.size(frgPred)):\n",
diff --git a/examples/python/CuTeDSL/dsl_tutorials/README.md b/examples/python/CuTeDSL/dsl_tutorials/README.md
new file mode 100644
index 000000000..2d573789a
--- /dev/null
+++ b/examples/python/CuTeDSL/dsl_tutorials/README.md
@@ -0,0 +1,106 @@
+# DSL Feature Examples
+
+This directory demonstrates **CuTe DSL capabilities** beyond kernel authoring itself:
+exporting compiled kernels for deployment, integrating with ML frameworks, using
+foreign function interfaces, and accessing low-level DSL features like inline PTX
+and shared memory allocation.
+
+---
+
+## Directory Structure
+
+```
+dsl/
+  export/                        Exporting kernels to C shared libraries
+    export_to_c.py                 Compile a kernel and export as .so/.dylib
+    load_in_python.py              Load and call the exported library from Python
+    run_with_dynamic_loading.cpp   C++ driver using dlopen
+    run_with_dynamic_loading.sh    Build/run script for dynamic loading
+    run_with_static_linking.cpp    C++ driver using static linking
+    run_with_static_linking.sh     Build/run script for static linking
+  ffi/                           Foreign function interface
+    jit_argument.py                JIT compilation with argument passing
+    tensor.cpp                     C++ tensor interop implementation
+    CMakeLists.txt                 CMake build for FFI examples
+  jax/                           JAX integration
+    cutlass_call_basic.py          Basic CUTLASS kernel call from JAX
+    cutlass_call_export.py         Export a CUTLASS kernel for JAX
+    cutlass_call_sharding.py       Multi-device sharding with CUTLASS kernels
+    elementwise_apply_example.py   Elementwise apply via JAX
+  tvm_ffi/                       TVM FFI integration
+    jit_and_use_in_torch.py        JIT compile and call from PyTorch
+    jit_and_use_in_jax.py          JIT compile and call from JAX
+    aot_export.py                  Ahead-of-time export
+    aot_use_in_torch.py            Use AOT-exported kernel in PyTorch
+    aot_use_in_jax.py              Use AOT-exported kernel in JAX
+    aot_use_in_cpp_bundle.cpp      Use AOT-exported kernel in C++
+    aot_use_in_cpp_bundle.sh       Build/run script for C++ AOT usage
+    compile_with_fake_tensor.py    Compile using fake tensors
+    compile_with_symint_arg.py     Compile with symbolic integer arguments
+    ampere_gemm_with_fake_tensor.py  Ampere GEMM with fake tensor compilation
+    error_reporting.py             Error reporting and diagnostics
+  call_bypass_dlpack.py          Calling kernels bypassing DLPack
+  call_from_jit.py               Calling conventions from JIT-compiled code
+  cooperative_launch.py          Cooperative kernel launch (multi-CTA)
+  dynamic_smem_size.py           Dynamic shared memory allocation
+  inline_ptx.py                  Embedding inline PTX assembly
+  launch_completion_and_programmatic_events.py
+                                 Launch completion / programmatic events with cudaEvent_t and CUevent
+  pointer.py                     Pointer manipulation in DSL
+  print_latex.py                 LaTeX rendering of CuTe layouts
+  programmatic_dependent_launch.py  Programmatic dependent launch (PDL)
+  smem_allocator.py              Shared memory allocator usage
+  torch_fake_tensor.py           PyTorch fake tensor integration
+  torch_fp4.py                   PyTorch FP4 tensor support
+```
+
+---
+
+## Subdirectory Guides
+
+### `export/` -- Kernel Export
+
+Shows how to compile a CuTe DSL kernel into a standalone C shared library (`.so`)
+that can be loaded and called from C++ or Python without any CuTe DSL dependency
+at runtime. Includes complete examples for both dynamic loading (`dlopen`) and
+static linking workflows.
+
+### `ffi/` -- Foreign Function Interface
+
+Demonstrates how to pass arguments between Python/CuTe DSL and C++ code using
+the FFI layer. Useful for integrating CuTe DSL kernels into existing C++
+applications.
+
+### `jax/` -- JAX Integration
+
+Shows how to call CuTe DSL kernels from JAX using `cutlass_call`, including
+basic invocation, kernel export for JAX, multi-device sharding, and elementwise
+application patterns.
+
+### `tvm_ffi/` -- TVM FFI Integration
+
+Comprehensive examples for using CuTe DSL kernels through TVM's foreign function
+interface. Covers both JIT and AOT (ahead-of-time) compilation workflows, with
+usage examples for PyTorch, JAX, and C++. Also demonstrates fake-tensor
+compilation (no GPU required at compile time) and symbolic integer arguments.
+
+---
+
+## Top-Level Files
+
+The top-level Python files demonstrate individual DSL features:
+
+- **`call_bypass_dlpack.py`** / **`call_from_jit.py`** -- Kernel calling conventions
+- **`inline_ptx.py`** -- Embedding inline PTX assembly in CuTe DSL kernels
+- **`launch_completion_and_programmatic_events.py`** -- Examples of 
+  ``launch_completion_event`` and ``programmatic_event`` launch attributes,
+  using events created via ``torch.cuda.Event(enable_timing=False)`` and
+  presented as either ``cudaEvent_t`` (`cuda.bindings.runtime`) or ``CUevent`` (`cuda.bindings.driver`). The
+  stream is passed as a ``cudaStream_t`` (`cuda.bindings.runtime`)
+- **`programmatic_dependent_launch.py`** -- Programmatic dependent launch for
+  chaining kernels with data dependencies
+- **`cooperative_launch.py`** -- Cooperative launch for multi-CTA kernels
+- **`dynamic_smem_size.py`** / **`smem_allocator.py`** -- Shared memory allocation
+- **`torch_fake_tensor.py`** / **`torch_fp4.py`** -- PyTorch integration features
+- **`pointer.py`** -- Pointer manipulation within DSL kernels
+- **`print_latex.py`** -- Render CuTe layouts as LaTeX for visualization
diff --git a/examples/python/CuTeDSL/dsl_tutorials/launch_completion_and_programmatic_events.py b/examples/python/CuTeDSL/dsl_tutorials/launch_completion_and_programmatic_events.py
new file mode 100644
index 000000000..15bda0d11
--- /dev/null
+++ b/examples/python/CuTeDSL/dsl_tutorials/launch_completion_and_programmatic_events.py
@@ -0,0 +1,367 @@
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+"""
+Launch Completion Events and Programmatic Events Example
+=======================================================
+
+This module demonstrates the two CUDA kernel-launch attributes that record a
+``cudaEvent_t`` / ``CUevent`` as part of a launch:
+
+1. ``cudaLaunchAttributeLaunchCompletionEvent``
+   The event is recorded when all blocks of the grid have begun executing
+   (best-effort, used for launch ordering). This attribute is
+   usable on **any** compute capability supported by CuTeDSL.
+
+2. ``cudaLaunchAttributeProgrammaticEvent``
+   Part of the Programmatic Dependent Launch (PDL) model. The event is recorded
+   either:
+
+   * after **every block** in the grid has called
+     ``cute.arch.griddepcontrol_launch_dependents()`` (or terminated) - this is
+     selected with ``trigger_at_block_start=0``. The kernel must call the trigger
+     itself.
+   * automatically at each block start - selected with ``trigger_at_block_start=1``.
+     The timing is similar to the launch-completion event, but the resulting
+     event remains part of the programmatic dependency model and is visible to
+     the next kernel's ``cute.arch.griddepcontrol_wait()``.
+
+   ``programmatic_event`` requires sm_90 (Hopper) or newer.
+
+The example demonstrates each attribute by launching a kernel with the attribute
+attached, passing the stream as a ``cudaStream_t`` (runtime bindings) and the event
+either as a ``cudaEvent_t`` (runtime) or ``CUevent`` (driver). The events
+themselves are created from PyTorch with ``torch.cuda.Event(enable_timing=False)``
+so they carry ``cudaEventDisableTiming``, which is required by both launch
+attributes.
+
+Usage::
+
+    python launch_completion_and_programmatic_events.py
+"""
+
+import argparse
+from typing import Literal, Tuple, Union
+
+import cuda.bindings.driver as cuda_driver
+import cuda.bindings.runtime as cuda_runtime
+import torch
+
+import cutlass
+import cutlass.cute as cute
+from cutlass.cute.runtime import from_dlpack
+
+
+# =============================================================================
+# Feature gate - queried via torch
+# =============================================================================
+
+
+def supports_programmatic_event() -> bool:
+    """``programmatic_event`` is part of the programmatic dependency model and requires Hopper (sm_90+)."""
+    return torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 9
+
+
+# =============================================================================
+# Kernels
+# =============================================================================
+
+
+@cute.kernel
+def simple_kernel(out: cute.Tensor, value: cutlass.Int32):
+    """Write ``value`` into each element of ``out``.
+
+    Used to demonstrate ``launch_completion_event``: the event fires
+    automatically when all blocks have begun executing.
+    """
+    tidx, _, _ = cute.arch.thread_idx()
+    bidx, _, _ = cute.arch.block_idx()
+
+    if tidx < cute.size(out, [0]) and bidx < cute.size(out, [1]):
+        out[tidx, bidx] = value
+
+
+@cute.kernel
+def programmatic_trigger_kernel(
+    out: cute.Tensor,
+    value: cutlass.Int32,
+    trigger_at_block_start: cutlass.Constexpr[bool],
+):
+    """Write ``value`` into each element of ``out``, then trigger the
+    programmatic launch-completion signal.
+
+    Used to demonstrate ``programmatic_event``.
+
+    With ``trigger_at_block_start=False``, every block must execute
+    ``cute.arch.griddepcontrol_launch_dependents()`` (from the DSL:
+    ``cute.arch.griddepcontrol_launch_dependents()``) for the event to fire.
+    With ``trigger_at_block_start=True`` the event fires automatically at the
+    block start.
+    """
+    tidx, _, _ = cute.arch.thread_idx()
+    bidx, _, _ = cute.arch.block_idx()
+
+    cute.arch.griddepcontrol_wait()
+
+    if cutlass.const_expr(not trigger_at_block_start):
+        cute.arch.griddepcontrol_launch_dependents()
+
+    if tidx < cute.size(out, [0]) and bidx < cute.size(out, [1]):
+        out[tidx, bidx] = value
+
+
+# =============================================================================
+# JIT host functions - each exercises a single launch attribute
+# =============================================================================
+
+
+THREADS_PER_BLOCK = 128
+
+
+@cute.jit
+def launch_with_launch_completion_event(
+    out: cute.Tensor,
+    value: cutlass.Int32,
+    stream: cuda_runtime.cudaStream_t,
+    launch_completion_event: Union[cuda_runtime.cudaEvent_t, cuda_driver.CUevent],
+):
+    """Launch ``simple_kernel`` with ``launch_completion_event=launch_completion_event``."""
+    simple_kernel(out, value).launch(
+        grid=(cute.size(out, [1]), 1, 1),
+        block=(cute.size(out, [0]), 1, 1),
+        stream=stream,
+        launch_completion_event=launch_completion_event,
+        launch_completion_event_flags=0,  # Optional flags
+    )
+
+
+@cute.jit
+def launch_with_programmatic_event(
+    out: cute.Tensor,
+    value: cutlass.Int32,
+    stream: cuda_runtime.cudaStream_t,
+    programmatic_event: Union[cuda_runtime.cudaEvent_t, cuda_driver.CUevent],
+    trigger_at_block_start: cutlass.Constexpr[int] = 0,
+):
+    """Launch ``programmatic_trigger_kernel`` with ``programmatic_event=programmatic_event``."""
+    programmatic_trigger_kernel(out, value, trigger_at_block_start == 1).launch(
+        grid=(cute.size(out, [1]), 1, 1),
+        block=(cute.size(out, [0]), 1, 1),
+        stream=stream,
+        programmatic_event=programmatic_event,
+        programmatic_event_flags=0,  # Optional flags
+        programmatic_event_trigger_at_block_start=trigger_at_block_start,  # Defaults to zero
+        use_pdl=True,
+    )
+
+
+# =============================================================================
+# Helpers for event creation and synchronization
+# =============================================================================
+
+
+def _make_event(
+    kind: Literal["runtime", "driver"], init_stream: torch.cuda.Stream
+) -> Tuple[torch.cuda.Event, Union[cuda_runtime.cudaEvent_t, cuda_driver.CUevent]]:
+    """Create a CUDA event from torch with timing disabled and wrap it as the
+    requested low-level API type.
+
+    Using ``torch.cuda.Event(enable_timing=False)`` guarantees the underlying
+    CUDA event is created with ``cudaEventDisableTiming``, which is required
+    for ``launch_completion_event`` and ``programmatic_event`` launch attributes.
+
+    PyTorch lazily initializes the underlying ``cudaEvent_t`` on the first
+    ``record()`` call. We force initialization by recording the event.
+
+    Returns ``(torch_event, cuda_event)``. The torch event
+    must be kept alive for the lifetime of the wrapped cuda event. Torch will
+    destroy the event when ``torch_event`` is garbage-collected.
+    """
+    torch_event = torch.cuda.Event(enable_timing=False)
+    torch_event.record(init_stream)
+    raw_event = int(torch_event.cuda_event)
+    if raw_event == 0:
+        raise RuntimeError("torch.cuda.Event was not created")
+    if kind == "runtime":
+        return torch_event, cuda_runtime.cudaEvent_t(raw_event)
+    elif kind == "driver":
+        return torch_event, cuda_driver.CUevent(raw_event)
+    else:
+        raise ValueError(
+            f"Unknown event kind: {kind!r}; expected 'runtime' or 'driver'"
+        )
+
+
+# =============================================================================
+# Run functions
+# =============================================================================
+
+
+def run_launch_completion_event_example(
+    n_elements_per_block: int = 128,
+    n_blocks: int = 48,
+    launch_completion_event_kind: Literal["runtime", "driver"] = "runtime",
+) -> None:
+    """Run the ``launch_completion_event`` demonstration.
+
+    Allocates an output tensor, launches ``simple_kernel`` with
+    ``launch_completion_event`` attached, then blocks on the event using the
+    matching API and verifies the output.
+    """
+    print(
+        f"\n[Launch Completion Event] Running launch_completion_event example "
+        f"(launch_completion_event_kind={launch_completion_event_kind!r}, "
+        f"n_elements_per_block={n_elements_per_block}, n_blocks={n_blocks})"
+    )
+
+    out = torch.full(
+        (n_elements_per_block, n_blocks), -1, dtype=torch.int32, device="cuda"
+    )
+    expected = torch.full(
+        (n_elements_per_block, n_blocks), 0, dtype=torch.int32, device="cuda"
+    )
+
+    torch_stream = torch.cuda.default_stream()
+    stream = cuda_runtime.cudaStream_t(torch_stream.cuda_stream)
+
+    torch_event, launch_completion_event = _make_event(
+        launch_completion_event_kind, torch_stream
+    )
+
+    out_tensor = from_dlpack(out)
+
+    launch_with_launch_completion_event(out_tensor, 0, stream, launch_completion_event)
+
+    torch_event.synchronize()
+
+    torch.testing.assert_close(out, expected)
+    print(
+        f"[Launch Completion Event] {launch_completion_event_kind} event fired and output verified."
+    )
+
+
+def run_programmatic_event_example(
+    n_elements_per_block: int = 128,
+    n_blocks: int = 48,
+    programmatic_event_kind: Literal["runtime", "driver"] = "runtime",
+    trigger_at_block_start: int = 0,
+) -> None:
+    """Run the ``programmatic_event`` demonstration.
+
+    Allocates an output tensor, launches ``programmatic_trigger_kernel`` with
+    ``programmatic_event`` attached, then blocks on the event using the matching
+    API and verifies the output.
+    """
+    if not supports_programmatic_event():
+        raise RuntimeError("programmatic_event requires Hopper (sm_90) or newer")
+
+    print(
+        f"\n[Programmatic Event]  Running programmatic_event example "
+        f"(programmatic_event_kind={programmatic_event_kind!r}, "
+        f"n_elements_per_block={n_elements_per_block}, n_blocks={n_blocks}, "
+        f"trigger_at_block_start={trigger_at_block_start})"
+    )
+
+    out = torch.full(
+        (n_elements_per_block, n_blocks), -1, dtype=torch.int32, device="cuda"
+    )
+    expected = torch.full(
+        (n_elements_per_block, n_blocks), 1, dtype=torch.int32, device="cuda"
+    )
+
+    torch_stream = torch.cuda.default_stream()
+    stream = cuda_runtime.cudaStream_t(torch_stream.cuda_stream)
+
+    out_tensor = from_dlpack(out)
+
+    graph = torch.cuda.CUDAGraph()
+
+    torch_event = None
+
+    with torch.cuda.graph(graph):
+        stream = cuda_runtime.cudaStream_t(torch.cuda.current_stream().cuda_stream)
+        torch_event, programmatic_event = _make_event(
+            programmatic_event_kind, torch_stream
+        )
+        launch_with_programmatic_event(
+            out_tensor,
+            0,
+            stream,
+            programmatic_event,
+            trigger_at_block_start=trigger_at_block_start,
+        )
+
+        # Overlaps with the prior launch after every CTA has been launched
+        launch_with_programmatic_event(
+            out_tensor,
+            1,
+            stream,
+            programmatic_event,
+            trigger_at_block_start=trigger_at_block_start,
+        )
+
+    graph.replay()
+    assert torch_event is not None, "torch.cuda.Event was not created"
+    torch.cuda.synchronize()
+
+    torch.testing.assert_close(out, expected)
+    print(
+        f"[Programmatic Event]  {programmatic_event_kind} event fired and output verified."
+    )
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description=(
+            "Demonstrate launch_completion_event and programmatic_event launch "
+            "attributes with both cudaEvent_t and CUevent."
+        )
+    )
+    parser.add_argument("--n-blocks", default=48, type=int)
+    parser.add_argument("--n-elements-per-block", default=128, type=int)
+    parser.add_argument("--use-driver-api", action="store_true")
+    parser.add_argument("--trigger-at-block-start", action="store_true")
+    args = parser.parse_args()
+
+    run_launch_completion_event_example(
+        n_elements_per_block=args.n_elements_per_block,
+        n_blocks=args.n_blocks,
+        launch_completion_event_kind="driver" if args.use_driver_api else "runtime",
+    )
+
+    if supports_programmatic_event():
+        run_programmatic_event_example(
+            n_elements_per_block=args.n_elements_per_block,
+            n_blocks=args.n_blocks,
+            programmatic_event_kind="driver" if args.use_driver_api else "runtime",
+            trigger_at_block_start=1 if args.trigger_at_block_start else 0,
+        )
+    else:
+        print("\nSkipping programmatic_event: requires Hopper (sm_90) or newer.")
+
+    print("\nPASS")
diff --git a/include/cute/algorithm/tuple_algorithms.hpp b/include/cute/algorithm/tuple_algorithms.hpp
index 966ae8f6e..5ed947538 100644
--- a/include/cute/algorithm/tuple_algorithms.hpp
+++ b/include/cute/algorithm/tuple_algorithms.hpp
@@ -182,6 +182,26 @@ for_each(T&& t, F&& f)
   CUTE_GCC_UNREACHABLE;
 }
 
+template <class T0, class T1, class F>
+CUTE_HOST_DEVICE constexpr
+void
+for_each(T0&& t0, T1&& t1, F&& f)
+{
+  if constexpr (is_tuple<remove_cvref_t<T0>>::value) {
+    static_assert(tuple_size<remove_cvref_t<T0>>::value == tuple_size<remove_cvref_t<T1>>::value, "Mismatched tuple_size");
+    return transform_apply(static_cast<T0&&>(t0), static_cast<T1&&>(t1), 
+      [&](auto&&... a){
+        f(static_cast<decltype(a)&&>(a)...);
+        return tuple<>{}; // dummy return value
+      },
+      [](auto...){});
+  } else {
+    return f(static_cast<T0&&>(t0), static_cast<T1&&>(t1));
+  }
+
+  CUTE_GCC_UNREACHABLE;
+}
+
 template <class T, class F>
 CUTE_HOST_DEVICE constexpr
 auto
diff --git a/include/cute/pointer_flagged.hpp b/include/cute/pointer_flagged.hpp
index 095391f6a..f9b5074f3 100644
--- a/include/cute/pointer_flagged.hpp
+++ b/include/cute/pointer_flagged.hpp
@@ -83,6 +83,24 @@ downcast(ComposedLayout<SwizzleFn,smem_ptr_flag_bits<B>,Layout> const& layout)
   return composition(layout.layout_a(), smem_ptr_flag_bits<B/N>{}, downcast<N>(layout.layout_b()));
 }
 
+template <class Coord, int B, int M, int S, int Bits, class Layout>
+CUTE_HOST_DEVICE constexpr
+auto
+slice_and_offset(Coord const& coord, ComposedLayout<Swizzle<B,M,S>,smem_ptr_flag_bits<Bits>,Layout> const& layout)
+{
+  auto sao = slice_and_offset(coord, layout.layout_b());
+  if constexpr (is_constant<0, decltype(get<1>(sao))>::value) {
+    // Inner slice produced a static zero offset: rebuild a canonical (Swizzle o flag o sliced) layout.
+    return make_tuple(composition(Swizzle<B,M,S>{}, smem_ptr_flag_bits<Bits>{}, get<0>(sao)), Int<0>{});
+  } else {
+    // Inner slice produced a non-static offset (e.g. callers slicing with a runtime literal 0
+    // instead of Int<0>{}). Fall back to the generic ComposedLayout slice_and_offset path that
+    // absorbs the offset into the composed layout's middle slot, so we don't force every
+    // upstream call site to use Int<0>{} purely to satisfy a static_assert here.
+    return cute::make_tuple(ComposedLayout{layout.layout_a(), layout.offset() + get<1>(sao), get<0>(sao)}, Int<0>{});
+  }
+}
+
 //
 // Conversion with swizzle_layout
 //
diff --git a/include/cutlass/cluster_launch.hpp b/include/cutlass/cluster_launch.hpp
index e288cb91a..5f26e1689 100644
--- a/include/cutlass/cluster_launch.hpp
+++ b/include/cutlass/cluster_launch.hpp
@@ -49,10 +49,8 @@
 #endif
 
 #if ((__CUDACC_VER_MAJOR__ >= 12) || ((__CUDACC_VER_MAJOR__ == 11) && (__CUDACC_VER_MINOR__ >= 8)))
-#if !(defined(__QNX__) && __QNX__ >= 800 && defined(NV_IS_SAFETY))
 #  define CUTLASS_SM90_CLUSTER_LAUNCH_ENABLED
 #endif
-#endif
 
 #if (__CUDACC_VER_MAJOR__ > 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 8))
   #  define CUDA_ENABLE_PREFERRED_CLUSTER
diff --git a/include/cutlass/epilogue/collective/builders/sm90_builder.inl b/include/cutlass/epilogue/collective/builders/sm90_builder.inl
index 4a3bf9061..bda6034da 100644
--- a/include/cutlass/epilogue/collective/builders/sm90_builder.inl
+++ b/include/cutlass/epilogue/collective/builders/sm90_builder.inl
@@ -231,7 +231,7 @@ struct CallbacksBuilder<
   cute::enable_if_t<(FusionOp::IsAuxOutSupported ^ FusionOp::IsAuxInSupported) // only one aux tensor
               && not cute::is_subbyte_v<typename FusionOp::ElementAux>> // aux subbyte tensor doesn't use smem
 > {
-  using GmemStrideTypeAux = gemm::TagToStrideC_t<typename FusionOp::GmemLayoutTagAux>;
+  using GmemStrideTypeAux = cute::remove_pointer_t<gemm::TagToStrideC_t<typename FusionOp::GmemLayoutTagAux>>;
   using SmemLayoutAtomAux = decltype(detail::sm90_get_epilogue_smem_swizzle_layout_atom<
     GmemStrideTypeAux, typename FusionOp::ElementAux, EpilogueTile_MN>());
   using CopyOpR2S = decltype(detail::sm90_get_smem_store_op_for_accumulator<
diff --git a/include/cutlass/epilogue/collective/detail.hpp b/include/cutlass/epilogue/collective/detail.hpp
index 05078bb3a..2220479d5 100644
--- a/include/cutlass/epilogue/collective/detail.hpp
+++ b/include/cutlass/epilogue/collective/detail.hpp
@@ -524,31 +524,35 @@ public:
 
   template <bool IsLoad,
             bool WaitForInflightTmaRequests = true,
-            class ProblemShapeMNKL>
+            class ProblemShapeMNKL,
+            class TensorMaps>
   CUTLASS_DEVICE
   void
   tensormaps_perform_update(
       [[maybe_unused]] TensorMapStorage& shared_tensormaps,
       [[maybe_unused]] typename EpilogueOp::Params const& params,
-      [[maybe_unused]] cute::TmaDescriptor const* tensormap,
+      [[maybe_unused]] TensorMaps const& tensormap,
       [[maybe_unused]] ProblemShapeMNKL problem_shape,
       [[maybe_unused]] int32_t next_batch,
       [[maybe_unused]] int32_t warp_group_idx = 0
   ) { }
 
-  template <bool IsLoad, bool WaitForInflightTmaRequests = true>
+  template <bool IsLoad,
+            bool WaitForInflightTmaRequests = true,
+            class TensorMaps>
   CUTLASS_DEVICE
   void
   tensormaps_cp_fence_release(
       [[maybe_unused]] TensorMapStorage& shared_tensormaps,
-      [[maybe_unused]] cute::TmaDescriptor const* tensormap,
+      [[maybe_unused]] TensorMaps const& tensormap,
       [[maybe_unused]] int32_t warp_group_idx = 0
   ) { }
 
-  template <bool IsLoad>
+  template <bool IsLoad,
+            class TensorMaps>
   CUTLASS_DEVICE
   void
-  tensormaps_fence_acquire([[maybe_unused]] cute::TmaDescriptor const* tensormap) { }
+  tensormaps_fence_acquire([[maybe_unused]] TensorMaps const& tensormap) { }
 };
 
 
@@ -887,29 +891,31 @@ public:
   }
 
   // Dummy methods to perform different parts of TMA/Tensormap modifications
-  template <bool IsLoad, bool WaitForInflightTmaRequests = true, class ProblemShape>
+  template <bool IsLoad, bool WaitForInflightTmaRequests = true, class ProblemShape, class TensorMaps>
   CUTLASS_DEVICE
   void
   tensormaps_perform_update(
       [[maybe_unused]] TensorMapStorage& shared_tensormap,
       [[maybe_unused]] typename EpilogueOp::Params const& params,
-      [[maybe_unused]] cute::TmaDescriptor const* tensormap,
+      [[maybe_unused]] TensorMaps const& tensormap,
       [[maybe_unused]] ProblemShape problem_shape,
       [[maybe_unused]] int32_t next_batch
   ) { }
 
-  template <bool IsLoad, bool WaitForInflightTmaRequests = true>
+  template <bool IsLoad,
+            bool WaitForInflightTmaRequests = true,
+            class TensorMaps>
   CUTLASS_DEVICE
   void
   tensormaps_cp_fence_release(
       [[maybe_unused]] TensorMapStorage& shared_tensormap,
-      [[maybe_unused]] cute::TmaDescriptor const* tensormap
+      [[maybe_unused]] TensorMaps const& tensormap
   ) { }
 
-  template <bool IsLoad>
+  template <bool IsLoad, class TensorMaps>
   CUTLASS_DEVICE
   void
-  tensormaps_fence_acquire([[maybe_unused]] cute::TmaDescriptor const* tensormap) { }
+  tensormaps_fence_acquire([[maybe_unused]] TensorMaps const& tensormap) { }
 };
 
 
diff --git a/include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp b/include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp
index 5601988cb..61760fb17 100644
--- a/include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp
@@ -304,8 +304,8 @@ public:
       [[maybe_unused]] void* workspace) {
     // These tensor shapes (only applicable for grouped gemm) and pointers are only used to create tensormap/tma desc.
     // These will be replaced with correct values before the initial tma load.
-    auto init_M = int32_t(size<0>(CtaTileMNK{}));
-    auto init_N = int32_t(size<1>(CtaTileMNK{}));
+    auto init_M = transform_leaf(get<0>(CtaTileMNK{}), [](auto v){ return int32_t(v); });
+    auto init_N = transform_leaf(get<1>(CtaTileMNK{}), [](auto v){ return int32_t(v); });
     auto init_L = 1;
 
     static_assert(!is_im2col_C and !is_im2col_D, "Im2Col not supported on C or D");
@@ -355,8 +355,7 @@ public:
 
     auto fusion_workspace = static_cast<char*>(workspace);
     auto fusion_workspace_size = round_nearest(FusionCallbacks::get_workspace_size(problem_shape, args.thread), MinTensorMapWorkspaceAlignment);
-    auto tma_descriptor_workspace = reinterpret_cast<cute::TmaDescriptor*>(
-                                      static_cast<char*>(workspace) + fusion_workspace_size);
+    auto tma_descriptor_workspace = reinterpret_cast<cute::TmaDescriptor*>(fusion_workspace + fusion_workspace_size);
 
     return {
       FusionCallbacks::to_underlying_arguments(problem_shape, args.thread, fusion_workspace),
@@ -374,7 +373,7 @@ public:
   template <class ProblemShape>
   static size_t
   get_workspace_size(ProblemShape const& problem_shape, Arguments const& args, int sm_count) {
-    constexpr uint32_t NumInputTensors = NumEpilogueWarpGroups + (cute::is_void_v<ElementC> ? 0 : 1);
+    constexpr uint32_t NumInputTensors = (is_destination_supported ? NumEpilogueWarpGroups : 0) + (is_source_supported ? 1 : 0);
     auto descriptors_shape = cute::make_shape(sm_count, Int<NumInputTensors>{});
     constexpr size_t SizeOfCuTensorMap = sizeof(cute::TmaDescriptor);
     // Allocate gmem space for input tensormaps per each SM, A tensormap copies followed by B tensormap copies
@@ -493,8 +492,7 @@ public:
     class TileShapeMNK,
     class TileCoordMNKL,
     class TiledMma,
-    class TensorMapC,
-    __CUTE_REQUIRES(std::is_pointer_v<TensorMapC>)
+    class FusionTensorMaps
   >
   CUTLASS_DEVICE auto
   load(
@@ -506,7 +504,7 @@ public:
       TiledMma tiled_mma,
       int thread_idx,
       TensorStorage& shared_tensors,
-      TensorMapC const& load_tensormap,
+      cute::tuple<cute::TmaDescriptor*, FusionTensorMaps> const& load_tensormaps,
       int subtile_idx=-1) {
     using namespace cute;
 
@@ -572,12 +570,12 @@ public:
         load_pipeline.producer_acquire(load_pipe_producer_state);
 
         // Loop fusion callback entry point
-        pld_callbacks.step(tma_barrier, epi_m, epi_n, load_pipe_producer_state.count(), issue_tma_load);
+        pld_callbacks.step(tma_barrier, epi_m, epi_n, load_pipe_producer_state.count(), issue_tma_load, get<1>(load_tensormaps));
 
         // Execute the TMA load for C if needed
         if (is_C_load_needed) {
           if (issue_tma_load) {
-            copy(params.tma_load_c.with(load_tensormap, *tma_barrier, mcast_mask),
+            copy(params.tma_load_c.with(get<0>(load_tensormaps), *tma_barrier, mcast_mask),
                 bGS_gC(_,_,_,epi_m,epi_n), bGS_sC(_,_,_,load_pipe_producer_state.index()));
             load_pipeline.producer_expect_transaction(load_pipe_producer_state);
           }
@@ -597,6 +595,44 @@ public:
     return load_pipe_producer_state;
   }
 
+  // Soft-deprecated overload accepting a bare cute::TmaDescriptor* (pre-EVT-tensormap API).
+  // Forwards to the tuple form with a default-constructed FusionTensorMaps placeholder, which
+  // is structurally a no-op for any fusion tree without AuxLoad/AuxStore nodes — i.e. the only
+  // EVT shapes that out-of-tree kernel layers built against the old API could have had.
+  template<
+    class ProblemShapeMNKL,
+    class TileShapeMNK,
+    class TileCoordMNKL,
+    class TiledMma
+  >
+  [[deprecated(
+    "Passing a bare cute::TmaDescriptor* to CollectiveEpilogue::load is deprecated. "
+    "Pass the cute::tuple<TmaDescriptor*, FusionTensorMaps> returned by load_init() directly so "
+    "that AuxLoad EVT nodes also receive their tensormaps. This compatibility overload skips "
+    "EVT tensormap updates and will be removed in a future release."
+  )]]
+  CUTLASS_DEVICE auto
+  load(
+      LoadPipeline load_pipeline,
+      LoadPipelineState load_pipe_producer_state,
+      ProblemShapeMNKL problem_shape_mnkl,
+      TileShapeMNK tile_shape_MNK,
+      TileCoordMNKL tile_coord_mnkl,
+      TiledMma tiled_mma,
+      int thread_idx,
+      TensorStorage& shared_tensors,
+      cute::TmaDescriptor* const& load_tensormap,
+      int subtile_idx=-1) {
+    using EvtTensormapsType = decltype(
+        fusion_callbacks.template get_tensormap_callbacks<true>().init(0, 0, 0));
+    return load(
+        load_pipeline, load_pipe_producer_state,
+        problem_shape_mnkl, tile_shape_MNK, tile_coord_mnkl, tiled_mma,
+        thread_idx, shared_tensors,
+        cute::make_tuple(load_tensormap, EvtTensormapsType{}),
+        subtile_idx);
+  }
+
   CUTLASS_DEVICE auto
   load_tail(
       LoadPipeline load_pipeline,
@@ -620,7 +656,7 @@ public:
     class TileCoordMNKL,
     class AccEngine, class AccLayout,
     class TiledMma,
-    class TensorMapD
+    class FusionTensorMaps
   >
   CUTLASS_DEVICE auto
   store(
@@ -635,7 +671,7 @@ public:
       TiledMma tiled_mma,
       int thread_idx,
       TensorStorage& shared_tensors,
-      TensorMapD const& store_tensormap,
+      cute::tuple<cute::TmaDescriptor*, FusionTensorMaps> const& store_tensormaps,
       int subtile_idx=-1) {
 
     using namespace cute;
@@ -744,15 +780,15 @@ public:
     Tensor cD_mn = local_tile(mD_crd, take<0,2>(CtaTileMNK{}), make_coord(m_coord, n_coord));          // (CTA_M,CTA_N)
     Tensor tRS_cD_mn = thread_r2s.partition_S(flat_divide(cD_mn, EpilogueTile{}));     // (R2S,R2S_M,R2S_N,EPI_M,EPI_N)
     // Relative coordinate tensors (static)
-    Tensor cD = make_coord_tensor(cD_mn.layout());                                                  // (CTA_M,CTA_N)
-    Tensor tRS_cD = make_coord_tensor(tRS_cD_mn.layout());                          // (R2S,R2S_M,R2S_N,EPI_M,EPI_N)
+    Tensor cD = make_coord_tensor(cD_mn.layout());                                                     // (CTA_M,CTA_N)
+    Tensor tRS_cD = make_coord_tensor(tRS_cD_mn.layout());                                             // (R2S,R2S_M,R2S_N,EPI_M,EPI_N)
     // Subtract the global "bottom right" corner from the local "top left" corner to get the max relative coordinate
     auto residue_cD = make_coord(M,N) - cD_mn(_0{});                                                           // (m,n)
     auto residue_tRS_cD = make_coord(M,N) - tRS_cD_mn(_0{});                                                   // (m,n)
 
     CUTE_STATIC_ASSERT(epi_tile_m % mma_tile_m == 0, "MMA_TILE_M must divide EPI_TILE_M");
 
-    if constexpr (epi_tile_m * epi_tile_n > mma_tile_m * mma_tile_n) {
+    if constexpr (epi_tile_n > mma_tile_n) {
       // When the epilogue subtile is larger than the MMA tiles, loop over multiple MMA tiles
       CUTE_STATIC_ASSERT(epi_tile_n % mma_tile_n == 0, "MMA_TILE_N must divide EPI_TILE_N");
     }
@@ -818,12 +854,12 @@ public:
       synchronize(); // ensure all threads have issued their async fence
       if constexpr (is_destination_supported) {
         if (issue_tma_store) {
-          copy(params.tma_store_d.with(store_tensormap), bSG_sD(_,_,_,store_pipe_producer_state.index()), bSG_gD(_,_,_,epi_m,epi_n));
+          copy(params.tma_store_d.with(get<0>(store_tensormaps)), bSG_sD(_,_,_,store_pipe_producer_state.index()), bSG_gD(_,_,_,epi_m,epi_n));
         }
       }
 
       // Post async fence, pre TMA commit callback entry point
-      cst_callbacks.tma_store(epi_m, epi_n, store_pipe_producer_state.count(), issue_tma_store);
+      cst_callbacks.tma_store(epi_m, epi_n, store_pipe_producer_state.count(), issue_tma_store, get<1>(store_tensormaps));
 
       // Commit the TMA stores for this stage
       if (issue_tma_store) {
@@ -901,8 +937,8 @@ public:
         if constexpr (epi_tile_m * epi_tile_n > mma_tile_m * mma_tile_n) {
           // When the epilogue subtile is larger than the MMA tiles, loop over multiple
           // MMA tiles
-          static constexpr int MmaMPerEpiM = epi_tile_m / mma_tile_m;
-          static constexpr int MmaNPerEpiN = epi_tile_n / mma_tile_n;
+          constexpr int MmaMPerEpiM = epi_tile_m / mma_tile_m;
+          constexpr int MmaNPerEpiN = epi_tile_n / mma_tile_n;
 
           CUTLASS_PRAGMA_UNROLL
           for (int mma_n_in_epi = 0; mma_n_in_epi < MmaNPerEpiN; ++mma_n_in_epi) {
@@ -951,8 +987,8 @@ public:
         // Copy tile from register to regiser if needed
         if constexpr (IsUseR2R) {
           // retile source and destination for tiled_r2r
-          Tensor tRR_rD_src = thread_r2r.retile_S(tRS_rD);                             // (R2R,R2R_M,R2R_N,EPI_M,EPI_N)
-          Tensor tRR_rD_dst = thread_r2r.retile_D(tRS_rD);                             // (R2R,R2R_M,R2R_N,EPI_M,EPI_N)
+          Tensor tRR_rD_src = thread_r2r.retile_S(tRS_rCompute);                             // (R2R,R2R_M,R2R_N,EPI_M,EPI_N)
+          Tensor tRR_rD_dst = thread_r2r.retile_D(tRS_rCompute);                             // (R2R,R2R_M,R2R_N,EPI_M,EPI_N)
 
           // Output needs register shuffling before copying to shared memory.
           copy(tiled_r2r, tRR_rD_src, tRR_rD_dst);
@@ -994,6 +1030,47 @@ public:
     return cute::make_tuple(load_pipe_consumer_state, store_pipe_producer_state);
   }
 
+  // Soft-deprecated overload accepting a bare cute::TmaDescriptor* (pre-EVT-tensormap API).
+  // See the analogous load() overload above for rationale.
+  template<
+    class ProblemShapeMNKL,
+    class TileShapeMNK,
+    class TileCoordMNKL,
+    class AccEngine, class AccLayout,
+    class TiledMma
+  >
+  [[deprecated(
+    "Passing a bare cute::TmaDescriptor* to CollectiveEpilogue::store is deprecated. "
+    "Pass the cute::tuple<TmaDescriptor*, FusionTensorMaps> returned by store_init() directly so "
+    "that AuxStore EVT nodes also receive their tensormaps. This compatibility overload skips "
+    "EVT tensormap updates and will be removed in a future release."
+  )]]
+  CUTLASS_DEVICE auto
+  store(
+      LoadPipeline load_pipeline,
+      LoadPipelineState load_pipe_consumer_state,
+      StorePipeline store_pipeline,
+      StorePipelineState store_pipe_producer_state,
+      ProblemShapeMNKL problem_shape_mnkl,
+      TileShapeMNK tile_shape_MNK,
+      TileCoordMNKL tile_coord_mnkl,
+      cute::Tensor<AccEngine,AccLayout> accumulators,
+      TiledMma tiled_mma,
+      int thread_idx,
+      TensorStorage& shared_tensors,
+      cute::TmaDescriptor* const& store_tensormap,
+      int subtile_idx=-1) {
+    using EvtTensormapsType = decltype(
+        fusion_callbacks.template get_tensormap_callbacks<false>().init(0, 0, 0));
+    return store(
+        load_pipeline, load_pipe_consumer_state,
+        store_pipeline, store_pipe_producer_state,
+        problem_shape_mnkl, tile_shape_MNK, tile_coord_mnkl,
+        accumulators, tiled_mma, thread_idx, shared_tensors,
+        cute::make_tuple(store_tensormap, EvtTensormapsType{}),
+        subtile_idx);
+  }
+
   CUTLASS_DEVICE auto
   store_tail(
       LoadPipeline load_pipeline,
@@ -1027,16 +1104,19 @@ public:
       int32_t sm_count,
       int32_t sm_idx,
       int32_t warp_group_idx) {
+    constexpr bool IsLoad = false;
     int warp_idx_in_warp_group = canonical_warp_idx_sync() % NumWarpsPerWarpGroup;
     // Since only one warp issues TMA store, we only need that one warp to initialize tensormaps
     if (warp_idx_in_warp_group == 0) {
       // Initialize tma
-      constexpr bool IsLoad = false;
       auto store_tensormaps = tensormaps_init<IsLoad>(params, shared_tensormaps, sm_count, sm_idx, warp_group_idx);
       return store_tensormaps;
     }
-    TmaDescriptor* null_tma_desc = nullptr;
-    return cute::make_tuple(null_tma_desc);
+    else {
+      // Return dummy values (default-initialized)
+      auto store_tensormaps = decltype(tensormaps_init<IsLoad>(params, shared_tensormaps, sm_count, sm_idx, warp_group_idx)){};
+      return store_tensormaps;
+    }
   }
 
   //
@@ -1052,14 +1132,15 @@ public:
       int32_t sm_idx,
       int32_t warp_group_idx) {
 
-    constexpr uint32_t NumInputTensors = NumEpilogueWarpGroups + (cute::is_void_v<ElementC> ? 0 : 1);
+    constexpr uint32_t NumInputTensors = (is_destination_supported ? NumEpilogueWarpGroups : 0) + (is_source_supported ? 1 : 0);
     Layout desc_layout = make_layout(make_shape(sm_count, Int<NumInputTensors>{}));
 
     Tensor gmem_tensormap = make_tensor(params.tensormaps, desc_layout);                      // (SMs, NumInputTensors)
+    TmaDescriptor* tma_desc = nullptr;
 
     if constexpr (IsLoad) {
-      if (is_source_supported) {
-        constexpr int C_tensormap_index = NumEpilogueWarpGroups;
+      if constexpr (is_source_supported) {
+        constexpr int C_tensormap_index = is_destination_supported ? NumEpilogueWarpGroups : 0;
         Tensor pC_tensormap = make_tensor(params.tma_load_c.get_tma_descriptor(), Int<1>{}, Int<1>{});
         Tensor sC_tensormap = make_tensor(make_smem_ptr(&shared_tensormaps.smem_tensormap_C), Int<1>{}, Int<1>{});
 
@@ -1068,23 +1149,26 @@ public:
           copy(recast<uint128_t>(pC_tensormap), recast<uint128_t>(sC_tensormap));
         }
         __syncwarp();
-        return cute::make_tuple(&gmem_tensormap(sm_idx, C_tensormap_index));
-
+        tma_desc = &gmem_tensormap(sm_idx, C_tensormap_index);
       }
-      TmaDescriptor* null_tma_desc = nullptr;
-      return cute::make_tuple(null_tma_desc);
     }
     else {
-      Tensor pD_tensormap = make_tensor(params.tma_store_d.get_tma_descriptor(), Int<1>{}, Int<1>{});
-      Tensor sD_tensormap = make_tensor(make_smem_ptr(&shared_tensormaps.smem_tensormap_D[warp_group_idx]), Int<1>{}, Int<1>{});
+      if constexpr (is_destination_supported) {
+        Tensor pD_tensormap = make_tensor(params.tma_store_d.get_tma_descriptor(), Int<1>{}, Int<1>{});
+        Tensor sD_tensormap = make_tensor(make_smem_ptr(&shared_tensormaps.smem_tensormap_D[warp_group_idx]), Int<1>{}, Int<1>{});
 
-      if (cute::elect_one_sync()) {
-        // Bringing tensormaps from params to smem for modification later
-        copy(recast<uint128_t>(pD_tensormap), recast<uint128_t>(sD_tensormap));
+        if (cute::elect_one_sync()) {
+          // Bringing tensormaps from params to smem for modification later
+          copy(recast<uint128_t>(pD_tensormap), recast<uint128_t>(sD_tensormap));
+        }
+        __syncwarp();
+        tma_desc = &gmem_tensormap(sm_idx, warp_group_idx);
       }
-      __syncwarp();
-      return cute::make_tuple(&gmem_tensormap(sm_idx, warp_group_idx));
     }
+
+    auto fusion_tensormap_callbacks = fusion_callbacks.template get_tensormap_callbacks<IsLoad>();
+    auto fusion_tensormaps = fusion_tensormap_callbacks.init(sm_count, sm_idx, warp_group_idx);
+    return cute::make_tuple(tma_desc, fusion_tensormaps);
   }
 
   // Replace address for the global tensor (to be done by single thread)
@@ -1105,9 +1189,11 @@ public:
         }
       }
     }
-    else if constexpr (is_destination_supported) {
-      cute::tma_descriptor_replace_addr_in_shared_mem(shared_tensormaps.smem_tensormap_D[warp_group_idx],
-                                                      params.ptr_D[next_batch]);
+    else {
+      if constexpr (is_destination_supported) {
+        cute::tma_descriptor_replace_addr_in_shared_mem(shared_tensormaps.smem_tensormap_D[warp_group_idx],
+                                                        params.ptr_D[next_batch]);
+      }
     }
   }
 
@@ -1121,8 +1207,8 @@ public:
       int32_t next_group,
       ProblemShape_MNKL problem_shape_mnkl,
       int32_t warp_group_idx) {
-    const uint32_t M = get<0>(problem_shape_mnkl);
-    const uint32_t N = get<1>(problem_shape_mnkl);
+    const auto M = get<0>(problem_shape_mnkl);
+    const auto N = get<1>(problem_shape_mnkl);
     // Replace all dims for consistency
     constexpr int MaxTensorRank = 5;
     cute::array<uint32_t, MaxTensorRank> prob_shape  = {1,1,1,1,1};
@@ -1135,7 +1221,7 @@ public:
           Tensor tensor_c = make_tensor(ptr_C, make_layout(make_shape(M,N,Int<1>{}), params.dC[next_group]));
 
           cute::detail::fill_tma_gmem_shape_stride(params.tma_load_c, tensor_c, 
-                                                  prob_shape, prob_stride);
+                                                   prob_shape, prob_stride);
           // Convert strides to byte strides
           for (uint64_t& stride : prob_stride) {
             stride = (stride * sizeof_bits_v<ElementC>) / 8;
@@ -1146,30 +1232,32 @@ public:
         }
       }
     }
-    else if constexpr (is_destination_supported) {
-      ElementD const* ptr_D = nullptr;
-      Tensor tensor_d = make_tensor(ptr_D, make_layout(make_shape(M,N,Int<1>{}), params.dD[next_group]));
+    else {
+      if constexpr (is_destination_supported) {
+        ElementD const* ptr_D = nullptr;
+        Tensor tensor_d = make_tensor(ptr_D, make_layout(make_shape(M,N,Int<1>{}), params.dD[next_group]));
 
-      cute::detail::fill_tma_gmem_shape_stride(params.tma_store_d, tensor_d, 
-                                               prob_shape, prob_stride);
-      // Convert strides to byte strides
-      for (uint64_t& stride : prob_stride) {
-        stride = (stride * sizeof_bits_v<ElementD>) / 8;
+        cute::detail::fill_tma_gmem_shape_stride(params.tma_store_d, tensor_d, 
+                                                prob_shape, prob_stride);
+        // Convert strides to byte strides
+        for (uint64_t& stride : prob_stride) {
+          stride = (stride * sizeof_bits_v<ElementD>) / 8;
+        }
+
+        cute::tma_descriptor_replace_dims_strides_in_shared_mem(shared_tensormaps.smem_tensormap_D[warp_group_idx],
+                                                                prob_shape,
+                                                                prob_stride);
       }
-
-      cute::tma_descriptor_replace_dims_strides_in_shared_mem(shared_tensormaps.smem_tensormap_D[warp_group_idx],
-                                                              prob_shape,
-                                                              prob_stride);
     }
   }
 
-  template <bool IsLoad, class ProblemShape_MNKL>
+  template <bool IsLoad, class ProblemShape_MNKL, class FusionTensorMaps>
   CUTLASS_DEVICE
   void
   tensormaps_perform_update(
       TensorMapStorage& shared_tensormaps,
       Params const& params,
-      cute::TmaDescriptor const* tensormap,
+      cute::tuple<cute::TmaDescriptor*, FusionTensorMaps> const& tensormaps,
       ProblemShape_MNKL problem_shape_mnkl,
       int32_t next_batch,
       int32_t warp_group_idx) {
@@ -1183,15 +1271,45 @@ public:
             shared_tensormaps, params, next_batch, problem_shape_mnkl, warp_group_idx);
       }
 
+      auto fusion_tensormap_callbacks = fusion_callbacks.template get_tensormap_callbacks<IsLoad>();
+      fusion_tensormap_callbacks.perform_update(cute::get<1>(tensormaps), problem_shape_mnkl, next_batch, warp_group_idx);
     }
   }
 
-  template <bool IsLoad>
+  // Soft-deprecated overload accepting a bare cute::TmaDescriptor* (pre-EVT-tensormap API).
+  // Forwards to the tuple form with a default-constructed FusionTensorMaps placeholder; the
+  // fusion-side perform_update is invoked with empty per-op tensormaps and is a structural
+  // no-op for any fusion tree without AuxLoad/AuxStore nodes.
+  template <bool IsLoad, class ProblemShape_MNKL>
+  [[deprecated(
+    "Passing a bare cute::TmaDescriptor* to CollectiveEpilogue::tensormaps_perform_update is "
+    "deprecated. Pass the cute::tuple<TmaDescriptor*, FusionTensorMaps> returned by "
+    "load_init()/store_init() directly so AuxLoad/AuxStore EVT nodes also get their tensormaps "
+    "updated. This compatibility overload will be removed in a future release."
+  )]]
+  CUTLASS_DEVICE
+  void
+  tensormaps_perform_update(
+      TensorMapStorage& shared_tensormaps,
+      Params const& params,
+      cute::TmaDescriptor* const& tensormap,
+      ProblemShape_MNKL problem_shape_mnkl,
+      int32_t next_batch,
+      int32_t warp_group_idx) {
+    using EvtTensormapsType = decltype(
+        fusion_callbacks.template get_tensormap_callbacks<IsLoad>().init(0, 0, 0));
+    tensormaps_perform_update<IsLoad>(
+        shared_tensormaps, params,
+        cute::make_tuple(tensormap, EvtTensormapsType{}),
+        problem_shape_mnkl, next_batch, warp_group_idx);
+  }
+
+  template <bool IsLoad, class FusionTensorMaps>
   CUTLASS_DEVICE
   void
   tensormaps_cp_fence_release(
       TensorMapStorage& shared_tensormaps,
-      cute::TmaDescriptor const* tensormap,
+      cute::tuple<cute::TmaDescriptor*, FusionTensorMaps> const& tensormaps,
       const int32_t warp_group_idx = 0) {
     // Commit and wait for all TMA load/store instructions before updating the tensormap in gmem.
     // This operation only happens when the group/batch changes between consecutive tiles.
@@ -1206,27 +1324,76 @@ public:
     if constexpr (IsLoad) {
       if constexpr (is_source_supported) {
         tma_desc_wait_all_fn();
-        tma_descriptor_cp_fence_release(tensormap, shared_tensormaps.smem_tensormap_C);
+        tma_descriptor_cp_fence_release(cute::get<0>(tensormaps), shared_tensormaps.smem_tensormap_C);
       }
     }
-    else if constexpr (is_destination_supported) {
-      tma_desc_wait_all_fn();
-      tma_descriptor_cp_fence_release(tensormap, shared_tensormaps.smem_tensormap_D[warp_group_idx]);
+    else {
+      if constexpr (is_destination_supported) {
+        tma_desc_wait_all_fn();
+        tma_descriptor_cp_fence_release(cute::get<0>(tensormaps), shared_tensormaps.smem_tensormap_D[warp_group_idx]);
+      }
     }
+
+    auto fusion_tensormap_callbacks = fusion_callbacks.template get_tensormap_callbacks<IsLoad>();
+    fusion_tensormap_callbacks.cp_fence_release(cute::get<1>(tensormaps), warp_group_idx);
   }
 
+  // Soft-deprecated overload accepting a bare cute::TmaDescriptor* (pre-EVT-tensormap API).
   template <bool IsLoad>
+  [[deprecated(
+    "Passing a bare cute::TmaDescriptor* to CollectiveEpilogue::tensormaps_cp_fence_release is "
+    "deprecated. Pass the cute::tuple<TmaDescriptor*, FusionTensorMaps> returned by "
+    "load_init()/store_init() directly so AuxLoad/AuxStore EVT nodes also get their fence "
+    "release. This compatibility overload will be removed in a future release."
+  )]]
   CUTLASS_DEVICE
   void
-  tensormaps_fence_acquire(cute::TmaDescriptor const* tensormap) {
+  tensormaps_cp_fence_release(
+      TensorMapStorage& shared_tensormaps,
+      cute::TmaDescriptor* const& tensormap,
+      const int32_t warp_group_idx = 0) {
+    using EvtTensormapsType = decltype(
+        fusion_callbacks.template get_tensormap_callbacks<IsLoad>().init(0, 0, 0));
+    tensormaps_cp_fence_release<IsLoad>(
+        shared_tensormaps,
+        cute::make_tuple(tensormap, EvtTensormapsType{}),
+        warp_group_idx);
+  }
+
+  template <bool IsLoad, class FusionTensorMaps>
+  CUTLASS_DEVICE
+  void
+  tensormaps_fence_acquire(cute::tuple<cute::TmaDescriptor*, FusionTensorMaps> const& tensormaps) {
     if constexpr (IsLoad) {
       if constexpr (is_source_supported) {
-        cute::tma_descriptor_fence_acquire(tensormap);
+        cute::tma_descriptor_fence_acquire(cute::get<0>(tensormaps));
       }
     } 
     else {
-      cute::tma_descriptor_fence_acquire(tensormap);
+      if constexpr (is_destination_supported) {
+        cute::tma_descriptor_fence_acquire(cute::get<0>(tensormaps));
+      }
     }
+
+    auto fusion_tensormap_callbacks = fusion_callbacks.template get_tensormap_callbacks<IsLoad>();
+    fusion_tensormap_callbacks.fence_acquire(cute::get<1>(tensormaps));
+  }
+
+  // Soft-deprecated overload accepting a bare cute::TmaDescriptor* (pre-EVT-tensormap API).
+  template <bool IsLoad>
+  [[deprecated(
+    "Passing a bare cute::TmaDescriptor* to CollectiveEpilogue::tensormaps_fence_acquire is "
+    "deprecated. Pass the cute::tuple<TmaDescriptor*, FusionTensorMaps> returned by "
+    "load_init()/store_init() directly so AuxLoad/AuxStore EVT nodes also get their fence "
+    "acquire. This compatibility overload will be removed in a future release."
+  )]]
+  CUTLASS_DEVICE
+  void
+  tensormaps_fence_acquire(cute::TmaDescriptor* const& tensormap) {
+    using EvtTensormapsType = decltype(
+        fusion_callbacks.template get_tensormap_callbacks<IsLoad>().init(0, 0, 0));
+    tensormaps_fence_acquire<IsLoad>(
+        cute::make_tuple(tensormap, EvtTensormapsType{}));
   }
 
 private:
diff --git a/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp b/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp
index c15c47231..686804652 100644
--- a/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp
@@ -662,7 +662,7 @@ public:
       }
     }();
     // Relative coordinate tensors (static)
-    Tensor cD = make_coord_tensor(cD_mn.layout());                                                  // (CTA_M,CTA_N)
+    Tensor cD = make_coord_tensor(cD_mn.layout());                                                     // (CTA_M,CTA_N)
     Tensor tRS_cD = make_coord_tensor(tRS_cD_mn.layout());                          // (R2S,R2S_M,R2S_N,EPI_M,EPI_N)
     // Subtract the global "bottom right" corner from the local "top left" corner to get the max relative coordinate
     auto residue_cD = make_coord(M,N) - cD_mn(_0{});                                                           // (m,n)
diff --git a/include/cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp b/include/cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp
index 0a931dd32..a0cfe55b3 100644
--- a/include/cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp
@@ -113,6 +113,62 @@ struct FusionCallbacks<
   using Impl::Impl;
 };
 
+// D = alpha * acc
+template <
+  int StagesC,
+  int StagesD,
+  int FragmentSize,
+  bool ReuseSmemC,
+  bool DelayTmaStore,
+  int NumEpilogueWarpGroups,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementScalar,
+  FloatRoundStyle RoundStyle,
+  class CtaTileShapeMNK,
+  class EpilogueTile
+>
+struct FusionCallbacks<
+    epilogue::Sm90PtrArrayTmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore, NumEpilogueWarpGroups>,
+    fusion::ScaledAcc<ElementOutput, ElementCompute, ElementScalar, RoundStyle>,
+    CtaTileShapeMNK,
+    EpilogueTile
+> : Sm90EVT<Sm90Compute<multiplies, ElementOutput, ElementCompute, RoundStyle>,
+      Sm90ScalarBroadcastPtrArray<ElementScalar, Stride<_0,_0,int64_t>>, 
+      Sm90AccFetch
+    > {
+  using Impl = 
+    Sm90EVT<Sm90Compute<multiplies, ElementOutput, ElementCompute, RoundStyle>,
+      Sm90ScalarBroadcastPtrArray<ElementScalar, Stride<_0,_0,int64_t>>,
+      Sm90AccFetch
+    >;
+  using Operation = fusion::ScaledAcc<ElementOutput, ElementCompute, ElementScalar, RoundStyle>;
+
+  struct Arguments {
+    // Give a name and flat ordering to the fusion callback args
+    ElementScalar alpha = ElementScalar(1);
+    ElementScalar const* alpha_ptr = nullptr;
+    ElementScalar const* const* alpha_ptr_array = nullptr;
+
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+
+    // Conversion to the args expected by the visitor implementation
+    // to_underlying_arguments will implicitly call this
+    operator typename Impl::Arguments() const {
+      return
+        {    // binary op : alpha * acc
+          {{alpha}, {alpha_ptr}, {alpha_ptr_array}, {dAlpha}}, // leaf args : alpha
+          {},                     // leaf args : acc
+          {} // binary args : multiplies
+        };   // end binary op
+    }
+  };
+
+  // Ctor inheritance
+  using Impl::Impl;
+};
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 // D = alpha * acc + beta * C
diff --git a/include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp b/include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp
index a58475601..939b149f8 100644
--- a/include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp
@@ -603,6 +603,396 @@ struct Sm90AuxLoad<
   }
 };
 
+template <
+  int Stages,
+  int NumEpilogueWarpGroups,
+  class EpilogueTile,
+  class Element,
+  class StrideMNL,
+  class SmemLayoutAtom,
+  class CopyOpS2R,
+  int Alignment = 128 / sizeof_bits_v<Element>,
+  bool EnableNullptr = true // Fallback scalar broadcast for nullptr params
+>
+struct Sm90AuxArrayLoad {
+  static_assert(Alignment * sizeof_bits_v<Element> % 128 == 0, "sub-16B alignment not supported yet");
+
+  using InternalStrideMNL = cute::remove_pointer_t<StrideMNL>;
+  static constexpr bool IsGroupedGemmKernel = !cute::is_same_v<InternalStrideMNL, StrideMNL>;
+
+  constexpr static bool is_m_major = epilogue::collective::detail::is_m_major<StrideMNL>();
+  // Find the max contiguous layout usable by TMA (if EpilogueTile is a non-compact tiler)
+  using SmemShapeTma = decltype(make_shape(
+      max_common_vector(make_layout(get<0>(EpilogueTile{})),make_layout(get<0>(EpilogueTile{}))),
+      max_common_vector(make_layout(get<1>(EpilogueTile{})),make_layout(get<1>(EpilogueTile{})))));
+  using SmemLayoutTma = decltype(tile_to_shape(
+      SmemLayoutAtom{}, SmemShapeTma{},
+      cute::conditional_t<is_m_major, Step<_2,_1>, Step<_1,_2>>{} ));
+  using SmemLayout = decltype(tile_to_shape(
+      SmemLayoutTma{},
+      make_shape(size<0>(shape(EpilogueTile{})), size<1>(shape(EpilogueTile{})), Int<Stages>{}),
+      cute::conditional_t<is_m_major, Step<_2,_1,_3>, Step<_1,_2,_3>>{} ));
+  using CopyOpG2S =
+      SM90_TMA_LOAD
+    ;
+
+  struct SharedStorage {
+    alignas(cutlass::detail::alignment_for_swizzle(SmemLayout{}))
+    array_aligned<Element, size(SmemLayout{})> smem_aux;
+    cute::array<cute::TmaDescriptor, NumEpilogueWarpGroups> smem_tensormap_aux;
+  };
+
+  struct Arguments {
+    Element const** ptr_aux = nullptr;
+    Element null_default = Element(0);
+    StrideMNL dAux = {};
+    int aux_sm_count = 0;
+  };
+
+  struct Params {
+    using TMA_Aux = decltype(make_tma_copy(
+        CopyOpG2S{},
+        make_tensor(make_gmem_ptr(static_cast<Element const*>(nullptr)), repeat_like(StrideMNL{}, int32_t(0)), append<3>(StrideMNL{}, _0{})),
+        take<0,2>(SmemLayoutTma{})));
+    TMA_Aux tma_load_aux;
+    cute::TmaDescriptor* tensormaps{};
+    Element** ptr_aux{};
+    StrideMNL dAux{};
+    Element null_default = Element(0);
+    bool use_default = false;
+  };
+
+  template <bool IsLoad>
+  using TensorMaps = cute::conditional_t<IsLoad, cute::TmaDescriptor*, cute::tuple<>>;
+
+  template <class ProblemShape>
+  static constexpr Params
+  to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
+
+    bool use_default = false;
+    if constexpr (EnableNullptr) {
+      use_default = args.ptr_aux == nullptr;
+    }
+
+    typename Params::TMA_Aux tma_load_aux;
+    if (not use_default) {
+      // These tensor shapes (only applicable for grouped gemm) and pointers are only used to create tensormap/tma desc.
+      // These will be replaced with correct values before the initial tma load.
+      // auto init_shape = repeat_like(append<4>(typename ProblemShape::UnderlyingProblemShape{}, 1), int32_t(1));
+      auto init_shape = repeat_like(append<4>(problem_shape, 1), int32_t(1));
+      auto init_M = get<0>(init_shape);
+      auto init_N = get<1>(init_shape);
+      auto init_L = get<3>(init_shape);
+
+      // Strides for Grouped Gemm will be replaced prior to the first access regardless.
+      InternalStrideMNL stride_mnl{};
+      if constexpr (not IsGroupedGemmKernel) {
+        // Tensor shapes for Ptr-Array are initialized correctly only here.
+        // auto problem_shape_MNKL = append<4>(problem_shape.get_host_problem_shape(0), 1);
+        auto problem_shape_MNKL = append<4>(problem_shape, 1);
+        init_M = get<0>(problem_shape_MNKL);
+        init_N = get<1>(problem_shape_MNKL);
+        init_L = get<3>(problem_shape_MNKL);
+
+        stride_mnl = args.dAux;
+      }
+
+      // Tensor pointers will be fixed before the first access
+      Element* ptr_aux_first_batch = nullptr;
+      Tensor tensor_aux = make_tensor(ptr_aux_first_batch, make_layout(make_shape(init_M,init_N,init_L), stride_mnl));
+      tma_load_aux = make_tma_copy(CopyOpG2S{}, tensor_aux, SmemLayoutTma{}, EpilogueTile{}, Shape<_1,_1,_1>{});
+    }
+
+    return {
+      tma_load_aux,
+      static_cast<cute::TmaDescriptor*>(workspace),
+      args.ptr_aux,
+      args.dAux,
+      args.null_default,
+      use_default
+    };
+  }
+
+  template <class ProblemShape>
+  static bool
+  can_implement(ProblemShape const& problem_shape, Arguments const& args) {
+    return true;
+  }
+
+  template <class ProblemShape>
+  static size_t
+  get_workspace_size(ProblemShape const& problem_shape, Arguments const& args) {
+    Layout desc_layout = make_layout(make_shape(args.aux_sm_count, Int<NumEpilogueWarpGroups>{}));
+    return sizeof(cute::TmaDescriptor) * cute::cosize(desc_layout);
+  }
+
+  template <class ProblemShape>
+  static cutlass::Status
+  initialize_workspace(ProblemShape const& problem_shape, Arguments const& args, void* workspace, cudaStream_t stream,
+    CudaHostAdapter* cuda_adapter = nullptr) {
+    return cutlass::Status::kSuccess;
+  }
+
+  CUTLASS_HOST_DEVICE
+  Sm90AuxArrayLoad() { }
+
+  CUTLASS_HOST_DEVICE
+  Sm90AuxArrayLoad(Params const& params, SharedStorage const& shared_storage)
+      : params_ptr(&params),
+        smem_aux(const_cast<Element*>(shared_storage.smem_aux.data())),
+        smem_tensormap_aux(const_cast<cute::TmaDescriptor*>(&shared_storage.smem_tensormap_aux[0])) { }
+
+  Params const* params_ptr;
+  Element* smem_aux;
+  cute::TmaDescriptor* smem_tensormap_aux;
+
+  CUTLASS_DEVICE bool
+  is_producer_load_needed() const {
+    return true;
+  }
+
+  CUTLASS_DEVICE bool
+  is_C_load_needed() const {
+    return false;
+  }
+
+  CUTLASS_DEVICE bool
+  is_zero() const {
+    return (params_ptr->use_default && params_ptr->null_default == Element(0));
+  }
+
+  template <class GTensor, class STensor>
+  struct ProducerLoadCallbacks : EmptyProducerLoadCallbacks {
+    CUTLASS_DEVICE
+    ProducerLoadCallbacks(GTensor&& bGS_gAux, STensor&& bGS_sAux, Params const* params_ptr)
+      : bGS_gAux(cute::forward<GTensor>(bGS_gAux)),
+        bGS_sAux(cute::forward<STensor>(bGS_sAux)),
+        params_ptr(params_ptr) {}
+
+    GTensor bGS_gAux;                                                                  // (TMA,TMA_M,TMA_N,EPI_M,EPI_N)
+    STensor bGS_sAux;                                                                  // (TMA,TMA_M,TMA_N,PIPE)
+    Params const* params_ptr;
+
+    CUTLASS_DEVICE void
+    step(uint64_t* full_mbarrier_ptr, int epi_m, int epi_n, int load_iteration, bool issue_tma_load, TensorMaps<true> tensormap) {
+      if constexpr (EnableNullptr) {
+        if (params_ptr->use_default) {
+          return;
+        }
+      }
+
+      if (issue_tma_load) {
+        // Increment the expected transaction bytes of the current stage's mbarrier by the subtile's byte-size
+        constexpr uint32_t copy_bytes = size(take<0,2>(SmemLayout{})) * sizeof_bits_v<Element> / 8;
+        cutlass::arch::ClusterTransactionBarrier::expect_transaction(full_mbarrier_ptr, copy_bytes);
+        // Issue the TMA load
+        constexpr uint16_t mcast_mask = 0;
+        int load_pipe_index = load_iteration % Stages;
+        copy(params_ptr->tma_load_aux.with(tensormap, *full_mbarrier_ptr, mcast_mask),
+          bGS_gAux(_,_,_,epi_m,epi_n), bGS_sAux(_,_,_,load_pipe_index));
+      }
+    }
+  };
+
+  template <class... Args>
+  CUTLASS_DEVICE auto
+  get_producer_load_callbacks(ProducerLoadArgs<Args...> const& args) {
+
+    auto [M, N, K, L] = args.problem_shape_mnkl;
+    auto [m, n, k, l] = args.tile_coord_mnkl;
+    auto coord_shape =
+        make_coord(m, n, l)
+      ;
+    Tensor mAux_mn = params_ptr->tma_load_aux.get_tma_tensor(make_shape(M,N,L));                             // (M,N,L)
+    Tensor mAux = coalesce(mAux_mn, take<0,2>(args.tile_shape_mnk));
+    Tensor gAux = local_tile(mAux, take<0,2>(args.tile_shape_mnk), coord_shape);                       // (CTA_M,CTA_N)
+
+    Tensor gAux_epi = flat_divide(gAux, args.epi_tile);                          // (EPI_TILE_M,EPI_TILE_N,EPI_M,EPI_N)
+    Tensor sAux_epi = make_tensor(make_smem_ptr(smem_aux), SmemLayout{});        // (EPI_TILE_M,EPI_TILE_N,PIPE)
+
+    ThrCopy thrblk_g2s = params_ptr->tma_load_aux.get_slice(_0{});
+    Tensor bGS_gAux = thrblk_g2s.partition_S(gAux_epi);                                // (TMA,TMA_M,TMA_N,EPI_M,EPI_N)
+    Tensor bGS_sAux = thrblk_g2s.partition_D(sAux_epi);                                // (TMA,TMA_M,TMA_N,PIPE)
+
+    return ProducerLoadCallbacks<decltype(bGS_gAux), decltype(bGS_sAux)>(
+      cute::move(bGS_gAux), cute::move(bGS_sAux), params_ptr);
+  }
+
+  template <class RTensor, class TiledS2R, class STensorS2R>
+  struct ConsumerStoreCallbacks : EmptyConsumerStoreCallbacks {
+    CUTLASS_DEVICE
+    ConsumerStoreCallbacks(RTensor&& tC_rAux, TiledS2R tiled_s2r, STensorS2R&& tSR_sAux, Params const* params_ptr)
+      : tC_rAux(cute::forward<RTensor>(tC_rAux)),
+        tiled_s2r(tiled_s2r),
+        tSR_sAux(cute::forward<STensorS2R>(tSR_sAux)),
+        params_ptr(params_ptr) { }
+
+    TiledS2R tiled_s2r;
+    RTensor tC_rAux;                                                                          // (CPY,CPY_M,CPY_N)
+    STensorS2R tSR_sAux;                                                                      // (S2R,S2R_M,S2R_N,PIPE)
+    Params const* params_ptr;
+
+    CUTLASS_DEVICE void
+    previsit(int epi_m, int epi_n, int load_iteration, bool is_producer_load_needed) {
+      if constexpr (EnableNullptr) {
+        if (params_ptr->use_default) {
+          fill(tC_rAux, params_ptr->null_default);
+          return;
+        }
+      }
+
+      using RLayoutS2R = decltype(cute::layout(TiledS2R{}.get_slice(0).retile_S(RTensor{})));
+      Tensor tSR_rAux = make_tensor(tC_rAux.data(), RLayoutS2R{});                                 // (S2R,S2R_M,S2R_N)
+
+      int load_pipe_index = load_iteration % Stages;
+      copy(tiled_s2r, tSR_sAux(_,_,_,load_pipe_index), tSR_rAux);
+    }
+
+    template <typename ElementAccumulator, int FragmentSize>
+    CUTLASS_DEVICE Array<Element, FragmentSize>
+    visit(Array<ElementAccumulator, FragmentSize> const& frg_acc, int epi_v, int epi_m, int epi_n) {
+      Tensor tC_rAux_frg = recast<Array<Element, FragmentSize>>(coalesce(tC_rAux));                          // (EPI_V)
+
+      return tC_rAux_frg(epi_v);
+    }
+  };
+
+  template <
+    bool ReferenceSrc, // do register tensors reference the src or dst layout of the tiled copy
+    class... Args
+  >
+  CUTLASS_DEVICE auto
+  get_consumer_store_callbacks(ConsumerStoreArgs<Args...> const& args) {
+
+    auto [M, N, K, L] = args.problem_shape_mnkl;
+
+    Tensor mAux_mn = params_ptr->tma_load_aux.get_tma_tensor(make_shape(M,N,L));                             // (M,N,L)
+    Tensor mAux = coalesce(mAux_mn, take<0,2>(args.tile_shape_mnk));
+    Tensor tC_gAux = sm90_partition_for_epilogue<ReferenceSrc                          // (CPY,CPY_M,CPY_N,EPI_M,EPI_N)
+      >(mAux, args.tile_shape_mnk, args.tile_coord_mnkl, args.epi_tile, args.tiled_copy, args.thread_idx);
+    Tensor tC_rAux = make_tensor<Element>(take<0,3>(shape(tC_gAux)));                  // (CPY,CPY_M,CPY_N)
+
+    auto tiled_s2r = conditional_return<ReferenceSrc>(
+      make_tiled_copy_S(Copy_Atom<CopyOpS2R,Element>{}, args.tiled_copy),
+      make_tiled_copy_D(Copy_Atom<CopyOpS2R,Element>{}, args.tiled_copy)
+    );
+    Tensor sAux_epi = cute::as_position_independent_swizzle_tensor(
+                        make_tensor(make_smem_ptr(smem_aux), SmemLayout{}));            // (EPI_TILE_M,EPI_TILE_N,PIPE)
+    auto tSR_sAux = tiled_s2r.get_slice(args.thread_idx).partition_S(sAux_epi);               // (S2R,S2R_M,S2R_N,PIPE)
+
+    return ConsumerStoreCallbacks<decltype(tC_rAux), decltype(tiled_s2r), decltype(tSR_sAux)>(
+        cute::move(tC_rAux), tiled_s2r, cute::move(tSR_sAux), params_ptr);
+  }
+
+  struct TensorMapCallbacks {
+
+    Params const* params_ptr;
+    cute::TmaDescriptor* smem_tensormap_aux;
+
+    CUTLASS_DEVICE
+    TensorMaps<true>
+    init(
+        int32_t sm_count,
+        int32_t sm_idx,
+        int32_t warp_group_idx)
+    {
+      if constexpr (EnableNullptr) {
+        if (params_ptr->use_default) {
+          return nullptr;
+        }
+      }
+
+      // Bringing tensormaps from params to smem for modification
+      Tensor pAux_tensormap = make_tensor(params_ptr->tma_load_aux.get_tma_descriptor(), Int<1>{}, Int<1>{});
+      Tensor sAux_tensormap = make_tensor(make_smem_ptr(&smem_tensormap_aux[warp_group_idx]), Int<1>{}, Int<1>{});
+      if (cute::elect_one_sync()) {
+        copy(recast<uint128_t>(pAux_tensormap), recast<uint128_t>(sAux_tensormap));
+      }
+      __syncwarp();
+
+      // Return a pointer to gmem tensormap in workspace
+      Layout desc_layout = make_layout(make_shape(sm_count, Int<NumEpilogueWarpGroups>{}));
+      Tensor gmem_tensormap = make_tensor(params_ptr->tensormaps, desc_layout);                  // (SM, WarpGroup)
+
+      return &gmem_tensormap(sm_idx, warp_group_idx);
+    }
+    
+    template <class ProblemShape_MNKL>
+    CUTLASS_DEVICE
+    void
+    perform_update(
+        TensorMaps<true> tensormap,
+        ProblemShape_MNKL problem_shape_mnkl,
+        int32_t next_batch,
+        int32_t warp_group_idx)
+    {
+      if constexpr (EnableNullptr) {
+        if (params_ptr->use_default) {
+          return;
+        }
+      }
+
+      cute::tma_descriptor_replace_addr_in_shared_mem(smem_tensormap_aux[warp_group_idx], params_ptr->ptr_aux[next_batch]);
+
+      if constexpr (IsGroupedGemmKernel) {
+        // Replacing global dims and strides for the next batch
+        constexpr int MaxTensorRank = 5;
+        cute::array<uint32_t, MaxTensorRank> prob_shape  = {1,1,1,1,1};
+        cute::array<uint64_t, MaxTensorRank> prob_stride = {0,0,0,0,0};
+
+        Element const* ptr_Aux = nullptr;
+        Tensor tensor_aux = make_tensor(ptr_Aux, make_layout(append(select<0,1>(problem_shape_mnkl), Int<1>{}), params_ptr->dAux[next_batch]));
+
+        cute::detail::fill_tma_gmem_shape_stride(params_ptr->tma_load_aux, tensor_aux, prob_shape, prob_stride);
+
+        // Convert strides to byte strides
+        for (uint64_t& stride : prob_stride) {
+          stride = (stride * sizeof_bits_v<Element>) / 8;
+        }
+
+        cute::tma_descriptor_replace_dims_strides_in_shared_mem(smem_tensormap_aux[warp_group_idx], prob_shape, prob_stride);
+      }
+    }
+
+    CUTLASS_DEVICE
+    void
+    cp_fence_release(
+        TensorMaps<true> const& tensormap,
+        const int32_t warp_group_idx = 0)
+    {
+      if constexpr (EnableNullptr) {
+        if (params_ptr->use_default) {
+          return;
+        }
+      }
+      cute::tma_descriptor_cp_fence_release(tensormap, smem_tensormap_aux[warp_group_idx]);
+    }
+
+    CUTLASS_DEVICE
+    void
+    fence_acquire(TensorMaps<true> const& tensormap)
+    {
+      if constexpr (EnableNullptr) {
+        if (params_ptr->use_default) {
+          return;
+        }
+      }
+      cute::tma_descriptor_fence_acquire(tensormap);
+    }
+  };
+
+  template <bool IsLoad>
+  CUTLASS_DEVICE constexpr auto
+  get_tensormap_callbacks() {
+    if constexpr (IsLoad) {
+      return TensorMapCallbacks{params_ptr, smem_tensormap_aux};
+    }
+    else {
+      return EmptyTensorMapCallbacks{};
+    }
+  }
+};
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
 //
 // Broadcast Load Operations
diff --git a/include/cutlass/epilogue/fusion/sm90_visitor_store_tma_warpspecialized.hpp b/include/cutlass/epilogue/fusion/sm90_visitor_store_tma_warpspecialized.hpp
index ee78ca6aa..e9e6fa8f7 100644
--- a/include/cutlass/epilogue/fusion/sm90_visitor_store_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/fusion/sm90_visitor_store_tma_warpspecialized.hpp
@@ -94,11 +94,40 @@ struct Sm90AuxStore {
     StrideMNL dAux = {};
   };
 
-  struct Params {
-    using TMA_Aux = decltype(make_tma_copy(
+  // Gated / GLU fusions instantiate this class with a hierarchical mode-0 in StrideMNL
+  // (e.g. tuple<tuple<_1,int64_t,_8>, int64_t, int64_t>); they require the 5-arg
+  // make_tma_copy form to construct a valid TMA descriptor. Flat strides use the
+  // 3-arg form (master behavior) so they remain compatible with non-compact
+  // EpilogueTile cases (e.g. cutlass3x SM100 GELU-aux kernels) where the 5-arg
+  // form's size(slayout) == size(cta_v_map) assertion would fire.
+  // Use tuple_element_t to obtain the unqualified element type: `cute::get<0>`
+  // applied to a prvalue tuple returns an rvalue reference, and `is_tuple` on a
+  // reference type triggers "pointer to reference is not allowed" in some nvcc
+  // configurations (e.g. the cuTe `(T*)0` SFINAE probe).
+  static constexpr bool IsHierarchicalStride = cute::is_tuple<cute::tuple_element_t<0, StrideMNL>>::value;
+
+  // Helper specializations keep the unused decltype out of instantiation —
+  // conditional_t would eagerly evaluate both branches, and the 3-arg
+  // make_tma_copy doesn't substitute cleanly for hierarchical strides.
+  template <bool Hierarchical, class /* dummy to delay */ = void>
+  struct TmaAuxTypeHelper {
+    using type = decltype(make_tma_copy(
+        SM90_TMA_STORE{},
+        make_tensor(static_cast<Element*>(nullptr), repeat_like(StrideMNL{}, int32_t(0)), StrideMNL{}),
+        SmemLayoutTma{},
+        EpilogueTile{},
+        Shape<_1,_1,_1>{}));
+  };
+  template <class Dummy>
+  struct TmaAuxTypeHelper<false, Dummy> {
+    using type = decltype(make_tma_copy(
         SM90_TMA_STORE{},
         make_tensor(static_cast<Element*>(nullptr), repeat_like(StrideMNL{}, int32_t(0)), StrideMNL{}),
         SmemLayoutTma{}));
+  };
+
+  struct Params {
+    using TMA_Aux = typename TmaAuxTypeHelper<IsHierarchicalStride>::type;
     TMA_Aux tma_store_aux;
     bool is_nullptr = false;
   };
@@ -118,7 +147,11 @@ struct Sm90AuxStore {
     typename Params::TMA_Aux tma_store_aux;
     if (not is_nullptr) {
       Tensor tensor_aux = make_tensor(args.ptr_aux, make_layout(make_shape(M,N,L), args.dAux));
-      tma_store_aux = make_tma_copy(SM90_TMA_STORE{}, tensor_aux, SmemLayoutTma{});
+      if constexpr (IsHierarchicalStride) {
+        tma_store_aux = make_tma_copy(SM90_TMA_STORE{}, tensor_aux, SmemLayoutTma{}, EpilogueTile{}, Shape<_1,_1,_1>{});
+      } else {
+        tma_store_aux = make_tma_copy(SM90_TMA_STORE{}, tensor_aux, SmemLayoutTma{});
+      }
     }
 
     return {tma_store_aux, is_nullptr};
@@ -461,6 +494,386 @@ struct Sm90AuxStore<
 
 };
 
+template <
+  int Stages,
+  int NumEpilogueWarpGroups,
+  class EpilogueTile,
+  class Element,
+  FloatRoundStyle RoundStyle,
+  class StrideMNL,
+  class SmemLayoutAtom,
+  class CopyOpR2S,
+  int Alignment = 128 / sizeof_bits_v<Element>,
+  bool EnableNullptr = true // Noop on nullptr params
+>
+struct Sm90AuxArrayStore {
+  using ElementAux = Element;
+  static_assert(Alignment * sizeof_bits_v<Element> % 128 == 0, "sub-16B alignment not supported yet");
+
+  using InternalStrideMNL = cute::remove_pointer_t<StrideMNL>;
+  static constexpr bool IsGroupedGemmKernel = !cute::is_same_v<InternalStrideMNL, StrideMNL>;
+
+  constexpr static bool is_m_major = epilogue::collective::detail::is_m_major<InternalStrideMNL>();
+  // Find the max contiguous layout usable by TMA (if EpilogueTile is a non-compact tiler)
+  using SmemShapeTma = decltype(make_shape(
+      max_common_vector(make_layout(get<0>(EpilogueTile{})),make_layout(get<0>(EpilogueTile{}))),
+      max_common_vector(make_layout(get<1>(EpilogueTile{})),make_layout(get<1>(EpilogueTile{})))));
+  using SmemLayoutTma = decltype(tile_to_shape(
+      SmemLayoutAtom{}, SmemShapeTma{},
+      cute::conditional_t<is_m_major, Step<_2,_1>, Step<_1,_2>>{} ));
+  using SmemLayout = decltype(tile_to_shape(
+      SmemLayoutTma{},
+      make_shape(size<0>(shape(EpilogueTile{})), size<1>(shape(EpilogueTile{})), Int<Stages>{}),
+      cute::conditional_t<is_m_major, Step<_2,_1,_3>, Step<_1,_2,_3>>{} ));
+
+  struct SharedStorage {
+    alignas(cutlass::detail::alignment_for_swizzle(SmemLayout{}))
+    array_aligned<Element, size(SmemLayout{})> smem_aux;
+    cute::array<cute::TmaDescriptor, NumEpilogueWarpGroups> smem_tensormap_aux;
+  };
+
+  struct Arguments {
+    Element** ptr_aux{};
+    StrideMNL dAux{};
+    int aux_sm_count = 0;
+  };
+
+  struct Params {
+    using TMA_Aux = decltype(make_tma_copy(
+        SM90_TMA_STORE{},
+        make_tensor(static_cast<Element*>(nullptr), repeat_like(InternalStrideMNL{}, int32_t(0)), InternalStrideMNL{}),
+        SmemLayoutTma{},
+        EpilogueTile{},
+        Shape<_1,_1,_1>{}));
+    TMA_Aux tma_store_aux;
+    cute::TmaDescriptor* tensormaps{};
+    Element** ptr_aux{};
+    StrideMNL dAux{};
+    bool is_nullptr = false;
+  };
+
+  template <bool IsLoad>
+  using TensorMaps = cute::conditional_t<IsLoad, cute::tuple<>, cute::TmaDescriptor*>;
+
+  template <class ProblemShape>
+  static constexpr Params
+  to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
+
+    bool is_nullptr = false;
+    if constexpr (EnableNullptr) {
+      is_nullptr = args.ptr_aux == nullptr;
+    }
+
+    typename Params::TMA_Aux tma_store_aux;
+    if (not is_nullptr) {
+      // These tensor shapes (only applicable for grouped gemm) and pointers are only used to create tensormap/tma desc.
+      // These will be replaced with correct values before the initial tma load.
+      // auto init_shape = repeat_like(append<4>(typename ProblemShape::UnderlyingProblemShape{}, 1), int32_t(1));
+      auto init_shape = repeat_like(append<4>(problem_shape, 1), int32_t(1));
+      auto init_M = get<0>(init_shape);
+      auto init_N = get<1>(init_shape);
+      auto init_L = get<3>(init_shape);
+
+      // Strides for Grouped Gemm will be replaced prior to the first access regardless.
+      InternalStrideMNL stride_mnl{};
+      if constexpr (not IsGroupedGemmKernel) {
+        // Tensor shapes for Ptr-Array are initialized correctly only here.
+        // auto problem_shape_MNKL = append<4>(problem_shape.get_host_problem_shape(0), 1);
+        auto problem_shape_MNKL = append<4>(problem_shape, 1);
+        init_M = get<0>(problem_shape_MNKL);
+        init_N = get<1>(problem_shape_MNKL);
+        init_L = get<3>(problem_shape_MNKL);
+
+        stride_mnl = args.dAux;
+      }
+
+      // Tensor pointers will be fixed before the first access
+      Element* ptr_aux_first_batch = nullptr;
+      Tensor tensor_aux = make_tensor(ptr_aux_first_batch, make_layout(make_shape(init_M,init_N,init_L), stride_mnl));
+      tma_store_aux = make_tma_copy(SM90_TMA_STORE{}, tensor_aux, SmemLayoutTma{}, EpilogueTile{}, Shape<_1,_1,_1>{});
+    }
+
+    return {
+      tma_store_aux,
+      static_cast<cute::TmaDescriptor*>(workspace),
+      args.ptr_aux,
+      args.dAux,
+      is_nullptr
+    };
+  }
+
+  template <class ProblemShape>
+  static bool
+  can_implement(ProblemShape const& problem_shape, Arguments const& args) {
+    return true;
+  }
+
+  template <class ProblemShape>
+  static size_t
+  get_workspace_size(ProblemShape const& problem_shape, Arguments const& args) {
+    Layout desc_layout = make_layout(make_shape(args.aux_sm_count, Int<NumEpilogueWarpGroups>{}));
+    return sizeof(cute::TmaDescriptor) * cute::cosize(desc_layout);
+  }
+
+  template <class ProblemShape>
+  static cutlass::Status
+  initialize_workspace(ProblemShape const& problem_shape, Arguments const& args, void* workspace, cudaStream_t stream,
+    CudaHostAdapter* cuda_adapter = nullptr) {
+    return cutlass::Status::kSuccess;
+  }
+
+  CUTLASS_HOST_DEVICE
+  Sm90AuxArrayStore() { }
+
+  CUTLASS_HOST_DEVICE
+  Sm90AuxArrayStore(Params const& params, SharedStorage const& shared_storage)
+      : params_ptr(&params),
+        smem_aux(const_cast<Element*>(shared_storage.smem_aux.data())),
+        smem_tensormap_aux(const_cast<cute::TmaDescriptor*>(&shared_storage.smem_tensormap_aux[0])) { }
+
+  Params const* params_ptr;
+  Element* smem_aux;
+  cute::TmaDescriptor* smem_tensormap_aux;
+
+  CUTLASS_DEVICE bool
+  is_producer_load_needed() const {
+    return false;
+  }
+
+  CUTLASS_DEVICE bool
+  is_C_load_needed() const {
+    return false;
+  }
+
+  template <class... Args>
+  CUTLASS_DEVICE auto
+  get_producer_load_callbacks(ProducerLoadArgs<Args...> const& args) {
+    return EmptyProducerLoadCallbacks{};
+  }
+
+  template <
+    class RTensor,
+    class TiledR2S,
+    class STensorR2S,
+    class STensorS2G,
+    class GTensorS2G
+  >
+  struct ConsumerStoreCallbacks : EmptyConsumerStoreCallbacks {
+    CUTLASS_DEVICE
+    ConsumerStoreCallbacks(
+          RTensor&& tC_rAux,
+          TiledR2S tiled_r2s,
+          STensorR2S&& tRS_sAux,
+          STensorS2G&& bSG_sAux,
+          GTensorS2G&& bSG_gAux,
+          Params const* params_ptr)
+      : tiled_r2s(tiled_r2s),
+        tC_rAux(cute::forward<RTensor>(tC_rAux)),
+        tRS_sAux(cute::forward<STensorR2S>(tRS_sAux)),
+        bSG_sAux(cute::forward<STensorS2G>(bSG_sAux)),
+        bSG_gAux(cute::forward<GTensorS2G>(bSG_gAux)),
+        params_ptr(params_ptr) {}
+
+    TiledR2S tiled_r2s;
+    RTensor tC_rAux;                                                                   // (CPY,CPY_M,CPY_N)
+    STensorR2S tRS_sAux;                                                               // (R2S,R2S_M,R2S_N,PIPE)
+    STensorS2G bSG_sAux;                                                               // (S2G,S2G_M,S2G_N,PIPE)
+    GTensorS2G bSG_gAux;                                                               // (S2G,S2G_M,S2G_N,EPI_M,EPI_N)
+    Params const* params_ptr;
+
+    template <typename ElementAccumulator, typename ElementInput, int FragmentSize>
+    CUTLASS_DEVICE auto
+    visit(Array<ElementAccumulator, FragmentSize> const& frg_acc, int epi_v, int epi_m, int epi_n,
+          Array<ElementInput, FragmentSize> const& frg_input) {
+      using ConvertInput = NumericArrayConverter<Element, ElementInput, FragmentSize, RoundStyle>;
+      ConvertInput convert_input{};
+
+      Tensor tC_rAux_frg = recast<Array<Element, FragmentSize>>(coalesce(tC_rAux));                          // (EPI_V)
+      tC_rAux_frg(epi_v) = convert_input(frg_input);
+
+      return frg_input;
+    }
+
+    CUTLASS_DEVICE void
+    postreduce(int epi_m, int epi_n, int store_iteration, bool issue_smem_store) {
+      if constexpr (EnableNullptr) {
+        if (params_ptr->is_nullptr) {
+          return;
+        }
+      }
+
+      using RLayoutR2S = decltype(cute::layout(TiledR2S{}.get_slice(0).retile_S(RTensor{})));
+      Tensor tRS_rAux = make_tensor(tC_rAux.data(), RLayoutR2S{});                                 // (R2S,R2S_M,R2S_N)
+
+      if (issue_smem_store) {
+        int store_pipe_index = store_iteration % Stages;
+        copy(tiled_r2s, tRS_rAux, tRS_sAux(_,_,_,store_pipe_index));
+      }
+    }
+
+    CUTLASS_DEVICE void
+    tma_store(int epi_m, int epi_n, int store_iteration, bool issue_tma_store, TensorMaps<false> tensormap) {
+      if constexpr (EnableNullptr) {
+        if (params_ptr->is_nullptr) {
+          return;
+        }
+      }
+
+      if (issue_tma_store) {
+        int store_pipe_index = store_iteration % Stages;
+        copy(params_ptr->tma_store_aux.with(tensormap), bSG_sAux(_,_,_,store_pipe_index), bSG_gAux(_,_,_,epi_m,epi_n));
+      }
+    }
+  };
+
+  template <
+    bool ReferenceSrc, // do register tensors reference the src or dst layout of the tiled copy
+    class... Args
+  >
+  CUTLASS_DEVICE auto
+  get_consumer_store_callbacks(ConsumerStoreArgs<Args...> const& args) {
+
+    auto [M, N, K, L] = args.problem_shape_mnkl;
+    auto [m, n, k, l] = args.tile_coord_mnkl;
+    Tensor mAux = params_ptr->tma_store_aux.get_tma_tensor(make_shape(M,N,Int<1>{}));                               // (M,N,L)
+    Tensor gAux = local_tile(mAux, take<0,2>(args.tile_shape_mnk), make_coord(m,n,Int<0>{}));                 // (CTA_M,CTA_N)
+
+    Tensor tC_gAux = sm90_partition_for_epilogue<ReferenceSrc>(                        // (CPY,CPY_M,CPY_N,EPI_M,EPI_N)
+                      gAux, args.epi_tile, args.tiled_copy, args.thread_idx);
+    Tensor tC_rAux = make_tensor<Element>(take<0,3>(shape(tC_gAux)));                  // (CPY,CPY_M,CPY_N)
+
+    Tensor sAux_epi = cute::as_position_independent_swizzle_tensor(
+                        make_tensor(make_smem_ptr(smem_aux), SmemLayout{}));     // (EPI_TILE_M,EPI_TILE_N,PIPE)
+    Tensor gAux_epi = flat_divide(gAux, args.epi_tile);                          // (EPI_TILE_M,EPI_TILE_N,EPI_M,EPI_N)
+
+    auto tiled_r2s = conditional_return<ReferenceSrc>(
+      make_tiled_copy_S(Copy_Atom<CopyOpR2S,Element>{}, args.tiled_copy),
+      make_tiled_copy_D(Copy_Atom<CopyOpR2S,Element>{}, args.tiled_copy)
+    );
+    auto tRS_sAux = tiled_r2s.get_slice(args.thread_idx).partition_D(sAux_epi);               // (R2S,R2S_M,R2S_N,PIPE)
+
+    ThrCopy thrblk_s2g = params_ptr->tma_store_aux.get_slice(_0{});
+    Tensor bSG_sAux = thrblk_s2g.partition_S(sAux_epi);                                // (TMA,TMA_M,TMA_N,PIPE)
+    Tensor bSG_gAux = thrblk_s2g.partition_D(gAux_epi);                                // (TMA,TMA_M,TMA_N,EPI_M,EPI_N)
+
+    return ConsumerStoreCallbacks<decltype(tC_rAux), decltype(tiled_r2s), decltype(tRS_sAux), decltype(bSG_sAux), decltype(bSG_gAux)>(
+            cute::move(tC_rAux),
+            tiled_r2s,
+            cute::move(tRS_sAux),
+            cute::move(bSG_sAux),
+            cute::move(bSG_gAux),
+            params_ptr);
+  }
+
+  struct TensorMapCallbacks {
+
+    Params const* params_ptr;
+    cute::TmaDescriptor* smem_tensormap_aux;
+
+    CUTLASS_DEVICE
+    TensorMaps<false>
+    init(
+        int32_t sm_count,
+        int32_t sm_idx,
+        int32_t warp_group_idx)
+    {
+      if constexpr (EnableNullptr) {
+        if (params_ptr->is_nullptr) {
+          return nullptr;
+        }
+      }
+
+      // Bringing tensormaps from params to smem for modification
+      Tensor pAux_tensormap = make_tensor(params_ptr->tma_store_aux.get_tma_descriptor(), Int<1>{}, Int<1>{});
+      Tensor sAux_tensormap = make_tensor(make_smem_ptr(&smem_tensormap_aux[warp_group_idx]), Int<1>{}, Int<1>{});
+      if (cute::elect_one_sync()) {
+        copy(recast<uint128_t>(pAux_tensormap), recast<uint128_t>(sAux_tensormap));
+      }
+      __syncwarp();
+
+      // Return a pointer to gmem tensormap in workspace
+      Layout desc_layout = make_layout(make_shape(sm_count, Int<NumEpilogueWarpGroups>{}));
+      Tensor gmem_tensormap = make_tensor(params_ptr->tensormaps, desc_layout);                  // (SM, WarpGroup)
+
+      return &gmem_tensormap(sm_idx, warp_group_idx);
+    }
+    
+    template <class ProblemShape_MNKL>
+    CUTLASS_DEVICE
+    void
+    perform_update(
+        TensorMaps<false> tensormap,
+        ProblemShape_MNKL problem_shape_mnkl,
+        int32_t next_batch,
+        int32_t warp_group_idx)
+    {
+      if constexpr (EnableNullptr) {
+        if (params_ptr->is_nullptr) {
+          return;
+        }
+      }
+
+      cute::tma_descriptor_replace_addr_in_shared_mem(smem_tensormap_aux[warp_group_idx], params_ptr->ptr_aux[next_batch]);
+
+      if constexpr (IsGroupedGemmKernel) {
+        // Replacing global dims and strides for the next batch
+        constexpr int MaxTensorRank = 5;
+        cute::array<uint32_t, MaxTensorRank> prob_shape  = {1,1,1,1,1};
+        cute::array<uint64_t, MaxTensorRank> prob_stride = {0,0,0,0,0};
+
+        Element const* ptr_Aux = nullptr;
+        Tensor tensor_aux = make_tensor(ptr_Aux, make_layout(append(select<0,1>(problem_shape_mnkl), Int<1>{}), params_ptr->dAux[next_batch]));
+
+        cute::detail::fill_tma_gmem_shape_stride(params_ptr->tma_store_aux, tensor_aux, prob_shape, prob_stride);
+
+        // Convert strides to byte strides
+        for (uint64_t& stride : prob_stride) {
+          stride = (stride * sizeof_bits_v<Element>) / 8;
+        }
+
+        cute::tma_descriptor_replace_dims_strides_in_shared_mem(smem_tensormap_aux[warp_group_idx], prob_shape, prob_stride);
+      }
+    }
+
+    CUTLASS_DEVICE
+    void
+    cp_fence_release(
+        TensorMaps<false> const& tensormap,
+        const int32_t warp_group_idx = 0)
+    {
+      if constexpr (EnableNullptr) {
+        if (params_ptr->is_nullptr) {
+          return;
+        }
+      }
+      cute::tma_descriptor_cp_fence_release(tensormap, smem_tensormap_aux[warp_group_idx]);
+    }
+
+    CUTLASS_DEVICE
+    void
+    fence_acquire(TensorMaps<false> const& tensormap)
+    {
+      if constexpr (EnableNullptr) {
+        if (params_ptr->is_nullptr) {
+          return;
+        }
+      }
+      cute::tma_descriptor_fence_acquire(tensormap);
+    }
+  };
+
+  template <bool IsLoad>
+  CUTLASS_DEVICE constexpr auto
+  get_tensormap_callbacks() {
+    if constexpr (IsLoad) {
+      return EmptyTensorMapCallbacks{};
+    }
+    else {
+      return TensorMapCallbacks{params_ptr, smem_tensormap_aux};
+    }
+  }
+};
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
 //
 // Reduction Store Operations
diff --git a/include/cutlass/epilogue/fusion/sm90_visitor_tma_warpspecialized.hpp b/include/cutlass/epilogue/fusion/sm90_visitor_tma_warpspecialized.hpp
index 5d4e9deb5..d6846fe1c 100644
--- a/include/cutlass/epilogue/fusion/sm90_visitor_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/fusion/sm90_visitor_tma_warpspecialized.hpp
@@ -150,6 +150,17 @@ struct ProducerLoadCallbacksImpl {
     );
   }
 
+  // Overload of step that accepts tensormaps argument, for use in ptr-array epilogues
+  template<class Tensormaps>
+  CUTLASS_DEVICE void
+  step(uint64_t* full_mbarrier_ptr, int epi_m, int epi_n, int load_iteration, bool issue_tma_load, Tensormaps tensormaps) {
+    for_each(callbacks_tuple, tensormaps,
+      [&] (auto& callbacks, auto&& op_tensormaps) {
+        callbacks.step(full_mbarrier_ptr, epi_m, epi_n, load_iteration, issue_tma_load, op_tensormaps);
+      }
+    );
+  }
+
   // Exit of the subtile load loop.
   CUTLASS_DEVICE void
   end() {
@@ -261,6 +272,17 @@ struct ConsumerStoreCallbacksImpl {
     );
   }
 
+  // Overload of tma_store that accepts tensormaps argument, for use in ptr-array epilogues
+  template<class Tensormaps>
+  CUTLASS_DEVICE void
+  tma_store(int epi_m, int epi_n, int store_iteration, bool issue_tma_store, Tensormaps tensormaps) {
+    for_each(callbacks_tuple, tensormaps,
+      [&] (auto& callbacks, auto&& op_tensormaps) {
+        callbacks.tma_store(epi_m, epi_n, store_iteration, issue_tma_store, op_tensormaps);
+      }
+    );
+  }
+
   // End of subtile store iteration
   CUTLASS_DEVICE void
   end_loop(int epi_m, int epi_n) {
@@ -282,6 +304,72 @@ struct ConsumerStoreCallbacksImpl {
   }
 };
 
+//
+// Tensor map update callbacks, called by the epilogue load/store warps.
+// Operations that support tensormap updates must define this.
+// For use with ptr-array epilogues.
+//
+template <class CallbacksTuple>
+struct TensorMapCallbacksImpl {
+  // Callbacks can store non-persistent variables (e.g. tensors) or copies of persistent variables
+  CallbacksTuple callbacks_tuple;
+
+  // Initialization, typically performs initial copy of tensormaps from constant to shared memory
+  CUTLASS_DEVICE
+  auto
+  init(int32_t sm_count, int32_t sm_idx,int32_t warp_group_idx) {
+    return transform_apply(callbacks_tuple,
+      [&](auto&& callback) {
+        return callback.init(sm_count, sm_idx, warp_group_idx);
+      },
+      [](auto&&... tensormaps) {
+        return cute::make_tuple(tensormaps...);
+      });
+  }
+  
+  // Tensormap update in shared memory
+  template <class Tensormaps, class ProblemShape_MNKL>
+  CUTLASS_DEVICE
+  void
+  perform_update(
+      Tensormaps tensormaps,
+      ProblemShape_MNKL problem_shape_mnkl,
+      int32_t next_batch,
+      int32_t warp_group_idx)
+  {
+    for_each(callbacks_tuple, tensormaps,
+      [&](auto&& callback, auto&& op_tensormaps) {
+        callback.perform_update(op_tensormaps, problem_shape_mnkl, next_batch, warp_group_idx);
+      });
+  }
+
+  // Tensormap store from shared to global mempry and release fence
+  template <class Tensormaps>
+  CUTLASS_DEVICE
+  void
+  cp_fence_release(
+      Tensormaps tensormaps,
+      const int32_t warp_group_idx = 0)
+  {
+    for_each(callbacks_tuple, tensormaps,
+      [&](auto&& callback, auto&& op_tensormaps) {
+        callback.cp_fence_release(op_tensormaps, warp_group_idx);
+      });
+  }
+
+  // Tensormap acquire fence
+  template <class Tensormaps>
+  CUTLASS_DEVICE
+  void
+  fence_acquire(Tensormaps tensormaps)
+  {
+    for_each(callbacks_tuple, tensormaps,
+      [&](auto&& callback, auto&& op_tensormaps) {
+        callback.fence_acquire(op_tensormaps);
+      });
+  }
+};
+
 template<
   class ProblemShapeMNKL,
   class TileShapeMNK,
@@ -368,6 +456,29 @@ struct ConsumerStoreArgs {
     thread_idx(thread_idx) {}
 };
 
+template <class T, bool IsLoad, class = void>
+struct tensormaps_type_impl {
+  using type = cute::tuple<>;
+};
+
+template <class T, bool IsLoad>
+struct tensormaps_type_impl<T, IsLoad, cute::void_t<typename T::template TensorMaps<IsLoad>>> {
+  using type = typename T::template TensorMaps<IsLoad>;
+};
+
+template <class T, bool IsLoad>
+using tensormaps_type = typename tensormaps_type_impl<T, IsLoad>::type;
+
+// Helper that owns the pack itself. Sm90VisitorImplBase::TensorMaps cannot
+// expand its enclosing class's Ops pack directly in an alias-template body
+// on MSVC (emits "parameter pack 'Ops' was referenced but not expanded").
+// Pushing the expansion into this struct's own template parameter list works
+// around it; the expansion at the call site is a plain non-alias context.
+template <bool IsLoad, class... Ts>
+struct tensormaps_tuple_for {
+  using type = tuple<typename tensormaps_type_impl<Ts, IsLoad>::type...>;
+};
+
 template <class... Ops>
 struct Sm90VisitorImplBase {
   // Shared memory allocation
@@ -376,6 +487,11 @@ struct Sm90VisitorImplBase {
   using Arguments = tuple<typename Ops::Arguments...>;
   // Device side fusion params (Kernel-entry API)
   using Params = tuple<typename Ops::Params...>;
+  // Tensormap argument type used in ptr-array epilogues. The pack expansion is
+  // delegated to tensormaps_tuple_for above; spelling it inline here trips an
+  // MSVC bug with alias-template pack expansion over an enclosing class's pack.
+  template <bool IsLoad>
+  using TensorMaps = typename tensormaps_tuple_for<IsLoad, Ops...>::type;
 
   template <class ProblemShape>
   static constexpr Params
@@ -473,6 +589,8 @@ struct Sm90VisitorImpl : Sm90VisitorImplBase<Ops...> {
   using Impl = Sm90VisitorImplBase<Ops...>;
   using Params = typename Impl::Params;
   using SharedStorage = typename Impl::SharedStorage;
+  template <bool IsLoad>
+  using TensorMaps = typename Impl::template TensorMaps<IsLoad>;
 
   CUTLASS_HOST_DEVICE
   Sm90VisitorImpl() {}
@@ -549,6 +667,25 @@ struct Sm90VisitorImpl : Sm90VisitorImplBase<Ops...> {
       }
     );
   }
+
+  template <bool IsLoad>
+  CUTLASS_DEVICE constexpr auto
+  get_tensormap_callbacks() {
+    return transform_apply(ops, 
+      [](auto& op){
+        auto has_tmap_callbacks = cute::is_valid([](auto&& t)->void_t<decltype(t.template get_tensormap_callbacks<IsLoad>())>{}, op);
+        if constexpr (has_tmap_callbacks) {
+          return op.template get_tensormap_callbacks<IsLoad>();
+        }
+        else {
+          return TensorMapCallbacksImpl<cute::tuple<>>{};
+        }
+      },
+      [](auto&&... callbacks){
+        auto callbacks_tuple = cute::make_tuple(callbacks...);
+        return TensorMapCallbacksImpl<decltype(callbacks_tuple)>{callbacks_tuple};
+      });
+  }
 };
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
@@ -556,6 +693,7 @@ struct Sm90VisitorImpl : Sm90VisitorImplBase<Ops...> {
 // Convenience aliases
 using EmptyProducerLoadCallbacks = ProducerLoadCallbacksImpl<cute::tuple<>>;
 using EmptyConsumerStoreCallbacks = ConsumerStoreCallbacksImpl<cute::tuple<>>;
+using EmptyTensorMapCallbacks = TensorMapCallbacksImpl<cute::tuple<>>;
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
@@ -768,6 +906,11 @@ struct Sm90VisitorImplBase<Op0> {
     typename Op0::Params op_0;
   };
 
+  template <bool IsLoad>
+  using TensorMaps = tuple<
+    tensormaps_type<Op0, IsLoad>
+  >;
+
   template <class ProblemShape>
   static constexpr Params
   to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
@@ -840,6 +983,12 @@ struct Sm90VisitorImplBase<Op0, Op1> {
     typename Op1::Params op_1;
   };
 
+  template <bool IsLoad>
+  using TensorMaps = tuple<
+    tensormaps_type<Op0,IsLoad>,
+    tensormaps_type<Op1,IsLoad>
+  >;
+
   template <class ProblemShape>
   static constexpr Params
   to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
@@ -931,6 +1080,13 @@ struct Sm90VisitorImplBase<Op0, Op1, Op2> {
     typename Op2::Params op_2;
   };
 
+  template <bool IsLoad>
+  using TensorMaps = tuple<
+    tensormaps_type<Op0,IsLoad>,
+    tensormaps_type<Op1,IsLoad>,
+    tensormaps_type<Op2,IsLoad>
+  >;
+
   template <class ProblemShape>
   static constexpr Params
   to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
@@ -1040,6 +1196,14 @@ struct Sm90VisitorImplBase<Op0, Op1, Op2, Op3> {
     typename Op3::Params op_3;
   };
 
+  template <bool IsLoad>
+  using TensorMaps = tuple<
+    tensormaps_type<Op0,IsLoad>,
+    tensormaps_type<Op1,IsLoad>,
+    tensormaps_type<Op2,IsLoad>,
+    tensormaps_type<Op3,IsLoad>
+  >;
+
   template <class ProblemShape>
   static constexpr Params
   to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
diff --git a/include/cutlass/gemm/collective/builders/sm90_common.inl b/include/cutlass/gemm/collective/builders/sm90_common.inl
index ae08658d9..26552cb89 100644
--- a/include/cutlass/gemm/collective/builders/sm90_common.inl
+++ b/include/cutlass/gemm/collective/builders/sm90_common.inl
@@ -290,12 +290,18 @@ ss_smem_selector()
     else if constexpr (BLK_MN0 % size<0>(GMMA::Layout_MN_SW32_Atom<ElementType>{}) == 0) {
       return GMMA::Layout_MN_SW32_Atom<ElementType>{};
     }
-    else if constexpr (BLK_MN0 % size<0>(GMMA::Layout_MN_INTER_Atom<ElementType>{}) == 0) {
+    // INTER fallback uses size(BLK_MN{}) instead of BLK_MN0 so that hierarchical
+    // shapes like ((_8,_2,M/16),) produced by the gated activation builder can
+    // still pick INTER when their flattened mode-0 (size<0> = 8) is too small
+    // for any SW atom but the product (_8*_2*M/16 = M) divides the INTER atom
+    // width. For flat (rank-1) BLK_MN shapes, size(BLK_MN{}) == BLK_MN0, so
+    // this is a no-op for every non-gated kernel today.
+    else if constexpr (size(BLK_MN{}) % size<0>(GMMA::Layout_MN_INTER_Atom<ElementType>{}) == 0) {
       return GMMA::Layout_MN_INTER_Atom<ElementType>{};
     }
     else {
-      static_assert(BLK_MN0 % size<0>(GMMA::Layout_MN_INTER_Atom<ElementType>{}) == 0,
-                    "BLK_MN0 must be a multiple of size<0>(GMMA::Layout_MN_INTER_Atom<ElementType>{})");
+      static_assert(size(BLK_MN{}) % size<0>(GMMA::Layout_MN_INTER_Atom<ElementType>{}) == 0,
+                    "BLK_MN must be a multiple of size<0>(GMMA::Layout_MN_INTER_Atom<ElementType>{})");
     }
   }
   else if constexpr (major == GMMA::Major::K) {
diff --git a/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized.hpp b/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized.hpp
index 83a281c3f..63b334bb7 100644
--- a/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized.hpp
+++ b/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized.hpp
@@ -156,14 +156,14 @@ struct CollectiveMma<
   using TMA_A = decltype(make_tma_copy(
       GmemTiledCopyA{},
       make_tensor(static_cast<InternalElementA const*>(nullptr), repeat_like(InternalStrideA{}, int32_t(0)), InternalStrideA{}),
-      SmemLayoutA{}(_,_,cute::Int<0>{}),
+      SmemLayoutA{}(_,_,Int<0>{}),
       make_shape(shape<0>(TileShape{}), shape<2>(TileShape{})),
       size<1>(ClusterShape{})));  // mcast along N mode for this M load, if any
   // Assumption: StrideB is congruent with Problem_NK
   using TMA_B = decltype(make_tma_copy(
       GmemTiledCopyB{},
       make_tensor(static_cast<InternalElementB const*>(nullptr), repeat_like(InternalStrideB{}, int32_t(0)), InternalStrideB{}),
-      SmemLayoutB{}(_,_,cute::Int<0>{}),
+      SmemLayoutB{}(_,_,Int<0>{}),
       make_shape(shape<1>(TileShape{}), shape<2>(TileShape{})),
       size<0>(ClusterShape{}))); // mcast along M mode for this N load, if any
 
@@ -251,13 +251,13 @@ struct CollectiveMma<
     TMA_A tma_load_a = make_tma_copy(
         GmemTiledCopyA{},
         tensor_a,
-        SmemLayoutA{}(_,_,cute::Int<0>{}),
+        SmemLayoutA{}(_,_,Int<0>{}),
         make_shape(shape<0>(TileShape{}), shape<2>(TileShape{})),
         size<1>(ClusterShape{})); // mcast along N mode for this M load, if any
     TMA_B tma_load_b = make_tma_copy(
         GmemTiledCopyB{},
         tensor_b,
-        SmemLayoutB{}(_,_,cute::Int<0>{}),
+        SmemLayoutB{}(_,_,Int<0>{}),
         make_shape(shape<1>(TileShape{}), shape<2>(TileShape{})),
         size<0>(ClusterShape{})); // mcast along M mode for this N load, if any
 
@@ -673,9 +673,9 @@ struct CollectiveMma<
       Params const& mainloop_params,
       int32_t next_group,
       ProblemShape_MNKL problem_shape_mnkl) {
-    const uint32_t M = get<0>(problem_shape_mnkl);
-    const uint32_t N = get<1>(problem_shape_mnkl);
-    const uint32_t K = get<2>(problem_shape_mnkl);
+    const auto M = get<0>(problem_shape_mnkl);
+    const auto N = get<1>(problem_shape_mnkl);
+    const auto K = get<2>(problem_shape_mnkl);
     // Replace all dims for consistency
     constexpr int MaxTensorRank = 5;
     cute::array<uint32_t, MaxTensorRank> prob_shape_A  = {1,1,1,1,1};
diff --git a/include/cutlass/gemm/gemm.h b/include/cutlass/gemm/gemm.h
index 96e9970ca..249e04aa2 100644
--- a/include/cutlass/gemm/gemm.h
+++ b/include/cutlass/gemm/gemm.h
@@ -117,6 +117,18 @@ is_k_major_B() {
   return is_k_major<TagToStrideB_t<LayoutB>>();
 }
 
+template<class LayoutC>
+constexpr bool
+is_m_major_C() {
+  return is_major<0,TagToStrideC_t<LayoutC>>();
+}
+
+template<class LayoutC>
+constexpr bool
+is_n_major_C() {
+  return is_major<1,TagToStrideC_t<LayoutC>>();
+}
+
 ///////////////////////////////////////////////////////////////////////////////
 
 // The following two metafunctions are used to detect whether a `kernel::Gemm` or `kernel::GemmUniversal`
diff --git a/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp b/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp
index 1b9fcbf72..27cf43730 100644
--- a/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp
+++ b/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp
@@ -131,9 +131,9 @@ public:
     cute::is_void_v<TileScheduler_>
     or (
       IsGroupedGemmKernel
-      and cute::is_any_of_v<TileScheduler_, GroupScheduler>
+      and cute::is_base_of_v<GroupScheduler, TileScheduler_>
     ),
-    "Ptr-Array Cooperative and Grouped Gemm Cooperative kernel only supports the default scheduler.");
+    "Ptr-Array Cooperative and Grouped Gemm Cooperative kernel only supports group-compatible schedulers (TileScheduler_ must derive from GroupScheduler).");
 
   using SchedulerTag = cute::conditional_t<
     cute::is_void_v<TileScheduler_>,
@@ -490,7 +490,7 @@ public:
     // TileScheduler pipeline
     using TileSchedulerPipeline = typename TileScheduler::Pipeline;
     typename TileSchedulerPipeline::Params tile_scheduler_pipeline_params;
-    if constexpr (cute::is_same_v<SchedulerTag, GroupScheduler>) {
+    if constexpr (IsGroupedGemmKernel) {
       if (warp_group_role == WarpGroupRole::Producer
         && producer_warp_role == ProducerWarpRole::Scheduler) {
         tile_scheduler_pipeline_params.role = TileSchedulerPipeline::ThreadCategory::Producer;
@@ -592,7 +592,7 @@ public:
 
     auto scheduler = [&] () {
       // Group scheduler requires a different constructor that takes a response ptr
-      if constexpr (cute::is_same_v<SchedulerTag, GroupScheduler>) {
+      if constexpr (IsGroupedGemmKernel) {
         return TileScheduler{params.scheduler, shared_storage.scheduler_response};
       }
       else {
@@ -631,7 +631,7 @@ public:
       if (producer_warp_role == ProducerWarpRole::Scheduler) {
         // GroupScheduler requires a producer warp to iterate over the group infos and push
         // the work tile infos to the downstream pipelines.
-        if constexpr (cute::is_same_v<SchedulerTag, GroupScheduler>) {
+        if constexpr (IsGroupedGemmKernel) {
           do {
             auto [next_work_tile_info, increment_pipe] = scheduler.advance_to_next_work(tile_scheduler_pipeline, tile_scheduler_pipe_producer_state);
             work_tile_info = next_work_tile_info;
@@ -726,6 +726,9 @@ public:
             curr_batch = next_batch;
             if constexpr (IsGroupedGemmKernel) {
               problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(curr_batch), 1);
+              load_inputs = collective_mainloop.load_init(problem_shape_MNKL, params.mainloop);
+              gA_mkl = get<0>(load_inputs);
+              gB_nkl = get<1>(load_inputs);
             }
             collective_mainloop.tensormaps_perform_update(
               shared_storage.tensormaps.mainloop,
@@ -811,7 +814,7 @@ public:
         int32_t const sm_idx = blockIdx.x + (blockIdx.y * gridDim.x);
         int32_t const sm_count = params.hw_info.sm_count;
 
-        auto epi_load_tensormap = get<0>(collective_epilogue.load_init(params.epilogue, shared_storage.tensormaps.epilogue, sm_count, sm_idx));
+        auto epi_load_tensormap = collective_epilogue.load_init(params.epilogue, shared_storage.tensormaps.epilogue, sm_count, sm_idx);
 
         bool did_batch_change = true;
         constexpr bool IsEpiLoad = true;
@@ -840,6 +843,9 @@ public:
           if (TileScheduler::compute_epilogue(work_tile_info, params.scheduler)) {
             if constexpr (IsGroupedGemmKernel) {
               problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+              load_inputs = collective_mainloop.load_init(problem_shape_MNKL, params.mainloop);
+              gA_mkl = get<0>(load_inputs);
+              gB_nkl = get<1>(load_inputs);
             }
 
             // Compute m_coord, n_coord, l_coord with the post-tiled m-shape and n-shape
@@ -875,6 +881,9 @@ public:
           if (work_tile_info.is_valid() && did_batch_change) {
             if constexpr (IsGroupedGemmKernel) {
               problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+              load_inputs = collective_mainloop.load_init(problem_shape_MNKL, params.mainloop);
+              gA_mkl = get<0>(load_inputs);
+              gB_nkl = get<1>(load_inputs);
             }
 
             // tensormap update
@@ -890,7 +899,10 @@ public:
 
               // Converge before issuing tensormap fence release since fence is aligned
               __syncwarp();
-              collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue, epi_load_tensormap, 0);
+              collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(
+                shared_storage.tensormaps.epilogue,
+                epi_load_tensormap,
+                0);
             }
           }
 
@@ -914,7 +926,7 @@ public:
       // Do we potentially issue tail arrives for TMA stores, if epilogue load is waiting for it
       bool do_store_tail = false;
       // Get a copy of tensormaps
-      auto epi_store_tensormap = get<0>(collective_epilogue.store_init(params.epilogue, shared_storage.tensormaps.epilogue, sm_count, sm_idx, consumer_warp_group_idx));
+      auto epi_store_tensormap = collective_epilogue.store_init(params.epilogue, shared_storage.tensormaps.epilogue, sm_count, sm_idx, consumer_warp_group_idx);
 
       bool did_batch_change = true;
       constexpr bool IsEpiLoad = false;
@@ -932,13 +944,16 @@ public:
         // Converge before issuing tensormap fence release since fence is aligned
         __syncwarp();
         collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue,
-                                                                    epi_store_tensormap,
-                                                                    consumer_warp_group_idx);
+                                                                            epi_store_tensormap,
+                                                                            consumer_warp_group_idx);
       }
 
       do {
         if constexpr (IsGroupedGemmKernel) {
           problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+          load_inputs = collective_mainloop.load_init(problem_shape_MNKL, params.mainloop);
+          gA_mkl = get<0>(load_inputs);
+          gB_nkl = get<1>(load_inputs);
         }
 
         int32_t curr_batch = work_tile_info.L_idx;
@@ -1041,6 +1056,9 @@ public:
         if (work_tile_info.is_valid() && did_batch_change) {
           if constexpr (IsGroupedGemmKernel) {
             problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+            load_inputs = collective_mainloop.load_init(problem_shape_MNKL, params.mainloop);
+            gA_mkl = get<0>(load_inputs);
+            gB_nkl = get<1>(load_inputs);
           }
           if (warp_idx_in_warp_group == 0) {
             collective_epilogue.template tensormaps_perform_update<IsEpiLoad>(
@@ -1055,8 +1073,8 @@ public:
             // Converge before issuing tensormap fence release since fence is aligned
             __syncwarp();
             collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue,
-                                                                       epi_store_tensormap,
-                                                                       consumer_warp_group_idx);
+                                                                                epi_store_tensormap,
+                                                                                consumer_warp_group_idx);
           }
         }
 
diff --git a/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp b/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp
index c828e8295..5f7a1be77 100644
--- a/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp
+++ b/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp
@@ -131,9 +131,9 @@ public:
     cute::is_void_v<TileScheduler_>
     or (
       IsGroupedGemmKernel
-      and cute::is_any_of_v<TileScheduler_, GroupScheduler>
+      and cute::is_base_of_v<GroupScheduler, TileScheduler_>
     ),
-    "Ptr-Array Pingpong and Grouped Gemm Pingpong kernel only supports the default scheduler.");
+    "Ptr-Array Pingpong and Grouped Gemm Pingpong kernel only supports group-compatible schedulers (TileScheduler_ must derive from GroupScheduler).");
 
   using SchedulerTag = cute::conditional_t<
     cute::is_void_v<TileScheduler_>,
@@ -499,7 +499,7 @@ public:
     // TileScheduler pipeline
     using TileSchedulerPipeline = typename TileScheduler::Pipeline;
     typename TileSchedulerPipeline::Params tile_scheduler_pipeline_params;
-    if constexpr (cute::is_same_v<SchedulerTag, GroupScheduler>) {
+    if constexpr (IsGroupedGemmKernel) {
       if (warp_group_role == WarpGroupRole::Producer
         && producer_warp_role == ProducerWarpRole::Scheduler) {
         tile_scheduler_pipeline_params.role = TileSchedulerPipeline::ThreadCategory::Producer;
@@ -607,7 +607,7 @@ public:
 
     auto scheduler = [&] () {
       // Group scheduler requires a different constructor that takes a response ptr
-      if constexpr (cute::is_same_v<SchedulerTag, GroupScheduler>) {
+      if constexpr (IsGroupedGemmKernel) {
         return TileScheduler{params.scheduler, shared_storage.scheduler_response};
       }
       else {
@@ -669,7 +669,7 @@ public:
       if (producer_warp_role == ProducerWarpRole::Scheduler) {
         // GroupScheduler requires a producer warp to iterate over the group infos and push
         // the work tile infos to the downstream pipelines.
-        if constexpr (cute::is_same_v<SchedulerTag, GroupScheduler>) {
+        if constexpr (IsGroupedGemmKernel) {
           do {
             auto [next_work_tile_info, increment_pipe] = scheduler.advance_to_next_work(tile_scheduler_pipeline, tile_scheduler_pipe_producer_state);
             work_tile_info = next_work_tile_info;
@@ -764,6 +764,9 @@ public:
             curr_batch = next_batch;
             if constexpr (IsGroupedGemmKernel) {
               problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(curr_batch), 1);
+              load_inputs = collective_mainloop.load_init(problem_shape_MNKL, params.mainloop);
+              gA_mkl = get<0>(load_inputs);
+              gB_nkl = get<1>(load_inputs);
             }
             collective_mainloop.tensormaps_perform_update(
               shared_storage.tensormaps.mainloop,
@@ -849,7 +852,7 @@ public:
         int32_t const sm_idx = blockIdx.x + (blockIdx.y * gridDim.x);
         int32_t const sm_count = params.hw_info.sm_count;
 
-        auto epi_load_tensormap = get<0>(collective_epilogue.load_init(params.epilogue, shared_storage.tensormaps.epilogue, sm_count, sm_idx));
+        auto epi_load_tensormap = collective_epilogue.load_init(params.epilogue, shared_storage.tensormaps.epilogue, sm_count, sm_idx);
 
         bool did_batch_change = true;
         constexpr bool IsEpiLoad = true;
@@ -878,6 +881,9 @@ public:
           if (TileScheduler::compute_epilogue(work_tile_info, params.scheduler)) {
             if constexpr (IsGroupedGemmKernel) {
               problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+              load_inputs = collective_mainloop.load_init(problem_shape_MNKL, params.mainloop);
+              gA_mkl = get<0>(load_inputs);
+              gB_nkl = get<1>(load_inputs);
             }
 
             // Compute m_coord, n_coord, l_coord with the post-tiled m-shape and n-shape
@@ -913,6 +919,9 @@ public:
           if (work_tile_info.is_valid() && did_batch_change) {
             if constexpr (IsGroupedGemmKernel) {
               problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+              load_inputs = collective_mainloop.load_init(problem_shape_MNKL, params.mainloop);
+              gA_mkl = get<0>(load_inputs);
+              gB_nkl = get<1>(load_inputs);
             }
 
             // tensormap update
@@ -928,7 +937,10 @@ public:
 
               // Converge before issuing tensormap fence release since fence is aligned
               __syncwarp();
-              collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue, epi_load_tensormap, 0);
+              collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(
+                shared_storage.tensormaps.epilogue,
+                epi_load_tensormap,
+                0);
             }
           }
 
@@ -952,7 +964,7 @@ public:
       // Do we potentially issue tail arrives for TMA stores, if epilogue load is waiting for it
       bool do_store_tail = false;
       // Get a copy of tensormaps
-      auto epi_store_tensormap = get<0>(collective_epilogue.store_init(params.epilogue, shared_storage.tensormaps.epilogue, sm_count, sm_idx, consumer_warp_group_idx));
+      auto epi_store_tensormap = collective_epilogue.store_init(params.epilogue, shared_storage.tensormaps.epilogue, sm_count, sm_idx, consumer_warp_group_idx);
 
       bool did_batch_change = true;
       constexpr bool IsEpiLoad = false;
@@ -970,13 +982,16 @@ public:
         // Converge before issuing tensormap fence release since fence is aligned
         __syncwarp();
         collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue,
-                                                                    epi_store_tensormap,
-                                                                    consumer_warp_group_idx);
+                                                                            epi_store_tensormap,
+                                                                            consumer_warp_group_idx);
       }
 
       do {
         if constexpr (IsGroupedGemmKernel) {
           problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+          load_inputs = collective_mainloop.load_init(problem_shape_MNKL, params.mainloop);
+          gA_mkl = get<0>(load_inputs);
+          gB_nkl = get<1>(load_inputs);
         }
 
         int32_t curr_batch = work_tile_info.L_idx;
@@ -1085,6 +1100,9 @@ public:
         if (work_tile_info.is_valid()) {
           if constexpr (IsGroupedGemmKernel) {
             problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+            load_inputs = collective_mainloop.load_init(problem_shape_MNKL, params.mainloop);
+            gA_mkl = get<0>(load_inputs);
+            gB_nkl = get<1>(load_inputs);
           }
           work_k_tile_count = TileScheduler::get_work_k_tile_count(work_tile_info, problem_shape_MNKL, blk_shape);
           mainloop_pipe_consumer_state.advance(work_k_tile_count);
@@ -1101,6 +1119,9 @@ public:
         if (work_tile_info.is_valid() && did_batch_change) {
           if constexpr (IsGroupedGemmKernel) {
             problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+            load_inputs = collective_mainloop.load_init(problem_shape_MNKL, params.mainloop);
+            gA_mkl = get<0>(load_inputs);
+            gB_nkl = get<1>(load_inputs);
           }
           if (warp_idx_in_warp_group == 0) {
             collective_epilogue.template tensormaps_perform_update<IsEpiLoad>(
@@ -1114,9 +1135,10 @@ public:
 
             // Converge before issuing tensormap fence release since fence is aligned
             __syncwarp();
-            collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue,
-                                                                       epi_store_tensormap,
-                                                                       consumer_warp_group_idx);
+            collective_epilogue.template tensormaps_cp_fence_release<IsEpiLoad>(
+              shared_storage.tensormaps.epilogue,
+              epi_store_tensormap,
+              consumer_warp_group_idx);
           }
         }
 
diff --git a/include/cutlass/gemm/kernel/sm90_gemm_tma.hpp b/include/cutlass/gemm/kernel/sm90_gemm_tma.hpp
index d39d430e1..5508e6ce3 100644
--- a/include/cutlass/gemm/kernel/sm90_gemm_tma.hpp
+++ b/include/cutlass/gemm/kernel/sm90_gemm_tma.hpp
@@ -281,12 +281,12 @@ public:
 
     constexpr int BLK_M_RANK = cute::rank<0>(blk_shape);
     auto m_max_coord = unwrap(cute::transform(make_seq<BLK_M_RANK>{}, [&](auto i) {
-        return  get<i>(M) - get<0,i>(blk_shape) * get<i>(m_coord);
+        return  get<0,i>(problem_shape_MNKL) - get<0,i>(blk_shape) * get<0,i>(output_tile_coord);
       }));
 
     constexpr int BLK_N_RANK = cute::rank<1>(blk_shape);
     auto n_max_coord = unwrap(cute::transform(make_seq<BLK_N_RANK>{}, [&](auto i) {
-        return  get<i>(N) - get<1,i>(blk_shape) * get<i>(n_coord);
+        return  get<1,i>(problem_shape_MNKL) - get<1,i>(blk_shape) * get<1,i>(output_tile_coord);
       }));
     auto residue_mnk = make_tuple(m_max_coord, n_max_coord, Int<0>{});
 
diff --git a/include/cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp b/include/cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp
index 0e879f1cf..f93c9c88f 100644
--- a/include/cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp
+++ b/include/cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp
@@ -49,7 +49,7 @@ class PersistentTileSchedulerSm90Group {
   // Data members
   //
 
-private:
+protected:
   uint64_t current_work_linear_idx_ = 0;
   uint64_t total_grid_size_ = 0;
 
@@ -287,8 +287,8 @@ public:
     total_grid_size_ = uint64_t(gridDim.x) * uint64_t(gridDim.y) * uint64_t(gridDim.z);
     uint64_t ctas_along_m, ctas_along_n;
     ProblemShape problem_shape = params_.problem_shapes_.get_problem_shape(0);
-    if (is_tuple<decltype(cute::shape<0>(problem_shape))>::value ||
-        is_tuple<decltype(cute::shape<1>(problem_shape))>::value) {
+    if constexpr (is_tuple<decltype(cute::shape<0>(problem_shape))>::value ||
+                  is_tuple<decltype(cute::shape<1>(problem_shape))>::value) {
       ctas_along_m = cute::size(cute::ceil_div(cute::shape<0>(problem_shape), scheduler_params.cta_shape_.m()));
       ctas_along_n = cute::size(cute::ceil_div(cute::shape<1>(problem_shape), scheduler_params.cta_shape_.n()));
     }
@@ -308,7 +308,7 @@ public:
   }
 
   // get work_idx_m, work_idx_n from linear_idx while applying swizzle
-  template<class WorkTileInfo, class GroupInfo, class ProblemShape, class RasterOrder>
+  template<class WorkTileInfo, class GroupInfo, class ProblemShape, class CtaShape, class RasterOrder>
   static
   CUTLASS_DEVICE
   WorkTileInfo
@@ -317,7 +317,7 @@ public:
       GroupInfo& group_info,
       GroupProblemShape &problem_shapes,
       ProblemShape (&cached_problem_shapes)[2],
-      GemmCoord cta_shape,
+      CtaShape cta_shape,
       GemmCoord cluster_shape,
       FastDivmodU64Pow2 const& divmod_cluster_shape_major,
       FastDivmodU64Pow2 const& divmod_cluster_shape_minor,
@@ -340,10 +340,10 @@ public:
         }
         if (group_info.group_idx < total_problem_groups) {
           uint64_t ctas_along_m, ctas_along_n;
-          if (is_tuple<decltype(cute::shape<0>(cached_problem_shapes[0]))>::value ||
-              is_tuple<decltype(cute::shape<1>(cached_problem_shapes[0]))>::value) {
-            ctas_along_m = cute::size(cute::ceil_div(cute::shape<0>(cached_problem_shapes[0]), cta_shape.m()));
-            ctas_along_n = cute::size(cute::ceil_div(cute::shape<1>(cached_problem_shapes[0]), cta_shape.n()));
+          if constexpr (is_tuple<decltype(cute::shape<0>(cached_problem_shapes[0]))>::value ||
+                        is_tuple<decltype(cute::shape<1>(cached_problem_shapes[0]))>::value) {
+            ctas_along_m = cute::size(cute::ceil_div(cute::shape<0>(cached_problem_shapes[0]), get<0>(cta_shape)));
+            ctas_along_n = cute::size(cute::ceil_div(cute::shape<1>(cached_problem_shapes[0]), get<1>(cta_shape)));
           }
           else {
             ctas_along_m = divmod_cta_shape_m.divide(cute::shape<0>(cached_problem_shapes[0]) +  divmod_cta_shape_m.divisor - 1);
diff --git a/include/cutlass/kernel_hardware_info.h b/include/cutlass/kernel_hardware_info.h
index f99f9336b..6b8c2427b 100644
--- a/include/cutlass/kernel_hardware_info.h
+++ b/include/cutlass/kernel_hardware_info.h
@@ -91,7 +91,6 @@ struct KernelHardwareInfo {
       void const* kernel_ptr,
       cudaStream_t stream = nullptr) {
     int max_active_clusters = 0;
-#if !(defined(__QNX__) && __QNX__ >= 800 && defined(NV_IS_SAFETY))
 #if defined(CUTLASS_SM90_CLUSTER_LAUNCH_ENABLED)
     ClusterLauncher::LaunchConfig cluster_launch_config = ClusterLauncher::make_cluster_launch_config(
                                                             cluster_dims /* minimum grid dim */, cluster_dims, {threads_per_block, 1, 1},
@@ -111,10 +110,6 @@ struct KernelHardwareInfo {
 #else
     CUTLASS_TRACE_HOST("ClusterLauncher: CUTLASS_SM90_CLUSTER_LAUNCH_ENABLED not defined! Aborting cluster occupancy query.");
     return max_active_clusters;
-#endif
-#else
-    CUTLASS_TRACE_HOST("ClusterLauncher: cluster launch disabled for QNX 8+ safety builds");
-    return max_active_clusters;
 #endif
   }
 
diff --git a/include/cutlass/version.h b/include/cutlass/version.h
index f22a08a24..615a53104 100644
--- a/include/cutlass/version.h
+++ b/include/cutlass/version.h
@@ -31,12 +31,16 @@
 
 #pragma once
 
+#if defined(__CUDACC_RTC__)
+#include CUDA_STD_HEADER(cstdint)
+#else
 #include <cstdint>
 #include <string>
+#endif
 
 #define CUTLASS_MAJOR 4
-#define CUTLASS_MINOR 5
-#define CUTLASS_PATCH 2
+#define CUTLASS_MINOR 6
+#define CUTLASS_PATCH 0
 
 #ifdef CUTLASS_VERSIONS_GENERATED
 #include "cutlass/version_extended.h"
@@ -65,6 +69,7 @@ namespace cutlass {
     return CUTLASS_BUILD + 0;
   }
 
+#if !defined(__CUDACC_RTC__)
   inline std::string getVersionString() {
     std::string version = "@CUTLASS_VERSION@";
     if (getVersionBuild()) {
@@ -72,9 +77,10 @@ namespace cutlass {
     }
     return version;
   }
-  
+
   inline std::string getGitRevision() {
     return "@CUTLASS_REVISION@";
   }
+#endif // !defined(__CUDACC_RTC__)
 
 } // namespace cutlass
diff --git a/media/docs/pythonDSL/cute_dsl_general/dsl_introduction.rst b/media/docs/pythonDSL/cute_dsl_general/dsl_introduction.rst
index 116cb86cd..906b8d95e 100644
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_introduction.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_introduction.rst
@@ -123,6 +123,12 @@ Defines GPU kernel functions, compiled as specialized GPU symbols through |DC|.
   - ``False`` (default) — Shared memory is allocated additively across all branches (default CUDA C++ behavior).
   - ``True`` — Merge shared-memory allocations across branches (experimental feature, recommended for mega-kernels).
 
+- ``preferred_smem_carveout``
+  Set per-kernel hint specifying what percentage of SM on-chip memory to reserve for shared memory vs. L1 cache.
+
+  - ``None`` (default) — Auto calculate the percentage using formula ``ceil_div(min_blocks_per_mp * smem * 100, max_smem_per_mp)`` when **min_blocks_per_mp** is greater than 1
+  - ``int`` — Override the auto-calculated percentage and manually set hint.
+
 Calling Conventions
 -------------------
 
diff --git a/media/docs/pythonDSL/cute_dsl_general/dsl_struct_types.rst b/media/docs/pythonDSL/cute_dsl_general/dsl_struct_types.rst
index 6f8d2d17c..36cf11cea 100644
--- a/media/docs/pythonDSL/cute_dsl_general/dsl_struct_types.rst
+++ b/media/docs/pythonDSL/cute_dsl_general/dsl_struct_types.rst
@@ -28,6 +28,10 @@ Overview
      - **No**
      - Tuple subclass — fields fixed at construction.
        Flattened field-by-field through the pytree system.
+   * - ``@native_struct``
+     - **Yes**
+     - Generates an LLVM struct type.
+       ``llvm.insertvalue`` replaces field values in-place.
    * - ``@dataclass(frozen=True)``
      - **No**
      - Frozen dataclass — treated as a read-only pytree container,
@@ -108,6 +112,39 @@ To "update" a field, construct a replacement NamedTuple:
         out[1] = scaled.y
         out[2] = scaled.z
 
+
+``@native_struct``
+------------------
+
+Use ``@native_struct`` when kernel logic needs to **accumulate into or
+update** a struct field.  Unlike NamedTuple, fields are mutable: each write
+generates an ``llvm.insertvalue`` to replace the field in the underlying LLVM
+struct.
+
+.. code-block:: python
+
+    import cutlass
+    import cutlass.cute as cute
+
+    @cute.native_struct
+    class Accumulator:
+        total: cutlass.Int32
+        count: cutlass.Int32
+
+    @cute.jit
+    def accumulate(acc: Accumulator, values: cute.Tensor, n: cutlass.Int32):
+        for i in range(n):
+            acc.total = acc.total + values[i]
+            acc.count = acc.count + cutlass.Int32(1)
+
+``@native_struct`` also supports:
+
+- ``zero_init=False`` — initialize with ``llvm.mlir.undef`` instead of zero.
+- ``packed=True`` — create a packed LLVM struct (no padding between fields).
+- ``Constexpr`` fields — excluded from the native struct and passed as
+  ordinary Python values.
+
+
 Choosing the right type
 -----------------------
 
diff --git a/python/CuTeDSL/cutlass/__init__.py b/python/CuTeDSL/cutlass/__init__.py
index 1d8c1ea37..19bd2dd78 100644
--- a/python/CuTeDSL/cutlass/__init__.py
+++ b/python/CuTeDSL/cutlass/__init__.py
@@ -34,6 +34,7 @@ def _ensure_mlir_type_compat() -> None:
 
 _ensure_mlir_type_compat()
 del _ensure_mlir_type_compat
+
 __version__ = "@CUTLASS_IR_WHEEL_RELEASE_VERSION@"
 # Monkey patch CUDA version query function
 from ._mlir._mlir_libs._cutlass_ir._base_dsl import (
@@ -87,6 +88,7 @@ from .cute.typing import *
 # Utilities not belonging to CuTe
 from . import utils as utils
 from . import pipeline as pipeline
+from . import testing as testing
 
 # Used as internal symbol
 from . import cutlass_dsl as _dsl
@@ -101,3 +103,4 @@ cuda = _dsl.cuda_helpers
 from . import jax as jax
 
 CACHE_FILE = "compiled_cache.db"
+
diff --git a/python/CuTeDSL/cutlass/_pth_hook.py b/python/CuTeDSL/cutlass/_pth_hook.py
index d2b5e96b0..5c77183c0 100644
--- a/python/CuTeDSL/cutlass/_pth_hook.py
+++ b/python/CuTeDSL/cutlass/_pth_hook.py
@@ -20,6 +20,7 @@ def setup(
     lib_so_path: str | Path,
     finder_module_path: str | Path,
     root_dir: str | Path | None = None,
+    vendored_iket_parent_dir: str | Path | None = None,
 ) -> None:
     """Set up the editable install environment.
 
@@ -31,6 +32,18 @@ def setup(
     :param lib_so_path: Path to libcute_dsl_runtime.so
     :param finder_module_path: Path to _editable_finder.py module
     :param root_dir: Path to DSL root directory (optional)
+    :param vendored_iket_parent_dir: The *parent* directory of a vendored
+        ``iket/`` package — i.e. the directory whose immediate child is
+        ``iket/`` — typically a dedicated sibling under egg-info such as
+        ``cutlass-dsl-dev.egg-info/_iket_vendor/``. Pass this parent (NOT
+        ``iket/`` itself, NOT the egg-info root). When provided, the
+        parent is prepended to ``sys.path`` so top-level ``import iket``
+        resolves to the vendored runtime. The directory MUST NOT contain
+        any other top-level package directory (e.g. ``cutlass/``);
+        otherwise Python would discover a namespace portion for that
+        package, which can shadow / break the real package's regular
+        ``__init__.py`` resolution. Optional; omit when iket bundling is
+        not wanted.
     """
     # Convert to Path objects
     cutlass_source_dir = Path(cutlass_source_dir)
@@ -40,6 +53,19 @@ def setup(
     if root_dir is not None:
         root_dir = Path(root_dir)
 
+    # Make top-level `import iket` resolvable for editable install: the iket
+    # runtime package is vendored into <egg-info>/_iket_vendor/iket/ by
+    # setup.py; put its parent (`_iket_vendor/`, which contains ONLY iket/)
+    # on sys.path so `iket` is importable without polluting resolution of
+    # `cutlass` (which would otherwise pick up a namespace portion from any
+    # sibling directory under egg-info).
+    import sys as _sys
+
+    if vendored_iket_parent_dir is not None:
+        vip = str(Path(vendored_iket_parent_dir))
+        if vip not in _sys.path:
+            _sys.path.insert(0, vip)
+
     # Set CUTE_DSL_LIBS environment variable
     os.environ.setdefault("CUTE_DSL_LIBS", str(lib_so_path))
 
diff --git a/python/CuTeDSL/cutlass/base_dsl/__init__.py b/python/CuTeDSL/cutlass/base_dsl/__init__.py
index 32a251528..15a02660c 100644
--- a/python/CuTeDSL/cutlass/base_dsl/__init__.py
+++ b/python/CuTeDSL/cutlass/base_dsl/__init__.py
@@ -12,7 +12,7 @@
 # Local module imports
 from .dsl import *
 from .runtime import *
-from ._mlir_helpers import lru_cache_ir, dsl_user_op
+from .._mlir_helpers import lru_cache_ir, dsl_user_op
 from .env_manager import get_str_env_var, detect_gpu_arch
 
 from .utils.tree_utils import (
diff --git a/python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/arith.py b/python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/arith.py
deleted file mode 100644
index 0732e32a5..000000000
--- a/python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/arith.py
+++ /dev/null
@@ -1,1554 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
-#
-# Use of this software is governed by the terms and conditions of the
-# NVIDIA End User License Agreement (EULA), available at:
-# https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/license.html
-#
-# Any use, reproduction, disclosure, or distribution of this software
-# and related documentation outside the scope permitted by the EULA
-# is strictly prohibited.
-
-"""
-This module provides MLIR Arith Dialect helper functions
-"""
-
-import array
-import builtins
-from typing import Any, Callable, Optional, Union
-
-import numpy as np
-
-from ..common import *
-from ..common import DSLRuntimeError, DSLNotImplemented
-from ..._mlir import ir
-from ..._mlir.extras import types as T
-from ..._mlir.dialects import arith, math, builtin
-from ..._mlir.dialects import nvgpu, vector, llvm
-from .op import dsl_user_op
-
-from .lru_cache_ir import lru_cache_ir
-
-# =============================================================================
-# Arith Dialect Helper functions
-# =============================================================================
-
-
-def recast_type(src_type: ir.Type, res_elem_type: ir.Type) -> ir.Type:
-    if isinstance(src_type, T.VectorType):
-        if src_type.scalable:
-            res_type = T.vector(
-                *src_type.shape,
-                res_elem_type,
-                scalable=src_type.scalable,
-                scalable_dims=src_type.scalable_dims,
-            )
-        else:
-            res_type = T.vector(*src_type.shape, res_elem_type)
-    elif isinstance(src_type, T.RankedTensorType):
-        res_type = T.RankedTensorType.get(
-            element_type=res_elem_type, shape=src_type.shape, strides=src_type.strides
-        )
-    elif isinstance(src_type, T.UnrankedTensorType):
-        res_type = T.UnrankedTensorType.get(element_type=res_elem_type)
-    elif isinstance(src_type, T.MemRefType):
-        res_type = T.MemRefType.get(
-            element_type=res_elem_type, shape=src_type.shape, strides=src_type.strides
-        )
-    else:
-        res_type = res_elem_type
-    return res_type
-
-
-def is_scalar(ty: ir.Type) -> bool:
-    return not isinstance(
-        ty, (T.VectorType, T.RankedTensorType, T.UnrankedTensorType, T.MemRefType)
-    )
-
-
-def element_type(ty: ir.Type) -> ir.Type:
-    if not is_scalar(ty):
-        return ty.element_type
-    else:
-        return ty
-
-
-def is_narrow_precision(ty: ir.Type) -> bool:
-    narrow_types = {
-        T.f8E8M0FNU(),
-        T.f8E4M3FN(),
-        T.f8E4M3(),
-        T.f8E5M2(),
-        T.f8E4M3B11FNUZ(),
-        T.f4E2M1FN(),
-        T.f6E3M2FN(),
-        T.f6E2M3FN(),
-    }
-    return ty in narrow_types
-
-
-def is_float_type(ty: ir.Type) -> bool:
-    return (
-        arith._is_float_type(ty)
-        # TODO-upstream: prediction is not correct. Patch here and fix in upstream later
-        or is_narrow_precision(ty)
-        or ty in (T.bf16(), T.tf32())
-    )
-
-
-def is_integer_like_type(ty: ir.Type) -> bool:
-    return isinstance(ty, (ir.IntegerType, ir.IndexType))
-
-
-def bitcast(
-    src: ir.Value,
-    res_elem_type: ir.Type,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> ir.Value:
-    res_type = recast_type(src.type, res_elem_type)
-    return arith.bitcast(res_type, src, loc=loc, ip=ip)
-
-
-def cvtf(
-    src: ir.Value,
-    res_elem_type: ir.Type,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> ir.Value:
-    src_elem_type = element_type(src.type)
-
-    if res_elem_type == src_elem_type:
-        return src
-
-    res_type = recast_type(src.type, res_elem_type)
-
-    # Treat TF32 as F32 and use i32 as intermediate data
-    # TODO-upstream: update arith to support tf32 <-> f32 conversion
-    if src_elem_type == T.tf32():
-        # tf32 -> i32
-        tmp_type = recast_type(src.type, T.i32())
-        src = builtin.unrealized_conversion_cast([tmp_type], [src], loc=loc, ip=ip)
-        # i32 -> f32
-        src = bitcast(src, T.f32(), loc=loc, ip=ip)
-        # f32 -> X with `cvtf` recursively
-        return cvtf(src, res_elem_type, loc=loc, ip=ip)
-
-    if res_elem_type == T.tf32():
-        # X -> f32 with `cvtf`` recursively
-        tmp = cvtf(src, T.f32(), loc=loc, ip=ip)
-        # f32 -> i32
-        tmp = bitcast(tmp, T.i32(), loc=loc, ip=ip)
-        # i32 -> tf32
-        return builtin.unrealized_conversion_cast([res_type], [tmp], loc=loc, ip=ip)
-
-    if res_elem_type.width > src_elem_type.width:
-        return arith.extf(res_type, src, loc=loc, ip=ip)
-    else:
-        # bf16 <-> f16: both are 16-bit, arith.truncf requires strict narrowing.
-        # Route through f32 intermediate.
-        if (src_elem_type == T.f16() and res_elem_type == T.bf16()) or (
-            src_elem_type == T.bf16() and res_elem_type == T.f16()
-        ):
-            tmp_type = recast_type(src.type, T.f32())
-            tmp = arith.extf(tmp_type, src, loc=loc, ip=ip)
-            return arith.truncf(res_type, tmp, loc=loc, ip=ip)
-
-        # E8M0 requires upward rounding; all others default to to_nearest_even
-        roundingmode = (
-            arith.RoundingMode.upward if res_elem_type == T.f8E8M0FNU() else None
-        )
-        return arith.truncf(res_type, src, roundingmode=roundingmode, loc=loc, ip=ip)
-
-
-def fptoi(
-    src: ir.Value,
-    signed: Union[bool, None],
-    res_elem_type: ir.Type,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> ir.Value:
-    res_type = recast_type(src.type, res_elem_type)
-    # TODO-upstream: update arith to support this kind of conversion
-    if element_type(src.type) in (T.tf32(), T.bf16()):
-        src = cvtf(src, T.f32(), loc=loc, ip=ip)
-
-    if signed != False:  # noqa: E712
-        return arith.fptosi(res_type, src, loc=loc, ip=ip)
-    else:
-        return arith.fptoui(res_type, src, loc=loc, ip=ip)
-
-
-def itofp(
-    src: ir.Value,
-    signed: Union[bool, None],
-    res_elem_type: ir.Type,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> ir.Value:
-    res_type = recast_type(src.type, res_elem_type)
-
-    orig_res_type = res_type
-    # TODO-upstream: update arith to support this kind of conversion
-    if res_elem_type in (T.tf32(), T.bf16()):
-        res_type = recast_type(src.type, T.f32())
-
-    if signed != False and element_type(src.type).width > 1:  # noqa: E712
-        res = arith.sitofp(res_type, src, loc=loc, ip=ip)
-    else:
-        res = arith.uitofp(res_type, src, loc=loc, ip=ip)
-
-    if orig_res_type == res_type:
-        return res
-
-    return cvtf(res, element_type(orig_res_type), loc=loc, ip=ip)
-
-
-def int_to_int(
-    a: ir.Value,
-    dst_elem_type: Any,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> ir.Value:
-    src_signed = a.signed
-    dst_signed = dst_elem_type.signed
-    src_width = element_type(a.type).width
-    dst_width = dst_elem_type.width
-
-    dst_mlir_type = recast_type(a.type, dst_elem_type.mlir_type)
-
-    if dst_width == src_width:
-        return a
-    elif src_signed != False and not dst_signed:  # noqa: E712
-        # Signed -> Unsigned
-        if dst_width > src_width:
-            return arith.extui(dst_mlir_type, a, loc=loc, ip=ip)
-        else:
-            return arith.trunci(dst_mlir_type, a, loc=loc, ip=ip)
-    elif src_signed == dst_signed:
-        # Same signedness
-        if dst_width > src_width:
-            if src_signed != False and src_width > 1:  # noqa: E712
-                return arith.extsi(dst_mlir_type, a, loc=loc, ip=ip)
-            else:
-                return arith.extui(dst_mlir_type, a, loc=loc, ip=ip)
-        else:
-            return arith.trunci(dst_mlir_type, a, loc=loc, ip=ip)
-    else:
-        # Unsigned -> Signed
-        if dst_width > src_width:
-            return arith.extui(dst_mlir_type, a, loc=loc, ip=ip)
-        else:
-            # For truncation from unsigned to signed, we need to handle overflow
-            # First truncate to the target width
-            trunc = arith.trunci(dst_mlir_type, a, loc=loc, ip=ip)
-            # Then reinterpret as signed
-            if dst_signed:
-                return arith.bitcast(dst_mlir_type, trunc, loc=loc, ip=ip)
-            return trunc
-
-
-# =============================================================================
-# Arith Ops Emitter Helpers
-#   - assuming type of lhs and rhs match each other
-#   - op name matches python module operator
-# =============================================================================
-
-
-def _cast(
-    res_elem_ty: ir.Type,
-    src: ir.Value,
-    is_signed: Optional[bool] = None,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> ir.Value:
-    """
-    This function provides simplified interface to upstream op builder
-        arith.truncf(T.vector(shape, new_type), src)
-
-    is simplified as because it's element-wise op which can't change shape
-        arith.truncf(new_type, src)
-    """
-    if isinstance(src, ir.Value):
-        src_ty = src.type
-    else:
-        src_ty = type(src).mlir_type
-        src = src.ir_value()
-
-    src_elem_ty = element_type(src_ty)
-
-    if src_elem_ty == res_elem_ty:
-        return src
-    elif is_float_type(src_elem_ty) and is_float_type(res_elem_ty):
-        # float-to-float
-        return cvtf(src, res_elem_ty, loc=loc, ip=ip)
-    elif arith._is_integer_like_type(src_elem_ty) and arith._is_integer_like_type(
-        res_elem_ty
-    ):
-        if src_elem_ty.width >= res_elem_ty.width:
-            cast_op = arith.trunci
-        else:
-            if is_signed != False:  # noqa: E712
-                cast_op = arith.extsi
-            else:
-                cast_op = arith.extui
-
-        res_ty = recast_type(src_ty, res_elem_ty)
-        return cast_op(res_ty, src, loc=loc, ip=ip)
-    elif is_float_type(src_elem_ty) and arith._is_integer_like_type(res_elem_ty):
-        return fptoi(src, is_signed, res_elem_ty, loc=loc, ip=ip)
-    elif arith._is_integer_like_type(src_elem_ty) and is_float_type(res_elem_ty):
-        return itofp(src, is_signed, res_elem_ty, loc=loc, ip=ip)
-    else:
-        raise DSLRuntimeError(
-            f"cast from {src_elem_ty} to {res_elem_ty} is not supported"
-        )
-
-
-@lru_cache_ir()
-def const(
-    value: Union[int, float, bool, np.ndarray],
-    ty: Optional[Union[ir.Type, "NumericMeta"]] = None,  # type: ignore[name-defined]
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> ir.Value:
-    """
-    Generates dynamic expression for constant values.
-    """
-    from ..typing import Numeric, NumericMeta
-    from ..dsl import is_dynamic_expression
-    from ..utils.numpy import _numpy_type_to_mlir_type
-
-    if isinstance(value, Numeric):
-        value = value.value
-
-    # Early return
-    if is_dynamic_expression(value) and (
-        isinstance(value.type, type(value.type))  # type: ignore[union-attr]
-        or isinstance(value.type, type(T.bool()))  # type: ignore[union-attr]
-    ):
-        return value
-
-    # Assume type
-    if ty is None:
-        if isinstance(value, float):
-            ty = T.f32()
-        elif isinstance(value, bool):
-            ty = T.bool()
-        elif isinstance(value, int):
-            ty = T.i32()
-        elif isinstance(value, np.ndarray):
-            ty = T.vector(*value.shape, _numpy_type_to_mlir_type(value.dtype))
-            value = array.array(value.dtype.kind, value.flatten().tolist())  # type: ignore[assignment]
-        else:
-            raise DSLNotImplemented(f"{type(value)} is not supported")
-    elif isinstance(ty, NumericMeta):
-        ty = ty.mlir_type
-    elif isinstance(ty, ir.Type):
-        if ir.RankedTensorType.isinstance(ty) or ir.VectorType.isinstance(ty):
-            elem_ty = ty.element_type
-            if isinstance(elem_ty, ir.IntegerType):
-                attr = ir.IntegerAttr.get(elem_ty, value)
-            else:
-                attr = ir.FloatAttr.get(elem_ty, value)
-            value = ir.DenseElementsAttr.get_splat(ty, attr)
-        elif arith._is_float_type(ty) and isinstance(value, (bool, int)):
-            value = float(value)
-        elif arith._is_integer_like_type(ty) and isinstance(value, float):
-            value = int(value)
-    else:
-        raise DSLNotImplemented(f"type {ty} is not supported")
-
-    return arith.constant(ty, value, loc=loc, ip=ip)
-
-
-def _dispatch_to_rhs_r_op(op: Callable[..., "ArithValue"]) -> Callable[..., Any]:
-    """Decorator that dispatches to the right-hand-side's reverse operation.
-
-    If the other operand is not an ArithValue or is a subclass (more specific)
-    of ArithValue, this allows proper method resolution for binary operations.
-    """
-
-    def wrapper(
-        self: "ArithValue", other: Union[int, float, bool, "ArithValue"], **kwargs: Any
-    ) -> Any:
-        if not isinstance(other, ArithValue):
-            if not isinstance(other, (int, float, bool)):
-                return NotImplemented
-
-        return op(self, other, **kwargs)
-
-    return wrapper
-
-
-def _binary_op(op: Callable[..., "ArithValue"]) -> Callable[..., "ArithValue"]:
-    """
-    Decorator to check if the 'other' argument is an ArithValue.
-    If not, returns NotImplemented.
-    """
-
-    def wrapper(
-        self: "ArithValue", other: Union[int, float, bool, "ArithValue"], **kwargs: Any
-    ) -> "ArithValue":
-        if isinstance(other, (int, float, bool)):
-            other = const(other, self.type).with_signedness(self.signed)
-
-        return op(self, other, **kwargs)
-
-    return wrapper
-
-
-# Operator overloading
-@ir.register_value_caster(ir.Float4E2M1FNType.static_typeid)
-@ir.register_value_caster(ir.Float6E2M3FNType.static_typeid)
-@ir.register_value_caster(ir.Float6E3M2FNType.static_typeid)
-@ir.register_value_caster(ir.Float8E4M3FNType.static_typeid)
-@ir.register_value_caster(ir.Float8E4M3B11FNUZType.static_typeid)
-@ir.register_value_caster(ir.Float8E5M2Type.static_typeid)
-@ir.register_value_caster(ir.Float8E4M3Type.static_typeid)
-@ir.register_value_caster(ir.Float8E8M0FNUType.static_typeid)
-@ir.register_value_caster(ir.BF16Type.static_typeid)
-@ir.register_value_caster(ir.F16Type.static_typeid)
-@ir.register_value_caster(ir.FloatTF32Type.static_typeid)
-@ir.register_value_caster(ir.F32Type.static_typeid)
-@ir.register_value_caster(ir.F64Type.static_typeid)
-@ir.register_value_caster(ir.IntegerType.static_typeid)
-@ir.register_value_caster(ir.RankedTensorType.static_typeid)
-class ArithValue(ir.Value):
-    """Overloads operators for MLIR's Arith dialects binary operations."""
-
-    @dsl_user_op
-    def __init__(
-        self,
-        v: Union[int, ir.Value],
-        signed: Union[bool, None] = None,
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> None:
-        if isinstance(v, int):
-            v = arith.constant(self.type, v, loc=loc, ip=ip)
-        super().__init__(v)
-
-        elem_ty = element_type(self.type)
-        self.is_float = arith._is_float_type(elem_ty)
-        # arith dialect consider `1` in `i1` as `-1`, treat it as unsigned for DSL
-        self.signed = signed and elem_ty.width > 1
-
-    @dsl_user_op
-    def ir_value(
-        self,
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> ir.Value:
-        return self
-
-    def with_signedness(self, signed: Union[bool, None]) -> "ArithValue":
-        return type(self)(self, signed)
-
-    @dsl_user_op
-    def __neg__(
-        self,
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        if self.type == T.bool():
-            raise TypeError(
-                "Negation, the operator `-` is not supported for boolean type"
-            )
-
-        if self.is_float:
-            return arith.negf(self, loc=loc, ip=ip)
-        else:
-            c0 = arith.constant(self.type, 0, loc=loc, ip=ip)
-            return arith.subi(c0, self, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_binary_op
-    def __pow__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        if self.is_float and other.is_float:
-            return math.powf(self, other, loc=loc, ip=ip)
-        elif self.is_float and not other.is_float:
-            return math.fpowi(self, other, loc=loc, ip=ip)
-        elif not self.is_float and other.is_float:
-            lhs = itofp(self, self.signed, T.f32(), loc=loc, ip=ip)
-            rhs = cvtf(other, T.f32(), loc=loc, ip=ip)
-            return math.powf(lhs, rhs, loc=loc, ip=ip)
-        elif not self.is_float and not other.is_float:
-            return math.ipowi(self, other, loc=loc, ip=ip)
-        else:
-            raise DSLNotImplemented(f"Unsupported '{self} ** {other}'")
-
-    @dsl_user_op
-    @_binary_op
-    def __rpow__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        return other.__pow__(self, loc=loc, ip=ip)
-
-    # arith operators
-    @dsl_user_op
-    @_dispatch_to_rhs_r_op
-    @_binary_op
-    def __add__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        if self.is_float:
-            return arith.addf(self, other, loc=loc, ip=ip)
-        else:
-            return arith.addi(self, other, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_dispatch_to_rhs_r_op
-    @_binary_op
-    def __sub__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        if self.is_float:
-            return arith.subf(self, other, loc=loc, ip=ip)
-        else:
-            return arith.subi(self, other, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_dispatch_to_rhs_r_op
-    @_binary_op
-    def __mul__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        if self.is_float:
-            return arith.mulf(self, other, loc=loc, ip=ip)
-        else:
-            return arith.muli(self, other, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_dispatch_to_rhs_r_op
-    @_binary_op
-    def __truediv__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        if self.is_float:
-            return arith.divf(self, other, loc=loc, ip=ip)
-        else:
-            lhs = itofp(self, self.signed, T.f32(), loc=loc, ip=ip)
-            rhs = itofp(other, other.signed, T.f32(), loc=loc, ip=ip)
-            return arith.divf(lhs, rhs, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_dispatch_to_rhs_r_op
-    @_binary_op
-    def __floordiv__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        if self.is_float:
-            q = arith.divf(self, other, loc=loc, ip=ip)
-            return math.floor(q, loc=loc, ip=ip)
-        elif self.signed != False:  # noqa: E712
-            return arith.floordivsi(self, other, loc=loc, ip=ip)
-        else:
-            return arith.divui(self, other, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_dispatch_to_rhs_r_op
-    @_binary_op
-    def __mod__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        if self.is_float:
-            return arith.remf(self, other, loc=loc, ip=ip)
-        elif self.signed != False:  # noqa: E712
-            return arith.remsi(self, other, loc=loc, ip=ip)
-        else:
-            return arith.remui(self, other, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_binary_op
-    def __radd__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        return other.__add__(self, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_binary_op
-    def __rsub__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        return other.__sub__(self, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_binary_op
-    def __rmul__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        return other.__mul__(self, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_binary_op
-    def __rtruediv__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        return other.__truediv__(self, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_binary_op
-    def __rfloordiv__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        return other.__floordiv__(self, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_binary_op
-    def __rmod__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        return other.__mod__(self, loc=loc, ip=ip)
-
-    # Comparison operators (comparison doesn't have right-hand-side variants)
-    @dsl_user_op
-    @_dispatch_to_rhs_r_op
-    @_binary_op
-    def __lt__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        if self.is_float:
-            return arith.cmpf(arith.CmpFPredicate.OLT, self, other, loc=loc, ip=ip)
-        elif self.signed != False:  # noqa: E712
-            return arith.cmpi(arith.CmpIPredicate.slt, self, other, loc=loc, ip=ip)
-        else:
-            return arith.cmpi(arith.CmpIPredicate.ult, self, other, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_dispatch_to_rhs_r_op
-    @_binary_op
-    def __le__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        if self.is_float:
-            return arith.cmpf(arith.CmpFPredicate.OLE, self, other, loc=loc, ip=ip)
-        elif self.signed != False:  # noqa: E712
-            return arith.cmpi(arith.CmpIPredicate.sle, self, other, loc=loc, ip=ip)
-        else:
-            return arith.cmpi(arith.CmpIPredicate.ule, self, other, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_dispatch_to_rhs_r_op
-    @_binary_op
-    def __eq__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        if self.is_float:
-            return arith.cmpf(arith.CmpFPredicate.OEQ, self, other, loc=loc, ip=ip)
-        else:
-            return arith.cmpi(arith.CmpIPredicate.eq, self, other, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_dispatch_to_rhs_r_op
-    @_binary_op
-    def __ne__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        if self.is_float:
-            # In Python, bool(float("nan")) is True, so use unordered comparison here
-            return arith.cmpf(arith.CmpFPredicate.UNE, self, other, loc=loc, ip=ip)
-        else:
-            return arith.cmpi(arith.CmpIPredicate.ne, self, other, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_dispatch_to_rhs_r_op
-    @_binary_op
-    def __gt__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        if self.is_float:
-            return arith.cmpf(arith.CmpFPredicate.OGT, self, other, loc=loc, ip=ip)
-        elif self.signed != False:  # noqa: E712
-            return arith.cmpi(arith.CmpIPredicate.sgt, self, other, loc=loc, ip=ip)
-        else:
-            return arith.cmpi(arith.CmpIPredicate.ugt, self, other, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_dispatch_to_rhs_r_op
-    @_binary_op
-    def __ge__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        if self.is_float:
-            return arith.cmpf(arith.CmpFPredicate.OGE, self, other, loc=loc, ip=ip)
-        elif self.signed != False:  # noqa: E712
-            return arith.cmpi(arith.CmpIPredicate.sge, self, other, loc=loc, ip=ip)
-        else:
-            return arith.cmpi(arith.CmpIPredicate.uge, self, other, loc=loc, ip=ip)
-
-    # Unary operators
-    @dsl_user_op
-    def __abs__(
-        self,
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        if self.is_float:
-            return math.absf(self, loc=loc, ip=ip)
-        else:
-            return math.absi(self, loc=loc, ip=ip)
-
-    @dsl_user_op
-    def __invert__(
-        self,
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        # For i1 (boolean) types, the all-ones value is 1, not -1.
-        # Using -1 with i1 vectors causes arith.constant to produce a
-        # type mismatch (e.g. vector<32xi0> instead of vector<32xi1>).
-        all_ones = 1 if element_type(self.type).width == 1 else -1
-        return arith.xori(self, const(all_ones, self.type))
-
-    # Bitwise operations
-    @dsl_user_op
-    @_dispatch_to_rhs_r_op
-    @_binary_op
-    def __and__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        return arith.andi(self, other, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_dispatch_to_rhs_r_op
-    @_binary_op
-    def __or__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        return arith.ori(self, other, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_dispatch_to_rhs_r_op
-    @_binary_op
-    def __xor__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        return arith.xori(self, other, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_dispatch_to_rhs_r_op
-    @_binary_op
-    def __rshift__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        if self.signed != False:  # noqa: E712
-            return arith.shrsi(self, other, loc=loc, ip=ip)
-        else:
-            return arith.shrui(self, other, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_dispatch_to_rhs_r_op
-    @_binary_op
-    def __lshift__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        return arith.shli(self, other, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_binary_op
-    def __rand__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        return arith.andi(other, self, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_binary_op
-    def __ror__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        return arith.ori(other, self, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_binary_op
-    def __rxor__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        return arith.xori(other, self, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_binary_op
-    def __rrshift__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        return other.__rshift__(self, loc=loc, ip=ip)
-
-    @dsl_user_op
-    @_binary_op
-    def __rlshift__(
-        self,
-        other: "ArithValue",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "ArithValue":
-        return other.__lshift__(self, loc=loc, ip=ip)
-
-    def __hash__(self) -> int:
-        return super().__hash__()
-
-    def __str__(self) -> str:
-        return "?"
-
-    def __repr__(self) -> str:
-        return self.__str__()
-
-
-def _min(
-    lhs: ir.Value,
-    rhs: ir.Value,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> ir.Value:
-    """
-    This function provides a unified interface for building arith min
-
-    Assuming the operands have the same type
-    """
-    from ..dsl import is_dynamic_expression
-
-    if not is_dynamic_expression(lhs):
-        if not is_dynamic_expression(rhs):
-            return min(lhs, rhs)
-        else:
-            lhs = arith.constant(rhs.type, lhs, loc=loc, ip=ip)
-    else:
-        if not is_dynamic_expression(rhs):
-            rhs = arith.constant(lhs.type, rhs, loc=loc, ip=ip)
-
-    # Handle vector types
-    if isinstance(lhs.type, T.VectorType):
-        elem_type = lhs.type.element_type
-        if arith._is_integer_like_type(elem_type):
-            assert hasattr(lhs, "signed"), (
-                "Should have attribute `signed`, must be a bug"
-            )
-            if lhs.signed != False:  # noqa: E712
-                return arith.minsi(lhs, rhs, loc=loc, ip=ip)
-            else:
-                return arith.minui(lhs, rhs, loc=loc, ip=ip)
-        else:
-            return arith.minimumf(lhs, rhs, loc=loc, ip=ip)
-    elif arith._is_integer_like_type(lhs.type):
-        assert hasattr(lhs, "signed"), "Should have attribute `signed`, must be a bug"
-        if lhs.signed != False:  # noqa: E712
-            return arith.minsi(lhs, rhs, loc=loc, ip=ip)
-        else:
-            return arith.minui(lhs, rhs, loc=loc, ip=ip)
-    else:
-        return arith.minimumf(lhs, rhs, loc=loc, ip=ip)
-
-
-def _max(
-    lhs: ir.Value,
-    rhs: ir.Value,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> ir.Value:
-    """
-    This function provides a unified interface for building arith max
-
-    Assuming the operands have the same type
-    """
-    from ..dsl import is_dynamic_expression
-
-    if not is_dynamic_expression(lhs):
-        if not is_dynamic_expression(rhs):
-            return max(lhs, rhs)
-        else:
-            lhs = arith.constant(rhs.type, lhs, loc=loc, ip=ip)
-    else:
-        if not is_dynamic_expression(rhs):
-            rhs = arith.constant(lhs.type, rhs, loc=loc, ip=ip)
-    # Handle vector types
-    if isinstance(lhs.type, T.VectorType):
-        elem_type = lhs.type.element_type
-        if isinstance(elem_type, ir.IntegerType):
-            assert hasattr(lhs, "signed"), (
-                "Should have attribute `signed`, must be a bug"
-            )
-            if lhs.signed != False:  # noqa: E712
-                return arith.maxsi(lhs, rhs, loc=loc, ip=ip)
-            else:
-                return arith.maxui(lhs, rhs, loc=loc, ip=ip)
-        else:
-            return arith.maximumf(lhs, rhs, loc=loc, ip=ip)
-    elif arith._is_integer_like_type(lhs.type):
-        assert hasattr(lhs, "signed"), "Should have attribute `signed`, must be a bug"
-        if lhs.signed != False:  # noqa: E712
-            return arith.maxsi(lhs, rhs, loc=loc, ip=ip)
-        else:
-            return arith.maxui(lhs, rhs, loc=loc, ip=ip)
-    else:
-        return arith.maximumf(lhs, rhs, loc=loc, ip=ip)
-
-
-# =============================================================================
-# Vector Type - Immutable on registers
-# =============================================================================
-
-
-@ir.register_value_caster(ir.VectorType.static_typeid)
-class Vector(ArithValue):
-    """Wrap an MLIR ``vector<NxTy>`` register value with DSL type information.
-
-    Provides element extraction (``vec[i]`` / ``vec[a:b]``), element-wise
-    arithmetic (``+``, ``-``, ``*``, ``/``), type conversion (:meth:`to`),
-    and bit-reinterpretation (:meth:`bitcast`) on top of a raw MLIR vector.
-
-    Vectors live entirely in registers — they carry no memory address and do
-    not support in-place element assignment.
-
-    Registered as the MLIR value caster for :class:`ir.VectorType`, so any
-    op that returns a vector automatically produces a ``Vector`` instance.
-
-    :param v: Underlying MLIR vector value.
-    :type v: ir.Value
-    :param dtype: DSL element type (e.g. ``Float32``, ``Int32``).
-        Inferred from the MLIR element type when omitted.
-    :type dtype: type, optional
-    """
-
-    _dtype: "type"
-    _mlir_type: ir.Type
-    _shape: "tuple[int, ...]"
-
-    # Cache parameterized subclasses so ``Vector[T, N] is Vector[T, N]``.
-    _parameterized_cache: "dict[tuple, type]" = {}
-
-    @classmethod
-    def __class_getitem__(cls, params: "tuple[type, int | tuple[int, ...]]") -> type:
-        """Return a parameterized Vector subclass with a ``mlir_type`` descriptor.
-
-        ``Vector[Int32, 4].mlir_type`` returns ``vector<4xi32>`` and
-        ``Vector[Float32, (4, 8)].mlir_type`` returns ``vector<4x8xf32>``,
-        matching the dual type-descriptor / value-constructor pattern of
-        scalar types like ``Int32``.  Follows the same approach as
-        ``Pointer.__class_getitem__``.
-        """
-        element_type, shape = params
-
-        from ..typing import Numeric
-
-        if not (isinstance(element_type, type) and issubclass(element_type, Numeric)):
-            raise TypeError(
-                f"Vector element type must be a Numeric type (Integer or Float), "
-                f"got {element_type!r}"
-            )
-        if isinstance(shape, int):
-            shape = (shape,)
-        shape = tuple(shape)
-        if not shape or any(d <= 0 for d in shape):
-            raise ValueError(
-                f"Vector shape dimensions must be positive integers, got {shape}"
-            )
-        key = (element_type, shape)
-        if key not in cls._parameterized_cache:
-            shape_str = "x".join(str(d) for d in shape)
-
-            class _Parameterized(cls):  # type: ignore[valid-type, misc]
-                """Vector subclass with an ``mlir_type`` type descriptor."""
-
-                class mlir_type:  # noqa: N801
-                    def __get__(
-                        self, obj: object, objtype: Optional[type] = None
-                    ) -> ir.Type:
-                        return ir.VectorType.get(list(shape), element_type.mlir_type)  # type: ignore[arg-type, attr-defined]
-
-                mlir_type = mlir_type()  # type: ignore[misc, assignment]
-
-                @staticmethod
-                def __get_mlir_types__() -> list:
-                    """Return MLIR types list — compatible with FFI ``_to_mlir_types``."""
-                    return [ir.VectorType.get(list(shape), element_type.mlir_type)]  # type: ignore[arg-type, attr-defined]
-
-                @staticmethod
-                def isinstance(value: object) -> bool:
-                    """Check if an ``ir.Value`` matches this parameterized vector type."""
-                    if not builtins.isinstance(value, ir.Value):
-                        return False
-                    ty = value.type
-                    if not builtins.isinstance(ty, ir.VectorType):
-                        return False
-                    return (
-                        list(ty.shape) == list(shape)  # type: ignore[arg-type]
-                        and ty.element_type == element_type.mlir_type  # type: ignore[attr-defined]
-                    )
-
-            _Parameterized.__name__ = f"Vector[{element_type.__name__}, {shape_str}]"
-            _Parameterized.__qualname__ = _Parameterized.__name__
-            cls._parameterized_cache[key] = _Parameterized
-        return cls._parameterized_cache[key]
-
-    def __init__(
-        self,
-        v: "ir.Value",
-        *,
-        dtype: "type | None" = None,
-        loc: "ir.Location | None" = None,
-        ip: "ir.InsertionPoint | None" = None,
-    ) -> None:
-        # Derive signedness from dtype for ArithValue base
-        signed = getattr(dtype, "signed", None)
-        super().__init__(v, signed, loc=loc, ip=ip)
-
-        # Infer dtype from MLIR element type if not provided
-        if dtype is None:
-            from ..typing import Numeric
-
-            dtype = Numeric.from_mlir_type(self.type.element_type)
-        self._dtype = dtype
-        self._mlir_type = dtype.mlir_type  # type: ignore[attr-defined]
-
-        # Shape is always derived from MLIR vector type
-        self._shape = tuple(self.type.shape)
-
-    # =========================================================================
-    # DSL Infrastructure
-    # =========================================================================
-
-    def __extract_mlir_values__(self) -> list:
-        return [self]
-
-    def __extract_mlir_attributes__(self) -> list:
-        return [ir.DictAttr.get({})]
-
-    def __new_from_mlir_values__(self, values: list) -> "Vector":
-        return Vector(values[0], dtype=self._dtype)
-
-    def with_signedness(self, signed: Union[bool, None]) -> "Vector":
-        """Override ArithValue.with_signedness for keyword-only __init__."""
-        new_vec = Vector(self, dtype=self._dtype)
-        elem_ty = self.type.element_type
-        new_vec.signed = (
-            signed
-            and ir.IntegerType.isinstance(elem_ty)
-            and ir.IntegerType(elem_ty).width > 1
-        )
-        return new_vec
-
-    # =========================================================================
-    # Properties
-    # =========================================================================
-
-    @property
-    def dtype(self) -> "type":
-        """The DSL element type (e.g., Float32, Int32)."""
-        return self._dtype
-
-    @property
-    def shape(self) -> "tuple[int, ...]":
-        """The logical shape of the vector array (1D, 2D, or 3D)."""
-        return self._shape
-
-    @property
-    def _count(self) -> int:
-        """Total number of elements (product of shape dimensions)."""
-        result = 1
-        for dim in self._shape:
-            result *= dim
-        return result
-
-    def numel(self) -> int:
-        """Total number of elements (product of all shape dimensions)."""
-        return self._count
-
-    # Vector has no memory space - it's always in registers
-    # The space property is intentionally not exposed on Vector
-
-    def ir_value(
-        self,
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> ir.Value:
-        """Return the underlying MLIR vector value."""
-        return self
-
-    # =========================================================================
-    # Indexing Operations
-    # =========================================================================
-
-    def _compute_linear_index(
-        self,
-        indices: "tuple[Union[int, Int32], ...]",  # type: ignore[name-defined]
-    ) -> "Union[int, Int32]":  # type: ignore[name-defined]
-        """Compute linear index from multi-dimensional indices (row-major order)."""
-        if len(indices) != len(self._shape):
-            raise IndexError(
-                f"Expected {len(self._shape)} indices for shape {self._shape}, "
-                f"got {len(indices)}"
-            )
-
-        # Check if all indices are static (Python ints)
-        all_static = all(isinstance(i, int) for i in indices)
-
-        if all_static:
-            # Static computation
-            linear = 0
-            stride = 1
-            for i in range(len(self._shape) - 1, -1, -1):
-                linear += indices[i] * stride
-                stride *= self._shape[i]
-            return linear
-        else:
-            from ..typing import Int32
-
-            # Dynamic computation using Int32 arithmetic
-            linear = Int32(0)  # type: ignore[assignment]
-            stride = 1
-            for i in range(len(self._shape) - 1, -1, -1):
-                idx = indices[i] if isinstance(indices[i], Int32) else Int32(indices[i])
-                linear = linear + idx * Int32(stride)
-                stride *= self._shape[i]
-            return linear
-
-    def __getitem__(
-        self,
-        idx: "Union[int, Int32, tuple, slice]",  # type: ignore[name-defined]
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> object:
-        """Extract an element or a contiguous sub-vector.
-
-        Supports three indexing modes:
-
-        * **Scalar index** — returns a single DSL scalar value::
-
-            elem = vec[i]          # static int or Int32
-
-        * **1-D slice** — all bounds must be static Python ``int``s::
-
-            sub = vec[start:stop]        # stride defaults to 1
-            sub = vec[start:stop:stride] # explicit stride
-
-        * **Multi-dimensional slice** — one entry per dimension, all bounds
-          must be static ``int``s.  An integer in a multi-dim slice is treated
-          as a size-1 slice (the dimension is kept)::
-
-            sub = mat[r0:r1, c0:c1]   # 2-D: rows r0:r1, cols c0:c1
-            sub = mat[:, c0:c1]       # 2-D: all rows, cols c0:c1
-            sub = mat[0, c0:c1]       # 2-D: row 0 (size 1), cols c0:c1
-
-        Slices use ``vector.extract_strided_slice`` internally; dynamic
-        (MLIR-value) slice bounds are **not** supported.
-
-        :param idx: Element index (int or Int32), a slice, or a tuple of
-            ints/slices for multi-dimensional access.
-        :type idx: int or Int32 or tuple or slice
-        :return: A scalar DSL value (for element indexing) or a new
-            :class:`Vector` (for slice indexing).
-        :rtype: Numeric or Vector
-        :raises TypeError: If slice bounds are not static Python ints.
-        :raises IndexError: If the number of dimensions in a multi-dim index
-            does not match :attr:`shape`.
-        """
-        from ..utils.logger import log
-
-        # Slice → extract_strided_slice (step==1) or vector.shuffle (step>1)
-        if isinstance(idx, slice):
-            start = idx.start if idx.start is not None else 0
-            step = idx.step if idx.step is not None else 1
-            stop = idx.stop if idx.stop is not None else self._count
-            if not all(isinstance(v, int) for v in (start, stop, step)):
-                raise TypeError(
-                    "Vector slice indices must be static ints; "
-                    f"got start={start}, stop={stop}, step={step}"
-                )
-            size = (stop - start + step - 1) // step
-            result_ty = ir.VectorType.get([size], self._mlir_type)
-            if step == 1:
-                result = vector.extract_strided_slice(
-                    result_ty, self, [start], [size], [step], loc=loc, ip=ip
-                )
-            else:
-                # vector.extract_strided_slice requires stride==1; use shuffle instead
-                mask = list(range(start, stop, step))
-                result = vector.shuffle(self, self, mask, loc=loc, ip=ip)
-            return Vector(result, dtype=self._dtype)
-
-        # Multi-dimensional slice: tuple containing at least one slice object
-        if isinstance(idx, tuple) and any(isinstance(i, slice) for i in idx):
-            if len(idx) != len(self._shape):
-                raise IndexError(
-                    f"Expected {len(self._shape)} indices for shape {self._shape}, "
-                    f"got {len(idx)}"
-                )
-            offsets: "list[int]" = []
-            sizes: "list[int]" = []
-            strides: "list[int]" = []
-            for dim, (i, dim_size) in enumerate(zip(idx, self._shape)):
-                if isinstance(i, slice):
-                    start = i.start if i.start is not None else 0
-                    stop = i.stop if i.stop is not None else dim_size
-                    step = i.step if i.step is not None else 1
-                    if not all(isinstance(v, int) for v in (start, stop, step)):
-                        raise TypeError(
-                            f"Vector slice indices must be static ints in dimension {dim}; "
-                            f"got start={start}, stop={stop}, step={step}"
-                        )
-                    if step != 1:
-                        raise NotImplementedError(
-                            f"Multi-dimensional strided slice (step={step}) is not supported; "
-                            "use step=1 for multi-dimensional slices"
-                        )
-                    offsets.append(start)
-                    sizes.append(stop - start)
-                    strides.append(1)
-                elif isinstance(i, int):
-                    # Integer index: treated as a size-1 slice (rank is preserved)
-                    if i < 0:
-                        i += dim_size
-                    offsets.append(i)
-                    sizes.append(1)
-                    strides.append(1)
-                else:
-                    raise TypeError(
-                        f"Vector multi-dimensional slice: dimension {dim} index must be "
-                        f"a static int or slice, got {type(i).__name__}"
-                    )
-            result_ty = ir.VectorType.get(sizes, self._mlir_type)
-            result = vector.extract_strided_slice(
-                result_ty, self, offsets, sizes, strides, loc=loc, ip=ip
-            )
-            return Vector(result, dtype=self._dtype)
-
-        # Normalize to tuple
-        if not isinstance(idx, tuple):
-            indices = (idx,)
-        else:
-            indices = idx
-
-        # Compute linear index
-        linear_idx = self._compute_linear_index(indices)
-
-        log().info(
-            f"Vector.__getitem__: idx={idx}, linear={linear_idx}, "
-            f"dtype={self._dtype}, shape={self._shape}"
-        )
-
-        # For dynamic indices, we use llvm.extractelement instead of vector.extract
-        # because vector.extract has issues with dynamic positions
-        if isinstance(linear_idx, int):
-            # Static index - use vector.extract with static position
-            elem = vector.extract(self, [], [linear_idx])
-        else:
-            # Dynamic index - use llvm.extractelement
-            elem = llvm.extractelement(self, linear_idx.ir_value())
-
-        return self._dtype(elem)
-
-    def __setitem__(
-        self,
-        idx: "Union[int, Int32, tuple]",  # type: ignore[name-defined]
-        value: object,
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> None:
-        """
-        Vector element assignment is not supported.
-
-        Vectors are immutable register values. Use one of these alternatives:
-
-        1. Use Array for mutable memory-backed storage:
-           arr = ctm.allocate_memory_local(ctm.Float32, 4)
-           arr[0] = value  # This works
-
-        2. Use full() to create vectors with initial values:
-           vec = ctm.full((4,), 1.0, ctm.Float32)
-        """
-        raise TypeError(
-            "Vector is immutable. Element assignment (vec[i] = value) is not supported. "
-        )
-
-    # =========================================================================
-    # Arithmetic Operations
-    # =========================================================================
-
-    def _is_float_type(self) -> bool:
-        """Check if this vector contains floating-point elements."""
-        return arith._is_float_type(self._mlir_type)
-
-    # Arithmetic operators (+, -, *, /, -x) are inherited from ArithValue.
-    # Results are automatically wrapped as Vector via the value caster.
-
-    def to(
-        self,
-        dtype: "type",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "Vector":
-        """Convert the vector elements to a different numeric type.
-
-        :param dtype: Target DSL element type (e.g. ``Float16``, ``Int32``).
-        :type dtype: Type[Numeric]
-        :return: A new :class:`Vector` with the same shape and elements cast
-            to ``dtype``.
-        :rtype: Vector
-        :raises TypeError: If ``dtype`` is not a subclass of ``Numeric``.
-
-        Example::
-
-            vec_f32 = ctm.full([4], 1.5, dtype=ctm.Float32)
-            vec_i32 = vec_f32.to(ctm.Int32)    # fp → int truncation
-            vec_f16 = vec_f32.to(ctm.Float16)  # fp32 → fp16 narrowing
-        """
-        from inspect import isclass
-        from ..typing import Numeric, Integer
-
-        if dtype is ir.Value:
-            return self
-
-        if not isclass(dtype) or not issubclass(dtype, Numeric):
-            raise TypeError(f"dtype must be a type of Numeric, but got {type(dtype)}")
-
-        src_dtype = self._dtype
-        if src_dtype == dtype:
-            return self
-
-        # maybe_downcast handles narrow precision types, with_signedness sets signedness
-        src = self.maybe_downcast().with_signedness(self.signed)
-
-        if src_dtype.is_float and dtype.is_float:  # type: ignore[attr-defined]
-            res_vect = cvtf(src, dtype.mlir_type, loc=loc, ip=ip)
-        elif src_dtype.is_float and issubclass(dtype, Integer):  # type: ignore[attr-defined]
-            res_vect = fptoi(src, dtype.signed, dtype.mlir_type, loc=loc, ip=ip)
-        elif issubclass(src_dtype, Integer) and dtype.is_float:
-            res_vect = itofp(src, src_dtype.signed, dtype.mlir_type, loc=loc, ip=ip)
-        else:
-            res_vect = int_to_int(src, dtype, loc=loc, ip=ip)
-
-        return Vector(res_vect, dtype=dtype)
-
-    @dsl_user_op
-    def bitcast(
-        self,
-        dtype: "type",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "Vector":
-        """Reinterpret the vector bits as a different element type.
-
-        The total bit width is preserved; the element count adjusts
-        proportionally.  For example, ``vector<4xi32>`` bitcast to
-        ``Float16`` yields ``vector<8xf16>`` (4 × 32 = 8 × 16 bits).
-
-        :param dtype: Target DSL element type (e.g. ``Float32``, ``Float16``).
-        :type dtype: Type[Numeric]
-        :return: A new :class:`Vector` with bits reinterpreted as ``dtype``.
-        :rtype: Vector
-        :raises TypeError: If ``dtype`` is not a subclass of ``Numeric``.
-        """
-        from inspect import isclass
-        from ..typing import Numeric
-
-        if not isclass(dtype) or not issubclass(dtype, Numeric):
-            raise TypeError(f"dtype must be a Numeric type, but got {dtype}")
-        if dtype is self._dtype:
-            return self
-        new_count = self._count * self._dtype.width // dtype.width  # type: ignore[attr-defined]
-        target_vec_ty = T.vector(new_count, dtype.mlir_type)
-        res_vec = vector.bitcast(target_vec_ty, self, loc=loc, ip=ip)
-        return Vector(res_vec, dtype=dtype, loc=loc, ip=ip)
-
-    @dsl_user_op
-    def __add__(
-        self,
-        other: "Vector",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "Vector":
-        result = super().__add__(other, loc=loc, ip=ip)
-        return Vector(result, dtype=self.dtype, loc=loc, ip=ip)
-
-    @dsl_user_op
-    def __radd__(
-        self,
-        other: "Vector",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "Vector":
-        result = super().__radd__(other, loc=loc, ip=ip)
-        return Vector(result, dtype=self.dtype, loc=loc, ip=ip)
-
-    @dsl_user_op
-    def __sub__(
-        self,
-        other: "Vector",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "Vector":
-        result = super().__sub__(other, loc=loc, ip=ip)
-        return Vector(result, dtype=self.dtype, loc=loc, ip=ip)
-
-    @dsl_user_op
-    def __rsub__(
-        self,
-        other: "Vector",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "Vector":
-        result = super().__rsub__(other, loc=loc, ip=ip)
-        return Vector(result, dtype=self.dtype, loc=loc, ip=ip)
-
-    @dsl_user_op
-    def __mul__(
-        self,
-        other: "Vector",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "Vector":
-        result = super().__mul__(other, loc=loc, ip=ip)
-        return Vector(result, dtype=self.dtype, loc=loc, ip=ip)
-
-    @dsl_user_op
-    def __rmul__(
-        self,
-        other: "Vector",
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> "Vector":
-        result = super().__rmul__(other, loc=loc, ip=ip)
-        return Vector(result, dtype=self.dtype, loc=loc, ip=ip)
diff --git a/python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/dialect_proxy.py b/python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/dialect_proxy.py
deleted file mode 100644
index c0087e00f..000000000
--- a/python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/dialect_proxy.py
+++ /dev/null
@@ -1,111 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
-#
-# Use of this software is governed by the terms and conditions of the
-# NVIDIA End User License Agreement (EULA), available at:
-# https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/license.html
-#
-# Any use, reproduction, disclosure, or distribution of this software
-# and related documentation outside the scope permitted by the EULA
-# is strictly prohibited.
-
-import enum
-import types
-from collections.abc import Callable
-from typing import Any
-
-
-class DialectAutoConvertProxy:
-    """
-    Proxy that wraps a raw MLIR dialect module, auto-converting DSL types
-    (anything with an ``.ir_value()`` method) to ``ir.Value`` when calling
-    dialect operations.
-
-    This enables users to write cleaner code without explicit
-    ``.ir_value()`` calls::
-
-        # Before (raw dialect module):
-        nvvm.shfl_sync(T.i32(), Int32(mask).ir_value(), ...)
-
-        # After (proxied):
-        nvvm.shfl_sync(T.i32(), Int32(mask), ...)
-
-    Non-callable attributes and enum classes are passed through unchanged
-    so that attribute access like ``nvvm.ShflKind.idx`` still works.
-
-    Parameters
-    ----------
-    dialect_module
-        The raw MLIR dialect module to wrap
-        (e.g. ``cutlass._mlir.dialects.nvvm``).
-    """
-
-    def __init__(self, dialect_module: types.ModuleType) -> None:
-        self._module = dialect_module
-        self._wrapped_cache: dict[str, Callable[..., object]] = {}
-
-    @staticmethod
-    def _convert_arg(
-        arg: object,
-        loc: object | None,
-        ip: object | None,
-    ) -> object:
-        """Recursively convert DSL objects to ir.Value."""
-        if hasattr(arg, "ir_value") and callable(arg.ir_value):
-            try:
-                return arg.ir_value(loc=loc, ip=ip)
-            except TypeError:
-                # Some ir_value() methods (e.g. Array) don't accept loc/ip.
-                return arg.ir_value()
-        if isinstance(arg, (list, tuple)):
-            converted = [
-                DialectAutoConvertProxy._convert_arg(item, loc, ip) for item in arg
-            ]
-            return type(arg)(converted)
-        return arg
-
-    def __getattr__(self, name: str) -> Any:
-        attr = getattr(self._module, name)
-
-        # Non-callable attributes and enum classes pass through
-        # unchanged.  Enum classes need attribute access (e.g.
-        # ShflKind.idx), but MLIR operation classes should be
-        # wrapped for argument conversion.
-        if not callable(attr) or isinstance(attr, enum.EnumMeta):
-            return attr
-
-        # Use cache for wrapped callables
-        if name not in self._wrapped_cache:
-
-            def _make_wrapper(
-                func: Callable[..., object],
-            ) -> Callable[..., object]:
-                def wrapped(
-                    *args: object,
-                    loc: object | None = None,
-                    ip: object | None = None,
-                    **kwargs: object,
-                ) -> object:
-                    converted_args = tuple(
-                        DialectAutoConvertProxy._convert_arg(arg, loc, ip)
-                        for arg in args
-                    )
-                    converted_kwargs = {
-                        k: DialectAutoConvertProxy._convert_arg(v, loc, ip)
-                        for k, v in kwargs.items()
-                    }
-                    return func(
-                        *converted_args,
-                        loc=loc,
-                        ip=ip,
-                        **converted_kwargs,
-                    )
-
-                return wrapped
-
-            self._wrapped_cache[name] = _make_wrapper(attr)
-
-        return self._wrapped_cache[name]
-
-    def __dir__(self) -> list[str]:
-        return dir(self._module)
diff --git a/python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/gpu.py b/python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/gpu.py
deleted file mode 100644
index b9d05e10b..000000000
--- a/python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/gpu.py
+++ /dev/null
@@ -1,63 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
-#
-# Use of this software is governed by the terms and conditions of the
-# NVIDIA End User License Agreement (EULA), available at:
-# https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/license.html
-#
-# Any use, reproduction, disclosure, or distribution of this software
-# and related documentation outside the scope permitted by the EULA
-# is strictly prohibited.
-
-"""
-This module provides MLIR GPU Dialect helper functions
-"""
-
-from ..._mlir import ir
-from ..._mlir.dialects import gpu, arith, scf
-from ..._mlir.extras import types as _T
-
-from ..common import *
-
-# =============================================================================
-# GPU Dialect Helper functions
-# =============================================================================
-
-
-def create_async_token() -> ir.Value:
-    token_ty = gpu.AsyncTokenType.get()
-    token = gpu.wait(token_ty, [])
-    return token
-
-
-def printf(fmt: str, *args: ir.Value, threadNumber: int = -1) -> None:
-    """Generate gpu.printf OP predicated on threadNumber"""
-    type_formats = []
-    for arg in args:
-        ty_format = None
-        if ir.IndexType.isinstance(arg.type):
-            ty_format = "%llu"
-        if ir.IntegerType.isinstance(arg.type):
-            width = ir.IntegerType(arg.type).width
-            if width == 64:
-                ty_format = "%llu"
-            elif width == 32:
-                ty_format = "%d"
-            elif width == 1:
-                ty_format = "%i"
-        if ir.F32Type.isinstance(arg.type):
-            ty_format = "%f"
-        if ty_format is None:
-            raise DSLNotImplemented(arg.type)
-        type_formats.append(ty_format)
-    if threadNumber == -1:
-        gpu.printf(fmt.format(*type_formats) + "\n", args)
-    if threadNumber != -1:
-        tidx = gpu.thread_id(gpu.Dimension.x)
-        predicate = arith.cmpi(
-            arith.CmpIPredicate.eq, tidx, arith.constant(_T.index(), threadNumber)
-        )
-        if_op = scf.IfOp(predicate)
-        with ir.InsertionPoint(if_op.then_block):
-            gpu.printf(fmt.format(*type_formats) + "\n", args)
-            scf.yield_([])
diff --git a/python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/lru_cache_ir.py b/python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/lru_cache_ir.py
deleted file mode 100644
index a7353b781..000000000
--- a/python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/lru_cache_ir.py
+++ /dev/null
@@ -1,76 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
-#
-# Use of this software is governed by the terms and conditions of the
-# NVIDIA End User License Agreement (EULA), available at:
-# https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/license.html
-#
-# Any use, reproduction, disclosure, or distribution of this software
-# and related documentation outside the scope permitted by the EULA
-# is strictly prohibited.
-
-"""
-This module provides @lru_cache_ir
-It extends functools.lru_cache with IR Context awareness.
-
-Example usage:
-from cutlass import ir
-from lru_cache_ir import lru_cache_ir
-
-@lru_cache_ir(ir, maxsize=128, typed=False)
-def make_layout(...):
-...
-
-"""
-
-from functools import lru_cache, wraps
-from typing import Any, Callable
-
-from ..._mlir import ir
-
-
-def get_ir_context(func: Any) -> Any:
-    """
-    Return the context for given func called under ir.
-    Currently the context includes MLIRContext and InsertionPoint.
-    """
-    try:
-        if ir:
-            return (ir.Context.current, ir.InsertionPoint.current)
-        else:
-            return None
-    except ValueError:
-        return None
-
-
-def lru_cache_ir(maxsize: int = 128, typed: bool = True) -> Callable[..., Any]:
-    """
-    Applies an LRU cache to a given function, with awareness of IR context.
-
-    Usage is similar to functools.lru_cache while taking `ir` as required argument.
-
-    :param ir: The IR object from which to derive the context by `get_ir_context`
-    :param maxsize: Max cache size, same as functools.lru_cache
-    :param typed: Whether params are type-sensitive, default to True as IR is type-sensitive
-    """
-
-    def decorator(func: Callable[..., Any]) -> Callable[..., Any]:
-        # Use functools.lru_cache with a custom wrapper to control the key generation
-        @lru_cache(maxsize=maxsize, typed=typed)
-        def cached_func(context: Any, *args: Any, **kwargs: Any) -> Any:
-            return func(*args, **kwargs)
-
-        @wraps(func)
-        def wrapper(*args: Any, **kwargs: Any) -> Any:
-            try:
-                # Call the cached function with the context
-                return cached_func(get_ir_context(func), *args, **kwargs)
-            except (RuntimeError, TypeError):
-                return func(*args, **kwargs)
-
-        # Expose cache-related methods for introspection
-        wrapper.cache_clear = cached_func.cache_clear  # type: ignore[attr-defined]
-        wrapper.cache_info = cached_func.cache_info  # type: ignore[attr-defined]
-        return wrapper
-
-    return decorator
diff --git a/python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/op.py b/python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/op.py
deleted file mode 100644
index 4e2af99c6..000000000
--- a/python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/op.py
+++ /dev/null
@@ -1,158 +0,0 @@
-# SPDX-FileCopyrightText: Copyright (c) 2025 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
-#
-# Use of this software is governed by the terms and conditions of the
-# NVIDIA End User License Agreement (EULA), available at:
-# https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/license.html
-#
-# Any use, reproduction, disclosure, or distribution of this software
-# and related documentation outside the scope permitted by the EULA
-# is strictly prohibited.
-
-"""
-This module provides MLIR's OP helper functions
-"""
-
-import inspect
-import os
-import types
-from functools import wraps
-from typing import Any, Callable
-
-from ..._mlir import ir
-from ..common import DSLRuntimeError, DSLOperationBuildError
-from ..utils.stacktrace import walk_to_top_module
-
-# The DSL package root is empty by default.
-_DSL_PACKAGE_ROOT: str | None = ""
-
-# Whether location tracking is enabled.
-_ENABLE_FRAME_FILTERING: bool = False
-
-
-def _set_enable_frame_filtering(enable: bool) -> None:
-    """Set whether location tracking is enabled."""
-    global _ENABLE_FRAME_FILTERING
-    _ENABLE_FRAME_FILTERING = enable
-
-
-def _is_framework_frame(filename: str) -> bool:
-    """Check if a frame's filename belongs to DSL library code."""
-    global _DSL_PACKAGE_ROOT
-    if _DSL_PACKAGE_ROOT == "":
-        # Compute the DSL package root once
-        # Any frame whose file starts with this prefix is considered DSL library code.
-        _DSL_PACKAGE_ROOT = walk_to_top_module(
-            os.path.dirname(os.path.abspath(__file__))
-        )
-
-    if _DSL_PACKAGE_ROOT is None:
-        return False
-
-    return os.path.abspath(filename).startswith(_DSL_PACKAGE_ROOT)
-
-
-def _find_user_frame(start_frame: types.FrameType | None) -> types.FrameType | None:
-    """Walk up the call stack from start_frame to find the first user (non-library) frame.
-
-    Returns the first frame whose file is not under the DSL package root.
-    Falls back to start_frame if no user frame is found (e.g. all frames are library code).
-    """
-    frame = start_frame
-    while frame is not None:
-        if not _is_framework_frame(frame.f_code.co_filename):
-            return frame
-        frame = frame.f_back
-    # Fallback: if everything is framework code, use the original caller
-    return start_frame
-
-
-def dsl_user_op(opFunc: Callable[..., Any]) -> Callable[..., Any]:
-    """
-    This is a decorator that needs to be used in each user-facing API to
-    manage source location for toolchain.
-
-    :param opFunc: The user-facing API function.
-    :type opFunc: Callable
-    :return: The wrapped user-facing API function.
-    :rtype: Callable
-    """
-
-    @wraps(opFunc)
-    def wrapper(*args: Any, **kwargs: Any) -> Any:
-        # Pop loc= from kwargs so callers that still pass it don't break.
-        # We no longer forward it — LOC_TRACEBACKS captures full stacks automatically.
-        loc: Any = kwargs.pop("loc", None)
-        frameInfo = None
-        verifier_error = False
-
-        if loc is None and ir.Context.current is not None:
-            frame = _find_user_frame(inspect.currentframe().f_back)  # type: ignore[union-attr]
-            frameInfo = inspect.getframeinfo(frame)  # type: ignore[arg-type]
-            try:
-                # In Python < 3.11, getframeinfo returns a NamedTuple without positions
-                if not hasattr(frameInfo, "positions"):
-                    file_loc = ir.Location.file(
-                        frameInfo.filename,
-                        frameInfo.lineno,
-                        0,
-                    )
-                else:
-                    file_loc = ir.Location.file(
-                        frameInfo.filename,
-                        frameInfo.positions.lineno,  # type: ignore[attr-defined]
-                        frameInfo.positions.col_offset or 0,  # type: ignore[attr-defined]
-                    )
-                loc = ir.Location.name(
-                    (
-                        "".join([c.strip() for c in frameInfo.code_context])
-                        if frameInfo.code_context
-                        else frameInfo.function
-                    ),
-                    childLoc=file_loc,
-                )
-            except RuntimeError:
-                # No MLIR context available (e.g. validation-only call
-                # outside a kernel).  Proceed with loc=None so that the
-                # wrapped function's own validation can still fire.
-                pass
-
-        try:
-            res_or_list = opFunc(*args, **kwargs, loc=loc)
-            verifier_error = True
-            # Verify the operation
-            if hasattr(res_or_list, "verify"):
-                res_or_list.verify()
-
-        except DSLOperationBuildError as e:
-            # Nested DSLOperationError
-            raise DSLOperationBuildError(
-                message=e.message, cause=e, frameInfo=frameInfo
-            )
-        except Exception as e:
-            # Check if it's a decorator config error first
-            func_name = getattr(opFunc, "__name__", str(opFunc))
-            if "unexpected keyword argument 'loc'" in str(e):
-                raise DSLRuntimeError(
-                    f"Function '{func_name}' decorated with @dsl_user_op does not accept the required 'loc' parameter.",
-                    suggestion=[
-                        f"1. Add 'loc=None' as a keyword-only parameter to {func_name}:",
-                        f"  def {func_name}(..., *, loc=None):",
-                        "",
-                        "2. Remove the @dsl_user_op decorator if location tracking is not needed",
-                    ],
-                    cause=e,
-                ) from e
-            if verifier_error:
-                raise DSLOperationBuildError(
-                    message="Operation verification failed",
-                    cause=e,
-                    frameInfo=frameInfo,
-                    auto_translate=False,
-                )
-
-            raise e
-
-        return res_or_list
-
-    return wrapper
diff --git a/python/CuTeDSL/cutlass/base_dsl/ast_helpers.py b/python/CuTeDSL/cutlass/base_dsl/ast_helpers.py
index a8de32369..220278fa4 100644
--- a/python/CuTeDSL/cutlass/base_dsl/ast_helpers.py
+++ b/python/CuTeDSL/cutlass/base_dsl/ast_helpers.py
@@ -28,8 +28,7 @@ import builtins
 
 from .utils.logger import log
 from .common import *
-
-from ._mlir_helpers.arith import ArithValue
+from .env_manager import get_str_env_var
 
 
 class Executor:
@@ -226,7 +225,7 @@ def loop_selector(
         vectorize,
         at_least_once,
     )
-    from .typing import Integer, Numeric
+    from .typing import Integer
 
     def _maybe_upcast(value: Any) -> Any:
         if isinstance(value, Integer):
@@ -640,8 +639,9 @@ def closure_check(
     closures: list[Any], _visited: set[tuple[str, int]] | None = None
 ) -> None:
     """
-    Check if the closures have any unsupported capture
+    Check if the closures have any unsupported capture.
     """
+
     if _visited is None:
         _visited = set()
 
@@ -772,3 +772,5 @@ def fstring_decompose(
                 f"Unsupported component type in f-string: {type(component)}",
             )
     return (format_string, *dynamic_args)
+
+
diff --git a/python/CuTeDSL/cutlass/base_dsl/ast_preprocessor.py b/python/CuTeDSL/cutlass/base_dsl/ast_preprocessor.py
index 1f1d8a480..7ca91fda8 100644
--- a/python/CuTeDSL/cutlass/base_dsl/ast_preprocessor.py
+++ b/python/CuTeDSL/cutlass/base_dsl/ast_preprocessor.py
@@ -314,7 +314,6 @@ class SessionData:
     region_stack: list[Region] = field(default_factory=list)
     generator_targets: list[str] = field(default_factory=list)
     lambda_args: list[str] = field(default_factory=list)
-
     @contextlib.contextmanager
     def set_current_class_name(self, class_name: str) -> Generator[None, None, None]:
         old_class_name = self.class_name
@@ -451,13 +450,28 @@ class DSLPreprocessor(ast.NodeTransformer):
         self.module_cache: dict[ModuleType, list[ImportInfo | TryImportInfo]] = {}
         self._session_data: SessionData | None = None
 
+    def _create_session_data(self) -> SessionData:
+        return SessionData()
+
+    def _start_session(self) -> None:
+        """Start a new preprocessing session by initializing session data."""
+        self._session_data = self._create_session_data()
+        # Track processed functions per preprocessing run, not for the entire
+        # lifetime of this preprocessor instance. Mode switches can restore a
+        # previously used preprocessor object, and stale entries here would make
+        # later sessions skip transforming a function entirely.
+        self.processed_functions = set()
+    def _end_session(self) -> None:
+        """End the current preprocessing session and clear session data."""
+        self._session_data = None
+
     @contextlib.contextmanager
     def get_session(self) -> Generator["DSLPreprocessor", None, None]:
         try:
-            self._session_data = SessionData()
+            self._start_session()
             yield self
         finally:
-            self._session_data = None
+            self._end_session()
 
     @property
     def session_data(self) -> SessionData:
@@ -879,7 +893,6 @@ class DSLPreprocessor(ast.NodeTransformer):
                 self.early_exit_type = "raise"
 
             def visit_Break(self, node: ast.Break) -> None:
-                # For break/continue in inner loops, we don't consider it as early exit
                 if self.loop_nest_level == 0 and self.kind != "if":
                     self.has_early_exit = True
                     self.early_exit_node = node
@@ -902,7 +915,6 @@ class DSLPreprocessor(ast.NodeTransformer):
                 self.loop_nest_level -= 1
 
             def visit_FunctionDef(self, node: ast.FunctionDef) -> None:
-                # Stop at nested function def
                 return
 
         checker = EarlyExitChecker(kind)
@@ -969,11 +981,14 @@ class DSLPreprocessor(ast.NodeTransformer):
         self,
         original_function: Callable[..., Any],
         exec_globals: dict[str, Any],
-        callee_rewrite: bool = False,
     ) -> ast.Module:
         """
         Transforms the provided function using the preprocessor.
         Requires an active DSL preprocessor session.
+
+        Args:
+            original_function: The function to transform.
+            exec_globals: The globals dict for the function's module.
         """
         self.session_data.file_name = (
             inspect.getsourcefile(original_function) or "<unknown>"
@@ -1015,8 +1030,6 @@ class DSLPreprocessor(ast.NodeTransformer):
                     write_args.add(node.id)
 
             def visit_Subscript(self, node: ast.Subscript) -> None:
-                # When subscript occurs on the lhs of an assignment, the `Name` is still a load, but `Subscript` is marked as `Store`.
-                # We need to force the store for the `Name` to be marked as write.
                 if isinstance(node.ctx, ast.Store):
                     self.force_store = True
                     self.visit(node.value)
@@ -1039,15 +1052,12 @@ class DSLPreprocessor(ast.NodeTransformer):
 
             @staticmethod
             def get_call_base(func_node: ast.expr) -> str | None:
-                # If the .value is another Attribute, keep digging
                 if isinstance(func_node, ast.Attribute):
                     if isinstance(func_node.value, ast.Attribute):
                         return RegionAnalyzer.get_call_base(func_node.value)
-                    # If the .value is a Name, that's our base
                     elif isinstance(func_node.value, ast.Name):
                         return func_node.value.id
                     else:
-                        # Could be something else (lambda, call, etc.)
                         return None
                 elif isinstance(func_node, ast.Name):
                     return None
@@ -1057,7 +1067,6 @@ class DSLPreprocessor(ast.NodeTransformer):
             def get_function_name(func_node: ast.Call) -> str | None:
                 if isinstance(func_node.func, ast.Name):
                     function_name = func_node.func.id
-                # Check if it's a method or attribute call
                 elif isinstance(func_node.func, ast.Attribute):
                     function_name = func_node.func.attr
                 else:
@@ -1073,7 +1082,7 @@ class DSLPreprocessor(ast.NodeTransformer):
 
                 # Classes are mutable by default. Mark them as write. If they are
                 # dataclass(frozen=True), treat them as read in runtime.
-                if base_name is not None and base_name not in ("self"):
+                if base_name is not None and base_name not in ("self",):
                     invoked_args.add(base_name)
 
                 self.generic_visit(node)
@@ -1416,19 +1425,28 @@ class DSLPreprocessor(ast.NodeTransformer):
         )
         return ast.Expr(check_call)
 
+    def _handle_constexpr_for(self, node: ast.For) -> ast.For | list[ast.stmt]:
+        """Handle const_expr/static for loops. Override for custom behavior."""
+        self.generic_visit(node)
+        return node
+
     def visit_For(self, node: ast.For) -> ast.For | list[ast.stmt]:
         # For static for loop (for with range_constexpr or not range based for), preprocessor keeps the loop.
         range_kind, is_builtin_range, has_keyword = self._get_range_kind(node.iter)
-        if range_kind == "range_constexpr" or range_kind == None:
-            self.generic_visit(node)
+        if range_kind == "range_constexpr" or range_kind is None:
+            # Delegate to template method for extensibility
+            result = self._handle_constexpr_for(node)
             if range_kind == "range_constexpr":
+                # Add check and transform to builtin range
                 assert isinstance(node.iter, ast.Call)
                 check_call = self._insert_cf_symbol_check(node.iter.func)
-                # Rewrite range_constexpr to range
                 node.iter.func = ast.Name(id="range", ctx=ast.Load())
                 self._insert_range_value_check(node)
-                return [check_call, node]
-            return node
+                return [
+                    check_call,
+                    result if isinstance(result, ast.For) else result[0],
+                ]
+            return result
 
         active_symbols = self.session_data.scope_manager.get_active_symbols()
         active_callables = self.session_data.scope_manager.get_active_callables()
@@ -1577,7 +1595,7 @@ class DSLPreprocessor(ast.NodeTransformer):
             extra_exprs.append(self.generic_visit(step))  # type: ignore[arg-type]
             extra_exprs.append(self.generic_visit(offset))  # type: ignore[arg-type]
 
-        # Add this to begining of loop body
+        # Add this to beginning of loop body
         # for i in range(start, stop, step):
         #     i = offset - i if isNegative else i
         assert isinstance(node.target, ast.Name)
@@ -1609,7 +1627,7 @@ class DSLPreprocessor(ast.NodeTransformer):
 
     def _create_closure_check_call(
         self, called_closures: list[str], node: ast.stmt
-    ) -> ast.Expr:
+    ) -> ast.Expr | None:
         return ast.Expr(
             ast.Call(
                 func=_create_module_attribute(
@@ -1627,6 +1645,20 @@ class DSLPreprocessor(ast.NodeTransformer):
             )
         )
 
+    def _prepare_loop_induction_var(self, node: ast.For) -> None:
+        """Prepare loop induction variable before function creation.
+
+        Override for custom behavior (e.g., mark variable for special handling).
+        """
+        pass  # No preparation needed in base class
+
+    def _cleanup_loop_induction_var(self, node: ast.For) -> None:
+        """Cleanup loop induction variable after function creation.
+
+        Override for custom behavior (e.g., restore normal handling).
+        """
+        pass  # No cleanup needed in base class
+
     def transform_for_loop(
         self,
         node: ast.For,
@@ -1711,11 +1743,16 @@ class DSLPreprocessor(ast.NodeTransformer):
             exprs.append(pre_loop_expr)
 
         if called_closures:
-            exprs.append(self._create_closure_check_call(called_closures, node))
+            cc = self._create_closure_check_call(called_closures, node)
+            if cc is not None:
+                exprs.append(cc)
 
         func_name = f"loop_body_{self.session_data.counter}"
         self.session_data.counter += 1
 
+        # Template method: prepare induction variable (e.g., mark for special handling)
+        self._prepare_loop_induction_var(node)
+
         func_def = self.create_loop_function(
             func_name,
             node,
@@ -1731,6 +1768,9 @@ class DSLPreprocessor(ast.NodeTransformer):
             full_write_args_count,
         )
 
+        # Template method: cleanup induction variable handling
+        self._cleanup_loop_induction_var(node)
+
         assign = self.create_cf_call(func_name, write_args, node)
 
         # This should work fine as it modifies the AST structure
@@ -1842,6 +1882,9 @@ class DSLPreprocessor(ast.NodeTransformer):
         node.args = [self.visit(arg) for arg in node.args]
         node.keywords = [self.visit(kwarg) for kwarg in node.keywords]
 
+        # Track whether a special-case rewrite already handled this call
+        already_rewritten = False
+
         # Rewrite call to some built-in functions
         if isinstance(func, ast.Name):
             # AST rewrite only redirect call to bool to bool_cast
@@ -1890,6 +1933,7 @@ class DSLPreprocessor(ast.NodeTransformer):
                 node.args = [
                     ast.Starred(value=self.processFString(node), ctx=ast.Load())
                 ]
+                already_rewritten = True
         elif isinstance(func, ast.Attribute) and isinstance(func.value, ast.Name):
             if (
                 func.attr in ("printf", "print_runtime")
@@ -1899,6 +1943,7 @@ class DSLPreprocessor(ast.NodeTransformer):
                 node.args = [
                     ast.Starred(value=self.processFString(node), ctx=ast.Load())
                 ]
+                already_rewritten = True
             else:
 
                 def create_downcast_call(arg: ast.expr) -> ast.Call:
@@ -1975,6 +2020,10 @@ class DSLPreprocessor(ast.NodeTransformer):
         self.generic_visit(node)
         return node
 
+    def visit_Subscript(self, node: ast.Subscript) -> ast.expr:
+        self.generic_visit(node)
+        return node
+
     def visit_AnnAssign(self, node: ast.AnnAssign) -> ast.AnnAssign:
         self._visit_target(node.target)
         self.generic_visit(node)
@@ -2002,6 +2051,21 @@ class DSLPreprocessor(ast.NodeTransformer):
         return node
 
     def get_dsl_decorator_index(self, decorator_list: list[ast.expr]) -> Any:
+        # Known decorator kwargs that should not prevent decorator recognition.
+        # "preprocess" controls whether AST preprocessing is enabled.
+        # "callee_rewrite" enables runtime callee wrapping via _dsl_callee_.
+        # "attributes" passes kernel function attributes (e.g. launch bounds)
+        # to the compiler — its presence shouldn't prevent the preprocessor
+        # from recognizing @cute.kernel(attributes=...) as a DSL decorator.
+        # "is_experimental" routes the function through the experimental
+        # CuTe DSL (see ``CuTeDSL.jit`` / ``CuTeDSL.kernel``).
+        _known_dsl_kwargs = {
+            "preprocess",
+            "callee_rewrite",
+            "attributes",
+            "is_experimental",
+        }
+
         for i, d in enumerate(decorator_list):
             if isinstance(d, ast.Call):
                 if isinstance(d.func, ast.Attribute):
@@ -2009,16 +2073,25 @@ class DSLPreprocessor(ast.NodeTransformer):
                         if d.keywords == []:
                             return i
 
-                        # Keep existing preprocess behavior unchanged.
-                        for keyword in d.keywords:
-                            if keyword.arg == "preprocess":
-                                try:
-                                    if isinstance(keyword.value, ast.Constant):
-                                        return keyword.value.value
-                                    else:
-                                        return ast.literal_eval(keyword.value)
-                                except:
-                                    pass
+                        # Check if all keywords are known DSL kwargs
+                        all_known = all(
+                            keyword.arg in _known_dsl_kwargs for keyword in d.keywords
+                        )
+                        if all_known:
+                            # Check if 'preprocess' is explicitly set
+                            for keyword in d.keywords:
+                                if keyword.arg == "preprocess":
+                                    try:
+                                        if isinstance(keyword.value, ast.Constant):
+                                            return keyword.value.value
+                                        else:
+                                            return ast.literal_eval(keyword.value)
+                                    except (ValueError, TypeError, SyntaxError):
+                                        # ast.literal_eval fails for non-literal
+                                        # expressions — treat as unknown.
+                                        pass
+                            # No 'preprocess' kwarg, but all kwargs are known — accept
+                            return i
 
                         keyword_names = {
                             keyword.arg
@@ -2130,13 +2203,14 @@ class DSLPreprocessor(ast.NodeTransformer):
                 self.session_data.scope_manager.add_to_scope(arg.arg)
                 arg.annotation = None
 
+            # Strip return annotation
+            node.returns = None
+
             self.generic_visit(node)
 
         # Remove .jit and .kernel decorators
         node.decorator_list = self.remove_dsl_decorator(node.decorator_list)
 
-        # Remove return annotation from processed AST to avoid symbol requirement
-        node.returns = None
         return node
 
     def visit_With(self, node: ast.With) -> ast.AST:
@@ -2145,13 +2219,17 @@ class DSLPreprocessor(ast.NodeTransformer):
                 self.session_data.scope_manager.add_to_scope(item.optional_vars.id)
         return self.generic_visit(node)
 
+    def _handle_constexpr_while(self, node: ast.While) -> list[ast.stmt]:
+        """Handle const_expr while statements. Override for custom behavior."""
+        self.generic_visit(node)
+        assert isinstance(node.test, ast.Call)
+        check = self._insert_cf_symbol_check(node.test.func)
+        return [check, node]
+
     def visit_While(self, node: ast.While) -> ast.While | list[ast.stmt]:
         # Constexpr doesn't get preprocessed
         if self.is_node_constexpr(node):
-            self.generic_visit(node)
-            assert isinstance(node.test, ast.Call)
-            check = self._insert_cf_symbol_check(node.test.func)
-            return [check, node]
+            return self._handle_constexpr_while(node)
 
         active_symbols = self.session_data.scope_manager.get_active_symbols()
         active_callables = self.session_data.scope_manager.get_active_callables()
@@ -2164,7 +2242,9 @@ class DSLPreprocessor(ast.NodeTransformer):
             )
             exprs = []
             if called_closures:
-                exprs.append(self._create_closure_check_call(called_closures, node))
+                cc = self._create_closure_check_call(called_closures, node)
+                if cc is not None:
+                    exprs.append(cc)
 
             func_name = f"while_region_{self.session_data.counter}"
             self.session_data.counter += 1
@@ -2429,13 +2509,17 @@ class DSLPreprocessor(ast.NodeTransformer):
 
         return call
 
+    def _handle_constexpr_if(self, node: ast.If) -> list[ast.stmt]:
+        """Handle const_expr if statements. Override for custom behavior."""
+        self.generic_visit(node)
+        assert isinstance(node.test, ast.Call)
+        check = self._insert_cf_symbol_check(node.test.func)
+        return [check, node]
+
     def visit_If(self, node: ast.If) -> ast.If | list[ast.stmt]:
         # const_expr doesn't get preprocessed
         if self.is_node_constexpr(node):
-            self.generic_visit(node)
-            assert isinstance(node.test, ast.Call)
-            check = self._insert_cf_symbol_check(node.test.func)
-            return [check, node]
+            return self._handle_constexpr_if(node)
 
         active_symbols = self.session_data.scope_manager.get_active_symbols()
         active_callables = self.session_data.scope_manager.get_active_callables()
@@ -2448,7 +2532,9 @@ class DSLPreprocessor(ast.NodeTransformer):
             )
             exprs = []
             if called_closures:
-                exprs.append(self._create_closure_check_call(called_closures, node))
+                cc = self._create_closure_check_call(called_closures, node)
+                if cc is not None:
+                    exprs.append(cc)
 
             func_name = f"if_region_{self.session_data.counter}"
             self.session_data.counter += 1
@@ -2475,6 +2561,15 @@ class DSLPreprocessor(ast.NodeTransformer):
             keywords=[],
         )
 
+    def _handle_constexpr_elif(self, elif_node: ast.If) -> ast.stmt:
+        """Handle const_expr elif nodes. Override for custom behavior.
+
+        Returns the check statement for the const_expr call.
+        """
+        self.generic_visit(elif_node)
+        assert isinstance(elif_node.test, ast.Call)
+        return self._insert_cf_symbol_check(elif_node.test.func)
+
     def create_if_function(
         self,
         func_name: str,
@@ -2615,9 +2710,7 @@ class DSLPreprocessor(ast.NodeTransformer):
                     #         if pred:
                     # And under both cases, the `pred` can be a const_expr, so we need to handle it here.
                     if self.is_node_constexpr(elif_node):
-                        self.generic_visit(elif_node)
-                        assert isinstance(elif_node.test, ast.Call)
-                        check = self._insert_cf_symbol_check(elif_node.test.func)
+                        check = self._handle_constexpr_elif(elif_node)
                         else_block = ast.FunctionDef(
                             name=else_block_name,
                             args=func_then_else_arguments,
@@ -2708,6 +2801,19 @@ class DSLPreprocessor(ast.NodeTransformer):
             node,
         )
 
+    def _prepare_while_condition_vars(
+        self,
+        node: ast.While,
+        write_args: list[str],
+        while_before_stmts: list[ast.stmt],
+    ) -> list[ast.stmt]:
+        """Prepare write_args before while condition evaluation.
+
+        Override for custom behavior (e.g., derived classes may insert instrumentation).
+        Returns statements to insert at the beginning of while_before_block.
+        """
+        return []  # No preparation needed in base class
+
     def create_while_function(
         self,
         func_name: str,
@@ -2792,6 +2898,14 @@ class DSLPreprocessor(ast.NodeTransformer):
         with Region(self.session_data, new_value=while_before_stmts):
             test_expr = ast.copy_location(self.visit(node.test), node.test)
 
+        # Template method: prepare write_args before condition evaluation
+        # Derived classes may insert instrumentation here
+        condition_prep_stmts = self._prepare_while_condition_vars(
+            node, write_args, while_before_stmts
+        )
+        if condition_prep_stmts:
+            while_before_stmts[:0] = condition_prep_stmts
+
         while_before_return_list = ast.List(
             elts=[test_expr, yield_args_ast_name_list],
             ctx=ast.Load(),
diff --git a/python/CuTeDSL/cutlass/base_dsl/cache_helpers.py b/python/CuTeDSL/cutlass/base_dsl/cache_helpers.py
index 7c34f7749..019e99519 100644
--- a/python/CuTeDSL/cutlass/base_dsl/cache_helpers.py
+++ b/python/CuTeDSL/cutlass/base_dsl/cache_helpers.py
@@ -78,7 +78,7 @@ def get_default_generated_ir_path(dsl_name: str = "CUTE_DSL") -> str:
         return str(p)
 
     try:
-        default_generated_ir_path = get_reusable_temp_dir("cutlass_python_cache")
+        return get_reusable_temp_dir("cutlass_python_cache")
     except Exception as e:
         fallback = str(tmp_dir / "cutlass_python_cache")
         log().warning(
@@ -86,9 +86,10 @@ def get_default_generated_ir_path(dsl_name: str = "CUTE_DSL") -> str:
         )
         return fallback
 
-    return default_generated_ir_path
-
 
+# TODO: Remove after updating imports in:
+#   cutlass_ir/compiler/test/python/dsl/cute/test_cache_helpers.py
+#   cutlass_ir/compiler/test/python/not_pytest/runtime/test_cache.py
 default_generated_ir_path = get_default_generated_ir_path()
 
 
@@ -219,7 +220,9 @@ def save_ir(
     :rtype: str
     """
     initial_name = f"{dsl_name.lower()}_{fname}.mlir"
-    save_path = normalize_path(output_dir if output_dir else tempfile.gettempdir())
+    save_path = normalize_path(
+        output_dir if output_dir else get_default_generated_ir_path(dsl_name)
+    )
     save_fname = save_path / initial_name
     # Random ID to avoid any collisions
     rnd_id = str(uuid.uuid4())
@@ -259,7 +262,7 @@ def load_cache_from_path(
     :type dsl_name: str
     :param file: The name of the file to load.
     :type file: str
-    :param path: The path to the cache directory, defaults to default_generated_ir_path
+    :param path: The path to the cache directory, defaults to get_default_generated_ir_path(dsl_name)
     :type path: str, optional
     :param bytecode_reader: The bytecode reader to use, defaults to None
     :type bytecode_reader: callable, optional
@@ -302,7 +305,7 @@ def dump_cache_to_path(
     :type jit_function: JitCompiledFunction
     :param file: The name of the file to dump.
     :type file: str
-    :param path: The path to the cache directory, defaults to default_generated_ir_path
+    :param path: The path to the cache directory, defaults to get_default_generated_ir_path(dsl_name)
     :type path: str, optional
     :param bytecode_writer: The bytecode writer to use, defaults to None
     :type bytecode_writer: callable, optional
diff --git a/python/CuTeDSL/cutlass/base_dsl/common.py b/python/CuTeDSL/cutlass/base_dsl/common.py
index 713bc7178..e3069e13b 100644
--- a/python/CuTeDSL/cutlass/base_dsl/common.py
+++ b/python/CuTeDSL/cutlass/base_dsl/common.py
@@ -39,6 +39,7 @@ def register_env_manager(env_manager: Any) -> None:
     _registered_env_manager = env_manager
 
 
+
 def _dsl_excepthook(
     exc_type: type,
     exc_value: BaseException,
@@ -55,7 +56,11 @@ def _dsl_excepthook(
         show_stacktrace = getattr(_registered_env_manager, "show_stacktrace", False)
 
     # Check if it's a DSL operation error (by name to avoid circular import issues)
-    if exc_type.__name__ in ("DSLOperationError", "DSLOperationBuildError"):
+    if exc_type.__name__ in (
+        "DSLOperationError",
+        "DSLOperationBuildError",
+        "DSLUserCodeError",
+    ):
         if show_stacktrace:
             # Show full traceback in verbose mode
             _original_excepthook(exc_type, exc_value, exc_traceback)
@@ -332,6 +337,17 @@ def _get_friendly_cuda_error_message(
     return message, debug_info, error_suggestions.get(error_name, "")
 
 
+def _get_cuda_error_name_from_code(error_code: int) -> Union[str, bytes]:
+    try:
+        # Avoid circular dependency.
+        from .runtime import cuda as cuda_helpers
+
+        cu_result = cuda_helpers.cuda.CUresult(error_code)
+        return cuda_helpers._cudaGetErrorEnum(cu_result)
+    except (ValueError, AttributeError):
+        return f"<unknown CUDA error code {error_code}>"
+
+
 class DSLCudaRuntimeError(DSLBaseError):
     """
     Raised when an error occurs during CUDA runtime code generation in the DSL.
@@ -351,6 +367,17 @@ class DSLCudaRuntimeError(DSLBaseError):
         )
 
 
+def create_cuda_runtime_error(
+    error_code: int, cause: BaseException | None = None
+) -> DSLCudaRuntimeError:
+    """Create a DSLCudaRuntimeError from a raw CUDA integer error code."""
+    error = DSLCudaRuntimeError(error_code, _get_cuda_error_name_from_code(error_code))
+    if cause is not None:
+        error.__cause__ = cause
+        error.__suppress_context__ = True
+    return error
+
+
 class DSLAstPreprocessorError(DSLBaseError):
     """
     Raised when an error occurs during AST preprocessing or visiting in the DSL.
diff --git a/python/CuTeDSL/cutlass/base_dsl/compiler.py b/python/CuTeDSL/cutlass/base_dsl/compiler.py
index 2edacfb34..8f78e9df9 100644
--- a/python/CuTeDSL/cutlass/base_dsl/compiler.py
+++ b/python/CuTeDSL/cutlass/base_dsl/compiler.py
@@ -18,6 +18,7 @@ and executes it using MLIR's ExecutionEngine.
 from typing import Any
 import collections.abc
 import os
+import re
 import sys
 import inspect
 import types
@@ -180,6 +181,7 @@ class Compiler:
         opt_level: int = 2,
         arch: str = "",
         enable_debug_info: bool = False,
+        enable_verifier: bool = False,
     ) -> Any:
         """Compiles and jits the module."""
         self.compile(
@@ -187,6 +189,7 @@ class Compiler:
             pipeline,
             arch,
             enable_debug_info=enable_debug_info,
+            enable_verifier=enable_verifier,
         )
 
         return self.jit(module, opt_level, shared_libs)
@@ -293,10 +296,41 @@ class OptLevel(CompileOption):
 
 
 
+_NVVM_DIAG_SUB_ALIASES: dict[str, str] = {
+}
+
+_SR_DISABLE_PATTERNS: dict[str, str] = {
+}
+
+
+_PERF_ENABLE_OPTIONS: dict[str, str] = {
+}
+
+
+class ExtraCompilerOpts(CompileOption):
+    """Raw MLIR pass options from CUTE_DSL_COMPILER_OPT, serialized verbatim."""
+
+    def __init__(self, val: str = "") -> None:
+        super().__init__(val)
+
+    def serialize(self) -> str:
+        return self._value
+
+
 class PtxasOptions(StringCompileOption):
     option_name = "ptx-options"
 
 
+class RDC(BooleanCompileOption):
+    """Compile as relocatable device code (``ptxas -c``).
+
+    Enabled automatically by ``DeviceTarget``.  In the future, can be
+    used directly with kernel compilation to produce linkable objects.
+    """
+
+    option_name = "rdc"
+
+
 class EnableAssertions(BooleanCompileOption):
     option_name = "enable-assertions"
 
@@ -308,10 +342,30 @@ class GenerateLineInfo(BooleanCompileOption):
 class KeepCUBIN(BooleanBasedFileDumpOption):
     option_name = "dump-cubin-path"
 
+    def __init__(self, val: bool = True) -> None:
+        super().__init__(val)
+        self.full_cubin_path: str = ""
+
 
 class KeepPTX(BooleanBasedFileDumpOption):
     option_name = "dump-ptx-path"
 
+    def __init__(self, val: bool = True) -> None:
+        super().__init__(val)
+        self.full_ptx_path: str = ""
+
+
+
+
+class FlattenLocsOutputJsonPath(StringCompileOption):
+    """Path to the FlattenLocs JSON sidecar (vloc id -> original source chain).
+
+    When set, the `flatten-locs` pass runs and writes the sidecar. Empty (the
+    default) makes the pass a no-op — the original loc chain passes through
+    to the LLVM backend unchanged.
+    """
+
+    option_name = "flatten-locs-output-json-path"
 
 
 class LinkLibraries(StringCompileOption):
@@ -349,10 +403,150 @@ class EnableTVMFFI(EmptyCompileOption):
     pass
 
 
+class DeviceTarget(BooleanCompileOption):
+    """Compile a ``@cute.jit`` function as a ``device`` function.
+
+    By default ``cute.compile`` compiles host and gpu kernel.
+    Usage::
+
+        cute.compile[DeviceTarget](my_func, Float32, Float32)
+    """
+
+    option_name = ""
+
+    def serialize(self) -> str:
+        return ""
+
+
 class DumpDir(EmptyCompileOption):
     option_name = "dump-dir"
 
 
+# AOT host cross-compile target presets. Tag → (triple, cpu, features).
+# Keep this small. Power users go through the long form below.
+_HOST_TARGET_PRESETS: dict[str, tuple[str, str, str]] = {
+    "linux-aarch64": ("aarch64-unknown-linux-gnu", "", ""),
+}
+
+
+def _parse_host_target(spec: str) -> tuple[str, str, str]:
+    """Parse a ``--host-target`` value into ``(triple, cpu, features)``.
+
+    Accepts:
+      * Empty string → all empty (native build-host behavior).
+      * Preset tag in ``_HOST_TARGET_PRESETS``.
+      * TVM-style long form ``llvm -mtriple=<t> [-mcpu=<c>] [-mattr=<f>]``.
+    """
+    spec = (spec or "").strip()
+    if not spec:
+        return "", "", ""
+    if spec.startswith("llvm"):
+        import shlex as _shlex
+
+        tokens = _shlex.split(spec)
+        if not tokens or tokens[0] != "llvm":
+            raise ValueError(f"invalid host-target long form: {spec!r}")
+        triple, cpu, features = "", "", ""
+        for tok in tokens[1:]:
+            if tok.startswith("-mtriple="):
+                triple = tok[len("-mtriple=") :]
+            elif tok.startswith("-mcpu="):
+                cpu = tok[len("-mcpu=") :]
+            elif tok.startswith("-mattr="):
+                features = tok[len("-mattr=") :]
+            else:
+                raise ValueError(
+                    f"unknown host-target flag {tok!r}; "
+                    "supported: -mtriple=, -mcpu=, -mattr="
+                )
+        if not triple:
+            raise ValueError(
+                f"host-target long form requires -mtriple=<triple>; got: {spec!r}"
+            )
+        return triple, cpu, features
+    if spec in _HOST_TARGET_PRESETS:
+        return _HOST_TARGET_PRESETS[spec]
+    raise ValueError(
+        f"--host-target {spec!r}: not a known preset and does not start "
+        f"with 'llvm '. Known presets: {sorted(_HOST_TARGET_PRESETS)}. "
+        f"Long form: 'llvm -mtriple=<triple> [-mcpu=<cpu>] [-mattr=<features>]'."
+    )
+
+
+class HostTarget(EmptyCompileOption):
+    """Target spec for AOT host cross-compile.
+
+    Empty value (default) targets the build host via the native
+    auto-detect path. A non-empty value cross-compiles the AOT host
+    object for the requested target; cross-compile is currently
+    exercised for AArch64 only, and other ISAs hit a "target not
+    registered" error at codegen time.
+
+    Accepts either a registered preset tag or a TVM-style long form
+    (consumed by the AOT export path; not part of the MLIR pipeline
+    string).
+
+    Presets::
+
+        linux-aarch64   → aarch64-unknown-linux-gnu
+
+    Long form (explicit tuning / escape hatch)::
+
+        llvm -mtriple=<triple> [-mcpu=<cpu>] [-mattr=<features>]
+
+    Examples::
+
+        cute.compile(fn, *args,
+                     options="--gpu-arch sm_100a --host-target linux-aarch64")
+        cute.compile(fn, *args,
+                     options=(
+                         "--gpu-arch sm_100a "
+                         "--host-target 'llvm -mtriple=aarch64-unknown-linux-gnu "
+                         "-mcpu=neoverse-n1 -mattr=+lse'"
+                     ))
+    """
+
+    option_name = "host-target"
+
+    def __init__(self, val: str = "") -> None:
+        # Parse + validate eagerly so bad input fails at cute.compile()
+        # parse time rather than later at AOT export time. The parsed
+        # triple/cpu/features are cached and exposed as attributes; the
+        # raw spec is the option's ``value``.
+        self._parse_and_cache(val)
+        super().__init__(val)
+
+    def _parse_and_cache(self, val: str) -> None:
+        try:
+            self._triple, self._cpu, self._features = _parse_host_target(val)
+        except ValueError as exc:
+            raise DSLRuntimeError(str(exc)) from exc
+
+    @property
+    def value(self) -> str:
+        return self._value
+
+    @value.setter
+    def value(self, new_value: str) -> None:
+        self._parse_and_cache(new_value)
+        self._value = new_value
+
+    @property
+    def triple(self) -> str:
+        """LLVM target triple. Empty = native build host."""
+        return self._triple
+
+    @property
+    def cpu(self) -> str:
+        """LLVM CPU name. Empty = generic baseline for the triple."""
+        return self._cpu
+
+    @property
+    def features(self) -> str:
+        """Comma-separated LLVM feature list."""
+        return self._features
+
+
 
 class CompileOptions:
     """
@@ -368,16 +562,21 @@ class CompileOptions:
         self.options: dict[type[CompileOption], CompileOption] = {
             # Compilation control options
             OptLevel: OptLevel(3),
+            ExtraCompilerOpts: ExtraCompilerOpts(""),
             PtxasOptions: PtxasOptions(""),
+            RDC: RDC(False),
             # Debugging options
             EnableAssertions: EnableAssertions(False),
             GenerateLineInfo: GenerateLineInfo(False),
             KeepCUBIN: KeepCUBIN(False),
             KeepPTX: KeepPTX(False),
             GPUArch: GPUArch(""),
+            FlattenLocsOutputJsonPath: FlattenLocsOutputJsonPath(""),
             LinkLibraries: LinkLibraries(""),
             EnableTVMFFI: EnableTVMFFI(False),
+            DeviceTarget: DeviceTarget(False),
             DumpDir: DumpDir(""),
+            HostTarget: HostTarget(""),
         }
 
         if options is not None:
@@ -395,6 +594,112 @@ class CompileOptions:
         else:
             _validate_and_update_option(options)
 
+    def _apply_opt_string(self, opt_str: str) -> None:
+        """Apply a compact compiler option string in-place.
+
+        Parses the same format as ``CUTE_DSL_COMPILER_OPT`` and updates this
+        object's options accordingly.  Valid forms (comma- or space-separated,
+        ``--`` prefix optional)::
+
+            --iket                           # enable IKET (In-Kernel Event Tracing) instrumentation
+
+        :param opt_str: Compact option string to parse.
+        :raises ValueError: On malformed syntax:
+
+            - **Unclosed brace** — a ``{`` immediately after a token that was
+              not captured by the regex (e.g. ``name{``).
+            - **Empty braces** — ``name{}`` is rejected unless *name* is a
+              documented exception (currently ``strength-reduction``).
+            - **Empty value** — ``name=`` (equals with no value) is rejected;
+              use the bare name to enable boolean options.
+        """
+        import re
+
+        # Alias map: short token → MLIR option name
+        _ALIAS_MAP: dict[str, str] = {
+        }
+
+        # Tokens that map directly to a single boolean pipeline flag.
+        _ENABLE_OPTIONS: dict[str, str] = {
+            "iket": "enable-iket",
+        }
+
+        opt_name_map = {cls.option_name: cls for cls in self.options if cls.option_name}
+        raw_opts: list[str] = []
+
+        # Tokenize: each token is  name  or  name{sub-opts}  or  name=val
+        for m in re.finditer(r"([\w][\w-]*)(?:\{([^}]*)\}|=([\S]+))?", opt_str):
+            name = m.group(1)
+            sub_str = m.group(2)  # braces content, or None
+            val_str = m.group(3)  # =val content, or None
+
+            # --- Malformed-syntax checks ---
+            # (1) Unclosed brace: the regex skips a lone '{' that has no '}'.
+            if m.end() < len(opt_str) and opt_str[m.end()] == "{":
+                raise ValueError(
+                    f"Unclosed '{{' after option '{name}'; "
+                    f"braces must be closed (e.g. {name}{{...}})"
+                )
+            # (2) Empty braces: name{} is ambiguous — reject unless documented.
+            if (
+                sub_str is not None
+                and sub_str == ""
+            ):
+                raise ValueError(
+                    f"Empty braces for option '{name}'; "
+                    f"provide sub-options (e.g. {name}{{key=val}}) "
+                    f"or remove the braces"
+                )
+            # (3) Empty value: name= (equals with no RHS). The regex requires
+            #     at least one \S after '=', so 'name=' leaves '=' uncaptured.
+            if (
+                val_str is None
+                and sub_str is None
+                and m.end() < len(opt_str)
+                and opt_str[m.end()] == "="
+            ):
+                raise ValueError(
+                    f"Empty value for option '{name}='; "
+                    f"provide a value (e.g. {name}=<value>) "
+                    f"or use the bare name to enable a boolean option"
+                )
+
+            if name == "nvvm-diag":
+                pass
+            elif name in _PERF_ENABLE_OPTIONS and sub_str is None:
+                # Emit the MLIR flag for this pass enable token directly.
+                raw_opts.append(f"{_PERF_ENABLE_OPTIONS[name]}=true")
+            elif sub_str is not None:
+                pass
+            elif name in _ENABLE_OPTIONS and sub_str is None:
+                # Emit the MLIR flag for this enable token directly.
+                raw_opts.append(f"{_ENABLE_OPTIONS[name]}=true")
+            else:
+                # Form: name  or  name=val — enable/configure a named option.
+                key = _ALIAS_MAP.get(name, name)
+                val = val_str or ""
+                if key in opt_name_map:
+                    opt = self.options[opt_name_map[key]]
+                    if isinstance(opt, BooleanCompileOption):
+                        opt.value = (
+                            True
+                            if not val
+                            else val.lower() in ("1", "true", "yes", "on")
+                        )
+                    else:
+                        if not val:
+                            raise DSLRuntimeError(
+                                f"Option '{key}' requires a value (e.g. {key}=<value>)"
+                            )
+                        opt.value = val
+                else:
+                    raw_opts.append(f"{key}={'true' if not val else val}")
+
+        if raw_opts:
+            existing = self.options[ExtraCompilerOpts].value
+            combined = (existing + " " + " ".join(raw_opts)).strip()
+            self.options[ExtraCompilerOpts].value = combined
+
     def apply_envar_settings(
         self, envar: EnvironmentVarManager, function_name: str
     ) -> None:
@@ -403,6 +708,8 @@ class CompileOptions:
             self.options[KeepPTX].value = True
         if envar.keep_cubin:
             self.options[KeepCUBIN].value = True
+        if envar.compiler_opt:
+            self._apply_opt_string(envar.compiler_opt)
         if envar.enable_assertions:
             self.options[EnableAssertions].value = True
         if envar.lineinfo:
@@ -421,19 +728,25 @@ class CompileOptions:
             if self.options[DumpDir].value == ""
             else self.options[DumpDir].value
         )
-        if self.options[KeepPTX].value:
-            self.options[KeepPTX].dump_path = os.path.join(dump_dir, f"{function_name}")  # type: ignore[attr-defined, arg-type]
-            self.options[KeepPTX].full_ptx_path = os.path.join(  # type: ignore[attr-defined]
-                dump_dir,  # type: ignore[arg-type]
+        keep_ptx = self.options[KeepPTX]
+        keep_cubin = self.options[KeepCUBIN]
+        assert isinstance(keep_ptx, KeepPTX)
+        assert isinstance(keep_cubin, KeepCUBIN)
+        if keep_ptx.value:
+            assert dump_dir is not None
+            keep_ptx.dump_path = os.path.join(dump_dir, f"{function_name}")
+            keep_ptx.full_ptx_path = os.path.join(
+                dump_dir,
                 f"{function_name}.{arch}.ptx",
             )
-        if self.options[KeepCUBIN].value:
-            self.options[KeepCUBIN].dump_path = os.path.join(  # type: ignore[attr-defined]
-                dump_dir,  # type: ignore[arg-type]
+        if keep_cubin.value:
+            assert dump_dir is not None
+            keep_cubin.dump_path = os.path.join(
+                dump_dir,
                 f"{function_name}",
             )
-            self.options[KeepCUBIN].full_cubin_path = os.path.join(  # type: ignore[attr-defined]
-                dump_dir,  # type: ignore[arg-type]
+            keep_cubin.full_cubin_path = os.path.join(
+                dump_dir,
                 f"{function_name}.{arch}.cubin",
             )
     @property
@@ -444,33 +757,39 @@ class CompileOptions:
     def gpu_arch(self) -> str:
         return self.options[GPUArch].value
 
+    @property
+    def host_target(self) -> "HostTarget":
+        """Host-target option object.
+
+        ``.value`` is the raw user-facing spec (preset tag or ``llvm …``
+        long form, empty = native build host). ``.triple`` / ``.cpu`` /
+        ``.features`` are the parsed components.
+        """
+        return self.options[HostTarget]  # type: ignore[return-value]
+
     @property
     def dump_ptx_path(self) -> str | None:
-        return self.options[KeepPTX].dump_path if self.options[KeepPTX].value else None  # type: ignore[attr-defined]
+        keep_ptx = self.options[KeepPTX]
+        assert isinstance(keep_ptx, KeepPTX)
+        return keep_ptx.dump_path if keep_ptx.value else None
 
     @property
     def full_ptx_path(self) -> str | None:
-        return (
-            self.options[KeepPTX].full_ptx_path  # type: ignore[attr-defined]
-            if self.options[KeepPTX].value
-            else None
-        )
+        keep_ptx = self.options[KeepPTX]
+        assert isinstance(keep_ptx, KeepPTX)
+        return keep_ptx.full_ptx_path if keep_ptx.value else None
 
     @property
     def dump_cubin_path(self) -> str | None:
-        return (
-            self.options[KeepCUBIN].dump_path  # type: ignore[attr-defined]
-            if self.options[KeepCUBIN].value
-            else None
-        )
+        keep_cubin = self.options[KeepCUBIN]
+        assert isinstance(keep_cubin, KeepCUBIN)
+        return keep_cubin.dump_path if keep_cubin.value else None
 
     @property
     def full_cubin_path(self) -> str | None:
-        return (
-            self.options[KeepCUBIN].full_cubin_path  # type: ignore[attr-defined]
-            if self.options[KeepCUBIN].value
-            else None
-        )
+        keep_cubin = self.options[KeepCUBIN]
+        assert isinstance(keep_cubin, KeepCUBIN)
+        return keep_cubin.full_cubin_path if keep_cubin.value else None
 
     @property
     def enable_tvm_ffi(self) -> bool:
@@ -497,6 +816,80 @@ class CompileOptions:
         return flattend_options
 
 
+def _extract_compact_options(
+    options: str,
+) -> "tuple[CompileOptions | None, str]":
+    """Split *options* into compact tokens and legacy tokens.
+
+    Returns:
+        (base_compile_options or None, remaining legacy options string).
+        When the input is *pure* compact, the fully-configured CompileOptions
+        is returned and the legacy string is empty.
+    """
+    import shlex
+
+    _COMPACT_NAMES: frozenset[str] = frozenset(
+        {
+            "iket",
+        }
+        | set(_PERF_ENABLE_OPTIONS)
+    )
+
+    def _is_compact(token: str) -> bool:
+        bare = (
+            (token[2:] if token.startswith("--") else token).split(",")[0].split("{")[0]
+        )
+        return bare in _COMPACT_NAMES
+
+    stripped = options.strip() if options else ""
+    if not stripped:
+        return None, options
+
+    try:
+        all_tokens = shlex.split(stripped)
+    except ValueError as exc:
+        raise ValueError(
+            f"Failed to parse compiler options string: {exc}\n"
+            f"  options string: {stripped!r}\n"
+            "  Hint: unmatched quotes or backslashes are common causes."
+        ) from exc
+
+    _LEGACY_VALUE_KEYS: frozenset[str] = frozenset(
+        {
+            "--ptxas-options",
+            "--link-libraries",
+            "--gpu-arch",
+            "--dump-dir",
+            "--opt-level",
+            "--host-target",
+        }
+    )
+    compact_tokens: list[str] = []
+    legacy_tokens: list[str] = []
+    _prev_is_legacy_key = False
+    for t in all_tokens:
+        if _prev_is_legacy_key:
+            legacy_tokens.append(t)
+            _prev_is_legacy_key = False
+        elif _is_compact(t):
+            compact_tokens.append(t)
+        else:
+            legacy_tokens.append(t)
+            _prev_is_legacy_key = t in _LEGACY_VALUE_KEYS
+
+    if not compact_tokens:
+        return None, options
+
+    compact_str = " ".join(t[2:] if t.startswith("--") else t for t in compact_tokens)
+    if not legacy_tokens:
+        compile_options = CompileOptions()
+        compile_options._apply_opt_string(compact_str)
+        return compile_options, ""
+
+    base = CompileOptions()
+    base._apply_opt_string(compact_str)
+    return base, " ".join(legacy_tokens)
+
 
 # This is a temp function to preserve backward compatibility.
 # To be removed in the future.
@@ -505,6 +898,10 @@ def _parse_compile_options_from_str(options: str) -> CompileOptions:
     import shlex as _shlex
 
     _base_compile_options: "CompileOptions | None" = None
+    _base_compile_options, options = _extract_compact_options(options)
+    if isinstance(_base_compile_options, CompileOptions) and not options:
+        return _base_compile_options
+
     def _get_compile_option_from_str(option_str: str) -> type[CompileOption]:
         mapping: dict[str, type[CompileOption]] = {
             "opt_level": OptLevel,
@@ -517,11 +914,11 @@ def _parse_compile_options_from_str(options: str) -> CompileOptions:
             "gpu_arch": GPUArch,
             "enable_tvm_ffi": EnableTVMFFI,
             "dump_dir": DumpDir,
+            "host_target": HostTarget,
         }
         return mapping[option_str]
 
     import argparse
-    import shlex
 
     parser = argparse.ArgumentParser()
     parser.add_argument("--opt-level", nargs="?", type=int, default=3)
@@ -534,6 +931,7 @@ def _parse_compile_options_from_str(options: str) -> CompileOptions:
     parser.add_argument("--gpu-arch", type=str, default="")
     parser.add_argument("--enable-tvm-ffi", action="store_true", default=False)
     parser.add_argument("--dump-dir", type=str, default="")
+    parser.add_argument("--host-target", type=str, default="")
     compile_options = (
         _base_compile_options if _base_compile_options is not None else CompileOptions()
     )
@@ -558,6 +956,36 @@ def _parse_compile_options_from_str(options: str) -> CompileOptions:
 
 
 class CompileCallable:
+    """Compile a ``@cute.jit`` callable into an executable host wrapper.
+
+    The public ``cute.compile(...)`` entrypoint is an instance of this
+    class. Call it with a ``@cute.jit`` function plus representative
+    arguments describing the runtime signature:
+
+    - fake tensors from
+      :func:`cutlass.cute.runtime.make_fake_tensor` or
+      :func:`cutlass.cute.runtime.make_fake_compact_tensor` for tensor
+      arguments
+    - plain scalars or :class:`cutlass.cute.typing.SymInt`-typed values
+      for scalar parameters
+    - :func:`cutlass.cute.runtime.make_fake_stream` when the host wrapper
+      takes a stream
+
+    The returned object is callable with real runtime arguments matching
+    the fake signature used at compile time.
+
+    Compile-time constants should be baked into the ``@cute.jit``
+    function or exposed as default-valued parameters on the user's own
+    ``compile()`` wrapper; runtime-varying symbolic quantities should be
+    modeled with :class:`cutlass.cute.typing.SymInt` in fake tensor
+    shapes / strides or host-function scalar arguments.
+
+    ``cute.compile(..., options="...")`` accepts the same token string as
+    ``CUTE_DSL_COMPILER_OPT``. Keep this docstring focused on the compile
+    contract; use ``write-kernel/references/compiler-options.md`` as the
+    authoritative catalog of option tokens and examples.
+    """
+
     def __init__(self, options: Any = None) -> None:
         def preprocess_options(option: Any) -> Any:
             if type(option) is type and issubclass(
@@ -579,21 +1007,73 @@ class CompileCallable:
         return new_callable_with_options
 
     def __call__(self, *args: Any, **kwargs: Any) -> Any:
+        """Compile ``func`` for the signature described by ``args``.
+
+        :param args: ``func`` followed by representative compile-time
+            arguments. Tensor arguments are typically fake tensors;
+            scalar arguments may be Python literals or SymInt-backed
+            symbolic values.
+        :param kwargs: Optional compile controls such as ``options=...``.
+            See ``references/compiler-options.md`` for option tokens.
+        :return: A compiled callable that accepts real runtime arguments
+            matching the supplied signature.
+        """
         return self._compile(*args, **kwargs)
 
+    def to_precompiled_mlir(self, func: Any, *args: Any, **kwargs: Any) -> Any:
+        """Return a PreCompiledMlirArtifact containing the pre-pass MLIR module."""
+        kwargs["compile_to_precompiled_mlir"] = True
+        return self._compile(func, *args, **kwargs)
+
+    def compile_to(self, target: Any, func: Any, *args: Any, **kwargs: Any) -> Any:
+        """Compile a @cute.jit function to the given artifact stage.
+
+        Args:
+            target: ``ArtifactType.PreCompiledMlir``.
+        """
+        from .._mlir._mlir_libs import _cutlass_ir
+
+        if target != _cutlass_ir.ArtifactType.PreCompiledMlir:
+            raise NotImplementedError(f"compile_to({target}) is not yet supported")
+        return self.to_precompiled_mlir(func, *args, **kwargs)
+
     def _compile(self, func: Any, *args: Any, **kwargs: Any) -> Any:
         """
-        This function is used to compile a `cute.jit` decorated function.
-        It will process the compile options and input parameters, do explicit compilation and return  the jit executor.
+        Compile a ``@cute.jit`` function and return its executable wrapper.
 
-        :param func: The function to compile. It can be a regular function, a method or a class instance.
-        :param args: The arguments to pass to the function.
-        :param kwargs: The keyword arguments to pass to the function. It can contain `options` like
-        `opt_level` to control the compilation flags.
+        ``func`` may be a regular function, bound method, or callable
+        instance, but it must ultimately resolve to a ``@cute.jit``
+        definition. ``args`` describe the runtime signature seen by the
+        compiled wrapper; for tensor arguments, prefer fake tensors over
+        ad-hoc real tensors so shape / stride / SymInt constraints remain
+        explicit and reproducible.
 
-        :return: The jit executor.
+        ``kwargs`` may contain ``is_experimental: bool`` to assert that
+        the function was decorated through the experimental DSL path
+        (``@cute.jit(is_experimental=True)`` /
+        ``@cute.kernel(is_experimental=True)``). This kwarg is consumed
+        at the call site rather than forwarded into the executor: the
+        actual DSL routing for the compile is already determined by the
+        function's decorator (via ``func._dsl_object``). The kwarg
+        exists as a migration aid away from
+        ``cute.experimental.compile``: when set to True the call
+        validates that the function is indeed routed through an
+        experimental DSL (``BaseDSL._is_experimental_dsl is True``) and
+        raises if not, so that mixing experimental host launchers with
+        non-experimental compile entry points (or vice versa) fails
+        loudly at the call site instead of producing a preprocessor
+        free-vars mismatch deep inside ``_preprocess_and_replace_code``.
 
-        :raises: DSLRuntimeError if the function is not decorated with `cute.jit` or is not callable.
+        :param func: The ``@cute.jit`` callable to compile.
+        :param args: Representative compile-time arguments describing the
+            callable's runtime signature.
+        :param kwargs: Optional compile controls. ``options=...`` accepts
+            the same token string as ``CUTE_DSL_COMPILER_OPT``. For the
+            full option catalog, see
+            ``write-kernel/references/compiler-options.md``.
+        :return: A compiled callable.
+        :raises DSLRuntimeError: If ``func`` is not callable or not
+            decorated with ``@cute.jit``.
         """
         if func is None:
             raise DSLRuntimeError("Function is not set or invalid.")
@@ -601,6 +1081,10 @@ class CompileCallable:
         if not callable(func):
             raise DSLRuntimeError("Object is not callable.")
 
+        # Pop the migration-aid kwarg before it leaks into the executor
+        # call: the rest of the pipeline does not know about it.
+        is_experimental_requested = kwargs.pop("is_experimental", False)
+
         kwargs["compile_only"] = True
         kwargs["no_cache"] = True
 
@@ -643,6 +1127,28 @@ class CompileCallable:
                 f"Function {func} is not decorated with jit decorator."
             )
 
+        # Validate the migration-aid ``is_experimental`` kwarg against
+        # the routing already baked into the function by its jit/kernel
+        # decorator. This is *not* a behavior switch: both
+        # ``cute.compile`` and ``cute.experimental.compile`` are
+        # functionally identical ``CompileCallable`` instances; the
+        # actual experimental dispatch is driven by ``func._dsl_object``
+        # (set by ``@cute.jit(is_experimental=True)``). Validating here
+        # turns a silent mismatch (which would only manifest later as a
+        # "code object with N free vars" preprocessor error) into a
+        # clear call-site diagnostic.
+        if is_experimental_requested and not getattr(
+            func._dsl_object, "_is_experimental_dsl", False
+        ):
+            raise DSLRuntimeError(
+                "cute.compile(is_experimental=True) was called on a function "
+                f"routed through {type(func._dsl_object).__name__}, which is "
+                "not an experimental DSL. Decorate the function with "
+                "@cute.jit(is_experimental=True) (and any nested @cute.kernel "
+                "with is_experimental=True) before compiling, or drop the "
+                "is_experimental flag from the cute.compile call."
+            )
+
         # process compile options, extract the options and remove them from the kwargs
         options = kwargs.pop("options", None)
         if isinstance(options, str) and len(options) == 0:
@@ -657,5 +1163,14 @@ class CompileCallable:
         # Preprocess the function if not already preprocessed
         func._dsl_object._preprocess_and_replace_code(func)
 
-        # Run the function
+        # Route based on DeviceTarget option: compiles as __device__ function.
+        if compile_options.options[DeviceTarget].value:
+            # Device functions are always relocatable objects.
+            compile_options.options[RDC].value = True
+            # Force artifact dumping so .o and .ptx are available after compilation.
+            compile_options.options[KeepPTX].value = True
+            compile_options.options[KeepCUBIN].value = True
+            return func._dsl_object._device_func(func, *args, **kwargs)
+
+        # Default: host wrapper + kernel
         return func._dsl_object._func(func, *args, **kwargs)
diff --git a/python/CuTeDSL/cutlass/base_dsl/dsl.py b/python/CuTeDSL/cutlass/base_dsl/dsl.py
index c136b595a..657f8f695 100644
--- a/python/CuTeDSL/cutlass/base_dsl/dsl.py
+++ b/python/CuTeDSL/cutlass/base_dsl/dsl.py
@@ -17,6 +17,7 @@ for example, it can handle various dialect-specific tasks.
 """
 
 # Standard library imports
+import dataclasses
 from dataclasses import dataclass, field
 import atexit
 import os
@@ -27,12 +28,22 @@ import re
 import inspect
 import argparse
 import hashlib
+from contextlib import contextmanager
 from functools import lru_cache, wraps
 from collections import namedtuple, OrderedDict
 from abc import ABC, abstractmethod
-from typing import Annotated, Any, ClassVar, TYPE_CHECKING, get_args, get_origin
+from typing import (
+    Annotated,
+    Any,
+    ClassVar,
+    Generator,
+    TYPE_CHECKING,
+    Union,
+    get_origin,
+    get_args,
+)
 from collections.abc import Callable
-from types import SimpleNamespace
+from types import SimpleNamespace, UnionType
 
 if TYPE_CHECKING:
     import hashlib
@@ -55,6 +66,7 @@ from .jit_executor import JitCompiledFunction, JitFunctionArtifacts
 from .utils.timer import timer
 from .utils.logger import log
 from .utils.stacktrace import filter_exception, walk_to_top_module, filter_stackframe
+from .utils.tree_utils import is_namedtuple_instance
 from .runtime.jit_arg_adapters import (
     is_argument_constexpr,
     is_arg_annotation_constexpr,
@@ -64,13 +76,14 @@ from .runtime.jit_arg_adapters import (
 from .ast_preprocessor import DSLPreprocessor
 from .common import *
 from .typing import (
+    Constexpr,
     get_c_pointers,
     get_mlir_types,
     Integer,
     implements_dynamic_expression,
     implements_jit_argument,
 )
-from ._mlir_helpers.op import _set_enable_frame_filtering
+from .._mlir_helpers.op import _set_enable_frame_filtering
 
 # =============================================================================
 # MLIR modules
@@ -421,6 +434,36 @@ def extract_mlir_attributes(obj: object) -> list[Any]:
         res = []
         for k, v in obj.__dict__.items():
             res.extend(extract_mlir_attributes(v))
+    elif dataclasses.is_dataclass(obj) and not isinstance(obj, type):
+        # Recurse into dataclass fields so per-field arg attrs (e.g.
+        # `cute_nvgpu.grid_constant` carried by a TMA atom) survive when the
+        # field is wrapped in a dataclass that customises
+        # `__extract_mlir_values__` but not `__extract_mlir_attributes__`.
+        # Without this the fallback below returns empty DictAttrs and the
+        # downstream `cute_nvgpu.atom.make_exec_tma` lowering can't trace
+        # back to the byval load, failing legalization.
+        res = []
+        for f in dataclasses.fields(obj):
+            v = getattr(obj, f.name)
+            # Skip static-value fields that don't contribute kernel args:
+            # - None (optional/unset)
+            # - class objects (e.g. a `dtype = Float32` field whose value is a
+            #   Numeric subclass; `isinstance(v, type)` catches classes with any
+            #   metaclass, including cutlass `NumericMeta`)
+            # - exact-type primitives (int/float/bool/str); use `type(v) in (...)`
+            #   so that subclass instances carrying their own DSL hooks (e.g.
+            #   `numpy.float64`) still get recursed into
+            if (
+                v is None
+                or isinstance(v, type)
+                or type(v) in (int, float, bool, str)
+            ):
+                continue
+            ftype = f.type
+            origin = get_origin(ftype) if not isinstance(ftype, str) else None
+            if ftype is Constexpr or origin is Constexpr:
+                continue
+            res.extend(extract_mlir_attributes(v))
     # Can't call is_dynamic_expression as _is_dynamic_expression depends on extract_mlir_values
     elif isinstance(obj, set):
         raise DSLRuntimeError(
@@ -473,10 +516,9 @@ def new_from_mlir_values(obj: Any, values: Any, *, structured: bool = False) ->
             res = [
                 new_from_mlir_values(x, v, structured=True) for x, v in zip(obj, values)
             ]
-            obj_ty = type(obj)
-            if hasattr(obj_ty, '_make'):
-                return obj_ty._make(res)
-            return obj_ty(res)
+            if is_namedtuple_instance(obj):
+                return type(obj)(*res)
+            return type(obj)(res)
         elif isinstance(obj, SimpleNamespace):
             ns = SimpleNamespace()
             for k, v in obj.__dict__.items():
@@ -497,8 +539,8 @@ def new_from_mlir_values(obj: Any, values: Any, *, structured: bool = False) ->
                 res.append(new_from_mlir_values(x, values[:n_items]))
                 values = values[n_items:]
             obj_ty = type(obj)
-            if hasattr(obj_ty, '_make'):
-                return obj_ty._make(res)
+            if is_namedtuple_instance(obj):
+                return obj_ty(*res)
             return obj_ty(res)
         elif isinstance(obj, SimpleNamespace):
             ns = SimpleNamespace()
@@ -586,6 +628,7 @@ class DSLLocation:
         lineno (int): Line number in the source file.
         col_offset (int): Column offset in the source line.
         function_name (str): Name of the function in which the location occurs.
+        caller_locs (tuple): Optional tuple of (filename, lineno) pairs for callsite chain.
 
     This is used primarily to annotate or trace locations in generated MLIR IR
     back to the original Python code for better diagnostic and debugging.
@@ -595,11 +638,13 @@ class DSLLocation:
     lineno: int
     col_offset: int
     function_name: str
+    caller_locs: tuple = ()
 
 
 class BaseDSL(metaclass=DSLSingletonMeta):
     gpu_module: Any = None
     _env_class: type[EnvironmentVarManager] = EnvironmentVarManager
+    _is_experimental_dsl: bool = False
 
     def __init__(
         self,
@@ -672,7 +717,9 @@ class BaseDSL(metaclass=DSLSingletonMeta):
         self.compile_options: CompileOptions = CompileOptions()
 
         if preprocess:
-            self.preprocessor: DSLPreprocessor = DSLPreprocessor(dsl_package_name)
+            preprocessor: DSLPreprocessor = DSLPreprocessor(dsl_package_name)
+            self.package_name = dsl_package_name
+            self.preprocessor: DSLPreprocessor = preprocessor
 
         log().info(f"Initializing {name} DSL")
         log().debug(f"Logger initialized for {self.name}")
@@ -746,19 +793,18 @@ class BaseDSL(metaclass=DSLSingletonMeta):
         """
         # Ensure the DSL instance is materialized before touching _dsl_object
         BaseDSL._lazy_initialize_dsl(func)
-
         # Update the decorator location to the new function
         func._dsl_object.decorator_location = func._decorator_location
 
         if getattr(func, "_preprocessed", False) is True:
-            # already preprocessed, skip
-            return
-
+                return
         if not func._dsl_object.enable_preprocessor:
             func._preprocessed = True
             return
 
-        fcn_ptr = func._dsl_object.run_preprocessor(func)
+        fcn_ptr = func._dsl_object.run_preprocessor(
+            func,
+        )
         if fcn_ptr:
             func.__code__ = (
                 fcn_ptr.__code__
@@ -781,11 +827,10 @@ class BaseDSL(metaclass=DSLSingletonMeta):
 
         def jit_runner_decorator(func: Any) -> Any:
             # Run preprocessor that alters AST
+            preprocess_enabled = BaseDSL._can_preprocess(**dkwargs)
             func._dsl_cls = cls
             func._decorator_location = BaseDSL.get_location_from_frame(frame)
-            if not hasattr(func, "_preprocessed") and not BaseDSL._can_preprocess(
-                **dkwargs
-            ):
+            if not hasattr(func, "_preprocessed") and not preprocess_enabled:
                 func._preprocessed = True
 
             @wraps(func)
@@ -819,7 +864,9 @@ class BaseDSL(metaclass=DSLSingletonMeta):
         """
         Decorator to mark a function for JIT compilation for Host code.
         """
-        frame = inspect.currentframe().f_back  # type: ignore[union-attr]
+        cur_frame = inspect.currentframe()
+        assert cur_frame is not None
+        frame = cur_frame.f_back
         return BaseDSL.jit_runner(cls, "_func", frame, *dargs, **dkwargs)
 
     @classmethod
@@ -827,7 +874,9 @@ class BaseDSL(metaclass=DSLSingletonMeta):
         """
         Decorator to mark a function for JIT compilation for GPU.
         """
-        frame = inspect.currentframe().f_back  # type: ignore[union-attr]
+        cur_frame = inspect.currentframe()
+        assert cur_frame is not None
+        frame = cur_frame.f_back
         return BaseDSL.jit_runner(cls, "_kernel_helper", frame, *dargs, **dkwargs)
 
     @abstractmethod
@@ -837,6 +886,13 @@ class BaseDSL(metaclass=DSLSingletonMeta):
         """
         pass
 
+    @abstractmethod
+    def _enter_gpu_module(self) -> ir.InsertionPoint:
+        """
+        Return an InsertionPoint into the GPU module body. Implemented by subclasses.
+        """
+        ...
+
     @abstractmethod
     def _build_gpu_module(self, attrs: dict[str, Any], loc: Any = None) -> None:
         """
@@ -955,10 +1011,6 @@ class BaseDSL(metaclass=DSLSingletonMeta):
             arg_spec = parameter.annotation
             log().debug("Processing [%d] Argument [%s : %s]", idx, arg_name, arg_spec)
 
-            # Implicit cast to NumericMeta
-            if isinstance(arg_spec, t.NumericMeta) and not isinstance(arg, arg_spec):
-                arg = t.cast(arg, arg_spec)  # type: ignore[arg-type]
-
             ir_arg, iv_block_args = self._generate_execution_arguments_for_known_types(
                 arg, arg_spec, arg_name, idx, fop_args, iv_block_args
             )
@@ -969,10 +1021,23 @@ class BaseDSL(metaclass=DSLSingletonMeta):
                 adapter = JitArgAdapterRegistry.get_registered_adapter(arg)
                 arg = adapter(arg) if adapter else arg
 
-                n_args = len(get_mlir_types(arg))
-                blk_args = fop_args[iv_block_args : iv_block_args + n_args]
-                ir_arg = new_from_mlir_values(arg, blk_args)
-                iv_block_args += n_args
+                if isinstance(arg_spec, t.NumericMeta) and not isinstance(
+                    arg, arg_spec
+                ):
+                    # Non-constexpr Numeric type coercion: the function's block arg
+                    # already has the target MLIR type (set by generate_kernel_operands_
+                    # and_types). Wrap it directly with the spec type instead of casting
+                    # the caller's value. This avoids emitting arith.trunci/extsi ops
+                    # that reference SSA values from an outer region, which would
+                    # violate IsolatedFromAbove on kernel functions.
+                    blk_args = fop_args[iv_block_args : iv_block_args + 1]
+                    ir_arg = arg_spec(blk_args[0])
+                    iv_block_args += 1
+                else:
+                    n_args = len(get_mlir_types(arg))
+                    blk_args = fop_args[iv_block_args : iv_block_args + n_args]
+                    ir_arg = new_from_mlir_values(arg, blk_args)
+                    iv_block_args += n_args
             else:
                 ir_arg = ir_arg[0]
 
@@ -993,8 +1058,6 @@ class BaseDSL(metaclass=DSLSingletonMeta):
             )
             ir_kwargs[name] = ir_arg
 
-        log().debug("execution args: %s", ", ".join(map(str, ir_args)))
-        log().debug("execution kwargs: %s", ", ".join(map(str, ir_kwargs)))
         return ir_args, ir_kwargs
 
     @abstractmethod
@@ -1125,15 +1188,20 @@ class BaseDSL(metaclass=DSLSingletonMeta):
         for i, (arg_name, arg) in enumerate(zip(input_arg_names, input_args)):
             spec_ty = sig.parameters[arg_name].annotation
 
-            # Unwrap Annotated[T, marker1, ...] → base type T + markers.
-            annotation_markers = ()
-            if (
-                spec_ty is not inspect.Parameter.empty
-                and get_origin(spec_ty) is Annotated
-            ):
-                type_args = get_args(spec_ty)
-                spec_ty = type_args[0]
-                annotation_markers = type_args[1:]
+            # Retrieve markers from the annotated type that matches the arg
+            candidate_sub_types = (
+                get_args(spec_ty)
+                if get_origin(spec_ty) is Union or isinstance(spec_ty, UnionType)
+                else (spec_ty,)
+            )
+            annotation_markers = []
+            for sub_ty in candidate_sub_types:
+                ty, *markers = (
+                    get_args(sub_ty) if get_origin(sub_ty) is Annotated else (sub_ty,)
+                )
+                if markers and isinstance(ty, type) and isinstance(arg, ty):
+                    annotation_markers = markers
+                    break
 
             log().debug("Processing [%d] Argument [%s : %s]", i, arg_name, spec_ty)
 
@@ -1159,6 +1227,7 @@ class BaseDSL(metaclass=DSLSingletonMeta):
             )
 
             if jit_arg_type is not None and len(jit_arg_type) == 0:
+                assert jit_exec_arg is not None and jit_arg_attr is not None
                 # If not any known type, try JIT argument adapter
                 # to convert the argument
                 adapter = JitArgAdapterRegistry.get_registered_adapter(arg)
@@ -1168,16 +1237,16 @@ class BaseDSL(metaclass=DSLSingletonMeta):
 
                 if is_host:
                     if self.envar.enable_tvm_ffi:
-                        jit_exec_arg.extend([arg])  # type: ignore[union-attr]
+                        jit_exec_arg.extend([arg])
                     else:
-                        jit_exec_arg.extend(get_c_pointers(arg))  # type: ignore[union-attr]
+                        jit_exec_arg.extend(get_c_pointers(arg))
                     jit_arg_type.extend(get_mlir_types(arg))
-                    jit_arg_attr.extend([default_attr] * len(get_mlir_types(arg)))  # type: ignore[union-attr]
+                    jit_arg_attr.extend([default_attr] * len(get_mlir_types(arg)))
                 else:
                     dyn_vals = extract_mlir_values(arg)
-                    jit_exec_arg.extend(dyn_vals)  # type: ignore[union-attr]
+                    jit_exec_arg.extend(dyn_vals)
                     jit_arg_type.extend([v.type for v in dyn_vals])
-                    jit_arg_attr.extend(extract_mlir_attributes(arg))  # type: ignore[union-attr]
+                    jit_arg_attr.extend(extract_mlir_attributes(arg))
 
                 if not jit_arg_type or not jit_exec_arg:
                     # when it is compile only, we don't have to prepare the executable arguments.
@@ -1201,9 +1270,25 @@ class BaseDSL(metaclass=DSLSingletonMeta):
                         )
 
             if jit_arg_type is not None:
-                jit_exec_args.extend(jit_exec_arg)  # type: ignore[arg-type]
+                assert jit_exec_arg is not None and jit_arg_attr is not None
+                # Merge attributes from annotated markers (e.g. grid_constant)
+                # into every element of jit_arg_attr for this argument.
+                if annotation_markers and jit_arg_attr:
+                    extra = {
+                        na.name: na.attr
+                        for marker in annotation_markers
+                        for attr_dict in extract_mlir_attributes(marker)
+                        for na in attr_dict
+                    }
+                    if extra:
+                        jit_arg_attr = [
+                            ir.DictAttr.get({na.name: na.attr for na in d} | extra)
+                            for d in jit_arg_attr
+                        ]
+
+                jit_exec_args.extend(jit_exec_arg)
                 jit_arg_types.extend(jit_arg_type)
-                jit_arg_attrs.extend(jit_arg_attr)  # type: ignore[arg-type]
+                jit_arg_attrs.extend(jit_arg_attr)
 
         return jit_exec_args, jit_arg_types, jit_arg_attrs, jit_adapted_args
 
@@ -1250,8 +1335,15 @@ class BaseDSL(metaclass=DSLSingletonMeta):
         has_fallback_cluster: bool = False
         min_blocks_per_mp: int = 0
         use_pdl: bool = False
-        auto_smem: bool = False
         cooperative: bool = False
+        launch_completion_event: Any | None = None
+        launch_completion_event_flags: int | None = None
+        programmatic_event: Any | None = None
+        programmatic_event_flags: int | None = None
+        programmatic_event_trigger_at_block_start: int | None = None
+
+        smem_merge_branch_allocs: bool = False
+        preferred_smem_carveout: int | None = None
 
         @staticmethod
         def _check_and_canonicalize_dim(dim: Any, name: str) -> list[Any]:
@@ -1275,10 +1367,6 @@ class BaseDSL(metaclass=DSLSingletonMeta):
             self.grid = self._check_and_canonicalize_dim(self.grid, "grid")
             self.block = self._check_and_canonicalize_dim(self.block, "block")
 
-            if self.smem is None:
-                self.smem = 0
-                self.auto_smem = True
-
             self.has_cluster = self.cluster is not None
             if self.cluster is None:
                 self.cluster = [None, None, None]
@@ -1333,10 +1421,11 @@ class BaseDSL(metaclass=DSLSingletonMeta):
     @staticmethod
     def get_location_from_frame(frame: Any) -> DSLLocation:
         return DSLLocation(
-            filename=inspect.getsourcefile(frame),  # type: ignore[arg-type]
+            filename=inspect.getsourcefile(frame) or "<unknown>",
             lineno=frame.f_lineno,
             col_offset=0,
             function_name=frame.f_code.co_name,
+            caller_locs=(),
         )
 
     def get_ir_location(self, location: DSLLocation | None = None) -> Any:
@@ -1359,6 +1448,13 @@ class BaseDSL(metaclass=DSLSingletonMeta):
             (location.function_name),
             childLoc=file_loc,
         )
+
+        if location.caller_locs:
+            caller_ir_locs = [
+                ir.Location.file(fn, ln, 0) for fn, ln in location.caller_locs
+            ]
+            loc = ir.Location.callsite(loc, caller_ir_locs)
+
         return loc
 
     def compile_and_jit(
@@ -1488,7 +1584,7 @@ class BaseDSL(metaclass=DSLSingletonMeta):
         Build the MLIR module, verify and return the module
         """
 
-        # Save IR in a file (raw, before any passes) — triggered by KEEP=ir-debug
+        # Save IR in a file (raw, before any passes) -- triggered by KEEP=ir-debug
         if self.envar.keep_ir:
             self.dump_mlir_path = save_ir(
                 self.name,
@@ -1498,12 +1594,13 @@ class BaseDSL(metaclass=DSLSingletonMeta):
                 enable_debug_info=self.envar.lineinfo,
             )
 
-        # Save clean IR (after canonicalize+cse) — triggered by KEEP=ir
+        # Save clean IR (after canonicalize+cse) -- triggered by KEEP=ir
         # Clone before compiling so the original module is not mutated.
         if self.envar.keep_ir_clean:
             module_clone = ir.Module.parse(str(module))
             self.compiler_provider.compile(
-                module_clone, "builtin.module(canonicalize,cse)"
+                module_clone,
+                "builtin.module(canonicalize,cse)",
             )
             self.dump_mlir_path = save_ir(
                 self.name,
@@ -1581,14 +1678,29 @@ class BaseDSL(metaclass=DSLSingletonMeta):
                         )
                         func.ReturnOp(default_ret_values, loc=loc)
                     except NameError as name_error:
-                        raise DSLRuntimeError(
-                            f"💥💥💥 Error during runtime code generation for function `{funcBody.__name__}` 💥💥💥",
+                        # Extract the actual variable name and source
+                        # location from the NameError traceback.
+                        tb = name_error.__traceback__
+                        err_filename = None
+                        err_lineno = None
+                        while tb is not None:
+                            err_filename = tb.tb_frame.f_code.co_filename
+                            err_lineno = tb.tb_lineno
+                            tb = tb.tb_next
+                        raise DSLUserCodeError(
+                            f"NameError in `{funcBody.__name__}`: {name_error}",
+                            filename=err_filename,
+                            lineno=err_lineno,
                             cause=name_error,
-                            suggestion="Using variables defined in dynamic control flow is not supported. Please give an initial value before control flow.",
-                        )
-                    except DSLRuntimeError as dsl_error:
-                        # Throw it's already a DSL error
-                        raise dsl_error
+                            suggestion=(
+                                "Variables used inside staged control flow "
+                                "(for/if/while) must be defined before the "
+                                "control flow region. Give the variable an "
+                                "initial value before the loop or branch."
+                            ),
+                        ) from name_error
+                    except (DSLRuntimeError, DSLUserCodeError):
+                        raise
 
                 if self._should_remove_empty_gpu_modules():
                     BaseDSL.__remove_empty_gpu_modules(module)
@@ -1681,7 +1793,10 @@ class BaseDSL(metaclass=DSLSingletonMeta):
                         pipeline,
                     )
                 else:
-                    self.compiler_provider.compile(module, pipeline)
+                    self.compiler_provider.compile(
+                        module,
+                        pipeline,
+                    )
                 engine = None
         else:
             log().info(
@@ -1727,6 +1842,7 @@ class BaseDSL(metaclass=DSLSingletonMeta):
             # set dynamic arguments if the jit_function is a JitCompiledFunction for AOT generation.
             dynamic_args=dynamic_args,
             dynamic_kwargs=dynamic_kwargs,
+            host_target=self.compile_options.host_target,
         )
 
         if not no_cache:
@@ -1804,6 +1920,7 @@ class BaseDSL(metaclass=DSLSingletonMeta):
         no_jit_engine: bool,
         compile_only: bool,
         location: DSLLocation | None = None,
+        compile_to_precompiled_mlir: bool = False,
     ) -> Any:
         """Generate MLIR module and compile iself.T_provider."""
         with ir.Context() as ctx, self.get_ir_location(location):
@@ -1816,10 +1933,9 @@ class BaseDSL(metaclass=DSLSingletonMeta):
             # Default OFF — deep tracebacks + LINEINFO causes segfault.
             _loc_tb_depth = self.envar.loc_tracebacks
             _loc_tb_ctx = None
-            if _loc_tb_depth:
+            if _loc_tb_depth > 0:
                 try:
-                    _depth = int(_loc_tb_depth)
-                    _loc_tb_ctx = ir.loc_tracebacks(max_depth=_depth)
+                    _loc_tb_ctx = ir.loc_tracebacks(max_depth=_loc_tb_depth)
                     _loc_tb_ctx.__enter__()
                 except (ValueError, TypeError, AttributeError):
                     pass
@@ -1856,13 +1972,10 @@ class BaseDSL(metaclass=DSLSingletonMeta):
                         link_libraries = self.compile_options.options[
                             LinkLibraries
                         ].value
-                        try:
-                            link_libraries_attributes = gpu_module.attributes[
-                                "link-libraries"
-                            ]
-                        except KeyError:
-                            link_libraries_attributes = set()
-                        sources = set(x.value for x in link_libraries_attributes)
+                        sources = set(
+                            x.value
+                            for x in gpu_module.attributes.get("link-libraries", set())
+                        )
                         link_libraries = (
                             link_libraries
                             + ("," if link_libraries and len(sources) > 0 else "")
@@ -1872,6 +1985,17 @@ class BaseDSL(metaclass=DSLSingletonMeta):
                             link_libraries
                         )
 
+                if compile_to_precompiled_mlir:
+                    import io
+
+                    from .._mlir._mlir_libs import _cutlass_ir
+
+                    buf = io.BytesIO()
+                    module.operation.write_bytecode(buf)
+                    return _cutlass_ir.PreCompiledMlirArtifact.from_bitcode(
+                        buf.getvalue()
+                    )
+
                 # dryrun is used to only generate IR
                 if self.envar.dryrun:
                     return result
@@ -1937,8 +2061,6 @@ class BaseDSL(metaclass=DSLSingletonMeta):
         re-parses the source and ``exec()``s it, which requires those names
         to be resolvable in *exec_globals*.
 
-        This mirrors the injection already done by
-        ``function_compiler._rewrite_callee``.
         """
         if original_function.__closure__:
             for name, cell in zip(
@@ -1953,7 +2075,8 @@ class BaseDSL(metaclass=DSLSingletonMeta):
                     pass
 
     def run_preprocessor(
-        self, original_function: Any, callee_rewrite: bool = False
+        self,
+        original_function: Any,
     ) -> Any:
         function_name = original_function.__name__
         self.funcBody = original_function
@@ -1964,7 +2087,8 @@ class BaseDSL(metaclass=DSLSingletonMeta):
         self._inject_closure_cells(original_function, exec_globals)
         with self.preprocessor.get_session() as preprocessor_session:
             transformed_ast = preprocessor_session.transform(
-                original_function, exec_globals
+                original_function,
+                exec_globals,
             )
             if self.envar.print_after_preprocessor:
                 log().info(
@@ -2139,37 +2263,42 @@ class BaseDSL(metaclass=DSLSingletonMeta):
 
         return signature.replace(parameters=new_params)
 
-    def _func(self, funcBody: Callable[..., Any], *args: Any, **kwargs: Any) -> Any:
-        """Decorator for MLIR functions.
-        It cuts the boilerplate code, does the following:
-            1. Generates `func.func`
-            2. Types translation (numpy arrays -> cute.memref, float -> <f32>, etc.)
-            3. Compiles and JITs the MLIR module
-            4. Invokes the generated function
-            5. Operator overloading (a + b --> arith.addi a, b)
-            6. Generates GPU kernel function with GPU module and kernel attributes baked
-        """
-        if ir.Context.current is None:
-            pass
-        elif ir.InsertionPoint.current is not None:
-            return funcBody(*args, **kwargs)
+    @dataclass
+    class _CompilationSetup:
+        """Shared pre-IR-generation state for both _func and _device_func."""
 
+        function_name: str
+        pipeline: str | None
+        gpu_module_attrs: dict[str, Any]
+        no_cache: bool
+        no_jit_engine: bool
+        compile_only: bool
+        canonicalized_args: tuple[Any, ...]
+        canonicalized_kwargs: dict[str, Any]
+        sig: inspect.Signature
+        location: Any  # DSLLocation | None
+        compile_to_precompiled_mlir: bool = False
+
+    def _prepare_compilation(
+        self, funcBody: Callable[..., Any], *args: Any, **kwargs: Any
+    ) -> "_CompilationSetup":
+        """Extract kwargs, canonicalize args, mangle name, and apply compile options.
+
+        Shared setup for both _func (kernel path) and _device_func (device path).
+        """
         function_name = funcBody.__name__
         self.funcBody = funcBody
 
         pipeline = kwargs.pop("pipeline", None)
         gpu_module_attrs = kwargs.pop("gpu_module_attrs", {})
-
-        # Disable cache
         no_cache = kwargs.pop("no_cache", False)
-
-        # Disable JIT execution engine
         no_jit_engine = kwargs.pop("no_jit_engine", False)
-
-        # Always compile(disable cache) and return the result jit_executor
         compile_only = kwargs.pop("compile_only", False)
 
+        compile_to_precompiled_mlir = kwargs.pop("compile_to_precompiled_mlir", False)
+
         func_name_prefix = kwargs.pop("_name_prefix", None)
+        export_name = kwargs.pop("export_name", None)
 
         if not no_cache and (
             self.envar.keep_ptx
@@ -2192,19 +2321,19 @@ class BaseDSL(metaclass=DSLSingletonMeta):
         has_varargs = self._check_arg_count(sig, bound_args, function_name)
 
         # Canonicalize the input arguments
-        canonicalized_args, canonicalized_kwonly_args = self._canonicalize_args(
-            bound_args
-        )
+        canonicalized_args, canonicalized_kwargs = self._canonicalize_args(bound_args)
 
         # Expand *args/**kwargs into concrete named parameters
         if has_varargs:
             sig = self._expand_varargs_varkw(
-                canonicalized_args, canonicalized_kwonly_args, sig
+                canonicalized_args, canonicalized_kwargs, sig
             )
-        # Simple name mangling
-        function_name = self.mangle_name(function_name, canonicalized_args, sig)
-        if func_name_prefix:
-            function_name = f"{func_name_prefix}_{function_name}"
+        if export_name is not None:
+            function_name = export_name
+        else:
+            function_name = self.mangle_name(function_name, canonicalized_args, sig)
+            if func_name_prefix:
+                function_name = f"{func_name_prefix}_{function_name}"
 
         self.compile_options.apply_envar_settings(self.envar, function_name)
         if not self.compile_options.generate_line_info:
@@ -2212,20 +2341,51 @@ class BaseDSL(metaclass=DSLSingletonMeta):
         # Enable frame filtering if line info is enabled
         _set_enable_frame_filtering(self.compile_options.generate_line_info)
 
-        # Generate MLIR Context and start generating IR
-        log().debug(f"Generating MLIR for function '{function_name}'")
+        return self._CompilationSetup(
+            function_name=function_name,
+            pipeline=pipeline,
+            gpu_module_attrs=gpu_module_attrs,
+            no_cache=no_cache,
+            no_jit_engine=no_jit_engine,
+            compile_only=compile_only,
+            canonicalized_args=canonicalized_args,
+            canonicalized_kwargs=canonicalized_kwargs,
+            sig=sig,
+            location=self.decorator_location,
+            compile_to_precompiled_mlir=compile_to_precompiled_mlir,
+        )
+
+    def _func(self, funcBody: Callable[..., Any], *args: Any, **kwargs: Any) -> Any:
+        """Decorator for MLIR functions.
+        It cuts the boilerplate code, does the following:
+            1. Generates `func.func`
+            2. Types translation (numpy arrays -> cute.memref, float -> <f32>, etc.)
+            3. Compiles and JITs the MLIR module
+            4. Invokes the generated function
+            5. Operator overloading (a + b --> arith.addi a, b)
+            6. Generates GPU kernel function with GPU module and kernel attributes baked
+        """
+        if ir.Context.current is None:
+            pass
+        elif ir.InsertionPoint.current is not None:
+            return funcBody(*args, **kwargs)
+
+        setup = self._prepare_compilation(funcBody, *args, **kwargs)
+
+        log().debug(f"Generating MLIR for function '{setup.function_name}'")
         result = self.generate_mlir(
             funcBody,
-            function_name,
-            gpu_module_attrs,
-            canonicalized_args,
-            canonicalized_kwonly_args,
-            sig,
-            pipeline,
-            no_cache,
-            no_jit_engine,
-            compile_only,
-            location=self.decorator_location,
+            setup.function_name,
+            setup.gpu_module_attrs,
+            setup.canonicalized_args,
+            setup.canonicalized_kwargs,
+            setup.sig,
+            setup.pipeline,
+            setup.no_cache,
+            setup.no_jit_engine,
+            setup.compile_only,
+            location=setup.location,
+            compile_to_precompiled_mlir=setup.compile_to_precompiled_mlir,
         )
         return result
 
@@ -2504,7 +2664,7 @@ class BaseDSL(metaclass=DSLSingletonMeta):
                 )
 
                 loc = self.get_ir_location()
-                with self._enter_gpu_module():  # type: ignore[attr-defined]
+                with self._enter_gpu_module():
                     log().debug("Generating device kernel")
                     if self.device_compilation_only:
                         log().debug("Generating cuda-python arguments")
diff --git a/python/CuTeDSL/cutlass/base_dsl/env_manager.py b/python/CuTeDSL/cutlass/base_dsl/env_manager.py
index 49a53daa2..0107ab9d1 100644
--- a/python/CuTeDSL/cutlass/base_dsl/env_manager.py
+++ b/python/CuTeDSL/cutlass/base_dsl/env_manager.py
@@ -82,14 +82,6 @@ def _parse_keep_tokens(raw: str, prefix: str = "") -> frozenset[str]:
     if "all" in tokens:
         return _KEEP_ALL_TOKENS
     unknown = tokens - _KEEP_VALID_TOKENS
-    if unknown:
-        message = f"{prefix}_KEEP" if prefix else "[DSL]_KEEP"
-        log().warning(
-            "%s: unknown token(s) %s will be ignored. Valid tokens: %s",
-            message,
-            sorted(unknown),
-            sorted(_KEEP_VALID_TOKENS),
-        )
     return tokens - unknown
 
 
@@ -108,28 +100,76 @@ def get_str_env_var(var_name: str, default_value: str | None = None) -> str | No
     return value if value is not None else default_value
 
 
+_BOOL_TRUE_VALUES = frozenset({"1", "true", "yes", "on"})
+_BOOL_FALSE_VALUES = frozenset({"0", "false", "no", "off"})
+
+
 @lru_cache(maxsize=None)
 def get_bool_env_var(var_name: str, default_value: bool = False) -> bool:
     """
     Get the value of a boolean environment variable.
-    If the value it not in False, 0, or empty string, it is considered True.
-    Note that the value is cached after the first call.
+
+    Recognized values (case-insensitive, surrounding whitespace ignored):
+      * Truthy:   ``1``, ``true``, ``yes``, ``on``
+      * Falsy:    ``0``, ``false``, ``no``, ``off``
+
+    An unset variable, or one whose value is empty (or whitespace only),
+    returns ``default_value``.
+
+    The parsed value is cached after the first call (per ``var_name`` /
+    ``default_value`` pair).
+
+    Raises:
+        ValueError: if the variable is set to any other value.
     """
-    value = get_str_env_var(var_name)
-    if value is None:
+    raw = get_str_env_var(var_name)
+    if raw is None:
         return default_value
-    return value not in {"False", "0", ""}
+    normalized = raw.strip().lower()
+    if normalized == "":
+        return default_value
+    if normalized in _BOOL_TRUE_VALUES:
+        return True
+    if normalized in _BOOL_FALSE_VALUES:
+        return False
+    raise ValueError(
+        f"Invalid value for environment variable {var_name}={raw!r}. "
+        f"Expected a boolean (case-insensitive): "
+        f"{sorted(_BOOL_TRUE_VALUES) + sorted(_BOOL_FALSE_VALUES)} "
+        f"or empty/unset to use the default ({default_value!r})."
+    )
 
 
 @lru_cache(maxsize=None)
 def get_int_env_var(var_name: str, default_value: int = 0) -> int:
     """
     Get the value of an integer environment variable.
-    If the value is not a valid integer, the default value 0 is returned.
-    Note that the value is cached after the first call.
+
+    Surrounding whitespace is ignored. An unset variable or one with an
+    empty value returns ``default_value``. Negative integers (e.g.
+    ``-5``) are accepted.
+
+    Raises:
+        ValueError: if the variable is set to a value that is not a
+            valid base-10 integer.
+
+    The parsed value is cached after the first call (per ``var_name`` /
+    ``default_value`` pair).
     """
-    value = get_str_env_var(var_name)
-    return int(value) if value and value.isdigit() else default_value
+    raw = get_str_env_var(var_name)
+    if raw is None:
+        return default_value
+    stripped = raw.strip()
+    if stripped == "":
+        return default_value
+    try:
+        return int(stripped)
+    except ValueError:
+        raise ValueError(
+            f"Invalid value for environment variable {var_name}={raw!r}. "
+            f"Expected a base-10 integer, or empty/unset to use the "
+            f"default ({default_value!r})."
+        ) from None
 
 
 @lru_cache(maxsize=None)
@@ -137,22 +177,36 @@ def get_int_or_none_env_var(
     var_name: str, default_value: int | None = None
 ) -> int | None:
     """
-    Get the value of an integer or None union environment variable.
-    If the value is not a valid integer, the default value 0 is returned.
-    Note that the value is cached after the first call.
+    Get the value of an integer-or-``None`` environment variable.
+
+    Recognized values (case-insensitive, surrounding whitespace ignored):
+      * ``"none"``                       returns ``None``
+      * any base-10 integer literal      returns that integer (negatives accepted)
+
+    An unset variable or one with an empty value returns ``default_value``.
+
+    Raises:
+        ValueError: if the variable is set to anything else.
+
+    The parsed value is cached after the first call (per ``var_name`` /
+    ``default_value`` pair).
     """
     raw = get_str_env_var(var_name)
     if raw is None:
         return default_value
-
-    value = raw.strip().lower()
-    if value == "none":
-        return None
-
-    try:
-        return int(value)
-    except ValueError:
+    normalized = raw.strip().lower()
+    if normalized == "":
         return default_value
+    if normalized == "none":
+        return None
+    try:
+        return int(normalized)
+    except ValueError:
+        raise ValueError(
+            f"Invalid value for environment variable {var_name}={raw!r}. "
+            f"Expected a base-10 integer, the literal 'none', or "
+            f"empty/unset to use the default ({default_value!r})."
+        ) from None
 
 
 @lru_cache(maxsize=None)
@@ -183,8 +237,9 @@ def detect_gpu_arch(prefix: str) -> str:
         arch = (10, 0)
 
     major, minor = arch
+    assert major is not None and minor is not None
     suffix = ""
-    if major >= 9:  # type: ignore[operator]
+    if major >= 9:
         suffix = "a"
 
     return f"sm_{major}{minor}{suffix}"
@@ -302,6 +357,8 @@ def get_prefix_dsl_libs(prefix: str) -> str | None:
             }
             lib_folder_guesses = [
                 "lib",
+                "cu12/lib",
+                "cu13/lib",
             ]
 
             for target_libs in [
@@ -407,6 +464,13 @@ class EnvironmentVarManager(LogEnvironmentManager):
     - [DSL_NAME]_LIBS: Path to dependent shared libraries (default: None)
     - [DSL_NAME]_ENABLE_TVM_FFI: Enable TVM-FFI or not (default: False)
     - [DSL_NAME]_LOC_TRACEBACKS: Maximum depth of location tracebacks (default: 0)
+    - [DSL_NAME]_COMPILER_OPT: Compact compiler option string (default: "").
+      Controls compiler passes and diagnostic checks. Forms accepted:
+        iket                        — enable IKET (In-Kernel Event Tracing) instrumentation
+      Examples:
+        CUTE_DSL_COMPILER_OPT="iket"
+      The same option strings are accepted by cute.compile(..., options=...).
+
     """
 
     def __init__(self, prefix: str = "DSL") -> None:
@@ -435,11 +499,14 @@ class EnvironmentVarManager(LogEnvironmentManager):
         # ------------------------------------------------------------------ #
         # Artifact keep — [DSL]_KEEP=<comma-list>                            #
         # ------------------------------------------------------------------ #
+        # Parse new consolidated option.
         _keep_raw = get_str_env_var(f"{prefix}_KEEP", "")
         _keep_tokens: set[str] = set(
             _parse_keep_tokens(_keep_raw, prefix) if _keep_raw else frozenset()
         )
 
+        # Backward compatibility: publicly-documented old options emit a
+        # DeprecationWarning and fold into _keep_tokens.
         if get_bool_env_var(f"{prefix}_KEEP_IR", False):
             warnings.warn(
                 f"{prefix}_KEEP_IR is deprecated; use {prefix}_KEEP=ir-debug instead.",
@@ -484,6 +551,8 @@ class EnvironmentVarManager(LogEnvironmentManager):
         self.disable_file_caching = get_bool_env_var(
             f"{prefix}_DISABLE_FILE_CACHING", False
         )
+        self.compiler_opt = get_str_env_var(f"{prefix}_COMPILER_OPT", "")
+
         # set mlir shared libraries
         self.shared_libs = get_prefix_dsl_libs(prefix)
 
diff --git a/python/CuTeDSL/cutlass/base_dsl/export/c_header_generator.py b/python/CuTeDSL/cutlass/base_dsl/export/c_header_generator.py
index 33cdc5963..8d54d25f1 100644
--- a/python/CuTeDSL/cutlass/base_dsl/export/c_header_generator.py
+++ b/python/CuTeDSL/cutlass/base_dsl/export/c_header_generator.py
@@ -21,6 +21,7 @@ from ..typing import (
     Uint32,
     Uint64,
     Boolean,
+    BFloat16,
     Float16,
     Float32,
     Float64,
@@ -92,6 +93,12 @@ class CHeaderGenerator:
 }
 """
 
+    # AOT cross-compile invariant: every C type emitted into the generated
+    # .h must be either fixed-width (intN_t / uintN_t / float / double) or an
+    # opaque target-defined typedef (e.g. cudaLibrary_t). Never use bare
+    # `long`, `int`, `size_t`, or pointer-arithmetic literals — those bind
+    # the struct ABI to the *build* machine, which silently miscompiles when
+    # the host cross-target has a different word size.
     numeric_to_c_type = {
         Int8: "int8_t ",
         Int16: "int16_t ",
@@ -106,8 +113,150 @@ class CHeaderGenerator:
         TFloat32: "float ",
         Float64: "double ",
         Float16: "__half_raw ",
+        BFloat16: "__nv_bfloat16 ",
     }
 
+    # Device functions use __half (not __half_raw used by host export headers).
+    device_type_overrides = {
+        Float16: "__half ",
+    }
+
+    @classmethod
+    def get_c_type(cls, ann: Any, device: bool = False) -> str:
+        """Map a DSL type annotation to its C type string (without trailing space).
+
+        Args:
+            ann: DSL type annotation (e.g., Float32, Int64, or a @native_struct class).
+            device: If True, use device-specific mappings (e.g., __half vs __half_raw).
+        """
+        if device:
+            c = cls.device_type_overrides.get(ann)
+            if c is not None:
+                return c.strip()
+        c = cls.numeric_to_c_type.get(ann)
+        if c is not None:
+            return c.strip()
+        if hasattr(ann, "_field_names"):
+            return ann.__name__
+        raise DSLRuntimeError(
+            f"Unsupported type annotation for C header generation: {ann}"
+        )
+
+    @classmethod
+    def generate_struct_typedef(cls, struct_cls: Any, device: bool = False) -> str:
+        """Generate a C struct typedef from a @native_struct class.
+
+        Nested struct fields are referenced by their typedef name (not
+        inlined as anonymous structs), so all transitive struct dependencies
+        must be emitted before this typedef.  Use
+        :meth:`_collect_struct_typedefs` to gather them in dependency order.
+        """
+        fields = []
+        for fname in struct_cls._field_names:
+            ann = struct_cls._field_annotations[fname]
+            fields.append(f"    {cls.get_c_type(ann, device=device)} {fname};")
+        packed = getattr(struct_cls, "_struct_packed", False)
+        attr = " __attribute__((packed))" if packed else ""
+        name = struct_cls.__name__
+        return f"typedef struct{attr} {{\n" + "\n".join(fields) + f"\n}} {name};"
+
+    @classmethod
+    def _collect_struct_typedefs(
+        cls, struct_cls: Any, *, device: bool = False, seen: set | None = None
+    ) -> list[str]:
+        """Recursively collect C struct typedefs in dependency order.
+
+        Walks the struct's field annotations; any field that is itself a
+        @native_struct is visited first so its typedef appears before the
+        struct that references it.  Deduplicates by class name.
+        """
+        if seen is None:
+            seen = set()
+        name = struct_cls.__name__
+        if name in seen:
+            return []
+        seen.add(name)
+
+        result: list[str] = []
+        for fname in struct_cls._field_names:
+            ann = struct_cls._field_annotations[fname]
+            if hasattr(ann, "_field_names"):
+                result.extend(
+                    cls._collect_struct_typedefs(ann, device=device, seen=seen)
+                )
+        result.append(cls.generate_struct_typedef(struct_cls, device=device))
+        return result
+
+    @classmethod
+    def generate_device_header(
+        cls,
+        function_name: str,
+        signature: inspect.Signature,
+        ret_annotation: Any = None,
+        pointer_types: tuple[type, ...] = (),
+    ) -> str:
+        """Generate a .cuh header declaring a __device__ function.
+
+        Works entirely from DSL type annotations — no MLIR types needed.
+        Handles scalar types, @native_struct types (including nested structs),
+        and opaque pointer annotations.
+
+        Args:
+            pointer_types: Annotation types that should be rendered as
+                ``void*`` in the generated header (e.g. upstream pointer
+                wrappers).
+        """
+        seen_structs: set[str] = set()
+        typedefs: list[str] = []
+
+        def _add_struct(struct_cls: Any) -> None:
+            """Collect typedefs for *struct_cls* (and its transitive deps)."""
+            if struct_cls.__name__ not in seen_structs:
+                for td in cls._collect_struct_typedefs(
+                    struct_cls, device=True, seen=seen_structs
+                ):
+                    typedefs.append(td)
+
+        # Return type
+        if ret_annotation is None:
+            ret_c = "void"
+        elif hasattr(ret_annotation, "_field_names"):
+            _add_struct(ret_annotation)
+            ret_c = ret_annotation.__name__
+        else:
+            ret_c = cls.get_c_type(ret_annotation, device=True)
+
+        # Parameters
+        params = []
+        for name, param in signature.parameters.items():
+            ann = param.annotation
+            if ann is inspect.Parameter.empty:
+                ann = None
+            if pointer_types and ann in pointer_types:
+                params.append(f"void* {name}")
+            elif ann is not None and hasattr(ann, "_field_names"):
+                _add_struct(ann)
+                params.append(f"{ann.__name__} {name}")
+            elif ann is not None:
+                params.append(f"{cls.get_c_type(ann, device=True)} {name}")
+            else:
+                params.append(f"void* {name}")
+        param_str = ", ".join(params) if params else "void"
+
+        typedef_block = "\n\n".join(typedefs)
+        if typedef_block:
+            typedef_block += "\n\n"
+
+        return f"""\
+#pragma once
+
+#include <cuda_fp16.h>
+#include <cuda_bf16.h>
+#include <stdint.h>
+
+{typedef_block}extern "C" __device__ {ret_c} {function_name}({param_str});
+"""
+
     def _count_dynamic_expression(self, arg: Any) -> int:
         """
         Count the number of dynamic values in the argument.
@@ -135,7 +284,7 @@ class CHeaderGenerator:
 #ifndef {dsl_name}_CUDA_ERROR_CHECK
 #define {dsl_name}"""
             + self.cuda_error_check
-            + f"""
+            + """
 #endif
 """
         )
diff --git a/python/CuTeDSL/cutlass/base_dsl/export/external_binary_module.py b/python/CuTeDSL/cutlass/base_dsl/export/external_binary_module.py
index 8f257fbf1..b6f9c9987 100644
--- a/python/CuTeDSL/cutlass/base_dsl/export/external_binary_module.py
+++ b/python/CuTeDSL/cutlass/base_dsl/export/external_binary_module.py
@@ -120,10 +120,11 @@ class ExternalBinaryModule:
                 f"Function prefix {function_prefix} not found in the module.", cause=e
             )
         load_provider.version_checker(version_str)
+        assert function_name is not None
         capi_func_p = self.engine.lookup(function_name)
         if not capi_func_p:
             raise DSLRuntimeError(
-                "Unknown function: "  # type: ignore[operator]
+                "Unknown function: "
                 + "_mlir_"
                 + function_prefix
                 + "__mlir_ciface_"
diff --git a/python/CuTeDSL/cutlass/base_dsl/ffi.py b/python/CuTeDSL/cutlass/base_dsl/ffi.py
index ecab46590..ba0c6fbe6 100644
--- a/python/CuTeDSL/cutlass/base_dsl/ffi.py
+++ b/python/CuTeDSL/cutlass/base_dsl/ffi.py
@@ -34,6 +34,7 @@ Usage:
         ...
 """
 
+import sys
 import typing
 from types import UnionType
 from typing import TypeVar, Any, Union
@@ -41,6 +42,12 @@ import inspect
 from dataclasses import dataclass
 import string
 
+# Typing.get_overloads requires Python 3.11+; fall back to typing_extensions on 3.10.
+if sys.version_info >= (3, 11):
+    from typing import get_overloads
+else:
+    from typing_extensions import get_overloads
+
 from .._mlir import ir
 from .._mlir.dialects import func, gpu, llvm
 
@@ -221,8 +228,7 @@ class ExternCallHandler:
         self.inited = True
 
         # Note: don't do this in the constructor as MLIR context doesn't exist yet
-        self.overloads = typing.get_overloads(self.func)  # type: ignore[attr-defined]
-        assert isinstance(self.overloads, list)
+        self.overloads = list(get_overloads(self.func))
         if len(self.overloads) == 0:
             self.overloads.append(self.func)
 
@@ -500,7 +506,7 @@ class FFI:
         params_types: list[Any] | None = None,
         return_type: Any = None,
         inline: bool = True,
-        source: Any = None,
+        source: BitCode | None = None,
         overloaded: bool = False,
         name_mangler: Any = None,
         implicit_convert: Any = None,
diff --git a/python/CuTeDSL/cutlass/base_dsl/jit_executor.py b/python/CuTeDSL/cutlass/base_dsl/jit_executor.py
index 2c3d4ac2c..1d32fc810 100644
--- a/python/CuTeDSL/cutlass/base_dsl/jit_executor.py
+++ b/python/CuTeDSL/cutlass/base_dsl/jit_executor.py
@@ -33,7 +33,11 @@ from .._mlir.dialects import llvm
 
 # Local modules imports
 from . import typing as t
-from .common import DSLRuntimeError, DSLCudaRuntimeError
+from .common import (
+    DSLRuntimeError,
+    DSLCudaRuntimeError,
+    create_cuda_runtime_error,
+)
 from .runtime import cuda as cuda_helpers
 from .runtime.jit_arg_adapters import JitArgAdapterRegistry, is_arg_annotation_constexpr
 from .typing import get_c_pointers
@@ -41,6 +45,7 @@ from .utils.logger import log
 from .utils.timer import timer
 
 if TYPE_CHECKING:
+    from .compiler import HostTarget
     from .dsl import BaseDSL
     from .export.export import SignatureProcessor
     from .export.c_header_generator import CHeaderGenerator
@@ -380,6 +385,14 @@ class ExecutionArgs:
         kwonly_defaults = {}
 
         for i, (name, param) in enumerate(sig.parameters.items()):
+            # We don't want *args or **kwargs to be included in the
+            # KwargsWrapperSpec, so skip them here. Including them
+            # breaks functions that use *args or **kwargs.
+            if param.kind in (
+                inspect.Parameter.VAR_POSITIONAL,
+                inspect.Parameter.VAR_KEYWORD,
+            ):
+                continue
             is_kwonly = param.kind == inspect.Parameter.KEYWORD_ONLY
             annotation = param.annotation
             if (
@@ -684,19 +697,14 @@ class JitExecutor:
             error_code = self.cuda_result.value  # type: ignore[union-attr]
             if error_code == 0:
                 return error_code
-            # Try to get the error name, but handle unknown error codes gracefully
-            try:
-                cu_result = cuda_helpers.cuda.CUresult(error_code)
-                error_name = cuda_helpers._cudaGetErrorEnum(cu_result)
-            except (ValueError, AttributeError):
-                # Error code not recognized by the enum or other error getting the name
-                error_name = f"<unknown CUDA error code {error_code}>"
-            raise DSLCudaRuntimeError(error_code, error_name)
+            raise create_cuda_runtime_error(error_code)
         except DSLCudaRuntimeError as e:
             raise e
         except Exception as e:
             raise DSLRuntimeError(f"💥💥💥 Runtime Crash 💥💥💥", cause=e)
 
+        return None
+
     def __call__(self, *args: Any, **kwargs: Any) -> int | None:
         exe_args, adapted_args = self.generate_execution_args(*args, **kwargs)
         return self.run_compiled_program(exe_args)
@@ -709,6 +717,9 @@ class JitFunctionArtifacts:
     PTX: str | None
     CUBIN: str | bytes | None
     MLIR: str | None
+    # Device compilation artifacts (set when DeviceTarget is enabled)
+    device_header: str | None = None
+    device_object_path: str | None = None
 
     def __post_init__(self) -> None:
         if self.PTX is not None and os.path.exists(self.PTX):
@@ -782,6 +793,7 @@ class JitCompiledFunction:
         dynamic_args: tuple[Any] = tuple[Any](),
         dynamic_kwargs: dict[str, Any] = dict[str, Any](),
         has_gpu_module: bool = True,
+        host_target: "HostTarget | None" = None,
     ) -> None:
         self.ir_module = ir_module
         self.engine = engine
@@ -800,6 +812,16 @@ class JitCompiledFunction:
         self.prefix = prefix
         self.load_from_binary = load_from_binary
 
+        # AOT cross-compile target for the host shim object. ``None`` or
+        # an empty HostTarget = native build host (preserves prior behavior).
+        # Lazy import: .compiler imports .env_manager which transitively
+        # imports this module, so a top-level import would close the cycle.
+        if host_target is None:
+            from .compiler import HostTarget as _HostTarget
+
+            host_target = _HostTarget("")
+        self.host_target: "HostTarget" = host_target
+
         # This runtime state is stored here so that we can preserve the module
         # in the compiler cache. Callers can extend the lifetime of the module
         # by creating and retaining the executor.
@@ -1099,6 +1121,9 @@ class JitCompiledFunction:
                     export_module,
                     tmp_object_file.name,
                     "_".join([function_prefix, self.function_name]),
+                    host_triple=self.host_target.triple,
+                    host_cpu=self.host_target.cpu,
+                    host_features=self.host_target.features,
                 )
                 with open(tmp_object_file.name, "rb") as f:
                     ret = f.read()
@@ -1127,6 +1152,26 @@ class JitCompiledFunction:
 
         The library contains the binary of the underlying host launch entry function.
 
+        Host cross-compilation:
+            The host CPU/OS that the emitted ``.o`` targets is set at
+            compile time via the options string passed to ``cute.compile``,
+            not via this method. To emit a cross-targeted host shim
+            (currently AArch64 only)::
+
+                compiled = cute.compile(
+                    fn, *args,
+                    options="--gpu-arch sm_100a --host-target linux-aarch64",
+                )
+                compiled.export_to_c(out_dir, "kernel")
+
+            The resulting ``kernel.o`` is an ELF object for the requested
+            AArch64 triple. Linking is the user's responsibility — invoke
+            their cross linker against a sysroot containing ``<cuda.h>``
+            and a target-built ``libcuda_dialect_runtime``. See
+            ``python -m cutlass.cute.export.aot_config --target ...`` for
+            ``-L`` / ``-l`` flag discovery. Non-AArch64 triples are not
+            supported and fail with a clean "target not registered" error.
+
         @param jit_function: The jit-compiled function from `cute.compile`.
         @param file_path: The path to the directory where the header and object files will be saved.
         @param file_name: The name of the header and object files.
diff --git a/python/CuTeDSL/cutlass/base_dsl/leaf_utils.py b/python/CuTeDSL/cutlass/base_dsl/leaf_utils.py
index 9f1340123..ed630e877 100644
--- a/python/CuTeDSL/cutlass/base_dsl/leaf_utils.py
+++ b/python/CuTeDSL/cutlass/base_dsl/leaf_utils.py
@@ -239,7 +239,7 @@ class LeafInfo:
                 self.obj.value = new_vals[0]
             return
 
-        # Case 3: raw ir.Value (including subclasses like ArithValue, ctm.Pointer)
+        # Case 3: raw ir.Value (including subclasses like ArithValue, Pointer)
         if isinstance(self.obj, ir.Value):
             if len(new_vals) >= 1:
                 self._replace_in_parent(new_vals[0])
diff --git a/python/CuTeDSL/cutlass/base_dsl/native_struct.py b/python/CuTeDSL/cutlass/base_dsl/native_struct.py
index b076e98ba..955c35c77 100644
--- a/python/CuTeDSL/cutlass/base_dsl/native_struct.py
+++ b/python/CuTeDSL/cutlass/base_dsl/native_struct.py
@@ -19,7 +19,7 @@ from .typing import DslType
 from .._mlir import ir
 from .._mlir.dialects import llvm
 
-from ._mlir_helpers import dsl_user_op
+from .._mlir_helpers import dsl_user_op
 
 
 def _is_constexpr_annotation(ann: type) -> bool:
@@ -303,9 +303,7 @@ def native_struct(
                         f"Expected single value for field {name!r}, got {len(elem)}"
                     )
                 elem = elem[0]
-                new_value = llvm.insertvalue(
-                    self._value, elem, position=[idx], loc=loc, ip=ip
-                )
+                new_value = llvm.insertvalue(self._value, elem, position=[idx], loc=loc, ip=ip)
                 self._value = new_value
 
             setter.__name__ = f"set_{name}"
diff --git a/python/CuTeDSL/cutlass/base_dsl/runtime/cuda.py b/python/CuTeDSL/cutlass/base_dsl/runtime/cuda.py
index 7d41754ff..cd5e3ee7e 100644
--- a/python/CuTeDSL/cutlass/base_dsl/runtime/cuda.py
+++ b/python/CuTeDSL/cutlass/base_dsl/runtime/cuda.py
@@ -15,7 +15,7 @@ This module provides CUDA Python helper functions
 
 from functools import lru_cache
 from dataclasses import dataclass
-from typing import Any
+from typing import Any, Optional
 from enum import IntEnum
 import numpy as np
 import os
diff --git a/python/CuTeDSL/cutlass/base_dsl/runtime/device_tensor.py b/python/CuTeDSL/cutlass/base_dsl/runtime/device_tensor.py
index 86a0d2fac..184ec7e0f 100644
--- a/python/CuTeDSL/cutlass/base_dsl/runtime/device_tensor.py
+++ b/python/CuTeDSL/cutlass/base_dsl/runtime/device_tensor.py
@@ -30,7 +30,7 @@ def allocate(tensor: TensorDescriptor, stream: Any = None) -> None:
 
     tensor.device_pointer = cuda_helpers.allocate(tensor.size_in_bytes, stream)
 
-    log().info("Allocate done tensor=[%s] dev_ptr=[%s]", tensor, tensor.device_pointer)  # type: ignore[union-attr]
+    log().info("Allocate done tensor=[%s] dev_ptr=[%s]", tensor, tensor.device_pointer)
 
 
 def deallocate(tensor: TensorDescriptor, stream: Any = None) -> None:
@@ -44,7 +44,7 @@ def deallocate(tensor: TensorDescriptor, stream: Any = None) -> None:
     if tensor.device_pointer is None:
         raise DSLRuntimeError("Tensor is not allocated on the device.")
 
-    log().info(  # type: ignore[union-attr]
+    log().info(
         "Deallocating done tensor=[%s] dev_ptr=[%s]", tensor, tensor.device_pointer
     )
 
@@ -59,13 +59,13 @@ def copy_to_gpu(
     Copies data from host memory to the GPU memory.
     If do_allocate is True, it first calls allocate
     """
-    log().info("copyin tensor=[%s] dev_ptr=[%s]", tensor, tensor.device_pointer)  # type: ignore[union-attr]
+    log().info("copyin tensor=[%s] dev_ptr=[%s]", tensor, tensor.device_pointer)
     if do_allocate:
         allocate(tensor, stream)
     cuda_helpers.memcpy_h2d(
         tensor.data_ptr, tensor.device_pointer, tensor.size_in_bytes, stream
     )
-    log().info("copyin done tensor=[%s] dev_ptr=[%s]", tensor, tensor.device_pointer)  # type: ignore[union-attr]
+    log().info("copyin done tensor=[%s] dev_ptr=[%s]", tensor, tensor.device_pointer)
     return tensor
 
 
@@ -76,7 +76,7 @@ def copy_from_gpu(
     Copies data from GPU memory back to the host.
     If do_deallocate is True, it calls deallocate
     """
-    log().info("copyout tensor=[%s] dev_ptr=[%s]", tensor, tensor.device_pointer)  # type: ignore[union-attr]
+    log().info("copyout tensor=[%s] dev_ptr=[%s]", tensor, tensor.device_pointer)
     if tensor._check_is_managed_by_framework():
         raise DSLRuntimeError(
             "GPU tensors are managed by the framework and cannot be modified."
@@ -89,9 +89,7 @@ def copy_from_gpu(
     )
     if do_deallocate:
         deallocate(tensor, stream)
-    log().info(  # type: ignore[union-attr]
-        "copyout done tensor=[%s] dev_ptr=[%s]", tensor, tensor.device_pointer
-    )
+    log().info("copyout done tensor=[%s] dev_ptr=[%s]", tensor, tensor.device_pointer)
 
 
 def to_gpu(tensor: Any, stream: Any = None) -> TensorDescriptor:
diff --git a/python/CuTeDSL/cutlass/base_dsl/runtime/jit_arg_adapters.py b/python/CuTeDSL/cutlass/base_dsl/runtime/jit_arg_adapters.py
index d42dbb6b7..58f2745a8 100644
--- a/python/CuTeDSL/cutlass/base_dsl/runtime/jit_arg_adapters.py
+++ b/python/CuTeDSL/cutlass/base_dsl/runtime/jit_arg_adapters.py
@@ -70,7 +70,7 @@ def is_arg_annotation_constexpr(
 
     return (
         _is_reserved_python_func_arg(arg_index, arg_name, owning_func)
-        or (isinstance(arg_annotation, type) and issubclass(arg_annotation, Constexpr))
+        or (isinstance(arg_annotation, type) and issubclass(arg_annotation, Constexpr))  # type: ignore[misc]
         or (get_origin(arg_annotation) is Constexpr)
     )
 
@@ -177,11 +177,13 @@ class JitArgAdapterRegistry:
         """
         adapter = cls.jit_arg_adapter_registry.get(type(arg), None)
         if adapter is None:
-            if (cls.default_dataclass_adapter
+            if (
+                cls.default_dataclass_adapter
                 and not implements_jit_argument(arg, partial=True)
                 and not implements_dynamic_expression(arg, partial=True)
                 and is_dataclass(arg)
-                and len(vars(arg)) == len(fields(arg))):  # no extra/missing instance attrs
+                and len(vars(arg)) == len(fields(arg))
+            ):  # no extra/missing instance attrs
                 adapter = cls.default_dataclass_adapter
         return adapter
 
@@ -198,25 +200,32 @@ class DefaultDataclassAdapter:
     """
     Adapter for dataclass typed JIT arguments.
     """
+
     def __init__(self, arg: object) -> None:
         self._ir_fields: dict[str, object] = {}
         self._ir_fields_len: dict[str, int] = {}
         self._arg = arg
-        for f in fields(arg): # type: ignore[arg-type]
+        for f in fields(arg):  # type: ignore[arg-type]
             arg_field = getattr(arg, f.name)
             if not is_constexpr_field(f):
-                if isinstance(f.type, NumericMeta) and not isinstance(arg_field, f.type):
-                    self._ir_fields[f.name] = cast(arg_field, f.type) # type: ignore[arg-type]
+                if isinstance(f.type, NumericMeta) and not isinstance(
+                    arg_field, f.type
+                ):
+                    self._ir_fields[f.name] = cast(arg_field, f.type)  # type: ignore[arg-type]
                 else:
                     # Allow the nested fields to be adapted
-                    arg_adapter = JitArgAdapterRegistry.get_registered_adapter(arg_field)
+                    arg_adapter = JitArgAdapterRegistry.get_registered_adapter(
+                        arg_field
+                    )
                     if arg_adapter is not None:
                         self._ir_fields[f.name] = arg_adapter(arg_field)
                     else:
                         self._ir_fields[f.name] = arg_field
 
     def __c_pointers__(self) -> list[Any]:
-        return list(chain.from_iterable(get_c_pointers(v) for v in self._ir_fields.values()))
+        return list(
+            chain.from_iterable(get_c_pointers(v) for v in self._ir_fields.values())
+        )
 
     def __get_mlir_types__(self) -> list[Any]:
         ir_types = []
@@ -231,18 +240,25 @@ class DefaultDataclassAdapter:
 
         kwargs = {}
         idx = 0
-        for f in fields(self._arg): # type: ignore[arg-type]
+        for f in fields(self._arg):  # type: ignore[arg-type]
             if is_constexpr_field(f):
                 kwargs[f.name] = getattr(self._arg, f.name)
             else:
-                kwargs[f.name] = new_from_mlir_values(self._ir_fields[f.name], values[idx : idx + self._ir_fields_len[f.name]])
+                kwargs[f.name] = new_from_mlir_values(
+                    self._ir_fields[f.name],
+                    values[idx : idx + self._ir_fields_len[f.name]],
+                )
                 idx += self._ir_fields_len[f.name]
         return type(self._arg)(**kwargs)
 
     def __extract_mlir_values__(self) -> list[ir.Value]:
         from ..dsl import extract_mlir_values  # deferred to avoid circular import
 
-        return list(chain.from_iterable(extract_mlir_values(v) for v in self._ir_fields.values()))
+        return list(
+            chain.from_iterable(
+                extract_mlir_values(v) for v in self._ir_fields.values()
+            )
+        )
 
 
 JitArgAdapterRegistry.set_default_dataclass_adapter(DefaultDataclassAdapter)
diff --git a/python/CuTeDSL/cutlass/base_dsl/runtime/tensor_descriptor.py b/python/CuTeDSL/cutlass/base_dsl/runtime/tensor_descriptor.py
index b8cb45d38..2b4e824d6 100644
--- a/python/CuTeDSL/cutlass/base_dsl/runtime/tensor_descriptor.py
+++ b/python/CuTeDSL/cutlass/base_dsl/runtime/tensor_descriptor.py
@@ -74,7 +74,7 @@ class TensorDescriptor:
                 f"DLPack device type is not supported {self.dl_tensor.device.device_type}"  # type: ignore[attr-defined]
             )
 
-        log().info("TensorDescriptor is created = [%s]", self)  # type: ignore[union-attr]
+        log().info("TensorDescriptor is created = [%s]", self)
 
     @staticmethod
     def can_transformed_to_dlpack(dl_tensor: object) -> bool:
diff --git a/python/CuTeDSL/cutlass/base_dsl/torch.py b/python/CuTeDSL/cutlass/base_dsl/torch.py
new file mode 100644
index 000000000..3bd7dbeeb
--- /dev/null
+++ b/python/CuTeDSL/cutlass/base_dsl/torch.py
@@ -0,0 +1,55 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
+#
+# Use of this software is governed by the terms and conditions of the
+# NVIDIA End User License Agreement (EULA), available at:
+# https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/license.html
+#
+# Any use, reproduction, disclosure, or distribution of this software
+# and related documentation outside the scope permitted by the EULA
+# is strictly prohibited.
+
+"""Torch dtype <-> DSL Numeric type conversion utilities."""
+
+from typing import Type
+
+import torch
+
+from .typing import (
+    Numeric,
+    Boolean,
+    TFloat32,
+    Float8E4M3B11FNUZ,
+    Float8E4M3FN,
+    Float8E5M2,
+    Float8E8M0FNU,
+    Float4E2M1FN,
+)
+
+
+def dtype(ty: Type[Numeric]) -> "torch.dtype":
+    """
+    Return the corresponding torch.dtype per the given DSL type
+    """
+    torch_dtype = getattr(torch, ty.__name__.lower(), None)
+
+    torch_type_map = {
+        Boolean: torch.bool,
+        # TFloat32 is just alias of float32
+        TFloat32: torch.float32,
+        Float8E5M2: torch.float8_e5m2,
+        Float8E4M3FN: torch.float8_e4m3fn,
+        Float8E4M3B11FNUZ: torch.float8_e4m3fnuz,
+        Float4E2M1FN: torch.float4_e2m1fn_x2,
+    }
+
+    # float8_e8m0fnu is introduced in latest version of torch
+    if hasattr(torch, "float8_e8m0fnu"):
+        torch_type_map[Float8E8M0FNU] = torch.float8_e8m0fnu
+
+    if torch_dtype is None:
+        torch_dtype = torch_type_map.get(ty)
+
+    if torch_dtype is None:
+        raise TypeError(f"{ty} is not supported by torch")
+    return torch_dtype
diff --git a/python/CuTeDSL/cutlass/base_dsl/tvm_ffi_builder/call_provider.py b/python/CuTeDSL/cutlass/base_dsl/tvm_ffi_builder/call_provider.py
index ddc6c7661..f46d3ff22 100644
--- a/python/CuTeDSL/cutlass/base_dsl/tvm_ffi_builder/call_provider.py
+++ b/python/CuTeDSL/cutlass/base_dsl/tvm_ffi_builder/call_provider.py
@@ -163,9 +163,21 @@ class DynamicParamPackCallProvider(CallProvider, TVMFFIBuilder):
     ) -> tuple[ir.Type, ir.Value]:
         """Pack a var parameter to a struct."""
         value: ir.Value = context.matched_var_binding[param]
+        native_struct_type = getattr(param, spec.NATIVE_STRUCT_TYPE_ATTR, None)
+        if native_struct_type is not None:
+            native_struct_mlir_types = native_struct_type.__get_mlir_types__()
+            if len(native_struct_mlir_types) != 1:
+                raise ValueError(
+                    f"native_struct parameter {param.name!r} must map to one MLIR type"
+                )
+            return (native_struct_mlir_types[0], value)
         _, alloca = self.pack_values_to_alloca(
             current_block, context.entry_block, [value]
         )
+
+        if param.alternate_ir_type_fetch_func is not None:
+            return (param.alternate_ir_type_fetch_func(self), alloca)
+
         return (value.type, alloca)
 
     def pack_param_shape(
@@ -211,8 +223,12 @@ class DynamicParamPackCallProvider(CallProvider, TVMFFIBuilder):
                 packed_params.append(
                     self.pack_param_var(current_block, context, param.var)
                 )
-            elif isinstance(param, spec.ConstNone):
-                # const none is not packed
+            elif isinstance(
+                param,
+                (spec.ConstNone, spec.ConstInt, spec.ConstBool, spec.ConstFloat),
+            ):
+                # constexpr params are asserted in the wrapper but not forwarded
+                # into the llvm.call
                 continue
             else:
                 raise NotImplementedError(f"Unsupported parameter type: {type(param)}")
diff --git a/python/CuTeDSL/cutlass/base_dsl/tvm_ffi_builder/mlir_builder.py b/python/CuTeDSL/cutlass/base_dsl/tvm_ffi_builder/mlir_builder.py
index b6a4c73d1..a7e3a9c3e 100644
--- a/python/CuTeDSL/cutlass/base_dsl/tvm_ffi_builder/mlir_builder.py
+++ b/python/CuTeDSL/cutlass/base_dsl/tvm_ffi_builder/mlir_builder.py
@@ -32,6 +32,8 @@ class MLIRTypeBuilder:
         self.i16_type = ir.IntegerType.get_signless(16)
         self.i8_type = ir.IntegerType.get_signless(8)
         self.i1_type = ir.IntegerType.get_signless(1)
+        self.f16_type = ir.Type.parse("f16")
+        self.bf16_type = ir.Type.parse("bf16")
         self.f32_type = ir.Type.parse("f32")
         self.f64_type = ir.Type.parse("f64")
         self.ptr_type = llvm.PointerType.get()
@@ -170,6 +172,28 @@ class MLIRBuilder(MLIRTypeBuilder):
         """Create an i64 constant with the given value."""
         return self.integer_constant(self.i64_type, value)
 
+    def fptrunc(self, value: ir.Value, res_type: ir.Type) -> ir.Value:
+        """Truncate a floating-point value to a narrower float type.
+
+        Routes through f32 intermediate to avoid emitting compiler-rt calls
+        (__truncdfhf2, __truncdfbf2) that are unavailable in the JIT engine.
+        For bf16, uses integer bit manipulation from f32.
+        """
+        if value.type == res_type:
+            return value
+        if res_type == self.bf16_type:
+            if value.type != self.f32_type:
+                value = llvm.fptrunc(res=self.f32_type, arg=value)
+            v_i32 = llvm.bitcast(self.i32_type, value)
+            v_shifted = llvm.lshr(v_i32, self.i32(16))
+            v_i16 = llvm.trunc(
+                self.i16_type, v_shifted, overflow_flags=llvm.IntegerOverflowFlags.none
+            )
+            return llvm.bitcast(self.bf16_type, v_i16)
+        if res_type == self.f16_type and value.type != self.f32_type:
+            value = llvm.fptrunc(res=self.f32_type, arg=value)
+        return llvm.fptrunc(res=res_type, arg=value)
+
     def mul(self, lhs: ir.Value, rhs: ir.Value) -> ir.Value:
         """Create a multiplication operation between two values."""
         return llvm.mul(lhs, rhs, overflow_flags=llvm.IntegerOverflowFlags.none)
@@ -195,7 +219,11 @@ class MLIRBuilder(MLIRTypeBuilder):
         """Create a logical NOT operation."""
         # Ensure we're working with i1 type for boolean operations
         if value.type != self.i1_type:
-            value = llvm.trunc(res=self.i1_type, arg=value)
+            value = llvm.trunc(
+                res=self.i1_type,
+                arg=value,
+                overflow_flags=llvm.IntegerOverflowFlags.none,
+            )
         return llvm.xor(value, self.i1(1))
 
     def i64_divisible_const(self, value: ir.Value, align_const: int) -> ir.Value:
diff --git a/python/CuTeDSL/cutlass/base_dsl/tvm_ffi_builder/spec.py b/python/CuTeDSL/cutlass/base_dsl/tvm_ffi_builder/spec.py
index 53b0d89e5..34c7f90b5 100644
--- a/python/CuTeDSL/cutlass/base_dsl/tvm_ffi_builder/spec.py
+++ b/python/CuTeDSL/cutlass/base_dsl/tvm_ffi_builder/spec.py
@@ -14,7 +14,11 @@
 from abc import ABC
 
 from collections.abc import Sequence
-from typing import Optional, Union
+from typing import Callable, Optional, Union, TYPE_CHECKING
+
+if TYPE_CHECKING:
+    from .mlir_builder import MLIRBuilder
+    from cutlass._mlir import ir
 
 try:
     import tvm_ffi
@@ -83,6 +87,9 @@ class Param(ABC):
     """Base class for all parameters."""
 
 
+NATIVE_STRUCT_TYPE_ATTR = "_native_struct_type"
+
+
 class Var(Param):
     """variables: pointer, integer, floating-point, boolean, etc.
 
@@ -94,12 +101,15 @@ class Var(Param):
         The data type of the parameter.
     divisibility: Optional[int]
         The divisibility of the parameter, by default None.
-
+    alternate_ir_type_fetch_func: Optional[Callable[["MLIRBuilder"], "ir.Type"]]
+        A function to fetch the alternate IR type of the parameter. This is
+        used for cases where there is no native IR type for the parameter.
     """
 
     name: str
     dtype: "tvm_ffi.dtype"
     divisibility: Optional[int]
+    alternate_ir_type_fetch_func: Optional[Callable[["MLIRBuilder"], "ir.Type"]] = None
 
     def __init__(
         self,
@@ -107,6 +117,9 @@ class Var(Param):
         dtype: Union[str, "tvm_ffi.dtype"],
         *,
         divisibility: Optional[int] = None,
+        alternate_ir_type_fetch_func: Optional[
+            Callable[["MLIRBuilder"], "ir.Type"]
+        ] = None,
     ) -> None:
         """Initialize a Var parameter.
 
@@ -121,6 +134,7 @@ class Var(Param):
         self.name = name
         self.dtype = tvm_ffi.dtype(dtype)
         self.divisibility = divisibility
+        self.alternate_ir_type_fetch_func = alternate_ir_type_fetch_func
 
 
 class Shape(Param):
@@ -338,6 +352,39 @@ class ConstNone(Param):
         self.name = name
 
 
+class ConstInt(Param):
+    """Constexpr int parameter: runtime-asserted to equal ``value``."""
+
+    name: str
+    value: int
+
+    def __init__(self, name: str, value: int) -> None:
+        self.name = name
+        self.value = int(value)
+
+
+class ConstBool(Param):
+    """Constexpr bool parameter: runtime-asserted to equal ``value``."""
+
+    name: str
+    value: bool
+
+    def __init__(self, name: str, value: bool) -> None:
+        self.name = name
+        self.value = bool(value)
+
+
+class ConstFloat(Param):
+    """Constexpr float parameter: runtime-asserted to bit-equal ``value``."""
+
+    name: str
+    value: float
+
+    def __init__(self, name: str, value: float) -> None:
+        self.name = name
+        self.value = float(value)
+
+
 class TupleParam(Param):
     """Tuple parameter.
 
@@ -410,6 +457,12 @@ def format_param_type(param: Param) -> str:
         return "DataPointer"
     elif isinstance(param, ConstNone):
         return "None"
+    elif isinstance(param, ConstInt):
+        return f"Int({param.value})"
+    elif isinstance(param, ConstBool):
+        return f"Bool({param.value})"
+    elif isinstance(param, ConstFloat):
+        return f"Float({param.value})"
     elif isinstance(param, TupleParam):
         # Recursively format tuple elements
         element_types = [format_param_type(p) for p in param.params]
diff --git a/python/CuTeDSL/cutlass/base_dsl/tvm_ffi_builder/tvm_ffi_builder.py b/python/CuTeDSL/cutlass/base_dsl/tvm_ffi_builder/tvm_ffi_builder.py
index 6cfe2fc9c..418ed2480 100644
--- a/python/CuTeDSL/cutlass/base_dsl/tvm_ffi_builder/tvm_ffi_builder.py
+++ b/python/CuTeDSL/cutlass/base_dsl/tvm_ffi_builder/tvm_ffi_builder.py
@@ -13,7 +13,7 @@
 
 from collections.abc import Sequence
 from enum import IntEnum
-from typing import Callable, Optional, Union
+from typing import Callable, Literal, Optional, Union
 
 try:
     import tvm_ffi
@@ -599,12 +599,11 @@ class TVMFFIBuilder(MLIRBuilder):
             )
             if index > 0:
                 # try to stay in constant compute as much as possible
-                if isinstance(expected_shape[index], int) and isinstance(
+                expected_shape_i = expected_shape[index]
+                if isinstance(expected_shape_i, int) and isinstance(
                     expected_stride, int
                 ):
-                    expected_stride = (
-                        expected_shape[index] * expected_stride  # type: ignore[operator]
-                    )
+                    expected_stride = expected_shape_i * expected_stride
                 else:
                     if isinstance(expected_stride, int):
                         expected_stride = self.i64(expected_stride)
@@ -617,6 +616,83 @@ class TVMFFIBuilder(MLIRBuilder):
                         expected_stride = self.mul(loaded_shape[index], expected_stride)
         return cond
 
+    def get_or_create_set_raised_cuda_error_helper(self, error_code_prefix: str) -> str:
+        """Get or create a helper that raises CUDADialectError from an i32 code."""
+        helper_name = "__tvm_ffi_set_raised_cuda_error"
+        if self.find_func_in_module(self.module, helper_name):
+            return helper_name
+
+        sprintf_type = ir.Type.parse("!llvm.func<i32 (!llvm.ptr, !llvm.ptr, ...)>")
+        if not self.find_func_in_module(self.module, "sprintf"):
+            with ir.InsertionPoint(self.module.body):  # type: ignore[union-attr]
+                func_op = llvm.func(
+                    "sprintf",
+                    function_type=ir.TypeAttr.get(sprintf_type),
+                )
+                func_op.attributes["llvm.linkage"] = ir.StringAttr.get("external")
+        format_symbol = self.define_global_string(content="%d")
+        error_kind_symbol = self.define_global_string(content="CUDADialectError")
+        error_prefix_symbol = self.define_global_string(content=error_code_prefix)
+        set_error_from_parts_helper = self.get_or_create_set_raised_from_cstr_parts(2)
+
+        with ir.InsertionPoint(self.module.body):  # type: ignore[union-attr]
+            params, entry_block = self.function(
+                name=helper_name,
+                params_type=[self.i32_type],
+                ret_type=self.void_type,
+                internal=True,
+                llvm_func_attrs=["noinline"],
+            )
+            (value,) = params
+            with ir.InsertionPoint(entry_block):
+                buffer = llvm.alloca(
+                    res=self.ptr_type,
+                    elem_type=self.i8_type,
+                    array_size=self.i32(12),
+                    alignment=1,
+                )
+                llvm.call(
+                    result=self.i32_type,
+                    callee="sprintf",
+                    callee_operands=[
+                        buffer,
+                        self.address_of(format_symbol, self.ptr_type),
+                        value,
+                    ],
+                    var_callee_type=ir.TypeAttr.get(sprintf_type),
+                    op_bundle_sizes=[],
+                    op_bundle_operands=[],
+                )
+
+                llvm.call(
+                    result=None,
+                    callee=set_error_from_parts_helper,
+                    callee_operands=[
+                        self.address_of(error_kind_symbol, self.ptr_type),
+                        self.i32(2),
+                        self.address_of(error_prefix_symbol, self.ptr_type),
+                        buffer,
+                    ],
+                    op_bundle_sizes=[],
+                    op_bundle_operands=[],
+                )
+                self.return_()
+
+        return helper_name
+
+    def raise_cuda_error_and_return(
+        self, code: ir.Value, error_code_prefix: str
+    ) -> None:
+        """Raise CUDADialectError and return -1 from the current TVM-FFI wrapper."""
+        llvm.call(
+            result=None,
+            callee=self.get_or_create_set_raised_cuda_error_helper(error_code_prefix),
+            callee_operands=[code],
+            op_bundle_sizes=[],
+            op_bundle_operands=[],
+        )
+        self.return_(self.i32(-1))
+
     def get_or_create_set_raised_from_cstr_parts(self, num_parts: int) -> str:
         r"""Get or create a helper function to call TVMFFIErrorSetRaisedFromCStrParts.
 
@@ -935,6 +1011,13 @@ class TVMFFIFunctionBuilder(TVMFFIBuilder):
             result_type = self.f64_type
         elif param.dtype.bits == 32:
             result_type = self.f32_type
+        elif (
+            hasattr(tvm_ffi._dtype.DataTypeCode, "BFLOAT")
+            and param.dtype.type_code == tvm_ffi._dtype.DataTypeCode.BFLOAT
+        ):
+            result_type = self.bf16_type
+        elif param.dtype.bits == 16:
+            result_type = self.f16_type
         else:
             raise ValueError(f"Unsupported Var dtype: {param.dtype}")
 
@@ -962,12 +1045,7 @@ class TVMFFIFunctionBuilder(TVMFFIBuilder):
             v_float64: ir.Value = self.load_ffi_any_array_item_v_float64(
                 args, arg_index
             )
-            if param.dtype.bits == 64:
-                float_result = v_float64
-            elif param.dtype.bits == 32:
-                float_result = llvm.fptrunc(res=self.f32_type, arg=v_float64)
-            else:
-                raise ValueError(f"Unsupported Var dtype: {param.dtype}")
+            float_result = self.fptrunc(v_float64, result_type)
             self.br(result_block, args=[float_result])
 
         # In int/bool check block, verify it's actually int or bool
@@ -982,16 +1060,8 @@ class TVMFFIFunctionBuilder(TVMFFIBuilder):
         # Handle int or bool type (convert to float)
         with ir.InsertionPoint(int_bool_block):
             v_int64: ir.Value = self.load_ffi_any_array_item_v_int64(args, arg_index)
-            # Convert int64 to float64 first, then to target type
             v_float64_from_int = llvm.sitofp(res=self.f64_type, arg=v_int64)
-            if param.dtype.bits == 64:
-                int_bool_result = v_float64_from_int
-            elif param.dtype.bits == 32:
-                int_bool_result = llvm.fptrunc(
-                    res=self.f32_type, arg=v_float64_from_int
-                )
-            else:
-                raise ValueError(f"Unsupported Var dtype: {param.dtype}")
+            int_bool_result = self.fptrunc(v_float64_from_int, result_type)
             self.br(result_block, args=[int_bool_result])
 
         # Error block
@@ -1073,38 +1143,119 @@ class TVMFFIFunctionBuilder(TVMFFIBuilder):
 
         return current_block
 
-    def decode_param_const_none(
+    def decode_param_const(
         self,
         current_block: ir.Block,
-        param: spec.ConstNone,
+        param: spec.Param,
         args: ir.Value,
         arg_index: int,
         arg_context: ArgContext,
     ) -> ir.Block:
-        """Decode the opaque handle parameter at the given index."""
-        # read the type index
+        """Decode a ``Const*`` param: assert the FFI arg matches the literal
+        captured at compile time.
+
+        Two-step shape shared by all four constexpr kinds:
+          1. Type-index check → TypeError (``expected <kind>``).
+          2. Value equality check → ValueError (``expected <value>``).
+             Skipped for ConstNone since None has no payload.
+
+        The Const slot is not forwarded into the llvm.call — it exists only
+        so tvm-ffi's ``unpack_dataclass_to_tuple`` arity matches the wrapper
+        signature, and so we catch mismatched caller values early.
+        """
+        # Per-kind: accepted FFI type indices, kind name for the error
+        # message, how to load + compare the payload, and how to render the
+        # expected literal in the error message. ``expected_repr`` is paired
+        # with ``payload_kind`` so the display value tracks the Python type
+        # (e.g. bool → True/False) rather than the wire value (int 1/0).
+        accepted: tuple[TVMFFITypeIndex, ...]
+        payload_kind: (
+            tuple[Literal["int"], int] | tuple[Literal["float"], float] | None
+        )
+        expected_repr: bool | int | float
+        if isinstance(param, spec.ConstNone):
+            accepted = (TVMFFITypeIndex.kTVMFFINone,)
+            kind_name = "None"
+            payload_kind = None
+        elif isinstance(param, spec.ConstBool):
+            # bool is strict: reject kTVMFFIInt even when the value would match
+            accepted = (TVMFFITypeIndex.kTVMFFIBool,)
+            kind_name = "bool"
+            # True→1, False→0 via int(bool) — same wire format as ConstInt
+            payload_kind = ("int", int(param.value))
+            expected_repr = bool(param.value)
+        elif isinstance(param, spec.ConstInt):
+            # int/bool share the v_int64 payload — accept either
+            accepted = (TVMFFITypeIndex.kTVMFFIInt, TVMFFITypeIndex.kTVMFFIBool)
+            kind_name = "int"
+            payload_kind = ("int", int(param.value))
+            expected_repr = int(param.value)
+        elif isinstance(param, spec.ConstFloat):
+            accepted = (TVMFFITypeIndex.kTVMFFIFloat,)
+            kind_name = "float"
+            payload_kind = ("float", float(param.value))
+            expected_repr = float(param.value)
+        else:
+            raise ValueError(
+                f"decode_param_const: unsupported Const param: {type(param).__name__}"
+            )
+
+        # Step 1 — type-index check.
         with ir.InsertionPoint(current_block):
             type_index: ir.Value = self.load_ffi_any_array_item_type_index(
                 args, arg_index
             )
-            # Check if type is a nullptr
-            is_nullptr = self.equal(type_index, self.i32(TVMFFITypeIndex.kTVMFFINone))
-
-        expect_message = ", expected None"
-
-        # Break error message into reusable parts for better string deduplication
+            type_ok: ir.Value = self.equal(type_index, self.i32(accepted[0]))
+            for tx in accepted[1:]:
+                type_ok = self.or_(
+                    type_ok, self.equal(type_index, self.i32(tx))
+                )
         current_block = self.check_condition(
             current_block,
-            lambda: is_nullptr,
+            lambda: type_ok,
             "TypeError",
             [
                 "Mismatched type ",
                 *arg_context.get(),
                 self._fn_call_context,
-                expect_message,
+                f", expected {kind_name}",
+            ],
+        )
+
+        # Step 2 — value check (None has no payload).
+        if payload_kind is None:
+            return current_block
+        value_ok: ir.Value
+        with ir.InsertionPoint(current_block):
+            if payload_kind[0] == "int":
+                expected_int = payload_kind[1]
+                v_int64: ir.Value = self.load_ffi_any_array_item_v_int64(
+                    args, arg_index
+                )
+                value_ok = self.equal(v_int64, self.i64(expected_int))
+            else:  # "float"
+                expected_float = payload_kind[1]
+                v_float64: ir.Value = self.load_ffi_any_array_item_v_float64(
+                    args, arg_index
+                )
+                expected_const = llvm.ConstantOp(
+                    self.f64_type,
+                    ir.FloatAttr.get(self.f64_type, expected_float),
+                ).res
+                value_ok = llvm.fcmp(
+                    llvm.FCmpPredicate.oeq, v_float64, expected_const
+                )
+        return self.check_condition(
+            current_block,
+            lambda: value_ok,
+            "ValueError",
+            [
+                "Mismatched constexpr value ",
+                *arg_context.get(),
+                self._fn_call_context,
+                f", expected {expected_repr}",
             ],
         )
-        return current_block
 
     def check_int_value_dtype_bound(
         self,
@@ -1671,8 +1822,23 @@ class TVMFFIFunctionBuilder(TVMFFIBuilder):
 
         # check dtype
         def dtype_equal() -> ir.Value:
-            # check dtype (code, bits, lanes)
-            dtype_code_match = self.equal(dtype_code, self.i8(param.dtype.type_code))
+            # MLIR integers are signless (i8, i16, i32, i64), so a function compiled
+            # for int8 is bit-identical to one compiled for uint8. Accept both signed
+            # (kDLInt=0) and unsigned (kDLUInt=1) integer dtype codes so that cached
+            # compiled functions work regardless of which signedness first populated
+            # the JIT cache.
+            if param.dtype.type_code in (
+                tvm_ffi._dtype.DataTypeCode.INT,
+                tvm_ffi._dtype.DataTypeCode.UINT,
+            ):
+                dtype_code_match = self.or_(
+                    self.equal(dtype_code, self.i8(tvm_ffi._dtype.DataTypeCode.INT)),
+                    self.equal(dtype_code, self.i8(tvm_ffi._dtype.DataTypeCode.UINT)),
+                )
+            else:
+                dtype_code_match = self.equal(
+                    dtype_code, self.i8(param.dtype.type_code)
+                )
             dtype_bits_match = self.equal(dtype_bits, self.i8(param.dtype.bits))
             dtype_lanes_match = self.equal(dtype_lanes, self.i16(param.dtype.lanes))
             return self.and_(
@@ -1940,6 +2106,13 @@ class TVMFFIFunctionBuilder(TVMFFIBuilder):
                 return self.decode_param_float(
                     current_block, param, args, arg_index, arg_context
                 )
+            elif (
+                hasattr(tvm_ffi._dtype.DataTypeCode, "BFLOAT")
+                and param.dtype.type_code == tvm_ffi._dtype.DataTypeCode.BFLOAT
+            ):
+                return self.decode_param_float(
+                    current_block, param, args, arg_index, arg_context
+                )
             elif param.dtype.type_code == tvm_ffi._dtype.DataTypeCode.HANDLE:
                 return self.decode_param_opaque_handle(
                     current_block, param, args, arg_index, arg_context
@@ -1965,8 +2138,11 @@ class TVMFFIFunctionBuilder(TVMFFIBuilder):
             return self.decode_param_data_pointer(
                 current_block, param, args, arg_index, arg_context
             )
-        elif isinstance(param, spec.ConstNone):
-            return self.decode_param_const_none(
+        elif isinstance(
+            param,
+            (spec.ConstNone, spec.ConstBool, spec.ConstInt, spec.ConstFloat),
+        ):
+            return self.decode_param_const(
                 current_block, param, args, arg_index, arg_context
             )
         elif isinstance(param, spec.TupleParam):
diff --git a/python/CuTeDSL/cutlass/base_dsl/typing.py b/python/CuTeDSL/cutlass/base_dsl/typing.py
index 60108b5f4..db38b9b4b 100644
--- a/python/CuTeDSL/cutlass/base_dsl/typing.py
+++ b/python/CuTeDSL/cutlass/base_dsl/typing.py
@@ -14,7 +14,10 @@ from itertools import chain
 import numpy as np
 import operator
 from typing import (
+    TYPE_CHECKING,
+    Annotated,
     Callable,
+    ClassVar,
     Generic,
     Optional,
     Protocol,
@@ -28,14 +31,13 @@ from typing import (
 
 from .common import *
 from .common import DSLRuntimeError as DSLRuntimeError
-from ._mlir_helpers import arith as arith_helper
-from ._mlir_helpers.arith import ArithValue
-from ._mlir_helpers.op import dsl_user_op
+from .._mlir_helpers import arith as arith_helper
+from .._mlir_helpers.arith import ArithValue
+from .._mlir_helpers.op import dsl_user_op
 
 from .._mlir import ir
 from .._mlir.extras import types as T
 from .._mlir.dialects import arith
-
 # =============================================================================
 # Dynamic Expression Protocol
 # =============================================================================
@@ -257,6 +259,7 @@ def get_mlir_types(obj: Any) -> list[ir.Type]:
     return []
 
 
+
 def implements_jit_argument(obj: Any, *, partial: bool = False) -> bool:
     """
     Check if the object implements the JitArgument protocol.
@@ -340,7 +343,9 @@ class DslType(type):
 class NumericMeta(DslType):
     """Metaclass for numeric types providing width and numpy dtype information.
 
-    :param width: Bit width of the numeric type, defaults to 8
+    :param width: Bit width of one storage unit, defaults to 8.
+        For unpacked dtypes this is the element width. Packed view dtypes
+        use the width of one packed tensor element.
     :type width: int
     :param np_dtype: Corresponding NumPy dtype
     :type np_dtype: numpy.dtype, optional
@@ -348,7 +353,6 @@ class NumericMeta(DslType):
     :type mlir_type: Any, optional
     :param is_abstract: Whether the type is abstract, defaults to False
     :type is_abstract: bool, optional
-
     :ivar width: Bit width of the numeric type
     :type width: int
     :ivar _np_dtype: Corresponding NumPy dtype
@@ -404,6 +408,11 @@ class NumericMeta(DslType):
         new_cls._np_dtype = np_dtype
         return new_cls
 
+    def n_bytes(cls, n_elements: int) -> int:
+        """Return the storage byte count for ``n_elements`` dtype elements."""
+
+        return n_elements * cls.bytes
+
     @property
     def numpy_dtype(cls) -> Optional[type]:
         return cls._np_dtype
@@ -604,7 +613,9 @@ class FloatMeta(NumericMeta):
     This metaclass provides type system infrastructure for floating-point types in the DSL,
     handling MLIR type mappings and NumPy type conversions.
 
-    :param width: Bit width of the float type, defaults to 32
+    :param width: Bit width of one storage unit, defaults to 32. Packed view
+        dtypes may use a wider storage-unit width here than their logical
+        floating-point element width.
     :type width: int
     :param mlir_type: Corresponding MLIR type, defaults to None
     :type mlir_type: Any, optional
@@ -629,7 +640,14 @@ class FloatMeta(NumericMeta):
     ) -> Any:
         np_dtype = getattr(np, name.lower(), None)
         new_cls = super().__new__(
-            cls, name, bases, attrs, width, np_dtype, mlir_type, is_abstract
+            cls,
+            name,
+            bases,
+            attrs,
+            width,
+            np_dtype,
+            mlir_type,
+            is_abstract,
         )
         # Extract exponent and mantissa bits from class name if it follows Float<E><M> pattern
         # For example: Float8E4M3 -> exponent_width=4, mantissa_width=3
@@ -800,8 +818,10 @@ def _binary_op_type_promote(
     if a_type == b_type:
         return a, b, a_type
 
-    a_signed = a_type.signed  # type: ignore[attr-defined]
-    b_signed = b_type.signed  # type: ignore[attr-defined]
+    # At this point both must be Integer subclasses (float branch above already returned).
+    assert issubclass(a_type, Integer) and issubclass(b_type, Integer)
+    a_signed = a_type.signed
+    b_signed = b_type.signed
     a_width = a_type.width
     b_width = b_type.width
 
@@ -909,13 +929,13 @@ def _binary_op(
 
         lhs_val: Union[bool, int, float, ir.Value, ArithValue]
         if isinstance(lhs.value, ArithValue) and isinstance(lhs, Integer):
-            lhs_val = lhs.value.with_signedness(lhs.signed)  # type: ignore[attr-defined]
+            lhs_val = lhs.value.with_signedness(lhs.signed)
         else:
             lhs_val = lhs.value
 
         rhs_val: Union[bool, int, float, ir.Value, ArithValue]
         if isinstance(rhs.value, ArithValue) and isinstance(rhs, Integer):
-            rhs_val = rhs.value.with_signedness(rhs.signed)  # type: ignore[attr-defined]
+            rhs_val = rhs.value.with_signedness(rhs.signed)
         else:
             rhs_val = rhs.value
 
@@ -941,6 +961,11 @@ class Numeric(metaclass=NumericMeta, is_abstract=True):
     :vartype value: Union[bool, int, float, Value]
     """
 
+    # Injected by NumericMeta.__new__ on every concrete subclass.
+    width: ClassVar[int]
+    bytes: ClassVar[int]
+    _np_dtype: ClassVar[Optional[type]]
+
     def __init__(
         self,
         value: Union[bool, int, float, Value],
@@ -1669,6 +1694,9 @@ class Integer(Numeric, metaclass=IntegerMeta, mlir_type=T.i32, is_abstract=True)
         a = Int32(c5)  # Treat c5 as int32 bitwise
     """
 
+    # Injected by IntegerMeta.__new__ on every concrete subclass.
+    signed: ClassVar[bool]
+
     def __init__(
         self,
         x: Union[bool, int, float, ir.Value, "Integer", "Float"],
@@ -1677,7 +1705,6 @@ class Integer(Numeric, metaclass=IntegerMeta, mlir_type=T.i32, is_abstract=True)
         ip: Optional[ir.InsertionPoint] = None,
     ) -> None:
         ty = type(self)
-
         if isinstance(x, (bool, int, float)):
             # Add check for NaN before numpy conversion
             if isinstance(x, float):
@@ -2094,7 +2121,7 @@ class Float16(Float, metaclass=FloatMeta, width=16, mlir_type=T.f16):
         # First convert to numpy float16 to handle the conversion
         f16_val = np.float16(value)
         # Get the raw bits as a 16-bit integer
-        bits: np.uint16 = f16_val.view(np.uint16)
+        bits: int = int(f16_val.view(np.uint16))
         # Create a short (16-bit int) with those bits
         c_val = ctypes.c_short(int(bits))
         return ctypes.cast(ctypes.pointer(c_val), ctypes.c_void_p)
@@ -2113,7 +2140,7 @@ class BFloat16(Float, metaclass=FloatMeta, width=16, mlir_type=T.bf16):
         # First convert the value to float32 bit representation
         f32_val = np.float32(self.value)
         # Get the 32-bit integer representation
-        bits: np.uint32 = f32_val.view(np.uint32)
+        bits = int(f32_val.view(np.uint32))
         # Truncate to 16 bits, keeping the high 16 bits
         bf16_bits = np.uint16(bits >> 16)
         # Create a short (16-bit int) with those bits
@@ -2150,6 +2177,13 @@ class Float6E3M2FN(Float, metaclass=FloatMeta, width=6, mlir_type=T.f6E3M2FN): .
 class Float6E2M3FN(Float, metaclass=FloatMeta, width=6, mlir_type=T.f6E2M3FN): ...
 
 
+
+def _element_precision_width(dtype: Type["Numeric"]) -> int:
+    """Return scalar lane precision for packed narrow-float view dtypes."""
+
+    return dtype.width
+
+
 _unsupported_dst_float_types = [
     Float8E4M3,
     Float8E4M3B11FNUZ,
@@ -2233,10 +2267,15 @@ class TensorMeta(DslType):
 TY = TypeVar("TY")
 
 
-class Constexpr(Generic[TY]):
-    """Value is passed and computed by python interpreter"""
+if TYPE_CHECKING:
+    # Static-analysis: Constexpr[T] is transparent and treated as T.
+    Constexpr = Annotated[TY, "constexpr"]
+else:
 
-    pass
+    class Constexpr(Generic[TY]):
+        """Value is passed and computed by python interpreter"""
+
+        pass
 
 
 class align(int):
diff --git a/python/CuTeDSL/cutlass/base_dsl/utils/tree_utils.py b/python/CuTeDSL/cutlass/base_dsl/utils/tree_utils.py
index 2c3f8c6b1..f4a973b00 100644
--- a/python/CuTeDSL/cutlass/base_dsl/utils/tree_utils.py
+++ b/python/CuTeDSL/cutlass/base_dsl/utils/tree_utils.py
@@ -17,7 +17,7 @@ import itertools as it
 from types import SimpleNamespace
 
 from ..typing import as_numeric, Numeric, Constexpr, implements_dynamic_expression
-from .._mlir_helpers.arith import ArithValue
+from ..._mlir_helpers.arith import ArithValue
 from ..common import DSLBaseError
 from ..._mlir import ir
 
@@ -161,9 +161,7 @@ def is_namedtuple_instance(x: Any) -> bool:
     """
     t = type(x)
     return (
-        issubclass(t, tuple)
-        and hasattr(t, "_fields")
-        and isinstance(t._fields, tuple)
+        issubclass(t, tuple) and hasattr(t, "_fields") and isinstance(t._fields, tuple)
     )
 
 
@@ -762,7 +760,7 @@ def _tree_flatten(
         return [x], [ir.DictAttr.get({})], create_leaf_for_value(x, is_numeric=True)
 
     elif implements_dynamic_expression(x) and isinstance(x, ir.Value):
-        # Only for ir.Value subclasses (e.g. ctm.Pointer). Check before plain ir.Value
+        # Only for ir.Value subclasses (e.g. Pointer). Check before plain ir.Value
         # so they are unflattened via __new_from_mlir_values__. Other dynamic
         # expressions (e.g. TmemAllocator with 2 values) use the registered/node path.
         v = _flatten_mlir_values(x.__extract_mlir_values__())
@@ -821,9 +819,7 @@ def _tree_flatten(
 
             if hasattr(x, "__extract_mlir_attributes__"):
                 # If x has extract mlir attributes it overrides the child's default attributes
-                child_attrs_flat = [x.__extract_mlir_attributes__()] * len(
-                    child_attrs_flat
-                )
+                child_attrs_flat = [x.__extract_mlir_attributes__()]
             child_attrs_flattened = it.chain.from_iterable(child_attrs_flat)
             return (
                 flattened,
diff --git a/python/CuTeDSL/cutlass/base_dsl/version_info.py b/python/CuTeDSL/cutlass/base_dsl/version_info.py
index 122ee1933..cd3d7ee22 100644
--- a/python/CuTeDSL/cutlass/base_dsl/version_info.py
+++ b/python/CuTeDSL/cutlass/base_dsl/version_info.py
@@ -9,9 +9,11 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
-from typing import Callable
-
-from .common import DSLCudaVersion, DSLRuntimeError, _get_cuda_version
+from .common import (
+    DSLCudaVersion,
+    DSLRuntimeError,
+    _get_cuda_version,
+)
 
 try:
     CUDA_VERSION = DSLCudaVersion(_get_cuda_version())
diff --git a/python/CuTeDSL/cutlass/compiler/__init__.py b/python/CuTeDSL/cutlass/compiler/__init__.py
new file mode 100644
index 000000000..4990b2f8d
--- /dev/null
+++ b/python/CuTeDSL/cutlass/compiler/__init__.py
@@ -0,0 +1,28 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
+#
+# Use of this software is governed by the terms and conditions of the
+# NVIDIA End User License Agreement (EULA), available at:
+# https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/license.html
+#
+# Any use, reproduction, disclosure, or distribution of this software
+# and related documentation outside the scope permitted by the EULA
+# is strictly prohibited.
+
+from .._mlir._mlir_libs._cutlass_ir import (
+    ArtifactType,
+    CompilationArtifact,
+    ParamKind,
+    ParamInfo,
+    FunctionInfo,
+    PreCompiledMlirArtifact,
+    CompiledMlirArtifact,
+    LlvmIrArtifact,
+    ObjectArtifact,
+    ExecutableFunction,
+    Executor,
+    Compiler,
+    CuteCompiler,
+    serialize_compilation_artifact,
+    deserialize_compilation_artifact,
+)
diff --git a/python/CuTeDSL/cutlass/cute/__init__.py b/python/CuTeDSL/cutlass/cute/__init__.py
index eadb458bd..91e215cd2 100644
--- a/python/CuTeDSL/cutlass/cute/__init__.py
+++ b/python/CuTeDSL/cutlass/cute/__init__.py
@@ -33,6 +33,9 @@ from .typing import (
     Pointer,
     Tensor,
     SymInt,
+    Numeric,
+    Int32,
+    Int16,
 )
 
 # Import everything else
@@ -151,7 +154,6 @@ from .tensor import (
     ReductionOp,
     make_tensor,
     make_identity_tensor,
-    make_fragment,
     make_fragment_like,
     make_rmem_tensor_like,
     make_rmem_tensor,
@@ -205,6 +207,8 @@ from . import testing
 from . import runtime
 from . import math
 
+from . import experimental
+
 # Export all math ops without "math."
 from .math import *
 
@@ -218,7 +222,9 @@ jit: Callable[..., Any] = _dsl.CuTeDSL.jit
 kernel: Callable[..., Any] = _dsl.CuTeDSL.kernel
 register_jit_arg_adapter = _dsl.JitArgAdapterRegistry.register_jit_arg_adapter
 compile = _dsl.CompileCallable()
+compile_to = compile.compile_to
 OptLevel = _dsl.OptLevel
+
 PtxasOptions = _dsl.PtxasOptions
 EnableAssertions = _dsl.EnableAssertions
 GenerateLineInfo = _dsl.GenerateLineInfo
@@ -227,7 +233,8 @@ KeepPTX = _dsl.KeepPTX
 GPUArch = _dsl.GPUArch
 LinkLibraries = _dsl.LinkLibraries
 EnableTVMFFI = _dsl.EnableTVMFFI
-
+DeviceTarget = _dsl.DeviceTarget
+RDC = _dsl.RDC
 native_struct = _dsl.native_struct
 make_native_struct = _dsl.make_native_struct  # factory for dynamic struct types
 
@@ -263,6 +270,7 @@ __all__ = [
     "assume",
     "is_integer",
     "is_int_tuple",
+    "is_int_tuple_type",
     "is_static",
     "size",
     "has_underscore",
@@ -289,7 +297,6 @@ __all__ = [
     "make_ptr",
     "make_tensor",
     "make_identity_tensor",
-    "make_fragment",
     "make_fragment_like",
     "make_rmem_tensor",
     "make_rmem_tensor_like",
@@ -417,6 +424,9 @@ __all__ = [
     "kernel",
     "register_jit_arg_adapter",
     "compile",
+    "compile_to",
+    "ArtifactType",
+    "PreCompiledMlirArtifact",
     # Foreign function interface
     "ffi",
     "extern",
diff --git a/python/CuTeDSL/cutlass/cute/_tvm_ffi_args_spec_converter.py b/python/CuTeDSL/cutlass/cute/_tvm_ffi_args_spec_converter.py
index 579773cf1..90a07d982 100644
--- a/python/CuTeDSL/cutlass/cute/_tvm_ffi_args_spec_converter.py
+++ b/python/CuTeDSL/cutlass/cute/_tvm_ffi_args_spec_converter.py
@@ -82,6 +82,29 @@ NumericToTVMFFIDtype = {
     Float6E3M2FN: "float6_e3m2fn",
 }
 
+_UNSUPPORTED_TVM_FFI_NUMERIC_TYPES = set[Any]()
+
+def _numeric_to_tvm_ffi_dtype(dtype: type[Numeric]) -> str:
+    if dtype in _UNSUPPORTED_TVM_FFI_NUMERIC_TYPES:
+        raise DSLRuntimeError(
+            f"TVM-FFI does not support packed tensor dtype {dtype.__name__}. "
+            "Packed FP6x4 tensors must not be passed through --enable-tvm-ffi yet."
+        )
+    try:
+        return NumericToTVMFFIDtype[dtype]
+    except KeyError as exc:
+        raise DSLRuntimeError(
+            f"TVM-FFI does not support tensor dtype {dtype.__name__}."
+        ) from exc
+
+
+# Functions which return the MLIR type for the specified CuTe type.
+# The functions take a MLIRBuilder as an argument and return the MLIR type.
+# Note: these are untyped below to avoid additional imports.
+AlternateIrTypeFunctions: Dict[Any, Any] = {
+    TFloat32: lambda builder: builder.i32_type,
+}
+
 AcceptableNumericTypesForScalar = [
     Boolean,
     Int8,
@@ -92,7 +115,10 @@ AcceptableNumericTypesForScalar = [
     Uint32,
     Int64,
     Uint64,
+    Float16,
+    BFloat16,
     Float32,
+    TFloat32,
     Float64,
 ]
 
@@ -111,6 +137,16 @@ def _is_gpu_memspace(
     return memspace != _cute_ir.AddressSpace.generic
 
 
+def _native_struct_type(value: Any) -> Optional[type]:
+    """Return the native_struct class represented by a value or annotation."""
+    if value is None or value is inspect.Parameter.empty:
+        return None
+    if isinstance(value, type):
+        return value if hasattr(value, "_struct_type") else None
+    value_type = type(value)
+    return value_type if hasattr(value_type, "_struct_type") else None
+
+
 class SymIntId:
     def __init__(self, sym_int: SymInt):
         self.sym_int = sym_int
@@ -184,8 +220,49 @@ class ConverterContext:
         return device_id_var
 
 
+def _shape_elem_to_spec(elem: Any, ctx: ConverterContext) -> Any:
+    """Convert one element of a CuTe shape to its spec representation.
+
+    Returns int (static constant), spec.Var (dynamic scalar), or spec.Shape
+    (nested tuple that TVM FFI will see as an Array of ints).
+    """
+    if isinstance(elem, int):
+        return elem
+    elif isinstance(elem, SymInt):
+        return ctx.alloc_or_reuse_symint_var(elem, ctx.alloc_shape_name)
+    elif isinstance(elem, tuple):
+        inner = [_shape_elem_to_spec(e, ctx) for e in elem]
+        return spec.Shape(ctx.alloc_shape_name(), inner)
+    elif isinstance(elem, Numeric):
+        return spec.Var(ctx.alloc_shape_name(), _numeric_to_tvm_ffi_dtype(elem.dtype))
+    else:
+        raise DSLRuntimeError(f"Unexpected element type in cute shape: {type(elem)}")
+
+
+def _convert_cute_shape_arg(
+    arg: Any, arg_name: str, ctx: ConverterContext
+) -> spec.Param:
+    """Convert a CuTe shape/stride/coord argument to a spec.Param.
+
+    Flat shapes map to spec.Shape. Nested shapes map to spec.TupleParam so
+    the spec mirrors the TVM FFI Array structure TVM produces
+    when the Python tuple is passed at runtime.
+    """
+    converted_elements = [_shape_elem_to_spec(e, ctx) for e in arg]
+    has_nested = any(isinstance(e, tuple) for e in arg)
+    if has_nested:
+        return spec.TupleParam(arg_name, converted_elements)
+    else:
+        return spec.Shape(arg_name, converted_elements)
+
+
 def _convert_single_arg(
-    arg: Any, arg_name: str, arg_type: Any, ctx: ConverterContext
+    arg: Any,
+    arg_name: str,
+    arg_type: Any,
+    ctx: ConverterContext,
+    *,
+    is_constexpr: bool = False,
 ) -> spec.Param:
     """Convert a single argument to a spec.Param.
 
@@ -204,27 +281,50 @@ def _convert_single_arg(
     -------
     spec.Param
         The converted parameter specification.
+
+    When ``is_constexpr=True`` the arg is captured as a compile-time literal
+    via a ``Const*`` param (asserted at runtime by the tvm-ffi wrapper, not
+    forwarded into the llvm.call). Otherwise it takes the normal runtime
+    path — e.g. a bare ``int`` becomes ``spec.Var(..., int32)``.
     """
+    if is_constexpr:
+        if arg is None:
+            return spec.ConstNone(arg_name)
+        # bool must be checked before int (bool is a subclass of int)
+        if isinstance(arg, bool):
+            return spec.ConstBool(arg_name, arg)
+        if isinstance(arg, int):
+            return spec.ConstInt(arg_name, arg)
+        if isinstance(arg, float):
+            return spec.ConstFloat(arg_name, arg)
+        raise DSLRuntimeError(
+            f"Unsupported Constexpr value for {arg_name!r}: {type(arg).__name__}. "
+            f"Supported: None, int, bool, float."
+        )
     if arg is None:
         return spec.ConstNone(arg_name)
     elif isinstance(arg, Numeric) and arg.dtype in AcceptableNumericTypesForScalar:
-        return spec.Var(arg_name, NumericToTVMFFIDtype[arg.dtype])
+        param = spec.Var(
+            arg_name,
+            _numeric_to_tvm_ffi_dtype(arg.dtype),
+            alternate_ir_type_fetch_func=AlternateIrTypeFunctions.get(arg.dtype, None),
+        )
+        return param
     elif arg_type in AcceptableNumericTypesForScalar:
-        return spec.Var(arg_name, NumericToTVMFFIDtype[arg_type])
+        param = spec.Var(
+            arg_name,
+            _numeric_to_tvm_ffi_dtype(arg_type),
+            alternate_ir_type_fetch_func=AlternateIrTypeFunctions.get(arg_type, None),
+        )
+        return param
+    elif (native_struct_type := _native_struct_type(arg_type)) is not None or (
+        native_struct_type := _native_struct_type(arg)
+    ) is not None:
+        param = spec.Var(arg_name, "handle")
+        setattr(param, spec.NATIVE_STRUCT_TYPE_ATTR, native_struct_type)
+        return param
     elif is_cute_algebra_type(arg_type):
-        shape = []
-        for i in range(len(arg)):
-            if isinstance(arg[i], int):
-                shape.append(arg[i])
-            elif isinstance(arg[i], SymInt):
-                shape.append(
-                    ctx.alloc_or_reuse_symint_var(arg[i], ctx.alloc_shape_name)
-                )
-            else:
-                shape.append(
-                    spec.Var(ctx.alloc_shape_name(), NumericToTVMFFIDtype[arg[i].dtype])
-                )
-        return spec.Shape(arg_name, shape)
+        return _convert_cute_shape_arg(arg, arg_name, ctx)
     elif isinstance(arg, SymInt):
         if arg.width == 32:
             dtype = NumericToTVMFFIDtype[Int32]
@@ -232,26 +332,30 @@ def _convert_single_arg(
             dtype = NumericToTVMFFIDtype[Int64]
         return spec.Var(arg_name, dtype, divisibility=arg.divisibility)
     elif isinstance(arg, Tensor):
-        shapes = []
+        arg_shape = arg.shape
+        arg_stride = arg.stride
+        assert isinstance(arg_shape, tuple)
+        assert isinstance(arg_stride, tuple)
+        shapes: List[Any] = []
         for i, dyn_mask in enumerate(arg.dynamic_shapes_mask):  # type: ignore[attr-defined]
             if not dyn_mask:
-                shapes.append(arg.shape[i])  # type: ignore[index]
-            elif isinstance(arg.shape[i], SymInt):  # type: ignore[index]
+                shapes.append(arg_shape[i])
+            elif isinstance(arg_shape[i], SymInt):
                 shapes.append(
-                    ctx.alloc_or_reuse_symint_var(arg.shape[i], ctx.alloc_shape_name)  # type: ignore[arg-type, index]
+                    ctx.alloc_or_reuse_symint_var(arg_shape[i], ctx.alloc_shape_name)  # type: ignore[arg-type]
                 )
             else:
                 shapes.append(
                     spec.Var(ctx.alloc_shape_name(), NumericToTVMFFIDtype[Int32])
                 )
-        strides = []
+        strides: List[Any] = []
 
         for i, dyn_mask in enumerate(arg.dynamic_strides_mask):  # type: ignore[attr-defined]
             if not dyn_mask:
-                strides.append(arg.stride[i])  # type: ignore[index]
-            elif isinstance(arg.stride[i], SymInt):  # type: ignore[index]
+                strides.append(arg_stride[i])
+            elif isinstance(arg_stride[i], SymInt):
                 strides.append(
-                    ctx.alloc_or_reuse_symint_var(arg.stride[i], ctx.alloc_stride_name)  # type: ignore[arg-type, index]
+                    ctx.alloc_or_reuse_symint_var(arg_stride[i], ctx.alloc_stride_name)  # type: ignore[arg-type]
                 )
             else:
                 if hasattr(arg, "_use_32bit_stride") and arg._use_32bit_stride:
@@ -270,9 +374,9 @@ def _convert_single_arg(
 
             tvm_ffi_cute_tensor = spec.Tensor(
                 arg_name,
-                shapes,  # type: ignore[arg-type]
+                shapes,
                 arg._tvm_ffi_tensor.dtype,
-                strides=strides,  # type: ignore[arg-type]
+                strides=strides,
                 data_alignment=arg._assumed_align,  # type: ignore[attr-defined]
                 device_type=device_type,
                 device_id=device_id,
@@ -286,9 +390,9 @@ def _convert_single_arg(
 
             tvm_ffi_cute_tensor = spec.Tensor(
                 arg_name,
-                shapes,  # type: ignore[arg-type]
-                NumericToTVMFFIDtype[arg.element_type],  # type: ignore[index]
-                strides=strides,  # type: ignore[arg-type]
+                shapes,
+                _numeric_to_tvm_ffi_dtype(arg.element_type),  # type: ignore[arg-type]
+                strides=strides,
                 data_alignment=arg._assumed_align,  # type: ignore[attr-defined]
                 device_type=device_type,
                 device_id=device_id,
@@ -390,17 +494,19 @@ def _convert_single_arg(
             raise DSLRuntimeError(
                 f"Expected {dc_type.__name__} for argument {arg_name}, got {type(arg)}"
             )
-        dc_fields = dataclass_fields(dc_type)
-        tuple_params = []
-        for f in dc_fields:
-            if is_constexpr_field(f):
-                continue
-            field_value = getattr(arg, f.name)
-            field_name = f"{arg_name}.{f.name}"
-            field_type = f.type
-            tuple_params.append(
-                _convert_single_arg(field_value, field_name, field_type, ctx)
+        # ``Constexpr[...]`` fields are captured as Const* literals (asserted
+        # at runtime, not forwarded into the llvm.call); other fields take
+        # the normal runtime path.
+        tuple_params = [
+            _convert_single_arg(
+                getattr(arg, f.name),
+                f"{arg_name}.{f.name}",
+                f.type,
+                ctx,
+                is_constexpr=is_constexpr_field(f),
             )
+            for f in dataclass_fields(dc_type)
+        ]
         return spec.TupleParam(arg_name, tuple_params)
     elif arg_type is not None and (
         get_origin(arg_type) is UnionType or get_origin(arg_type) is Union
@@ -426,10 +532,18 @@ def _tvm_ffi_args_spec_converter(
     signature: inspect.Signature,
     full_args: List[Any],
     full_kwargs: Dict[str, Any],
-) -> tuple[List[spec.Param], Any]:
+) -> tuple[List[spec.Param], Any, List[str]]:
     """Convert cute algebra args to tvm ffi spec params.
 
-    This function converts the cute arguments specs to tvm ffi spec params.
+    Returns
+    -------
+    (params, kwargs_wrapper_spec, map_dataclass_to_tuple)
+        ``map_dataclass_to_tuple`` lists the top-level arg names whose values
+        are dataclass instances. These must be unpacked to nested tuples via
+        tvm-ffi's ``unpack_dataclass_to_tuple`` before the FFI call; the spec
+        we just built already includes every field (with ``Const*`` for
+        constexpr fields). This stays a tvm-ffi concern and intentionally
+        lives outside ``KwargsWrapperSpec``.
     """
     exec_args = ExecutionArgs(signature, function_name)
     rectified_args = exec_args.get_rectified_args_from_original_args(
@@ -438,6 +552,7 @@ def _tvm_ffi_args_spec_converter(
     params = []
     ctx = ConverterContext()
     wrapper_extra_exclude_arg_names = []
+    map_dataclass_to_tuple: List[str] = []
 
     for arg, parameter in zip(rectified_args, exec_args.signature.parameters.values()):
         arg_type = parameter.annotation
@@ -446,10 +561,16 @@ def _tvm_ffi_args_spec_converter(
         params.append(param)
         if isinstance(param, spec.EnvStream):
             wrapper_extra_exclude_arg_names.append(arg_name)
+        # Covers both plain ``op: AddOp`` and ``op: AddOp | MulOp`` since we
+        # check the runtime type when the annotation isn't itself a dataclass.
+        if (
+            isinstance(arg_type, type) and is_dataclass(arg_type)
+        ) or is_dataclass(type(arg)):
+            map_dataclass_to_tuple.append(arg_name)
     kwargs_wrapper_spec = exec_args.get_kwargs_wrapper_spec(
         wrapper_extra_exclude_arg_names
     )
-    return params, kwargs_wrapper_spec
+    return params, kwargs_wrapper_spec, map_dataclass_to_tuple
 
 
 def attach_args_spec_converter(dsl: Any) -> None:
diff --git a/python/CuTeDSL/cutlass/cute/algorithm.py b/python/CuTeDSL/cutlass/cute/algorithm.py
index a6a99ea0c..8e777c16f 100644
--- a/python/CuTeDSL/cutlass/cute/algorithm.py
+++ b/python/CuTeDSL/cutlass/cute/algorithm.py
@@ -15,6 +15,7 @@ from typing import Optional, Dict, Any, List, Tuple, Type, Union
 
 from cutlass._mlir import ir
 from cutlass.cutlass_dsl import (
+    BaseDSL,
     for_generate,
     yield_out,
     if_generate,
@@ -24,7 +25,17 @@ from cutlass.cutlass_dsl import (
 import cutlass._mlir.dialects.cute as _cute_ir
 import cutlass._mlir.dialects.cute_nvgpu as _cute_nvgpu_ir
 
-from .typing import Numeric, Tensor, Int64, Int16, AddressSpace
+from .typing import (
+    Numeric,
+    Pointer,
+    Tensor,
+    Boolean,
+    Int8,
+    Int64,
+    Int16,
+    AddressSpace,
+    is_int_tuple_type,
+)
 from .core import (
     rank,
     is_static,
@@ -50,6 +61,7 @@ from .nvgpu.common import (
     CopyR2GOp,
     CopyS2ROp,
     CopyR2SOp,
+    OpError,
 )
 
 
@@ -87,9 +99,18 @@ def gemm(
 
     - For regular GEMM, `a` and `b` contain the GEMM A and B tensors respectively.
     - For GEMM with auxiliary operands, `a` and `b` contain the GEMM A and B tensors followed by
-      their respective auxiliary tensors. For example:
+      their respective auxiliary tensors.
 
-      - For BlockScaledGemm, `a` = [A, SFA] and `b` = [B, SFB].
+    Auxiliary operands examples:
+
+    - For BlockScaledGemm, `a` = [A, SFA] and `b` = [B, SFB].
+    - For SparseGemm, `a` = [A, E] and `b` = [B].
+    - For BlockScaledSparseGemm, `a` = [A, SFA, E] and `b` = [B, SFB].
+
+    Runtime keyword arguments in ``kwargs`` are forwarded to the underlying MMA atom trait.
+    For SM100 tcgen05 MMA atoms, ``disable_output_lane`` provides a per-lane
+    write-disable mask for ``tcgen05.mma.disable_output_lane`` lowering.
+    The expected lane count is 4 for ``cta_group::1`` and 8 for ``cta_group::2``.
 
     :param atom: MMA atom
     :type atom: MmaAtom
@@ -230,9 +251,12 @@ def basic_copy(
     """
 
     if is_static(src.shape) and is_static(dst.shape):
+        # Boolean (i1) is stored as i8 in memory; use i8 for the copy atom
+        src_elem_type = Int8 if src.element_type is Boolean else src.element_type
+        assert not is_int_tuple_type(src_elem_type)
         simt_copy_ty = _cute_nvgpu_ir.CopyAtomSIMTSyncCopyType.get(
-            src.element_type.mlir_type,  # type: ignore[union-attr]
-            src.element_type.width,  # type: ignore[union-attr]
+            src_elem_type.mlir_type,
+            src_elem_type.width,
         )
         simt_copy = make_atom(simt_copy_ty, loc=loc, ip=ip)
         return _cute_ir.copy(simt_copy, [src.value], [dst.value], loc=loc, ip=ip)
@@ -263,7 +287,10 @@ def basic_copy_if(
     is fully unrolled.
 
     """
-    if src.element_type.width != dst.element_type.width:  # type: ignore[union-attr]
+    src_elem_type = Int8 if src.element_type is Boolean else src.element_type
+    dst_elem_type = Int8 if dst.element_type is Boolean else dst.element_type
+    assert not is_int_tuple_type(src_elem_type) and not is_int_tuple_type(dst_elem_type)
+    if src_elem_type.width != dst_elem_type.width:
         raise NotImplementedError(
             "basic_copy_if currently only supports equal source and destination "
             "element type bit width"
@@ -318,10 +345,13 @@ def autovec_copy(
     copy op.
 
     """
-    if src.element_type.width != dst.element_type.width:  # type: ignore[union-attr]
+    src_elem_type = Int8 if src.element_type is Boolean else src.element_type
+    dst_elem_type = Int8 if dst.element_type is Boolean else dst.element_type
+    assert not is_int_tuple_type(src_elem_type) and not is_int_tuple_type(dst_elem_type)
+    if src_elem_type.width != dst_elem_type.width:
         raise NotImplementedError(
             "autovec_copy only supports equal source and destination "
-            f"element type bit widths, got {src.element_type} and {dst.element_type}"
+            f"element type bit widths, got {src_elem_type} and {dst_elem_type}"
         )
 
     # We are going to dispatch to copy-with-atom which requires shapes to be static
@@ -339,18 +369,21 @@ def autovec_copy(
 
     upper_bound = math.gcd(src.layout.max_alignment, dst.layout.max_alignment)  # type: ignore[union-attr]
     upper_bound = math.gcd(upper_bound, num_common_elements)
-    upper_bound *= src.element_type.width  # type: ignore[union-attr]
+    upper_bound *= src_elem_type.width
 
     # For our instructions, the alignment of the pointer is an upper bound to the vector width
     # max_alignment, as opposed to alignment, takes into account possible address swizzling
-    upper_bound = math.gcd(upper_bound, src.iterator.max_alignment * 8)  # type: ignore[union-attr]
-    upper_bound = math.gcd(upper_bound, dst.iterator.max_alignment * 8)  # type: ignore[union-attr]
+    src_iter = src.iterator
+    dst_iter = dst.iterator
+    assert isinstance(src_iter, Pointer) and isinstance(dst_iter, Pointer)
+    upper_bound = math.gcd(upper_bound, src_iter.max_alignment * 8)
+    upper_bound = math.gcd(upper_bound, dst_iter.max_alignment * 8)
 
     # Finally, we put a cap at 256b
     num_bits_per_copy = math.gcd(upper_bound, 256)
 
     if (num_common_elements > 1) and (num_bits_per_copy % 8 == 0):
-        num_common_elements = num_bits_per_copy // src.element_type.width  # type: ignore[union-attr]
+        num_common_elements = num_bits_per_copy // src_elem_type.width
 
         # 2 step logical divides ensuring that the divides are valid at every step
         vec_src = logical_divide(src, vec_layout, loc=loc, ip=ip)
@@ -369,10 +402,10 @@ def autovec_copy(
             mem_attrs["l1c_evict_priority"] = l1c_evict_priority
 
         simt_copy_atom = _make_copy_atom(
-            src.element_type,  # type: ignore[arg-type]
+            src_elem_type,
             num_bits_per_copy,
-            src.iterator.memspace,  # type: ignore[union-attr]
-            dst.iterator.memspace,  # type: ignore[union-attr]
+            src_iter.memspace,
+            dst_iter.memspace,
             loc=loc,
             ip=ip,
             **mem_attrs,
@@ -544,10 +577,12 @@ def copy(
         dst_primary.type,  # type: ignore[attr-defined]
         _cute_ir.MemRefType,
     ):
-        if (
-            len(dst_list) == 1
-            and src_primary.element_type.width != dst_primary.element_type.width  # type: ignore[union-attr]
-        ):
+        src_p_elem_type = src_primary.element_type
+        dst_p_elem_type = dst_primary.element_type
+        assert not is_int_tuple_type(src_p_elem_type) and not is_int_tuple_type(
+            dst_p_elem_type
+        )
+        if len(dst_list) == 1 and src_p_elem_type.width != dst_p_elem_type.width:
             raise TypeError(
                 "`copy` currently only supports equal source and destination "
                 "element type bit width"
@@ -574,8 +609,12 @@ def copy(
     # Recompute primary references after canonicalization
     src_primary = src_list[0]
     dst_primary = dst_list[0]
+    src_primary_shape = src_primary.shape
+    dst_primary_shape = dst_primary.shape
+    assert isinstance(src_primary_shape, tuple)
+    assert isinstance(dst_primary_shape, tuple)
 
-    if is_static(src_primary.shape[1]) and is_static(dst_primary.shape[1]):  # type: ignore[index]
+    if is_static(src_primary_shape[1]) and is_static(dst_primary_shape[1]):
         if size(src_primary, mode=[1]) != size(dst_primary, mode=[1]):
             raise ValueError(
                 "Expected source and destination tensors to have the same size in mode-1, "
@@ -585,7 +624,7 @@ def copy(
     multicast_attr_pairs = _parse_tma_multicast_args(kwargs)
 
     # Unroll the loop per specified unroll_factor for static RestM case
-    if is_static(src_primary.shape[1]) and unroll_factor is not None:  # type: ignore[index]
+    if is_static(src_primary_shape[1]) and unroll_factor is not None:
         unroll_factor = LoopUnroll(count=unroll_factor)
         for i in for_generate(
             0,
@@ -618,7 +657,7 @@ def copy(
 @dsl_user_op
 def prefetch(
     atom: CopyAtom,
-    src: Tensor,
+    src: Union[Tensor, List[Tensor], Tuple[Tensor, ...]],
     *,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
@@ -631,16 +670,32 @@ def prefetch(
 
     Prefetch accepts Copy Atom but not all are allowed. Currently, only supports TMA prefetch.
 
+    For standard TMA modes (tiled, im2col), pass a single GMEM tensor:
+
     .. code-block:: python
 
-        cute.prefetch(tma_prefetch, src)
+        cute.prefetch(tma_prefetch, tAgA)
+
+    For 2D ``tile::gather4`` mode, pass a list ``[data_tensor, gmem_index_tensor]``
+    (mirrors :func:`cute.copy` for the same mode); see :func:`cute.nvgpu.cpasync.tma_partition`
+    for the gather4 layout conventions:
+
+    .. code-block:: python
+
+        cute.prefetch(tma_atom, [tAgA, tAgI])
+
+    Pass the whole multi-stage partitioned tensors — the lowering's internal loop walks every
+    rest-dimension entry, so a single ``cute.prefetch`` call covers every stage and lives outside
+    any per-stage loop.
 
     For Copy Atoms that require single-threaded execution, the copy op automatically handles thread
     election internally. Manual thread selection is not required in such cases.
     """
+    src_list = _normalize_variadic_tensor_operand(src, "src")
+
     dummy_tma_bar_ptr = make_ptr(Int64, 0, AddressSpace.smem, loc=loc, ip=ip)
     dummy_mcast_mask = Int16(0)
     value = atom._unpack(
         loc=loc, ip=ip, tma_bar_ptr=dummy_tma_bar_ptr, mcast_mask=dummy_mcast_mask
     )
-    return _cute_ir.prefetch(value, src.value, loc=loc, ip=ip)
+    return _cute_ir.prefetch(value, [t.value for t in src_list], loc=loc, ip=ip)
diff --git a/python/CuTeDSL/cutlass/cute/arch/__init__.py b/python/CuTeDSL/cutlass/cute/arch/__init__.py
index 16a24da08..545b5b965 100644
--- a/python/CuTeDSL/cutlass/cute/arch/__init__.py
+++ b/python/CuTeDSL/cutlass/cute/arch/__init__.py
@@ -9,6 +9,7 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
+from .constants import *
 from .elect import *
 from .mbar import *
 from .nvvm_wrappers import *
@@ -19,13 +20,6 @@ from .clc import *
 
 import cutlass.cutlass_dsl as cutlass_dsl
 
-# Forward from auto-generated nvvm python: only export on 12.9 wheel
-_nvvm_forward_exports_12_9 = (
-    ["ProxyKind", "SharedSpace", "RoundingModeKind", "ReduxKind", "AtomicOpKind"]
-    if cutlass_dsl.target_version(exact_version="12.9")
-    else []
-)
-
 # __all__ is required here for documentation generation
 __all__ = [
     #
@@ -44,6 +38,7 @@ __all__ = [
     "mbarrier_try_wait",
     "mbarrier_conditional_try_wait",
     "mbarrier_arrive",
+    "mbarrier_test_wait",
     #
     # nvvm_wrappers.py
     #
@@ -119,6 +114,7 @@ __all__ = [
     "cvt_i8x2_to_bf16x2",
     "cvt_i8x4_to_bf16x4",
     "cvt_f32x2_bf16x2",
+    "warp_redux_sync",
     "smid",
     "nsmid",
     "clock",
@@ -149,15 +145,19 @@ __all__ = [
     "sub_sat_int",
     "lop3",
     "shf",
-    # Constants
+    #
+    # constants.py
+    #
     "WARP_SIZE",
-    *_nvvm_forward_exports_12_9,
+    "WARPS_PER_WARPGROUP",
+    "THREADS_PER_WARPGROUP",
     #
     # smem.py
     #
     "alloc_smem",
     "get_dyn_smem",
     "get_dyn_smem_size",
+    "store_async_dsmem",
     #
     # tmem.py
     #
diff --git a/python/CuTeDSL/cutlass/cute/arch/clc.py b/python/CuTeDSL/cutlass/cute/arch/clc.py
index 88caa9d73..55a877a93 100644
--- a/python/CuTeDSL/cutlass/cute/arch/clc.py
+++ b/python/CuTeDSL/cutlass/cute/arch/clc.py
@@ -37,20 +37,23 @@ def issue_clc_query(
     :type mbar_ptr:  Pointer
     :param clc_response_ptr: A pointer to the cluster launch control response address in SMEM
     :type clc_response_ptr:  Pointer
+    :param multicast: Whether to use multicast variant (default: True)
+    :type multicast:  bool
     """
     mbar_llvm_ptr = mbar_ptr.llvm_ptr
     clc_response_llvm_ptr = clc_response_ptr.llvm_ptr
     if multicast:
-        _nvvm.clusterlaunchcontrol_try_cancel_multicast(
-            clc_response_llvm_ptr,
-            mbar_llvm_ptr,
+        _nvvm.clusterlaunchcontrol_try_cancel(
+            smem_address=clc_response_llvm_ptr,
+            mbarrier=mbar_llvm_ptr,
+            multicast="multicast",
             loc=loc,
             ip=ip,
         )
     else:
         _nvvm.clusterlaunchcontrol_try_cancel(
-            clc_response_llvm_ptr,
-            mbar_llvm_ptr,
+            smem_address=clc_response_llvm_ptr,
+            mbarrier=mbar_llvm_ptr,
             loc=loc,
             ip=ip,
         )
@@ -92,30 +95,38 @@ def clc_response(
     )
     # Query if the cluster was canceled
     # res parameter expects an MLIR Type, and returns the actual OpResult value
-    pred = _nvvm.clusterlaunchcontrol_query_cancel_is_canceled(
-        clc_result_i128,
+    pred = _nvvm.clusterlaunchcontrol_query_cancel(
+        res=ir.IntegerType.get_signless(1),
+        query_type="is_canceled",
+        try_cancel_response=clc_result_i128,
         loc=loc,
         ip=ip,
     )
     is_valid = Int32(pred)
 
     # Get first CTA ID x component
-    m_idx_i32 = _nvvm.clusterlaunchcontrol_query_cancel_get_first_ctaid_x(
-        clc_result_i128,
+    m_idx_i32 = _nvvm.clusterlaunchcontrol_query_cancel(
+        res=ir.IntegerType.get_signless(32),
+        query_type="get_first_cta_id_x",
+        try_cancel_response=clc_result_i128,
         loc=loc,
         ip=ip,
     )
 
     # Get first CTA ID y component
-    n_idx_i32 = _nvvm.clusterlaunchcontrol_query_cancel_get_first_ctaid_y(
-        clc_result_i128,
+    n_idx_i32 = _nvvm.clusterlaunchcontrol_query_cancel(
+        res=ir.IntegerType.get_signless(32),
+        query_type="get_first_cta_id_y",
+        try_cancel_response=clc_result_i128,
         loc=loc,
         ip=ip,
     )
 
     # Get first CTA ID z component
-    l_idx_i32 = _nvvm.clusterlaunchcontrol_query_cancel_get_first_ctaid_z(
-        clc_result_i128,
+    l_idx_i32 = _nvvm.clusterlaunchcontrol_query_cancel(
+        res=ir.IntegerType.get_signless(32),
+        query_type="get_first_cta_id_z",
+        try_cancel_response=clc_result_i128,
         loc=loc,
         ip=ip,
     )
diff --git a/python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/__init__.py b/python/CuTeDSL/cutlass/cute/arch/constants.py
similarity index 60%
rename from python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/__init__.py
rename to python/CuTeDSL/cutlass/cute/arch/constants.py
index b963a0ad3..b6e10fd64 100644
--- a/python/CuTeDSL/cutlass/base_dsl/_mlir_helpers/__init__.py
+++ b/python/CuTeDSL/cutlass/cute/arch/constants.py
@@ -10,19 +10,9 @@
 # is strictly prohibited.
 
 """
-This module provides MLIR Dialect helper functions
+This class contains constants defined in the CUDA programming model.
 """
 
-from . import arith
-from .dialect_proxy import DialectAutoConvertProxy
-from .lru_cache_ir import lru_cache_ir
-from .op import dsl_user_op
-
-__all__ = ["arith", "DialectAutoConvertProxy", "lru_cache_ir", "dsl_user_op"]
-
-try:
-    from . import gpu
-
-    __all__.extend(["gpu"])
-except ImportError:
-    pass
+WARP_SIZE = 32
+WARPS_PER_WARPGROUP = 4
+THREADS_PER_WARPGROUP = WARP_SIZE * WARPS_PER_WARPGROUP
diff --git a/python/CuTeDSL/cutlass/cute/arch/elect.py b/python/CuTeDSL/cutlass/cute/arch/elect.py
index ba4947271..e162fcb5f 100644
--- a/python/CuTeDSL/cutlass/cute/arch/elect.py
+++ b/python/CuTeDSL/cutlass/cute/arch/elect.py
@@ -14,7 +14,7 @@ from typing import Optional
 from cutlass.cutlass_dsl import BaseDSL, dsl_user_op
 
 import cutlass._mlir.dialects.cute_nvgpu as _cute_nvgpu_ir
-from cutlass._mlir.dialects import nvvm, scf
+from cutlass._mlir.dialects import nvvm as _nvvm, scf
 from cutlass._mlir import ir
 
 from ..typing import Int, Int32
@@ -144,6 +144,6 @@ def elect_one(
     from cutlass.base_dsl.arch import Arch
 
     BaseDSL._get_dsl().check_arch(lambda arch: arch >= Arch.sm_90)
-    is_thread_leader = nvvm.elect_sync()
+    is_thread_leader = _nvvm.elect_sync()
     if_op = scf.IfOp(is_thread_leader, loc=loc, ip=ip)
     return IfOpRegion(if_op.then_block, loc=loc, ip=ip)
diff --git a/python/CuTeDSL/cutlass/cute/arch/mbar.py b/python/CuTeDSL/cutlass/cute/arch/mbar.py
index 5a1c6de8e..5142c3cf0 100644
--- a/python/CuTeDSL/cutlass/cute/arch/mbar.py
+++ b/python/CuTeDSL/cutlass/cute/arch/mbar.py
@@ -9,13 +9,21 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 from cutlass.base_dsl.arch import Arch
-from cutlass.cutlass_dsl import BaseDSL, if_generate, dsl_user_op
+from cutlass.cutlass_dsl import (
+    BaseDSL,
+    T,
+    if_generate,
+    while_generate,
+    dsl_user_op,
+    yield_out,
+)
 
 from cutlass._mlir import ir
-from cutlass._mlir.dialects import nvvm, llvm
-
+from cutlass._mlir.dialects import nvvm as _nvvm, llvm
+from cutlass._mlir.ir import AttrBuilder
 from ..typing import Optional, Pointer, Int, Boolean, Int32, AddressSpace
 
+
 ####################################################################################################
 #
 # Mbarrier management utilities
@@ -42,7 +50,8 @@ def mbarrier_init(
         with cute.arch.elect_one():
             cute.arch.mbarrier_init(barrier_ptr, arrival_count)
 
-    **PTX Mapping**: This operation maps to the PTX ``mbarrier.init.shared.b64`` instruction,
+    **PTX Mapping**: This operation maps to the PTX ``mbarrier.init.sha
+    red.b64`` instruction,
     which must be issued by a single thread for correctness.
 
     :param mbar_ptr: A pointer to the mbarrier in SMEM
@@ -55,14 +64,16 @@ def mbarrier_init(
        - :func:`cute.arch.mbarrier_expect_tx` - Also requires elect_one
        - PTX ISA documentation on ``mbarrier.init``
     """
-    nvvm.mbarrier_init_shared(
-        mbar_ptr.to_llvm_ptr(loc=loc, ip=ip),  # type: ignore[attr-defined]
+
+    _nvvm.mbarrier_init(
+        mbar_ptr.to_llvm_ptr(loc=loc, ip=ip),
         Int32(cnt).ir_value(loc=loc, ip=ip),
         loc=loc,
         ip=ip,
     )
 
 
+
 @dsl_user_op
 def mbarrier_init_fence(
     *, loc: Optional[ir.Location] = None, ip: Optional[ir.InsertionPoint] = None
@@ -72,7 +83,7 @@ def mbarrier_init_fence(
     """
     BaseDSL._get_dsl().check_arch(lambda arch: arch >= Arch.sm_90)
 
-    nvvm.fence_mbarrier_init(loc=loc, ip=ip)
+    _nvvm.fence_mbarrier_init(loc=loc, ip=ip)
 
 
 @dsl_user_op
@@ -80,6 +91,8 @@ def mbarrier_arrive_and_expect_tx(
     mbar_ptr: Pointer,
     bytes: Int,
     peer_cta_rank_in_cluster: Optional[Int] = None,
+    relaxed: bool = False,
+    scope: _nvvm.MemScopeKind = _nvvm.MemScopeKind.CTA,
     *,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
@@ -110,6 +123,9 @@ def mbarrier_arrive_and_expect_tx(
     :param peer_cta_rank_in_cluster: An optional CTA rank in cluster. If provided, the pointer to
                                      the mbarrier is converted to a remote address in the peer CTA's
                                      SMEM.
+    :param relaxed:                  If True, the arrive operation has relaxed semantics and does not provide
+                                     any ordering or visibility guarantees.
+    :param scope:                    Scope of threads participating in the arrive/wait operations.
 
     .. seealso::
        - :func:`cute.arch.elect_one` - Required wrapper for single-thread execution
@@ -118,27 +134,23 @@ def mbarrier_arrive_and_expect_tx(
     """
     BaseDSL._get_dsl().check_arch(lambda arch: arch >= Arch.sm_90)
 
-    mbar_llvm_ptr = mbar_ptr.to_llvm_ptr(loc=loc, ip=ip)  # type: ignore[attr-defined]
+    mbar_llvm_ptr = mbar_ptr.to_llvm_ptr(loc=loc, ip=ip)
     if peer_cta_rank_in_cluster is not None:
         mbar_cluster_type = llvm.PointerType.get(AddressSpace.dsmem)
-        mbar_llvm_ptr = nvvm.mapa(
+        mbar_llvm_ptr = _nvvm.mapa(
             mbar_cluster_type,
             mbar_llvm_ptr,
             Int32(peer_cta_rank_in_cluster).ir_value(loc=loc, ip=ip),
             loc=loc,
             ip=ip,
         )
-        mbar_shared_type = llvm.PointerType.get(AddressSpace.smem)
-        mbar_llvm_ptr = llvm.addrspacecast(mbar_shared_type, mbar_llvm_ptr)
-        space = nvvm.MBarrierSpaceKind.CLUSTER
-    else:
-        space = nvvm.MBarrierSpaceKind.CTA
 
-    nvvm.mbarrier_txn(
+    _nvvm.mbarrier_arrive_expect_tx(
+        None,
         mbar_llvm_ptr,
         Int32(bytes).ir_value(loc=loc, ip=ip),
-        kind=nvvm.MBarrierTxnKind.ARRIVE_EXPECT_TX,
-        space=space,
+        scope=scope,
+        relaxed=relaxed,
         loc=loc,
         ip=ip,
     )
@@ -185,32 +197,28 @@ def mbarrier_expect_tx(
     """
     BaseDSL._get_dsl().check_arch(lambda arch: arch >= Arch.sm_90)
 
-    mbar_llvm_ptr = mbar_ptr.to_llvm_ptr(loc=loc, ip=ip)  # type: ignore[attr-defined]
+    mbar_llvm_ptr = mbar_ptr.to_llvm_ptr(loc=loc, ip=ip)
+    scope = _nvvm.MemScopeKind.CTA
     if peer_cta_rank_in_cluster is not None:
         mbar_cluster_type = llvm.PointerType.get(AddressSpace.dsmem)
-        mbar_llvm_ptr = nvvm.mapa(
+        mbar_llvm_ptr = _nvvm.mapa(
             mbar_cluster_type,
             mbar_llvm_ptr,
             Int32(peer_cta_rank_in_cluster).ir_value(loc=loc, ip=ip),
             loc=loc,
             ip=ip,
         )
-        mbar_shared_type = llvm.PointerType.get(AddressSpace.smem)
-        mbar_llvm_ptr = llvm.addrspacecast(mbar_shared_type, mbar_llvm_ptr)
-        space = nvvm.MBarrierSpaceKind.CLUSTER
-    else:
-        space = nvvm.MBarrierSpaceKind.CTA
 
-    nvvm.mbarrier_txn(
+    _nvvm.mbarrier_expect_tx(
         mbar_llvm_ptr,
         Int32(bytes).ir_value(loc=loc, ip=ip),
-        kind=nvvm.MBarrierTxnKind.EXPECT_TX,
-        space=space,
+        scope=scope,
         loc=loc,
         ip=ip,
     )
 
 
+
 @dsl_user_op
 def mbarrier_wait(
     mbar_ptr: Pointer,
@@ -231,10 +239,10 @@ def mbarrier_wait(
 
     timeout_ns = 10000000
 
-    # This NVVM Op is a spin-loop wrapping the mbarrier.try_wait.parity.shared.b64 PTX
-    # The timeout in ns only applies to the latter and this call is truly blocking
-    nvvm.mbarrier_try_wait_parity_shared(
-        mbar_ptr.to_llvm_ptr(loc=loc, ip=ip),  # type: ignore[attr-defined]
+    # Using intrinsics is recommended by NVVM compiler for production; the prep
+    # pass expands this to an explicit retry loop over nvvm.mbarrier.try_wait.
+    _nvvm.mbarrier_try_wait_parity(
+        mbar_ptr.to_llvm_ptr(loc=loc, ip=ip),
         Int32(phase).ir_value(loc=loc, ip=ip),
         Int32(timeout_ns).ir_value(loc=loc, ip=ip),
         loc=loc,
@@ -263,10 +271,41 @@ def mbarrier_try_wait(
     BaseDSL._get_dsl().check_arch(lambda arch: arch >= Arch.sm_90)
 
     return Boolean(
-        nvvm.mbarrier_wait_parity(
-            mbar_ptr.to_llvm_ptr(loc=loc, ip=ip),  # type: ignore[attr-defined]
+        _nvvm.mbarrier_wait_parity(
+            mbar_ptr.to_llvm_ptr(loc=loc, ip=ip),
             Int32(phase).ir_value(loc=loc, ip=ip),
-            nvvm.MBarrierWaitKind.TRY,
+            _nvvm.MBarrierWaitKind.TRY,
+            loc=loc,
+            ip=ip,
+        )
+    )
+
+
+@dsl_user_op
+def mbarrier_test_wait(
+    mbar_ptr: Pointer,
+    phase: Int,
+    *,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> Boolean:
+    """
+    Tests if a mbarrier with a specified phase is complete.
+
+    :param mbar_ptr: A pointer to the mbarrier in SMEM
+    :type mbar_ptr:  Pointer
+    :param phase:    The phase to wait for (either 0 or 1)
+    :type phase:     Int
+    :return:         A boolean value indicating whether the wait operation was successful
+    :rtype:          Boolean
+    """
+    BaseDSL._get_dsl().check_arch(lambda arch: arch >= Arch.sm_90)
+
+    return Boolean(
+        _nvvm.mbarrier_wait_parity(
+            mbar_ptr.to_llvm_ptr(loc=loc, ip=ip),
+            Int32(phase).ir_value(loc=loc, ip=ip),
+            _nvvm.MBarrierWaitKind.TEST,
             loc=loc,
             ip=ip,
         )
@@ -323,29 +362,25 @@ def mbarrier_arrive(
                                      the mbarrier is converted to a remote address in the peer CTA's
                                      SMEM.
     """
-    mbar_llvm_ptr = mbar_ptr.to_llvm_ptr(loc=loc, ip=ip)  # type: ignore[attr-defined]
+    mbar_llvm_ptr = mbar_ptr.to_llvm_ptr(loc=loc, ip=ip)
+    scope = _nvvm.MemScopeKind.CTA
     if peer_cta_rank_in_cluster is not None:
         BaseDSL._get_dsl().check_arch(lambda arch: arch >= Arch.sm_90)
 
         mbar_cluster_type = llvm.PointerType.get(AddressSpace.dsmem)
-        mbar_llvm_ptr = nvvm.mapa(
+        mbar_llvm_ptr = _nvvm.mapa(
             mbar_cluster_type,
             mbar_llvm_ptr,
             Int32(peer_cta_rank_in_cluster).ir_value(loc=loc, ip=ip),
             loc=loc,
             ip=ip,
         )
-        mbar_shared_type = llvm.PointerType.get(AddressSpace.smem)
-        mbar_llvm_ptr = llvm.addrspacecast(mbar_shared_type, mbar_llvm_ptr)
-        space = nvvm.MBarrierSpaceKind.CLUSTER
-    else:
-        space = nvvm.MBarrierSpaceKind.CTA
 
-    nvvm.mbarrier_txn(
+    _nvvm.mbarrier_arrive(
+        None,
         mbar_llvm_ptr,
-        Int32(arrive_count).ir_value(loc=loc, ip=ip),
-        kind=nvvm.MBarrierTxnKind.ARRIVE,
-        space=space,
+        count=Int32(arrive_count).ir_value(loc=loc, ip=ip),
+        scope=scope,
         loc=loc,
         ip=ip,
     )
@@ -369,5 +404,5 @@ def cp_async_mbarrier_arrive_noinc(
     """
     BaseDSL._get_dsl().check_arch(lambda arch: arch >= Arch.sm_90)
 
-    mbar_llvm_ptr = mbar_ptr.to_llvm_ptr(loc=loc, ip=ip)  # type: ignore[attr-defined]
-    nvvm.cp_async_mbarrier_arrive_shared(mbar_llvm_ptr, noinc=True, loc=loc, ip=ip)
+    mbar_llvm_ptr = mbar_ptr.to_llvm_ptr(loc=loc, ip=ip)
+    _nvvm.cp_async_mbarrier_arrive(mbar_llvm_ptr, noinc=True, loc=loc, ip=ip)
diff --git a/python/CuTeDSL/cutlass/cute/arch/numeric_conversion.py b/python/CuTeDSL/cutlass/cute/arch/numeric_conversion.py
index ed8f4d540..e66013863 100644
--- a/python/CuTeDSL/cutlass/cute/arch/numeric_conversion.py
+++ b/python/CuTeDSL/cutlass/cute/arch/numeric_conversion.py
@@ -13,7 +13,7 @@ from typing import Optional
 
 from cutlass.base_dsl.arch import Arch
 from cutlass.base_dsl.common import DSLRuntimeError
-from cutlass.cutlass_dsl import BaseDSL, dsl_user_op
+from cutlass.cutlass_dsl import BaseDSL, dsl_user_op, target_version
 
 from cutlass._mlir import ir
 from cutlass._mlir.dialects import arith, vector
@@ -69,6 +69,7 @@ def cvt_i8_bf16_intrinsic(
         ip=ip,
     ).result
     arch = BaseDSL._get_dsl().get_arch_enum()
+    is_ptx9_or_higher = target_version(min_version="13.1")
     # try to use vectorized version
     if length >= 4:
         num_vec4 = length // 4
@@ -76,7 +77,10 @@ def cvt_i8_bf16_intrinsic(
             vec_i8x4 = vector.extract_strided_slice(
                 vec_i8x4_type, vec_i8, [src_pos], [4], [1], loc=loc, ip=ip
             )
-            if arch in cvt_i8_bf16_intrinsic.s26_bf16_supported_archs:  # type: ignore[attr-defined]
+            if (
+                is_ptx9_or_higher
+                and arch in cvt_i8_bf16_intrinsic.s26_bf16_supported_archs  # type: ignore[attr-defined]
+            ):
                 vec_bf16x4 = cvt_i8x4_to_bf16x4(vec_i8x4, loc=loc, ip=ip)
                 vec_dst = vector.insert_strided_slice(
                     vec_bf16x4, vec_dst, [src_pos], [1], loc=loc, ip=ip
@@ -104,7 +108,7 @@ def cvt_i8_bf16_intrinsic(
         vec_i8x2 = vector.extract_strided_slice(
             vec_i8x2_type, vec_i8, [src_pos], [2], [1], loc=loc, ip=ip
         )
-        if arch in cvt_i8_bf16_intrinsic.s26_bf16_supported_archs:  # type: ignore[attr-defined]
+        if is_ptx9_or_higher and arch in cvt_i8_bf16_intrinsic.s26_bf16_supported_archs:  # type: ignore[attr-defined]
             vec_bf16x2 = cvt_i8x2_to_bf16x2(vec_i8x2, loc=loc, ip=ip)
         else:
             vec_f32x2 = cvt_i8x2_to_f32x2(vec_i8x2, loc=loc, ip=ip)
@@ -117,9 +121,10 @@ def cvt_i8_bf16_intrinsic(
     if length >= 1:
         if arch in cvt_i8_bf16_intrinsic.s26_bf16_supported_archs:  # type: ignore[attr-defined]
             val_bf16 = cvt_i8_bf16(
-                vector.extractelement(
+                vector.extract(
                     vec_i8,
-                    position=arith.constant(Int32.mlir_type, src_pos),
+                    [],
+                    [src_pos],
                     loc=loc,
                     ip=ip,
                 ),
@@ -127,19 +132,21 @@ def cvt_i8_bf16_intrinsic(
                 ip=ip,
             )
         else:
-            src_i8 = vector.extractelement(
+            src_i8 = vector.extract(
                 vec_i8,
-                position=arith.constant(Int32.mlir_type, src_pos),
+                [],
+                [src_pos],
                 loc=loc,
                 ip=ip,
             )
             src_i32 = arith.ExtSIOp(Int32.mlir_type, src_i8, loc=loc, ip=ip)
             src_f32 = arith.SIToFPOp(Float32.mlir_type, src_i32, loc=loc, ip=ip)
             val_bf16 = cvt_f32_bf16(src_f32, loc=loc, ip=ip)
-        vec_dst = vector.insertelement(
+        vec_dst = vector.insert(
             val_bf16,
             vec_dst,
-            position=arith.constant(Int32.mlir_type, src_pos),
+            [],
+            [src_pos],
             loc=loc,
             ip=ip,
         )
@@ -229,19 +236,21 @@ def cvt_i4_bf16_intrinsic(
         length -= 2
     if length >= 1:
         val_bf16 = cvt_i4_bf16(
-            vector.extractelement(
+            vector.extract(
                 vec_i4,
-                position=arith.constant(Int32.mlir_type, src_pos),
+                [],
+                [src_pos],
                 loc=loc,
                 ip=ip,
             ),
             loc=loc,
             ip=ip,
         )
-        vec_dst = vector.insertelement(
+        vec_dst = vector.insert(
             val_bf16,
             vec_dst,
-            position=arith.constant(Int32.mlir_type, src_pos),
+            [],
+            [src_pos],
             loc=loc,
             ip=ip,
         )
diff --git a/python/CuTeDSL/cutlass/cute/arch/nvvm_wrappers.py b/python/CuTeDSL/cutlass/cute/arch/nvvm_wrappers.py
index 8a9cba11e..46b49d8fe 100644
--- a/python/CuTeDSL/cutlass/cute/arch/nvvm_wrappers.py
+++ b/python/CuTeDSL/cutlass/cute/arch/nvvm_wrappers.py
@@ -9,6 +9,8 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
+import enum
+import types
 from functools import partial
 from typing import Any, Optional, Tuple, Union, Callable, Literal, Type, overload
 from typing_extensions import deprecated
@@ -19,7 +21,8 @@ import cutlass.cutlass_dsl as cutlass_dsl
 
 from cutlass._mlir import ir
 from cutlass._mlir.dialects import arith, builtin, llvm, math, nvvm as _nvvm_raw, vector
-from cutlass.base_dsl._mlir_helpers.dialect_proxy import DialectAutoConvertProxy
+
+from .constants import WARP_SIZE
 
 from ..core import size
 
@@ -42,11 +45,99 @@ from ..typing import (
     as_numeric,
 )
 
-WARP_SIZE = 32
 FULL_MASK = 0xFFFFFFFF
 
+
+# ============================================================================
+# NVVM Auto-Convert Proxy
+# ============================================================================
+
+
+class _NvvmAutoConvertProxy:
+    """
+    Proxy that wraps the raw MLIR nvvm dialect module.
+    Automatically converts Numeric arguments to ir.Value when calling any nvvm operation.
+
+    This enables users to write cleaner code without explicit .ir_value() calls:
+
+    Before:
+        nvvm.tcgen05_mma_smem_desc(
+            Int32(x).ir_value(),
+            Int32(y).ir_value(),
+            ...
+        )
+
+    After:
+        nvvm.tcgen05_mma_smem_desc(
+            Int32(x),
+            Int32(y),
+            ...
+        )
+    """
+
+    def __init__(self, nvvm_module: types.ModuleType) -> None:
+        self._module = nvvm_module
+        self._wrapped_cache: dict[str, Callable[..., Any]] = {}
+
+    @staticmethod
+    def _convert_arg(
+        arg: Any, loc: Optional[ir.Location], ip: Optional[ir.InsertionPoint]
+    ) -> Any:
+        """Recursively convert DSL objects to ir.Value, including inside lists/tuples."""
+        # Check for ir_value method (covers Numeric and other DSL types)
+        if hasattr(arg, "ir_value") and callable(arg.ir_value):
+            return arg.ir_value(loc=loc, ip=ip)
+        if isinstance(arg, (list, tuple)):
+            converted = [
+                _NvvmAutoConvertProxy._convert_arg(item, loc, ip) for item in arg
+            ]
+            return type(arg)(converted)
+        return arg
+
+    def __getattr__(self, name: str) -> Any:
+        attr = getattr(self._module, name)
+
+        # Non-callable attributes and enum classes pass through unchanged
+        # Enum classes need attribute access (e.g., ShflKind.idx), but MLIR operation
+        # classes should be wrapped for argument conversion when instantiated
+        if (
+            not callable(attr)
+            or isinstance(attr, enum.EnumMeta)
+            or hasattr(attr, "_mock_name")
+        ):
+            return attr
+
+        # Use cache for wrapped callables
+        if name not in self._wrapped_cache:
+
+            def _make_wrapper(func: Callable[..., Any]) -> Callable[..., Any]:
+                def wrapped(
+                    *args: Any,
+                    loc: Optional[ir.Location] = None,
+                    ip: Optional[ir.InsertionPoint] = None,
+                    **kwargs: Any,
+                ) -> Any:
+                    # Convert Numeric args to ir.Value (recursively handles lists/tuples)
+                    converted_args = tuple(
+                        _NvvmAutoConvertProxy._convert_arg(arg, loc, ip) for arg in args
+                    )
+                    converted_kwargs = {
+                        k: _NvvmAutoConvertProxy._convert_arg(v, loc, ip)
+                        for k, v in kwargs.items()
+                    }
+                    return func(*converted_args, loc=loc, ip=ip, **converted_kwargs)
+
+                return wrapped
+
+            self._wrapped_cache[name] = _make_wrapper(attr)
+
+        return self._wrapped_cache[name]
+
+    def __dir__(self) -> list[str]:
+        return dir(self._module)
+
 # Create the proxy instance to replace the raw nvvm module
-nvvm = DialectAutoConvertProxy(_nvvm_raw)
+nvvm = _NvvmAutoConvertProxy(_nvvm_raw)
 
 
 # ============================================================================
@@ -99,22 +190,11 @@ def _enhance_enum_with_str_mapping(enum_class: Any) -> Any:
         from enum import Enum
 
         if isinstance(s, Enum):
-            if cutlass_dsl.target_version(exact_version="12.9"):
-                import warnings
-
-                warnings.warn(
-                    f"Passing enum member directly to {cls.__name__}.from_str() is deprecated. "
-                    f"Please use string literals instead (e.g., '{str(s)}' instead of {cls.__name__}.{s.name}).",
-                    DeprecationWarning,
-                    stacklevel=2,
-                )
-                return s
-            else:
-                raise TypeError(
-                    f"Expected a string literal for {cls.__name__}, but got enum '{type(s).__name__}.{s.name}'. "
-                    f"Please pass a string instead (e.g., '{str(s)}' instead of {type(s).__name__}.{s.name}). "
-                    f"Valid string options are: {sorted(str_to_enum_map.keys())}"
-                )
+            raise TypeError(
+                f"Expected a string literal for {cls.__name__}, but got enum '{type(s).__name__}.{s.name}'. "
+                f"Please pass a string instead (e.g., '{str(s)}' instead of {type(s).__name__}.{s.name}). "
+                f"Valid string options are: {sorted(str_to_enum_map.keys())}"
+            )
 
         if s not in str_to_enum_map:
             valid_options = sorted(str_to_enum_map.keys())
@@ -245,14 +325,13 @@ def warp_idx(
     """
     Returns the warp index within a CTA.
     """
-    warp_size = 32
     tid_x = Int32(nvvm.read_ptx_sreg_tid_x(T.i32(), loc=loc, ip=ip))
     tid_y = Int32(nvvm.read_ptx_sreg_tid_y(T.i32(), loc=loc, ip=ip))
     tid_z = Int32(nvvm.read_ptx_sreg_tid_z(T.i32(), loc=loc, ip=ip))
     ntid_x = Int32(nvvm.read_ptx_sreg_ntid_x(T.i32(), loc=loc, ip=ip))
     ntid_y = Int32(nvvm.read_ptx_sreg_ntid_y(T.i32(), loc=loc, ip=ip))
     tid = tid_x + tid_y * ntid_x + tid_z * ntid_x * ntid_y
-    return tid // warp_size
+    return tid // WARP_SIZE
 
 
 @dsl_user_op
@@ -406,19 +485,7 @@ def dynamic_smem_size(
     """
     Returns the launch dynamic smem size.
     """
-    return Int32(
-        llvm.inline_asm(
-            Int32.mlir_type,
-            [],
-            "mov.u32 $0, %dynamic_smem_size;\n",
-            "=r",
-            has_side_effects=True,
-            is_align_stack=False,
-            asm_dialect=llvm.AsmDialect.AD_ATT,
-            loc=loc,
-            ip=ip,
-        )
-    )
+    return Int32(nvvm.read_ptx_sreg_dynamic_smem_size(T.i32(), loc=loc, ip=ip))
 
 
 @dsl_user_op
@@ -476,16 +543,17 @@ def shuffle_sync_op(
     if not isinstance(value, Numeric):
         value = as_numeric(value)
 
-    if value.width > 64:  # type: ignore[attr-defined]
+    if value.width > 64:
         raise ValueError("shuffle_sync only supports values up to 64 bits")
 
     orig_type = type(value)
 
-    if value.width < 32:  # type: ignore[attr-defined]
+    if value.width < 32:
         if value.dtype.is_float:
             value = value.to(Float32)
         else:
-            if value.signed:  # type: ignore[attr-defined]
+            assert isinstance(value, Integer)
+            if value.signed:
                 value = value.to(Int32)
             else:
                 value = value.to(Uint32)
@@ -501,7 +569,7 @@ def shuffle_sync_op(
                 ip=ip,
             )
         )
-    elif value.width == 32:  # type: ignore[attr-defined]
+    elif value.width == 32:
         return orig_type(
             nvvm.shfl_sync(
                 type(value).mlir_type,
@@ -515,7 +583,7 @@ def shuffle_sync_op(
             )
         )
     else:
-        if value.width != 64:  # type: ignore[attr-defined]
+        if value.width != 64:
             raise ValueError(
                 "shuffle_sync only supports 64 bits values when the bit width is larger than 32"
             )
@@ -573,13 +641,13 @@ def warp_reduction(
     val: Numeric,
     op: Callable,
     *,
-    threads_in_group: int = 32,
+    threads_in_group: int = WARP_SIZE,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> Numeric:
     """warp reduction of a Numeric value(e.g.Float32) by shuffle_sync_bfly, accepts custom binary operator.
     The threads_in_group is the number of threads reduction group in a warp.
-    E.g. 32 means the whole warp reduced in one group. 8 means the warp is divided into 4 thread groups, each group has 8 threads in reduction.
+    E.g. WARP_SIZE (32) means the whole warp reduced in one group. 8 means the warp is divided into 4 thread groups, each group has 8 threads in reduction.
 
 
     :param val: register value
@@ -628,30 +696,14 @@ def barrier(
     if number_of_threads is not None:
         number_of_threads = Int32(number_of_threads).ir_value(loc=loc, ip=ip)
 
-    if cutlass_dsl.target_version(exact_version="12.9"):
-        if barrier_id is None:
-            barrier_id = Int32(0).ir_value(loc=loc, ip=ip)
-        has_count = number_of_threads is not None
-        operands = [barrier_id, number_of_threads] if has_count else [barrier_id]
-        llvm.inline_asm(
-            None,
-            operands,
-            f"bar.sync {'$0, $1' if has_count else '$0'};",
-            "r,r" if has_count else "r",
-            has_side_effects=True,
-            is_align_stack=False,
-            asm_dialect=llvm.AsmDialect.AD_ATT,
-            loc=loc,
-            ip=ip,
-        )
-    else:
-        # TODO: support barrier with reduction result
-        nvvm.barrier(
-            barrier_id=barrier_id,
-            number_of_threads=number_of_threads,
-            loc=loc,
-            ip=ip,
-        )
+    # TODO: support barrier with reduction result
+    nvvm.barrier(
+        res=None,
+        barrier_id=barrier_id,
+        number_of_threads=number_of_threads,
+        loc=loc,
+        ip=ip,
+    )
 
 
 @dsl_user_op
@@ -659,9 +711,26 @@ def barrier_arrive(
     *,
     barrier_id: Optional[Int] = None,
     number_of_threads: Optional[Int] = None,
+    aligned: bool = True,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> None:
+    """Issue a non-blocking arrive on a CTA-scoped named barrier.
+
+    The PTX ISA distinguishes two flavors of the arrive instruction:
+
+    - ``aligned=True`` (default) emits ``bar.arrive`` (legacy syntax with
+      implicit ``.aligned``). All threads in the CTA must reach this
+      instruction (i.e. it must lie outside divergent control flow).
+    - ``aligned=False`` emits ``barrier.cta.arrive`` (no ``.aligned``
+      modifier), which is required when the participating threads do not
+      necessarily execute the same instruction (e.g. when only a subset of
+      threads or warps in the CTA issue the arrive).
+
+    See the `PTX documentation
+    <https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-bar-barrier>`__
+    for details on the aligned vs. unaligned variants.
+    """
     if barrier_id is not None:
         barrier_id = Int32(barrier_id).ir_value(loc=loc, ip=ip)
     else:
@@ -673,22 +742,21 @@ def barrier_arrive(
         )
     number_of_threads = Int32(number_of_threads).ir_value(loc=loc, ip=ip)
 
-    if cutlass_dsl.target_version(exact_version="12.9"):
-        llvm.inline_asm(
-            None,
-            [barrier_id, number_of_threads],
-            "bar.arrive $0, $1;",
-            "r,r",
-            has_side_effects=True,
-            is_align_stack=False,
-            asm_dialect=llvm.AsmDialect.AD_ATT,
+    if aligned:
+        nvvm.barrier_arrive(
+            barrier_id=barrier_id,
+            number_of_threads=number_of_threads,
             loc=loc,
             ip=ip,
         )
     else:
-        nvvm.barrier_arrive(
+        # The unaligned variant uses `barrier.cta.arrive` (without the
+        # `.aligned` modifier). The legacy `bar.arrive` instruction is
+        # implicitly aligned and cannot express this case.
+        nvvm.barrier_cta_arrive(
             barrier_id=barrier_id,
-            number_of_threads=number_of_threads,
+            thread_count=number_of_threads,
+            aligned=False,
             loc=loc,
             ip=ip,
         )
@@ -701,7 +769,7 @@ def sync_threads(
     """
     Synchronizes all threads within a CTA.
     """
-    nvvm.barrier(loc=loc, ip=ip)
+    nvvm.barrier(res=None, loc=loc, ip=ip)
 
 
 @dsl_user_op
@@ -726,7 +794,7 @@ def fence_acq_rel_cta(
 
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`__.
     """
-    nvvm.fence_acq_rel_cta(loc=loc, ip=ip)
+    llvm.fence(llvm.AtomicOrdering.acq_rel, syncscope="block", loc=loc, ip=ip)
 
 
 @dsl_user_op
@@ -738,7 +806,7 @@ def fence_acq_rel_cluster(
 
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`__.
     """
-    nvvm.fence_acq_rel_cluster(loc=loc, ip=ip)
+    llvm.fence(llvm.AtomicOrdering.acq_rel, syncscope="cluster", loc=loc, ip=ip)
 
 
 @dsl_user_op
@@ -750,7 +818,7 @@ def fence_acq_rel_gpu(
 
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`__.
     """
-    nvvm.fence_acq_rel_gpu(loc=loc, ip=ip)
+    llvm.fence(llvm.AtomicOrdering.acq_rel, syncscope="device", loc=loc, ip=ip)
 
 
 @dsl_user_op
@@ -762,7 +830,7 @@ def fence_acq_rel_sys(
 
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`__.
     """
-    nvvm.fence_acq_rel_sys(loc=loc, ip=ip)
+    llvm.fence(llvm.AtomicOrdering.acq_rel, loc=loc, ip=ip)
 
 
 @dsl_user_op
@@ -914,40 +982,7 @@ def vote_sync_op(
     Performs a vote operation across the warp.
     """
     return_type = Int32 if kind == "ballot" else Boolean
-    if cutlass_dsl.target_version(exact_version="12.9"):
-        if kind == "ballot":
-            return return_type(
-                nvvm.vote_ballot_sync(
-                    T.i32(),
-                    Int32(mask).ir_value(loc=loc, ip=ip),
-                    Boolean(pred).ir_value(loc=loc, ip=ip),
-                    loc=loc,
-                    ip=ip,
-                )
-            )
-        else:
-            return return_type(
-                llvm.inline_asm(
-                    T.bool(),
-                    [
-                        Boolean(pred).ir_value(loc=loc, ip=ip),
-                        Int32(mask).ir_value(loc=loc, ip=ip),
-                    ],
-                    f"""{{\n\t
-                    .reg .pred ps;\n\t
-                    .reg .pred pd;\n\t
-                    setp.ne.b32 ps, $1, 0;\n\t
-                    vote.sync.{kind}.pred pd, ps, $2;\n\t
-                    selp.b32 $0, 1, 0, pd;\n\t
-                    }}""",
-                    "=r,r,r",
-                    has_side_effects=True,
-                    is_align_stack=False,
-                    asm_dialect=llvm.AsmDialect.AD_ATT,
-                    loc=loc,
-                    ip=ip,
-                )
-            )
+
     from cutlass._mlir.dialects.nvvm import VoteSyncKind
 
     VoteSyncKind = _enhance_enum_with_str_mapping(VoteSyncKind)
@@ -1276,25 +1311,14 @@ def fmax(
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> Float32:
-    if cutlass_dsl.target_version(exact_version="12.9"):
-        return Float32(
-            nvvm.fmax(
-                T.f32(),
-                Float32(a).ir_value(loc=loc, ip=ip),
-                Float32(b).ir_value(loc=loc, ip=ip),
-                loc=loc,
-                ip=ip,
-            )
-        )
-    else:
-        return Float32(
-            nvvm.fmax(
-                Float32(a).ir_value(loc=loc, ip=ip),
-                Float32(b).ir_value(loc=loc, ip=ip),
-                loc=loc,
-                ip=ip,
-            )
+    return Float32(
+        nvvm.fmax(
+            Float32(a).ir_value(loc=loc, ip=ip),
+            Float32(b).ir_value(loc=loc, ip=ip),
+            loc=loc,
+            ip=ip,
         )
+    )
 
 
 @dsl_user_op
@@ -1305,25 +1329,14 @@ def fmin(
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> Float32:
-    if cutlass_dsl.target_version(exact_version="12.9"):
-        return Float32(
-            nvvm.fmin(
-                T.f32(),
-                Float32(a).ir_value(loc=loc, ip=ip),
-                Float32(b).ir_value(loc=loc, ip=ip),
-                loc=loc,
-                ip=ip,
-            )
-        )
-    else:
-        return Float32(
-            nvvm.fmin(
-                Float32(a).ir_value(loc=loc, ip=ip),
-                Float32(b).ir_value(loc=loc, ip=ip),
-                loc=loc,
-                ip=ip,
-            )
+    return Float32(
+        nvvm.fmin(
+            Float32(a).ir_value(loc=loc, ip=ip),
+            Float32(b).ir_value(loc=loc, ip=ip),
+            loc=loc,
+            ip=ip,
         )
+    )
 
 
 @dsl_user_op
@@ -1395,6 +1408,10 @@ def cvt_i8x2_to_bf16x2(
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> ir.Value:
+    if not cutlass_dsl.target_version(min_version="13.1"):
+        raise cutlass_dsl.DSLCudaVerNotImplemented(
+            feature="cvt_i8x2_to_bf16x2", required_version="13.1"
+        )
     # pack 2 int8 into 1 int16 value
     src_i16 = llvm.bitcast(Int16.mlir_type, src_vec2, loc=loc, ip=ip)
     val_i32 = llvm.inline_asm(
@@ -1422,6 +1439,10 @@ def cvt_i8x4_to_bf16x4(
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> ir.Value:
+    if not cutlass_dsl.target_version(min_version="13.1"):
+        raise cutlass_dsl.DSLCudaVerNotImplemented(
+            feature="cvt_i8x4_to_bf16x4", required_version="13.1"
+        )
     # pack 4 int8 into 1 int32 value
     src_i32 = llvm.bitcast(Int32.mlir_type, src_vec4, loc=loc, ip=ip)
     rst01 = llvm.inline_asm(
@@ -1468,12 +1489,8 @@ def cvt_f32x2_bf16x2(
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> ir.Value:
-    src0 = vector.extractelement(
-        src_vec2, position=arith.constant(Int32.mlir_type, 0, loc=loc, ip=ip)
-    )
-    src1 = vector.extractelement(
-        src_vec2, position=arith.constant(Int32.mlir_type, 1, loc=loc, ip=ip)
-    )
+    src0 = vector.extract(src_vec2, [], [0])
+    src1 = vector.extract(src_vec2, [], [1])
     rst = llvm.inline_asm(
         T.i32(),
         [
@@ -2134,27 +2151,26 @@ def _warp_redux_sync_nvvm(
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> Numeric:
-    from cutlass._mlir.dialects.nvvm import ReduxKind
+    from cutlass._mlir.dialects.nvvm import ReductionKind
 
     # Enhance enum and convert string literal to enum type
-    ReduxKind = _enhance_enum_with_str_mapping(ReduxKind)
-    kind = ReduxKind.from_str(kind)
+    ReductionKind = _enhance_enum_with_str_mapping(ReductionKind)
+    kind = ReductionKind.from_str(kind)
 
     value_type = type(value)
-    if value_type.is_integer and not value_type.signed:  # type: ignore[attr-defined]
-        if kind == ReduxKind.MAX:
-            kind = ReduxKind.UMAX
-        elif kind == ReduxKind.MIN:
-            kind = ReduxKind.UMIN
+    if issubclass(value_type, Integer) and not value_type.signed:
+        if kind == ReductionKind.MAX:
+            kind = ReductionKind.UMAX
+        elif kind == ReductionKind.MIN:
+            kind = ReductionKind.UMIN
 
     value_ir = value.ir_value(loc=loc, ip=ip)
 
     return value_type(
         nvvm.redux_sync(
-            res=value_ir.type,
-            val=value_ir,
-            kind=kind,
-            mask_and_clamp=Int32(mask_and_clamp).ir_value(loc=loc, ip=ip),
+            value_ir,
+            kind,
+            Int32(mask_and_clamp).ir_value(loc=loc, ip=ip),
             abs=abs,
             nan=nan,
             loc=loc,
@@ -2179,9 +2195,6 @@ def _warp_redux_sync_ptx(
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> Numeric:
-    """
-    **ONLY** support f32 as nvvm compatability
-    """
     value_type = type(value)
     value_ir = value.ir_value(loc=loc, ip=ip)
     mlir_type = value_type.mlir_type
@@ -2326,7 +2339,7 @@ def _atomic(
         "exch",
     ],
     sem: Optional[Literal["relaxed", "release", "acquire", "acq_rel"]] = None,
-    scope: Optional[Literal["gpu", "cta", "sys"]] = None,
+    scope: Optional[Literal["gpu", "cta", "cluster", "sys"]] = None,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> Union[Numeric, ir.Value]:
@@ -2345,8 +2358,8 @@ def _atomic(
         "max"/"min" auto-promote to "umax"/"umin" for unsigned types (Uint32/Uint64).
     :type op: Literal["add", "fadd", "max", "min", "umax", "umin", "and", "or", "xor", "exch"]
     :type sem: Optional[Literal["relaxed", "release", "acquire", "acq_rel"]]
-    :param scope: Memory scope ("gpu", "cta", "sys")
-    :type scope: Optional[Literal["gpu", "cta", "sys"]]
+    :param scope: Memory scope ("gpu", "cta", "cluster", "sys")
+    :type scope: Optional[Literal["gpu", "cta", "cluster", "sys"]]
     :return: Old value at memory location
     :rtype: Union[Numeric, ir.Value]
     """
@@ -2395,38 +2408,21 @@ def _atomic(
         if val_type.is_float:
             if op == AtomicOpKind.ADD:
                 op = AtomicOpKind.FADD
-        elif val_type.is_integer and not val_type.signed:  # type: ignore[attr-defined]
+        elif issubclass(val_type, Integer) and not val_type.signed:
             if op == AtomicOpKind.MAX:
                 op = AtomicOpKind.UMAX
             elif op == AtomicOpKind.MIN:
                 op = AtomicOpKind.UMIN
 
-    # * NVVM call based on nvvm version
-    if cutlass_dsl.target_version(exact_version="12.9"):
-        # Old API: requires explicit result type as first positional argument
-        # For vectors: pass val_type (ir.VectorType), for scalars: pass val_type.mlir_type
-        result_type = val_type if is_vector else val_type.mlir_type
-        result = nvvm.atomicrmw(
-            result_type,
-            op=op,
-            ptr=ptr,
-            a=val_ir,
-            mem_order=sem,
-            syncscope=scope,
-            loc=loc,
-            ip=ip,
-        )
-    else:
-        # New API: infers result type automatically
-        result = nvvm.atomicrmw(
-            op=op,
-            ptr=ptr,
-            a=val_ir,
-            mem_order=sem,
-            syncscope=scope,
-            loc=loc,
-            ip=ip,
-        )
+    result = nvvm.atomicrmw(
+        op=op,
+        ptr=ptr,
+        a=val_ir,
+        mem_order=sem,
+        syncscope=scope,
+        loc=loc,
+        ip=ip,
+    )
     # Return raw result for vectors, wrapped for scalars
     return result if is_vector else val_type(result)
 
@@ -2436,7 +2432,7 @@ def atomic_add(
     val: Union[Numeric, ir.Value],
     *,
     sem: Optional[Literal["relaxed", "release", "acquire", "acq_rel"]] = None,
-    scope: Optional[Literal["gpu", "cta", "sys"]] = None,
+    scope: Optional[Literal["gpu", "cta", "cluster", "sys"]] = None,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> Union[Numeric, ir.Value]:
@@ -2450,8 +2446,8 @@ def atomic_add(
     :type val: Union[Numeric, ir.Value]
     :param sem: Memory semantic ("relaxed", "release", "acquire", "acq_rel")
     :type sem: Optional[Literal["relaxed", "release", "acquire", "acq_rel"]]
-    :param scope: Memory scope ("gpu", "cta", "sys")
-    :type scope: Optional[Literal["gpu", "cta", "sys"]]
+    :param scope: Memory scope ("gpu", "cta", "cluster", "sys")
+    :type scope: Optional[Literal["gpu", "cta", "cluster", "sys"]]
     :return: Old value at memory location
     :rtype: Union[Numeric, ir.Value]
     """
@@ -2463,7 +2459,7 @@ def atomic_and(
     val: Numeric,
     *,
     sem: Optional[Literal["relaxed", "release", "acquire", "acq_rel"]] = None,
-    scope: Optional[Literal["gpu", "cta", "sys"]] = None,
+    scope: Optional[Literal["gpu", "cta", "cluster", "sys"]] = None,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> Numeric:
@@ -2477,8 +2473,8 @@ def atomic_and(
     :type val: Numeric
     :param sem: Memory semantic ("relaxed", "release", "acquire", "acq_rel")
     :type sem: Optional[Literal["relaxed", "release", "acquire", "acq_rel"]]
-    :param scope: Memory scope ("gpu", "cta", "sys")
-    :type scope: Optional[Literal["gpu", "cta", "sys"]]
+    :param scope: Memory scope ("gpu", "cta", "cluster", "sys")
+    :type scope: Optional[Literal["gpu", "cta", "cluster", "sys"]]
     :return: Old value at memory location
     :rtype: Numeric
     """
@@ -2490,7 +2486,7 @@ def atomic_or(
     val: Numeric,
     *,
     sem: Optional[Literal["relaxed", "release", "acquire", "acq_rel"]] = None,
-    scope: Optional[Literal["gpu", "cta", "sys"]] = None,
+    scope: Optional[Literal["gpu", "cta", "cluster", "sys"]] = None,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> Numeric:
@@ -2504,8 +2500,8 @@ def atomic_or(
     :type val: Numeric
     :param sem: Memory semantic ("relaxed", "release", "acquire", "acq_rel")
     :type sem: Optional[Literal["relaxed", "release", "acquire", "acq_rel"]]
-    :param scope: Memory scope ("gpu", "cta", "sys")
-    :type scope: Optional[Literal["gpu", "cta", "sys"]]
+    :param scope: Memory scope ("gpu", "cta", "cluster", "sys")
+    :type scope: Optional[Literal["gpu", "cta", "cluster", "sys"]]
     :return: Old value at memory location
     :rtype: Numeric
     """
@@ -2517,7 +2513,7 @@ def atomic_xor(
     val: Numeric,
     *,
     sem: Optional[Literal["relaxed", "release", "acquire", "acq_rel"]] = None,
-    scope: Optional[Literal["gpu", "cta", "sys"]] = None,
+    scope: Optional[Literal["gpu", "cta", "cluster", "sys"]] = None,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> Numeric:
@@ -2531,8 +2527,8 @@ def atomic_xor(
     :type val: Numeric
     :param sem: Memory semantic ("relaxed", "release", "acquire", "acq_rel")
     :type sem: Optional[Literal["relaxed", "release", "acquire", "acq_rel"]]
-    :param scope: Memory scope ("gpu", "cta", "sys")
-    :type scope: Optional[Literal["gpu", "cta", "sys"]]
+    :param scope: Memory scope ("gpu", "cta", "cluster", "sys")
+    :type scope: Optional[Literal["gpu", "cta", "cluster", "sys"]]
     :return: Old value at memory location
     :rtype: Numeric
     """
@@ -2544,7 +2540,7 @@ def atomic_max(
     val: Numeric,
     *,
     sem: Optional[Literal["relaxed", "release", "acquire", "acq_rel"]] = None,
-    scope: Optional[Literal["gpu", "cta", "sys"]] = None,
+    scope: Optional[Literal["gpu", "cta", "cluster", "sys"]] = None,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> Numeric:
@@ -2558,8 +2554,8 @@ def atomic_max(
     :type val: Numeric
     :param sem: Memory semantic ("relaxed", "release", "acquire", "acq_rel")
     :type sem: Optional[Literal["relaxed", "release", "acquire", "acq_rel"]]
-    :param scope: Memory scope ("gpu", "cta", "sys")
-    :type scope: Optional[Literal["gpu", "cta", "sys"]]
+    :param scope: Memory scope ("gpu", "cta", "cluster", "sys")
+    :type scope: Optional[Literal["gpu", "cta", "cluster", "sys"]]
     :return: Old value at memory location
     :rtype: Numeric
     """
@@ -2571,7 +2567,7 @@ def atomic_min(
     val: Numeric,
     *,
     sem: Optional[Literal["relaxed", "release", "acquire", "acq_rel"]] = None,
-    scope: Optional[Literal["gpu", "cta", "sys"]] = None,
+    scope: Optional[Literal["gpu", "cta", "cluster", "sys"]] = None,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> Numeric:
@@ -2612,12 +2608,12 @@ def atomic_exch(
     :type val: Numeric
     :param sem: Memory semantic ("relaxed", "release", "acquire", "acq_rel")
     :type sem: Optional[Literal["relaxed", "release", "acquire", "acq_rel"]]
-    :param scope: Memory scope ("gpu", "cta", "sys")
-    :type scope: Optional[Literal["gpu", "cta", "sys"]]
+    :param scope: Memory scope ("gpu", "cta", "cluster", "sys")
+    :type scope: Optional[Literal["gpu", "cta", "cluster", "sys"]]
     :return: Old value at memory location
     :rtype: Numeric
     """
-    return _atomic(ptr, val, op="exch", sem=sem, scope=scope, loc=loc, ip=ip)  # type: ignore[arg-type]
+    return _atomic(ptr, val, op="exch", sem=sem, scope=scope, loc=loc, ip=ip)
 
 
 @dsl_user_op
@@ -2653,20 +2649,10 @@ def atomic_fmax(
     """
     intval = llvm.bitcast(T.i32(), val.ir_value(loc=loc, ip=ip), loc=loc, ip=ip)
     then_body = lambda: atomic_min(
-        ptr,
-        Uint32(intval),
-        sem=sem,
-        scope=scope,  # type: ignore[arg-type]
-        loc=loc,
-        ip=ip,
+        ptr, Uint32(intval), sem=sem, scope=scope, loc=loc, ip=ip
     )
     else_body = lambda: atomic_max(
-        ptr,
-        Int32(intval),
-        sem=sem,
-        scope=scope,  # type: ignore[arg-type]
-        loc=loc,
-        ip=ip,
+        ptr, Int32(intval), sem=sem, scope=scope, loc=loc, ip=ip
     )
 
     if sign_bit is None:
@@ -2712,7 +2698,7 @@ def atomic_cas(
     cmp: Numeric,
     val: Numeric,
     sem: Optional[Literal["relaxed", "release", "acquire", "acq_rel"]] = None,
-    scope: Optional[Literal["gpu", "cta", "sys"]] = None,
+    scope: Optional[Literal["gpu", "cta", "cluster", "sys"]] = None,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> Numeric:
@@ -2731,8 +2717,8 @@ def atomic_cas(
     :type val: Numeric
     :param sem: Memory semantic ("relaxed", "release", "acquire", "acq_rel")
     :type sem: Optional[Literal["relaxed", "release", "acquire", "acq_rel"]]
-    :param scope: Memory scope ("gpu", "cta", "sys")
-    :type scope: Optional[Literal["gpu", "cta", "sys"]]
+    :param scope: Memory scope ("gpu", "cta", "cluster", "sys")
+    :type scope: Optional[Literal["gpu", "cta", "cluster", "sys"]]
     :return: Old value at memory location
     :rtype: Numeric
     """
@@ -2758,29 +2744,16 @@ def atomic_cas(
     val_ir = val.ir_value(loc=loc, ip=ip)
 
     # * NVVM call based on nvvm version
-    if cutlass_dsl.target_version(exact_version="12.9"):
-        result = nvvm.atomicrmw(
-            cmp_type.mlir_type,
-            op=AtomicOpKind.CAS,
-            ptr=ptr,
-            a=val_ir,
-            b=cmp_ir,
-            mem_order=sem,
-            syncscope=scope,
-            loc=loc,
-            ip=ip,
-        )
-    else:
-        result = nvvm.atomicrmw(
-            op=AtomicOpKind.CAS,
-            ptr=ptr,
-            a=val_ir,
-            b=cmp_ir,
-            mem_order=sem,
-            syncscope=scope,
-            loc=loc,
-            ip=ip,
-        )
+    result = nvvm.atomicrmw(
+        op=AtomicOpKind.CAS,
+        ptr=ptr,
+        a=cmp_ir,
+        b=val_ir,
+        mem_order=sem,
+        syncscope=scope,
+        loc=loc,
+        ip=ip,
+    )
     return cmp_type(result)
 
 
@@ -3251,6 +3224,156 @@ def cvt_f4e2m1x8_to_f16x8(
     return vec_f16x8
 
 
+# Type alias for inline_ptx argument types
+ScalarArg = Union[int, float, bool, Numeric]
+
+
+@dsl_user_op
+def inline_ptx(
+    ptx_code: str,
+    *,
+    write_only_types: Optional[list[Type[Numeric]]] = None,
+    read_only_args: Optional[list[ScalarArg]] = None,
+    read_write_args: Optional[list[ScalarArg]] = None,
+    predicate: Optional[Boolean] = None,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> Union[Numeric, tuple[Numeric, ...], None]:
+    """
+    Inline PTX assembly code into the current kernel.
+
+    This function handles register size selection and sets the correct
+    read/write access for each operand automatically.
+
+    :param ptx_code: PTX assembly code string. Use named operand references (see below).
+    :type ptx_code: str
+    :param write_only_types: List of Numeric types for output-only values (corresponds to {$w0}, {$w1}, ... in PTX).
+    :type write_only_types: list[Type[Numeric]], optional
+    :param read_only_args: List of input values, each int, float, bool, or Numeric.
+                          Python primitives are automatically converted to Numeric types.
+                          These correspond to {$r0}, {$r1}, ... in PTX.
+    :type read_only_args: list[ScalarArg], optional
+    :param read_write_args: List of values that are both read and written (correspond to {$rw0}, {$rw1}, ... in PTX).
+    :type read_write_args: list[ScalarArg], optional
+    :param predicate: Optional Boolean value for conditional execution (corresponds to @$p prefix).
+    :type predicate: Boolean, optional
+    :param loc: MLIR location (advanced use, typically None).
+    :type loc: Any, optional
+    :param ip: MLIR insertion point (advanced use, typically None).
+    :type ip: Any, optional
+
+    :return: None if no outputs; single Numeric if one output; or tuple of Numerics if multiple outputs are declared.
+    :rtype: None or Numeric or tuple[Numeric, ...]
+
+    **Operand Naming Convention:**
+
+    Use curly braces with prefixes inside the PTX to denote operand roles:
+
+        - {$r0}, {$r1}, ...    : Read-only operands (index 0, 1, ...)
+        - {$w0}, {$w1}, ...    : Write-only operands (index 0, 1, ...)
+        - {$rw0}, {$rw1}, ...  : Read-write operands (index 0, 1, ...)
+
+    **Examples**:
+
+    Read-only inputs (mbarrier init):
+
+    .. code-block:: python
+
+        barrier_ptr = ...  # pointer to mbarrier
+        count = Int32(1)
+        inline_ptx(
+            "mbarrier.init.b64 [{$r0}], {$r1};",
+            read_only_args=[barrier_ptr, count],
+        )
+
+    Write-only output (exponential approximation):
+
+    .. code-block:: python
+
+        input_val = Float32(2.0)
+        result = inline_ptx(
+            "ex2.approx.ftz.f32 {$w0}, {$r0};",
+            write_only_types=[Float32],
+            read_only_args=[input_val],
+        )
+
+    Max of two Int32 values:
+
+    .. code-block:: python
+
+        a = Int32(10)
+        b = Int32(20)
+        max_val = inline_ptx(
+            "max.s32 {$w0}, {$r0}, {$r1};",
+            write_only_types=[Int32],
+            read_only_args=[a, b],
+        )
+
+    Read-write operands:
+
+    .. code-block:: python
+
+        tidx = thread_idx()[0]
+        offset = tidx * Int32(4)
+        w0, w1 = inline_ptx(
+            ptx,
+            write_only_types=[Float32, Float32],
+            read_only_args=[tidx, offset],
+            read_write_args=[arr[0], arr[1]],
+        )
+
+    .. note::
+        Refer to the `PTX ISA documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/>`_
+        for instruction syntax.
+    """
+    # Initialize empty lists if None
+    if write_only_types is None:
+        write_only_types = []
+    if read_only_args is None:
+        read_only_args = []
+    if read_write_args is None:
+        read_write_args = []
+
+    # Convert inputs to Numeric types or pass through raw IR values
+    def convert_arg(arg: Union[ir.Value, Numeric]) -> Union[ir.Value, Numeric]:
+        """Convert arg to Numeric, or pass through if already an ir.Value."""
+        if isinstance(arg, ir.Value):
+            return arg
+        return as_numeric(arg)
+
+    read_only_ir = [convert_arg(arg) for arg in read_only_args]
+    read_write_ir = [convert_arg(arg) for arg in read_write_args]
+
+    # Build write_only result types
+    write_only_mlir_types = [dtype.mlir_type for dtype in write_only_types]
+
+    # Get predicate IR value if provided
+    # The nvvm proxy will convert Numeric objects to ir.Value automatically
+    predicate_ir = None
+    if predicate is not None:
+        predicate_ir = as_numeric(predicate)
+        if not isinstance(predicate_ir, Boolean):
+            raise ValueError("Predicate must be a Boolean")
+
+    # Call nvvm.inline_ptx
+    results = nvvm.inline_ptx(
+        write_only_mlir_types,
+        read_only_ir,
+        read_write_ir,
+        ptx_code,
+        predicate=predicate_ir,
+        loc=loc,
+        ip=ip,
+    )
+
+    # Handle return value based on number of outputs
+    if len(write_only_types) == 0:
+        return None
+    elif len(write_only_types) == 1:
+        return write_only_types[0](results)
+    else:
+        # Multiple results - wrap each in its DSL type
+        return tuple(dtype(result) for dtype, result in zip(write_only_types, results))
 
 @dsl_user_op
 def smid(
@@ -5101,46 +5224,14 @@ def prefetch(
         PrefetchCacheLevel = _enhance_enum_with_str_mapping(nvvm.PrefetchCacheLevel)
         cache_level = PrefetchCacheLevel.from_str(cache_level)
 
-    if cutlass_dsl.target_version(min_version="13.2"):
-        nvvm.prefetch(
-            addr,
-            cache_level=cache_level,
-            evict_priority=evict_priority,
-            predicate=predicate,
-            tensormap=tensormap,
-            uniform=uniform,
-            in_param_space=in_param_space,
-            loc=loc,
-            ip=ip,
-        )
-    else:
-        # Fallback: inline PTX for builds without nvvm.prefetch op
-        if tensormap:
-            ptr_as_i64 = llvm.ptrtoint(T.i64(), addr, loc=loc, ip=ip)
-            llvm.inline_asm(
-                None,
-                [ptr_as_i64],
-                "prefetch.tensormap [$0];",
-                "l",
-                has_side_effects=True,
-                is_align_stack=False,
-                asm_dialect=llvm.AsmDialect.AD_ATT,
-                loc=loc,
-                ip=ip,
-            )
-        else:
-            level = "L1"
-            if cache_level is not None:
-                level = str(cache_level)
-            ptr_as_i64 = llvm.ptrtoint(T.i64(), addr, loc=loc, ip=ip)
-            llvm.inline_asm(
-                None,
-                [ptr_as_i64],
-                f"prefetch.global.{level} [$0];",
-                "l",
-                has_side_effects=True,
-                is_align_stack=False,
-                asm_dialect=llvm.AsmDialect.AD_ATT,
-                loc=loc,
-                ip=ip,
-            )
+    nvvm.prefetch(
+        addr,
+        cache_level=cache_level,
+        evict_priority=evict_priority,
+        predicate=predicate,
+        tensormap=tensormap,
+        uniform=uniform,
+        in_param_space=in_param_space,
+        loc=loc,
+        ip=ip,
+    )
diff --git a/python/CuTeDSL/cutlass/cute/arch/smem.py b/python/CuTeDSL/cutlass/cute/arch/smem.py
index 77ba50e2d..c4a24af3d 100644
--- a/python/CuTeDSL/cutlass/cute/arch/smem.py
+++ b/python/CuTeDSL/cutlass/cute/arch/smem.py
@@ -20,6 +20,8 @@ from cutlass._mlir.dialects import nvvm, llvm
 
 from ..typing import Int, Int32, Pointer, Numeric, NumericMeta
 
+_AddressSpace = _cute_ir.AddressSpace
+
 
 @dsl_user_op
 def alloc_smem(
@@ -133,14 +135,73 @@ def map_dsmem_ptr(
     """
     dsmem_llvm_ptr = nvvm.mapa(
         llvm.PointerType.get(_cute_ir.AddressSpace.dsmem),
-        smem_ptr.to_llvm_ptr(loc=loc, ip=ip),  # type: ignore[attr-defined]
+        smem_ptr.to_llvm_ptr(loc=loc, ip=ip),
         Int32(cta_rank_in_cluster).ir_value(loc=loc, ip=ip),
         loc=loc,
         ip=ip,
     )
 
     intptr = llvm.ptrtoint(T.i32(), dsmem_llvm_ptr, loc=loc, ip=ip)
-    aligned_ty = _cute_ir.ConstrainedIntType.get(smem_ptr.alignment, 32)  # type: ignore[attr-defined]
+    aligned_ty = _cute_ir.ConstrainedIntType.get(smem_ptr.alignment, 32)
     aligned_intptr = _cute_ir.assume(aligned_ty, intptr, loc=loc, ip=ip)
 
     return _cute_ir.inttoptr(smem_ptr.type, aligned_intptr, loc=loc, ip=ip)
+
+
+@dsl_user_op
+def store_async_dsmem(
+    smem_ptr: Pointer,
+    value: Int,
+    mbar_ptr: Pointer,
+    peer_cta_rank: Int,
+    *,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> None:
+    """
+    Asynchronous store to a remote CTA's shared memory via ``st.async.shared::cluster``.
+
+    The store completion is tracked by the mbarrier's transaction count
+    (``mbarrier::complete_tx::bytes``), allowing the caller to use a relaxed
+    mbarrier arrive.
+
+    :param smem_ptr:       Destination pointer in this CTA's shared memory.
+    :param value:          The i32 value to store.
+    :param mbar_ptr:       Mbarrier pointer in this CTA's shared memory.
+                           Mapped to the peer CTA via ``nvvm.mapa``.
+    :param peer_cta_rank:  Target CTA rank in the cluster.
+    """
+    dsmem_ptr_ty = llvm.PointerType.get(_AddressSpace.dsmem)
+    cta_rank_ir = Int32(peer_cta_rank).ir_value(loc=loc, ip=ip)
+
+    dsmem_addr = nvvm.mapa(
+        dsmem_ptr_ty,
+        smem_ptr.to_llvm_ptr(loc=loc, ip=ip),
+        cta_rank_ir,
+        loc=loc,
+        ip=ip,
+    )
+
+    dsmem_mbar = nvvm.mapa(
+        dsmem_ptr_ty,
+        mbar_ptr.to_llvm_ptr(loc=loc, ip=ip),
+        cta_rank_ir,
+        loc=loc,
+        ip=ip,
+    )
+
+    addr_i32 = llvm.ptrtoint(T.i32(), dsmem_addr, loc=loc, ip=ip)
+    mbar_i32 = llvm.ptrtoint(T.i32(), dsmem_mbar, loc=loc, ip=ip)
+    value_ir = Int32(value).ir_value(loc=loc, ip=ip)
+
+    llvm.inline_asm(
+        None,
+        [addr_i32, value_ir, mbar_i32],
+        "st.async.shared::cluster.mbarrier::complete_tx::bytes.b32 [$0], $1, [$2];",
+        "r,r,r",
+        has_side_effects=True,
+        is_align_stack=False,
+        asm_dialect=llvm.AsmDialect.AD_ATT,
+        loc=loc,
+        ip=ip,
+    )
diff --git a/python/CuTeDSL/cutlass/cute/atom.py b/python/CuTeDSL/cutlass/cute/atom.py
index 6bfea228b..c5f3642b2 100644
--- a/python/CuTeDSL/cutlass/cute/atom.py
+++ b/python/CuTeDSL/cutlass/cute/atom.py
@@ -12,7 +12,17 @@
 from abc import ABC, ABCMeta, abstractmethod
 from typing import Type, Union, Optional, Any, overload, List, Tuple
 
-from .typing import Shape, Layout, Tile, Tensor, Numeric, Int32
+from .typing import (
+    Shape,
+    Layout,
+    Tile,
+    Tiler,
+    Tensor,
+    Numeric,
+    Int32,
+    XTuple,
+    is_int_tuple_type,
+)
 from .core import (
     composition,
     coalesce,
@@ -32,7 +42,12 @@ from .tuple import product_each
 from .core import _unpack_x_tuple, _pack_shape, _pack_coord, _pack_tile
 from .tensor import _Tensor, make_tensor
 
-from cutlass.cutlass_dsl import extract_mlir_values, new_from_mlir_values, dsl_user_op
+from cutlass.cutlass_dsl import (
+    extract_mlir_attributes,
+    extract_mlir_values,
+    new_from_mlir_values,
+    dsl_user_op,
+)
 
 from cutlass._mlir import ir
 from cutlass._mlir.dialects import cute as _cute_ir
@@ -139,6 +154,16 @@ class Trait(ABC):
         return self.__class__(self.unpack(loc=loc, ip=ip, **kwargs))
 
 
+class TmaTrait(Trait):
+    """
+    Base class for all TMA traits, which provides the ``cute_nvgpu.grid_constant``
+    attribute for TMA arguments.
+    """
+
+    def __extract_mlir_attributes__(self) -> List[ir.Attribute]:
+        return [ir.DictAttr.get({"cute_nvgpu.grid_constant": ir.UnitAttr.get()})]
+
+
 def make_atom(
     ty: ir.Type,
     values: Optional[List[ir.Value]] = None,
@@ -180,6 +205,9 @@ class Atom(ABC):
     def __extract_mlir_values__(self) -> List[ir.Value]:
         return extract_mlir_values(self._trait) + extract_mlir_values(self._op)
 
+    def __extract_mlir_attributes__(self) -> List[ir.Attribute]:
+        return extract_mlir_attributes(self._trait) + extract_mlir_attributes(self._op)
+
     def __new_from_mlir_values__(self, values: List[ir.Value]) -> "Atom":
         traits_value = values[: len(extract_mlir_values(self._trait))]
         op_value = values[len(extract_mlir_values(self._trait)) :]
@@ -522,7 +550,7 @@ class TiledMma(MmaAtom):
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
-    ) -> Any:
+    ) -> XTuple:
         shape = _pack_shape(shape, loc=loc, ip=ip)
         return _unpack_x_tuple(
             _cute_ir.tiled_mma_partition_shape(
@@ -539,7 +567,7 @@ class TiledMma(MmaAtom):
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
-    ) -> Any:
+    ) -> XTuple:
         return self._partition_shape(_cute_ir.MmaOperand.A, shape_mk, loc=loc, ip=ip)
 
     @dsl_user_op
@@ -549,7 +577,7 @@ class TiledMma(MmaAtom):
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
-    ) -> Any:
+    ) -> XTuple:
         return self._partition_shape(_cute_ir.MmaOperand.B, shape_nk, loc=loc, ip=ip)
 
     @dsl_user_op
@@ -559,7 +587,7 @@ class TiledMma(MmaAtom):
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
-    ) -> Any:
+    ) -> XTuple:
         return self._partition_shape(_cute_ir.MmaOperand.C, shape_mn, loc=loc, ip=ip)
 
     #
@@ -739,8 +767,8 @@ def make_mma_atom(
 @dsl_user_op
 def make_tiled_mma(
     op_or_atom: Union[Op, MmaAtom],
-    atom_layout_mnk: Any = (1, 1, 1),
-    permutation_mnk: Any = None,
+    atom_layout_mnk: Union[Layout, Tuple[Any, ...]] = (1, 1, 1),
+    permutation_mnk: Optional[Tiler] = None,
     *,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
@@ -777,7 +805,7 @@ def make_tiled_mma(
         permutation_mnk_ty = _pack_tile(permutation_mnk, loc=loc, ip=ip).type
     ty = _cute_nvgpu_ir.TiledMmaType.get(
         atom._trait.value.type,
-        atom_layout_mnk.type,
+        atom_layout_mnk.type,  # type: ignore[union-attr]
         permutation_mnk_ty,
     )
     val = _cute_ir.make_tiled_mma(ty, atom._trait.value, loc=loc, ip=ip)
@@ -878,11 +906,11 @@ class TiledCopy(CopyAtom):
     @dsl_user_op
     def retile(
         self,
-        src: Any,
+        src: Tensor,
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
-    ) -> Any:
+    ) -> Tensor:
         return _cute_ir.tiled_copy_retile(
             tiled_copy=self._trait.value,
             input=src.value,
@@ -978,13 +1006,13 @@ def make_copy_atom(
 
 
 def _make_tiled_copy(
-    atom: Any,
-    layout_tv: Any,
+    atom: CopyAtom,
+    layout_tv: Layout,
     tiler_mn: Any,
     *,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
-) -> "TiledCopy":
+) -> TiledCopy:
     if type(tiler_mn) is tuple:
         tiler_mn = _pack_tile(tiler_mn, loc=loc, ip=ip)
 
@@ -1006,13 +1034,13 @@ def _make_tiled_copy(
 
 
 def make_tiled_copy(
-    atom: Any,
-    layout_tv: Any,
+    atom: CopyAtom,
+    layout_tv: Layout,
     tiler_mn: Any,
     *,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
-) -> "TiledCopy":
+) -> TiledCopy:
     """Create a tiled type given a TV partitioner and tiler.
 
     :param atom: Copy atom, e.g. smit_copy and simt_async_copy, tma_load, etc.
@@ -1107,10 +1135,14 @@ def make_cotiled_copy(
     layout_tv_data = composition(inv_data_layout, atom_layout_tv, loc=loc, ip=ip)
 
     # check validity
+    atom_layout_tv_shape = atom_layout_tv.shape
+    atom_layout_tv_stride = atom_layout_tv.stride
+    assert isinstance(atom_layout_tv_shape, tuple)
+    assert isinstance(atom_layout_tv_stride, tuple)
     atom_layout_v_to_check = coalesce(
         make_layout(
-            atom_layout_tv.shape[1],  # type: ignore[index]
-            stride=atom_layout_tv.stride[1],  # type: ignore[index]
+            atom_layout_tv_shape[1],
+            stride=atom_layout_tv_stride[1],
             loc=loc,
             ip=ip,
         ),
@@ -1167,12 +1199,12 @@ def make_cotiled_copy(
 
 @dsl_user_op
 def make_tiled_copy_A(
-    atom: Any,
-    tiled_mma: Any,
+    atom: CopyAtom,
+    tiled_mma: TiledMma,
     *,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
-) -> "TiledCopy":
+) -> TiledCopy:
     """Create a tiled copy out of the copy_atom that matches the A-Layout of tiled_mma.
 
     :param atom: Copy atom
@@ -1199,12 +1231,12 @@ def make_tiled_copy_A(
 
 @dsl_user_op
 def make_tiled_copy_B(
-    atom: Any,
-    tiled_mma: Any,
+    atom: CopyAtom,
+    tiled_mma: TiledMma,
     *,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
-) -> "TiledCopy":
+) -> TiledCopy:
     """Create a tiled copy out of the copy_atom that matches the B-Layout of tiled_mma.
 
     :param atom: Copy atom
@@ -1231,12 +1263,12 @@ def make_tiled_copy_B(
 
 @dsl_user_op
 def make_tiled_copy_C(
-    atom: Any,
-    tiled_mma: Any,
+    atom: CopyAtom,
+    tiled_mma: TiledMma,
     *,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
-) -> "TiledCopy":
+) -> TiledCopy:
     """Create a tiled copy out of the copy_atom that matches the C-Layout of tiled_mma.
 
     :param atom: Copy atom
@@ -1263,12 +1295,12 @@ def make_tiled_copy_C(
 
 @dsl_user_op
 def make_tiled_copy_S(
-    atom: Any,
-    tiled_copy: Any,
+    atom: CopyAtom,
+    tiled_copy: TiledCopy,
     *,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
-) -> "TiledCopy":
+) -> TiledCopy:
     """Create a tiled copy out of the copy_atom that matches the Src-Layout of tiled_copy.
 
     :param atom: Copy atom
@@ -1291,12 +1323,12 @@ def make_tiled_copy_S(
 
 @dsl_user_op
 def make_tiled_copy_D(
-    atom: Any,
-    tiled_copy: Any,
+    atom: CopyAtom,
+    tiled_copy: TiledCopy,
     *,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
-) -> "TiledCopy":
+) -> TiledCopy:
     """Create a tiled copy out of the copy_atom that matches the Dst-Layout of tiled_copy.
 
     :param atom: Copy atom
@@ -1324,7 +1356,7 @@ def make_tiled_copy_C_atom(
     *,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
-) -> "TiledCopy":
+) -> TiledCopy:
     """Create the smallest tiled copy that can retile LayoutC_TV for use with pipelined epilogues with subtiled stores.
 
     :param atom: Copy atom
@@ -1452,7 +1484,8 @@ def copy_atom_call(
     - For copy with auxiliary operands, they contain the main tensor followed by
       auxiliary tensors. For example:
 
-      - For static load from tensor memory, ``dst`` = [data, stat].
+      - For static load to tensor memory, ``dst`` = [data, stat].
+      - For SPARSIFY, ``dst`` = [data, metadata].
       - For TMA gather4, ``src`` = [coord0, coord1, coord2, coord3] (four 2D coordinate tensors).
       - For TMA scatter4, ``dst`` = [coord0, coord1, coord2, coord3] (four 2D coordinate tensors).
 
@@ -1486,7 +1519,7 @@ def copy_atom_call(
         # Predicated copy atom operation
         cute.copy_atom_call(copy_atom, src, dst, pred=pred)
 
-        # Static load from tensor memory: load with row-wise reduction (MAX, MIN, MAXABS, MINABS)
+        # Static load to tensor memory: load with row-wise reduction (MAX, MIN, MAXABS, MINABS)
         cute.copy_atom_call(loadtm_stat_atom, src, [data, stat])
 
         # TMA gather4: combine four 2D coordinate tensors into single destination
@@ -1502,10 +1535,12 @@ def copy_atom_call(
         dst_list[0].type,  # type: ignore[attr-defined]
         _cute_ir.MemRefType,
     ):
-        if (
-            len(dst_list) == 1
-            and src_list[0].element_type.width != dst_list[0].element_type.width  # type: ignore[union-attr]
-        ):
+        src0_elem_type = src_list[0].element_type
+        dst0_elem_type = dst_list[0].element_type
+        assert not is_int_tuple_type(src0_elem_type) and not is_int_tuple_type(
+            dst0_elem_type
+        )
+        if len(dst_list) == 1 and src0_elem_type.width != dst0_elem_type.width:
             raise TypeError(
                 "`copy_atom_call` currently only supports equal source and destination "
                 "element type bit width"
@@ -1550,9 +1585,19 @@ def mma_atom_call(
 
     - For regular MMA, `a` and `b` contain the MMA A and B tensors respectively.
     - For MMA with auxiliary operands, `a` and `b` contain the MMA A and B tensors followed by
-      their respective auxiliary tensors. For example:
+      their respective auxiliary tensors.
 
-      - For BlockScaledMMA, `a` = [A, SFA] and `b` = [B, SFB].
+    Auxiliary operands examples:
+
+    - For BlockScaledMMA, `a` = [A, SFA] and `b` = [B, SFB].
+    - For SparseMMA, `a` = [A, E] and `b` = [B].
+    - For BlockScaledSparseMMA, `a` = [A, SFA, E] and `b` = [B, SFB].
+
+    Runtime keyword arguments in ``kwargs`` are forwarded to the atom trait's ``unpack`` logic.
+    For SM100 tcgen05 MMA atoms, you can pass ``disable_output_lane`` to control
+    per-lane output writes through ``tcgen05.mma.disable_output_lane`` lowering.
+    The expected mask length is 4 lanes for ``cta_group::1`` and 8 lanes for
+    ``cta_group::2``.
 
     :param atom: The MMA atom to execute
     :type atom: MmaAtom
diff --git a/python/CuTeDSL/cutlass/cute/core.py b/python/CuTeDSL/cutlass/cute/core.py
index b1d401633..c2648551c 100644
--- a/python/CuTeDSL/cutlass/cute/core.py
+++ b/python/CuTeDSL/cutlass/cute/core.py
@@ -9,6 +9,7 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
+
 from functools import partial, reduce
 import inspect
 from inspect import isclass
@@ -25,6 +26,7 @@ from cutlass._mlir.dialects.cute import (
 from cutlass._mlir.dialects.cute import (
     Ratio as _Ratio,
     ScaledBasis as _ScaledBasis,
+    SparseElemType as _SparseElemType,
 )
 from cutlass._mlir.extras.types import MemRefType as BuiltinMemRefType
 from cutlass.cutlass_dsl import (
@@ -63,6 +65,7 @@ from .typing import (
     Tile,
     Tiler,
     XTuple,
+    _element_precision_width,
     is_int_tuple,
     is_integer,
 )
@@ -1513,7 +1516,10 @@ class _Pointer(Pointer):
         assert isinstance(value, ir.Value), f"Expected ir.Value, but got {type(value)}"
         self.value = value
 
-        if isinstance(value.type.value_type, _cute_nvgpu_ir.TmaDescriptorTiledType):
+        if isinstance(
+            value.type.value_type,
+            (_cute_ir.SparseElemType, _cute_nvgpu_ir.TmaDescriptorTiledType),
+        ):
             dtype = value.type.value_type
         self._dtype = dtype or Numeric.from_mlir_type(value.type.value_type)
 
@@ -1575,16 +1581,17 @@ class _Pointer(Pointer):
     ) -> Numeric:
         # LLVM doesn't support load/store narrow precision per element
         tmp_ty = self.dtype.mlir_type
-        if self.dtype is Boolean or self.dtype.width == 8:
+        element_precision_width = _element_precision_width(self.dtype)
+        if self.dtype is Boolean or element_precision_width == 8:
             tmp_ty = T.i8()
-        elif self.dtype.width < 8:
+        elif element_precision_width < 8:
             raise ValueError(
                 f"Loading narrow precision type {self.dtype} is not supported"
             )
 
         llvm_ptr = self.to_llvm_ptr(loc=loc, ip=ip)
         tmp_val = llvm.load(tmp_ty, llvm_ptr, loc=loc, ip=ip)
-        if self.dtype.width == 8:
+        if element_precision_width == 8:
             tmp_val = arith.bitcast(self.dtype.mlir_type, tmp_val, loc=loc, ip=ip)
 
         return self.dtype(tmp_val, loc=loc, ip=ip)
@@ -1606,9 +1613,10 @@ class _Pointer(Pointer):
             raise ValueError(f"Unsupported value type: {type(value)}")
         # LLVM doesn't support load/store narrow precision per element
         tmp_val = value.ir_value(loc=loc, ip=ip)
-        if self.dtype.width == 8:
+        element_precision_width = _element_precision_width(self.dtype)
+        if element_precision_width == 8:
             tmp_val = arith.bitcast(T.i8(), tmp_val, loc=loc, ip=ip)
-        elif self.dtype is not Boolean and self.dtype.width < 8:
+        elif self.dtype is not Boolean and element_precision_width < 8:
             raise ValueError(
                 f"Storing narrow precision type {self.dtype} is not supported"
             )
@@ -1854,6 +1862,14 @@ def _op_wrapper(
         return _Tensor(res, dtype=input.element_type, loc=loc, ip=ip)
     elif isinstance(input, _ComposedLayout):
         return op_fn(input.value, loc=loc, ip=ip)
+    elif (
+        not isinstance(input, ir.Value)
+        and hasattr(input, "value")
+        and isinstance(input.value, ir.Value)
+    ):
+        # support types with ViewTypeInterface defined outside of cute_ir
+        res = op_fn(input.value, loc=loc, ip=ip)
+        return type(input)(res)
     else:
         return op_fn(input, loc=loc, ip=ip)
 
@@ -1874,9 +1890,6 @@ def ModeOpDecorator(func: Any) -> Any:
             Initialize ModeOp.
             """
             self.func = func
-            # Functions like cute.size are written to take Lists.
-            # ModeOp works better with tuples.
-            # For now, handle the conversion internally.
             self.mode = (
                 tuple(mode)
                 if isinstance(mode, list)
@@ -2102,7 +2115,7 @@ def printf(
         arg0 = arg.value if isinstance(arg, Numeric) else arg
 
         if isinstance(arg0, ir.Value):
-            return arg0
+            return arg.ir_value(loc=loc, ip=ip) if isinstance(arg, Numeric) else arg0
         elif isinstance(arg0, bool):
             return const(arg0, Boolean)
         elif isinstance(arg0, int):
@@ -2309,7 +2322,8 @@ def rank(a: Union[XTuple, Layout, "ComposedLayout"], mode: List[int] = []) -> in
     This function is used in layout algebra to determine the dimensionality
     of tensors and layouts for operations like slicing and evaluation.
     """
-    if isinstance(a, (Layout, ComposedLayout, Tensor)):
+    # support types with ViewTypeInterface defined outside of cute_ir
+    if isinstance(a, (Layout, ComposedLayout, Tensor)) or hasattr(a, "shape"):
         return rank(a.shape, mode)
 
     # Guaranteed by ModeOpDecorator
@@ -3512,6 +3526,8 @@ def ceil_div(
             result = cute.ceil_div(input, tiler)
             print(result)  # Outputs: (4, 2)
     """
+    if isinstance(input, int) and isinstance(tiler, int):
+        return (input + tiler - 1) // tiler
     input_val = _pack_int_tuple(input, loc=loc, ip=ip)
     tiler_val = _pack_tile(tiler, loc=loc, ip=ip)
     res = _cute_ir.ceil_div(input=input_val, tiler=tiler_val, loc=loc, ip=ip)
@@ -4300,16 +4316,13 @@ def recast_ptr(
 
     value_ty = cvt_ty or ptr.type.value_type
     swizzle_attr = swizzle_.type.attribute if swizzle_ is not None else None
-    res_ty = _cute_ir.PtrType.get(value_ty, ptr.memspace, ptr.alignment, swizzle_attr)  # type: ignore[attr-defined]
+    res_ty = _cute_ir.PtrType.get(value_ty, ptr.memspace, ptr.alignment, swizzle_attr)
     return _cute_ir.recast_iter(res_ty, ptr.value, loc=loc, ip=ip)
 
 
 @dsl_user_op
 def make_ptr(
-    dtype: Union[
-        Type[Numeric],
-        None,
-    ],
+    dtype: Union[Type[Numeric], _SparseElemType],
     value: Union[int, Integer, ir.Value],
     mem_space: Optional[AddressSpace] = None,
     *,
@@ -4318,13 +4331,12 @@ def make_ptr(
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> Pointer:
-    cvt_type = None
-    if dtype is not None:
-        if cvt_type is None:
-            if not isinstance(dtype, NumericMeta):
-                raise TypeError("expects dtype to be a type of Numeric")
-            cvt_type = dtype.mlir_type
-    if isinstance(value, ir.Value) and llvm.PointerType.isinstance(value.type):
+    cvt_type: Union[_SparseElemType, ir.Type, None] = None
+    if cvt_type is None:
+        if not isinstance(dtype, NumericMeta):
+            raise TypeError("expects dtype to be a type of Numeric")
+        cvt_type = dtype.mlir_type
+    if isinstance(value, ir.Value) and isinstance(value.type, llvm.PointerType):
         llvm_ptr_ty = llvm.PointerType(value.type)
         mem_space = AddressSpace(llvm_ptr_ty.address_space)
         value = llvm.ptrtoint(T.i64(), value)
@@ -4342,7 +4354,7 @@ def make_ptr(
     value = Int32(value) if is_tmem else Int64(value)
 
     # Set the alignment of the pointer
-    bytes_per_elt = max(1, dtype.width // 8)  # type: ignore[union-attr]
+    bytes_per_elt = max(1, dtype.width // 8)
     if assumed_align is None:
         assumed_align = bytes_per_elt
 
@@ -4357,7 +4369,7 @@ def make_ptr(
     )
 
     # Construct the pointer Type
-    data_ty = T.i8() if dtype is None else cvt_type
+    data_ty = cvt_type
     swizzle = swizzle_.type.attribute if swizzle_ is not None else None
 
     ptr_ty = _cute_ir.PtrType.get(data_ty, mem_space, assumed_align, swizzle)
@@ -4574,6 +4586,7 @@ def logical_product(
     assert rank(tiler_val) <= rank(block), "logical_product: Too many modes in tiler."
     tiler_rank = rank(tiler_val)
     block_rank = rank(block)
+    assert isinstance(tiler_val, tuple)
     res = tuple(
         logical_product(block[i], tiler_val[i]) if i < tiler_rank else block[i]  # type: ignore[index]
         for i in range(block_rank)
@@ -5486,7 +5499,7 @@ class struct:
         :ivar _size: The size of the MemRange.
         """
 
-        _dtype: Optional[Numeric] = None
+        _dtype: Optional[Type[Numeric]] = None
         _size: Optional[int] = None
 
         def __new__(
@@ -5519,11 +5532,14 @@ class struct:
 
         @property
         def elem_width(cls) -> int:
-            return cls._dtype.width if cls._dtype is not Boolean else 8  # type: ignore[union-attr]
+            dtype = cls._dtype
+            assert dtype is not None
+            return dtype.width if dtype is not Boolean else 8
 
         @property
         def size_in_bytes(cls) -> int:
-            return cls.size * cls.elem_width // 8  # type: ignore[operator]
+            assert cls.size is not None
+            return cls.size * cls.elem_width // 8
 
     class MemRange(metaclass=_MemRangeMeta):
         """
@@ -5542,7 +5558,10 @@ class struct:
         """
 
         def __init__(
-            self, dtype: Optional[Numeric], size: Optional[int], base: Optional[Pointer]
+            self,
+            dtype: Optional[Type[Numeric]],
+            size: Optional[int],
+            base: Optional[Pointer],
         ) -> None:
             """
             Initializes a new memory range.
@@ -5552,7 +5571,7 @@ class struct:
                          case the range can only be used for its address (e.g. as a partition marker).
             :param base: The base address of the memory range.
             """
-            self._dtype: Optional[Numeric] = dtype
+            self._dtype: Optional[Type[Numeric]] = dtype
             self._size: Optional[int] = size
             self._base: Optional[Pointer] = base
 
@@ -5642,9 +5661,10 @@ class struct:
             :raises AssertionError: If the index is out of range.
             """
             assert self._size is not None and (index >= 0) and (index < self._size)
+            assert self._dtype is not None
             ptr = self.data_ptr() + index
             ptr.store(
-                as_numeric(val).to(self._dtype),  # type: ignore[call-overload]
+                as_numeric(val).to(self._dtype),
                 loc=loc,
                 ip=ip,
             )
@@ -5805,6 +5825,24 @@ class struct:
         """
         return isinstance(dtype, type) and issubclass(dtype, Numeric)
 
+    @staticmethod
+    def _install_dynamic_expression_protocol(cls: type, decorator: Any) -> None:
+        type.__setattr__(
+            cls,
+            "__get_mlir_types__",
+            lambda self: self.base.__get_mlir_types__(),
+        )
+        type.__setattr__(
+            cls,
+            "__extract_mlir_values__",
+            lambda self: self.base.__extract_mlir_values__(),
+        )
+        type.__setattr__(
+            cls,
+            "__new_from_mlir_values__",
+            lambda self, values: decorator(self.base.__new_from_mlir_values__(values)),
+        )
+
     # calculate size and alignment
     def __init__(self, cls: type) -> None:
         """
@@ -5837,19 +5875,9 @@ class struct:
 
         type.__setattr__(self._cls, "__repr__", struct_repr)
 
-        # Implement the DynamicExpression protocol so struct instances can
-        # be threaded through DSL control flow (e.g. captured into the
-        # branches of an `scf.if` or the body of an `scf.while`). A struct
-        # instance is fully described by its `base` pointer; all field
-        # instances are re-derived from `base + offsets` on reconstruction.
-        decorator = self
-        type.__setattr__(self._cls, "__get_mlir_types__",
-                         lambda self: self.base.__get_mlir_types__())
-        type.__setattr__(self._cls, "__extract_mlir_values__",
-                         lambda self: self.base.__extract_mlir_values__())
-        type.__setattr__(self._cls, "__new_from_mlir_values__",
-                         lambda self, values:
-                             decorator(self.base.__new_from_mlir_values__(values)))
+        # A struct instance is fully described by its base pointer; fields are
+        # re-derived from static offsets when the struct is reconstructed.
+        struct._install_dynamic_expression_protocol(self._cls, self)
 
         # Calculate the offsets and alignment
         offset = 0
@@ -6061,6 +6089,10 @@ class union(struct):
 
         type.__setattr__(self._cls, "__repr__", union_repr)
 
+        # A union instance is fully described by its base pointer; fields are
+        # re-derived from offset zero when the union is reconstructed.
+        struct._install_dynamic_expression_protocol(self._cls, self)
+
         # Calculate the maximum size and alignment
         max_size = 0
         max_alignment = 1
@@ -6170,22 +6202,6 @@ class union(struct):
         return self._align_of
 
 
-# Deprecated usage but keep them to avoid breaking some examples uses `cute.core.ThrMma`
-
-from .atom import ThrCopy as _ThrCopy
-from .atom import ThrMma as _ThrMma
-
-
-@deprecated("cute.core.ThrMma is deprecated, use cute.ThrMma instead")
-class ThrMma(_ThrMma):
-    pass
-
-
-@deprecated("cute.core.ThrCopy is deprecated, use cute.ThrCopy instead")
-class ThrCopy(_ThrCopy):
-    pass
-
-
 #
 # FastDivmod operations for optimized division and modulus
 #
diff --git a/python/CuTeDSL/cutlass/cute/experimental/__init__.py b/python/CuTeDSL/cutlass/cute/experimental/__init__.py
index 2e3b59a87..2816063aa 100644
--- a/python/CuTeDSL/cutlass/cute/experimental/__init__.py
+++ b/python/CuTeDSL/cutlass/cute/experimental/__init__.py
@@ -22,3 +22,18 @@ from .math import *
 from .memory import *
 from .pipeline import *
 from .utils import *
+
+try:
+    from . import iket
+except ImportError:
+
+    class _IketUnavailable:
+        """Stub so that ``cute.experimental.iket.<anything>`` gives a clear error."""
+
+        def __getattr__(self, name: str) -> None:
+            raise ImportError(
+                "IKET (In-Kernel Event Tracing) is not available in this "
+                "installation. Reinstall nvidia-cutlass-dsl to restore it."
+            )
+
+    iket = _IketUnavailable()  # type: ignore[assignment]
diff --git a/python/CuTeDSL/cutlass/cute/experimental/algorithm.py b/python/CuTeDSL/cutlass/cute/experimental/algorithm.py
index 8977a0271..019d20b24 100644
--- a/python/CuTeDSL/cutlass/cute/experimental/algorithm.py
+++ b/python/CuTeDSL/cutlass/cute/experimental/algorithm.py
@@ -67,7 +67,7 @@ def partition(
 
 @dsl_user_op
 def partition_and_copy(
-    tiled_copy: cute.core.ThrCopy,
+    tiled_copy: cute.ThrCopy,
     src: cute.Tensor,
     dst: cute.Tensor,
     *,
@@ -138,3 +138,20 @@ def partition_and_copy(
             loc=loc,
             ip=ip,
         )
+
+
+@dsl_user_op
+def predicated_tensor_origin(
+    tensor: cute.Tensor,
+    *,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> cute.Tensor:
+    """
+    Marks `tensor` as the origin (root) for predication/TMA bounds.
+
+    This is a semantic marker that lets the compiler unambiguously choose which
+    `!cute.memref` shape is used as the `predBounds` origin when automatically
+    constructing predicate tensors for OOB masking.
+    """
+    return cutlass_lir.PredicatedTensorOriginOp(tensor.value, loc=loc, ip=ip).result
diff --git a/python/CuTeDSL/cutlass/cute/experimental/core.py b/python/CuTeDSL/cutlass/cute/experimental/core.py
index bedf22a2a..43b0a41b8 100644
--- a/python/CuTeDSL/cutlass/cute/experimental/core.py
+++ b/python/CuTeDSL/cutlass/cute/experimental/core.py
@@ -9,7 +9,7 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
-from typing import Optional, Protocol, TypeAlias
+from typing import Optional, Protocol, Sequence, TypeAlias, Union
 
 from cutlass.cutlass_dsl import dsl_user_op
 from cutlass._mlir.dialects import lir as cutlass_lir_ir, nvvm as _nvvm
@@ -17,7 +17,8 @@ from cutlass._mlir import ir
 from cutlass.cutlass_dsl import lru_cache_ir
 from cutlass._mlir.dialects.core import OperationTypeEnum
 from cutlass import cute
-from cutlass.cute.typing import Boolean
+from cutlass.cute.core import _pack_coord
+from cutlass.cute.typing import Boolean, Coord, Tensor
 
 
 class _SupportsIrValue(Protocol):
@@ -31,6 +32,25 @@ class _SupportsIrValue(Protocol):
 
 SkipWaitToken: TypeAlias = bool | ir.Value | _SupportsIrValue
 
+# A producer/consumer role may be a single operation type or a non-empty
+# sequence of them, for pipelines where a single stage is written (or read)
+# by multiple operation kinds. Order within a sequence is not significant
+# and duplicates are rejected.
+OperationTypeSpec: TypeAlias = Union[OperationTypeEnum, Sequence[OperationTypeEnum]]
+
+
+def _format_operation_types(spec: OperationTypeSpec) -> str:
+    """Render a single op type or a set of them for a `!lir.pipeline<...>`
+    type string."""
+    if isinstance(spec, OperationTypeEnum):
+        return str(spec)
+    items = list(spec)
+    if not items:
+        raise ValueError("operation type set must be non-empty")
+    if len(items) == 1:
+        return str(items[0])
+    return "[" + ", ".join(str(t) for t in items) + "]"
+
 
 @dsl_user_op
 def elect_sync(
@@ -51,6 +71,30 @@ def get_mbarrier(
     return cutlass_lir_ir.GetMbarrierOp(stage_token, loc=loc, ip=ip)
 
 
+@dsl_user_op
+def domain_offset(
+    coord: Coord,
+    tensor: Tensor,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> Tensor:
+    """Shift a tensor's logical domain while preserving its descriptor fields.
+
+    This is the IR-preserving counterpart to ``cute.domain_offset``. It keeps
+    the source-level domain-offset semantics explicit until the IR partitioning
+    passes can apply the offset to generated coordinate tensors.
+    """
+    coord_value = _pack_coord(coord, loc=loc, ip=ip)
+    op = cutlass_lir_ir.DomainOffsetOp(
+        tensor.value,
+        coord_value,
+        results=[tensor.value.type],
+        loc=loc,
+        ip=ip,
+    )
+    return op.result
+
+
 @ir.register_value_caster(cutlass_lir_ir.PipelineStateType.get_static_typeid())
 class PipelineState(ir.Value):
     def __init__(self, value: ir.Value) -> None:
@@ -115,9 +159,9 @@ def _normalize_create_pipeline_arrival_mask(
 
 
 def _build_pipeline(
-    stage: cute.Int32,
-    producer: OperationTypeEnum,
-    consumer: OperationTypeEnum,
+    stage: int,
+    producer: OperationTypeSpec,
+    consumer: OperationTypeSpec,
     producer_arv_count: cute.Int32,
     consumer_arv_count: cute.Int32,
     arrival_mask: Optional[cute.Int16],
@@ -129,10 +173,13 @@ def _build_pipeline(
     if isinstance(consumer_arv_count, int):
         consumer_arv_count = cute.Int32(consumer_arv_count)
 
+    producer_str = _format_operation_types(producer)
+    consumer_str = _format_operation_types(consumer)
+    pipeline_type_str = f"!lir.pipeline<{stage}, {producer_str} -> {consumer_str}>"
     if arrival_mask is not None:
         if isinstance(arrival_mask, int):
             arrival_mask = cute.Int16(arrival_mask)
-        result = ir.Type.parse(f"!lir.pipeline<{stage}, {producer} -> {consumer}>")
+        result = ir.Type.parse(pipeline_type_str)
         op = cutlass_lir_ir.CreatePipelineWithMaskOp(
             result,
             producer_arv_count.ir_value(),
@@ -142,7 +189,7 @@ def _build_pipeline(
             ip=ip,
         )
     else:
-        result = ir.Type.parse(f"!lir.pipeline<{stage}, {producer} -> {consumer}>")
+        result = ir.Type.parse(pipeline_type_str)
         op = cutlass_lir_ir.CreatePipelineOp(
             result,
             producer_arv_count.ir_value(),
@@ -165,9 +212,9 @@ def _build_pipeline(
 
 @dsl_user_op
 def create_pipeline(
-    stage: cute.Int32,
-    producer: OperationTypeEnum,
-    consumer: OperationTypeEnum,
+    stage: int,
+    producer: OperationTypeSpec,
+    consumer: OperationTypeSpec,
     producer_arv_count: cute.Int32,
     consumer_arv_count: cute.Int32,
     arrival_mask: Optional[cute.Int16] = None,
@@ -181,8 +228,10 @@ def create_pipeline(
 
     Args:
         stage: Number of pipeline stages.
-        producer: Producer operation type.
-        consumer: Consumer operation type.
+        producer: Producer operation type, or a non-empty sequence of them
+            for a multi-producer pipeline (e.g. ``[store to smem, SM100_COPY_R2T]``).
+        consumer: Consumer operation type, or a non-empty sequence of them
+            for a multi-consumer pipeline.
         producer_arv_count: Number of producer arrivals.
         consumer_arv_count: Number of consumer arrivals.
         arrival_mask: Optional arrival mask for multi-CTA synchronization
@@ -204,9 +253,9 @@ def create_pipeline(
 
 @dsl_user_op
 def create_pipeline_with_mask(
-    stage: cute.Int32,
-    producer: OperationTypeEnum,
-    consumer: OperationTypeEnum,
+    stage: int,
+    producer: OperationTypeSpec,
+    consumer: OperationTypeSpec,
     producer_arv_count: cute.Int32,
     consumer_arv_count: cute.Int32,
     arrival_mask: cute.Int16,
@@ -240,6 +289,38 @@ def pipeline_advance_iterator(
     return op.result
 
 
+@dsl_user_op
+def create_pipeline_state_at(
+    pipe: ir.Value,
+    stage: int,
+    stage_index: cute.Int32,
+    phase: cute.Int32,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> PipelineState:
+    """
+    Materializes a pipeline state for an explicit stage index and phase.
+
+    This is useful for retry agents that sweep stages out of iterator order.
+    The operation only packages state; it does not synchronize.
+    """
+    if isinstance(stage_index, int):
+        stage_index = cute.Int32(stage_index)
+    if isinstance(phase, int):
+        phase = cute.Int32(phase)
+
+    result = ir.Type.parse(f"!lir.pipeline_state<{stage}>")
+    op = cutlass_lir_ir.CreatePipelineStateAtOp(
+        result,
+        pipe,
+        stage_index.ir_value(),
+        phase.ir_value(),
+        loc=loc,
+        ip=ip,
+    )
+    return op.result
+
+
 @dsl_user_op
 def producer_acquire(
     pipe: ir.Value,
@@ -258,13 +339,23 @@ def producer_acquire(
 def producer_commit(
     pipe: ir.Value,
     state: ir.Value,
+    *,
+    elect_one_sync: Optional[bool] = None,
+    elect_leader_cta: Optional[bool] = None,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> ir.Value:
     """
     Commits results to a pipeline.
     """
-    op = cutlass_lir_ir.ProducerCommitOp(pipe, state, loc=loc, ip=ip)
+    op = cutlass_lir_ir.ProducerCommitOp(
+        pipe,
+        state,
+        elect_one_sync=elect_one_sync,
+        elect_leader_cta=elect_leader_cta,
+        loc=loc,
+        ip=ip,
+    )
     return op.result
 
 
@@ -286,13 +377,23 @@ def consumer_wait(
 def consumer_release(
     pipe: ir.Value,
     state: ir.Value,
+    *,
+    elect_one_sync: Optional[bool] = None,
+    elect_leader_cta: Optional[bool] = None,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> ir.Value:
     """
     Releases a pipeline that has been consumed.
     """
-    op = cutlass_lir_ir.ConsumerReleaseOp(pipe, state, loc=loc, ip=ip)
+    op = cutlass_lir_ir.ConsumerReleaseOp(
+        pipe,
+        state,
+        elect_one_sync=elect_one_sync,
+        elect_leader_cta=elect_leader_cta,
+        loc=loc,
+        ip=ip,
+    )
     return op.result
 
 
@@ -312,6 +413,21 @@ def consumer_release_elect_one_sync(
     return op.result
 
 
+
+@dsl_user_op
+def pipeline_tail(
+    pipe: ir.Value,
+    state: ir.Value,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> ir.Value:
+    """
+    Drains outstanding asynchronous pipeline work associated with the state.
+    """
+    op = cutlass_lir_ir.PipelineTailOp(pipe, state, loc=loc, ip=ip)
+    return op.result
+
+
 @dsl_user_op
 def consumer_tail(
     pipe: ir.Value,
@@ -320,9 +436,9 @@ def consumer_tail(
     ip: Optional[ir.InsertionPoint] = None,
 ) -> ir.Value:
     """
-    Called by the consumer to block until asynchronous tasks have completed.
+    Legacy alias for `pipeline_tail()`.
     """
-    op = cutlass_lir_ir.ConsumerTailOp(pipe, state, loc=loc, ip=ip)
+    op = cutlass_lir_ir.PipelineTailOp(pipe, state, loc=loc, ip=ip)
     return op.result
 
 
@@ -443,26 +559,25 @@ def create_circular_buffer_pipeline(
 def circular_buffer_pipeline_consume(
     pipeline: ir.Value,
     circular_buffer_pipeline_state: CircularBufferPipelineState,
+    *,
+    elect_leader_cta: Optional[bool] = None,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> None:
     """
     Synchronize pipeline stages needed for circular buffer consumption.
 
-    This operation performs synchronization for the circular buffer consumer.
-    Based on the current circular buffer position and `count_per_iteration`, it
-    determines which pipeline stages need to be synchronized and waits for them
-    to transition to full before consumption can proceed.
-
     Args:
         pipeline: The underlying pipeline object
         circular_buffer_pipeline_state: Current circular buffer pipeline state
+        elect_leader_cta: When set, only the leader CTA performs the wait
         loc: Source location
         ip: Insertion point
     """
     cutlass_lir_ir.CircularBufferPipelineConsumeOp(
         pipeline,
         circular_buffer_pipeline_state,
+        elect_leader_cta=elect_leader_cta,
         loc=loc,
         ip=ip,
     )
@@ -472,26 +587,28 @@ def circular_buffer_pipeline_consume(
 def circular_buffer_pipeline_consumer_release(
     pipeline: ir.Value,
     circular_buffer_pipeline_state: CircularBufferPipelineState,
+    *,
+    elect_one_sync: Optional[bool] = None,
+    elect_leader_cta: Optional[bool] = None,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> None:
     """
     Release pipeline stages after circular buffer consumption.
 
-    This operation releases pipeline stages after circular buffer consumption.
-    Based on the current circular buffer position and `count_per_iteration`, it
-    determines which pipeline stages have been fully consumed and transitions them
-    to empty.
-
     Args:
         pipeline: The underlying pipeline object
         circular_buffer_pipeline_state: Current circular buffer pipeline state
+        elect_one_sync: When set, only a single thread performs the release
+        elect_leader_cta: When set, only the leader CTA performs the release
         loc: Source location
         ip: Insertion point
     """
     cutlass_lir_ir.CircularBufferPipelineConsumerReleaseOp(
         pipeline,
         circular_buffer_pipeline_state,
+        elect_one_sync=elect_one_sync,
+        elect_leader_cta=elect_leader_cta,
         loc=loc,
         ip=ip,
     )
diff --git a/python/CuTeDSL/cutlass/cute/experimental/iket/__init__.py b/python/CuTeDSL/cutlass/cute/experimental/iket/__init__.py
new file mode 100644
index 000000000..56da7809e
--- /dev/null
+++ b/python/CuTeDSL/cutlass/cute/experimental/iket/__init__.py
@@ -0,0 +1,27 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
+#
+# Use of this software is governed by the terms and conditions of the
+# NVIDIA End User License Agreement (EULA), available at:
+# https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/license.html
+#
+# Any use, reproduction, disclosure, or distribution of this software
+# and related documentation outside the scope permitted by the EULA
+# is strictly prohibited.
+
+
+from .iket import *
+from .dag import DAG
+
+
+def dag(name: str) -> DAG:
+    """Create a :class:`DAG` for the given name.
+
+    Usage::
+
+        from cutlass.cute.experimental import iket
+
+        dag = iket.dag("gemm_fp16_2stage")
+        dag.edge("prologue", "mainloop", label="smem[0]", via="mbarrier")
+    """
+    return DAG(name)
diff --git a/python/CuTeDSL/cutlass/cute/experimental/iket/dag.py b/python/CuTeDSL/cutlass/cute/experimental/iket/dag.py
new file mode 100644
index 000000000..1ecf7b738
--- /dev/null
+++ b/python/CuTeDSL/cutlass/cute/experimental/iket/dag.py
@@ -0,0 +1,172 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
+#
+# Use of this software is governed by the terms and conditions of the
+# NVIDIA End User License Agreement (EULA), available at:
+# https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/license.html
+#
+# Any use, reproduction, disclosure, or distribution of this software
+# and related documentation outside the scope permitted by the EULA
+# is strictly prohibited.
+
+
+"""
+IKET DAG — lightweight dataflow dependency graph for kernel regions.
+
+Usage::
+
+    from cutlass.cute.experimental import iket
+
+    dag = iket.dag("gemm_fp16_2stage")
+    dag.edge("prologue_load", "MMA_mainloop", label="smem[0]", via="mbarrier")
+    dag.edge("TMA_load",      "MMA_mainloop", label="smem[s]", via="mbarrier")
+    dag.edge("MMA_mainloop",  "epilogue",     label="acc",     via="tcgen05_commit")
+
+    # Inside @cute.kernel — use existing iket.range_start / range_end as usual.
+    # The dag object is pure metadata; it never touches the IR.
+
+    dag.save("gemm_fp16_2stage.iket_dag.json")
+
+Regions are implicit — created automatically from edge endpoints.
+No declarations, no context managers, no extra indentation.
+"""
+
+import json
+import os
+from dataclasses import dataclass, asdict
+from typing import Optional, List, Set, Dict
+
+
+@dataclass
+class Edge:
+    """A directed dataflow edge between two regions."""
+
+    source: str
+    target: str
+    label: Optional[str] = None
+    via: Optional[str] = None
+
+
+class DAG:
+    """
+    A compile-time DAG describing dataflow between named kernel regions.
+
+    Regions are not declared explicitly — they are created implicitly the
+    first time they appear as an endpoint in :meth:`edge`.  At runtime the
+    kernel uses the existing ``iket.range_start`` / ``iket.range_end`` API
+    with matching region names; this class only records the dependency
+    metadata for post-processing (Perfetto trace arrows).
+
+    Parameters
+    ----------
+    name : str
+        Human-readable name for this DAG (e.g. ``"gemm_fp16_2stage"``).
+    """
+
+    def __init__(self, name: str):
+        self.name = name
+        self._edges: List[Edge] = []
+        self._regions: Set[str] = set()
+
+    # ------------------------------------------------------------------
+    # Public API
+    # ------------------------------------------------------------------
+
+    def edge(
+        self,
+        source: str,
+        target: str,
+        *,
+        label: Optional[str] = None,
+        via: Optional[str] = None,
+    ) -> "DAG":
+        """
+        Declare a dataflow dependency: *source* region produces data
+        consumed by *target* region.
+
+        Parameters
+        ----------
+        source : str
+            Name of the producing region (must match a ``range_start`` name).
+        target : str
+            Name of the consuming region.
+        label : str, optional
+            What flows along this edge (e.g. ``"smem_A"``, ``"accumulator"``).
+        via : str, optional
+            Synchronization mechanism (e.g. ``"mbarrier"``, ``"tcgen05_commit"``,
+            ``"barrier"``).
+
+        Returns
+        -------
+        DAG
+            ``self``, for chaining.
+        """
+        self._regions.add(source)
+        self._regions.add(target)
+        self._edges.append(Edge(source, target, label, via))
+        return self
+
+    @property
+    def regions(self) -> List[str]:
+        """Sorted list of region names (auto-populated from edges)."""
+        return sorted(self._regions)
+
+    @property
+    def edges(self) -> List[Edge]:
+        """All declared edges."""
+        return list(self._edges)
+
+    # ------------------------------------------------------------------
+    # Serialization
+    # ------------------------------------------------------------------
+
+    def to_dict(self) -> Dict:
+        """Return the DAG as a plain dict suitable for ``json.dump``."""
+        return {
+            "name": self.name,
+            "regions": self.regions,
+            "edges": [
+                {k: v for k, v in asdict(e).items() if v is not None}
+                for e in self._edges
+            ],
+        }
+
+    def to_json(self, indent: int = 2) -> str:
+        """Serialize the DAG to a JSON string."""
+        return json.dumps(self.to_dict(), indent=indent)
+
+    def save(self, path: Optional[str] = None) -> str:
+        """
+        Write the DAG JSON to disk.
+
+        Parameters
+        ----------
+        path : str, optional
+            Output file path.  Defaults to
+            ``$CUTE_DSL_DUMP_DIR/<kernel_name>.iket_dag.json`` if the env
+            var is set, otherwise ``<kernel_name>.iket_dag.json`` in cwd.
+
+        Returns
+        -------
+        str
+            The absolute path of the written file.
+        """
+        if path is None:
+            filename = f"{self.name}.iket_dag.json"
+            dump_dir = os.environ.get("CUTE_DSL_DUMP_DIR", "")
+            if dump_dir:
+                os.makedirs(dump_dir, exist_ok=True)
+                path = os.path.join(dump_dir, filename)
+            else:
+                path = filename
+
+        with open(path, "w") as f:
+            f.write(self.to_json())
+        return os.path.abspath(path)
+
+    # ------------------------------------------------------------------
+    # Dunder
+    # ------------------------------------------------------------------
+
+    def __repr__(self) -> str:
+        return f"DAG({self.name!r}, regions={self.regions}, edges={len(self._edges)})"
diff --git a/python/CuTeDSL/cutlass/cute/experimental/iket/iket.py b/python/CuTeDSL/cutlass/cute/experimental/iket/iket.py
new file mode 100644
index 000000000..649bcaf21
--- /dev/null
+++ b/python/CuTeDSL/cutlass/cute/experimental/iket/iket.py
@@ -0,0 +1,363 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
+#
+# Use of this software is governed by the terms and conditions of the
+# NVIDIA End User License Agreement (EULA), available at:
+# https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/license.html
+#
+# Any use, reproduction, disclosure, or distribution of this software
+# and related documentation outside the scope permitted by the EULA
+# is strictly prohibited.
+
+
+from typing import Any, List, Optional
+
+from cutlass._mlir.dialects import iket, arith
+from cutlass.cutlass_dsl import Boolean, dsl_user_op, if_generate, is_dynamic_expression
+from cutlass.base_dsl.typing import Numeric
+from cutlass._mlir import ir
+
+
+def _as_raw_ir_value(v: Any) -> Any:
+    # Pytree unflatten wraps raw ir.Values into Numeric for leaves marked
+    # is_numeric=True. RangeToken stores _is_none_val as a raw ir.Value, so
+    # unwrap any Numeric we get back before re-extracting — otherwise the
+    # next _flatten_mlir_values call silently drops it (it only matches
+    # ir.Value/list/dict) and the template/flat shape diverges across staged
+    # control-flow boundaries.
+    if isinstance(v, Numeric):
+        return v.ir_value()
+    return v
+
+
+def _infer_payload_signedness(payload: Any) -> Optional[str]:
+    """
+    Infers whether the payload's dtype is signed or unsigned integer.
+    """
+    dtype = getattr(payload, "dtype", None)
+    if dtype is None:
+        return None
+    if not getattr(dtype, "is_integer", False):
+        return None
+    signed = getattr(dtype, "signed", None)
+    if signed is None:
+        return None
+    return "signed" if signed else "unsigned"
+
+
+def _coerce_imm_payload(
+    payload: Any,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> Any:
+    """
+    Convert a raw Python int or float immediate value to an MLIR constant ir.Value.
+    int -> signless i32 constant; float -> f32 constant.
+    DSL typed values or ir.Value objects are returned unchanged.
+    """
+    if isinstance(payload, bool):
+        # bool is a subclass of int; treat as i1
+        i1 = ir.IntegerType.get_signless(1)
+        return arith.ConstantOp(
+            i1, ir.IntegerAttr.get(i1, int(payload)), loc=loc, ip=ip
+        ).result
+    if isinstance(payload, int):
+        i32 = ir.IntegerType.get_signless(32)
+        return arith.ConstantOp(
+            i32, ir.IntegerAttr.get(i32, payload), loc=loc, ip=ip
+        ).result
+    if isinstance(payload, float):
+        f32 = ir.F32Type.get()
+        return arith.ConstantOp(
+            f32, ir.FloatAttr.get(f32, payload), loc=loc, ip=ip
+        ).result
+    return payload
+
+
+def _attach_payload_signedness_attr(op: Any, payload: Any) -> Any:
+    """
+    Attaches a payload_signedness attribute to the given op if the payload has a signedness (signed/unsigned),
+    inferring it from the payload's dtype.
+    """
+    payload_signedness = _infer_payload_signedness(payload)
+    if payload_signedness is None:
+        return op
+    operation = getattr(op, "operation", None)
+    if operation is None:
+        operation = getattr(op, "owner", None)
+    if operation is None:
+        return op
+    operation.attributes["payload_signedness"] = ir.StringAttr.get(payload_signedness)
+    return op
+
+
+def _make_i1_constant(
+    value: bool,
+    *,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> Any:
+    """Create an i1 MLIR constant (0 or 1)."""
+    i1 = ir.IntegerType.get_signless(1)
+    return arith.ConstantOp(
+        i1, ir.IntegerAttr.get(i1, int(value)), loc=loc, ip=ip
+    ).result
+
+
+class RangeToken:
+    """A token representing a range, carrying an ``!iket.range.token`` SSA value
+    and an ``i1`` flag indicating whether the token is a *none* (sentinel) token.
+
+    Implements the DynamicExpression protocol (``__extract_mlir_values__`` /
+    ``__new_from_mlir_values__``) so the DSL compiler can thread it through
+    ``scf.for`` as a loop-carried variable.
+    """
+
+    def __init__(self) -> None:
+        self.token = ir.Type.parse("!iket.range.token")
+        self.region = ir.InsertionPoint.current
+        self.op: Any = None
+        self._is_none_val: Any = None
+        self._event_name: Optional[str] = None
+
+    # -- DynamicExpression protocol ------------------------------------------
+
+    def __extract_mlir_values__(self) -> List[Any]:
+        return [self.op, _as_raw_ir_value(self._is_none_val)]
+
+    def __new_from_mlir_values__(self, values: List[Any]) -> "RangeToken":
+        new = RangeToken.__new__(RangeToken)
+        new.token = self.token
+        new.region = self.region
+        new.op = values[0]
+        new._is_none_val = _as_raw_ir_value(values[1])
+        new._event_name = self._event_name
+        return new
+
+
+@dsl_user_op
+def mark(
+    event_name: str,
+    payload: Any = None,
+    *,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> Any:
+    """
+    Mark an event in the kernel, optionally with a payload value.
+
+    When used without a payload, generates a simple checkpoint marker.
+    When used with a payload, the payload is dumped alongside the timestamp
+    for analysis. Payload support requires NativeDump instrumentation method.
+    Supported payload types: i8, i16, i32, f32 (32-bit), i64, and pointer (64-bit).
+
+    :param event_name: Name of the event to mark
+    :type event_name: str
+    :param payload: Optional payload value to attach to the event
+    :type payload: Any numeric type (Int8, Int16, Int32, Int64, Float32, Index), optional
+    :param loc: Source location for MLIR, defaults to None
+    :type loc: Optional[Location], optional
+    :param ip: Insertion point for MLIR, defaults to None
+    :type ip: Optional[InsertionPoint], optional
+    """
+
+    if payload is not None:
+        if is_dynamic_expression(payload):
+            payload_value = payload.ir_value(loc=loc, ip=ip)
+        else:
+            payload_value = _coerce_imm_payload(payload, loc=loc, ip=ip)
+        op = iket.mark(event_name, payload=payload_value, loc=loc, ip=ip)
+        return _attach_payload_signedness_attr(op, payload)
+    return iket.mark(event_name, loc=loc, ip=ip)
+
+
+@dsl_user_op
+def sentinel_token(
+    event_name: str,
+    *,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> RangeToken:
+    """Create a sentinel range token for initializing loop-carried variables.
+
+    The returned :class:`RangeToken` carries a valid ``!iket.range.token`` SSA
+    value but is marked as sentinel (``is_none = true``).  When passed to
+    :func:`range_end`, no instrumentation IR is emitted at runtime.  The
+    ``event_name`` is required so that the correct event metadata is preserved
+    when the token flows through ``scf.for`` loop-carried variables.
+
+    Example::
+
+        mma_k_tile_token = cute.experimental.iket.sentinel_token("MMA_K_Tile")
+        for k_tile in cutlass.range(k_tile_cnt):
+            cute.experimental.iket.range_end(mma_k_tile_token)   # no-op on first iteration
+            mma_k_tile_token = cute.experimental.iket.range_start("MMA_K_Tile")
+            ...
+
+    :param event_name: The event name matching the :func:`range_start` that
+        will produce real tokens in the loop body.
+    :type event_name: str
+    :return: A sentinel RangeToken safe to pass to ``range_end``
+    :rtype: RangeToken
+    """
+
+    token = RangeToken()
+    token.op = iket.sentinel_token(token.token, loc=loc, ip=ip)
+    token._is_none_val = _make_i1_constant(True, loc=loc, ip=ip)
+    token._event_name = event_name
+    return token
+
+
+@dsl_user_op
+def range_start(
+    event_name: str,
+    payload: Any = None,
+    *,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> RangeToken:
+    """
+    Mark the beginning of a range, optionally with a payload value.
+
+    The payload is dumped alongside the timestamp for analysis.
+    Payload support requires NativeDump instrumentation method.
+    For a START_END range, the matching range_end must use the same payload
+    shape as range_start. Do not mix payload and no-payload forms, and do not
+    change payload width between the start and end of the same range.
+
+    :param event_name: Name of the event/range to start
+    :type event_name: str
+    :param payload: Optional payload value to attach to the range start event
+    :type payload: Any numeric type (Int8, Int16, Int32, Int64, Float32, Index), optional
+    :param loc: Source location for MLIR, defaults to None
+    :type loc: Optional[Location], optional
+    :param ip: Insertion point for MLIR, defaults to None
+    :type ip: Optional[InsertionPoint], optional
+    :return: A RangeToken to be used with range_end
+    :rtype: RangeToken
+    """
+
+    token = RangeToken()
+    if payload is not None:
+        if is_dynamic_expression(payload):
+            payload_value = payload.ir_value(loc=loc, ip=ip)
+        else:
+            payload_value = _coerce_imm_payload(payload, loc=loc, ip=ip)
+        token.op = iket.range_start(
+            token.token, event_name, payload=payload_value, loc=loc, ip=ip
+        )
+        _attach_payload_signedness_attr(token.op, payload)
+    else:
+        token.op = iket.range_start(token.token, event_name, loc=loc, ip=ip)
+    token._is_none_val = _make_i1_constant(False, loc=loc, ip=ip)
+    token._event_name = event_name
+    return token
+
+
+@dsl_user_op
+def range_end(
+    token: RangeToken,
+    payload: Any = None,
+    *,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> None:
+    """
+    Mark the end of a range, optionally with a payload value.
+
+    If *token* is a sentinel token (from :func:`sentinel_token`), no
+    instrumentation IR is emitted.
+    For a START_END range, this payload must match the payload usage of the
+    corresponding range_start. If the range was started with a payload, end it
+    with a payload of the same type/width; if it was started without payload,
+    end it without payload.
+
+    :param token: The RangeToken from range_start (or sentinel_token)
+    :type token: RangeToken
+    :param payload: Optional payload value to attach to the range end event
+    :type payload: Any numeric type (Int8, Int16, Int32, Int64, Float32, Index), optional
+    :param loc: Source location for MLIR, defaults to None
+    :type loc: Optional[Location], optional
+    :param ip: Insertion point for MLIR, defaults to None
+    :type ip: Optional[InsertionPoint], optional
+    """
+
+    def _emit_range_end() -> None:
+        event_name_kwarg = {}
+        if token._event_name is not None:
+            event_name_kwarg["event_name"] = token._event_name
+
+        if payload is not None:
+            if is_dynamic_expression(payload):
+                payload_value = payload.ir_value(loc=loc, ip=ip)
+            else:
+                payload_value = _coerce_imm_payload(payload, loc=loc, ip=ip)
+            op = iket.range_end(
+                token.op,
+                payload=payload_value,
+                loc=loc,
+                ip=ip,
+                **event_name_kwarg,
+            )
+            _attach_payload_signedness_attr(op, payload)
+        else:
+            op = iket.range_end(token.op, loc=loc, ip=ip, **event_name_kwarg)
+
+    is_not_none = Boolean(token._is_none_val) == 0
+    if_generate(is_not_none, _emit_range_end, loc=loc, ip=ip)
+
+
+@dsl_user_op
+def range_push(
+    event_name: str,
+    payload: Any = None,
+    *,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> Any:
+    """
+    Push a range onto the stack, optionally with a payload value.
+
+    This marks the start of a push/pop range using a stack-based model.
+    Unlike range_start/range_end which use SSA tokens for pairing, push/pop
+    ranges are paired using a LIFO stack - each pop matches the most recent push.
+
+    Payload support requires NativeDump instrumentation method.
+
+    :param event_name: Name of the event to push
+    :type event_name: str
+    :param payload: Optional payload value to attach to the push event
+    :type payload: Any numeric type (Int8, Int16, Int32, Int64, Float32, Index), optional
+    :param loc: Source location for MLIR, defaults to None
+    :type loc: Optional[Location], optional
+    :param ip: Insertion point for MLIR, defaults to None
+    :type ip: Optional[InsertionPoint], optional
+    """
+
+    if payload is not None:
+        if is_dynamic_expression(payload):
+            payload_value = payload.ir_value(loc=loc, ip=ip)
+        else:
+            payload_value = _coerce_imm_payload(payload, loc=loc, ip=ip)
+        op = iket.range_push(event_name, payload=payload_value, loc=loc, ip=ip)
+        return _attach_payload_signedness_attr(op, payload)
+    return iket.range_push(event_name, loc=loc, ip=ip)
+
+
+@dsl_user_op
+def range_pop(
+    *, loc: Optional[ir.Location] = None, ip: Optional[ir.InsertionPoint] = None
+) -> Any:
+    """
+    Pop a range from the stack.
+
+    This marks the end of the most recent push/pop range.
+    Uses a reserved event ID (31) - no event name needed.
+
+    :param loc: Source location for MLIR, defaults to None
+    :type loc: Optional[Location], optional
+    :param ip: Insertion point for MLIR, defaults to None
+    :type ip: Optional[InsertionPoint], optional
+    """
+
+    return iket.range_pop(loc=loc, ip=ip)
diff --git a/python/CuTeDSL/cutlass/cute/experimental/math.py b/python/CuTeDSL/cutlass/cute/experimental/math.py
index 99b5a785d..1198e1f31 100644
--- a/python/CuTeDSL/cutlass/cute/experimental/math.py
+++ b/python/CuTeDSL/cutlass/cute/experimental/math.py
@@ -17,6 +17,23 @@ from cutlass._mlir import ir
 from cutlass._mlir.dialects import lir as cutlass_lir
 
 
+# MMA operands passed to `dot` / `dot_block_scaled` / ... must follow
+# the CuTe fragment contract `(MMA, REST_M, REST_K)` (rank >= 3), same as
+# the underlying `cute.gemm`. Partitioning an MMA tile often yields rank-2
+# A/B slices, so callers historically had to sprinkle
+# `cute.append_ones(..., up_to_rank=3)` at every call site. We do that here
+# so the user doesn't have to. Already-rank-3 (or higher) tensors pass
+# through unchanged.
+def _ensure_rank3(
+    t: cute.Tensor,
+    loc: Optional[ir.Location],
+    ip: Optional[ir.InsertionPoint],
+) -> cute.Tensor:
+    if cute.rank(t) < 3:
+        t = cute.append_ones(t, up_to_rank=3, loc=loc, ip=ip)
+    return t
+
+
 @dsl_user_op
 def dot_block_scaled(
     mma_atom: cute.MmaAtom,
@@ -28,6 +45,10 @@ def dot_block_scaled(
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> None:
+    a = _ensure_rank3(a, loc, ip)
+    b = _ensure_rank3(b, loc, ip)
+    sfa = _ensure_rank3(sfa, loc, ip)
+    sfb = _ensure_rank3(sfb, loc, ip)
     cutlass_lir.DotBlockScaledOp(
         mma_atom._unpack(),
         a.value,
@@ -49,6 +70,8 @@ def dot(
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> None:
+    a = _ensure_rank3(a, loc, ip)
+    b = _ensure_rank3(b, loc, ip)
     cutlass_lir.DotOp(
         mma_atom._unpack(),
         a.value,
diff --git a/python/CuTeDSL/cutlass/cute/experimental/memory.py b/python/CuTeDSL/cutlass/cute/experimental/memory.py
index a6f84d0db..9723bf42e 100644
--- a/python/CuTeDSL/cutlass/cute/experimental/memory.py
+++ b/python/CuTeDSL/cutlass/cute/experimental/memory.py
@@ -36,6 +36,78 @@ def _get_tma_load_kind(
     raise ValueError(f"Unsupported TMA operation type: {tma_operation_type}")
 
 
+def _is_tma_load_multicast_operation_type(
+    tma_operation_type: OperationTypeEnum,
+) -> bool:
+    return tma_operation_type in (
+        OperationTypeEnum.SM100_TMA_LOAD_2SM_MULTICAST,
+        OperationTypeEnum.SM90_TMA_LOAD_MULTICAST,
+    )
+
+
+def _populate_tma_load_common_kwargs(
+    kwargs: dict[str, object],
+    cta_v_map: Optional[cute.Layout],
+    internal_type: Optional[Type[cute.Numeric]],
+    update_expect_tx: bool,
+) -> None:
+    if cta_v_map is not None:
+        kwargs["cta_v_map"] = cta_v_map.type.attribute
+    # Map internal_type to tma_format per updated API
+    if internal_type is not None:
+        internal_mlir_ty = (
+            internal_type.mlir_type
+            if hasattr(internal_type, "mlir_type")
+            else internal_type
+        )
+        kwargs["tma_format"] = _cute_nvgpu_ir.TmaDataFormat(
+            _cute_nvgpu_ir.get_default_tma_format(internal_mlir_ty, False)
+        )
+
+    if update_expect_tx:
+        kwargs["update_expect_tx"] = True
+
+def _emit_tma_load(
+    src: cute.Tensor,
+    dst: cute.Tensor,
+    mbar: ir.Value,
+    *,
+    cta_v_map: Optional[cute.Layout],
+    tma_operation_type: OperationTypeEnum,
+    vmnk_layout: Optional[cute.Layout],
+    multicast_mode: Optional[int],
+    internal_type: Optional[Type[cute.Numeric]],
+    update_expect_tx: bool,
+    loc: Optional[ir.Location],
+    ip: Optional[ir.InsertionPoint],
+) -> None:
+    kind = _get_tma_load_kind(tma_operation_type)
+    multicast = _is_tma_load_multicast_operation_type(tma_operation_type)
+
+    kwargs = {
+        "kind": kind,
+        "loc": loc,
+        "ip": ip,
+    }
+    _populate_tma_load_common_kwargs(
+        kwargs,
+        cta_v_map,
+        internal_type,
+        update_expect_tx,
+    )
+
+    if multicast:
+        if vmnk_layout is None:
+            raise ValueError("tma_load multicast requires vmnk_layout")
+        if multicast_mode is None:
+            raise ValueError("tma_load multicast requires multicast_mode")
+        kwargs["vmnk_layout"] = vmnk_layout
+        kwargs["multicast_mode"] = multicast_mode
+        cutlass_lir.TmaLoadMulticastOp(src.value, dst.value, mbar, **kwargs)
+    else:
+        cutlass_lir.TmaLoadOp(src.value, dst.value, mbar, **kwargs)
+
+
 @dsl_user_op
 def allocate(
     type: Type[cute.Numeric],
@@ -67,8 +139,10 @@ def allocate(
 
     bit_layout = None
 
-    _is_passthrough_type = False
-    if not _is_passthrough_type:
+    # Handle SparseElemType (pass through) vs regular types (get mlir_type)
+    if isinstance(type, _cute_ir.SparseElemType):
+        pass
+    else:
         type = type.mlir_type
 
     ptr_ty = _cute_ir.PtrType.get(
@@ -76,7 +150,7 @@ def allocate(
         address_space,
         alignment,
         swizzle.type.attribute if swizzle else None,
-        None,
+        None,  # sparsity
         bit_layout.type.attribute if bit_layout else None,
     )
     buffer_type = _cute_ir.MemRefType.get(ptr_ty, layout.type)
@@ -98,6 +172,8 @@ def tma_load(
     *,
     cta_v_map: Optional[cute.Layout] = None,
     tma_operation_type: Optional[OperationTypeEnum] = None,
+    vmnk_layout: Optional[cute.Layout] = None,
+    multicast_mode: Optional[int] = None,
     internal_type: Optional[Type[cute.Numeric]] = None,
     update_expect_tx: bool = True,
     loc: Optional[ir.Location] = None,
@@ -108,35 +184,34 @@ def tma_load(
 
     update_expect_tx (bool): controls whether this operation increments the mbarrier's transaction bytes with the TMA copy size.
     When used with Cute DSL pipelines, it must be set to False as the pipeline already initializes the mbarrier's transaction bytes.
-    tma_operation_type (optional): specifies the TMA operation type (SM90_TMA_LOAD, SM100_TMA_LOAD_2SM, etc.)
+    tma_operation_type (optional): specifies the TMA operation type. If omitted, defaults to SM90_TMA_LOAD.
+    Required for multicast loads and for selecting SM100_TMA_LOAD_2SM.
+    vmnk_layout: layout describing the cluster configuration. Required when tma_operation_type is multicast.
+    multicast_mode (optional): multicast projection mode. Required when tma_operation_type is multicast.
     internal_type (optional): selects the TMA transfer's internal element encoding used by hardware.
+    Does not change src/dst memref types. For structured sparsity, use base storage types:
+    Float16 for 2:4 FP16 sparse element type, Uint8 for 8:1 uint8 sparse element type.
     """
-    if tma_operation_type is not None:
-        kind = _get_tma_load_kind(tma_operation_type)
-    else:
-        kind = _cute_ir.TiledTmaLoadEnum.sm_90
+    if tma_operation_type is None:
+        if vmnk_layout is not None or multicast_mode is not None:
+            raise ValueError(
+                "tma_load requires tma_operation_type when multicast kwargs are provided"
+            )
+        tma_operation_type = OperationTypeEnum.SM90_TMA_LOAD
 
-    kwargs = {
-        "kind": kind,
-        "loc": loc,
-        "ip": ip,
-    }
-    if cta_v_map is not None:
-        kwargs["cta_v_map"] = cta_v_map.type.attribute
-    # Map internal_type to tma_format per updated API
-    if internal_type is not None:
-        internal_mlir_ty = (
-            internal_type.mlir_type
-            if hasattr(internal_type, "mlir_type")
-            else internal_type
-        )
-        kwargs["tma_format"] = _cute_nvgpu_ir.TmaDataFormat(
-            _cute_nvgpu_ir.get_default_tma_format(internal_mlir_ty, False)
-        )
-
-    if update_expect_tx:
-        kwargs["update_expect_tx"] = True
-    cutlass_lir.TmaLoadOp(src.value, dst.value, mbar, **kwargs)
+    _emit_tma_load(
+        src,
+        dst,
+        mbar,
+        cta_v_map=cta_v_map,
+        tma_operation_type=tma_operation_type,
+        vmnk_layout=vmnk_layout,
+        multicast_mode=multicast_mode,
+        internal_type=internal_type,
+        update_expect_tx=update_expect_tx,
+        loc=loc,
+        ip=ip,
+    )
 
 
 @dsl_user_op
@@ -156,6 +231,9 @@ def tma_load_multicast(
     """
     Copies a tensor pointed by a !cute.memref into a Buffer using TMA with multicast.
 
+    Deprecated: use tma_load with a multicast tma_operation_type plus
+    vmnk_layout and multicast_mode. This wrapper is kept for existing kernels.
+
     :param src: Source tensor in global memory
     :param dst: Destination tensor in shared memory
     :param mbar: Memory barrier for synchronization
@@ -165,24 +243,21 @@ def tma_load_multicast(
     :param multicast_mode: Multicast projection mode (1=column, 2=row)
     :param update_expect_tx: Whether to update expected transaction bytes
     """
-    kind = _get_tma_load_kind(tma_operation_type)
-    kwargs = {
-        "kind": kind,
-        "vmnk_layout": vmnk_layout,
-        "multicast_mode": multicast_mode,
-        "loc": loc,
-        "ip": ip,
-    }
-    if cta_v_map is not None:
-        kwargs["cta_v_map"] = cta_v_map.type.attribute
+    if not _is_tma_load_multicast_operation_type(tma_operation_type):
+        raise ValueError("tma_load_multicast requires a multicast tma_operation_type")
 
-    if update_expect_tx:
-        kwargs["update_expect_tx"] = True
-    cutlass_lir.TmaLoadMulticastOp(
-        src.value,
-        dst.value,
+    _emit_tma_load(
+        src,
+        dst,
         mbar,
-        **kwargs,
+        cta_v_map=cta_v_map,
+        tma_operation_type=tma_operation_type,
+        vmnk_layout=vmnk_layout,
+        multicast_mode=multicast_mode,
+        internal_type=None,
+        update_expect_tx=update_expect_tx,
+        loc=loc,
+        ip=ip,
     )
 
 
@@ -200,6 +275,8 @@ def tma_store(
     Copies a tensor from a Buffer to a tensor pointed to by a !cute.memref.
 
     internal_type (optional): selects the TMA transfer's internal element encoding used by hardware.
+    Does not change src/dst memref types. For structured sparsity, use base storage types:
+    Float16 for 2:4 FP16 sparse element type, Uint8 for 8:1 uint8 sparse element type.
     """
 
     kwargs = {
@@ -223,6 +300,55 @@ def tma_store(
     cutlass_lir.TmaStoreOp(src.value, dst.value, **kwargs)
 
 
+@dsl_user_op
+def tma_reduce_store(
+    src: cute.Tensor,
+    dst: cute.Tensor,
+    *,
+    kind: cute.ReductionKind,
+    cta_v_map: Optional[cute.Layout] = None,
+    internal_type: Optional[Type[cute.Numeric]] = None,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> None:
+    """
+    Atomically reduce a tensor from an SMEM Buffer into a tensor pointed to by
+    a !cute.memref via TMA. Lowers to ``lir.tma_reduce_store`` and ultimately to
+    ``cp.reduce.async.bulk.tensor.<rank>d.global.shared::cta.<kind>.tile``.
+
+    :param kind: reduction kind. Use ``cute.ReductionKind.<X>``. Supported
+        kinds: ADD / MIN / MAX / INC / DEC / AND / OR / XOR.
+    :param cta_v_map: optional CTA-v map layout. If omitted, the layout is
+        inferred from the destination GMEM memref.
+    :param internal_type: optional override for the TMA transfer's internal
+        element encoding; see :func:`tma_store` for semantics.
+    """
+    if not isinstance(kind, cute.ReductionKind):
+        raise TypeError(
+            f"tma_reduce_store: unsupported kind {kind!r}; "
+            f"expected cute.ReductionKind"
+        )
+    kwargs = {
+        "kind": ir.Attribute.parse(f"#cute_nvgpu.tma_reduce_kind<{kind.name}>"),
+        "loc": loc,
+        "ip": ip,
+    }
+    if cta_v_map is not None:
+        kwargs["cta_v_map"] = cta_v_map.type.attribute
+
+    if internal_type is not None:
+        internal_mlir_ty = (
+            internal_type.mlir_type
+            if hasattr(internal_type, "mlir_type")
+            else internal_type
+        )
+        kwargs["tma_format"] = _cute_nvgpu_ir.TmaDataFormat(
+            _cute_nvgpu_ir.get_default_tma_format(internal_mlir_ty, False)
+        )
+
+    cutlass_lir.TmaReduceStoreOp(src.value, dst.value, **kwargs)
+
+
 @dsl_user_op
 def copy(
     src: cute.Tensor,
diff --git a/python/CuTeDSL/cutlass/cute/experimental/pipeline.py b/python/CuTeDSL/cutlass/cute/experimental/pipeline.py
index 0850823e2..93280ea23 100644
--- a/python/CuTeDSL/cutlass/cute/experimental/pipeline.py
+++ b/python/CuTeDSL/cutlass/cute/experimental/pipeline.py
@@ -22,6 +22,7 @@ import cutlass.cute as cute
 from cutlass.base_dsl.typing import Int32
 from cutlass._mlir import ir
 from cutlass._mlir.dialects import lir as cutlass_lir_ir
+from cutlass.cutlass_dsl import extract_mlir_values
 from cutlass._mlir.dialects.core import OperationTypeEnum
 from cutlass.cute.typing import Boolean
 from cutlass.cute.experimental.core import (
@@ -29,27 +30,27 @@ from cutlass.cute.experimental.core import (
     producer_acquire,
     get_pipeline_produce_stage,
     get_pipeline_consume_stage,
+    get_mbarrier,
     producer_commit,
     consumer_release,
     pipeline_advance_iterator,
     PipelineState,
     consumer_wait,
-    consumer_tail,
+    create_pipeline_state_at,
+    pipeline_tail,
     create_circular_buffer_pipeline,
     circular_buffer_pipeline_consume,
     circular_buffer_pipeline_consumer_release,
     circular_buffer_pipeline_advance_iterator,
-    mbarrier_expect_tx,
     normalize_skip_wait_token,
     producer_try_acquire,
     consumer_try_wait,
+    OperationTypeSpec,
     SkipWaitToken,
 )
 
 from cutlass.cutlass_dsl import CuteExperimentalDSL
 
-from ..typing import Pointer
-
 
 class GenericPipelineBase:
     """Base class for pipeline convenience wrappers"""
@@ -150,18 +151,38 @@ class GenericPipelineBase:
         )
         return stage_token, stage_idx
 
-    def producer_commit(self) -> "GenericPipelineBase":
+    def producer_commit(
+        self,
+        *,
+        elect_one_sync: Optional[bool] = None,
+        elect_leader_cta: Optional[bool] = None,
+    ) -> "GenericPipelineBase":
         """Commit producer state."""
-        producer_commit(self.raw_pipeline, self.producer_state)
+        producer_commit(
+            self.raw_pipeline,
+            self.producer_state,
+            elect_one_sync=elect_one_sync,
+            elect_leader_cta=elect_leader_cta,
+        )
         return self
 
     def consumer_try_wait(self, *, token: Optional[SkipWaitToken] = None) -> Boolean:
         """Try to wait for the next consumer stage without blocking."""
         return consumer_try_wait(self.raw_pipeline, self.consumer_state, token=token)
 
-    def consumer_release(self) -> "GenericPipelineBase":
+    def consumer_release(
+        self,
+        *,
+        elect_one_sync: Optional[bool] = None,
+        elect_leader_cta: Optional[bool] = None,
+    ) -> "GenericPipelineBase":
         """Release consumer state."""
-        consumer_release(self.raw_pipeline, self.consumer_state)
+        consumer_release(
+            self.raw_pipeline,
+            self.consumer_state,
+            elect_one_sync=elect_one_sync,
+            elect_leader_cta=elect_leader_cta,
+        )
         return self
 
     def producer_commit_and_advance(self) -> "GenericPipelineBase":
@@ -217,15 +238,28 @@ class GenericPipelineBase:
         )
         return self
 
+    def producer_tail(self) -> "GenericPipelineBase":
+        """Drain outstanding pipeline work visible to the producer state."""
+        self.producer_state = pipeline_tail(self.raw_pipeline, self.producer_state)
+        return self
+
     def consumer_tail(self) -> "GenericPipelineBase":
-        """Combined consumer tail with automatic elect_one using internal state."""
-        consumer_tail(self.raw_pipeline, self.consumer_state)
+        """Drain outstanding pipeline work visible to the consumer state."""
+        self.consumer_state = pipeline_tail(self.raw_pipeline, self.consumer_state)
         return self
 
     def increment_state(self, state: PipelineState) -> ir.Value:
         """Advance the input state w/o modifying current pipeline"""
         return pipeline_advance_iterator(self.raw_pipeline, state)
 
+    def create_state_at(
+        self, stage_index: cute.Int32, phase: cute.Int32
+    ) -> PipelineState:
+        """Create a pipeline state for an explicit stage and phase."""
+        return create_pipeline_state_at(
+            self.raw_pipeline, self.num_stages, stage_index, phase
+        )
+
 
 class GenericPipeline(GenericPipelineBase):
     """
@@ -299,11 +333,32 @@ def _is_multicast_tma_operation_type(operation_type: OperationTypeEnum) -> bool:
     ]
 
 
+def _is_2sm_tma_operation_type(operation_type: OperationTypeEnum) -> bool:
+    """Check if the operation type is a 2SM TMA load."""
+    return operation_type in [
+        OperationTypeEnum.SM100_TMA_LOAD_2SM,
+        OperationTypeEnum.SM100_TMA_LOAD_2SM_MULTICAST,
+        OperationTypeEnum.SM100_TMA_LOAD_2SM_IM2COL,
+        OperationTypeEnum.SM100_TMA_LOAD_2SM_IM2COL_MULTICAST,
+    ]
+
+
 class TMAToUMMAPipeline(GenericPipelineBase):
     """
     Pipeline for TMA to UMMA.
     """
 
+    def __init__(
+        self,
+        raw_pipeline: ir.Value,
+        num_stages: cute.Int32,
+        producer_state: ir.Value,
+        consumer_state: ir.Value,
+        is_2sm: bool = False,
+    ) -> None:
+        super().__init__(raw_pipeline, num_stages, producer_state, consumer_state)
+        self.is_2sm = is_2sm
+
     @staticmethod
     def create(
         *,
@@ -378,11 +433,13 @@ class TMAToUMMAPipeline(GenericPipelineBase):
             )
         else:
             raise ValueError(f"Invalid tma_operation_type: {tma_operation_type}")
+        is_2sm = _is_2sm_tma_operation_type(tma_operation_type)
         return TMAToUMMAPipeline(
             raw_pipeline,
             num_stages,
             producer_state,
             consumer_state,
+            is_2sm=is_2sm,
         )
 
     @staticmethod
@@ -422,7 +479,9 @@ class TMAToUMMAPipeline(GenericPipelineBase):
 
         # For 2CTA MMA (v-size==2), the peer CTA is the other v-slice (xor 1).
         # For 1CTA MMA (v-size==1), the peer is the local CTA (no flip).
-        v_size = cute.size(cluster_layout_vmnk.shape[0])  # type: ignore[index]
+        cluster_shape_vmnk = cluster_layout_vmnk.shape
+        assert isinstance(cluster_shape_vmnk, tuple)
+        v_size = cute.size(cluster_shape_vmnk[0])
         peer_v = (
             (cta_in_cluster_coord_vmnk[0] ^ 1)
             if cutlass.const_expr(v_size > 1)
@@ -456,8 +515,8 @@ class TMAToUMMAPipeline(GenericPipelineBase):
             arrival_mask_a | arrival_mask_a_peer | arrival_mask_b | arrival_mask_b_peer
         )
 
-        num_mcast_ctas_a = cute.size(cluster_layout_vmnk.shape[2])  # type: ignore[index]
-        num_mcast_ctas_b = cute.size(cluster_layout_vmnk.shape[1])  # type: ignore[index]
+        num_mcast_ctas_a = cute.size(cluster_shape_vmnk[2])
+        num_mcast_ctas_b = cute.size(cluster_shape_vmnk[1])
         num_mcast_participants = num_mcast_ctas_a + num_mcast_ctas_b - 1
 
         raw_pipeline, producer_state, consumer_state = create_pipeline(
@@ -468,26 +527,41 @@ class TMAToUMMAPipeline(GenericPipelineBase):
             consumer_arv_count=num_mcast_participants,
             arrival_mask=arrival_mask_c,
         )
+        is_2sm = _is_2sm_tma_operation_type(tma_operation_type)
         return TMAToUMMAPipeline(
-            raw_pipeline, num_stages, producer_state, consumer_state
+            raw_pipeline,
+            num_stages,
+            producer_state,
+            consumer_state,
+            is_2sm=is_2sm,
         )
 
-    def producer_commit(self) -> "TMAToUMMAPipeline":
+    def producer_commit(
+        self,
+        *,
+        elect_one_sync: Optional[bool] = None,
+        elect_leader_cta: Optional[bool] = None,
+    ) -> "TMAToUMMAPipeline":
         """
         Commit producer state.
 
         For 2SM MMA, only leader CTA commits during production as MMA
-        is issued by leader. Compiler generates the if-leader-cta-branch
-        internally to preserve a symmetric acquire-commit pattern.
+        is issued by leader.
         """
-        with cute.arch.elect_one():
-            super().producer_commit()
+        super().producer_commit(
+            elect_one_sync=True,
+            elect_leader_cta=True if self.is_2sm else None,
+        )
         return self
 
-    def consumer_release(self) -> "TMAToUMMAPipeline":
+    def consumer_release(
+        self,
+        *,
+        elect_one_sync: Optional[bool] = None,
+        elect_leader_cta: Optional[bool] = None,
+    ) -> "TMAToUMMAPipeline":
         """Release consumer state."""
-        with cute.arch.elect_one():
-            super().consumer_release()
+        super().consumer_release(elect_one_sync=True)
         return self
 
 
@@ -617,12 +691,18 @@ class TMAToUMMACircularPipeline(TMAToUMMAPipeline):
         circular_buffer_pipeline_consume(self.raw_pipeline, self.circular_buffer_state)
         return self
 
-    def consumer_release(self) -> "TMAToUMMACircularPipeline":
+    def consumer_release(
+        self,
+        *,
+        elect_one_sync: Optional[bool] = None,
+        elect_leader_cta: Optional[bool] = None,
+    ) -> "TMAToUMMACircularPipeline":
         """Release consumer state (uses circular buffer consumer release)."""
-        with cute.arch.elect_one():
-            circular_buffer_pipeline_consumer_release(
-                self.raw_pipeline, self.circular_buffer_state
-            )
+        circular_buffer_pipeline_consumer_release(
+            self.raw_pipeline,
+            self.circular_buffer_state,
+            elect_one_sync=True,
+        )
         return self
 
     def consumer_release_and_advance(self) -> "TMAToUMMACircularPipeline":
@@ -685,10 +765,14 @@ class TMAToAsyncPipeline(GenericPipelineBase):
             consumer_state,
         )
 
-    def producer_commit(self) -> "TMAToAsyncPipeline":
+    def producer_commit(
+        self,
+        *,
+        elect_one_sync: Optional[bool] = None,
+        elect_leader_cta: Optional[bool] = None,
+    ) -> "TMAToAsyncPipeline":
         """Commit producer state."""
-        with cute.arch.elect_one():
-            super().producer_commit()
+        super().producer_commit(elect_one_sync=True)
         return self
 
 
@@ -701,12 +785,24 @@ class AsyncToUMMAPipeline(GenericPipelineBase):
     def create(
         *,
         num_stages: cute.Int32,
-        producer: OperationTypeEnum,
+        producer: OperationTypeSpec,
         producer_arv_count: cute.Int32,
         mma_operation_type: OperationTypeEnum,
     ) -> "AsyncToUMMAPipeline":
         """
         Create a * (except TMA) to UMMA pipeline.
+
+        Args:
+            num_stages: Number of pipeline stages.
+            producer: Producer operation type, or a non-empty sequence of
+                them when a single pipeline stage is written by multiple
+                producer kinds (e.g. ``[store to smem, SM100_COPY_R2T]`` when
+                transform warps stage one operand into TMEM and another
+                into SMEM under a single producer-commit). Each producer
+                kind contributes its own commit-side fence at lowering
+                time.
+            producer_arv_count: Number of producer arrivals.
+            mma_operation_type: UMMA operation type.
         """
         _validate_umma_operation_type(
             mma_operation_type,
@@ -729,10 +825,14 @@ class AsyncToUMMAPipeline(GenericPipelineBase):
             consumer_state,
         )
 
-    def consumer_release(self) -> "AsyncToUMMAPipeline":
+    def consumer_release(
+        self,
+        *,
+        elect_one_sync: Optional[bool] = None,
+        elect_leader_cta: Optional[bool] = None,
+    ) -> "AsyncToUMMAPipeline":
         """Release consumer state."""
-        with cute.arch.elect_one():
-            super().consumer_release()
+        super().consumer_release(elect_one_sync=True)
         return self
 
 
@@ -817,17 +917,21 @@ class UMMAtoAsyncPipeline(GenericPipelineBase):
             consumer_state,
         )
 
-    def producer_commit(self) -> "UMMAtoAsyncPipeline":
+    def producer_commit(
+        self,
+        *,
+        elect_one_sync: Optional[bool] = None,
+        elect_leader_cta: Optional[bool] = None,
+    ) -> "UMMAtoAsyncPipeline":
         """Commit producer state."""
-        with cute.arch.elect_one():
-            super().producer_commit()
+        super().producer_commit(elect_one_sync=True)
         return self
 
 
 @dataclass
 class TMAStorePipeline:
     """
-    TMA Store Pipeline modeling SMEM producer (store to smem operations) to TMA consumer (TMA store to global) pipeline.
+    TMA Store Pipeline modeling SMEM producer to TMA consumer pipeline.
     A number of epilogue warps participate in the pipeline as producers, and one of them is designated as the consumer to perform TMA store.
     Named barrier is used to synchronize all warps so that producers write SMEM after the pipeline stage is available, and the consumer waits for all producers before issuing TMA store.
     The canonical pipeline flow is:
@@ -852,7 +956,7 @@ class TMAStorePipeline:
     index: int = 0
 
     def get_num_stages(self) -> int:
-        return self.stages  # type: ignore[return-value]
+        return self.stages
 
     def acquire_sync(self) -> "TMAStorePipeline":
         """
@@ -973,7 +1077,7 @@ class GroupedGemmSchedulerPipeline(GenericPipelineBase):
         raw_pipeline, producer_state, consumer_state = create_pipeline(
             num_stages,
             OperationTypeEnum.SW_STATIC_PERSISTENT_TILE_SCHEDULER,
-            OperationTypeEnum.LDS,
+            OperationTypeEnum.LD_SHARED,
             producer_arv_count=producer_arv_count,
             consumer_arv_count=consumer_arv_count,
         )
@@ -989,7 +1093,12 @@ class GroupedGemmSchedulerPipeline(GenericPipelineBase):
         consumer_wait(self.raw_pipeline, self.consumer_state)
         return self
 
-    def consumer_release(self) -> "GroupedGemmSchedulerPipeline":
+    def consumer_release(
+        self,
+        *,
+        elect_one_sync: Optional[bool] = None,
+        elect_leader_cta: Optional[bool] = None,
+    ) -> "GroupedGemmSchedulerPipeline":
         """Release consumer state."""
         consumer_release(self.raw_pipeline, self.consumer_state)
         return self
@@ -1005,22 +1114,52 @@ class CLCPipeline(GenericPipelineBase):
     Pipeline for tile scheduling (using CLC) to all warps.
     """
 
+    def __init__(
+        self,
+        raw_pipeline: ir.Value,
+        num_stages: cute.Int32,
+        producer_state: ir.Value,
+        consumer_state: ir.Value,
+        response_buffer: "cute.Tensor",
+    ) -> None:
+        super().__init__(raw_pipeline, num_stages, producer_state, consumer_state)
+        self.response_buffer = response_buffer
+
+    def __extract_mlir_values__(self) -> list:
+        base_values = super().__extract_mlir_values__()
+        return base_values + extract_mlir_values(self.response_buffer)
+
+    @classmethod
+    def __new_from_mlir_values__(cls, values: list) -> "CLCPipeline":
+        num_stages_dsl = Int32(0).__new_from_mlir_values__([values[1]])  # type: ignore[attr-defined]
+        return cls(
+            values[0],  # raw_pipeline
+            num_stages_dsl,
+            values[2],  # producer_state
+            values[3],  # consumer_state
+            values[4],  # response_buffer
+        )
+
     @staticmethod
     def create(
         *,
         num_stages: cute.Int32,
         consumer_arv_count: cute.Int32,
+        response_buffer: "cute.Tensor",
     ) -> "CLCPipeline":
         """
         Create a CLC to consumer pipeline.
 
         The consumer includes mma, tma, epilogue, and scheduler.
+
+        :param response_buffer: Multi-stage shared-memory buffer for CLC responses.
+            Each stage gets its own slot, indexed by stage index.
         """
 
         raw_pipeline, producer_state, consumer_state = create_pipeline(
             num_stages,
             OperationTypeEnum.SM100_LAUNCH_CONTROL,
-            OperationTypeEnum.LDS,
+            OperationTypeEnum.LD_SHARED,
             producer_arv_count=1,
             consumer_arv_count=consumer_arv_count,
         )
@@ -1029,22 +1168,36 @@ class CLCPipeline(GenericPipelineBase):
             num_stages,
             producer_state,
             consumer_state,
+            response_buffer,
         )
 
-    def producer_commit(self) -> "CLCPipeline":
+    def producer_commit(
+        self,
+        *,
+        elect_one_sync: Optional[bool] = None,
+        elect_leader_cta: Optional[bool] = None,
+    ) -> "CLCPipeline":
         """Commit producer state."""
         producer_commit(self.raw_pipeline, self.producer_state)
         return self
 
-    @staticmethod
-    def get_response_size() -> int:
-        """
-        Returns the size in bytes of a CLC response.
-        """
-        return 16
+    def get_response_ptr(self, stage_idx: ir.Value) -> ir.Value:
+        """Returns the response pointer for the given stage index."""
+        response = cute.slice_(self.response_buffer, stage_idx)
+        return response.iterator
 
-    def expect_response(self, mbar_ptr: Pointer) -> None:
+    def issue_next(self) -> None:
         """
-        Increments the expected transaction count of a CLC response.
+        Issues a CLC request for the next work tile.
+
+        Emits lir.work_item.issue_next which performs mbarrier.expect_tx
+        (fan-out across cluster CTAs) and clusterlaunchcontrol.try.cancel
+        in a single elected thread.
+        Must be called between producer_acquire and producer_commit.
         """
-        mbarrier_expect_tx(mbar_ptr, self.get_response_size(), cute.arch.lane_idx())
+        token, idx = get_pipeline_produce_stage(self.raw_pipeline, self.producer_state)
+        mbar_ptr = get_mbarrier(token).result
+        response_ptr = self.get_response_ptr(idx)
+        cutlass_lir_ir.WorkItemIssueNextOp(
+            mbar_ptr.value, response_ptr.value, elect_one_sync=True
+        )
diff --git a/python/CuTeDSL/cutlass/cute/experimental/utils.py b/python/CuTeDSL/cutlass/cute/experimental/utils.py
index c2c28420f..3cada8b98 100644
--- a/python/CuTeDSL/cutlass/cute/experimental/utils.py
+++ b/python/CuTeDSL/cutlass/cute/experimental/utils.py
@@ -14,8 +14,12 @@ from typing import Callable, Optional, Tuple, Union
 import cutlass
 from cutlass import cute
 from cutlass._mlir import ir
+from cutlass._mlir.dialects import cute as _cute_ir
+from cutlass._mlir.dialects import math as math_ir
+import cutlass._mlir.dialects.cute_nvgpu as _cute_nvgpu_ir
 
 from ... import cutlass_dsl as _dsl
+from ..arch.constants import THREADS_PER_WARPGROUP
 from .pipeline import TMAStorePipeline, TMAToUMMAPipeline
 
 
@@ -28,17 +32,38 @@ def get_cta_v_map_ab(
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> Union[cute.Layout, cute.ComposedLayout]:
+    if input_operand not in ("A", "B", "SFA", "SFB"):
+        raise ValueError(
+            f"input_operand must be one of 'A', 'B', 'SFA', or 'SFB', "
+            f"but got {input_operand!r}"
+        )
+    _MMA_OPERAND_BY_NAME = {
+        "A": _cute_ir.MmaOperand.A,
+        "B": _cute_ir.MmaOperand.B,
+        "SFA": _cute_ir.MmaOperand.SFA,
+        "SFB": _cute_ir.MmaOperand.SFB,
+    }
+
     ident = cute.core.make_identity_layout(gmem_tensor.shape, loc=loc, ip=ip)
     mode = 0 if (input_operand in ("A", "SFA")) else 1
-    mma_tiler_mk = (mma_tiler_mnk[mode], *mma_tiler_mnk[2:])  # type: ignore[index]
+    assert isinstance(mma_tiler_mnk, tuple)
+    mma_tiler_mk = (mma_tiler_mnk[mode], *mma_tiler_mnk[2:])
     g_tile = cute.core.composition(ident, mma_tiler_mk, loc=loc, ip=ip)
-    if input_operand in ("A", "SFA"):
-        cta_v_map = tiled_mma._thrfrg_A(g_tile)
-    if input_operand in ("B", "SFB"):
-        cta_v_map = tiled_mma._thrfrg_B(g_tile)
-    cta_v_map = cute.core.get(cta_v_map, mode=[1])  # type: ignore[assignment]
-    cta_v_map = cute.core.dice(cta_v_map, (1, (1,) * cute.core.rank(g_tile)))  # type: ignore[assignment]
-    return cta_v_map  # type: ignore[return-value]
+
+    # SM120 warp-level MMA: cta_v_map is the CTA-tiled identity (matches DSL's
+    # cpasync.make_tiled_tma_atom at helpers.py:455-460). ldmatrix handles the
+    # SMEM->register reshuffling separately.
+    if isinstance(
+        tiled_mma._trait.value.type, _cute_nvgpu_ir.MmaAtomSM120BlockScaledType
+    ):
+        return g_tile
+
+    cta_v_map = tiled_mma._thrfrg(
+        _MMA_OPERAND_BY_NAME[input_operand], g_tile, loc=loc, ip=ip
+    )
+    cta_v_map = cute.core.get(cta_v_map, mode=[1])
+    cta_v_map = cute.core.dice(cta_v_map, (1, (1,) * cute.core.rank(g_tile)))
+    return cta_v_map
 
 
 
@@ -72,7 +97,8 @@ def make_tmem_layout_acc(
     Returns:
         ``cute.Layout`` for the accumulator TMEM buffer.
     """
-    acc_shape = tiled_mma.partition_shape_C(mnk_tiler[:2], loc=loc, ip=ip)  # type: ignore[index]
+    assert isinstance(mnk_tiler, tuple)
+    acc_shape = tiled_mma.partition_shape_C(mnk_tiler[:2], loc=loc, ip=ip)
     acc_shape_staged = cute.append(acc_shape, acc_stage, loc=loc, ip=ip)
     return tiled_mma.make_fragment_C(acc_shape_staged, loc=loc, ip=ip).layout
 
@@ -144,6 +170,7 @@ def epilogue_tma_store(
     epilogue_op: Callable[[cute.Tensor], cute.Tensor],
     d_major_mode: Optional["LayoutEnum"] = None,  # type: ignore[name-defined]
     tid_x_in_group: Optional[int] = None,
+    amax_out: Optional[cute.Tensor] = None,
 ) -> TMAStorePipeline:
     """
     Epilogue phase: copy accumulator from TMEM to GMEM via RMEM and TMA store.
@@ -176,6 +203,13 @@ def epilogue_tma_store(
             example, if warps 4-7 are in the same group and calling this function,
             tid_x_in_group should be 0-127 instead of 128-255. If not provided, the
             function will use cute.arch.thread_idx().
+        amax_out: Optional single FP32 output GMEM tensor. When provided, we compute
+            amax_out=max(|Acc|) via:
+            - per thread, per subtile, 32-register reduction -> scalar
+            - per thread, across subtile, running scalar max
+            - per warp, intra-warp reduction
+            - per warp, lane 0 does atomic_max into GMEM
+            ``None`` (default) disables the AMax codepath at JIT-time.
 
     Returns:
         tma_store_pipeline: The updated TMAStorePipeline instance
@@ -187,7 +221,7 @@ def epilogue_tma_store(
 
     if cutlass.const_expr(tid_x_in_group is None):
         tid_x_in_group, _, _ = cute.arch.thread_idx()
-        tid_x_in_group = tid_x_in_group % 128
+        tid_x_in_group = tid_x_in_group % THREADS_PER_WARPGROUP
     warp_idx = cute.arch.warp_idx()
     warp_idx = cute.arch.make_warp_uniform(warp_idx)
 
@@ -220,11 +254,14 @@ def epilogue_tma_store(
         tiled_copy_t2r,
     )
 
-    tiler_mn = (cta_tile_shape_mnk[0], cta_tile_shape_mnk[1])  # type: ignore[index]
+    assert isinstance(cta_tile_shape_mnk, tuple)
+    tiler_mn = (cta_tile_shape_mnk[0], cta_tile_shape_mnk[1])
     gmem_d_mn_tiled = cute.zipped_divide(gmem_d, tiler_mn)
     gmem_d_tile = gmem_d_mn_tiled[(None, None), cta_d_tile_coord]
     gmem_d_epi_tma = cute.flat_divide(gmem_d_tile, epi_tile_shape)  # type: ignore[arg-type]
-    epi_subtile_cnt = gmem_d_epi_tma.shape[3]  # type: ignore[index]
+    gmem_d_epi_tma_shape = gmem_d_epi_tma.shape
+    assert isinstance(gmem_d_epi_tma_shape, tuple)
+    epi_subtile_cnt = gmem_d_epi_tma_shape[3]
 
     acc_d_rmem_layout = make_t2r_rmem_layout(
         tiled_copy_t2r,
@@ -257,6 +294,9 @@ def epilogue_tma_store(
         alignment=1024,
     )
 
+    if cutlass.const_expr(amax_out is not None):
+        thread_amax = cutlass.Float32(0.0)
+
     for epi_subtile_idx in range(epi_subtile_cnt):  # type: ignore[arg-type]
         # TMEM -> RMEM
         partition_and_copy(
@@ -267,6 +307,15 @@ def epilogue_tma_store(
 
         # RMEM -> RMEM and epilogue Op
         acc_vec = rmem_acc_buffer.load()
+
+        if cutlass.const_expr(amax_out is not None):
+            abs_acc_vec_ir = math_ir.absf(acc_vec.ir_value())
+            abs_acc_vec = type(acc_vec)(abs_acc_vec_ir, acc_vec.shape, acc_vec.dtype)
+            subtile_amax = abs_acc_vec.reduce(
+                cute.ReductionOp.MAX, cutlass.Float32(0.0), 0
+            )
+            thread_amax = cute.arch.fmax(thread_amax, subtile_amax)
+
         epilogue_out = epilogue_op(acc_vec.to(d_dtype))
         rmem_d_buffer.store(epilogue_out)
 
@@ -302,6 +351,17 @@ def epilogue_tma_store(
         # - All warps advance to the next pipeline stage
         tma_store_pipeline.release_advance()
 
+    if cutlass.const_expr(amax_out is not None):
+        # Silence mypy
+        assert amax_out is not None
+        warp_amax = cute.arch.warp_redux_sync(
+            value=thread_amax,
+            kind="fmax",
+            nan=True,
+        )
+        if cute.arch.lane_idx() == 0:
+            cute.arch.atomic_fmax(amax_out.iterator, warp_amax, sign_bit=False)
+
     tma_store_pipeline.tail()
     return tma_store_pipeline
 
@@ -321,34 +381,37 @@ def mainloop_mma(
     accumulate_to_acc: bool = False,
 ) -> Tuple[TMAToUMMAPipeline, TMAToUMMAPipeline]:
     """
-    Mainloop MMA phase: consume A/B tiles from the pipeline and compute into TMEM accumulator.
+    Mainloop MMA phase: consume A/B tiles from the pipeline and compute into accumulator.
 
     This function is the consumer side of the TMA load -> MMA pipeline. It waits
     for the TMA load warp to fill a pipeline stage, then runs multiple MMA
     instructions over the K-tile (inner loop over mma_inst_tile_k), and releases the stage.
 
     Args:
-        tiled_mma: Tiled MMA descriptor (e.g. from blackwell_helpers.make_trivial_tiled_mma)
-        a_buffer: A operand buffer, shape (..., mma_inst_tile_k, num_a_buffer_stages)
-        b_buffer: B operand buffer, shape (..., mma_inst_tile_k, num_b_buffer_stages)
+        tiled_mma: Tiled MMA descriptor
+        a_buffer: A operand buffer for this CTA's tile, should have shape
+            (..., mma_inst_tile_k, num_a_buffer_stages)
+        b_buffer: B operand buffer for this CTA's tile, should have shape
+            (..., mma_inst_tile_k, num_b_buffer_stages)
         acc_buffer: Accumulator buffer for this CTA's tile
         k_tile_start: Start index of the K-tile to iterate over (outer loop)
         k_tile_end: End index of the K-tile to iterate over (outer loop)
         mma_inst_tile_k: Number of MMA instructions per K-tile (inner loop)
-        a_buffer_pipeline: TMAToUMMAPipeline to sync with TMA load producer for A buffer
-        b_buffer_pipeline: TMAToUMMAPipeline to sync with TMA load producer for B buffer
+        a_buffer_pipeline: Pipeline to sync with TMA load producer for A buffer
+        b_buffer_pipeline: Pipeline to sync with TMA load producer for B buffer
         ab_buffer_same_pipeline: If the TMA load producers for A and B are the same pipeline
         accumulate_to_acc: If the first K-tile should accumulate to the accumulator,
             otherwise the result will be overwritten.
 
     Returns:
-        a_buffer_pipeline: The updated TMAToUMMAPipeline for A buffer
-        b_buffer_pipeline: The updated TMAToUMMAPipeline for B buffer
+        a_buffer_pipeline: The updated pipeline for A buffer
+        b_buffer_pipeline: The updated pipeline for B buffer
     """
     from .math import dot
 
     mma_atom = cute.make_mma_atom(tiled_mma.op)
     mma_atom.set(cute.nvgpu.tcgen05.Field.ACCUMULATE, accumulate_to_acc)
+
     for _k_tile in cutlass.range(k_tile_start, k_tile_end, 1, unroll=1):
         _, a_buffer_stage_idx = a_buffer_pipeline.consumer_wait_and_get_stage()
         if cutlass.const_expr(ab_buffer_same_pipeline):
@@ -360,8 +423,8 @@ def mainloop_mma(
             b_buffer_sliced = b_buffer[None, None, k_instr_tile, b_buffer_stage_idx]
             dot(
                 mma_atom,
-                cute.append_ones(a_buffer_sliced, up_to_rank=3),  # type: ignore[arg-type]
-                cute.append_ones(b_buffer_sliced, up_to_rank=3),  # type: ignore[arg-type]
+                a_buffer_sliced,
+                b_buffer_sliced,
                 acc_buffer,
             )
             mma_atom.set(cute.nvgpu.tcgen05.Field.ACCUMULATE, True)
@@ -370,3 +433,49 @@ def mainloop_mma(
             b_buffer_pipeline.consumer_release_and_advance()
 
     return a_buffer_pipeline, b_buffer_pipeline
+
+
+@_dsl.CuteExperimentalDSL.jit
+def mainloop_mma_sm120(
+    tiled_mma: cute.TiledMma,
+    tCrA: cute.Tensor,
+    tCrSFA: cute.Tensor,
+    tCrB: cute.Tensor,
+    tCrSFB: cute.Tensor,
+    acc_buffer: cute.Tensor,
+    mma_inst_tile_k: cute.Int32,
+    accumulate_to_acc: bool = True,
+) -> None:
+    """Inner K-instruction MMA loop for SM120 block-scaled GEMM.
+
+    The caller is responsible for:
+    - pipeline synchronization (consumer_wait/release),
+    - SMEM->RMEM ldmatrix copies into tCrA/tCrB/tCrSFA/tCrSFB,
+    - any per-architecture register-layout fixups; see
+      ``cutlass.utils.blackwell_helpers``.
+
+    Args:
+        tiled_mma: SM120 block-scaled tiled MMA descriptor.
+        tCrA, tCrB: RMEM A/B fragments, with shape (..., mma_inst_tile_k).
+        tCrSFA, tCrSFB: RMEM scale-factor fragments, with shape
+            (..., mma_inst_tile_k).
+        acc_buffer: RMEM accumulator buffer.
+        mma_inst_tile_k: number of MMA instructions per K-tile.
+        accumulate_to_acc: if False, zero acc_buffer before the loop.
+    """
+    from .math import dot_block_scaled
+
+    mma_atom = cute.make_mma_atom(tiled_mma.op)
+
+    if not cutlass.const_expr(accumulate_to_acc):
+        acc_buffer.fill(cutlass.Float32(0.0))
+
+    for k_instr_tile in cutlass.range(mma_inst_tile_k, unroll_full=True):
+        dot_block_scaled(
+            mma_atom,
+            tCrA[None, None, k_instr_tile],
+            tCrSFA[None, None, k_instr_tile],
+            tCrB[None, None, k_instr_tile],
+            tCrSFB[None, None, k_instr_tile],
+            acc_buffer,
+        )
diff --git a/python/CuTeDSL/cutlass/cute/export/aot_config.py b/python/CuTeDSL/cutlass/cute/export/aot_config.py
index bd5d63f39..bd91bee6e 100644
--- a/python/CuTeDSL/cutlass/cute/export/aot_config.py
+++ b/python/CuTeDSL/cutlass/cute/export/aot_config.py
@@ -20,6 +20,18 @@ Usage:
     python -m cutlass.cute.export.aot_config --ldflags   # Returns -L flags for linking
     python -m cutlass.cute.export.aot_config --libs      # Returns -l flags for linking
 
+Cross-compile (e.g. linking an AOT object built for AArch64 Linux):
+    python -m cutlass.cute.export.aot_config --ldflags --target aarch64-unknown-linux-gnu
+    python -m cutlass.cute.export.aot_config --libs    --target aarch64-unknown-linux-gnu
+
+When a --target triple is given, the search looks first in
+``<libdir>/<triple>/`` (per-triple subtree, GCC/Clang multilib convention),
+then in ``<libdir>/stubs/<triple>/`` (AOT cross-compile link-time stub —
+empty-body libcute_dsl_runtime.so with the same SONAME, resolved against
+the real lib at runtime on the target). If no per-triple runtime library
+is found, a clear error is reported to stderr. Empty ``--target`` falls
+back to ``<libdir>/`` for the build host.
+
 Examples:
     # Compile and link a shared library using shell substitution
     g++ -shared -o kernel.so kernel.o \\
@@ -35,24 +47,66 @@ import sys
 from pathlib import Path
 
 
-def get_libdir() -> str:
+def _resolve_target_libdir(default_dir: str, target: str) -> tuple[str, str]:
+    """Resolve the runtime library directory for a given target triple.
+
+    Returns ``(resolved_dir, error_msg)``. ``error_msg`` is non-empty when
+    a target was requested but no per-triple subtree exists for it.
+
+    Lookup order:
+      1. ``<default_dir>/<target>/`` — per-target multilib subtree, used
+         when a real cross-built runtime ships for ``target``.
+      2. ``<default_dir>/stubs/<target>/`` — AOT cross-compile link-time
+         stub (libcute_dsl_runtime.so with empty bodies, same SONAME as
+         the native lib). At runtime on the target the dynamic loader
+         resolves against the real lib.
+      3. ``<default_dir>/`` only when ``target`` is empty (native host).
+    """
+    if not target:
+        return default_dir, ""
+    if not default_dir:
+        return "", (
+            f"No CuTe DSL runtime library directory available; "
+            f"cannot resolve --target={target}."
+        )
+    candidate = Path(default_dir) / target
+    if candidate.is_dir() and any(candidate.iterdir()):
+        return str(candidate), ""
+    stub_candidate = Path(default_dir) / "stubs" / target
+    if stub_candidate.is_dir() and any(stub_candidate.iterdir()):
+        return str(stub_candidate), ""
+    return "", (
+        f"No CuTe DSL runtime library available for --target={target}. "
+        f"Looked in {candidate} and {stub_candidate}. Cross-target "
+        f"runtime builds are not shipped; supply your own cross-built "
+        f"libcute_dsl_runtime via -L/<your-sysroot>/lib at link time."
+    )
+
+
+def get_libdir(target: str = "") -> str:
     """
     Get the library directory path containing libcuda_dialect_runtime.so.
 
-    :return: Path to the library directory
+    :param target: Optional LLVM triple for a cross target. When non-empty,
+        looks in ``<default-libdir>/<target>/`` first.
+    :return: Path to the library directory. Empty string only when ``target``
+        is also empty and no runtime library is installed on the host.
     :rtype: str
+    :raises RuntimeError: when ``target`` is set but no per-triple runtime
+        library subtree exists, so callers (and the CLI) surface a non-zero
+        exit instead of silently returning ``""``.
     """
     from ..runtime import find_runtime_libraries
 
     libs = find_runtime_libraries(enable_tvm_ffi=False)
-    if libs:
-        # Return the directory containing the first library found
-        return str(Path(libs[0]).parent)
-
-    return ""
+    default_dir = str(Path(libs[0]).parent) if libs else ""
+    resolved, err = _resolve_target_libdir(default_dir, target)
+    if err:
+        raise RuntimeError(err)
+    return resolved
 
 
-def get_libs(enable_tvm_ffi: bool = False) -> str:
+def get_libs(enable_tvm_ffi: bool = False, target: str = "") -> str:
     """
     Get the -l flags needed for AOT compilation linking.
 
@@ -60,6 +114,10 @@ def get_libs(enable_tvm_ffi: bool = False) -> str:
     this returns `-lcuda_dialect_runtime` (and `-ltvm_ffi` if TVM-FFI is enabled).
 
     :param enable_tvm_ffi: Whether to include TVM-FFI library
+    :param target: Optional LLVM triple for a cross target. The set of
+        ``-l`` flags is independent of the target (the linker resolves
+        them against the search path), so this is informational; the value
+        is accepted for symmetry with :func:`get_ldflags`.
     :return: Space-separated -l flags (e.g., "-lcuda_dialect_runtime -ltvm_ffi")
     :rtype: str
     """
@@ -91,16 +149,22 @@ def get_lib_paths(enable_tvm_ffi: bool = False) -> list[str]:
     return find_runtime_libraries(enable_tvm_ffi=enable_tvm_ffi)
 
 
-def get_ldflags() -> str:
+def get_ldflags(target: str = "") -> str:
     """
     Get the -L flags for the linker.
 
     Similar to `tvm-ffi-config --ldflags` which returns `-L<libdir>`.
 
-    :return: -L flag with library directory path
+    :param target: Optional LLVM triple for a cross target. When non-empty,
+        emits ``-L<libdir>/<target>`` if that subtree exists; otherwise
+        raises ``RuntimeError`` so the CLI fails non-zero.
+    :return: -L flag with library directory path, or ``""`` when no runtime
+        is installed on the host (only possible with ``target=""``).
     :rtype: str
+    :raises RuntimeError: when ``target`` is set but no per-triple runtime
+        library subtree exists.
     """
-    libdir = get_libdir()
+    libdir = get_libdir(target=target)
     if libdir:
         return f"-L{libdir}"
     return ""
@@ -148,6 +212,18 @@ Examples:
         action="store_true",
         help="Include TVM-FFI library in --libs output (disabled by default)",
     )
+    parser.add_argument(
+        "--target",
+        type=str,
+        default="",
+        metavar="TRIPLE",
+        help=(
+            "LLVM target triple for a cross-compile linking step "
+            "(e.g. aarch64-unknown-linux-gnu). When set, library "
+            "discovery searches <libdir>/<triple>/ for the per-target "
+            "runtime build. Empty (default) = build-host runtime."
+        ),
+    )
 
     args = parser.parse_args()
 
@@ -156,15 +232,20 @@ Examples:
         sys.exit(1)
 
     enable_tvm_ffi = args.with_tvm_ffi
+    target = args.target
 
-    if args.libdir:
-        print(get_libdir())
+    try:
+        if args.libdir:
+            print(get_libdir(target=target))
 
-    if args.ldflags:
-        print(get_ldflags())
+        if args.ldflags:
+            print(get_ldflags(target=target))
 
-    if args.libs:
-        print(get_libs(enable_tvm_ffi=enable_tvm_ffi))
+        if args.libs:
+            print(get_libs(enable_tvm_ffi=enable_tvm_ffi, target=target))
+    except RuntimeError as e:
+        print(e, file=sys.stderr)
+        sys.exit(2)
 
 
 if __name__ == "__main__":
diff --git a/python/CuTeDSL/cutlass/cute/export/c_header_generator.py b/python/CuTeDSL/cutlass/cute/export/c_header_generator.py
index bb79a376b..5b5de7ced 100644
--- a/python/CuTeDSL/cutlass/cute/export/c_header_generator.py
+++ b/python/CuTeDSL/cutlass/cute/export/c_header_generator.py
@@ -99,7 +99,7 @@ void _mlir_{symbol_prefix}_cuda_init(void **);
 void _mlir_{symbol_prefix}_cuda_load_to_device(void **);
 static inline void {symbol_prefix}_Kernel_Module_Load({symbol_prefix}_Kernel_Module_t *module) {{
     cudaLibrary_t *libraryPtr = &(module->module);
-    cudaError_t ret;
+    cudaError_t ret = cudaSuccess;
     struct {{
         cudaLibrary_t **libraryPtr;
         cudaError_t *ret;
@@ -261,7 +261,7 @@ extern "C"
 void {capi_function_name}(void **args, int32_t num_args);
 
 static inline {return_type} {wrapper_function_name}({symbol_prefix}_Kernel_Module_t *module, {", ".join(arguments)}) {{
-    {return_type} ret;
+    {return_type} ret = 0;
     void *args[{len(packed_args) + 1}] = {{
         {", ".join(packed_args)},
         &ret
diff --git a/python/CuTeDSL/cutlass/cute/ffi.py b/python/CuTeDSL/cutlass/cute/ffi.py
index 3e045c77b..a085f9b9b 100644
--- a/python/CuTeDSL/cutlass/cute/ffi.py
+++ b/python/CuTeDSL/cutlass/cute/ffi.py
@@ -9,11 +9,12 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
+from __future__ import annotations
+
 import functools
 from typing import List, Optional
 
-from cutlass.base_dsl.ffi import extern as base_extern
-from cutlass.base_dsl.ffi import FFI, BitCode, mangle, ConstValue
+from cutlass.base_dsl.ffi import FFI, BitCode, mangle, ConstValue, extern as base_extern
 from cutlass._mlir import ir
 from cutlass._mlir.dialects import cute, llvm
 
@@ -27,7 +28,7 @@ def _implicit_convert(arg: List[ir.Value], typ: List[ir.Type]) -> List[ir.Value]
             arg_type, cute.PtrType
         ):
             ptr_value = arg[0]
-            ptr_as_int = cute.ptrtoint(ir.IntegerType.get_signless(64), ptr_value)
+            ptr_as_int = cute.ptrtoint(ir.IntegerType.get(64), ptr_value)
             addr_space = cute.PtrType(ptr_value.type).address_space
             llvm_ptr_ty = llvm.PointerType.get(addr_space)
             llvm_ptr = llvm.inttoptr(llvm_ptr_ty, ptr_as_int)
@@ -42,14 +43,14 @@ def ffi(
     params_types: list | None = None,
     return_type: Optional[ir.Type] = None,
     inline: bool = True,
-    source: str | None = None,
+    source: Optional[str | BitCode] = None,
 ) -> FFI:
     return FFI(
         name=name,
         params_types=params_types,
         return_type=return_type,
         inline=inline,
-        source=source,
+        source=source if not isinstance(source, str) else BitCode(source),
         implicit_convert=_implicit_convert,
     )
 
diff --git a/python/CuTeDSL/cutlass/cute/math.py b/python/CuTeDSL/cutlass/cute/math.py
index ef3674d20..608ad0c40 100644
--- a/python/CuTeDSL/cutlass/cute/math.py
+++ b/python/CuTeDSL/cutlass/cute/math.py
@@ -9,693 +9,102 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
-from typing import Callable, Optional, Union
-
-from .typing import Numeric
-from .tensor import TensorSSA
-
-from cutlass._mlir import ir
-from cutlass._mlir.dialects import math, arith
-from cutlass.cutlass_dsl import dsl_user_op
-
-
-def _math_op(
-    func: Callable[..., ir.Value],
-    fastmath: bool,
-    *args: Union[TensorSSA, Numeric],
-    **kwargs: object,
-) -> Union[TensorSSA, ir.Value]:
-    """Dispatch the function to either a TensorSSA or a Numeric(Float).
-
-    :param func: The function to dispatch
-    :param args: The input tensor or scalar
-    :param kwargs: Extra keyword arguments (loc, ip) forwarded to the MLIR op
-    """
-    arg_type = type(args[0])
-    for arg in args:
-        if not isinstance(arg, TensorSSA) and (
-            not isinstance(arg, Numeric) or not type(arg).is_float
-        ):
-            raise TypeError(
-                f"Expected a TensorSSA or Numeric(Float), but got {type(arg)}"
-            )
-        if not isinstance(arg, arg_type):
-            raise TypeError(
-                f"Expected all inputs to be of type {arg_type}, but got {type(arg)}"
-            )
-
-    fastmath_flag = arith.FastMathFlags.fast if fastmath else arith.FastMathFlags.none
-    if isinstance(args[0], TensorSSA):
-        return TensorSSA(
-            func(*args, fastmath=fastmath_flag, **kwargs), args[0].shape, args[0].dtype
-        )
-    else:
-        ir_args = [a.ir_value() for a in args]
-        return func(*ir_args, fastmath=fastmath_flag, **kwargs)
-
-
-@dsl_user_op
-def acos(
-    a: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise arc cosine of the input tensor.
-
-    :param a: Input tensor
-    :type a: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing the arc cosine of each element in input tensor
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        x = cute.make_rmem_tensor(layout)  # Create tensor
-        y = x.load()  # Load values
-        z = acos(y)  # Compute arc cosine
-    """
-    return _math_op(math.acos, fastmath, a, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def asin(
-    a: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise arc sine of the input tensor.
-
-    :param a: Input tensor
-    :type a: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing the arc sine of each element in input tensor
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        x = cute.make_rmem_tensor(layout)  # Create tensor
-        y = x.load()  # Load values
-        z = asin(y)  # Compute arc sine
-    """
-    return _math_op(math.asin, fastmath, a, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def atan(
-    a: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise arc tangent of the input tensor.
-
-    :param a: Input tensor
-    :type a: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing the arc tangent of each element in input tensor
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        x = cute.make_rmem_tensor(layout)  # Create tensor
-        y = x.load()  # Load values
-        z = atan(y)  # Compute arc tangent
-    """
-    return _math_op(math.atan, fastmath, a, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def atan2(
-    a: Union[TensorSSA, Numeric],
-    b: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise arc tangent of two tensors.
-
-    Computes atan2(a, b) element-wise. The function atan2(a, b) is the angle in radians
-    between the positive x-axis and the point given by the coordinates (b, a).
-
-    :param a: First input tensor (y-coordinates)
-    :type a: Union[TensorSSA, Numeric]
-    :param b: Second input tensor (x-coordinates)
-    :type b: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing the arc tangent of a/b element-wise
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        y = cute.make_rmem_tensor(ptr1, layout).load()  # y coordinates
-        x = cute.make_rmem_tensor(ptr2, layout).load()  # x coordinates
-        theta = atan2(y, x)  # Compute angles
-    """
-    return _math_op(math.atan2, fastmath, a, b, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def absf(
-    a: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise absolute value of the input tensor.
-
-    :param a: Input tensor
-    :type a: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing the absolute value of each element in input tensor
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        x = cute.make_rmem_tensor(layout)  # Create tensor
-        y = x.load()  # Load values
-        z = absf(y)  # Compute absolute value
-    """
-    return _math_op(math.absf, fastmath, a, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def copysign(
-    a: Union[TensorSSA, Numeric],
-    b: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise copysign of two tensors.
-
-    Returns a value with the magnitude of ``a`` and the sign of ``b``.
-
-    :param a: Input tensor providing magnitude
-    :type a: Union[TensorSSA, Numeric]
-    :param b: Input tensor providing sign
-    :type b: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor where each element has the magnitude of ``a`` and the sign of ``b``
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        mag = cute.make_rmem_tensor(ptr1, layout).load()  # magnitudes
-        sgn = cute.make_rmem_tensor(ptr2, layout).load()  # signs
-        result = copysign(mag, sgn)  # Combine magnitude and sign
-    """
-    return _math_op(math.copysign, fastmath, a, b, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def cos(
-    a: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise cosine of the input tensor.
-
-    :param a: Input tensor (in radians)
-    :type a: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing the cosine of each element
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        x = cute.make_rmem_tensor(layout)  # Create tensor
-        y = x.load()  # Load values
-        z = cos(y)  # Compute cosine
-    """
-    return _math_op(math.cos, fastmath, a, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def erf(
-    a: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise error function of the input tensor.
-
-    The error function is defined as:
-    erf(x) = 2/√π ∫[0 to x] exp(-t²) dt
-
-    :param a: Input tensor
-    :type a: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing the error function value for each element
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        x = cute.make_rmem_tensor(layout)  # Create tensor
-        y = x.load()  # Load values
-        z = erf(y)  # Compute error function
-    """
-    return _math_op(math.erf, fastmath, a, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def exp(
-    a: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise exponential of the input tensor.
-
-    :param a: Input tensor
-    :type a: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing the exponential of each element
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        x = cute.make_rmem_tensor(layout)  # Create tensor
-        y = x.load()  # Load values
-        z = exp(y)  # Compute exponential
-    """
-    return _math_op(math.exp, fastmath, a, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def exp2(
-    a: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise base-2 exponential of the input tensor.
-
-    :param a: Input tensor
-    :type a: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing 2 raised to the power of each element
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        x = cute.make_rmem_tensor(layout)  # Create tensor
-        y = x.load()  # Load values
-        z = exp2(y)  # Compute 2^x
-    """
-    return _math_op(math.exp2, fastmath, a, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def floor(
-    a: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise floor of the input tensor.
-
-    :param a: Input tensor
-    :type a: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing the largest integer less than or equal to each element in input tensor
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        x = cute.make_rmem_tensor(layout)  # Create tensor
-        y = x.load()  # Load values
-        z = floor(y)  # Compute floor
-    """
-    return _math_op(math.floor, fastmath, a, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def log(
-    a: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise natural logarithm of the input tensor.
-
-    :param a: Input tensor
-    :type a: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing the natural logarithm of each element
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        x = cute.make_rmem_tensor(layout)  # Create tensor
-        y = x.load()  # Load values
-        z = log(y)  # Compute natural logarithm
-    """
-    return _math_op(math.log, fastmath, a, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def log2(
-    a: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise base-2 logarithm of the input tensor.
-
-    :param a: Input tensor
-    :type a: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing the base-2 logarithm of each element
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        x = cute.make_rmem_tensor(layout)  # Create tensor
-        y = x.load()  # Load values
-        z = log2(y)  # Compute log base 2
-    """
-    return _math_op(math.log2, fastmath, a, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def log10(
-    a: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise base-10 logarithm of the input tensor.
-
-    :param a: Input tensor
-    :type a: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing the base-10 logarithm of each element
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        x = cute.make_rmem_tensor(layout)  # Create tensor
-        y = x.load()  # Load values
-        z = log10(y)  # Compute log base 10
-    """
-    return _math_op(math.log10, fastmath, a, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def rsqrt(
-    a: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise reciprocal square root of the input tensor.
-
-    Computes 1/√x element-wise.
-
-    :param a: Input tensor
-    :type a: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing the reciprocal square root of each element
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        x = cute.make_rmem_tensor(layout)  # Create tensor
-        y = x.load()  # Load values
-        z = rsqrt(y)  # Compute 1/√x
-    """
-    return _math_op(math.rsqrt, fastmath, a, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def sin(
-    a: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise sine of the input tensor.
-
-    :param a: Input tensor (in radians)
-    :type a: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing the sine of each element
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        x = cute.make_rmem_tensor(layout)  # Create tensor
-        y = x.load()  # Load values
-        z = sin(y)  # Compute sine
-    """
-    return _math_op(math.sin, fastmath, a, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def sqrt(
-    a: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise square root of the input tensor.
-
-    :param a: Input tensor
-    :type a: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing the square root of each element
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        x = cute.make_rmem_tensor(layout)  # Create tensor
-        y = x.load()  # Load values
-        z = sqrt(y)  # Compute square root
-    """
-    return _math_op(math.sqrt, fastmath, a, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def tan(
-    a: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise tangent of the input tensor.
-
-    :param a: Input tensor (in radians)
-    :type a: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing the tangent of each element
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        x = cute.make_rmem_tensor(layout)  # Create tensor
-        y = x.load()  # Load values
-        z = tan(y)  # Compute tangent
-    """
-    return _math_op(math.tan, fastmath, a, loc=loc, ip=ip)
-
-
-@dsl_user_op
-def tanh(
-    a: Union[TensorSSA, Numeric],
-    fastmath: bool = False,
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Union[TensorSSA, Numeric]:
-    """Compute element-wise hyperbolic tangent of the input tensor.
-
-    :param a: Input tensor
-    :type a: Union[TensorSSA, Numeric]
-    :param fastmath: Enable fast math optimizations, defaults to False
-    :type fastmath: bool, optional
-    :param loc: Source location information, defaults to None
-    :type loc: Optional[Location]
-    :param ip: Insertion point for IR generation, defaults to None
-    :type ip: Optional[InsertionPoint]
-    :return: Tensor containing the hyperbolic tangent of each element
-    :rtype: Union[TensorSSA, Numeric]
-
-    Example:
-
-    .. code-block::
-
-        x = cute.make_rmem_tensor(layout)  # Create tensor
-        y = x.load()  # Load values
-        z = tanh(y)  # Compute hyperbolic tangent
-    """
-    return _math_op(math.tanh, fastmath, a, loc=loc, ip=ip)
-
+"""CuTeDSL math API — thin re-export from :mod:`cutlass._mlir_helpers.math`.
+
+All math implementations live in the foundation module. This file
+curates the CuTeDSL-side public surface (historical transcendentals,
+rounding, abs / copysign) and aliases ``abs`` to ``absf`` for backwards
+compatibility.
+
+``TensorSSA`` inherits from ``Vector``, so the foundation's Vector
+dispatch path handles CuTeDSL tensor values directly — no
+CuTeDSL-specific wrapper is needed here. Per-element math ops preserve
+a ``TensorSSA``'s CuTe nested shape via the ``_wrap_like`` polymorphic
+hook on the Vector base class.
+"""
+
+from cutlass._mlir_helpers.math import (
+    # Trigonometric
+    sin,
+    cos,
+    tan,
+    acos,
+    asin,
+    atan,
+    atan2,
+    # Hyperbolic
+    sinh,
+    cosh,
+    tanh,
+    acosh,
+    asinh,
+    atanh,
+    # Exponential / logarithmic
+    exp,
+    exp2,
+    expm1,
+    log,
+    log2,
+    log10,
+    log1p,
+    # Error functions
+    erf,
+    erfc,
+    # Power / root
+    pow,
+    cbrt,
+    sqrt,
+    rsqrt,
+    # Rounding
+    ceil,
+    floor,
+    round,
+    roundeven,
+    trunc,
+    # Sign
+    copysign,
+)
+from cutlass._mlir_helpers.math import abs as absf  # historical CuTeDSL name
 
 __all__ = [
-    "absf",
+    # Trigonometric
+    "sin",
+    "cos",
+    "tan",
     "acos",
     "asin",
     "atan",
     "atan2",
-    "cos",
-    "copysign",
-    "erf",
+    # Hyperbolic
+    "sinh",
+    "cosh",
+    "tanh",
+    "acosh",
+    "asinh",
+    "atanh",
+    # Exponential / logarithmic
     "exp",
     "exp2",
-    "floor",
+    "expm1",
     "log",
-    "log10",
     "log2",
-    "rsqrt",
-    "sin",
+    "log10",
+    "log1p",
+    # Error functions
+    "erf",
+    "erfc",
+    # Power / root
+    "pow",
+    "cbrt",
     "sqrt",
-    "tan",
-    "tanh",
+    "rsqrt",
+    # Rounding
+    "ceil",
+    "floor",
+    "round",
+    "roundeven",
+    "trunc",
+    # Abs / sign
+    "absf",
+    "copysign",
 ]
diff --git a/python/CuTeDSL/cutlass/cute/nvgpu/common.py b/python/CuTeDSL/cutlass/cute/nvgpu/common.py
index 735badbde..e356d0255 100644
--- a/python/CuTeDSL/cutlass/cute/nvgpu/common.py
+++ b/python/CuTeDSL/cutlass/cute/nvgpu/common.py
@@ -11,7 +11,6 @@
 import enum
 from dataclasses import dataclass
 from typing import Any, Mapping, Optional, Type, Union
-import warnings
 
 from cutlass.cutlass_dsl import DSLBaseError, DSLRuntimeError
 
@@ -170,6 +169,8 @@ class MmaUniversalOp(atom.MmaOp):
     This Operation currently expects the A/B operands as well as the accumulator to share the same
     data types.
 
+    **Supported architectures:** all (universal FMA)
+
     :param abacc_dtype: The data type for the A/B operands and the accumulator
     :type abacc_dtype:  Type[Numeric]
     """
@@ -377,12 +378,7 @@ class CopyUniversalOp(atom.CopyOp):
     .. code-block:: python
 
         op = cute.nvgpu.CopyUniversalOp()
-        atom = cute.make_copy_atom(
-            op, 
-            tensor_dtype, 
-            num_bits_per_copy=64,
-            l1c_evict_priority=cute.nvgpu.CacheEvictionPriority.EVICT_NORMAL
-        )
+        atom = cute.make_copy_atom(op, tensor_dtype, num_bits_per_copy=64)
 
     - ``tensor_dtype`` is the data type used to build the reference TV Layout (either the source \
         or the destination TV Layout) in unit of tensor elements and is used for partitioning by \
@@ -390,11 +386,6 @@ class CopyUniversalOp(atom.CopyOp):
     - ``num_bits_per_copy`` is a kw argument specifying the number of bits to copy per Atom \
         execution. This can be larger than the width of the above data type. When not provided, \
         the compiler will do a best effort at auto-vectorizing.
-    - ``l1c_evict_priority`` is a kw argument specifying the L1 cache eviction priority hint for \
-        the copy operation. Defaults to ``EVICT_NORMAL`` if not provided.
-    - ``invariant`` is a kw argument specifying whether the load is invariant (read-only data \
-        that never changes). This enables compiler optimizations like instruction reordering. \
-        Defaults to ``False`` if not provided.
     """
 
     def __str__(self) -> str:
@@ -405,10 +396,6 @@ class CopyUniversalOp(atom.CopyOp):
         copy_internal_type: Type[Numeric],
         *,
         num_bits_per_copy: int = 0,
-        memory_order: MemoryOrder = MemoryOrder.WEAK,
-        memory_scope: MemoryScope = MemoryScope.CTA,
-        l1c_evict_priority: CacheEvictionPriority = CacheEvictionPriority.EVICT_NORMAL,
-        invariant: bool = False,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
         **kwargs: Any,
@@ -418,29 +405,9 @@ class CopyUniversalOp(atom.CopyOp):
                 f"'num_bits_per_copy' must be a non-negative int when creating a copy Atom for {self.__class__.__name__!r}"
             )
 
-        # CopyUniversalOp is designed to be a universal copy operation that is
-        # equivalent to the "a = b" assignment without any extra attributes.
-        # For advanced memory features, such as memory order, please use the
-        # specialized copy operations (e.g., CopyG2ROp) or their combinations instead.
-        if (
-            memory_order != MemoryOrder.WEAK
-            or memory_scope != MemoryScope.CTA
-            or l1c_evict_priority != CacheEvictionPriority.EVICT_NORMAL
-            or invariant
-        ):
-            warnings.warn(
-                "Using CopyUniversalOp with extra attributes is deprecated. Please use specialized copy ops "
-                "(e.g., CopyG2ROp) for advanced memory features.",
-                DeprecationWarning,
-            )
-
         atom_type = _cute_nvgpu_ir.CopyAtomSIMTSyncCopyType.get(
             copy_internal_type.mlir_type,
             num_bits_per_copy,
-            memory_order._to_ir(),
-            memory_scope._to_ir(),
-            l1c_evict_priority._to_ir(),
-            invariant,
         )
         return CopyUniversalTrait(atom.make_atom(atom_type, loc=loc, ip=ip))
 
diff --git a/python/CuTeDSL/cutlass/cute/nvgpu/cpasync/copy.py b/python/CuTeDSL/cutlass/cute/nvgpu/cpasync/copy.py
index 21b6b3e60..c545a2e19 100644
--- a/python/CuTeDSL/cutlass/cute/nvgpu/cpasync/copy.py
+++ b/python/CuTeDSL/cutlass/cute/nvgpu/cpasync/copy.py
@@ -12,8 +12,9 @@
 import enum
 import warnings
 from dataclasses import dataclass
-from typing import Any, Optional, Type
+from typing import Any, Optional, Type, cast
 from typing_extensions import deprecated
+from abc import ABCMeta, abstractmethod
 
 from cutlass.base_dsl.arch import Arch
 from cutlass.cutlass_dsl import BaseDSL
@@ -26,7 +27,7 @@ from cutlass._mlir import ir
 
 ReductionOp = ReductionKind
 
-from ...atom import CopyOp, Trait, make_atom
+from ...atom import CopyOp, Trait, TmaTrait, make_atom
 from ...typing import Int16, Int32, Int64, Pointer, Integer, Numeric
 from ..common import OpError, LoadCacheMode as LoadCacheMode_
 
@@ -152,7 +153,142 @@ class TmaCopyOp(CopyOp):
 
 
 @dataclass
-class CopyBulkTensorTileG2SOp(TmaCopyOp):
+class CopyG2STileBaseOp(TmaCopyOp, metaclass=ABCMeta):
+    """
+    Base class for bulk tensor asynchronous GMEM to SMEM Copy Operations using the TMA unit in tiled mode.
+    """
+
+    cta_group: CtaGroup = CtaGroup.ONE
+
+    def __post_init__(self) -> None:
+        if not isinstance(self.cta_group, CtaGroup):
+            raise OpError(
+                self, "expects the 'cta_group' parameter to be a CtaGroup instance"
+            )
+        # Arch verification
+        arch: Arch = BaseDSL._get_dsl().get_arch_enum()
+        if not arch >= Arch.sm_90:
+            raise OpError(
+                self,
+                f"expects arch to be at least {Arch.sm_90.name}, but got {arch.name}",
+                suggestion="Ensure env CUTE_DSL_ARCH matches your GPU architecture",
+            )
+        if (self.cta_group == CtaGroup.TWO) and arch.major == Arch.sm_90.major:
+            raise OpError(
+                self,
+                f"CTA group of 2 is tcgen05-specific and is not compatible with {arch}",
+                suggestion="Ensure env CUTE_DSL_ARCH matches your GPU architecture",
+            )
+
+    @abstractmethod
+    def _get_description(self) -> str:
+        """
+        Get the description of the operation. This should be overridden by the subclass.
+        """
+        return ""
+
+    def __str__(self) -> str:
+        res = self._get_description()
+        if self.cta_group == CtaGroup.TWO:
+            res += "\n  CTA group = 2"
+        return res
+
+    @abstractmethod
+    def _make_trait(
+        self,
+        copy_internal_type: Type[Numeric],
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+        **kwargs: Any,
+    ) -> "TmaTrait":
+        pass
+
+    @abstractmethod
+    def _to_ir(self) -> _cute_nvgpu_ir.TiledTmaLoadEnum:
+        if self.cta_group == CtaGroup.ONE:
+            return _cute_nvgpu_ir.TiledTmaLoadEnum.sm_90
+        elif self.cta_group == CtaGroup.TWO:
+            return _cute_nvgpu_ir.TiledTmaLoadEnum.sm_100_2sm
+        else:
+            assert False, "unrecognized self.cta_group"
+
+
+class CopyG2STileNonExecBaseTrait(TmaTrait):
+    """
+    Base trait for non-executable bulk tensor asynchronous GMEM to SMEM Copy Operations using the TMA unit in tiled mode.
+    """
+
+    # We allow kw args to be dropped so that the user can write common code for non-multicast
+    # and multicast loads.
+
+    @abstractmethod
+    def with_(
+        self,
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+        **kwargs: Any,
+    ) -> "Trait":
+        pass
+
+    def unpack(
+        self,
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+        tma_bar_ptr: Optional[Pointer] = None,
+        tma_desc_ptr: Optional[Pointer] = None,
+        cache_policy: Optional[Int64] = None,
+        **kwargs: Any,
+    ) -> ir.Value:
+        """
+        Custom implementation of unpack for non-executable TMAs.
+
+        The non-multicast TMA load requires a `tma_bar_ptr` keyword argument to be provided when
+        using `cute.copy`. `cache_policy` keyword argument to be provided to set the l2 cache eviction priority.
+        Any other kw arguments will be ignored instead of triggering an error.
+        """
+        if not isinstance(tma_bar_ptr, Pointer):
+            raise ValueError(
+                "expects a pointer to an mbarrier to be provided via the tma_bar_ptr kw argument"
+            )
+
+        exec_value = _cute_nvgpu_ir.atom_make_exec_tma(
+            self.value,
+            loc=loc,
+            ip=ip,
+        )
+
+        attr_str = f"#cute_nvgpu.atom_copy_field_tmaload<{TMA_MBAR_PTR_FIELD_NAME}>"
+        attr = ir.Attribute.parse(attr_str)
+        exec_value = _cute_nvgpu_ir.atom_set_value(
+            exec_value, attr, cast(Any, tma_bar_ptr).value, loc=loc, ip=ip
+        )
+        if isinstance(tma_desc_ptr, Pointer):
+            attr_str = f"#cute_nvgpu.atom_copy_field_tmaload<{TMA_DESC_PTR_FIELD_NAME}>"
+            attr = ir.Attribute.parse(attr_str)
+            exec_value = _cute_nvgpu_ir.atom_set_value(
+                exec_value, attr, cast(Any, tma_desc_ptr).value, loc=loc, ip=ip
+            )
+        if cache_policy is not None:
+            if not isinstance(cache_policy, Int64):
+                raise ValueError(
+                    "expects `Int64` value to be provided via the cache_policy kw argument"
+                )
+
+            attr_str = (
+                f"#cute_nvgpu.atom_copy_field_tmaload<{TMA_CACHE_POLICY_FIELD_NAME}>"
+            )
+            attr = ir.Attribute.parse(attr_str)
+            exec_value = _cute_nvgpu_ir.atom_set_value(
+                exec_value, attr, cache_policy.ir_value(), loc=loc, ip=ip
+            )
+        return exec_value
+
+
+@dataclass
+class CopyBulkTensorTileG2SOp(CopyG2STileBaseOp):
     """
     Bulk tensor asynchronous GMEM to SMEM Copy Operation using the TMA unit.
 
@@ -198,33 +334,8 @@ class CopyBulkTensorTileG2SOp(TmaCopyOp):
        - Tutorial example: ``examples/blackwell/tutorial_tma/tma_v0.py``
     """
 
-    cta_group: CtaGroup = CtaGroup.ONE
-
-    def __post_init__(self) -> None:
-        if not isinstance(self.cta_group, CtaGroup):
-            raise OpError(
-                self, "expects the 'cta_group' parameter to be a CtaGroup instance"
-            )
-        # Arch verification
-        arch: Arch = BaseDSL._get_dsl().get_arch_enum()
-        if not arch >= Arch.sm_90:
-            raise OpError(
-                self,
-                f"expects arch to be at least {Arch.sm_90.name}, but got {arch.name}",
-                suggestion="Ensure env CUTE_DSL_ARCH matches your GPU architecture",
-            )
-        if (self.cta_group == CtaGroup.TWO) and arch.major == Arch.sm_90.major:
-            raise OpError(
-                self,
-                f"CTA group of 2 is tcgen05-specific and is not and is not compatible with {arch}",
-                suggestion="Ensure env CUTE_DSL_ARCH matches your GPU architecture",
-            )
-
-    def __str__(self) -> str:
-        res = "cp.async GMEM -> SMEM bulk tensor copy Operation"
-        if self.cta_group == CtaGroup.TWO:
-            res += "\n  CTA group = 2"
-        return res
+    def _get_description(self) -> str:
+        return "cp.async GMEM -> SMEM bulk tensor copy Operation"
 
     def _make_trait(
         self,
@@ -239,18 +350,10 @@ class CopyBulkTensorTileG2SOp(TmaCopyOp):
         )
 
     def _to_ir(self) -> _cute_nvgpu_ir.TiledTmaLoadEnum:
-        if self.cta_group == CtaGroup.ONE:
-            return _cute_nvgpu_ir.TiledTmaLoadEnum.sm_90
-        elif self.cta_group == CtaGroup.TWO:
-            return _cute_nvgpu_ir.TiledTmaLoadEnum.sm_100_2sm
-        else:
-            assert False, "unrecognized self.cta_group"
+        return super()._to_ir()
 
 
-class CopyBulkTensorTileG2SNonExecTrait(Trait):
-    # We allow kw args to be dropped so that the user can write common code for non-multicast
-    # and multicast loads.
-
+class CopyBulkTensorTileG2SNonExecTrait(CopyG2STileNonExecBaseTrait):
     def with_(
         self,
         *,
@@ -260,32 +363,70 @@ class CopyBulkTensorTileG2SNonExecTrait(Trait):
     ) -> "CopyBulkTensorTileG2STrait":
         return CopyBulkTensorTileG2STrait(self.unpack(loc=loc, ip=ip, **kwargs))
 
+
+class CopyBulkTensorTileG2STrait(Trait):
+    pass
+
+
+
+
+
+#
+# TMA GMEM -> SMEM multicast copies
+#
+
+
+class CopyG2STileMulticastNonExecBaseTrait(TmaTrait):
+    """
+    Base trait for non-executable bulk tensor asynchronous multicast GMEM to SMEM Copy Operations using the TMA unit in tiled mode.
+    """
+
+    @abstractmethod
+    def with_(
+        self,
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+        **kwargs: Any,
+    ) -> "Trait":
+        pass
+
     def unpack(
         self,
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
         tma_bar_ptr: Optional[Pointer] = None,
-        tma_desc_ptr: Optional[Pointer] = None,
+        mcast_mask: Any = None,
+        tma_desc_ptr: Any = None,
         cache_policy: Optional[Int64] = None,
         **kwargs: Any,
     ) -> ir.Value:
         """
         Custom implementation of unpack for non-executable TMAs.
 
-        The non-multicast TMA load requires a `tma_bar_ptr` keyword argument to be provided when
-        using `cute.copy`. `cache_policy` keyword argument to be provided to set the l2 cache eviction priority.
-        Any other kw arguments will be ignored instead of triggering an error.
+        The multicast TMA load requires a `tma_bar_ptr`  and a `mcast_mask` keyword arguments to be
+        provided when using `cute.copy`. `cache_policy` keyword argument to be provided to set the
+        l2 cache eviction priority.
         """
         if not isinstance(tma_bar_ptr, Pointer):
             raise ValueError(
                 "expects a pointer to an mbarrier to be provided via the tma_bar_ptr kw argument"
             )
-        exec_value = _cute_nvgpu_ir.atom_make_exec_tma(self.value, loc=loc, ip=ip)
-        attr_str = f"#cute_nvgpu.atom_copy_field_tmaload<{TMA_MBAR_PTR_FIELD_NAME}>"
+        if not isinstance(mcast_mask, Integer):
+            raise ValueError(
+                "expects a multicast mask to be provided via the mcast_mask kw argument"
+            )
+
+        exec_value = _cute_nvgpu_ir.atom_make_exec_tma(
+            self.value,
+            loc=loc,
+            ip=ip,
+        )
+        attr_str = f"#cute_nvgpu.atom_copy_field_tmaload<{TMA_MCAST_MASK_FIELD_NAME}>"
         attr = ir.Attribute.parse(attr_str)
         exec_value = _cute_nvgpu_ir.atom_set_value(
-            exec_value, attr, tma_bar_ptr.value, loc=loc, ip=ip
+            exec_value, attr, Int16(mcast_mask).ir_value(loc=loc, ip=ip), loc=loc, ip=ip
         )
         if isinstance(tma_desc_ptr, Pointer):
             attr_str = f"#cute_nvgpu.atom_copy_field_tmaload<{TMA_DESC_PTR_FIELD_NAME}>"
@@ -306,17 +447,95 @@ class CopyBulkTensorTileG2SNonExecTrait(Trait):
             exec_value = _cute_nvgpu_ir.atom_set_value(
                 exec_value, attr, cache_policy.ir_value(), loc=loc, ip=ip
             )
+        # Set the tma_bar_ptr at last to ensure that the atom creation and setting
+        # operations above can be moved outside the loop
+        attr_str = f"#cute_nvgpu.atom_copy_field_tmaload<{TMA_MBAR_PTR_FIELD_NAME}>"
+        attr = ir.Attribute.parse(attr_str)
+        exec_value = _cute_nvgpu_ir.atom_set_value(
+            exec_value, attr, tma_bar_ptr.value, loc=loc, ip=ip
+        )
         return exec_value
 
 
-class CopyBulkTensorTileG2STrait(Trait):
+@dataclass
+class CopyBulkTensorTileG2SMulticastOp(CopyG2STileBaseOp):
+    """
+    Bulk tensor asynchronous multicast GMEM to SMEM Copy Operation using the TMA unit.
+
+    TMA multicast operations are issued by a single thread within a warp, but the DSL **automatically handles this** by
+    implicitly adding ``elect_one()`` around the copy operation.
+
+    .. code-block:: python
+
+        # CORRECT: TMA multicast without elect_one
+        cute.copy(
+            tma_atom.with_(mcast_mask=cluster_mask),
+            gmem_tensor,
+            smem_tensor,
+            tma_bar_ptr=barrier_ptr
+        )
+
+        # WRONG: Do NOT wrap in elect_one (can cause deadlock)
+        with cute.arch.elect_one():  # INCORRECT
+            cute.copy(tma_atom.with_(mcast_mask=mask), gmem_tensor, smem_tensor)
+
+    **PTX Programming Model**: In PTX, TMA multicast operations (``cp.async.bulk.tensor.multicast``)
+    must be issued by a single thread.
+
+    See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`__.
+    This Operation uses TMA in the ``.tile`` mode.
+
+    .. seealso::
+       - :func:`cute.arch.elect_one` - **NOT** needed for TMA copy
+       - :class:`CopyBulkTensorTileG2SOp` - Non-multicast TMA load
+    """
+
+    def _get_description(self) -> str:
+        return "cp.async GMEM -> SMEM bulk tensor multicast copy Operation"
+
+    def _make_trait(
+        self,
+        copy_internal_type: Type[Numeric],
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+        **kwargs: Any,
+    ) -> "CopyBulkTensorTileG2SMulticastNonExecTrait":
+        raise NotImplementedError(
+            "Use cpasync.make_tiled_tma_atom to obtain a copy Atom for TMA"
+        )
+
+    def _to_ir(self) -> _cute_nvgpu_ir.TiledTmaLoadEnum:
+        if self.cta_group == CtaGroup.ONE:
+            return _cute_nvgpu_ir.TiledTmaLoadEnum.sm_90_multicast
+        elif self.cta_group == CtaGroup.TWO:
+            return _cute_nvgpu_ir.TiledTmaLoadEnum.sm_100_2sm_multicast
+        else:
+            assert False, "unrecognized self.cta_group"
+
+
+class CopyBulkTensorTileG2SMulticastNonExecTrait(CopyG2STileMulticastNonExecBaseTrait):
+    def with_(
+        self,
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+        **kwargs: Any,
+    ) -> "CopyBulkTensorTileG2SMulticastTrait":
+        return CopyBulkTensorTileG2SMulticastTrait(
+            self.unpack(loc=loc, ip=ip, **kwargs)
+        )
+
+
+class CopyBulkTensorTileG2SMulticastTrait(Trait):
     pass
 
 
+
 @dataclass
 class CopyBulkTensorIm2ColG2SOp(TmaCopyOp):
     """
-    Bulk tensor asynchrnous GMEM to SMEM Copy Operation using the TMA unit in im2col mode.
+    Bulk tensor asynchronous GMEM to SMEM Copy Operation using the TMA unit in im2col mode.
 
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`__.
     This Operation uses TMA in the ``.im2col`` mode.
@@ -340,7 +559,7 @@ class CopyBulkTensorIm2ColG2SOp(TmaCopyOp):
         if (self.cta_group == CtaGroup.TWO) and arch.major == Arch.sm_90.major:
             raise OpError(
                 self,
-                f"CTA group of 2 is tcgen05-specific and is not and is not compatible with {arch}",
+                f"CTA group of 2 is tcgen05-specific and is not compatible with {arch}",
                 suggestion="Ensure env CUTE_DSL_ARCH matches your GPU architecture",
             )
 
@@ -371,7 +590,7 @@ class CopyBulkTensorIm2ColG2SOp(TmaCopyOp):
             assert False, "unrecognized self.cta_group"
 
 
-class CopyBulkTensorIm2ColG2SNonExecTrait(Trait):
+class CopyBulkTensorIm2ColG2SNonExecTrait(TmaTrait):
     # We allow kw args to be dropped so that the user can write common code for non-multicast
     # and multicast loads.
 
@@ -414,13 +633,13 @@ class CopyBulkTensorIm2ColG2SNonExecTrait(Trait):
         attr_str = f"#cute_nvgpu.atom_copy_field_tmaload<{TMA_MBAR_PTR_FIELD_NAME}>"
         attr = ir.Attribute.parse(attr_str)
         exec_value = _cute_nvgpu_ir.atom_set_value(
-            exec_value, attr, tma_bar_ptr.value, loc=loc, ip=ip
+            exec_value, attr, cast(Any, tma_bar_ptr).value, loc=loc, ip=ip
         )
         if isinstance(tma_desc_ptr, Pointer):
             attr_str = f"#cute_nvgpu.atom_copy_field_tmaload<{TMA_DESC_PTR_FIELD_NAME}>"
             attr = ir.Attribute.parse(attr_str)
             exec_value = _cute_nvgpu_ir.atom_set_value(
-                exec_value, attr, tma_desc_ptr.value, loc=loc, ip=ip
+                exec_value, attr, cast(Any, tma_desc_ptr).value, loc=loc, ip=ip
             )
         if cache_policy is not None:
             if not isinstance(cache_policy, Int64):
@@ -450,7 +669,7 @@ class CopyBulkTensorIm2ColG2STrait(Trait):
 @dataclass
 class CopyBulkTensorIm2ColG2SMulticastOp(TmaCopyOp):
     """
-    Bulk tensor asynchrnous multicast GMEM to SMEM Copy Operation using the TMA unit.
+    Bulk tensor asynchronous multicast GMEM to SMEM Copy Operation using the TMA unit.
 
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`__.
     This Operation uses TMA in the ``.im2col`` mode.
@@ -474,7 +693,7 @@ class CopyBulkTensorIm2ColG2SMulticastOp(TmaCopyOp):
         if (self.cta_group == CtaGroup.TWO) and arch.major == Arch.sm_90.major:
             raise OpError(
                 self,
-                f"CTA group of 2 is tcgen05-specific and is not and is not compatible with {arch}",
+                f"CTA group of 2 is tcgen05-specific and is not compatible with {arch}",
                 suggestion="Ensure env CUTE_DSL_ARCH matches your GPU architecture",
             )
 
@@ -505,7 +724,7 @@ class CopyBulkTensorIm2ColG2SMulticastOp(TmaCopyOp):
             assert False, "unrecognized self.cta_group"
 
 
-class CopyBulkTensorIm2ColG2SMulticastNonExecTrait(Trait):
+class CopyBulkTensorIm2ColG2SMulticastNonExecTrait(TmaTrait):
     def with_(
         self,
         *,
@@ -517,7 +736,7 @@ class CopyBulkTensorIm2ColG2SMulticastNonExecTrait(Trait):
             self.unpack(loc=loc, ip=ip, **kwargs)
         )
 
-    def unpack(  # type: ignore[override]
+    def unpack(
         self,
         *,
         loc: Optional[ir.Location] = None,
@@ -526,6 +745,7 @@ class CopyBulkTensorIm2ColG2SMulticastNonExecTrait(Trait):
         mcast_mask: Any = None,
         tma_desc_ptr: Any = None,
         cache_policy: Optional[Int64] = None,
+        **kwargs: Any,
     ) -> ir.Value:
         """
         Custom implementation of unpack for non-executable TMAs.
@@ -548,7 +768,7 @@ class CopyBulkTensorIm2ColG2SMulticastNonExecTrait(Trait):
             loc=loc,
             ip=ip,
         )
-        attr_str = "#cute_nvgpu.atom_copy_field_tmaload<mcast_mask>"
+        attr_str = f"#cute_nvgpu.atom_copy_field_tmaload<{TMA_MCAST_MASK_FIELD_NAME}>"
         attr = ir.Attribute.parse(attr_str)
         exec_value = _cute_nvgpu_ir.atom_set_value(
             exec_value, attr, Int16(mcast_mask).ir_value(loc=loc, ip=ip), loc=loc, ip=ip
@@ -557,7 +777,7 @@ class CopyBulkTensorIm2ColG2SMulticastNonExecTrait(Trait):
             attr_str = f"#cute_nvgpu.atom_copy_field_tmaload<{TMA_DESC_PTR_FIELD_NAME}>"
             attr = ir.Attribute.parse(attr_str)
             exec_value = _cute_nvgpu_ir.atom_set_value(
-                exec_value, attr, tma_desc_ptr.value, loc=loc, ip=ip
+                exec_value, attr, cast(Any, tma_desc_ptr).value, loc=loc, ip=ip
             )
         if cache_policy is not None:
             if not isinstance(cache_policy, Int64):
@@ -574,10 +794,10 @@ class CopyBulkTensorIm2ColG2SMulticastNonExecTrait(Trait):
             )
         # Set the tma_bar_ptr at last to ensure that the atom creation and setting
         # operations above can be moved outside the loop
-        attr_str = "#cute_nvgpu.atom_copy_field_tmaload<tma_bar>"
+        attr_str = f"#cute_nvgpu.atom_copy_field_tmaload<{TMA_MBAR_PTR_FIELD_NAME}>"
         attr = ir.Attribute.parse(attr_str)
         exec_value = _cute_nvgpu_ir.atom_set_value(
-            exec_value, attr, tma_bar_ptr.value, loc=loc, ip=ip
+            exec_value, attr, cast(Any, tma_bar_ptr).value, loc=loc, ip=ip
         )
         return exec_value
 
@@ -586,254 +806,6 @@ class CopyBulkTensorIm2ColG2SMulticastTrait(Trait):
     pass
 
 
-@dataclass
-class CopyBulkTensorIm2ColS2GOp(TmaCopyOp):
-    """
-    Bulk tensor asynchrnous SMEM to GMEM Copy Operation using the TMA unit in im2col mode.
-
-    See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`__.
-    This Operation uses TMA in the ``.im2col`` mode.
-    """
-
-    def __post_init__(self) -> None:
-        # Arch verification
-        arch: Arch = BaseDSL._get_dsl().get_arch_enum()
-        if not arch >= Arch.sm_90:
-            raise OpError(
-                self,
-                f"expects arch to be at least {Arch.sm_90.name}, but got {arch.name}",
-                suggestion="Ensure env CUTE_DSL_ARCH matches your GPU architecture",
-            )
-
-    def __str__(self) -> str:
-        return "cp.async SMEM -> GMEM bulk tensor copy Operation"
-
-    def _make_trait(
-        self,
-        copy_internal_type: Type[Numeric],
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-        **kwargs: Any,
-    ) -> "CopyBulkTensorIm2ColS2GNonExecTrait":
-        raise NotImplementedError(
-            "Use cpasync.make_im2col_tma_atom to obtain a copy Atom for TMA"
-        )
-
-
-class CopyBulkTensorIm2ColS2GNonExecTrait(Trait):
-    def with_(
-        self,
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-        **kwargs: Any,
-    ) -> "CopyBulkTensorIm2ColS2GTrait":
-        return CopyBulkTensorIm2ColS2GTrait(self.unpack(loc=loc, ip=ip, **kwargs))
-
-    def unpack(  # type: ignore[override]
-        self,
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-        tma_desc_ptr: Optional[Pointer] = None,
-        cache_policy: Optional[Int64] = None,
-    ) -> ir.Value:
-        """
-        Custom implementation of unpack for non-executable TMAs.
-        """
-        exec_value = _cute_nvgpu_ir.atom_make_exec_tma(self.value, loc=loc, ip=ip)
-        if isinstance(tma_desc_ptr, Pointer):
-            attr_str = (
-                f"#cute_nvgpu.atom_copy_field_tmastore<{TMA_DESC_PTR_FIELD_NAME}>"
-            )
-            attr = ir.Attribute.parse(attr_str)
-            exec_value = _cute_nvgpu_ir.atom_set_value(
-                exec_value, attr, tma_desc_ptr.value, loc=loc, ip=ip
-            )
-        if cache_policy is not None:
-            if not isinstance(cache_policy, Int64):
-                raise ValueError(
-                    "expects `Int64` value to be provided via the cache_policy kw argument"
-                )
-
-            attr_str = (
-                f"#cute_nvgpu.atom_copy_field_tmastore<{TMA_CACHE_POLICY_FIELD_NAME}>"
-            )
-            attr = ir.Attribute.parse(attr_str)
-            exec_value = _cute_nvgpu_ir.atom_set_value(
-                exec_value, attr, cache_policy.ir_value(), loc=loc, ip=ip
-            )
-        return exec_value
-
-
-class CopyBulkTensorIm2ColS2GTrait(Trait):
-    pass
-
-
-#
-# TMA GMEM -> SMEM multicast copies
-#
-
-
-@dataclass
-class CopyBulkTensorTileG2SMulticastOp(TmaCopyOp):
-    """
-    Bulk tensor asynchronous multicast GMEM to SMEM Copy Operation using the TMA unit.
-
-    TMA multicast operations are issued by a single thread within a warp, but the DSL **automatically handles this** by
-    implicitly adding ``elect_one()`` around the copy operation.
-
-    .. code-block:: python
-
-        # CORRECT: TMA multicast without elect_one
-        cute.copy(
-            tma_atom.with_(mcast_mask=cluster_mask),
-            gmem_tensor,
-            smem_tensor,
-            tma_bar_ptr=barrier_ptr
-        )
-
-        # WRONG: Do NOT wrap in elect_one (can cause deadlock)
-        with cute.arch.elect_one():  # INCORRECT
-            cute.copy(tma_atom.with_(mcast_mask=mask), gmem_tensor, smem_tensor)
-
-    **PTX Programming Model**: In PTX, TMA multicast operations (``cp.async.bulk.tensor.multicast``)
-    must be issued by a single thread.
-
-    See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`__.
-    This Operation uses TMA in the ``.tile`` mode.
-
-    .. seealso::
-       - :func:`cute.arch.elect_one` - **NOT** needed for TMA copy
-       - :class:`CopyBulkTensorTileG2SOp` - Non-multicast TMA load
-    """
-
-    cta_group: CtaGroup = CtaGroup.ONE
-
-    def __post_init__(self) -> None:
-        if not isinstance(self.cta_group, CtaGroup):
-            raise OpError(
-                self, "expects the 'cta_group' parameter to be a CtaGroup instance"
-            )
-        # Arch verification
-        arch = BaseDSL._get_dsl().get_arch_enum()
-        if not arch >= Arch.sm_90:
-            raise OpError(
-                self,
-                f"expects arch to be at least {Arch.sm_90.name}, but got {arch.name}",
-                suggestion="Ensure env CUTE_DSL_ARCH matches your GPU architecture",
-            )
-        if (self.cta_group == CtaGroup.TWO) and arch.major == Arch.sm_90.major:
-            raise OpError(
-                self,
-                f"CTA group of 2 is tcgen05-specific and is not and is not compatible with {arch}",
-                suggestion="Ensure env CUTE_DSL_ARCH matches your GPU architecture",
-            )
-
-    def __str__(self) -> str:
-        res = "cp.async GMEM -> SMEM bulk tensor multicast copy Operation"
-        if self.cta_group == CtaGroup.TWO:
-            res += "\n  CTA group = 2"
-        return res
-
-    def _make_trait(
-        self,
-        copy_internal_type: Type[Numeric],
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-        **kwargs: Any,
-    ) -> "CopyBulkTensorTileG2SMulticastNonExecTrait":
-        raise NotImplementedError(
-            "Use cpasync.make_tiled_tma_atom to obtain a copy Atom for TMA"
-        )
-
-    def _to_ir(self) -> _cute_nvgpu_ir.TiledTmaLoadEnum:
-        if self.cta_group == CtaGroup.ONE:
-            return _cute_nvgpu_ir.TiledTmaLoadEnum.sm_90_multicast
-        elif self.cta_group == CtaGroup.TWO:
-            return _cute_nvgpu_ir.TiledTmaLoadEnum.sm_100_2sm_multicast
-        else:
-            assert False, "unrecognized self.cta_group"
-
-
-class CopyBulkTensorTileG2SMulticastNonExecTrait(Trait):
-    def with_(
-        self,
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-        **kwargs: Any,
-    ) -> "CopyBulkTensorTileG2SMulticastTrait":
-        return CopyBulkTensorTileG2SMulticastTrait(
-            self.unpack(loc=loc, ip=ip, **kwargs)
-        )
-
-    def unpack(  # type: ignore[override]
-        self,
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-        tma_bar_ptr: Optional[Pointer] = None,
-        mcast_mask: Any = None,
-        tma_desc_ptr: Any = None,
-        cache_policy: Optional[Int64] = None,
-    ) -> ir.Value:
-        """
-        Custom implementation of unpack for non-executable TMAs.
-
-        The multicast TMA load requires a `tma_bar_ptr`  and a `mcast_mask` keyword arguments to be
-        provided when using `cute.copy`. `cache_policy` keyword argument to be provided to set the
-        l2 cache eviction priority.
-        """
-        if not isinstance(tma_bar_ptr, Pointer):
-            raise ValueError(
-                "expects a pointer to an mbarrier to be provided via the tma_bar_ptr kw argument"
-            )
-        if not isinstance(mcast_mask, Integer):
-            raise ValueError(
-                "expects a multicast mask to be provided via the mcast_mask kw argument"
-            )
-        exec_value = _cute_nvgpu_ir.atom_make_exec_tma(self.value, loc=loc, ip=ip)
-        attr_str = "#cute_nvgpu.atom_copy_field_tmaload<mcast_mask>"
-        attr = ir.Attribute.parse(attr_str)
-        exec_value = _cute_nvgpu_ir.atom_set_value(
-            exec_value, attr, Int16(mcast_mask).ir_value(loc=loc, ip=ip), loc=loc, ip=ip
-        )
-        if isinstance(tma_desc_ptr, Pointer):
-            attr_str = f"#cute_nvgpu.atom_copy_field_tmaload<{TMA_DESC_PTR_FIELD_NAME}>"
-            attr = ir.Attribute.parse(attr_str)
-            exec_value = _cute_nvgpu_ir.atom_set_value(
-                exec_value, attr, tma_desc_ptr.value, loc=loc, ip=ip
-            )
-        if cache_policy is not None:
-            if not isinstance(cache_policy, Int64):
-                raise ValueError(
-                    "expects `Int64` value to be provided via the cache_policy kw argument"
-                )
-
-            attr_str = (
-                f"#cute_nvgpu.atom_copy_field_tmaload<{TMA_CACHE_POLICY_FIELD_NAME}>"
-            )
-            attr = ir.Attribute.parse(attr_str)
-            exec_value = _cute_nvgpu_ir.atom_set_value(
-                exec_value, attr, cache_policy.ir_value(), loc=loc, ip=ip
-            )
-        # Set the tma_bar_ptr at last to ensure that the atom creation and setting
-        # operations above can be moved outside the loop
-        attr_str = "#cute_nvgpu.atom_copy_field_tmaload<tma_bar>"
-        attr = ir.Attribute.parse(attr_str)
-        exec_value = _cute_nvgpu_ir.atom_set_value(
-            exec_value, attr, tma_bar_ptr.value, loc=loc, ip=ip
-        )
-        return exec_value
-
-
-class CopyBulkTensorTileG2SMulticastTrait(Trait):
-    pass
-
-
 #
 # TMA SMEM -> GMEM copies
 #
@@ -898,7 +870,7 @@ class CopyBulkTensorTileS2GOp(TmaCopyOp):
         )
 
 
-class CopyBulkTensorTileS2GNonExecTrait(Trait):
+class CopyBulkTensorTileS2GNonExecTrait(TmaTrait):
     def with_(
         self,
         *,
@@ -908,13 +880,14 @@ class CopyBulkTensorTileS2GNonExecTrait(Trait):
     ) -> "CopyBulkTensorTileS2GTrait":
         return CopyBulkTensorTileS2GTrait(self.unpack(loc=loc, ip=ip, **kwargs))
 
-    def unpack(  # type: ignore[override]
+    def unpack(
         self,
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
         tma_desc_ptr: Optional[Pointer] = None,
         cache_policy: Optional[Int64] = None,
+        **kwargs: Any,
     ) -> ir.Value:
         """
         Custom implementation of unpack for non-executable TMAs.
@@ -926,7 +899,7 @@ class CopyBulkTensorTileS2GNonExecTrait(Trait):
             )
             attr = ir.Attribute.parse(attr_str)
             exec_value = _cute_nvgpu_ir.atom_set_value(
-                exec_value, attr, tma_desc_ptr.value, loc=loc, ip=ip
+                exec_value, attr, cast(Any, tma_desc_ptr).value, loc=loc, ip=ip
             )
         if cache_policy is not None:
             if not isinstance(cache_policy, Int64):
@@ -971,17 +944,6 @@ class CopyReduceBulkTensorTileS2GOp(TmaCopyOp):
             # Hardware-correct upstream enum; convert silently by name.
             self.reduction_kind = ReductionKind[kind.name]
         elif isinstance(kind, _CuteReductionOp):
-            # Validate the name first so an unknown member (e.g. MUL) raises
-            # a clean TypeError. Doing it before warnings.warn guarantees the
-            # invalid-name case isn't masked by the DeprecationWarning being
-            # escalated to an exception under -W error / warnings-as-errors.
-            if kind.name not in ReductionKind.__members__:
-                raise TypeError(
-                    f"cute.ReductionOp.{kind.name} is not a valid TMA "
-                    f"reduction kind. Valid kinds: "
-                    f"{[m.name for m in ReductionKind]}."
-                )
-            self.reduction_kind = ReductionKind[kind.name]
             warnings.warn(
                 "Passing cute.ReductionOp to CopyReduceBulkTensorTileS2GOp "
                 "is deprecated: cute.ReductionOp models compute reductions "
@@ -990,6 +952,14 @@ class CopyReduceBulkTensorTileS2GOp(TmaCopyOp):
                 DeprecationWarning,
                 stacklevel=3,
             )
+            try:
+                self.reduction_kind = ReductionKind[kind.name]
+            except KeyError:
+                raise TypeError(
+                    f"cute.ReductionOp.{kind.name} is not a valid TMA "
+                    f"reduction kind. Valid kinds: "
+                    f"{[m.name for m in ReductionKind]}."
+                )
         # else: leave as-is; _to_ir raises TypeError on unknown types.
 
         # Arch verification
@@ -1030,7 +1000,7 @@ class CopyReduceBulkTensorTileS2GOp(TmaCopyOp):
         )
 
 
-class CopyReduceBulkTensorTileS2GNonExecTrait(Trait):
+class CopyReduceBulkTensorTileS2GNonExecTrait(TmaTrait):
     def with_(
         self,
         *,
@@ -1040,13 +1010,14 @@ class CopyReduceBulkTensorTileS2GNonExecTrait(Trait):
     ) -> "CopyReduceBulkTensorTileS2GTrait":
         return CopyReduceBulkTensorTileS2GTrait(self.unpack(loc=loc, ip=ip, **kwargs))
 
-    def unpack(  # type: ignore[override]
+    def unpack(
         self,
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
         tma_desc_ptr: Optional[Pointer] = None,
         cache_policy: Optional[Int64] = None,
+        **kwargs: Any,
     ) -> ir.Value:
         """
         Custom implementation of unpack for non-executable TMAs.
@@ -1058,7 +1029,7 @@ class CopyReduceBulkTensorTileS2GNonExecTrait(Trait):
             )
             attr = ir.Attribute.parse(attr_str)
             exec_value = _cute_nvgpu_ir.atom_set_value(
-                exec_value, attr, tma_desc_ptr.value, loc=loc, ip=ip
+                exec_value, attr, cast(Any, tma_desc_ptr).value, loc=loc, ip=ip
             )
         if cache_policy is not None:
             if not isinstance(cache_policy, Int64):
@@ -1079,6 +1050,93 @@ class CopyReduceBulkTensorTileS2GTrait(Trait):
     pass
 
 
+
+@dataclass
+class CopyBulkTensorIm2ColS2GOp(TmaCopyOp):
+    """
+    Bulk tensor asynchronous SMEM to GMEM Copy Operation using the TMA unit in im2col mode.
+
+    See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`__.
+    This Operation uses TMA in the ``.im2col`` mode.
+    """
+
+    def __post_init__(self) -> None:
+        # Arch verification
+        arch: Arch = BaseDSL._get_dsl().get_arch_enum()
+        if not arch >= Arch.sm_90:
+            raise OpError(
+                self,
+                f"expects arch to be at least {Arch.sm_90.name}, but got {arch.name}",
+                suggestion="Ensure env CUTE_DSL_ARCH matches your GPU architecture",
+            )
+
+    def __str__(self) -> str:
+        return "cp.async SMEM -> GMEM bulk tensor copy Operation"
+
+    def _make_trait(
+        self,
+        copy_internal_type: Type[Numeric],
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+        **kwargs: Any,
+    ) -> "CopyBulkTensorIm2ColS2GNonExecTrait":
+        raise NotImplementedError(
+            "Use cpasync.make_im2col_tma_atom to obtain a copy Atom for TMA"
+        )
+
+
+class CopyBulkTensorIm2ColS2GNonExecTrait(TmaTrait):
+    def with_(
+        self,
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+        **kwargs: Any,
+    ) -> "CopyBulkTensorIm2ColS2GTrait":
+        return CopyBulkTensorIm2ColS2GTrait(self.unpack(loc=loc, ip=ip, **kwargs))
+
+    def unpack(
+        self,
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+        tma_desc_ptr: Optional[Pointer] = None,
+        cache_policy: Optional[Int64] = None,
+        **kwargs: Any,
+    ) -> ir.Value:
+        """
+        Custom implementation of unpack for non-executable TMAs.
+        """
+        exec_value = _cute_nvgpu_ir.atom_make_exec_tma(self.value, loc=loc, ip=ip)
+        if isinstance(tma_desc_ptr, Pointer):
+            attr_str = (
+                f"#cute_nvgpu.atom_copy_field_tmastore<{TMA_DESC_PTR_FIELD_NAME}>"
+            )
+            attr = ir.Attribute.parse(attr_str)
+            exec_value = _cute_nvgpu_ir.atom_set_value(
+                exec_value, attr, tma_desc_ptr.value, loc=loc, ip=ip
+            )
+        if cache_policy is not None:
+            if not isinstance(cache_policy, Int64):
+                raise ValueError(
+                    "expects `Int64` value to be provided via the cache_policy kw argument"
+                )
+
+            attr_str = (
+                f"#cute_nvgpu.atom_copy_field_tmastore<{TMA_CACHE_POLICY_FIELD_NAME}>"
+            )
+            attr = ir.Attribute.parse(attr_str)
+            exec_value = _cute_nvgpu_ir.atom_set_value(
+                exec_value, attr, cache_policy.ir_value(), loc=loc, ip=ip
+            )
+        return exec_value
+
+
+class CopyBulkTensorIm2ColS2GTrait(Trait):
+    pass
+
+
 #
 # Bulk GMEM -> SMEM copies
 #
@@ -1087,7 +1145,7 @@ class CopyReduceBulkTensorTileS2GTrait(Trait):
 @dataclass(frozen=True)
 class CopyBulkG2SOp(CopyOp):
     """
-    Bulk copy asynchrnous GMEM to SMEM Copy Operation.
+    Bulk copy asynchronous GMEM to SMEM Copy Operation.
 
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk>`__.
     """
@@ -1152,7 +1210,7 @@ class CopyBulkG2STrait(Trait):
         attr_str = f"#cute_nvgpu.atom_copy_field_bulkg2s<{TMA_MBAR_PTR_FIELD_NAME}>"
         attr = ir.Attribute.parse(attr_str)
         val = _cute_nvgpu_ir.atom_set_value(
-            self.value, attr, mbar_ptr.value, loc=loc, ip=ip
+            self.value, attr, cast(Any, mbar_ptr).value, loc=loc, ip=ip
         )
         if cache_policy is not None:
             if not isinstance(cache_policy, Int64):
@@ -1177,7 +1235,7 @@ class CopyBulkG2STrait(Trait):
 @dataclass(frozen=True)
 class CopyBulkG2SMulticastOp(CopyOp):
     """
-    Bulk multicast copy asynchrnous GMEM to SMEM Copy Operation.
+    Bulk multicast copy asynchronous GMEM to SMEM Copy Operation.
 
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk>`__.
     """
@@ -1246,7 +1304,7 @@ class CopyBulkG2SMulticastTrait(Trait):
         attr_str = f"#cute_nvgpu.atom_copy_field_bulkg2s<{TMA_MBAR_PTR_FIELD_NAME}>"
         attr = ir.Attribute.parse(attr_str)
         val = _cute_nvgpu_ir.atom_set_value(
-            self.value, attr, mbar_ptr.value, loc=loc, ip=ip
+            self.value, attr, cast(Any, mbar_ptr).value, loc=loc, ip=ip
         )
         attr_str = f"#cute_nvgpu.atom_copy_field_bulkg2s<{TMA_MCAST_MASK_FIELD_NAME}>"
         attr = ir.Attribute.parse(attr_str)
@@ -1276,7 +1334,7 @@ class CopyBulkG2SMulticastTrait(Trait):
 @dataclass(frozen=True)
 class CopyBulkS2GOp(CopyOp):
     """
-    Bulk copy asynchrnous SMEM to GMEM Copy Operation.
+    Bulk copy asynchronous SMEM to GMEM Copy Operation.
 
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk>`__.
     """
@@ -1327,7 +1385,7 @@ class CopyBulkS2GTrait(Trait):
 @dataclass(frozen=True)
 class CopyBulkS2GByteMaskOp(CopyOp):
     """
-    Bulk copy asynchrnous SMEM to GMEM Copy Operation with mask.
+    Bulk copy asynchronous SMEM to GMEM Copy Operation with mask.
     The i-th bit in the 16-bit wide byteMask operand specifies whether
     the i-th byte of each 16-byte wide chunk of source data is copied to the destination.
 
@@ -1408,7 +1466,7 @@ class CopyBulkS2GByteMaskTrait(Trait):
 @dataclass(frozen=True)
 class CopyBulkS2SOp(CopyOp):
     """
-    Bulk copy asynchrnous SMEM CTA to Cluster Copy Operation.
+    Bulk copy asynchronous SMEM CTA to Cluster Copy Operation.
 
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk>`__.
     """
@@ -1475,7 +1533,7 @@ class CopyBulkS2STrait(Trait):
         attr_str = f"#cute_nvgpu.atom_copy_field_bulks2s<{TMA_MBAR_PTR_FIELD_NAME}>"
         attr = ir.Attribute.parse(attr_str)
         val = _cute_nvgpu_ir.atom_set_value(
-            self.value, attr, mbar_ptr.value, loc=loc, ip=ip
+            self.value, attr, cast(Any, mbar_ptr).value, loc=loc, ip=ip
         )
         attr_str = f"#cute_nvgpu.atom_copy_field_bulks2s<{TMA_CTA_RANK_FIELD_NAME}>"
         attr = ir.Attribute.parse(attr_str)
@@ -1565,7 +1623,7 @@ class CopyDsmemStoreTrait(Trait):
         val = _cute_nvgpu_ir.atom_set_value(
             self.value,
             attr,
-            mbar_ptr.value,
+            cast(Any, mbar_ptr).value,
             loc=loc,
             ip=ip,
         )
diff --git a/python/CuTeDSL/cutlass/cute/nvgpu/cpasync/helpers.py b/python/CuTeDSL/cutlass/cute/nvgpu/cpasync/helpers.py
index 4d8cb7536..385f3a44b 100644
--- a/python/CuTeDSL/cutlass/cute/nvgpu/cpasync/helpers.py
+++ b/python/CuTeDSL/cutlass/cute/nvgpu/cpasync/helpers.py
@@ -34,8 +34,11 @@ from ...typing import (
     Numeric,
     NumericMeta,
     IntTuple,
+    _element_precision_width,
+    is_int_tuple_type,
 )
 from ... import core, atom
+from ...atom import _normalize_variadic_tensor_operand
 from .copy import (
     CopyBulkTensorTileG2SOp,
     CopyBulkTensorIm2ColG2SOp,
@@ -290,13 +293,19 @@ def make_im2col_tma_atom(
         if not isinstance(internal_type, NumericMeta):
             raise TypeError(f"internal_type must be a Numeric, but got {internal_type}")
 
+        gmem_tensor_element_type = gmem_tensor.element_type
+        assert not is_int_tuple_type(gmem_tensor_element_type)
+        gmem_element_precision_width = _element_precision_width(
+            gmem_tensor_element_type
+        )
         use_unpack = (
             itype.width == 8
-            and isinstance(gmem_tensor.element_type, NumericMeta)
-            and gmem_tensor.element_type.width < 8  # type: ignore[union-attr]
+            and isinstance(gmem_tensor_element_type, type)
+            and issubclass(gmem_tensor_element_type, Numeric)
+            and gmem_element_precision_width < 8
         )
         internal_mlir_type = (
-            gmem_tensor.element_type.mlir_type if use_unpack else itype.mlir_type  # type: ignore[union-attr]
+            gmem_tensor_element_type.mlir_type if use_unpack else itype.mlir_type
         )
         tma_format = _cute_nvgpu_ir.TmaDataFormat(
             _cute_nvgpu_ir.get_default_tma_format(internal_mlir_type, use_unpack)
@@ -359,6 +368,13 @@ def make_im2col_tma_atom(
                 f"expects num_multicast to be >= 1 for multicast G2S copies, "
                 f"but got {num_multicast}"
             )
+        assert lower_corner_whd is not None
+        assert upper_corner_whd is not None
+        assert lower_padding_whd is not None
+        assert upper_padding_whd is not None
+        assert stride_whd is not None
+        assert lower_srt is not None
+        assert stride_srt is not None
         res = _cute_nvgpu_ir.atom_make_non_exec_im2col_tma_load(
             cast(Any, gmem_tensor).value,
             smem_layout,
@@ -412,14 +428,18 @@ def make_tiled_tma_atom(
     ip: Optional[ir.InsertionPoint] = None,
 ) -> TmaInfo:
     """
-    Makes a TMA Copy Atom in the ``.tile`` mode to copy tiles of a GMEM tensor to/from SMEM
-    buffer with the given Layout.
+    Makes a TMA Copy Atom to copy tiles of a GMEM tensor to/from SMEM buffer with the given Layout.
+
+    Supports ``.tile`` mode (default) and ``.tile::gather4`` mode (when ``gmem_coord_tensor`` is
+    provided with a gather4 op). For the gather4 programming model — coord-tensor layout
+    conventions, examples, and restrictions — see :func:`tma_partition`.
 
     Given
 
     - a GMEM tensor
     - a SMEM layout
     - a CTA-level Tiler
+    - (optional) a GMEM index tensor for gather4 mode
 
     this function figures out the bulk tensor asynchronous copy instruction to use with the maximum
     "TMA vector length" to copy tiles of the GMEM tensor to/from an SMEM buffer with the provided
@@ -479,13 +499,19 @@ def make_tiled_tma_atom(
         if not isinstance(internal_type, NumericMeta):
             raise TypeError(f"internal_type must be a Numeric, but got {internal_type}")
 
+        gmem_tensor_element_type = gmem_tensor.element_type
+        assert not is_int_tuple_type(gmem_tensor_element_type)
+        gmem_element_precision_width = _element_precision_width(
+            gmem_tensor_element_type
+        )
         use_unpack = (
             itype.width == 8
-            and isinstance(gmem_tensor.element_type, NumericMeta)
-            and gmem_tensor.element_type.width < 8  # type: ignore[union-attr]
+            and isinstance(gmem_tensor_element_type, type)
+            and issubclass(gmem_tensor_element_type, Numeric)
+            and gmem_element_precision_width < 8
         )
         internal_mlir_type = (
-            gmem_tensor.element_type.mlir_type if use_unpack else itype.mlir_type  # type: ignore[union-attr]
+            gmem_tensor_element_type.mlir_type if use_unpack else itype.mlir_type
         )
         tma_format = _cute_nvgpu_ir.TmaDataFormat(
             _cute_nvgpu_ir.get_default_tma_format(internal_mlir_type, use_unpack)
@@ -572,25 +598,97 @@ def tma_partition(
     cta_coord: Coord,
     cta_layout: Layout,
     smem_tensor: Tensor,
-    gmem_tensor: Tensor,
+    gmem_tensor: Union[Tensor, List[Tensor], Tuple[Tensor, ...]],
     *,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
-) -> Tuple[Tensor, Tensor]:
+) -> Union[Tuple[Tensor, Tensor], Tuple[Tensor, Tensor, Tensor]]:
     """
     Tiles the GMEM and SMEM tensors for the provided TMA Copy Atom.
+
+    For standard TMA modes (tiled, im2col, etc.), pass a single GMEM tensor:
+
+        tAsA, tAgA = tma_partition(atom, cta_coord, cta_layout, sA, gA)
+
+    For gather4 mode, pass a list of ``[gmem_tensor, gmem_coord_tensor]``:
+
+        tAsA, tAgA, tAgI = tma_partition(atom, cta_coord, cta_layout, sA, [gA, gI])
+
+    The 2D gather4 TMA atom (``CopyBulkTensor2DGather4G2SOp``) issues
+    ``cp.async.bulk.tensor.2d.tile::gather4``, which loads four rows from a
+    GMEM tile by gathering at four indirectly-addressed indices. Each PTX
+    instruction consumes 5 coordinates ``{crd0, crd1_i0, crd1_i1, crd1_i2,
+    crd1_i3}``: one *contiguous* coordinate (``crd0``) and four *gather*
+    coordinates pulled from ``gmem_coord_tensor``.
+
+    The ``gmem_coord_tensor`` is a rank-2 ``Int32`` tensor whose layout selects
+    which mode is the gather mode. One mode is broadcast (stride 0) and the
+    other supplies the per-row gather indices:
+
+    For TMA override, pass ``(gmem_tensor, residue_tensor)`` where the
+    gmem_tensor tracks the gmem address and stride, residue_tensor is a coordinate
+    tensor with negated strides to track the remaining shape to copy.
+
+    - **Column-major data, gather along rows:** ``stride=(0, 1)`` — mode 0 is
+      the broadcast (contiguous) mode, mode 1 carries the gather indices.
+
+      .. code-block:: python
+
+          # GMEM data is (M, N) col-major; gather along the M dimension.
+          gI = cute.make_tensor(
+              idx_ptr, cute.make_layout((M, N), stride=(0, 1))
+          )
+          atom, gA = cpasync.make_tiled_tma_atom(
+              cpasync.CopyBulkTensor2DGather4G2SOp(),
+              gA, smem_layout, cta_tiler,
+              gmem_coord_tensor=gI,
+          )
+          tAsA, tAgA, tAgI = cpasync.tma_partition(
+              atom, 0, cute.make_layout(1), sA, [gA, gI],
+          )
+          cute.copy(atom, [tAgA[crd], tAgI[crd]], tAsA, tma_bar_ptr=mbar)
+
+    - **Row-major data, gather along cols:** ``stride=(1, 0)`` — mode 1 is
+      the broadcast (contiguous) mode, mode 0 carries the gather indices.
+
+      .. code-block:: python
+
+          gI = cute.make_tensor(
+              idx_ptr, cute.make_layout((M, N), stride=(1, 0))
+          )
+
+    **Restrictions** (enforced by Python and MLIR verifiers):
+
+    - ``gmem_coord_tensor`` layout must be 2D.
+    - The gather mode size must be ``>= 4`` (4 row indices per instruction).
+    - Exactly one mode of ``gmem_coord_tensor`` must be a broadcast (stride 0);
+      that broadcast mode is *not* the gather dimension.
+
+    :param atom:         The TMA Copy Atom
+    :param cta_coord:    CTA coordinate within the cluster
+    :param cta_layout:   Layout of CTAs in the cluster
+    :param smem_tensor:  The SMEM tensor to partition
+    :param gmem_tensor:  A single GMEM tensor, or a list ``[data_tensor, index_tensor]`` for gather4
+    :return: ``(smem_tensor, gmem_tensor)`` for standard TMA, or
+             ``(smem_tensor, gmem_tensor, gmem_coord_tensor)`` for gather4
     """
+
+    # Normalize src/dst to lists for variadic IR operands
+    gmem_tensor_list = _normalize_variadic_tensor_operand(gmem_tensor, "gmem_tensor")
+
     cta_coord_val = core._pack_coord(cta_coord, loc=loc, ip=ip)
-    s, d = _cute_nvgpu_ir.atom_tma_partition(
+    res = _cute_nvgpu_ir.atom_tma_partition(
         atom._trait.value,
         cta_coord=cta_coord_val,
         cta_layout=cta_layout,
         smem_tensor=cast(Any, smem_tensor).value,
-        target_tensors=[cast(Any, gmem_tensor).value],
+        target_tensors=[cast(Any, t).value for t in gmem_tensor_list],
         loc=loc,
         ip=ip,
     )
-    return s, d
+    # partitioned smem_tensor, gmem_tensors
+    return res
+
 
 
 @dsl_user_op
@@ -789,3 +887,4 @@ def group_bulk_copy_modes(
     mSrc = core.group_modes(src, 0, core.rank(src), loc=loc, ip=ip)
     mDst = core.group_modes(dst, 0, core.rank(dst), loc=loc, ip=ip)
     return (mSrc, mDst)
+
diff --git a/python/CuTeDSL/cutlass/cute/nvgpu/helpers.py b/python/CuTeDSL/cutlass/cute/nvgpu/helpers.py
index af109b26b..3efe52d9b 100644
--- a/python/CuTeDSL/cutlass/cute/nvgpu/helpers.py
+++ b/python/CuTeDSL/cutlass/cute/nvgpu/helpers.py
@@ -17,7 +17,16 @@ from cutlass._mlir import ir
 import cutlass._mlir.dialects.cute_nvgpu as _cute_nvgpu_ir
 
 from .. import core, atom
-from ..typing import Shape, Layout, ComposedLayout, Tensor, Numeric, NumericMeta
+from ..typing import (
+    Shape,
+    Layout,
+    ComposedLayout,
+    Tensor,
+    Numeric,
+    NumericMeta,
+    _element_precision_width,
+    is_int_tuple_type,
+)
 from .cpasync.copy import (
     CopyBulkTensorTileG2SOp,
     CopyBulkTensorTileG2SNonExecTrait,
@@ -46,7 +55,10 @@ __all__ = [
 
 @dsl_user_op
 def make_tiled_tma_atom_A(
-    op: Union[CopyBulkTensorTileG2SOp, CopyBulkTensorTileG2SMulticastOp],
+    op: Union[
+        CopyBulkTensorTileG2SOp,
+        CopyBulkTensorTileG2SMulticastOp,
+    ],
     gmem_tensor: Tensor,
     smem_layout: Union[Layout, ComposedLayout],
     mma_tiler_mnk: Shape,
@@ -144,13 +156,19 @@ def make_tiled_tma_atom_A(
         if not isinstance(internal_type, NumericMeta):
             raise TypeError(f"internal_type must be a Numeric, but got {internal_type}")
 
+        gmem_tensor_element_type = gmem_tensor.element_type
+        assert not is_int_tuple_type(gmem_tensor_element_type)
+        gmem_element_precision_width = _element_precision_width(
+            gmem_tensor_element_type
+        )
         use_unpack = (
             itype.width == 8
-            and isinstance(gmem_tensor.element_type, NumericMeta)
-            and gmem_tensor.element_type.width < 8  # type: ignore[union-attr]
+            and isinstance(gmem_tensor_element_type, type)
+            and issubclass(gmem_tensor_element_type, Numeric)
+            and gmem_element_precision_width < 8
         )
         internal_mlir_type = (
-            gmem_tensor.element_type.mlir_type if use_unpack else itype.mlir_type  # type: ignore[union-attr]
+            gmem_tensor_element_type.mlir_type if use_unpack else itype.mlir_type
         )
         tma_format = _cute_nvgpu_ir.TmaDataFormat(
             _cute_nvgpu_ir.get_default_tma_format(internal_mlir_type, use_unpack)
@@ -185,7 +203,10 @@ def make_tiled_tma_atom_A(
 
 @dsl_user_op
 def make_tiled_tma_atom_B(
-    op: Union[CopyBulkTensorTileG2SOp, CopyBulkTensorTileG2SMulticastOp],
+    op: Union[
+        CopyBulkTensorTileG2SOp,
+        CopyBulkTensorTileG2SMulticastOp,
+    ],
     gmem_tensor: Tensor,
     smem_layout: Union[Layout, ComposedLayout],
     mma_tiler_mnk: Shape,
@@ -283,13 +304,19 @@ def make_tiled_tma_atom_B(
         if not isinstance(internal_type, NumericMeta):
             raise TypeError(f"internal_type must be a Numeric, but got {internal_type}")
 
+        gmem_tensor_element_type = gmem_tensor.element_type
+        assert not is_int_tuple_type(gmem_tensor_element_type)
+        gmem_element_precision_width = _element_precision_width(
+            gmem_tensor_element_type
+        )
         use_unpack = (
             itype.width == 8
-            and isinstance(gmem_tensor.element_type, NumericMeta)
-            and gmem_tensor.element_type.width < 8  # type: ignore[union-attr]
+            and isinstance(gmem_tensor_element_type, type)
+            and issubclass(gmem_tensor_element_type, Numeric)
+            and gmem_element_precision_width < 8
         )
         internal_mlir_type = (
-            gmem_tensor.element_type.mlir_type if use_unpack else itype.mlir_type  # type: ignore[union-attr]
+            gmem_tensor_element_type.mlir_type if use_unpack else itype.mlir_type
         )
         tma_format = _cute_nvgpu_ir.TmaDataFormat(
             _cute_nvgpu_ir.get_default_tma_format(internal_mlir_type, use_unpack)
@@ -458,13 +485,19 @@ def make_im2col_tma_atom_A(
         if not isinstance(internal_type, NumericMeta):
             raise TypeError(f"internal_type must be a Numeric, but got {internal_type}")
 
+        gmem_tensor_element_type = gmem_tensor.element_type
+        assert not is_int_tuple_type(gmem_tensor_element_type)
+        gmem_element_precision_width = _element_precision_width(
+            gmem_tensor_element_type
+        )
         use_unpack = (
             itype.width == 8
-            and isinstance(gmem_tensor.element_type, NumericMeta)
-            and gmem_tensor.element_type.width < 8  # type: ignore[union-attr]
+            and isinstance(gmem_tensor_element_type, type)
+            and issubclass(gmem_tensor_element_type, Numeric)
+            and gmem_element_precision_width < 8
         )
         internal_mlir_type = (
-            gmem_tensor.element_type.mlir_type if use_unpack else itype.mlir_type  # type: ignore[union-attr]
+            gmem_tensor_element_type.mlir_type if use_unpack else itype.mlir_type
         )
         tma_format = _cute_nvgpu_ir.TmaDataFormat(
             _cute_nvgpu_ir.get_default_tma_format(internal_mlir_type, use_unpack)
diff --git a/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/helpers.py b/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/helpers.py
index b7c6eece3..6cd1a6316 100644
--- a/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/helpers.py
+++ b/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/helpers.py
@@ -9,7 +9,7 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
-from typing import Any, overload, Type, Tuple, Union, Optional
+from typing import Any, Optional, Tuple, Type, Union, cast, overload
 
 from cutlass.cutlass_dsl import dsl_user_op
 
@@ -220,12 +220,12 @@ def commit(
        - :func:`cute.arch.mbarrier_arrive` - General barrier arrive operation
     """
     if cta_group == CtaGroup.ONE:
-        group = nvvm.Tcgen05GroupKind.CTA_1
+        group = nvvm.CTAGroupKind.CTA_1
     else:
         assert cta_group == CtaGroup.TWO
-        group = nvvm.Tcgen05GroupKind.CTA_2
+        group = nvvm.CTAGroupKind.CTA_2
 
-    mbar_ptr = mbar_ptr.llvm_ptr
+    mbar_ptr = cast(Any, mbar_ptr).llvm_ptr
     if mask is not None:
         mask = Int16(mask).ir_value(loc=loc, ip=ip)
         nvvm.tcgen05_commit(mbar_ptr, multicast_mask=mask, group=group, loc=loc, ip=ip)
@@ -314,11 +314,12 @@ def get_tmem_copy_properties(
         num_dp, num_bits = 32, 32
     else:
         raise ValueError(f"expects 'atom' to be a TMEM copy, but got {atom}")
+    op_raw: Any = atom.op
     if is_tmem_load(atom):
-        return num_dp, num_bits, atom.op.repeat.value, atom.op.pack  # type: ignore[union-attr]
+        return num_dp, num_bits, op_raw.repeat.value, op_raw.pack
     else:
         assert is_tmem_store(atom), "atom must be a TMEM store"
-        return num_dp, num_bits, atom.op.repeat.value, atom.op.unpack  # type: ignore[union-attr]
+        return num_dp, num_bits, op_raw.repeat.value, op_raw.unpack
 
 
 @dsl_user_op
@@ -360,7 +361,7 @@ def make_tmem_copy(
     Makes a Tiled Copy instance from a TMEM Copy Atom and a TMEM tensor.
     """
     tiled_copy_val = _cute_nvgpu_ir.atom_make_tmem_copy(
-        atom._trait.value, tmem_tensor.value, loc=loc, ip=ip
+        atom._trait.value, cast(Any, tmem_tensor).value, loc=loc, ip=ip
     )
     new_trait = type(atom._trait)(tiled_copy_val)
     return TiledCopy(atom.op, new_trait)
@@ -379,7 +380,7 @@ def make_s2t_copy(
     """
     tmem_tensor = core.filter_zeros(tmem_tensor, loc=loc, ip=ip)
     tiled_copy_val = _cute_nvgpu_ir.atom_make_s2t_copy(
-        atom._trait.value, tmem_tensor.value, loc=loc, ip=ip
+        atom._trait.value, cast(Any, tmem_tensor).value, loc=loc, ip=ip
     )
     new_trait = type(atom._trait)(tiled_copy_val)
     return TiledCopy(atom.op, new_trait)
@@ -397,7 +398,7 @@ def get_s2t_smem_desc_tensor(
     Returns the SMEM descriptor tensor from a S2T copy atom and a SMEM tensor.
     """
     smem_desc_tensor = _cute_nvgpu_ir.atom_get_copy_s2t_smem_desc_view(
-        atom._trait.value, smem_tensor.value, loc=loc, ip=ip
+        atom._trait.value, cast(Any, smem_tensor).value, loc=loc, ip=ip
     )
     return smem_desc_tensor
 
@@ -440,9 +441,9 @@ def make_umma_smem_desc(
     :return: The shared memory descriptor
     :rtype: SmemDescType
     """
-    src = src.value
+    src = cast(Any, src).value
     if next_src is not None:
-        next_src = next_src.value
+        next_src = cast(Any, next_src).value
 
     return _cute_nvgpu_ir.make_umma_smem_desc(
         src=src,
diff --git a/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py b/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py
index 2c9902424..8194a1ecd 100644
--- a/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py
+++ b/python/CuTeDSL/cutlass/cute/nvgpu/tcgen05/mma.py
@@ -12,7 +12,7 @@
 import enum
 import warnings
 from dataclasses import dataclass
-from typing import Type, Any, Union, Optional, cast, Tuple
+from typing import Any, Optional, Tuple, Type, Union, cast
 
 from cutlass.base_dsl.arch import Arch
 from cutlass.cutlass_dsl import BaseDSL, T, DSLRuntimeError
@@ -20,6 +20,7 @@ from cutlass.cutlass_dsl import BaseDSL, T, DSLRuntimeError
 from cutlass._mlir import ir
 import cutlass._mlir.dialects.cute as _cute_ir
 import cutlass._mlir.dialects.cute_nvgpu as _cute_nvgpu_ir
+import cutlass._mlir.dialects.vector as vector_d
 from typing_extensions import deprecated
 
 from ..common import OpError, normalize_field_to_ir_name
@@ -43,6 +44,7 @@ from ...typing import (
     Int8,
     Uint8,
     Int32,
+    Integer,
     Numeric,
     AddressSpace,
     Pointer,
@@ -160,6 +162,8 @@ class Field(enum.Enum):
     ACCUMULATE = "accum_c"
     SFA = "sf_a"
     SFB = "sf_b"
+    DISABLE_OUTPUT_LANE = "disable_output_lane"
+
     def __str__(self) -> str:
         return f"{self.__class__.__name__}.{self.name}"
 
@@ -170,6 +174,103 @@ class Field(enum.Enum):
         return self.value
 
 
+def _make_disable_output_lane_default(
+    cta_group: CtaGroup,
+    *,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> ir.Value:
+    num_elts = 4 if cta_group == CtaGroup.ONE else 8
+    vec_ty = ir.VectorType.get([num_elts], Int32.mlir_type)
+    c0 = Int32(0).ir_value(loc=loc, ip=ip)
+    return vector_d.from_elements(vec_ty, [c0] * num_elts, loc=loc, ip=ip)
+
+
+def _coerce_disable_output_lane_value(
+    value: Any,
+    *,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> ir.Value:
+    if isinstance(value, ir.Value):
+        return value
+    wrapped_val = getattr(value, "value", None)
+    if isinstance(wrapped_val, ir.Value):
+        return wrapped_val
+    if isinstance(value, (tuple, list)):
+        if len(value) not in (4, 8):
+            raise ValueError(
+                "disable_output_lane expects a list/tuple of 4 (CTA_1) or 8 (CTA_2) i32 lanes"
+            )
+        vec_ty = ir.VectorType.get([len(value)], Int32.mlir_type)
+        elems = [Int32(v).ir_value(loc=loc, ip=ip) for v in value]
+        return vector_d.from_elements(vec_ty, elems, loc=loc, ip=ip)
+    raise ValueError(
+        "disable_output_lane expects an mlir value, a DSL value wrapping mlir value, "
+        "or a list/tuple of integers"
+    )
+
+
+def _extract_disable_output_lane_kwarg(kwargs: dict[str, Any]) -> Any:
+    return kwargs.pop("disable_output_lane", None)
+
+
+def _reject_block_scaled_disable_output_lane_kwargs(
+    kwargs: dict[str, Any],
+    kind: str,
+    *,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> None:
+    disable_output_lane = _extract_disable_output_lane_kwarg(kwargs)
+    if kwargs:
+        unsupported = ", ".join(sorted(kwargs.keys()))
+        raise ValueError(f"unsupported tcgen05 {kind} runtime kwargs: {unsupported}")
+    if disable_output_lane is not None:
+        _coerce_disable_output_lane_value(disable_output_lane, loc=loc, ip=ip)
+        raise ValueError(
+            "disable-output-lane/disable_output_lane is not supported for tcgen05 block-scaled MMA"
+        )
+
+
+def _supports_disable_output_lane_field(atom: ir.Value) -> bool:
+    try:
+        _cute_nvgpu_ir.resolve_atom_field_attr(
+            atom, Field.DISABLE_OUTPUT_LANE._to_ir_field_name()
+        )
+        return True
+    except ValueError:
+        return False
+
+
+def _make_sm10x_umma_atom_type(
+    shape_attr: Any,
+    cta_group: int,
+    a_major_ir: Any,
+    b_major_ir: Any,
+    a_type_ir: Any,
+    b_type_ir: Any,
+    c_type_ir: Any,
+    a_src_ir: Any,
+    use_packed_c_format: bool = False,
+) -> Any:
+    """Construct the appropriate tcgen05 dense UMMA MLIR atom type for the
+    current compilation target architecture.
+
+    """
+    arch = BaseDSL._get_dsl().get_arch_enum()
+    return _cute_nvgpu_ir.MmaAtomSM100UMMAType.get(
+        shape_attr,
+        cta_group,
+        a_major_ir,
+        b_major_ir,
+        a_type_ir,
+        b_type_ir,
+        c_type_ir,
+        a_src_ir,
+        0,  # c_scale_exp
+    )
+
 
 # Base class for all tcgen05 MMA Ops with syntax `tcgen05.mma.cta_group.kind` used to factor out some internal code
 @dataclass(frozen=True)
@@ -327,8 +428,13 @@ class MmaOp(Tcgen05MmaOp):
         return True
 
 
-class MmaTraits(Trait):
-    admissible_fields = [Field.ACCUMULATE, Field.NEGATE_A, Field.NEGATE_B]
+class Sm100MmaTraits(Trait):
+    admissible_fields = [
+        Field.ACCUMULATE,
+        Field.NEGATE_A,
+        Field.NEGATE_B,
+        Field.DISABLE_OUTPUT_LANE,
+    ]
 
     def set(
         self,
@@ -339,17 +445,16 @@ class MmaTraits(Trait):
         ip: Optional[ir.InsertionPoint] = None,
     ) -> None:
         field_ir = normalize_field_to_ir_name(field, self.admissible_fields)
-        bool_val = Boolean(value).ir_value(loc=loc, ip=ip)
-        try:
-            self.value = _cute_nvgpu_ir.atom_set_value(
-                self.value, field_ir, bool_val, loc=loc, ip=ip
-            )
-        except (TypeError, AttributeError):
-            # Legacy fallback
-            attr = ir.Attribute.parse(f"#cute_nvgpu.atom_mma_field_sm100<{field_ir}>")
-            self.value = _cute_nvgpu_ir.atom_set_value(
-                self.value, attr, bool_val, loc=loc, ip=ip
-            )
+        if field_ir == Field.DISABLE_OUTPUT_LANE._to_ir_field_name():
+            val = _coerce_disable_output_lane_value(value, loc=loc, ip=ip)
+            if not _supports_disable_output_lane_field(self.value):
+                return
+        else:
+            val = Boolean(value).ir_value(loc=loc, ip=ip)
+        attr = _cute_nvgpu_ir.resolve_atom_field_attr(self.value, field_ir)
+        self.value = _cute_nvgpu_ir.atom_set_value(
+            self.value, attr, val, loc=loc, ip=ip
+        )
 
     def get(
         self,
@@ -359,15 +464,38 @@ class MmaTraits(Trait):
         ip: Optional[ir.InsertionPoint] = None,
     ) -> Any:
         field_ir = normalize_field_to_ir_name(field, self.admissible_fields)
-        try:
-            return _cute_nvgpu_ir.atom_get_value(
-                Boolean.mlir_type, self.value, field_ir, loc=loc, ip=ip
-            )
-        except (TypeError, AttributeError):
-            attr = ir.Attribute.parse(f"#cute_nvgpu.atom_mma_field_sm100<{field_ir}>")
-            return _cute_nvgpu_ir.atom_get_value(
-                Boolean.mlir_type, self.value, attr, loc=loc, ip=ip
+        if field_ir == Field.DISABLE_OUTPUT_LANE._to_ir_field_name():
+            raise ValueError(
+                "get(disable_output_lane) is not supported; set it via "
+                "cute.nvgpu.tcgen05.Field.DISABLE_OUTPUT_LANE and pass through cute.gemm/mma_atom_call"
             )
+        attr = _cute_nvgpu_ir.resolve_atom_field_attr(self.value, field_ir)
+        return _cute_nvgpu_ir.atom_get_value(
+            Boolean.mlir_type, self.value, attr, loc=loc, ip=ip
+        )
+
+    def unpack(
+        self,
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+        **kwargs: Any,
+    ) -> ir.Value:
+        disable_output_lane = _extract_disable_output_lane_kwarg(kwargs)
+        if kwargs:
+            unsupported = ", ".join(sorted(kwargs.keys()))
+            raise ValueError(f"unsupported tcgen05 MMA runtime kwargs: {unsupported}")
+        if disable_output_lane is None:
+            return self.value
+        if not _supports_disable_output_lane_field(self.value):
+            _coerce_disable_output_lane_value(disable_output_lane, loc=loc, ip=ip)
+            return self.value
+        field_ir = Field.DISABLE_OUTPUT_LANE._to_ir_field_name()
+        attr = _cute_nvgpu_ir.resolve_atom_field_attr(self.value, field_ir)
+        mask_val = _coerce_disable_output_lane_value(
+            disable_output_lane, loc=loc, ip=ip
+        )
+        return _cute_nvgpu_ir.atom_set_value(self.value, attr, mask_val, loc=loc, ip=ip)
 
 
 # Base class for all tcgen05 BlockScaled MMA Ops with syntax `tcgen05.mma.cta_group.kind.block_scale` used to factor out some internal code
@@ -431,14 +559,6 @@ class BlockScaledMmaOp(Tcgen05MmaOp):
                 DeprecationWarning,
                 stacklevel=2,
             )
-            # Normalize the major modes to the new enum type
-            # Since this is a frozen dataclass, we need to use the object.__setattr__ method to set the attributes
-            object.__setattr__(
-                self, "a_major_mode", _OperandMajorMode(self.a_major_mode.value)
-            )
-            object.__setattr__(
-                self, "b_major_mode", _OperandMajorMode(self.b_major_mode.value)
-            )
         # Verify the instruction shape
         shape_mnk_tuple: Any = cast(Any, self.shape_mnk)
         if (rank(shape_mnk_tuple) not in [2, 3]) or (depth(shape_mnk_tuple) != 1):
@@ -560,20 +680,13 @@ class BlockScaledMmaTraits(Trait):
                 raise ValueError(
                     f"expects value to be a pointer for {field_ir}, but got {type(value).__name__}"
                 )
-            val = value.value
+            val = cast(Any, value).value
         else:
             raise ValueError(f"unsupported field: {field_ir}")
-        try:
-            self.value = _cute_nvgpu_ir.atom_set_value(
-                self.value, field_ir, val, loc=loc, ip=ip
-            )
-        except (TypeError, AttributeError):
-            attr = ir.Attribute.parse(
-                f"#cute_nvgpu.atom_mma_field_sm100_block_scaled<{field_ir}>"
-            )
-            self.value = _cute_nvgpu_ir.atom_set_value(
-                self.value, attr, val, loc=loc, ip=ip
-            )
+        attr = _cute_nvgpu_ir.resolve_atom_field_attr(self.value, field_ir)
+        self.value = _cute_nvgpu_ir.atom_set_value(
+            self.value, attr, val, loc=loc, ip=ip
+        )
 
     def get(
         self,
@@ -587,17 +700,22 @@ class BlockScaledMmaTraits(Trait):
             f for f in self.admissible_fields if f not in (Field.SFA, Field.SFB)
         ]
         field_ir = normalize_field_to_ir_name(field, gettable_fields)
-        try:
-            return _cute_nvgpu_ir.atom_get_value(
-                Boolean.mlir_type, self.value, field_ir, loc=loc, ip=ip
-            )
-        except (TypeError, AttributeError):
-            attr = ir.Attribute.parse(
-                f"#cute_nvgpu.atom_mma_field_sm100_block_scaled<{field_ir}>"
-            )
-            return _cute_nvgpu_ir.atom_get_value(
-                Boolean.mlir_type, self.value, attr, loc=loc, ip=ip
-            )
+        attr = _cute_nvgpu_ir.resolve_atom_field_attr(self.value, field_ir)
+        return _cute_nvgpu_ir.atom_get_value(
+            Boolean.mlir_type, self.value, attr, loc=loc, ip=ip
+        )
+
+    def unpack(
+        self,
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+        **kwargs: Any,
+    ) -> ir.Value:
+        _reject_block_scaled_disable_output_lane_kwargs(
+            kwargs, "block-scaled MMA", loc=loc, ip=ip
+        )
+        return self.value
 
 
 #
@@ -613,17 +731,48 @@ class MmaTF32Op(MmaOp):
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-instructions-mma>`__.
     This Operation corresponds to the ``.kind::tf32`` qualifier.
 
-    MMA operations should be issued by a single thread. The DSL automatically handles this by
-    implicitly adding ``elect_one()`` around the copy operation.
+    **Supported data type combinations:**
+
+    +-------------+-------------+----------+-------+
+    | A Data Type | B Data Type | Acc Type | Mma-K |
+    +=============+=============+==========+=======+
+    | TF32        | TF32        | F32      | 8     |
+    +-------------+-------------+----------+-------+
+
+    **Supported architectures:** sm_100a, sm_100f, sm_103a, sm_103f, sm_110a, sm_110f
+
+    **Constraints:**
+
+    * CtaGroup.ONE: Mma-M in {64, 128}; 8 <= Mma-N <= 256, step 8
+    * CtaGroup.TWO: Mma-M in {128, 256}; 16 <= Mma-N <= 256, step 16
+    * A and B support both K-major and MN-major (transpose), but only with
+      128B swizzling with 32B swizzle-atomicity. Transpose A requires
+      a_src=SMEM. When a_src=TMEM, A is always K-major.
+
+    **Execution Model:**
+
+    * ``cute.gemm(...)`` (PTX: ``tcgen05.mma``) is asynchronous. Issue granularity is
+      single-thread (for ``.cta_group::1``) or single-thread in a CTA pair
+      (for ``.cta_group::2``), per PTX issue rules.
+    * In user code, issue ``cute.gemm(...)`` as warp-uniform and do not wrap it in
+      ``elect_one()``.
+    * To observe/sequence MMA completion for dependent non-pipelined operations, call
+      ``cute.nvgpu.tcgen05.commit(...)`` (PTX: ``tcgen05.commit``) and follow the
+      corresponding completion wait/synchronization path.
+    * For completion of tcgen05 TMEM load/store operations, use
+      ``tcgen05.wait::ld`` / ``tcgen05.wait::st`` (PTX waits).
+    * For ordering tcgen05 operations across threads, use
+      ``tcgen05.fence::before_thread_sync`` / ``tcgen05.fence::after_thread_sync``
+      (PTX fences) together with an execution-order synchronization mechanism.
 
     .. code-block:: python
 
-        # CORRECT: MMA without elect_one
+        # CORRECT: warp-uniform tcgen05 MMA
         cute.gemm(mma_atom, d, a, b, c)
 
-        # WRONG: Do NOT wrap in elect_one (can cause deadlock)
-        with cute.arch.elect_one():  # INCORRECT
-            cute.gemm(mma_atom, d, a, b, c)
+        # Signal completion of prior tcgen05 MMA operations
+        with cute.arch.elect_one():
+            cute.nvgpu.tcgen05.commit(mbar_ptr, None, cta_group)
 
     """
 
@@ -671,7 +820,7 @@ class MmaTF32Op(MmaOp):
         **kwargs: Any,
     ) -> "MmaTF32Trait":
         shape_mnk = _pack_shape(self.shape_mnk, loc=loc, ip=ip)
-        ty = _cute_nvgpu_ir.MmaAtomSM100UMMAType.get(
+        ty = _make_sm10x_umma_atom_type(
             shape_mnk.type.attribute,
             self.cta_group.value,
             self.a_major_mode._to_ir(),
@@ -680,23 +829,24 @@ class MmaTF32Op(MmaOp):
             self.b_dtype.mlir_type,
             self.acc_dtype.mlir_type,
             self.a_src._to_ir(),
-            0,
         )
+        operands = [
+            Boolean(False).ir_value(loc=loc, ip=ip),
+            Boolean(False).ir_value(loc=loc, ip=ip),
+            Boolean(False).ir_value(loc=loc, ip=ip),
+            _make_disable_output_lane_default(self.cta_group, loc=loc, ip=ip),
+        ]
         return MmaTF32Trait(
             make_atom(
                 ty,
-                [
-                    Boolean(False).ir_value(loc=loc, ip=ip),
-                    Boolean(False).ir_value(loc=loc, ip=ip),
-                    Boolean(False).ir_value(loc=loc, ip=ip),
-                ],
+                operands,
                 loc=loc,
                 ip=ip,
             )
         )
 
 
-class MmaTF32Trait(MmaTraits):
+class MmaTF32Trait(Sm100MmaTraits):
     pass
 
 
@@ -713,17 +863,45 @@ class MmaF16BF16Op(MmaOp):
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-instructions-mma>`__.
     This Operation corresponds to the ``.kind::f16`` qualifier.
 
-    MMA operations should be issued by a single thread. The DSL automatically handles this by
-    implicitly adding ``elect_one()`` around the copy operation.
+    **Supported data type combinations:**
+
+    +-------------+-------------+----------+-------+
+    | A Data Type | B Data Type | Acc Type | Mma-K |
+    +=============+=============+==========+=======+
+    | F16         | F16         | F16, F32 | 16    |
+    +-------------+-------------+----------+-------+
+    | BF16        | BF16        | F32      | 16    |
+    +-------------+-------------+----------+-------+
+
+    **Supported architectures:** sm_100a, sm_100f, sm_103a, sm_103f, sm_110a, sm_110f
+
+    **Constraints:**
+
+    * CtaGroup.ONE: Mma-M in {64, 128}; 8 <= Mma-N <= 256, step 8
+    * CtaGroup.TWO: Mma-M in {128, 256}; 16 <= Mma-N <= 256, step 16
+    * A and B support both K-major and MN-major (transpose), except with
+      128B swizzling with 32B swizzle-atomicity. Transpose A requires
+      a_src=SMEM. When a_src=TMEM, A is always K-major.
+
+    **Execution Model:**
+
+    * ``cute.gemm(...)`` (PTX: ``tcgen05.mma``) is asynchronous. Issue granularity is
+      single-thread (for ``.cta_group::1``) or single-thread in a CTA pair
+      (for ``.cta_group::2``), per PTX issue rules.
+    * In user code, issue ``cute.gemm(...)`` as warp-uniform and do not wrap it in
+      ``elect_one()``, as ``elect_one()`` insertion is handled by the compiler.
+    * To observe/sequence MMA completion for dependent non-pipelined operations, call
+      ``cute.nvgpu.tcgen05.commit(...)`` (PTX: ``tcgen05.commit``) and follow the
+      corresponding completion wait/synchronization path.
 
     .. code-block:: python
 
-        # CORRECT: MMA without elect_one
+        # CORRECT: warp-uniform tcgen05 MMA
         cute.gemm(mma_atom, d, a, b, c)
 
-        # WRONG: Do NOT wrap in elect_one (can cause deadlock)
-        with cute.arch.elect_one():  # INCORRECT
-            cute.gemm(mma_atom, d, a, b, c)
+        # Signal completion of prior tcgen05 MMA operations
+        with cute.arch.elect_one():
+            cute.nvgpu.tcgen05.commit(mbar_ptr, None, cta_group)
 
     """
 
@@ -786,7 +964,7 @@ class MmaF16BF16Op(MmaOp):
         **kwargs: Any,
     ) -> "MmaF16BF16Trait":
         shape_mnk = _pack_shape(self.shape_mnk, loc=loc, ip=ip)
-        ty = _cute_nvgpu_ir.MmaAtomSM100UMMAType.get(
+        ty = _make_sm10x_umma_atom_type(
             shape_mnk.type.attribute,
             self.cta_group.value,
             self.a_major_mode._to_ir(),
@@ -795,23 +973,29 @@ class MmaF16BF16Op(MmaOp):
             self.b_dtype.mlir_type,
             self.acc_dtype.mlir_type,
             self.a_src._to_ir(),
-            0,
+            use_packed_c_format=(
+                self.acc_dtype is Float16
+                and self.a_dtype is Float16
+                and self.b_dtype is Float16
+            ),
         )
+        operands = [
+            Boolean(False).ir_value(loc=loc, ip=ip),
+            Boolean(False).ir_value(loc=loc, ip=ip),
+            Boolean(False).ir_value(loc=loc, ip=ip),
+            _make_disable_output_lane_default(self.cta_group, loc=loc, ip=ip),
+        ]
         return MmaF16BF16Trait(
             make_atom(
                 ty,
-                [
-                    Boolean(False).ir_value(loc=loc, ip=ip),
-                    Boolean(False).ir_value(loc=loc, ip=ip),
-                    Boolean(False).ir_value(loc=loc, ip=ip),
-                ],
+                operands,
                 loc=loc,
                 ip=ip,
             )
         )
 
 
-class MmaF16BF16Trait(MmaTraits):
+class MmaF16BF16Trait(Sm100MmaTraits):
     pass
 
 
@@ -828,17 +1012,46 @@ class MmaI8Op(MmaOp):
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-instructions-mma>`__.
     This Operation corresponds to the ``.kind::i8`` qualifier.
 
-    MMA operations should be issued by a single thread. The DSL automatically handles this by
-    implicitly adding ``elect_one()`` around the copy operation.
+    **Supported data type combinations:**
+
+    +-------------+-------------+----------+-------+
+    | A Data Type | B Data Type | Acc Type | Mma-K |
+    +=============+=============+==========+=======+
+    | Int8, Uint8 | Int8, Uint8 | Int32    | 32    |
+    +-------------+-------------+----------+-------+
+
+    **Supported architectures:** sm_100a, sm_100f, sm_103a, sm_103f, sm_110a, sm_110f
+
+    **Constraints:**
+
+    * CtaGroup.ONE: Mma-M in {64, 128}; Mma-N in {8, 16, 24, 32, 48, 64, 80, ..., 256}
+      (step 8 for Mma-N <= 32, then step 16 for Mma-N > 32; values like 40, 56 are invalid)
+    * CtaGroup.TWO: Mma-M in {128, 256}; 16 <= Mma-N <= 256, step 16
+    * A and B signedness are independent (mixed signed/unsigned allowed)
+    * A and B support both K-major and MN-major (transpose), except with
+      128B swizzling with 32B swizzle-atomicity. Transpose A requires
+      a_src=SMEM. When a_src=TMEM, A is always K-major.
+    * With B MN-major (8-bit B transpose): Mma-N step changes to 16 for CG1, 32 for CG2.
+
+    **Execution Model:**
+
+    * ``cute.gemm(...)`` (PTX: ``tcgen05.mma``) is asynchronous. Issue granularity is
+      single-thread (for ``.cta_group::1``) or single-thread in a CTA pair
+      (for ``.cta_group::2``), per PTX issue rules.
+    * In user code, issue ``cute.gemm(...)`` as warp-uniform and do not wrap it in
+      ``elect_one()``, as ``elect_one()`` insertion is handled by the compiler.
+    * To observe/sequence MMA completion for dependent non-pipelined operations, call
+      ``cute.nvgpu.tcgen05.commit(...)`` (PTX: ``tcgen05.commit``) and follow the
+      corresponding completion wait/synchronization path.
 
     .. code-block:: python
 
-        # CORRECT: MMA without elect_one
+        # CORRECT: warp-uniform tcgen05 MMA
         cute.gemm(mma_atom, d, a, b, c)
 
-        # WRONG: Do NOT wrap in elect_one (can cause deadlock)
-        with cute.arch.elect_one():  # INCORRECT
-            cute.gemm(mma_atom, d, a, b, c)
+        # Signal completion of prior tcgen05 MMA operations
+        with cute.arch.elect_one():
+            cute.nvgpu.tcgen05.commit(mbar_ptr, None, cta_group)
 
     """
 
@@ -894,32 +1107,35 @@ class MmaI8Op(MmaOp):
         **kwargs: Any,
     ) -> "MmaI8Trait":
         shape_mnk = _pack_shape(self.shape_mnk, loc=loc, ip=ip)
-        ty = _cute_nvgpu_ir.MmaAtomSM100UMMAType.get(
+        # MmaI8 only operates on integer dtypes.
+        assert issubclass(self.a_dtype, Integer) and issubclass(self.b_dtype, Integer)
+        ty = _make_sm10x_umma_atom_type(
             shape_mnk.type.attribute,
             self.cta_group.value,
             self.a_major_mode._to_ir(),
             self.b_major_mode._to_ir(),
-            (T.si8() if self.a_dtype.signed else T.ui8()),  # type: ignore[attr-defined]
-            (T.si8() if self.b_dtype.signed else T.ui8()),  # type: ignore[attr-defined]
+            (T.si8() if self.a_dtype.signed else T.ui8()),
+            (T.si8() if self.b_dtype.signed else T.ui8()),
             T.si32(),
             self.a_src._to_ir(),
-            0,
         )
+        operands = [
+            Boolean(False).ir_value(loc=loc, ip=ip),
+            Boolean(False).ir_value(loc=loc, ip=ip),
+            Boolean(False).ir_value(loc=loc, ip=ip),
+            _make_disable_output_lane_default(self.cta_group, loc=loc, ip=ip),
+        ]
         return MmaI8Trait(
             make_atom(
                 ty,
-                [
-                    Boolean(False).ir_value(loc=loc, ip=ip),
-                    Boolean(False).ir_value(loc=loc, ip=ip),
-                    Boolean(False).ir_value(loc=loc, ip=ip),
-                ],
+                operands,
                 loc=loc,
                 ip=ip,
             )
         )
 
 
-class MmaI8Trait(MmaTraits):
+class MmaI8Trait(Sm100MmaTraits):
     pass
 
 
@@ -939,17 +1155,46 @@ class MmaFP8Op(MmaOp):
 
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-instructions-mma>`__.
 
-    MMA operations should be issued by a single thread. The DSL automatically handles this by
-    implicitly adding ``elect_one()`` around the copy operation.
+    **Supported data type combinations:**
+
+    +-------------+-------------+----------+-------+
+    | A Data Type | B Data Type | Acc Type | Mma-K |
+    +=============+=============+==========+=======+
+    | E4M3, E5M2  | E4M3, E5M2  | F16, F32 | 32    |
+    +-------------+-------------+----------+-------+
+
+    **Supported architectures:** sm_100a, sm_100f, sm_103a, sm_103f, sm_110a, sm_110f
+
+    **Constraints:**
+
+    * A and B data types must be the same
+    * CtaGroup.ONE: Mma-M in {64, 128}; 8 <= Mma-N <= 256, step 8
+    * CtaGroup.TWO: Mma-M in {128, 256}; 16 <= Mma-N <= 256, step 16
+    * With B-major=MN: Mma-N step doubles (16 for CG1, 32 for CG2)
+    * A and B support both K-major and MN-major (transpose), except with
+      128B swizzling with 32B swizzle-atomicity. Transpose A requires
+      a_src=SMEM. When a_src=TMEM, A is always K-major.
+    * With 8-bit B transpose (MN-major): N step changes to 16 for CG1, 32 for CG2.
+
+    **Execution Model:**
+
+    * ``cute.gemm(...)`` (PTX: ``tcgen05.mma``) is asynchronous. Issue granularity is
+      single-thread (for ``.cta_group::1``) or single-thread in a CTA pair
+      (for ``.cta_group::2``), per PTX issue rules.
+    * In user code, issue ``cute.gemm(...)`` as warp-uniform and do not wrap it in
+      ``elect_one()``, as ``elect_one()`` insertion is handled by the compiler.
+    * To observe/sequence MMA completion for dependent non-pipelined operations, call
+      ``cute.nvgpu.tcgen05.commit(...)`` (PTX: ``tcgen05.commit``) and follow the
+      corresponding completion wait/synchronization path.
 
     .. code-block:: python
 
-        # CORRECT: MMA without elect_one
+        # CORRECT: warp-uniform tcgen05 MMA
         cute.gemm(mma_atom, d, a, b, c)
 
-        # WRONG: Do NOT wrap in elect_one (can cause deadlock)
-        with cute.arch.elect_one():  # INCORRECT
-            cute.gemm(mma_atom, d, a, b, c)
+        # Signal completion of prior tcgen05 MMA operations
+        with cute.arch.elect_one():
+            cute.nvgpu.tcgen05.commit(mbar_ptr, None, cta_group)
 
     """
 
@@ -1015,7 +1260,7 @@ class MmaFP8Op(MmaOp):
         **kwargs: Any,
     ) -> "MmaFP8Trait":
         shape_mnk = _pack_shape(self.shape_mnk, loc=loc, ip=ip)
-        ty = _cute_nvgpu_ir.MmaAtomSM100UMMAType.get(
+        ty = _make_sm10x_umma_atom_type(
             shape_mnk.type.attribute,
             self.cta_group.value,
             self.a_major_mode._to_ir(),
@@ -1024,23 +1269,25 @@ class MmaFP8Op(MmaOp):
             self.b_dtype.mlir_type,
             self.acc_dtype.mlir_type,
             self.a_src._to_ir(),
-            0,
+            use_packed_c_format=(self.acc_dtype is Float16),
         )
+        operands = [
+            Boolean(False).ir_value(loc=loc, ip=ip),
+            Boolean(False).ir_value(loc=loc, ip=ip),
+            Boolean(False).ir_value(loc=loc, ip=ip),
+            _make_disable_output_lane_default(self.cta_group, loc=loc, ip=ip),
+        ]
         return MmaFP8Trait(
             make_atom(
                 ty,
-                [
-                    Boolean(False).ir_value(loc=loc, ip=ip),
-                    Boolean(False).ir_value(loc=loc, ip=ip),
-                    Boolean(False).ir_value(loc=loc, ip=ip),
-                ],
+                operands,
                 loc=loc,
                 ip=ip,
             )
         )
 
 
-class MmaFP8Trait(MmaTraits):
+class MmaFP8Trait(Sm100MmaTraits):
     pass
 
 
@@ -1051,17 +1298,46 @@ class MmaF8F6F4Op(MmaOp):
 
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-instructions-mma>`__.
 
-    MMA operations should be issued by a single thread. The DSL automatically handles this by
-    implicitly adding ``elect_one()`` around the copy operation.
+    **Supported data type combinations:**
+
+    +------------------------------+------------------------------+----------+-------+
+    | A Data Type                  | B Data Type                  | Acc Type | Mma-K |
+    +==============================+==============================+==========+=======+
+    | E4M3, E5M2, E3M2, E2M3, E2M1 | E4M3, E5M2, E3M2, E2M3, E2M1 | F16, F32 | 32    |
+    +------------------------------+------------------------------+----------+-------+
+
+    **Supported architectures:** sm_100a, sm_100f, sm_103a, sm_103f, sm_110a, sm_110f
+
+    **Constraints:**
+
+    * A and B data types are independent (mixed F8/F6/F4 allowed)
+    * CtaGroup.ONE: Mma-M in {64, 128}; 8 <= Mma-N <= 256, step 8
+    * CtaGroup.TWO: Mma-M in {128, 256}; 16 <= Mma-N <= 256, step 16
+    * With B-major=MN: Mma-N step doubles (16 for CG1, 32 for CG2)
+    * A and B support both K-major and MN-major (transpose), except with
+      128B swizzling with 32B swizzle-atomicity. Transpose A requires
+      a_src=SMEM. When a_src=TMEM, A is always K-major.
+    * With 8-bit B transpose (MN-major): N step changes to 16 for CG1, 32 for CG2.
+
+    **Execution Model:**
+
+    * ``cute.gemm(...)`` (PTX: ``tcgen05.mma``) is asynchronous. Issue granularity is
+      single-thread (for ``.cta_group::1``) or single-thread in a CTA pair
+      (for ``.cta_group::2``), per PTX issue rules.
+    * In user code, issue ``cute.gemm(...)`` as warp-uniform and do not wrap it in
+      ``elect_one()``, as ``elect_one()`` insertion is handled by the compiler.
+    * To observe/sequence MMA completion for dependent non-pipelined operations, call
+      ``cute.nvgpu.tcgen05.commit(...)`` (PTX: ``tcgen05.commit``) and follow the
+      corresponding completion wait/synchronization path.
 
     .. code-block:: python
 
-        # CORRECT: MMA without elect_one
+        # CORRECT: warp-uniform tcgen05 MMA
         cute.gemm(mma_atom, d, a, b, c)
 
-        # WRONG: Do NOT wrap in elect_one (can cause deadlock)
-        with cute.arch.elect_one():  # INCORRECT
-            cute.gemm(mma_atom, d, a, b, c)
+        # Signal completion of prior tcgen05 MMA operations
+        with cute.arch.elect_one():
+            cute.nvgpu.tcgen05.commit(mbar_ptr, None, cta_group)
 
     """
 
@@ -1131,7 +1407,7 @@ class MmaF8F6F4Op(MmaOp):
         **kwargs: Any,
     ) -> "MmaF8F6F4Trait":
         shape_mnk = _pack_shape(self.shape_mnk, loc=loc, ip=ip)
-        ty = _cute_nvgpu_ir.MmaAtomSM100UMMAType.get(
+        ty = _make_sm10x_umma_atom_type(
             shape_mnk.type.attribute,
             self.cta_group.value,
             self.a_major_mode._to_ir(),
@@ -1140,23 +1416,25 @@ class MmaF8F6F4Op(MmaOp):
             self.b_dtype.mlir_type,
             self.acc_dtype.mlir_type,
             self.a_src._to_ir(),
-            0,
+            use_packed_c_format=(self.acc_dtype is Float16),
         )
+        operands = [
+            Boolean(False).ir_value(loc=loc, ip=ip),
+            Boolean(False).ir_value(loc=loc, ip=ip),
+            Boolean(False).ir_value(loc=loc, ip=ip),
+            _make_disable_output_lane_default(self.cta_group, loc=loc, ip=ip),
+        ]
         return MmaF8F6F4Trait(
             make_atom(
                 ty,
-                [
-                    Boolean(False).ir_value(loc=loc, ip=ip),
-                    Boolean(False).ir_value(loc=loc, ip=ip),
-                    Boolean(False).ir_value(loc=loc, ip=ip),
-                ],
+                operands,
                 loc=loc,
                 ip=ip,
             )
         )
 
 
-class MmaF8F6F4Trait(MmaTraits):
+class MmaF8F6F4Trait(Sm100MmaTraits):
     pass
 
 
@@ -1178,17 +1456,47 @@ class MmaMXF8Op(BlockScaledMmaOp):
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-instructions-mma>`__.
     This Operation corresponds to the ``.kind::mxf8f6f4`` qualifier.
 
-    MMA operations should be issued by a single thread. The DSL automatically handles this by
-    implicitly adding ``elect_one()`` around the copy operation.
+    **Supported data type combinations:**
+
+    +-------------+-------------+--------------+----------+-------+-------------+
+    | A Data Type | B Data Type | SF Data Type | Acc Type | Mma-K | SF Vec Size |
+    +=============+=============+==============+==========+=======+=============+
+    | E4M3, E5M2  | E4M3, E5M2  | UE8M0        | F32      | 32    | 32          |
+    +-------------+-------------+--------------+----------+-------+-------------+
+
+    **Supported architectures:** sm_100a, sm_103a
+
+    **Constraints:**
+
+    * A and B data types must be the same
+    * CtaGroup.ONE: Mma-M = 128; 8 <= Mma-N <= 256, step 8
+    * CtaGroup.TWO: Mma-M in {128, 256}; 16 <= Mma-N <= 256, step 16
+    * A and B support both K-major and MN-major (transpose), except with
+      128B swizzling with 32B swizzle-atomicity. Transpose A requires
+      a_src=SMEM. When a_src=TMEM, A is always K-major.
+    * With 8-bit B transpose (MN-major): N step changes to 16 for CtaGroup.ONE, 32 for CtaGroup.TWO.
+
+    **Execution Model:**
+
+    * ``cute.gemm(...)`` (PTX: ``tcgen05.mma``) is asynchronous. Issue granularity is
+      single-thread (for ``.cta_group::1``) or single-thread in a CTA pair
+      (for ``.cta_group::2``), per PTX issue rules.
+    * In user code, issue ``cute.gemm(...)`` as warp-uniform and do not wrap it in
+      ``elect_one()``, as ``elect_one()`` insertion is handled by the compiler.
+    * For block-scaled MMA, pass A and B as paired operands in ``cute.gemm(...)``:
+      ``[a, sfa]`` and ``[b, sfb]``.
+    * To observe/sequence MMA completion for dependent non-pipelined operations, call
+      ``cute.nvgpu.tcgen05.commit(...)`` (PTX: ``tcgen05.commit``) and follow the
+      corresponding completion wait/synchronization path.
 
     .. code-block:: python
 
-        # CORRECT: MMA without elect_one
-        cute.gemm(mma_atom, d, a, b, c)
+        # CORRECT: warp-uniform tcgen05 MMA
+        cute.gemm(mma_atom, d, [a, sfa], [b, sfb], c)
 
-        # WRONG: Do NOT wrap in elect_one (can cause deadlock)
-        with cute.arch.elect_one():  # INCORRECT
-            cute.gemm(mma_atom, d, a, b, c)
+        # Signal completion of prior tcgen05 MMA operations
+        with cute.arch.elect_one():
+            cute.nvgpu.tcgen05.commit(mbar_ptr, None, cta_group)
 
     """
 
@@ -1290,17 +1598,47 @@ class MmaMXF8F6F4Op(BlockScaledMmaOp):
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-instructions-mma>`__.
     This Operation corresponds to the ``.kind::mxf8f6f4`` qualifier.
 
-    MMA operations should be issued by a single thread. The DSL automatically handles this by
-    implicitly adding ``elect_one()`` around the copy operation.
+    **Supported data type combinations:**
+
+    +------------------------------+------------------------------+--------------+----------+-------+-------------+
+    | A Data Type                  | B Data Type                  | SF Data Type | Acc Type | Mma-K | SF Vec Size |
+    +==============================+==============================+==============+==========+=======+=============+
+    | E4M3, E5M2, E3M2, E2M3, E2M1 | E4M3, E5M2, E3M2, E2M3, E2M1 | UE8M0        | F32      | 32    | 32          |
+    +------------------------------+------------------------------+--------------+----------+-------+-------------+
+
+    **Supported architectures:** sm_100a, sm_103a
+
+    **Constraints:**
+
+    * A and B data types are independent (mixed F8/F6/F4 allowed)
+    * CtaGroup.ONE: Mma-M = 128; 8 <= Mma-N <= 256, step 8
+    * CtaGroup.TWO: Mma-M in {128, 256}; 16 <= Mma-N <= 256, step 16
+    * A and B support both K-major and MN-major (transpose), except with
+      128B swizzling with 32B swizzle-atomicity. Transpose A requires
+      a_src=SMEM. When a_src=TMEM, A is always K-major.
+    * With 8-bit B transpose (MN-major): N step changes to 16 for CtaGroup.ONE, 32 for CtaGroup.TWO.
+
+    **Execution Model:**
+
+    * ``cute.gemm(...)`` (PTX: ``tcgen05.mma``) is asynchronous. Issue granularity is
+      single-thread (for ``.cta_group::1``) or single-thread in a CTA pair
+      (for ``.cta_group::2``), per PTX issue rules.
+    * In user code, issue ``cute.gemm(...)`` as warp-uniform and do not wrap it in
+      ``elect_one()``, as ``elect_one()`` insertion is handled by the compiler.
+    * For block-scaled MMA, pass A and B as paired operands in ``cute.gemm(...)``:
+      ``[a, sfa]`` and ``[b, sfb]``.
+    * To observe/sequence MMA completion for dependent non-pipelined operations, call
+      ``cute.nvgpu.tcgen05.commit(...)`` (PTX: ``tcgen05.commit``) and follow the
+      corresponding completion wait/synchronization path.
 
     .. code-block:: python
 
-        # CORRECT: MMA without elect_one
-        cute.gemm(mma_atom, d, a, b, c)
+        # CORRECT: warp-uniform tcgen05 MMA
+        cute.gemm(mma_atom, d, [a, sfa], [b, sfb], c)
 
-        # WRONG: Do NOT wrap in elect_one (can cause deadlock)
-        with cute.arch.elect_one():  # INCORRECT
-            cute.gemm(mma_atom, d, a, b, c)
+        # Signal completion of prior tcgen05 MMA operations
+        with cute.arch.elect_one():
+            cute.nvgpu.tcgen05.commit(mbar_ptr, None, cta_group)
 
     """
 
@@ -1415,17 +1753,43 @@ class MmaMXF4Op(BlockScaledMmaOp):
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-instructions-mma>`__.
     This Operation corresponds to the ``.kind::mxf4`` qualifier.
 
-    MMA operations should be issued by a single thread. The DSL automatically handles this by
-    implicitly adding ``elect_one()`` around the copy operation.
+    **Supported data type combinations:**
+
+    +-------------+-------------+--------------+----------+-------+-------------+
+    | A Data Type | B Data Type | SF Data Type | Acc Type | Mma-K | SF Vec Size |
+    +=============+=============+==============+==========+=======+=============+
+    | E2M1        | E2M1        | UE8M0        | F32      | 64    | 32          |
+    +-------------+-------------+--------------+----------+-------+-------------+
+
+    **Supported architectures:** sm_100a, sm_103a
+
+    **Constraints:**
+
+    * CtaGroup.ONE: Mma-M = 128; 8 <= Mma-N <= 256, step 8
+    * CtaGroup.TWO: Mma-M in {128, 256}; 16 <= Mma-N <= 256, step 16
+    * Transpose (MN-major) is not supported. Both A and B must be K-major.
+
+    **Execution Model:**
+
+    * ``cute.gemm(...)`` (PTX: ``tcgen05.mma``) is asynchronous. Issue granularity is
+      single-thread (for ``.cta_group::1``) or single-thread in a CTA pair
+      (for ``.cta_group::2``), per PTX issue rules.
+    * In user code, issue ``cute.gemm(...)`` as warp-uniform and do not wrap it in
+      ``elect_one()``, as ``elect_one()`` insertion is handled by the compiler.
+    * For block-scaled MMA, pass A and B as paired operands in ``cute.gemm(...)``:
+      ``[a, sfa]`` and ``[b, sfb]``.
+    * To observe/sequence MMA completion for dependent non-pipelined operations, call
+      ``cute.nvgpu.tcgen05.commit(...)`` (PTX: ``tcgen05.commit``) and follow the
+      corresponding completion wait/synchronization path.
 
     .. code-block:: python
 
-        # CORRECT: MMA without elect_one
-        cute.gemm(mma_atom, d, a, b, c)
+        # CORRECT: warp-uniform tcgen05 MMA
+        cute.gemm(mma_atom, d, [a, sfa], [b, sfb], c)
 
-        # WRONG: Do NOT wrap in elect_one (can cause deadlock)
-        with cute.arch.elect_one():  # INCORRECT
-            cute.gemm(mma_atom, d, a, b, c)
+        # Signal completion of prior tcgen05 MMA operations
+        with cute.arch.elect_one():
+            cute.nvgpu.tcgen05.commit(mbar_ptr, None, cta_group)
 
     """
 
@@ -1522,17 +1886,43 @@ class MmaMXF4NVF4Op(BlockScaledMmaOp):
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-instructions-mma>`__.
     This Operation corresponds to the ``.kind::mxf4nvf4`` qualifier.
 
-    MMA operations should be issued by a single thread. The DSL automatically handles this by
-    implicitly adding ``elect_one()`` around the copy operation.
+    **Supported data type combinations:**
+
+    +-------------+-------------+--------------+----------+-------+-------------+
+    | A Data Type | B Data Type | SF Data Type | Acc Type | Mma-K | SF Vec Size |
+    +=============+=============+==============+==========+=======+=============+
+    | E2M1        | E2M1        | UE8M0, UE4M3 | F32      | 64    | 16          |
+    +-------------+-------------+--------------+----------+-------+-------------+
+
+    **Supported architectures:** sm_100a, sm_103a
+
+    **Constraints:**
+
+    * CtaGroup.ONE: Mma-M = 128; 8 <= Mma-N <= 256, step 8
+    * CtaGroup.TWO: Mma-M in {128, 256}; 16 <= Mma-N <= 256, step 16
+    * Transpose (MN-major) is not supported. Both A and B must be K-major.
+
+    **Execution Model:**
+
+    * ``cute.gemm(...)`` (PTX: ``tcgen05.mma``) is asynchronous. Issue granularity is
+      single-thread (for ``.cta_group::1``) or single-thread in a CTA pair
+      (for ``.cta_group::2``), per PTX issue rules.
+    * In user code, issue ``cute.gemm(...)`` as warp-uniform and do not wrap it in
+      ``elect_one()``, as ``elect_one()`` insertion is handled by the compiler.
+    * For block-scaled MMA, pass A and B as paired operands in ``cute.gemm(...)``:
+      ``[a, sfa]`` and ``[b, sfb]``.
+    * To observe/sequence MMA completion for dependent non-pipelined operations, call
+      ``cute.nvgpu.tcgen05.commit(...)`` (PTX: ``tcgen05.commit``) and follow the
+      corresponding completion wait/synchronization path.
 
     .. code-block:: python
 
-        # CORRECT: MMA without elect_one
-        cute.gemm(mma_atom, d, a, b, c)
+        # CORRECT: warp-uniform tcgen05 MMA
+        cute.gemm(mma_atom, d, [a, sfa], [b, sfb], c)
 
-        # WRONG: Do NOT wrap in elect_one (can cause deadlock)
-        with cute.arch.elect_one():  # INCORRECT
-            cute.gemm(mma_atom, d, a, b, c)
+        # Signal completion of prior tcgen05 MMA operations
+        with cute.arch.elect_one():
+            cute.nvgpu.tcgen05.commit(mbar_ptr, None, cta_group)
 
     """
 
@@ -1637,17 +2027,43 @@ class SM103MmaMXF4Op(BlockScaledMmaOp):
     This Operation corresponds to the ``.kind::mxf4`` qualifier.
     This Operation is for SM103.
 
-    MMA operations should be issued by a single thread. The DSL automatically handles this by
-    implicitly adding ``elect_one()`` around the copy operation.
+    **Supported data type combinations:**
+
+    +-------------+-------------+--------------+----------+-------+-------------+
+    | A Data Type | B Data Type | SF Data Type | Acc Type | Mma-K | SF Vec Size |
+    +=============+=============+==============+==========+=======+=============+
+    | E2M1        | E2M1        | UE8M0        | F32      | 96    | 32          |
+    +-------------+-------------+--------------+----------+-------+-------------+
+
+    **Supported architectures:** sm_100a, sm_103a (K=96 requires sm_103a+)
+
+    **Constraints:**
+
+    * CtaGroup.ONE: Mma-M = 128; 8 <= Mma-N <= 256, step 8
+    * CtaGroup.TWO: Mma-M in {128, 256}; 16 <= Mma-N <= 256, step 16
+    * Transpose (MN-major) is not supported. Both A and B must be K-major.
+
+    **Execution Model:**
+
+    * ``cute.gemm(...)`` (PTX: ``tcgen05.mma``) is asynchronous. Issue granularity is
+      single-thread (for ``.cta_group::1``) or single-thread in a CTA pair
+      (for ``.cta_group::2``), per PTX issue rules.
+    * In user code, issue ``cute.gemm(...)`` as warp-uniform and do not wrap it in
+      ``elect_one()``, as ``elect_one()`` insertion is handled by the compiler.
+    * For block-scaled MMA, pass A and B as paired operands in ``cute.gemm(...)``:
+      ``[a, sfa]`` and ``[b, sfb]``.
+    * To observe/sequence MMA completion for dependent non-pipelined operations, call
+      ``cute.nvgpu.tcgen05.commit(...)`` (PTX: ``tcgen05.commit``) and follow the
+      corresponding completion wait/synchronization path.
 
     .. code-block:: python
 
-        # CORRECT: MMA without elect_one
-        cute.gemm(mma_atom, d, a, b, c)
+        # CORRECT: warp-uniform tcgen05 MMA
+        cute.gemm(mma_atom, d, [a, sfa], [b, sfb], c)
 
-        # WRONG: Do NOT wrap in elect_one (can cause deadlock)
-        with cute.arch.elect_one():  # INCORRECT
-            cute.gemm(mma_atom, d, a, b, c)
+        # Signal completion of prior tcgen05 MMA operations
+        with cute.arch.elect_one():
+            cute.nvgpu.tcgen05.commit(mbar_ptr, None, cta_group)
 
     """
 
@@ -1742,17 +2158,43 @@ class SM103MmaMXF4NVF4Op(BlockScaledMmaOp):
     This Operation corresponds to the ``.kind::mxf4nvf4`` qualifier.
     This Operation is for SM103.
 
-    MMA operations should be issued by a single thread. The DSL automatically handles this by
-    implicitly adding ``elect_one()`` around the copy operation.
+    **Supported data type combinations:**
+
+    +-------------+-------------+--------------+----------+-------+-------------+
+    | A Data Type | B Data Type | SF Data Type | Acc Type | Mma-K | SF Vec Size |
+    +=============+=============+==============+==========+=======+=============+
+    | E2M1        | E2M1        | UE8M0, UE4M3 | F32      | 96    | 16          |
+    +-------------+-------------+--------------+----------+-------+-------------+
+
+    **Supported architectures:** sm_100a, sm_103a (K=96 requires sm_103a+)
+
+    **Constraints:**
+
+    * CtaGroup.ONE: Mma-M = 128; 8 <= Mma-N <= 256, step 8
+    * CtaGroup.TWO: Mma-M in {128, 256}; 16 <= Mma-N <= 256, step 16
+    * Transpose (MN-major) is not supported. Both A and B must be K-major.
+
+    **Execution Model:**
+
+    * ``cute.gemm(...)`` (PTX: ``tcgen05.mma``) is asynchronous. Issue granularity is
+      single-thread (for ``.cta_group::1``) or single-thread in a CTA pair
+      (for ``.cta_group::2``), per PTX issue rules.
+    * In user code, issue ``cute.gemm(...)`` as warp-uniform and do not wrap it in
+      ``elect_one()``, as ``elect_one()`` insertion is handled by the compiler.
+    * For block-scaled MMA, pass A and B as paired operands in ``cute.gemm(...)``:
+      ``[a, sfa]`` and ``[b, sfb]``.
+    * To observe/sequence MMA completion for dependent non-pipelined operations, call
+      ``cute.nvgpu.tcgen05.commit(...)`` (PTX: ``tcgen05.commit``) and follow the
+      corresponding completion wait/synchronization path.
 
     .. code-block:: python
 
-        # CORRECT: MMA without elect_one
-        cute.gemm(mma_atom, d, a, b, c)
+        # CORRECT: warp-uniform tcgen05 MMA
+        cute.gemm(mma_atom, d, [a, sfa], [b, sfb], c)
 
-        # WRONG: Do NOT wrap in elect_one (can cause deadlock)
-        with cute.arch.elect_one():  # INCORRECT
-            cute.gemm(mma_atom, d, a, b, c)
+        # Signal completion of prior tcgen05 MMA operations
+        with cute.arch.elect_one():
+            cute.nvgpu.tcgen05.commit(mbar_ptr, None, cta_group)
 
     """
 
diff --git a/python/CuTeDSL/cutlass/cute/nvgpu/warp/copy.py b/python/CuTeDSL/cutlass/cute/nvgpu/warp/copy.py
index ff40e6b26..c9922938d 100644
--- a/python/CuTeDSL/cutlass/cute/nvgpu/warp/copy.py
+++ b/python/CuTeDSL/cutlass/cute/nvgpu/warp/copy.py
@@ -10,7 +10,7 @@
 # is strictly prohibited.
 
 from dataclasses import dataclass
-from typing import Any, Optional, Type
+from typing import Any, Type
 
 import cutlass._mlir.dialects.cute_nvgpu as _cute_nvgpu_ir
 from cutlass._mlir import ir
diff --git a/python/CuTeDSL/cutlass/cute/nvgpu/warp/mma.py b/python/CuTeDSL/cutlass/cute/nvgpu/warp/mma.py
index dbdc296a4..9df4d9d03 100644
--- a/python/CuTeDSL/cutlass/cute/nvgpu/warp/mma.py
+++ b/python/CuTeDSL/cutlass/cute/nvgpu/warp/mma.py
@@ -10,7 +10,7 @@
 # is strictly prohibited.
 
 from dataclasses import dataclass
-from typing import Any, Optional, Type
+from typing import Any, Optional, Type, cast
 
 import enum
 from cutlass.base_dsl.arch import Arch
@@ -60,6 +60,33 @@ class MmaF16BF16Op(WarpMmaOp):
 
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#warp-level-matrix-instructions-mma>`__.
     This Operation covers the instructions using the ``.f16`` or ``.bf16`` qualifiers for the input operands.
+
+    **Supported data type combinations:**
+
+    +-------------+-------------+----------+---------------------+
+    | A Data Type | B Data Type | Acc Type | Mma-MNK             |
+    +=============+=============+==========+=====================+
+    | F16         | F16         | F16, F32 | (16,8,8), (16,8,16) |
+    +-------------+-------------+----------+---------------------+
+    | BF16        | BF16        | F32      | (16,8,8), (16,8,16) |
+    +-------------+-------------+----------+---------------------+
+
+    **Supported architectures:** sm_80+
+
+    **Constraints:**
+
+    * Operand layout is fixed: A = row-major (K-major), B = col-major (K-major). Transpose is not supported.
+
+    **Execution Model:**
+
+    * WMMA (``mma.sync.aligned``) is a warp-collective synchronous operation. All lanes in the
+      warp must execute the same MMA instruction in convergence.
+    * In user code, ``cute.gemm(...)`` should be issued as warp-uniform code.
+
+    .. code-block:: python
+
+        cute.gemm(mma_atom, d, a, b, c)
+
     """
 
     ab_dtype: Type[Numeric]
@@ -142,6 +169,32 @@ class MmaFP8Op(WarpMmaOp):
 
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#warp-level-matrix-instructions-mma>`__.
     This Operation covers the instructions using the ``.e4m3`` or ``.e5m2`` qualifiers for the input operands.
+
+    **Supported data type combinations:**
+
+    +-------------+-------------+----------+----------------------+
+    | A Data Type | B Data Type | Acc Type | Mma-MNK              |
+    +=============+=============+==========+======================+
+    | E4M3        | E4M3        | F16, F32 | (16,8,16), (16,8,32) |
+    +-------------+-------------+----------+----------------------+
+    | E5M2        | E5M2        | F16, F32 | (16,8,16), (16,8,32) |
+    +-------------+-------------+----------+----------------------+
+
+    **Supported architectures:** sm_89+
+
+    **Constraints:**
+
+    * Operand layout is fixed: A = row-major (K-major), B = col-major (K-major). Transpose is not supported.
+
+    **Execution Model:**
+
+    * WMMA (``mma.sync.aligned``) is a warp-collective synchronous operation. All lanes in the
+      warp must execute the same MMA instruction in convergence.
+    * In user code, ``cute.gemm(...)`` should be issued as warp-uniform code.
+
+    .. code-block:: python
+
+        cute.gemm(mma_atom, d, a, b, c)
     """
 
     ab_dtype: Type[Numeric]
@@ -235,10 +288,7 @@ class MmaSM120BlockScaledOp(MmaOp):
         if arch not in self.admissible_archs:
             raise OpError(
                 self,
-                f"expects arch to be one of {self.admissible_archs}, but got {arch}"
-                " - Note: sm_120f is currently not supported, "
-                " please compile for your local GPU architecture instead with env "
-                "CUTE_DSL_ARCH set to sm_120a or sm_121a",
+                f"expects arch to be one of {self.admissible_archs}, but got {arch}",
                 suggestion="Ensure env CUTE_DSL_ARCH matches your GPU architecture",
             )
         # (ab_dtype, shape_mnk) consistency: FP4 uses (16,8,64); FP8 uses (16,8,32).
@@ -356,17 +406,18 @@ class MmaBlockScaledTrait(Trait):
             raise ValueError(
                 f"expects field to be one of {self.admissible_fields}, but got {field}"
             )
+        mlir_operand = value
         if field in [Field.SFA, Field.SFB]:
             if not isinstance(value, Pointer):
                 raise ValueError(
                     f"expects value to be a pointer for {field}, but got {type(value).__name__}"
                 )
-            value = value.value
+            mlir_operand = cast(Any, value).value
 
         field_name = f"#cute_nvgpu.atom_mma_field_sm120_block_scaled<{field._to_ir_field_name()}>"
         attr = ir.Attribute.parse(field_name)
         self.value = _cute_nvgpu_ir.atom_set_value(
-            self.value, attr, value, loc=loc, ip=ip
+            self.value, attr, mlir_operand, loc=loc, ip=ip
         )
 
     def get(
@@ -376,6 +427,7 @@ class MmaBlockScaledTrait(Trait):
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> Any:
+        # SM120 warp-level block-scaled MMA does not support get method for any field
         raise ValueError(f"the get method for {field} is not supported")
 
 
@@ -394,6 +446,30 @@ class MmaMXF4Op(MmaSM120BlockScaledOp):
     .kind           = {.kind::mxf4};
     .scale_vec_size = {.scale_vec::2X};
     .stype          = {.ue8m0};
+
+    **Supported data type combinations:**
+
+    +-------------+-------------+--------------+----------+-----------+-------------+
+    | A Data Type | B Data Type | SF Data Type | Acc Type | Mma-MNK   | SF Vec Size |
+    +=============+=============+==============+==========+===========+=============+
+    | E2M1        | E2M1        | UE8M0        | F32      | (16,8,64) | 32          |
+    +-------------+-------------+--------------+----------+-----------+-------------+
+
+    **Supported architectures:** sm_120a, sm_120f, sm_121a
+
+    **Constraints:**
+
+    * Operand layout is fixed: A = row-major (K-major), B = col-major (K-major). Transpose is not supported.
+
+    **Execution Model:**
+
+    * Block-scaled WMMA (``mma.sync.aligned`` with ``.block_scale``) is a warp-collective synchronous
+      operation. All lanes in the warp must execute the same MMA instruction in convergence.
+    * In user code, ``cute.gemm(...)`` should be issued as warp-uniform code.
+
+    .. code-block:: python
+
+        cute.gemm(mma_atom, d, a, b, c)
     """
 
     descriptive_name = "warp-level MXF4 MMA Operation"
@@ -449,8 +525,32 @@ class MmaMXF4NVF4Op(MmaSM120BlockScaledOp):
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#warp-level-matrix-instructions-mma>`__.
     This Operation covers the instructions using the ``.e2m1`` qualifiers for the input operands.
     .kind           = {.kind::mxf4nvf4};
-    .scale_vec_size = {.scale_vec::2X, .scale_vec::4X};
-    .stype          = {.ue8m0, .ue4m3};
+    .scale_vec_size = {.scale_vec::4X};
+    .stype          = {.ue4m3};
+
+    **Supported data type combinations:**
+
+    +-------------+-------------+--------------+----------+-----------+-------------+
+    | A Data Type | B Data Type | SF Data Type | Acc Type | Mma-MNK   | SF Vec Size |
+    +=============+=============+==============+==========+===========+=============+
+    | E2M1        | E2M1        | UE4M3        | F32      | (16,8,64) | 16          |
+    +-------------+-------------+--------------+----------+-----------+-------------+
+
+    **Supported architectures:** sm_120a, sm_120f, sm_121a
+
+    **Constraints:**
+
+    * Operand layout is fixed: A = row-major (K-major), B = col-major (K-major). Transpose is not supported.
+
+    **Execution Model:**
+
+    * Block-scaled WMMA (``mma.sync.aligned`` with ``.block_scale``) is a warp-collective synchronous
+      operation. All lanes in the warp must execute the same MMA instruction in convergence.
+    * In user code, ``cute.gemm(...)`` should be issued as warp-uniform code.
+
+    .. code-block:: python
+
+        cute.gemm(mma_atom, d, a, b, c)
     """
 
     descriptive_name = "warp-level MXF4NVF4 MMA Operation"
@@ -596,7 +696,9 @@ class MmaMXF8F6F4Op(MmaOp):
 
     admissible_archs = [
         Arch.sm_120a,
+        Arch.sm_120f,
         Arch.sm_121a,
+        Arch.sm_121f,
     ]
 
     def __post_init__(self) -> None:
diff --git a/python/CuTeDSL/cutlass/cute/nvgpu/warpgroup/mma.py b/python/CuTeDSL/cutlass/cute/nvgpu/warpgroup/mma.py
index 6c32191bc..db8b7f9af 100644
--- a/python/CuTeDSL/cutlass/cute/nvgpu/warpgroup/mma.py
+++ b/python/CuTeDSL/cutlass/cute/nvgpu/warpgroup/mma.py
@@ -18,9 +18,9 @@ from cutlass.base_dsl.arch import Arch
 from cutlass.cutlass_dsl import BaseDSL, T, DSLRuntimeError
 from typing_extensions import deprecated
 
+from cutlass._mlir import ir
 import cutlass._mlir.dialects.cute as _cute_ir
 import cutlass._mlir.dialects.cute_nvgpu as _cute_nvgpu_ir
-from cutlass._mlir import ir
 
 from ..common import OpError, normalize_field_to_ir_name
 from ..common import OperandMajorMode as _OperandMajorMode
@@ -37,6 +37,7 @@ from ...typing import (
     Int32,
     Int8,
     Uint8,
+    Integer,
     Numeric,
     AddressSpace,
 )
@@ -193,6 +194,7 @@ class MmaOp(WarpGroupMmaOp):
             object.__setattr__(
                 self, "b_major_mode", _OperandMajorMode(self.b_major_mode.value)
             )
+
         # Verify instruction shape
         shape_mnk_tuple: Any = cast(Any, self.shape_mnk)
         if (rank(shape_mnk_tuple) not in [2, 3]) or (depth(shape_mnk_tuple) != 1):
@@ -274,26 +276,18 @@ class MmaTraits(Trait):
     def set(
         self,
         field: Any,
-        value: Any,
+        field_value: Any,
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> None:
         field_ir_name = self._normalize_field_name(field)
-        # Prefer the newer builder that accepts a logical field name, but keep
-        # a fallback for legacy attribute-based construction to avoid breaking changes.
-        bool_val = Boolean(value).ir_value(loc=loc, ip=ip)
-        try:
-            self.value = _cute_nvgpu_ir.atom_set_value(
-                self.value, field_ir_name, bool_val, loc=loc, ip=ip
-            )
-        except (TypeError, AttributeError):
-            # Legacy path: construct the per-arch field attribute explicitly
-            attr_asm = f"#cute_nvgpu.atom_mma_field_sm90<{field_ir_name}>"
-            attr = ir.Attribute.parse(attr_asm)
-            self.value = _cute_nvgpu_ir.atom_set_value(
-                self.value, attr, bool_val, loc=loc, ip=ip
-            )
+        bool_val = Boolean(field_value).ir_value(loc=loc, ip=ip)
+        trait_ir_val = self.value
+        attr = _cute_nvgpu_ir.resolve_atom_field_attr(trait_ir_val, field_ir_name)
+        self.value = _cute_nvgpu_ir.atom_set_value(
+            trait_ir_val, attr, bool_val, loc=loc, ip=ip
+        )
 
     def get(
         self,
@@ -303,16 +297,11 @@ class MmaTraits(Trait):
         ip: Optional[ir.InsertionPoint] = None,
     ) -> Any:
         field_ir_name = self._normalize_field_name(field)
-        try:
-            return _cute_nvgpu_ir.atom_get_value(
-                Boolean.mlir_type, self.value, field_ir_name, loc=loc, ip=ip
-            )
-        except (TypeError, AttributeError):
-            attr_asm = f"#cute_nvgpu.atom_mma_field_sm90<{field_ir_name}>"
-            attr = ir.Attribute.parse(attr_asm)
-            return _cute_nvgpu_ir.atom_get_value(
-                Boolean.mlir_type, self.value, attr, loc=loc, ip=ip
-            )
+        trait_ir_val = self.value
+        attr = _cute_nvgpu_ir.resolve_atom_field_attr(trait_ir_val, field_ir_name)
+        return _cute_nvgpu_ir.atom_get_value(
+            Boolean.mlir_type, trait_ir_val, attr, loc=loc, ip=ip
+        )
 
 
 @dataclass(frozen=True)
@@ -322,6 +311,44 @@ class MmaF16BF16Op(MmaOp):
 
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-mma>`__.
     This Operation covers the instructions using the ``.f16`` or ``.bf16`` qualifiers for the input operands.
+
+    **Supported data type combinations:**
+
+    +-------------+-------------+----------+-------+
+    | A Data Type | B Data Type | Acc Type | Mma-K |
+    +=============+=============+==========+=======+
+    | F16         | F16         | F16, F32 | 16    |
+    +-------------+-------------+----------+-------+
+    | BF16        | BF16        | F32      | 16    |
+    +-------------+-------------+----------+-------+
+
+    **Supported architectures:** sm_90a
+
+    **Constraints:**
+
+    * Mma-M = 64
+    * 8 <= Mma-N <= 256, step 8
+    * A and B support both K-major and MN-major (transpose) when A is in shared memory (descriptor).
+      When A is in registers, only B can be transposed.
+
+    **Execution Model:**
+
+    * WGMMA is asynchronous and collective at warpgroup scope (4 contiguous warps).
+      In user code, ``cute.gemm(...)`` should be issued warpgroup-uniformly.
+    * Before issuing ``cute.gemm(...)``, call ``cute.nvgpu.warpgroup.fence()`` to order
+      prior register writes to accumulator/A fragments with subsequent WGMMA reads.
+    * After issuing ``cute.gemm(...)``, call ``cute.nvgpu.warpgroup.commit_group()``.
+      Use ``cute.nvgpu.warpgroup.wait_group(N)`` before consuming or reusing accumulator
+      values from pending WGMMA groups.
+
+    .. code-block:: python
+
+        cute.nvgpu.warpgroup.fence()
+        cute.gemm(tiled_mma, acc, tCrA[tile_crd], tCrB[tile_crd], acc)
+        cute.nvgpu.warpgroup.commit_group()
+        cute.nvgpu.warpgroup.wait_group(1)
+        # ... pipeline continues ...
+        cute.nvgpu.warpgroup.wait_group(0)
     """
 
     descriptive_name = "warpgroup F16/BF16 MMA Operation"
@@ -411,6 +438,42 @@ class MmaF8Op(MmaOp):
 
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-mma>`__.
     This Operation covers the instructions using the ``.e4m3`` or ``.e5m2`` qualifiers for the input operands.
+
+    **Supported data type combinations:**
+
+    +-------------+-------------+----------+-------+
+    | A Data Type | B Data Type | Acc Type | Mma-K |
+    +=============+=============+==========+=======+
+    | E4M3, E5M2  | E4M3, E5M2  | F16, F32 | 32    |
+    +-------------+-------------+----------+-------+
+
+    **Supported architectures:** sm_90a
+
+    **Constraints:**
+
+    * Mma-M = 64
+    * 8 <= Mma-N <= 256, step 8
+    * A and B data types are independent (mixed FP8 allowed)
+    * Transpose (MN-major) is not supported for A or B. Both operands must be K-major.
+
+    **Execution Model:**
+
+    * WGMMA is asynchronous and collective at warpgroup scope (4 contiguous warps).
+      In user code, ``cute.gemm(...)`` should be issued warpgroup-uniformly.
+    * Before issuing ``cute.gemm(...)``, call ``cute.nvgpu.warpgroup.fence()`` to order
+      prior register writes to accumulator/A fragments with subsequent WGMMA reads.
+    * After issuing ``cute.gemm(...)``, call ``cute.nvgpu.warpgroup.commit_group()``.
+      Use ``cute.nvgpu.warpgroup.wait_group(N)`` before consuming or reusing accumulator
+      values from pending WGMMA groups.
+
+    .. code-block:: python
+
+        cute.nvgpu.warpgroup.fence()
+        cute.gemm(tiled_mma, acc, tCrA[tile_crd], tCrB[tile_crd], acc)
+        cute.nvgpu.warpgroup.commit_group()
+        cute.nvgpu.warpgroup.wait_group(1)
+        # ... pipeline continues ...
+        cute.nvgpu.warpgroup.wait_group(0)
     """
 
     descriptive_name = "warpgroup F8 MMA Operation"
@@ -500,6 +563,42 @@ class MmaI8Op(MmaOp):
 
     See the `PTX documentation <https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions-wgmma-mma>`__.
     This Operation covers the instructions using the ``.s8`` or ``.u8`` qualifiers for the input operands.
+
+    **Supported data type combinations:**
+
+    +-------------+-------------+----------+-------+
+    | A Data Type | B Data Type | Acc Type | Mma-K |
+    +=============+=============+==========+=======+
+    | Int8, Uint8 | Int8, Uint8 | Int32    | 32    |
+    +-------------+-------------+----------+-------+
+
+    **Supported architectures:** sm_90a
+
+    **Constraints:**
+
+    * Mma-M = 64
+    * Mma-N in {8, 24} or Mma-N % 16 == 0, with 8 <= Mma-N <= 256
+    * A and B signedness are independent (mixed signed/unsigned allowed)
+    * Transpose (MN-major) is not supported for A or B. Both operands must be K-major.
+
+    **Execution Model:**
+
+    * WGMMA is asynchronous and collective at warpgroup scope (4 contiguous warps).
+      In user code, ``cute.gemm(...)`` should be issued warpgroup-uniformly.
+    * Before issuing ``cute.gemm(...)``, call ``cute.nvgpu.warpgroup.fence()`` to order
+      prior register writes to accumulator/A fragments with subsequent WGMMA reads.
+    * After issuing ``cute.gemm(...)``, call ``cute.nvgpu.warpgroup.commit_group()``.
+      Use ``cute.nvgpu.warpgroup.wait_group(N)`` before consuming or reusing accumulator
+      values from pending WGMMA groups.
+
+    .. code-block:: python
+
+        cute.nvgpu.warpgroup.fence()
+        cute.gemm(tiled_mma, acc, tCrA[tile_crd], tCrB[tile_crd], acc)
+        cute.nvgpu.warpgroup.commit_group()
+        cute.nvgpu.warpgroup.wait_group(1)
+        # ... pipeline continues ...
+        cute.nvgpu.warpgroup.wait_group(0)
     """
 
     descriptive_name = "warpgroup I8 MMA Operation"
@@ -573,12 +672,14 @@ class MmaI8Op(MmaOp):
         **kwargs: Any,
     ) -> "MmaI8Trait":
         shape_mnk = _pack_shape(self.shape_mnk, loc=loc, ip=ip)
+        # MmaI8 only operates on integer dtypes.
+        assert issubclass(self.a_dtype, Integer) and issubclass(self.b_dtype, Integer)
         ty = _cute_nvgpu_ir.MmaAtomSM90Type.get(
             shape_mnk.type.attribute,
             self.a_major_mode._to_ir(),
             self.b_major_mode._to_ir(),
-            (T.si8() if self.a_dtype.signed else T.ui8()),  # type: ignore[attr-defined]
-            (T.si8() if self.b_dtype.signed else T.ui8()),  # type: ignore[attr-defined]
+            (T.si8() if self.a_dtype.signed else T.ui8()),
+            (T.si8() if self.b_dtype.signed else T.ui8()),
             self.acc_dtype.mlir_type,
             self.a_src._to_ir(),
         )
diff --git a/python/CuTeDSL/cutlass/cute/runtime.py b/python/CuTeDSL/cutlass/cute/runtime.py
index 436547ec4..56454582e 100644
--- a/python/CuTeDSL/cutlass/cute/runtime.py
+++ b/python/CuTeDSL/cutlass/cute/runtime.py
@@ -594,28 +594,41 @@ def make_fake_compact_tensor(
     use_32bit_stride: bool = False,
 ) -> _FakeTensor:
     """
-    Create a fake tensor with the specified shape, element type, and a compact memory layout.
+    Create a fake tensor descriptor with a compact layout derived from shape.
+
+    This is the usual builder for ``cute.compile(...)`` when the logical
+    tensor is compact and you want the runtime stride tuple to be derived
+    automatically from ``shape`` and ``stride_order``.  Each entry in
+    ``shape`` may be a static Python ``int`` or a dynamic
+    :class:`~cutlass.cute.typing.SymInt`.  Dynamic dimensions become
+    runtime-bound scalar parameters on the compiled callable.
 
     :param dtype: Data type of the tensor elements.
     :type dtype: Type[Numeric]
-    :param shape: Shape of the tensor, consisting of static (int) or dynamic (SymInt) dimensions.
+    :param shape: Tensor extents in elements, one per mode. Each entry may be
+        static (``int``) or dynamic (``SymInt``).
     :type shape: tuple[int | SymInt, ...]
-    :param stride_order: Order in which strides (memory layout) are assigned to the tensor dimensions.
-        If None, the default layout is left-to-right order (known as column-major order for flatten layout).
-        Otherwise, it should be a permutation order of the dimension indices.
-        The mode with stride_order 0 is the fastest changing (leading) dimension, and N-1 is the slowest changing.
+    :param stride_order: Permutation describing which mode is fastest-changing.
+        ``0`` means the innermost / stride-1 mode, ``len(shape)-1`` the
+        slowest-changing mode. If omitted, the default is left-to-right
+        order ``(0, 1, ..., n-1)``.
     :type stride_order: tuple[int, ...], optional
     :param memspace: Memory space where the fake tensor resides. Defaults to AddressSpace.gmem.
     :type memspace: AddressSpace, optional
-    :param assumed_align: Assumed byte alignment for the tensor data. If None, the default alignment is the dtype width, & at least 1 byte.
+    :param assumed_align: Assumed byte alignment of the base pointer. If
+        ``None``, defaults to one element width in bytes (and at least 1).
     :type assumed_align: int, optional
-    :param use_32bit_stride: Whether to use 32-bit stride for dynamic dimensions. If True and the total size of the
-        layout (cosize(layout)) fits within int32, then dynamic strides will use 32-bit integers for improved performance.
-        Only applies when dimensions are dynamic. Defaults to False.
+    :param use_32bit_stride: Use 32-bit symbolic strides instead of 64-bit
+        ones for dynamic layouts. This only affects dynamically-derived
+        stride entries and is useful when the compact layout provably fits
+        in int32.
     :type use_32bit_stride: bool, optional
     :return: An instance of a fake tensor with the given properties and compact layout.
     :rtype: _FakeTensor
 
+    Use :func:`make_fake_tensor` instead when the logical layout is
+    non-compact or when you need to spell the stride tuple explicitly.
+
     **Examples:**
 
     .. code-block:: python
@@ -633,7 +646,8 @@ def make_fake_compact_tensor(
         compiled_foo = cute.compile(foo, x)
 
         # Default stride order is left-to-right order (0, 1, ..., n-1)
-        y = make_fake_compact_tensor(cutlass.Float32, (8, 3, 2)) # y.stride == (1, 8, 24)
+        y = make_fake_compact_tensor(cutlass.Float32, (8, 3, 2))
+        assert y.stride == (1, 8, 24)
     """
 
     if stride_order is not None:
@@ -680,20 +694,61 @@ def make_fake_tensor(
     assumed_align: int | None = None,
 ) -> _FakeTensor:
     """
-    Create a fake tensor with the specified element type, shape, and stride.
+    Create a fake tensor descriptor with an explicit layout.
+
+    Use this builder for ``cute.compile(...)`` when the logical tensor
+    layout is not compact, when you already know the exact stride tuple,
+    or when you want fake-tensor layout to match an external contract
+    exactly. ``shape`` and ``stride`` are both expressed in elements, not
+    bytes.
 
     :param dtype: Data type of the tensor elements.
     :type dtype: Type[Numeric]
-    :param shape: Shape of the tensor, consisting of static (int) or dynamic (SymInt) dimensions.
+    :param shape: Tensor extents in elements, one per mode. Each entry may be
+        static (``int``) or dynamic (:class:`~cutlass.cute.typing.SymInt`).
+        Dynamic dimensions become runtime-bound scalar parameters on the
+        compiled callable.
     :type shape: tuple[int | SymInt, ...]
-    :param stride: Stride of the tensor, consisting of static (int) or dynamic (SymInt) values.
+    :param stride: Explicit stride tuple in elements. Must have the same rank
+        as ``shape``. Each entry may be static (``int``) or dynamic
+        (``SymInt``).
     :type stride: tuple[int | SymInt, ...]
     :param memspace: Memory space where the fake tensor resides. Defaults to AddressSpace.gmem.
     :type memspace: AddressSpace, optional
-    :param assumed_align: Assumed byte alignment for the tensor data. If None, the default alignment is the dtype width, & at least 1 byte.
+    :param assumed_align: Assumed byte alignment of the base pointer. If
+        ``None``, defaults to one element width in bytes (and at least 1).
     :type assumed_align: int, optional
     :return: An instance of a fake tensor with the given properties.
     :rtype: _FakeTensor
+
+    If the same runtime symbolic quantity appears in multiple positions,
+    reuse the same :class:`~cutlass.cute.typing.SymInt` object at every
+    occurrence. Different ``SymInt`` objects are treated as distinct
+    runtime parameters even if they share the same ``symbol`` string.
+
+    Use :func:`make_fake_compact_tensor` instead when the layout is
+    compact and you want the stride tuple inferred from ``shape`` and a
+    mode order.
+
+    **Examples:**
+
+    .. code-block:: python
+
+        @cute.jit
+        def foo(x: cute.Tensor):
+            ...
+
+        sym_m = cute.sym_int64(symbol="M")
+        sym_ld = cute.sym_int64(divisibility=16, symbol="LD")
+
+        # Row-major logical layout: contiguous K dimension, explicit leading dim.
+        x = make_fake_tensor(
+            cutlass.Float16,
+            shape=(sym_m, 128),
+            stride=(sym_ld, 1),
+        )
+
+        compiled_foo = cute.compile(foo, x)
     """
     return _FakeTensor(
         dtype, shape, stride=stride, memspace=memspace, assumed_align=assumed_align
diff --git a/python/CuTeDSL/cutlass/cute/tensor.py b/python/CuTeDSL/cutlass/cute/tensor.py
index 0627fc870..d43dc1d4f 100644
--- a/python/CuTeDSL/cutlass/cute/tensor.py
+++ b/python/CuTeDSL/cutlass/cute/tensor.py
@@ -11,7 +11,6 @@
 
 
 from typing import Any, Callable, Optional, Union, Type, Tuple, overload, List
-from typing_extensions import deprecated
 from inspect import isclass
 import operator
 
@@ -29,6 +28,11 @@ import cutlass._mlir.dialects.cute as _cute_ir
 from cutlass._mlir.dialects.cute import ReductionOp as ReductionOp
 import cutlass._mlir.dialects.cute_nvgpu as _cute_nvgpu_ir
 from cutlass._mlir.dialects import vector, arith, llvm
+from cutlass._mlir.dialects.cute import (
+    SparseElemType as _SparseElemType,
+)
+from cutlass._mlir_helpers.arith import Vector
+
 from .typing import (
     Numeric,
     Integer,
@@ -48,8 +52,10 @@ from .typing import (
     ComposedLayout,
     Tensor,
     AddressSpace,
+    _element_precision_width,
     is_integer,
     is_int_tuple,
+    is_int_tuple_type,
     as_numeric,
 )
 
@@ -91,7 +97,6 @@ __all__ = [
     "ReductionOp",
     "make_tensor",
     "make_identity_tensor",
-    "make_fragment",
     "make_fragment_like",
     "make_rmem_tensor_like",
     "make_rmem_tensor",
@@ -319,17 +324,16 @@ class _Tensor(Tensor):
         ip: Optional[ir.InsertionPoint] = None,
     ) -> ir.Value:
         orig_dtype = data.dtype
+        elem_type = self.element_type
+        assert not is_int_tuple_type(elem_type)
         # Implicit upcast to wider type
-        if (
-            data.dtype.is_same_kind(self.element_type)
-            and self.element_type.width >= data.dtype.width  # type: ignore[union-attr]
-        ):
-            data = data.to(self.element_type, loc=loc, ip=ip)  # type: ignore[assignment]
+        if data.dtype.is_same_kind(elem_type) and elem_type.width >= data.dtype.width:
+            data = data.to(elem_type, loc=loc, ip=ip)
 
-        if data.dtype.width != self.element_type.width:  # type: ignore[union-attr]
+        if data.dtype.width != elem_type.width:
             raise ValueError(
                 f"Type mismatch, store {orig_dtype} (-> {data.dtype}) "
-                f"to Tensor with element type {self.element_type}"
+                f"to Tensor with element type {elem_type}"
             )
 
         if data.dtype is Boolean and self.element_type is Boolean:
@@ -527,6 +531,16 @@ class _Tensor(Tensor):
 
         self._check_can_load_store(vectorized=True)
 
+        # For non-rmem unmasked loads, copy to an intermediate rmem tensor so
+        # that autovec_copy vectorises the gmem/smem transfer regardless of
+        # stride order.
+        if self.memspace != AddressSpace.rmem and mask is None:
+            from .algorithm import autovec_copy  # avoid circular import
+
+            rmem = make_rmem_tensor_like(self, dtype=self.element_type, loc=loc, ip=ip)
+            autovec_copy(self, rmem, loc=loc, ip=ip)
+            return rmem.load(loc=loc, ip=ip)
+
         mask_val = None if mask is None else mask.ir_value(loc=loc, ip=ip)
         pass_thru_val = (
             None if pass_thru is None else self._cvt_to_dest(pass_thru, loc=loc, ip=ip)
@@ -585,6 +599,20 @@ class _Tensor(Tensor):
                 f"lhs and rhs must have the same shape, but got {self.shape} and {data.shape}"
             )
 
+        # For non-rmem unmasked stores, route through an intermediate rmem
+        # tensor so autovec_copy vectorises the gmem/smem write regardless of
+        # stride order.
+        element_precision_width = _element_precision_width(self.element_type)
+        is_sub_byte = self.element_type is not Boolean and element_precision_width < 8
+        if self.memspace != AddressSpace.rmem and mask is None and not is_sub_byte:
+            from .algorithm import autovec_copy  # avoid circular import
+
+            rmem = make_rmem_tensor_like(self, dtype=self.element_type, loc=loc, ip=ip)
+            for i in range(size(rmem)):
+                rmem[i] = data[i]
+            autovec_copy(rmem, self, loc=loc, ip=ip)
+            return
+
         elem_mlir_type = cutlass_arith.element_type(data.dtype.mlir_type)
         if (
             cutlass_arith.is_narrow_precision(elem_mlir_type)
@@ -895,18 +923,6 @@ def make_rmem_tensor(
     return _Tensor(tensor.value, dtype)
 
 
-@dsl_user_op
-@deprecated("`make_fragment` is deprecated, use `make_rmem_tensor` instead")
-def make_fragment(
-    layout_or_shape: Union[Layout, Shape],
-    dtype: Type[Numeric],
-    *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
-) -> Tensor:
-    return make_rmem_tensor(layout_or_shape, dtype, loc=loc, ip=ip)
-
-
 @dsl_user_op
 def make_rmem_tensor_like(
     src: Union[Layout, ComposedLayout, Tensor, "TensorSSA"],
@@ -1209,8 +1225,9 @@ def print_tensor(
         tensor = tmp
 
     if isinstance(tensor.type, _cute_ir.MemRefType):  # type: ignore[union-attr]
-        if tensor.element_type.is_integer:  # type: ignore[union-attr]
-            signed = tensor.element_type.signed  # type: ignore[union-attr]
+        elem_type = tensor.element_type
+        if issubclass(elem_type, Integer):
+            signed = elem_type.signed
         else:
             signed = False
     elif isinstance(tensor.type, _cute_ir.CoordTensorType):  # type: ignore[union-attr]
@@ -1355,7 +1372,7 @@ def _infer_broadcast_shape(*shapes: Shape) -> Shape:
     return res_shape
 
 
-class TensorSSA(cutlass_arith.ArithValue):
+class TensorSSA(Vector):
     """A class representing thread local data from CuTe Tensor in value semantic and immutable.
 
     :param value: Flatten vector as ir.Value holding logic data of SSA Tensor
@@ -1386,9 +1403,8 @@ class TensorSSA(cutlass_arith.ArithValue):
 
         :param value: A :class:`ir.Value` holding the flattened MLIR vector value of the tensor.
         :type value: :class:`ir.Value`
-        :param shape: The logical (possibly nested) shape of the tensor. If None,
-            this is inferred from ``value.type.shape``.
-        :type shape: Shape, optional
+        :param shape: The logical (possibly nested) shape of the tensor.
+        :type shape: Shape
         :param dtype: The data type of the tensor elements. If None,
             this is inferred from the MLIR element type.
         :type dtype: Type[Numeric], optional
@@ -1401,9 +1417,6 @@ class TensorSSA(cutlass_arith.ArithValue):
 
         .. note::
             - Instances are immutable and represent per-thread local SSA values using value semantics.
-            - If ``shape`` is inferred and is multi-dimensional, the provided ``value``
-              will be shape-cast to a 1D vector with the same logical product, aligning the
-              physical and logical shape representations.
             - The tensor's broadcast shape and static element type are registered; dynamic shapes are not supported.
         """
         if not isinstance(value, ir.Value):
@@ -1420,11 +1433,11 @@ class TensorSSA(cutlass_arith.ArithValue):
         if dtype is None:
             dtype = Numeric.from_mlir_type(value.type.element_type)
 
-        signed = dtype.signed if issubclass(dtype, Integer) else False
-        super().__init__(value, signed)
+        # Vector.__init__ infers flat _shape from value.type; we override
+        # with CuTe nested shape below.
+        Vector.__init__(self, value, dtype=dtype, loc=loc, ip=ip)
 
-        self._shape = shape
-        self._dtype = dtype
+        self._shape = shape  # type: ignore[assignment]  # CuTe Shape (possibly nested)
         self._layout = None
 
     @staticmethod
@@ -1480,10 +1493,15 @@ class TensorSSA(cutlass_arith.ArithValue):
         force_flatten: bool = False,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
-    ) -> ir.Value:
+    ) -> Vector:
         """
-        Convert the tensor to a MLIR vector value.
+        Convert the tensor to :class:`Vector` carrying the tensor's dtype.
+
+        Returns a :class:`Vector` wrapping the underlying MLIR vector value;
+        the DSL dtype is propagated so callers can use :meth:`Vector.reduce`,
+        :meth:`Vector.to`, and element-wise arithmetic.
         """
+
         if depth(self.shape) > 1:
             if not force_flatten:
                 raise ValueError(
@@ -1493,9 +1511,11 @@ class TensorSSA(cutlass_arith.ArithValue):
         else:
             shape = self.shape  # type: ignore[assignment]
 
-        res_ty = ir.VectorType.get(list(shape), self.dtype.mlir_type)
+        shape_list = [shape] if isinstance(shape, int) else list(shape)
+        res_ty = ir.VectorType.get(shape_list, self.dtype.mlir_type)
         val = _col2row(self, shape=shape, loc=loc, ip=ip)
-        return vector.shape_cast(res_ty, val, loc=loc, ip=ip)
+        vec_ir = vector.shape_cast(res_ty, val, loc=loc, ip=ip)
+        return Vector(vec_ir, dtype=self.dtype, loc=loc, ip=ip)
 
     @property
     def dtype(self) -> Type[Numeric]:
@@ -1511,11 +1531,31 @@ class TensorSSA(cutlass_arith.ArithValue):
     def __new_from_mlir_values__(self, values: list) -> "TensorSSA":
         return TensorSSA(values[0], self.shape, self.dtype)
 
+    def _wrap_like(self, result_ir: "ir.Value") -> "TensorSSA":
+        """Preserve CuTe nested shape when the math foundation wraps a
+        per-element op's result back into a TensorSSA."""
+        return TensorSSA(result_ir, shape=self._shape, dtype=self._dtype)
+
+    @property
+    def _count(self) -> int:
+        """Total element count — flatten CuTe nested shape before multiplying.
+
+        Overrides :attr:`Vector._count`, which assumes a flat MLIR shape tuple.
+        TensorSSA carries a possibly-nested CuTe shape (e.g. ``((4, 2), 8)``),
+        so the base implementation's ``result *= dim`` produces garbage for
+        nested shapes (tuple-repetition instead of arithmetic). ``numel``
+        picks up this override automatically.
+        """
+        result = 1
+        for dim in flatten_to_tuple(self._shape):
+            result *= dim
+        return result
+
     def __str__(self) -> str:
         return f"tensor_value<{self.type} o {self.shape}>"
 
     @property
-    def shape(self) -> Shape:
+    def shape(self) -> Shape:  # type: ignore[override]
         return self._shape
 
     def _apply_op(
@@ -2288,7 +2328,7 @@ class TensorSSA(cutlass_arith.ArithValue):
         return TensorSSA(self, shape, self.dtype)
 
     @dsl_user_op
-    def __getitem__(
+    def __getitem__(  # type: ignore[override]
         self,
         crd: Coord,
         *,
@@ -2470,6 +2510,7 @@ class TensorSSA(cutlass_arith.ArithValue):
         if dtype is self._dtype:
             return self
         old_count = size(self._shape)
+        assert isclass(self._dtype) and issubclass(self._dtype, Numeric)
         new_count = old_count * self._dtype.width // dtype.width
         target_vec_ty = ir.VectorType.get([new_count], dtype.mlir_type)
         res_vec = vector.bitcast(target_vec_ty, self, loc=loc, ip=ip)
@@ -2646,7 +2687,9 @@ def full(
         raise ValueError(f"Expected fill_value be numeric type, but got {fill_value}")
 
     res_ty = T.vector(size, dtype.mlir_type)
-    res_val = vector.splat(res_ty, fill_value.ir_value(loc=loc, ip=ip), loc=loc, ip=ip)
+    res_val = vector.broadcast(
+        res_ty, fill_value.ir_value(loc=loc, ip=ip), loc=loc, ip=ip
+    )
     return TensorSSA(res_val, shape, dtype)
 
 
@@ -2905,6 +2948,8 @@ def gather(
 
     _check_can_gather_scatter(input, mode, index)
 
+    input_elem_type = input.element_type
+    assert not is_int_tuple_type(input_elem_type)
     idx_layout = make_layout(index.shape)
     src_layout = make_layout(index.shape, stride=input.stride)
 
@@ -2922,17 +2967,19 @@ def gather(
     src_layout_rest = select(src_layout, rest_modes)
 
     res_elems = [None] * size(index.shape)
-    res_vect_ty = T.vector(size(index.shape), input.element_type.mlir_type)  # type: ignore[union-attr]
+    res_vect_ty = T.vector(size(index.shape), input_elem_type.mlir_type)
 
     # Optimized path: lower to vector.gather when the tensor is col-major
     # and gathering along the left-most mode
+    input_iter = input.iterator
     if (
         mode == 0
         and is_major(mode, input.stride)
-        and not input.iterator.value.type.is_swizzled  # type: ignore[union-attr]
+        and isinstance(input_iter, Pointer)
+        and not input_iter.value.type.is_swizzled
     ):
         vect_sz = size(idx_layout_gather)
-        vect_ty = T.vector(vect_sz, input.element_type.mlir_type)  # type: ignore[union-attr]
+        vect_ty = T.vector(vect_sz, input_elem_type.mlir_type)
         idx_vect_ty = T.vector(vect_sz, index.element_type.mlir_type)
         mask_all_ones = vector.constant_mask(
             T.vector(vect_sz, T.bool()), [vect_sz], loc=loc, ip=ip
@@ -2956,7 +3003,7 @@ def gather(
                 indices=idx_vect,
                 mask=mask_all_ones,
                 pass_thru=pass_thru_poison,
-                alignment=input.iterator.alignment,  # type: ignore[union-attr]
+                alignment=input_iter.alignment,
                 loc=loc,
                 ip=ip,
             )
@@ -2975,7 +3022,7 @@ def gather(
             index_crd = idx_layout_gather(gather_crd) + idx_layout_rest(rest_crd)
             src_crd_gather = index[index_crd]
             src_crd = src_layout_gather(src_crd_gather) + src_layout_rest(rest_crd)
-            src_crd_hier = input.layout.get_hier_coord(src_crd, loc=loc, ip=ip)  # type: ignore[call-arg, union-attr]
+            src_crd_hier = input.layout.get_hier_coord(src_crd, loc=loc, ip=ip)  # type: ignore[union-attr]
             res_elems[index_crd] = input[src_crd_hier].ir_value(loc=loc, ip=ip)  # type: ignore[union-attr]
     res_vect = vector.from_elements(res_vect_ty, res_elems)
     return TensorSSA(res_vect, index.shape, input.element_type)
@@ -3029,6 +3076,8 @@ def scatter(
 
     _check_can_gather_scatter(output, mode, index, data)
 
+    output_elem_type = output.element_type
+    assert not is_int_tuple_type(output_elem_type)
     idx_layout = make_layout(index.shape)
     dst_layout = make_layout(index.shape, stride=output.stride)
 
@@ -3047,13 +3096,15 @@ def scatter(
 
     # Optimized path: lower to vector.scatter when tensor is col-major and
     # scattering along the left-most mode
+    output_iter = output.iterator
     if (
         mode == 0
         and is_major(mode, output.stride)
-        and not output.iterator.value.type.is_swizzled  # type: ignore[union-attr]
+        and isinstance(output_iter, Pointer)
+        and not output_iter.value.type.is_swizzled
     ):
         vect_sz = size(idx_layout_scatter)
-        vect_ty = T.vector(vect_sz, output.element_type.mlir_type)  # type: ignore[union-attr]
+        vect_ty = T.vector(vect_sz, output_elem_type.mlir_type)
         idx_vect_ty = T.vector(vect_sz, index.element_type.mlir_type)
         mask_all_ones = vector.constant_mask(
             T.vector(vect_sz, T.bool()), [vect_sz], loc=loc, ip=ip
@@ -3085,7 +3136,7 @@ def scatter(
                 indices=idx_vect,
                 mask=mask_all_ones,
                 value_to_store=data_vect,
-                alignment=output.iterator.alignment,  # type: ignore[union-attr]
+                alignment=output_iter.alignment,
                 loc=loc,
                 ip=ip,
             )
@@ -3097,7 +3148,7 @@ def scatter(
             index_crd = idx_layout_scatter(scatter_crd) + idx_layout_rest(rest_crd)
             dst_crd_scatter = index[index_crd]
             dst_crd = dst_layout_scatter(dst_crd_scatter) + dst_layout_rest(rest_crd)
-            dst_crd_hier = output.layout.get_hier_coord(dst_crd, loc=loc, ip=ip)  # type: ignore[call-arg, union-attr]
+            dst_crd_hier = output.layout.get_hier_coord(dst_crd, loc=loc, ip=ip)  # type: ignore[union-attr]
             output[dst_crd_hier] = data[index_crd]
 
 
@@ -3128,11 +3179,15 @@ def _check_can_gather_scatter(
         raise NotImplementedError(
             f"gather/scatter on tensor with nested layout is not supported, got: {tensor.layout} and {index.shape}"
         )
+    index_shape = index.shape
+    tensor_shape = tensor.shape
+    assert isinstance(index_shape, tuple)
+    assert isinstance(tensor_shape, tuple)
     for m in range(n_modes):
-        if m != mode and size(index.shape[m]) > size(tensor.shape[m]):  # type: ignore[index]
+        if m != mode and size(index_shape[m]) > size(tensor_shape[m]):
             raise ValueError(
                 f"index dimension {m} must be less than or equal to the corresponding source dimension,"
-                f"got: {size(index.shape[m])} and {size(tensor.shape[m])}"  # type: ignore[index]
+                f"got: {size(index_shape[m])} and {size(tensor_shape[m])}"
             )
     if data is not None and index.shape != data.shape:
         raise ValueError(
@@ -3142,7 +3197,9 @@ def _check_can_gather_scatter(
     # Check data type
     if not issubclass(index.dtype, Integer):
         raise TypeError(f"index must be integer TensorSSA, got {index.dtype}")
-    if tensor.element_type.width % 8 != 0:  # type: ignore[union-attr]
+    elem_type = tensor.element_type
+    assert not is_int_tuple_type(elem_type)
+    if elem_type.width % 8 != 0:
         raise TypeError(
             f"gather/scatter for sub-byte element type is not supported, got: {tensor.element_type}"
         )
diff --git a/python/CuTeDSL/cutlass/cute/testing.py b/python/CuTeDSL/cutlass/cute/testing.py
index f7f3dfe64..91d80abd5 100644
--- a/python/CuTeDSL/cutlass/cute/testing.py
+++ b/python/CuTeDSL/cutlass/cute/testing.py
@@ -9,22 +9,26 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
-import argparse
+
 import functools
 import inspect
 import logging
 import os
-from dataclasses import dataclass
 from itertools import product
 from time import time
-from typing import Type, Union, Callable, Optional, Dict, List, Any
-
-import cuda.bindings.driver as cuda_driver
-import cuda.bindings.runtime as cuda_runtime
+from typing import Any, Callable, Dict, List, Optional, Type
 
 from cutlass.cutlass_dsl import Constexpr, CuTeDSL, T, dsl_user_op, const_expr
 
-from .typing import Numeric, Int8, Uint8, Boolean, Tensor, Layout, Shape
+from .typing import (
+    Numeric,
+    Int8,
+    Boolean,
+    Tensor,
+    Layout,
+    Shape,
+    is_int_tuple_type,
+)
 
 from . import nvgpu
 from .core import recast_layout, make_layout, composition, get, rank, size
@@ -41,121 +45,8 @@ from .algorithm import copy
 from .core import zipped_divide
 from .runtime import from_dlpack
 
-from cutlass._mlir.dialects import builtin, cf, nvvm, vector
-
-from functools import partial
 from cutlass._mlir import ir
-
-
-class CuptiProfiler:
-    """A class for managing CUPTI profiling measurements with start, stop, and duration methods.
-
-    This class provides a clean interface for measuring CUDA kernel execution times
-    using CUPTI (CUDA Profiling Tools Interface). It encapsulates the complexity
-    of buffer management, callback registration, and activity tracking.
-
-    Example usage:
-        profiler = CuptiProfiler()
-        profiler.start()
-        # ... run your CUDA kernels ...
-        profiler.stop()
-        duration = profiler.get_duration()  # Returns total duration in milliseconds
-    """
-
-    def __init__(self, buffer_size: int = 8 * 1024 * 1024) -> None:
-        """Initialize the CUPTI profiler.
-
-        Args:
-            buffer_size: Size of the CUPTI buffer in bytes (default: 8MB)
-
-        Raises:
-            ImportError: If the cupti-python package is not installed
-        """
-        try:
-            from cupti import cupti
-
-            self._cupti = cupti
-        except ModuleNotFoundError:
-            raise ModuleNotFoundError(
-                "CUPTI is not available. Install the 'cupti-python' package to use CuptiProfiler."
-            )
-        self.buffer_size = buffer_size
-        self.timings: list[tuple[str, float]] = []
-        self._is_active = False
-        self._buffer_requested_callback: Optional[Callable[..., Any]] = None
-        self._buffer_completed_callback: Optional[Callable[..., Any]] = None
-
-    def _buffer_requested(self) -> tuple[int, int]:
-        """Internal callback for CUPTI buffer requests."""
-        max_num_records = 0
-        return self.buffer_size, max_num_records
-
-    def _buffer_completed(self, activities: list[Any]) -> None:
-        """Internal callback for processing completed CUPTI activities."""
-        for activity in activities:
-            start = activity.start if hasattr(activity, "start") else "nil"
-            end = activity.end if hasattr(activity, "end") else "nil"
-            duration = end - start if start != "nil" and end != "nil" else "nil"  # type: ignore[operator]
-            name = activity.name[:100] if hasattr(activity, "name") else "unknown"
-            # Convert to milliseconds
-            if duration != "nil":
-                self.timings.append((name, duration / 1e6))  # type: ignore[operator]
-
-    def start(self) -> None:
-        """Start CUPTI profiling.
-
-        Enables CUPTI activity tracking for concurrent kernels and registers
-        the necessary callbacks for buffer management.
-
-        Raises:
-            ValueError: If CUPTI activity cannot be enabled
-        """
-        if self._is_active:
-            raise RuntimeError("CUPTI profiler is already active")
-
-        # Clear previous timings
-        self.timings = []
-
-        try:
-            self._cupti.activity_enable(self._cupti.ActivityKind.CONCURRENT_KERNEL)
-        except self._cupti.cuptiError as e:
-            raise ValueError(
-                f"\033[91mError while enabling Activity Kind {self._cupti.ActivityKind.CONCURRENT_KERNEL.name}: {e}. Please disable CUPTI if you using profilers\033[0m"
-            )
-
-        # Register callbacks
-        self._buffer_requested_callback = self._buffer_requested
-        self._buffer_completed_callback = partial(self._buffer_completed)
-
-        self._cupti.activity_register_callbacks(
-            self._buffer_requested_callback, self._buffer_completed_callback
-        )
-
-        self._is_active = True
-
-    def stop(self) -> None:
-        """Stop CUPTI profiling.
-
-        Flushes all activities, disables CUPTI tracking, and finalizes the profiler.
-        This method should be called after the kernels you want to measure have completed.
-        """
-        if not self._is_active:
-            raise RuntimeError("CUPTI profiler is not active")
-
-        # Flush all activities and cleanup
-        self._cupti.activity_flush_all(0)
-        self._cupti.activity_disable(self._cupti.ActivityKind.CONCURRENT_KERNEL)
-        self._cupti.finalize()
-
-        self._is_active = False
-
-    def get_duration(self) -> float:
-        """Get the total duration of all measured activities in milliseconds.
-
-        Returns:
-            Total duration in milliseconds. Returns 0.0 if no activities were recorded.
-        """
-        return sum(timing[1] for timing in self.timings)
+from cutlass._mlir.dialects import builtin, cf, nvvm, vector
 
 
 @dsl_user_op
@@ -250,7 +141,8 @@ class _CompileTimeAssertion(Assertion):
     def __extract_mlir_values__(self) -> list[ir.Value]:
         if self._disable:
             return []
-        return self._tensor.__extract_mlir_values__()  # type: ignore[union-attr]
+        assert self._tensor is not None
+        return self._tensor.__extract_mlir_values__()  # type: ignore[attr-defined]
 
     @dsl_user_op
     @CuTeDSL.jit
@@ -280,14 +172,14 @@ class _CompileTimeAssertion(Assertion):
             return
         if const_expr(not isinstance(idx, int)):
             raise ValueError(f"expects idx to be 'int', but got {type(idx)}")
-        if const_expr(idx >= self._num_assertions):  # type: ignore[operator]
+        if const_expr(idx >= self._num_assertions):
             raise ValueError("please increase the number of assertions!!!")
         if const_expr(self._init_value is True):
             self._tensor[idx] = pred and self._tensor[idx]  # type: ignore[index]
         else:
             self._tensor[idx] = pred  # type: ignore[index]
-        self._msgs[idx] = f"{msg}\nAt {loc}"  # type: ignore[call-overload]
-        self._used_indices.add(idx)  # type: ignore[union-attr, arg-type]
+        self._msgs[idx] = f"{msg}\nAt {loc}"
+        self._used_indices.add(idx)  # type: ignore[union-attr]
 
     def __enter__(self) -> "_CompileTimeAssertion":
         """Enter context manager."""
@@ -420,17 +312,18 @@ class RuntimeAssertion(Assertion):
     ) -> None:
         """Exit the context manager, automatically calls verify()."""
         if exc_type is None:
-            # Only verify if no exception occurred in the with block
             self.verify()
 
 
 def _maybe_recast_tensor_from_f4_f6(
     src: Tensor, tv_layout: Layout
 ) -> tuple[Tensor, Layout]:
-    if src.element_type.width == 4:  # type: ignore[union-attr]
+    elem_type = src.element_type
+    assert not is_int_tuple_type(elem_type)
+    if elem_type.width == 4:
         tv_layout = recast_layout(8, 4, tv_layout)
         src = recast_tensor(src, dtype=Int8)
-    elif src.element_type.width == 6:  # type: ignore[union-attr]
+    elif elem_type.width == 6:
         tv_layout = recast_layout(8, 6, tv_layout)
         src = recast_tensor(src, dtype=Int8)
     return src, tv_layout
@@ -583,11 +476,11 @@ def _convert(
     src_cta_tiler = [
         1,
     ] * rank(src.layout)
-    src_cta_tiler[leading_mode] = size(src_tv_layout)  # type: ignore[call-overload]  # (...,TileV,...)
+    src_cta_tiler[leading_mode] = size(src_tv_layout)  # (...,TileV,...)
     dst_cta_tiler = [
         1,
     ] * rank(dst.layout)
-    dst_cta_tiler[leading_mode] = size(dst_tv_layout)  # type: ignore[call-overload]  # (...,TileV,...)
+    dst_cta_tiler[leading_mode] = size(dst_tv_layout)  # (...,TileV,...)
 
     # Step 4. partition input and output tensor by cta tiler.
     gS = zipped_divide(src, tuple(src_cta_tiler))  # ((...,TileV,...),(...,RestV,...))
@@ -615,13 +508,18 @@ def _convert(
 # their leading dimension should be 4(fp8)/8(fp4) element align. (nvgpu.cvt_fptrunc/cvt_fpext
 # needs 32-bits aligned input/output)
 def convert(src: Tensor, dst: Tensor) -> None:
-    assert len(src.shape) == len(dst.shape), (  # type: ignore[arg-type]
+    src_shape = src.shape
+    src_stride = src.stride
+    assert isinstance(src_shape, tuple) and isinstance(src_stride, tuple)
+    dst_shape = dst.shape
+    assert isinstance(dst_shape, tuple)
+    assert len(src_shape) == len(dst_shape), (
         "Shape of src and dst tensors should be the same rank."
     )
     # find leading mode
     leading_mode = [
         idx
-        for idx, (shape, stride) in enumerate(zip(src.shape, src.stride))  # type: ignore[arg-type]
+        for idx, (shape, stride) in enumerate(zip(src_shape, src_stride))
         if shape > 1 and stride == 1  # type: ignore[operator]
     ]
     if len(leading_mode) != 1:
@@ -630,11 +528,14 @@ def convert(src: Tensor, dst: Tensor) -> None:
 
     elem_per_copy = 2
 
-    if src.element_type.width == 4 or dst.element_type.width == 4:  # type: ignore[union-attr]
+    src_elem_type = src.element_type
+    dst_elem_type = dst.element_type
+    assert not is_int_tuple_type(src_elem_type) and not is_int_tuple_type(dst_elem_type)
+    if src_elem_type.width == 4 or dst_elem_type.width == 4:
         elem_per_copy = 8
-    elif src.element_type.width == 8 or dst.element_type.width == 8:  # type: ignore[union-attr]
+    elif src_elem_type.width == 8 or dst_elem_type.width == 8:
         elem_per_copy = 4
-    elif src.element_type.width == 6 or dst.element_type.width == 6:  # type: ignore[union-attr]
+    elif src_elem_type.width == 6 or dst_elem_type.width == 6:
         elem_per_copy = 16  # 16*f6 elements per 96 bits(12 bytes)
     assert (
         src.shape[leading_mode] % elem_per_copy == 0  # type: ignore[index, call-overload]
@@ -645,622 +546,10 @@ def convert(src: Tensor, dst: Tensor) -> None:
 
 
 #########################################
-# Testing utilities
+# Tuning utilities
 #########################################
 
 
-def sample_pytest(rand_cfg: Optional[tuple[int, float]] = None) -> Callable[..., Any]:
-    """
-    Decorator to randomly sample pytest parametrized tests.
-    rand_cfg: Tuple[int, float] - (random_seed, sample_ratio)
-    Sampling is disabled when:
-    - A specific test is selected (via -k or direct test path)
-    - Not running under pytest
-    """
-    import functools
-    import os
-    import random
-    import sys
-
-    import pytest
-
-    seed, sample_ratio = rand_cfg  # type: ignore[misc]
-    random.seed(seed)
-
-    def decorator(func: Callable[..., Any]) -> Callable[..., Any]:
-        @functools.wraps(func)
-        def wrapper(*args: Any, **kwargs: Any) -> Any:
-            if rand_cfg is not None and "PYTEST_CURRENT_TEST" in os.environ:
-                # Check if test was explicitly selected like ::test_name[param1-param2-...]
-                if "-k" in sys.argv or any(".py::" in arg for arg in sys.argv):
-                    # Test was explicitly selected, don't skip
-                    return func(*args, **kwargs)
-
-                if random.uniform(0.0, 1.0) > sample_ratio:
-                    pytest.skip(f"Randomly skipped (sampling ratio: {sample_ratio})")
-            return func(*args, **kwargs)
-
-        return wrapper
-
-    return decorator
-
-
-#########################################
-# Benchmarking utilities
-#########################################
-
-
-class JitArguments:
-    """
-    A type to hold both args and kwargs for passing to a kernel while benchmarking.
-    """
-
-    def __init__(self, *args: Any, **kwargs: Any) -> None:
-        self.args = args
-        self.kwargs = kwargs
-        self.references: list[Any] = list()
-
-    def add_to_scope(self, references: Any) -> None:
-        """
-        Keeps references to external variables (e.g., Torch tensors when taking a view)
-        in the scope of the lifetime of the JitArguments object.
-        """
-        self.references.extend(references)
-
-
-def _cuda_success(
-    err: Union[tuple[Any, ...], cuda_runtime.cudaError_t, cuda_driver.CUresult],
-    message: str,
-) -> None:
-    """
-    Helper function to check CUDA API errors.
-    """
-    if isinstance(err, tuple):
-        _cuda_success(err[0], message)
-    elif isinstance(err, cuda_runtime.cudaError_t):
-        error_message = cuda_runtime.cudaGetErrorString(err)[1].decode("utf-8")
-        if err != cuda_runtime.cudaError_t.cudaSuccess:
-            raise RuntimeError(f"{message} : {error_message}")
-    elif isinstance(err, cuda_driver.CUresult):
-        if err != cuda_driver.CUresult.CUDA_SUCCESS:
-            error_message = cuda_driver.cuGetErrorString(err)[1].decode("utf-8")
-            raise RuntimeError(f"{message} : {error_message}")
-    else:
-        raise TypeError(
-            f"{err} is an unexpected type : it should be a cudaError_t or CUresult"
-        )
-
-
-def _does_kernel_use_stream(
-    kernel: Callable[..., Any], stream: cuda_driver.CUstream, *args: Any, **kwargs: Any
-) -> bool:
-    """
-    This function checks if the kernel uses the provided non-default stream.
-    It does this by capturing the stream and then checking if any kernels were launched.
-    :param kernel: The kernel to check
-    :type kernel: Callable
-    :param stream: The stream to check
-    :type stream: cuda_driver.CUstream
-    :return: True if the kernel uses the stream, False otherwise
-    :rtype: bool
-    """
-
-    assert int(stream) != int(cuda_driver.CUstream_flags.CU_STREAM_DEFAULT), (
-        "Stream must be a non-default stream"
-    )
-
-    err = cuda_runtime.cudaStreamBeginCapture(
-        stream, cuda_runtime.cudaStreamCaptureMode.cudaStreamCaptureModeThreadLocal
-    )
-    _cuda_success(err, "Error on stream capture")
-
-    try:
-        kernel(*args, **kwargs)
-    except Exception:
-        # Always end the capture even on failure to avoid zombie capture state
-        # that would poison all subsequent graph capture operations in the process.
-        try:
-            cuda_runtime.cudaStreamEndCapture(stream)
-        except Exception:
-            pass
-        raise
-
-    err, graph = cuda_runtime.cudaStreamEndCapture(stream)
-    _cuda_success(err, "Error on stream capture")
-
-    # Get number of nodes in warmup graph to check it matches what is expected
-    err, _, num_nodes = cuda_runtime.cudaGraphGetNodes(graph)
-    _cuda_success(err, "Error on querying graph")
-    return num_nodes > 0
-
-
-def benchmark(
-    callable: Callable,
-    *,
-    warmup_iterations: int = 10,
-    iterations: int = 100,
-    stream: Optional[cuda_driver.CUstream] = None,
-    kernel_arguments: Optional[JitArguments] = None,
-    workspace_generator: Optional[Callable[[], JitArguments]] = None,
-    workspace_count: int = 1,
-    use_cuda_graphs: bool = False,
-    use_cupti: bool = False,
-) -> float:
-    """Benchmarks a callable function with the specified parameters.
-
-    For example,
-    .. code-block:: python
-
-        from cutlass.cute.testing import benchmark
-
-        @cute.jit
-        def user_function(a: cute.Tensor, b: cute.Tensor, c: cute.Tensor, stream: cuda_driver.CUstream):
-            # contents of the function
-            pass
-
-        time_us = benchmark(user_function, kernel_arguments=JitArguments(a, b, c, stream)
-                            warmup_iterations=10, iterations=100
-                            stream=stream)
-
-    To prevent skewing results by repeately accessing the L2 cache, use the workspace_count and workspace_generator
-    parameters to cycle through a number of different workspaces.
-
-    .. code-block:: python
-
-        from cutlass.cute.testing import benchmark
-
-        @cute.jit
-        def user_function(a: cute.Tensor, b: cute.Tensor, c: cute.Tensor):
-            # contents of the function
-            pass
-
-        def workspace_generator():
-            # create a, b, and c
-            return JitArguments(a, b, c)
-
-        time_us = benchmark(user_function,
-                            workspace_generator=workspace_generator,
-                            workspace_count=10,
-                            warmup_iterations=10000,
-                            iterations=1000)
-
-    To benchmark you may always configure the function being profiled (callable), the warmup iterations, and
-    the number of profiling iterations.
-
-    Whenever the kernel being benchmarked runs in a non-default stream, the stream must be provided through the stream parameter.
-
-    To use CUDA graphs, the callable must be a compiled @cute.jit annotated function.
-    When using CUDA graphs, the kernel must be launched in a non-default stream.
-
-    :param callable: The function to benchmark. For jit function, it must be compiled functions.
-    :type callable: Callable
-    :param warmup_iterations: Number of warmup iterations, defaults to 10
-    :type warmup_iterations: int, optional
-    :param iterations: Number of benchmark iterations, defaults to 100
-    :type iterations: int, optional
-    :param stream: Stream kernel is launched in, defaults to CUDA stream default
-    :type stream: CUstream, None
-    :param kernel_arguments: Kernel arguments to launch callable with, defaults to None
-    :type kernel_arguments: JitArguments, None
-    :param workspace_generator: Function that returns kernel arguments, defaults to None
-    :type workspace_generator: Callable
-    :param workspace_count: Number of workspaces (arguments) to loop through, looping through enough workspaces will keep the L2 cache cold
-    :type workspace_count: int, optional
-    :param use_cuda_graphs: Whether to use cuda graphs, defaults to False
-    :type use_cuda_graphs: bool, optional
-
-    :return: The benchmark time in microseconds
-    :rtype: float
-    """
-    import cutlass.base_dsl.jit_executor  # noqa: F401
-    import cutlass.cutlass_dsl.cuda_jit_executor  # noqa: F401
-
-    if stream is None:
-        stream = cuda_driver.CUstream(cuda_driver.CUstream_flags.CU_STREAM_DEFAULT)
-
-    if workspace_count < 1:
-        raise ValueError("workspace_count must be at least 1")
-
-    _time_us = float("nan")
-    if workspace_generator == None:
-        # If no workspace generator is provided, we need a single workspace
-        if workspace_count != 1:
-            raise ValueError("Need a single workspace if not providing a generator")
-
-        # If no workspace generator is provided, we need a kernel_argument
-        if kernel_arguments == None:
-            raise ValueError(
-                "Please pass a kernel argument if not providing a generator"
-            )
-        workspace_generator = lambda: kernel_arguments
-
-    workspaces = [workspace_generator() for _ in range(workspace_count)]
-
-    for workspace in workspaces:
-        if type(workspace) != JitArguments:
-            raise TypeError(
-                "workspace_generator and/or kernel_arguments should use JitArguments type"
-            )
-
-    # use memset to flush L2 cache after workspace h2d copies
-    if workspace_count > 1:
-        from cutlass.utils import HardwareInfo
-
-        hardware_info = HardwareInfo()
-        num_l2_cache_bytes = hardware_info.get_l2_cache_size_in_bytes()
-        l2_flush_bytes = num_l2_cache_bytes * 2
-        err, cache_ptr = cuda_driver.cuMemAlloc(int(l2_flush_bytes))
-        _cuda_success(err, "Error on allocating memory")
-
-        err = cuda_driver.cuMemsetD32Async(
-            cache_ptr, 0, int(l2_flush_bytes // 4), stream
-        )
-        _cuda_success(err, "Error on memset")
-
-        err = cuda_driver.cuMemFree(cache_ptr)
-        _cuda_success(err, "Error on freeing memory")
-
-    def _loop_and_call_kernel(iterations: int, workspace_index: int = 0) -> int:
-        for _ in range(iterations):
-            current_workspace = workspaces[workspace_index]
-            callable(*current_workspace.args, **current_workspace.kwargs)
-            workspace_index = (workspace_index + 1) % workspace_count
-        return workspace_index
-
-    # Create CUDA events for timing
-    err, start_event = cuda_driver.cuEventCreate(
-        cuda_driver.CUevent_flags.CU_EVENT_DEFAULT
-    )
-    _cuda_success(err, "Error on creating event")
-    err, end_event = cuda_driver.cuEventCreate(
-        cuda_driver.CUevent_flags.CU_EVENT_DEFAULT
-    )
-    _cuda_success(err, "Error on creating event")
-
-    elapsed_time = float("nan")
-
-    # =========================================================================
-    # Helper: Measure kernel execution time using CUPTI profiler
-    # =========================================================================
-    def _measure_with_cupti(kernel_launcher: Callable[[], Any]) -> float:
-        """
-        Measure kernel execution time using NVIDIA CUPTI profiler.
-        :param kernel_launcher: Callable that launches the kernel(s) to be profiled
-        :type kernel_launcher: Callable
-        :return: Elapsed time in milliseconds
-        :rtype: float
-        """
-        if not hasattr(kernel_launcher, "__call__"):
-            raise TypeError(
-                f"kernel_launcher must be callable, got {type(kernel_launcher).__name__}"
-            )
-
-        cupti_profiler = CuptiProfiler()
-
-        cupti_profiler.start()
-        kernel_launcher()
-
-        err = cuda_runtime.cudaDeviceSynchronize()
-        _cuda_success(err, "Error on synchronizing device")
-
-        cupti_profiler.stop()
-        duration_ms = cupti_profiler.get_duration()
-        return duration_ms
-
-    def _measure_with_cuda_event(kernel_launcher: Callable[[], Any]) -> float:
-        """
-        Measure kernel execution time using CUDA events.
-        :param kernel_launcher: Callable that launches the kernel(s) to be profiled
-        :type kernel_launcher: Callable
-        :return: Elapsed time in milliseconds
-        :rtype: float
-        """
-        if not hasattr(kernel_launcher, "__call__"):
-            raise TypeError(
-                f"kernel_launcher must be callable, got {type(kernel_launcher).__name__}"
-            )
-
-        if int(stream) != int(
-            cuda_driver.CUstream_flags.CU_STREAM_DEFAULT
-        ) and not _does_kernel_use_stream(
-            callable, stream, *workspaces[0].args, **workspaces[0].kwargs
-        ):
-            raise ValueError(
-                "CUDA stream passed to benchmark does not match the stream the kernel was launched in"
-            )
-
-        err = cuda_driver.cuEventRecord(start_event, stream)
-        _cuda_success(err, "Error on recording start event")
-
-        kernel_launcher()
-
-        err = cuda_driver.cuEventRecord(end_event, stream)
-        _cuda_success(err, "Error on recording end event")
-
-        err = cuda_driver.cuEventSynchronize(end_event)
-        _cuda_success(err, "Error on synchronizing end event")
-
-        err, duration_ms = cuda_driver.cuEventElapsedTime(start_event, end_event)
-        _cuda_success(err, "Error on querying elapsed time")
-        return duration_ms
-
-    # =========================================================================
-    # Branch 1: CUDA Graphs mode - Capture and replay kernel execution
-    # =========================================================================
-    if use_cuda_graphs:
-        if hasattr(callable, "_dsl_cls"):
-            raise TypeError(
-                "Uncompiled @cute.jit function cannot be captured into a CUDA Graph. "
-                "Use cute.compile() first, or wrap compiled calls in a plain function."
-            )
-
-        # ---------------------------------------------------------------------
-        # Step 1: Capture warmup graph
-        # ---------------------------------------------------------------------
-        import gc as _gc
-
-        # Disable GC during capture to prevent __del__ methods (e.g., cudaFree)
-        # from invalidating the capture with a non-capturable CUDA call.
-        _gc.collect()
-        _gc.disable()
-        err = cuda_runtime.cudaStreamBeginCapture(
-            stream, cuda_runtime.cudaStreamCaptureMode.cudaStreamCaptureModeThreadLocal
-        )
-        _cuda_success(err, "Error on beginning warmup stream capture")
-
-        try:
-            warmup_workspace_idx = _loop_and_call_kernel(warmup_iterations)
-        except Exception:
-            _gc.enable()
-            try:
-                cuda_runtime.cudaStreamEndCapture(stream)
-            except Exception:
-                pass
-            raise
-
-        err, warmup_graph = cuda_runtime.cudaStreamEndCapture(stream)
-        _gc.enable()
-        _cuda_success(err, "Error on ending warmup stream capture")
-
-        # Validate warmup graph node count
-        # Each kernel launch should produce at least one graph node
-        err, _, warmup_node_count = cuda_runtime.cudaGraphGetNodes(warmup_graph)
-        _cuda_success(err, "Error on querying warmup graph nodes")
-        # Use >= since one host function may launch multiple kernels
-        if warmup_node_count < warmup_iterations:
-            raise ValueError(
-                "CUDA stream passed to benchmark does not match the stream the kernel was launched in"
-            )
-
-        # ---------------------------------------------------------------------
-        # Step 2: Capture profiling graph
-        # ---------------------------------------------------------------------
-        _gc.collect()
-        _gc.disable()
-        err = cuda_runtime.cudaStreamBeginCapture(
-            stream, cuda_runtime.cudaStreamCaptureMode.cudaStreamCaptureModeThreadLocal
-        )
-        _cuda_success(err, "Error on beginning profiling stream capture")
-
-        try:
-            _loop_and_call_kernel(iterations, warmup_workspace_idx)
-        except Exception:
-            _gc.enable()
-            try:
-                cuda_runtime.cudaStreamEndCapture(stream)
-            except Exception:
-                pass
-            raise
-
-        err, profiling_graph = cuda_runtime.cudaStreamEndCapture(stream)
-        _gc.enable()
-        _cuda_success(err, "Error on ending profiling stream capture")
-
-        # ---------------------------------------------------------------------
-        # Step 3: Instantiate executable graphs
-        # ---------------------------------------------------------------------
-        err, warmup_graph_exec = cuda_runtime.cudaGraphInstantiate(warmup_graph, 0)
-        _cuda_success(err, "Error on instantiating warmup graph")
-        err, profiling_graph_exec = cuda_runtime.cudaGraphInstantiate(
-            profiling_graph, 0
-        )
-        _cuda_success(err, "Error on instantiating profiling graph")
-
-        # ---------------------------------------------------------------------
-        # Step 4: Execute warmup graph (cache warming)
-        # ---------------------------------------------------------------------
-        err = cuda_runtime.cudaGraphLaunch(warmup_graph_exec, stream)
-        _cuda_success(err, "Error on launching warmup graph")
-
-        # ---------------------------------------------------------------------
-        # Step 5: Profile execution using selected profiler
-        # ---------------------------------------------------------------------
-        def launch_profiling_graph() -> None:
-            err = cuda_runtime.cudaGraphLaunch(profiling_graph_exec, stream)
-            _cuda_success(err, "Error on launching profiling graph")
-
-        if use_cupti:
-            elapsed_time = _measure_with_cupti(launch_profiling_graph)
-        else:
-            elapsed_time = _measure_with_cuda_event(launch_profiling_graph)
-
-        # ---------------------------------------------------------------------
-        # Step 6: Cleanup - Destroy graph executables
-        # ---------------------------------------------------------------------
-        err = cuda_runtime.cudaGraphExecDestroy(warmup_graph_exec)
-        _cuda_success(err, "Error on destroying warmup graph executable")
-        err = cuda_runtime.cudaGraphExecDestroy(profiling_graph_exec)
-        _cuda_success(err, "Error on destroying profiling graph executable")
-
-    # =========================================================================
-    # Branch 2: CUPTI profiler mode (without CUDA Graphs)
-    # =========================================================================
-    elif use_cupti:
-        # Warmup iterations to stabilize GPU state
-        warmup_workspace_idx = _loop_and_call_kernel(warmup_iterations)
-
-        def run_profiling_iterations() -> None:
-            _loop_and_call_kernel(iterations, warmup_workspace_idx)
-
-        elapsed_time = _measure_with_cupti(run_profiling_iterations)
-
-    # =========================================================================
-    # Branch 3: CUDA event profiler mode (default)
-    # =========================================================================
-    else:
-        # Warmup iterations to stabilize GPU state
-        warmup_workspace_idx = _loop_and_call_kernel(warmup_iterations)
-
-        def run_profiling_iterations() -> None:
-            _loop_and_call_kernel(iterations, warmup_workspace_idx)
-
-        elapsed_time = _measure_with_cuda_event(run_profiling_iterations)
-
-    # Destroy events
-    err = cuda_driver.cuEventDestroy(start_event)
-    _cuda_success(err, "Error on destroying event")
-    err = cuda_driver.cuEventDestroy(end_event)
-    _cuda_success(err, "Error on destroying event")
-
-    return elapsed_time / iterations * 1e3
-
-
-def get_workspace_count(
-    one_workspace_bytes: int, warmup_iterations: int, iterations: int
-) -> int:
-    """Calculate the number of workspaces needed to fill L2 cache.
-
-    :param one_workspace_bytes: Size of one workspace in bytes
-    :type one_workspace_bytes: int
-    :param warmup_iterations: Number of warmup iterations
-    :type warmup_iterations: int
-    :param iterations: Number of iterations
-    :type iterations: int
-    :return: Number of workspaces needed
-    :rtype: int
-    """
-    from cutlass.utils import HardwareInfo
-
-    num_l2_cache_bytes = HardwareInfo().get_l2_cache_size_in_bytes()
-    num_workspaces = (num_l2_cache_bytes * 3) // one_workspace_bytes + 1
-    num_iters = warmup_iterations + iterations
-    return num_iters if num_iters < num_workspaces else num_workspaces
-
-
-#########################################
-# Autotuning/Tuning utilities
-#########################################
-
-
-def _benchmark_for_autotune(
-    callable: Callable[..., Any],
-    *args: Any,
-    warmup_iterations: int,
-    iterations: int,
-    use_cold_l2: bool,
-    print_verbose: bool,
-    current_stream: Optional[cuda_driver.CUstream] = None,
-    **kwargs: Any,
-) -> float:
-    """Benchmarks a callable function with the specified parameters.
-
-    This function differs from the benchmark function in that it is used for autotuning. In this case we
-    do not loop through workspaces to keep the L2 cache cold. Instead we rely on writing to an L2 cache sized address to keep the L2 cache cold.
-
-    The primary reason for doing this is that we do not have information on how to generate the workspaces for the kernel when autotuning.
-    We also do not have information on how much memory the workspaces take up.
-
-    This benchmarking is done as a close approximation of the actual runtime of the kernel in an E2E system,
-    where we may have clock throttling, a warm cache, or other factors that could affect the runtime of the kernel.
-
-    :param callable: The function to benchmark
-    :type callable: Callable
-    :param args: Arguments to pass to the callable function
-    :param warmup_iterations: Number of warmup iterations, defaults to 10
-    :type warmup_iterations: int, optional
-    :param iterations: Number of benchmark iterations, defaults to 100
-    :type iterations: int, optional
-    :param use_cold_l2: Whether to clear L2 cache between runs, defaults to True
-    :type use_cold_l2: bool, optional
-    :param print_verbose: Whether to print verbose output, defaults to False
-    :type print_verbose: bool, optional
-    :param current_stream: Stream to benchmark in, defaults to CUDA stream default
-    :type current_stream: CUstream, None
-    :param kwargs: Additional keyword arguments to pass to the callable function
-
-    :return: The benchmark time in microseconds
-    :rtype: float
-    """
-    if current_stream is None:
-        current_stream = cuda_driver.CUstream(
-            cuda_driver.CUstream_flags.CU_STREAM_DEFAULT
-        )
-
-    if int(current_stream) != int(
-        cuda_driver.CUstream(cuda_driver.CUstream_flags.CU_STREAM_DEFAULT)
-    ) and not _does_kernel_use_stream(callable, current_stream, *args, **kwargs):
-        raise ValueError(f"Incorrect stream passed to kernel: {current_stream}")
-
-    if use_cold_l2:
-        from cutlass.utils import HardwareInfo
-
-        # use memset to clear L2 cache
-        hardware_info = HardwareInfo()
-        num_l2_cache_bytes = hardware_info.get_l2_cache_size_in_bytes()
-        err, cache_ptr = cuda_driver.cuMemAlloc(int(num_l2_cache_bytes))
-        _cuda_success(err, "Error on allocating memory")
-
-    # Create CUDA events for timing
-    err, start_event = cuda_driver.cuEventCreate(
-        cuda_driver.CUevent_flags.CU_EVENT_DEFAULT
-    )
-    _cuda_success(err, "Error on creating event")
-    err, end_event = cuda_driver.cuEventCreate(
-        cuda_driver.CUevent_flags.CU_EVENT_DEFAULT
-    )
-    _cuda_success(err, "Error on creating event")
-    try:
-        # warmup
-        for _ in range(warmup_iterations):
-            callable(*args, **kwargs)
-
-        _time = 0
-        execution_time_ms = []
-        for _ in range(iterations):
-            if use_cold_l2:
-                # clear L2 cache by memset to zero for every run
-                err = cuda_driver.cuMemsetD32Async(
-                    cache_ptr, 0, int(num_l2_cache_bytes // 4), current_stream
-                )
-                _cuda_success(err, "Error on memset")
-            err = cuda_driver.cuEventRecord(start_event, current_stream)
-            _cuda_success(err, "Error on recording event")
-            callable(*args, **kwargs)
-            err = cuda_driver.cuEventRecord(end_event, current_stream)
-            _cuda_success(err, "Error on recording event")
-            err = cuda_driver.cuEventSynchronize(end_event)
-            _cuda_success(err, "Error on synchronizing event")
-            err, elapsed_time = cuda_driver.cuEventElapsedTime(start_event, end_event)
-            _cuda_success(err, "Error on querying event")
-            execution_time_ms.append(elapsed_time)
-        # unit: us
-        time_us = sum(execution_time_ms) * 1e3 / len(execution_time_ms)
-    except Exception as e:
-        print(f"This config execution error: {e}")
-        time_us = float("inf")
-    if print_verbose:
-        print(f"Execution time: {time_us:.4f} us")
-
-    if use_cold_l2:
-        err = cuda_driver.cuMemFree(cache_ptr)
-        _cuda_success(err, "Error on freeing memory")
-    err = cuda_driver.cuEventDestroy(start_event)
-    _cuda_success(err, "Error on destroying event")
-    err = cuda_driver.cuEventDestroy(end_event)
-    _cuda_success(err, "Error on destroying event")
-    return time_us
-
-
 class autotune_jit:
     """Auto-tuning tool supporting both dictionary and parameterized decorator styles.
     The autotune_jit class can be used as a decorator or a function.
@@ -1290,25 +579,31 @@ class autotune_jit:
     the autotuner will not recompile the kernel.
     """
 
-    logger: Optional[logging.Logger] = None
+    _logger: Optional[logging.Logger] = None
 
     @classmethod
     def _initialize_logger(cls) -> None:
         """Ensure the logger is initialized"""
-        if cls.logger is None:
-            cls.logger = logging.getLogger(__name__ + "_Autotune")
-            if not cls.logger.handlers:
+        if cls._logger is None:
+            cls._logger = logging.getLogger(__name__ + "_Autotune")
+            if not cls._logger.handlers:
                 handler = logging.StreamHandler()
                 formatter = logging.Formatter(
                     "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
                 )
                 handler.setFormatter(formatter)
-                cls.logger.addHandler(handler)
+                cls._logger.addHandler(handler)
             if (
                 os.environ.get("CUTE_DSL_LOG_AUTOTUNE") is not None
                 and os.environ.get("CUTE_DSL_LOG_AUTOTUNE") != "0"
             ):
-                cls.logger.setLevel(logging.INFO)
+                cls._logger.setLevel(logging.INFO)
+
+    @classmethod
+    def log(cls) -> logging.Logger:
+        """Return the initialized logger; asserts ``_initialize_logger`` ran first."""
+        assert cls._logger is not None, "logger is not initialized"
+        return cls._logger
 
     @classmethod
     def _create_tuning_wrapper(
@@ -1327,6 +622,7 @@ class autotune_jit:
             Decorated wrapper function
         """
         from cutlass.cute import compile
+        from cutlass.testing import _benchmark_for_autotune
 
         # Initialize autotune parameters
         if not hasattr(func, "_autotune_params"):
@@ -1350,7 +646,7 @@ class autotune_jit:
                             tuning_key.append(args[index])
                 tuning_key = tuple(tuning_key)
                 if tuning_key in func._best_kernel.keys():  # type: ignore[attr-defined]
-                    cls.logger.info(  # type: ignore[union-attr]
+                    cls.log().info(
                         f"Using cached best configuration: {func._best_config[tuning_key]}"  # type: ignore[attr-defined]
                     )
                     return func._best_kernel[tuning_key](*args, **kwargs)  # type: ignore[attr-defined]
@@ -1370,7 +666,7 @@ class autotune_jit:
                 for config_values in product(*values):
                     # Build current configuration
                     current_config = dict(zip(keys, config_values))
-                    cls.logger.info(f"Tuning configuration: {current_config}")  # type: ignore[union-attr]
+                    cls.log().info(f"Tuning configuration: {current_config}")
 
                     try:
                         # Call the original function, using current configuration to replace default parameters
@@ -1414,7 +710,7 @@ class autotune_jit:
                             **merged_kwargs,
                         )
 
-                        cls.logger.info(f"   Execution time: {cur_time} us")  # type: ignore[union-attr]
+                        cls.log().info(f"   Execution time: {cur_time} us")
 
                         # Update best results
                         if cur_time < min_time:
@@ -1423,16 +719,16 @@ class autotune_jit:
                             best_config = current_config
 
                     except NotImplementedError as e:
-                        cls.logger.info(  # type: ignore[union-attr]
+                        cls.log().info(
                             f"   Encountered unimplemented error, abort execution: {e}"
                         )
                         raise e
                     except (ValueError, TypeError) as e:
-                        cls.logger.info(f"   Configuration parameter skipping: {e}")  # type: ignore[union-attr]
+                        cls.log().info(f"   Configuration parameter skipping: {e}")
                         raise e
                         continue
                     except Exception as e:
-                        cls.logger.info(f"   Execution error skipping: {e}")  # type: ignore[union-attr]
+                        cls.log().info(f"   Execution error skipping: {e}")
                         raise e
                         continue
 
@@ -1442,10 +738,10 @@ class autotune_jit:
                 if best_kernel is None:
                     raise ValueError("No best kernel found")
 
-                cls.logger.info(  # type: ignore[union-attr]
+                cls.log().info(
                     f"Best configuration: {best_config}, execution time: {min_time} us"
                 )
-                cls.logger.info(f"Total tuning time: {tuning_time} s")  # type: ignore[union-attr]
+                cls.log().info(f"Total tuning time: {tuning_time} s")
                 func._best_kernel[tuning_key] = best_kernel  # type: ignore[attr-defined]
                 func._best_config[tuning_key] = best_config  # type: ignore[attr-defined]
                 return best_kernel(*args, **kwargs)
@@ -1512,218 +808,93 @@ class autotune_jit:
         return result_func
 
 
-def tune(
-    func: Callable[..., Callable[[], Any]],
-    params_dict: Optional[Dict[str, List[Any]]] = None,
-    kernel_arguments: JitArguments = JitArguments(),
-    warmup_iterations: int = 10,
-    iterations: int = 100,
-    stream: Optional[cuda_driver.CUstream] = None,
-) -> Dict[str, Any]:
-    """Tuning tool to suport arbitrary functions. The user must provide a function that returns a callable, which
-    takes no arguments to be tuned over.
-    Best practice is to return a jit function that is compiled with cute.compile for optimal performance.
-    For example:
-    .. code-block:: python
+################################################
+# Deprecated re-exports
+################################################
+# The symbols below have moved to ``cutlass.testing``. They are kept here as
+# ``@deprecated`` shims so existing ``cutlass.cute.testing.*`` call sites keep
+# working but emit ``DeprecationWarning``. Please migrate to
+# ``cutlass.testing.*``; these shims will be removed in a future release.
 
-        def user_function(param1=1, param2=2, param3=3) -> Callable[[], Any]:
-            # contents of the function
-            return lambda : compiled_func(param1, param2, param3)
+from typing_extensions import deprecated as _deprecated
 
-        config = tune(user_function, params_dict={'param1': [1, 2, 3], 'param2': [4, 5, 6]}, update_on_change=['param3'])
-
-    :param func: Function to be tuned, note that errors raised in the function will be ignored and the next configuration will be tried.
-    :type func: Callable[[Any], Callable[[], Any]]
-    :param params_dict: Dictionary containing parameter names and their possible values
-    :type params_dict: Dict[str, List[Any]], optional
-    :param kernel_arguments: Kernel arguments to launch callable with, defaults to JitArguments()
-    :type kernel_arguments: JitArguments, optional
-    :param warmup_iterations: Number of warmup iterations, defaults to 10
-    :type warmup_iterations: int, optional
-    :param iterations: Number of benchmark iterations, defaults to 100
-    :type iterations: int, optional
-    :param stream: Stream kernel is launched in, defaults to CUDA stream default
-    :type stream: CUstream, None
-    :return: Best configuration
-    :rtype: Dict[str, Any]
-    """
-    logger = logging.getLogger(__name__ + "_Autotune")
-    if not logger.handlers:
-        handler = logging.StreamHandler()
-        formatter = logging.Formatter(
-            "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-        )
-        handler.setFormatter(formatter)
-        logger.addHandler(handler)
-    if (
-        os.environ.get("CUTE_DSL_LOG_AUTOTUNE") is not None
-        and os.environ.get("CUTE_DSL_LOG_AUTOTUNE") != "0"
-    ):
-        logger.setLevel(logging.INFO)
-
-    if stream is None:
-        stream = cuda_driver.CUstream(cuda_driver.CUstream_flags.CU_STREAM_DEFAULT)
-
-    if params_dict is None:
-        raise ValueError("params_dict must be provided")
-
-    # Get all parameter configurations
-    keys = list(params_dict.keys())
-    values = list(params_dict.values())
-
-    min_time = float("inf")
-
-    best_config = None
-    # Record start time
-    start = time()
-
-    # Iterate through all possible configuration combinations
-    for config_values in product(*values):
-        # Build current configuration
-        current_config = dict(zip(keys, config_values))
-        logger.info(f"Tuning configuration: {current_config}")
-
-        try:
-            merged_kwargs = {**kernel_arguments.kwargs, **current_config}
-
-            compiled_func = func(*kernel_arguments.args, **merged_kwargs)
-            # Benchmark the compiled function
-            cur_time = _benchmark_for_autotune(
-                compiled_func,
-                warmup_iterations=warmup_iterations,
-                iterations=iterations,
-                use_cold_l2=True,
-                print_verbose=False,
-                current_stream=stream,
-            )
-
-            logger.info(f"   Execution time: {cur_time} us")
-
-            # Update best results
-            if cur_time < min_time:
-                min_time = cur_time
-                best_config = current_config
-
-        except NotImplementedError as e:
-            logger.info(f"   Encountered unimplemented error, abort execution: {e}")
-            raise e
-        except (ValueError, TypeError, CantImplementError) as e:
-            logger.info(f"   Configuration parameter skipping: {e}")
-            continue
-        except Exception as e:
-            logger.info(f"   Execution error skipping: {e}")
-            continue
-
-    end = time()
-    tuning_time = end - start
-
-    if best_config is None:
-        raise ValueError("No best kernel found")
-
-    logger.info(f"Best configuration: {best_config}, execution time: {min_time} us")
-    logger.info(f"Total tuning time: {tuning_time} s")
-    return best_config
+from cutlass.testing import CantImplementError as _CantImplementError
+from cutlass.testing import CuptiProfiler as _CuptiProfiler
+from cutlass.testing import JitArguments as _JitArguments
+from cutlass.testing import TensorInitConfig as _TensorInitConfig
+from cutlass.testing import add_tensor_init_args as _add_tensor_init_args
+from cutlass.testing import benchmark as _benchmark
+from cutlass.testing import get_workspace_count as _get_workspace_count
+from cutlass.testing import sample_pytest as _sample_pytest
+from cutlass.testing import should_use_normal_init as _should_use_normal_init
+from cutlass.testing import (
+    tensor_init_config_from_args as _tensor_init_config_from_args,
+)
+from cutlass.testing import tune as _tune
+from cutlass.testing import validate_tensor_init_args as _validate_tensor_init_args
 
 
-class CantImplementError(Exception):
-    """Exception raised when a function is not implemented."""
-
-    def __init__(self, message: Optional[str] = None) -> None:
-        self.message = message or "The current config is invalid/unsupported"
-        super().__init__(self.message)
-
-    def __str__(self) -> str:
-        return self.message
-
-    def __repr__(self) -> str:
-        return self.message
+# --- Deprecated classes (subclass + @deprecated emits warning on instantiation) ---
 
 
-#########################################
-# Tensor initialization configuration
-#########################################
+@_deprecated(
+    "cute.testing.CantImplementError is deprecated, use cutlass.testing.CantImplementError instead"
+)
+class CantImplementError(_CantImplementError):
+    pass
 
 
-@dataclass(frozen=True)
-class TensorInitConfig:
-    """Configuration for tensor initialization policy.
-
-    When init_normal=True, tensors are initialized from a normal distribution
-    with the specified mean and std. Int8/Uint8 dtypes always use random
-    integer initialization regardless of this flag.
-    """
-
-    init_normal: bool = False
-    normal_mean: float = 0.0
-    normal_std: float = 1.0
+@_deprecated(
+    "cute.testing.CuptiProfiler is deprecated, use cutlass.testing.CuptiProfiler instead"
+)
+class CuptiProfiler(_CuptiProfiler):
+    pass
 
 
-def add_tensor_init_args(
-    parser: argparse.ArgumentParser,
-    supports_int_dtypes: bool = True,
-) -> None:
-    """Add --init_normal, --normal_mean, --normal_std arguments to a parser.
-
-    :param parser: ArgumentParser to add arguments to.
-    :param supports_int_dtypes: If True, appends Int8/Uint8 caveat to --init_normal
-        help text. Set to False for files whose ab_dtype choices do not include
-        Int8/Uint8 (e.g. grouped_gemm, dense_blockscaled_gemm_persistent).
-    """
-    init_normal_help = (
-        "Use normal distribution for tensor initialization instead of random integers."
-    )
-    if supports_int_dtypes:
-        init_normal_help += (
-            " Note: Int8/Uint8 dtypes always use random init regardless of this flag"
-        )
-    parser.add_argument(
-        "--init_normal",
-        action="store_true",
-        help=init_normal_help,
-    )
-    parser.add_argument(
-        "--normal_mean",
-        type=float,
-        default=0.0,
-        help="Mean for normal distribution initialization",
-    )
-    parser.add_argument(
-        "--normal_std",
-        type=float,
-        default=1.0,
-        help="Standard deviation for normal distribution initialization (must be >= 0)",
-    )
+@_deprecated(
+    "cute.testing.JitArguments is deprecated, use cutlass.testing.JitArguments instead"
+)
+class JitArguments(_JitArguments):
+    pass
 
 
-def validate_tensor_init_args(
-    args: argparse.Namespace,
-    parser: argparse.ArgumentParser,
-) -> None:
-    """Validate tensor init arguments after parse_args().
-
-    :param args: Parsed arguments namespace.
-    :param parser: Parser instance (used for error reporting).
-    """
-    if args.normal_std < 0:
-        parser.error("--normal_std must be non-negative")
+@_deprecated(
+    "cute.testing.TensorInitConfig is deprecated, use cutlass.testing.TensorInitConfig instead"
+)
+class TensorInitConfig(_TensorInitConfig):
+    pass
 
 
-def tensor_init_config_from_args(args: argparse.Namespace) -> TensorInitConfig:
-    """Extract a TensorInitConfig from parsed arguments."""
-    return TensorInitConfig(
-        init_normal=args.init_normal,
-        normal_mean=args.normal_mean,
-        normal_std=args.normal_std,
-    )
+# --- Deprecated free functions (decorator wraps callable; DeprecationWarning on call) ---
 
+benchmark = _deprecated(
+    "cute.testing.benchmark is deprecated, use cutlass.testing.benchmark instead"
+)(_benchmark)
 
-def should_use_normal_init(
-    config: TensorInitConfig,
-    dtype: Type[Numeric],
-) -> bool:
-    """Determine whether normal initialization should be used for the given dtype.
+get_workspace_count = _deprecated(
+    "cute.testing.get_workspace_count is deprecated, use cutlass.testing.get_workspace_count instead"
+)(_get_workspace_count)
 
-    Returns False if config.init_normal is False or if dtype is Int8/Uint8
-    (which do not support normal distribution initialization).
-    """
-    return config.init_normal and dtype not in (Int8, Uint8)
+tune = _deprecated("cute.testing.tune is deprecated, use cutlass.testing.tune instead")(
+    _tune
+)
+
+sample_pytest = _deprecated(
+    "cute.testing.sample_pytest is deprecated, use cutlass.testing.sample_pytest instead"
+)(_sample_pytest)
+
+add_tensor_init_args = _deprecated(
+    "cute.testing.add_tensor_init_args is deprecated, use cutlass.testing.add_tensor_init_args instead"
+)(_add_tensor_init_args)
+
+validate_tensor_init_args = _deprecated(
+    "cute.testing.validate_tensor_init_args is deprecated, use cutlass.testing.validate_tensor_init_args instead"
+)(_validate_tensor_init_args)
+
+tensor_init_config_from_args = _deprecated(
+    "cute.testing.tensor_init_config_from_args is deprecated, use cutlass.testing.tensor_init_config_from_args instead"
+)(_tensor_init_config_from_args)
+
+should_use_normal_init = _deprecated(
+    "cute.testing.should_use_normal_init is deprecated, use cutlass.testing.should_use_normal_init instead"
+)(_should_use_normal_init)
diff --git a/python/CuTeDSL/cutlass/cute/tuple.py b/python/CuTeDSL/cutlass/cute/tuple.py
index 545e9f93b..e19c5906a 100644
--- a/python/CuTeDSL/cutlass/cute/tuple.py
+++ b/python/CuTeDSL/cutlass/cute/tuple.py
@@ -9,12 +9,12 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
+
 from inspect import signature
 from itertools import chain
-from typing import Any, Callable, Optional, Union, Tuple, List, Iterable
+from typing import Any, Callable, Generator, Optional, Union, Tuple, List, Iterable
 
 from cutlass._mlir import ir
-
 from cutlass.cutlass_dsl import is_dynamic_expression, dsl_user_op
 import cutlass._mlir.dialects.cute as _cute_ir
 
@@ -104,7 +104,7 @@ def unflatten(
         unflatten([1, 2, 3, 4], ((0, 0), (0, 0)))  # Returns ((1, 2), (3, 4))
     """
 
-    def _make_generator() -> Any:
+    def _make_generator() -> Generator[Any, None, None]:
         for element in sequence:
             yield element
 
@@ -225,7 +225,7 @@ def product_each(
 
 def find_if(
     t: XTuple,
-    pred_fn: Callable[[XTuple, int], bool],
+    pred_fn: Callable[[XTuple, Union[int, Tuple[int, ...]]], bool],
     hierarchical: bool = True,
 ) -> Union[int, Tuple[int, ...], None]:
     from .core import rank, get
@@ -261,7 +261,9 @@ def find_if(
         find_if(stride, pred_fn=pred_fn)
     """
 
-    def _find_if_impl(curr: Any, pos: Any) -> Any:
+    def _find_if_impl(
+        curr: XTuple, pos: Union[int, Tuple[int, ...]]
+    ) -> Union[int, Tuple[int, ...], None]:
         if isinstance(curr, tuple):
             # Recursively search nested tuple
             for i in range(rank(curr)):
@@ -320,7 +322,7 @@ def find(
     if not isinstance(x, int):
         raise TypeError(f"find() requires a static x to search for, but got {x}")
 
-    def pred_fn(val: Any, pos: Any) -> bool:
+    def pred_fn(val: XTuple, pos: Union[int, Tuple[int, ...]]) -> bool:
         # Skip dynamic values which can't be compared
         return not is_dynamic_expression(val) and val == x
 
@@ -452,7 +454,7 @@ def transform_apply(
     if not args:
         raise ValueError("transform_apply requires at least one argument")
 
-    def _compatible_xtuples(args: XTuple) -> bool:
+    def _compatible_xtuples(args: Tuple[XTuple, ...]) -> bool:
         if isinstance(args[0], tuple):
             if not all(isinstance(arg, tuple) for arg in args):
                 return False
diff --git a/python/CuTeDSL/cutlass/cute/typing.py b/python/CuTeDSL/cutlass/cute/typing.py
index 59c366146..e406abbcb 100644
--- a/python/CuTeDSL/cutlass/cute/typing.py
+++ b/python/CuTeDSL/cutlass/cute/typing.py
@@ -9,11 +9,10 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
-from __future__ import annotations
-
 from abc import ABC, abstractmethod
 import ctypes
 from typing import (
+    TYPE_CHECKING,
     ForwardRef,
     Tuple,
     Union,
@@ -22,10 +21,11 @@ from typing import (
     List,
     Optional,
     Literal,
-    TYPE_CHECKING,
 )
+from typing_extensions import TypeIs
 
 from cutlass.cutlass_dsl import T
+from cutlass.base_dsl import typing as _base_typing
 from cutlass.base_dsl.typing import (
     Numeric,
     NumericMeta,
@@ -58,14 +58,100 @@ from cutlass.base_dsl.typing import (
     as_numeric,
 )
 
+_element_precision_width = _base_typing._element_precision_width
+
 from cutlass._mlir import ir
 import cutlass._mlir.dialects.cute as _cute_ir
 from cutlass._mlir.dialects.cute import AddressSpace, ConstrainedIntType
 
+if TYPE_CHECKING:
+    from cutlass.cute.core import ScaledBasis, Swizzle
+    from cutlass.cute.tensor import TensorSSA
+else:
+    ScaledBasis = ForwardRef("ScaledBasis")
+
+
 Int = Union[int, Integer]
 
 
 class SymInt:
+    r"""A symbolic integer for runtime-bound dimensions of an AOT-compiled
+    ``@cute.jit`` function.
+
+    A ``SymInt`` stands in for a Python ``int`` at compile-tracing time —
+    its concrete value is bound only at kernel launch.  Use it when a
+    tensor shape, loop bound, or scalar argument varies launch-to-launch
+    but the compiler still needs to know enough about its structure to
+    emit aligned vector ops, strength-reduce ``%``/``//``, or skip
+    tail-loop prologues.
+
+    The preferred way to construct a ``SymInt`` is via the
+    :func:`sym_int32` / :func:`sym_int64` convenience constructors:
+
+    .. code-block:: python
+
+        # In compile():
+        sym_n = cute.sym_int64(divisibility=16)         # "N is mul of 16"
+
+        # In the @cute.jit signature, declare the matching arg type:
+        @cute.jit
+        def _host(..., num_k_tiles: Int64):
+            ...
+
+    :param width: Bit width of the integer at runtime — ``32`` or ``64``.
+        Use 64 for tensor shape dims (M, N, K, batch) that may exceed 2 G;
+        32 for small counts.
+    :param divisibility: **Hard contract** on the runtime value — the
+        compiler is free to assume the value is always a multiple of
+        ``divisibility`` and may emit aligned vector stores,
+        strength-reduce ``%`` / ``//`` against the divisor, and drop
+        tail-loop prologues based on it.  **Violating the contract at
+        runtime is undefined behaviour** and typically surfaces as
+        ``cudaErrorMisalignedAddress`` (a sticky CUDA error that poisons
+        the worker's CUDA context) or silently wrong results.  The
+        TVM-FFI runtime adapter does not validate divisibility (the JAX
+        adapter does, see ``cutlass/jax/primitive.py``); kernels that
+        want a friendly Python-side error on the TVM-FFI path should
+        ``raise ValueError(...)`` in their ``run()`` wrapper *in
+        addition to* the SymInt declaration.  The default ``1`` means
+        no constraint.
+    :param symbol: Human-readable name for the variable, e.g. ``"M"``.
+        Appears in IR dumps and compile-time error messages.
+
+    **Common patterns**
+
+    Pick the divisibility that matches the kernel's actual contract on
+    each dim — declare neither more nor less than what the codegen
+    relies on:
+
+    .. code-block:: python
+
+        # Contiguous output dim — declare the alignment the epilogue's
+        # vector store depends on (32 B / sizeof(dtype) elements):
+        sym_n = cute.sym_int64(divisibility=32 // (out_dtype.width // 8))
+
+        # K dim — must be a multiple of the K-tile so the K-loop is exact:
+        sym_k = cute.sym_int64(divisibility=mma_tiler_mnk[2])
+
+        # M dim — no compile-time constraint when the kernel masks
+        # ``if row < M`` at runtime:
+        sym_m = cute.sym_int64()
+
+    .. note::
+
+        ``divisibility`` is the kernel's hard contract, not a hint.
+        Don't *over-promise*: declaring ``divisibility=128`` when the
+        kernel only needs 16-byte alignment will reject more shapes
+        than necessary.  Don't *under-promise*: declaring
+        ``divisibility=1`` when the epilogue assumes 32 B alignment
+        will silently fault on non-conforming inputs.
+
+    .. seealso::
+
+        :func:`sym_int32`, :func:`sym_int64` — convenience constructors.
+
+    """
+
     def __init__(
         self,
         width: Literal[32, 64] = 32,
@@ -117,7 +203,7 @@ class SymInt:
             ]
         )
 
-    def __mod__(self, other: int | SymInt) -> SymInt | int:
+    def __mod__(self, other: "int | SymInt") -> "SymInt | int":
         if isinstance(other, int):
             other_div, result_width = other, self._width
         elif isinstance(other, SymInt):
@@ -141,7 +227,7 @@ class SymInt:
             return other % self._divisibility
         return NotImplemented
 
-    def __mul__(self, other: int | SymInt) -> SymInt:
+    def __mul__(self, other: "int | SymInt") -> "SymInt":
         if isinstance(other, int):
             return SymInt(self._width, divisibility=self._divisibility * other)
         elif isinstance(other, SymInt):
@@ -152,7 +238,7 @@ class SymInt:
         else:
             return NotImplemented
 
-    def __rmul__(self, other: int | SymInt) -> SymInt:
+    def __rmul__(self, other: "int | SymInt") -> "SymInt":
         return self.__mul__(other)
 
     def __c_pointers__(self) -> List[int | None]:
@@ -164,7 +250,7 @@ class SymInt:
         )
         return [res_ty]
 
-    def __new_from_mlir_values__(self, values: List[ir.Value]) -> SymInt:
+    def __new_from_mlir_values__(self, values: List[ir.Value]) -> "SymInt":
         from .core import IntValue
 
         if self.width == 32:
@@ -178,23 +264,69 @@ class SymInt:
 def sym_int(
     width: Literal[32, 64] = 32, *, divisibility: int = 1, symbol: str | None = None
 ) -> SymInt:
+    r"""Construct a :class:`SymInt` of the given width.
+
+    :param width: Bit width — ``32`` or ``64``.
+    :param divisibility: Hard divisibility contract on the runtime value;
+        see :class:`SymInt`.
+    :param symbol: Optional human-readable name for IR dumps.
+    :returns: A :class:`SymInt` instance.
+
+    .. seealso::
+
+        :class:`SymInt`, :func:`sym_int32`, :func:`sym_int64`.
+    """
     return SymInt(width, divisibility=divisibility, symbol=symbol)
 
 
 def sym_int32(divisibility: int = 1, symbol: str | None = None) -> SymInt:
+    r"""Construct a 32-bit :class:`SymInt`.
+
+    :param divisibility: Hard divisibility contract on the runtime value;
+        see :class:`SymInt`.
+    :param symbol: Optional human-readable name for IR dumps.
+    :returns: A 32-bit :class:`SymInt` instance.
+
+    .. seealso::
+
+        :class:`SymInt` for the full divisibility contract.
+    """
     return sym_int(32, divisibility=divisibility, symbol=symbol)
 
 
 def sym_int64(divisibility: int = 1, symbol: str | None = None) -> SymInt:
+    r"""Construct a 64-bit :class:`SymInt`.
+
+    Standard choice for tensor shape dimensions (M, N, K, batch).
+
+    :param divisibility: Hard divisibility contract on the runtime value;
+        see :class:`SymInt`.
+    :param symbol: Optional human-readable name for IR dumps.
+    :returns: A 64-bit :class:`SymInt` instance.
+
+    **Common usage**
+
+    .. code-block:: python
+
+        # Contiguous output dim — declare the alignment the epilogue's
+        # vector store depends on (32 B / sizeof(dtype) elements):
+        sym_n = cute.sym_int64(divisibility=32 // (out_dtype.width // 8))
+
+        # K dim — must be a multiple of the K-tile so the K-loop is exact:
+        sym_k = cute.sym_int64(divisibility=mma_tiler_mnk[2])
+
+        # M dim — no compile-time constraint when the kernel masks
+        # ``if row < M`` at runtime:
+        sym_m = cute.sym_int64()
+
+    .. seealso::
+
+        :class:`SymInt` for the full divisibility contract.
+
+    """
     return sym_int(64, divisibility=divisibility, symbol=symbol)
 
 
-if TYPE_CHECKING:
-    from cutlass.cute.core import ScaledBasis, Swizzle
-    from cutlass.cute.tensor import TensorSSA
-else:
-    ScaledBasis = ForwardRef("ScaledBasis")
-
 IntTuple = Union[Int, Tuple["IntTuple", ...]]
 Shape = Union[Int, Tuple["Shape", ...]]
 Stride = Union[Int, ScaledBasis, Tuple["Stride", ...]]
@@ -205,15 +337,20 @@ class Layout(ir.Value):
     def __init__(self, op_result: ir.Value) -> None:
         super().__init__(op_result)
 
-    def __str__(self) -> str:
-        return super().__str__()  # pragma: no cover
-
-    def get_hier_coord(self, idx: Int) -> Coord:
+    @abstractmethod
+    def get_hier_coord(
+        self,
+        idx: Int,
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+    ) -> Coord:
         """Return the (hierarchical) ND logical coordinate corresponding to the linear index"""
         ...
 
     @property
-    def shape(  # type: ignore[empty-body]
+    @abstractmethod
+    def shape(
         self,
         *,
         loc: Optional[ir.Location] = None,
@@ -223,7 +360,8 @@ class Layout(ir.Value):
         ...
 
     @property
-    def stride(  # type: ignore[empty-body]
+    @abstractmethod
+    def stride(
         self,
         *,
         loc: Optional[ir.Location] = None,
@@ -359,7 +497,7 @@ class Pointer(ABC):
     @property
     def type(self) -> ir.Type:
         """The MLIR type of this pointer. Implemented by subclasses."""
-        ...
+        raise NotImplementedError
 
     @property
     def value_type(self) -> Type[Numeric]:
@@ -376,9 +514,14 @@ class Pointer(ABC):
         ...
 
     @property
-    def max_alignment(self) -> int:  # type: ignore[empty-body]
+    def max_alignment(self) -> int:
         """Maximum alignment of this pointer in bytes. Implemented by subclasses."""
-        ...
+        raise NotImplementedError
+
+    @property
+    def alignment(self) -> int:
+        """Alignment of this pointer in bytes (post-swizzle). Implemented by subclasses."""
+        raise NotImplementedError
 
     @property
     def llvm_ptr(
@@ -388,16 +531,25 @@ class Pointer(ABC):
         ip: Optional[ir.InsertionPoint] = None,
     ) -> ir.Value:
         """Get the LLVM pointer representation. Implemented by subclasses."""
-        ...
+        raise NotImplementedError
 
-    def toint(  # type: ignore[empty-body]
+    def to_llvm_ptr(
+        self,
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+    ) -> ir.Value:
+        """Get the LLVM pointer representation (loc/ip propagated). Implemented by subclasses."""
+        raise NotImplementedError
+
+    def toint(
         self,
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> Numeric:
         """Convert pointer to integer. Implemented by subclasses."""
-        ...
+        raise NotImplementedError
 
     def align(self, min_align: int) -> "Pointer":  # type: ignore[empty-body]
         """Implemented by subclasses."""
@@ -615,12 +767,14 @@ class Tensor(ABC):
     def element_type(self) -> Union[Type[Numeric], Type[IntTuple]]: ...
 
     @element_type.setter
-    def element_type(self, new_type: Union[Type[Numeric], Type[IntTuple]]) -> None: ...
+    def element_type(self, new_type: Union[Type[Numeric], Type[IntTuple]]) -> None:
+        """Implemented by subclasses."""
+        raise NotImplementedError
 
     @property
-    def dtype(self) -> Type[Numeric]:  # type: ignore[empty-body]
+    def dtype(self) -> Type[Numeric]:
         """The element data type. Implemented by subclasses."""
-        ...
+        raise NotImplementedError
 
     @property
     @abstractmethod
@@ -647,9 +801,9 @@ class Tensor(ABC):
         ...
 
     @property
-    def layout(self) -> Union[Layout, "ComposedLayout"]:  # type: ignore[empty-body]
+    def layout(self) -> Union[Layout, "ComposedLayout"]:
         """Implemented by subclasses."""
-        ...
+        raise NotImplementedError
 
     @property
     @abstractmethod
@@ -659,14 +813,14 @@ class Tensor(ABC):
     @abstractmethod
     def stride(self) -> Stride: ...
 
-    def load(  # type: ignore[empty-body]
+    def load(
         self,
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> "TensorSSA":
         """Implemented by subclasses."""
-        ...
+        raise NotImplementedError
 
     def store(
         self,
@@ -674,22 +828,22 @@ class Tensor(ABC):
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
-    ) -> None: ...
-
-    def mark_layout_dynamic(  # type: ignore[empty-body]
-        self, leading_dim: Optional[int] = None
-    ) -> "Tensor":
+    ) -> None:
         """Implemented by subclasses."""
-        ...
+        raise NotImplementedError
 
-    def mark_compact_shape_dynamic(  # type: ignore[empty-body]
+    def mark_layout_dynamic(self, leading_dim: Optional[int] = None) -> "Tensor":
+        """Implemented by subclasses."""
+        raise NotImplementedError
+
+    def mark_compact_shape_dynamic(
         self,
         mode: int,
         stride_order: Optional[tuple[int, ...]] = None,
         divisibility: int = 1,
     ) -> "Tensor":
         """Implemented by subclasses."""
-        ...
+        raise NotImplementedError
 
     @abstractmethod
     def fill(self, value: Numeric) -> None: ...
@@ -710,6 +864,31 @@ def is_int_tuple(a: object) -> bool:
         return is_integer(a)
 
 
+def is_int_tuple_type(t: object) -> TypeIs[Type[IntTuple]]:
+    """Check whether a type slot equals the IntTuple typing alias object.
+
+    Why this helper exists (instead of inlining ``t is IntTuple``):
+    mypy cannot narrow on ``t is IntTuple`` directly — IntTuple is a
+    typing.Union alias object, not a class, so mypy's ``is`` / ``is not``
+    narrowing rules don't apply. Wrapping the identity check in a function
+    annotated with ``TypeIs[Type[IntTuple]]`` tells mypy how to narrow on
+    both branches:
+      * True  -> ``t`` is ``Type[IntTuple]``
+      * False -> ``t`` is the input type minus ``Type[IntTuple]``
+
+    Typical use: narrow ``Tensor.element_type``'s
+    ``Union[Type[Numeric], Type[IntTuple]]`` to ``Type[Numeric]`` via
+    ``assert not is_int_tuple_type(elem_type)``.
+
+    Requires ``typing_extensions >= 4.10.0`` (PEP 742, which introduced
+    ``TypeIs``). On Python < 3.13, ``TypeIs`` only lives in
+    ``typing_extensions``; on Python >= 3.13 it is also in stdlib
+    ``typing`` but we import from ``typing_extensions`` for back-compat.
+
+    """
+    return t is IntTuple
+
+
 __all__ = [
     "SymInt",
     "sym_int",
@@ -759,4 +938,5 @@ __all__ = [
     "as_numeric",
     "is_integer",
     "is_int_tuple",
+    "is_int_tuple_type",
 ]
diff --git a/python/CuTeDSL/cutlass/cutlass_dsl/__init__.py b/python/CuTeDSL/cutlass/cutlass_dsl/__init__.py
index 5d23e8971..ff7c89523 100644
--- a/python/CuTeDSL/cutlass/cutlass_dsl/__init__.py
+++ b/python/CuTeDSL/cutlass/cutlass_dsl/__init__.py
@@ -39,8 +39,8 @@ from ..base_dsl import *
 from ..base_dsl.arch import Arch
 from ..base_dsl.dsl import extract_mlir_values, new_from_mlir_values
 from ..base_dsl.typing import _binary_op_type_promote
-from ..base_dsl._mlir_helpers.gpu import *
-from ..base_dsl._mlir_helpers.op import dsl_user_op
+from .._mlir_helpers.gpu import *
+from .._mlir_helpers.op import dsl_user_op
 from ..base_dsl.runtime import *
 from ..base_dsl.runtime import cuda as cuda_helpers
 from ..base_dsl.compiler import (
@@ -54,6 +54,8 @@ from ..base_dsl.compiler import (
     GPUArch,
     LinkLibraries,
     EnableTVMFFI,
+    DeviceTarget,
+    RDC,
 )
 from ..base_dsl.runtime.jit_arg_adapters import *
 from ..base_dsl.native_struct import make_native_struct, native_struct
diff --git a/python/CuTeDSL/cutlass/cutlass_dsl/cuda_event_adapter.py b/python/CuTeDSL/cutlass/cutlass_dsl/cuda_event_adapter.py
new file mode 100644
index 000000000..c023f4a61
--- /dev/null
+++ b/python/CuTeDSL/cutlass/cutlass_dsl/cuda_event_adapter.py
@@ -0,0 +1,69 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
+#
+# Use of this software is governed by the terms and conditions of the
+# NVIDIA End User License Agreement (EULA), available at:
+# https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/license.html
+#
+# Any use, reproduction, disclosure, or distribution of this software
+# and related documentation outside the scope permitted by the EULA
+# is strictly prohibited.
+
+"""
+This module provides CUDA Python helper functions
+"""
+
+import ctypes
+from typing import List, Tuple
+
+import cuda.bindings.driver as cuda_driver
+import cuda.bindings.runtime as cuda_runtime
+
+# MLIR imports
+from .._mlir import ir
+from .._mlir.dialects import cuda
+
+# Local module imports
+from ..base_dsl.runtime.jit_arg_adapters import JitArgAdapterRegistry
+
+
+@JitArgAdapterRegistry.register_jit_arg_adapter(cuda_driver.CUevent)
+class CudaDriverEventAdapter:
+    """
+    Convert a CUDA event to a event representation for JIT arg generation.
+    """
+
+    def __init__(self, arg: "cuda_driver.CUevent") -> None:
+        self._arg = arg
+        self._c_pointer = self._arg.getPtr()
+
+    def __new_from_mlir_values__(self, values: List[ir.Value]) -> ir.Value:
+        assert len(values) == 1
+        return values[0]
+
+    def __c_pointers__(self) -> List[ctypes.c_void_p]:
+        return [self._c_pointer]
+
+    def __get_mlir_types__(self) -> List[ir.Type]:
+        return [cuda.EventType.get()]
+
+
+@JitArgAdapterRegistry.register_jit_arg_adapter(cuda_runtime.cudaEvent_t)
+class CudaRuntimeEventAdapter:
+    """
+    Convert a CUDA event to a event representation for JIT arg generation.
+    """
+
+    def __init__(self, arg: "cuda_runtime.cudaEvent_t") -> None:
+        self._arg = arg
+        self._c_pointer = self._arg.getPtr()
+
+    def __new_from_mlir_values__(self, values: List[ir.Value]) -> ir.Value:
+        assert len(values) == 1
+        return values[0]
+
+    def __c_pointers__(self) -> List[ctypes.c_void_p]:
+        return [self._c_pointer]
+
+    def __get_mlir_types__(self) -> List[ir.Type]:
+        return [cuda.EventType.get()]
diff --git a/python/CuTeDSL/cutlass/cutlass_dsl/cuda_jit_executor.py b/python/CuTeDSL/cutlass/cutlass_dsl/cuda_jit_executor.py
index c94ef11c7..a8c2dd774 100644
--- a/python/CuTeDSL/cutlass/cutlass_dsl/cuda_jit_executor.py
+++ b/python/CuTeDSL/cutlass/cutlass_dsl/cuda_jit_executor.py
@@ -30,6 +30,7 @@ from ..base_dsl.jit_executor import (
     ExecutionArgs,
     JitFunctionArtifacts,
 )
+from ..base_dsl.compiler import HostTarget
 from ..base_dsl.utils.logger import log
 from ..base_dsl.common import DSLRuntimeError
 from ..base_dsl.typing import Int32
@@ -74,6 +75,10 @@ class CudaDialectJitModule:
 class CudaDialectJitCompiledFunction(JitCompiledFunction):
     """Holds a compiled function and its module."""
 
+    device_header: Optional[str]
+    device_object_path: Optional[str]
+    device_ptx_path: Optional[str]
+
     def __init__(
         self,
         ir_module: ir.Module,
@@ -89,6 +94,7 @@ class CudaDialectJitCompiledFunction(JitCompiledFunction):
         dynamic_args: tuple[Any] = tuple[Any](),
         dynamic_kwargs: dict[str, Any] = dict[str, Any](),
         has_gpu_module: bool = True,
+        host_target: HostTarget | None = None,
     ) -> None:
         super().__init__(
             ir_module,
@@ -104,6 +110,7 @@ class CudaDialectJitCompiledFunction(JitCompiledFunction):
             dynamic_args,
             dynamic_kwargs,
             has_gpu_module,
+            host_target=host_target,
         )
 
         # Populated from module attributes by CuteExperimentalDSL.compile_and_cache;
@@ -280,3 +287,38 @@ class CudaDialectJitCompiledFunction(JitCompiledFunction):
                 )
 
             return JitExecutor(self.jit_module, None, self.jit_time_profiling)
+
+    @property
+    def library(self) -> "cuda_runtime.cudaLibrary_t":
+        """Loaded ``cudaLibrary_t`` for this compile's single cubin.
+
+        Triggers a lazy ``.to()`` call if the library isn't loaded yet,
+        so callers can access this before any explicit kernel launch
+        :raises RuntimeError: when the compile produced no gpu.module
+            (host-only program) or multiple cubins (explicit selection
+            is not yet supported).
+        """
+        # Trigger lazy load if the library isn't materialized yet.  Same
+        # check ``store_to_symbol``'s eager path uses today.
+        if self.jit_module is None or (
+            isinstance(self.jit_module, CudaDialectJitModule)
+            and self.jit_module.is_unloaded()
+        ):
+            self.to()
+
+        libraries = (
+            getattr(self.jit_module, "cuda_library", []) if self.jit_module else []
+        )
+        if not libraries:
+            raise RuntimeError(
+                "compiled.library: no cudaLibrary_t — the compile produced "
+                "no gpu.module (host-only function), or used "
+                "`cute.compile[DeviceTarget]` (which doesn't register a "
+                "runtime cudaLibrary_t)."
+            )
+        if len(libraries) > 1:
+            raise RuntimeError(
+                "compiled.library: this compile produced multiple cubins; "
+                "explicit library= selection is not yet supported."
+            )
+        return libraries[0]
diff --git a/python/CuTeDSL/cutlass/cutlass_dsl/cuda_library_adapter.py b/python/CuTeDSL/cutlass/cutlass_dsl/cuda_library_adapter.py
new file mode 100644
index 000000000..90e57f4a5
--- /dev/null
+++ b/python/CuTeDSL/cutlass/cutlass_dsl/cuda_library_adapter.py
@@ -0,0 +1,57 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
+#
+# Use of this software is governed by the terms and conditions of the
+# NVIDIA End User License Agreement (EULA), available at:
+# https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/license.html
+#
+# Any use, reproduction, disclosure, or distribution of this software
+# and related documentation outside the scope permitted by the EULA
+# is strictly prohibited.
+
+"""Pass a CUDA library handle (``cudaLibrary_t``) into a ``@cute.jit`` body
+as a typed MLIR ``!cuda.library`` argument.
+
+Without this adapter ``cute.compile`` rejects the library handle because
+it doesn't know how to marshal it through the ABI or what MLIR type to
+use for the corresponding block argument.  The adapter mirrors the
+existing ``CudaDialectStreamAdapter`` (``cuda.stream``) one-for-one — it
+exposes ``getPtr()`` to the C ABI and ``!cuda.library`` to MLIR.
+"""
+
+import ctypes
+from typing import List
+
+import cuda.bindings.runtime as cuda_runtime
+
+# MLIR imports
+from .._mlir import ir
+from .._mlir.dialects import cuda
+
+# Local module imports
+from ..base_dsl.runtime.jit_arg_adapters import JitArgAdapterRegistry
+
+
+@JitArgAdapterRegistry.register_jit_arg_adapter(cuda_runtime.cudaLibrary_t)
+class CudaDialectLibraryAdapter:
+    """Convert a ``cudaLibrary_t`` into a ``!cuda.library`` JIT argument.
+
+    Same shape as :class:`CudaDialectStreamAdapter` — the runtime handle
+    is a void pointer; the MLIR type is the dialect's opaque
+    ``!cuda.library``.  Lifetime is the caller's responsibility — we do
+    not own the library, ``cuda.compile``/the user does.
+    """
+
+    def __init__(self, arg: "cuda_runtime.cudaLibrary_t") -> None:
+        self._arg = arg
+        self._c_pointer = self._arg.getPtr()
+
+    def __new_from_mlir_values__(self, values: List[ir.Value]) -> ir.Value:
+        assert len(values) == 1
+        return values[0]
+
+    def __c_pointers__(self) -> List[ctypes.c_void_p]:
+        return [self._c_pointer]
+
+    def __get_mlir_types__(self) -> List[ir.Type]:
+        return [cuda.LibraryType.get()]
diff --git a/python/CuTeDSL/cutlass/cutlass_dsl/cuda_stream_adapter.py b/python/CuTeDSL/cutlass/cutlass_dsl/cuda_stream_adapter.py
index d3a37068c..c6a029532 100644
--- a/python/CuTeDSL/cutlass/cutlass_dsl/cuda_stream_adapter.py
+++ b/python/CuTeDSL/cutlass/cutlass_dsl/cuda_stream_adapter.py
@@ -17,6 +17,7 @@ import ctypes
 from typing import List, Tuple
 
 import cuda.bindings.driver as cuda_driver
+import cuda.bindings.runtime as cuda_runtime
 
 # MLIR imports
 from .._mlir import ir
@@ -27,7 +28,7 @@ from ..base_dsl.runtime.jit_arg_adapters import JitArgAdapterRegistry
 
 
 @JitArgAdapterRegistry.register_jit_arg_adapter(cuda_driver.CUstream)
-class CudaDialectStreamAdapter:
+class CudaDriverStreamAdapter:
     """
     Convert a CUDA stream to a stream representation for JIT arg generation.
     """
@@ -49,3 +50,28 @@ class CudaDialectStreamAdapter:
     def __cuda_stream__(self) -> Tuple[int, int]:
         # support cuda stream protocol
         return (0, int(self._arg))
+
+
+@JitArgAdapterRegistry.register_jit_arg_adapter(cuda_runtime.cudaStream_t)
+class CudaRuntimeStreamAdapter:
+    """
+    Convert a CUDA stream to a stream representation for JIT arg generation.
+    """
+
+    def __init__(self, arg: "cuda_runtime.cudaStream_t") -> None:
+        self._arg = arg
+        self._c_pointer = self._arg.getPtr()
+
+    def __new_from_mlir_values__(self, values: List[ir.Value]) -> ir.Value:
+        assert len(values) == 1
+        return values[0]
+
+    def __c_pointers__(self) -> List[ctypes.c_void_p]:
+        return [self._c_pointer]
+
+    def __get_mlir_types__(self) -> List[ir.Type]:
+        return [cuda.StreamType.get()]
+
+    def __cuda_stream__(self) -> Tuple[int, int]:
+        # support cuda stream protocol
+        return (0, int(self._arg))
diff --git a/python/CuTeDSL/cutlass/cutlass_dsl/cutlass.py b/python/CuTeDSL/cutlass/cutlass_dsl/cutlass.py
index 6bdfc0352..875f57d7a 100644
--- a/python/CuTeDSL/cutlass/cutlass_dsl/cutlass.py
+++ b/python/CuTeDSL/cutlass/cutlass_dsl/cutlass.py
@@ -22,6 +22,7 @@ from typing import (
     Generator,
     Optional,
     Union,
+    Annotated,
     List,
     Tuple,
     Sequence,
@@ -37,7 +38,9 @@ import pkgutil
 from concurrent.futures import ThreadPoolExecutor, as_completed
 from dataclasses import fields
 from math import ceil
+from numbers import Integral
 from itertools import chain
+import sys
 from pathlib import Path
 import builtins
 import ctypes
@@ -84,18 +87,20 @@ from ..base_dsl.leaf_utils import is_frozen_dataclass
 from ..base_dsl.runtime.jit_arg_adapters import is_arg_annotation_constexpr
 from ..base_dsl.jit_executor import ExecutionArgs  # noqa: F401
 from ..base_dsl.runtime import cuda as cuda_helpers
-from .cuda_stream_adapter import CudaDialectStreamAdapter
+from .cuda_stream_adapter import CudaDriverStreamAdapter, CudaRuntimeStreamAdapter  # noqa: F401
+from .cuda_library_adapter import CudaDialectLibraryAdapter  # noqa: F401
+from .cuda_event_adapter import CudaDriverEventAdapter, CudaRuntimeEventAdapter  # noqa: F401
 from .cuda_jit_executor import CudaDialectJitCompiledFunction
 
 # MLIR Imports
 from cutlass._mlir import ir, execution_engine, passmanager
 from cutlass._mlir.dialects import (
     arith,
-    func,
-    gpu,
+    func,  # noqa: F401
+    gpu,  # noqa: F401
     scf,
     cute,
-    gpu as cutlass_gpu,
+    gpu as cutlass_gpu,  # noqa: F401
     cuda as cuda_dialect,
 )
 from cutlass._mlir.dialects._ods_common import (
@@ -110,25 +115,12 @@ except ImportError:
 from cutlass._mlir.extras import types as T
 
 # Helpers
-from ..base_dsl._mlir_helpers import arith as cutlass_arith
-from ..base_dsl._mlir_helpers import lru_cache_ir
-from ..base_dsl._mlir_helpers.op import dsl_user_op
-from ..base_dsl._mlir_helpers.arith import const
+from .._mlir_helpers import arith as cutlass_arith
+from .._mlir_helpers.op import dsl_user_op
+from .._mlir_helpers.arith import const
 
 from ..base_dsl.ast_helpers import (
-    loop_selector,
     executor,
-    if_selector,
-    if_executor,
-    while_selector,
-    while_executor,
-    assert_executor,
-    const_expr,
-    dynamic_expr,
-    bool_cast,
-    compare_executor,
-    range_value_check,
-    cf_symbol_check,
 )
 
 from .cutlass_ast_decorators import (
@@ -219,19 +211,13 @@ def is_cute_algebra_type(arg_spec: object) -> bool:
 
 
 def _build_kernel_attrs(config: BaseDSL.LaunchConfig) -> dict:
+    """Build launch-time CUDA function attrs that are known before compilation."""
+    ATTR_SMEM_CARVEOUT = cuda_helpers.cuda.CUfunction_attribute.CU_FUNC_ATTRIBUTE_PREFERRED_SHARED_MEMORY_CARVEOUT
     kernel_attrs = {}
-    if config.min_blocks_per_mp > 1:
-        assert config.smem is not None
-        kernel_attrs = {
-            cuda_helpers.cuda.CUfunction_attribute.CU_FUNC_ATTRIBUTE_PREFERRED_SHARED_MEMORY_CARVEOUT: ceil(
-                config.min_blocks_per_mp
-                * config.smem
-                / cuda_helpers.get_device_attribute(
-                    cuda_helpers.cuda.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR
-                )
-                * 100
-            )
-        }
+    if config.preferred_smem_carveout is not None:
+        assert isinstance(config.preferred_smem_carveout, int)
+        kernel_attrs = {ATTR_SMEM_CARVEOUT: config.preferred_smem_carveout}
+
     return kernel_attrs
 
 
@@ -241,6 +227,46 @@ class CutlassBaseDSL(BaseDSL):
     _ALLOWED_EXTRA_KERNEL_VALUE_ATTRS: frozenset[str] = frozenset()
     _KERNEL_ATTR_SPEC_FIELD: Optional[str] = None
 
+    @staticmethod
+    def _make_kernel_decorator(
+        target_cls: type["CutlassBaseDSL"],
+        frame: Any,
+        *dargs: Any,
+        **dkwargs: Any,
+    ) -> Any:
+        """Build a ``@kernel`` decorator that dispatches through ``target_cls``.
+
+        Centralises the ``attributes=...`` kwarg handling shared between
+        ``CuTeDSL.kernel`` (with ``is_experimental=True``) and
+        ``CuteExperimentalDSL.kernel``: when ``attributes`` is supplied,
+        the resulting decorator stamps the spec onto the function via
+        ``target_cls._KERNEL_ATTR_SPEC_FIELD`` before running the normal
+        jit wrapping logic. The caller is responsible for capturing the
+        user's source frame so source locations point to the call site
+        rather than this helper.
+        """
+        attr_spec = dkwargs.pop("attributes", None)
+        kernel_decorator = BaseDSL.jit_runner(
+            target_cls, "_kernel_helper", frame, *dargs, **dkwargs
+        )
+        if attr_spec is None:
+            return kernel_decorator
+
+        def attach_and_decorate(func: Callable[..., None]) -> Callable[..., None]:
+            if target_cls._KERNEL_ATTR_SPEC_FIELD is None:
+                raise DSLRuntimeError(
+                    f"{target_cls.__name__} does not support kernel-level 'attributes='.",
+                    suggestion=(
+                        "Only DSLs that set _KERNEL_ATTR_SPEC_FIELD support the "
+                        "'attributes=' kwarg on @kernel; remove the argument or use "
+                        "a DSL that supports it."
+                    ),
+                )
+            setattr(func, target_cls._KERNEL_ATTR_SPEC_FIELD, attr_spec)
+            return kernel_decorator(func)
+
+        return attach_and_decorate
+
     def __init__(
         self,
         name: str,
@@ -442,9 +468,84 @@ class CutlassBaseDSL(BaseDSL):
             ret["nvvm.minctasm"] = ir.Attribute.parse(
                 f"{config.min_blocks_per_mp} : i32"
             )
+        # Add optional maximum shared memory per multiprocessor
+        # to support preferred shared memory carveout calculation
+        if config.preferred_smem_carveout is None and config.min_blocks_per_mp > 1:
+            max_smem_per_mp = cuda_helpers.get_device_attribute(
+                cuda_helpers.cuda.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR
+            )
+            ret["smem.max_smem_per_mp"] = ir.Attribute.parse(f"{max_smem_per_mp} : i64")
+
+        # Pass user config specified smem size to support calculation
+        if config.smem is not None and isinstance(config.smem, int):
+            ret["smem.config_smem_size"] = ir.Attribute.parse(f"{config.smem} : i64")
+
+        # Allow mutually exclusive code paths to reuse shared memory
+        if config.smem_merge_branch_allocs:
+            ret["smem.merge_branch_allocs"] = ir.UnitAttr.get()
 
         return ret
 
+    @staticmethod
+    def _normalize_static_cluster_dim(dim: Any, config_name: str) -> int:
+        if is_dynamic_expression(dim):
+            raise DSLRuntimeError(
+                f"`LaunchConfig.{config_name}` contains dynamic expression, cannot materialize compiler-visible cluster metadata",
+                suggestion="Consider using `Constexpr` annotation or Python constant",
+            )
+
+        value = dim.value if isinstance(dim, Integer) else dim
+        if isinstance(value, bool) or not isinstance(value, Integral):
+            raise DSLRuntimeError(
+                f"`LaunchConfig.{config_name}` must contain integer dimensions, but got {type(value)}"
+            )
+
+        return int(value)
+
+    @classmethod
+    def _materialize_cluster_shape_attr(
+        cls, dims: Sequence[Any], config_name: str
+    ) -> ir.Attribute:
+        normalized_dims = [
+            cls._normalize_static_cluster_dim(dim, config_name) for dim in dims
+        ]
+        dims_str = ",".join(map(str, normalized_dims))
+        return ir.Attribute.parse(f'#cute.shape<"({dims_str})">')
+
+    @classmethod
+    def _get_cluster_kernel_attrs(
+        cls, config: BaseDSL.LaunchConfig
+    ) -> dict[str, ir.Attribute]:
+        if config.has_fallback_cluster and not config.has_cluster:
+            raise DSLRuntimeError(
+                "`LaunchConfig.fallback_cluster` requires `LaunchConfig.cluster` to also be set"
+            )
+
+        if config.has_fallback_cluster:
+            assert config.cluster is not None
+            assert config.fallback_cluster is not None
+            # Mirror the existing runtime launch convention for mixed cluster:
+            # LaunchConfig.cluster is the preferred shape, while
+            # LaunchConfig.fallback_cluster becomes the IR's cluster_shape attr.
+            return {
+                "preferred_cluster_shape": cls._materialize_cluster_shape_attr(
+                    config.cluster, "cluster"
+                ),
+                "cluster_shape": cls._materialize_cluster_shape_attr(
+                    config.fallback_cluster, "fallback_cluster"
+                ),
+            }
+
+        if config.has_cluster:
+            assert config.cluster is not None
+            return {
+                "cluster_shape": cls._materialize_cluster_shape_attr(
+                    config.cluster, "cluster"
+                )
+            }
+
+        return {}
+
     @functools.lru_cache(maxsize=1)
     def get_version(self) -> Any:
         """
@@ -485,14 +586,6 @@ class CutlassBaseDSL(BaseDSL):
         mlir_libs_candidates = [
             Path(dsl_path) / "_mlir" / "_mlir_libs",
         ]
-        try:
-            import cutlass._mlir as _mlir_module
-
-            if hasattr(_mlir_module, "__path__"):
-                for p in _mlir_module.__path__:
-                    mlir_libs_candidates.append(Path(p) / "_mlir_libs")
-        except (ImportError, AttributeError):
-            pass
         mlir_libs_path = None
         for candidate in mlir_libs_candidates:
             if candidate.exists():
@@ -503,8 +596,46 @@ class CutlassBaseDSL(BaseDSL):
                 "Could not find _mlir/_mlir_libs directory. "
                 "Please re-install the package."
             )
-        giant_dso_name = str(next(mlir_libs_path.glob("_cutlass_ir.cpython*")).name)
-        so_path = str(mlir_libs_path / giant_dso_name)
+        # The pybind module file may be CTK-tagged (`_cutlass_ir.cu12.cpython-…so`
+        # or `_cutlass_ir.cu13.cpython-…so`) when multiple CTK flavors coexist in
+        # the same `_mlir_libs/` directory. The CTK-aware loader has already
+        # picked one and bound it as `cutlass._mlir._mlir_libs._cutlass_ir`, so
+        # the loaded module's path is authoritative — use it unconditionally
+        # rather than re-scanning the candidate dirs (which can resolve to a
+        # different `_mlir_libs/` than the loader actually consulted, e.g. when
+        # `cutlass._mlir.__path__` is a CI/PYTHONPATH overlay).
+        loaded = sys.modules.get("cutlass._mlir._mlir_libs._cutlass_ir")
+        loaded_file = getattr(loaded, "__file__", None) if loaded is not None else None
+        if not loaded_file:
+            # Loader hasn't run (very early import path, or someone deleted
+            # the sys.modules entry). Force it to run via import_module —
+            # this is idempotent if the module is already loaded, and
+            # otherwise routes through the CTK-aware loader in
+            # _mlir_libs/__init__.py so we hash exactly the binary the
+            # runtime would. Avoids a non-deterministic glob fallback
+            # (which could pick the wrong flavor when both cu12 and cu13
+            # .so files coexist) and a bare StopIteration when no match
+            # is found.
+            from importlib import import_module
+
+            try:
+                loaded = import_module("cutlass._mlir._mlir_libs._cutlass_ir")
+            except ImportError as e:
+                raise DSLRuntimeError(
+                    "Could not load cutlass._mlir._mlir_libs._cutlass_ir. "
+                    "Please re-install the package."
+                ) from e
+            loaded_file = getattr(loaded, "__file__", None)
+            if not loaded_file:
+                raise DSLRuntimeError(
+                    "Loaded cutlass._mlir._mlir_libs._cutlass_ir has no "
+                    "__file__ attribute. Please re-install the package."
+                )
+        so_path = loaded_file
+        giant_dso_name = Path(loaded_file).name
+        # Re-anchor `mlir_libs_path` to where the loaded binary actually
+        # lives so any subsequent path-derived state stays consistent.
+        mlir_libs_path = Path(loaded_file).parent
         try:
             # update the version hash of the cutlass shared library
             so_size = os.path.getsize(so_path)
@@ -632,10 +763,12 @@ class CutlassBaseDSL(BaseDSL):
             from cutlass.base_dsl.tvm_ffi_builder import attach_ffi_func
 
             assert self._tvm_ffi_args_spec_converter is not None
-            tvm_ffi_spec_params, kwargs_wrapper_spec = (
-                self._tvm_ffi_args_spec_converter(
-                    function_name, signature, full_args, full_kwargs
-                )
+            (
+                tvm_ffi_spec_params,
+                kwargs_wrapper_spec,
+                map_dataclass_to_tuple,
+            ) = self._tvm_ffi_args_spec_converter(
+                function_name, signature, full_args, full_kwargs
             )
             tvm_ffi_provider = TVMFFICuteCallProvider(
                 function_name, has_gpu_module=self.num_kernels > 0
@@ -655,9 +788,23 @@ class CutlassBaseDSL(BaseDSL):
                 module.operation.verify()
 
             def _make_compiled_func(*args: Any, **kwargs: Any) -> Any:
-                if kwargs_wrapper_spec.kwonly_names or kwargs_wrapper_spec.arg_defaults:
+                # Route through the kwargs-capable compiled class whenever
+                # the signature has kwonly/defaults OR any positional arg
+                # carries a dataclass instance (detected by the spec
+                # converter, including Union[...] over dataclasses). The
+                # kwargs wrapper is the only place tvm-ffi's
+                # ``map_dataclass_to_tuple`` unpack hook fires.
+                if (
+                    kwargs_wrapper_spec.kwonly_names
+                    or kwargs_wrapper_spec.arg_defaults
+                    or kwargs_wrapper_spec.arg_names
+                    or map_dataclass_to_tuple
+                ):
                     return TVMFFIJitCompiledFunctionWithKwargs(
-                        *args, **kwargs, kwargs_wrapper_spec=kwargs_wrapper_spec
+                        *args,
+                        **kwargs,
+                        kwargs_wrapper_spec=kwargs_wrapper_spec,
+                        map_dataclass_to_tuple=map_dataclass_to_tuple,
                     )
                 else:
                     return TVMFFIJitCompiledFunction(*args, **kwargs)
@@ -700,40 +847,6 @@ class CutlassBaseDSL(BaseDSL):
             funcBody=funcBody,
         )
 
-    @staticmethod
-    def track_smem_allocator(
-        allocator: object, callback: Callable[[object], int]
-    ) -> None:
-        """
-        Tracks shared memory usage for kernel functions.
-        Find and set allocator to its parent dsl object.
-        """
-        frame = inspect.currentframe().f_back  # type: ignore[union-attr]
-        while frame:
-            obj = frame.f_locals.get("self", None)
-            if obj and isinstance(obj, CutlassBaseDSL):
-                obj._set_smem_tracking(allocator, callback)
-                return
-            frame = frame.f_back
-        warnings.warn("Cannot find parent dsl for allocator!", UserWarning)
-
-    def _set_smem_tracking(  # type: ignore[no-redef]
-        self, allocator: object, callback: Callable[[object], int]
-    ) -> None:
-        # Registers an allocator and callback for current dsl
-        self._smem_usage_tracker = (allocator, callback)
-
-    def _reset_smem_tracking(self) -> None:  # type: ignore[no-redef]
-        # Clear an allocator and callback for current dsl
-        self._smem_usage_tracker = None
-
-    def _get_smem_usage(self) -> int:  # type: ignore[no-redef]
-        # Treat final allocated bytes of allocator as smem usage
-        if not self._smem_usage_tracker:
-            return 0
-        allocator, callback = self._smem_usage_tracker
-        return callback(allocator)
-
     @staticmethod
     def cuda_launch_func(
         stream: Union[list, tuple],
@@ -755,6 +868,11 @@ class CutlassBaseDSL(BaseDSL):
         dynamic_shared_memory_size: Optional[Union[Int64, int]] = None,
         use_pdl: bool = False,
         cooperative: bool = False,
+        launch_completion_event: Optional[ir.Value] = None,
+        launch_completion_event_flags: Optional[int | Int32] = None,
+        programmatic_event: Optional[ir.Value] = None,
+        programmatic_event_flags: Optional[int | Int32] = None,
+        programmatic_event_trigger_at_block_start: Optional[int | Int32] = None,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> None:
@@ -799,6 +917,29 @@ class CutlassBaseDSL(BaseDSL):
             cfg, Int32(use_pdl).ir_value(), loc=loc, ip=ip
         )
 
+        if launch_completion_event is not None:
+            event = launch_completion_event
+            if launch_completion_event_flags is None:
+                launch_completion_event_flags = 0
+            flags = Int32(launch_completion_event_flags).ir_value(loc=loc, ip=ip)
+            cuda_dialect.launch_cfg_launch_completion_event(
+                cfg, event, flags, loc=loc, ip=ip
+            )
+
+        if programmatic_event is not None:
+            event = programmatic_event
+            if programmatic_event_flags is None:
+                programmatic_event_flags = 0
+            flags = Int32(programmatic_event_flags).ir_value(loc=loc, ip=ip)
+            if programmatic_event_trigger_at_block_start is None:
+                programmatic_event_trigger_at_block_start = 0
+            trigger_at_block_start = Int32(
+                programmatic_event_trigger_at_block_start
+            ).ir_value(loc=loc, ip=ip)
+            cuda_dialect.launch_cfg_programmatic_event(
+                cfg, event, flags, trigger_at_block_start, loc=loc, ip=ip
+            )
+
         if cluster_size_x is not None:
             if cluster_size_y is None:
                 cluster_size_y = 1
@@ -897,7 +1038,6 @@ class CutlassBaseDSL(BaseDSL):
             def __init__(self, dsl: CutlassBaseDSL):
                 super().__init__()
                 self.dsl = dsl
-                self.dsl._reset_smem_tracking()
 
             def generate_func_op(
                 self,
@@ -944,19 +1084,24 @@ class CutlassBaseDSL(BaseDSL):
                 )
 
                 cfg = requiredArgs.config
-
-                smem_usage = self.dsl._get_smem_usage()
-                if any(not isinstance(x, int) for x in [cfg.smem, smem_usage]):
-                    pass  # cannot compare dynamic value inside kernel to launch op in py
-                elif cfg.auto_smem:
+                # kernel_smem_size to auto infer smem size
+                smem_usage = cute.kernel_smem_size(kernel_name=kernelSym, loc=loc)
+                # Use auto inferred smem size when user not specify
+                if cfg.smem is None:
                     cfg.smem = smem_usage
-                elif smem_usage > cfg.smem:  # type: ignore[operator]
-                    warnings.warn(
-                        f"Potential error: specified kernel launch smem bytes "
-                        f"({cfg.smem}) is smaller than kernel usage ({smem_usage})!",
-                        UserWarning,
+                else:
+                    # Warn user if specified launch cfg.smem size is insufficient
+                    cfg.smem = const(cfg.smem)
+                    smem_msg = (
+                        f"\nError: shared memory usage in '{kernelSym}' "
+                        "may exceed available memory set in kernel launch. "
+                        "Allocated: {} bytes. Used: {} bytes.\n\n"
+                    )
+                    if_generate(
+                        Int64(cfg.smem) < smem_usage,
+                        lambda: cute.print_([cfg.smem, smem_usage], fmt=smem_msg),
+                        loc=loc,
                     )
-                cfg.smem = const(cfg.smem)
 
                 # Warn user if shared memory exceed arch max
                 # Currently runtime only show 'CUDA_ERROR_INVALID_VALUE' error which is not useful
@@ -1029,6 +1174,11 @@ class CutlassBaseDSL(BaseDSL):
                     dynamic_shared_memory_size=cfg.smem,
                     use_pdl=cfg.use_pdl,
                     cooperative=cfg.cooperative,
+                    launch_completion_event=cfg.launch_completion_event,
+                    launch_completion_event_flags=cfg.launch_completion_event_flags,
+                    programmatic_event=cfg.programmatic_event,
+                    programmatic_event_flags=cfg.programmatic_event_flags,
+                    programmatic_event_trigger_at_block_start=cfg.programmatic_event_trigger_at_block_start,
                     loc=loc,
                 )
                 return None
@@ -1081,6 +1231,10 @@ class CutlassBaseDSL(BaseDSL):
         ):
             pass
         else:
+            # Ignore metadata for annotated types
+            if get_origin(arg_annotation) is Annotated:
+                arg_annotation = get_args(arg_annotation)[0]
+
             origin = get_origin(arg_annotation)
             # Handle special case where annotation is Type[X] but arg is an actual type
             if origin is type and isinstance(arg, type):
@@ -1098,6 +1252,9 @@ class CutlassBaseDSL(BaseDSL):
                     (ty is Any)
                     or (isinstance(ty, type) and isinstance(arg, ty))
                     or (get_origin(ty) is tuple and isinstance(arg, tuple))
+                    or (
+                        get_origin(ty) is Annotated and isinstance(arg, get_args(ty)[0])
+                    )
                     for ty in allowed_types
                 ):
                     return DSLRuntimeError(
@@ -1249,6 +1406,79 @@ class CuTeDSL(CutlassBaseDSL):
 
         super().__init__(name, compiler_provider, pass_sm_arch_name, preprocess=True)
 
+    @classmethod
+    def jit(cls, *dargs: Any, **dkwargs: Any) -> Any:
+        """
+        Decorator to mark a function for JIT compilation for Host code.
+
+        Parameters
+        ----------
+        is_experimental : bool, optional
+            When True, route compilation through the experimental CuTe DSL
+            path (equivalent to ``cute.experimental.jit``). Defaults to
+            False.
+
+            This option exists to ease the migration of users away from
+            ``cute.experimental`` toward the unified ``cute`` decorator;
+            ``cute.experimental`` is deprecated and will be removed once
+            its functionality has been folded into ``cute``.
+        """
+        # Pop before forwarding so jit_runner does not see the alias kwarg.
+        is_experimental = dkwargs.pop("is_experimental", False)
+        # CuteExperimentalDSL is defined later in this module; it is
+        # resolved lazily here at call time, so the forward reference is
+        # safe.
+        target_cls: type[BaseDSL] = CuteExperimentalDSL if is_experimental else cls
+        # Capture the user's call site (mirroring BaseDSL.jit) so that
+        # decorator source locations are reported relative to the caller,
+        # not this override.
+        frame = inspect.currentframe().f_back  # type: ignore[union-attr]
+        return BaseDSL.jit_runner(target_cls, "_func", frame, *dargs, **dkwargs)
+
+    @classmethod
+    def kernel(cls, *dargs: Any, **dkwargs: Any) -> Any:
+        """
+        Decorator to mark a function for JIT compilation for GPU device code.
+
+        Parameters
+        ----------
+        is_experimental : bool, optional
+            When True, route compilation through the experimental CuTe DSL
+            path (equivalent to ``cute.experimental.kernel``). Defaults to
+            False.
+
+            This option exists to ease the migration of users away from
+            ``cute.experimental`` toward the unified ``cute`` decorator;
+            ``cute.experimental`` is deprecated and will be removed once
+            its functionality has been folded into ``cute``. Pair this
+            flag with ``cute.jit(is_experimental=True)`` and
+            ``cute.compile(..., is_experimental=True)`` so that the host
+            launcher, the device kernel, and the explicit compile entry
+            point all dispatch through the same DSL; mixing experimental
+            and non-experimental decorations on the same kernel triggers
+            preprocessor mismatches.
+        attributes : optional
+            Kernel-level attribute spec, only supported when routing to
+            ``CuteExperimentalDSL`` (i.e. ``is_experimental=True``).
+            Stamped onto the wrapped function via
+            ``CuteExperimentalDSL._KERNEL_ATTR_SPEC_FIELD``.
+        """
+        is_experimental = dkwargs.pop("is_experimental", False)
+        # CuteExperimentalDSL is defined later in this module; the
+        # forward reference is resolved lazily at call time.
+        target_cls: type[CutlassBaseDSL] = (
+            CuteExperimentalDSL if is_experimental else cls
+        )
+        # Capture the user's call site here rather than relying on a
+        # nested method, so source locations point at the caller rather
+        # than this override.
+        current_frame = inspect.currentframe()
+        assert current_frame is not None
+        frame = current_frame.f_back
+        return CutlassBaseDSL._make_kernel_decorator(
+            target_cls, frame, *dargs, **dkwargs
+        )
+
     @staticmethod
     def generate_func_op(
         arg_types: List[ir.Type],
@@ -1280,6 +1510,253 @@ class CuTeDSL(CutlassBaseDSL):
     ) -> Any:
         return cuda_dialect.ReturnOp([], loc=loc, ip=ip)
 
+    @staticmethod
+    def generate_device_func_op(
+        arg_types: List[Any],
+        ret_types: List[Any],
+        func_name: str,
+        arg_attrs: Any = None,
+        loc: Optional[ir.Location] = None,
+    ) -> Any:
+        """Generate a cuda.func op (__device__ function) with optional return types."""
+        func_op = cuda_dialect.FuncOp(
+            func_name, ir.FunctionType.get(arg_types, ret_types), loc=loc
+        )
+        if arg_attrs is not None:
+            func_op.arg_attrs = arg_attrs
+        return func_op
+
+    def _device_func(
+        self, funcBody: Callable[..., Any], *args: Any, **kwargs: Any
+    ) -> Any:
+        """Generate a __device__ function (cuda.func inside gpu.module).
+
+        Unlike _func (host wrapper) or _kernel_helper (__global__ kernel),
+        this generates only a gpu.module containing a cuda.func with no
+        host-side wrapper and no launch op. Supports return values.
+        """
+        if ir.Context.current is not None and ir.InsertionPoint.current is not None:
+            return funcBody(*args, **kwargs)
+
+        # Device functions have no host entry point — never create a JIT engine.
+        kwargs.pop("no_jit_engine", None)
+        setup = self._prepare_compilation(funcBody, *args, **kwargs)
+
+        # Extract return type from Python annotation (e.g. -> Float32)
+        ret_annotation = setup.sig.return_annotation
+        if ret_annotation is inspect.Signature.empty:
+            ret_annotation = None
+
+        log().debug(f"Generating MLIR for device function '{setup.function_name}'")
+
+        # Generate MLIR
+        with ir.Context() as ctx, self.get_ir_location(setup.location):
+            ctx.enable_multithreading(False)
+            try:
+                from cutlass._mlir.dialects import llvm as llvm_dialect
+                from ..base_dsl.typing import NumericMeta
+
+                # Auto-instantiate bare types (e.g. Int32, MyStruct) into
+                # zero-valued instances. Struct construction creates MLIR ops,
+                # so we build them in a temporary module then discard it.
+                def _instantiate_bare(arg: Any) -> Any:
+                    if isinstance(arg, NumericMeta):
+                        return arg(0)
+                    if isinstance(arg, type) and hasattr(arg, "_field_names"):
+                        kw = {}
+                        for fn in arg._field_names:
+                            kw[fn] = _instantiate_bare(arg._field_annotations[fn])  # type: ignore[attr-defined]
+                        return arg(**kw)
+                    return arg
+
+                needs_instantiation = any(
+                    isinstance(a, (type, NumericMeta)) for a in setup.canonicalized_args
+                )
+                if needs_instantiation:
+                    from .._mlir.dialects import func as func_dialect
+
+                    tmp = ir.Module.create()
+                    with ir.InsertionPoint(tmp.body):
+                        tmp_fop = func_dialect.FuncOp(
+                            "_bare_type_init", ir.FunctionType.get([], [])
+                        )
+                        with ir.InsertionPoint(tmp_fop.add_entry_block()):
+                            setup.canonicalized_args = tuple(
+                                _instantiate_bare(a) for a in setup.canonicalized_args
+                            )
+                            func_dialect.ReturnOp([])
+
+                OpaquePointer: Optional[type] = None
+                # For device functions, opaque pointer annotations mean
+                # "pointer in generic address space".  Create a copy of the
+                # annotations with those replaced so the type-generation
+                # path doesn't reject them, but keep the originals intact for
+                # header generation.
+                pointer_arg_indices = set()
+                typegen_sig = setup.sig
+                if OpaquePointer is not None:
+                    typegen_params = []
+                    for idx, (name, param) in enumerate(setup.sig.parameters.items()):
+                        if param.annotation is OpaquePointer:
+                            pointer_arg_indices.add(idx)
+                            typegen_params.append(
+                                param.replace(annotation=inspect.Parameter.empty)
+                            )
+                        else:
+                            typegen_params.append(param)
+                    if pointer_arg_indices:
+                        typegen_sig = setup.sig.replace(parameters=typegen_params)
+
+                exe_args, func_types, adapted_args = self.generate_mlir_function_types(
+                    funcBody,
+                    setup.function_name,
+                    setup.canonicalized_args,
+                    setup.canonicalized_kwargs,
+                    typegen_sig,
+                    compile_only=setup.compile_only,
+                )
+
+                # Device function pointers use generic address space (0).
+                generic_ptr = llvm_dialect.PointerType.get(0)
+                for i, ty in enumerate(func_types):
+                    if i in pointer_arg_indices:
+                        func_types[i] = generic_ptr
+                        continue
+                    try:
+                        ptr_ty = llvm_dialect.PointerType(ty)
+                        if ptr_ty.address_space != 0:
+                            func_types[i] = generic_ptr
+                    except Exception:
+                        pass
+
+                # Resolve return types from annotation.
+                # Mirrors _annotation_to_mlir_type: handles callable mlir_type
+                # (some DSL types) and the _struct_type fallback (@native_struct).
+                ret_types = []
+                if ret_annotation is not None:
+                    if hasattr(ret_annotation, "mlir_type"):
+                        mt = ret_annotation.mlir_type
+                        ret_types = [mt() if callable(mt) else mt]
+                    elif hasattr(ret_annotation, "_struct_type"):
+                        ret_types = [ret_annotation._struct_type]
+                    else:
+                        raise DSLRuntimeError(
+                            f"Device function return type annotation must be a DSL type "
+                            f"(e.g. Float32) or @native_struct, got {ret_annotation}"
+                        )
+
+                loc = self.get_ir_location(setup.location)
+                module = ir.Module.create(loc=loc)
+                module.operation.attributes["gpu.container_module"] = ir.UnitAttr.get()
+
+                with ir.InsertionPoint(module.body):
+                    self._build_gpu_module(setup.gpu_module_attrs, loc=loc)
+
+                    with self._enter_gpu_module():
+                        fop = self.generate_device_func_op(
+                            func_types,
+                            ret_types,
+                            setup.function_name,
+                            loc=loc,
+                        )
+                        fop.sym_visibility = ir.StringAttr.get("public")
+
+                        arg_locs = [fop.operation.location for _ in func_types]
+                        entry_block = fop.add_entry_block(arg_locs=arg_locs)
+
+                        with ir.InsertionPoint(entry_block):
+                            ir_args, ir_kwargs = self.generate_execution_arguments(
+                                setup.canonicalized_args,
+                                setup.canonicalized_kwargs,
+                                fop,
+                                setup.sig,
+                            )
+
+                            result = funcBody(*ir_args, **ir_kwargs)
+
+                            # Generate return op
+                            if ret_types:
+                                if result is None:
+                                    raise DSLRuntimeError(
+                                        f"Device function '{funcBody.__name__}' has return "
+                                        f"type annotation but returned None"
+                                    )
+                                if hasattr(result, "ir_value"):
+                                    ret_val = result.ir_value()
+                                elif hasattr(result, "__extract_mlir_values__"):
+                                    extracted_vals = result.__extract_mlir_values__()
+                                    if len(extracted_vals) != 1:
+                                        raise DSLRuntimeError(
+                                            f"Device function '{funcBody.__name__}' returned "
+                                            f"{len(extracted_vals)} MLIR values; expected 1"
+                                        )
+                                    ret_val = extracted_vals[0]
+                                else:
+                                    ret_val = result
+                                cuda_dialect.ReturnOp([ret_val], loc=loc)
+                            else:
+                                cuda_dialect.ReturnOp([], loc=loc)
+
+                # Increment kernel count so the gpu.module is not removed
+                self.num_kernels += 1
+
+                from ..base_dsl.export.c_header_generator import CHeaderGenerator
+
+                ptr_types = (OpaquePointer,) if OpaquePointer is not None else ()
+                device_header = CHeaderGenerator.generate_device_header(
+                    setup.function_name,
+                    setup.sig,
+                    ret_annotation=ret_annotation,
+                    pointer_types=ptr_types,
+                )
+
+                module = self.build_module(module, setup.function_name)
+
+                # dryrun: generate IR and header, skip compilation
+                if self.envar.dryrun:
+                    print(device_header)
+                    return result
+
+                module_hash = self.get_module_hash(module, setup.function_name)
+                jit_function = self.compile_and_cache(
+                    module,
+                    module_hash,
+                    setup.function_name,
+                    setup.pipeline,
+                    setup.sig,
+                    setup.no_cache,
+                    no_jit_engine=True,
+                )
+
+                # Attach device compilation artifacts.
+                # The pipeline always dumps .cubin; with --compile-only it's a
+                # relocatable object, so rename to .o for clarity.
+                cubin_path = self.compile_options.full_cubin_path
+                obj_path = None
+                if cubin_path:
+                    obj_path = cubin_path.rsplit(".cubin", 1)[0] + ".o"
+                    try:
+                        os.rename(cubin_path, obj_path)
+                    except FileNotFoundError:
+                        # Already renamed or not produced.
+                        if not os.path.exists(obj_path):
+                            obj_path = None
+
+                jit_function.device_header = device_header
+                jit_function.device_object_path = obj_path
+
+                ptx_path = self.compile_options.full_ptx_path
+                jit_function.device_ptx_path = (
+                    ptx_path if ptx_path and os.path.exists(ptx_path) else None
+                )
+            finally:
+                self.post_compilation_cleanup()
+
+        if setup.compile_only:
+            return jit_function
+
+        return result
+
 
 # =============================================================================
 # CuteExperimentalJitCompiledFunction Class
@@ -1318,6 +1795,12 @@ class CuteExperimentalDSL(CutlassBaseDSL):
         {"lir.tma_update_mode"}
     )
     _KERNEL_ATTR_SPEC_FIELD: Optional[str] = "_cute_experimental_kernel_attributes"
+    # Marks this class as the deprecated "experimental" CuTe DSL so
+    # ``cute.compile(..., is_experimental=True)`` can validate that the
+    # function being compiled was actually decorated through the
+    # experimental path (``@cute.jit(is_experimental=True)`` /
+    # ``@cute.kernel(is_experimental=True)``).
+    _is_experimental_dsl: bool = True
     JitCompiledFunction = _CuteExperimentalJitCompiledFunction
 
     def __init__(self) -> None:
@@ -1329,31 +1812,21 @@ class CuteExperimentalDSL(CutlassBaseDSL):
 
     @classmethod
     def kernel(cls, *dargs: Any, **dkwargs: Any) -> Any:
-        attr_spec = dkwargs.pop("attributes", None)
-        # Capture the caller's frame here rather than delegating to
-        # super().kernel(), which would record *this* frame instead of
-        # the user's source location (f_back would land in this override
-        # rather than in the user file).
+        # Capture the caller's frame here rather than delegating to a
+        # nested helper that uses ``inspect.currentframe().f_back``,
+        # which would record *this* frame instead of the user's source
+        # location (f_back would land in this override rather than in
+        # the user file).
         current_frame = inspect.currentframe()
         assert current_frame is not None
         frame = current_frame.f_back
-        kernel_decorator = BaseDSL.jit_runner(
-            cls, "_kernel_helper", frame, *dargs, **dkwargs
-        )
-        if attr_spec is None:
-            return kernel_decorator
-
-        def attach_and_decorate(func: Callable[..., None]) -> Callable[..., None]:
-            assert cls._KERNEL_ATTR_SPEC_FIELD is not None
-            setattr(func, cls._KERNEL_ATTR_SPEC_FIELD, attr_spec)
-            return kernel_decorator(func)
-
-        return attach_and_decorate
+        return CutlassBaseDSL._make_kernel_decorator(cls, frame, *dargs, **dkwargs)
 
     def _generate_kernel_attrs(self, config: BaseDSL.LaunchConfig) -> dict:
         import re
 
         ret = super()._generate_kernel_attrs(config)
+        ret.update(self._get_cluster_kernel_attrs(config))
 
         # Add compute capability attribute from the target arch.
         # get_arch_enum() validates the arch string; strip the portability
@@ -1370,8 +1843,15 @@ class CuteExperimentalDSL(CutlassBaseDSL):
 
     def _get_pipeline(self, pipeline: Optional[str]) -> str:
         if pipeline == None:
+            # Build the `lir-to-cute{...}` brace. Separate from
+            # ``compile_options.to_str()`` -- which targets
+            # ``cute-to-nvvm{...}`` -- because the two live on
+            # different pipelines.
+            lir_to_cute_opts = "enable-cuda-dialect enable-lir-func-finalization=false"
             return (
-                "builtin.module(gpu.module(lir-to-cute{enable-cuda-dialect enable-lir-func-finalization=false}), lir-func-finalization{enable-cuda-dialect=true}, cute-to-nvvm{cubin-format=bin enable-cuda-dialect "
+                "builtin.module(gpu.module(lir-to-cute{"
+                + lir_to_cute_opts
+                + "}), lir-func-finalization{enable-cuda-dialect=true require-configure-launch=false}, cute-to-nvvm{cubin-format=bin enable-cuda-dialect "
                 + self.compile_options.to_str()
                 + "})"
             )
@@ -1415,7 +1895,7 @@ class CuteExperimentalDSL(CutlassBaseDSL):
     def generate_func_ret_op(
         loc: Optional[ir.Location] = None, ip: Optional[ir.InsertionPoint] = None
     ) -> Any:
-        return cutlass_lir.ReturnOp([])
+        return cutlass_lir.ReturnOp([], loc=loc, ip=ip)
 
     def compile_and_cache(
         self,
@@ -1533,7 +2013,11 @@ class KernelLauncher:
         """
         Check smem usage for this kernel, only available after `launch`
         """
-        return self.dsl._get_smem_usage()  # type: ignore[return-value]
+        if self._launch_name is None:
+            raise ValueError("kernel smem usage only available after `launch`")
+        kernel_sym = ir.SymbolRefAttr.get(["kernels", self._launch_name])
+        smem_usage = cute.kernel_smem_size(kernel_name=kernel_sym)
+        return Int32(smem_usage)
 
     def launch(self, *args: Any, **kwargs: Any) -> Any:
         self.dsl._preprocess_launch_config_args(args, kwargs)
@@ -1821,7 +2305,7 @@ def _minmax(
     ip: Optional[ir.InsertionPoint] = None,
 ) -> Union[Numeric, int, float]:
     """Computes the minimum or maximum value from the provided arguments."""
-    from ..base_dsl.typing import _binary_op, _binary_op_type_promote
+    from ..base_dsl.typing import _binary_op_type_promote
 
     # AST Traversal doesn't support early exit in if executor
     x = None
@@ -1861,7 +2345,7 @@ def _minmax(
                 if isinstance(lhs.value, cutlass_arith.ArithValue) and isinstance(
                     lhs, Integer
                 ):
-                    lhs_val = lhs.value.with_signedness(lhs.signed)  # type: ignore[attr-defined]
+                    lhs_val = lhs.value.with_signedness(lhs.signed)
                 else:
                     lhs_val = lhs.value
 
@@ -1869,7 +2353,7 @@ def _minmax(
                 if isinstance(rhs.value, cutlass_arith.ArithValue) and isinstance(
                     rhs, Integer
                 ):
-                    rhs_val = rhs.value.with_signedness(rhs.signed)  # type: ignore[attr-defined]
+                    rhs_val = rhs.value.with_signedness(rhs.signed)
                 else:
                     rhs_val = rhs.value
                 res = res_type(
@@ -2284,7 +2768,7 @@ def for_generate(
 
     def _createI32Attr(value: Union[Int32, int]) -> ir.IntegerAttr:
         if not isinstance(value, int):
-            raise DSLRuntimeError(f"value must be int.")
+            raise DSLRuntimeError("value must be int.")
         return ir.IntegerAttr.get(ir.IntegerType.get_signless(32), value)
 
     ir_iter_args = extract_mlir_values(iter_args) if iter_args is not None else None
@@ -2371,6 +2855,7 @@ def if_generate(
         List of DSL typed results
     """
     input_args = input_args or []
+
     mlir_return_types = []
 
     # Validate and collect MLIR return types (if provided).
@@ -2381,11 +2866,16 @@ def if_generate(
             mlir_return_types.append(t.mlir_type)  # type: ignore[attr-defined]
 
     # Determine whether there's an else branch.
+    # When return_types are specified but else_body is None, synthesize a
+    # passthrough else that yields the input_args unchanged.  Without this,
+    # scf.if with results would lack an else block, which is invalid MLIR.
+    if else_body is None and return_types is not None:
+        else_body = lambda *args: args if len(args) > 1 else args[0]
     has_else = else_body is not None
 
     # Create the IfOp.
     if_op = scf.IfOp(
-        Boolean(cond).ir_value(), mlir_return_types, hasElse=has_else, loc=loc, ip=ip
+        Boolean(cond).ir_value(), mlir_return_types, has_else=has_else, loc=loc, ip=ip
     )
 
     def _execute_and_yield_out(body: Callable, input_args: List[DslType]) -> None:
diff --git a/python/CuTeDSL/cutlass/cutlass_dsl/cutlass_ast_decorators.py b/python/CuTeDSL/cutlass/cutlass_dsl/cutlass_ast_decorators.py
index cba47634d..6b8087d34 100644
--- a/python/CuTeDSL/cutlass/cutlass_dsl/cutlass_ast_decorators.py
+++ b/python/CuTeDSL/cutlass/cutlass_dsl/cutlass_ast_decorators.py
@@ -18,11 +18,11 @@ from collections.abc import Sequence
 
 from ..base_dsl.common import DSLRuntimeError, DSLNotImplemented
 from ..base_dsl.dsl import is_dynamic_expression
-from ..base_dsl._mlir_helpers.arith import ArithValue
+from .._mlir_helpers.arith import ArithValue
 from ..base_dsl.ast_helpers import *  # noqa: F401,F403
 from ..base_dsl.utils.logger import log
 from ..base_dsl import typing as t
-from ..base_dsl.typing import Boolean, Numeric, as_numeric
+from ..base_dsl.typing import Boolean, Numeric, as_numeric, _binary_op_type_promote
 from ..base_dsl.utils.tree_utils import PyTreeDef, check_tree_equal
 from . import cutlass as cutlass_dsl
 
@@ -33,6 +33,13 @@ from . import cutlass as cutlass_dsl
 NoneType = type(None)
 
 
+def _create_control_flow_generator() -> "ScfGenerator":
+    """
+    Create appropriate control flow generator based on runtime configuration.
+    """
+    return ScfGenerator()
+
+
 class LoopUnroll(ir.Attribute):
     def __init__(self, **kwargs: Union[int, bool]) -> None:
         valid_keys = set(["count", "full"])
@@ -292,7 +299,7 @@ def _loop_execute_range_dynamic(
     """
     Example: build an scf.for with optional unroll, using our universal helper.
     """
-    scf_gen = ScfGenerator()
+    scf_gen = _create_control_flow_generator()
 
     def create_for_op(dyn_yield_ops: List[ir.Value]) -> ir.Operation:
         for d in dyn_yield_ops:
@@ -301,19 +308,21 @@ def _loop_execute_range_dynamic(
                     f"Invalid dyn_yield_ops: {dyn_yield_ops} \n\tExpected ir.Value, got {type(d)}"
                 )
 
-        # Convert Python ints or values to IR constants if needed
-        start_ = t.as_numeric(start)
-        stop_ = t.as_numeric(stop)
-        step_ = t.as_numeric(step)
-        if start_.dtype is not t.Int32:
-            raise DSLRuntimeError(f"expected Int32 for start, got {start_.dtype}")
-        if stop_.dtype is not t.Int32:
-            raise DSLRuntimeError(f"expected Int32 for stop, got {stop_.dtype}")
-        if step_.dtype is not t.Int32:
-            raise DSLRuntimeError(f"expected Int32 for step, got {step_.dtype}")
-        start_ = start_.ir_value()
-        stop_ = stop_.ir_value()
-        step_ = step_.ir_value()
+        # Convert to Numeric, require integers, then promote to a common integer type
+        start_n = as_numeric(start)
+        stop_n = as_numeric(stop)
+        step_n = as_numeric(step)
+        for name, n in (("start", start_n), ("stop", stop_n), ("step", step_n)):
+            if not n.dtype.is_integer:
+                raise DSLRuntimeError(
+                    f"dynamic loop `{name}` must be an integer type, got {n.dtype}"
+                )
+        # Promote to a common integer type using pairwise type promotion
+        _, _, tmp_dtype = _binary_op_type_promote(start_n, stop_n)
+        _, _, dst_dtype = _binary_op_type_promote(start_n.to(tmp_dtype), step_n)
+        start_ = start_n.to(dst_dtype)
+        stop_ = stop_n.to(dst_dtype)
+        step_ = step_n.to(dst_dtype)
 
         # Attributes must be pure Python value, add a check
         _attr_const_check(unroll, int, "unroll")
@@ -362,13 +371,22 @@ def _loop_execute_range_dynamic(
             step_,
             type(step_),
         )
+        # LOC_TRACEBACKS captures full Python call stacks automatically —
+        # no explicit loc needed.
 
         # Create scf.ForOp, passing iteration args if any
         try:
             if not dyn_yield_ops:
-                for_op = scf.ForOp(start_, stop_, step_)
+                for_op = scf.ForOp(
+                    start_.ir_value(), stop_.ir_value(), step_.ir_value()
+                )
             else:
-                for_op = scf.ForOp(start_, stop_, step_, list(dyn_yield_ops))
+                for_op = scf.ForOp(
+                    start_.ir_value(),
+                    stop_.ir_value(),
+                    step_.ir_value(),
+                    list(dyn_yield_ops),
+                )
         except Exception as e:
             yield_ops = "\n".join(
                 f"\t\t{i} => {d} : type : {type(d)}"
@@ -411,20 +429,17 @@ def _loop_execute_range_dynamic(
             block_args,
             full_write_args_count,
         )
-        # block_args[1:] are iteration variables
-        func_args = []
-        func_args.extend(
-            cutlass_dsl.pack_from_irvalue(
-                block_args[1:],
-                pytree_def,  # type: ignore[arg-type]
-                mix_iter_args,
-                full_write_args_count,
+        if pytree_def is None:
+            func_args = list(mix_iter_args)
+        else:
+            func_args = list(
+                cutlass_dsl.pack_from_irvalue(
+                    block_args[1:], pytree_def, mix_iter_args, full_write_args_count
+                )
             )
-        )
         if not func_args:
-            # No iteration arguments, or only the induction var
             func(iv)
-            return []  # yield nothing
+            return []
         else:
             updated_func_args = func(iv, *func_args)
             return updated_func_args
@@ -452,7 +467,7 @@ def _if_execute_dynamic(
     """
     Build an scf.if with optional else, using our universal helper.
     """
-    scf_gen = ScfGenerator()
+    scf_gen = _create_control_flow_generator()
 
     def create_if_op(dyn_yield_ops: List[ir.Value]) -> ir.Operation:
         # Assume final result types match the dynamic yields
@@ -463,8 +478,8 @@ def _if_execute_dynamic(
         try:
             if_op = scf.IfOp(
                 pred_.ir_value(),
-                hasElse=(else_block is not None),
                 results_=result_types,
+                has_else=else_block is not None,
             )
         except Exception as e:
             raise DSLRuntimeError(
@@ -480,12 +495,11 @@ def _if_execute_dynamic(
         mix_iter_args: List[object],
         full_write_args_count: int,
     ) -> object:
+        if pytree_def is None:
+            return then_block(*mix_iter_args)
         flat_args = list(
             cutlass_dsl.pack_from_irvalue(
-                dyn_yield_ops,
-                pytree_def,  # type: ignore[arg-type]
-                mix_iter_args,
-                full_write_args_count,
+                dyn_yield_ops, pytree_def, mix_iter_args, full_write_args_count
             )
         )
         return then_block(*flat_args)
@@ -502,12 +516,11 @@ def _if_execute_dynamic(
             mix_iter_args: List[object],
             full_write_args_count: int,
         ) -> object:
+            if pytree_def is None:
+                return else_block(*mix_iter_args)
             flat_args = list(
                 cutlass_dsl.pack_from_irvalue(
-                    dyn_yield_ops,
-                    pytree_def,  # type: ignore[arg-type]
-                    mix_iter_args,
-                    full_write_args_count,
+                    dyn_yield_ops, pytree_def, mix_iter_args, full_write_args_count
                 )
             )
             return else_block(*flat_args)
@@ -544,7 +557,7 @@ def _while_execute_dynamic(
     """
     log().debug("_while_execute_dynamic")
     while_op_type_name = "while"
-    scf_gen = ScfGenerator()
+    scf_gen = _create_control_flow_generator()
 
     def create_while_op(dyn_yield_ops: List[ir.Value]) -> ir.Operation:
         # Create the while operation with the types from yield_args
@@ -574,15 +587,14 @@ def _while_execute_dynamic(
         full_write_args_count: int,
     ) -> Any:
         # Build the before (condition) block
-        flat_args = []
-        flat_args.extend(
-            cutlass_dsl.pack_from_irvalue(
-                block_args,
-                pytree_def,  # type: ignore[arg-type]
-                mix_iter_args,
-                full_write_args_count,
+        if pytree_def is None:
+            flat_args = list(mix_iter_args)
+        else:
+            flat_args = list(
+                cutlass_dsl.pack_from_irvalue(
+                    block_args, pytree_def, mix_iter_args, full_write_args_count
+                )
             )
-        )
 
         log().debug("before block args: %s", flat_args)
 
@@ -604,10 +616,10 @@ def _while_execute_dynamic(
     ) -> None:
         # Generate a condition op instead of yield op.
         cond = cond_and_results[0]
+        ir_cond = as_numeric(cond).ir_value()
         before_result_list = ScfGenerator._normalize_region_result_to_list(
             cond_and_results[1]
         )
-        ir_cond = as_numeric(cond).ir_value()
         ir_results_list, pytree_def = cutlass_dsl.unpack_to_irvalue(
             before_result_list, while_op_type_name, full_write_args_count
         )
@@ -627,15 +639,14 @@ def _while_execute_dynamic(
         full_write_args_count: int,
     ) -> object:
         # Build the after (body) block
-        flat_args = []
-        flat_args.extend(
-            cutlass_dsl.pack_from_irvalue(
-                block_args,
-                pytree_def,  # type: ignore[arg-type]
-                mix_iter_args,
-                full_write_args_count,
+        if pytree_def is None:
+            flat_args = list(mix_iter_args)
+        else:
+            flat_args = list(
+                cutlass_dsl.pack_from_irvalue(
+                    block_args, pytree_def, mix_iter_args, full_write_args_count
+                )
             )
-        )
 
         log().debug("after block args: %s", flat_args)
 
@@ -749,8 +760,8 @@ def _ifexp_execute_dynamic(
         try:
             if_op = scf.IfOp(
                 pred_.ir_value(),
-                hasElse=True,
                 results_=result_types,
+                has_else=else_block is not None,
             )
         except Exception as e:
             raise DSLRuntimeError(
@@ -767,8 +778,10 @@ def _ifexp_execute_dynamic(
     # Prepare the list of region builders for the SCF IfOp: first for "then", then for "else"
     region_builders = [then_builder, else_builder]
 
+    # IfExp (ternary) always has results and its create_if_op hardcodes
+    # result_types, so use "ifexp" as op_type_name to ensure standard execution.
     ret = scf_gen.scf_execute_dynamic(
-        op_type_name="if",
+        op_type_name="ifexp",
         mix_iter_args=mix_iter_args,
         full_write_args_count=0,
         mix_iter_arg_names=["unknown" for _ in mix_iter_args],
diff --git a/python/CuTeDSL/cutlass/cutlass_dsl/tvm_ffi_provider.py b/python/CuTeDSL/cutlass/cutlass_dsl/tvm_ffi_provider.py
index 3d2a1ed29..eaa6cbea8 100644
--- a/python/CuTeDSL/cutlass/cutlass_dsl/tvm_ffi_provider.py
+++ b/python/CuTeDSL/cutlass/cutlass_dsl/tvm_ffi_provider.py
@@ -9,10 +9,9 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
-from dataclasses import is_dataclass, fields as dataclass_fields
+import re
 from typing import Any, Callable, List, Optional, cast
 
-from cutlass.base_dsl.utils.tree_utils import is_constexpr_field
 from cutlass.base_dsl.tvm_ffi_builder import (
     DynamicParamPackCallProvider,
     CallContext,
@@ -25,10 +24,52 @@ from cutlass._mlir.dialects import llvm
 from cutlass._mlir._mlir_libs._cutlass_ir import _aot_support
 from cutlass.cutlass_dsl.cuda_jit_executor import CudaDialectJitCompiledFunction
 from cutlass.base_dsl.jit_executor import JitExecutor
-from cutlass.base_dsl.common import DSLRuntimeError
+from cutlass.base_dsl.common import (
+    DSLRuntimeError,
+    DSLCudaRuntimeError,
+    _get_cuda_error_name_from_code,
+)
 import tvm_ffi
 
 
+@tvm_ffi.register_error
+class CUDADialectError(DSLCudaRuntimeError):
+    """TVM-FFI error kind for CUDA dialect runtime failures."""
+
+    PREFIX = "CUDA Error Code: "
+
+    def __init__(self, message: str) -> None:
+        self.raw_tvm_ffi_message = message
+        error_code = CUDADialectError._parse_cuda_dialect_error_code(message)
+        super().__init__(error_code, _get_cuda_error_name_from_code(error_code))
+
+    def _format_message(self) -> str:
+        message = super()._format_message()
+        return message.replace(f"{self.__class__.__name__}:", "DSLCudaRuntimeError:", 1)
+
+    @staticmethod
+    def _parse_cuda_dialect_error_code(message: str) -> int:
+        match = re.fullmatch(
+            rf"{re.escape(CUDADialectError.PREFIX)}(?P<code>\d+)\s*",
+            message,
+        )
+        if match is not None:
+            return int(match.group("code"))
+
+        if not message.startswith(CUDADialectError.PREFIX):
+            raise ValueError(
+                "CUDADialectError expects a message beginning with "
+                f"{CUDADialectError.PREFIX!r}, got {message!r}"
+            )
+        if not message[len(CUDADialectError.PREFIX) :].strip():
+            raise ValueError(
+                f"CUDADialectError message has no numeric code: {message!r}"
+            )
+        raise ValueError(
+            f"CUDADialectError message has unexpected payload: {message!r}"
+        )
+
+
 class TVMFFICuteCallProvider(DynamicParamPackCallProvider):
     """Cute call provider that uses cute call convention."""
 
@@ -393,17 +434,9 @@ class TVMFFICuteCallProvider(DynamicParamPackCallProvider):
 
         # Populate the error block
         with ir.InsertionPoint(error_block):
-            error_str = llvm.call(
-                result=self.ptr_type,
-                callee="cuda_dialect_get_error_name",
-                callee_operands=[error_code],
-                op_bundle_sizes=[],
-                op_bundle_operands=[],
-            )
             # Raise error and return -1
-            context.builder.raise_error_and_return(
-                error_kind="RuntimeError",
-                error_message_parts=["CUDA Error: ", error_str],
+            context.builder.raise_cuda_error_and_return(
+                error_code, CUDADialectError.PREFIX
             )
 
         return error_block
@@ -472,22 +505,6 @@ def _get_format_from_object_file_path(object_file_path: str) -> str:
     return format
 
 
-def _flatten_dataclass_arg(arg: Any) -> Any:
-    """Recursively flatten a dataclass argument into a tuple for TVM FFI runtime.
-
-    TVM FFI expects tuple/array for TupleParam specs. NamedTuples work because
-    they are tuples, but dataclass instances need explicit flattening.
-    """
-    if is_dataclass(arg) and not isinstance(arg, type):
-        values = []
-        for f in dataclass_fields(arg):
-            if is_constexpr_field(f):
-                continue
-            values.append(_flatten_dataclass_arg(getattr(arg, f.name)))
-        return tuple(values)
-    return arg
-
-
 class TVMFFIJitCompiledFunctionBase(CudaDialectJitCompiledFunction):
     """Base class for TVM FFI compiled function."""
 
@@ -520,9 +537,15 @@ class TVMFFIJitCompiledFunctionBase(CudaDialectJitCompiledFunction):
         :param function_name: The name of the function to export.
         :param enable_pic: Whether to enable PIC relocation needed for shared library loading.
         :param export_only_tvm_ffi_symbols: Only export TVM FFI symbols (hide all others).
-        :param host_target_triple: If not provided, the current host target is used.
         """
-        internal_symbol_prefix = "__cute_internal_" + function_name  # type: ignore[operator]
+        if self.host_target.value:
+            raise DSLRuntimeError(
+                "Host cross-compile via TVM-FFI is not supported. "
+                "Drop --enable-tvm-ffi to use the plain AOT export path "
+                "with --host-target."
+            )
+        assert function_name is not None
+        internal_symbol_prefix = "__cute_internal_" + function_name
         mod = self.ir_module
         mod = get_export_module(
             self.ir_module,
@@ -530,7 +553,7 @@ class TVMFFIJitCompiledFunctionBase(CudaDialectJitCompiledFunction):
             preserve_symbols={f"__tvm_ffi_{self.function_name}"},
         )
 
-        rename_tvm_ffi_function(mod, self.function_name, function_name)  # type: ignore[arg-type]
+        rename_tvm_ffi_function(mod, self.function_name, function_name)
         if export_only_tvm_ffi_symbols:
             _inplace_hide_symbols(mod, lambda x: not x.startswith("__tvm_ffi"))
 
@@ -588,10 +611,6 @@ class TVMFFIJitCompiledFunction(tvm_ffi.Function, TVMFFIJitCompiledFunctionBase)
             # move the handle from the tvm_ffi.Function to the current instance
             self.__move_handle_from__(tvm_ffi_function)
 
-    def __call__(self, *args: Any) -> Any:
-        args = tuple(_flatten_dataclass_arg(a) for a in args)
-        return tvm_ffi.Function.__call__(self, *args)
-
 
 class TVMFFIJitCompiledFunctionWithKwargs(TVMFFIJitCompiledFunctionBase):
     """TVM FFI Function with kwargs wrapper support"""
@@ -599,33 +618,48 @@ class TVMFFIJitCompiledFunctionWithKwargs(TVMFFIJitCompiledFunctionBase):
     def __init__(self, *args: Any, **kwargs: Any) -> None:
         assert "kwargs_wrapper_spec" in kwargs, "kwargs_wrapper_spec is required"
         kwargs_wrapper_spec = kwargs.pop("kwargs_wrapper_spec")
+        # ``map_dataclass_to_tuple`` is a tvm-ffi concern (lists the arg
+        # names whose values get unpacked via unpack_dataclass_to_tuple at
+        # call time) and is intentionally kept outside KwargsWrapperSpec.
+        map_dataclass_to_tuple: List[str] = kwargs.pop("map_dataclass_to_tuple", [])
         super().__init__(*args, **kwargs)
         # initialize the tvm_ffi.Function from the current execution engine
         self._tvm_ffi_function = self._create_tvm_ffi_function()
-        assert self._tvm_ffi_function is not None
-        if kwargs_wrapper_spec.kwonly_names or kwargs_wrapper_spec.arg_defaults:
-            try:
-                from tvm_ffi.utils import kwargs_wrapper
+        if self._tvm_ffi_function is None:
+            self._kwargs_wrapper: Optional[Callable[..., Any]] = None
+            return
+        # This class is instantiated when the jit signature has any of:
+        #   - keyword-only parameters;
+        #   - defaults on positional parameters;
+        #   - a top-level @dataclass argument (needed for ``compiled(p=...)``
+        #     style calls since the underlying tvm_ffi.Function is
+        #     positional-only).
+        # ``make_kwargs_wrapper`` handles empty kwonly/arg_defaults as no-ops,
+        # so a single call covers all three triggers.
+        try:
+            from tvm_ffi.utils import kwargs_wrapper  # type: ignore
 
-                self._kwargs_wrapper = kwargs_wrapper.make_kwargs_wrapper(
-                    self._tvm_ffi_function,
-                    arg_names=kwargs_wrapper_spec.arg_names,
-                    arg_defaults=kwargs_wrapper_spec.arg_defaults,
-                    kwonly_names=kwargs_wrapper_spec.kwonly_names,
-                    kwonly_defaults=kwargs_wrapper_spec.kwonly_defaults,
-                )
-            except ImportError:
-                raise DSLRuntimeError(
-                    "install apache-tvm-ffi>=0.1.5 to enable kwargs/defaults"
-                )
-        else:
-            # positional only is probably fine
-            self._kwargs_wrapper = self._tvm_ffi_function
+            self._kwargs_wrapper = kwargs_wrapper.make_kwargs_wrapper(
+                self._tvm_ffi_function,
+                arg_names=kwargs_wrapper_spec.arg_names,
+                arg_defaults=kwargs_wrapper_spec.arg_defaults,
+                kwonly_names=kwargs_wrapper_spec.kwonly_names,
+                kwonly_defaults=kwargs_wrapper_spec.kwonly_defaults,
+                map_dataclass_to_tuple=map_dataclass_to_tuple,
+            )
+        except ImportError:
+            raise DSLRuntimeError(
+                "install apache-tvm-ffi>=0.1.11 to enable kwargs / defaults / "
+                "top-level dataclass argument support"
+            )
 
     def __call__(self, *args: Any, **kwargs: Any) -> Any:
         """Call the TVM FFI function with kwargs wrapper."""
-        args = tuple(_flatten_dataclass_arg(a) for a in args)
-        kwargs = {k: _flatten_dataclass_arg(v) for k, v in kwargs.items()}
+        if self._kwargs_wrapper is None:
+            raise DSLRuntimeError(
+                "TVM FFI function is not initialized."
+                " Was this function compiled for a different architecture?"
+            )
         return self._kwargs_wrapper(*args, **kwargs)
 
     def __tvm_ffi_object__(self) -> Optional["tvm_ffi.Function"]:
diff --git a/python/CuTeDSL/cutlass/jax/__init__.py b/python/CuTeDSL/cutlass/jax/__init__.py
index 286ce88f5..42a0e49c6 100644
--- a/python/CuTeDSL/cutlass/jax/__init__.py
+++ b/python/CuTeDSL/cutlass/jax/__init__.py
@@ -23,7 +23,7 @@ CUTE_DSL_MIN_SUPPORTED_JAX_VERSION = (0, 5, 0)
 
 
 @cache
-def is_available():
+def is_available() -> bool:
     """Returns true if JAX extensions are supported and available."""
     try:
         import jax
diff --git a/python/CuTeDSL/cutlass/jax/compile.py b/python/CuTeDSL/cutlass/jax/compile.py
index e58fa20d2..13807a005 100644
--- a/python/CuTeDSL/cutlass/jax/compile.py
+++ b/python/CuTeDSL/cutlass/jax/compile.py
@@ -10,7 +10,7 @@
 # is strictly prohibited.
 
 import gc
-from typing import Any
+from typing import Any, Callable
 from dataclasses import dataclass
 from functools import partial
 
@@ -38,7 +38,6 @@ from cutlass.cutlass_dsl.cutlass import CuTeDSL
 
 logger = logging.getLogger(__name__)
 
-_CUTLASS_COMPILE_CACHE = {}
 _EXPORT_PREFIX = "cutlass_call"
 
 
@@ -49,7 +48,7 @@ class Arg:
     dtype: jnp.dtype
     spec: TensorSpec
 
-    def get_static_flag(self, use_static_tensors: bool):
+    def get_static_flag(self, use_static_tensors: bool) -> bool:
         if self.spec.static is None:
             return use_static_tensors
         else:
@@ -67,11 +66,11 @@ class FunctionSpec:
     input_output_aliases: tuple[tuple[int, int], ...]
     input_spec: tuple[TensorSpec, ...]
     output_spec: tuple[TensorSpec, ...]
-    compile_options: str
+    compile_options: str | None
     use_static_tensors: bool
-    kwargs: tuple[tuple[str, Any]]
+    kwargs: tuple[tuple[str, Any], ...]
 
-    def get_compile_args(self):
+    def get_compile_args(self) -> JaxArrayList:
         """Returns the arguments to provide to cute.compile."""
         compiler_ins = [
             JaxTracedArray(
@@ -109,7 +108,7 @@ def jit_wrapper(
     *,
     wrapped_fn: cutlass.Constexpr,
     spec: cutlass.Constexpr,
-):
+) -> None:
     # split buffer argument into inputs and outputs and return to tree
     ins, outs = args[: len(spec.in_args)], args[(len(spec.in_args)) :]  # type: ignore[attr-defined]
     ins = [x.get_tensor() for x in ins]  # type: ignore[assignment, attr-defined]
@@ -128,7 +127,12 @@ class CompileResult:
     spec: FunctionSpec
 
 
-def _check_is_valid_type(x, is_input):
+_CUTLASS_COMPILE_CACHE: dict[
+    tuple[Callable[..., None], FunctionSpec], CompileResult
+] = {}
+
+
+def _check_is_valid_type(x: Any, is_input: bool) -> None:
     if not is_input:
         if not isinstance(x, jax.ShapeDtypeStruct):
             raise TypeError("Invalid output value passed.", x)
@@ -138,26 +142,26 @@ def _check_is_valid_type(x, is_input):
 
 
 def build_function_spec(
-    ins,
-    in_tree,
-    outs,
-    out_tree,
-    input_spec,
-    output_spec,
-    input_output_aliases,
-    compile_options,
-    use_static_tensors,
-    kwargs,
-):
+    ins: Any,
+    in_tree: Any,
+    outs: Any,
+    out_tree: Any,
+    input_spec: tuple[TensorSpec, ...],
+    output_spec: tuple[TensorSpec, ...],
+    input_output_aliases: dict[int, int],
+    compile_options: str | None,
+    use_static_tensors: bool,
+    kwargs: dict[str, Any],
+) -> FunctionSpec:
     in_args = []
-    for idx, (arg, spec) in enumerate(zip(ins, input_spec)):
+    for idx, (arg, tensor_spec) in enumerate(zip(ins, input_spec)):
         _check_is_valid_type(arg, is_input=True)
-        in_args.append(Arg(idx, arg.shape, arg.dtype, spec))
+        in_args.append(Arg(idx, arg.shape, arg.dtype, tensor_spec))
 
     out_args = []
-    for idx, (arg, spec) in enumerate(zip(outs, output_spec)):
+    for idx, (arg, tensor_spec) in enumerate(zip(outs, output_spec)):
         _check_is_valid_type(arg, is_input=False)
-        out_args.append(Arg(idx, arg.shape, arg.dtype, spec))
+        out_args.append(Arg(idx, arg.shape, arg.dtype, tensor_spec))
 
     # Return the argument specs to the original pytree structure
     # We need this structure to sanely match index positions of the
@@ -209,7 +213,7 @@ def build_function_spec(
 _compile_lock = threading.Lock()
 
 
-def get_or_compile_kernel(fn, spec):
+def get_or_compile_kernel(fn: Callable[..., None], spec: FunctionSpec) -> CompileResult:
     """Gets or compiles fn and returns a CutlassCompileResult.
 
     The function and its specification is used as a key to determine if a new
@@ -253,7 +257,7 @@ def get_or_compile_kernel(fn, spec):
     return result
 
 
-def release_compile_cache():
+def release_compile_cache() -> None:
     """Releases entries from the compile cache.
 
     Note that is may prevent cute dsl from saving its persistent compilation cache entries.
diff --git a/python/CuTeDSL/cutlass/jax/ffi.py b/python/CuTeDSL/cutlass/jax/ffi.py
index 8bef8aa38..7d27f2a2f 100644
--- a/python/CuTeDSL/cutlass/jax/ffi.py
+++ b/python/CuTeDSL/cutlass/jax/ffi.py
@@ -9,7 +9,7 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
-from typing import Sequence, Optional
+from typing import Any, Optional, Sequence
 from pathlib import Path
 from functools import cache
 import logging
@@ -98,9 +98,9 @@ def find_cute_dsl_runtime_library() -> Optional[str]:
                     return lib
 
         # Otherwise try to search for the library inside the wheel
-        def get_libs_cand(start):
+        def get_libs_cand(start: str | Path) -> list[str]:
             libs_cand = find_libs_in_ancestors(
-                start, [_CUTE_DSL_RUNTIME_LIBRARY_NAME], ["lib"]
+                start, {_CUTE_DSL_RUNTIME_LIBRARY_NAME}, ["lib"]
             )
             if libs_cand:
                 return libs_cand
@@ -128,7 +128,7 @@ def find_cute_dsl_runtime_library() -> Optional[str]:
 _FFI_CALLS_REGISTERED = False
 
 
-def register_ffi(ffi_version: int = get_cutlass_call_ffi_version()):
+def register_ffi(ffi_version: int = get_cutlass_call_ffi_version()) -> None:
     """Registers custom calls with Jax/XLA runtime.
 
     A specific version can be requested using `ffi_version` argument. Attempting
@@ -147,9 +147,11 @@ def register_ffi(ffi_version: int = get_cutlass_call_ffi_version()):
 
     lib = ctypes.CDLL(runtime_library)
 
-    def _register_ffi_targets(lib, targets):
+    def _register_ffi_targets(
+        lib: ctypes.CDLL, targets: dict[str, dict[str, str]]
+    ) -> None:
         for target_name, target in targets.items():
-            handler = {}
+            handler: dict[str, Any] = {}
             for stage, fn_name in target.items():
                 fn = getattr(lib, fn_name)
                 fn.restype = ctypes.c_void_p
@@ -157,9 +159,9 @@ def register_ffi(ffi_version: int = get_cutlass_call_ffi_version()):
             logger.debug(f"Registering ffi handler: {target_name}, {handler}")
             jax.ffi.register_ffi_target(target_name, handler, platform="CUDA")
 
-    def _register_ffi_types(lib, types):
+    def _register_ffi_types(lib: ctypes.CDLL, types: dict[str, dict[str, str]]) -> None:
         for type_name, type_dict_targets in types.items():
-            type_dict = {}
+            type_dict: dict[str, Any] = {}
             for field, fn_name in type_dict_targets.items():
                 fn = getattr(lib, fn_name)
                 fn.restype = ctypes.c_void_p
@@ -181,7 +183,7 @@ def register_ffi(ffi_version: int = get_cutlass_call_ffi_version()):
     _FFI_CALLS_REGISTERED = True
 
 
-def is_ffi_registered():
+def is_ffi_registered() -> bool:
     """Returns true if the FFI calls have been registered with Jax/XLA."""
     global _FFI_CALLS_REGISTERED
     return _FFI_CALLS_REGISTERED
diff --git a/python/CuTeDSL/cutlass/jax/primitive.py b/python/CuTeDSL/cutlass/jax/primitive.py
index 2114d9ce2..ff37b1c84 100644
--- a/python/CuTeDSL/cutlass/jax/primitive.py
+++ b/python/CuTeDSL/cutlass/jax/primitive.py
@@ -43,12 +43,12 @@ def cutlass_call(
     output_spec: Any = None,
     input_mode: Any = None,
     output_mode: Any = None,
-    input_output_aliases=None,
-    allow_cuda_graph=True,
-    compile_options=None,
-    use_static_tensors=False,
-    **kwargs,
-):
+    input_output_aliases: dict[int, int] | None = None,
+    allow_cuda_graph: bool = True,
+    compile_options: str | None = None,
+    use_static_tensors: bool = False,
+    **kwargs: Any,
+) -> Callable[..., Any]:
     """Create a callable that invokes a ``@cute.jit`` function from JAX.
 
     Returns a callable that accepts JAX arrays and dispatches to *fn* as part
@@ -91,7 +91,7 @@ def cutlass_call(
             Indices are into the flattened input and output pytrees.
         allow_cuda_graph: If ``False``, prevents XLA from capturing this call
             in a CUDA graph.  Defaults to ``True``.
-        compile_options: Optional dict of compiler flags forwarded to
+        compile_options: Optional string of compiler flags forwarded to
             ``cute.compile``.
         use_static_tensors: If ``True``, tensor shapes and strides are baked in
             as compile-time constants, improving performance when shapes are
@@ -224,17 +224,17 @@ def _validate_specs(label: str, tensors: list, specs: tuple[TensorSpec, ...]) ->
 
 
 def _cutlass_call_impl(
-    fn,
+    fn: Callable[..., None],
     *,
     output_shape_dtype: Any,
     input_spec: Any,
     output_spec: Any,
-    input_output_aliases,
-    allow_cuda_graph,
-    compile_options,
-    use_static_tensors,
-    **kwargs,
-):
+    input_output_aliases: dict[int, int],
+    allow_cuda_graph: bool,
+    compile_options: str | None,
+    use_static_tensors: bool,
+    **kwargs: Any,
+) -> Callable[..., Any]:
     # A single ShapeDtypeStruct means one output; a sequence means multiple.
     multiple_results = isinstance(output_shape_dtype, Sequence)
     if not multiple_results:
@@ -242,7 +242,7 @@ def _cutlass_call_impl(
     output_shape_dtype_flat, output_tree = jax.tree.flatten(output_shape_dtype)
 
     @jax.jit
-    def call_wrapper(*args):
+    def call_wrapper(*args: Any) -> Any:
         args_flat, args_tree = jax.tree.flatten(args)
 
         input_spec_flat = _resolve_spec_flat(input_spec, args_flat)
@@ -273,25 +273,27 @@ def _cutlass_call_impl(
 
 
 @cutlass_call_inner_p.def_abstract_eval
-def cutlass_call_inner_p_abstract(*_, output_shape_dtype_flat, **__):
+def cutlass_call_inner_p_abstract(
+    *_: Any, output_shape_dtype_flat: Any, **__: Any
+) -> list[Any]:
     return [jax.core.ShapedArray(x.shape, x.dtype) for x in output_shape_dtype_flat]
 
 
 def cutlass_call_inner_p_impl(
-    *args_flat,
-    fn,
+    *args_flat: Any,
+    fn: Callable[..., None],
     args_tree: Any,
-    output_shape_dtype_flat: Any,
+    output_shape_dtype_flat: tuple[Any, ...],
     output_tree: Any,
-    input_spec_flat: Any,
-    output_spec_flat: Any,
-    input_output_aliases,
-    allow_cuda_graph,
-    compile_options,
-    use_static_tensors,
-    **kwargs,
-):
-    input_output_aliases = dict(input_output_aliases)
+    input_spec_flat: tuple[TensorSpec, ...],
+    output_spec_flat: tuple[TensorSpec, ...],
+    input_output_aliases: tuple[tuple[int, int], ...],
+    allow_cuda_graph: bool,
+    compile_options: str | None,
+    use_static_tensors: bool,
+    **kwargs: Any,
+) -> Any:
+    aliases_dict = dict(input_output_aliases)
     spec = build_function_spec(
         args_flat,
         args_tree,
@@ -299,7 +301,7 @@ def cutlass_call_inner_p_impl(
         output_tree,
         input_spec_flat,
         output_spec_flat,
-        input_output_aliases,
+        aliases_dict,
         compile_options,
         use_static_tensors,
         kwargs,
@@ -331,7 +333,7 @@ def cutlass_call_inner_p_impl(
     return fun(*args_flat, module=kernel.module, key=kernel.fingerprint)
 
 
-def _cutlass_call_jvp_rule(*args, **kwargs):
+def _cutlass_call_jvp_rule(*args: Any, **kwargs: Any) -> None:
     del args, kwargs
     raise NotImplementedError(
         "cutlass_call does not support VJP. Please use `jax.custom_jvp` for taking gradients."
@@ -341,7 +343,7 @@ def _cutlass_call_jvp_rule(*args, **kwargs):
 ad.primitive_jvps[cutlass_call_inner_p] = _cutlass_call_jvp_rule
 
 
-def _cutlass_call_transpose_rule(*args, **kwargs):
+def _cutlass_call_transpose_rule(*args: Any, **kwargs: Any) -> None:
     del args, kwargs
     raise NotImplementedError(
         "cutlass_call does not support transpose. Please use `jax.custom_vjp` for taking gradients."
@@ -351,7 +353,7 @@ def _cutlass_call_transpose_rule(*args, **kwargs):
 ad.primitive_transposes[cutlass_call_inner_p] = _cutlass_call_transpose_rule
 
 
-def _cutlass_call_vmap_rule(*args, **kwargs):
+def _cutlass_call_vmap_rule(*args: Any, **kwargs: Any) -> None:
     del args, kwargs
     raise NotImplementedError(
         "cutlass_call does not support batching with jax.vmap. Please "
diff --git a/python/CuTeDSL/cutlass/jax/testing.py b/python/CuTeDSL/cutlass/jax/testing.py
index dbf68ecc8..8cb0519d7 100644
--- a/python/CuTeDSL/cutlass/jax/testing.py
+++ b/python/CuTeDSL/cutlass/jax/testing.py
@@ -9,13 +9,13 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
-
 import jax
 import jax.numpy as jnp
 
+from typing import Sequence
+
 import cutlass.cute as cute
 from cutlass.cutlass_dsl import dsl_user_op
-from typing import Optional, Sequence
 from cutlass._mlir import ir
 
 
@@ -103,19 +103,30 @@ def get_gemm_shape_from_tensors(
     a: cute.Tensor,
     b: cute.Tensor,
     *,
-    loc: Optional[ir.Location] = None,
-    ip: Optional[ir.InsertionPoint] = None,
+    loc: ir.Location | None = None,
+    ip: ir.InsertionPoint | None = None,
 ) -> tuple[int, int, int, int]:
     """Returns a tuple of (M, N, K, L) from A/B gemm tensors."""
     # mkl, nkl
-    m, k, l = a.shape[:]  # type: ignore[index]
-    n = b.shape[0]  # type: ignore[index]
+    a_shape = a.shape
+    b_shape = b.shape
+    assert isinstance(a_shape, tuple)
+    assert isinstance(b_shape, tuple)
+    m, k, l = a_shape[:]
+    n = b_shape[0]
     return (m, n, k, l)  # type: ignore[return-value]
 
 
 def create_tensor(
-    shape, dtype, key, *, minval=-2.0, maxval=2.0, fill_value=None, fill_arange=False
-):
+    shape: tuple[int, ...],
+    dtype: jnp.dtype,
+    key: jax.Array,
+    *,
+    minval: float = -2.0,
+    maxval: float = 2.0,
+    fill_value: float | int | None = None,
+    fill_arange: bool = False,
+) -> jax.Array:
     if fill_arange:
         tensor = jnp.ones(shape, dtype=dtype)
         tensor = tensor * jnp.arange(tensor.size, dtype=tensor.dtype).reshape(
@@ -132,17 +143,17 @@ def create_tensor(
 
 
 def create_a_tensor(
-    l,
-    m,
-    k,
-    major,
-    dtype,
-    key,
-    minval=-2.0,
-    maxval=2.0,
-    fill_value=None,
-    fill_arange=False,
-):
+    l: int,
+    m: int,
+    k: int,
+    major: str,
+    dtype: jnp.dtype,
+    key: jax.Array,
+    minval: float = -2.0,
+    maxval: float = 2.0,
+    fill_value: float | int | None = None,
+    fill_arange: bool = False,
+) -> jax.Array:
     shape = gemm_a_shape(l, m, k, major)
     tensor = create_tensor(
         shape,
@@ -157,17 +168,17 @@ def create_a_tensor(
 
 
 def create_b_tensor(
-    l,
-    n,
-    k,
-    major,
-    dtype,
-    key,
-    minval=-2.0,
-    maxval=2.0,
-    fill_value=None,
-    fill_arange=False,
-):
+    l: int,
+    n: int,
+    k: int,
+    major: str,
+    dtype: jnp.dtype,
+    key: jax.Array,
+    minval: float = -2.0,
+    maxval: float = 2.0,
+    fill_value: float | int | None = None,
+    fill_arange: bool = False,
+) -> jax.Array:
     shape = gemm_b_shape(l, n, k, major)
     tensor = create_tensor(
         shape,
@@ -182,18 +193,18 @@ def create_b_tensor(
 
 
 def create_cd_tensor(
-    l,
-    m,
-    n,
-    major,
-    dtype,
-    key,
+    l: int,
+    m: int,
+    n: int,
+    major: str,
+    dtype: jnp.dtype,
+    key: jax.Array,
     *,
-    minval=-2.0,
-    maxval=2.0,
-    fill_value=None,
-    fill_arange=False,
-):
+    minval: float = -2.0,
+    maxval: float = 2.0,
+    fill_value: float | int | None = None,
+    fill_arange: bool = False,
+) -> jax.Array:
     shape = gemm_c_shape(l, m, n, major)
     tensor = create_tensor(
         shape,
@@ -208,17 +219,17 @@ def create_cd_tensor(
 
 
 def gemm_reference_einsum(
-    a,
-    b,
-    acc_dtype,
-    c_dtype,
-    a_major,
-    b_major,
-    c_major,
-    sf_a=None,
-    sf_b=None,
-    precision="highest",
-):
+    a: jax.Array,
+    b: jax.Array,
+    acc_dtype: jnp.dtype,
+    c_dtype: jnp.dtype,
+    a_major: str,
+    b_major: str,
+    c_major: str,
+    sf_a: jax.Array | None = None,
+    sf_b: jax.Array | None = None,
+    precision: str = "highest",
+) -> jax.Array:
     a_idx = gemm_a_major(a_major)
     b_idx = gemm_b_major(b_major)
     c_idx = gemm_c_major(c_major)
@@ -244,8 +255,18 @@ def gemm_reference_einsum(
 
 
 def create_attn_tensors(
-    b, s, hq, hkv, d, dtype, key, *, minval=-2.0, maxval=2.0, fill_value=None
-):
+    b: int,
+    s: int,
+    hq: int,
+    hkv: int,
+    d: int,
+    dtype: jnp.dtype,
+    key: jax.Array,
+    *,
+    minval: float = -2.0,
+    maxval: float = 2.0,
+    fill_value: float | int | None = None,
+) -> tuple[jax.Array, jax.Array, jax.Array]:
     qkey, kkey, vkey = jax.random.split(key, 3)
     return (
         create_tensor(
@@ -275,7 +296,7 @@ def create_attn_tensors(
     )
 
 
-def attn_ref(q, k, v, is_causal: bool):
+def attn_ref(q: jax.Array, k: jax.Array, v: jax.Array, is_causal: bool) -> jax.Array:
     return jax.jit(
         lambda q, k, v: jax.nn.dot_product_attention(
             q, k, v, is_causal=is_causal, implementation="cudnn"
diff --git a/python/CuTeDSL/cutlass/jax/types.py b/python/CuTeDSL/cutlass/jax/types.py
index caf53ffaf..953534ecc 100644
--- a/python/CuTeDSL/cutlass/jax/types.py
+++ b/python/CuTeDSL/cutlass/jax/types.py
@@ -9,7 +9,7 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
-from typing import Any, Optional, Sequence
+from typing import Any, Iterator, Sequence, Union, overload
 from dataclasses import dataclass, field
 
 
@@ -126,7 +126,7 @@ class TensorSpec:
     mode: tuple[int, ...] | None = field(metadata=dict(static=True), default=None)
     # If True, shapes and strides are embedded as compile-time constants.
     # Must be False for symbolic/dynamic shapes (e.g. jax.export).
-    static: bool = field(metadata=dict(static=True), default=None)
+    static: bool | None = field(metadata=dict(static=True), default=None)
     # Assumed alignment (bytes) of the data pointer. Default matches XLA's 256-byte alignment.
     ptr_assumed_align: int = field(
         metadata=dict(static=True), default=DEFAULT_CUTLASS_DEVICE_BUFFER_ALIGNMENT
@@ -137,7 +137,7 @@ class TensorSpec:
     )
 
 
-def row_major_layout(shaped):
+def row_major_layout(shaped: Any) -> tuple[int, ...]:
     """Returns the CuTeDSL minor-to-major stride ordering for a row-major (C-contiguous) tensor.
 
     In CuTeDSL convention, ``layout[i]`` is the stride rank of dimension ``i``,
@@ -160,7 +160,7 @@ def row_major_layout(shaped):
     return tuple(reversed(range(len(shaped))))
 
 
-def default_tensor_mode(shaped):
+def default_tensor_mode(shaped: Any) -> tuple[int, ...]:
     """Returns the identity mode permutation for an N-dimensional tensor.
 
     The mode permutation maps JAX input dimensions to ``cute.Layout`` mode
@@ -179,7 +179,7 @@ def default_tensor_mode(shaped):
     return tuple(range(len(shaped)))
 
 
-def default_tensor_spec(shaped) -> TensorSpec:
+def default_tensor_spec(shaped: Any) -> TensorSpec:
     """Returns a :class:`TensorSpec` with row-major layout and identity mode ordering.
 
     Equivalent to::
@@ -223,7 +223,9 @@ def default_tensor_spec(shaped) -> TensorSpec:
 
 
 def _expand_divisibility(
-    divisibility, order: tuple[int, ...], ndim: int
+    divisibility: tuple[int | None, ...] | int | None,
+    order: tuple[int, ...],
+    ndim: int,
 ) -> tuple[int | None, ...] | None:
     """Expand a divisibility spec to a full per-input-dimension tuple.
 
@@ -235,7 +237,7 @@ def _expand_divisibility(
     if divisibility is None or isinstance(divisibility, tuple):
         return divisibility
     leading = order.index(0)
-    result = [None] * ndim
+    result: list[int | None] = [None] * ndim
     result[leading] = divisibility
     return tuple(result)
 
@@ -295,7 +297,7 @@ def jax_to_cutlass_layout_order(
     return tuple(inv)
 
 
-def jax_to_cutlass_dtype(dtype):
+def jax_to_cutlass_dtype(dtype: Any) -> Any:
     """Gets the corresponding cutlass dtype given a jax dtype."""
     dtype = jnp.dtype(dtype)
     if dtype not in JAX_DTYPE_TO_CUTLASS_DTYPE:
@@ -303,14 +305,16 @@ def jax_to_cutlass_dtype(dtype):
     return JAX_DTYPE_TO_CUTLASS_DTYPE[dtype]
 
 
-def cutlass_to_jax_dtype(dtype):
+def cutlass_to_jax_dtype(dtype: Any) -> Any:
     """Gets the corresponding cutlass dtype given a jax dtype."""
     if dtype not in CUTLASS_DTYPE_TO_JAX_DTYPE:
         raise ValueError(f"Cutlass dtype [{dtype}] has no equivalent jax dtype.")
     return CUTLASS_DTYPE_TO_JAX_DTYPE[dtype]
 
 
-def from_dlpack(array, assumed_align: int = DEFAULT_CUTLASS_DEVICE_BUFFER_ALIGNMENT):
+def from_dlpack(
+    array: Any, assumed_align: int = DEFAULT_CUTLASS_DEVICE_BUFFER_ALIGNMENT
+) -> Any:
     """Convert jax.Array to a DL pack tensor."""
     return _from_dlpack(array, assumed_align=assumed_align)
 
@@ -361,15 +365,15 @@ class JaxArray:
 
     def __init__(
         self,
-        dtype,
-        shape,
-        mem_space,
-        assumed_align,
-        order=None,
-        mode=None,
-        static=False,
-        divisibility=None,
-    ):
+        dtype: type,
+        shape: Sequence[int | Any],
+        mem_space: AddressSpace,
+        assumed_align: int,
+        order: tuple[int, ...] | None = None,
+        mode: tuple[int, ...] | None = None,
+        static: bool = False,
+        divisibility: tuple[int | None, ...] | int | None = None,
+    ) -> None:
         self.dtype = dtype
         self.shape = tuple(shape)
         self.ndim = len(self.shape)
@@ -395,6 +399,7 @@ class JaxArray:
 
         if divisibility is not None:
             divisibility = _expand_divisibility(divisibility, self.order, self.ndim)
+            assert divisibility is not None
             divisibility = tuple(divisibility)
             if len(divisibility) != len(shape):
                 raise ValueError(
@@ -413,35 +418,35 @@ class JaxArrayValue(JaxArray):
 
     def __init__(
         self,
-        ir_value,
-        dtype,
-        shape,
-        mem_space,
-        assumed_align,
-        order,
-        mode,
-        static,
-        divisibility=None,
-    ):
+        ir_value: ir.Value,
+        dtype: type,
+        shape: Sequence[int | Any],
+        mem_space: AddressSpace,
+        assumed_align: int,
+        order: tuple[int, ...],
+        mode: tuple[int, ...],
+        static: bool,
+        divisibility: tuple[int | None, ...] | int | None = None,
+    ) -> None:
         super().__init__(
             dtype, shape, mem_space, assumed_align, order, mode, static, divisibility
         )
         self.value = ir_value
 
-    def __str__(self):
+    def __str__(self) -> str:
         return f"JaxArrayValue<{self.value}:{self.dtype}:{self.shape}:{self.order}:{self.mode}:{self.static}:{self.divisibility}>"
 
-    def __repr__(self):
+    def __repr__(self) -> str:
         return str(self)
 
     def _make_ordered_layout_dynamic_strides(
         self,
-        shape,
+        shape: tuple[ir.Value, ...],
         order: tuple[int, ...],
         *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ):
+        loc: ir.Location | None = None,
+        ip: ir.InsertionPoint | None = None,
+    ) -> ir.Value:
         i32 = ir.IntegerType.get_signless(32)
 
         # Track the divisibility available for each input dimension. Explicit
@@ -500,11 +505,11 @@ class JaxArrayValue(JaxArray):
 
     def _load_dynamic_shapes(
         self,
-        ffi_buffer,
+        ffi_buffer: ir.Value,
         *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ):
+        loc: ir.Location | None = None,
+        ip: ir.InsertionPoint | None = None,
+    ) -> tuple[ir.Value, ...]:
         i64 = ir.IntegerType.get_signless(64)
         shape_array = llvm.extractvalue(
             llvm.PointerType.get(),
@@ -532,11 +537,11 @@ class JaxArrayValue(JaxArray):
 
     def _load_pointer(
         self,
-        ffi_buffer,
+        ffi_buffer: ir.Value,
         *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ):
+        loc: ir.Location | None = None,
+        ip: ir.InsertionPoint | None = None,
+    ) -> ir.Value:
         raw_ptr = llvm.extractvalue(
             llvm.PointerType.get(),
             ffi_buffer,
@@ -554,11 +559,8 @@ class JaxArrayValue(JaxArray):
         )
 
     def get_tensor(
-        self,
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ):
+        self, *, loc: ir.Location | None = None, ip: ir.InsertionPoint | None = None
+    ) -> ir.Value:
         ffi_buffer_type = llvm.StructType.get_literal(
             [llvm.PointerType.get(), llvm.PointerType.get()]
         )
@@ -581,10 +583,10 @@ class JaxArrayValue(JaxArray):
 
         return cute.make_tensor(pointer, layout, loc=loc, ip=ip)
 
-    def __extract_mlir_values__(self):
+    def __extract_mlir_values__(self) -> list[ir.Value]:
         return [self.value]
 
-    def __new_from_mlir_values__(self, values):
+    def __new_from_mlir_values__(self, values: list[ir.Value]) -> "JaxArrayValue":
         return JaxArrayValue(
             values[0],
             self.dtype,
@@ -604,17 +606,17 @@ class JaxTracedArray(JaxArray):
     Traced values are not real tensors or allocated on the device.
     """
 
-    def __str__(self):
+    def __str__(self) -> str:
         return f"JaxTracedArray<{self.dtype}:{self.shape}:{self.order}:{self.mode}:{self.static}:{self.divisibility}>"
 
-    def __repr__(self):
+    def __repr__(self) -> str:
         return str(self)
 
-    def __get_mlir_types__(self):
+    def __get_mlir_types__(self) -> list[ir.Type]:
         # Struct passed as opaque object.
         return [llvm.PointerType.get()]
 
-    def __new_from_mlir_values__(self, values):
+    def __new_from_mlir_values__(self, values: ir.Value) -> JaxArrayValue:
         return JaxArrayValue(
             values,
             self.dtype,
@@ -627,7 +629,7 @@ class JaxTracedArray(JaxArray):
             self.divisibility,
         )
 
-    def __c_pointers__(self):
+    def __c_pointers__(self) -> list[int]:
         return [0]
 
 
@@ -637,28 +639,34 @@ class JaxArrayList:
     the jit boundary.
     """
 
-    def __init__(self, arrays: Sequence[JaxArray]):
+    def __init__(self, arrays: Sequence[JaxArray]) -> None:
         self.arrays = tuple(arrays)
 
-    def __getitem__(self, idx):
+    @overload
+    def __getitem__(self, idx: int) -> JaxArray: ...
+    @overload
+    def __getitem__(self, idx: slice) -> tuple[JaxArray, ...]: ...
+    def __getitem__(
+        self, idx: Union[int, slice]
+    ) -> Union[JaxArray, tuple[JaxArray, ...]]:
         return self.arrays[idx]
 
-    def __len__(self):
+    def __len__(self) -> int:
         return len(self.arrays)
 
-    def __iter__(self):
+    def __iter__(self) -> Iterator[JaxArray]:
         return iter(self.arrays)
 
-    def __c_pointers__(self):
-        return [x.__c_pointers__()[0] for x in self.arrays]
+    def __c_pointers__(self) -> list[int]:
+        return [x.__c_pointers__()[0] for x in self.arrays]  # type: ignore[attr-defined]
 
-    def __get_mlir_types__(self):
-        return [x.__get_mlir_types__()[0] for x in self.arrays]
+    def __get_mlir_types__(self) -> list[ir.Type]:
+        return [x.__get_mlir_types__()[0] for x in self.arrays]  # type: ignore[attr-defined]
 
-    def __extract_mlir_values__(self):
-        return [x.__extract_mlir_values__()[0] for x in self.arrays]
+    def __extract_mlir_values__(self) -> list[ir.Value]:
+        return [x.__extract_mlir_values__()[0] for x in self.arrays]  # type: ignore[attr-defined]
 
-    def __new_from_mlir_values__(self, values):
+    def __new_from_mlir_values__(self, values: list[ir.Value]) -> "JaxArrayList":
         return JaxArrayList(
-            [x.__new_from_mlir_values__(v) for x, v in zip(self.arrays, values)]
+            [x.__new_from_mlir_values__(v) for x, v in zip(self.arrays, values)]  # type: ignore[attr-defined]
         )
diff --git a/python/CuTeDSL/cutlass/pipeline/__init__.py b/python/CuTeDSL/cutlass/pipeline/__init__.py
index afb31d462..b6b56bd9a 100644
--- a/python/CuTeDSL/cutlass/pipeline/__init__.py
+++ b/python/CuTeDSL/cutlass/pipeline/__init__.py
@@ -13,6 +13,7 @@ from .helpers import (
     Agent,
     CooperativeGroup,
     PipelineOp,
+    MbarrierLayout,
     SyncObject,
     MbarrierArray,
     NamedBarrier,
@@ -54,6 +55,7 @@ __all__ = [
     "CooperativeGroup",
     "PipelineOp",
     "SyncObject",
+    "MbarrierLayout",
     "MbarrierArray",
     "NamedBarrier",
     "PipelineOrder",
diff --git a/python/CuTeDSL/cutlass/pipeline/helpers.py b/python/CuTeDSL/cutlass/pipeline/helpers.py
index 7250f4686..c174823c8 100644
--- a/python/CuTeDSL/cutlass/pipeline/helpers.py
+++ b/python/CuTeDSL/cutlass/pipeline/helpers.py
@@ -13,12 +13,13 @@ import enum
 import inspect
 from abc import ABC, abstractmethod
 from dataclasses import dataclass
-from typing import Any, Optional, Union
+from typing import Any, Optional, Union, cast
 import warnings
 
 import cutlass.cute as cute
 from cutlass._mlir import ir
 from cutlass.base_dsl.arch import Arch
+from cutlass.cute.arch.constants import WARP_SIZE
 from cutlass.cutlass_dsl import CuTeDSL, Boolean, Int32, if_generate, dsl_user_op
 
 
@@ -36,9 +37,9 @@ class Agent(enum.Enum):
     Thread = enum.auto()
     # A collection of 32 threads executing in lockstep
     Warp = enum.auto()
-    # Same as AsyncThread, but includes all threads in the block
+    # A collection of threads participating in a threadblock
     ThreadBlock = enum.auto()
-    # Same as AsyncThread, but includes all threads in the cluster
+    # A collection of threads across a cluster
     ThreadBlockCluster = enum.auto()
 
 
@@ -49,39 +50,67 @@ class Agent(enum.Enum):
 
 class CooperativeGroup:
     """
-    CooperativeGroup contains size and alignment restrictions for an Agent.
+    CooperativeGroup contains size restrictions for an Agent.
     """
 
-    def __init__(self, agent: Agent, size: int = 1, alignment: Optional[int] = None):
-        if alignment is not None:
-            warnings.warn(
-                "The 'alignment' parameter of CooperativeGroup's constructor is deprecated and "
-                "will be removed in a subsequent release, please remove it from your code.",
-                DeprecationWarning,
-                stacklevel=2,
-            )
-
-        if agent is Agent.Thread:
-            assert size > 0
-        elif agent is Agent.ThreadBlock:
-            raise NotImplementedError("Error: Not yet supported.")
-        elif agent is Agent.ThreadBlockCluster:
-            raise NotImplementedError("Error: Not yet supported.")
+    def __init__(
+        self, agent: Agent, size: Union[int, Int32] = 1, alignment: Optional[int] = None
+    ):
+        if agent in [
+            Agent.Thread,
+            Agent.Warp,
+            Agent.ThreadBlock,
+            Agent.ThreadBlockCluster,
+        ]:
+            if isinstance(size, int) and size <= 0:
+                raise ValueError(
+                    "Error: The number of threads in a CooperativeGroup must be "
+                    "greater than 0."
+                )
         else:
-            # Should never reach this state
-            size = 0
+            raise ValueError("Unsupported agent type")
 
-        if size <= 0:
-            raise ValueError(
-                "Error: The number of threads in a CooperativeGroup must be more than 0."
-            )
-
-        # Size indicates how many threads are participating in this CooperativeGroup
+        # Size indicates how many agents are participating in this CooperativeGroup
         self.size = size
-        # Agent indicates the type of thread group
+        # Agent indicates the type of thread grouping
         self.agent = agent
 
 
+@dsl_user_op
+def _get_thread_arrive_count(
+    group: "CooperativeGroup",
+    *,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> Union[int, Int32]:
+    """Compute the total number of threads represented by ``group``.
+
+    :param group: ``CooperativeGroup`` describing the agent type and count.
+    :return: Number of threads in the cooperative group.
+    :raises NotImplementedError: If ``group.agent`` is not one of the
+        supported agents (Thread / Warp / ThreadBlock / ThreadBlockCluster).
+    """
+    if group.agent is Agent.Thread:
+        return group.size
+    elif group.agent is Agent.Warp:
+        return group.size * WARP_SIZE
+    elif group.agent is Agent.ThreadBlock:
+        bdim_x, bdim_y, bdim_z = cute.arch.block_dim(loc=loc, ip=ip)
+        return group.size * bdim_x * bdim_y * bdim_z
+    elif group.agent is Agent.ThreadBlockCluster:
+        bdim_x, bdim_y, bdim_z = cute.arch.block_dim(loc=loc, ip=ip)
+        return (
+            group.size
+            * cute.arch.cluster_size(loc=loc, ip=ip)
+            * bdim_x
+            * bdim_y
+            * bdim_z
+        )
+    raise NotImplementedError(
+        f"Error: Unsupported agent type for arrive count: {group.agent}"
+    )
+
+
 ##############################################################################
 # PipelineOp class
 ##############################################################################
@@ -111,6 +140,12 @@ def _get_pipeline_op(type_str: int | PipelineOp) -> PipelineOp:
     return PipelineOp(type_str)
 
 
+class MbarrierLayout(enum.Enum):
+    """Layout of mbarrier used for synchronization."""
+
+    # Transactional mbarrier
+    V0 = enum.auto()
+
 ##############################################################################
 # SyncObject class
 ##############################################################################
@@ -124,11 +159,11 @@ class SyncObject(ABC):
     """
 
     @abstractmethod
-    def arrive(self) -> None:
+    def arrive(self, *args: Any, **kwargs: Any) -> None:
         pass
 
     @abstractmethod
-    def wait(self) -> None:
+    def wait(self, *args: Any, **kwargs: Any) -> None:
         pass
 
     @abstractmethod
@@ -165,6 +200,8 @@ class MbarrierArray(SyncObject):
         num_stages: int,
         agent: tuple[PipelineOp, CooperativeGroup],
         tx_count: int = 0,
+        mbarrier_layout: MbarrierLayout = MbarrierLayout.V0,
+        name: str = "",
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
@@ -173,17 +210,18 @@ class MbarrierArray(SyncObject):
         self.tx_count = tx_count
         self.num_stages = num_stages
         self.op_type, self.cg = agent
-        self.arrive_count = self.cg.size
+        self.arrive_count = _get_thread_arrive_count(self.cg, loc=loc, ip=ip)
+        self.mbarrier_layout = mbarrier_layout
+        self.name = name
 
         if self.num_stages <= 0:
             raise ValueError("Error: Mbarrier stage count must be greater than 0.")
-        if self.arrive_count <= 0:
+        if isinstance(self.arrive_count, int) and self.arrive_count <= 0:
             raise ValueError("Error: Mbarrier arrive count must be greater than 0.")
         if self.op_type is PipelineOp.TmaLoad and self.tx_count < 0:
             raise ValueError(
                 "Error: Mbarrier tx count must not be less than 0 for TMA ops."
             )
-
         # Store mbarrier base pointer
         self.mbarrier_base = self.barrier_storage
 
@@ -205,8 +243,29 @@ class MbarrierArray(SyncObject):
         new_mbarrier_array.tx_count = self.tx_count
         new_mbarrier_array.arrive_count = self.arrive_count
         new_mbarrier_array.mbarrier_base = self.mbarrier_base
+        new_mbarrier_array.mbarrier_layout = self.mbarrier_layout
+        new_mbarrier_array.name = self.name
         return new_mbarrier_array
 
+    def _mbar_scope(self, op: str) -> Any:
+        """Return a Scope context manager for barrier identification.
+
+        Format: ``name:op`` (e.g. ``smem_kv:wait``, ``tmem_sp0:arrive``).
+        Profiling tools group by the ``name`` prefix and classify by the ``op`` suffix.
+
+        Usage::
+
+            with self._mbar_scope("wait"):
+                cute.arch.mbarrier_wait(...)
+        """
+        if not getattr(self, "name", None):
+            from contextlib import nullcontext
+
+            return nullcontext()
+        from cutlass.utils.profiling import Scope
+
+        return Scope(f"{self.name}:{op}")
+
     # Mbarrier initialization
     @dsl_user_op
     def mbarrier_init(
@@ -218,17 +277,17 @@ class MbarrierArray(SyncObject):
         """
         Initializes an array of mbarriers using warp 0.
         """
-
         def then_body() -> None:
             use_uniform_mbarrier_init = True
             if use_uniform_mbarrier_init:
-                for index in range(self.num_stages):
-                    cute.arch.mbarrier_init(
-                        self.get_barrier(index, loc=loc, ip=ip),
-                        self.arrive_count,
-                        loc=loc,
-                        ip=ip,
-                    )
+                with cute.arch.elect_one(loc=loc, ip=ip):
+                    for index in range(self.num_stages):
+                        cute.arch.mbarrier_init(
+                            self.get_barrier(index, loc=loc, ip=ip),
+                            self.arrive_count,
+                            loc=loc,
+                            ip=ip,
+                        )
         warp_idx = cute.arch.warp_idx(loc=loc, ip=ip)
         warp_idx = cute.arch.make_warp_uniform(warp_idx, loc=loc, ip=ip)
 
@@ -258,26 +317,28 @@ class MbarrierArray(SyncObject):
         :param cta_group: CTA group for ``TCGen05Mma``, defaults to None for other op types
         :type cta_group: ``cute.nvgpu.tcgen05.CtaGroup``, optional
         """
-        if self.op_type is PipelineOp.AsyncThread:
-            self.arrive_mbarrier(index, dst, loc=loc, ip=ip)
-        elif self.op_type is PipelineOp.TCGen05Mma:
-            assert cta_group is not None, (
-                "Error: CTA group must be provided for TCGen05Mma."
-            )
-            self.arrive_tcgen05mma(index, dst, cta_group, loc=loc, ip=ip)
-        elif self.op_type in [PipelineOp.TmaLoad]:
-            # TMA operation signals local mbarrier only
-            self.arrive_and_expect_tx(index, self.tx_count, loc=loc, ip=ip)
-        elif self.op_type in [PipelineOp.ClcLoad]:
-            self.arrive_and_expect_tx_with_dst(
-                index, self.tx_count, dst, loc=loc, ip=ip
-            )
-        elif self.op_type is PipelineOp.AsyncLoad:
-            self.arrive_cp_async_mbarrier(index, loc=loc, ip=ip)
-        else:
-            assert False, (
-                f"Error: MbarrierArray is not supported for PipelineOp: {_get_pipeline_op(self.op_type)}."
-            )
+        with self._mbar_scope("arrive"):
+            if self.op_type is PipelineOp.AsyncThread:
+                self.arrive_mbarrier(index, dst, loc=loc, ip=ip)
+            elif self.op_type is PipelineOp.TCGen05Mma:
+                assert cta_group is not None, (
+                    "Error: CTA group must be provided for TCGen05Mma."
+                )
+                self.arrive_tcgen05mma(index, dst, cta_group, loc=loc, ip=ip)
+            elif self.op_type in [PipelineOp.TmaLoad]:
+                # TMA operation signals local mbarrier only
+                self.arrive_and_expect_tx(index, self.tx_count, loc=loc, ip=ip)
+            elif self.op_type in [PipelineOp.ClcLoad]:
+                # Multiple threads in CTA 0 each signal a different remote CTA in cluster's mbarrier
+                self.arrive_and_expect_tx_with_dst(
+                    index, self.tx_count, dst, loc=loc, ip=ip
+                )
+            elif self.op_type is PipelineOp.AsyncLoad:
+                self.arrive_cp_async_mbarrier(index, loc=loc, ip=ip)
+            else:
+                assert False, (
+                    f"Error: MbarrierArray is not supported for PipelineOp: {_get_pipeline_op(self.op_type)}."
+                )
 
     @dsl_user_op
     def arrive_mbarrier(
@@ -371,7 +432,21 @@ class MbarrierArray(SyncObject):
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> Boolean:
-        return cute.arch.mbarrier_try_wait(
+        with self._mbar_scope("try_wait"):
+            return cute.arch.mbarrier_try_wait(
+                self.get_barrier(index, loc=loc, ip=ip), phase, loc=loc, ip=ip
+            )
+
+    @dsl_user_op
+    def test_wait(
+        self,
+        index: int,
+        phase: int,
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+    ) -> Boolean:
+        return cute.arch.mbarrier_test_wait(
             self.get_barrier(index, loc=loc, ip=ip), phase, loc=loc, ip=ip
         )
 
@@ -384,10 +459,18 @@ class MbarrierArray(SyncObject):
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> Optional[tuple]:
-        cute.arch.mbarrier_wait(
-            self.get_barrier(index, loc=loc, ip=ip), phase, loc=loc, ip=ip
-        )
-        return None
+        """
+        Wait on mbarrier.
+        uses mbarrier_wait and returns None.
+
+        :param index: Index of the mbarrier in the array
+        :param phase: Phase/parity to wait for (0 or 1)
+        """
+        with self._mbar_scope("wait"):
+            cute.arch.mbarrier_wait(
+                self.get_barrier(index, loc=loc, ip=ip), phase, loc=loc, ip=ip
+            )
+            return None
 
     @dsl_user_op
     def arrive_and_wait(
@@ -423,17 +506,25 @@ class MbarrierArray(SyncObject):
         return self.mbarrier_base + index
 
     def max(self) -> int:
-        # Transaction barriers have a maximum arrive count of 511 (2^9 - 1).
-        # Non-transaction barriers have a maximum arrive count of 1,048,575 (2^20 - 1).
-        return 511
+        # Transactional barriers have a maximum arrive count of (2^20 - 1).
+        return (1 << 20) - 1
 
     def __extract_mlir_values__(self) -> list[object]:
         return [self.barrier_storage]
 
     def __new_from_mlir_values__(self, values: list[object]) -> "MbarrierArray":
-        return MbarrierArray(
-            values[0], self.num_stages, (self.op_type, self.cg), self.tx_count
-        )
+        new_mbarrier_array = object.__new__(MbarrierArray)
+        barrier_ptr = cast(cute.Pointer, values[0])
+        new_mbarrier_array.barrier_storage = barrier_ptr
+        new_mbarrier_array.op_type = self.op_type
+        new_mbarrier_array.cg = self.cg
+        new_mbarrier_array.num_stages = self.num_stages
+        new_mbarrier_array.tx_count = self.tx_count
+        new_mbarrier_array.arrive_count = self.arrive_count
+        new_mbarrier_array.mbarrier_base = barrier_ptr
+        new_mbarrier_array.mbarrier_layout = self.mbarrier_layout
+        new_mbarrier_array.name = self.name
+        return new_mbarrier_array
 
 
 # Set explicit signature for Sphinx documentation to avoid issues with @dsl_user_op decorator
@@ -444,7 +535,6 @@ MbarrierArray.__init__.__signature__ = inspect.Signature(  # type: ignore[attr-d
 )
 
 
-
 ##############################################################################
 # NamedBarrier class
 ##############################################################################
@@ -483,6 +573,7 @@ class NamedBarrier(SyncObject):
         cute.arch.barrier_arrive(
             barrier_id=self.barrier_id,
             number_of_threads=self.num_threads,
+            aligned=True,
             loc=loc,
             ip=ip,
         )
@@ -500,6 +591,7 @@ class NamedBarrier(SyncObject):
         cute.arch.barrier_arrive(
             barrier_id=self.barrier_id,
             number_of_threads=self.num_threads,
+            aligned=False,
             loc=loc,
             ip=ip,
         )
@@ -690,14 +782,23 @@ class PipelineState:
         self._index = index
         self._phase = phase
 
-    def clone(self) -> "PipelineState":
-        return PipelineState(self.stages, self._count, self.index, self.phase)
+    @dsl_user_op
+    @cute.jit
+    def clone(
+        self,
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+    ) -> "PipelineState":
+        return PipelineState(self._stages, self._count, self._index, self._phase)
 
     @property
+    @cute.jit
     def index(self) -> Int32:
         return self._index
 
     @property
+    @cute.jit
     def count(self) -> Int32:
         return self._count
 
@@ -706,10 +807,12 @@ class PipelineState:
         return self._stages
 
     @property
+    @cute.jit
     def phase(self) -> Int32:
         return self._phase
 
     @dsl_user_op
+    @cute.jit
     def reset_count(
         self,
         *,
@@ -719,6 +822,7 @@ class PipelineState:
         self._count = Int32(0, loc=loc, ip=ip)
 
     @dsl_user_op
+    @cute.jit
     def advance(
         self,
         *,
@@ -740,13 +844,14 @@ class PipelineState:
             self._index == self.stages,
             then_body,
             else_body,
-            [self.index, self.phase],  # type: ignore[list-item]
+            [self.index, self.phase],
             [Int32, Int32],
             loc=loc,
             ip=ip,
         )
 
     @dsl_user_op
+    @cute.jit
     def reverse(
         self,
         *,
@@ -768,20 +873,17 @@ class PipelineState:
             self._index == -1,
             then_body,
             else_body,
-            [self.index, self.phase],  # type: ignore[list-item]
+            [self.index, self.phase],
             [Int32, Int32],
             loc=loc,
             ip=ip,
         )
 
     def __get_mlir_types__(self) -> list[ir.Type]:
-        return [self._count.type, self._index.type, self._phase.type]  # type: ignore[attr-defined]
+        return [self.count.type, self.index.type, self.phase.type]
 
     def __extract_mlir_values__(self) -> list[ir.Value]:
-        count = self._count
-        index = self._index
-        phase = self._phase
-        return [count.ir_value(), index.ir_value(), phase.ir_value()]
+        return [self.count.ir_value(), self.index.ir_value(), self.phase.ir_value()]
 
     # This can be overridden by derived classes
     def __new_from_mlir_values__(self, values: list[ir.Value]) -> "PipelineState":
@@ -925,7 +1027,11 @@ def arrive(
     same instruction. See PTX documentation.
     """
     cute.arch.barrier_arrive(
-        barrier_id=barrier_id, number_of_threads=num_threads, loc=loc, ip=ip
+        barrier_id=barrier_id,
+        number_of_threads=num_threads,
+        aligned=True,
+        loc=loc,
+        ip=ip,
     )
 
 
@@ -941,7 +1047,11 @@ def arrive_unaligned(
     The unaligned flavor of arrive can be used with an arbitrary number of threads in the CTA.
     """
     cute.arch.barrier_arrive(
-        barrier_id=barrier_id, number_of_threads=num_threads, loc=loc, ip=ip
+        barrier_id=barrier_id,
+        number_of_threads=num_threads,
+        aligned=False,
+        loc=loc,
+        ip=ip,
     )
 
 
diff --git a/python/CuTeDSL/cutlass/pipeline/profiling.py b/python/CuTeDSL/cutlass/pipeline/profiling.py
new file mode 100644
index 000000000..4aca1fce7
--- /dev/null
+++ b/python/CuTeDSL/cutlass/pipeline/profiling.py
@@ -0,0 +1,76 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
+#
+# Use of this software is governed by the terms and conditions of the
+# NVIDIA End User License Agreement (EULA), available at:
+# https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/license.html
+#
+# Any use, reproduction, disclosure, or distribution of this software
+# and related documentation outside the scope permitted by the EULA
+# is strictly prohibited.
+
+"""Pipeline-level profiling helpers.
+
+These helpers register pipeline barriers in the shared symbol registry
+(defined in ``cutlass.utils.profiling``) and dump ``kernel_symbols.json``
+for downstream profiling tools. They are deliberately kept in their own
+module — the pipeline helper file holds MLIR-emitting primitives, not
+profiling bookkeeping.
+"""
+
+import json
+import os
+from typing import Optional
+
+from cutlass.utils.profiling import (
+    dump_kernel_symbols,
+    register_symbol,
+    reset_symbol_registry,
+)
+
+
+def register_barrier(name: str, num_stages: int, pipeline_type: str) -> None:
+    """Register a named barrier in the global symbol registry."""
+    if name:
+        register_symbol(
+            name, kind="barrier", num_stages=num_stages, pipeline_type=pipeline_type
+        )
+
+
+def dump_barrier_registry() -> dict:
+    """Return barrier data from the symbol registry (backward compat)."""
+    return dump_kernel_symbols()
+
+
+def reset_barrier_registry() -> None:
+    """Clear the symbol registry."""
+    reset_symbol_registry()
+
+
+def dump_profiling_metadata(dump_dir: str, extra: Optional[dict] = None) -> None:
+    """Dump ``kernel_symbols.json`` to ``dump_dir``.
+
+    Call after ``cute.compile()`` returns. Merges the symbol registry
+    (barriers + warps) with any extra metadata.
+
+    :param dump_dir: Output directory.
+    :type dump_dir: str
+    :param extra: Additional fields to merge.
+    :type extra: dict, optional
+    :raises RuntimeError: If the output file cannot be written.
+    """
+    data = dump_kernel_symbols()
+    if not data.get("bar_alloc") and not data.get("warps"):
+        return
+    if extra:
+        data.update(extra)
+    os.makedirs(dump_dir, exist_ok=True)
+    path = os.path.join(dump_dir, "kernel_symbols.json")
+    try:
+        with open(path, "w") as f:
+            json.dump(data, f, indent=2)
+            f.write("\n")
+    except OSError as e:
+        raise RuntimeError(
+            f"Failed to write kernel_symbols.json to {path!r}: {e}"
+        ) from e
diff --git a/python/CuTeDSL/cutlass/pipeline/sm100.py b/python/CuTeDSL/cutlass/pipeline/sm100.py
index 26b4c5687..3d1d132f1 100644
--- a/python/CuTeDSL/cutlass/pipeline/sm100.py
+++ b/python/CuTeDSL/cutlass/pipeline/sm100.py
@@ -10,17 +10,19 @@
 # is strictly prohibited.
 
 from dataclasses import dataclass
-from typing import Optional
+from typing import TYPE_CHECKING, Literal, Optional, cast
 
 from cutlass._mlir import ir
 
 import cutlass
 import cutlass.cute as cute
-from cutlass.cutlass_dsl import Boolean, Int32, if_generate, dsl_user_op
-
+from cutlass.cute.arch.constants import WARP_SIZE
+from cutlass.cute.core import is_static
+from cutlass.cutlass_dsl import BaseDSL, Boolean, Int32, if_generate, dsl_user_op
 from cutlass.pipeline import (
     Agent,
     CooperativeGroup,
+    MbarrierLayout,
     PipelineOp,
     SyncObject,
     MbarrierArray,
@@ -29,6 +31,11 @@ from cutlass.pipeline import (
     PipelineAsync,
     agent_sync,
 )
+from cutlass.pipeline.helpers import _get_thread_arrive_count
+from cutlass.pipeline.profiling import register_barrier
+
+if TYPE_CHECKING:
+    from cutlass.pipeline.sm90 import PipelineConsumer, PipelineProducer
 
 ##############################################################################
 # Pipeline classes
@@ -51,6 +58,9 @@ class PipelineTmaUmma(PipelineAsync):
         num_stages: int,
         agent: tuple[PipelineOp, CooperativeGroup],
         tx_count: int = 0,
+        mbarrier_layout: MbarrierLayout = MbarrierLayout.V0,
+        name: str = "",
+        phase: Literal["", "full", "empty"] = "",
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
@@ -58,6 +68,7 @@ class PipelineTmaUmma(PipelineAsync):
         """
         Returns a SyncObject corresponding to an agent's PipelineOp.
         """
+        full_name = f"{name}.{phase}" if name and phase else name
         if agent[0] in [
             PipelineOp.AsyncThread,
             PipelineOp.TmaLoad,
@@ -71,6 +82,8 @@ class PipelineTmaUmma(PipelineAsync):
                 num_stages=num_stages,
                 agent=agent,
                 tx_count=tx_count,
+                mbarrier_layout=mbarrier_layout,
+                name=full_name,
                 loc=loc,
                 ip=ip,
             )
@@ -170,7 +183,9 @@ class PipelineTmaUmma(PipelineAsync):
         barrier_storage: Optional[cute.Pointer] = None,
         cta_layout_vmnk: Optional[cute.Layout] = None,
         mcast_mode_mn: tuple[int, int] = (1, 1),
+        enable_multicast_signaling: bool = False,
         defer_sync: bool = False,
+        name: str = "",
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> "PipelineTmaUmma":
@@ -190,6 +205,10 @@ class PipelineTmaUmma(PipelineAsync):
         :type cta_layout_vmnk: cute.Layout, optional
         :param mcast_mode_mn: Tuple specifying multicast modes for m and n dimensions (each 0 or 1)
         :type mcast_mode_mn: tuple[int, int], optional
+        :param enable_multicast_signaling: See docstring in PipelineTmaAsync.create() for details
+        :type enable_multicast_signaling: bool, optional
+        :param defer_sync: Bool specifying whether or not to skip the built-in mbarrier fence and sync for performance, defaults to False
+        :type defer_sync: bool, optional
         :raises ValueError: If barrier_storage is not a cute.Pointer instance
         :return: A new PipelineTmaUmma instance configured with the provided parameters
         :rtype: PipelineTmaUmma
@@ -199,17 +218,53 @@ class PipelineTmaUmma(PipelineAsync):
                 f"Expected barrier_storage to be a cute.Pointer, but got {type(barrier_storage)}"
             )
 
+        if not is_static(cta_layout_vmnk):
+            raise ValueError("The cluster shape (cta_layout_vmnk) needs to be static.")
+
+        if cta_layout_vmnk is None:
+            cta_layout_vmnk = cute.make_layout((1, 1, 1, 1))
+
         producer_type = PipelineOp.TmaLoad
         consumer_type = PipelineOp.TCGen05Mma
 
         producer = (producer_type, producer_group)
-        consumer = (consumer_type, consumer_group)
+
+        if enable_multicast_signaling:
+            consumer_thread_arrive_cnt = _get_thread_arrive_count(consumer_group)
+
+            if (
+                isinstance(consumer_thread_arrive_cnt, int)
+                and consumer_thread_arrive_cnt % WARP_SIZE != 0
+            ):
+                raise ValueError(
+                    "Error: Consumer arrival count must be aligned with warp size"
+                )
+
+            shape_vmnk = cast(tuple[int, ...], cta_layout_vmnk.shape)
+            mcast_m_size = shape_vmnk[2] if mcast_mode_mn[0] else 0
+            mcast_n_size = shape_vmnk[1] if mcast_mode_mn[1] else 0
+            overlap = 1 if (mcast_mode_mn[0] and mcast_mode_mn[1]) else 0
+            mcast_size = mcast_m_size + mcast_n_size - overlap
+            assert mcast_size > 0, "Mcast size must be greater than 0."
+
+            num_warps = consumer_thread_arrive_cnt // WARP_SIZE
+            num_signaling_threads = mcast_size * num_warps
+
+            thread_consumer_group = CooperativeGroup(
+                Agent.Thread, num_signaling_threads
+            )
+        else:
+            thread_consumer_group = consumer_group
+
+        consumer = (consumer_type, thread_consumer_group)
 
         sync_object_full = PipelineTmaUmma._make_sync_object(
             barrier_storage.align(min_align=8),
             num_stages,
             producer,
             tx_count,
+            name=name,
+            phase="full",
             loc=loc,
             ip=ip,
         )
@@ -217,11 +272,16 @@ class PipelineTmaUmma(PipelineAsync):
             barrier_storage.align(min_align=8) + num_stages,
             num_stages,
             consumer,
+            name=name,
+            phase="empty",
             loc=loc,
             ip=ip,
         )
 
-        if cta_layout_vmnk is None or cute.size(cta_layout_vmnk, loc=loc, ip=ip) == 1:
+        if name:
+            register_barrier(name, num_stages, "PipelineTmaUmma")
+
+        if cute.size(cta_layout_vmnk, loc=loc, ip=ip) == 1:
             # No mcast mask if not using clusters
             producer_mask = None
             # All threadblocks are leaders if not using clusters
@@ -236,8 +296,7 @@ class PipelineTmaUmma(PipelineAsync):
 
         cta_group = (
             cute.nvgpu.tcgen05.CtaGroup.ONE
-            if cta_layout_vmnk is None
-            or cute.size(cta_layout_vmnk, mode=[0], loc=loc, ip=ip) == 1
+            if cute.size(cta_layout_vmnk, mode=[0], loc=loc, ip=ip) == 1
             else cute.nvgpu.tcgen05.CtaGroup.TWO
         )
 
@@ -245,10 +304,7 @@ class PipelineTmaUmma(PipelineAsync):
 
         if not defer_sync:
             cute.arch.mbarrier_init_fence()
-            if (
-                cta_layout_vmnk is None
-                or cute.size(cta_layout_vmnk, loc=loc, ip=ip) == 1
-            ):
+            if cute.size(cta_layout_vmnk, loc=loc, ip=ip) == 1:
                 agent_sync(Agent.ThreadBlock)
             else:
                 agent_sync(Agent.ThreadBlockCluster, is_relaxed=True)
@@ -257,7 +313,7 @@ class PipelineTmaUmma(PipelineAsync):
             sync_object_full,
             sync_object_empty,
             num_stages,
-            producer_mask,
+            producer_mask,  # unused
             consumer_mask,
             is_leader_cta,
             cta_group,
@@ -274,7 +330,7 @@ class PipelineTmaUmma(PipelineAsync):
         """
         UMMA consumer release buffer empty, cta_group needs to be provided.
         """
-        self.sync_object_empty.arrive(  # type: ignore[call-arg]
+        self.sync_object_empty.arrive(
             state.index, self.consumer_mask, self.cta_group, loc=loc, ip=ip
         )
 
@@ -283,25 +339,34 @@ class PipelineTmaUmma(PipelineAsync):
         state: PipelineState,
         try_acquire_token: Optional[Boolean] = None,
         *,
+        expected_tx: Optional[Int32] = None,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> None:
         """
-        TMA producer commit conditionally waits on buffer empty and sets the transaction barrier for leader threadblocks.
+        TMA producer conditionally waits on buffer empty and sets the transaction barrier for leader threadblocks.
+
+        :param expected_tx: Override the expected transaction byte count for this
+            acquire. When ``None`` (default), uses the ``tx_count`` from barrier init.
+            Pass a dynamic value for workloads where the byte count varies per
+            iteration (e.g. sparse GEMM with conditional metadata loading).
         """
         if_generate(
             try_acquire_token is None or try_acquire_token == 0,
-            lambda: self.sync_object_empty.wait(  # type: ignore[call-arg]
+            lambda: self.sync_object_empty.wait(
                 state.index, state.phase, loc=loc, ip=ip
             ),
             loc=loc,
             ip=ip,
         )
+        tx = self.sync_object_full.tx_count if expected_tx is None else expected_tx  # type: ignore[attr-defined]
+
+        def arrive_body() -> None:
+            self.sync_object_full.arrive_and_expect_tx(state.index, tx, loc=loc, ip=ip)  # type: ignore[attr-defined]
+
         if_generate(
             self.is_leader_cta,
-            lambda: self.sync_object_full.arrive(  # type: ignore[call-arg]
-                state.index, self.producer_mask, loc=loc, ip=ip
-            ),
+            arrive_body,
             loc=loc,
             ip=ip,
         )
@@ -313,7 +378,6 @@ class PipelineTmaUmma(PipelineAsync):
         pass
 
 
-
 @dataclass(frozen=True)
 class PipelineAsyncUmma(PipelineAsync):
     """
@@ -405,6 +469,7 @@ class PipelineAsyncUmma(PipelineAsync):
         barrier_storage: Optional[cute.Pointer] = None,
         cta_layout_vmnk: Optional[cute.Layout] = None,
         defer_sync: bool = False,
+        name: str = "",
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> "PipelineAsyncUmma":
@@ -429,6 +494,12 @@ class PipelineAsyncUmma(PipelineAsync):
                 f"Expected barrier_storage to be a cute.Pointer, but got {type(barrier_storage)}"
             )
 
+        if not is_static(cta_layout_vmnk):
+            raise ValueError("The cluster shape (cta_layout_vmnk) needs to be static.")
+
+        if cta_layout_vmnk is None:
+            cta_layout_vmnk = cute.make_layout((1, 1, 1, 1))
+
         producer_type = PipelineOp.AsyncThread
         consumer_type = PipelineOp.TCGen05Mma
 
@@ -439,6 +510,8 @@ class PipelineAsyncUmma(PipelineAsync):
             barrier_storage.align(min_align=8),
             num_stages,
             producer,
+            name=name,
+            phase="full",
             loc=loc,
             ip=ip,
         )
@@ -446,25 +519,22 @@ class PipelineAsyncUmma(PipelineAsync):
             barrier_storage.align(min_align=8) + num_stages,
             num_stages,
             consumer,
+            name=name,
+            phase="empty",
             loc=loc,
             ip=ip,
         )
 
-        cta_v_size = (
-            cute.size(cta_layout_vmnk, mode=[0], loc=loc, ip=ip)
-            if cta_layout_vmnk is not None
-            else 1
-        )
+        if name:
+            register_barrier(name, num_stages, "PipelineAsyncUmma")
+
+        cta_v_size = cute.size(cta_layout_vmnk, mode=[0], loc=loc, ip=ip)
         cta_group = (
             cute.nvgpu.tcgen05.CtaGroup.ONE
-            if cta_layout_vmnk is None
-            or cute.size(cta_layout_vmnk, mode=[0], loc=loc, ip=ip) == 1
+            if cute.size(cta_layout_vmnk, mode=[0], loc=loc, ip=ip) == 1
             else cute.nvgpu.tcgen05.CtaGroup.TWO
         )
-        if (
-            cta_layout_vmnk is None
-            or cute.size(cta_layout_vmnk, mode=[0], loc=loc, ip=ip) == 1
-        ):
+        if cute.size(cta_layout_vmnk, mode=[0], loc=loc, ip=ip) == 1:
             # No mcast mask if we're not using 2CTA tcgen05 MMA
             producer_mask = None
             consumer_mask = None
@@ -481,10 +551,7 @@ class PipelineAsyncUmma(PipelineAsync):
 
         if not defer_sync:
             cute.arch.mbarrier_init_fence()
-            if (
-                cta_layout_vmnk is None
-                or cute.size(cta_layout_vmnk, loc=loc, ip=ip) == 1
-            ):
+            if cute.size(cta_layout_vmnk, loc=loc, ip=ip) == 1:
                 agent_sync(Agent.ThreadBlock)
             else:
                 agent_sync(Agent.ThreadBlockCluster, is_relaxed=True)
@@ -509,7 +576,9 @@ class PipelineAsyncUmma(PipelineAsync):
         """
         UMMA consumer release buffer empty, cta_group needs to be provided.
         """
-        self.sync_object_empty.arrive(state.index, self.consumer_mask, self.cta_group)  # type: ignore[call-arg]
+        self.sync_object_empty.arrive(
+            state.index, self.consumer_mask, self.cta_group, loc=loc, ip=ip
+        )
 
 
 @dataclass(frozen=True)
@@ -546,7 +615,9 @@ class PipelineUmmaAsync(PipelineAsync):
     @dsl_user_op
     @staticmethod
     def _compute_peer_cta_rank(
-        *, loc: Optional[ir.Location] = None, ip: Optional[ir.InsertionPoint] = None
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
     ) -> Int32:
         """
         Computes a mask to signal release of tmem buffers for 2CTA kernels.
@@ -568,6 +639,7 @@ class PipelineUmmaAsync(PipelineAsync):
         barrier_storage: Optional[cute.Pointer] = None,
         cta_layout_vmnk: Optional[cute.Layout] = None,
         defer_sync: bool = False,
+        name: str = "",
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> "PipelineUmmaAsync":
@@ -592,6 +664,12 @@ class PipelineUmmaAsync(PipelineAsync):
                 f"Expected barrier_storage to be a cute.Pointer, but got {type(barrier_storage)}"
             )
 
+        if not is_static(cta_layout_vmnk):
+            raise ValueError("The cluster shape (cta_layout_vmnk) needs to be static.")
+
+        if cta_layout_vmnk is None:
+            cta_layout_vmnk = cute.make_layout((1, 1, 1, 1))
+
         producer_type = PipelineOp.TCGen05Mma
         consumer_type = PipelineOp.AsyncThread
 
@@ -599,17 +677,28 @@ class PipelineUmmaAsync(PipelineAsync):
         consumer = (consumer_type, consumer_group)
 
         sync_object_full = PipelineTmaUmma._make_sync_object(
-            barrier_storage.align(min_align=8), num_stages, producer, loc=loc, ip=ip
+            barrier_storage.align(min_align=8),
+            num_stages,
+            producer,
+            name=name,
+            phase="full",
+            loc=loc,
+            ip=ip,
         )
         sync_object_empty = PipelineTmaUmma._make_sync_object(
             barrier_storage.align(min_align=8) + num_stages,
             num_stages,
             consumer,
+            name=name,
+            phase="empty",
             loc=loc,
             ip=ip,
         )
 
-        if cta_layout_vmnk is None or cute.size(cta_layout_vmnk, loc=loc, ip=ip) == 1:
+        if name:
+            register_barrier(name, num_stages, "PipelineUmmaAsync")
+
+        if cute.size(cta_layout_vmnk, loc=loc, ip=ip) == 1:
             # Set mask to None if not using clusters (i.e. 1CTA kernels)
             producer_mask = None
         else:
@@ -617,10 +706,7 @@ class PipelineUmmaAsync(PipelineAsync):
                 cta_layout_vmnk, loc=loc, ip=ip
             )
 
-        if (
-            cta_layout_vmnk is None
-            or cute.size(cta_layout_vmnk, mode=[0], loc=loc, ip=ip) == 1
-        ):
+        if cute.size(cta_layout_vmnk, mode=[0], loc=loc, ip=ip) == 1:
             # Set mask to None if not using 2CTA instructions
             consumer_mask = None
         else:
@@ -628,17 +714,13 @@ class PipelineUmmaAsync(PipelineAsync):
 
         cta_group = (
             cute.nvgpu.tcgen05.CtaGroup.ONE
-            if cta_layout_vmnk is None
-            or cute.size(cta_layout_vmnk, mode=[0], loc=loc, ip=ip) == 1
+            if cute.size(cta_layout_vmnk, mode=[0], loc=loc, ip=ip) == 1
             else cute.nvgpu.tcgen05.CtaGroup.TWO
         )
 
         if not defer_sync:
             cute.arch.mbarrier_init_fence()
-            if (
-                cta_layout_vmnk is None
-                or cute.size(cta_layout_vmnk, loc=loc, ip=ip) == 1
-            ):
+            if cute.size(cta_layout_vmnk, loc=loc, ip=ip) == 1:
                 agent_sync(Agent.ThreadBlock)
             else:
                 agent_sync(Agent.ThreadBlockCluster, is_relaxed=True)
@@ -663,7 +745,7 @@ class PipelineUmmaAsync(PipelineAsync):
         """
         UMMA producer commit buffer full, cta_group needs to be provided.
         """
-        self.sync_object_full.arrive(  # type: ignore[call-arg]
+        self.sync_object_full.arrive(
             state.index, self.producer_mask, self.cta_group, loc=loc, ip=ip
         )
 
@@ -699,7 +781,7 @@ class PipelineUmmaAsync(PipelineAsync):
 
 
 @dataclass(frozen=True)
-class PipelineClcFetchAsync:
+class PipelineClcFetchAsync(PipelineAsync):
     """
     PipelineClcFetchAsync implements a producer-consumer pipeline for Cluster Launch
     Control based dynamic scheduling. Both producer and consumer operate asynchronously
@@ -717,13 +799,13 @@ class PipelineClcFetchAsync:
     num_stages: int
     producer_mask: Optional[Int32]
     consumer_mask: Optional[Int32]
-    is_signalling_thread: Boolean
+    is_signaling_thread: Boolean
 
     @staticmethod
     @cute.jit
     def _init_full_barrier_arrive_signal(
         cta_layout_vmnk: cute.Layout, tidx: Int32
-    ) -> tuple:
+    ) -> tuple[Int32, Boolean]:
         """
         Computes producer barrier signaling parameters, returns destination CTA rank
         (0 to cluster_size-1) based on thread ID, and a boolean flag indicating if
@@ -732,12 +814,12 @@ class PipelineClcFetchAsync:
         :param cta_layout_vmnk: Cluster layout defining CTA count
         :param tidx: Thread ID within the CTA
         """
-        dst_rank = tidx % 32
-        is_signalling_thread = dst_rank < cute.size(cta_layout_vmnk)
-        return dst_rank, is_signalling_thread
+        dst_rank = tidx % WARP_SIZE
+        is_signaling_thread = dst_rank < cute.size(cta_layout_vmnk)
+        return dst_rank, is_signaling_thread
 
     @staticmethod
-    def create(
+    def create(  # type: ignore[override]
         *,
         num_stages: int,
         producer_group: CooperativeGroup,
@@ -748,6 +830,7 @@ class PipelineClcFetchAsync:
         consumer_mask: Optional[Int32] = None,
         cta_layout_vmnk: Optional[cute.Layout] = None,
         defer_sync: bool = False,
+        name: str = "",
     ) -> "PipelineClcFetchAsync":
         """
         This helper function computes any necessary attributes and returns an instance of PipelineClcFetchAsync.
@@ -771,25 +854,40 @@ class PipelineClcFetchAsync:
                 f"Expected barrier_storage to be a cute.Pointer, but got {type(barrier_storage)}"
             )
 
+        if not is_static(cta_layout_vmnk):
+            raise ValueError("The cluster shape (cta_layout_vmnk) needs to be static.")
+
+        if cta_layout_vmnk is None:
+            cta_layout_vmnk = cute.make_layout((1, 1, 1, 1))
+
         producer_type = PipelineOp.ClcLoad
         consumer_type = PipelineOp.AsyncThread
 
         producer = (producer_type, producer_group)
         consumer = (consumer_type, consumer_group)
         sync_object_full = PipelineTmaUmma._make_sync_object(
-            barrier_storage.align(min_align=8), num_stages, producer, tx_count
+            barrier_storage.align(min_align=8),
+            num_stages,
+            producer,
+            tx_count,
+            name=name,
+            phase="full",
         )
         sync_object_empty = PipelineTmaUmma._make_sync_object(
-            barrier_storage.align(min_align=8) + num_stages, num_stages, consumer
+            barrier_storage.align(min_align=8) + num_stages,
+            num_stages,
+            consumer,
+            name=name,
+            phase="empty",
         )
 
-        if cta_layout_vmnk is None:
-            cta_layout_vmnk = cute.make_layout((1, 1, 1, 1))
+        if name:
+            register_barrier(name, num_stages, "PipelineClcFetchAsync")
 
         tidx, _, _ = cute.arch.thread_idx()
-        # All signalling happens from CTA 0's threads, each thread
+        # All signaling happens from CTA 0's threads, each thread
         # in CTA 0 signals a different remote CTA's mbarrier.
-        (producer_mask, is_signalling_thread) = (
+        (producer_mask, is_signaling_thread) = (
             PipelineClcFetchAsync._init_full_barrier_arrive_signal(
                 cta_layout_vmnk, tidx
             )
@@ -801,7 +899,7 @@ class PipelineClcFetchAsync:
 
         if not defer_sync:
             cute.arch.mbarrier_init_fence()
-            if cta_layout_vmnk is None or cute.size(cta_layout_vmnk) == 1:
+            if cute.size(cta_layout_vmnk) == 1:
                 agent_sync(Agent.ThreadBlock)
             else:
                 agent_sync(Agent.ThreadBlockCluster, is_relaxed=True)
@@ -812,7 +910,7 @@ class PipelineClcFetchAsync:
             num_stages,
             producer_mask,
             consumer_mask,
-            is_signalling_thread,
+            is_signaling_thread,
         )
 
     @dsl_user_op
@@ -832,15 +930,15 @@ class PipelineClcFetchAsync:
         """
         if_generate(
             try_acquire_token is None or try_acquire_token == 0,
-            lambda: self.sync_object_empty.wait(  # type: ignore[call-arg]
+            lambda: self.sync_object_empty.wait(
                 state.index, state.phase, loc=loc, ip=ip
             ),
             loc=loc,
             ip=ip,
         )
         if_generate(
-            self.is_signalling_thread,
-            lambda: self.sync_object_full.arrive(  # type: ignore[call-arg]
+            self.is_signaling_thread,
+            lambda: self.sync_object_full.arrive(
                 state.index, self.producer_mask, loc=loc, ip=ip
             ),
             loc=loc,
@@ -864,7 +962,7 @@ class PipelineClcFetchAsync:
         """
         if_generate(
             try_wait_token is None or try_wait_token == 0,
-            lambda: self.sync_object_full.wait(  # type: ignore[call-arg]
+            lambda: self.sync_object_full.wait(
                 state.index, state.phase, loc=loc, ip=ip
             ),
             loc=loc,
@@ -879,7 +977,7 @@ class PipelineClcFetchAsync:
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> None:
-        self.sync_object_empty.arrive(state.index, self.consumer_mask, loc=loc, ip=ip)  # type: ignore[call-arg]
+        self.sync_object_empty.arrive(state.index, self.consumer_mask, loc=loc, ip=ip)
 
     @dsl_user_op
     def producer_get_barrier(
@@ -901,17 +999,6 @@ class PipelineClcFetchAsync:
     ) -> cute.Pointer:
         return self.sync_object_empty.get_barrier(state.index, loc=loc, ip=ip)  # type: ignore[call-arg, return-value]
 
-    @dsl_user_op
-    def producer_tail(
-        self,
-        state: PipelineState,
-        try_acquire_token: Optional[Boolean] = None,
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
-    ) -> cute.Pointer:
-        return self.sync_object_empty.get_barrier(state.index, loc=loc, ip=ip)  # type: ignore[call-arg, return-value]
-
 
 @dataclass(frozen=True)
 class PipelineTmaMultiConsumersAsync(PipelineAsync):
@@ -924,7 +1011,7 @@ class PipelineTmaMultiConsumersAsync(PipelineAsync):
     sync_object_empty_async: SyncObject
     cta_group: cute.nvgpu.tcgen05.CtaGroup
     consumer_dst_rank_async: Optional[Int32] = None
-    is_signalling_thread: Boolean = True  # type: ignore[assignment]
+    is_signaling_thread: Boolean = True  # type: ignore[assignment]
 
     @staticmethod
     def create(  # type: ignore[override]
@@ -936,7 +1023,11 @@ class PipelineTmaMultiConsumersAsync(PipelineAsync):
         tx_count: int,
         barrier_storage: Optional[cute.Pointer] = None,
         cta_layout_vmnk: Optional[cute.Layout] = None,
+        mcast_mode_mn: tuple[int, int] = (1, 1),
+        tidx: Optional[Int32] = None,
+        enable_multicast_signaling: bool = False,
         defer_sync: bool = False,
+        name: str = "",
     ) -> "PipelineTmaMultiConsumersAsync":
         """
         This helper function computes any necessary attributes and returns an instance of PipelineTmaMultiConsumersAsync.
@@ -954,12 +1045,24 @@ class PipelineTmaMultiConsumersAsync(PipelineAsync):
         :type tx_count: int
         :param cta_layout_vmnk: Layout of the cluster shape
         :type cta_layout_vmnk: cute.Layout | None
+        :param mcast_mode_mn: Tuple specifying multicast modes for m and n dimensions (each 0 or 1)
+        :type mcast_mode_mn: tuple[int, int]
+        :param tidx: Thread index for computing AsyncThread consumer signaling, defaults to thread_idx()[0]
+        :type tidx: Int32 | None
+        :param enable_multicast_signaling: See docstring in PipelineTmaAsync.create() for details
+        :type enable_multicast_signaling: bool, optional
         """
         if not isinstance(barrier_storage, cute.Pointer):
             raise TypeError(
                 f"Expected barrier_storage to be a cute.Pointer, but got {type(barrier_storage)}"
             )
 
+        if not is_static(cta_layout_vmnk):
+            raise ValueError("The cluster shape (cta_layout_vmnk) needs to be static.")
+
+        if cta_layout_vmnk is None:
+            cta_layout_vmnk = cute.make_layout((1, 1, 1, 1))
+
         producer_type = PipelineOp.TmaLoad
         consumer_type = PipelineOp.Composite
         consumer_type_umma = PipelineOp.TCGen05Mma
@@ -970,14 +1073,55 @@ class PipelineTmaMultiConsumersAsync(PipelineAsync):
                 "UMMA and AsyncThread consumer groups must be the same agent"
             )
 
-        if cta_layout_vmnk is not None and cute.size(cta_layout_vmnk) != 1:
-            raise ValueError(
-                f"PipelineTmaMultiConsumersAsync is not verified for cta_layout_vmnk != 1, cta_layout_vmnk:{cta_layout_vmnk}"
+        if enable_multicast_signaling:
+            consumer_thread_arrive_cnt_umma = _get_thread_arrive_count(
+                consumer_group_umma
+            )
+            consumer_thread_arrive_cnt_async = _get_thread_arrive_count(
+                consumer_group_async
             )
 
+            if (
+                isinstance(consumer_thread_arrive_cnt_umma, int)
+                and consumer_thread_arrive_cnt_umma % WARP_SIZE != 0
+            ):
+                raise ValueError(
+                    "Error: UMMA consumer arrival count must be aligned with warp size"
+                )
+            if (
+                isinstance(consumer_thread_arrive_cnt_async, int)
+                and consumer_thread_arrive_cnt_async % WARP_SIZE != 0
+            ):
+                raise ValueError(
+                    "Error: AsyncThread consumer arrival count must be aligned with warp size"
+                )
+
+            shape_vmnk = cast(tuple[int, ...], cta_layout_vmnk.shape)
+            mcast_m_size = shape_vmnk[2] if mcast_mode_mn[0] else 0
+            mcast_n_size = shape_vmnk[1] if mcast_mode_mn[1] else 0
+            overlap = 1 if (mcast_mode_mn[0] and mcast_mode_mn[1]) else 0
+            mcast_size = mcast_m_size + mcast_n_size - overlap
+            assert mcast_size > 0, "Mcast size must be greater than 0."
+
+            num_warps_umma = consumer_thread_arrive_cnt_umma // WARP_SIZE
+            num_signaling_threads_umma = mcast_size * num_warps_umma
+
+            num_warps_async = consumer_thread_arrive_cnt_async // WARP_SIZE
+            num_signaling_threads_async = mcast_size * num_warps_async
+
+            thread_consumer_group_umma = CooperativeGroup(
+                Agent.Thread, num_signaling_threads_umma
+            )
+            thread_consumer_group_async = CooperativeGroup(
+                Agent.Thread, num_signaling_threads_async
+            )
+        else:
+            thread_consumer_group_umma = consumer_group_umma
+            thread_consumer_group_async = consumer_group_async
+
         consumer_group = CooperativeGroup(
-            consumer_group_umma.agent,
-            consumer_group_umma.size + consumer_group_async.size,
+            thread_consumer_group_umma.agent,
+            thread_consumer_group_umma.size + thread_consumer_group_async.size,
         )
 
         producer = (producer_type, producer_group)
@@ -988,10 +1132,19 @@ class PipelineTmaMultiConsumersAsync(PipelineAsync):
             num_stages,
             producer,
             tx_count,
+            name=name,
+            phase="full",
         )
         sync_object_empty = PipelineTmaUmma._make_sync_object(
-            barrier_storage.align(min_align=8) + num_stages, num_stages, consumer
+            barrier_storage.align(min_align=8) + num_stages,
+            num_stages,
+            consumer,
+            name=name,
+            phase="empty",
         )
+
+        if name:
+            register_barrier(name, num_stages, "PipelineTmaMultiConsumersAsync")
         sync_object_empty_umma = sync_object_empty.recast_to_new_op_type(
             consumer_type_umma
         )
@@ -999,20 +1152,39 @@ class PipelineTmaMultiConsumersAsync(PipelineAsync):
             consumer_type_async
         )
 
-        # No mcast mask if not using clusters
-        producer_mask = None
-        consumer_mask = None
-        # All threadblocks are leaders if not using clusters
-        is_leader_cta = True
+        # Compute AsyncThread consumer signaling (dst_rank + is_signaling_thread)
+        if tidx is None:
+            tidx, _, _ = cute.arch.thread_idx()
+        (
+            dst_rank_async,
+            is_signaling_thread,
+        ) = PipelineTmaMultiConsumersAsync._init_empty_barrier_arrive_signal_2sm(
+            cta_layout_vmnk, tidx, mcast_mode_mn
+        )
+
+        if cute.size(cta_layout_vmnk) == 1:
+            # No mcast mask if not using clusters
+            producer_mask = None
+            consumer_mask = None
+            consumer_dst_rank_async = None
+            is_leader_cta = True
+        else:
+            producer_mask = None
+            consumer_mask = PipelineTmaUmma._compute_mcast_arrival_mask(
+                cta_layout_vmnk, mcast_mode_mn
+            )
+            consumer_dst_rank_async = dst_rank_async
+            is_leader_cta = PipelineTmaUmma._compute_is_leader_cta(cta_layout_vmnk)
+
         cta_group = (
             cute.nvgpu.tcgen05.CtaGroup.ONE
-            if cta_layout_vmnk is None or cute.size(cta_layout_vmnk, mode=[0]) == 1
+            if cute.size(cta_layout_vmnk, mode=[0]) == 1
             else cute.nvgpu.tcgen05.CtaGroup.TWO
         )
 
         if not defer_sync:
             cute.arch.mbarrier_init_fence()
-            if cta_layout_vmnk is None or cute.size(cta_layout_vmnk) == 1:
+            if cute.size(cta_layout_vmnk) == 1:
                 agent_sync(Agent.ThreadBlock)
             else:
                 agent_sync(Agent.ThreadBlockCluster, is_relaxed=True)
@@ -1027,8 +1199,57 @@ class PipelineTmaMultiConsumersAsync(PipelineAsync):
             sync_object_empty_umma,
             sync_object_empty_async,
             cta_group,
+            consumer_dst_rank_async,
+            is_signaling_thread,
         )
 
+    @staticmethod
+    @cute.jit
+    def _init_empty_barrier_arrive_signal_2sm(
+        cta_layout_vmnk: cute.Layout,
+        tidx: Int32,
+        mcast_mode_mn: tuple[int, int] = (1, 1),
+    ) -> tuple[Int32, Boolean]:
+        """
+        Identical to sm90.py PipelineTmaAsync.init_empty_barrier_arrive_signal except
+        that CTAs in the multicast will also signal CTAs with a different V-coordinate (i.e. leader/follower CTA pairs).
+        """
+        # Logic to optimally schedule Empty Arrives
+        cluster_shape_vmnk = cta_layout_vmnk.shape
+
+        cta_rank_in_cluster = cute.arch.make_warp_uniform(
+            cute.arch.block_idx_in_cluster()
+        )
+
+        tidx = tidx % WARP_SIZE
+        is_signaling_thread = tidx < cute.size(cluster_shape_vmnk)
+        dst_rank = tidx % cute.size(cluster_shape_vmnk)
+
+        dst_cta_coord = cta_layout_vmnk.get_hier_coord(dst_rank)
+        cur_cta_coord = cta_layout_vmnk.get_hier_coord(cta_rank_in_cluster)
+        assert isinstance(dst_cta_coord, tuple)
+        assert isinstance(cur_cta_coord, tuple)
+
+        is_mcast_mode_m = (
+            dst_cta_coord[1] == cur_cta_coord[1]
+            and dst_cta_coord[3] == cur_cta_coord[3]
+        )
+        is_mcast_mode_n = (
+            dst_cta_coord[2] == cur_cta_coord[2]
+            and dst_cta_coord[3] == cur_cta_coord[3]
+        )
+
+        assert not (mcast_mode_mn[0] == 0 and mcast_mode_mn[1] == 0)
+        if mcast_mode_mn[0] == 1 and mcast_mode_mn[1] == 0:
+            is_signaling_thread = is_signaling_thread and is_mcast_mode_m
+        elif mcast_mode_mn[0] == 0 and mcast_mode_mn[1] == 1:
+            is_signaling_thread = is_signaling_thread and is_mcast_mode_n
+        elif mcast_mode_mn[0] == 1 and mcast_mode_mn[1] == 1:
+            is_mcast_mode_m_or_n = is_mcast_mode_m or is_mcast_mode_n
+            is_signaling_thread = is_signaling_thread and is_mcast_mode_m_or_n
+
+        return dst_rank, is_signaling_thread
+
     @dsl_user_op
     def producer_acquire(
         self,
@@ -1043,7 +1264,7 @@ class PipelineTmaMultiConsumersAsync(PipelineAsync):
         """
         if_generate(
             try_acquire_token is None or try_acquire_token == 0,
-            lambda: self.sync_object_empty.wait(  # type: ignore[call-arg]
+            lambda: self.sync_object_empty.wait(
                 state.index, state.phase, loc=loc, ip=ip
             ),
             loc=loc,
@@ -1051,7 +1272,7 @@ class PipelineTmaMultiConsumersAsync(PipelineAsync):
         )
         if_generate(
             self.is_leader_cta,
-            lambda: self.sync_object_full.arrive(state.index, self.producer_mask),  # type: ignore[call-arg]
+            lambda: self.sync_object_full.arrive(state.index, self.producer_mask),
             loc=loc,
             ip=ip,
         )
@@ -1069,22 +1290,66 @@ class PipelineTmaMultiConsumersAsync(PipelineAsync):
         """
         pass
 
+    @dsl_user_op
+    def consumer_wait(
+        self,
+        state: PipelineState,
+        try_wait_token: Optional[Boolean] = None,
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+    ) -> None:
+        """Consumer waits for full barrier to be signaled."""
+        _wait_fn = self.sync_object_full.wait
+        if_generate(
+            try_wait_token is None or try_wait_token == 0,
+            lambda: _wait_fn(state.index, state.phase, loc=loc, ip=ip),
+            loc=loc,
+            ip=ip,
+        )
+
+    @dsl_user_op
+    def consumer_try_wait(
+        self,
+        state: PipelineState,
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+    ) -> Boolean:
+        """Non-blocking check if data is ready."""
+        _try_wait_fn = self.sync_object_full.try_wait  # type: ignore[attr-defined]
+        return _try_wait_fn(state.index, state.phase, loc=loc, ip=ip)
+
     @dsl_user_op
     def consumer_release(
         self,
         state: PipelineState,
-        op_type: PipelineOp,
+        op_type: PipelineOp = PipelineOp.TCGen05Mma,
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> None:
         if op_type == PipelineOp.TCGen05Mma:
-            self.sync_object_empty_umma.arrive(  # type: ignore[call-arg]
+            self.sync_object_empty_umma.arrive(
                 state.index, self.consumer_mask, self.cta_group, loc=loc, ip=ip
             )
         elif op_type == PipelineOp.AsyncThread:
-            self.sync_object_empty_async.arrive(  # type: ignore[call-arg]
+            self.sync_object_empty_async.arrive(
                 state.index, self.consumer_mask, loc=loc, ip=ip
             )
         else:
             raise ValueError(f"Invalid PipelineOp specified. op_type:{op_type}")
+
+    @dsl_user_op
+    def make_participants(
+        self,
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+    ) -> "tuple[PipelineProducer, PipelineConsumer, None]":
+        """Returns (producer, umma_consumer, None)."""
+        return (
+            self.make_producer(loc=loc, ip=ip),
+            self.make_consumer(loc=loc, ip=ip),
+            None,
+        )
diff --git a/python/CuTeDSL/cutlass/pipeline/sm90.py b/python/CuTeDSL/cutlass/pipeline/sm90.py
index 181866088..774af8e28 100644
--- a/python/CuTeDSL/cutlass/pipeline/sm90.py
+++ b/python/CuTeDSL/cutlass/pipeline/sm90.py
@@ -10,9 +10,14 @@
 # is strictly prohibited.
 
 from dataclasses import dataclass
-from typing import Any, Optional
+from typing import Any, Literal, Optional, cast
 
 import cutlass.cute as cute
+from cutlass._mlir import ir
+from cutlass.cute.arch.constants import (
+    WARP_SIZE,
+)
+from cutlass.cute.core import is_static
 from cutlass.cutlass_dsl import Boolean, Int32, if_generate, dsl_user_op
 
 from cutlass.pipeline import (
@@ -27,7 +32,8 @@ from cutlass.pipeline import (
     make_pipeline_state,
     agent_sync,
 )
-from cutlass._mlir import ir
+from cutlass.pipeline.helpers import _get_thread_arrive_count
+from cutlass.pipeline.profiling import register_barrier
 
 ##############################################################################
 # Pipeline classes
@@ -128,10 +134,13 @@ class PipelineAsync:
         num_stages: int,
         agent: tuple[PipelineOp, CooperativeGroup],
         tx_count: int = 0,
+        name: str = "",
+        phase: Literal["", "full", "empty"] = "",
     ) -> SyncObject:
         """
         Returns a SyncObject corresponding to an agent's PipelineOp.
         """
+        full_name = f"{name}.{phase}" if name and phase else name
         if agent[0] in [
             PipelineOp.AsyncThread,
             PipelineOp.TmaLoad,
@@ -143,6 +152,7 @@ class PipelineAsync:
                 num_stages=num_stages,
                 agent=agent,
                 tx_count=tx_count,
+                name=full_name,
             )
         elif agent[0] is PipelineOp.TmaStore:
             # Path taken for AsyncTmaStore
@@ -160,6 +170,7 @@ class PipelineAsync:
         producer_mask: Optional[Int32] = None,
         consumer_mask: Optional[Int32] = None,
         defer_sync: bool = False,
+        name: str = "",
     ) -> "PipelineAsync":
         """Creates and initializes a new PipelineAsync instance.
 
@@ -194,12 +205,23 @@ class PipelineAsync:
         consumer = (consumer_type, consumer_group)
 
         sync_object_full = PipelineAsync._make_sync_object(
-            barrier_storage.align(min_align=8), num_stages, producer
+            barrier_storage.align(min_align=8),
+            num_stages,
+            producer,
+            name=name,
+            phase="full",
         )
         sync_object_empty = PipelineAsync._make_sync_object(
-            barrier_storage.align(min_align=8) + num_stages, num_stages, consumer
+            barrier_storage.align(min_align=8) + num_stages,
+            num_stages,
+            consumer,
+            name=name,
+            phase="empty",
         )
 
+        if name:
+            register_barrier(name, num_stages, "PipelineAsync")
+
         if not defer_sync:
             cute.arch.mbarrier_init_fence()
             agent_sync(Agent.ThreadBlock)
@@ -223,7 +245,7 @@ class PipelineAsync:
     ) -> None:
         if_generate(
             try_acquire_token is None or try_acquire_token == 0,
-            lambda: self.sync_object_empty.wait(  # type: ignore[call-arg]
+            lambda: self.sync_object_empty.wait(
                 state.index, state.phase, loc=loc, ip=ip
             ),
             loc=loc,
@@ -248,7 +270,7 @@ class PipelineAsync:
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> None:
-        self.sync_object_full.arrive(state.index, self.producer_mask, loc=loc, ip=ip)  # type: ignore[call-arg]
+        self.sync_object_full.arrive(state.index, self.producer_mask, loc=loc, ip=ip)
 
     @dsl_user_op
     def consumer_wait(
@@ -261,7 +283,7 @@ class PipelineAsync:
     ) -> None:
         if_generate(
             try_wait_token is None or try_wait_token == 0,
-            lambda: self.sync_object_full.wait(  # type: ignore[call-arg]
+            lambda: self.sync_object_full.wait(
                 state.index, state.phase, loc=loc, ip=ip
             ),
             loc=loc,
@@ -286,7 +308,7 @@ class PipelineAsync:
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> None:
-        self.sync_object_empty.arrive(state.index, self.consumer_mask, loc=loc, ip=ip)  # type: ignore[call-arg]
+        self.sync_object_empty.arrive(state.index, self.consumer_mask, loc=loc, ip=ip)
 
     @dsl_user_op
     def consumer_get_barrier(
@@ -324,11 +346,10 @@ class PipelineAsync:
         :param state: The pipeline state that points to next useful buffer
         :type state: PipelineState
         """
-        # Assume state contains that next useful buffer
-        # So we only need to advance to num_stages - 1 times to last used buffer
-        for i in range(self.num_stages - 1):
+        # wait on all stage buffers since there is no guaranteed ordering of empty barrier arrives
+        for i in range(self.num_stages):
+            self.sync_object_empty.wait(state.index, state.phase, loc=loc, ip=ip)
             state.advance(loc=loc, ip=ip)
-        self.producer_acquire(state, loc=loc, ip=ip)
 
     # Util methods to manage producer and consumer
     @dsl_user_op
@@ -368,7 +389,7 @@ class PipelineAsync:
 @dataclass(frozen=True)
 class PipelineCpAsync(PipelineAsync):
     """
-    PipelineCpAsync is used for CpAsync producers and AsyncThread consumers
+    PipelineCpAsync is used for CpAsync producers and AsyncThread consumers (e.g. Hopper load mainloops).
     """
 
     @staticmethod
@@ -380,6 +401,7 @@ class PipelineCpAsync(PipelineAsync):
         producer_mask: Optional[Int32] = None,
         consumer_mask: Optional[Int32] = None,
         defer_sync: bool = False,
+        name: str = "",
     ) -> "PipelineCpAsync":
         """Helper function that computes necessary attributes and returns a ``PipelineCpAsync`` instance.
 
@@ -408,13 +430,20 @@ class PipelineCpAsync(PipelineAsync):
             barrier_storage.align(min_align=8),
             num_stages,  # type: ignore[arg-type]
             producer,
+            name=name,
+            phase="full",
         )
         sync_object_array_empty = PipelineCpAsync._make_sync_object(
             barrier_storage.align(min_align=8) + num_stages,
             num_stages,  # type: ignore[arg-type]
             consumer,
+            name=name,
+            phase="empty",
         )
 
+        if name:
+            register_barrier(name, int(num_stages), "PipelineCpAsync")
+
         if not defer_sync:
             cute.arch.mbarrier_init_fence()
             agent_sync(Agent.ThreadBlock)
@@ -434,7 +463,7 @@ class PipelineTmaAsync(PipelineAsync):
     PipelineTmaAsync is used for TMA producers and AsyncThread consumers (e.g. Hopper mainloops).
     """
 
-    is_signalling_thread: Boolean
+    is_signaling_thread: Boolean
 
     @staticmethod
     @cute.jit
@@ -465,34 +494,36 @@ class PipelineTmaAsync(PipelineAsync):
             cute.arch.block_idx_in_cluster()
         )
 
-        tidx = tidx % 32
-        is_signalling_thread = tidx < cute.size(cluster_shape_vmnk)
+        tidx = tidx % WARP_SIZE
+        is_signaling_thread = tidx < cute.size(cluster_shape_vmnk)
         dst_rank = tidx % cute.size(cluster_shape_vmnk)
 
         dst_cta_coord = cta_layout_vmnk.get_hier_coord(dst_rank)
         cur_cta_coord = cta_layout_vmnk.get_hier_coord(cta_rank_in_cluster)
+        assert isinstance(dst_cta_coord, tuple)
+        assert isinstance(cur_cta_coord, tuple)
 
         is_mcast_mode_m = (
-            dst_cta_coord[0] == cur_cta_coord[0]  # type: ignore[index]
-            and dst_cta_coord[1] == cur_cta_coord[1]  # type: ignore[index]
-            and dst_cta_coord[3] == cur_cta_coord[3]  # type: ignore[index]
+            dst_cta_coord[0] == cur_cta_coord[0]
+            and dst_cta_coord[1] == cur_cta_coord[1]
+            and dst_cta_coord[3] == cur_cta_coord[3]
         )
         is_mcast_mode_n = (
-            dst_cta_coord[0] == cur_cta_coord[0]  # type: ignore[index]
-            and dst_cta_coord[2] == cur_cta_coord[2]  # type: ignore[index]
-            and dst_cta_coord[3] == cur_cta_coord[3]  # type: ignore[index]
+            dst_cta_coord[0] == cur_cta_coord[0]
+            and dst_cta_coord[2] == cur_cta_coord[2]
+            and dst_cta_coord[3] == cur_cta_coord[3]
         )
 
         assert not (mcast_mode_mn[0] == 0 and mcast_mode_mn[1] == 0)
         if mcast_mode_mn[0] == 1 and mcast_mode_mn[1] == 0:
-            is_signalling_thread = is_signalling_thread and is_mcast_mode_m
+            is_signaling_thread = is_signaling_thread and is_mcast_mode_m
         elif mcast_mode_mn[0] == 0 and mcast_mode_mn[1] == 1:
-            is_signalling_thread = is_signalling_thread and is_mcast_mode_n
+            is_signaling_thread = is_signaling_thread and is_mcast_mode_n
         elif mcast_mode_mn[0] == 1 and mcast_mode_mn[1] == 1:
             is_mcast_mode_m_or_n = is_mcast_mode_m or is_mcast_mode_n
-            is_signalling_thread = is_signalling_thread and is_mcast_mode_m_or_n
+            is_signaling_thread = is_signaling_thread and is_mcast_mode_m_or_n
 
-        return dst_rank, is_signalling_thread
+        return dst_rank, is_signaling_thread
 
     @staticmethod
     def create(  # type: ignore[override]
@@ -505,7 +536,9 @@ class PipelineTmaAsync(PipelineAsync):
         cta_layout_vmnk: Optional[cute.Layout] = None,
         tidx: Optional[Int32] = None,
         mcast_mode_mn: tuple[int, int] = (1, 1),
+        enable_multicast_signaling: bool = False,
         defer_sync: bool = False,
+        name: str = "",
     ) -> "PipelineTmaAsync":
         """Create a new ``PipelineTmaAsync`` instance.
 
@@ -525,6 +558,15 @@ class PipelineTmaAsync(PipelineAsync):
         :type tidx: Int32, optional
         :param mcast_mode_mn: Tuple specifying multicast modes for m and n dimensions (each 0 or 1), defaults to (1,1)
         :type mcast_mode_mn: tuple[int, int], optional
+        :param enable_multicast_signaling: When ``True``, the CooperativeGroup is expected
+            to represent the number of threads in a CTA calling
+            consumer_wait/consumer_release, and the actual arrive count is recomputed
+            internally. Multicast is handled automatically based on cta_layout_vmnk and
+            mcast_mode_mn. Defaults to ``False``, which skips this logic and uses the
+            consumer arrive count specified by the user.
+        :type enable_multicast_signaling: bool, optional
+        :param defer_sync: Bool specifying whether or not to skip the built-in mbarrier fence and sync for performance, defaults to False
+        :type defer_sync: bool, optional
         :raises ValueError: If barrier_storage is not a cute.Pointer instance
         :return: New ``PipelineTmaAsync`` instance
         :rtype: PipelineTmaAsync
@@ -534,29 +576,88 @@ class PipelineTmaAsync(PipelineAsync):
                 f"Expected barrier_storage to be a cute.Pointer, but got {type(barrier_storage)}"
             )
 
+        if not is_static(cta_layout_vmnk):
+            raise ValueError("The cluster shape (cta_layout_vmnk) needs to be static.")
+
+        if cta_layout_vmnk is None:
+            cta_layout_vmnk = cute.make_layout((1, 1, 1, 1))
+
         producer_type = PipelineOp.TmaLoad
         consumer_type = PipelineOp.AsyncThread
 
-        producer = (producer_type, producer_group)
-        consumer = (consumer_type, consumer_group)
+        # The producer group is not dependent on multicast and is forwarded as-is.
+        thread_producer_group = producer_group
+
+        if enable_multicast_signaling:
+            # In multicast mode, the consumer arrive count is recomputed. Each
+            # consumer warp contributes one signaling thread per multicast partner
+            # CTA to the arrive count, rather than using the thread count directly.
+            consumer_thread_arrive_cnt = _get_thread_arrive_count(consumer_group)
+
+            if (
+                isinstance(consumer_thread_arrive_cnt, int)
+                and consumer_thread_arrive_cnt % WARP_SIZE != 0
+            ):
+                raise ValueError(
+                    "Error: Consumer arrival count must be aligned with warp size"
+                )
+
+            shape_vmnk = cast(tuple[int, ...], cta_layout_vmnk.shape)
+            # mcast_m_size is the number of multicast partners in the m dimension
+            mcast_m_size = shape_vmnk[2] if mcast_mode_mn[0] else 0
+            # mcast_n_size is the number of multicast partners in the n dimension
+            mcast_n_size = shape_vmnk[1] if mcast_mode_mn[1] else 0
+            # Subtracting by 1 is necessary if multicasting in both the m and n
+            # dimensions to avoid double counting the local CTA
+            overlap = 1 if (mcast_mode_mn[0] and mcast_mode_mn[1]) else 0
+            # mcast_size is the total number of multicast partners
+            mcast_size = mcast_m_size + mcast_n_size - overlap
+            assert mcast_size > 0, "Mcast size must be greater than 0."
+
+            num_warps = consumer_thread_arrive_cnt // WARP_SIZE
+            # num_signaling_threads is the total number of arrives expected. One
+            # arrive is expected per consumer warp, per multicast partner.
+            num_signaling_threads = mcast_size * num_warps
+
+            thread_consumer_group = CooperativeGroup(
+                Agent.Thread, num_signaling_threads
+            )
+
+        else:
+            # Non-multicast signaling path
+            thread_consumer_group = consumer_group
+
+        producer = (producer_type, thread_producer_group)
+        consumer = (consumer_type, thread_consumer_group)
 
         sync_object_full = PipelineAsync._make_sync_object(
-            barrier_storage.align(min_align=8), num_stages, producer, tx_count
+            barrier_storage.align(min_align=8),
+            num_stages,
+            producer,
+            tx_count,
+            name=name,
+            phase="full",
         )
         sync_object_empty = PipelineAsync._make_sync_object(
-            barrier_storage.align(min_align=8) + num_stages, num_stages, consumer
+            barrier_storage.align(min_align=8) + num_stages,
+            num_stages,
+            consumer,
+            name=name,
+            phase="empty",
         )
+
+        if name:
+            register_barrier(name, num_stages, "PipelineTmaAsync")
+
         if tidx is None:
             tidx, _, _ = cute.arch.thread_idx()
-        if cta_layout_vmnk is None:
-            cta_layout_vmnk = cute.make_layout((1, 1, 1, 1))
         (
             dst_rank,
-            is_signalling_thread,
+            is_signaling_thread,
         ) = PipelineTmaAsync.init_empty_barrier_arrive_signal(
             cta_layout_vmnk, tidx, mcast_mode_mn
         )
-        if cta_layout_vmnk is None or cute.size(cta_layout_vmnk) == 1:
+        if cute.size(cta_layout_vmnk) == 1:
             dst_rank = None
         else:
             dst_rank = dst_rank
@@ -565,7 +666,7 @@ class PipelineTmaAsync(PipelineAsync):
 
         if not defer_sync:
             cute.arch.mbarrier_init_fence()
-            if cta_layout_vmnk is None or cute.size(cta_layout_vmnk) == 1:
+            if cute.size(cta_layout_vmnk) == 1:
                 agent_sync(Agent.ThreadBlock)
             else:
                 agent_sync(Agent.ThreadBlockCluster, is_relaxed=True)
@@ -574,9 +675,9 @@ class PipelineTmaAsync(PipelineAsync):
             sync_object_full,
             sync_object_empty,
             num_stages,
-            producer_mask,
+            producer_mask,  # unused
             dst_rank,
-            is_signalling_thread,
+            is_signaling_thread,
         )
 
     @dsl_user_op
@@ -593,13 +694,13 @@ class PipelineTmaAsync(PipelineAsync):
         """
         if_generate(
             try_acquire_token is None or try_acquire_token == 0,
-            lambda: self.sync_object_empty.wait(  # type: ignore[call-arg]
+            lambda: self.sync_object_empty.wait(
                 state.index, state.phase, loc=loc, ip=ip
             ),
             loc=loc,
             ip=ip,
         )
-        self.sync_object_full.arrive(state.index, self.producer_mask, loc=loc, ip=ip)  # type: ignore[call-arg]
+        self.sync_object_full.arrive(state.index, self.producer_mask, loc=loc, ip=ip)
 
     @dsl_user_op
     def producer_commit(
@@ -626,8 +727,8 @@ class PipelineTmaAsync(PipelineAsync):
         TMA consumer release conditionally signals the empty buffer to the producer.
         """
         if_generate(
-            self.is_signalling_thread,
-            lambda: self.sync_object_empty.arrive(  # type: ignore[call-arg]
+            self.is_signaling_thread,
+            lambda: self.sync_object_empty.arrive(
                 state.index, self.consumer_mask, loc=loc, ip=ip
             ),
         )
@@ -669,7 +770,7 @@ class PipelineTmaStore(PipelineAsync):
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> None:
-        self.sync_object_full.wait(loc=loc, ip=ip)  # type: ignore[call-arg]
+        self.sync_object_full.wait(loc=loc, ip=ip)
 
     @dsl_user_op
     def producer_commit(
@@ -678,7 +779,7 @@ class PipelineTmaStore(PipelineAsync):
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> None:
-        self.sync_object_full.arrive(loc=loc, ip=ip)  # type: ignore[call-arg]
+        self.sync_object_full.arrive(loc=loc, ip=ip)
 
     @dsl_user_op
     def consumer_wait(
@@ -757,6 +858,7 @@ class PipelineOrder:
         group_id: int,
         producer_group: CooperativeGroup,
         defer_sync: bool = False,
+        name: str = "",
     ) -> "PipelineOrder":
         if not isinstance(barrier_storage, cute.Pointer):
             raise TypeError(
@@ -770,9 +872,15 @@ class PipelineOrder:
         num_stages = depth * length
 
         sync_object_full = PipelineAsync._make_sync_object(
-            barrier_storage.align(min_align=8), num_stages, producer
+            barrier_storage.align(min_align=8),
+            num_stages,
+            producer,
+            name=name,
         )
 
+        if name:
+            register_barrier(name, num_stages, "PipelineOrder")
+
         if not defer_sync:
             cute.arch.mbarrier_init_fence()
             agent_sync(Agent.ThreadBlock)
@@ -797,15 +905,15 @@ class PipelineOrder:
         return state.index * self.length + group_id
 
     @dsl_user_op
-    def arrive(  # type: ignore[return]
+    def arrive(
         self,
         state: Optional[PipelineState] = None,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> Optional[PipelineState]:
         state = self.state if state is None else state
-        signalling_id = (self.group_id + 1) % self.length
-        idx = self.get_barrier_for_current_stage_idx(signalling_id, state)
+        signaling_id = (self.group_id + 1) % self.length
+        idx = self.get_barrier_for_current_stage_idx(signaling_id, state)
         cute.arch.mbarrier_arrive(
             self.sync_object_full.get_barrier(idx, loc=loc, ip=ip),  # type: ignore[call-arg]
             loc=loc,
@@ -814,6 +922,7 @@ class PipelineOrder:
         state.advance(loc=loc, ip=ip)
         if state is not self.state:
             return state
+        return None
 
     @dsl_user_op
     def wait(
@@ -862,7 +971,7 @@ class ImmutableResourceHandle:
         """Get the original pipeline this resource handle belongs to."""
         return self.__origin
 
-    def __extract_mlir_values__(self) -> list:
+    def __extract_mlir_values__(self) -> list[ir.Value]:
         """Extract MLIR values from the current state.
 
         :return: List of MLIR values representing the current state
@@ -871,7 +980,9 @@ class ImmutableResourceHandle:
         # TODO: need to handle pipeline as well
         return self.__immutable_state.__extract_mlir_values__()
 
-    def __new_from_mlir_values__(self, values: "list") -> "ImmutableResourceHandle":
+    def __new_from_mlir_values__(
+        self, values: list[ir.Value]
+    ) -> "ImmutableResourceHandle":
         """Create a new Producer instance from MLIR values.
 
         :param values: MLIR values to initialize the state
@@ -956,7 +1067,7 @@ class PipelineProducer:
 
     def __init__(
         self, pipeline: PipelineAsync, state: PipelineState, group: CooperativeGroup
-    ):
+    ) -> None:
         """Initialize a new Producer instance.
 
         :param pipeline: The pipeline this producer belongs to
@@ -1201,7 +1312,7 @@ class PipelineConsumer:
 
     def __init__(
         self, pipeline: PipelineAsync, state: PipelineState, group: CooperativeGroup
-    ):
+    ) -> None:
         """Initialize a new Consumer instance.
 
         :param pipeline: The pipeline this consumer belongs to
diff --git a/python/CuTeDSL/cutlass/testing.py b/python/CuTeDSL/cutlass/testing.py
new file mode 100644
index 000000000..21e27e741
--- /dev/null
+++ b/python/CuTeDSL/cutlass/testing.py
@@ -0,0 +1,986 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 - 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
+#
+# Use of this software is governed by the terms and conditions of the
+# NVIDIA End User License Agreement (EULA), available at:
+# https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/license.html
+#
+# Any use, reproduction, disclosure, or distribution of this software
+# and related documentation outside the scope permitted by the EULA
+# is strictly prohibited.
+
+
+import argparse
+import logging
+import os
+from dataclasses import dataclass
+from functools import partial
+from itertools import product
+from time import time
+from typing import Any, Callable, Dict, List, Optional, Type, Union
+
+import cuda.bindings.driver as cuda_driver
+import cuda.bindings.runtime as cuda_runtime
+
+from cutlass.cute.typing import Int8, Numeric, Uint8
+
+
+class CuptiProfiler:
+    """A class for managing CUPTI profiling measurements with start, stop, and duration methods.
+
+    This class provides a clean interface for measuring CUDA kernel execution times
+    using CUPTI (CUDA Profiling Tools Interface). It encapsulates the complexity
+    of buffer management, callback registration, and activity tracking.
+
+    Example usage:
+        profiler = CuptiProfiler()
+        profiler.start()
+        # ... run your CUDA kernels ...
+        profiler.stop()
+        duration = profiler.get_duration()  # Returns total duration in milliseconds
+    """
+
+    def __init__(self, buffer_size: int = 8 * 1024 * 1024) -> None:
+        """Initialize the CUPTI profiler.
+
+        Args:
+            buffer_size: Size of the CUPTI buffer in bytes (default: 8MB)
+
+        Raises:
+            ImportError: If the cupti-python package is not installed
+        """
+        try:
+            from cupti import cupti
+
+            self._cupti = cupti
+        except ModuleNotFoundError:
+            raise ModuleNotFoundError(
+                "CUPTI is not available. Install the 'cupti-python' package to use CuptiProfiler."
+            )
+        self.buffer_size = buffer_size
+        self.timings: list[tuple[str, float]] = []
+        self._is_active = False
+        self._buffer_requested_callback: Optional[Callable[..., Any]] = None
+        self._buffer_completed_callback: Optional[Callable[..., Any]] = None
+
+    def _buffer_requested(self) -> tuple[int, int]:
+        """Internal callback for CUPTI buffer requests."""
+        max_num_records = 0
+        return self.buffer_size, max_num_records
+
+    def _buffer_completed(self, activities: list[Any]) -> None:
+        """Internal callback for processing completed CUPTI activities."""
+        for activity in activities:
+            start: Any = activity.start if hasattr(activity, "start") else "nil"
+            end: Any = activity.end if hasattr(activity, "end") else "nil"
+            if start != "nil" and end != "nil":
+                duration: Any = end - start
+            else:
+                duration = "nil"
+            name = activity.name[:100] if hasattr(activity, "name") else "unknown"
+            # Convert to milliseconds
+            if duration != "nil":
+                self.timings.append((name, duration / 1e6))
+
+    def start(self) -> None:
+        """Start CUPTI profiling.
+
+        Enables CUPTI activity tracking for concurrent kernels and registers
+        the necessary callbacks for buffer management.
+
+        Raises:
+            ValueError: If CUPTI activity cannot be enabled
+        """
+        if self._is_active:
+            raise RuntimeError("CUPTI profiler is already active")
+
+        # Clear previous timings
+        self.timings = []
+
+        try:
+            self._cupti.activity_enable(self._cupti.ActivityKind.CONCURRENT_KERNEL)
+        except self._cupti.cuptiError as e:
+            raise ValueError(
+                f"\033[91mError while enabling Activity Kind {self._cupti.ActivityKind.CONCURRENT_KERNEL.name}: {e}. Please disable CUPTI if you using profilers\033[0m"
+            )
+
+        # Register callbacks
+        self._buffer_requested_callback = self._buffer_requested
+        self._buffer_completed_callback = partial(self._buffer_completed)
+
+        self._cupti.activity_register_callbacks(
+            self._buffer_requested_callback, self._buffer_completed_callback
+        )
+
+        self._is_active = True
+
+    def stop(self) -> None:
+        """Stop CUPTI profiling.
+
+        Flushes all activities, disables CUPTI tracking, and finalizes the profiler.
+        This method should be called after the kernels you want to measure have completed.
+        """
+        if not self._is_active:
+            raise RuntimeError("CUPTI profiler is not active")
+
+        # Flush all activities and cleanup
+        self._cupti.activity_flush_all(0)
+        self._cupti.activity_disable(self._cupti.ActivityKind.CONCURRENT_KERNEL)
+        self._cupti.finalize()
+
+        self._is_active = False
+
+    def get_duration(self) -> float:
+        """Get the total duration of all measured activities in milliseconds.
+
+        Returns:
+            Total duration in milliseconds. Returns 0.0 if no activities were recorded.
+        """
+        return sum(timing[1] for timing in self.timings)
+
+
+#########################################
+# Testing utilities
+#########################################
+
+
+def sample_pytest(rand_cfg: Optional[tuple[int, float]] = None) -> Callable[..., Any]:
+    """
+    Decorator to randomly sample pytest parametrized tests.
+    rand_cfg: Tuple[int, float] - (random_seed, sample_ratio)
+    Sampling is disabled when:
+    - A specific test is selected (via -k or direct test path)
+    - Not running under pytest
+    """
+    import functools
+    import os
+    import random
+    import sys
+
+    import pytest
+
+    if rand_cfg is not None:
+        seed, sample_ratio = rand_cfg
+        random.seed(seed)
+    else:
+        sample_ratio = 1.0  # No sampling when rand_cfg is None
+
+    def decorator(func: Callable[..., Any]) -> Callable[..., Any]:
+        @functools.wraps(func)
+        def wrapper(*args: Any, **kwargs: Any) -> Any:
+            if rand_cfg is not None and "PYTEST_CURRENT_TEST" in os.environ:
+                # Check if test was explicitly selected like ::test_name[param1-param2-...]
+                if "-k" in sys.argv or any(".py::" in arg for arg in sys.argv):
+                    # Test was explicitly selected, don't skip
+                    return func(*args, **kwargs)
+
+                if random.uniform(0.0, 1.0) > sample_ratio:
+                    pytest.skip(f"Randomly skipped (sampling ratio: {sample_ratio})")
+            return func(*args, **kwargs)
+
+        return wrapper
+
+    return decorator
+
+
+#########################################
+# Benchmarking utilities
+#########################################
+
+
+class JitArguments:
+    """
+    A type to hold both args and kwargs for passing to a kernel while benchmarking.
+    """
+
+    def __init__(self, *args: Any, **kwargs: Any) -> None:
+        self.args = args
+        self.kwargs = kwargs
+        self.references: list[Any] = list()
+
+    def add_to_scope(self, references: Any) -> None:
+        """
+        Keeps references to external variables (e.g., Torch tensors when taking a view)
+        in the scope of the lifetime of the JitArguments object.
+        """
+        self.references.extend(references)
+
+
+def _cuda_success(
+    err: Union[tuple[Any, ...], cuda_runtime.cudaError_t, cuda_driver.CUresult],
+    message: str,
+) -> None:
+    """
+    Helper function to check CUDA API errors.
+    """
+    if isinstance(err, tuple):
+        _cuda_success(err[0], message)
+    elif isinstance(err, cuda_runtime.cudaError_t):
+        error_message = cuda_runtime.cudaGetErrorString(err)[1].decode("utf-8")
+        if err != cuda_runtime.cudaError_t.cudaSuccess:
+            raise RuntimeError(f"{message} : {error_message}")
+    elif isinstance(err, cuda_driver.CUresult):
+        if err != cuda_driver.CUresult.CUDA_SUCCESS:
+            error_message = cuda_driver.cuGetErrorString(err)[1].decode("utf-8")
+            raise RuntimeError(f"{message} : {error_message}")
+    else:
+        raise TypeError(
+            f"{err} is an unexpected type : it should be a cudaError_t or CUresult"
+        )
+
+
+def _does_kernel_use_stream(
+    kernel: Callable[..., Any],
+    stream: cuda_driver.CUstream,
+    args: tuple[Any, ...],
+    kwargs: dict[str, Any],
+) -> bool:
+    """
+    This function checks if the kernel uses the provided non-default stream.
+    It does this by capturing the stream and then checking if any kernels were launched.
+
+    Note: the function accepts positional/keyword arguments for the kernel in non-unpacked form
+    (as tuple/dict, respectively) to avoid name clashes with function's own arguments (e.g. stream).
+
+    :param kernel: The kernel to check
+    :type kernel: Callable
+    :param stream: The stream to check
+    :type stream: cuda_driver.CUstream
+    :param args: Positional arguments to pass to the kernel
+    :type args: tuple[Any, ...]
+    :param kwargs: Keyword arguments to pass to the kernel
+    :type kwargs: dict[str, Any]
+    :return: True if the kernel uses the stream, False otherwise
+    :rtype: bool
+    """
+
+    assert int(stream) != int(cuda_driver.CUstream_flags.CU_STREAM_DEFAULT), (
+        "Stream must be a non-default stream"
+    )
+
+    err = cuda_runtime.cudaStreamBeginCapture(
+        stream, cuda_runtime.cudaStreamCaptureMode.cudaStreamCaptureModeThreadLocal
+    )
+    _cuda_success(err, "Error on stream capture")
+
+    try:
+        kernel(*args, **kwargs)
+    except Exception:
+        # Always end the capture even on failure to avoid zombie capture state
+        # that would poison all subsequent graph capture operations in the process.
+        try:
+            cuda_runtime.cudaStreamEndCapture(stream)
+        except Exception:
+            pass
+        raise
+
+    err, graph = cuda_runtime.cudaStreamEndCapture(stream)
+    _cuda_success(err, "Error on stream capture")
+
+    # Get number of nodes in warmup graph to check it matches what is expected
+    err, _, num_nodes = cuda_runtime.cudaGraphGetNodes(graph)
+    _cuda_success(err, "Error on querying graph")
+    return num_nodes > 0
+
+
+def benchmark(
+    callable: Callable,
+    *,
+    warmup_iterations: int = 10,
+    iterations: int = 100,
+    stream: Optional[cuda_driver.CUstream] = None,
+    kernel_arguments: Optional[JitArguments] = None,
+    workspace_generator: Optional[Callable[[], JitArguments]] = None,
+    workspace_count: int = 1,
+    use_cuda_graphs: bool = False,
+    use_cupti: bool = False,
+) -> float:
+    """Benchmarks a callable function with the specified parameters.
+
+    For example,
+    .. code-block:: python
+
+        from cutlass.testing import benchmark
+
+        @cute.jit
+        def user_function(a: cute.Tensor, b: cute.Tensor, c: cute.Tensor, stream: cuda_driver.CUstream):
+            # contents of the function
+            pass
+
+        time_us = benchmark(user_function, kernel_arguments=JitArguments(a, b, c, stream)
+                            warmup_iterations=10, iterations=100
+                            stream=stream)
+
+    To prevent skewing results by repeately accessing the L2 cache, use the workspace_count and workspace_generator
+    parameters to cycle through a number of different workspaces.
+
+    .. code-block:: python
+
+        from cutlass.testing import benchmark
+
+        @cute.jit
+        def user_function(a: cute.Tensor, b: cute.Tensor, c: cute.Tensor):
+            # contents of the function
+            pass
+
+        def workspace_generator():
+            # create a, b, and c
+            return JitArguments(a, b, c)
+
+        time_us = benchmark(user_function,
+                            workspace_generator=workspace_generator,
+                            workspace_count=10,
+                            warmup_iterations=10000,
+                            iterations=1000)
+
+    To benchmark you may always configure the function being profiled (callable), the warmup iterations, and
+    the number of profiling iterations.
+
+    Whenever the kernel being benchmarked runs in a non-default stream, the stream must be provided through the stream parameter.
+
+    To use CUDA graphs, the callable must be a compiled @cute.jit annotated function.
+    When using CUDA graphs, the kernel must be launched in a non-default stream.
+
+    :param callable: The function to benchmark. For jit function, it must be compiled functions.
+    :type callable: Callable
+    :param warmup_iterations: Number of warmup iterations, defaults to 10
+    :type warmup_iterations: int, optional
+    :param iterations: Number of benchmark iterations, defaults to 100
+    :type iterations: int, optional
+    :param stream: Stream kernel is launched in, defaults to CUDA stream default
+    :type stream: CUstream, None
+    :param kernel_arguments: Kernel arguments to launch callable with, defaults to None
+    :type kernel_arguments: JitArguments, None
+    :param workspace_generator: Function that returns kernel arguments, defaults to None
+    :type workspace_generator: Callable
+    :param workspace_count: Number of workspaces (arguments) to loop through, looping through enough workspaces will keep the L2 cache cold
+    :type workspace_count: int, optional
+    :param use_cuda_graphs: Whether to use cuda graphs, defaults to False
+    :type use_cuda_graphs: bool, optional
+
+    :return: The benchmark time in microseconds
+    :rtype: float
+    """
+    import cutlass.base_dsl.jit_executor  # noqa: F401
+    import cutlass.cutlass_dsl.cuda_jit_executor  # noqa: F401
+
+    if stream is None:
+        stream = cuda_driver.CUstream(cuda_driver.CUstream_flags.CU_STREAM_DEFAULT)
+
+    if workspace_count < 1:
+        raise ValueError("workspace_count must be at least 1")
+
+    _time_us = float("nan")
+    if workspace_generator == None:
+        # If no workspace generator is provided, we need a single workspace
+        if workspace_count != 1:
+            raise ValueError("Need a single workspace if not providing a generator")
+
+        # If no workspace generator is provided, we need a kernel_argument
+        if kernel_arguments == None:
+            raise ValueError(
+                "Please pass a kernel argument if not providing a generator"
+            )
+        workspace_generator = lambda: kernel_arguments
+
+    workspaces = [workspace_generator() for _ in range(workspace_count)]
+
+    for workspace in workspaces:
+        if not isinstance(workspace, JitArguments):
+            raise TypeError(
+                "workspace_generator and/or kernel_arguments should use JitArguments type"
+            )
+
+    # use memset to flush L2 cache after workspace h2d copies
+    if workspace_count > 1:
+        from cutlass.utils import HardwareInfo
+
+        hardware_info = HardwareInfo()
+        num_l2_cache_bytes = hardware_info.get_l2_cache_size_in_bytes()
+        l2_flush_bytes = num_l2_cache_bytes * 2
+        err, cache_ptr = cuda_driver.cuMemAlloc(int(l2_flush_bytes))
+        _cuda_success(err, "Error on allocating memory")
+
+        err = cuda_driver.cuMemsetD32Async(
+            cache_ptr, 0, int(l2_flush_bytes // 4), stream
+        )
+        _cuda_success(err, "Error on memset")
+
+        err = cuda_driver.cuMemFree(cache_ptr)
+        _cuda_success(err, "Error on freeing memory")
+
+    def _loop_and_call_kernel(iterations: int, workspace_index: int = 0) -> int:
+        for _ in range(iterations):
+            current_workspace = workspaces[workspace_index]
+            callable(*current_workspace.args, **current_workspace.kwargs)
+            workspace_index = (workspace_index + 1) % workspace_count
+        return workspace_index
+
+    # Create CUDA events for timing
+    err, start_event = cuda_driver.cuEventCreate(
+        cuda_driver.CUevent_flags.CU_EVENT_DEFAULT
+    )
+    _cuda_success(err, "Error on creating event")
+    err, end_event = cuda_driver.cuEventCreate(
+        cuda_driver.CUevent_flags.CU_EVENT_DEFAULT
+    )
+    _cuda_success(err, "Error on creating event")
+
+    elapsed_time = float("nan")
+
+    # =========================================================================
+    # Helper: Measure kernel execution time using CUPTI profiler
+    # =========================================================================
+    def _measure_with_cupti(kernel_launcher: Callable[[], Any]) -> float:
+        """
+        Measure kernel execution time using NVIDIA CUPTI profiler.
+        :param kernel_launcher: Callable that launches the kernel(s) to be profiled
+        :type kernel_launcher: Callable
+        :return: Elapsed time in milliseconds
+        :rtype: float
+        """
+        if not hasattr(kernel_launcher, "__call__"):
+            raise TypeError(
+                f"kernel_launcher must be callable, got {type(kernel_launcher).__name__}"
+            )
+
+        cupti_profiler = CuptiProfiler()
+
+        cupti_profiler.start()
+        kernel_launcher()
+
+        err = cuda_runtime.cudaDeviceSynchronize()
+        _cuda_success(err, "Error on synchronizing device")
+
+        cupti_profiler.stop()
+        duration_ms = cupti_profiler.get_duration()
+        return duration_ms
+
+    def _measure_with_cuda_event(kernel_launcher: Callable[[], Any]) -> float:
+        """
+        Measure kernel execution time using CUDA events.
+        :param kernel_launcher: Callable that launches the kernel(s) to be profiled
+        :type kernel_launcher: Callable
+        :return: Elapsed time in milliseconds
+        :rtype: float
+        """
+        if not hasattr(kernel_launcher, "__call__"):
+            raise TypeError(
+                f"kernel_launcher must be callable, got {type(kernel_launcher).__name__}"
+            )
+
+        if int(stream) != int(
+            cuda_driver.CUstream_flags.CU_STREAM_DEFAULT
+        ) and not _does_kernel_use_stream(
+            callable, stream, workspaces[0].args, workspaces[0].kwargs
+        ):
+            raise ValueError(
+                "CUDA stream passed to benchmark does not match the stream the kernel was launched in"
+            )
+
+        err = cuda_driver.cuEventRecord(start_event, stream)
+        _cuda_success(err, "Error on recording start event")
+
+        kernel_launcher()
+
+        err = cuda_driver.cuEventRecord(end_event, stream)
+        _cuda_success(err, "Error on recording end event")
+
+        err = cuda_driver.cuEventSynchronize(end_event)
+        _cuda_success(err, "Error on synchronizing end event")
+
+        err, duration_ms = cuda_driver.cuEventElapsedTime(start_event, end_event)
+        _cuda_success(err, "Error on querying elapsed time")
+        return duration_ms
+
+    # =========================================================================
+    # Branch 1: CUDA Graphs mode - Capture and replay kernel execution
+    # =========================================================================
+    if use_cuda_graphs:
+        if hasattr(callable, "_dsl_cls"):
+            raise TypeError(
+                "Uncompiled @cute.jit function cannot be captured into a CUDA Graph. "
+                "Use cute.compile() first, or wrap compiled calls in a plain function."
+            )
+
+        # ---------------------------------------------------------------------
+        # Step 1: Capture warmup graph
+        # ---------------------------------------------------------------------
+        import gc as _gc
+
+        # Disable GC during capture to prevent __del__ methods (e.g., cudaFree)
+        # from invalidating the capture with a non-capturable CUDA call.
+        _gc.collect()
+        _gc.disable()
+        err = cuda_runtime.cudaStreamBeginCapture(
+            stream, cuda_runtime.cudaStreamCaptureMode.cudaStreamCaptureModeThreadLocal
+        )
+        _cuda_success(err, "Error on beginning warmup stream capture")
+
+        try:
+            warmup_workspace_idx = _loop_and_call_kernel(warmup_iterations)
+        except Exception:
+            _gc.enable()
+            try:
+                cuda_runtime.cudaStreamEndCapture(stream)
+            except Exception:
+                pass
+            raise
+
+        err, warmup_graph = cuda_runtime.cudaStreamEndCapture(stream)
+        _gc.enable()
+        _cuda_success(err, "Error on ending warmup stream capture")
+
+        # Validate warmup graph node count
+        # Each kernel launch should produce at least one graph node
+        err, _, warmup_node_count = cuda_runtime.cudaGraphGetNodes(warmup_graph)
+        _cuda_success(err, "Error on querying warmup graph nodes")
+        # Use >= since one host function may launch multiple kernels
+        if warmup_node_count < warmup_iterations:
+            raise ValueError(
+                "CUDA stream passed to benchmark does not match the stream the kernel was launched in"
+            )
+
+        # ---------------------------------------------------------------------
+        # Step 2: Capture profiling graph
+        # ---------------------------------------------------------------------
+        _gc.collect()
+        _gc.disable()
+        err = cuda_runtime.cudaStreamBeginCapture(
+            stream, cuda_runtime.cudaStreamCaptureMode.cudaStreamCaptureModeThreadLocal
+        )
+        _cuda_success(err, "Error on beginning profiling stream capture")
+
+        try:
+            _loop_and_call_kernel(iterations, warmup_workspace_idx)
+        except Exception:
+            _gc.enable()
+            try:
+                cuda_runtime.cudaStreamEndCapture(stream)
+            except Exception:
+                pass
+            raise
+
+        err, profiling_graph = cuda_runtime.cudaStreamEndCapture(stream)
+        _gc.enable()
+        _cuda_success(err, "Error on ending profiling stream capture")
+
+        # ---------------------------------------------------------------------
+        # Step 3: Instantiate executable graphs
+        # ---------------------------------------------------------------------
+        err, warmup_graph_exec = cuda_runtime.cudaGraphInstantiate(warmup_graph, 0)
+        _cuda_success(err, "Error on instantiating warmup graph")
+        err, profiling_graph_exec = cuda_runtime.cudaGraphInstantiate(
+            profiling_graph, 0
+        )
+        _cuda_success(err, "Error on instantiating profiling graph")
+
+        # ---------------------------------------------------------------------
+        # Step 4: Execute warmup graph (cache warming)
+        # ---------------------------------------------------------------------
+        err = cuda_runtime.cudaGraphLaunch(warmup_graph_exec, stream)
+        _cuda_success(err, "Error on launching warmup graph")
+
+        # ---------------------------------------------------------------------
+        # Step 5: Profile execution using selected profiler
+        # ---------------------------------------------------------------------
+        def launch_profiling_graph() -> None:
+            err = cuda_runtime.cudaGraphLaunch(profiling_graph_exec, stream)
+            _cuda_success(err, "Error on launching profiling graph")
+
+        if use_cupti:
+            elapsed_time = _measure_with_cupti(launch_profiling_graph)
+        else:
+            elapsed_time = _measure_with_cuda_event(launch_profiling_graph)
+
+        # ---------------------------------------------------------------------
+        # Step 6: Cleanup - Destroy graph executables
+        # ---------------------------------------------------------------------
+        err = cuda_runtime.cudaGraphExecDestroy(warmup_graph_exec)
+        _cuda_success(err, "Error on destroying warmup graph executable")
+        err = cuda_runtime.cudaGraphExecDestroy(profiling_graph_exec)
+        _cuda_success(err, "Error on destroying profiling graph executable")
+
+    # =========================================================================
+    # Branch 2: CUPTI profiler mode (without CUDA Graphs)
+    # =========================================================================
+    elif use_cupti:
+        # Warmup iterations to stabilize GPU state
+        warmup_workspace_idx = _loop_and_call_kernel(warmup_iterations)
+
+        def run_profiling_iterations() -> None:
+            _loop_and_call_kernel(iterations, warmup_workspace_idx)
+
+        elapsed_time = _measure_with_cupti(run_profiling_iterations)
+
+    # =========================================================================
+    # Branch 3: CUDA event profiler mode (default)
+    # =========================================================================
+    else:
+        # Warmup iterations to stabilize GPU state
+        warmup_workspace_idx = _loop_and_call_kernel(warmup_iterations)
+
+        def run_profiling_iterations() -> None:
+            _loop_and_call_kernel(iterations, warmup_workspace_idx)
+
+        elapsed_time = _measure_with_cuda_event(run_profiling_iterations)
+
+    # Destroy events
+    err = cuda_driver.cuEventDestroy(start_event)
+    _cuda_success(err, "Error on destroying event")
+    err = cuda_driver.cuEventDestroy(end_event)
+    _cuda_success(err, "Error on destroying event")
+
+    return elapsed_time / iterations * 1e3
+
+
+def get_workspace_count(
+    one_workspace_bytes: int, warmup_iterations: int, iterations: int
+) -> int:
+    """Calculate the number of workspaces needed to fill L2 cache.
+
+    :param one_workspace_bytes: Size of one workspace in bytes
+    :type one_workspace_bytes: int
+    :param warmup_iterations: Number of warmup iterations
+    :type warmup_iterations: int
+    :param iterations: Number of iterations
+    :type iterations: int
+    :return: Number of workspaces needed
+    :rtype: int
+    """
+    from cutlass.utils import HardwareInfo
+
+    num_l2_cache_bytes = HardwareInfo().get_l2_cache_size_in_bytes()
+    num_workspaces = (num_l2_cache_bytes * 3) // one_workspace_bytes + 1
+    num_iters = warmup_iterations + iterations
+    return num_iters if num_iters < num_workspaces else num_workspaces
+
+
+#########################################
+# Autotuning/Tuning utilities
+#########################################
+
+
+def _benchmark_for_autotune(
+    callable: Callable[..., Any],
+    *args: Any,
+    warmup_iterations: int,
+    iterations: int,
+    use_cold_l2: bool,
+    print_verbose: bool,
+    current_stream: Optional[cuda_driver.CUstream] = None,
+    **kwargs: Any,
+) -> float:
+    """Benchmarks a callable function with the specified parameters.
+
+    This function differs from the benchmark function in that it is used for autotuning. In this case we
+    do not loop through workspaces to keep the L2 cache cold. Instead we rely on writing to an L2 cache sized address to keep the L2 cache cold.
+
+    The primary reason for doing this is that we do not have information on how to generate the workspaces for the kernel when autotuning.
+    We also do not have information on how much memory the workspaces take up.
+
+    This benchmarking is done as a close approximation of the actual runtime of the kernel in an E2E system,
+    where we may have clock throttling, a warm cache, or other factors that could affect the runtime of the kernel.
+
+    :param callable: The function to benchmark
+    :type callable: Callable
+    :param args: Arguments to pass to the callable function
+    :param warmup_iterations: Number of warmup iterations, defaults to 10
+    :type warmup_iterations: int, optional
+    :param iterations: Number of benchmark iterations, defaults to 100
+    :type iterations: int, optional
+    :param use_cold_l2: Whether to clear L2 cache between runs, defaults to True
+    :type use_cold_l2: bool, optional
+    :param print_verbose: Whether to print verbose output, defaults to False
+    :type print_verbose: bool, optional
+    :param current_stream: Stream to benchmark in, defaults to CUDA stream default
+    :type current_stream: CUstream, None
+    :param kwargs: Additional keyword arguments to pass to the callable function
+
+    :return: The benchmark time in microseconds
+    :rtype: float
+    """
+    if current_stream is None:
+        current_stream = cuda_driver.CUstream(
+            cuda_driver.CUstream_flags.CU_STREAM_DEFAULT
+        )
+
+    if int(current_stream) != int(
+        cuda_driver.CUstream(cuda_driver.CUstream_flags.CU_STREAM_DEFAULT)
+    ) and not _does_kernel_use_stream(callable, current_stream, args, kwargs):
+        raise ValueError(f"Incorrect stream passed to kernel: {current_stream}")
+
+    if use_cold_l2:
+        from cutlass.utils import HardwareInfo
+
+        # use memset to clear L2 cache
+        hardware_info = HardwareInfo()
+        num_l2_cache_bytes = hardware_info.get_l2_cache_size_in_bytes()
+        err, cache_ptr = cuda_driver.cuMemAlloc(int(num_l2_cache_bytes))
+        _cuda_success(err, "Error on allocating memory")
+
+    # Create CUDA events for timing
+    err, start_event = cuda_driver.cuEventCreate(
+        cuda_driver.CUevent_flags.CU_EVENT_DEFAULT
+    )
+    _cuda_success(err, "Error on creating event")
+    err, end_event = cuda_driver.cuEventCreate(
+        cuda_driver.CUevent_flags.CU_EVENT_DEFAULT
+    )
+    _cuda_success(err, "Error on creating event")
+    try:
+        # warmup
+        for _ in range(warmup_iterations):
+            callable(*args, **kwargs)
+
+        _time = 0
+        execution_time_ms = []
+        for _ in range(iterations):
+            if use_cold_l2:
+                # clear L2 cache by memset to zero for every run
+                err = cuda_driver.cuMemsetD32Async(
+                    cache_ptr, 0, int(num_l2_cache_bytes // 4), current_stream
+                )
+                _cuda_success(err, "Error on memset")
+            err = cuda_driver.cuEventRecord(start_event, current_stream)
+            _cuda_success(err, "Error on recording event")
+            callable(*args, **kwargs)
+            err = cuda_driver.cuEventRecord(end_event, current_stream)
+            _cuda_success(err, "Error on recording event")
+            err = cuda_driver.cuEventSynchronize(end_event)
+            _cuda_success(err, "Error on synchronizing event")
+            err, elapsed_time = cuda_driver.cuEventElapsedTime(start_event, end_event)
+            _cuda_success(err, "Error on querying event")
+            execution_time_ms.append(elapsed_time)
+        # unit: us
+        time_us = sum(execution_time_ms) * 1e3 / len(execution_time_ms)
+    except Exception as e:
+        print(f"This config execution error: {e}")
+        time_us = float("inf")
+    if print_verbose:
+        print(f"Execution time: {time_us:.4f} us")
+
+    if use_cold_l2:
+        err = cuda_driver.cuMemFree(cache_ptr)
+        _cuda_success(err, "Error on freeing memory")
+    err = cuda_driver.cuEventDestroy(start_event)
+    _cuda_success(err, "Error on destroying event")
+    err = cuda_driver.cuEventDestroy(end_event)
+    _cuda_success(err, "Error on destroying event")
+    return time_us
+
+
+def tune(
+    func: Callable[..., Callable[[], Any]],
+    params_dict: Optional[Dict[str, List[Any]]] = None,
+    kernel_arguments: "JitArguments" = JitArguments(),
+    warmup_iterations: int = 10,
+    iterations: int = 100,
+    stream: Optional[cuda_driver.CUstream] = None,
+) -> Dict[str, Any]:
+    """Tuning tool to suport arbitrary functions. The user must provide a function that returns a callable, which
+    takes no arguments to be tuned over.
+    Best practice is to return a jit function that is compiled with cute.compile for optimal performance.
+    For example:
+    .. code-block:: python
+
+        def user_function(param1=1, param2=2, param3=3) -> Callable[[], Any]:
+            # contents of the function
+            return lambda : compiled_func(param1, param2, param3)
+
+        config = tune(user_function, params_dict={'param1': [1, 2, 3], 'param2': [4, 5, 6]}, update_on_change=['param3'])
+
+    :param func: Function to be tuned, note that errors raised in the function will be ignored and the next configuration will be tried.
+    :type func: Callable[[Any], Callable[[], Any]]
+    :param params_dict: Dictionary containing parameter names and their possible values
+    :type params_dict: Dict[str, List[Any]], optional
+    :param kernel_arguments: Kernel arguments to launch callable with, defaults to JitArguments()
+    :type kernel_arguments: JitArguments, optional
+    :param warmup_iterations: Number of warmup iterations, defaults to 10
+    :type warmup_iterations: int, optional
+    :param iterations: Number of benchmark iterations, defaults to 100
+    :type iterations: int, optional
+    :param stream: Stream kernel is launched in, defaults to CUDA stream default
+    :type stream: CUstream, None
+    :return: Best configuration
+    :rtype: Dict[str, Any]
+    """
+    logger = logging.getLogger(__name__ + "_Autotune")
+    if not logger.handlers:
+        handler = logging.StreamHandler()
+        formatter = logging.Formatter(
+            "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+        )
+        handler.setFormatter(formatter)
+        logger.addHandler(handler)
+    if (
+        os.environ.get("CUTE_DSL_LOG_AUTOTUNE") is not None
+        and os.environ.get("CUTE_DSL_LOG_AUTOTUNE") != "0"
+    ):
+        logger.setLevel(logging.INFO)
+
+    if stream is None:
+        stream = cuda_driver.CUstream(cuda_driver.CUstream_flags.CU_STREAM_DEFAULT)
+
+    # Get all parameter configurations
+    if params_dict is None:
+        raise ValueError("params_dict must be provided")
+    keys = list(params_dict.keys())
+    values = list(params_dict.values())
+
+    min_time = float("inf")
+
+    best_config = None
+    # Record start time
+    start = time()
+
+    # Iterate through all possible configuration combinations
+    for config_values in product(*values):
+        # Build current configuration
+        current_config = dict(zip(keys, config_values))
+        logger.info(f"Tuning configuration: {current_config}")
+
+        try:
+            merged_kwargs = {**kernel_arguments.kwargs, **current_config}
+
+            compiled_func = func(*kernel_arguments.args, **merged_kwargs)
+            # Benchmark the compiled function
+            cur_time = _benchmark_for_autotune(
+                compiled_func,
+                warmup_iterations=warmup_iterations,
+                iterations=iterations,
+                use_cold_l2=True,
+                print_verbose=False,
+                current_stream=stream,
+            )
+
+            logger.info(f"   Execution time: {cur_time} us")
+
+            # Update best results
+            if cur_time < min_time:
+                min_time = cur_time
+                best_config = current_config
+
+        except NotImplementedError as e:
+            logger.info(f"   Encountered unimplemented error, abort execution: {e}")
+            raise e
+        except (ValueError, TypeError, CantImplementError) as e:
+            logger.info(f"   Configuration parameter skipping: {e}")
+            continue
+        except Exception as e:
+            logger.info(f"   Execution error skipping: {e}")
+            continue
+
+    end = time()
+    tuning_time = end - start
+
+    if best_config is None:
+        raise ValueError("No best kernel found")
+
+    logger.info(f"Best configuration: {best_config}, execution time: {min_time} us")
+    logger.info(f"Total tuning time: {tuning_time} s")
+    return best_config
+
+
+class CantImplementError(Exception):
+    """Exception raised when a function is not implemented."""
+
+    def __init__(self, message: Optional[str] = None) -> None:
+        self.message = message or "The current config is invalid/unsupported"
+        super().__init__(self.message)
+
+    def __str__(self) -> str:
+        return self.message
+
+    def __repr__(self) -> str:
+        return self.message
+
+
+#########################################
+# Tensor initialization configuration
+#########################################
+
+
+@dataclass(frozen=True)
+class TensorInitConfig:
+    """Configuration for tensor initialization policy.
+
+    When init_normal=True, tensors are initialized from a normal distribution
+    with the specified mean and std. Int8/Uint8 dtypes always use random
+    integer initialization regardless of this flag.
+    """
+
+    init_normal: bool = False
+    normal_mean: float = 0.0
+    normal_std: float = 1.0
+
+
+def add_tensor_init_args(
+    parser: argparse.ArgumentParser,
+    supports_int_dtypes: bool = True,
+) -> None:
+    """Add --init_normal, --normal_mean, --normal_std arguments to a parser.
+
+    :param parser: ArgumentParser to add arguments to.
+    :param supports_int_dtypes: If True, appends Int8/Uint8 caveat to --init_normal
+        help text. Set to False for files whose ab_dtype choices do not include
+        Int8/Uint8 (e.g. grouped_gemm, dense_blockscaled_gemm_persistent).
+    """
+    init_normal_help = (
+        "Use normal distribution for tensor initialization instead of random integers."
+    )
+    if supports_int_dtypes:
+        init_normal_help += (
+            " Note: Int8/Uint8 dtypes always use random init regardless of this flag"
+        )
+    parser.add_argument(
+        "--init_normal",
+        action="store_true",
+        help=init_normal_help,
+    )
+    parser.add_argument(
+        "--normal_mean",
+        type=float,
+        default=0.0,
+        help="Mean for normal distribution initialization",
+    )
+    parser.add_argument(
+        "--normal_std",
+        type=float,
+        default=1.0,
+        help="Standard deviation for normal distribution initialization (must be >= 0)",
+    )
+
+
+def validate_tensor_init_args(
+    args: argparse.Namespace,
+    parser: argparse.ArgumentParser,
+) -> None:
+    """Validate tensor init arguments after parse_args().
+
+    :param args: Parsed arguments namespace.
+    :param parser: Parser instance (used for error reporting).
+    """
+    if args.normal_std < 0:
+        parser.error("--normal_std must be non-negative")
+
+
+def tensor_init_config_from_args(args: argparse.Namespace) -> TensorInitConfig:
+    """Extract a TensorInitConfig from parsed arguments."""
+    return TensorInitConfig(
+        init_normal=args.init_normal,
+        normal_mean=args.normal_mean,
+        normal_std=args.normal_std,
+    )
+
+
+def should_use_normal_init(
+    config: TensorInitConfig,
+    dtype: Type[Numeric],
+) -> bool:
+    """Determine whether normal initialization should be used for the given dtype.
+
+    Returns False if config.init_normal is False or if dtype is Int8/Uint8
+    (which do not support normal distribution initialization).
+    """
+    return config.init_normal and dtype not in (Int8, Uint8)
diff --git a/python/CuTeDSL/cutlass/torch.py b/python/CuTeDSL/cutlass/torch.py
index f2334e482..177654b14 100644
--- a/python/CuTeDSL/cutlass/torch.py
+++ b/python/CuTeDSL/cutlass/torch.py
@@ -17,15 +17,12 @@ from typing import Any, Optional, Type, Union, Tuple
 
 from cutlass.cute.typing import (
     Numeric,
-    Boolean,
-    TFloat32,
-    Float8E4M3B11FNUZ,
     Float8E4M3FN,
     Float8E5M2,
     Float8E8M0FNU,
+    Float4E2M1FN,
     Float6E3M2FN,
     Float6E2M3FN,
-    Float4E2M1FN,
     Int4,
     Tensor,
 )
@@ -35,31 +32,7 @@ import torch
 import cuda.bindings.driver as cuda
 
 
-def dtype(ty: Type[Numeric]) -> "torch.dtype":
-    """
-    Return the corresponding torch.dtype per the given DSL type
-    """
-    torch_dtype = getattr(torch, ty.__name__.lower(), None)
-
-    torch_type_map = {
-        Boolean: torch.bool,
-        # TFloat32 is just alias of float32
-        TFloat32: torch.float32,
-        Float8E5M2: torch.float8_e5m2,
-        Float8E4M3FN: torch.float8_e4m3fn,
-        Float8E4M3B11FNUZ: torch.float8_e4m3fnuz,
-    }
-
-    # float8_e8m0fnu is introduced in latest version of torch
-    if hasattr(torch, "float8_e8m0fnu"):
-        torch_type_map[Float8E8M0FNU] = torch.float8_e8m0fnu
-
-    if torch_dtype is None:
-        torch_dtype = torch_type_map.get(ty)
-
-    if torch_dtype is None:
-        raise TypeError(f"{ty} is not supported by torch")
-    return torch_dtype
+from cutlass.base_dsl.torch import dtype as dtype  # noqa: F811
 
 
 def as_tensor(pointer: Any, shape: Any, torch_type: "torch.dtype") -> "torch.Tensor":
@@ -197,9 +170,9 @@ def convert_cute_tensor(
         Float8E5M2,
         Float8E4M3FN,
         Float8E8M0FNU,
+        Float4E2M1FN,
         Float6E3M2FN,
         Float6E2M3FN,
-        Float4E2M1FN,
     }:
         fp32_cute_tensor = from_dlpack(f32_torch_tensor)
         if is_dynamic_layout:
@@ -304,12 +277,9 @@ def cute_tensor_like(
     """
 
     # allocate device buffer for cute tensor
-    if (cutlass_dtype.is_float and cutlass_dtype.width <= 8) or (
-        cutlass_dtype.is_integer and cutlass_dtype.width == 4
-    ):
-        torch_dtype = torch.int8
-    else:
-        torch_dtype = dtype(cutlass_dtype)
+    do_kernel_convert = ((cutlass_dtype.is_float and cutlass_dtype.width <= 8) or
+                         (cutlass_dtype.is_integer and cutlass_dtype.width == 4))
+    torch_dtype = torch.uint8 if do_kernel_convert else dtype(cutlass_dtype)
     torch_tensor = torch.empty_like(data_ref, dtype=torch_dtype, device="cuda")
 
     # create cute tensor using the device buffer
@@ -322,10 +292,7 @@ def cute_tensor_like(
 
     is_empty_tensor = torch_tensor.numel() == 0
     # initialize the cute tensor data
-    if not is_empty_tensor and (
-        (cutlass_dtype.is_float and cutlass_dtype.width <= 8)
-        or (cutlass_dtype.is_integer and cutlass_dtype.width == 4)
-    ):
+    if not is_empty_tensor and do_kernel_convert:
         cute_tensor = convert_cute_tensor(
             data_ref.to(dtype=torch.float32),
             cute_tensor,
@@ -334,6 +301,15 @@ def cute_tensor_like(
         )
     else:
         torch_tensor.copy_(data_ref.to(dtype=torch_dtype))
+
+    # cast back to torch type if possible
+    if do_kernel_convert:
+        try:
+            torch_dtype = dtype(cutlass_dtype)
+        except TypeError:
+            torch_dtype = torch_tensor.dtype
+        torch_tensor = torch_tensor.view(dtype=torch_dtype)
+
     return cute_tensor, torch_tensor
 
 
diff --git a/python/CuTeDSL/cutlass/utils/__init__.py b/python/CuTeDSL/cutlass/utils/__init__.py
index 6c4fdf005..bf454cd44 100644
--- a/python/CuTeDSL/cutlass/utils/__init__.py
+++ b/python/CuTeDSL/cutlass/utils/__init__.py
@@ -74,7 +74,11 @@ from .tensormap_manager import (
     TensorMapManager,
 )
 
-from .smem_allocator import SmemAllocator, get_smem_capacity_in_bytes
+from .smem_allocator import (
+    SmemAllocator,
+    get_smem_capacity_in_bytes,
+    get_kernel_smem_size,
+)
 from .tmem_allocator import (
     TmemAllocator,
     TmemBufferPool,
@@ -83,7 +87,6 @@ from .tmem_allocator import (
 )
 
 from .layout import LayoutEnum
-
 from .block import block_copy
 
 from .mixed_input_helpers import (
@@ -126,6 +129,7 @@ from .tensor_helpers import (
 
 __all__ = [
     "get_smem_capacity_in_bytes",
+    "get_kernel_smem_size",
     "SmemAllocator",
     "TmemAllocator",
     "TmemBufferPool",
diff --git a/python/CuTeDSL/cutlass/utils/blackwell_helpers.py b/python/CuTeDSL/cutlass/utils/blackwell_helpers.py
index 3ff569c3b..1b2ccb72f 100644
--- a/python/CuTeDSL/cutlass/utils/blackwell_helpers.py
+++ b/python/CuTeDSL/cutlass/utils/blackwell_helpers.py
@@ -19,18 +19,21 @@ from cutlass.cutlass_dsl import (
     Float32,
     Uint8,
     Int8,
+    Int32,
     Float8E4M3FN,
     Float8E5M2,
     Float6E3M2FN,
     Float6E2M3FN,
     Float4E2M1FN,
+    Integer,
     Numeric,
     NumericMeta,
     dsl_user_op,
 )
-
-from cutlass._mlir import ir
 import cutlass.cute as cute
+from cutlass._mlir import ir
+from cutlass._mlir.dialects import arith, vector as _vector_dialect
+import cutlass._mlir.dialects.cute_nvgpu as _cute_nvgpu_ir
 from cutlass.cute.nvgpu.common import CopyUniversalOp, OperandMajorMode
 from cutlass.cute.nvgpu.warp import StMatrix8x8x16bOp, StMatrix16x8x8bOp
 from cutlass.cute.nvgpu.tcgen05 import (
@@ -63,7 +66,7 @@ from cutlass.cute.nvgpu.cpasync import (
     CopyBulkTensorTileG2SOp,
 )
 from cutlass.utils.layout import LayoutEnum
-import cutlass.cute.testing as testing
+from cutlass import testing
 
 # Type alias for documentation clarity
 OperandSource = Tcgen05OperandSource
@@ -174,6 +177,7 @@ def compute_epilogue_tile_shape(
     *,
     layout_c: Optional[LayoutEnum] = None,
     elem_ty_c: Union[Type[Numeric], None] = None,
+    tmem_warp_shape_mn: Optional[Tuple[int, int]] = None,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> cute.Tile:
@@ -193,11 +197,16 @@ def compute_epilogue_tile_shape(
     :type layout_c: LayoutEnum, optional
     :param elem_ty_c: The element type for input tensor C. Defaults to None.
     :type elem_ty_c: Union[Type[Numeric], None], optional
+    :param tmem_warp_shape_mn: Optional (warp_m, warp_n) override for the tmem
+        subpartition layout. When omitted, the layout is derived from
+        ``cta_tile_shape`` and ``use_2cta_instrs``.
+    :type tmem_warp_shape_mn: Tuple[int, int], optional
 
     :return: Returns epilog tiler, which is used in subsequent epilog partitions.
     :rtype: cute.Tile
 
     :raises ValueError: If the computed tile cute.size does not meet minimum requirements based on CTA dimensions.
+
     """
 
     def validate_type(ty: Type[Numeric], ty_name: str) -> None:
@@ -208,7 +217,10 @@ def compute_epilogue_tile_shape(
     if elem_ty_c is not None:
         validate_type(elem_ty_c, "elem_ty_c")
 
-    cta_m, cta_n = cta_tile_shape[:2]  # type: ignore[index]
+    assert isinstance(cta_tile_shape, tuple)
+    cta_m, cta_n = cta_tile_shape[:2]
+    assert isinstance(cta_m, (int, Integer))
+    assert isinstance(cta_n, (int, Integer))
     d_is_m_major = layout_d.is_m_major_c()
     c_is_m_major = True if layout_c is None else layout_c.is_m_major_c()
 
@@ -221,10 +233,14 @@ def compute_epilogue_tile_shape(
         elem_ty_c.width if elem_ty_c is not None else None,
         d_is_m_major,
         c_is_m_major,
+        tmem_warp_shape_mn=tmem_warp_shape_mn,
     )
 
-    # Compute warp layout parameters (needed for CuTe layout creation)
-    (warp_m, warp_n) = (2, 2) if (cta_m == 64 and use_2cta_instrs) else (4, 1)
+    # Compute warp layout parameters (needed for CuTe layout creation).
+    if tmem_warp_shape_mn is not None:
+        (warp_m, warp_n) = tmem_warp_shape_mn
+    else:
+        (warp_m, warp_n) = (2, 2) if (cta_m == 64 and use_2cta_instrs) else (4, 1)
 
     # Validate minimum tile requirements
     disable_source = elem_ty_c is None
@@ -233,19 +249,21 @@ def compute_epilogue_tile_shape(
         if d_is_m_major
         else (128 * warp_n if elem_ty_d.width == 6 else 128 // elem_ty_d.width * warp_n)
     )
-    n_min_c = (
-        8 * warp_n
-        if (c_is_m_major or disable_source)
-        else (128 * warp_n if elem_ty_c.width == 6 else 128 // elem_ty_c.width * warp_n)  # type: ignore[union-attr]
-    )
-    if cta_n < n_min_c or cta_n < n_min_d:  # type: ignore[operator]
+    if c_is_m_major or disable_source:
+        n_min_c = 8 * warp_n
+    else:
+        assert elem_ty_c is not None
+        n_min_c = (
+            128 * warp_n if elem_ty_c.width == 6 else 128 // elem_ty_c.width * warp_n
+        )
+    if cta_n < n_min_c or cta_n < n_min_d:
         raise ValueError(f"CTA tile too small: {cta_tile_shape=}")
 
     # stride by tmem warp layout and return a by-mode tiler
     tile_m_layout = cute.make_layout(tile_m, loc=loc, ip=ip)
     tile_n_layout = cute.make_layout(
         (tile_n // warp_n, warp_n),
-        stride=(1, cta_n // warp_n),  # type: ignore[operator]
+        stride=(1, cta_n // warp_n),
         loc=loc,
         ip=ip,
     )
@@ -423,6 +441,7 @@ def get_tmem_load_op(
     epi_tile: cute.Tile,
     use_2cta_instrs: bool,
     *,
+    tmem_warp_shape_mn: Optional[Tuple[int, int]] = None,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
 ) -> cute.CopyAtom:
@@ -454,9 +473,13 @@ def get_tmem_load_op(
     acc_bits = elem_ty_acc.width
     d_bits = elem_ty_d.width
 
-    tmem_warp_shape_mn = (
-        (2, 2) if (cta_tile_shape[0] == 64 and use_2cta_instrs) else (4, 1)  # type: ignore[index]
-    )
+    assert isinstance(cta_tile_shape, tuple)
+    if tmem_warp_shape_mn is None:
+        # Default rule: (2,2) for 2CTA+M64, else (4,1). Callers may pass an
+        # explicit override.
+        tmem_warp_shape_mn = (
+            (2, 2) if (cta_tile_shape[0] == 64 and use_2cta_instrs) else (4, 1)
+        )
     epilog_tile_shape_mn = cute.product_each(
         cute.shape(epi_tile, loc=loc, ip=ip), loc=loc, ip=ip
     )
@@ -700,7 +723,7 @@ def make_smem_layout(
     smem_layout_atom_kind = get_smem_layout_atom_ab(
         leading_mode,
         a_dtype,
-        smem_tile_shape,  # type: ignore[arg-type]
+        smem_tile_shape,
         loc=loc,
         ip=ip,
     )
@@ -872,7 +895,7 @@ def get_smem_layout_atom_epi(
         return get_smem_layout_atom_ab(
             OperandMajorMode.MN,
             element_type,
-            tma_shape,  # type: ignore[arg-type]
+            tma_shape,
             loc=loc,
             ip=ip,
         )
@@ -881,7 +904,7 @@ def get_smem_layout_atom_epi(
         return get_smem_layout_atom_ab(
             OperandMajorMode.K,
             element_type,
-            tma_shape,  # type: ignore[arg-type]
+            tma_shape,
             loc=loc,
             ip=ip,
         )
@@ -1196,10 +1219,10 @@ def make_blockscaled_trivial_tiled_mma(
 ) -> cute.TiledMma: ...
 
 
+@overload
 @deprecated(
     "use make_blockscaled_trivial_tiled_mma with separate a_dtype and b_dtype instead"
 )
-@overload
 def make_blockscaled_trivial_tiled_mma(
     ab_dtype: Type[Numeric],
     a_leading_mode: OperandMajorMode,
@@ -1568,7 +1591,8 @@ def get_permutation_mnk(
 
     :raise ValueError: If the tile shape is not divisible by the sf_vec_size
     """
-    perm_m = min(tile_shape_mnk[0], 128)  # type: ignore[index]
+    assert isinstance(tile_shape_mnk, tuple)
+    perm_m = min(tile_shape_mnk[0], 128)
     # refer to C++ code:
     # /include/cutlass/gemm/collective/builders/sm120_common.inl?ref_type=heads#L158
     if sf_vec_size == 32 or sf_vec_size == 16:
@@ -1595,7 +1619,7 @@ def sm103_make_blockscaled_trivial_tiled_mma(
     mma_tiler_mn: Tuple[int, int],
     a_source: OperandSource = OperandSource.SMEM,
 ) -> cute.TiledMma:
-    """Create a blockscaled trivial tiled MMA for SM103 (Ultra FP4), K fixed to 96.
+    """Create a blockscaled trivial tiled MMA for SM103 (ultra FP4), K fixed to 96.
 
     Returns a tcgen05 MMA configured for the given (M, N) tiler and CTA group.
 
@@ -1681,6 +1705,7 @@ def sm120_get_smem_store_op(
 
 
 
+
 def compute_epilogue_tile_size(
     cta_tile_m: int,
     cta_tile_n: int,
@@ -1689,6 +1714,7 @@ def compute_epilogue_tile_size(
     elem_width_c: int | None = None,
     d_is_m_major: bool = True,
     c_is_m_major: bool = True,
+    tmem_warp_shape_mn: Optional[Tuple[int, int]] = None,
 ) -> tuple[int, int]:
     """Compute epilogue subtile dimensions ``(tile_m, tile_n)`` (pure Python, no MLIR).
 
@@ -1804,7 +1830,10 @@ def compute_epilogue_tile_size(
     """
     # -- Step 1: warp grid ------------------------------------------------
     # (2,2) for 2CTA+M64 so each warp gets 32 rows; else (4,1).
-    (warp_m, warp_n) = (2, 2) if (cta_tile_m == 64 and use_2cta) else (4, 1)
+    if tmem_warp_shape_mn is not None:
+        (warp_m, warp_n) = tmem_warp_shape_mn
+    else:
+        (warp_m, warp_n) = (2, 2) if (cta_tile_m == 64 and use_2cta) else (4, 1)
     disable_source = elem_width_c is None
     max_bits = elem_width_d if disable_source else max(elem_width_c, elem_width_d)  # type: ignore[type-var]
 
@@ -1844,11 +1873,11 @@ def compute_epilogue_tile_size(
         if d_is_m_major
         else (128 * warp_n if elem_width_d == 6 else 128 // elem_width_d * warp_n)
     )
-    n_min_c = (
-        8 * warp_n
-        if (c_is_m_major or disable_source)
-        else (128 * warp_n if elem_width_c == 6 else 128 // elem_width_c * warp_n)  # type: ignore[operator]
-    )
+    if c_is_m_major or disable_source:
+        n_min_c = 8 * warp_n
+    else:
+        assert elem_width_c is not None
+        n_min_c = 128 * warp_n if elem_width_c == 6 else 128 // elem_width_c * warp_n
 
     # -- Step 5: tile_n ---------------------------------------------------
     tile_n = min(cta_tile_n, max(n_perf, n_min_c, n_min_d))
@@ -1874,13 +1903,6 @@ def compute_acc_tmem_cols_per_stage(
     ``alloc_tmem`` call site.  See ``TmemAllocator.check_valid_num_columns``
     in ``cutlass/utils/tmem_allocator.py``.
 
-    Replicates the C++ logic from without requiring an
-    MLIR context, so it can be used at kernel discovery time.  When an
-    MLIR context is available, prefer
-    ``cutlass.cute.nvgpu.tcgen05.find_tmem_tensor_col_offset`` which
-    computes the column count directly from the compiler-generated
-    TMEM layout.
-
     **How TMEM packing works**
 
     TMEM has 128 datapaths (rows).  M maps to datapaths, N maps to
diff --git a/python/CuTeDSL/cutlass/utils/block.py b/python/CuTeDSL/cutlass/utils/block.py
index e2c53696d..a41e4f9be 100644
--- a/python/CuTeDSL/cutlass/utils/block.py
+++ b/python/CuTeDSL/cutlass/utils/block.py
@@ -9,7 +9,9 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
+from typing import Any, Optional
 
+from cutlass._mlir import ir
 from cutlass.cutlass_dsl import dsl_user_op, CuTeDSL
 
 from cutlass.cute.typing import Tensor
@@ -21,11 +23,11 @@ from cutlass.cute.nvgpu.cpasync.copy import (
     TmaCopyOp,
     CopyBulkTensorTileG2SOp,
     CopyBulkTensorTileG2SMulticastOp,
+    CopyBulkTensorIm2ColG2SOp,
+    CopyBulkTensorIm2ColG2SMulticastOp,
 )
 from cutlass.cute.nvgpu.cpasync.helpers import tma_partition
 from cutlass.cute.nvgpu.tcgen05.copy import _S2TCopyBase
-from typing import Any, Optional
-from cutlass._mlir import ir
 
 
 def _check_required_args(
@@ -53,14 +55,11 @@ def _tma_copy_impl(
     #
     if "tma_multicast" in kwargs:
         if not isinstance(
-            tiled_copy.op,
-            (
-                CopyBulkTensorTileG2SOp,
-            ),
+            tiled_copy.op, (CopyBulkTensorTileG2SOp, CopyBulkTensorIm2ColG2SOp)
         ):
             raise ValueError(
                 "block_copy with tma_multicast expects a non-multicast G2S TMA copy atom "
-                "(CopyBulkTensorTileG2SOp) for compiler-driven multicast"
+                "(CopyBulkTensorTileG2SOp or CopyBulkTensorIm2ColG2SOp) for compiler-driven multicast"
             )
         # Mark as coming from block API
         kwargs["tma_multicast"]["from_block_api"] = True
@@ -73,6 +72,8 @@ def _tma_copy_impl(
         (
             CopyBulkTensorTileG2SOp,
             CopyBulkTensorTileG2SMulticastOp,
+            CopyBulkTensorIm2ColG2SOp,
+            CopyBulkTensorIm2ColG2SMulticastOp,
         ),
     )
     _check_required_args(["tma_bar_ptr"], kwargs, is_bar_ptr_required)
@@ -81,10 +82,7 @@ def _tma_copy_impl(
     # TMA bulk tensor copies: partition via tma_partition
     #
     is_g2s = isinstance(
-        tiled_copy.op,
-        (
-            CopyBulkTensorTileG2SOp,
-        ),
+        tiled_copy.op, (CopyBulkTensorTileG2SOp, CopyBulkTensorIm2ColG2SOp)
     )
     stensor = dst if is_g2s else src
     gtensor = src if is_g2s else dst
diff --git a/python/CuTeDSL/cutlass/utils/blockscaled_layout.py b/python/CuTeDSL/cutlass/utils/blockscaled_layout.py
index 7bc4ee6f4..60f727dbf 100644
--- a/python/CuTeDSL/cutlass/utils/blockscaled_layout.py
+++ b/python/CuTeDSL/cutlass/utils/blockscaled_layout.py
@@ -157,10 +157,11 @@ def make_smem_layout_sfa(
     :return: Smem layout for SFA
     :rtype: cute.Layout
     """
+    assert isinstance(mma_tiler_mnk, tuple)
     # (CTA_Tile_Shape_M, MMA_Tile_Shape_K)
     sfa_tile_shape = (
-        mma_tiler_mnk[0] // cute.size(tiled_mma.thr_id.shape),  # type: ignore[index]
-        mma_tiler_mnk[2],  # type: ignore[index]
+        mma_tiler_mnk[0] // cute.size(tiled_mma.thr_id.shape),
+        mma_tiler_mnk[2],
     )
 
     # ((Atom_M, Rest_M),(Atom_K, Rest_K))
@@ -171,8 +172,8 @@ def make_smem_layout_sfa(
     )
 
     # Number of MMA instructions to cover all k-tiles
-    mma_tile_inst_m = mma_tiler_mnk[0] // cute.size(tiled_mma.shape_mnk, mode=[0])  # type: ignore[index]
-    mma_tile_inst_k = mma_tiler_mnk[2] // cute.size(tiled_mma.shape_mnk, mode=[2])  # type: ignore[index]
+    mma_tile_inst_m = mma_tiler_mnk[0] // cute.size(tiled_mma.shape_mnk, mode=[0])
+    mma_tile_inst_k = mma_tiler_mnk[2] // cute.size(tiled_mma.shape_mnk, mode=[2])
 
     # (CTA_Tile_Shape_M, MMA_Inst_Shape_K)
     sfa_tile_shape = cute.shape_div(sfa_tile_shape, (mma_tile_inst_m, mma_tile_inst_k))
@@ -225,10 +226,11 @@ def make_smem_layout_sfb(
     :return: Smem layout for SFA
     :rtype: cute.Layout
     """
+    assert isinstance(mma_tiler_mnk, tuple)
     # (Round_Up(CTA_Tile_Shape_N, 128), MMA_Tile_Shape_K)
     sfb_tile_shape = (
-        cute.round_up(mma_tiler_mnk[1], 128),  # type: ignore[index, arg-type]
-        mma_tiler_mnk[2],  # type: ignore[index]
+        cute.round_up(mma_tiler_mnk[1], 128),  # type: ignore[arg-type]
+        mma_tiler_mnk[2],
     )
 
     # ((Atom_N, Rest_N),(Atom_K, Rest_K))
@@ -239,8 +241,8 @@ def make_smem_layout_sfb(
     )
 
     # Number of MMA instructions to cover all k-tiles
-    mma_tile_inst_n = mma_tiler_mnk[1] // cute.size(tiled_mma.shape_mnk, mode=[1])  # type: ignore[index]
-    mma_tile_inst_k = mma_tiler_mnk[2] // cute.size(tiled_mma.shape_mnk, mode=[2])  # type: ignore[index]
+    mma_tile_inst_n = mma_tiler_mnk[1] // cute.size(tiled_mma.shape_mnk, mode=[1])
+    mma_tile_inst_k = mma_tiler_mnk[2] // cute.size(tiled_mma.shape_mnk, mode=[2])
 
     # (CTA_Tile_Shape_N, MMA_Inst_Shape_K)
     sfb_tile_shape = cute.shape_div(sfb_tile_shape, (mma_tile_inst_n, mma_tile_inst_k))
@@ -294,6 +296,7 @@ def sm120_make_smem_layout_sfa(
     """
 
     assert sf_vec_size == 16 or sf_vec_size == 32, "sf_vec_size must be 16 or 32"
+    assert isinstance(tile_shape_mnk, tuple)
 
     blk_mn = 128
     blk_sf = 4
@@ -305,26 +308,26 @@ def sm120_make_smem_layout_sfa(
     k_basic_block_shape = (sf_vec_size, mma_nsf)
     k_basic_block_stride = (0, 1)
 
-    assert tile_shape_mnk[0] % blk_mn == 0, (  # type: ignore[index, operator]
+    assert tile_shape_mnk[0] % blk_mn == 0, (  # type: ignore[operator]
         "tile_shape_mnk[0] must be divisible by blk_mn"
     )
 
-    sSFA_shapeM = (mn_basic_block_shape, tile_shape_mnk[0] // blk_mn)  # type: ignore[index, operator]
+    sSFA_shapeM = (mn_basic_block_shape, tile_shape_mnk[0] // blk_mn)  # type: ignore[operator]
     sSF_strideM = (mn_basic_block_stride, blk_elems)
 
-    assert tile_shape_mnk[2] % (blk_sf * mma_nsf) == 0, (  # type: ignore[index]
+    assert tile_shape_mnk[2] % (blk_sf * mma_nsf) == 0, (
         "tile_shape_mnk[2] must be divisible by blk_sf * mma_nsf"
     )
 
     sSFA_shapeK = (
         k_basic_block_shape,
         blk_sf // mma_nsf,
-        tile_shape_mnk[2] // sf_vec_size // blk_sf,  # type: ignore[index, operator]
+        tile_shape_mnk[2] // sf_vec_size // blk_sf,  # type: ignore[operator]
     )
     sSF_strideK = (
         k_basic_block_stride,
         mma_nsf,
-        tile_shape_mnk[0] // blk_mn * blk_elems,  # type: ignore[index, operator]
+        tile_shape_mnk[0] // blk_mn * blk_elems,  # type: ignore[operator]
     )
 
     sSFA_shape = (sSFA_shapeM, sSFA_shapeK)
@@ -380,12 +383,13 @@ def sm120_make_smem_layout_sfb(
     blk_elems = blk_mn * blk_sf
 
     assert sf_vec_size == 16 or sf_vec_size == 32, "sf_vec_size must be 16 or 32"
+    assert isinstance(tile_shape_mnk, tuple)
 
-    assert tile_shape_mnk[1] % blk_mn == 0, (  # type: ignore[index, operator]
+    assert tile_shape_mnk[1] % blk_mn == 0, (  # type: ignore[operator]
         "tile_shape_mnk[1] must be divisible by blk_mn"
     )
 
-    assert tile_shape_mnk[2] % sf_vec_size == 0, (  # type: ignore[index, operator]
+    assert tile_shape_mnk[2] % sf_vec_size == 0, (  # type: ignore[operator]
         "tile_shape_mnk[2] must be divisible by sf_vec_size"
     )
 
@@ -396,26 +400,26 @@ def sm120_make_smem_layout_sfb(
     k_basic_block_shape = (sf_vec_size, mma_nsf)
     k_basic_block_stride = (0, 1)
 
-    assert tile_shape_mnk[1] % blk_mn == 0, (  # type: ignore[index, operator]
+    assert tile_shape_mnk[1] % blk_mn == 0, (  # type: ignore[operator]
         "tile_shape_mnk[1] must be divisible by blk_mn"
     )
 
-    sSFA_shapeN = (mn_basic_block_shape, tile_shape_mnk[1] // blk_mn)  # type: ignore[index, operator]
+    sSFA_shapeN = (mn_basic_block_shape, tile_shape_mnk[1] // blk_mn)  # type: ignore[operator]
     sSF_strideN = (mn_basic_block_stride, blk_elems)
 
-    assert tile_shape_mnk[2] % (blk_sf * mma_nsf) == 0, (  # type: ignore[index]
+    assert tile_shape_mnk[2] % (blk_sf * mma_nsf) == 0, (
         "tile_shape_mnk[2] must be divisible by blk_sf * mma_nsf"
     )
 
     sSFA_shapeK = (
         k_basic_block_shape,
         blk_sf // mma_nsf,
-        tile_shape_mnk[2] // sf_vec_size // blk_sf,  # type: ignore[index, operator]
+        tile_shape_mnk[2] // sf_vec_size // blk_sf,  # type: ignore[operator]
     )
     sSF_strideK = (
         k_basic_block_stride,
         mma_nsf,
-        tile_shape_mnk[1] // blk_mn * blk_elems,  # type: ignore[index, operator]
+        tile_shape_mnk[1] // blk_mn * blk_elems,  # type: ignore[operator]
     )
 
     sSFA_shape = (sSFA_shapeN, sSFA_shapeK)
@@ -463,8 +467,9 @@ def make_tmem_layout_sfa(
     :return: TMEM layout for SFA
     :rtype: cute.Layout
     """
+    assert isinstance(mma_tiler_mnk, tuple)
     atom_thr_size = cute.size(tiled_mma.thr_id.shape, loc=loc, ip=ip)
-    cta_tile_shape_m = mma_tiler_mnk[0] // atom_thr_size  # type: ignore[index]
+    cta_tile_shape_m = mma_tiler_mnk[0] // atom_thr_size
 
     sfa_layout_ty = _cute_nvgpu_ir.make_tmem_layout_sfa(
         smem_layout, cta_tile_shape_m, atom_thr_size, sf_vec_size
@@ -501,8 +506,9 @@ def make_tmem_layout_sfb(
     :return: TMEM layout for SFB
     :rtype: cute.Layout
     """
+    assert isinstance(mma_tiler_mnk, tuple)
     atom_thr_size = cute.size(tiled_mma.thr_id.shape, loc=loc, ip=ip)
-    cta_tile_shape_m = mma_tiler_mnk[0] // atom_thr_size  # type: ignore[index]
+    cta_tile_shape_m = mma_tiler_mnk[0] // atom_thr_size
 
     sfb_layout_ty = _cute_nvgpu_ir.make_tmem_layout_sfb(
         smem_layout, cta_tile_shape_m, atom_thr_size, sf_vec_size
@@ -525,14 +531,12 @@ class Sm103BlockScaledBasicChunk:
     _layout: cute.Layout = field(init=False, repr=False)
 
     def __post_init__(self) -> None:
-        atom_shape: cute.Shape
-        atom_stride: cute.Stride
         if self.major_mode == OperandMajorMode.K:
             atom_shape = ((8, 4, 4), (self.sf_vec_size, 4))
             atom_stride = ((16, 128, 4), (0, 1))
         else:
-            atom_shape = ((self.sf_vec_size, 4), (8, 4, 4))
-            atom_stride = ((0, 1), (16, 128, 4))
+            atom_shape = ((self.sf_vec_size, 4), (8, 4, 4))  # type: ignore[assignment]
+            atom_stride = ((0, 1), (16, 128, 4))  # type: ignore[assignment]
 
         object.__setattr__(
             self, "_layout", cute.make_layout(shape=atom_shape, stride=atom_stride)
@@ -569,7 +573,8 @@ def sm103_make_smem_layout_sfa(
     :return: Smem layout for SFA
     :rtype: cute.Layout
     """
-    mma_shape_mk = tiled_mma.partition_shape_A((mma_tiler[0], mma_tiler[2]))  # type: ignore[index]
+    assert isinstance(mma_tiler, tuple)
+    mma_shape_mk = tiled_mma.partition_shape_A((mma_tiler[0], mma_tiler[2]))
     sf_atom = Sm103BlockScaledBasicChunk(sf_vec_size, tiled_mma.op.a_major_mode).layout  # type: ignore[attr-defined]
     k_divisor = 4 if sf_vec_size == 16 else 2
     mma_sfa_tiler = (
@@ -617,9 +622,10 @@ def sm103_make_smem_layout_sfb(
     :return: Smem layout for SFB
     :rtype: cute.Layout
     """
+    assert isinstance(mma_tiler, tuple)
     sf_atom = Sm103BlockScaledBasicChunk(sf_vec_size, tiled_mma.op.b_major_mode).layout  # type: ignore[attr-defined]
     k_divisor = 4 if sf_vec_size == 16 else 2
-    mma_sfb_tiler = (mma_tiler[1], mma_tiler[2] // k_divisor)  # type: ignore[index, operator]
+    mma_sfb_tiler = (mma_tiler[1], mma_tiler[2] // k_divisor)  # type: ignore[operator]
     if mma_sfb_tiler[0] == 128:
         sfb_smem_atom_layout = cute.tiled_product(
             sf_atom,
diff --git a/python/CuTeDSL/cutlass/utils/distributed.py b/python/CuTeDSL/cutlass/utils/distributed.py
index 2adcc47a8..ec5f98b9a 100644
--- a/python/CuTeDSL/cutlass/utils/distributed.py
+++ b/python/CuTeDSL/cutlass/utils/distributed.py
@@ -15,7 +15,7 @@ from typing import Literal, Optional, Tuple, Type, Union
 import cutlass
 import cutlass.cute as cute
 from cutlass.cute.typing import Pointer, Int32
-from cutlass.cutlass_dsl import Numeric, T, dsl_user_op
+from cutlass.cutlass_dsl import T, Numeric, dsl_user_op
 from cutlass._mlir import ir
 from cutlass._mlir.dialects import llvm
 from typing_extensions import deprecated
diff --git a/python/CuTeDSL/cutlass/utils/dynamic_persistent_tile_scheduler.py b/python/CuTeDSL/cutlass/utils/dynamic_persistent_tile_scheduler.py
index 05122e3f4..4b5ab769b 100644
--- a/python/CuTeDSL/cutlass/utils/dynamic_persistent_tile_scheduler.py
+++ b/python/CuTeDSL/cutlass/utils/dynamic_persistent_tile_scheduler.py
@@ -10,7 +10,7 @@
 # is strictly prohibited.
 
 import inspect
-from typing import Optional, Tuple
+from typing import Any, Optional, Tuple
 
 import cutlass
 from cutlass.cutlass_dsl import (
@@ -26,9 +26,16 @@ from cutlass._mlir import ir
 from cutlass.utils.static_persistent_tile_scheduler import (
     WorkTileInfo,
 )
+from typing_extensions import deprecated
 import cutlass.cute as cute
 
+_DEPRECATION_MSG = (
+    "Migrated to examples/CuTeDSL/helpers/dynamic_persistent_tile_scheduler.py "
+    "(BSD-3). The wheel copy will be removed in a future release."
+)
 
+
+@deprecated(_DEPRECATION_MSG)
 class ClcDynamicPersistentTileSchedulerParams:
     """A class to represent parameters for a dynamic persistent tile scheduler.
 
@@ -67,15 +74,16 @@ class ClcDynamicPersistentTileSchedulerParams:
         :raises ValueError: If cluster_shape_k is not 1.
         """
 
-        if cluster_shape_mnk[2] != 1:  # type: ignore[index]
-            raise ValueError(f"unsupported cluster_shape_k {cluster_shape_mnk[2]}")  # type: ignore[index]
+        assert isinstance(cluster_shape_mnk, tuple)
+        if cluster_shape_mnk[2] != 1:
+            raise ValueError(f"unsupported cluster_shape_k {cluster_shape_mnk[2]}")
         if swizzle_size < 1:
             raise ValueError(f"expect swizzle_size >= 1, but get {swizzle_size}")
 
         self.problem_shape_ntile_mnl = problem_shape_ntile_mnl
         # cluster_shape_mnk is kept for reconstruction
         self._cluster_shape_mnk = cluster_shape_mnk
-        self.cluster_shape_mn = cluster_shape_mnk[:2]  # type: ignore[index]
+        self.cluster_shape_mn = cluster_shape_mnk[:2]
         self.swizzle_size = swizzle_size
         self._raster_along_m = raster_along_m
         self.cluster_shape_major_fdd = None
@@ -86,7 +94,7 @@ class ClcDynamicPersistentTileSchedulerParams:
         self.problem_layout_ncluster_mnl = cute.make_layout(
             cute.ceil_div(
                 self.problem_shape_ntile_mnl,
-                cluster_shape_mnk[:2],  # type: ignore[index]
+                cluster_shape_mnk[:2],
                 loc=loc,
                 ip=ip,
             ),
@@ -100,18 +108,19 @@ class ClcDynamicPersistentTileSchedulerParams:
                 self.problem_layout_ncluster_mnl.shape,
                 (1, swizzle_size, 1) if raster_along_m else (swizzle_size, 1, 1),
             )
+            assert isinstance(problem_shape_ncluster_mnl, tuple)
 
             if raster_along_m:
                 self.problem_layout_ncluster_mnl = cute.make_layout(
                     (
-                        problem_shape_ncluster_mnl[0],  # type: ignore[index]
-                        (swizzle_size, problem_shape_ncluster_mnl[1] // swizzle_size),  # type: ignore[index, operator]
-                        problem_shape_ncluster_mnl[2],  # type: ignore[index]
+                        problem_shape_ncluster_mnl[0],
+                        (swizzle_size, problem_shape_ncluster_mnl[1] // swizzle_size),  # type: ignore[operator]
+                        problem_shape_ncluster_mnl[2],
                     ),
                     stride=(
                         swizzle_size,
-                        (1, swizzle_size * problem_shape_ncluster_mnl[0]),  # type: ignore[index]
-                        problem_shape_ncluster_mnl[0] * problem_shape_ncluster_mnl[1],  # type: ignore[index, operator]
+                        (1, swizzle_size * problem_shape_ncluster_mnl[0]),
+                        problem_shape_ncluster_mnl[0] * problem_shape_ncluster_mnl[1],  # type: ignore[operator]
                     ),
                     loc=loc,
                     ip=ip,
@@ -119,14 +128,14 @@ class ClcDynamicPersistentTileSchedulerParams:
             else:
                 self.problem_layout_ncluster_mnl = cute.make_layout(
                     (
-                        (swizzle_size, problem_shape_ncluster_mnl[0] // swizzle_size),  # type: ignore[index, operator]
-                        problem_shape_ncluster_mnl[1],  # type: ignore[index]
-                        problem_shape_ncluster_mnl[2],  # type: ignore[index]
+                        (swizzle_size, problem_shape_ncluster_mnl[0] // swizzle_size),  # type: ignore[operator]
+                        problem_shape_ncluster_mnl[1],
+                        problem_shape_ncluster_mnl[2],
                     ),
                     stride=(
-                        (1, swizzle_size * problem_shape_ncluster_mnl[1]),  # type: ignore[index]
+                        (1, swizzle_size * problem_shape_ncluster_mnl[1]),
                         swizzle_size,
-                        problem_shape_ncluster_mnl[0] * problem_shape_ncluster_mnl[1],  # type: ignore[index, operator]
+                        problem_shape_ncluster_mnl[0] * problem_shape_ncluster_mnl[1],  # type: ignore[operator]
                     ),
                     loc=loc,
                     ip=ip,
@@ -260,6 +269,7 @@ ClcDynamicPersistentTileSchedulerParams.__init__.__signature__ = inspect.Signatu
 )
 
 
+@deprecated(_DEPRECATION_MSG)
 class ClcDynamicPersistentTileScheduler:
     """A scheduler for dynamic persistent tile execution in CUTLASS/CuTe kernels.
 
@@ -278,6 +288,7 @@ class ClcDynamicPersistentTileScheduler:
         num_tiles_executed: Int32,
         clc_response_ptr: cute.Pointer,
         block_idx: Tuple[Integer, Integer, Integer],
+        insert_fence: bool = True,
     ):
         """
         Initializes the ClcDynamicPersistentTileScheduler with the given parameters.
@@ -292,12 +303,21 @@ class ClcDynamicPersistentTileScheduler:
         :type clc_response_ptr: cute.Pointer
         :param block_idx: The block index.
         :type block_idx: Tuple[Integer, Integer, Integer]
+        :param insert_fence: Whether to insert a fence to ensure generic-async proxy order.
+            CLC issue is in async proxy while loading the response from shared memory is in
+            generic proxy. A cross-proxy fence is needed to ensure producer's next issue
+            won't race with consumer's current loading. Therefore the scheduler inserts a
+            fence by default after loading the response.
+            Developers may insert the fence in pipeline acquire/release functions. In that case,
+            the fence here can be omitted.
+        :type insert_fence: bool
         """
         self.params = params
         self.cta_id_in_cluster = cta_id_in_cluster
         self._num_tiles_executed = num_tiles_executed
         self._clc_response_ptr = clc_response_ptr
         self._block_idx = block_idx
+        self.insert_fence = insert_fence
 
     def __extract_mlir_values__(self) -> list[ir.Value]:
         values = extract_mlir_values(self.cta_id_in_cluster)
@@ -333,6 +353,7 @@ class ClcDynamicPersistentTileScheduler:
         block_idx: Tuple[Integer, Integer, Integer],
         grid_dim: Tuple[Integer, Integer, Integer],
         clc_response_ptr: cute.Pointer,
+        insert_fence: bool = True,
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
@@ -374,6 +395,7 @@ class ClcDynamicPersistentTileScheduler:
             num_tiles_executed,
             clc_response_ptr,
             block_idx,
+            insert_fence,
         )
 
     # called by host
@@ -454,10 +476,12 @@ class ClcDynamicPersistentTileScheduler:
         result_addr: 16-byte response data (simulating shared memory access)
         """
         m_idx, n_idx, l_idx, vld = cute.arch.clc_response(result_addr, loc=loc, ip=ip)
-        cute.arch.fence_proxy(
-            "async.shared",
-            space="cta",
-        )
+
+        # Cross-proxy fence to ensure generic proxy access are seen by later
+        # async proxy operations.
+        if self.insert_fence:
+            cute.arch.fence_proxy("async.shared", space="cta", loc=loc, ip=ip)
+
         m_idx, n_idx, l_idx = self._swizzle_and_rasterize(
             m_idx, n_idx, l_idx, loc=loc, ip=ip
         )
@@ -467,6 +491,7 @@ class ClcDynamicPersistentTileScheduler:
         return WorkTileInfo(cur_tile_coord, vld)
 
     @dsl_user_op
+    @cute.jit
     def get_current_work(
         self,
         *,
@@ -474,15 +499,13 @@ class ClcDynamicPersistentTileScheduler:
         ip: Optional[ir.InsertionPoint] = None,
     ) -> WorkTileInfo:
         smem_addr = self._clc_response_ptr
-        work_tile = self.work_tile_info_from_clc_response(smem_addr)
+        work_tile = self.work_tile_info_from_clc_response(smem_addr, loc=loc, ip=ip)
         return work_tile
 
     @dsl_user_op
+    @cute.jit
     def initial_work_tile_info(
-        self,
-        *,
-        loc: Optional[ir.Location] = None,
-        ip: Optional[ir.InsertionPoint] = None,
+        self, *, loc: Any = None, ip: Any = None
     ) -> WorkTileInfo:
         bidx, bidy, bidz = self._block_idx
         # Subtract cta_id_in_cluster from block_idx because swizzle_and_rasterize expects coordinates to be
@@ -498,6 +521,7 @@ class ClcDynamicPersistentTileScheduler:
         return WorkTileInfo(cur_tile_coord, Boolean(True))
 
     @dsl_user_op
+    @cute.jit
     def advance_to_next_work(
         self,
         mbarrier_addr: cute.Pointer,
@@ -513,5 +537,6 @@ class ClcDynamicPersistentTileScheduler:
         self._num_tiles_executed += Int32(1)
 
     @property
+    @cute.jit
     def num_tiles_executed(self) -> Int32:
         return self._num_tiles_executed
diff --git a/python/CuTeDSL/cutlass/utils/gemm/sm100.py b/python/CuTeDSL/cutlass/utils/gemm/sm100.py
index ead18ca60..030955d80 100644
--- a/python/CuTeDSL/cutlass/utils/gemm/sm100.py
+++ b/python/CuTeDSL/cutlass/utils/gemm/sm100.py
@@ -14,6 +14,7 @@ import cutlass.cute as cute
 from cutlass.cutlass_dsl import Int32, Boolean, Constexpr, const_expr
 import cutlass.pipeline as pipeline
 from cutlass.utils.blackwell_helpers import get_tmem_load_op, get_smem_store_op
+from cutlass.cute.arch.constants import WARP_SIZE
 from cutlass.cute.nvgpu import cpasync, tcgen05
 from cutlass.cute.nvgpu.common import CacheEvictionPriority
 
@@ -45,12 +46,18 @@ def transform_partitioned_tensor_layout(tensor: cute.Tensor) -> cute.Tensor:
 
     shape = layout.shape
     stride = layout.stride  # type: ignore[union-attr]
+    assert isinstance(shape, tuple)
+    assert isinstance(stride, tuple)
+    shape_0 = shape[0]
+    stride_0 = stride[0]
+    assert isinstance(shape_0, tuple)
+    assert isinstance(stride_0, tuple)
 
     # Build new shape: ((shape[0][0], shape[1]), (shape[0][1], shape[2]), ...rest)
-    new_shape = ((shape[0][0], shape[1]), (shape[0][1], shape[2]), *shape[3:])  # type: ignore[index]
+    new_shape = ((shape_0[0], shape[1]), (shape_0[1], shape[2]), *shape[3:])
 
     # Build new stride: ((stride[0][0], stride[1]), (stride[0][1], stride[2]), ...rest)
-    new_stride = ((stride[0][0], stride[1]), (stride[0][1], stride[2]), *stride[3:])  # type: ignore[index]
+    new_stride = ((stride_0[0], stride[1]), (stride_0[1], stride[2]), *stride[3:])
 
     new_layout = cute.make_layout(shape=new_shape, stride=new_stride)
 
@@ -93,7 +100,9 @@ def epilogue_tmem_copy_and_partition(
         - tTR_rAcc: The accumulated tensor in register used to hold t2r results
     :rtype: Tuple[cute.TiledCopy, cute.Tensor, cute.Tensor]
     """
-    # Make tiledCopy for tensor memory load
+    # Make tiledCopy for tensor memory load.
+    # Kernels may expose `tmem_warp_shape_mn` to override the default
+    # (2CTA+M64 -> (2,2), else (4,1)) rule.
     copy_atom_t2r = get_tmem_load_op(
         gemm_kernel.cta_tile_shape_mnk,
         gemm_kernel.c_layout,
@@ -101,6 +110,7 @@ def epilogue_tmem_copy_and_partition(
         gemm_kernel.acc_dtype,
         epi_tile,
         use_2cta_instrs,
+        tmem_warp_shape_mn=getattr(gemm_kernel, "tmem_warp_shape_mn", None),
     )
     # (EPI_TILE_M, EPI_TILE_N, EPI_M, EPI_N, STAGE)
     tAcc_epi = cute.flat_divide(
@@ -217,7 +227,7 @@ def epilogue_tma_store(
 
     epilog_sync_barrier = pipeline.NamedBarrier(
         barrier_id=gemm_kernel.epilog_sync_bar_id,
-        num_threads=32 * len(gemm_kernel.epilogue_warp_id),
+        num_threads=WARP_SIZE * len(gemm_kernel.epilogue_warp_id),
     )
 
     #
@@ -254,7 +264,7 @@ def epilogue_tma_store(
         # Convert to C type
         #
         acc_vec = tiled_copy_r2s.retile(tTR_rAcc).load()
-        acc_vec = epilogue_op(acc_vec.to(gemm_kernel.c_dtype))  # type: ignore[operator]
+        acc_vec = epilogue_op(acc_vec.to(gemm_kernel.c_dtype))
         tRS_rC.store(acc_vec)
 
         #
@@ -307,7 +317,7 @@ def epilogue(
     acc_pipeline: pipeline.PipelineAsync,
     tCcC_base: Optional[cute.Tensor] = None,
     mC_mnl: Optional[cute.Tensor] = None,
-    overlapping_accum: Constexpr = False,  # type: ignore[assignment]
+    overlapping_accum: Constexpr = False,
 ) -> pipeline.PipelineState:
     """
     Epilogue function that stores accumulator results directly to global memory.
@@ -480,7 +490,7 @@ def epilogue(
         # Convert to C type
         #
         acc_vec = tTR_rAcc.load()
-        acc_vec = epilogue_op(acc_vec.to(gemm_kernel.c_dtype))  # type: ignore[operator]
+        acc_vec = epilogue_op(acc_vec.to(gemm_kernel.c_dtype))
         tTR_rC.store(acc_vec)
 
         if const_expr(use_predication):
diff --git a/python/CuTeDSL/cutlass/utils/gemm/tensor_utils.py b/python/CuTeDSL/cutlass/utils/gemm/tensor_utils.py
index a01ceb71d..bae47a221 100644
--- a/python/CuTeDSL/cutlass/utils/gemm/tensor_utils.py
+++ b/python/CuTeDSL/cutlass/utils/gemm/tensor_utils.py
@@ -139,6 +139,7 @@ def get_gemm_tensor(
     :rtype: cutlass.cute.Tensor
     """
 
+    # Use uint8 view for conversion, then restore element_type after from_dlpack().
     _TYPES_NOT_SUPPORTED_BY_DLPACK = {
         torch.float8_e4m3fn: cute.Float8E4M3FN,
         torch.float8_e5m2: cute.Float8E5M2,
@@ -214,7 +215,7 @@ def get_gemm_tensors(
 
 def create_scale_factor_tensor(
     MN: int, K: int, L: int, sf_vec_size: int, sf_dtype: Type[Numeric]
-) -> Tuple[torch.Tensor, cute.Tensor]:
+) -> Tuple[torch.Tensor, cute.Tensor, torch.Tensor]:
     """
     Create a random scale-factor tensor in BlockScaledBasicChunk layout.
 
@@ -224,9 +225,10 @@ def create_scale_factor_tensor(
 
     Scale factor values are drawn uniformly from ``{1.0, 2.0, 4.0}``.
 
-    Two tensors are returned: a logical FP32 tensor for host-side reference
-    computation, and a CuTe tensor in the on-device packed layout for passing
-    to the GPU kernel.
+    Three tensors are returned: a logical FP32 tensor for host-side reference
+    computation, a CuTe tensor in the on-device packed layout for passing
+    to the GPU kernel, and the backing PyTorch CUDA tensor whose
+    ``data_ptr()`` can be used for pointer-array construction.
 
     :param MN: Size of the MN dimension of the operand to be scaled.
     :type MN: int
@@ -239,12 +241,13 @@ def create_scale_factor_tensor(
     :type sf_vec_size: int
     :param sf_dtype: CuTe element type for the on-device scale factors
         (e.g. ``cute.Float8E4M3FN``, ``cute.Float8E8M0FNU``).
-    :return: ``(sf_torch, sf_cute)`` where ``sf_torch`` is an FP32 CPU tensor
-        of shape ``(MN, K, L)`` with scale factors unpacked into a dense
-        layout suitable for element-wise multiplication with A or B, and
+    :return: ``(sf_ref, sf_cute, sf_gpu)`` where ``sf_ref`` is an FP32 CPU
+        tensor of shape ``(MN, K, L)`` with scale factors unpacked into a
+        dense layout suitable for element-wise multiplication with A or B,
         ``sf_cute`` is a CuTe CUDA tensor in BlockScaledBasicChunk layout
-        with ``element_type`` set to ``sf_dtype``.
-    :rtype: tuple[torch.Tensor, cutlass.cute.Tensor]
+        with ``element_type`` set to ``sf_dtype``, and ``sf_gpu`` is the
+        backing PyTorch CUDA tensor for pointer-array construction.
+    :rtype: tuple[torch.Tensor, cutlass.cute.Tensor, torch.Tensor]
     """
 
     def unpack_scale_factors(
@@ -319,11 +322,12 @@ def create_scale_factor_tensor(
         cutlass_torch.dtype(sf_dtype)
     )
 
-    sf_cute = from_dlpack(sf_torch.cuda().view(dtype=torch.uint8), assumed_align=16)
+    sf_gpu = sf_torch.cuda()
+    sf_cute = from_dlpack(sf_gpu.view(dtype=torch.uint8), assumed_align=16)
     sf_cute.element_type = sf_dtype
     sf_torch = unpack_scale_factors(sf_torch.to(torch.float32), sf_vec_size, MN, K, L)
 
-    return sf_torch, sf_cute
+    return sf_torch, sf_cute, sf_gpu
 
 
 def decode_float4e2m1fn(u8: torch.Tensor) -> torch.Tensor:
diff --git a/python/CuTeDSL/cutlass/utils/grouped_gemm_persistent_tile_scheduler.py b/python/CuTeDSL/cutlass/utils/grouped_gemm_persistent_tile_scheduler.py
index efd683cf7..40cba6c33 100644
--- a/python/CuTeDSL/cutlass/utils/grouped_gemm_persistent_tile_scheduler.py
+++ b/python/CuTeDSL/cutlass/utils/grouped_gemm_persistent_tile_scheduler.py
@@ -30,7 +30,13 @@ from cutlass.utils import (
     WorkTileInfo,
 )
 
+_DEPRECATION_MSG = (
+    "Migrated to examples/CuTeDSL/helpers/grouped_gemm_persistent_tile_scheduler.py "
+    "(BSD-3). The wheel copy will be removed in a future release."
+)
 
+
+@deprecated(_DEPRECATION_MSG)
 class GroupSearchResult:
     """
     The result of the group search for grouped gemm.
@@ -84,6 +90,7 @@ class GroupSearchResult:
         return GroupSearchResult(*tuple(values))
 
 
+@deprecated(_DEPRECATION_MSG)
 class GroupedGemmGroupSearchState:
     """
     The state of group index search for grouped gemm.
@@ -94,8 +101,9 @@ class GroupedGemmGroupSearchState:
     :type start_group_idx: Int32
     :param tile_count_prev_group: Number of tiles before the matched group
     :type tile_count_prev_group: Int32
-    :param tile_count_searched: Number of tiles we have searched. When the matched group is found,
-                               it records the number of tiles including the matched group
+    :param tile_count_searched: Number of tiles we have searched. When the matched group
+                                is found, it records the number of tiles including the
+                                matched group
     :type tile_count_searched: Int32
     """
 
@@ -135,6 +143,7 @@ class GroupedGemmGroupSearchState:
         )
 
 
+@deprecated(_DEPRECATION_MSG)
 def create_initial_search_state() -> GroupedGemmGroupSearchState:
     """
     Create an initial search state for grouped gemm.
@@ -151,6 +160,7 @@ def create_initial_search_state() -> GroupedGemmGroupSearchState:
 
 
 # Grouped Work Tile Information
+@deprecated(_DEPRECATION_MSG)
 class GroupedWorkTileInfo(WorkTileInfo):
     """A class to represent information about a work tile.
 
@@ -180,7 +190,9 @@ class GroupedWorkTileInfo(WorkTileInfo):
     def __new_from_mlir_values__(self, values: list[ir.Value]) -> "GroupedWorkTileInfo":
         if len(values) != 11:
             raise ValueError("Length of mlir values extracted is incorrect.")
-        new_tile_idx = new_from_mlir_values(self._tile_idx, values[:3])
+        # Reconstruct tile_idx as a tuple -- WorkTileInfo.__init__ unpacks it
+        # into _tile_m, _tile_n, _tile_l.
+        new_tile_idx = (Int32(values[0]), Int32(values[1]), Int32(values[2]))
         new_is_valid_tile = new_from_mlir_values(self._is_valid_tile, [values[3]])
         new_group_search_result = new_from_mlir_values(
             self.group_search_result, values[4:11]
@@ -191,10 +203,13 @@ class GroupedWorkTileInfo(WorkTileInfo):
 
 
 # Static Persistent Grouped GEMM
+@deprecated(_DEPRECATION_MSG)
 class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
-    """A scheduler for static persistent group-based tile execution in CUTLASS/CuTe kernels.
+    """A scheduler for static persistent group-based tile execution in CUTLASS/CuTe
+    kernels.
 
-    :ivar params: Tile schedule related params, including cluster shape and problem_layout_ncluster_mnl
+    :ivar params: Tile schedule related params, including cluster shape and
+                  problem_layout_ncluster_mnl
     :type params: PersistentTileSchedulerParams
     :ivar num_persistent_clusters: Number of persistent clusters that can be launched
     :type num_persistent_clusters: Int32
@@ -225,8 +240,9 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
         search_state: GroupedGemmGroupSearchState,
         group_count: int,
         problem_shape_mnkl: cute.Tensor,
-        cached_problem_shape_0: cute.Tensor,
-        cached_problem_shape_1: cute.Tensor,
+        cached_problem_shape_0: Tuple[Int32, Int32, Int32, Int32],
+        cached_problem_shape_1: Tuple[Int32, Int32, Int32, Int32],
+        use_cached_problem_shapes: bool = False,
     ):
         StaticPersistentTileScheduler.__init__(
             self,
@@ -241,6 +257,7 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
         self.cluster_tile_shape_mnk = cluster_tile_shape_mnk
         self.search_state = search_state
         self.problem_shape_mnkl = problem_shape_mnkl
+        self.use_cached_problem_shapes = use_cached_problem_shapes
 
         self.cached_problem_shape_0 = cached_problem_shape_0
         self.cached_problem_shape_1 = cached_problem_shape_1
@@ -260,7 +277,7 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
     def __new_from_mlir_values__(
         self, values: list[ir.Value]
     ) -> "StaticPersistentGroupTileScheduler":
-        if len(values) < 13:
+        if len(values) < 19:
             raise ValueError("Length of mlir values extracted is incorrect.")
         new_num_persistent_clusters = new_from_mlir_values(
             self.num_persistent_clusters, [values[0]]
@@ -276,13 +293,20 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
         )
         search_state = new_from_mlir_values(self.search_state, values[6:10])
         problem_shape_mnkl = new_from_mlir_values(self.problem_shape_mnkl, [values[10]])
-        cached_problem_shape_0 = new_from_mlir_values(
-            self.cached_problem_shape_0, [values[11]]
+
+        cached_problem_shape_0 = (
+            Int32(values[11]),
+            Int32(values[12]),
+            Int32(values[13]),
+            Int32(values[14]),
         )
-        cached_problem_shape_1 = new_from_mlir_values(
-            self.cached_problem_shape_1, [values[12]]
+        cached_problem_shape_1 = (
+            Int32(values[15]),
+            Int32(values[16]),
+            Int32(values[17]),
+            Int32(values[18]),
         )
-        params = new_from_mlir_values(self.params, values[13:])
+        params = new_from_mlir_values(self.params, values[19:])
 
         return StaticPersistentGroupTileScheduler(
             params,
@@ -308,6 +332,7 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
         initial_search_state: GroupedGemmGroupSearchState,
         group_count: int,
         problem_shape_mnkl: cute.Tensor,
+        use_cached_problem_shapes: bool = False,
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
@@ -329,6 +354,11 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
         :type group_count: int
         :param problem_shape_mnkl: Problem shape tensor for groups
         :type problem_shape_mnkl: cute.Tensor
+        :param use_cached_problem_shapes: Enable double-buffered caching of problem
+            shapes. When False, loads problem shapes on-demand. Evaluated at
+            compile-time via const_expr(). (default: False) When set to True, always
+            couple with prefetch_problem_shapes() function.
+        :type use_cached_problem_shapes: bool
 
         :return: A StaticPersistentGroupTileScheduler object.
         :rtype: StaticPersistentGroupTileScheduler
@@ -352,11 +382,17 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
             Int32(0),
         )
 
-        cached_problem_shape_0 = cute.make_rmem_tensor(
-            cute.make_layout(4), problem_shape_mnkl.element_type
+        cached_problem_shape_0 = (
+            Int32(-1),
+            Int32(-1),
+            Int32(-1),
+            Int32(-1),
         )
-        cached_problem_shape_1 = cute.make_rmem_tensor(
-            cute.make_layout(4), problem_shape_mnkl.element_type
+        cached_problem_shape_1 = (
+            Int32(-1),
+            Int32(-1),
+            Int32(-1),
+            Int32(-1),
         )
 
         # Initialize number of tiles executed to zero
@@ -373,13 +409,13 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
             problem_shape_mnkl,
             cached_problem_shape_0,
             cached_problem_shape_1,
+            use_cached_problem_shapes,
         )
 
     @property
     def num_tiles_executed(self) -> Int32:
         return self._num_tiles_executed
 
-    # This setter is the main way to prevent the Attribute error right now
     @num_tiles_executed.setter
     def num_tiles_executed(self, value: Int32) -> None:
         self._num_tiles_executed = value
@@ -396,7 +432,8 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
         """
         Perform prefix sum within a full warp.
 
-        :param value_per_thread: The value for this thread to contribute to the prefix sum
+        :param value_per_thread: The value for this thread to contribute to the prefix
+                                 sum
         :type value_per_thread: Int32
         :return: The prefix sum result for this thread
         :rtype: Int32
@@ -421,16 +458,14 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
-    ) -> cute.Tensor:
+    ) -> Tuple[Int32, Int32, Int32, Int32]:
         """
-        Load gemm problem (m,n,k,l) for the specified group from global memory to register.
+        Load gemm problem (m,n,k,l) for the specified group from global memory to
+        register.
 
-        :param problem_shape_mnkl: Tensor in global memory with layout (group_count, 4):(4, 1)
-        :type problem_shape_mnkl: cute.Tensor
-        :param group_idx: The index of the group to load
-        :type group_idx: Int32
+        :param problem_shape_mnkl: Tensor in global memory with layout
         :return: The problem shape tensor for the specified group
-        :rtype: cute.Tensor
+        :rtype: Tuple[Int32, Int32, Int32, Int32]
         """
         cur_problem_mnkl = cute.make_rmem_tensor(
             cute.make_layout(4), problem_shape_mnkl.element_type, loc=loc, ip=ip
@@ -438,7 +473,12 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
         cute.autovec_copy(
             problem_shape_mnkl[(group_idx, None)], cur_problem_mnkl, loc=loc, ip=ip
         )
-        return cur_problem_mnkl
+        return (
+            cur_problem_mnkl[0],
+            cur_problem_mnkl[1],
+            cur_problem_mnkl[2],
+            cur_problem_mnkl[3],
+        )
 
     @dsl_user_op
     @cute.jit
@@ -448,16 +488,17 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
     ) -> None:
-        if self.lane_idx < self.group_count:
-            cur_problem_mnkl = self._get_problem_for_group(
-                self.problem_shape_mnkl, self.lane_idx
-            )
-            self.cached_problem_shape_1 = cur_problem_mnkl
+        if const_expr(self.use_cached_problem_shapes):
+            if self.lane_idx < self.group_count:
+                cur_problem_mnkl = self._get_problem_for_group(
+                    self.problem_shape_mnkl, self.lane_idx, loc=loc, ip=ip
+                )
+                self.cached_problem_shape_1 = cur_problem_mnkl
 
     @dsl_user_op
     def _get_cluster_tile_count_mn(
         self,
-        problem_shape: cute.Tensor,
+        problem_shape: Tuple[Int32, Int32, Int32, Int32],
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
@@ -466,18 +507,18 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
         Compute total cluster count.
 
         :param problem_shape: Tensor containing problem shape (m, n, k, l)
-        :type problem_shape: cute.Tensor
+        :type problem_shape: Tuple[Int32, Int32, Int32, Int32]
         :return: The total cluster tile count for M and N dimensions
         :rtype: Int32
         """
         cur_ntile_m = (
-            problem_shape[0] + self.cluster_tile_shape_mnk[0] - 1  # type: ignore[operator]
+            problem_shape[0] + self.cluster_tile_shape_mnk[0] - 1
         ) // self.cluster_tile_shape_mnk[0]
         cur_ntile_n = (
-            problem_shape[1] + self.cluster_tile_shape_mnk[1] - 1  # type: ignore[operator]
+            problem_shape[1] + self.cluster_tile_shape_mnk[1] - 1
         ) // self.cluster_tile_shape_mnk[1]
         cur_ntile_mn = cur_ntile_m * cur_ntile_n
-        return cur_ntile_mn  # type: ignore[return-value]
+        return cur_ntile_mn
 
     @dsl_user_op
     def _compute_cta_tile_coord(
@@ -491,17 +532,21 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
         ip: Optional[ir.InsertionPoint] = None,
     ) -> tuple:
         """
-        Compute CTA tile indices along M and N dimensions based on the linear index within a group.
+        Compute CTA tile indices along M and N dimensions based on the linear index
+        within a group.
 
         It uses the AlongM mode to decompose the linear index onto M and N dimensions.
 
         :param cluster_tile_idx: The linear index within a group
         :type cluster_tile_idx: Int32
-        :param cta_tile_coord_in_cluster: CTA indices along M and N dimensions within a cluster
+        :param cta_tile_coord_in_cluster: CTA indices along M and N dimensions within a
+                                          cluster
         :type cta_tile_coord_in_cluster: tuple of Int32
-        :param cluster_tile_count_m: The number of clusters along M dimension of the matched group
+        :param cluster_tile_count_m: The number of clusters along M dimension of the
+                                     matched group
         :type cluster_tile_count_m: Int32
-        :param cluster_tile_count_n: The number of clusters along N dimension of the matched group
+        :param cluster_tile_count_n: The number of clusters along N dimension of the
+                                     matched group
         :type cluster_tile_count_n: Int32
         :return: A tuple containing CTA tile indices along M and N dimensions
         :rtype: tuple of (Int32, Int32)
@@ -535,7 +580,8 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
 
         :param linear_idx: The linear index to be decomposed
         :type linear_idx: Int32
-        :param problem_shape_mnkl: Tensor containing gemm problem size (M, N, K, L) for all groups
+        :param problem_shape_mnkl: Tensor containing gemm problem size (M, N, K, L) for
+                                   all groups
         :type problem_shape_mnkl: cute.Tensor
         :param init_group_idx: The group idx to start the search with
         :type init_group_idx: Int32
@@ -552,7 +598,6 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
         not_found = linear_idx >= tile_count_searched
         start_not_found = not_found
         tile_count_prev_group = self.search_state.tile_count_prev_group
-        tidx, _, _ = cute.arch.thread_idx()
 
         while not_found and start_group_idx < self.group_count:
             # get group to search for current lane
@@ -560,27 +605,37 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
             # check if the group to be checked is out of range
             inside_group_bound = cur_group_idx < self.group_count
 
-            # Rotate cache
-            self.cached_problem_shape_0 = self.cached_problem_shape_1
+            if const_expr(self.use_cached_problem_shapes):
+                # Cached path: rotate cache and prefetch
+                self.cached_problem_shape_0 = self.cached_problem_shape_1
 
-            # Prefetch problem shape for next while iteration
-            next_prefetch_group_idx = (
-                start_group_idx + cute.arch.WARP_SIZE + self.lane_idx
-            )
-            if next_prefetch_group_idx < self.group_count:
-                self.cached_problem_shape_1 = self._get_problem_for_group(
-                    problem_shape_mnkl, next_prefetch_group_idx
+                # Prefetch problem shape for next while iteration
+                next_prefetch_group_idx = (
+                    start_group_idx + cute.arch.WARP_SIZE + self.lane_idx
                 )
+                if next_prefetch_group_idx < self.group_count:
+                    self.cached_problem_shape_1 = self._get_problem_for_group(
+                        problem_shape_mnkl, next_prefetch_group_idx, loc=loc, ip=ip
+                    )
 
-            cur_ntile_mn = c_0
-            if inside_group_bound:
-                # get problem size of current group
-                cur_problem_mnkl = self._get_problem_for_group(
-                    problem_shape_mnkl, cur_group_idx, loc=loc, ip=ip
-                )
-                cur_ntile_mn = self._get_cluster_tile_count_mn(
-                    cur_problem_mnkl, loc=loc, ip=ip
-                )
+                cur_ntile_mn = c_0
+                if inside_group_bound:
+                    # get problem size of current group from cache
+                    cur_problem_mnkl = self.cached_problem_shape_0
+                    cur_ntile_mn = self._get_cluster_tile_count_mn(
+                        cur_problem_mnkl, loc=loc, ip=ip
+                    )
+            else:
+                # On-demand path: load problem shape directly
+                cur_ntile_mn = c_0
+                if inside_group_bound:
+                    # Load problem shape directly without caching
+                    cur_problem_mnkl = self._get_problem_for_group(
+                        problem_shape_mnkl, cur_group_idx, loc=loc, ip=ip
+                    )
+                    cur_ntile_mn = self._get_cluster_tile_count_mn(
+                        cur_problem_mnkl, loc=loc, ip=ip
+                    )
 
             # compute tile count from beginning to current group(included)
             total_cluster_tile_count_ps_per_thread = self._prefix_sum(
@@ -591,27 +646,25 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
             )
 
             group_not_in_window = linear_idx >= cluster_tile_count_end_per_thread
-            hitted_group_idx_in_search_window = cute.arch.popc(
+            hit_group_idx_in_search_window = cute.arch.popc(
                 cute.arch.vote_ballot_sync(group_not_in_window, loc=loc, ip=ip),
                 loc=loc,
                 ip=ip,
             )
-            not_found = hitted_group_idx_in_search_window == cute.arch.WARP_SIZE
-            start_group_idx = hitted_group_idx_in_search_window + start_group_idx
+            not_found = hit_group_idx_in_search_window == cute.arch.WARP_SIZE
+            start_group_idx = hit_group_idx_in_search_window + start_group_idx
 
-            hit_the_1st_problem_in_search_window = (
-                hitted_group_idx_in_search_window == c_0
-            )
+            hit_the_1st_problem_in_search_window = hit_group_idx_in_search_window == c_0
             tile_count_prev_group = tile_count_searched
-            if hit_the_1st_problem_in_search_window == False:
+            if not hit_the_1st_problem_in_search_window:
                 tile_count_prev_group = cute.arch.shuffle_sync(
                     cluster_tile_count_end_per_thread,
-                    hitted_group_idx_in_search_window - 1,
+                    hit_group_idx_in_search_window - 1,
                 )
 
             # If no matched group, then get new_cluster_tile_count_end from last lane
-            # Otherwise, get new_cluster_tile_count_end from the hitted group
-            lane_idx_for_cluster_tile_count_end = hitted_group_idx_in_search_window
+            # Otherwise, get new_cluster_tile_count_end from the hit group
+            lane_idx_for_cluster_tile_count_end = hit_group_idx_in_search_window
 
             if not_found:
                 lane_idx_for_cluster_tile_count_end = last_lane_idx
@@ -620,14 +673,16 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
                 lane_idx_for_cluster_tile_count_end,
             )
 
-            # Prefetch problem shape for next wave
-            if not not_found:
-                if start_group_idx + self.lane_idx < self.group_count:
-                    self.cached_problem_shape_1 = self._get_problem_for_group(
-                        problem_shape_mnkl, start_group_idx + self.lane_idx
-                    )
+            # Prefetch problem shape for next wave (only when caching enabled)
+            if const_expr(self.use_cached_problem_shapes):
+                if not not_found:  # noqa: SIM102
+                    if start_group_idx + self.lane_idx < self.group_count:
+                        self.cached_problem_shape_1 = self._get_problem_for_group(
+                            problem_shape_mnkl, start_group_idx + self.lane_idx
+                        )
 
-        # The tile is invalid if not_found doesn't change before and after the while loop.
+        # The tile is invalid if not_found doesn't change before and after the while
+        # loop.
         end_not_found = not_found
         is_valid = start_not_found != end_not_found
 
@@ -649,20 +704,21 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
-    ) -> Tuple[Boolean, Union[Int32, int], cute.Tensor]:
+    ) -> Tuple[Int32, Int32, Tuple[Int32, Int32, Int32, Int32]]:
         """
         Perform group search and load problem shape for the matched group.
 
         :param linear_idx: The linear index to be decomposed
         :type linear_idx: Int32
-        :param problem_shape_mnkl: Tensor containing gemm problem size (M, N, K, L) for all groups
+        :param problem_shape_mnkl: Tensor containing gemm problem size (M, N, K, L) for
+                                   all groups
         :type problem_shape_mnkl: cute.Tensor
         :param start_group_idx: The group idx to start the search with
         :type start_group_idx: Int32
         :param tile_count_searched: The number of tiles we have searched
         :type tile_count_searched: Int32
         :return: A tuple containing the final group index and the problem shape tensor
-        :rtype: Tuple[Int32, cute.Tensor]
+        :rtype: Tuple[Int32, Int32, Tuple[Int32, Int32, Int32, Int32]]
         """
         self.search_state = self._group_search(
             linear_idx,
@@ -672,20 +728,17 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
             loc=loc,
             ip=ip,
         )
-        tidx, _, _ = cute.arch.thread_idx()
         # get final group search state
         found = self.search_state.found
 
         final_group_idx: Union[Int32, int] = -1
-        problem_mnkl = cute.make_rmem_tensor(
-            cute.make_layout(4), problem_shape_mnkl.element_type, loc=loc, ip=ip
-        )
+        problem_mnkl = (Int32(-1), Int32(-1), Int32(-1), Int32(-1))
         if found:
             final_group_idx = self.search_state.start_group_idx
             problem_mnkl = self._get_problem_for_group(
                 problem_shape_mnkl, final_group_idx, loc=loc, ip=ip
             )
-        return found, final_group_idx, problem_mnkl
+        return Int32(found), Int32(final_group_idx), problem_mnkl
 
     @dsl_user_op
     def delinearize_z(
@@ -694,20 +747,19 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
-    ) -> "GroupedWorkTileInfo":
+    ) -> GroupSearchResult:
         """
         Delinearize the linear z index and return GroupSearchResult.
 
-        This function should be used by warps that need to know the CTA tile index on M and N dimensions.
+        This function should be used by warps that need to know the CTA tile index on M
+        and N dimensions.
 
         :param cta_tile_coord: The raw CTA coordinate from tile scheduler
         :type cta_tile_coord: tuple of Int32
-        :param problem_shape_mnkl: Tensor containing gemm problem size (M, N, K, L) for each group
-        :type problem_shape_mnkl: cute.Tensor
         :return: The search result containing group index and tile coordinates
         :rtype: GroupSearchResult
         """
-        # delinear the z coord
+        # delinearize the z coord
 
         linear_idx = self._current_work_linear_idx
 
@@ -720,7 +772,8 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
             ip=ip,
         )
 
-        # The work_tile is valid if its linear index could be mapped to a group in the problem shapes
+        # The work_tile is valid if its linear index could be mapped to a group in the
+        # problem shapes
         is_valid = found
 
         # linear index local to current group
@@ -759,7 +812,7 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
             cluster_count_k,
         )
 
-        return GroupedWorkTileInfo(cta_tile_coord, is_valid, group_search_result)
+        return GroupedWorkTileInfo(cta_tile_coord, is_valid, group_search_result)  # type: ignore[return-value]
 
     @dsl_user_op
     def get_current_work(
@@ -775,16 +828,16 @@ class StaticPersistentGroupTileScheduler(StaticPersistentTileScheduler):
         return grouped_work_tile
 
 
-@deprecated(
-    "API is deprecated, use cutlass.utils.StaticPersistentGroupTileScheduler instead"
-)
+@deprecated("API is deprecated, use StaticPersistentGroupTileScheduler instead.")
 class GroupedGemmTileSchedulerHelper:
     """
-    A helper to translate the raw block index (x, y, z) from tile scheduler to real CTA tile index for grouped gemm.
+    A helper to translate the raw block index (x, y, z) from tile scheduler to real CTA
+    tile index for grouped gemm.
 
     :param group_count: Number of groups in current grouped gemm problem
     :type group_count: int
-    :param tile_sched_params: Parameter used to create the tile scheduler this helper works with
+    :param tile_sched_params: Parameter used to create the tile scheduler this helper
+                              works with
     :type tile_sched_params: PersistentTileSchedulerParams
     :param cluster_tile_shape_mnk: The shape of cluster tile as (m, n, k)
     :type cluster_tile_shape_mnk: tuple[int, int, int]
@@ -814,7 +867,8 @@ class GroupedGemmTileSchedulerHelper:
         self, values: List[ir.Value]
     ) -> "GroupedGemmTileSchedulerHelper":
         # Reconstruct tile_sched_params and determine how many values it consumed.
-        # NOTE: tile_sched_params may contain FastDivmod divisors (when swizzle_size == 1),
+        # NOTE: tile_sched_params may contain FastDivmod divisors
+        # (when swizzle_size == 1),
         # which adds extra MLIR values.
         params_values = extract_mlir_values(self.tile_sched_params)
         n_params_values = len(params_values)
@@ -840,16 +894,18 @@ class GroupedGemmTileSchedulerHelper:
         """
         Delinearize the linear z index and return GroupSearchResult.
 
-        This function should be used by warps that need to know the CTA tile index on M and N dimensions.
+        This function should be used by warps that need to know the CTA tile index on M
+        and N dimensions.
 
         :param cta_tile_coord: The raw CTA coordinate from tile scheduler
         :type cta_tile_coord: tuple of Int32
-        :param problem_shape_mnkl: Tensor containing gemm problem size (M, N, K, L) for each group
+        :param problem_shape_mnkl: Tensor containing gemm problem size (M, N, K, L) for
+                                   each group
         :type problem_shape_mnkl: cute.Tensor
         :return: The search result containing group index and tile coordinates
         :rtype: GroupSearchResult
         """
-        # delinear the z coord
+        # delinearize the z coord
         linear_idx = cta_tile_coord[2]
         group_idx, problem_mnkl = self._group_search_and_load_problem_shape(
             linear_idx,
@@ -892,13 +948,16 @@ class GroupedGemmTileSchedulerHelper:
         problem_shape_mnkl: cute.Tensor,
     ) -> Tuple[Int32, Int32]:
         """
-        Search the matched group for given linear index and compute the number of tiles along K dimension for the matched group.
+        Search the matched group for given linear index and compute the number of tiles
+        along K dimension for the matched group.
 
-        This function should be used by warps that are only interested in the number of tiles along K dimension.
+        This function should be used by warps that are only interested in the number of
+        tiles along K dimension.
 
         :param cta_tile_coord: The raw CTA coordinate from tile scheduler
         :type cta_tile_coord: tuple of Int32
-        :param problem_shape_mnkl: Tensor containing gemm problem size (M, N, K, L) for all groups
+        :param problem_shape_mnkl: Tensor containing gemm problem size (M, N, K, L) for
+                                   all groups
         :type problem_shape_mnkl: cute.Tensor
         :return: A tuple containing cluster count along K dimension and the group index
         :rtype: Tuple[Int32, Int32]
@@ -919,7 +978,8 @@ class GroupedGemmTileSchedulerHelper:
         """
         Perform prefix sum within a full warp.
 
-        :param value_per_thread: The value for this thread to contribute to the prefix sum
+        :param value_per_thread: The value for this thread to contribute to the prefix
+                                 sum
         :type value_per_thread: Int32
         :return: The prefix sum result for this thread
         :rtype: Int32
@@ -940,9 +1000,11 @@ class GroupedGemmTileSchedulerHelper:
         self, problem_shape_mnkl: cute.Tensor, group_idx: Int32
     ) -> cute.Tensor:
         """
-        Load gemm problem (m,n,k,l) for the specified group from global memory to register.
+        Load gemm problem (m,n,k,l) for the specified group from global memory to
+        register.
 
-        :param problem_shape_mnkl: Tensor in global memory with layout (group_count, 4):(4, 1)
+        :param problem_shape_mnkl: Tensor in global memory with layout
+                                   (group_count, 4):(4, 1)
         :type problem_shape_mnkl: cute.Tensor
         :param group_idx: The index of the group to load
         :type group_idx: Int32
@@ -981,17 +1043,21 @@ class GroupedGemmTileSchedulerHelper:
         cluster_tile_count_n: Int32,
     ) -> tuple:
         """
-        Compute CTA tile indices along M and N dimensions based on the linear index within a group.
+        Compute CTA tile indices along M and N dimensions based on the linear index
+        within a group.
 
         It uses the AlongM mode to decompose the linear index onto M and N dimensions.
 
         :param cluster_tile_idx: The linear index within a group
         :type cluster_tile_idx: Int32
-        :param cta_tile_coord_in_cluster: CTA indices along M and N dimensions within a cluster
+        :param cta_tile_coord_in_cluster: CTA indices along M and N dimensions within a
+                                          cluster
         :type cta_tile_coord_in_cluster: tuple of Int32
-        :param cluster_tile_count_m: The number of clusters along M dimension of the matched group
+        :param cluster_tile_count_m: The number of clusters along M dimension of the
+                                     matched group
         :type cluster_tile_count_m: Int32
-        :param cluster_tile_count_n: The number of clusters along N dimension of the matched group
+        :param cluster_tile_count_n: The number of clusters along N dimension of the
+                                     matched group
         :type cluster_tile_count_n: Int32
         :return: A tuple containing CTA tile indices along M and N dimensions
         :rtype: tuple of (Int32, Int32)
@@ -1023,7 +1089,8 @@ class GroupedGemmTileSchedulerHelper:
 
         :param linear_idx: The linear index to be decomposed
         :type linear_idx: Int32
-        :param problem_shape_mnkl: Tensor containing gemm problem size (M, N, K, L) for all groups
+        :param problem_shape_mnkl: Tensor containing gemm problem size (M, N, K, L) for
+                                   all groups
         :type problem_shape_mnkl: cute.Tensor
         :param init_group_idx: The group idx to start the search with
         :type init_group_idx: Int32
@@ -1058,24 +1125,22 @@ class GroupedGemmTileSchedulerHelper:
             )
 
             group_not_in_window = linear_idx >= cluster_tile_count_end_per_thread
-            hitted_group_idx_in_search_window = cute.arch.popc(
+            hit_group_idx_in_search_window = cute.arch.popc(
                 cute.arch.vote_ballot_sync(group_not_in_window)
             )
-            not_found = hitted_group_idx_in_search_window == cute.arch.WARP_SIZE
-            start_group_idx = hitted_group_idx_in_search_window + start_group_idx
-            hit_the_1st_problem_in_search_window = (
-                hitted_group_idx_in_search_window == c_0
-            )
+            not_found = hit_group_idx_in_search_window == cute.arch.WARP_SIZE
+            start_group_idx = hit_group_idx_in_search_window + start_group_idx
+            hit_the_1st_problem_in_search_window = hit_group_idx_in_search_window == c_0
             tile_count_prev_group = tile_count_searched
-            if hit_the_1st_problem_in_search_window == False:
+            if not hit_the_1st_problem_in_search_window:
                 tile_count_prev_group = cute.arch.shuffle_sync(
                     cluster_tile_count_end_per_thread,
-                    hitted_group_idx_in_search_window - 1,
+                    hit_group_idx_in_search_window - 1,
                 )
 
             # If no matched group, then get new_cluster_tile_count_end from last lane
-            # Otherwise, get new_cluster_tile_count_end from the hitted group
-            lane_idx_for_cluster_tile_count_end = hitted_group_idx_in_search_window
+            # Otherwise, get new_cluster_tile_count_end from the hit group
+            lane_idx_for_cluster_tile_count_end = hit_group_idx_in_search_window
             if not_found:
                 lane_idx_for_cluster_tile_count_end = last_lane_idx
             tile_count_searched = cute.arch.shuffle_sync(
@@ -1102,7 +1167,8 @@ class GroupedGemmTileSchedulerHelper:
 
         :param linear_idx: The linear index to be decomposed
         :type linear_idx: Int32
-        :param problem_shape_mnkl: Tensor containing gemm problem size (M, N, K, L) for all groups
+        :param problem_shape_mnkl: Tensor containing gemm problem size (M, N, K, L) for
+                                   all groups
         :type problem_shape_mnkl: cute.Tensor
         :param start_group_idx: The group idx to start the search with
         :type start_group_idx: Int32
diff --git a/python/CuTeDSL/cutlass/utils/hopper_helpers.py b/python/CuTeDSL/cutlass/utils/hopper_helpers.py
index d6b19f40c..678c4a34f 100644
--- a/python/CuTeDSL/cutlass/utils/hopper_helpers.py
+++ b/python/CuTeDSL/cutlass/utils/hopper_helpers.py
@@ -9,7 +9,7 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
-from typing import Type, Union, Tuple, Optional
+from typing import Any, Type, Union, Tuple, Optional
 
 from cutlass._mlir import ir
 from cutlass.utils.layout import LayoutEnum
@@ -126,6 +126,7 @@ def make_trivial_tiled_mma(
     :raises TypeError: If the data type is not supported.
     """
 
+    mma_op: Any
     if a_dtype in {Float16, BFloat16}:
         if a_dtype != b_dtype:
             raise TypeError(f"Type mismatch: {a_dtype} != {b_dtype}")
@@ -144,7 +145,7 @@ def make_trivial_tiled_mma(
         Float8E4M3FN,
         Float8E5M2,
     }:
-        mma_op = MmaF8Op(  # type: ignore[assignment]
+        mma_op = MmaF8Op(
             a_dtype,
             b_dtype,
             acc_dtype,
@@ -154,7 +155,7 @@ def make_trivial_tiled_mma(
             b_leading_mode,
         )
     elif a_dtype in {Int8, Uint8} and b_dtype in {Int8, Uint8}:
-        mma_op = MmaI8Op(  # type: ignore[assignment]
+        mma_op = MmaI8Op(
             a_dtype,
             b_dtype,
             acc_dtype,
@@ -177,7 +178,7 @@ def get_smem_layout_atom(
     *,
     loc: Optional[ir.Location] = None,
     ip: Optional[ir.InsertionPoint] = None,
-) -> "cute.nvgpu.warpgroup.SmemLayoutAtomKind":
+) -> Any:
     """Select the optimal shared memory layout atom based on parameters.
 
     :param layout: Layout enum of the tensor
@@ -242,12 +243,15 @@ def make_smem_layout_a(
     :rtype: Union[cute.Layout, cute.ComposedLayout]
     """
     # Extract A tensor shape from the MMA tiler (M dimension)
+    assert isinstance(mma_tiler_mnk, tuple)
     a_tile_shape_mnk = mma_tiler_mnk
     a_smem_shape = cute.slice_(a_tile_shape_mnk, (None, 0, None), loc=loc, ip=ip)
 
     # Determine if K is the major mode and get the major mode size
     is_k_major = a_layout.is_k_major_a()
-    a_major_mode_size = a_tile_shape_mnk[2] if is_k_major else a_tile_shape_mnk[0]  # type: ignore[index]
+    a_major_mode_size = cute.size(
+        a_tile_shape_mnk[2] if is_k_major else a_tile_shape_mnk[0]
+    )
 
     # Create SMEM layout atom for A tensor based on major mode and data type
     a_smem_layout_atom = make_smem_layout_atom(
@@ -299,11 +303,12 @@ def make_smem_layout_b(
     :rtype: Union[cute.Layout, cute.ComposedLayout]
     """
     # Extract B tensor shape from the MMA tiler (N and K dimensions)
+    assert isinstance(mma_tiler_mnk, tuple)
     b_smem_shape = cute.slice_(mma_tiler_mnk, (0, None, None), loc=loc, ip=ip)
 
     # Determine if K is the major mode and get the major mode size
     is_k_major = b_layout.is_k_major_b()
-    b_major_mode_size = mma_tiler_mnk[2] if is_k_major else mma_tiler_mnk[1]  # type: ignore[index]
+    b_major_mode_size = cute.size(mma_tiler_mnk[2] if is_k_major else mma_tiler_mnk[1])
 
     # Create SMEM layout atom for B tensor based on major mode and data type
     b_smem_layout_atom = make_smem_layout_atom(
@@ -361,10 +366,11 @@ def make_smem_layout_epi(
     :rtype: Union[cute.Layout, cute.ComposedLayout]
     """
     # Extract output tensor shape from epilog tile
+    assert isinstance(epi_tile, tuple)
     o_smem_shape = epi_tile
 
     # Determine major mode size based on layout (M or N major)
-    o_major_mode_size = epi_tile[1] if epi_layout.is_n_major_c() else epi_tile[0]  # type: ignore[index]
+    o_major_mode_size = epi_tile[1] if epi_layout.is_n_major_c() else epi_tile[0]
 
     # Create SMEM layout atom for output tensor based on layout and data type
     o_smem_layout_atom = make_smem_layout_atom(
diff --git a/python/CuTeDSL/cutlass/utils/mixed_input_helpers.py b/python/CuTeDSL/cutlass/utils/mixed_input_helpers.py
index 869995a60..2ff0dbc3a 100644
--- a/python/CuTeDSL/cutlass/utils/mixed_input_helpers.py
+++ b/python/CuTeDSL/cutlass/utils/mixed_input_helpers.py
@@ -13,13 +13,14 @@ from __future__ import annotations
 
 from enum import Enum, auto
 from math import log2
-from typing import Optional, Union
+from typing import Any, Optional, Type, Union
 
 import cutlass
 import cutlass.cute as cute
 from cutlass._mlir import ir
 from cutlass.cutlass_dsl import (
     Boolean,
+    Numeric,
     extract_mlir_values,
     new_from_mlir_values,
 )
@@ -68,9 +69,10 @@ def scale_tma_partition(
 
     :rtype: tuple[cute.Tensor, cute.Tensor]
     """
+    assert isinstance(block_in_cluster_coord_vmnk, tuple)
     tSsS, tSgS = cpasync.tma_partition(
         tma_atom_s,
-        block_in_cluster_coord_vmnk[2],  # type: ignore[index]
+        block_in_cluster_coord_vmnk[2],
         scale_cta_layout,
         cute.group_modes(tCsS, 0, 3),
         cute.group_modes(tCgS, 0, 3),
@@ -629,14 +631,18 @@ def get_copy_atom_a_transform(
     Determine the copy atom for transformed A tensor based on the operand source and tile size.
     """
     if cutlass.const_expr(transform_a_source == tcgen05.OperandSource.TMEM):
+        copy_op_r2t: Any
+        assert isinstance(a_smem_shape, tuple)
+        a_smem_shape_0 = a_smem_shape[0]
+        assert isinstance(a_smem_shape_0, tuple)
         if cutlass.const_expr(
-            cute.size(a_smem_shape[0][0]) == 64 and (not use_2cta_instrs)  # type: ignore[index]
+            cute.size(a_smem_shape_0[0]) == 64 and (not use_2cta_instrs)
         ):
             copy_op_r2t = tcgen05.St16x256bOp(
                 tcgen05.Repetition(1), tcgen05.Unpack.NONE
             )
         else:
-            copy_op_r2t = tcgen05.St32x32bOp(tcgen05.Repetition(8), tcgen05.Unpack.NONE)  # type: ignore[assignment]
+            copy_op_r2t = tcgen05.St32x32bOp(tcgen05.Repetition(8), tcgen05.Unpack.NONE)
         return cute.make_copy_atom(copy_op_r2t, mma_dtype)
     else:
         return cute.make_copy_atom(
@@ -687,7 +693,9 @@ def is_shuffle_a(
         and mma_dtype == cutlass.BFloat16
         and scale_granularity_k >= 8
     )
-    return shuffle_a
+    # shuffle is supported since CTK 13.1
+    shuffle_supported = cutlass.target_version(min_version="13.1")
+    return shuffle_a and shuffle_supported
 
 
 def is_valid_tensor_alignment(
@@ -712,9 +720,7 @@ def is_valid_tensor_alignment(
     """
 
     def check_contiguous_16B_alignment(
-        dtype: type[cutlass.Numeric],
-        is_mode0_major: bool,
-        tensor_shape: tuple[int, int],
+        dtype: Type[Numeric], is_mode0_major: bool, tensor_shape: tuple[int, int]
     ) -> bool:
         major_mode_idx = 0 if is_mode0_major else 1
         num_major_elements = tensor_shape[major_mode_idx]
@@ -806,7 +812,7 @@ class ContiguousGGSearchState:
         cur_group_idx: cutlass.Int32,
         cur_offset: cutlass.Int32,
         cur_start: cutlass.Int32,
-    ) -> None:
+    ):
         self.last_tile_count = last_tile_count
         self.cur_boundary = cur_boundary
         self.cur_tile_count = cur_tile_count
@@ -880,7 +886,7 @@ class ContiguousGroupWorkTileInfo:
         coord_n: cutlass.Int32,
         group_idx: cutlass.Int32,
         distance_to_boundary: cutlass.Int32,
-    ) -> None:
+    ):
         self.cta_coord_m = cta_coord_m
         self.coord_n = coord_n
         self.group_idx = group_idx
@@ -987,7 +993,7 @@ def cvt_tensor_a(
     for int4-to-bf16 conversion.
     """
 
-    # shuffle is supported since CUDA 13.1
+    # shuffle is supported since CTK 13.1
     shuffle_supported = cutlass.target_version(min_version="13.1")
     shuffle = shuffle and shuffle_supported
     rst = src.load()
diff --git a/python/CuTeDSL/cutlass/utils/profiling.py b/python/CuTeDSL/cutlass/utils/profiling.py
new file mode 100644
index 000000000..9a9264ca7
--- /dev/null
+++ b/python/CuTeDSL/cutlass/utils/profiling.py
@@ -0,0 +1,231 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
+#
+# Use of this software is governed by the terms and conditions of the
+# NVIDIA End User License Agreement (EULA), available at:
+# https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/license.html
+#
+# Any use, reproduction, disclosure, or distribution of this software
+# and related documentation outside the scope permitted by the EULA
+# is strictly prohibited.
+
+"""Unified profiling scope for NameLoc annotations.
+
+Provides ``Scope`` (base, with optional metadata) and ``WarpScope``
+(typed wrapper for warp groups with warp_start/warp_end).
+
+All metadata is collected in a global symbol registry and dumped as
+``kernel_symbols.json`` by ``dump_kernel_symbols()``.
+
+Usage::
+
+    from cutlass.utils.profiling import Scope, WarpScope
+
+    # Context manager (nested scopes for pipeline phases):
+    with Scope("tmem_sp0/ConsumerWork"):
+        ...  # all MLIR ops tagged with NameLoc
+
+    # Flat switch (warp dispatch):
+    scope = WarpScope("LoadWarp", warp_start=13, warp_end=14)
+    ...
+    scope.switch("MmaWarp", warp_start=12, warp_end=13)
+    ...
+    scope.close()
+"""
+
+from __future__ import annotations
+
+from typing import Any, Union
+
+from cutlass._mlir._mlir_libs._cutlass_ir._mlir import ir
+
+
+# ── Symbol registry ──────────────────────────────────────────────────────────
+
+_symbol_registry: dict[str, dict[str, Any]] = {}
+
+
+def register_symbol(name: str, kind: str, **metadata: Any) -> None:
+    """Register a named symbol with metadata for kernel_symbols.json.
+
+    Merges with existing entry if present (later calls add fields).
+    """
+    if name:
+        existing = _symbol_registry.get(name, {})
+        existing.update({"kind": kind, **metadata})
+        _symbol_registry[name] = existing
+
+
+def get_symbol_registry() -> dict[str, dict[str, Any]]:
+    """Return the current symbol registry (for inspection)."""
+    return dict(_symbol_registry)
+
+
+def dump_kernel_symbols() -> dict[str, Any]:
+    """Return the symbol registry grouped by kind, then clear it.
+
+    Returns::
+
+        {
+            "warps": {"Softmax0Warp": {"warp_start": 0, "warp_end": 4}, ...},
+            "barriers": {"load_q": {"num_stages": 2, "pipeline_type": "..."}, ...},
+        }
+    """
+    warps = {}
+    barriers = {}
+    for name, meta in _symbol_registry.items():
+        kind = meta.pop("kind", "")
+        if kind == "warp":
+            warps[name] = meta
+        elif kind == "barrier":
+            barriers[name] = meta
+        # other kinds can be added later
+    result: dict[str, Any] = {}
+    if warps:
+        result["warps"] = warps
+    if barriers:
+        # Also emit bar_alloc for backward compat with stall_attribution
+        alloc_parts = []
+        for bname, bmeta in barriers.items():
+            stages = bmeta.get("num_stages", 1)
+            alloc_parts.append(f"{bname}:{stages}")
+        result["bar_alloc"] = ", ".join(alloc_parts)
+        result["barriers"] = barriers
+    _symbol_registry.clear()
+    return result
+
+
+def reset_symbol_registry() -> None:
+    """Clear the symbol registry."""
+    _symbol_registry.clear()
+
+
+# ── Scope ────────────────────────────────────────────────────────────────────
+
+
+class Scope:
+    """Profiling scope: NameLoc + optional metadata.
+
+    :param name: Scope label(s). A string creates one NameLoc.
+        A tuple creates nested NameLocs (outermost first)::
+
+            with Scope(("tmem_sp0", "ConsumerWork[1]")):
+                ...  # NameLocs: loc("tmem_sp0"(loc("ConsumerWork[1]"(...))))
+
+    :type name: str or tuple[str, ...]
+    :param metadata: Arbitrary key-value pairs registered in the symbol registry.
+        Pass ``kind="warp"`` or ``kind="barrier"`` to categorize.
+    """
+
+    def __init__(
+        self,
+        name: Union[str, tuple[str, ...]] = "",
+        *,
+        iket: bool = False,
+        **metadata: Any,
+    ):
+        self._iket_enabled = iket
+        self._iket_token = None
+        self._locs: list = []  # stack of NameLoc context managers
+        if metadata:
+            label = name if isinstance(name, str) else name[0]
+            register_symbol(label, **metadata)
+        if name:
+            self._enter(name)
+
+    def _enter(self, name: Union[str, tuple[str, ...]]) -> None:
+        """Push NameLoc scope(s)."""
+        ctx = ir.Location.current.context
+        parts = (name,) if isinstance(name, str) else name
+        for part in parts:
+            loc = ir.Location.name(part, childLoc=ir.Location.current, context=ctx)
+            loc.__enter__()
+            self._locs.append(loc)
+        if self._iket_enabled:
+            import cutlass.cute.experimental.iket as _iket
+
+            iket_label = "/".join(parts)
+            self._iket_token = _iket.range_start(iket_label)
+
+    def _exit(self) -> None:
+        """Pop all NameLoc scopes."""
+        if self._iket_token is not None:
+            import cutlass.cute.experimental.iket as _iket
+
+            _iket.range_end(self._iket_token)
+            self._iket_token = None
+        # Exit in reverse order (innermost first)
+        while self._locs:
+            loc = self._locs.pop()
+            loc.__exit__(None, None, None)
+
+    def switch(self, name: Union[str, tuple[str, ...]], **metadata: Any) -> "Scope":
+        """Exit current scope and enter a new one (flat dispatch pattern).
+
+        Accepts the same metadata kwargs as __init__.
+        """
+        self._exit()
+        if metadata:
+            label = name if isinstance(name, str) else name[0]
+            register_symbol(label, **metadata)
+        self._enter(name)
+        return self
+
+    def close(self) -> None:
+        """Explicitly close the scope."""
+        self._exit()
+
+    def __enter__(self) -> "Scope":
+        return self
+
+    def __exit__(self, *exc: Any) -> None:
+        self._exit()
+
+    def __del__(self) -> None:
+        try:
+            self._exit()
+        except Exception:
+            pass
+
+
+class WarpScope(Scope):
+    """Typed scope for warp groups — registers warp_start/warp_end metadata.
+
+    Usage::
+
+        scope = WarpScope("Softmax0Warp", warp_start=0, warp_end=4, iket=True)
+        scope.switch("MmaWarp", warp_start=12, warp_end=13)
+        scope.close()
+    """
+
+    def __init__(
+        self,
+        name: str,
+        *,
+        warp_start: int,
+        warp_end: int,
+        iket: bool = False,
+    ):
+        super().__init__(
+            name,
+            iket=iket,
+            kind="warp",
+            warp_start=warp_start,
+            warp_end=warp_end,
+        )
+
+    def switch(  # type: ignore[override]
+        self,
+        name: str,
+        *,
+        warp_start: int,
+        warp_end: int,
+    ) -> "WarpScope":
+        """Switch to a new warp scope."""
+        super().switch(
+            name,
+            kind="warp",
+            warp_start=warp_start,
+            warp_end=warp_end,
+        )
+        return self
diff --git a/python/CuTeDSL/cutlass/utils/smem_allocator.py b/python/CuTeDSL/cutlass/utils/smem_allocator.py
index ce9b4c67c..67889fa2b 100644
--- a/python/CuTeDSL/cutlass/utils/smem_allocator.py
+++ b/python/CuTeDSL/cutlass/utils/smem_allocator.py
@@ -9,17 +9,18 @@
 # and related documentation outside the scope permitted by the EULA
 # is strictly prohibited.
 
-from typing import Any, Optional, Type, Union, overload
+from typing import Any, Callable, Optional, Type, Union, overload
+from typing_extensions import deprecated
 import inspect
 
 import cutlass.cute as cute
-from cutlass.cute.arch import get_dyn_smem, get_dyn_smem_size
 from cutlass.cute.tensor import _Tensor
 from cutlass.cutlass_dsl import (
     SMEM_CAPACITY_MAP,
     CuTeDSL,
     Boolean,
     Int8,
+    Int32,
     Numeric,
     NumericMeta,
     dsl_user_op,
@@ -28,6 +29,29 @@ from cutlass._mlir import ir
 from cutlass._mlir.dialects import cute as _cute_ir
 
 
+def _extract_struct_fields(struct_type: cute.struct) -> list[tuple[str, int, int]]:
+    """Extract (name, size_bytes, offset) for each field in a cute.struct."""
+    from cutlass.cute.core import struct
+
+    fields = []
+    for name, member in struct_type._annotations.items():
+        offset = struct_type._offsets[name]
+        # Unwrap Align wrapper
+        if isinstance(member, struct._AlignMeta):
+            member = member.dtype
+        # Compute size based on type
+        if struct._is_scalar_type(member):
+            size = max(1, member.width // 8)
+        elif isinstance(member, struct._MemRangeMeta):
+            size = member.size_in_bytes
+        elif isinstance(member, struct):
+            size = member.__sizeof__()
+        else:
+            continue
+        fields.append((name, size, offset))
+    return fields
+
+
 class SmemAllocator:
     """A helper class for managing shared memory allocation on GPU.
 
@@ -35,7 +59,6 @@ class SmemAllocator:
     numeric types, arrays, and tensors with specified layouts and alignments.
 
     .. note::
-        - The base pointer is aligned to 1024 bytes upon initialization.
         - SmemAllocator will automatically calculate the usage upon kernel launch.
         - There is no need to explicitly specify shared memory size in kernel launch.
         - Currently only supports static layouts. Dynamic layouts are not supported.
@@ -101,20 +124,62 @@ class SmemAllocator:
         *,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
-    ) -> None:
+    ):
         """Initialize a new SmemAllocator instance.
 
-        Creates a new shared memory allocator with a base pointer aligned to 1024 bytes.
-        Tracks the allocator instance for memory management.
-
         :param loc: Source location information for debugging, defaults to None
         :type loc: Optional[ir.Location]
         :param ip: Insertion point for MLIR operations, defaults to None
         :type ip: Optional[ir.InsertionPoint]
         """
-        self._base = get_dyn_smem(Int8, alignment=1024, loc=loc, ip=ip)
-        self._allocated_bytes = 0
-        CuTeDSL.track_smem_allocator(self, lambda cls: cls._allocated_bytes)  # type: ignore[attr-defined]
+        pass
+
+    @property
+    @deprecated(
+        "Internal var `_allocated_bytes` is deprecated, use public API `arch.dynamic_smem_size()` instead."
+    )
+    def _allocated_bytes(self) -> Int32:
+        return cute.arch.dynamic_smem_size()
+
+    @dsl_user_op
+    def _smem_alloca(
+        self,
+        layout: cute.Layout,
+        dtype: NumericMeta,
+        byte_alignment: int,
+        swizzle: Optional[cute.Swizzle] = None,
+        struct_fields: Optional[list[tuple[str, int, int]]] = None,
+        *,
+        loc: Optional[ir.Location] = None,
+        ip: Optional[ir.InsertionPoint] = None,
+    ) -> cute.Pointer:
+        """
+        Allocate shared memory using cute.memref.alloca with given layout, data type, and alignment.
+
+        Returns:
+            cute.Pointer: An iterator (pointer) to the allocated shared memory.
+        """
+        assert byte_alignment <= 1024, "max shared memory alignment limit to 1024 bytes"
+        assert cute.is_static(layout), "shared memory allocation must be static layout"
+        # allocate using cute.memref.alloca
+        swizzle = swizzle.type.attribute if swizzle is not None else None
+        mlir_type = Int8.mlir_type if dtype is Boolean else dtype.mlir_type
+        ptr_ty = _cute_ir.PtrType.get(
+            mlir_type, cute.AddressSpace.smem, byte_alignment, swizzle
+        )
+        res_ty = _cute_ir.MemRefType.get(ptr_ty, layout.type)
+        memref = _cute_ir.memref_alloca(res_ty, layout=None, loc=loc, ip=ip)
+        # Attach struct field metadata as MLIR attributes
+        if struct_fields:
+            from cutlass._mlir import ir as _ir
+
+            field_attrs = []
+            for name, size, offset in struct_fields:
+                field_attrs.append(_ir.StringAttr.get(f"{name}:{size}:{offset}"))
+            memref.value.owner.attributes["smem.struct_fields"] = _ir.ArrayAttr.get(
+                field_attrs
+            )
+        return _cute_ir.get_iter(memref.value, loc=loc, ip=ip)
 
     @overload
     def allocate(
@@ -137,7 +202,7 @@ class SmemAllocator:
     ) -> cute.Pointer: ...
 
     @overload
-    def allocate(
+    def allocate(  # type: ignore[overload-cannot-match]
         self,
         size_or_type: cute.struct,
         byte_alignment: int,
@@ -184,7 +249,16 @@ class SmemAllocator:
         elif isinstance(size_or_type, cute.struct):
             size_in_bytes = size_or_type.__sizeof__()
             alignment = max(byte_alignment, size_or_type.__alignof__())
-            base_ptr = self.allocate(size_in_bytes, alignment, loc=loc, ip=ip)
+            struct_fields = _extract_struct_fields(size_or_type)
+            layout = cute.make_layout(size_in_bytes)
+            base_ptr = self._smem_alloca(
+                layout,
+                Int8,
+                alignment,
+                struct_fields=struct_fields,
+                loc=loc,
+                ip=ip,
+            )
             return size_or_type(base_ptr)
         elif isinstance(
             size_or_type,
@@ -207,25 +281,8 @@ class SmemAllocator:
         if byte_alignment < 1:
             raise ValueError("`byte_alignment` must be at least 1")
 
-        self._base = self._base.align(byte_alignment)
-        ptr = self._base
-        self._base += size_in_bytes
-        if self._allocated_bytes % byte_alignment != 0:
-            self._allocated_bytes += (
-                byte_alignment - self._allocated_bytes % byte_alignment
-            )
-        self._allocated_bytes += size_in_bytes
-
-        # Check bounds against available dynamic shared memory
-        cute.testing.assert_(
-            self._allocated_bytes <= get_dyn_smem_size(loc=loc, ip=ip),
-            f"Allocation failed: shared memory allocation exceeds available memory set in kernel launch. "
-            f"Allocated bytes: {self._allocated_bytes} bytes. "
-            f"Please reduce the allocation or set a larger smem size in kernel launch.",
-            loc=loc,
-            ip=ip,
-        )
-        return ptr
+        layout = cute.make_layout(size_in_bytes)
+        return self._smem_alloca(layout, Int8, byte_alignment, loc=loc, ip=ip)
 
     @dsl_user_op
     def allocate_array(
@@ -320,7 +377,8 @@ class SmemAllocator:
         if isinstance(layout, int):
             layout = cute.make_layout(layout)
 
-        profile = layout(0, loc=loc, ip=ip)  # type: ignore[operator]
+        assert not isinstance(layout, int)
+        profile = layout(0, loc=loc, ip=ip)
         if isinstance(profile, tuple):
             raise TypeError(
                 "cannot allocate a shared memory tensor with a non-integer iterator"
@@ -355,3 +413,44 @@ SmemAllocator.__init__.__signature__ = inspect.Signature(  # type: ignore[attr-d
 )
 
 get_smem_capacity_in_bytes = SmemAllocator.capacity_in_bytes
+
+
+@dsl_user_op
+def get_kernel_smem_size(
+    kernel: Callable,
+    *,
+    loc: Optional[ir.Location] = None,
+    ip: Optional[ir.InsertionPoint] = None,
+) -> int:
+    """Get the total static shared memory allocation in bytes for a kernel.
+
+    Uses ``cute.kernel_smem_size`` to query the total smem bytes that will be
+    allocated by a kernel.  The result is lowered to a compile-time constant by
+    ``InferKernelSmemUsagePass``.
+
+    Must be called from within a ``@cute.jit`` body after the kernel's
+    ``.launch()`` has been called, which triggers tracing and registers the
+    kernel's MLIR symbol.
+
+    :param kernel: A ``@cute.kernel``-decorated function.  The MLIR symbol is
+        retrieved automatically from state stored by the DSL after ``.launch()``.
+    :type kernel: Callable
+    :return: Total shared memory allocated by the kernel, in bytes.
+    :rtype: int (i64 MLIR value during tracing)
+    """
+    if not callable(kernel):
+        raise TypeError(
+            f"get_kernel_smem_size: expected a @cute.kernel-decorated function, "
+            f"got {type(kernel)}"
+        )
+
+    # The DSL stores the MLIR SymbolRefAttr on the underlying funcBody
+    # (accessible via __wrapped__) after each .launch() call.
+    func_body = getattr(kernel, "__wrapped__", kernel)
+    sym = getattr(func_body, "_dsl_kernel_sym", None)
+    if sym is None:
+        raise ValueError(
+            f"get_kernel_smem_size: kernel '{kernel.__name__}' has not been "
+            "traced yet — call .launch() before get_kernel_smem_size()."
+        )
+    return _cute_ir.kernel_smem_size(sym, loc=loc, ip=ip)
diff --git a/python/CuTeDSL/cutlass/utils/static_persistent_tile_scheduler.py b/python/CuTeDSL/cutlass/utils/static_persistent_tile_scheduler.py
index b9762a18d..c44a542af 100644
--- a/python/CuTeDSL/cutlass/utils/static_persistent_tile_scheduler.py
+++ b/python/CuTeDSL/cutlass/utils/static_persistent_tile_scheduler.py
@@ -23,8 +23,14 @@ from cutlass.cutlass_dsl import (
     const_expr,
 )
 from cutlass._mlir import ir
+from typing_extensions import deprecated
 import cutlass.cute as cute
 
+_DEPRECATION_MSG = (
+    "Migrated to examples/CuTeDSL/helpers/static_persistent_tile_scheduler.py "
+    "(BSD-3). The wheel copy will be removed in a future release."
+)
+
 ##############################################################################
 # Static persistent tile scheduler
 ##############################################################################
@@ -42,19 +48,31 @@ class WorkTileInfo:
     def __init__(self, tile_idx: cute.Coord, is_valid_tile: Boolean):
         self._tile_idx = tile_idx
         self._is_valid_tile = Boolean(is_valid_tile)
+        self._tile_idx_num_values: Optional[int] = None
 
     def __extract_mlir_values__(self) -> list[ir.Value]:
-        values = extract_mlir_values(self.tile_idx)
-        values.extend(extract_mlir_values(self.is_valid_tile))
-        return values
+        tile_idx_values = extract_mlir_values(self.tile_idx)
+        valid_values = extract_mlir_values(self.is_valid_tile)
+        self._tile_idx_num_values = len(tile_idx_values)
+        return tile_idx_values + valid_values
 
     def __new_from_mlir_values__(self, values: list[ir.Value]) -> "WorkTileInfo":
-        assert len(values) == 4
-        new_tile_idx = new_from_mlir_values(self._tile_idx, values[:-1])
-        new_is_valid_tile = new_from_mlir_values(self._is_valid_tile, [values[-1]])
+        if self._tile_idx_num_values is None:
+            raise ValueError(
+                "WorkTileInfo reconstruction requires tile_idx width recorded during extraction"
+            )
+        n = self._tile_idx_num_values
+        expected = n + 1
+        if len(values) != expected:
+            raise ValueError(
+                f"expected {expected} mlir values for WorkTileInfo, got {len(values)}"
+            )
+        new_tile_idx = new_from_mlir_values(self._tile_idx, values[:n])
+        new_is_valid_tile = new_from_mlir_values(self._is_valid_tile, values[n:])
         return WorkTileInfo(new_tile_idx, new_is_valid_tile)
 
     @property
+    @cute.jit
     def is_valid_tile(self) -> Boolean:
         """Check latest tile returned by the scheduler is valid or not. Any scheduling
         requests after all tasks completed will return an invalid tile.
@@ -65,6 +83,7 @@ class WorkTileInfo:
         return self._is_valid_tile
 
     @property
+    @cute.jit
     def tile_idx(self) -> cute.Coord:
         """
         Get the index of the tile.
@@ -75,6 +94,7 @@ class WorkTileInfo:
         return self._tile_idx
 
 
+@deprecated(_DEPRECATION_MSG)
 class PersistentTileSchedulerParams:
     """A class to represent parameters for a persistent tile scheduler.
 
@@ -116,15 +136,16 @@ class PersistentTileSchedulerParams:
         :raises ValueError: If cluster_shape_k is not 1.
         """
 
-        if cluster_shape_mnk[2] != 1:  # type: ignore[index]
-            raise ValueError(f"unsupported cluster_shape_k {cluster_shape_mnk[2]}")  # type: ignore[index]
+        assert isinstance(cluster_shape_mnk, tuple)
+        if cluster_shape_mnk[2] != 1:
+            raise ValueError(f"unsupported cluster_shape_k {cluster_shape_mnk[2]}")
         if swizzle_size < 1:
             raise ValueError(f"expect swizzle_size >= 1, but get {swizzle_size}")
 
         self.problem_shape_ntile_mnl = problem_shape_ntile_mnl
         # cluster_shape_mnk is kept for reconstruction
         self._cluster_shape_mnk = cluster_shape_mnk
-        self.cluster_shape_mn = cluster_shape_mnk[:2]  # type: ignore[index]
+        self.cluster_shape_mn = cluster_shape_mnk[:2]
         self.swizzle_size = swizzle_size
         self.raster_along_m = raster_along_m
         self._loc = loc
@@ -133,7 +154,7 @@ class PersistentTileSchedulerParams:
         self.problem_layout_ncluster_mnl = cute.make_layout(
             cute.ceil_div(
                 self.problem_shape_ntile_mnl,
-                cluster_shape_mnk[:2],  # type: ignore[index]
+                cluster_shape_mnk[:2],
                 loc=loc,
                 ip=ip,
             ),
@@ -147,18 +168,19 @@ class PersistentTileSchedulerParams:
                 self.problem_layout_ncluster_mnl.shape,
                 (1, swizzle_size, 1) if raster_along_m else (swizzle_size, 1, 1),
             )
+            assert isinstance(problem_shape_ncluster_mnl, tuple)
 
             if raster_along_m:
                 self.problem_layout_ncluster_mnl = cute.make_layout(
                     (
-                        problem_shape_ncluster_mnl[0],  # type: ignore[index]
-                        (swizzle_size, problem_shape_ncluster_mnl[1] // swizzle_size),  # type: ignore[index, operator]
-                        problem_shape_ncluster_mnl[2],  # type: ignore[index]
+                        problem_shape_ncluster_mnl[0],
+                        (swizzle_size, problem_shape_ncluster_mnl[1] // swizzle_size),  # type: ignore[operator]
+                        problem_shape_ncluster_mnl[2],
                     ),
                     stride=(
                         swizzle_size,
-                        (1, swizzle_size * problem_shape_ncluster_mnl[0]),  # type: ignore[index]
-                        problem_shape_ncluster_mnl[0] * problem_shape_ncluster_mnl[1],  # type: ignore[index, operator]
+                        (1, swizzle_size * problem_shape_ncluster_mnl[0]),
+                        problem_shape_ncluster_mnl[0] * problem_shape_ncluster_mnl[1],  # type: ignore[operator]
                     ),
                     loc=loc,
                     ip=ip,
@@ -166,14 +188,14 @@ class PersistentTileSchedulerParams:
             else:
                 self.problem_layout_ncluster_mnl = cute.make_layout(
                     (
-                        (swizzle_size, problem_shape_ncluster_mnl[0] // swizzle_size),  # type: ignore[index, operator]
-                        problem_shape_ncluster_mnl[1],  # type: ignore[index]
-                        problem_shape_ncluster_mnl[2],  # type: ignore[index]
+                        (swizzle_size, problem_shape_ncluster_mnl[0] // swizzle_size),  # type: ignore[operator]
+                        problem_shape_ncluster_mnl[1],
+                        problem_shape_ncluster_mnl[2],
                     ),
                     stride=(
-                        (1, swizzle_size * problem_shape_ncluster_mnl[1]),  # type: ignore[index]
+                        (1, swizzle_size * problem_shape_ncluster_mnl[1]),
                         swizzle_size,
-                        problem_shape_ncluster_mnl[0] * problem_shape_ncluster_mnl[1],  # type: ignore[index, operator]
+                        problem_shape_ncluster_mnl[0] * problem_shape_ncluster_mnl[1],  # type: ignore[operator]
                     ),
                     loc=loc,
                     ip=ip,
@@ -527,8 +549,10 @@ class StaticPersistentTileScheduler:
                 current_work_linear_idx, loc=loc, ip=ip
             )
 
+        import cutlass.cute as _cute
+
         cur_tile_coord = tuple(
-            Int32(x) * Int32(z) + Int32(y)
+            _cute.arch.make_warp_uniform(Int32(x) * Int32(z) + Int32(y))
             for x, y, z in zip(
                 cur_cluster_coord,
                 self.cta_id_in_cluster,  # type: ignore[arg-type]
@@ -536,7 +560,7 @@ class StaticPersistentTileScheduler:
             )
         )
 
-        return WorkTileInfo(cur_tile_coord, is_valid)
+        return WorkTileInfo(cur_tile_coord, _cute.arch.make_warp_uniform(is_valid))
 
     def _get_cluster_work_idx_with_fastdivmod(
         self,
@@ -581,6 +605,7 @@ class StaticPersistentTileScheduler:
         return (cluster_m, cluster_n, batch_l)
 
     @dsl_user_op
+    @cute.jit
     def get_current_work(
         self,
         *,
@@ -592,6 +617,7 @@ class StaticPersistentTileScheduler:
         )
 
     @dsl_user_op
+    @cute.jit
     def initial_work_tile_info(
         self,
         *,
@@ -601,6 +627,7 @@ class StaticPersistentTileScheduler:
         return self.get_current_work(loc=loc, ip=ip)
 
     @dsl_user_op
+    @cute.jit
     def advance_to_next_work(
         self,
         *,
@@ -614,10 +641,12 @@ class StaticPersistentTileScheduler:
         self._num_tiles_executed += Int32(1)
 
     @property
+    @cute.jit
     def num_tiles_executed(self) -> Int32:
         return self._num_tiles_executed
 
 
+@deprecated(_DEPRECATION_MSG)
 class StaticPersistentRuntimeTileScheduler(StaticPersistentTileScheduler):
     """A scheduler for static persistent runtime tile execution in CUTLASS/CuTe kernels.
     This scheduler will always launch all the SMs and the scheduler will generate the real tile info for each SM.
diff --git a/python/CuTeDSL/cutlass/utils/tensor_helpers.py b/python/CuTeDSL/cutlass/utils/tensor_helpers.py
index df5f5710c..6da762053 100644
--- a/python/CuTeDSL/cutlass/utils/tensor_helpers.py
+++ b/python/CuTeDSL/cutlass/utils/tensor_helpers.py
@@ -34,32 +34,48 @@ def create_cute_tensor_for_fp8(
     dtype: Type[Numeric],
     leading_dim: int,
     source_f32_tensor: Optional[Any] = None,
+    assumed_align: int = 16,
+    mark_dynamic_layout: bool = True,
 ) -> cute.Tensor:
     """Create cute tensor, handling float8 types that don't support dlpack.
 
-    For float8 types, the storage_tensor should be uint8 (for DLPack compatibility).
+    For float8 types, the storage_tensor should use byte storage (for DLPack compatibility).
     The source_f32_tensor provides the actual float32 values to convert to fp8.
 
-    params storage_tensor: Tensor for DLPack (uint8 for fp8, otherwise the actual dtype)
+    params storage_tensor: Tensor for DLPack (byte storage for fp8, otherwise the actual dtype)
     params dtype: Target cutlass dtype
     params leading_dim: Leading dimension for dynamic layout
     paramas source_f32_tensor: Float32 source data for fp8 conversion (required for fp8)
+    params assumed_align: Assumed alignment for the DLPack tensor
+    params mark_dynamic_layout: Whether to mark the resulting tensor layout dynamic
     return: A cute tensor with the appropriate dtype and layout
     """
     import cutlass.torch as cutlass_torch
 
+    fp8_source_tensor: Optional[Any] = None
+    if is_fp8_dtype(dtype):
+        if source_f32_tensor is None:
+            raise ValueError("source_f32_tensor is required for fp8 types")
+        fp8_source_tensor = source_f32_tensor
+        try:
+            import torch
+
+            if storage_tensor.dtype not in {torch.int8, torch.uint8}:
+                storage_tensor = storage_tensor.view(dtype=torch.uint8)
+        except AttributeError:
+            pass
+
     cute_tensor = from_dlpack(
-        storage_tensor, assumed_align=16, force_tf32=dtype == TFloat32
+        storage_tensor, assumed_align=assumed_align, force_tf32=dtype == TFloat32
     )
     # For float8 types, set element_type explicitly since storage is uint8
     if is_fp8_dtype(dtype):
         cute_tensor.element_type = dtype
-    cute_tensor = cute_tensor.mark_layout_dynamic(leading_dim=leading_dim)
+    if mark_dynamic_layout:
+        cute_tensor = cute_tensor.mark_layout_dynamic(leading_dim=leading_dim)
     # For float8 types, convert data from float32 using GPU kernel
-    if is_fp8_dtype(dtype):
-        if source_f32_tensor is None:
-            raise ValueError("source_f32_tensor is required for fp8 types")
+    if fp8_source_tensor is not None:
         cute_tensor = cutlass_torch.convert_cute_tensor(
-            source_f32_tensor, cute_tensor, dtype, is_dynamic_layout=True
+            fp8_source_tensor, cute_tensor, dtype, is_dynamic_layout=True
         )
     return cute_tensor
diff --git a/python/CuTeDSL/cutlass/utils/tensormap_manager.py b/python/CuTeDSL/cutlass/utils/tensormap_manager.py
index cd19dd3ad..fc52cf9da 100644
--- a/python/CuTeDSL/cutlass/utils/tensormap_manager.py
+++ b/python/CuTeDSL/cutlass/utils/tensormap_manager.py
@@ -13,9 +13,9 @@ from dataclasses import dataclass
 from enum import Enum, auto
 from typing import Optional, Tuple
 
-from cutlass._mlir import ir
 import cutlass._mlir.dialects.cute as _cute_ir
 import cutlass._mlir.dialects.cute_nvgpu as _cute_nvgpu_ir
+from cutlass._mlir import ir
 from cutlass.cutlass_dsl import dsl_user_op
 
 import cutlass.cute as cute
@@ -149,7 +149,7 @@ class TensorMapManager:
                     p.dtype,
                     cute.arch.make_warp_uniform(p.toint(), loc=loc, ip=ip),
                     mem_space=_CuteAddressSpace.smem,
-                    assumed_align=p.alignment,  # type: ignore[attr-defined]
+                    assumed_align=p.alignment,
                 )
                 for p in tensormap_smem_ptr
             )
diff --git a/python/CuTeDSL/cutlass/utils/tmem_allocator.py b/python/CuTeDSL/cutlass/utils/tmem_allocator.py
index dd747aa50..8b840a52e 100644
--- a/python/CuTeDSL/cutlass/utils/tmem_allocator.py
+++ b/python/CuTeDSL/cutlass/utils/tmem_allocator.py
@@ -28,6 +28,7 @@ import cutlass.cute as cute
 from cutlass._mlir import ir
 from cutlass.cute.nvgpu.tcgen05 import find_tmem_tensor_col_offset
 from cutlass.cute.arch import get_max_tmem_alloc_cols, get_min_tmem_alloc_cols
+from cutlass.cute.arch.constants import WARP_SIZE
 
 
 _TMEM_COL_MASK = 0x0000FFFF
@@ -285,7 +286,8 @@ class TmemAllocator:
         warp_idx = cute.arch.make_warp_uniform(warp_idx, loc=loc, ip=ip)
         _is_allocator_warp = warp_idx == self._allocator_warp_id
         if _is_allocator_warp:
-            num_tmem_dealloc_threads = 32
+            # Tmem dealloc is executed by one warp
+            num_tmem_dealloc_threads = WARP_SIZE
             with cute.arch.elect_one(loc=loc, ip=ip):
                 cute.arch.mbarrier_init(
                     self._two_cta_tmem_dealloc_mbar_ptr,
@@ -309,7 +311,7 @@ class TmemAllocator:
         initialize_mbarrier: bool = True,
         loc: Optional[ir.Location] = None,
         ip: Optional[ir.InsertionPoint] = None,
-    ) -> None:
+    ):
         """
         Initialize a TmemAllocator instance for managing tensor memory on Blackwell GPUs.
 
diff --git a/python/CuTeDSL/prep_editable_install.py b/python/CuTeDSL/prep_editable_install.py
index d3064f7bb..d6047051e 100644
--- a/python/CuTeDSL/prep_editable_install.py
+++ b/python/CuTeDSL/prep_editable_install.py
@@ -23,7 +23,7 @@ import tempfile
 import zipfile
 import re
 from pathlib import Path
-from typing import Optional, Tuple, List
+from typing import Tuple
 import logging
 
 # Configure logging
@@ -40,27 +40,6 @@ class CutlassDSLSetupError(Exception):
     pass
 
 
-def get_package_spec(requirements_path: Optional[Path] = None) -> str:
-    """
-    Return the pip requirement spec for nvidia-cutlass-dsl from requirements.txt.
-
-    If anything goes wrong (file not found, parse failure, line missing),
-    return PACKAGE_NAME as a safe default.
-    """
-    try:
-        req_path = requirements_path or Path(__file__).with_name("requirements.txt")
-        with open(req_path, "r", encoding="utf-8") as f:
-            for raw_line in f:
-                line = raw_line.strip()
-                if not line or line.startswith("#"):
-                    continue
-                if line.lower().startswith(PACKAGE_NAME):
-                    return line.split("#", 1)[0].strip()
-    except Exception:
-        pass
-    return PACKAGE_NAME
-
-
 def download_wheel(temp_dir: Path) -> Path:
     """
     Download the nvidia-cutlass-dsl wheel to a temporary directory.
@@ -74,10 +53,7 @@ def download_wheel(temp_dir: Path) -> Path:
     Raises:
         CutlassDSLSetupError: If download fails or wheel not found
     """
-    # Resolve package spec from requirements, or fall back to PACKAGE_NAME
-    package_spec = get_package_spec()
-
-    logger.info(f"Downloading {package_spec} wheel to {temp_dir}")
+    logger.info(f"Downloading {PACKAGE_NAME} wheel to {temp_dir}")
 
     try:
         subprocess.check_call(
@@ -87,7 +63,7 @@ def download_wheel(temp_dir: Path) -> Path:
                 "pip",
                 "download",
                 "--no-deps",
-                package_spec,
+                PACKAGE_NAME,
                 "--dest",
                 str(temp_dir),
             ],
@@ -103,7 +79,7 @@ def download_wheel(temp_dir: Path) -> Path:
         raise CutlassDSLSetupError(error_msg)
 
     # Find the downloaded wheel file
-    wheel_pattern = f"*.whl"
+    wheel_pattern = f"{PACKAGE_NAME.replace('-', '_')}-*.whl"
     wheel_files = list(temp_dir.glob(wheel_pattern))
     if not wheel_files:
         raise CutlassDSLSetupError(
@@ -132,7 +108,7 @@ def extract_version_from_wheel(wheel_path: Path) -> str:
     # Construct version regex from package name
     # Wheel filename format: {package_name_with_underscores}-{version}-{python}-{abi}-{platform}.whl
     package_pattern = PACKAGE_NAME.replace("-", "_")
-    version_regex = rf"{re.escape(package_pattern)}-([^-]+)"
+    version_regex = rf"{re.escape(package_pattern)}-([^-]+)-"
     version_match = re.match(version_regex, wheel_filename)
 
     if version_match:
@@ -156,7 +132,9 @@ def extract_version_from_wheel(wheel_path: Path) -> str:
 
         return dev_version
     else:
-        return "9.9.9.dev0"
+        raise CutlassDSLSetupError(
+            f"Could not parse version from wheel filename: {wheel_filename}"
+        )
 
 
 def extract_wheel_contents(wheel_path: Path, extract_dir: Path) -> None:
@@ -193,7 +171,7 @@ def copy_library_files(extract_dir: Path, package_root: Path) -> int:
     Returns:
         Number of files copied
     """
-    lib_pattern = extract_dir / "**" / "lib" / "*.so"
+    extract_dir / "**" / "lib" / "*.so"
     so_files = [f for f in extract_dir.rglob("lib/*.so")]
 
     if not so_files:
@@ -240,7 +218,7 @@ def copy_python_packages(extract_dir: Path, package_root: Path) -> Tuple[int, in
     cutlass_source_dir = cutlass_source_dirs[0]
     cutlass_dest_dir = package_root / "cutlass"
 
-    logger.info(f"Found python_packages/cutlass/ directory")
+    logger.info("Found python_packages/cutlass/ directory")
     logger.info(f"Copying from {cutlass_source_dir} to {cutlass_dest_dir}")
 
     copied_count = 0
diff --git a/python/CuTeDSL/requirements-cu13.txt b/python/CuTeDSL/requirements-cu13.txt
index 8486beced..961756b78 100644
--- a/python/CuTeDSL/requirements-cu13.txt
+++ b/python/CuTeDSL/requirements-cu13.txt
@@ -1,3 +1,3 @@
 # Use `pip install -r requirements-cu13.txt` with the present file to install a
 # wheel consistent with the present state of the github repository
-nvidia-cutlass-dsl[cu13]==4.5.2
+nvidia-cutlass-dsl[cu13]==4.6.0.dev0
diff --git a/python/CuTeDSL/requirements.txt b/python/CuTeDSL/requirements.txt
index 83175d1de..985c46ef0 100644
--- a/python/CuTeDSL/requirements.txt
+++ b/python/CuTeDSL/requirements.txt
@@ -1,3 +1,3 @@
 # Use `pip install -r requirements.txt` with the present file to install a
 # wheel consistent with the present state of the github repository
-nvidia-cutlass-dsl==4.5.2
+nvidia-cutlass-dsl==4.6.0.dev0
diff --git a/python/cutlass_cppgen/__init__.py b/python/cutlass_cppgen/__init__.py
index 2928b3b71..807632e52 100644
--- a/python/cutlass_cppgen/__init__.py
+++ b/python/cutlass_cppgen/__init__.py
@@ -133,7 +133,7 @@ def get_option_registry():
         this._option_registry = OptionRegistry(device_cc())
     return this._option_registry
 
-this.__version__ = '4.5.2'
+this.__version__ = '4.6.0'
 
 from cutlass_cppgen.backend import create_memory_pool
 from cutlass_cppgen.emit.pytorch import pytorch
diff --git a/python/setup_cutlass.py b/python/setup_cutlass.py
index 31fa5b0f7..eba123f33 100644
--- a/python/setup_cutlass.py
+++ b/python/setup_cutlass.py
@@ -51,7 +51,7 @@ setup_pycute.perform_setup()
 
 setup(
     name='cutlass_cppgen',
-    version='4.5.2',
+    version='4.6.0',
     description='CUTLASS Pythonic Interface',
     package_dir={'': '.'},
     packages=[
diff --git a/python/setup_library.py b/python/setup_library.py
index 534723ca2..353a696bb 100644
--- a/python/setup_library.py
+++ b/python/setup_library.py
@@ -36,7 +36,7 @@ from setuptools import setup
 def perform_setup():
     setup(
         name='cutlass_library',
-        version='4.5.2',
+        version='4.6.0',
         description='CUTLASS library generation scripts',
         packages=['cutlass_library']
     )
diff --git a/python/setup_pycute.py b/python/setup_pycute.py
index de04dd92c..58bade228 100644
--- a/python/setup_pycute.py
+++ b/python/setup_pycute.py
@@ -36,7 +36,7 @@ from setuptools import setup
 def perform_setup():
     setup(
         name='pycute',
-        version='4.5.2',
+        version='4.6.0',
         description='Python implementation of CuTe',
         packages=['pycute'],
     )
diff --git a/tools/util/include/cutlass/util/command_line.h b/tools/util/include/cutlass/util/command_line.h
index 44da2dfd9..36b028cc0 100644
--- a/tools/util/include/cutlass/util/command_line.h
+++ b/tools/util/include/cutlass/util/command_line.h
@@ -57,6 +57,7 @@ namespace cutlass {
  * Utility for parsing command line arguments
  */
 struct CommandLine {
+  std::string program_path;
   std::vector<std::string> keys;
   std::vector<std::string> values;
   std::vector<std::string> args;
@@ -64,7 +65,7 @@ struct CommandLine {
   /**
    * Constructor
    */
-  CommandLine(int argc, const char** argv) {
+  CommandLine(int argc, const char** argv) : program_path(argc > 0 ? argv[0] : "") {
     using namespace std;
 
     for (int i = 1; i < argc; i++) {
diff --git a/tools/util/include/cutlass/util/reference/device/tensor_fill.h b/tools/util/include/cutlass/util/reference/device/tensor_fill.h
index a027be935..940074f6e 100644
--- a/tools/util/include/cutlass/util/reference/device/tensor_fill.h
+++ b/tools/util/include/cutlass/util/reference/device/tensor_fill.h
@@ -1921,7 +1921,8 @@ void BlockFillSequential(
   Element *ptr,
   int64_t capacity,
   Element v = Element(1),
-  Element s = Element(0)) {
+  Element s = Element(0),
+  cudaStream_t stream = nullptr) {
 
   using Layout = layout::PackedVectorLayout;
   Layout::TensorCoord size(static_cast<Layout::Index>(capacity)); // -Wconversion
@@ -1931,7 +1932,7 @@ void BlockFillSequential(
   Array<Element, Layout::kRank> c{};
   c[0] = v;
 
-  TensorFillLinear(view, c, s);
+  TensorFillLinear(view, c, s, stream);
 }
 
 ///////////////////////////////////////////////////////////////////////////////////////////////////
@@ -1972,6 +1973,14 @@ void BlockFillRandom(
       dist.uniform.pnan,
       stream);
   }
+  else if (dist.kind == Distribution::Sequential) {
+    BlockFillSequential<Element>(
+      ptr,
+      capacity,
+      static_cast<Real>(dist.sequential.delta),
+      static_cast<Real>(dist.sequential.start),
+      stream);
+  }
 }
 
 ///////////////////////////////////////////////////////////////////////////////////////////////////