* v4.6 dev update.
* Remove CUTLASS_HOST_DEVICE from CudaHostAdapater::memsetDevice (#3286)
* [SM120] Add ptr-array TMA collective for tensor/token-scaled FP8 grouped GEMM (#3280)
* gemm: add SM120 array TMA collective for tensor/token-scaled FP8 grouped GEMM
Adds CollectiveMma and CollectiveBuilder specializations for
MainloopSm120ArrayTmaWarpSpecialized, enabling ptr-array grouped GEMM
(MoE expert dispatch) with tensor- and token-level FP8 scaling on
SM_120/SM_121 consumer Blackwell (RTX 5090/5080/5070, DGX Spark GB10).
New files:
- include/cutlass/gemm/collective/sm120_mma_array_tma.hpp
CollectiveMma specialization for MainloopSm120ArrayTmaWarpSpecialized.
Handles both Cooperative (4x2 atom layout) and Pingpong (2x2) schedules.
Grouped GEMM via pointer-array indirection through params.ptr_A / ptr_B.
Supports F8F6F4 MMA with TMA loads for both A and B operands.
- include/cutlass/gemm/collective/builders/sm120_array_mma_builder.inl
CollectiveBuilder specialization for KernelPtrArrayTmaWarpSpecialized
Cooperative/PingpongSm120<N> schedule tags. Computes tile/stage counts
from smem capacity, routes to MainloopSm120ArrayTmaWarpSpecialized
dispatch policy, produces correctly-typed CollectiveOp.
Modified files:
- collective_mma.hpp: include sm120_mma_array_tma.hpp
- collective_builder.hpp: include sm120_array_mma_builder.inl
- sm120_mma_builder.inl: remove ptr-array schedules from enable_if
(they now route to sm120_array_mma_builder.inl) and drop the
IsPtrArrayKernel static_assert that enforced the restriction
Validated on real SM_121 hardware (DGX Spark, 128 GB LPDDR5X) running
vLLM with RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic (Gemma 4 MoE, 26B
total / 4B active). Previously fell back to a non-CUTLASS Triton path;
with this patch, the SM120 CUTLASS grouped GEMM collective activates and
produces correct outputs. Short-sequence throughput improved ~7% vs the
fallback baseline (76.3 → 81.9 tok/s).
Closes#3263
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>
* test: add SM120 ptr-array grouped GEMM unit tests
Adds 6 device-level tests for the CollectiveMma/CollectiveBuilder
specializations introduced for MainloopSm120ArrayTmaWarpSpecialized,
covering both KernelPtrArrayTmaWarpSpecializedPingpongSm120<2> and
KernelPtrArrayTmaWarpSpecializedCooperativeSm120<2> schedule tags across
e4m3×e4m3 (symmetric), e4m3×e5m2 (mixed), float and bfloat16 outputs,
and two tile shapes.
Tests land in test/unit/gemm/device/sm120_tensorop_gemm/ under the new
cutlass_test_unit_sm120_grouped_gemm_device_tensorop CMake target, per
reviewer request in PR #3280.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
---------
Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>
Co-authored-by: Alex Georgiev <89279829+alexngUNC@users.noreply.github.com>
Co-authored-by: Tyler <tgmerritt@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Fixes https://github.com/NVIDIA/cutlass/issues/3268
A `@cute.struct` instance captured into an `scf.if` branch or `scf.while`
body fails the DSL trace with:
DSLRuntimeError: The 'if' statement encountered a user-defined Python
object, which cannot be automatically converted into an dynamic
expression.
This blocks the natural warp-specialization pattern, where each
`if warp_idx == <role>:` branch reads its tile from a shared storage
struct.
A struct instance is fully described by its `base` pointer (already
DynamicExpression-aware via `_Pointer`); every field instance is
re-derived from `base + static offsets` on construction. Implement the
DynamicExpression protocol on each decorated class by forwarding
`__get_mlir_types__` / `__extract_mlir_values__` to `base`, and
`__new_from_mlir_values__` to a fresh decorator invocation that
re-derives the fields from a rebuilt base pointer.
Tested in Docker on cutlass-dsl 4.5.1 with six new unit tests in
test/python/CuTeDSL/test_struct_in_if.py covering:
* the original failing case (storage.get_tensor inside dynamic if),
* regression: plain non-branched struct usage still works,
* nested struct (struct-of-struct) inside a dynamic if,
* if/else with both branches accessing the struct,
* if/elif/elif/else (the actual warp-specialization shape),
* scf.while body capturing the struct.
A dataclass with no fields exposed a bug in `extract_dataclass_members`:
```
@dataclass
class Dummy:
pass
```
The type/return path was inconsistent. This PR fixes the function to
support empty dataclasses, which are useful in unions.
Implement grouped GEMM (C_g = A_g x B_g for g groups) on Hopper using
CuTe DSL, extending the dense persistent GEMM with per-group TMA
descriptor management.
Kernel design (grouped_gemm.py):
- Warp-specialized pipeline: DMA warp group handles TMA loads and
per-group tensormap updates; MMA warp group runs WGMMA and stores C
- StaticPersistentGroupTileScheduler for cross-group tile scheduling
- Per-group TMA descriptor updates via GMEM or SMEM mode
- Supports fp16, fp8 (E4M3FN/E5M2), int8 with mixed A/B dtypes
- Configurable tile shapes (128x128, 128x256) and cluster shapes
- Fix base TensorMapManager: hoist uniform_smem_ptrs outside predicated
block to avoid illegal @P0 R2UR on sm_90a
Tests (test/examples/CuTeDSL/hopper/test_grouped_gemm.py):
- L0 compile and L1 correctness pytest suite covering tile shapes,
dtypes, major modes, cluster shapes, group counts, and mixed sizes
- Move to test/examples/CuTeDSL/hopper/ following sm_100a convention
- Fix deprecated startdir arg in test_sharding.py pytest hook
Before this fix, combining two Boolean (i1) DSL values with Python `and`
triggered a verbose i1→i32→i1 round-trip in __dsl_and__:
arith.extui (×3), arith.select, arith.cmpi ne (×2) — 6 extra MLIR ops.
Add a fast path: when both operands are Boolean, delegate directly to
__and__, emitting a single arith.andi %a, %b : i1 — identical to `&`.
Both operators were already semantically equivalent; this fix makes the
generated MLIR identical as well.
Includes:
- repro_dsl_and_bool.py — minimal standalone reproducer / bug-report script
- test_dsl_and_fix.py — pytest tests verifying the fixed behaviour
Add subtraction operation for packed f32x2 values, following the same
pattern as the existing add_packed_f32x2 and mul_packed_f32x2 operations.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>