* v4.6 dev update.
* Remove CUTLASS_HOST_DEVICE from CudaHostAdapater::memsetDevice (#3286)
* [SM120] Add ptr-array TMA collective for tensor/token-scaled FP8 grouped GEMM (#3280)
* gemm: add SM120 array TMA collective for tensor/token-scaled FP8 grouped GEMM
Adds CollectiveMma and CollectiveBuilder specializations for
MainloopSm120ArrayTmaWarpSpecialized, enabling ptr-array grouped GEMM
(MoE expert dispatch) with tensor- and token-level FP8 scaling on
SM_120/SM_121 consumer Blackwell (RTX 5090/5080/5070, DGX Spark GB10).
New files:
- include/cutlass/gemm/collective/sm120_mma_array_tma.hpp
CollectiveMma specialization for MainloopSm120ArrayTmaWarpSpecialized.
Handles both Cooperative (4x2 atom layout) and Pingpong (2x2) schedules.
Grouped GEMM via pointer-array indirection through params.ptr_A / ptr_B.
Supports F8F6F4 MMA with TMA loads for both A and B operands.
- include/cutlass/gemm/collective/builders/sm120_array_mma_builder.inl
CollectiveBuilder specialization for KernelPtrArrayTmaWarpSpecialized
Cooperative/PingpongSm120<N> schedule tags. Computes tile/stage counts
from smem capacity, routes to MainloopSm120ArrayTmaWarpSpecialized
dispatch policy, produces correctly-typed CollectiveOp.
Modified files:
- collective_mma.hpp: include sm120_mma_array_tma.hpp
- collective_builder.hpp: include sm120_array_mma_builder.inl
- sm120_mma_builder.inl: remove ptr-array schedules from enable_if
(they now route to sm120_array_mma_builder.inl) and drop the
IsPtrArrayKernel static_assert that enforced the restriction
Validated on real SM_121 hardware (DGX Spark, 128 GB LPDDR5X) running
vLLM with RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic (Gemma 4 MoE, 26B
total / 4B active). Previously fell back to a non-CUTLASS Triton path;
with this patch, the SM120 CUTLASS grouped GEMM collective activates and
produces correct outputs. Short-sequence throughput improved ~7% vs the
fallback baseline (76.3 → 81.9 tok/s).
Closes#3263
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>
* test: add SM120 ptr-array grouped GEMM unit tests
Adds 6 device-level tests for the CollectiveMma/CollectiveBuilder
specializations introduced for MainloopSm120ArrayTmaWarpSpecialized,
covering both KernelPtrArrayTmaWarpSpecializedPingpongSm120<2> and
KernelPtrArrayTmaWarpSpecializedCooperativeSm120<2> schedule tags across
e4m3×e4m3 (symmetric), e4m3×e5m2 (mixed), float and bfloat16 outputs,
and two tile shapes.
Tests land in test/unit/gemm/device/sm120_tensorop_gemm/ under the new
cutlass_test_unit_sm120_grouped_gemm_device_tensorop CMake target, per
reviewer request in PR #3280.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
---------
Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>
Co-authored-by: Alex Georgiev <89279829+alexngUNC@users.noreply.github.com>
Co-authored-by: Tyler <tgmerritt@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
* gemm: add SM120 array TMA collective for tensor/token-scaled FP8 grouped GEMM
Adds CollectiveMma and CollectiveBuilder specializations for
MainloopSm120ArrayTmaWarpSpecialized, enabling ptr-array grouped GEMM
(MoE expert dispatch) with tensor- and token-level FP8 scaling on
SM_120/SM_121 consumer Blackwell (RTX 5090/5080/5070, DGX Spark GB10).
New files:
- include/cutlass/gemm/collective/sm120_mma_array_tma.hpp
CollectiveMma specialization for MainloopSm120ArrayTmaWarpSpecialized.
Handles both Cooperative (4x2 atom layout) and Pingpong (2x2) schedules.
Grouped GEMM via pointer-array indirection through params.ptr_A / ptr_B.
Supports F8F6F4 MMA with TMA loads for both A and B operands.
- include/cutlass/gemm/collective/builders/sm120_array_mma_builder.inl
CollectiveBuilder specialization for KernelPtrArrayTmaWarpSpecialized
Cooperative/PingpongSm120<N> schedule tags. Computes tile/stage counts
from smem capacity, routes to MainloopSm120ArrayTmaWarpSpecialized
dispatch policy, produces correctly-typed CollectiveOp.
Modified files:
- collective_mma.hpp: include sm120_mma_array_tma.hpp
- collective_builder.hpp: include sm120_array_mma_builder.inl
- sm120_mma_builder.inl: remove ptr-array schedules from enable_if
(they now route to sm120_array_mma_builder.inl) and drop the
IsPtrArrayKernel static_assert that enforced the restriction
Validated on real SM_121 hardware (DGX Spark, 128 GB LPDDR5X) running
vLLM with RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic (Gemma 4 MoE, 26B
total / 4B active). Previously fell back to a non-CUTLASS Triton path;
with this patch, the SM120 CUTLASS grouped GEMM collective activates and
produces correct outputs. Short-sequence throughput improved ~7% vs the
fallback baseline (76.3 → 81.9 tok/s).
Closes#3263
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>
* test: add SM120 ptr-array grouped GEMM unit tests
Adds 6 device-level tests for the CollectiveMma/CollectiveBuilder
specializations introduced for MainloopSm120ArrayTmaWarpSpecialized,
covering both KernelPtrArrayTmaWarpSpecializedPingpongSm120<2> and
KernelPtrArrayTmaWarpSpecializedCooperativeSm120<2> schedule tags across
e4m3×e4m3 (symmetric), e4m3×e5m2 (mixed), float and bfloat16 outputs,
and two tile shapes.
Tests land in test/unit/gemm/device/sm120_tensorop_gemm/ under the new
cutlass_test_unit_sm120_grouped_gemm_device_tensorop CMake target, per
reviewer request in PR #3280.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Introduces cutlass::epilogue::thread::Snake, a two-operand activation
functor implementing Snake_a(x) = x + (1/a) * sin^2(a*x) from
Ziyin et al. 2020 (arXiv:2006.08195). The per-channel learnable
frequency `a` flows through an EVT child (e.g. Sm90RowBroadcast),
composing into Sm90EVT<Sm90Compute<Snake, ...>, x_node, alpha_node>
for fused GEMM+Snake epilogues used in neural vocoders.
Adds unit tests in test/unit/epilogue/thread/activation.cu covering
f32 and bf16 paths, validated against float64 reference goldens.
Closes#3141
The store(frag, tile_offset) method was computing the pointer offset
without dividing by kElementsPerAccess, while the matching load(frag,
tile_offset) method does include this division. Both load_with_pointer_offset
and store_with_pointer_offset apply the same byte conversion, so the
tile_offset -> pointer_offset calculation must also match.
When kElementsPerAccess > 1, this caused load and store to reference
different memory locations for the same logical tile offset.
Fixes#3017
Signed-off-by: Blake Ledden <bledden@users.noreply.github.com>
the parameter workspace is marked as unused like other kernels, but it is actually used after 3.3.0, so the code which mark it as unused could be removed.
* fix print_layout printf format in device code
* Replace %.*s format specifier with explicit loop
* Remove unused delim variable
The printf format %.*s with dynamic width does not work correctly
in CUDA device code, causing literal %.*s to appear in output.
Fixes#2496
* Update include/cute/util/print_tensor.hpp
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>
* Update include/cute/util/print_tensor.hpp
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>
---------
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>
* Adding blockscaled ragged contiguous grouped gemm for MoEs
* cleaning up the example
* introduction to example improved
---------
Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com>
* Blackwell DistGEMM bug fixes
1. If using preferred cluster, there needs to be a branch so that
the universal GEMM wrapper finds the correct base params.
2. Workspace sizes can change depending on problem shape in Blackwell,
and DistGEMM was previously using the per-device shape to evaluate
workspace size instead of the per-gemm shape.
3. Flattened size used to initialize host tensors can overflow (in
Hopper example as well)
4. Preferred and fallback cluster args need to be set explicitly,
otherwise if someone modifies the example to use preferred cluster,
it will just fail.
* Fix example runtimes
* Set default fallback cluster shapes to the static ones