383 Commits

Author SHA1 Message Date
Brayden Zhong
e8ecfad75b add tileN = 8,16 for SM120 blockscale GEMM. (#3292)
* add tileN = 8,16

* add a guard

---------

Co-authored-by: Brayden Zhong <brayden.zhong@radixark.ai>
2026-06-26 18:16:02 -04:00
Junkai-Wu
39b352fa93 v4.6 dev update. (#3315)
* v4.6 dev update.

* Remove CUTLASS_HOST_DEVICE from CudaHostAdapater::memsetDevice (#3286)

* [SM120] Add ptr-array TMA collective for tensor/token-scaled FP8 grouped GEMM (#3280)

* gemm: add SM120 array TMA collective for tensor/token-scaled FP8 grouped GEMM

Adds CollectiveMma and CollectiveBuilder specializations for
MainloopSm120ArrayTmaWarpSpecialized, enabling ptr-array grouped GEMM
(MoE expert dispatch) with tensor- and token-level FP8 scaling on
SM_120/SM_121 consumer Blackwell (RTX 5090/5080/5070, DGX Spark GB10).

New files:
- include/cutlass/gemm/collective/sm120_mma_array_tma.hpp
  CollectiveMma specialization for MainloopSm120ArrayTmaWarpSpecialized.
  Handles both Cooperative (4x2 atom layout) and Pingpong (2x2) schedules.
  Grouped GEMM via pointer-array indirection through params.ptr_A / ptr_B.
  Supports F8F6F4 MMA with TMA loads for both A and B operands.

- include/cutlass/gemm/collective/builders/sm120_array_mma_builder.inl
  CollectiveBuilder specialization for KernelPtrArrayTmaWarpSpecialized
  Cooperative/PingpongSm120<N> schedule tags. Computes tile/stage counts
  from smem capacity, routes to MainloopSm120ArrayTmaWarpSpecialized
  dispatch policy, produces correctly-typed CollectiveOp.

Modified files:
- collective_mma.hpp: include sm120_mma_array_tma.hpp
- collective_builder.hpp: include sm120_array_mma_builder.inl
- sm120_mma_builder.inl: remove ptr-array schedules from enable_if
  (they now route to sm120_array_mma_builder.inl) and drop the
  IsPtrArrayKernel static_assert that enforced the restriction

Validated on real SM_121 hardware (DGX Spark, 128 GB LPDDR5X) running
vLLM with RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic (Gemma 4 MoE, 26B
total / 4B active). Previously fell back to a non-CUTLASS Triton path;
with this patch, the SM120 CUTLASS grouped GEMM collective activates and
produces correct outputs. Short-sequence throughput improved ~7% vs the
fallback baseline (76.3 → 81.9 tok/s).

Closes #3263

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>

* test: add SM120 ptr-array grouped GEMM unit tests

Adds 6 device-level tests for the CollectiveMma/CollectiveBuilder
specializations introduced for MainloopSm120ArrayTmaWarpSpecialized,
covering both KernelPtrArrayTmaWarpSpecializedPingpongSm120<2> and
KernelPtrArrayTmaWarpSpecializedCooperativeSm120<2> schedule tags across
e4m3×e4m3 (symmetric), e4m3×e5m2 (mixed), float and bfloat16 outputs,
and two tile shapes.

Tests land in test/unit/gemm/device/sm120_tensorop_gemm/ under the new
cutlass_test_unit_sm120_grouped_gemm_device_tensorop CMake target, per
reviewer request in PR #3280.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>

---------

Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>
Co-authored-by: Alex Georgiev <89279829+alexngUNC@users.noreply.github.com>
Co-authored-by: Tyler <tgmerritt@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
2026-06-15 23:23:20 -04:00
Tyler
0ce648f53f [SM120] Add ptr-array TMA collective for tensor/token-scaled FP8 grouped GEMM (#3280)
* gemm: add SM120 array TMA collective for tensor/token-scaled FP8 grouped GEMM

Adds CollectiveMma and CollectiveBuilder specializations for
MainloopSm120ArrayTmaWarpSpecialized, enabling ptr-array grouped GEMM
(MoE expert dispatch) with tensor- and token-level FP8 scaling on
SM_120/SM_121 consumer Blackwell (RTX 5090/5080/5070, DGX Spark GB10).

New files:
- include/cutlass/gemm/collective/sm120_mma_array_tma.hpp
  CollectiveMma specialization for MainloopSm120ArrayTmaWarpSpecialized.
  Handles both Cooperative (4x2 atom layout) and Pingpong (2x2) schedules.
  Grouped GEMM via pointer-array indirection through params.ptr_A / ptr_B.
  Supports F8F6F4 MMA with TMA loads for both A and B operands.

- include/cutlass/gemm/collective/builders/sm120_array_mma_builder.inl
  CollectiveBuilder specialization for KernelPtrArrayTmaWarpSpecialized
  Cooperative/PingpongSm120<N> schedule tags. Computes tile/stage counts
  from smem capacity, routes to MainloopSm120ArrayTmaWarpSpecialized
  dispatch policy, produces correctly-typed CollectiveOp.

Modified files:
- collective_mma.hpp: include sm120_mma_array_tma.hpp
- collective_builder.hpp: include sm120_array_mma_builder.inl
- sm120_mma_builder.inl: remove ptr-array schedules from enable_if
  (they now route to sm120_array_mma_builder.inl) and drop the
  IsPtrArrayKernel static_assert that enforced the restriction

Validated on real SM_121 hardware (DGX Spark, 128 GB LPDDR5X) running
vLLM with RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic (Gemma 4 MoE, 26B
total / 4B active). Previously fell back to a non-CUTLASS Triton path;
with this patch, the SM120 CUTLASS grouped GEMM collective activates and
produces correct outputs. Short-sequence throughput improved ~7% vs the
fallback baseline (76.3 → 81.9 tok/s).

Closes #3263

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>

* test: add SM120 ptr-array grouped GEMM unit tests

Adds 6 device-level tests for the CollectiveMma/CollectiveBuilder
specializations introduced for MainloopSm120ArrayTmaWarpSpecialized,
covering both KernelPtrArrayTmaWarpSpecializedPingpongSm120<2> and
KernelPtrArrayTmaWarpSpecializedCooperativeSm120<2> schedule tags across
e4m3×e4m3 (symmetric), e4m3×e5m2 (mixed), float and bfloat16 outputs,
and two tile shapes.

Tests land in test/unit/gemm/device/sm120_tensorop_gemm/ under the new
cutlass_test_unit_sm120_grouped_gemm_device_tensorop CMake target, per
reviewer request in PR #3280.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
2026-06-13 17:10:47 -04:00
Alex Georgiev
93774d3da5 Remove CUTLASS_HOST_DEVICE from CudaHostAdapater::memsetDevice (#3286) 2026-06-11 12:33:47 -04:00
ANIKET SHIVAM
1fc71b3ed1 Update sm100 MMA desc offsetting (#3299)
Fix perf regression
2026-06-08 22:12:36 -04:00
Junkai-Wu
5c54bee12b v4.5.2 update. (#3264) 2026-05-26 22:32:26 -04:00
Junkai-Wu
982cb9e718 v4.5.1 update. (#3237) 2026-05-18 22:35:08 -04:00
Haicheng Wu
ef120d0d09 update to 4.5 (#3228) 2026-05-12 14:44:22 +08:00
Emre Albayrak
2e56847d72 Add Snake activation functor for EVT (#3184)
Introduces cutlass::epilogue::thread::Snake, a two-operand activation
functor implementing Snake_a(x) = x + (1/a) * sin^2(a*x) from
Ziyin et al. 2020 (arXiv:2006.08195). The per-channel learnable
frequency `a` flows through an EVT child (e.g. Sm90RowBroadcast),
composing into Sm90EVT<Sm90Compute<Snake, ...>, x_node, alpha_node>
for fused GEMM+Snake epilogues used in neural vocoders.

Adds unit tests in test/unit/epilogue/thread/activation.cu covering
f32 and bf16 paths, validated against float64 reference goldens.

Closes #3141
2026-05-11 22:09:53 -04:00
Junkai-Wu
cb37157db5 v4.5 tag update (#3202)
* Python DSL examples reorganization.

* v4.5 tag update.
2026-05-05 20:55:27 -04:00
Blake Ledden
7a9fe055cb fix: Add missing kElementsPerAccess division in RegularTileIterator store (#3049)
The store(frag, tile_offset) method was computing the pointer offset
without dividing by kElementsPerAccess, while the matching load(frag,
tile_offset) method does include this division. Both load_with_pointer_offset
and store_with_pointer_offset apply the same byte conversion, so the
tile_offset -> pointer_offset calculation must also match.

When kElementsPerAccess > 1, this caused load and store to reference
different memory locations for the same logical tile offset.

Fixes #3017

Signed-off-by: Blake Ledden <bledden@users.noreply.github.com>
2026-04-24 23:27:40 -04:00
Vrushtee
9135a9bb6d Replace std::min with cute::min in sm120 blockwise scaling device functions (#3055) 2026-04-24 11:13:38 -04:00
dePaul Miller
b46b16d003 Small Tile N BlockScaled GEMM + Grouped GEMM (#3176)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2026-04-21 12:32:40 -04:00
zfm
aeba0d3723 correct BLayout stride in SM80 m16n8k32 int4 MMA traits (#3140) 2026-04-21 17:17:03 +08:00
Junkai-Wu
a221da7ccf v4.5 dev update. (#3153) 2026-04-07 12:16:05 -04:00
Junkai-Wu
1b741cabaa v4.4.2 update. (#3104) 2026-03-17 00:58:19 -04:00
dePaul Miller
73c59c055c Support for Group GEMM in CUTLASS Profiler for Geforce and Spark (#3092)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2026-03-06 20:36:29 -05:00
Junkai-Wu
3bb6e28d3c v4.4.1 update (#3079) 2026-02-27 13:59:21 -05:00
Neil Kichler
edf2f82c00 Fix register index bug in mma.sync.aligned.m16n8k16 (#2740) 2026-02-27 16:24:18 +08:00
Haicheng Wu
2aedca6f5e Bump CUTLASS version to 4.4.0 2026-02-25 00:01:56 -05:00
Junkai-Wu
d4bbf728ca v4.4 tag release update. (#3032) 2026-02-13 23:27:58 -05:00
Junkai-Wu
6b3e607b85 v4.4 release update v2. (#2999) 2026-02-03 20:48:31 -05:00
Junkai-Wu
9fba3195f9 v4.4 update. (#2979) 2026-01-24 11:46:17 -05:00
Qi Yuhang
2fafefb7b9 [Bug Fix]Set NumSplitsM to 1 when TileShapeM < 128 in sm90 fp8 blockwise scaling CollectiveMma (#2965)
* Fix NumSplitsM when TileShapeM < 128.

* Use cute::conditional_t to replace std::conditional_t.
2026-01-23 15:56:52 +08:00
kf-zhang
0deda34b9f fix typo (#2884) 2026-01-09 00:57:06 -05:00
Junkai-Wu
0d2b201e8c v4.3.5 update. (#2934)
* v4.3.5 update.

* Update copyright to 2026
2026-01-08 15:02:56 -05:00
veritas-Qiu
61b560983a remove useless line (#2926)
the parameter workspace is marked as unused like other kernels, but it is actually used after 3.3.0, so the code which mark it as unused could be removed.
2026-01-06 23:54:08 -05:00
dePaul Miller
7127592069 Replace CUDA driver API with runtime API (#2928)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2026-01-05 13:50:44 -05:00
tsu-bin
3d9de19bb7 add constexpr specifier to make_tiled_copy (#2875) 2026-01-03 15:39:43 -05:00
Junkai-Wu
b7ecaa605d v4.3.4 update v2. (#2898) 2025-12-22 22:28:26 -05:00
Junkai-Wu
7f5fe3edf1 v4.3.4 update. (#2892) 2025-12-21 11:49:12 -05:00
Qi Yuhang
ebf3165efb [Bug Fix]Bypass launch grids for SM120 Kernel with SM90 Mainloop & SM100 TileScheduler (#2865)
* Delete unused #ifdef/#endif. Bypass sm120 case.

* Add todo.

* Fix pingpong.

* Revert "Add todo."

This reverts commit 246cb42091.

* Refine name.

Refine name again.

* Apply suggestions from code review

Skip `is_last_tile` for all sm120 kernels.

Co-authored-by: Junkai-Wu <junkaiw@nvidia.com>

* Skip early stop for sm120 kernel.

* Fix typo.

---------

Co-authored-by: Junkai-Wu <junkaiw@nvidia.com>
2025-12-18 08:51:38 +08:00
Junkai-Wu
d3a5492381 v4.3.3 update. (#2868) 2025-12-11 00:26:58 -05:00
Amin Sedaghat
49bd6bf1ba fix print_layout printf format in device code (#2688)
* fix print_layout printf format in device code

* Replace %.*s format specifier with explicit loop
* Remove unused delim variable

The printf format %.*s with dynamic width does not work correctly
in CUDA device code, causing literal %.*s to appear in output.

Fixes #2496

* Update include/cute/util/print_tensor.hpp

Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>

* Update include/cute/util/print_tensor.hpp

Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>

---------

Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>
2025-12-10 08:57:56 +08:00
HydraQYH
95f8beb44c Revert "Remove unnecessary #ifdef #endif for general gemm."
This reverts commit 17ffd56dfe.
2025-12-09 11:52:23 +08:00
HydraQYH
17ffd56dfe Remove unnecessary #ifdef #endif for general gemm. 2025-12-06 10:20:35 +08:00
HydraQYH
ff7f2dcdfb Remove duplicated cutlass::arch::wait_on_dependent_grids(); 2025-12-06 10:20:35 +08:00
HydraQYH
929e1e0259 Remove unnecessary #ifdef / #endif for launch_dependent_grids. 2025-12-06 10:20:35 +08:00
HydraQYH
b6ad6db219 Delete unnecessary #ifdef / #endif. 2025-12-06 10:20:35 +08:00
HydraQYH
e1b2ec57e3 Hoist waits above the warp specialized region. 2025-12-06 10:20:35 +08:00
HydraQYH
1e5f95cbbe Support PDL in sm90_gemm_array_tma_warpspecialized_cooperative 2025-12-06 10:20:35 +08:00
HydraQYH
acf5990cc2 Refine position for wait_on_dependent_grids. 2025-12-06 10:20:35 +08:00
HydraQYH
91de7891a5 Support PDL in sm90_gemm_array_tma_warpspecialized_pingpong.hpp 2025-12-06 10:20:35 +08:00
Junkai-Wu
bc680c7f67 v4.3.2 update. (#2839) 2025-12-04 10:14:32 -05:00
Haicheng Wu
f11375bf91 Bump CUTLASS patch version to 1 2025-12-01 22:08:52 -05:00
Shreya Gaur
af8d5dfa54 bug fix for example 92 (#2830)
Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com>
Co-authored-by: Shreya Gaur <shgaur@2u2g-spr-0015.ipp4a1.colossus.nvidia.com>
2025-12-01 22:02:59 -05:00
Junkai-Wu
1de3a576cc v4.3.1 update. (#2817) 2025-11-27 09:49:30 -05:00
Shreya Gaur
2052fd3885 Blockscaled Ragged Contiguous Grouped Gemm for MoEs (#2790)
* Adding blockscaled ragged contiguous grouped gemm for MoEs

* cleaning up the example

* introduction to example improved

---------

Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com>
2025-11-26 20:16:49 -05:00
Junkai-Wu
8cd5bef43a v4.3 tag release update. (#2789) 2025-11-20 20:49:44 -05:00
Ali Hassani
d1ef0e87f2 DistGEMM bug fixes (#2713)
* Blackwell DistGEMM bug fixes

1. If using preferred cluster, there needs to be a branch so that
   the universal GEMM wrapper finds the correct base params.
2. Workspace sizes can change depending on problem shape in Blackwell,
   and DistGEMM was previously using the per-device shape to evaluate
   workspace size instead of the per-gemm shape.
3. Flattened size used to initialize host tensors can overflow (in
   Hopper example as well)
4. Preferred and fallback cluster args need to be set explicitly,
   otherwise if someone modifies the example to use preferred cluster,
   it will just fail.

* Fix example runtimes

* Set default fallback cluster shapes to the static ones
2025-11-06 13:31:24 -05:00