20 Commits

Author SHA1 Message Date
Junkai-Wu
39b352fa93 v4.6 dev update. (#3315)
* v4.6 dev update.

* Remove CUTLASS_HOST_DEVICE from CudaHostAdapater::memsetDevice (#3286)

* [SM120] Add ptr-array TMA collective for tensor/token-scaled FP8 grouped GEMM (#3280)

* gemm: add SM120 array TMA collective for tensor/token-scaled FP8 grouped GEMM

Adds CollectiveMma and CollectiveBuilder specializations for
MainloopSm120ArrayTmaWarpSpecialized, enabling ptr-array grouped GEMM
(MoE expert dispatch) with tensor- and token-level FP8 scaling on
SM_120/SM_121 consumer Blackwell (RTX 5090/5080/5070, DGX Spark GB10).

New files:
- include/cutlass/gemm/collective/sm120_mma_array_tma.hpp
  CollectiveMma specialization for MainloopSm120ArrayTmaWarpSpecialized.
  Handles both Cooperative (4x2 atom layout) and Pingpong (2x2) schedules.
  Grouped GEMM via pointer-array indirection through params.ptr_A / ptr_B.
  Supports F8F6F4 MMA with TMA loads for both A and B operands.

- include/cutlass/gemm/collective/builders/sm120_array_mma_builder.inl
  CollectiveBuilder specialization for KernelPtrArrayTmaWarpSpecialized
  Cooperative/PingpongSm120<N> schedule tags. Computes tile/stage counts
  from smem capacity, routes to MainloopSm120ArrayTmaWarpSpecialized
  dispatch policy, produces correctly-typed CollectiveOp.

Modified files:
- collective_mma.hpp: include sm120_mma_array_tma.hpp
- collective_builder.hpp: include sm120_array_mma_builder.inl
- sm120_mma_builder.inl: remove ptr-array schedules from enable_if
  (they now route to sm120_array_mma_builder.inl) and drop the
  IsPtrArrayKernel static_assert that enforced the restriction

Validated on real SM_121 hardware (DGX Spark, 128 GB LPDDR5X) running
vLLM with RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic (Gemma 4 MoE, 26B
total / 4B active). Previously fell back to a non-CUTLASS Triton path;
with this patch, the SM120 CUTLASS grouped GEMM collective activates and
produces correct outputs. Short-sequence throughput improved ~7% vs the
fallback baseline (76.3 → 81.9 tok/s).

Closes #3263

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>

* test: add SM120 ptr-array grouped GEMM unit tests

Adds 6 device-level tests for the CollectiveMma/CollectiveBuilder
specializations introduced for MainloopSm120ArrayTmaWarpSpecialized,
covering both KernelPtrArrayTmaWarpSpecializedPingpongSm120<2> and
KernelPtrArrayTmaWarpSpecializedCooperativeSm120<2> schedule tags across
e4m3×e4m3 (symmetric), e4m3×e5m2 (mixed), float and bfloat16 outputs,
and two tile shapes.

Tests land in test/unit/gemm/device/sm120_tensorop_gemm/ under the new
cutlass_test_unit_sm120_grouped_gemm_device_tensorop CMake target, per
reviewer request in PR #3280.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>

---------

Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>
Co-authored-by: Alex Georgiev <89279829+alexngUNC@users.noreply.github.com>
Co-authored-by: Tyler <tgmerritt@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
2026-06-15 23:23:20 -04:00
Junkai-Wu
5c54bee12b v4.5.2 update. (#3264) 2026-05-26 22:32:26 -04:00
Junkai-Wu
982cb9e718 v4.5.1 update. (#3237) 2026-05-18 22:35:08 -04:00
Haicheng Wu
ef120d0d09 update to 4.5 (#3228) 2026-05-12 14:44:22 +08:00
Junkai-Wu
cb37157db5 v4.5 tag update (#3202)
* Python DSL examples reorganization.

* v4.5 tag update.
2026-05-05 20:55:27 -04:00
Junkai-Wu
a221da7ccf v4.5 dev update. (#3153) 2026-04-07 12:16:05 -04:00
Junkai-Wu
1b741cabaa v4.4.2 update. (#3104) 2026-03-17 00:58:19 -04:00
Junkai-Wu
3bb6e28d3c v4.4.1 update (#3079) 2026-02-27 13:59:21 -05:00
Haicheng Wu
954503d44c Bump version to 4.4.0 2026-02-25 00:04:04 -05:00
Junkai-Wu
0d2b201e8c v4.3.5 update. (#2934)
* v4.3.5 update.

* Update copyright to 2026
2026-01-08 15:02:56 -05:00
Junkai-Wu
7f5fe3edf1 v4.3.4 update. (#2892) 2025-12-21 11:49:12 -05:00
Junkai-Wu
d3a5492381 v4.3.3 update. (#2868) 2025-12-11 00:26:58 -05:00
Junkai-Wu
bc680c7f67 v4.3.2 update. (#2839) 2025-12-04 10:14:32 -05:00
Haicheng Wu
5e847d37c4 Bump version from 4.2.1 to 4.3.1 2025-12-01 22:13:19 -05:00
Junkai-Wu
1de3a576cc v4.3.1 update. (#2817) 2025-11-27 09:49:30 -05:00
Haicheng Wu
7967ce5f83 Bump version to 4.3.0 2025-11-24 16:34:45 -05:00
Junkai-Wu
8cd5bef43a v4.3 tag release update. (#2789) 2025-11-20 20:49:44 -05:00
Junkai-Wu
b1d6e2c9b3 v4.3 update. (#2709)
* v4.3 update.

* Update the cute_dsl_api changelog's doc link

* Update version to 4.3.0

* Update the example link

* Update doc to encourage user to install DSL from requirements.txt

---------

Co-authored-by: Larry Wu <larwu@nvidia.com>
2025-10-21 14:26:30 -04:00
Haicheng Wu
f874df19ac 4.2.1 update 2025-09-23 13:45:13 -07:00
Jack Kosaian
b234a8c024 Rename python/cutlass to python/cutlass_cppgen (#2652) 2025-09-18 14:26:57 -04:00