233 Commits

Author SHA1 Message Date
myu-guo
d4b4b494c3 [CLI] Recover ssd and blockwise group gemm perf (#3344)
* cggemm

* cggemm

* mggemm

* quick fix for 13.3

* fix ssd

* typo

* update
2026-06-24 08:42:36 +08:00
drazi
c88b280fbf add fp4_x2 example (#3043)
* add fp4_x2 example

* update docstring

* improve comments
2026-06-23 17:56:23 +08:00
Junkai-Wu
8f50b052e1 Fix license. (#3328) 2026-06-22 22:07:29 -04:00
minas-nv
cf064d2e6b Update tensorop_gemm.py (#3322)
* Update tensorop_gemm.py

Add auto-transpose option for m-major C

* Update tensorop_gemm.py

Fix broken name
2026-06-16 11:33:00 -04:00
Junkai-Wu
39b352fa93 v4.6 dev update. (#3315)
* v4.6 dev update.

* Remove CUTLASS_HOST_DEVICE from CudaHostAdapater::memsetDevice (#3286)

* [SM120] Add ptr-array TMA collective for tensor/token-scaled FP8 grouped GEMM (#3280)

* gemm: add SM120 array TMA collective for tensor/token-scaled FP8 grouped GEMM

Adds CollectiveMma and CollectiveBuilder specializations for
MainloopSm120ArrayTmaWarpSpecialized, enabling ptr-array grouped GEMM
(MoE expert dispatch) with tensor- and token-level FP8 scaling on
SM_120/SM_121 consumer Blackwell (RTX 5090/5080/5070, DGX Spark GB10).

New files:
- include/cutlass/gemm/collective/sm120_mma_array_tma.hpp
  CollectiveMma specialization for MainloopSm120ArrayTmaWarpSpecialized.
  Handles both Cooperative (4x2 atom layout) and Pingpong (2x2) schedules.
  Grouped GEMM via pointer-array indirection through params.ptr_A / ptr_B.
  Supports F8F6F4 MMA with TMA loads for both A and B operands.

- include/cutlass/gemm/collective/builders/sm120_array_mma_builder.inl
  CollectiveBuilder specialization for KernelPtrArrayTmaWarpSpecialized
  Cooperative/PingpongSm120<N> schedule tags. Computes tile/stage counts
  from smem capacity, routes to MainloopSm120ArrayTmaWarpSpecialized
  dispatch policy, produces correctly-typed CollectiveOp.

Modified files:
- collective_mma.hpp: include sm120_mma_array_tma.hpp
- collective_builder.hpp: include sm120_array_mma_builder.inl
- sm120_mma_builder.inl: remove ptr-array schedules from enable_if
  (they now route to sm120_array_mma_builder.inl) and drop the
  IsPtrArrayKernel static_assert that enforced the restriction

Validated on real SM_121 hardware (DGX Spark, 128 GB LPDDR5X) running
vLLM with RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic (Gemma 4 MoE, 26B
total / 4B active). Previously fell back to a non-CUTLASS Triton path;
with this patch, the SM120 CUTLASS grouped GEMM collective activates and
produces correct outputs. Short-sequence throughput improved ~7% vs the
fallback baseline (76.3 → 81.9 tok/s).

Closes #3263

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>

* test: add SM120 ptr-array grouped GEMM unit tests

Adds 6 device-level tests for the CollectiveMma/CollectiveBuilder
specializations introduced for MainloopSm120ArrayTmaWarpSpecialized,
covering both KernelPtrArrayTmaWarpSpecializedPingpongSm120<2> and
KernelPtrArrayTmaWarpSpecializedCooperativeSm120<2> schedule tags across
e4m3×e4m3 (symmetric), e4m3×e5m2 (mixed), float and bfloat16 outputs,
and two tile shapes.

Tests land in test/unit/gemm/device/sm120_tensorop_gemm/ under the new
cutlass_test_unit_sm120_grouped_gemm_device_tensorop CMake target, per
reviewer request in PR #3280.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>

---------

Signed-off-by: Tyler Merritt <tgmerritt@gmail.com>
Co-authored-by: Alex Georgiev <89279829+alexngUNC@users.noreply.github.com>
Co-authored-by: Tyler <tgmerritt@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
2026-06-15 23:23:20 -04:00
brandonsun
d80a4e53b5 fix validation codes (#3303) 2026-06-05 20:16:02 +08:00
Linfeng Zheng
2599f2975b [CLI] quick fix for fmha compile options (#3295) 2026-06-03 17:56:17 +08:00
Anakin(Yancheng) Zheng
0e9ac0734c Fix example with upcoming release (#3293)
* Fix example new releases

* Remove return
2026-06-03 13:40:27 +08:00
xiufanl
0bdd5cf8fb fix example issue (#3294) 2026-06-03 11:02:00 +08:00
Linfeng Zheng
423904d717 [CLI] Recover fmha perf (#3291)
* [CLI] Recover fmha perf

* [CLI] enable options for a certain version
2026-06-03 08:55:57 +08:00
brandonsun
25e252bdce replace deprecated apis (#3285) 2026-06-01 08:58:58 +08:00
bangyu shen
9c1d0965f8 Add Blackwell GeForce blockscaled GEMM examples (#3272)
Co-authored-by: bangyus <bangyus@nvidia.com>
2026-05-27 16:06:52 -04:00
Junkai-Wu
5c54bee12b v4.5.2 update. (#3264) 2026-05-26 22:32:26 -04:00
Caleb_Du
60b9659133 [CLI] add support for sm100 blockscaled gemm (#3274)
Co-authored-by: Caleb Du <cadu@nvidia.com>
2026-05-27 09:33:26 +08:00
Longsheng Du
5f06f5fc1a fix elect_sync api (#3262) 2026-05-26 08:50:00 +08:00
Linfeng Zheng
e45ccb1226 [CLI] Update FMHA & improve perf (#3251) 2026-05-25 15:56:08 +08:00
dePaul Miller
546c3efa89 Fix examples and pytest, run ruff (#3230)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2026-05-21 11:05:38 +08:00
Junkai-Wu
982cb9e718 v4.5.1 update. (#3237) 2026-05-18 22:35:08 -04:00
Observer007
971d1ed8b7 fix for thor (#3224) 2026-05-13 09:06:44 +08:00
questa-quan-wang
ae6bccf341 [CuTeDSL] Update atomic_max_float32 to atomic_fmax in blockscaled GEMM example (#3206)
The internal DSL package refactored atomic_max_float32 to atomic_fmax,
which properly handles negative floats via sign-bit-aware integer
atomics. Update the example to use the new API so it works with
current DSL wheels.

Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com>
2026-05-07 15:03:37 +08:00
Junkai-Wu
cb37157db5 v4.5 tag update (#3202)
* Python DSL examples reorganization.

* v4.5 tag update.
2026-05-05 20:55:27 -04:00
Johnsonms
f74fea9ce3 [Hopper CuTeDSL] Add FP8 GEMM with 2xAcc (#3149)
Add dense_gemm_fp8_2xacc.py — a CuTeDSL port of CUTLASS Example 54
(54_hopper_fp8_warp_specialized_gemm.cu) for NVIDIA Hopper (SM90).

Implements D = scale_a * scale_b * (A @ B) where A/B are FP8 E4M3FN using
the 2xAcc (double accumulation) technique: a temporary accumulator is
periodically promoted into the main accumulator every mma_promotion_interval
MMA instructions to prevent FP8 precision loss.

Features:
- FP8 E4M3FN inputs with Float32 accumulation
- 2xAcc for improved numerical accuracy
- TMA with multicast for A/B/D transfers
- WGMMA warp-specialized persistent tile scheduling
- Configurable output dtype: Float16, Float32, Float8E4M3FN
- Scalar scale_a / scale_b epilogue factors
- Cluster shapes up to 2x2

Add pytest test suite covering:
- L0 compile tests: all tile shapes, cluster shapes, output dtypes,
  mma_promotion_interval values
- L1 correctness tests: numerical validation vs torch.einsum reference
  for all configs, non-trivial scale factors, and batched GEMM (L>1)
- Benchmark tests (pytest -m bench -s): representative problem sizes
  with warmup, cold-L2, and TFLOPS reporting

Also fix conftest.py to import cutlass before adding examples/python/CuTeDSL
to sys.path, preventing the jax/ examples subdirectory from being detected
as a namespace package and breaking cutlass's JAX availability check.
2026-04-25 16:10:33 -04:00
Longsheng Du
08185b9c3e Update blackwell tutorial to be compatible with 4.5-dev version (#3130)
* Update blackwell tutorial to be compatible with 4.5-dev version

* update example for reverted changes

* add more example fix
2026-04-09 14:40:33 +08:00
Junkai-Wu
a221da7ccf v4.5 dev update. (#3153) 2026-04-07 12:16:05 -04:00
Katja Sirazitdinova
418d38a5de PR update (#3103) 2026-04-02 18:00:41 +08:00
drazi
4ca61d0662 [CuTeDSL] Add dataclass example: passing pointers via frozen dataclass (#3070)
* Add dataclass example: passing pointers via frozen dataclass

Demonstrates passing pointers from tensor arguments in @cute.jit to
@cute.kernel using @dataclass(frozen=True). Shows the pattern of
extracting pointers with tensor.iterator, bundling into a dataclass,
and reconstructing tensors in the kernel.

Uses fake tensors for compilation and TVM-FFI for runtime dispatch.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add dataclass example: passing tensors via frozen dataclass

Demonstrates passing tensors from @cute.jit to @cute.kernel using
@dataclass(frozen=True). Shows the pattern of bundling tensors into
a dataclass with static configuration.

Uses fake tensors for compilation and TVM-FFI for runtime dispatch.
Includes reference check against PyTorch implementation.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-03-30 15:08:36 +08:00
Zheng Linfeng
ecb32fe231 [CLI] Fix tutorial issues 2026-03-24 00:12:01 -07:00
Johnsonms
982748aa73 [Hopper CuTeDSL] Add grouped GEMM persistent kernel and tests (#3091)
Implement grouped GEMM (C_g = A_g x B_g for g groups) on Hopper using
CuTe DSL, extending the dense persistent GEMM with per-group TMA
descriptor management.

Kernel design (grouped_gemm.py):
- Warp-specialized pipeline: DMA warp group handles TMA loads and
  per-group tensormap updates; MMA warp group runs WGMMA and stores C
- StaticPersistentGroupTileScheduler for cross-group tile scheduling
- Per-group TMA descriptor updates via GMEM or SMEM mode
- Supports fp16, fp8 (E4M3FN/E5M2), int8 with mixed A/B dtypes
- Configurable tile shapes (128x128, 128x256) and cluster shapes
- Fix base TensorMapManager: hoist uniform_smem_ptrs outside predicated
  block to avoid illegal @P0 R2UR on sm_90a

Tests (test/examples/CuTeDSL/hopper/test_grouped_gemm.py):
- L0 compile and L1 correctness pytest suite covering tile shapes,
  dtypes, major modes, cluster shapes, group counts, and mixed sizes
- Move to test/examples/CuTeDSL/hopper/ following sm_100a convention
- Fix deprecated startdir arg in test_sharding.py pytest hook
2026-03-18 00:40:15 -04:00
Junkai-Wu
1b741cabaa v4.4.2 update. (#3104) 2026-03-17 00:58:19 -04:00
Linfeng Zheng
772fbb264e [CLI] add cutedsl fp16 gemm tutorial from 2 to 6 (#3106)
* [CLI] add fp16 gemm tutorial from 2 to 6

* [CLI] refine comments
2026-03-17 10:11:55 +08:00
Blake Ledden
087c84df83 docs: Fix float16 documentation in elementwise_add notebook (#2949) (#3047)
The notebook uses float16 tensors but the vectorized kernel documentation
incorrectly describes elements as 32-bit and uses 4-element vectorization.
Updated to correctly state 16-bit elements with 8-element vectorization
for proper 128-bit loads/stores.

Signed-off-by: Blake Ledden <bledden@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 10:29:46 +08:00
dePaul Miller
73c59c055c Support for Group GEMM in CUTLASS Profiler for Geforce and Spark (#3092)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2026-03-06 20:36:29 -05:00
Junkai-Wu
3bb6e28d3c v4.4.1 update (#3079) 2026-02-27 13:59:21 -05:00
Tianqi Zhang (张天启)
c651d660d2 fix typo (#3012) 2026-02-27 16:25:35 +08:00
mnehete32
79345359a7 Fix debug typo in sgemm_2.cu and sgemm_sm70.cu (#2678) 2026-02-27 16:23:59 +08:00
Junkai-Wu
057635de5c Remove redundant dsl example. (#3074) 2026-02-26 08:10:59 -05:00
Junkai-Wu
c213bfdfc1 Remove redundant dsl examples. (#3071) 2026-02-25 22:42:01 -05:00
Linfeng Zheng
3476ddb7bd remove mixed_input_fmha_prefill (#3041) 2026-02-18 07:59:01 -05:00
Yihan Chen
291300ffff [CuTeDSL] implment a cta-level norm example (both layernorm and rmsnorm) (#3009)
* kernel impl

* add copyright
2026-02-14 17:54:03 +08:00
aragorn-guan
f9a5f76b7a Replace fence proxy to the latest routine code in examples/distributed/all_reduce_tma.py (#3027) 2026-02-14 17:51:20 +08:00
Junkai-Wu
d4bbf728ca v4.4 tag release update. (#3032) 2026-02-13 23:27:58 -05:00
aragorn-guan
8dbce01473 [CuTeDSL] Distributed example, using TMA load to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMA store (#2970) 2026-02-11 11:54:00 +08:00
drazi
71aa7a0abc Merge pull request #2919 from pbelevich/patch-1
Refactor binary_op functions to remove unused result parameter
2026-02-11 11:48:58 +08:00
Junkai-Wu
6b3e607b85 v4.4 release update v2. (#2999) 2026-02-03 20:48:31 -05:00
Hua Huang
1cfbb53a23 [CuTeDSL] Fix: SM100 block-scale gemm overlapping accumulator (#2995)
* Fix: SM100 block-scale gemm overlapping accumulator

Signed-off-by: Hua Huang <huah@nvidia.com>

* Also include threads_per_warp fix

Signed-off-by: Hua Huang <huah@nvidia.com>

---------

Signed-off-by: Hua Huang <huah@nvidia.com>
2026-02-03 11:01:41 +08:00
dongxiao
a4eb0e05f6 fix performance inssues in cute-dsl examples for 4.4-ctk13.1 release (#2988)
* fix grouped gemm

* fix mixed input gemm

* fix mixed input grouped gemm

* fix version checking

* use advanced compiler options

* fix comment

* rename advanced compiler configs to adcanced compiler control

* fix comment

* fix name

* fix name
2026-01-30 13:31:04 +08:00
myu-guo
d252b01300 fix performance regression in cute-dsl examples for 4.4-ctk13.1 release (#2990)
* fix regression with cu13.1

* update
2026-01-30 13:30:49 +08:00
Xiao Song
acb45938e9 Update nvvm API call from nvvm enum to str (#2985) 2026-01-27 17:28:29 +08:00
Xiao Song
7a14467776 update api usage (#2969) 2026-01-27 15:33:22 +08:00
Junkai-Wu
9fba3195f9 v4.4 update. (#2979) 2026-01-24 11:46:17 -05:00