cutlass

mirror of https://github.com/NVIDIA/cutlass.git synced 2026-06-28 18:37:05 +00:00

Author	SHA1	Message	Date
myu-guo	d4b4b494c3	[CLI] Recover ssd and blockwise group gemm perf (#3344 ) * cggemm * cggemm * mggemm * quick fix for 13.3 * fix ssd * typo * update	2026-06-24 08:42:36 +08:00
drazi	c88b280fbf	add fp4_x2 example (#3043 ) * add fp4_x2 example * update docstring * improve comments	2026-06-23 17:56:23 +08:00
Junkai-Wu	8f50b052e1	Fix license. (#3328 )	2026-06-22 22:07:29 -04:00
minas-nv	cf064d2e6b	Update tensorop_gemm.py (#3322 ) * Update tensorop_gemm.py Add auto-transpose option for m-major C * Update tensorop_gemm.py Fix broken name	2026-06-16 11:33:00 -04:00
Junkai-Wu	39b352fa93	v4.6 dev update. (#3315 ) * v4.6 dev update. * Remove CUTLASS_HOST_DEVICE from CudaHostAdapater::memsetDevice (#3286) * [SM120] Add ptr-array TMA collective for tensor/token-scaled FP8 grouped GEMM (#3280) * gemm: add SM120 array TMA collective for tensor/token-scaled FP8 grouped GEMM Adds CollectiveMma and CollectiveBuilder specializations for MainloopSm120ArrayTmaWarpSpecialized, enabling ptr-array grouped GEMM (MoE expert dispatch) with tensor- and token-level FP8 scaling on SM_120/SM_121 consumer Blackwell (RTX 5090/5080/5070, DGX Spark GB10). New files: - include/cutlass/gemm/collective/sm120_mma_array_tma.hpp CollectiveMma specialization for MainloopSm120ArrayTmaWarpSpecialized. Handles both Cooperative (4x2 atom layout) and Pingpong (2x2) schedules. Grouped GEMM via pointer-array indirection through params.ptr_A / ptr_B. Supports F8F6F4 MMA with TMA loads for both A and B operands. - include/cutlass/gemm/collective/builders/sm120_array_mma_builder.inl CollectiveBuilder specialization for KernelPtrArrayTmaWarpSpecialized Cooperative/PingpongSm120<N> schedule tags. Computes tile/stage counts from smem capacity, routes to MainloopSm120ArrayTmaWarpSpecialized dispatch policy, produces correctly-typed CollectiveOp. Modified files: - collective_mma.hpp: include sm120_mma_array_tma.hpp - collective_builder.hpp: include sm120_array_mma_builder.inl - sm120_mma_builder.inl: remove ptr-array schedules from enable_if (they now route to sm120_array_mma_builder.inl) and drop the IsPtrArrayKernel static_assert that enforced the restriction Validated on real SM_121 hardware (DGX Spark, 128 GB LPDDR5X) running vLLM with RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic (Gemma 4 MoE, 26B total / 4B active). Previously fell back to a non-CUTLASS Triton path; with this patch, the SM120 CUTLASS grouped GEMM collective activates and produces correct outputs. Short-sequence throughput improved ~7% vs the fallback baseline (76.3 → 81.9 tok/s). Closes #3263 Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Tyler Merritt <tgmerritt@gmail.com> * test: add SM120 ptr-array grouped GEMM unit tests Adds 6 device-level tests for the CollectiveMma/CollectiveBuilder specializations introduced for MainloopSm120ArrayTmaWarpSpecialized, covering both KernelPtrArrayTmaWarpSpecializedPingpongSm120<2> and KernelPtrArrayTmaWarpSpecializedCooperativeSm120<2> schedule tags across e4m3×e4m3 (symmetric), e4m3×e5m2 (mixed), float and bfloat16 outputs, and two tile shapes. Tests land in test/unit/gemm/device/sm120_tensorop_gemm/ under the new cutlass_test_unit_sm120_grouped_gemm_device_tensorop CMake target, per reviewer request in PR #3280. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Signed-off-by: Tyler Merritt <tgmerritt@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> --------- Signed-off-by: Tyler Merritt <tgmerritt@gmail.com> Co-authored-by: Alex Georgiev <89279829+alexngUNC@users.noreply.github.com> Co-authored-by: Tyler <tgmerritt@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>	2026-06-15 23:23:20 -04:00
brandonsun	d80a4e53b5	fix validation codes (#3303 )	2026-06-05 20:16:02 +08:00
Linfeng Zheng	2599f2975b	[CLI] quick fix for fmha compile options (#3295 )	2026-06-03 17:56:17 +08:00
Anakin(Yancheng) Zheng	0e9ac0734c	Fix example with upcoming release (#3293 ) * Fix example new releases * Remove return	2026-06-03 13:40:27 +08:00
xiufanl	0bdd5cf8fb	fix example issue (#3294 )	2026-06-03 11:02:00 +08:00
Linfeng Zheng	423904d717	[CLI] Recover fmha perf (#3291 ) * [CLI] Recover fmha perf * [CLI] enable options for a certain version	2026-06-03 08:55:57 +08:00
brandonsun	25e252bdce	replace deprecated apis (#3285 )	2026-06-01 08:58:58 +08:00
bangyu shen	9c1d0965f8	Add Blackwell GeForce blockscaled GEMM examples (#3272 ) Co-authored-by: bangyus <bangyus@nvidia.com>	2026-05-27 16:06:52 -04:00
Junkai-Wu	5c54bee12b	v4.5.2 update. (#3264 )	2026-05-26 22:32:26 -04:00
Caleb_Du	60b9659133	[CLI] add support for sm100 blockscaled gemm (#3274 ) Co-authored-by: Caleb Du <cadu@nvidia.com>	2026-05-27 09:33:26 +08:00
Longsheng Du	5f06f5fc1a	fix elect_sync api (#3262 )	2026-05-26 08:50:00 +08:00
Linfeng Zheng	e45ccb1226	[CLI] Update FMHA & improve perf (#3251 )	2026-05-25 15:56:08 +08:00
dePaul Miller	546c3efa89	Fix examples and pytest, run ruff (#3230 ) Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>	2026-05-21 11:05:38 +08:00
Observer007	971d1ed8b7	fix for thor (#3224 )	2026-05-13 09:06:44 +08:00
questa-quan-wang	ae6bccf341	[CuTeDSL] Update atomic_max_float32 to atomic_fmax in blockscaled GEMM example (#3206 ) The internal DSL package refactored atomic_max_float32 to atomic_fmax, which properly handles negative floats via sign-bit-aware integer atomics. Update the example to use the new API so it works with current DSL wheels. Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com>	2026-05-07 15:03:37 +08:00
Junkai-Wu	cb37157db5	v4.5 tag update (#3202 ) * Python DSL examples reorganization. * v4.5 tag update.	2026-05-05 20:55:27 -04:00
Johnsonms	f74fea9ce3	[Hopper CuTeDSL] Add FP8 GEMM with 2xAcc (#3149 ) Add dense_gemm_fp8_2xacc.py — a CuTeDSL port of CUTLASS Example 54 (54_hopper_fp8_warp_specialized_gemm.cu) for NVIDIA Hopper (SM90). Implements D = scale_a * scale_b * (A @ B) where A/B are FP8 E4M3FN using the 2xAcc (double accumulation) technique: a temporary accumulator is periodically promoted into the main accumulator every mma_promotion_interval MMA instructions to prevent FP8 precision loss. Features: - FP8 E4M3FN inputs with Float32 accumulation - 2xAcc for improved numerical accuracy - TMA with multicast for A/B/D transfers - WGMMA warp-specialized persistent tile scheduling - Configurable output dtype: Float16, Float32, Float8E4M3FN - Scalar scale_a / scale_b epilogue factors - Cluster shapes up to 2x2 Add pytest test suite covering: - L0 compile tests: all tile shapes, cluster shapes, output dtypes, mma_promotion_interval values - L1 correctness tests: numerical validation vs torch.einsum reference for all configs, non-trivial scale factors, and batched GEMM (L>1) - Benchmark tests (pytest -m bench -s): representative problem sizes with warmup, cold-L2, and TFLOPS reporting Also fix conftest.py to import cutlass before adding examples/python/CuTeDSL to sys.path, preventing the jax/ examples subdirectory from being detected as a namespace package and breaking cutlass's JAX availability check.	2026-04-25 16:10:33 -04:00
Longsheng Du	08185b9c3e	Update blackwell tutorial to be compatible with 4.5-dev version (#3130 ) * Update blackwell tutorial to be compatible with 4.5-dev version * update example for reverted changes * add more example fix	2026-04-09 14:40:33 +08:00
Junkai-Wu	a221da7ccf	v4.5 dev update. (#3153 )	2026-04-07 12:16:05 -04:00
Katja Sirazitdinova	418d38a5de	PR update (#3103 )	2026-04-02 18:00:41 +08:00
drazi	4ca61d0662	[CuTeDSL] Add dataclass example: passing pointers via frozen dataclass (#3070 ) * Add dataclass example: passing pointers via frozen dataclass Demonstrates passing pointers from tensor arguments in @cute.jit to @cute.kernel using @dataclass(frozen=True). Shows the pattern of extracting pointers with tensor.iterator, bundling into a dataclass, and reconstructing tensors in the kernel. Uses fake tensors for compilation and TVM-FFI for runtime dispatch. Co-authored-by: Cursor <cursoragent@cursor.com> * Add dataclass example: passing tensors via frozen dataclass Demonstrates passing tensors from @cute.jit to @cute.kernel using @dataclass(frozen=True). Shows the pattern of bundling tensors into a dataclass with static configuration. Uses fake tensors for compilation and TVM-FFI for runtime dispatch. Includes reference check against PyTorch implementation. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-03-30 15:08:36 +08:00
Zheng Linfeng	ecb32fe231	[CLI] Fix tutorial issues	2026-03-24 00:12:01 -07:00
Johnsonms	982748aa73	[Hopper CuTeDSL] Add grouped GEMM persistent kernel and tests (#3091 ) Implement grouped GEMM (C_g = A_g x B_g for g groups) on Hopper using CuTe DSL, extending the dense persistent GEMM with per-group TMA descriptor management. Kernel design (grouped_gemm.py): - Warp-specialized pipeline: DMA warp group handles TMA loads and per-group tensormap updates; MMA warp group runs WGMMA and stores C - StaticPersistentGroupTileScheduler for cross-group tile scheduling - Per-group TMA descriptor updates via GMEM or SMEM mode - Supports fp16, fp8 (E4M3FN/E5M2), int8 with mixed A/B dtypes - Configurable tile shapes (128x128, 128x256) and cluster shapes - Fix base TensorMapManager: hoist uniform_smem_ptrs outside predicated block to avoid illegal @P0 R2UR on sm_90a Tests (test/examples/CuTeDSL/hopper/test_grouped_gemm.py): - L0 compile and L1 correctness pytest suite covering tile shapes, dtypes, major modes, cluster shapes, group counts, and mixed sizes - Move to test/examples/CuTeDSL/hopper/ following sm_100a convention - Fix deprecated startdir arg in test_sharding.py pytest hook	2026-03-18 00:40:15 -04:00
Junkai-Wu	1b741cabaa	v4.4.2 update. (#3104 )	2026-03-17 00:58:19 -04:00
Linfeng Zheng	772fbb264e	[CLI] add cutedsl fp16 gemm tutorial from 2 to 6 (#3106 ) * [CLI] add fp16 gemm tutorial from 2 to 6 * [CLI] refine comments	2026-03-17 10:11:55 +08:00
Blake Ledden	087c84df83	docs: Fix float16 documentation in elementwise_add notebook (#2949 ) (#3047 ) The notebook uses float16 tensors but the vectorized kernel documentation incorrectly describes elements as 32-bit and uses 4-element vectorization. Updated to correctly state 16-bit elements with 8-element vectorization for proper 128-bit loads/stores. Signed-off-by: Blake Ledden <bledden@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 10:29:46 +08:00
Junkai-Wu	3bb6e28d3c	v4.4.1 update (#3079 )	2026-02-27 13:59:21 -05:00
Junkai-Wu	057635de5c	Remove redundant dsl example. (#3074 )	2026-02-26 08:10:59 -05:00
Junkai-Wu	c213bfdfc1	Remove redundant dsl examples. (#3071 )	2026-02-25 22:42:01 -05:00
Linfeng Zheng	3476ddb7bd	remove mixed_input_fmha_prefill (#3041 )	2026-02-18 07:59:01 -05:00
Yihan Chen	291300ffff	[CuTeDSL] implment a cta-level norm example (both layernorm and rmsnorm) (#3009 ) * kernel impl * add copyright	2026-02-14 17:54:03 +08:00
aragorn-guan	f9a5f76b7a	Replace fence proxy to the latest routine code in examples/distributed/all_reduce_tma.py (#3027 )	2026-02-14 17:51:20 +08:00
Junkai-Wu	d4bbf728ca	v4.4 tag release update. (#3032 )	2026-02-13 23:27:58 -05:00
aragorn-guan	8dbce01473	[CuTeDSL] Distributed example, using TMA load to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMA store (#2970 )	2026-02-11 11:54:00 +08:00
drazi	71aa7a0abc	Merge pull request #2919 from pbelevich/patch-1 Refactor binary_op functions to remove unused result parameter	2026-02-11 11:48:58 +08:00
Junkai-Wu	6b3e607b85	v4.4 release update v2. (#2999 )	2026-02-03 20:48:31 -05:00
Hua Huang	1cfbb53a23	[CuTeDSL] Fix: SM100 block-scale gemm overlapping accumulator (#2995 ) * Fix: SM100 block-scale gemm overlapping accumulator Signed-off-by: Hua Huang <huah@nvidia.com> * Also include threads_per_warp fix Signed-off-by: Hua Huang <huah@nvidia.com> --------- Signed-off-by: Hua Huang <huah@nvidia.com>	2026-02-03 11:01:41 +08:00
dongxiao	a4eb0e05f6	fix performance inssues in cute-dsl examples for 4.4-ctk13.1 release (#2988 ) * fix grouped gemm * fix mixed input gemm * fix mixed input grouped gemm * fix version checking * use advanced compiler options * fix comment * rename advanced compiler configs to adcanced compiler control * fix comment * fix name * fix name	2026-01-30 13:31:04 +08:00
myu-guo	d252b01300	fix performance regression in cute-dsl examples for 4.4-ctk13.1 release (#2990 ) * fix regression with cu13.1 * update	2026-01-30 13:30:49 +08:00
Xiao Song	acb45938e9	Update nvvm API call from nvvm enum to str (#2985 )	2026-01-27 17:28:29 +08:00
Xiao Song	7a14467776	update api usage (#2969 )	2026-01-27 15:33:22 +08:00
Junkai-Wu	9fba3195f9	v4.4 update. (#2979 )	2026-01-24 11:46:17 -05:00
Brian K. Ryu	147f5673d0	New RMS Norm example with unit tests (#2917 ) * Add rmsnorm example * Address reviewer comments. (1) use the cute.runtime definition directly. (2) use the nvvm_wrapper's warp reduce directly * Separate out reduce.py * Change copyright notice years	2026-01-13 09:05:31 +08:00
Junkai-Wu	0d2b201e8c	v4.3.5 update. (#2934 ) * v4.3.5 update. * Update copyright to 2026	2026-01-08 15:02:56 -05:00
Pavel Belevich	b6d7703e02	Refactor binary_op functions to remove unused result parameter	2026-01-02 11:23:43 -05:00
Pavel Belevich	f9bedd9096	Fix print statement for floor division result	2026-01-02 11:15:15 -05:00

1 2

80 Commits