cutlass

mirror of https://github.com/NVIDIA/cutlass.git synced 2026-07-03 13:47:07 +00:00

Author	SHA1	Message	Date
myu-guo	d4b4b494c3	[CLI] Recover ssd and blockwise group gemm perf (#3344 ) * cggemm * cggemm * mggemm * quick fix for 13.3 * fix ssd * typo * update	2026-06-24 08:42:36 +08:00
drazi	c88b280fbf	add fp4_x2 example (#3043 ) * add fp4_x2 example * update docstring * improve comments	2026-06-23 17:56:23 +08:00
Junkai-Wu	8f50b052e1	Fix license. (#3328 )	2026-06-22 22:07:29 -04:00
minas-nv	cf064d2e6b	Update tensorop_gemm.py (#3322 ) * Update tensorop_gemm.py Add auto-transpose option for m-major C * Update tensorop_gemm.py Fix broken name	2026-06-16 11:33:00 -04:00
Junkai-Wu	39b352fa93	v4.6 dev update. (#3315 ) * v4.6 dev update. * Remove CUTLASS_HOST_DEVICE from CudaHostAdapater::memsetDevice (#3286) * [SM120] Add ptr-array TMA collective for tensor/token-scaled FP8 grouped GEMM (#3280) * gemm: add SM120 array TMA collective for tensor/token-scaled FP8 grouped GEMM Adds CollectiveMma and CollectiveBuilder specializations for MainloopSm120ArrayTmaWarpSpecialized, enabling ptr-array grouped GEMM (MoE expert dispatch) with tensor- and token-level FP8 scaling on SM_120/SM_121 consumer Blackwell (RTX 5090/5080/5070, DGX Spark GB10). New files: - include/cutlass/gemm/collective/sm120_mma_array_tma.hpp CollectiveMma specialization for MainloopSm120ArrayTmaWarpSpecialized. Handles both Cooperative (4x2 atom layout) and Pingpong (2x2) schedules. Grouped GEMM via pointer-array indirection through params.ptr_A / ptr_B. Supports F8F6F4 MMA with TMA loads for both A and B operands. - include/cutlass/gemm/collective/builders/sm120_array_mma_builder.inl CollectiveBuilder specialization for KernelPtrArrayTmaWarpSpecialized Cooperative/PingpongSm120<N> schedule tags. Computes tile/stage counts from smem capacity, routes to MainloopSm120ArrayTmaWarpSpecialized dispatch policy, produces correctly-typed CollectiveOp. Modified files: - collective_mma.hpp: include sm120_mma_array_tma.hpp - collective_builder.hpp: include sm120_array_mma_builder.inl - sm120_mma_builder.inl: remove ptr-array schedules from enable_if (they now route to sm120_array_mma_builder.inl) and drop the IsPtrArrayKernel static_assert that enforced the restriction Validated on real SM_121 hardware (DGX Spark, 128 GB LPDDR5X) running vLLM with RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic (Gemma 4 MoE, 26B total / 4B active). Previously fell back to a non-CUTLASS Triton path; with this patch, the SM120 CUTLASS grouped GEMM collective activates and produces correct outputs. Short-sequence throughput improved ~7% vs the fallback baseline (76.3 → 81.9 tok/s). Closes #3263 Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Tyler Merritt <tgmerritt@gmail.com> * test: add SM120 ptr-array grouped GEMM unit tests Adds 6 device-level tests for the CollectiveMma/CollectiveBuilder specializations introduced for MainloopSm120ArrayTmaWarpSpecialized, covering both KernelPtrArrayTmaWarpSpecializedPingpongSm120<2> and KernelPtrArrayTmaWarpSpecializedCooperativeSm120<2> schedule tags across e4m3×e4m3 (symmetric), e4m3×e5m2 (mixed), float and bfloat16 outputs, and two tile shapes. Tests land in test/unit/gemm/device/sm120_tensorop_gemm/ under the new cutlass_test_unit_sm120_grouped_gemm_device_tensorop CMake target, per reviewer request in PR #3280. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Signed-off-by: Tyler Merritt <tgmerritt@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> --------- Signed-off-by: Tyler Merritt <tgmerritt@gmail.com> Co-authored-by: Alex Georgiev <89279829+alexngUNC@users.noreply.github.com> Co-authored-by: Tyler <tgmerritt@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>	2026-06-15 23:23:20 -04:00
brandonsun	d80a4e53b5	fix validation codes (#3303 )	2026-06-05 20:16:02 +08:00
Linfeng Zheng	2599f2975b	[CLI] quick fix for fmha compile options (#3295 )	2026-06-03 17:56:17 +08:00
Anakin(Yancheng) Zheng	0e9ac0734c	Fix example with upcoming release (#3293 ) * Fix example new releases * Remove return	2026-06-03 13:40:27 +08:00
xiufanl	0bdd5cf8fb	fix example issue (#3294 )	2026-06-03 11:02:00 +08:00
Linfeng Zheng	423904d717	[CLI] Recover fmha perf (#3291 ) * [CLI] Recover fmha perf * [CLI] enable options for a certain version	2026-06-03 08:55:57 +08:00
brandonsun	25e252bdce	replace deprecated apis (#3285 )	2026-06-01 08:58:58 +08:00
bangyu shen	9c1d0965f8	Add Blackwell GeForce blockscaled GEMM examples (#3272 ) Co-authored-by: bangyus <bangyus@nvidia.com>	2026-05-27 16:06:52 -04:00
Caleb_Du	60b9659133	[CLI] add support for sm100 blockscaled gemm (#3274 ) Co-authored-by: Caleb Du <cadu@nvidia.com>	2026-05-27 09:33:26 +08:00
Longsheng Du	5f06f5fc1a	fix elect_sync api (#3262 )	2026-05-26 08:50:00 +08:00
Linfeng Zheng	e45ccb1226	[CLI] Update FMHA & improve perf (#3251 )	2026-05-25 15:56:08 +08:00
dePaul Miller	546c3efa89	Fix examples and pytest, run ruff (#3230 ) Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>	2026-05-21 11:05:38 +08:00
Observer007	971d1ed8b7	fix for thor (#3224 )	2026-05-13 09:06:44 +08:00
questa-quan-wang	ae6bccf341	[CuTeDSL] Update atomic_max_float32 to atomic_fmax in blockscaled GEMM example (#3206 ) The internal DSL package refactored atomic_max_float32 to atomic_fmax, which properly handles negative floats via sign-bit-aware integer atomics. Update the example to use the new API so it works with current DSL wheels. Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com>	2026-05-07 15:03:37 +08:00
Junkai-Wu	cb37157db5	v4.5 tag update (#3202 ) * Python DSL examples reorganization. * v4.5 tag update.	2026-05-05 20:55:27 -04:00
drazi	4ca61d0662	[CuTeDSL] Add dataclass example: passing pointers via frozen dataclass (#3070 ) * Add dataclass example: passing pointers via frozen dataclass Demonstrates passing pointers from tensor arguments in @cute.jit to @cute.kernel using @dataclass(frozen=True). Shows the pattern of extracting pointers with tensor.iterator, bundling into a dataclass, and reconstructing tensors in the kernel. Uses fake tensors for compilation and TVM-FFI for runtime dispatch. Co-authored-by: Cursor <cursoragent@cursor.com> * Add dataclass example: passing tensors via frozen dataclass Demonstrates passing tensors from @cute.jit to @cute.kernel using @dataclass(frozen=True). Shows the pattern of bundling tensors into a dataclass with static configuration. Uses fake tensors for compilation and TVM-FFI for runtime dispatch. Includes reference check against PyTorch implementation. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-03-30 15:08:36 +08:00
Junkai-Wu	d4bbf728ca	v4.4 tag release update. (#3032 )	2026-02-13 23:27:58 -05:00
Junkai-Wu	9fba3195f9	v4.4 update. (#2979 )	2026-01-24 11:46:17 -05:00
Junkai-Wu	0d2b201e8c	v4.3.5 update. (#2934 ) * v4.3.5 update. * Update copyright to 2026	2026-01-08 15:02:56 -05:00
Fung Xie	286781a1fb	add requirements.txt	2025-11-27 17:02:27 -08:00
Fung Xie	2664cac685	enhanced the example for tvm-ffi	2025-11-27 17:02:26 -08:00
Fung Xie	b9154d65b3	update examples for tvm-ffi	2025-11-27 17:02:26 -08:00
Fung Xie	afe2f71522	reorganize examples for tvm-ffi	2025-11-27 17:02:26 -08:00
Junkai-Wu	1de3a576cc	v4.3.1 update. (#2817 )	2025-11-27 09:49:30 -05:00
Junkai-Wu	8cd5bef43a	v4.3 tag release update. (#2789 )	2025-11-20 20:49:44 -05:00
Junkai-Wu	b1d6e2c9b3	v4.3 update. (#2709 ) * v4.3 update. * Update the cute_dsl_api changelog's doc link * Update version to 4.3.0 * Update the example link * Update doc to encourage user to install DSL from requirements.txt --------- Co-authored-by: Larry Wu <larwu@nvidia.com>	2025-10-21 14:26:30 -04:00
Junkai-Wu	a1aaf2300a	v4.1 release	2025-07-03 08:07:53 -04:00
Junkai-Wu	8bdbfca682	v4.0 update. (#2371 )	2025-06-06 02:39:20 -04:00

32 Commits