cutlass

mirror of https://github.com/NVIDIA/cutlass.git synced 2026-05-11 17:00:05 +00:00

Author	SHA1	Message	Date
questa-quan-wang	ae6bccf341	[CuTeDSL] Update atomic_max_float32 to atomic_fmax in blockscaled GEMM example (#3206 ) The internal DSL package refactored atomic_max_float32 to atomic_fmax, which properly handles negative floats via sign-bit-aware integer atomics. Update the example to use the new API so it works with current DSL wheels. Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com>	2026-05-07 15:03:37 +08:00
Junkai-Wu	cb37157db5	v4.5 tag update (#3202 ) * Python DSL examples reorganization. * v4.5 tag update.	2026-05-05 20:55:27 -04:00
Johnsonms	f74fea9ce3	[Hopper CuTeDSL] Add FP8 GEMM with 2xAcc (#3149 ) Add dense_gemm_fp8_2xacc.py — a CuTeDSL port of CUTLASS Example 54 (54_hopper_fp8_warp_specialized_gemm.cu) for NVIDIA Hopper (SM90). Implements D = scale_a * scale_b * (A @ B) where A/B are FP8 E4M3FN using the 2xAcc (double accumulation) technique: a temporary accumulator is periodically promoted into the main accumulator every mma_promotion_interval MMA instructions to prevent FP8 precision loss. Features: - FP8 E4M3FN inputs with Float32 accumulation - 2xAcc for improved numerical accuracy - TMA with multicast for A/B/D transfers - WGMMA warp-specialized persistent tile scheduling - Configurable output dtype: Float16, Float32, Float8E4M3FN - Scalar scale_a / scale_b epilogue factors - Cluster shapes up to 2x2 Add pytest test suite covering: - L0 compile tests: all tile shapes, cluster shapes, output dtypes, mma_promotion_interval values - L1 correctness tests: numerical validation vs torch.einsum reference for all configs, non-trivial scale factors, and batched GEMM (L>1) - Benchmark tests (pytest -m bench -s): representative problem sizes with warmup, cold-L2, and TFLOPS reporting Also fix conftest.py to import cutlass before adding examples/python/CuTeDSL to sys.path, preventing the jax/ examples subdirectory from being detected as a namespace package and breaking cutlass's JAX availability check.	2026-04-25 16:10:33 -04:00
Longsheng Du	08185b9c3e	Update blackwell tutorial to be compatible with 4.5-dev version (#3130 ) * Update blackwell tutorial to be compatible with 4.5-dev version * update example for reverted changes * add more example fix	2026-04-09 14:40:33 +08:00
Junkai-Wu	a221da7ccf	v4.5 dev update. (#3153 )	2026-04-07 12:16:05 -04:00
Katja Sirazitdinova	418d38a5de	PR update (#3103 )	2026-04-02 18:00:41 +08:00
drazi	4ca61d0662	[CuTeDSL] Add dataclass example: passing pointers via frozen dataclass (#3070 ) * Add dataclass example: passing pointers via frozen dataclass Demonstrates passing pointers from tensor arguments in @cute.jit to @cute.kernel using @dataclass(frozen=True). Shows the pattern of extracting pointers with tensor.iterator, bundling into a dataclass, and reconstructing tensors in the kernel. Uses fake tensors for compilation and TVM-FFI for runtime dispatch. Co-authored-by: Cursor <cursoragent@cursor.com> * Add dataclass example: passing tensors via frozen dataclass Demonstrates passing tensors from @cute.jit to @cute.kernel using @dataclass(frozen=True). Shows the pattern of bundling tensors into a dataclass with static configuration. Uses fake tensors for compilation and TVM-FFI for runtime dispatch. Includes reference check against PyTorch implementation. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-03-30 15:08:36 +08:00
Zheng Linfeng	ecb32fe231	[CLI] Fix tutorial issues	2026-03-24 00:12:01 -07:00
Johnsonms	982748aa73	[Hopper CuTeDSL] Add grouped GEMM persistent kernel and tests (#3091 ) Implement grouped GEMM (C_g = A_g x B_g for g groups) on Hopper using CuTe DSL, extending the dense persistent GEMM with per-group TMA descriptor management. Kernel design (grouped_gemm.py): - Warp-specialized pipeline: DMA warp group handles TMA loads and per-group tensormap updates; MMA warp group runs WGMMA and stores C - StaticPersistentGroupTileScheduler for cross-group tile scheduling - Per-group TMA descriptor updates via GMEM or SMEM mode - Supports fp16, fp8 (E4M3FN/E5M2), int8 with mixed A/B dtypes - Configurable tile shapes (128x128, 128x256) and cluster shapes - Fix base TensorMapManager: hoist uniform_smem_ptrs outside predicated block to avoid illegal @P0 R2UR on sm_90a Tests (test/examples/CuTeDSL/hopper/test_grouped_gemm.py): - L0 compile and L1 correctness pytest suite covering tile shapes, dtypes, major modes, cluster shapes, group counts, and mixed sizes - Move to test/examples/CuTeDSL/hopper/ following sm_100a convention - Fix deprecated startdir arg in test_sharding.py pytest hook	2026-03-18 00:40:15 -04:00
Junkai-Wu	1b741cabaa	v4.4.2 update. (#3104 )	2026-03-17 00:58:19 -04:00
Linfeng Zheng	772fbb264e	[CLI] add cutedsl fp16 gemm tutorial from 2 to 6 (#3106 ) * [CLI] add fp16 gemm tutorial from 2 to 6 * [CLI] refine comments	2026-03-17 10:11:55 +08:00
Blake Ledden	087c84df83	docs: Fix float16 documentation in elementwise_add notebook (#2949 ) (#3047 ) The notebook uses float16 tensors but the vectorized kernel documentation incorrectly describes elements as 32-bit and uses 4-element vectorization. Updated to correctly state 16-bit elements with 8-element vectorization for proper 128-bit loads/stores. Signed-off-by: Blake Ledden <bledden@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 10:29:46 +08:00
Junkai-Wu	3bb6e28d3c	v4.4.1 update (#3079 )	2026-02-27 13:59:21 -05:00
Junkai-Wu	057635de5c	Remove redundant dsl example. (#3074 )	2026-02-26 08:10:59 -05:00
Junkai-Wu	c213bfdfc1	Remove redundant dsl examples. (#3071 )	2026-02-25 22:42:01 -05:00
Linfeng Zheng	3476ddb7bd	remove mixed_input_fmha_prefill (#3041 )	2026-02-18 07:59:01 -05:00
Yihan Chen	291300ffff	[CuTeDSL] implment a cta-level norm example (both layernorm and rmsnorm) (#3009 ) * kernel impl * add copyright	2026-02-14 17:54:03 +08:00
aragorn-guan	f9a5f76b7a	Replace fence proxy to the latest routine code in examples/distributed/all_reduce_tma.py (#3027 )	2026-02-14 17:51:20 +08:00
Junkai-Wu	d4bbf728ca	v4.4 tag release update. (#3032 )	2026-02-13 23:27:58 -05:00
aragorn-guan	8dbce01473	[CuTeDSL] Distributed example, using TMA load to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMA store (#2970 )	2026-02-11 11:54:00 +08:00
drazi	71aa7a0abc	Merge pull request #2919 from pbelevich/patch-1 Refactor binary_op functions to remove unused result parameter	2026-02-11 11:48:58 +08:00
Junkai-Wu	6b3e607b85	v4.4 release update v2. (#2999 )	2026-02-03 20:48:31 -05:00
Hua Huang	1cfbb53a23	[CuTeDSL] Fix: SM100 block-scale gemm overlapping accumulator (#2995 ) * Fix: SM100 block-scale gemm overlapping accumulator Signed-off-by: Hua Huang <huah@nvidia.com> * Also include threads_per_warp fix Signed-off-by: Hua Huang <huah@nvidia.com> --------- Signed-off-by: Hua Huang <huah@nvidia.com>	2026-02-03 11:01:41 +08:00
dongxiao	a4eb0e05f6	fix performance inssues in cute-dsl examples for 4.4-ctk13.1 release (#2988 ) * fix grouped gemm * fix mixed input gemm * fix mixed input grouped gemm * fix version checking * use advanced compiler options * fix comment * rename advanced compiler configs to adcanced compiler control * fix comment * fix name * fix name	2026-01-30 13:31:04 +08:00
myu-guo	d252b01300	fix performance regression in cute-dsl examples for 4.4-ctk13.1 release (#2990 ) * fix regression with cu13.1 * update	2026-01-30 13:30:49 +08:00
Xiao Song	acb45938e9	Update nvvm API call from nvvm enum to str (#2985 )	2026-01-27 17:28:29 +08:00
Xiao Song	7a14467776	update api usage (#2969 )	2026-01-27 15:33:22 +08:00
Junkai-Wu	9fba3195f9	v4.4 update. (#2979 )	2026-01-24 11:46:17 -05:00
Brian K. Ryu	147f5673d0	New RMS Norm example with unit tests (#2917 ) * Add rmsnorm example * Address reviewer comments. (1) use the cute.runtime definition directly. (2) use the nvvm_wrapper's warp reduce directly * Separate out reduce.py * Change copyright notice years	2026-01-13 09:05:31 +08:00
Junkai-Wu	0d2b201e8c	v4.3.5 update. (#2934 ) * v4.3.5 update. * Update copyright to 2026	2026-01-08 15:02:56 -05:00
Pavel Belevich	b6d7703e02	Refactor binary_op functions to remove unused result parameter	2026-01-02 11:23:43 -05:00
Pavel Belevich	f9bedd9096	Fix print statement for floor division result	2026-01-02 11:15:15 -05:00
questa-quan-wang	3f4c086d09	new example with TMA prefetch feature targeting for DRAM latency bound cases (#2881 ) Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com>	2025-12-23 15:29:48 +08:00
Junkai-Wu	7f5fe3edf1	v4.3.4 update. (#2892 )	2025-12-21 11:49:12 -05:00
dongxiao	331e2f451c	add missing condition for sync (#2889 )	2025-12-19 11:00:30 +08:00
Linfeng Zheng	f6402fcd5e	add pytest support for tutorial gemm (#2826 ) * add pytest support for tutorial gemm * add license	2025-12-05 08:45:01 -05:00
bangyu shen	7252a2d17e	remove internal comment (#2841 ) Co-authored-by: bangyus <bangyus@nvidia.com>	2025-12-04 10:36:21 -05:00
Junkai-Wu	bc680c7f67	v4.3.2 update. (#2839 )	2025-12-04 10:14:32 -05:00
bangyu shen	52ae719eda	[examples][CuTeDSL] init commit for distirbuted examples (#2806 ) * init commit for distirbuted examples * better OOB protection * and try import to nvshmem for better error message and a READMME.md to introduce nvshmem and multimem instructions * add some lamport explanation * enhance f8 output and warn that f8 output can have nan in it * tell user why we need complicate data conversions in ref check part * tell user we don't support nvshmem device function --------- Co-authored-by: bangyus <bangyus@nvidia.com>	2025-12-01 22:25:40 -05:00
drazi	ec8daf642d	Merge pull request #2809 from whatdhack/patch-1 Update notebook title from 'Tour to' to 'Tour of'	2025-11-28 18:07:34 +08:00
Fung Xie	286781a1fb	add requirements.txt	2025-11-27 17:02:27 -08:00
Fung Xie	2664cac685	enhanced the example for tvm-ffi	2025-11-27 17:02:26 -08:00
Fung Xie	b9154d65b3	update examples for tvm-ffi	2025-11-27 17:02:26 -08:00
Fung Xie	afe2f71522	reorganize examples for tvm-ffi	2025-11-27 17:02:26 -08:00
Fung Xie	739fffce27	fix TVM FFI doc and update example	2025-11-27 17:02:26 -08:00
Junkai-Wu	1de3a576cc	v4.3.1 update. (#2817 )	2025-11-27 09:49:30 -05:00
whatdhack	4a55379686	Update notebook title from 'Tour to' to 'Tour of' Grammar check . LLM's can quickly clean up such issues.	2025-11-24 20:11:14 -08:00
Junkai-Wu	8cd5bef43a	v4.3 tag release update. (#2789 )	2025-11-20 20:49:44 -05:00
Linfeng Zheng	406e078b29	add a notebook for tour to sol gemm (#2780 ) * add tour to sol gemm notebook * change some typos * change some typos	2025-11-20 09:41:01 -05:00
Mindy Li	06b6bd7d7b	remove cute dsl pdl example.	2025-11-09 21:47:00 -08:00

1 2

72 Commits