cutlass

mirror of https://github.com/NVIDIA/cutlass.git synced 2026-05-11 17:00:05 +00:00

Author	SHA1	Message	Date
Longsheng Du	08185b9c3e	Update blackwell tutorial to be compatible with 4.5-dev version (#3130 ) * Update blackwell tutorial to be compatible with 4.5-dev version * update example for reverted changes * add more example fix	2026-04-09 14:40:33 +08:00
Junkai-Wu	a221da7ccf	v4.5 dev update. (#3153 )	2026-04-07 12:16:05 -04:00
Katja Sirazitdinova	418d38a5de	PR update (#3103 )	2026-04-02 18:00:41 +08:00
drazi	4ca61d0662	[CuTeDSL] Add dataclass example: passing pointers via frozen dataclass (#3070 ) * Add dataclass example: passing pointers via frozen dataclass Demonstrates passing pointers from tensor arguments in @cute.jit to @cute.kernel using @dataclass(frozen=True). Shows the pattern of extracting pointers with tensor.iterator, bundling into a dataclass, and reconstructing tensors in the kernel. Uses fake tensors for compilation and TVM-FFI for runtime dispatch. Co-authored-by: Cursor <cursoragent@cursor.com> * Add dataclass example: passing tensors via frozen dataclass Demonstrates passing tensors from @cute.jit to @cute.kernel using @dataclass(frozen=True). Shows the pattern of bundling tensors into a dataclass with static configuration. Uses fake tensors for compilation and TVM-FFI for runtime dispatch. Includes reference check against PyTorch implementation. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-03-30 15:08:36 +08:00
Zheng Linfeng	ecb32fe231	[CLI] Fix tutorial issues	2026-03-24 00:12:01 -07:00
Johnsonms	982748aa73	[Hopper CuTeDSL] Add grouped GEMM persistent kernel and tests (#3091 ) Implement grouped GEMM (C_g = A_g x B_g for g groups) on Hopper using CuTe DSL, extending the dense persistent GEMM with per-group TMA descriptor management. Kernel design (grouped_gemm.py): - Warp-specialized pipeline: DMA warp group handles TMA loads and per-group tensormap updates; MMA warp group runs WGMMA and stores C - StaticPersistentGroupTileScheduler for cross-group tile scheduling - Per-group TMA descriptor updates via GMEM or SMEM mode - Supports fp16, fp8 (E4M3FN/E5M2), int8 with mixed A/B dtypes - Configurable tile shapes (128x128, 128x256) and cluster shapes - Fix base TensorMapManager: hoist uniform_smem_ptrs outside predicated block to avoid illegal @P0 R2UR on sm_90a Tests (test/examples/CuTeDSL/hopper/test_grouped_gemm.py): - L0 compile and L1 correctness pytest suite covering tile shapes, dtypes, major modes, cluster shapes, group counts, and mixed sizes - Move to test/examples/CuTeDSL/hopper/ following sm_100a convention - Fix deprecated startdir arg in test_sharding.py pytest hook	2026-03-18 00:40:15 -04:00
Junkai-Wu	1b741cabaa	v4.4.2 update. (#3104 )	2026-03-17 00:58:19 -04:00
Linfeng Zheng	772fbb264e	[CLI] add cutedsl fp16 gemm tutorial from 2 to 6 (#3106 ) * [CLI] add fp16 gemm tutorial from 2 to 6 * [CLI] refine comments	2026-03-17 10:11:55 +08:00
Blake Ledden	087c84df83	docs: Fix float16 documentation in elementwise_add notebook (#2949 ) (#3047 ) The notebook uses float16 tensors but the vectorized kernel documentation incorrectly describes elements as 32-bit and uses 4-element vectorization. Updated to correctly state 16-bit elements with 8-element vectorization for proper 128-bit loads/stores. Signed-off-by: Blake Ledden <bledden@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 10:29:46 +08:00
dePaul Miller	73c59c055c	Support for Group GEMM in CUTLASS Profiler for Geforce and Spark (#3092 ) Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>	2026-03-06 20:36:29 -05:00
Junkai-Wu	3bb6e28d3c	v4.4.1 update (#3079 )	2026-02-27 13:59:21 -05:00
Tianqi Zhang (张天启)	c651d660d2	fix typo (#3012 )	2026-02-27 16:25:35 +08:00
mnehete32	79345359a7	Fix debug typo in sgemm_2.cu and sgemm_sm70.cu (#2678 )	2026-02-27 16:23:59 +08:00
Junkai-Wu	057635de5c	Remove redundant dsl example. (#3074 )	2026-02-26 08:10:59 -05:00
Junkai-Wu	c213bfdfc1	Remove redundant dsl examples. (#3071 )	2026-02-25 22:42:01 -05:00
Linfeng Zheng	3476ddb7bd	remove mixed_input_fmha_prefill (#3041 )	2026-02-18 07:59:01 -05:00
Yihan Chen	291300ffff	[CuTeDSL] implment a cta-level norm example (both layernorm and rmsnorm) (#3009 ) * kernel impl * add copyright	2026-02-14 17:54:03 +08:00
aragorn-guan	f9a5f76b7a	Replace fence proxy to the latest routine code in examples/distributed/all_reduce_tma.py (#3027 )	2026-02-14 17:51:20 +08:00
Junkai-Wu	d4bbf728ca	v4.4 tag release update. (#3032 )	2026-02-13 23:27:58 -05:00
aragorn-guan	8dbce01473	[CuTeDSL] Distributed example, using TMA load to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMA store (#2970 )	2026-02-11 11:54:00 +08:00
drazi	71aa7a0abc	Merge pull request #2919 from pbelevich/patch-1 Refactor binary_op functions to remove unused result parameter	2026-02-11 11:48:58 +08:00
Junkai-Wu	6b3e607b85	v4.4 release update v2. (#2999 )	2026-02-03 20:48:31 -05:00
Hua Huang	1cfbb53a23	[CuTeDSL] Fix: SM100 block-scale gemm overlapping accumulator (#2995 ) * Fix: SM100 block-scale gemm overlapping accumulator Signed-off-by: Hua Huang <huah@nvidia.com> * Also include threads_per_warp fix Signed-off-by: Hua Huang <huah@nvidia.com> --------- Signed-off-by: Hua Huang <huah@nvidia.com>	2026-02-03 11:01:41 +08:00
dongxiao	a4eb0e05f6	fix performance inssues in cute-dsl examples for 4.4-ctk13.1 release (#2988 ) * fix grouped gemm * fix mixed input gemm * fix mixed input grouped gemm * fix version checking * use advanced compiler options * fix comment * rename advanced compiler configs to adcanced compiler control * fix comment * fix name * fix name	2026-01-30 13:31:04 +08:00
myu-guo	d252b01300	fix performance regression in cute-dsl examples for 4.4-ctk13.1 release (#2990 ) * fix regression with cu13.1 * update	2026-01-30 13:30:49 +08:00
Xiao Song	acb45938e9	Update nvvm API call from nvvm enum to str (#2985 )	2026-01-27 17:28:29 +08:00
Xiao Song	7a14467776	update api usage (#2969 )	2026-01-27 15:33:22 +08:00
Junkai-Wu	9fba3195f9	v4.4 update. (#2979 )	2026-01-24 11:46:17 -05:00
Johnsonms	0edaa6e47d	Fix out-of-bounds TMA access in wgmma_tma_sm90 tutorial (#2945 )	2026-01-23 12:54:12 +08:00
Aidan Do	431d070fcb	[docs] Add additional tip for generating less kernels in blockwise (#2940 ) - Running without this generates a lot of kernels - Clarified CMake configuration for selecting GEMM kernels and added details on kernel generation granularity.	2026-01-23 12:53:51 +08:00
Brian K. Ryu	147f5673d0	New RMS Norm example with unit tests (#2917 ) * Add rmsnorm example * Address reviewer comments. (1) use the cute.runtime definition directly. (2) use the nvvm_wrapper's warp reduce directly * Separate out reduce.py * Change copyright notice years	2026-01-13 09:05:31 +08:00
Johnsonms	8c52459504	Fix incorrect tensor layout strides in Blackwell MMA tutorial comments (#2921 )	2026-01-09 01:02:41 -05:00
Junkai-Wu	0d2b201e8c	v4.3.5 update. (#2934 ) * v4.3.5 update. * Update copyright to 2026	2026-01-08 15:02:56 -05:00
Andrew Yooeun Chun	eb61c91147	Fix CUDA version checking in examples (#2894 ) * examples: update CUDA version requirements in Blackwell examples * examples: fix comments to specify the correct CUDA version requirement	2026-01-07 00:20:37 -05:00
Aidan Do	670480df3a	Fix SFB Layout scale granularity representation (#2924 )	2026-01-06 23:55:21 -05:00
Pavel Belevich	b6d7703e02	Refactor binary_op functions to remove unused result parameter	2026-01-02 11:23:43 -05:00
Pavel Belevich	f9bedd9096	Fix print statement for floor division result	2026-01-02 11:15:15 -05:00
questa-quan-wang	3f4c086d09	new example with TMA prefetch feature targeting for DRAM latency bound cases (#2881 ) Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com>	2025-12-23 15:29:48 +08:00
Junkai-Wu	7f5fe3edf1	v4.3.4 update. (#2892 )	2025-12-21 11:49:12 -05:00
dongxiao	331e2f451c	add missing condition for sync (#2889 )	2025-12-19 11:00:30 +08:00
Linfeng Zheng	f6402fcd5e	add pytest support for tutorial gemm (#2826 ) * add pytest support for tutorial gemm * add license	2025-12-05 08:45:01 -05:00
bangyu shen	7252a2d17e	remove internal comment (#2841 ) Co-authored-by: bangyus <bangyus@nvidia.com>	2025-12-04 10:36:21 -05:00
Junkai-Wu	bc680c7f67	v4.3.2 update. (#2839 )	2025-12-04 10:14:32 -05:00
bangyu shen	52ae719eda	[examples][CuTeDSL] init commit for distirbuted examples (#2806 ) * init commit for distirbuted examples * better OOB protection * and try import to nvshmem for better error message and a READMME.md to introduce nvshmem and multimem instructions * add some lamport explanation * enhance f8 output and warn that f8 output can have nan in it * tell user why we need complicate data conversions in ref check part * tell user we don't support nvshmem device function --------- Co-authored-by: bangyus <bangyus@nvidia.com>	2025-12-01 22:25:40 -05:00
drazi	ec8daf642d	Merge pull request #2809 from whatdhack/patch-1 Update notebook title from 'Tour to' to 'Tour of'	2025-11-28 18:07:34 +08:00
Fung Xie	286781a1fb	add requirements.txt	2025-11-27 17:02:27 -08:00
Fung Xie	2664cac685	enhanced the example for tvm-ffi	2025-11-27 17:02:26 -08:00
Fung Xie	b9154d65b3	update examples for tvm-ffi	2025-11-27 17:02:26 -08:00
Fung Xie	afe2f71522	reorganize examples for tvm-ffi	2025-11-27 17:02:26 -08:00
Fung Xie	739fffce27	fix TVM FFI doc and update example	2025-11-27 17:02:26 -08:00

1 2 3 4 5

211 Commits