cutlass

mirror of https://github.com/NVIDIA/cutlass.git synced 2026-05-12 17:25:45 +00:00

Author	SHA1	Message	Date
Linfeng Zheng	772fbb264e	[CLI] add cutedsl fp16 gemm tutorial from 2 to 6 (#3106 ) * [CLI] add fp16 gemm tutorial from 2 to 6 * [CLI] refine comments	2026-03-17 10:11:55 +08:00
Blake Ledden	087c84df83	docs: Fix float16 documentation in elementwise_add notebook (#2949 ) (#3047 ) The notebook uses float16 tensors but the vectorized kernel documentation incorrectly describes elements as 32-bit and uses 4-element vectorization. Updated to correctly state 16-bit elements with 8-element vectorization for proper 128-bit loads/stores. Signed-off-by: Blake Ledden <bledden@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 10:29:46 +08:00
dePaul Miller	73c59c055c	Support for Group GEMM in CUTLASS Profiler for Geforce and Spark (#3092 ) Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>	2026-03-06 20:36:29 -05:00
Junkai-Wu	3bb6e28d3c	v4.4.1 update (#3079 )	2026-02-27 13:59:21 -05:00
Tianqi Zhang (张天启)	c651d660d2	fix typo (#3012 )	2026-02-27 16:25:35 +08:00
mnehete32	79345359a7	Fix debug typo in sgemm_2.cu and sgemm_sm70.cu (#2678 )	2026-02-27 16:23:59 +08:00
Junkai-Wu	057635de5c	Remove redundant dsl example. (#3074 )	2026-02-26 08:10:59 -05:00
Junkai-Wu	c213bfdfc1	Remove redundant dsl examples. (#3071 )	2026-02-25 22:42:01 -05:00
Linfeng Zheng	3476ddb7bd	remove mixed_input_fmha_prefill (#3041 )	2026-02-18 07:59:01 -05:00
Yihan Chen	291300ffff	[CuTeDSL] implment a cta-level norm example (both layernorm and rmsnorm) (#3009 ) * kernel impl * add copyright	2026-02-14 17:54:03 +08:00
aragorn-guan	f9a5f76b7a	Replace fence proxy to the latest routine code in examples/distributed/all_reduce_tma.py (#3027 )	2026-02-14 17:51:20 +08:00
Junkai-Wu	d4bbf728ca	v4.4 tag release update. (#3032 )	2026-02-13 23:27:58 -05:00
aragorn-guan	8dbce01473	[CuTeDSL] Distributed example, using TMA load to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMA store (#2970 )	2026-02-11 11:54:00 +08:00
drazi	71aa7a0abc	Merge pull request #2919 from pbelevich/patch-1 Refactor binary_op functions to remove unused result parameter	2026-02-11 11:48:58 +08:00
Junkai-Wu	6b3e607b85	v4.4 release update v2. (#2999 )	2026-02-03 20:48:31 -05:00
Hua Huang	1cfbb53a23	[CuTeDSL] Fix: SM100 block-scale gemm overlapping accumulator (#2995 ) * Fix: SM100 block-scale gemm overlapping accumulator Signed-off-by: Hua Huang <huah@nvidia.com> * Also include threads_per_warp fix Signed-off-by: Hua Huang <huah@nvidia.com> --------- Signed-off-by: Hua Huang <huah@nvidia.com>	2026-02-03 11:01:41 +08:00
dongxiao	a4eb0e05f6	fix performance inssues in cute-dsl examples for 4.4-ctk13.1 release (#2988 ) * fix grouped gemm * fix mixed input gemm * fix mixed input grouped gemm * fix version checking * use advanced compiler options * fix comment * rename advanced compiler configs to adcanced compiler control * fix comment * fix name * fix name	2026-01-30 13:31:04 +08:00
myu-guo	d252b01300	fix performance regression in cute-dsl examples for 4.4-ctk13.1 release (#2990 ) * fix regression with cu13.1 * update	2026-01-30 13:30:49 +08:00
Xiao Song	acb45938e9	Update nvvm API call from nvvm enum to str (#2985 )	2026-01-27 17:28:29 +08:00
Xiao Song	7a14467776	update api usage (#2969 )	2026-01-27 15:33:22 +08:00
Junkai-Wu	9fba3195f9	v4.4 update. (#2979 )	2026-01-24 11:46:17 -05:00
Johnsonms	0edaa6e47d	Fix out-of-bounds TMA access in wgmma_tma_sm90 tutorial (#2945 )	2026-01-23 12:54:12 +08:00
Aidan Do	431d070fcb	[docs] Add additional tip for generating less kernels in blockwise (#2940 ) - Running without this generates a lot of kernels - Clarified CMake configuration for selecting GEMM kernels and added details on kernel generation granularity.	2026-01-23 12:53:51 +08:00
Brian K. Ryu	147f5673d0	New RMS Norm example with unit tests (#2917 ) * Add rmsnorm example * Address reviewer comments. (1) use the cute.runtime definition directly. (2) use the nvvm_wrapper's warp reduce directly * Separate out reduce.py * Change copyright notice years	2026-01-13 09:05:31 +08:00
Johnsonms	8c52459504	Fix incorrect tensor layout strides in Blackwell MMA tutorial comments (#2921 )	2026-01-09 01:02:41 -05:00
Junkai-Wu	0d2b201e8c	v4.3.5 update. (#2934 ) * v4.3.5 update. * Update copyright to 2026	2026-01-08 15:02:56 -05:00
Andrew Yooeun Chun	eb61c91147	Fix CUDA version checking in examples (#2894 ) * examples: update CUDA version requirements in Blackwell examples * examples: fix comments to specify the correct CUDA version requirement	2026-01-07 00:20:37 -05:00
Aidan Do	670480df3a	Fix SFB Layout scale granularity representation (#2924 )	2026-01-06 23:55:21 -05:00
Pavel Belevich	b6d7703e02	Refactor binary_op functions to remove unused result parameter	2026-01-02 11:23:43 -05:00
Pavel Belevich	f9bedd9096	Fix print statement for floor division result	2026-01-02 11:15:15 -05:00
questa-quan-wang	3f4c086d09	new example with TMA prefetch feature targeting for DRAM latency bound cases (#2881 ) Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com>	2025-12-23 15:29:48 +08:00
Junkai-Wu	7f5fe3edf1	v4.3.4 update. (#2892 )	2025-12-21 11:49:12 -05:00
dongxiao	331e2f451c	add missing condition for sync (#2889 )	2025-12-19 11:00:30 +08:00
Linfeng Zheng	f6402fcd5e	add pytest support for tutorial gemm (#2826 ) * add pytest support for tutorial gemm * add license	2025-12-05 08:45:01 -05:00
bangyu shen	7252a2d17e	remove internal comment (#2841 ) Co-authored-by: bangyus <bangyus@nvidia.com>	2025-12-04 10:36:21 -05:00
Junkai-Wu	bc680c7f67	v4.3.2 update. (#2839 )	2025-12-04 10:14:32 -05:00
bangyu shen	52ae719eda	[examples][CuTeDSL] init commit for distirbuted examples (#2806 ) * init commit for distirbuted examples * better OOB protection * and try import to nvshmem for better error message and a READMME.md to introduce nvshmem and multimem instructions * add some lamport explanation * enhance f8 output and warn that f8 output can have nan in it * tell user why we need complicate data conversions in ref check part * tell user we don't support nvshmem device function --------- Co-authored-by: bangyus <bangyus@nvidia.com>	2025-12-01 22:25:40 -05:00
drazi	ec8daf642d	Merge pull request #2809 from whatdhack/patch-1 Update notebook title from 'Tour to' to 'Tour of'	2025-11-28 18:07:34 +08:00
Fung Xie	286781a1fb	add requirements.txt	2025-11-27 17:02:27 -08:00
Fung Xie	2664cac685	enhanced the example for tvm-ffi	2025-11-27 17:02:26 -08:00
Fung Xie	b9154d65b3	update examples for tvm-ffi	2025-11-27 17:02:26 -08:00
Fung Xie	afe2f71522	reorganize examples for tvm-ffi	2025-11-27 17:02:26 -08:00
Fung Xie	739fffce27	fix TVM FFI doc and update example	2025-11-27 17:02:26 -08:00
Junkai-Wu	1de3a576cc	v4.3.1 update. (#2817 )	2025-11-27 09:49:30 -05:00
Shreya Gaur	2052fd3885	Blockscaled Ragged Contiguous Grouped Gemm for MoEs (#2790 ) * Adding blockscaled ragged contiguous grouped gemm for MoEs * cleaning up the example * introduction to example improved --------- Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com>	2025-11-26 20:16:49 -05:00
whatdhack	4a55379686	Update notebook title from 'Tour to' to 'Tour of' Grammar check . LLM's can quickly clean up such issues.	2025-11-24 20:11:14 -08:00
Junkai-Wu	8cd5bef43a	v4.3 tag release update. (#2789 )	2025-11-20 20:49:44 -05:00
Linfeng Zheng	406e078b29	add a notebook for tour to sol gemm (#2780 ) * add tour to sol gemm notebook * change some typos * change some typos	2025-11-20 09:41:01 -05:00
Mindy Li	06b6bd7d7b	remove cute dsl pdl example.	2025-11-09 21:47:00 -08:00
Linfeng Zheng	2252254ce2	Add tutorial fp16_gemm_1 (#2750 ) * Add tutorial fp16_gemm_1 * refine * refine * refine * revert changes in fp16_gemm_0.py	2025-11-06 22:40:09 -05:00

1 2 3 4 5

204 Commits