cutlass

mirror of https://github.com/NVIDIA/cutlass.git synced 2026-05-12 09:15:56 +00:00

Author	SHA1	Message	Date
Linfeng Zheng	772fbb264e	[CLI] add cutedsl fp16 gemm tutorial from 2 to 6 (#3106 ) * [CLI] add fp16 gemm tutorial from 2 to 6 * [CLI] refine comments	2026-03-17 10:11:55 +08:00
Blake Ledden	087c84df83	docs: Fix float16 documentation in elementwise_add notebook (#2949 ) (#3047 ) The notebook uses float16 tensors but the vectorized kernel documentation incorrectly describes elements as 32-bit and uses 4-element vectorization. Updated to correctly state 16-bit elements with 8-element vectorization for proper 128-bit loads/stores. Signed-off-by: Blake Ledden <bledden@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 10:29:46 +08:00
dePaul Miller	73c59c055c	Support for Group GEMM in CUTLASS Profiler for Geforce and Spark (#3092 ) Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>	2026-03-06 20:36:29 -05:00
Johnsonms	e5fcd125a5	[fix] Boolean.__dsl_and__ emits arith.andi directly for i1 operands (#3087 ) Before this fix, combining two Boolean (i1) DSL values with Python `and` triggered a verbose i1→i32→i1 round-trip in __dsl_and__: arith.extui (×3), arith.select, arith.cmpi ne (×2) — 6 extra MLIR ops. Add a fast path: when both operands are Boolean, delegate directly to __and__, emitting a single arith.andi %a, %b : i1 — identical to `&`. Both operators were already semantically equivalent; this fix makes the generated MLIR identical as well. Includes: - repro_dsl_and_bool.py — minimal standalone reproducer / bug-report script - test_dsl_and_fix.py — pytest tests verifying the fixed behaviour	2026-03-05 17:20:26 +08:00
TLescoatTFX	a93d86ec83	Fix finding cuDNN (#2890 ) * Fix finding cuDNN * Search $CUDNN_PATH first, not last	2026-03-05 09:51:37 +08:00
David W.H. Swenson	49e54f2b23	fix: add_help=False in temporary parser (#2721 )	2026-03-02 15:33:42 +08:00
drazi	b9847690c5	Merge pull request #3028 from SzymonOzog/patch-3 Add option to not suffix prints with new line	2026-02-28 10:11:05 +08:00
Junkai-Wu	3bb6e28d3c	v4.4.1 update (#3079 )	2026-02-27 13:59:21 -05:00
Tianqi Zhang (张天启)	c651d660d2	fix typo (#3012 )	2026-02-27 16:25:35 +08:00
Ziang Li	518327d631	Fix error in Blackwell document of referring to Mxf4 format as NVF4 (#2977 ) * Update blackwell_functionality.md Fixed the error of referring to Mxf4 format as NVF4. * Correct NVF4 output matrix reference in documentation	2026-02-27 16:25:16 +08:00
StevenYangCC	de67bb7a42	Fix example in CuTe tutorials (#2752 )	2026-02-27 16:24:34 +08:00
Neil Kichler	edf2f82c00	Fix register index bug in mma.sync.aligned.m16n8k16 (#2740 )	2026-02-27 16:24:18 +08:00
mnehete32	79345359a7	Fix debug typo in sgemm_2.cu and sgemm_sm70.cu (#2678 )	2026-02-27 16:23:59 +08:00
zkyue	8b9b3d78df	fix typo in documentation (#2671 )	2026-02-27 16:23:37 +08:00
Gabriel Wu	fc5bbc2dab	Fix typo in cute.nvgpu.warpgroup.mma doc (#2548 )	2026-02-27 16:22:55 +08:00
Junkai-Wu	057635de5c	Remove redundant dsl example. (#3074 )	2026-02-26 08:10:59 -05:00
Junkai-Wu	c213bfdfc1	Remove redundant dsl examples. (#3071 ) v4.4.0	2026-02-25 22:42:01 -05:00
Haicheng Wu	954503d44c	Bump version to 4.4.0	2026-02-25 00:04:04 -05:00
Haicheng Wu	6c4200f1bc	Bump version from 4.3.5 to 4.4.0	2026-02-25 00:03:23 -05:00
Haicheng Wu	de93e8a4ac	Bump version from 4.3.5 to 4.4.0	2026-02-25 00:03:04 -05:00
Haicheng Wu	b92b9f0d37	Bump version from 4.3.5 to 4.4.0	2026-02-25 00:02:41 -05:00
Haicheng Wu	2aedca6f5e	Bump CUTLASS version to 4.4.0	2026-02-25 00:01:56 -05:00
Haicheng Wu	6450964b57	Update README	2026-02-24 23:55:55 -05:00
Haicheng Wu	284449fa5b	Revise chagnelog	2026-02-24 23:54:56 -05:00
Haicheng Wu	0853d81d70	Revise README	2026-02-24 15:32:17 -05:00
Linfeng Zheng	3476ddb7bd	remove mixed_input_fmha_prefill (#3041 )	2026-02-18 07:59:01 -05:00
Yihan Chen	291300ffff	[CuTeDSL] implment a cta-level norm example (both layernorm and rmsnorm) (#3009 ) * kernel impl * add copyright	2026-02-14 17:54:03 +08:00
aragorn-guan	f9a5f76b7a	Replace fence proxy to the latest routine code in examples/distributed/all_reduce_tma.py (#3027 )	2026-02-14 17:51:20 +08:00
drazi	ec7e6cb17b	Merge pull request #2971 from rsmallblue/tvm-ffi [CuTeDSL]fix tvm-ffi path in from_dlpack	2026-02-14 14:14:10 +08:00
Yuan Xiaolan	395ab575f6	Merge branch 'main' into tvm-ffi	2026-02-14 13:35:28 +08:00
Junkai-Wu	d4bbf728ca	v4.4 tag release update. (#3032 )	2026-02-13 23:27:58 -05:00
Szymon Ożóg	beb80e04e1	Add option to not suffix prints with new line	2026-02-13 15:56:50 +01:00
drazi	01687cfba1	Merge pull request #3004 from tridao/add-sub-packed-f32x2 [CuTeDSL] Add sub_packed_f32x2 operation	2026-02-13 20:46:26 +08:00
drazi	5c42d0f28c	Merge pull request #3021 from tridao/clc_no_multicast [Cute-DSL] Add option for issue_clc_query without multicast	2026-02-13 20:45:52 +08:00
drazi	1d36152f34	Merge pull request #3022 from tridao/nvvm_fmin [Cute-DSL] Add cute.arch.fmin by calling nvvm	2026-02-13 20:45:08 +08:00
Tri Dao	244e8d00d5	[Cute-DSL] Add cute.arch.fmin by calling nvvm	2026-02-11 14:23:09 -05:00
Tri Dao	5b83b34afd	[Cute-DSL] Add option for issue_clc_query without multicast	2026-02-11 14:19:29 -05:00
aragorn-guan	8dbce01473	[CuTeDSL] Distributed example, using TMA load to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMA store (#2970 )	2026-02-11 11:54:00 +08:00
drazi	71aa7a0abc	Merge pull request #2919 from pbelevich/patch-1 Refactor binary_op functions to remove unused result parameter	2026-02-11 11:48:58 +08:00
Tri Dao	51935551fb	[CuTeDSL] Add sub_packed_f32x2 operation Add subtraction operation for packed f32x2 values, following the same pattern as the existing add_packed_f32x2 and mul_packed_f32x2 operations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 21:18:46 +07:00
Junkai-Wu	6b3e607b85	v4.4 release update v2. (#2999 )	2026-02-03 20:48:31 -05:00
yuanxiaolan	de161925a5	pass in stream=-1	2026-02-03 11:59:14 +08:00
yuanxiaolan	de198b2419	fix tvm-ffi path in from_dlpack	2026-02-03 11:59:13 +08:00
Hua Huang	1cfbb53a23	[CuTeDSL] Fix: SM100 block-scale gemm overlapping accumulator (#2995 ) * Fix: SM100 block-scale gemm overlapping accumulator Signed-off-by: Hua Huang <huah@nvidia.com> * Also include threads_per_warp fix Signed-off-by: Hua Huang <huah@nvidia.com> --------- Signed-off-by: Hua Huang <huah@nvidia.com>	2026-02-03 11:01:41 +08:00
dongxiao	a4eb0e05f6	fix performance inssues in cute-dsl examples for 4.4-ctk13.1 release (#2988 ) * fix grouped gemm * fix mixed input gemm * fix mixed input grouped gemm * fix version checking * use advanced compiler options * fix comment * rename advanced compiler configs to adcanced compiler control * fix comment * fix name * fix name	2026-01-30 13:31:04 +08:00
myu-guo	d252b01300	fix performance regression in cute-dsl examples for 4.4-ctk13.1 release (#2990 ) * fix regression with cu13.1 * update	2026-01-30 13:30:49 +08:00
Xiao Song	acb45938e9	Update nvvm API call from nvvm enum to str (#2985 )	2026-01-27 17:28:29 +08:00
Xiao Song	7a14467776	update api usage (#2969 )	2026-01-27 15:33:22 +08:00
drazi	51f82812ec	Merge pull request #2891 from ColinPeppler/main docs: note when DSL dumps are populated	2026-01-26 17:38:27 -08:00
Junkai-Wu	9fba3195f9	v4.4 update. (#2979 )	2026-01-24 11:46:17 -05:00

1 2 3 4 5 ...

853 Commits