Linfeng Zheng
772fbb264e
[CLI] add cutedsl fp16 gemm tutorial from 2 to 6 ( #3106 )
...
* [CLI] add fp16 gemm tutorial from 2 to 6
* [CLI] refine comments
2026-03-17 10:11:55 +08:00
Blake Ledden
087c84df83
docs: Fix float16 documentation in elementwise_add notebook ( #2949 ) ( #3047 )
...
The notebook uses float16 tensors but the vectorized kernel documentation
incorrectly describes elements as 32-bit and uses 4-element vectorization.
Updated to correctly state 16-bit elements with 8-element vectorization
for proper 128-bit loads/stores.
Signed-off-by: Blake Ledden <bledden@users.noreply.github.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-12 10:29:46 +08:00
dePaul Miller
73c59c055c
Support for Group GEMM in CUTLASS Profiler for Geforce and Spark ( #3092 )
...
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com >
2026-03-06 20:36:29 -05:00
Johnsonms
e5fcd125a5
[fix] Boolean.__dsl_and__ emits arith.andi directly for i1 operands ( #3087 )
...
Before this fix, combining two Boolean (i1) DSL values with Python `and`
triggered a verbose i1→i32→i1 round-trip in __dsl_and__:
arith.extui (×3), arith.select, arith.cmpi ne (×2) — 6 extra MLIR ops.
Add a fast path: when both operands are Boolean, delegate directly to
__and__, emitting a single arith.andi %a, %b : i1 — identical to `&`.
Both operators were already semantically equivalent; this fix makes the
generated MLIR identical as well.
Includes:
- repro_dsl_and_bool.py — minimal standalone reproducer / bug-report script
- test_dsl_and_fix.py — pytest tests verifying the fixed behaviour
2026-03-05 17:20:26 +08:00
TLescoatTFX
a93d86ec83
Fix finding cuDNN ( #2890 )
...
* Fix finding cuDNN
* Search $CUDNN_PATH first, not last
2026-03-05 09:51:37 +08:00
David W.H. Swenson
49e54f2b23
fix: add_help=False in temporary parser ( #2721 )
2026-03-02 15:33:42 +08:00
drazi
b9847690c5
Merge pull request #3028 from SzymonOzog/patch-3
...
Add option to not suffix prints with new line
2026-02-28 10:11:05 +08:00
Junkai-Wu
3bb6e28d3c
v4.4.1 update ( #3079 )
2026-02-27 13:59:21 -05:00
Tianqi Zhang (张天启)
c651d660d2
fix typo ( #3012 )
2026-02-27 16:25:35 +08:00
Ziang Li
518327d631
Fix error in Blackwell document of referring to Mxf4 format as NVF4 ( #2977 )
...
* Update blackwell_functionality.md
Fixed the error of referring to Mxf4 format as NVF4.
* Correct NVF4 output matrix reference in documentation
2026-02-27 16:25:16 +08:00
StevenYangCC
de67bb7a42
Fix example in CuTe tutorials ( #2752 )
2026-02-27 16:24:34 +08:00
Neil Kichler
edf2f82c00
Fix register index bug in mma.sync.aligned.m16n8k16 ( #2740 )
2026-02-27 16:24:18 +08:00
mnehete32
79345359a7
Fix debug typo in sgemm_2.cu and sgemm_sm70.cu ( #2678 )
2026-02-27 16:23:59 +08:00
zkyue
8b9b3d78df
fix typo in documentation ( #2671 )
2026-02-27 16:23:37 +08:00
Gabriel Wu
fc5bbc2dab
Fix typo in cute.nvgpu.warpgroup.mma doc ( #2548 )
2026-02-27 16:22:55 +08:00
Junkai-Wu
057635de5c
Remove redundant dsl example. ( #3074 )
2026-02-26 08:10:59 -05:00
Junkai-Wu
c213bfdfc1
Remove redundant dsl examples. ( #3071 )
v4.4.0
2026-02-25 22:42:01 -05:00
Haicheng Wu
954503d44c
Bump version to 4.4.0
2026-02-25 00:04:04 -05:00
Haicheng Wu
6c4200f1bc
Bump version from 4.3.5 to 4.4.0
2026-02-25 00:03:23 -05:00
Haicheng Wu
de93e8a4ac
Bump version from 4.3.5 to 4.4.0
2026-02-25 00:03:04 -05:00
Haicheng Wu
b92b9f0d37
Bump version from 4.3.5 to 4.4.0
2026-02-25 00:02:41 -05:00
Haicheng Wu
2aedca6f5e
Bump CUTLASS version to 4.4.0
2026-02-25 00:01:56 -05:00
Haicheng Wu
6450964b57
Update README
2026-02-24 23:55:55 -05:00
Haicheng Wu
284449fa5b
Revise chagnelog
2026-02-24 23:54:56 -05:00
Haicheng Wu
0853d81d70
Revise README
2026-02-24 15:32:17 -05:00
Linfeng Zheng
3476ddb7bd
remove mixed_input_fmha_prefill ( #3041 )
2026-02-18 07:59:01 -05:00
Yihan Chen
291300ffff
[CuTeDSL] implment a cta-level norm example (both layernorm and rmsnorm) ( #3009 )
...
* kernel impl
* add copyright
2026-02-14 17:54:03 +08:00
aragorn-guan
f9a5f76b7a
Replace fence proxy to the latest routine code in examples/distributed/all_reduce_tma.py ( #3027 )
2026-02-14 17:51:20 +08:00
drazi
ec7e6cb17b
Merge pull request #2971 from rsmallblue/tvm-ffi
...
[CuTeDSL]fix tvm-ffi path in from_dlpack
2026-02-14 14:14:10 +08:00
Yuan Xiaolan
395ab575f6
Merge branch 'main' into tvm-ffi
2026-02-14 13:35:28 +08:00
Junkai-Wu
d4bbf728ca
v4.4 tag release update. ( #3032 )
2026-02-13 23:27:58 -05:00
Szymon Ożóg
beb80e04e1
Add option to not suffix prints with new line
2026-02-13 15:56:50 +01:00
drazi
01687cfba1
Merge pull request #3004 from tridao/add-sub-packed-f32x2
...
[CuTeDSL] Add sub_packed_f32x2 operation
2026-02-13 20:46:26 +08:00
drazi
5c42d0f28c
Merge pull request #3021 from tridao/clc_no_multicast
...
[Cute-DSL] Add option for issue_clc_query without multicast
2026-02-13 20:45:52 +08:00
drazi
1d36152f34
Merge pull request #3022 from tridao/nvvm_fmin
...
[Cute-DSL] Add cute.arch.fmin by calling nvvm
2026-02-13 20:45:08 +08:00
Tri Dao
244e8d00d5
[Cute-DSL] Add cute.arch.fmin by calling nvvm
2026-02-11 14:23:09 -05:00
Tri Dao
5b83b34afd
[Cute-DSL] Add option for issue_clc_query without multicast
2026-02-11 14:19:29 -05:00
aragorn-guan
8dbce01473
[CuTeDSL] Distributed example, using TMA load to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMA store ( #2970 )
2026-02-11 11:54:00 +08:00
drazi
71aa7a0abc
Merge pull request #2919 from pbelevich/patch-1
...
Refactor binary_op functions to remove unused result parameter
2026-02-11 11:48:58 +08:00
Tri Dao
51935551fb
[CuTeDSL] Add sub_packed_f32x2 operation
...
Add subtraction operation for packed f32x2 values, following the same
pattern as the existing add_packed_f32x2 and mul_packed_f32x2 operations.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-02-04 21:18:46 +07:00
Junkai-Wu
6b3e607b85
v4.4 release update v2. ( #2999 )
2026-02-03 20:48:31 -05:00
yuanxiaolan
de161925a5
pass in stream=-1
2026-02-03 11:59:14 +08:00
yuanxiaolan
de198b2419
fix tvm-ffi path in from_dlpack
2026-02-03 11:59:13 +08:00
Hua Huang
1cfbb53a23
[CuTeDSL] Fix: SM100 block-scale gemm overlapping accumulator ( #2995 )
...
* Fix: SM100 block-scale gemm overlapping accumulator
Signed-off-by: Hua Huang <huah@nvidia.com >
* Also include threads_per_warp fix
Signed-off-by: Hua Huang <huah@nvidia.com >
---------
Signed-off-by: Hua Huang <huah@nvidia.com >
2026-02-03 11:01:41 +08:00
dongxiao
a4eb0e05f6
fix performance inssues in cute-dsl examples for 4.4-ctk13.1 release ( #2988 )
...
* fix grouped gemm
* fix mixed input gemm
* fix mixed input grouped gemm
* fix version checking
* use advanced compiler options
* fix comment
* rename advanced compiler configs to adcanced compiler control
* fix comment
* fix name
* fix name
2026-01-30 13:31:04 +08:00
myu-guo
d252b01300
fix performance regression in cute-dsl examples for 4.4-ctk13.1 release ( #2990 )
...
* fix regression with cu13.1
* update
2026-01-30 13:30:49 +08:00
Xiao Song
acb45938e9
Update nvvm API call from nvvm enum to str ( #2985 )
2026-01-27 17:28:29 +08:00
Xiao Song
7a14467776
update api usage ( #2969 )
2026-01-27 15:33:22 +08:00
drazi
51f82812ec
Merge pull request #2891 from ColinPeppler/main
...
docs: note when DSL dumps are populated
2026-01-26 17:38:27 -08:00
Junkai-Wu
9fba3195f9
v4.4 update. ( #2979 )
2026-01-24 11:46:17 -05:00