dePaul Miller
73c59c055c
Support for Group GEMM in CUTLASS Profiler for Geforce and Spark ( #3092 )
...
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com >
2026-03-06 20:36:29 -05:00
Johnsonms
e5fcd125a5
[fix] Boolean.__dsl_and__ emits arith.andi directly for i1 operands ( #3087 )
...
Before this fix, combining two Boolean (i1) DSL values with Python `and`
triggered a verbose i1→i32→i1 round-trip in __dsl_and__:
arith.extui (×3), arith.select, arith.cmpi ne (×2) — 6 extra MLIR ops.
Add a fast path: when both operands are Boolean, delegate directly to
__and__, emitting a single arith.andi %a, %b : i1 — identical to `&`.
Both operators were already semantically equivalent; this fix makes the
generated MLIR identical as well.
Includes:
- repro_dsl_and_bool.py — minimal standalone reproducer / bug-report script
- test_dsl_and_fix.py — pytest tests verifying the fixed behaviour
2026-03-05 17:20:26 +08:00
David W.H. Swenson
49e54f2b23
fix: add_help=False in temporary parser ( #2721 )
2026-03-02 15:33:42 +08:00
drazi
b9847690c5
Merge pull request #3028 from SzymonOzog/patch-3
...
Add option to not suffix prints with new line
2026-02-28 10:11:05 +08:00
Junkai-Wu
3bb6e28d3c
v4.4.1 update ( #3079 )
2026-02-27 13:59:21 -05:00
Gabriel Wu
fc5bbc2dab
Fix typo in cute.nvgpu.warpgroup.mma doc ( #2548 )
2026-02-27 16:22:55 +08:00
Haicheng Wu
954503d44c
Bump version to 4.4.0
2026-02-25 00:04:04 -05:00
Haicheng Wu
6c4200f1bc
Bump version from 4.3.5 to 4.4.0
2026-02-25 00:03:23 -05:00
Haicheng Wu
de93e8a4ac
Bump version from 4.3.5 to 4.4.0
2026-02-25 00:03:04 -05:00
Haicheng Wu
b92b9f0d37
Bump version from 4.3.5 to 4.4.0
2026-02-25 00:02:41 -05:00
Yuan Xiaolan
395ab575f6
Merge branch 'main' into tvm-ffi
2026-02-14 13:35:28 +08:00
Junkai-Wu
d4bbf728ca
v4.4 tag release update. ( #3032 )
2026-02-13 23:27:58 -05:00
Szymon Ożóg
beb80e04e1
Add option to not suffix prints with new line
2026-02-13 15:56:50 +01:00
drazi
01687cfba1
Merge pull request #3004 from tridao/add-sub-packed-f32x2
...
[CuTeDSL] Add sub_packed_f32x2 operation
2026-02-13 20:46:26 +08:00
drazi
5c42d0f28c
Merge pull request #3021 from tridao/clc_no_multicast
...
[Cute-DSL] Add option for issue_clc_query without multicast
2026-02-13 20:45:52 +08:00
Tri Dao
244e8d00d5
[Cute-DSL] Add cute.arch.fmin by calling nvvm
2026-02-11 14:23:09 -05:00
Tri Dao
5b83b34afd
[Cute-DSL] Add option for issue_clc_query without multicast
2026-02-11 14:19:29 -05:00
Tri Dao
51935551fb
[CuTeDSL] Add sub_packed_f32x2 operation
...
Add subtraction operation for packed f32x2 values, following the same
pattern as the existing add_packed_f32x2 and mul_packed_f32x2 operations.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-02-04 21:18:46 +07:00
Junkai-Wu
6b3e607b85
v4.4 release update v2. ( #2999 )
2026-02-03 20:48:31 -05:00
yuanxiaolan
de161925a5
pass in stream=-1
2026-02-03 11:59:14 +08:00
yuanxiaolan
de198b2419
fix tvm-ffi path in from_dlpack
2026-02-03 11:59:13 +08:00
Xiao Song
acb45938e9
Update nvvm API call from nvvm enum to str ( #2985 )
2026-01-27 17:28:29 +08:00
Junkai-Wu
9fba3195f9
v4.4 update. ( #2979 )
2026-01-24 11:46:17 -05:00
Aidan Do
3f5bafb326
[Cutlass profiler] Fix SM100 FP8 nosmem epilogue shape_div “Divisibility Condition” for non‑multiple‑of‑64 N tiles ( #2946 )
...
* .
* .
* .
* .
* .
* .
* .
2026-01-20 15:27:34 +08:00
Junkai-Wu
0d2b201e8c
v4.3.5 update. ( #2934 )
...
* v4.3.5 update.
* Update copyright to 2026
2026-01-08 15:02:56 -05:00
Wenxuan Tan
f86feb0aa8
Fix idx2crd docstring ( #2914 )
...
* fix idx2crd docstring
* fix
* fix
2026-01-07 13:11:38 -05:00
Junkai-Wu
7f5fe3edf1
v4.3.4 update. ( #2892 )
2025-12-21 11:49:12 -05:00
Haicheng Wu
d4e16f5d4e
Bump version from 4.2.1 to 4.3.3
2025-12-11 23:58:38 -05:00
Junkai-Wu
d3a5492381
v4.3.3 update. ( #2868 )
2025-12-11 00:26:58 -05:00
Haicheng Wu
c4744f706e
Bump version from 4.2.1 to 4.3.2
2025-12-05 13:45:16 -05:00
Junkai-Wu
bc680c7f67
v4.3.2 update. ( #2839 )
2025-12-04 10:14:32 -05:00
Haicheng Wu
5e847d37c4
Bump version from 4.2.1 to 4.3.1
2025-12-01 22:13:19 -05:00
Haicheng Wu
f16068b4db
Bump version from 4.2.0 to 4.3.1
2025-12-01 22:12:20 -05:00
Haicheng Wu
1acfe141af
Bump version from 4.2.1 to 4.3.1
2025-12-01 22:11:13 -05:00
Fung Xie
03aa211310
update doc
2025-11-27 17:02:59 -08:00
Junkai-Wu
1de3a576cc
v4.3.1 update. ( #2817 )
2025-11-27 09:49:30 -05:00
Haicheng Wu
e67e63c331
Bump version from 4.2.1 to 4.3.0
2025-11-24 16:36:06 -05:00
Haicheng Wu
ddaf12c1b1
Bump version from 4.2.0 to 4.3.0
2025-11-24 16:35:27 -05:00
Haicheng Wu
7967ce5f83
Bump version to 4.3.0
2025-11-24 16:34:45 -05:00
Junkai-Wu
8cd5bef43a
v4.3 tag release update. ( #2789 )
2025-11-20 20:49:44 -05:00
Zekun Fan
a2439551c7
Fixed editable install to depend on CuTeDSL/requirements.txt ( #2768 )
...
To guarantee wheel version alignment of the source code.
2025-11-14 15:31:49 -08:00
Junkai-Wu
b1d6e2c9b3
v4.3 update. ( #2709 )
...
* v4.3 update.
* Update the cute_dsl_api changelog's doc link
* Update version to 4.3.0
* Update the example link
* Update doc to encourage user to install DSL from requirements.txt
---------
Co-authored-by: Larry Wu <larwu@nvidia.com >
2025-10-21 14:26:30 -04:00
Haicheng Wu
f874df19ac
4.2.1 update
2025-09-23 13:45:13 -07:00
Junkai-Wu
7a6d4ee099
v4.2.1 update. ( #2666 )
2025-09-23 13:25:43 -04:00
Jack Kosaian
b234a8c024
Rename python/cutlass to python/cutlass_cppgen ( #2652 )
2025-09-18 14:26:57 -04:00
Junkai-Wu
8825e8be4f
Add required changes for github pipeline. ( #2648 )
2025-09-17 22:22:45 -04:00
Junkai-Wu
6a35b4d22f
v4.2 tag release. ( #2638 )
2025-09-15 12:21:53 -04:00
Harrison Barclay
b2dd65dc86
more robust imports in heuristics.py and heuristics_provider.py ( #2596 )
2025-08-28 22:32:55 -04:00
Junkai-Wu
a49a78ffef
v4.2 release. ( #2587 )
...
* Fix default cluster callback values to 1 to avoid profiler failure when these values are not set in command line.
* v4.2 release.
2025-08-22 18:11:24 -04:00
melonedo
ec18e8043b
Make swizzle in pycute work ( #2553 )
2025-08-19 22:21:00 -04:00