cutlass

mirror of https://github.com/NVIDIA/cutlass.git synced 2026-05-12 01:10:08 +00:00

Author	SHA1	Message	Date
TungtungQia	1d9e1f6d7a	[CuTeDSL] Fix loop carried target scope (#3200 ) * [CuTeDSL] Bug fix for scf.for's write_args analysis * [CuTeDSL] Add for loop test	2026-05-11 16:02:26 +08:00
Junkai-Wu	cb37157db5	v4.5 tag update (#3202 ) * Python DSL examples reorganization. * v4.5 tag update.	2026-05-05 20:55:27 -04:00
Nandor Licker	ea46e277d2	Add `absf` and `floor` to `cute.math` (#3156 ) The ops are already exposed by the underlying dialect.	2026-04-17 08:54:24 +08:00
Nandor Licker	3f3db08a0a	Add support for empty dataclass arguments (#3152 ) A dataclass with no fields exposed a bug in `extract_dataclass_members`: ``` @dataclass class Dummy: pass ``` The type/return path was inconsistent. This PR fixes the function to support empty dataclasses, which are useful in unions.	2026-04-17 08:47:47 +08:00
Junkai-Wu	a221da7ccf	v4.5 dev update. (#3153 )	2026-04-07 12:16:05 -04:00
Johnsonms	982748aa73	[Hopper CuTeDSL] Add grouped GEMM persistent kernel and tests (#3091 ) Implement grouped GEMM (C_g = A_g x B_g for g groups) on Hopper using CuTe DSL, extending the dense persistent GEMM with per-group TMA descriptor management. Kernel design (grouped_gemm.py): - Warp-specialized pipeline: DMA warp group handles TMA loads and per-group tensormap updates; MMA warp group runs WGMMA and stores C - StaticPersistentGroupTileScheduler for cross-group tile scheduling - Per-group TMA descriptor updates via GMEM or SMEM mode - Supports fp16, fp8 (E4M3FN/E5M2), int8 with mixed A/B dtypes - Configurable tile shapes (128x128, 128x256) and cluster shapes - Fix base TensorMapManager: hoist uniform_smem_ptrs outside predicated block to avoid illegal @P0 R2UR on sm_90a Tests (test/examples/CuTeDSL/hopper/test_grouped_gemm.py): - L0 compile and L1 correctness pytest suite covering tile shapes, dtypes, major modes, cluster shapes, group counts, and mixed sizes - Move to test/examples/CuTeDSL/hopper/ following sm_100a convention - Fix deprecated startdir arg in test_sharding.py pytest hook	2026-03-18 00:40:15 -04:00
Junkai-Wu	1b741cabaa	v4.4.2 update. (#3104 )	2026-03-17 00:58:19 -04:00
Johnsonms	e5fcd125a5	[fix] Boolean.__dsl_and__ emits arith.andi directly for i1 operands (#3087 ) Before this fix, combining two Boolean (i1) DSL values with Python `and` triggered a verbose i1→i32→i1 round-trip in __dsl_and__: arith.extui (×3), arith.select, arith.cmpi ne (×2) — 6 extra MLIR ops. Add a fast path: when both operands are Boolean, delegate directly to __and__, emitting a single arith.andi %a, %b : i1 — identical to `&`. Both operators were already semantically equivalent; this fix makes the generated MLIR identical as well. Includes: - repro_dsl_and_bool.py — minimal standalone reproducer / bug-report script - test_dsl_and_fix.py — pytest tests verifying the fixed behaviour	2026-03-05 17:20:26 +08:00
drazi	b9847690c5	Merge pull request #3028 from SzymonOzog/patch-3 Add option to not suffix prints with new line	2026-02-28 10:11:05 +08:00
Junkai-Wu	3bb6e28d3c	v4.4.1 update (#3079 )	2026-02-27 13:59:21 -05:00
Gabriel Wu	fc5bbc2dab	Fix typo in cute.nvgpu.warpgroup.mma doc (#2548 )	2026-02-27 16:22:55 +08:00
Yuan Xiaolan	395ab575f6	Merge branch 'main' into tvm-ffi	2026-02-14 13:35:28 +08:00
Junkai-Wu	d4bbf728ca	v4.4 tag release update. (#3032 )	2026-02-13 23:27:58 -05:00
Szymon Ożóg	beb80e04e1	Add option to not suffix prints with new line	2026-02-13 15:56:50 +01:00
drazi	01687cfba1	Merge pull request #3004 from tridao/add-sub-packed-f32x2 [CuTeDSL] Add sub_packed_f32x2 operation	2026-02-13 20:46:26 +08:00
drazi	5c42d0f28c	Merge pull request #3021 from tridao/clc_no_multicast [Cute-DSL] Add option for issue_clc_query without multicast	2026-02-13 20:45:52 +08:00
Tri Dao	244e8d00d5	[Cute-DSL] Add cute.arch.fmin by calling nvvm	2026-02-11 14:23:09 -05:00
Tri Dao	5b83b34afd	[Cute-DSL] Add option for issue_clc_query without multicast	2026-02-11 14:19:29 -05:00
Tri Dao	51935551fb	[CuTeDSL] Add sub_packed_f32x2 operation Add subtraction operation for packed f32x2 values, following the same pattern as the existing add_packed_f32x2 and mul_packed_f32x2 operations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 21:18:46 +07:00
Junkai-Wu	6b3e607b85	v4.4 release update v2. (#2999 )	2026-02-03 20:48:31 -05:00
yuanxiaolan	de161925a5	pass in stream=-1	2026-02-03 11:59:14 +08:00
yuanxiaolan	de198b2419	fix tvm-ffi path in from_dlpack	2026-02-03 11:59:13 +08:00
Xiao Song	acb45938e9	Update nvvm API call from nvvm enum to str (#2985 )	2026-01-27 17:28:29 +08:00
Junkai-Wu	9fba3195f9	v4.4 update. (#2979 )	2026-01-24 11:46:17 -05:00
Junkai-Wu	0d2b201e8c	v4.3.5 update. (#2934 ) * v4.3.5 update. * Update copyright to 2026	2026-01-08 15:02:56 -05:00
Wenxuan Tan	f86feb0aa8	Fix idx2crd docstring (#2914 ) * fix idx2crd docstring * fix * fix	2026-01-07 13:11:38 -05:00
Junkai-Wu	7f5fe3edf1	v4.3.4 update. (#2892 )	2025-12-21 11:49:12 -05:00
Junkai-Wu	d3a5492381	v4.3.3 update. (#2868 )	2025-12-11 00:26:58 -05:00
Junkai-Wu	bc680c7f67	v4.3.2 update. (#2839 )	2025-12-04 10:14:32 -05:00
Fung Xie	03aa211310	update doc	2025-11-27 17:02:59 -08:00
Junkai-Wu	1de3a576cc	v4.3.1 update. (#2817 )	2025-11-27 09:49:30 -05:00
Junkai-Wu	8cd5bef43a	v4.3 tag release update. (#2789 )	2025-11-20 20:49:44 -05:00
Zekun Fan	a2439551c7	Fixed editable install to depend on CuTeDSL/requirements.txt (#2768 ) To guarantee wheel version alignment of the source code.	2025-11-14 15:31:49 -08:00
Junkai-Wu	b1d6e2c9b3	v4.3 update. (#2709 ) * v4.3 update. * Update the cute_dsl_api changelog's doc link * Update version to 4.3.0 * Update the example link * Update doc to encourage user to install DSL from requirements.txt --------- Co-authored-by: Larry Wu <larwu@nvidia.com>	2025-10-21 14:26:30 -04:00
Junkai-Wu	7a6d4ee099	v4.2.1 update. (#2666 )	2025-09-23 13:25:43 -04:00
Junkai-Wu	8825e8be4f	Add required changes for github pipeline. (#2648 )	2025-09-17 22:22:45 -04:00
Junkai-Wu	6a35b4d22f	v4.2 tag release. (#2638 )	2025-09-15 12:21:53 -04:00
Junkai-Wu	fd6cfe1ed0	v4.1 release update v2. (#2481 )	2025-07-21 22:03:55 -04:00
Junkai-Wu	a1aaf2300a	v4.1 release	2025-07-03 08:07:53 -04:00
brandonsun	5c6bca0441	Update requirements.txt (#2390 ) Remove the dev suffix in the wheel version	2025-06-10 02:31:49 -04:00
Junkai-Wu	8bdbfca682	v4.0 update. (#2371 )	2025-06-06 02:39:20 -04:00
Ruyman	1ec230c4bf	Fix typo (#2299 ) Needs == for pip to parse the file	2025-05-15 09:38:42 -04:00
Kihiro Bando	f115c3f854	Release v4.0.0 (#2294 )	2025-05-13 15:55:29 -04:00

43 Commits