Commit Graph

855 Commits

Author SHA1 Message Date
Johnsonms
982748aa73 [Hopper CuTeDSL] Add grouped GEMM persistent kernel and tests (#3091)
Implement grouped GEMM (C_g = A_g x B_g for g groups) on Hopper using
CuTe DSL, extending the dense persistent GEMM with per-group TMA
descriptor management.

Kernel design (grouped_gemm.py):
- Warp-specialized pipeline: DMA warp group handles TMA loads and
  per-group tensormap updates; MMA warp group runs WGMMA and stores C
- StaticPersistentGroupTileScheduler for cross-group tile scheduling
- Per-group TMA descriptor updates via GMEM or SMEM mode
- Supports fp16, fp8 (E4M3FN/E5M2), int8 with mixed A/B dtypes
- Configurable tile shapes (128x128, 128x256) and cluster shapes
- Fix base TensorMapManager: hoist uniform_smem_ptrs outside predicated
  block to avoid illegal @P0 R2UR on sm_90a

Tests (test/examples/CuTeDSL/hopper/test_grouped_gemm.py):
- L0 compile and L1 correctness pytest suite covering tile shapes,
  dtypes, major modes, cluster shapes, group counts, and mixed sizes
- Move to test/examples/CuTeDSL/hopper/ following sm_100a convention
- Fix deprecated startdir arg in test_sharding.py pytest hook
2026-03-18 00:40:15 -04:00
Junkai-Wu
1b741cabaa v4.4.2 update. (#3104) 2026-03-17 00:58:19 -04:00
Linfeng Zheng
772fbb264e [CLI] add cutedsl fp16 gemm tutorial from 2 to 6 (#3106)
* [CLI] add fp16 gemm tutorial from 2 to 6

* [CLI] refine comments
2026-03-17 10:11:55 +08:00
Blake Ledden
087c84df83 docs: Fix float16 documentation in elementwise_add notebook (#2949) (#3047)
The notebook uses float16 tensors but the vectorized kernel documentation
incorrectly describes elements as 32-bit and uses 4-element vectorization.
Updated to correctly state 16-bit elements with 8-element vectorization
for proper 128-bit loads/stores.

Signed-off-by: Blake Ledden <bledden@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 10:29:46 +08:00
dePaul Miller
73c59c055c Support for Group GEMM in CUTLASS Profiler for Geforce and Spark (#3092)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2026-03-06 20:36:29 -05:00
Johnsonms
e5fcd125a5 [fix] Boolean.__dsl_and__ emits arith.andi directly for i1 operands (#3087)
Before this fix, combining two Boolean (i1) DSL values with Python `and`
triggered a verbose i1→i32→i1 round-trip in __dsl_and__:
  arith.extui  (×3), arith.select, arith.cmpi ne (×2) — 6 extra MLIR ops.

Add a fast path: when both operands are Boolean, delegate directly to
__and__, emitting a single arith.andi %a, %b : i1 — identical to `&`.

Both operators were already semantically equivalent; this fix makes the
generated MLIR identical as well.

Includes:
- repro_dsl_and_bool.py  — minimal standalone reproducer / bug-report script
- test_dsl_and_fix.py    — pytest tests verifying the fixed behaviour
2026-03-05 17:20:26 +08:00
TLescoatTFX
a93d86ec83 Fix finding cuDNN (#2890)
* Fix finding cuDNN

* Search $CUDNN_PATH first, not last
2026-03-05 09:51:37 +08:00
David W.H. Swenson
49e54f2b23 fix: add_help=False in temporary parser (#2721) 2026-03-02 15:33:42 +08:00
drazi
b9847690c5 Merge pull request #3028 from SzymonOzog/patch-3
Add option to not suffix prints with new line
2026-02-28 10:11:05 +08:00
Junkai-Wu
3bb6e28d3c v4.4.1 update (#3079) 2026-02-27 13:59:21 -05:00
Tianqi Zhang (张天启)
c651d660d2 fix typo (#3012) 2026-02-27 16:25:35 +08:00
Ziang Li
518327d631 Fix error in Blackwell document of referring to Mxf4 format as NVF4 (#2977)
* Update blackwell_functionality.md

Fixed the error of referring to Mxf4 format as NVF4.

* Correct NVF4 output matrix reference in documentation
2026-02-27 16:25:16 +08:00
StevenYangCC
de67bb7a42 Fix example in CuTe tutorials (#2752) 2026-02-27 16:24:34 +08:00
Neil Kichler
edf2f82c00 Fix register index bug in mma.sync.aligned.m16n8k16 (#2740) 2026-02-27 16:24:18 +08:00
mnehete32
79345359a7 Fix debug typo in sgemm_2.cu and sgemm_sm70.cu (#2678) 2026-02-27 16:23:59 +08:00
zkyue
8b9b3d78df fix typo in documentation (#2671) 2026-02-27 16:23:37 +08:00
Gabriel Wu
fc5bbc2dab Fix typo in cute.nvgpu.warpgroup.mma doc (#2548) 2026-02-27 16:22:55 +08:00
Junkai-Wu
057635de5c Remove redundant dsl example. (#3074) 2026-02-26 08:10:59 -05:00
Junkai-Wu
c213bfdfc1 Remove redundant dsl examples. (#3071) v4.4.0 2026-02-25 22:42:01 -05:00
Haicheng Wu
954503d44c Bump version to 4.4.0 2026-02-25 00:04:04 -05:00
Haicheng Wu
6c4200f1bc Bump version from 4.3.5 to 4.4.0 2026-02-25 00:03:23 -05:00
Haicheng Wu
de93e8a4ac Bump version from 4.3.5 to 4.4.0 2026-02-25 00:03:04 -05:00
Haicheng Wu
b92b9f0d37 Bump version from 4.3.5 to 4.4.0 2026-02-25 00:02:41 -05:00
Haicheng Wu
2aedca6f5e Bump CUTLASS version to 4.4.0 2026-02-25 00:01:56 -05:00
Haicheng Wu
6450964b57 Update README 2026-02-24 23:55:55 -05:00
Haicheng Wu
284449fa5b Revise chagnelog 2026-02-24 23:54:56 -05:00
Haicheng Wu
0853d81d70 Revise README 2026-02-24 15:32:17 -05:00
Linfeng Zheng
3476ddb7bd remove mixed_input_fmha_prefill (#3041) 2026-02-18 07:59:01 -05:00
Yihan Chen
291300ffff [CuTeDSL] implment a cta-level norm example (both layernorm and rmsnorm) (#3009)
* kernel impl

* add copyright
2026-02-14 17:54:03 +08:00
aragorn-guan
f9a5f76b7a Replace fence proxy to the latest routine code in examples/distributed/all_reduce_tma.py (#3027) 2026-02-14 17:51:20 +08:00
drazi
ec7e6cb17b Merge pull request #2971 from rsmallblue/tvm-ffi
[CuTeDSL]fix tvm-ffi path in from_dlpack
2026-02-14 14:14:10 +08:00
Yuan Xiaolan
395ab575f6 Merge branch 'main' into tvm-ffi 2026-02-14 13:35:28 +08:00
Junkai-Wu
d4bbf728ca v4.4 tag release update. (#3032) 2026-02-13 23:27:58 -05:00
Szymon Ożóg
beb80e04e1 Add option to not suffix prints with new line 2026-02-13 15:56:50 +01:00
drazi
01687cfba1 Merge pull request #3004 from tridao/add-sub-packed-f32x2
[CuTeDSL] Add sub_packed_f32x2 operation
2026-02-13 20:46:26 +08:00
drazi
5c42d0f28c Merge pull request #3021 from tridao/clc_no_multicast
[Cute-DSL] Add option for issue_clc_query without multicast
2026-02-13 20:45:52 +08:00
drazi
1d36152f34 Merge pull request #3022 from tridao/nvvm_fmin
[Cute-DSL] Add cute.arch.fmin by calling nvvm
2026-02-13 20:45:08 +08:00
Tri Dao
244e8d00d5 [Cute-DSL] Add cute.arch.fmin by calling nvvm 2026-02-11 14:23:09 -05:00
Tri Dao
5b83b34afd [Cute-DSL] Add option for issue_clc_query without multicast 2026-02-11 14:19:29 -05:00
aragorn-guan
8dbce01473 [CuTeDSL] Distributed example, using TMA load to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMA store (#2970) 2026-02-11 11:54:00 +08:00
drazi
71aa7a0abc Merge pull request #2919 from pbelevich/patch-1
Refactor binary_op functions to remove unused result parameter
2026-02-11 11:48:58 +08:00
Tri Dao
51935551fb [CuTeDSL] Add sub_packed_f32x2 operation
Add subtraction operation for packed f32x2 values, following the same
pattern as the existing add_packed_f32x2 and mul_packed_f32x2 operations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 21:18:46 +07:00
Junkai-Wu
6b3e607b85 v4.4 release update v2. (#2999) 2026-02-03 20:48:31 -05:00
yuanxiaolan
de161925a5 pass in stream=-1 2026-02-03 11:59:14 +08:00
yuanxiaolan
de198b2419 fix tvm-ffi path in from_dlpack 2026-02-03 11:59:13 +08:00
Hua Huang
1cfbb53a23 [CuTeDSL] Fix: SM100 block-scale gemm overlapping accumulator (#2995)
* Fix: SM100 block-scale gemm overlapping accumulator

Signed-off-by: Hua Huang <huah@nvidia.com>

* Also include threads_per_warp fix

Signed-off-by: Hua Huang <huah@nvidia.com>

---------

Signed-off-by: Hua Huang <huah@nvidia.com>
2026-02-03 11:01:41 +08:00
dongxiao
a4eb0e05f6 fix performance inssues in cute-dsl examples for 4.4-ctk13.1 release (#2988)
* fix grouped gemm

* fix mixed input gemm

* fix mixed input grouped gemm

* fix version checking

* use advanced compiler options

* fix comment

* rename advanced compiler configs to adcanced compiler control

* fix comment

* fix name

* fix name
2026-01-30 13:31:04 +08:00
myu-guo
d252b01300 fix performance regression in cute-dsl examples for 4.4-ctk13.1 release (#2990)
* fix regression with cu13.1

* update
2026-01-30 13:30:49 +08:00
Xiao Song
acb45938e9 Update nvvm API call from nvvm enum to str (#2985) 2026-01-27 17:28:29 +08:00
Xiao Song
7a14467776 update api usage (#2969) 2026-01-27 15:33:22 +08:00