Commit Graph

14 Commits

Author SHA1 Message Date
Johnsonms
982748aa73 [Hopper CuTeDSL] Add grouped GEMM persistent kernel and tests (#3091)
Implement grouped GEMM (C_g = A_g x B_g for g groups) on Hopper using
CuTe DSL, extending the dense persistent GEMM with per-group TMA
descriptor management.

Kernel design (grouped_gemm.py):
- Warp-specialized pipeline: DMA warp group handles TMA loads and
  per-group tensormap updates; MMA warp group runs WGMMA and stores C
- StaticPersistentGroupTileScheduler for cross-group tile scheduling
- Per-group TMA descriptor updates via GMEM or SMEM mode
- Supports fp16, fp8 (E4M3FN/E5M2), int8 with mixed A/B dtypes
- Configurable tile shapes (128x128, 128x256) and cluster shapes
- Fix base TensorMapManager: hoist uniform_smem_ptrs outside predicated
  block to avoid illegal @P0 R2UR on sm_90a

Tests (test/examples/CuTeDSL/hopper/test_grouped_gemm.py):
- L0 compile and L1 correctness pytest suite covering tile shapes,
  dtypes, major modes, cluster shapes, group counts, and mixed sizes
- Move to test/examples/CuTeDSL/hopper/ following sm_100a convention
- Fix deprecated startdir arg in test_sharding.py pytest hook
2026-03-18 00:40:15 -04:00
Junkai-Wu
1b741cabaa v4.4.2 update. (#3104) 2026-03-17 00:58:19 -04:00
Junkai-Wu
3bb6e28d3c v4.4.1 update (#3079) 2026-02-27 13:59:21 -05:00
Yihan Chen
291300ffff [CuTeDSL] implment a cta-level norm example (both layernorm and rmsnorm) (#3009)
* kernel impl

* add copyright
2026-02-14 17:54:03 +08:00
Junkai-Wu
d4bbf728ca v4.4 tag release update. (#3032) 2026-02-13 23:27:58 -05:00
Xiao Song
acb45938e9 Update nvvm API call from nvvm enum to str (#2985) 2026-01-27 17:28:29 +08:00
Junkai-Wu
0d2b201e8c v4.3.5 update. (#2934)
* v4.3.5 update.

* Update copyright to 2026
2026-01-08 15:02:56 -05:00
Junkai-Wu
7f5fe3edf1 v4.3.4 update. (#2892) 2025-12-21 11:49:12 -05:00
Junkai-Wu
8cd5bef43a v4.3 tag release update. (#2789) 2025-11-20 20:49:44 -05:00
Junkai-Wu
b1d6e2c9b3 v4.3 update. (#2709)
* v4.3 update.

* Update the cute_dsl_api changelog's doc link

* Update version to 4.3.0

* Update the example link

* Update doc to encourage user to install DSL from requirements.txt

---------

Co-authored-by: Larry Wu <larwu@nvidia.com>
2025-10-21 14:26:30 -04:00
Junkai-Wu
6a35b4d22f v4.2 tag release. (#2638) 2025-09-15 12:21:53 -04:00
Junkai-Wu
fd6cfe1ed0 v4.1 release update v2. (#2481) 2025-07-21 22:03:55 -04:00
Junkai-Wu
a1aaf2300a v4.1 release 2025-07-03 08:07:53 -04:00
Junkai-Wu
8bdbfca682 v4.0 update. (#2371) 2025-06-06 02:39:20 -04:00