72 Commits

Author SHA1 Message Date
questa-quan-wang
ae6bccf341 [CuTeDSL] Update atomic_max_float32 to atomic_fmax in blockscaled GEMM example (#3206)
The internal DSL package refactored atomic_max_float32 to atomic_fmax,
which properly handles negative floats via sign-bit-aware integer
atomics. Update the example to use the new API so it works with
current DSL wheels.

Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com>
2026-05-07 15:03:37 +08:00
Junkai-Wu
cb37157db5 v4.5 tag update (#3202)
* Python DSL examples reorganization.

* v4.5 tag update.
2026-05-05 20:55:27 -04:00
Johnsonms
f74fea9ce3 [Hopper CuTeDSL] Add FP8 GEMM with 2xAcc (#3149)
Add dense_gemm_fp8_2xacc.py — a CuTeDSL port of CUTLASS Example 54
(54_hopper_fp8_warp_specialized_gemm.cu) for NVIDIA Hopper (SM90).

Implements D = scale_a * scale_b * (A @ B) where A/B are FP8 E4M3FN using
the 2xAcc (double accumulation) technique: a temporary accumulator is
periodically promoted into the main accumulator every mma_promotion_interval
MMA instructions to prevent FP8 precision loss.

Features:
- FP8 E4M3FN inputs with Float32 accumulation
- 2xAcc for improved numerical accuracy
- TMA with multicast for A/B/D transfers
- WGMMA warp-specialized persistent tile scheduling
- Configurable output dtype: Float16, Float32, Float8E4M3FN
- Scalar scale_a / scale_b epilogue factors
- Cluster shapes up to 2x2

Add pytest test suite covering:
- L0 compile tests: all tile shapes, cluster shapes, output dtypes,
  mma_promotion_interval values
- L1 correctness tests: numerical validation vs torch.einsum reference
  for all configs, non-trivial scale factors, and batched GEMM (L>1)
- Benchmark tests (pytest -m bench -s): representative problem sizes
  with warmup, cold-L2, and TFLOPS reporting

Also fix conftest.py to import cutlass before adding examples/python/CuTeDSL
to sys.path, preventing the jax/ examples subdirectory from being detected
as a namespace package and breaking cutlass's JAX availability check.
2026-04-25 16:10:33 -04:00
Longsheng Du
08185b9c3e Update blackwell tutorial to be compatible with 4.5-dev version (#3130)
* Update blackwell tutorial to be compatible with 4.5-dev version

* update example for reverted changes

* add more example fix
2026-04-09 14:40:33 +08:00
Junkai-Wu
a221da7ccf v4.5 dev update. (#3153) 2026-04-07 12:16:05 -04:00
Katja Sirazitdinova
418d38a5de PR update (#3103) 2026-04-02 18:00:41 +08:00
drazi
4ca61d0662 [CuTeDSL] Add dataclass example: passing pointers via frozen dataclass (#3070)
* Add dataclass example: passing pointers via frozen dataclass

Demonstrates passing pointers from tensor arguments in @cute.jit to
@cute.kernel using @dataclass(frozen=True). Shows the pattern of
extracting pointers with tensor.iterator, bundling into a dataclass,
and reconstructing tensors in the kernel.

Uses fake tensors for compilation and TVM-FFI for runtime dispatch.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add dataclass example: passing tensors via frozen dataclass

Demonstrates passing tensors from @cute.jit to @cute.kernel using
@dataclass(frozen=True). Shows the pattern of bundling tensors into
a dataclass with static configuration.

Uses fake tensors for compilation and TVM-FFI for runtime dispatch.
Includes reference check against PyTorch implementation.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-03-30 15:08:36 +08:00
Zheng Linfeng
ecb32fe231 [CLI] Fix tutorial issues 2026-03-24 00:12:01 -07:00
Johnsonms
982748aa73 [Hopper CuTeDSL] Add grouped GEMM persistent kernel and tests (#3091)
Implement grouped GEMM (C_g = A_g x B_g for g groups) on Hopper using
CuTe DSL, extending the dense persistent GEMM with per-group TMA
descriptor management.

Kernel design (grouped_gemm.py):
- Warp-specialized pipeline: DMA warp group handles TMA loads and
  per-group tensormap updates; MMA warp group runs WGMMA and stores C
- StaticPersistentGroupTileScheduler for cross-group tile scheduling
- Per-group TMA descriptor updates via GMEM or SMEM mode
- Supports fp16, fp8 (E4M3FN/E5M2), int8 with mixed A/B dtypes
- Configurable tile shapes (128x128, 128x256) and cluster shapes
- Fix base TensorMapManager: hoist uniform_smem_ptrs outside predicated
  block to avoid illegal @P0 R2UR on sm_90a

Tests (test/examples/CuTeDSL/hopper/test_grouped_gemm.py):
- L0 compile and L1 correctness pytest suite covering tile shapes,
  dtypes, major modes, cluster shapes, group counts, and mixed sizes
- Move to test/examples/CuTeDSL/hopper/ following sm_100a convention
- Fix deprecated startdir arg in test_sharding.py pytest hook
2026-03-18 00:40:15 -04:00
Junkai-Wu
1b741cabaa v4.4.2 update. (#3104) 2026-03-17 00:58:19 -04:00
Linfeng Zheng
772fbb264e [CLI] add cutedsl fp16 gemm tutorial from 2 to 6 (#3106)
* [CLI] add fp16 gemm tutorial from 2 to 6

* [CLI] refine comments
2026-03-17 10:11:55 +08:00
Blake Ledden
087c84df83 docs: Fix float16 documentation in elementwise_add notebook (#2949) (#3047)
The notebook uses float16 tensors but the vectorized kernel documentation
incorrectly describes elements as 32-bit and uses 4-element vectorization.
Updated to correctly state 16-bit elements with 8-element vectorization
for proper 128-bit loads/stores.

Signed-off-by: Blake Ledden <bledden@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 10:29:46 +08:00
Junkai-Wu
3bb6e28d3c v4.4.1 update (#3079) 2026-02-27 13:59:21 -05:00
Junkai-Wu
057635de5c Remove redundant dsl example. (#3074) 2026-02-26 08:10:59 -05:00
Junkai-Wu
c213bfdfc1 Remove redundant dsl examples. (#3071) 2026-02-25 22:42:01 -05:00
Linfeng Zheng
3476ddb7bd remove mixed_input_fmha_prefill (#3041) 2026-02-18 07:59:01 -05:00
Yihan Chen
291300ffff [CuTeDSL] implment a cta-level norm example (both layernorm and rmsnorm) (#3009)
* kernel impl

* add copyright
2026-02-14 17:54:03 +08:00
aragorn-guan
f9a5f76b7a Replace fence proxy to the latest routine code in examples/distributed/all_reduce_tma.py (#3027) 2026-02-14 17:51:20 +08:00
Junkai-Wu
d4bbf728ca v4.4 tag release update. (#3032) 2026-02-13 23:27:58 -05:00
aragorn-guan
8dbce01473 [CuTeDSL] Distributed example, using TMA load to access remote memory rank-by-rank, reducing in cta, broadcast result to all ranks by multimem TMA store (#2970) 2026-02-11 11:54:00 +08:00
drazi
71aa7a0abc Merge pull request #2919 from pbelevich/patch-1
Refactor binary_op functions to remove unused result parameter
2026-02-11 11:48:58 +08:00
Junkai-Wu
6b3e607b85 v4.4 release update v2. (#2999) 2026-02-03 20:48:31 -05:00
Hua Huang
1cfbb53a23 [CuTeDSL] Fix: SM100 block-scale gemm overlapping accumulator (#2995)
* Fix: SM100 block-scale gemm overlapping accumulator

Signed-off-by: Hua Huang <huah@nvidia.com>

* Also include threads_per_warp fix

Signed-off-by: Hua Huang <huah@nvidia.com>

---------

Signed-off-by: Hua Huang <huah@nvidia.com>
2026-02-03 11:01:41 +08:00
dongxiao
a4eb0e05f6 fix performance inssues in cute-dsl examples for 4.4-ctk13.1 release (#2988)
* fix grouped gemm

* fix mixed input gemm

* fix mixed input grouped gemm

* fix version checking

* use advanced compiler options

* fix comment

* rename advanced compiler configs to adcanced compiler control

* fix comment

* fix name

* fix name
2026-01-30 13:31:04 +08:00
myu-guo
d252b01300 fix performance regression in cute-dsl examples for 4.4-ctk13.1 release (#2990)
* fix regression with cu13.1

* update
2026-01-30 13:30:49 +08:00
Xiao Song
acb45938e9 Update nvvm API call from nvvm enum to str (#2985) 2026-01-27 17:28:29 +08:00
Xiao Song
7a14467776 update api usage (#2969) 2026-01-27 15:33:22 +08:00
Junkai-Wu
9fba3195f9 v4.4 update. (#2979) 2026-01-24 11:46:17 -05:00
Brian K. Ryu
147f5673d0 New RMS Norm example with unit tests (#2917)
* Add rmsnorm example

* Address reviewer comments. (1) use the cute.runtime definition directly. (2) use the nvvm_wrapper's warp reduce directly

* Separate out reduce.py

* Change copyright notice years
2026-01-13 09:05:31 +08:00
Junkai-Wu
0d2b201e8c v4.3.5 update. (#2934)
* v4.3.5 update.

* Update copyright to 2026
2026-01-08 15:02:56 -05:00
Pavel Belevich
b6d7703e02 Refactor binary_op functions to remove unused result parameter 2026-01-02 11:23:43 -05:00
Pavel Belevich
f9bedd9096 Fix print statement for floor division result 2026-01-02 11:15:15 -05:00
questa-quan-wang
3f4c086d09 new example with TMA prefetch feature targeting for DRAM latency bound cases (#2881)
Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com>
2025-12-23 15:29:48 +08:00
Junkai-Wu
7f5fe3edf1 v4.3.4 update. (#2892) 2025-12-21 11:49:12 -05:00
dongxiao
331e2f451c add missing condition for sync (#2889) 2025-12-19 11:00:30 +08:00
Linfeng Zheng
f6402fcd5e add pytest support for tutorial gemm (#2826)
* add pytest support for tutorial gemm

* add license
2025-12-05 08:45:01 -05:00
bangyu shen
7252a2d17e remove internal comment (#2841)
Co-authored-by: bangyus <bangyus@nvidia.com>
2025-12-04 10:36:21 -05:00
Junkai-Wu
bc680c7f67 v4.3.2 update. (#2839) 2025-12-04 10:14:32 -05:00
bangyu shen
52ae719eda [examples][CuTeDSL] init commit for distirbuted examples (#2806)
* init commit for distirbuted examples

* better OOB protection

* and try import to nvshmem for better error message and a READMME.md to introduce nvshmem and multimem instructions

* add some lamport explanation

* enhance f8 output and warn that f8 output can have nan in it

* tell user why we need complicate data conversions in ref check part

* tell user we don't support nvshmem device function

---------

Co-authored-by: bangyus <bangyus@nvidia.com>
2025-12-01 22:25:40 -05:00
drazi
ec8daf642d Merge pull request #2809 from whatdhack/patch-1
Update notebook title from 'Tour to' to 'Tour of'
2025-11-28 18:07:34 +08:00
Fung Xie
286781a1fb add requirements.txt 2025-11-27 17:02:27 -08:00
Fung Xie
2664cac685 enhanced the example for tvm-ffi 2025-11-27 17:02:26 -08:00
Fung Xie
b9154d65b3 update examples for tvm-ffi 2025-11-27 17:02:26 -08:00
Fung Xie
afe2f71522 reorganize examples for tvm-ffi 2025-11-27 17:02:26 -08:00
Fung Xie
739fffce27 fix TVM FFI doc and update example 2025-11-27 17:02:26 -08:00
Junkai-Wu
1de3a576cc v4.3.1 update. (#2817) 2025-11-27 09:49:30 -05:00
whatdhack
4a55379686 Update notebook title from 'Tour to' to 'Tour of'
Grammar check . LLM's can quickly clean up such issues.
2025-11-24 20:11:14 -08:00
Junkai-Wu
8cd5bef43a v4.3 tag release update. (#2789) 2025-11-20 20:49:44 -05:00
Linfeng Zheng
406e078b29 add a notebook for tour to sol gemm (#2780)
* add tour to sol gemm notebook

* change some typos

* change some typos
2025-11-20 09:41:01 -05:00
Mindy Li
06b6bd7d7b remove cute dsl pdl example. 2025-11-09 21:47:00 -08:00