The internal DSL package refactored atomic_max_float32 to atomic_fmax,
which properly handles negative floats via sign-bit-aware integer
atomics. Update the example to use the new API so it works with
current DSL wheels.
Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com>
Add dense_gemm_fp8_2xacc.py — a CuTeDSL port of CUTLASS Example 54
(54_hopper_fp8_warp_specialized_gemm.cu) for NVIDIA Hopper (SM90).
Implements D = scale_a * scale_b * (A @ B) where A/B are FP8 E4M3FN using
the 2xAcc (double accumulation) technique: a temporary accumulator is
periodically promoted into the main accumulator every mma_promotion_interval
MMA instructions to prevent FP8 precision loss.
Features:
- FP8 E4M3FN inputs with Float32 accumulation
- 2xAcc for improved numerical accuracy
- TMA with multicast for A/B/D transfers
- WGMMA warp-specialized persistent tile scheduling
- Configurable output dtype: Float16, Float32, Float8E4M3FN
- Scalar scale_a / scale_b epilogue factors
- Cluster shapes up to 2x2
Add pytest test suite covering:
- L0 compile tests: all tile shapes, cluster shapes, output dtypes,
mma_promotion_interval values
- L1 correctness tests: numerical validation vs torch.einsum reference
for all configs, non-trivial scale factors, and batched GEMM (L>1)
- Benchmark tests (pytest -m bench -s): representative problem sizes
with warmup, cold-L2, and TFLOPS reporting
Also fix conftest.py to import cutlass before adding examples/python/CuTeDSL
to sys.path, preventing the jax/ examples subdirectory from being detected
as a namespace package and breaking cutlass's JAX availability check.
* Add dataclass example: passing pointers via frozen dataclass
Demonstrates passing pointers from tensor arguments in @cute.jit to
@cute.kernel using @dataclass(frozen=True). Shows the pattern of
extracting pointers with tensor.iterator, bundling into a dataclass,
and reconstructing tensors in the kernel.
Uses fake tensors for compilation and TVM-FFI for runtime dispatch.
Co-authored-by: Cursor <cursoragent@cursor.com>
* Add dataclass example: passing tensors via frozen dataclass
Demonstrates passing tensors from @cute.jit to @cute.kernel using
@dataclass(frozen=True). Shows the pattern of bundling tensors into
a dataclass with static configuration.
Uses fake tensors for compilation and TVM-FFI for runtime dispatch.
Includes reference check against PyTorch implementation.
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
Implement grouped GEMM (C_g = A_g x B_g for g groups) on Hopper using
CuTe DSL, extending the dense persistent GEMM with per-group TMA
descriptor management.
Kernel design (grouped_gemm.py):
- Warp-specialized pipeline: DMA warp group handles TMA loads and
per-group tensormap updates; MMA warp group runs WGMMA and stores C
- StaticPersistentGroupTileScheduler for cross-group tile scheduling
- Per-group TMA descriptor updates via GMEM or SMEM mode
- Supports fp16, fp8 (E4M3FN/E5M2), int8 with mixed A/B dtypes
- Configurable tile shapes (128x128, 128x256) and cluster shapes
- Fix base TensorMapManager: hoist uniform_smem_ptrs outside predicated
block to avoid illegal @P0 R2UR on sm_90a
Tests (test/examples/CuTeDSL/hopper/test_grouped_gemm.py):
- L0 compile and L1 correctness pytest suite covering tile shapes,
dtypes, major modes, cluster shapes, group counts, and mixed sizes
- Move to test/examples/CuTeDSL/hopper/ following sm_100a convention
- Fix deprecated startdir arg in test_sharding.py pytest hook
The notebook uses float16 tensors but the vectorized kernel documentation
incorrectly describes elements as 32-bit and uses 4-element vectorization.
Updated to correctly state 16-bit elements with 8-element vectorization
for proper 128-bit loads/stores.
Signed-off-by: Blake Ledden <bledden@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Add rmsnorm example
* Address reviewer comments. (1) use the cute.runtime definition directly. (2) use the nvvm_wrapper's warp reduce directly
* Separate out reduce.py
* Change copyright notice years
* init commit for distirbuted examples
* better OOB protection
* and try import to nvshmem for better error message and a READMME.md to introduce nvshmem and multimem instructions
* add some lamport explanation
* enhance f8 output and warn that f8 output can have nan in it
* tell user why we need complicate data conversions in ref check part
* tell user we don't support nvshmem device function
---------
Co-authored-by: bangyus <bangyus@nvidia.com>