[CuTeDSL] Update atomic_max_float32 to atomic_fmax in blockscaled GEMM example (#3206)

The internal DSL package refactored atomic_max_float32 to atomic_fmax, which properly handles negative floats via sign-bit-aware integer atomics. Update the example to use the new API so it works with current DSL wheels. Co-authored-by: Questa Wang <questaw@computelab-frontend-7.nvidia.com>
2026-05-11 08:50:09 +00:00 · 2026-05-07 15:03:37 +08:00
parent cb37157db5
commit ae6bccf341
1 changed files with 2 additions and 3 deletions
--- a/examples/python/CuTeDSL/cute/blackwell/kernel/blockscaled_gemm/dense_blockscaled_gemm_persistent_amax.py
+++ b/examples/python/CuTeDSL/cute/blackwell/kernel/blockscaled_gemm/dense_blockscaled_gemm_persistent_amax.py
@@ -1423,9 +1423,8 @@ class Sm100BlockScaledPersistentDenseGemmKernel:

                    # Global atomic max (accumulates across all tiles for final tensor amax)
                    # Since we compute absolute values, all values are non-negative
-                    # Use wrapper function for atomic max operation
-                    _ = cute.arch.atomic_max_float32(
-                        ptr=mAmax.iterator.llvm_ptr, value=block_amax
+                    _ = cute.arch.atomic_fmax(
+                        mAmax.iterator, block_amax, sign_bit=False
                    )
                #
                # Async arrive accumulator buffer empty