Extend XDL kernel to Support RDNA3/4 - Part 2 (#2722)

Update Blockwise and Gridwise files to support both wave32 & wave64. 1. Calculate WaveSize from template parameter, instead of hard code it to 64, some "64" is also replace with WaveSize 2. Move BN0Shuffled and BK0Shuffled to device side. we can't get correct mfma inst info in host side. 3. Update b_thread_offset_n and b_thread_offset_k in gridwise_gemm_xdl_cshuffle_v3_b_scale.hpp for gfx11. in gfx11, input data is duplicated for each 16 threads, it is different with all of others. 4. Modify a1_threadwise_copy in gridwise_batched_*gemm*gemm for gfx11. for gfx11, we need duplicate input and swizzle A if transposeC isn't enabled.
2026-04-20 06:49:15 +00:00 · 2025-09-04 08:33:40 +08:00
parent 80ce6a573b
commit e2d28a92af
48 changed files with 605 additions and 300 deletions
--- a/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer.hpp
+++ b/include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer.hpp
@@ -1900,6 +1900,7 @@ struct ThreadwiseTensorSliceTransfer_StaticToStatic_InterRow
                        const DstSliceOriginIdx&,
                        DstBuffer& dst_buf) const
    {
+        ElementwiseOperation element_op_{};
        static_assert(SrcDesc::IsKnownAtCompileTime() && DstDesc::IsKnownAtCompileTime(),
                      "wrong! Desc need to known at compile-time");

@@ -1985,7 +1986,6 @@ struct ThreadwiseTensorSliceTransfer_StaticToStatic_InterRow
            });
        });
    }
-    ElementwiseOperation element_op_{};
 };

 // Specialized for gfx12