Improve XDL to WMMA porting for grouped conv fwd (#3456)

Refactors the way the number of XDL (matrix multiply-accumulate) instructions per wave is calculated and used in the grouped convolution forward implementations, especially to better support WMMA (Wave Matrix Multiply-Accumulate) instructions and 16x16 tiles. The changes use MXdlPerWave instead of NXdlPerWave to increase number of waves per M dim.
2026-05-05 14:11:29 +00:00 · 2025-12-19 23:58:51 +01:00
parent 2d9c962e2c
commit cbc8335964
13 changed files with 226 additions and 133 deletions
--- a/experimental/builder/test/utils/ckb_conv_test_configs.hpp
+++ b/experimental/builder/test/utils/ckb_conv_test_configs.hpp
@@ -68,7 +68,7 @@ constexpr TransferABC FwdTransfer_4x64x1{
                {.m_block = 1, .m_wave_per_xdl = 32, .n_block = 1, .n_wave_per_xdl = 8},
            .epilogue = {.m_xdl_per_wave_per_shuffle = 1,
                         .n_per_wave_per_shuffle     = 1,
-                         .scalar_per_vector          = 8},
+                         .scalar_per_vector          = 4},
        },
 };