Shuffle fix for gfx950 (#3491)

* solve compiler issue * solve the gfx950 mfma shuffle regression * refactor jenkinsfile to handle arch name better * [CK TILE] set divisor to count of thread along k dimension * fix the compiler error * solve degradation * Finish the multiplies fix * fix the scales * solve compilation error * solve the composes * solve the error of tile sweeper * fix the test and example * fix for gfx950 --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: Cong Ma <congma13@amd.com>
2026-04-20 14:59:17 +00:00 · 2026-01-14 01:21:29 +08:00
parent 9908a87c31
commit 00c46785a8
33 changed files with 161 additions and 152 deletions
--- a/include/ck_tile/ops/fused_moe/kernel/fused_moegemm_shape.hpp
+++ b/include/ck_tile/ops/fused_moe/kernel/fused_moegemm_shape.hpp
@@ -56,10 +56,10 @@ struct FusedMoeGemmShape
    using WarpTile_1     = remove_cvref_t<WarpTile_1_>;

    static constexpr index_t NumWarps =
-        reduce_on_sequence(WarpPerBlock_0{}, multiplies{}, number<1>{});
+        reduce_on_sequence(WarpPerBlock_0{}, multiplies<>{}, number<1>{});

    // TODO: we don't support half warps aound to 1 warp here
-    static_assert(NumWarps == reduce_on_sequence(WarpPerBlock_1{}, multiplies{}, number<1>{}));
+    static_assert(NumWarps == reduce_on_sequence(WarpPerBlock_1{}, multiplies<>{}, number<1>{}));

    static constexpr index_t Block_M0        = BlockTile_0::at(number<0>{});
    static constexpr index_t Block_N0        = BlockTile_0::at(number<1>{});