From e7798e9560b5cb9079907afa472bf002e2993755 Mon Sep 17 00:00:00 2001 From: JP-Fernando <103817231+JP-Fernando@users.noreply.github.com> Date: Thu, 21 May 2026 09:36:41 +0200 Subject: [PATCH] [rocm-libraries] ROCm/rocm-libraries#7112 (commit a6e5eac) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add asynchronous XOR shuffle support to the Async GEMM pipeline and the MX GEMM pipeline (#7112) ## Motivation The goal of this work is to apply XOR shuffle (swizzle) to the current `comp_async` GEMM pipeline and the `gemm_mx` pipeline. XOR swizzling has been helpful to avoid LDS bank conflicts, as data are redistributed across LDS banks, such that simultaneous threads accessing different rows land on different LDS banks. ## Technical Details A similar approach to the work in the existing eight-waves pipeline was followed. Currently, XOR swizzle support is available for FP8 and BF8 types. FP4 support is also available for MX GEMM. Should the types not match, or should the async vector width be of an unsupported size, then the pipeline falls through to the previously existing ('unswizzled') path. ## Test Plan Execute `test_ck_tile_gemm_pipeline_comp_async` for the Async GEMM pipeline. Execute `test_ck_tile_mx_gemm_fp8` and `test_ck_tile_mx_gemm_fp4` for the MX GEMM pipeline. ## Test Result The tests passed successfully in the `Alola` cluster with MI350 hardware. ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Fernando Jiménez Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> --- .../03_gemm/gemm_tdm_data_cache_prefetch.cpp | 658 ++-- ...ght_preshuffle_tdm_data_cache_prefetch.cpp | 420 +-- include/ck_tile/core.hpp | 3 + .../core/utility/data_cache_prefetch.hpp | 30 +- include/ck_tile/host.hpp | 1 + include/ck_tile/ops/fmha.hpp | 2 +- include/ck_tile/ops/gemm.hpp | 6 + .../gemm_pipeline_ag_bg_cr_comp_async.hpp | 38 +- ...ine_ag_bg_cr_comp_async_default_policy.hpp | 385 ++- .../gemm_pipeline_ag_bg_cr_comp_tdm_v1.hpp | 2982 ++++++++-------- .../gemm_pipeline_ag_bg_cr_comp_tdm_v2.hpp | 2992 ++++++++--------- .../gemm_pipeline_ag_bg_cr_comp_async.hpp | 42 +- ...ine_ag_bg_cr_comp_async_default_policy.hpp | 444 ++- 13 files changed, 4369 insertions(+), 3634 deletions(-) diff --git a/example/ck_tile/03_gemm/gemm_tdm_data_cache_prefetch.cpp b/example/ck_tile/03_gemm/gemm_tdm_data_cache_prefetch.cpp index 39ae6d250c..955aff84fa 100644 --- a/example/ck_tile/03_gemm/gemm_tdm_data_cache_prefetch.cpp +++ b/example/ck_tile/03_gemm/gemm_tdm_data_cache_prefetch.cpp @@ -1,329 +1,329 @@ -// Copyright (c) Advanced Micro Devices, Inc., or its affiliates. -// SPDX-License-Identifier: MIT - -#include "gemm_utils.hpp" -#include "run_gemm_example.inc" -#include "run_gemm_example_common.hpp" -#include "universal_gemm_invoker.hpp" -#include "ck_tile/core/utility/gemm_validation.hpp" - -// Template function to run GEMM with optional prefetch comparison. -// GemmConfig takes (PrecType, DataCachePrefetchKind A, DataCachePrefetchKind B, -// ClusterM, ClusterN). -template