[CK Tile] Stream-K RDNA Support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Currently, CK Tile Stream-K only supports CDNA architectures. This change adds Stream-K support on RDNA3/3.5 and RDNA4 architectures. ## Technical Details Stream-K currently has 3 reduction strategies: 1) atomics, 2) linear, and 3) tree. The linear and tree reductions require inter-workgroup communication to a global flags buffer and a global partials buffer. To ensure cache coherency, we use cache modifiers to skip cache levels that are not visible to all workgroups. On CDNA architectures, scalar load and scalar store instructions are available, which we use to read and write to the flags buffer with appropriate cache skipping modifiers. However, RDNA architectures do not support scalar store instructions, so workgroups must use a buffer store instruction to write to flags. Additionally, cache modifiers differ between CDNA and RDNA; they also differ between RDNA3 and RDNA4. Given this information, the main changes are as follows: - Added RDNA flag signaling: Use buffer store instructions for writing to global flags buffer - Add appropriate cache modifiers for reading and writing to flags and partials: - RDNA3 (gfx11): Use `glc | dlc` coherence flags - RDNA4 (gfx12): Use `DEVICE` coherence scope - SFINAE-guarded overloads: Added compile-time dispatch for `SignalStorePartialDone()` and `WaitStorePartialDone()` based on target architecture - RDNA alignment requirements: Increased flags buffer alignment from 128B to 256B due to RDNA cache line size **A note about the `amd_buffer_coherence_enum`:** - **Problem:** The `amd_buffer_coherence_enum` uses preprocessor conditionals (`#if defined(__gfx12__)`) to define architecture-specific values. Template specializations reference enum values from different architectures (e.g., `glc_dlc` for GFX11). Due to C++ two-phase name lookup, non-dependent names are resolved during template parsing regardless of which architecture is being compiled, causing compilation failures when referenced values do not exist in the active preprocessor branch. - **Temporary Solution**: Added compatibility enum values to each architecture block. For example, I added `glc_dlc` in the `__gfx12__` block. I will create a ticket to refactor this enum with a design that has better scalability and tries to avoid the use of preprocessor conditionals. ## Test Plan ### Summary gtests were added to test wmma variants of Stream-K. These tests were stressed tested locally on gfx11 and gfx12. ### More details This PR makes the following changes/additions to the Stream-K gtests: - Split tests into MFMA (CDNA) and WMMA (RDNA) variants - Added 16 WMMA kernel types: FP16/BF16/FP8/BF8 × Linear/Tree reduction - WMMA uses 16×16×16 wave tiles for RDNA (this is the only tile size supported on RDNA) - Fixed RDNA WGP mode: multiply multiProcessorCount by 2 for actual CU count - As described in [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/doxygen/html/group___global_defs.html#ggacc0acd7b9bda126c6bb3dfd6e2796d7ca3ac50041beb59111a5c76edf03da0898), when in Workgroup Processor (WGP) mode, the value of `hipDeviceAttributeMultiprocessorCount` is half of CUs, because a single WGP contains two CUs. The default mode on RDNA is WGP mode, so when creating (M, N, K) instances for gtests using the CU count, we need to multiply the CU count by 2 to get the correct value. This is not needed in the kernel host code, because the occupancy ensures that overall `max_active_wgs` is correct. ## Test Result All tests pass locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
CK Tile Testing Guide
This document describes the test organization and available test targets for CK Tile operations.
Overview
CK Tile tests are organized with multiple levels of granularity to support different development workflows:
- Global test labels - Run tests across all operations
- Operation-specific umbrella targets - Run all tests for a specific operation
- Individual test executables - Run specific tests
Global Test Labels
These targets run tests across all CK operations (not just CK Tile):
ninja smoke
Run fast smoke tests (tests that complete within ~30 seconds on gfx90a).
ninja smoke
ninja regression
Run slower, more comprehensive regression tests.
ninja regression
ninja check
Run ALL available tests in the entire codebase.
ninja check
Operation-Specific Umbrella Targets
These targets allow you to run all tests for a specific CK Tile operation. This is useful when making changes to a particular operation and wanting to validate all related tests without running the entire test suite.
GEMM Operations
ck_tile_gemm_tests
Run all basic GEMM pipeline tests (memory, compute variants, persistent, etc.)
ninja ck_tile_gemm_tests
Test executables included:
test_ck_tile_gemm_pipeline_memtest_ck_tile_gemm_pipeline_compv3test_ck_tile_gemm_pipeline_compv4test_ck_tile_gemm_pipeline_persistenttest_ck_tile_gemm_pipeline_compv6test_ck_tile_gemm_pipeline_comp_async(gfx95 only)test_ck_tile_gemm_pipeline_*_wmmavariants (gfx11/gfx12 only)
ck_tile_gemm_block_scale_tests
Run all GEMM tests with block-scale quantization (AQuant, BQuant, ABQuant, etc.)
ninja ck_tile_gemm_block_scale_tests
Test executables included: 29 test executables covering:
- AQuant tests (memory pipelines, base layouts, prefill, preshuffle, transpose)
- ABQuant tests (base, padding, preshuffle)
- BQuant tests (1D/2D variants, transpose)
- BQuant with PreshuffleB (decode/prefill, 1D/2D)
- BQuant with PreshuffleQuant (decode/prefill, 1D/2D)
- RowColQuant and TensorQuant tests
ck_tile_gemm_streamk_tests
Run all GEMM StreamK tests (tile partitioner, reduction, smoke, extended)
ninja ck_tile_gemm_streamk_tests
Test executables included:
test_ck_tile_streamk_tile_partitionertest_ck_tile_streamk_reductiontest_ck_tile_streamk_smoketest_ck_tile_streamk_extended
ck_tile_grouped_gemm_quant_tests
Run all grouped GEMM quantization tests
ninja ck_tile_grouped_gemm_quant_tests
Test executables included:
test_ck_tile_grouped_gemm_quant_rowcoltest_ck_tile_grouped_gemm_quant_tensortest_ck_tile_grouped_gemm_quant_aquanttest_ck_tile_grouped_gemm_quant_bquanttest_ck_tile_grouped_gemm_quant_bquant_preshuffleb
Other Operations
ck_tile_fmha_tests
Run all FMHA (Flash Multi-Head Attention) tests
ninja ck_tile_fmha_tests
Test executables included: Forward and backward tests for fp16, bf16, fp8bf16, fp32
ck_tile_reduce_tests
Run all reduce operation tests
ninja ck_tile_reduce_tests
Test executables included:
test_ck_tile_reduce2dtest_ck_tile_multi_reduce2d_threadwisetest_ck_tile_multi_reduce2d_multiblock
Individual Test Executables
You can also build and run individual test executables:
Build a specific test
ninja test_ck_tile_gemm_pipeline_mem
Run a specific test directly
./build/bin/test_ck_tile_gemm_pipeline_mem
Run a specific test through ctest
ctest -R test_ck_tile_gemm_pipeline_mem --output-on-failure