[CK Tile] Async support pipeline V3 ## Motivation Optimize pipeline V3 for gfx950 by enabling buffer load to lds (async pipeline) ## Technical Details - Add `Async` bool to `Problem` struct to enable async pipeline in existing one - Add `static_move_ys` to load transpose. This generates offset in assembly instructions saving registers - Add `is_valid` to `async_get_vectorized_elements`. Before hard coded to true. It allows to support padding - Remove unnecessary restrictions to `is_a_load_tr` and `is_b_load_tr` (wider use of lds load transpose on gfx950) - Integrate async support in existing V3 pipeline (avoid pipelines duplication) - Create policy to support both async and default cases. This could be used by any async pipeline (next steps) - Define `wg_attr_num_access` separately for A and B. This allows to optimize ds_read instruction width for cases when one matrix is transposed and the other is not. Before in such cases, `ds_read_b64` was used instead of `ds_read_b128` - Add test for V3 async. Currently only supporting cases with A and B having the same type ## Test Plan New test `test_ck_tile_gemm_pipeline_compv3_async` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
CK Tile Testing Guide
This document describes the test organization and available test targets for CK Tile operations.
Overview
CK Tile tests are organized with multiple levels of granularity to support different development workflows:
- Global test labels - Run tests across all operations
- Operation-specific umbrella targets - Run all tests for a specific operation
- Individual test executables - Run specific tests
Global Test Labels
These targets run tests across all CK operations (not just CK Tile):
ninja smoke
Run fast smoke tests (tests that complete within ~30 seconds on gfx90a).
ninja smoke
ninja regression
Run slower, more comprehensive regression tests.
ninja regression
ninja check
Run ALL available tests in the entire codebase.
ninja check
Operation-Specific Umbrella Targets
These targets allow you to run all tests for a specific CK Tile operation. This is useful when making changes to a particular operation and wanting to validate all related tests without running the entire test suite.
GEMM Operations
ck_tile_gemm_tests
Run all basic GEMM pipeline tests (memory, compute variants, persistent, etc.)
ninja ck_tile_gemm_tests
Test executables included:
test_ck_tile_gemm_pipeline_memtest_ck_tile_gemm_pipeline_compv3test_ck_tile_gemm_pipeline_compv4test_ck_tile_gemm_pipeline_persistenttest_ck_tile_gemm_pipeline_compv6test_ck_tile_gemm_pipeline_comp_async(gfx95 only)test_ck_tile_gemm_pipeline_*_wmmavariants (gfx11/gfx12 only)
ck_tile_gemm_block_scale_tests
Run all GEMM tests with block-scale quantization (AQuant, BQuant, ABQuant, etc.)
ninja ck_tile_gemm_block_scale_tests
Test executables included: 29 test executables covering:
- AQuant tests (memory pipelines, base layouts, prefill, preshuffle, transpose)
- ABQuant tests (base, padding, preshuffle)
- BQuant tests (1D/2D variants, transpose)
- BQuant with PreshuffleB (decode/prefill, 1D/2D)
- BQuant with PreshuffleQuant (decode/prefill, 1D/2D)
- RowColQuant and TensorQuant tests
ck_tile_gemm_streamk_tests
Run all GEMM StreamK tests (tile partitioner, reduction, smoke, extended)
ninja ck_tile_gemm_streamk_tests
Test executables included:
test_ck_tile_streamk_tile_partitionertest_ck_tile_streamk_reductiontest_ck_tile_streamk_smoketest_ck_tile_streamk_extended
ck_tile_grouped_gemm_quant_tests
Run all grouped GEMM quantization tests
ninja ck_tile_grouped_gemm_quant_tests
Test executables included:
test_ck_tile_grouped_gemm_quant_rowcoltest_ck_tile_grouped_gemm_quant_tensortest_ck_tile_grouped_gemm_quant_aquanttest_ck_tile_grouped_gemm_quant_bquanttest_ck_tile_grouped_gemm_quant_bquant_preshuffleb
Other Operations
ck_tile_fmha_tests
Run all FMHA (Flash Multi-Head Attention) tests
ninja ck_tile_fmha_tests
Test executables included: Forward and backward tests for fp16, bf16, fp8bf16, fp32
ck_tile_reduce_tests
Run all reduce operation tests
ninja ck_tile_reduce_tests
Test executables included:
test_ck_tile_reduce2dtest_ck_tile_multi_reduce2d_threadwisetest_ck_tile_multi_reduce2d_multiblock
Individual Test Executables
You can also build and run individual test executables:
Build a specific test
ninja test_ck_tile_gemm_pipeline_mem
Run a specific test directly
./build/bin/test_ck_tile_gemm_pipeline_mem
Run a specific test through ctest
ctest -R test_ck_tile_gemm_pipeline_mem --output-on-failure