=?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?= =?UTF-8?q?More=20accurate=20tests=20for=20MmaPipelines=20(#6212)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation This PR solves several issues: #### More accurate tests for MmaPipelines The current tests for the MmaPipelines (test_amdgcn_sparse_mma, test_amdgcn_wavewise_mma) use explicit input fragment vectors filled with 1s, and only check the output of a single lane. We should have tests that actually use the MmaPipelines with non-trivial input matrices and verify the complete output. Some other aspects of the current MmaPipelines tests that I noticed and deserve some attention: 1. There is sometimes iteration over K outside of the pipeline, which is then included in WaveTileK or FragK, which is not correct. We should remove it, move K iteration inside of the pipeline, or be more clear about this outer-K loop size and how it propagates downwards. 2. There is very tight coupling between the kernel, gtest code, and test_pipeline helper, requiring a lot of information and functions to be passed back and forth. 3. The test_pipeline helper is doing a bunch of register-related logic on the host (related to point 1) 4. Without this register logic the only thing it does is check the device, call the kernel, and check the output, but with a lot of boilerplate. #### Test helper for detecting target arch at HOST runtime There is a really apparent issue we faced while writing tests: Scenario: 1. Compile a test that supports both gfx950 and gfx1201 for gfx950 2. Run the test on a server that only has gfx1201 GPU Actual: Segmentation fault Expected: The test can correctly detect from HOST runtime that the DEVICE target_id was different and skips the test. Notes: The only way of detecting the COMPILER_TARGET_ID in the existing "arch" framework is launching a kernel and calling `get_compiler_target()` (so, from a DEVICE code). This will create a segmentation fault if the current arch differs from the target arch. To cope with this issue, we propose to export the compiler target(s) (note they can be many) through `projects/composablekernel/test/ck_tile/core/arch/CMakeLists.txt` and define a test helper to deal with such cases. #### Add composition support to Transforms We have a small number of Transforms which act on MmaOp input and output data, before and after the MmaOp call respectively. These are currently implemented to work on an MmaTile level, but in theory they are also supposed to work at a WaveTile level, i.e. after composition of multiple MmaTiles to create larger effective MNK dimensions. Currently the composed MmaTiles look like 2D C-style arrays of the individual MmaTile level register vectors (see WaveWiseMmaPipeline). The transforms should be able to take these and perform the proper transforms to the whole WaveTile at once. This might allow for better performing transformations. Note: This PR handles the SparseTransform case and if we don't end up doing scale as a transformation, there isn't really much left to do. If we end up having only the sparse transform as a non-trivial transform, then we could also consider removing the Transform framework.
CK Tile Testing Guide
This document describes the test organization and available test targets for CK Tile operations.
Overview
CK Tile tests are organized with multiple levels of granularity to support different development workflows:
- Global test labels - Run tests across all operations
- Operation-specific umbrella targets - Run all tests for a specific operation
- Individual test executables - Run specific tests
Global Test Labels
These targets run tests across all CK operations (not just CK Tile):
ninja smoke
Run fast smoke tests (tests that complete within ~30 seconds on gfx90a).
ninja smoke
ninja regression
Run slower, more comprehensive regression tests.
ninja regression
ninja check
Run ALL available tests in the entire codebase.
ninja check
Operation-Specific Umbrella Targets
These targets allow you to run all tests for a specific CK Tile operation. This is useful when making changes to a particular operation and wanting to validate all related tests without running the entire test suite.
GEMM Operations
ck_tile_gemm_tests
Run all basic GEMM pipeline tests (memory, compute variants, persistent, etc.)
ninja ck_tile_gemm_tests
Test executables included:
test_ck_tile_gemm_pipeline_memtest_ck_tile_gemm_pipeline_compv3test_ck_tile_gemm_pipeline_compv4test_ck_tile_gemm_pipeline_persistenttest_ck_tile_gemm_pipeline_compv6test_ck_tile_gemm_pipeline_comp_async(gfx95 only)test_ck_tile_gemm_pipeline_*_wmmavariants (gfx11/gfx12 only)
ck_tile_gemm_block_scale_tests
Run all GEMM tests with block-scale quantization (AQuant, BQuant, ABQuant, etc.)
ninja ck_tile_gemm_block_scale_tests
Test executables included: 29 test executables covering:
- AQuant tests (memory pipelines, base layouts, prefill, preshuffle, transpose)
- ABQuant tests (base, padding, preshuffle)
- BQuant tests (1D/2D variants, transpose)
- BQuant with PreshuffleB (decode/prefill, 1D/2D)
- BQuant with PreshuffleQuant (decode/prefill, 1D/2D)
- RowColQuant and TensorQuant tests
ck_tile_gemm_streamk_tests
Run all GEMM StreamK tests (tile partitioner, reduction, smoke, extended)
ninja ck_tile_gemm_streamk_tests
Test executables included:
test_ck_tile_streamk_tile_partitionertest_ck_tile_streamk_reductiontest_ck_tile_streamk_smoketest_ck_tile_streamk_extended
ck_tile_grouped_gemm_quant_tests
Run all grouped GEMM quantization tests
ninja ck_tile_grouped_gemm_quant_tests
Test executables included:
test_ck_tile_grouped_gemm_quant_rowcoltest_ck_tile_grouped_gemm_quant_tensortest_ck_tile_grouped_gemm_quant_aquanttest_ck_tile_grouped_gemm_quant_bquanttest_ck_tile_grouped_gemm_quant_bquant_preshuffleb
Other Operations
ck_tile_fmha_tests
Run all FMHA (Flash Multi-Head Attention) tests
ninja ck_tile_fmha_tests
Test executables included: Forward and backward tests for fp16, bf16, fp8bf16, fp32
ck_tile_reduce_tests
Run all reduce operation tests
ninja ck_tile_reduce_tests
Test executables included:
test_ck_tile_reduce2dtest_ck_tile_multi_reduce2d_threadwisetest_ck_tile_multi_reduce2d_multiblock
Individual Test Executables
You can also build and run individual test executables:
Build a specific test
ninja test_ck_tile_gemm_pipeline_mem
Run a specific test directly
./build/bin/test_ck_tile_gemm_pipeline_mem
Run a specific test through ctest
ctest -R test_ck_tile_gemm_pipeline_mem --output-on-failure