Files
composable_kernel/test/CMakeLists.txt
Christopher Millette bd6070fb5c Compile-time optimize threadwise slice transfer (#4673)
## Motivation

Profiling with `-ftime-trace` on representative translation units (e.g.,

`device_grouped_conv2d_fwd_xdl_nhwgc_gkyxc_nhwgk_f16_comp_instance.cpp`)
revealed
that **92% of frontend time was spent in template instantiation**. The
primary
bottleneck was redundant instantiation of identical helper logic across
multiple
threadwise transfer class variants.

Each `ThreadwiseTensorSliceTransfer_v*` class independently contained
its own
copy of the same helper computations — serpentine traversal, coordinate
stepping, thread scratch descriptors, lambda-like functors, and
compile-time
constants — duplicated across 13 header files. When a typical GEMM or
convolution kernel TU includes blockwise operations (e.g.,
`blockwise_gemm_xdlops.hpp`), it pulls in multiple transfer variants
simultaneously, causing the compiler to instantiate the same helper
logic
multiple times with the same template arguments.

This was compounded by the helpers being defined as members of the outer
`ThreadwiseTensorSliceTransfer_v*` classes, which carry 14+ template
parameters.
Functions like `ComputeForwardSweep` depend only on their two argument
types,
but as inline members of the outer class, the compiler was forced to
create
separate instantiations for every unique combination of all outer
parameters
(data types, descriptors, vector widths, etc.) — even when most of those
parameters had no effect on the helper's output.

## Technical Details

### The Fix: Shared Helper Struct Hierarchy

Duplicated logic was extracted into a standalone helper hierarchy in
`threadwise_tensor_slice_transfer_util.hpp`:

```
ThreadwiseTransferHelper_Base          (I0..I16, MoveSliceWindow, ComputeThreadScratchDescriptor,
|                                       ComputeForwardSteps, ComputeBackwardSteps, MakeVectorContainerTuple)
+-- ThreadwiseTransferHelper_Serpentine (ComputeForwardSweep, ComputeMoveOnDim, ComputeDataIndex,
|                                       ComputeCoordinateResetStep, VectorSizeLookupTable, VectorOffsetsLookupTable)
+-- ThreadwiseTransferHelper_SFC       (ComputeSFCCoordinateResetStep)
```

Each helper method is now parameterized **only by what it actually
uses**:

- `ComputeForwardSweep(idx, lengths)` — parameterized only by the two
argument
  types, not by `SrcData`, `DstData`, `SrcDesc`, etc.
- `ComputeForwardSteps(desc, scalar_per_access)` — parameterized only by
the
  descriptor and access sequence types.
- `ComputeCoordinateResetStep<SliceLengths, VectorDim, ScalarPerVector,
DimAccessOrder>()` — parameterized only by the four values it actually
needs.

This reduces template instantiation work in two ways:
1. **Across different transfer variants** (v3r1 vs v3r2 vs v3r1_gather):
the
compiler reuses a single instantiation instead of creating one per
variant.
2. **Across different outer class instantiations** (fp16 vs bf16 vs
int8): the
compiler reuses the helper instantiation because the helper doesn't
depend
   on the data type at all.

### Refactored Headers

**13 headers** now delegate to the shared helpers instead of duplicating
logic:
- Serpentine family: v3r1, v3r2, v3r1_gather, v3r1_dequant
- SFC family: v6r1, v6r1r2, v6r2, v6r3, v7r2, v7r3, v7r3_scatter
- Dead code removed: v4r1, v5r1

### Additional Fixes Found During Refactoring

- Two latent bugs in v3r2 (`forward_sweep` indexing,
`GetDstCoordinateResetStep` extraction)
- Dead `SrcCoordStep` variables in v4r1 and v5r1
- Unused `scale_element_op_` member in v3r1_dequant (restored with note)

### Net Code Change

+1,428 / -2,297 lines (~870 lines removed).

## Test Plan

### Unit Tests

28 host-side gtests in
`test/threadwise_transfer_helper/test_threadwise_transfer_helper.cpp`
covering the full helper hierarchy:

| Suite | Tests | What is verified |
|-------|-------|------------------|
| ThreadwiseTransferHelperBase | 6 | Compile-time constants,
inheritance, `MoveSliceWindow` with `ResetCoordinateAfterRun` true/false
in 2D and 3D |
| ThreadwiseTransferHelperSerpentine | 9 | `ComputeForwardSweep`
(even/odd row, 1D), `ComputeMoveOnDim` (inner complete/incomplete),
`ComputeDataIndex`, `ComputeCoordinateResetStep`,
`VectorSizeLookupTable`, `VectorOffsetsLookupTable` |
| ThreadwiseTransferHelperSFC | 6 | `ComputeSFCCoordinateResetStep` —
single access, 2D row-major, 2D column-major, 3D batch, even/odd inner
access counts |
| ThreadwiseTransferHelperInheritance | 3 | Serpentine and SFC derive
from Base, are not related to each other |
| DetailFunctors | 4 | `lambda_scalar_per_access`,
`lambda_scalar_step_in_vector`,
`lambda_scalar_per_access_for_src_and_dst` (same dim, different dims) |

### Semantic Equivalence

GPU ISA comparison using `--cuda-device-only -S` confirmed identical
assembly
output (modulo `__hip_cuid_*` metadata) between baseline and refactored
code.

## Test Results

All measurements on a 384-core machine, `-j64`, freshly rebooted,
near-idle.

### Targeted Builds (affected targets only)

| Target | Baseline | Refactored | Wall-clock Delta | CPU Delta |
|--------|----------|------------|-----------------|-----------|
| `device_grouped_conv2d_fwd_instance` (160 TUs) | 7m 37s / 189m CPU |
6m 53s / 161m CPU | **-9.7%** | **-14.9%** |
| `device_grouped_conv3d_fwd_instance` (185 TUs) | 9m 49s / 202m CPU |
6m 42s / 182m CPU | **-31.8%** | **-10.0%** |
| **Combined** | **17m 27s / 392m CPU** | **13m 35s / 344m CPU** |
**-22.2%** | **-12.4%** |

### Full Project Build (8,243 targets)

| Metric | Baseline | Refactored | Delta |
|--------|----------|------------|-------|
| Wall-clock | 103m 38s | 111m 56s | +8.0%* |
| CPU time | 4705m 7s | 4648m 17s | **-1.2%** |

\*Wall-clock inflated by external load spike during refactored build
(load 90 vs 66). CPU time is the reliable metric.

### Context

~15% of all build targets (1,262 / 8,243) transitively include the
modified
headers. These are primarily GEMM and convolution kernel instantiations
— the
core compute workloads. The 12-15% CPU savings on affected targets is
diluted
to 1.2% across the full project because 85% of targets are unaffected.


## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 09:26:40 -07:00

328 lines
13 KiB
CMake

# Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
# SPDX-License-Identifier: MIT
include_directories(BEFORE
${PROJECT_SOURCE_DIR}/
${PROJECT_SOURCE_DIR}/profiler/include
)
include(gtest)
add_custom_target(tests)
# list of tests that are labelled as REGRESSION_TEST for make regression (runtime more than 30 seconds)
# all other tests are labelled as SMOKE_TEST
set(REGRESSION_TESTS
test_gemm_standalone_xdl_fp16
test_gemm_fp16
test_gemm_splitk
test_gemm_universal_wmma_fp16
test_gemm_universal_xdl_fp16
test_gemm_universal_streamk_fp16
test_gemm_universal_streamk_bf16
test_gemm_universal_streamk_fp8
test_batched_gemm_softmax_gemm_fp16
test_batched_gemm_softmax_gemm_permute_fp16
test_batched_gemm_bias_softmax_gemm_permute_fp16
test_batched_gemm_softmax_gemm_permute_bf16
test_batched_gemm_bias_softmax_gemm_permute_bf16
test_grouped_gemm_splitk
test_batched_gemm_b_scale_wmma
test_reduce_no_index
test_reduce_with_index
test_convnd_fwd
test_convnd_bwd_data
test_grouped_convnd_fwd
test_grouped_convnd_fwd_bias_bnorm_clamp
test_grouped_convnd_fwd_scaleadd_ab
test_grouped_convnd_bwd_weight
test_softmax_rank3
test_softmax_rank4
test_batchnorm_fwd_rank_4
test_batchnorm_bwd_rank_4
test_grouped_convnd_bwd_data_xdl
test_grouped_convnd_bwd_data_wmma
test_grouped_convnd_bwd_data_wmma_large_cases
test_grouped_conv_bwd_data_wmma_scale
test_conv_tensor_rearrange
test_gemm_mx
test_ck_tile_batched_transpose
test_ck_tile_fmha_bwd_fp32
test_ck_tile_fmha_bwd_bf16
test_ck_tile_fmha_bwd_fp16
test_ck_tile_fmha_fwd_fp32
test_ck_tile_fmha_fwd_bf16
test_ck_tile_fmha_fwd_fp16
test_ck_tile_fmha_fwd_fp8
test_ck_tile_streamk_extended
)
function(add_test_executable TEST_NAME)
message(DEBUG "adding test ${TEST_NAME}")
set(result 1)
if(DEFINED DTYPES)
foreach(source IN LISTS ARGN)
get_filename_component(source_name ${source} NAME)
set(test 0)
if((source_name MATCHES "_fp16|_f16") AND NOT "fp16" IN_LIST DTYPES)
set(test 1)
endif()
if((source_name MATCHES "_fp32|_f32") AND NOT "fp32" IN_LIST DTYPES)
set(test 1)
endif()
if((source_name MATCHES "_tf32|_tf32") AND NOT "tf32" IN_LIST DTYPES)
set(test 1)
endif()
if((source_name MATCHES "_fp64|_f64") AND NOT "fp64" IN_LIST DTYPES)
set(test 1)
endif()
if((source_name MATCHES "_fp8|_f8") AND NOT "fp8" IN_LIST DTYPES)
set(test 1)
endif()
if((source_name MATCHES "_bf8|_bf8") AND NOT "bf8" IN_LIST DTYPES)
set(test 1)
endif()
if((source_name MATCHES "_bf16|_b16") AND NOT "bf16" IN_LIST DTYPES)
set(test 1)
endif()
if((source_name MATCHES "_int8|_i8") AND NOT "int8" IN_LIST DTYPES)
set(test 1)
endif()
if(test EQUAL 1)
message(DEBUG "removing test ${source} ")
list(REMOVE_ITEM ARGN "${source}")
endif()
endforeach()
endif()
set(TEST_TARGETS ${SUPPORTED_GPU_TARGETS})
foreach(source IN LISTS ARGN)
get_filename_component(source_name ${source} NAME)
if(NOT DEFINED DPP_KERNELS AND source_name MATCHES "_dpp")
message(DEBUG "removing dpp test ${source} ")
list(REMOVE_ITEM ARGN "${source}")
endif()
if(NOT DEFINED DL_KERNELS AND source_name MATCHES "_dl")
message(DEBUG "removing dl test ${source} ")
list(REMOVE_ITEM ARGN "${source}")
endif()
if(NOT TEST_TARGETS MATCHES "gfx9|gfx11|gfx12" AND source_name MATCHES "xdl")
message(DEBUG "removing xdl test ${source} ")
list(REMOVE_ITEM ARGN "${source}")
endif()
if(NOT TEST_TARGETS MATCHES "gfx11|gfx12" AND source_name MATCHES "wmma")
message(DEBUG "removing wmma test ${source} ")
list(REMOVE_ITEM ARGN "${source}")
endif()
endforeach()
#only continue if there are some source files left on the list
set(source_name_list "")
foreach(source IN LISTS ARGN)
get_filename_component(source_name ${source} NAME)
list(APPEND source_name_list ${source_name})
endforeach()
if(ARGN)
if(source_name_list MATCHES "_xdl")
list(REMOVE_ITEM TEST_TARGETS gfx900 gfx906 gfx906:xnack- gfx1030 gfx10-3-generic)
elseif(source_name_list MATCHES "_wmma")
list(REMOVE_ITEM TEST_TARGETS gfx900 gfx906 gfx906:xnack- gfx908:xnack+ gfx908:xnack- gfx90a:xnack+ gfx90a:xnack- gfx908 gfx90a gfx942 gfx1030 gfx950)
elseif(source_name_list MATCHES "_smfmac")
list(REMOVE_ITEM TEST_TARGETS gfx900 gfx906 gfx906:xnack- gfx1030 gfx1100 gfx1101 gfx1102 gfx1103 gfx1150 gfx1151 gfx1152 gfx1153 gfx908 gfx90a gfx1200 gfx1201 gfx10-3-generic gfx11-generic gfx12-generic)
endif()
set_source_files_properties(${ARGN} PROPERTIES LANGUAGE HIP)
add_executable(${TEST_NAME} ${ARGN})
set_property(TARGET ${TEST_NAME} PROPERTY HIP_ARCHITECTURES ${TEST_TARGETS} )
target_link_libraries(${TEST_NAME} PRIVATE getopt::getopt)
add_test(NAME ${TEST_NAME} COMMAND $<TARGET_FILE:${TEST_NAME}>)
add_dependencies(tests ${TEST_NAME})
add_dependencies(check ${TEST_NAME})
rocm_install(TARGETS ${TEST_NAME} COMPONENT tests)
set(result 0)
endif()
message(DEBUG "add_test returns ${result}")
set(result ${result} PARENT_SCOPE)
if(result EQUAL 0 AND NOT "${TEST_NAME}" IN_LIST REGRESSION_TESTS)
set_tests_properties(${TEST_NAME} PROPERTIES LABELS "SMOKE_TEST")
add_dependencies(smoke ${TEST_NAME})
elseif(result EQUAL 0 AND "${TEST_NAME}" IN_LIST REGRESSION_TESTS)
set_tests_properties(${TEST_NAME} PROPERTIES LABELS "REGRESSION_TEST")
add_dependencies(regression ${TEST_NAME})
endif()
endfunction()
function(add_gtest_executable TEST_NAME)
message(DEBUG "adding gtest ${TEST_NAME}")
set(result 1)
if(DEFINED DTYPES)
foreach(source IN LISTS ARGN)
get_filename_component(source_name ${source} NAME)
set(test 0)
if((source_name MATCHES "_fp16|_f16") AND NOT "fp16" IN_LIST DTYPES)
set(test 1)
endif()
if((source_name MATCHES "_fp32|_f32") AND NOT "fp32" IN_LIST DTYPES)
set(test 1)
endif()
if((source_name MATCHES "_tf32|_tf32") AND NOT "tf32" IN_LIST DTYPES)
set(test 1)
endif()
if((source_name MATCHES "_fp64|_f64") AND NOT "fp64" IN_LIST DTYPES)
set(test 1)
endif()
if((source_name MATCHES "_fp8|_f8") AND NOT "fp8" IN_LIST DTYPES)
set(test 1)
endif()
if((source_name MATCHES "_bf8|_bf8") AND NOT "bf8" IN_LIST DTYPES)
set(test 1)
endif()
if((source_name MATCHES "_bf16|_b16") AND NOT "bf16" IN_LIST DTYPES)
set(test 1)
endif()
if((source_name MATCHES "_int8|_i8") AND NOT "int8" IN_LIST DTYPES)
set(test 1)
endif()
if(test EQUAL 1)
message(DEBUG "removing gtest ${source} ")
list(REMOVE_ITEM ARGN "${source}")
endif()
endforeach()
endif()
set(TEST_TARGETS ${SUPPORTED_GPU_TARGETS})
foreach(source IN LISTS ARGN)
get_filename_component(source_name ${source} NAME)
if(NOT DEFINED DL_KERNELS AND source_name MATCHES "_dl")
message(DEBUG "removing dl test ${source} ")
list(REMOVE_ITEM ARGN "${source}")
endif()
if(NOT TEST_TARGETS MATCHES "gfx9|gfx11|gfx12" AND source_name MATCHES "xdl")
message(DEBUG "removing xdl test ${source} ")
list(REMOVE_ITEM ARGN "${source}")
endif()
if(NOT TEST_TARGETS MATCHES "gfx95" AND source_name MATCHES "mx_")
message(DEBUG "removing microscaling test ${source} ")
list(REMOVE_ITEM ARGN "${source}")
endif()
if(NOT TEST_TARGETS MATCHES "gfx11|gfx12" AND source_name MATCHES "wmma")
message(DEBUG "removing wmma test ${source} ")
list(REMOVE_ITEM ARGN "${source}")
endif()
endforeach()
#only continue if there are some source files left on the list
set(source_name_list "")
foreach(source IN LISTS ARGN)
get_filename_component(source_name ${source} NAME)
list(APPEND source_name_list ${source_name})
endforeach()
if(ARGN)
if(source_name_list MATCHES "_xdl")
list(REMOVE_ITEM TEST_TARGETS gfx900 gfx906 gfx906:xnack- gfx1030 gfx10-3-generic)
elseif(source_name_list MATCHES "_wmma")
list(REMOVE_ITEM TEST_TARGETS gfx900 gfx906 gfx906:xnack- gfx908:xnack+ gfx908:xnack- gfx90a:xnack+ gfx90a:xnack- gfx908 gfx90a gfx942 gfx1030 gfx950)
elseif(source_name_list MATCHES "_smfmac")
list(REMOVE_ITEM TEST_TARGETS gfx900 gfx906 gfx906:xnack- gfx1030 gfx1100 gfx1101 gfx1102 gfx1103 gfx1150 gfx1151 gfx1152 gfx1153 gfx908 gfx90a gfx1200 gfx1201 gfx10-3-generic gfx11-generic gfx12-generic)
elseif(source_name_list MATCHES "_mx") #only build mx example for gfx950
list(REMOVE_ITEM TEST_TARGETS gfx900 gfx906 gfx906:xnack- gfx908:xnack+ gfx908:xnack- gfx90a:xnack+ gfx90a:xnack- gfx908 gfx90a gfx942 gfx1030 gfx1100 gfx1101 gfx1102 gfx1103 gfx1150 gfx1151 gfx1152 gfx1153 gfx1200 gfx1201 gfx10-3-generic gfx11-generic gfx12-generic)
endif()
set_source_files_properties(${ARGN} PROPERTIES LANGUAGE HIP)
add_executable(${TEST_NAME} ${ARGN})
set_property(TARGET ${TEST_NAME} PROPERTY HIP_ARCHITECTURES ${TEST_TARGETS} )
add_dependencies(tests ${TEST_NAME})
add_dependencies(check ${TEST_NAME})
# suppress gtest warnings
target_compile_options(${TEST_NAME} PRIVATE -Wno-global-constructors -Wno-undef)
target_link_libraries(${TEST_NAME} PRIVATE gtest_main getopt::getopt)
add_test(NAME ${TEST_NAME} COMMAND $<TARGET_FILE:${TEST_NAME}>)
rocm_install(TARGETS ${TEST_NAME} COMPONENT tests)
set(result 0)
endif()
message(DEBUG "add_gtest returns ${result}")
set(result ${result} PARENT_SCOPE)
if(result EQUAL 0 AND NOT "${TEST_NAME}" IN_LIST REGRESSION_TESTS)
set_tests_properties(${TEST_NAME} PROPERTIES LABELS "SMOKE_TEST")
add_dependencies(smoke ${TEST_NAME})
elseif(result EQUAL 0 AND "${TEST_NAME}" IN_LIST REGRESSION_TESTS)
set_tests_properties(${TEST_NAME} PROPERTIES LABELS "REGRESSION_TEST")
add_dependencies(regression ${TEST_NAME})
endif()
endfunction()
add_compile_options(-Wno-c++20-extensions)
add_subdirectory(ck_tile)
add_subdirectory(magic_number_division)
add_subdirectory(space_filling_curve)
add_subdirectory(threadwise_transfer_helper)
add_subdirectory(conv_util)
add_subdirectory(reference_conv_fwd)
add_subdirectory(gemm)
add_subdirectory(gemm_add)
add_subdirectory(gemm_bias_add_reduce)
add_subdirectory(gemm_blockscale_wp)
add_subdirectory(gemm_layernorm)
add_subdirectory(gemm_multi_abd)
add_subdirectory(gemm_multiply_multiply_wp)
add_subdirectory(gemm_split_k)
add_subdirectory(gemm_universal)
add_subdirectory(gemm_universal_preshuffle)
add_subdirectory(gemm_ab_scale)
add_subdirectory(gemm_b_scale)
add_subdirectory(gemm_universal_streamk)
add_subdirectory(gemm_reduce)
add_subdirectory(gemm_universal_reduce)
add_subdirectory(batched_contraction)
add_subdirectory(batched_gemm)
add_subdirectory(batched_gemm_reduce)
add_subdirectory(batched_gemm_gemm)
add_subdirectory(batched_gemm_multiple_d_gemm_multiple_d)
add_subdirectory(batched_gemm_softmax_gemm)
add_subdirectory(batched_gemm_softmax_gemm_permute)
add_subdirectory(batched_gemm_b_scale)
add_subdirectory(grouped_gemm)
add_subdirectory(grouped_gemm_tile_loop)
add_subdirectory(reduce)
add_subdirectory(convnd_fwd)
add_subdirectory(convnd_bwd_data)
add_subdirectory(grouped_convnd_fwd)
add_subdirectory(grouped_convnd_fwd_activation)
add_subdirectory(grouped_convnd_bwd_weight)
add_subdirectory(block_to_ctile_map)
add_subdirectory(softmax)
add_subdirectory(normalization_fwd)
add_subdirectory(normalization_bwd_data)
add_subdirectory(normalization_bwd_gamma_beta)
add_subdirectory(data_type)
add_subdirectory(elementwise_normalization)
add_subdirectory(batchnorm)
add_subdirectory(contraction)
add_subdirectory(pool)
add_subdirectory(batched_gemm_multi_d)
add_subdirectory(grouped_convnd_bwd_data)
add_subdirectory(conv_tensor_rearrange)
add_subdirectory(transpose)
add_subdirectory(permute_scale)
add_subdirectory(wrapper)
add_subdirectory(quantization)
add_subdirectory(device_memory)
if(SUPPORTED_GPU_TARGETS MATCHES "gfx11")
add_subdirectory(wmma_op)
endif()
if(SUPPORTED_GPU_TARGETS MATCHES "gfx942" OR SUPPORTED_GPU_TARGETS MATCHES "gfx950") # smfmac needs ROCm6.2
add_subdirectory(smfmac_op)
endif()
if(SUPPORTED_GPU_TARGETS MATCHES "gfx950")
add_subdirectory(mx_mfma_op)
add_subdirectory(gemm_mx)
endif()
if(SUPPORTED_GPU_TARGETS MATCHES "gfx12")
add_subdirectory(s_prefetch_op)
endif()
add_subdirectory(position_embedding)
add_subdirectory(scatter_gather)
add_subdirectory(gpu_reference)
add_subdirectory(util)
add_subdirectory(gpu_verification)