mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-07-01 20:27:42 +00:00
Using named functors instead of lambdas ## Motivation Currently, in block-level GEMM pipelines, there is significant code repetition for prefetching and tail handling, where lambda functions create a unique instantiations at each call. This includes repeated static_for instantiations and large loops such as MRepeat. Each repetition results in additional instantiations, which increases compilation time and binary bloat. ## Technical Details Refactor repeated code blocks into named functors so the compiler can reuse already instantiated code instead of generating multiple copies. Scope of changes: 1. WMMAOPS pipeline internals: projects/composablekernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_wmmaops_base.hpp, projects/composablekernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_wmmaops_v1.hpp, projects/composablekernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_wmmaops_v3.hpp 2. XDLOPS and preshuffle pipeline variants across projects/composablekernel/include/ck/tensor_operation/gpu/block (v1/v2/v3/v4/v5, scale, dequant, gufusion, moe, mx, blockscale, skip-b-lds, dpp, xdlops) Shared functor file: projects/composablekernel/include/ck/utility/vector_load_functor.hpp ## Test Plan Note that the provided compilation traces by -ftime-trace do not report unnamed lambda instantiations, so a clear baseline for instantiation counts cannot be established. As a result, the impact of this change will be evaluated based on runtime performance rather than direct instantiation-count comparisons. ## Test Result The effects of this were timed by the compilation of a single HIP object through an example (grouped_gemm_wmma_splitk_fp16.cpp). The average user time and speedup of this using the average of 100 compilations is: - Mean compile time before the changes: 37.734 s - Mean compile time after: 32.087 s - Speedup: 17.6% Ran a full CK compilation on Alola with the following results: | Metric | Before (min) | After (min) | Absolute Reduction (min) | % Reduction | | ------ | ------------ | ----------- | ------------------------ |
27 lines
867 B
CMake
27 lines
867 B
CMake
# Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
|
|
# SPDX-License-Identifier: MIT
|
|
|
|
add_gtest_executable(unit_sequence unit_sequence.cpp)
|
|
if(result EQUAL 0)
|
|
target_link_libraries(unit_sequence PRIVATE utility)
|
|
endif()
|
|
|
|
add_gtest_executable(unit_ford unit_ford.cpp)
|
|
if(result EQUAL 0)
|
|
target_link_libraries(unit_ford PRIVATE utility)
|
|
endif()
|
|
|
|
add_gtest_executable(unit_sequence_helper unit_sequence_helper.cpp)
|
|
if(result EQUAL 0)
|
|
target_link_libraries(unit_sequence_helper PRIVATE utility)
|
|
endif()
|
|
|
|
add_gtest_executable(unit_tensor_descriptor_functors unit_tensor_descriptor_functors.cpp)
|
|
if(result EQUAL 0)
|
|
target_link_libraries(unit_tensor_descriptor_functors PRIVATE utility)
|
|
endif()
|
|
|
|
add_gtest_executable(unit_index_expression unit_index_expression.cpp)
|
|
if(result EQUAL 0)
|
|
target_link_libraries(unit_index_expression PRIVATE utility)
|
|
endif() |