mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-21 05:19:20 +00:00
* Convolution bwd weight device implementation
* Merge branch 'grouped_conv_bwd_weight_device_impl_wmma' into 'feature/conv_bwd_weight_wmma'
Convolution bwd weight device implementation
See merge request amd/ai/composable_kernel!38
* Fix bug and disable splitK=-1 tests for wmma
* Add generic instances for bf16 f32 bf16
* check gridwise level validity in device impl for 1 stage D0
* Fix bugs in device implementation:
- rdna3 compilation error
- gridwise layouts (need to be correct to ensure that CheckValidaity()
works correctly)
* Add padding in conv to gemm transformers for 1x1Stride1Pad0 specialization
* Remove workaround for 1x1Stride1Pad0 conv specialization
* Add instances for xdl parity (for pipeline v1)
* Add two stage instances (xdl parity)
* Add multiple Ds instances
* Add examples
* Uncomment scale instances
* Fix copyright
* Fix examples compilation
* Add atomic add float4
* Fix compilation error
* Fix instances
* Compute tolerances in examples instead of using default ones
* Compute tolerances instead of using default ones in bilinear and scale tests
* Merge branch 'grouped_conv_bwd_weight_instances_examples' into 'feature/conv_bwd_weight_wmma'
Grouped conv: Instances and example bwd weight
See merge request amd/ai/composable_kernel!47
* Device implementation of explicit gemm for grouped conv bwd weight
Based on batched gemm multiple D
* Add instances for pipeline v1 and v3
* Add support for occupancy-based splitk
* Fix ckProfiler dependencies
* Review fixes
* Merge branch 'explicit_bwd_weight' into 'feature/conv_bwd_weight_wmma'
Device implementation of explicit gemm for grouped conv bwd weight
See merge request amd/ai/composable_kernel!52
* Fix cmake file for tests
* fix clang format
* fix instance factory error
* Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test.
* Revert "Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test."
This reverts commit da8e4cfb7917d45d46339ec74eb72e2f585f14cf.
* Disable splitk for 2stage xdl on rdna (bug to be fixed)
* Fix add_test_executable
* Always ForceThreadTileTransfer for now, WaveTileTransfer does not work for convolution yet.
* Grab device and gridwise files from bkp branch, this should enable splitK support for convolution and also we no longer ForceThreadTileTransfer for explicit gemm. Also grab some updates from 7e7243783008b11e904f127ecf1df55ef95e9af2 to fix building on clang20.
* Fix bug in various bwd wei device implementations / profiler where the occupancy based split_k value could not be found because the Argument did not derive from ArgumentSplitK, leading to incorrect error tolerances.
* Actually print the reason when a device implementation is not supported.
* Print number of valid instances in profiler and tests.
* Fix clang format for Two Stage implementation
* Fix copyright
* Address review comments
* Fix explicit conv bwd weight struct
* Fix gridwise common
* Fix gridwise ab scale
* Remove autodeduce 1 stage
* Restore example tolerance calculation
* Fix compilation error
* Fix gridwise common
* Fix gridwise gemm
* Fix typo
* Fix splitk
* Fix splitk ab scale
* Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test.
* Reduce instances to only the tuned wmma V3 ones for implicit v1 intra and explicit v1 intra pad/nopad.
* Add explicit oddMN support with custom tuned instances
* Add two stage instances based on the parameters from the tuned cshuffle V3 instances. CShuffleBlockTranserScalarPerVector adapted to 4, and mergegroups fixed to 1 for now. No more special instance lists.
* Replace cshuffle non-v3 lists with v3 lists, making sure to not have duplications. Also removing stride1pad0 support for NHWGC since we can use explicit for those cases.
* Remove some instances that give incorrect results (f16 NHWGC)
* Add bf16 f32 bf16 instances based on tuned b16 NHWGC GKYXC instances.
* Add back some generic instances to make sure we have the same shape / layout / datatype support as before the instance selection process.
* Add instances for scale and bilinear based on the bf16 NHWGC GKYXC tuning. Keep generic instances for support.
* Disable two stage f16 instances which produce incorrect results.
* Remove more instances which fail verification, for bf16_f32_bf16 and for f16 scale / bilinear.
* Disable all non-generic two-stage instances in the instance lists for NHWGC. They are never faster and support is already carried by CShuffleV3 and Explicit.
* Remove unused instance lists and related add_x_instance() functions, fwd declarations, cmakelists entries. Also merge the "wmma" and "wmma v3" instance list files, which are both v3.
* Re-enable all xdl instances (un-16x16-adapted) and dl instances. Remove custom ckProfiler target.
* Remove straggler comments
* Remove [[maybe_unused]]
* Fix clang format
* Remove unwanted instances. This includes all instances which are not NHWGCxGKYXC and F16 or BF16 (no mixed in-out types).
* Add comment
---------
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
Co-authored-by: Kiefer van Teutem <50830967+krithalith@users.noreply.github.com>
[ROCm/composable_kernel commit: 87dd073887]
286 lines
13 KiB
CMake
286 lines
13 KiB
CMake
# Copyright (c) Advanced Micro Devices, Inc., or its affiliates.
|
|
# SPDX-License-Identifier: MIT
|
|
|
|
# ckProfiler
|
|
set(CK_PROFILER_OP_FILTER "" CACHE STRING "Filter for the operators to be profiled. Default is to include all")
|
|
set(CK_PROFILER_INSTANCE_FILTER "" CACHE STRING "Filter for the kernels instances to be profiled. Default is to be the same as the operator filter")
|
|
if (CK_PROFILER_OP_FILTER STREQUAL "")
|
|
set(CK_PROFILER_OP_FILTER ".+")
|
|
endif()
|
|
if (CK_PROFILER_INSTANCE_FILTER STREQUAL "")
|
|
set(CK_PROFILER_INSTANCE_FILTER ${CK_PROFILER_OP_FILTER})
|
|
endif()
|
|
message(STATUS "CK_PROFILER_OP_FILTER: ${CK_PROFILER_OP_FILTER}")
|
|
message(STATUS "CK_PROFILER_INSTANCE_FILTER: ${CK_PROFILER_INSTANCE_FILTER}")
|
|
|
|
set(PROFILER_OPS
|
|
profile_gemm.cpp
|
|
profile_reduce.cpp
|
|
profile_groupnorm_bwd_data.cpp
|
|
profile_groupnorm_fwd.cpp
|
|
profile_layernorm_bwd_data.cpp
|
|
profile_layernorm_bwd_gamma_beta.cpp
|
|
profile_groupnorm_bwd_gamma_beta.cpp
|
|
profile_layernorm_fwd.cpp
|
|
profile_max_pool2d_fwd.cpp
|
|
profile_pool3d_fwd.cpp
|
|
profile_avg_pool3d_bwd.cpp
|
|
profile_max_pool3d_bwd.cpp
|
|
profile_avg_pool2d_bwd.cpp
|
|
profile_max_pool2d_bwd.cpp
|
|
profile_softmax.cpp
|
|
profile_batchnorm_fwd.cpp
|
|
profile_batchnorm_bwd.cpp
|
|
profile_batchnorm_infer.cpp
|
|
profile_conv_tensor_rearrange.cpp
|
|
profile_transpose.cpp
|
|
profile_permute_scale.cpp
|
|
profile_gemm_quantization.cpp
|
|
)
|
|
|
|
if(SUPPORTED_GPU_TARGETS MATCHES "gfx9")
|
|
if(DTYPES MATCHES "fp32" OR DTYPES MATCHES "fp64" OR NOT DEFINED DTYPES)
|
|
list(APPEND PROFILER_OPS profile_contraction_bilinear.cpp)
|
|
list(APPEND PROFILER_OPS profile_contraction_scale.cpp)
|
|
endif()
|
|
endif()
|
|
|
|
if(SUPPORTED_GPU_TARGETS MATCHES "gfx9|gfx1[12]")
|
|
if(DTYPES MATCHES "fp16" OR NOT DEFINED DTYPES)
|
|
list(APPEND PROFILER_OPS profile_gemm_reduce.cpp)
|
|
list(APPEND PROFILER_OPS profile_batched_gemm_add_relu_gemm_add.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_add.cpp)
|
|
list(APPEND PROFILER_OPS profile_grouped_gemm.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_streamk.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_add_relu.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_add_relu_add_layernorm.cpp)
|
|
list(APPEND PROFILER_OPS profile_grouped_gemm_fixed_nk.cpp)
|
|
list(APPEND PROFILER_OPS profile_grouped_gemm_fastgelu.cpp)
|
|
list(APPEND PROFILER_OPS profile_grouped_gemm_tile_loop.cpp)
|
|
list(APPEND PROFILER_OPS profile_grouped_gemm_multiply_tile_loop.cpp)
|
|
endif()
|
|
if(SUPPORTED_GPU_TARGETS MATCHES "gfx9[45]|gfx12")
|
|
list(APPEND PROFILER_OPS profile_gemm_multiply_multiply_wp.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_ab_scale.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_blockscale_wp.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_universal_preshuffle.cpp)
|
|
endif()
|
|
if(SUPPORTED_GPU_TARGETS MATCHES "gfx95")
|
|
list(APPEND PROFILER_OPS profile_gemm_mx.cpp)
|
|
endif()
|
|
list(APPEND PROFILER_OPS profile_batched_gemm_reduce.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_add_multiply.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_add.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_bias_add_reduce.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_splitk.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_universal_batched.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_universal_streamk.cpp)
|
|
list(APPEND PROFILER_OPS profile_conv_fwd_bias_relu.cpp)
|
|
list(APPEND PROFILER_OPS profile_conv_fwd_bias_relu_add.cpp)
|
|
list(APPEND PROFILER_OPS profile_conv_bwd_data.cpp)
|
|
list(APPEND PROFILER_OPS profile_conv_fwd.cpp)
|
|
list(APPEND PROFILER_OPS profile_grouped_conv_fwd_outelementop.cpp)
|
|
endif()
|
|
|
|
if((SUPPORTED_GPU_TARGETS MATCHES "gfx9" AND (DTYPES MATCHES "fp16" OR NOT DEFINED DTYPES)) OR
|
|
(SUPPORTED_GPU_TARGETS MATCHES "gfx1[12]"))
|
|
list(APPEND PROFILER_OPS profile_gemm_bilinear.cpp)
|
|
endif()
|
|
if(SUPPORTED_GPU_TARGETS MATCHES "gfx(9[45]|1[12])")
|
|
list(APPEND PROFILER_OPS profile_gemm_multiply_multiply.cpp)
|
|
endif()
|
|
|
|
if(SUPPORTED_GPU_TARGETS MATCHES "gfx9|gfx1[12]")
|
|
list(APPEND PROFILER_OPS profile_gemm_universal.cpp)
|
|
list(APPEND PROFILER_OPS profile_batched_gemm.cpp)
|
|
list(APPEND PROFILER_OPS profile_batched_gemm_b_scale.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_b_scale.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_universal_reduce.cpp)
|
|
list(APPEND PROFILER_OPS profile_grouped_conv_fwd.cpp)
|
|
list(APPEND PROFILER_OPS profile_grouped_conv_fwd_bias_clamp.cpp)
|
|
list(APPEND PROFILER_OPS profile_grouped_conv_fwd_clamp.cpp)
|
|
list(APPEND PROFILER_OPS profile_grouped_conv_bwd_data.cpp)
|
|
list(APPEND PROFILER_OPS profile_grouped_conv_bwd_weight.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_multi_abd.cpp)
|
|
if(DTYPES MATCHES "fp16" OR NOT DEFINED DTYPES)
|
|
list(APPEND PROFILER_OPS profile_gemm_add_multiply.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_multiply_add.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_add_silu.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_fastgelu.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_add_fastgelu.cpp)
|
|
list(APPEND PROFILER_OPS profile_gemm_add_add_fastgelu.cpp)
|
|
list(APPEND PROFILER_SOURCES profile_gemm_add.cpp)
|
|
endif()
|
|
list(APPEND PROFILER_OPS profile_batched_gemm_gemm.cpp)
|
|
endif()
|
|
|
|
if(DL_KERNELS)
|
|
list(APPEND PROFILER_OPS profile_batched_gemm_multi_d.cpp)
|
|
list(APPEND PROFILER_OPS profile_grouped_conv_bwd_weight.cpp)
|
|
endif()
|
|
|
|
if(CK_ENABLE_INT8)
|
|
list(APPEND PROFILER_OPS profile_gemm_quantization.cpp)
|
|
endif()
|
|
|
|
set(PROFILER_SOURCES profiler.cpp)
|
|
foreach(SOURCE ${PROFILER_OPS})
|
|
string(REGEX REPLACE "profile_(.+)\.cpp" "\\1" OP_NAME ${SOURCE})
|
|
if (OP_NAME STREQUAL "")
|
|
message(FATAL_ERROR "Unexpected source file name: ${SOURCE}")
|
|
endif()
|
|
if("${OP_NAME}" MATCHES "${CK_PROFILER_OP_FILTER}")
|
|
list(APPEND PROFILER_SOURCES ${SOURCE})
|
|
endif()
|
|
endforeach()
|
|
message(VERBOSE "ckProfiler sources: ${PROFILER_SOURCES}")
|
|
|
|
set(PROFILER_EXECUTABLE ckProfiler)
|
|
|
|
add_executable(${PROFILER_EXECUTABLE} ${PROFILER_SOURCES})
|
|
target_compile_options(${PROFILER_EXECUTABLE} PRIVATE -Wno-global-constructors)
|
|
# flags to compress the library
|
|
if(NOT WIN32 AND ${hip_VERSION_FLAT} GREATER 600241132)
|
|
message(DEBUG "Adding --offload-compress flag for ${PROFILER_EXECUTABLE}")
|
|
target_compile_options(${PROFILER_EXECUTABLE} PRIVATE --offload-compress)
|
|
endif()
|
|
|
|
|
|
set(DEVICE_INSTANCES "")
|
|
list(APPEND DEVICE_INSTANCES device_gemm_instance)
|
|
list(APPEND DEVICE_INSTANCES device_normalization_fwd_instance)
|
|
list(APPEND DEVICE_INSTANCES device_normalization_bwd_data_instance)
|
|
list(APPEND DEVICE_INSTANCES device_normalization_bwd_gamma_beta_instance)
|
|
list(APPEND DEVICE_INSTANCES device_softmax_instance)
|
|
list(APPEND DEVICE_INSTANCES device_reduce_instance)
|
|
list(APPEND DEVICE_INSTANCES device_batchnorm_instance)
|
|
list(APPEND DEVICE_INSTANCES device_pool2d_fwd_instance)
|
|
list(APPEND DEVICE_INSTANCES device_pool3d_fwd_instance)
|
|
list(APPEND DEVICE_INSTANCES device_avg_pool2d_bwd_instance)
|
|
list(APPEND DEVICE_INSTANCES device_avg_pool3d_bwd_instance)
|
|
list(APPEND DEVICE_INSTANCES device_max_pool_bwd_instance)
|
|
list(APPEND DEVICE_INSTANCES device_image_to_column_instance)
|
|
list(APPEND DEVICE_INSTANCES device_column_to_image_instance)
|
|
list(APPEND DEVICE_INSTANCES device_transpose_instance)
|
|
list(APPEND DEVICE_INSTANCES device_permute_scale_instance)
|
|
|
|
if(SUPPORTED_GPU_TARGETS MATCHES "gfx9|gfx1[12]")
|
|
if(DTYPES MATCHES "fp32" OR DTYPES MATCHES "fp64" OR NOT DEFINED DTYPES)
|
|
list(APPEND DEVICE_INSTANCES device_contraction_bilinear_instance)
|
|
list(APPEND DEVICE_INSTANCES device_contraction_scale_instance)
|
|
endif()
|
|
if(DTYPES MATCHES "fp16" OR NOT DEFINED DTYPES)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_add_instance)
|
|
list(APPEND DEVICE_INSTANCES device_batched_gemm_gemm_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_add_add_fastgelu_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_fastgelu_instance)
|
|
list(APPEND DEVICE_INSTANCES device_batched_gemm_add_relu_gemm_add_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_gemm_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_streamk_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_add_relu_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_add_relu_add_layernorm_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_gemm_fixed_nk_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_gemm_fastgelu_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_gemm_tile_loop_instance)
|
|
endif()
|
|
list(APPEND DEVICE_INSTANCES device_batched_gemm_reduce_instance)
|
|
if(SUPPORTED_GPU_TARGETS MATCHES "gfx9[45]|gfx12")
|
|
list(APPEND DEVICE_INSTANCES device_gemm_multiply_multiply_wp_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_universal_preshuffle_instance)
|
|
endif()
|
|
if(SUPPORTED_GPU_TARGETS MATCHES "gfx9[45]|gfx1[12]")
|
|
list(APPEND DEVICE_INSTANCES device_gemm_ab_scale_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_blockscale_wp_instance)
|
|
endif()
|
|
if(SUPPORTED_GPU_TARGETS MATCHES "gfx95")
|
|
list(APPEND DEVICE_INSTANCES device_gemm_mx_instance)
|
|
endif()
|
|
list(APPEND DEVICE_INSTANCES device_gemm_splitk_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_universal_batched_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_universal_streamk_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_add_multiply_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_add_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_reduce_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_bias_add_reduce_instance)
|
|
list(APPEND DEVICE_INSTANCES device_conv2d_fwd_instance)
|
|
list(APPEND DEVICE_INSTANCES device_conv2d_fwd_bias_relu_instance)
|
|
list(APPEND DEVICE_INSTANCES device_conv2d_fwd_bias_relu_add_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_conv1d_fwd_instance)
|
|
list(APPEND DEVICE_INSTANCES device_conv1d_bwd_data_instance)
|
|
list(APPEND DEVICE_INSTANCES device_conv3d_bwd_data_instance)
|
|
list(APPEND DEVICE_INSTANCES device_conv2d_bwd_data_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_conv3d_fwd_convscale_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_conv3d_fwd_convinvscale_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_conv2d_fwd_clamp_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_conv3d_fwd_clamp_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_conv2d_fwd_bias_clamp_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_conv3d_fwd_bias_clamp_instance)
|
|
endif()
|
|
|
|
if((SUPPORTED_GPU_TARGETS MATCHES "gfx9" AND (DTYPES MATCHES "fp16" OR NOT DEFINED DTYPES)) OR
|
|
(SUPPORTED_GPU_TARGETS MATCHES "gfx1[12]" ))
|
|
list(APPEND DEVICE_INSTANCES device_gemm_bilinear_instance)
|
|
endif()
|
|
if(SUPPORTED_GPU_TARGETS MATCHES "gfx(9[45]|1[12])")
|
|
list(APPEND DEVICE_INSTANCES device_gemm_multiply_multiply_instance)
|
|
endif()
|
|
|
|
if(SUPPORTED_GPU_TARGETS MATCHES "gfx9|gfx1[12]")
|
|
list(APPEND DEVICE_INSTANCES device_gemm_universal_instance)
|
|
list(APPEND DEVICE_INSTANCES device_batched_gemm_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_b_scale_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_universal_reduce_instance)
|
|
list(APPEND DEVICE_INSTANCES device_batched_gemm_b_scale_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_conv3d_fwd_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_conv2d_bwd_data_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_conv3d_bwd_data_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_conv2d_fwd_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_add_relu_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_multi_abd_instance)
|
|
if(DTYPES MATCHES "fp16" OR NOT DEFINED DTYPES)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_add_multiply_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_multiply_add_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_add_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_add_silu_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_fastgelu_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_add_fastgelu_instance)
|
|
list(APPEND DEVICE_INSTANCES device_gemm_add_add_fastgelu_instance)
|
|
endif()
|
|
list(APPEND DEVICE_INSTANCES device_batched_gemm_gemm_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_conv1d_bwd_weight_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_conv2d_bwd_weight_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_convnd_bwd_weight_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_conv3d_bwd_weight_instance)
|
|
endif()
|
|
|
|
if(DL_KERNELS)
|
|
list(APPEND DEVICE_INSTANCES device_batched_gemm_multi_d_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_conv1d_bwd_weight_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_conv2d_bwd_weight_instance)
|
|
list(APPEND DEVICE_INSTANCES device_grouped_conv3d_bwd_weight_instance)
|
|
endif()
|
|
|
|
if(CK_ENABLE_INT8)
|
|
list(APPEND DEVICE_INSTANCES device_quantization_instance)
|
|
endif()
|
|
|
|
set(PROFILER_LIBS utility getopt::getopt)
|
|
foreach(LIB ${DEVICE_INSTANCES})
|
|
string(REGEX REPLACE "device_(.+)_instance" "\\1" INSTANCE_NAME ${LIB})
|
|
if (INSTANCE_NAME STREQUAL "")
|
|
message(FATAL_ERROR "Unexpected kernel instance name: ${LIB}")
|
|
endif()
|
|
if("${INSTANCE_NAME}" MATCHES "${CK_PROFILER_INSTANCE_FILTER}")
|
|
# Only link if the target was actually created
|
|
if(TARGET ${LIB})
|
|
list(APPEND PROFILER_LIBS ${LIB})
|
|
else()
|
|
message(VERBOSE "Skipping ${LIB} - no instances built for current GPU targets")
|
|
endif()
|
|
endif()
|
|
endforeach()
|
|
message(VERBOSE "ckProfiler libs: ${PROFILER_LIBS}")
|
|
target_link_libraries(${PROFILER_EXECUTABLE} PRIVATE ${PROFILER_LIBS})
|
|
|
|
rocm_install(TARGETS ${PROFILER_EXECUTABLE} COMPONENT profiler)
|