composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-10 17:17:10 +00:00

Author	SHA1	Message	Date
SamiAario-AMD	bcd00317f9	Lwpck 3550: Implement and test fixed precision fp8 x bf8 (#2963 ) * HasHotLoop is a constexpr * Remove an unused function * Remove some unused include statements * Add implementation and tests for fp8 x bf8 weight preshuffle GEMM * Add implementation and tests for fp8 x bf8 in CK Tile basic and universal GEMMs * Remove two barrier calls that HotLoopScheduler already calls * No need to suppress a variable that hasn't been declared * Replace six arg_parser arguments with constexpr literals * Simplify run_gemm_test_prec_type * The strides don't need to be passed via arg_parser as we use their default values * The layouts don't need to be passed as arguments twice * Pass M N and K as regular arguments, not using the argument parser * We can now remove the argument parser * Add a common file for precision types to be used in testing * Convert basic and universal GEMM tests to use gtest * Make GemmConfig a test parameter, and form test cases as the cartesian product GemmConfigs x PrecTypes * Add GemmConfigComputeV4 to the GEMM configs to run the universal tests on * Added a changelog entry * Add missing copyright statements * ifndef-define-endif is not needed with pragma once * Fix a comment * Add F8 x BF8 tests for CompV4 in test_gemm_pipeline_kernel_types.hpp * Disable the unreliable test MoeSortingCase4 --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `254bce9346`]	2025-10-30 13:36:10 +01:00
Ville Pietilä	fd61987d73	[CK_TILE] Improve grouped conv kernel name generation (#3028 ) * Improve the grouped conv kernel name generation in CK Tile. * Fix building CShuffle epilogue tests. --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `9ee9f4d2a3`]	2025-10-30 13:19:07 +01:00
Jimniu	e3d1fc26b6	Jimniu/ ck tile gemm stride validation (#2710 ) * Add stride validation for gemm_basic * change default stride statement * Fix build error * Fix pre-commit failure * Addressed PR comments * clear the redundant code * clang format --------- Co-authored-by: mkumar16-amd <mkumar16@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `8c4cb4f9f4`]	2025-10-29 19:45:09 -07:00
Anton Gorenko	220bd7a9bb	[CK_TILE] Support WMMA (gfx12) in FMHA (#2528 ) * Pass hdim to tile_example_fmha_fwd in fp8 tests * Add WMMA support to fwd FMHA pipelines * Tune tile sizes a bit for less spilling fp16 256 is still quite slow * Fix Q grad tile distribution for warp size = 32 and hdim >= 256 With AccDataType = float and warp size = 32, K0 becomes 0, K repeat is required to correcty distribute the tile. * Use code based on BlockDropout in BlockDropoutBwd * Fix split KV combine kernel for gfx12 (warp size 32) and make it more universal * Fix LSE LDS tensor descriptors: kMaxSplits and kM0 were swapped, it worked on gfx9 because they both equal to 8 while on gfx12 they are 8 and 4; * Fix Oacc LDS tensor descriptor: it was transposed even though its shape=[4 * kM0, kN1], it worked on gfx9 because 4 * kM == kN1 == 32; * Removing these hidden dependecies allows to support: * any number of warps (power-of-2), not only 4; * kN1 = 16, not only 32; * any number of splits; * Rename ids like o_acc_4 and Oacc4 to eliminate confusion: kNumWarps doesn't have to be 4 now * Replace hard-coded kN1 in dispatch code with the requested tile size * Add gfx12-specific tile sizes for split KV * Pass GPU architecture to kernel generation scripts This is still a temporary solution. * Build and run FMHA CI tests for gfx12 * Fix issue after merging * Fix bwd tile sizes The current pipelines always read only one tile K and V tile, this requires bk0 == bhdq and bk2 == bhdv (kK0 == kQKHeaddim and kK2 == kVHeaddim). * Use hardware f32->f8 on gfx12, remove v_perm __builtin_amdgcn_perm is not needed because __builtin_amdgcn_cvt_pk_fp8_f32 allows to specify which word (16 bit of 32-bit dword) is used to store results (two f8 values). * Update changelog * Add WMMA support to pagedkv * Fix scripts after rebasing * Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout Add comments with dropout implementation details Fix performance regression of fwd+dropout * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox; * "scalarize" seed and offset, they may come either from kernel args or from device memory (presumably loaded with vector loads). These changes help the compiler to procude more optimal code and reduce register spilling. Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding Use code based on BlockDropout in BlockDropoutBwd Refactor BlockDropout (fwd) Implement BlockDropout (fwd) for WMMA Originally BlockDropout only supported 32x32 tiles (IsWG32 = true), this version supports 16x16 tiles. If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly to BlockDropoutBwd. Implement BlockDropoutBwd for WMMA Remove MakeRandValLds* functions unused in BlockDropoutBwd Remove unused Run overload from BlockDropoutBwd * Fix regression with philox seed and offset when they exceed 32-bit int __builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset are 64-bit so they get truncated. * Fix names after cherry-picking * Fix selection of a fallback tile based on bm0 The assumption that the largest bm0 == 128 is not always true for current fp32 tiles. * Do not use filters related to qr_async_trload They disable tiles/pipelines which are valid for gfx12. * Use different dstr encoding when C is transposed * Do not call GetQKBlockGemm (and hence WarpGemmDispatcher) in host code Some WarpGemmDispatcher instantiations are defined only for specific archs and undefined on host. Calculations related to sched barriers are moved from Pipeline's public fields into pipeline's operator(). * Fix incorrect name WarpGemmMfmaFp8Fp8F32M32N32K16SwizzleBTransposedCDistribution Correct name is WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution because it's 32x32x16 with IterateK = 2 so K = 32, also all tiles used in codegen scripts are 32, 32, 32. * Generalize usages of WarpGemmDispatcher for MFMA and WMMA WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution is still used explicitly becaus of swizzle factor = 4. * Mark has_load_tr as maybe_unused There are no transpose loading for RDNA. * Remove CK_TILE_USE_MFMA/WMMA from fmha-related code * Detect BlockSize on host based on warp size of the current device If kBlockSize == kNumWarps * get_warp_size(), the kernel is launched with kBlockSize / 2 because on host get_warp_size() == 64 always. * Fix calculation of grid size for combine kernel with warp size = 32 * Add missing includes and header * Support multiple archs in one binary for fwd * Support multiple archs in one binary for fwd_splitkv, fwd_appendkv, pagedkv_prefill * Support multiple archs in one binary for bwd * trload kernels are compiled only for gfx950; * instances with padding are checked after instances without padding so they can be used as fallbacks (similarly to fwd); * Extract common code from register_traits * Revert "Fix regression with philox seed and offset when they exceed 32-bit int" To simplify merging , the proper fix is in develop already. * Support new numerical d paddings in trait ordering checks * Build fp32 tests only on gfx9 * Do not use hardcoded M0 = 64 for dot bwd kernel * Use textwrap.indent from standard library * Make fp8 pipelines on gfx12 consistent with gfx9 * Update tests for current pipelines * Make ninja check more responsive in CI ninja buffers output so this job looks hanging. * Support fp8fp32 by limiting O vector size The fp32 output type requires storing 8 * sizeof(float) = 32 bytes, which is not implemented (here 8 is the number of C values per lane for v_wmma_f32_16x16x16...). * Remove unused cmake options * Unify including amd_buffer_addressing.hpp/_builtins.hpp * Temporarily use amd_buffer_addressing.hpp on >=gfx10 amd_buffer_addressing_builtins.hpp uses inline asm for loads/stores which is not compatible with >=gfx10: * 1 scalar for exec masks instead of 2, * gfx12 uses different instruction names etc. * Update asm in bf16 conversions to work with warp 32 * Do not generate splitkv/appendkv with vlayout=col for consistency with fwd * Add arch tags to kernels/host funcs, compile for each arch separately * Add kM0 to fmha_bwd_dot_do_o kernel name to match filename * Add workaround for miscompilation of bwd with padded hdim SWDEV-559729: v_wmma instructions can be incorrectly placed in divergent branches used to store padded tensors (when some lanes are inactive due to padding). Inline asm with dummy dependencies on VGPRs of the tensors prevents the compiler doing this. * Fix add_gtest_executable for absolute paths Some tests (like gemm_tile_engine) pass absolute paths to source files. In CI the branch name is a part of the root dir, and if the branch name contains "wmma", "xdl" etc., files can be incorrectly excluded. * Run only hdim 128 smoke tests for fp8fp32 There are no instances for hdim 64 and 256. * Format py with ruff to simplify merging develop * Fix incorrect var name * Codegen for gfx9,gfx950 when --targets is not specified Aiter and Pytorch require changes for passing their targets to the codegen scripts. With this temporary solution the files are generated but not all of them have to be really built (depending on the used --offload-arch=). * Combine arch-related values into ArchTrait This more centralized approach removes duplication of various formatting templates. * Try a workaround for Jenkins error "groovyjarjarasm.asm.MethodTooLargeException: Method too large" Some code is extracted into a function. [ROCm/composable_kernel commit: `1e77695fe8`]	2025-10-29 13:31:08 -07:00
John Shumway	361a4c6e23	Add instance traits for two more grouped forward convolutions (#3112 ) [ROCm/composable_kernel commit: `cafaeb6b7b`]	2025-10-29 16:04:13 +01:00
Ville Pietilä	88910537bf	[CK_Tile] Merge multiple convolution groups into a single GEMM batch (#2986 ) * Fix compilation of the grouped conv examples. * Fix grouped conv bwd weight example output in CK Tile. * Add number of groups to merge to ck tile grouped gemm example. * Initial set of tests for TransformConvBwdWeightToGemm. * Added unit tests for TransformConvBwdWeightToGemm conv groups are merged. * WIP: Tensor transformations. * Add unit tests for coordinate transforms. * Fully working conv group merging for TransformConvBwdWeightToGemm. * WIP: Merged conv groups offset calculation. * Adde unit tests for tensor view. * WIP: Merged conv groups epilogue. * Enable running multiple conv groups per batch. * Add tests for tile_distribution_encoding. * Change example to match optimally depthwise convolution with merged groups. * Add more tests for tensor view. * Integration test for reading diagonal blocks from grouped distributed tensor. * Improved integration test. * Improve test for accessing diagonal blocks. * Added integration test for cshuffle epilogue LDS tile distribution. * Add more logging. * Increase the max number of reported errors. * WIP: merged conv groups GEMM epilogue changes. * LDS to global memory copy. * Fix tile window size for c block. * Integration test for CShuffle epilogue. * Improved CShuffle test. * WIP: Separate epilogue for merged conv groups. * Tile example parameters changes to match depthwise conv. * Offset fixes. * Epilogue fixes. * Working baseline for depthwise covolution with merged conv groups. * Fix build. * Initial unit tests for tensor descriptor. * Add one more unit test for tensor view. * WIP: LDS to global mem transfer using CK tile tensor descriptor and tile distribution encoding. * Fully functional LDS to global mem transfer using tensor descriptor and tile distribution encoding. * Add more comments, disable debug code. * Remove debug and other dead code. * Code clean-up for bwd tensor transformations. * Enable running multiple GEMM batches of merged conv groups. * Add compile check for assumed row-mjor layout. * Fix strides in 1D conv to gemm transformation. * WIP: Simplify conv to gemm transformations and handle K > 1 and C > 1 cases. * Fix case k > 1 and c=1. * Remove debug code. * Make MPerGroup and NPerGroup template parameters. * Add additional check for non-supported c > 1 case. * WIP: Put back the generic tensor descriptors for convolutions. * Fix tensor descriptors. * Remove the obsolete template parameters. * Add more instances. * Fix bugs in merged conv groups tensor descriptors. * Fix tensor descriptors for merged conv groups when K > 1. * Remove debug output. * Remove dead code. * Fix merge conflicts. * Code clean-up. * Remove unused code. * Run clang-formatting. * Remove debug prints and obsolete tests. * Check that number of convolution groups is multiple of merged groups. * Fix build after removing obsolete functionality. * Remove obsolete enumeration. * Fix new unit projects. * Remove unnecessary includes. * Fix passing the number of merged groups. * Remove unrelated tests. * Fix IsSupportedArgument for bwd weight conv kernel. * Fix clang formatting. * Fix the bwd weight conv to gemm mapping for num merged groups > 1. * GEMM config for conv group merging. * Fix clang-formatting. * Remove obsolete comment. * Fix typos in comment strings. * Increase the max number of reported errors when testing against reference implementation. * Rename gemm_config to conv_config. * Rename GemmConfig to ConvConfig and move NumGroupsToMerge into ConvConfig. * Change num_groups_to_merge to a boolean flag in the ck tile grouped conv example. * Run clang-format. * Add number of merged groups into kernel name string. * Remove group merging flag from CK Tile grouped conv example. [ROCm/composable_kernel commit: `121bf0e1f3`]	2025-10-29 16:49:28 +02:00
Bartłomiej Kocot	cd30313161	Grouped conv fwd with direct load (#3082 ) * Grouped conv fwd with direct load * fix * fix * Add IsSupported check * Fix * fix inductor [ROCm/composable_kernel commit: `66bae4306c`]	2025-10-29 09:54:42 +01:00
Yashvardhan Agarwal	201463f036	[CK_TILE] Add indexing to pooling operator (Lwpck 3892) (#3013 ) * Add indexing support to pooling operator - Add IndexDataType template parameter to pooling problem and kernel definitions - Enable pooling kernel to output indices of selected elements during max/absmax pooling - Add overloaded operators for Max and AbsMax that track when values change using bool changed parameter - Support optional index buffer allocation and management in device memory - Modify BlockReduce2d classes to handle index tensors alongside value tensors - Add separate shared memory allocation for index data in cross-warp reductions - Create validate_pool_indices function to verify index correctness - Modify pool3d.cpp example to demonstrate index output functionality - Add tests for index output * fixes * Refactor BlockReduce2D functions to get rid auxiliary private types. * comment resolutions and some changes to block_reduce2d - index reference implementation improved - reduce_operator.hpp cleanedup - updated the block_reduce2d.hpp to have index calculation for BlockReduce2dLinearCrossWarpSync as well * conditionally used variable declaration improvement - the conditionally used vairbales are used only when indexing is enabled. To inform the compiler that they may be unused and declare them with least size possible. This may allow it to be optimized compared to the previous declarations * comment resolutions * lexical ordering of the indicies - introduced accumulate methods that handle the intermediate steps if needed to order the indexes * add reduce_operator_accumulate.hpp to core.hpp --------- Co-authored-by: Adam Osewski <Adam.Osewski@amd.com> [ROCm/composable_kernel commit: `3052d7c9e6`]	2025-10-29 09:58:04 +02:00
Jeff Huang	ac03aee245	[CK_TILE] fmha: Add query padding support to backward pass (#3097 ) * [CK_TILE] fmha: Add query padding support to backward pass Introduces support for query sequence padding (q_padding) in the FMHA backward pass kernels. - Passing `seqlen_q_ptr` to the backward kernels to distinguish logical from physical sequence lengths. - Updating `OGradDotO`, `ConvertQGrad`, and `DQDKDV` kernels to respect logical lengths and handle zero-length sequences. - Aligning LSE indexing in the forward kernel with the padded layout for consistency. - Adding a new GTest suite (`test_fmha_bwd_kernel_padding.cpp`) with comprehensive tests for various padding scenarios, including zero-length sequences and deterministic mode. * fix clang format * Adapt fmha_bwd_runner.cpp to new q, kv sequence padding Add backward q/kv sequence padding unit tests. * [CK_TILE] fmha: Unify sequence length and padding handling Refactor the handling of sequence lengths and padding in the FMHA forward and backward kernels to provide a more unified and flexible interface. - Replaced `seqstart_padded__ptr` with a more robust system that uses `seqstart__ptr` for physical sequence lengths and introduces `seqlen__ptr` and `cu_seqlen__ptr` for logical (unpadded) lengths. - Established a clear order of precedence for determining sequence length: cumulative lengths (`cu_seqlen__ptr`) take priority, followed by per-sequence lengths (`seqlen__ptr`), and finally physical lengths derived from `seqstart_*_ptr`. - Clarified the distinction between "group mode" and "batch mode" and how sequence lengths are handled in each case. - Renamed `cu_seqlen_kv_ptr` to `cu_seqlen_k_ptr` for consistency. - Updated comments and documentation to reflect the new argument structure and usage. --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `7c6430eca0`]	2025-10-29 13:56:11 +08:00
Sami Remes	7be2eed5c2	[CK_TILE] Top-K with Sigmoid kernel (#3062 ) * Add sigmoid option to topk_softmax * fix formatting * add to changelog * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Use else if Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `515e283091`]	2025-10-28 10:54:06 -07:00
Illia Silin	97c2fb582a	Fix multiple test failures with staging compiler. (#3103 ) * fix sync issues with staging compiler * fix codegen * use separate sync for gfx11 [ROCm/composable_kernel commit: `331273b474`]	2025-10-28 08:07:19 -07:00
Mateusz Ozga	8eb813de42	[CK_TILE] Fixed multi-abd GEMM test, NaN problem (#2979 ) * Multi-ABD NaN problem * Rollback tests --------- Co-authored-by: root <root@splinter-126-008d.aus.dcgpu> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `da4247a6df`]	2025-10-28 15:53:36 +01:00
Aviral Goel	402bf6664d	[CK_TILE] Add Bquant to Grouped Gemm (#3063 ) * update test cases * format codes * use GTEST_FAIL * add bquant to grouped_gemm * fix a bug in test_grouped_gemm_util * skip test when use wmma on grouped_quant kernel * add tensorwise quant in grouped gemm * fix example issue * update test cases * format codes * fix a bug in test_grouped_gemm_util * tests(quant_grouped_gemm): add unit tests to cover bquant in grouped_gemm * Update test/ck_tile/grouped_gemm_quant/test_grouped_gemm_util_quant.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * feat: add bf8 support * chore: remove unnecessary decltype usage * chore: add default quant_mode to function signature as fallback * fix: pass correct runtime pipeline params in grouped_gemm bquant kernel Calculate has_hot_loop, num_loop, and tail_number on device side for each GEMM problem instead of using default values. This fixes incorrect results when different problems in the group have different K dimensions. * chore: set default quant mode in function signature * test: add additional test cases to cover edge case of no hotloop * chore: clang formatting --------- Co-authored-by: kyle-256 <Kyle.Zhao@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `4368fd9f57`]	2025-10-28 10:20:24 -04:00
Ville Pietilä	f0b6fdcadb	Add name member to CK elementwise operations. (#3102 ) [ROCm/composable_kernel commit: `1c17bae816`]	2025-10-27 22:19:29 -07:00
John Shumway	03c97c9524	[CK_BUILDER] Test and fix instance traits utils. (#3096 ) * Refactor instance_traits_util and add unit tests tests * Address reviewer comments. Just adds some TODOs to indicate deprecated layouts in our reflection. Our strategy is to leave the reflection code broad (covering deprecated features), but keep the builder concepts narrow. Once we've removed deprecated features from all instances, we can remove them from reflection. Also add a comment to the cmake to explain the unit test target test_conv_builder. * Addressed more reviewer comments. * Remove duplicate PassThrough::name Accidentally added this field to the end of the struct, too. The `name` field should be a the start of the struct for consistency. [ROCm/composable_kernel commit: `54746e9329`]	2025-10-27 22:14:08 -07:00
Khushbu Agarwal	35cb7500e4	Fix quant scale matrix layout for block scale gemm (#3079 ) * Adding support for TiledPermuteN * Adding test * moving shuffle functions to common place * resolving commit hook * fix formatting [ROCm/composable_kernel commit: `b11f53a484`]	2025-10-27 13:56:07 -07:00
Ville Pietilä	e1e96b89fa	[CK_BUILDER] First fwd convolution builder implementation (#3070 ) * Add experimental builder infrastructure for composable_kernel - Add experimental/builder directory with README documentation. - Create initial test infrastructure with CMakeLists.txt and placeholder test. - Update root CMakeLists.txt to support CK_EXPERIMENTAL_BUILDER option. - Update .gitignore to not treat `experimental/builder` as a CMake build directory. This establishes the directory structure for a high-level builder pattern that will provide a semantically-clear interface for constructing CK operations, with initial focus on convolution kernels for MIOpen integration. * Fix clang formatting. * Fix CMake build infrastructure for experimental builder - Add experimental/builder CMakeLists.txt with proper subdirectory structure - Add placeholder include/ck_tile/builder CMakeLists.txt for header installation - Fix gtest.cmake to use include_guard to prevent multiple inclusions - Update root CMakeLists.txt to include full builder directory instead of just tests * Scope C++20 settingto the test code Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Remove redundant GTest::gtest linkage Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Introduce basic types, and convolution algorithm concepts and limits. * Add convolution signature concepts. * Add convolution factory. * Finalize conv factory implementation for fwd convolutions. * Add type definitions for testing. * Add placeholder test. * Add convolution builder definition. * Fully functional fwd conv builder. * Test improvements. * Clean-up include headers. * Enable the limit checks for the convolution algorithm parameters. * Remove dead code. * clang formatting. * Add more tests and missing conv specialization argument. * clang formatting. * Add explicit handling of the tensor layouts. * Add complete 2D/3D layout support to CK Builder - Add missing 2D layouts: GNHWC_GKYXC_GNHWK, NGCHW_GKCYX_NGKHW - Add missing 3D layout: GNDHWC_GKZYXC_GNDHWK - Add 1D layouts (NWGC, NGCW, GNWC, NGCW_GKCX) for future support - Add 3 tests for new 2D/3D layouts - All tests pass (5/5) * Add tests for remaining 2D/3D layouts - Add test for 2D NGCHW_GKYXC_NGKHW (channels-first) with Filter1x1Stride1Pad0 - Add test for 3D NDHWGC_GKZYXC_NDHWGK (channels-last) - All 7 tests pass (complete coverage for all 2D/3D forward layouts) * Change enum converters to consteval. * 7 tests with pipeline and specialization\| Test # \| Dim \| Type \| Layout \| Pipeline \| Specialization \| \|--------\|-----\|------\|----------------------\|----------\|-------------------------\| \| 1 \| 2D \| BF16 \| NHWGC_GKYXC_NHWGK \| V1 \| DEFAULT \| \| 2 \| 2D \| FP16 \| GNHWC_GKYXC_GNHWK \| V3 \| FILTER_1X1_PAD0 \| \| 3 \| 2D \| FP32 \| NGCHW_GKCYX_NGKHW \| V4 \| FILTER_1X1_STRIDE1_PAD0 \| \| 4 \| 2D \| BF16 \| NHWGC_GKYXC_NHWGK \| V5 \| FILTER_3x3 \| \| 5 \| 3D \| FP32 \| NGCDHW_GKCZYX_NGKDHW \| V1 \| FILTER_1X1_PAD0 \| \| 6 \| 3D \| BF16 \| GNDHWC_GKZYXC_GNDHWK \| V3 \| DEFAULT \| \| 7 \| 3D \| FP16 \| NDHWGC_GKZYXC_NDHWGK \| V4 \| FILTER_1X1_PAD0 \| * Add missing convolution layouts and provide better compile-time error in instance traits. * Fix clang formatting. * Changed I8 -> S8. * Fix signature. * Rename concepts and corresponding members. * Rename LDS related parameters. * Remove ODD_C specialization. Add V2 pipeline. * Add missing types. * Add elementwise operation to the conv signature. * Improve compile-time error message for unsupported elementwise ops. * Separate different fwd conv builder tests into separate compilation units. * Fix layout to string and add name to old CK PassThrough elementwise op. * Enable both CK and CK Tile tensor layouts in instance traits. * Fix clang-format. --------- Co-authored-by: John Shumway <jshumway@amd.com> Co-authored-by: John Shumway <john.shumwayjr@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: JH-Leon-KIM-AMD <jeonghyun.kim@amd.com> [ROCm/composable_kernel commit: `6c2ca1211a`]	2025-10-27 20:09:24 +02:00
Johannes Graner	0690ed26ba	[CK_TILE] Add conv fwd + bias + clamp example (#3012 ) * Implement argument passing to element-wise functions for fwd convolution * Add files for fwd + bias + clamp example * Implement Bias * Implement Clamp * Elementwise function composition * Composition unit test * Implement fwd + bias + clamp example * Simplify argument passing and composition * elfunc -> bias_and_clamp * Rename function to specify example * Move element-wise function instantiation to kernel * Make bias a runtime tensor * No ugly namespace aliasing * Initialize element-wise function on host * Remove function initialization helper, simplify Compose initialization * Remove unintended LSP compatibility patch * Clean up includes and unused code * Switch names in cshuffle epilogue * Move CDElementwise to conv traits * Re-add required include * Initialize bias in same way as other tensors * Better type specification for ds pointer * Disable 1D convolution * Add warning for non-group-constant bias [ROCm/composable_kernel commit: `5c1974065e`]	2025-10-27 18:43:09 +01:00
arai713	d06d23ab11	[CK_TILE] Stream-K operator() Reboot (#3064 ) * Persistent Stream-K Kernel Implementation This change implements an operator() function in the reboot::StreamKKernel class that is enabled when the Persistent flag is set to true. In this case, the data-parallel portion and the Stream-K portion of the kernel are fully persistent. The changes were made in the reboot namespace. A future PR will remove the old Stream-K kernel class and remove the reboot namespace. * Unit Tests for Persistent Stream-K Kernel This change contains the inital test suite for the Persitent Stream-K Kernel. The files contain "reboot" in the name; a future PR will remove tests for the old Stream-K Kernel and remove the "reboot" naming. A future commit will add tests for the non-persistent kernel. Also added estimate_num_wgs_per_tile to the StreamKTilePartitionerBase class. This allows us to estimate the number of accumulations done per macro tile in C to use during validation when computing relative and absolute tolerance. * Adding implementation for the Non-Persistent Stream-K kernel This code is adding the operator() function for the Non-Persistent Stream-K kernel. Persistency of the kernel is determined through a template argument. The Non-Persistent kernel will allocate additional workgroups for the data parallel section, leading to a different structure for processing the data parallel and Stream-K sections. There has been an addition to the TilePartitioner to get access to the whether Persistent has been set to true or false in the StreamKKernel. * Adding in the tests for the Non-Persistent Stream-K kernel * Refactor Stream-K Reboot Unit Tests This commit makes the following changes: - Update test cases to determine M, N, and K based on the number of CUs. This ensures that each test case is one of Edge Case, SK Only, DP Only, or DP + 2 Tile SK regardless of the architecture. - Since the DP + 2 Tile SK test case takes long to run, this change moves this case into a separate .inc file and labels it as an extended test. - Since the extended test takes > 30 seconds to run, this test is added to the list of regression tests. * Fix spelling errors in comments for test cases Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Changes based on review Removed const volatile for typenames Set up alias for is_tuple_t Naming changes for clarity: GemmCommon -> BaseGemm Moved std::enable_if_t out of template parameters and changed to a return type for operator() Added constructor for StreamKKernelArgs to clarify UniversalGemm inheritance --------- Co-authored-by: Emily Martins <emily.martins@amd.com> Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `054fdb765c`]	2025-10-27 09:14:17 -07:00
Adam Osewski	dc6d0327f9	[CK_Builder] Add name member to unary elementwise ops & update builder traits. (#3093 ) * Add name member to unary elementwise ops. * Update elementwise_op_name to check for name attribute. * Require that the layout is derived from BaseTensorLayout struct. [ROCm/composable_kernel commit: `f53d857b25`]	2025-10-25 07:27:03 -07:00
Max Podkorytov	3ecd2a8689	[CK-Tile][Async gemm] add missing sync and f8 inputs test cases (#3000 ) * add missing sync and f8 test cases * reformat test cases * comment failing cases * bump * reintroduce compv4 shapes [ROCm/composable_kernel commit: `86d542f663`]	2025-10-24 12:16:01 -07:00
Khushbu Agarwal	2498b499a1	[CK_TILE] Adding support for TiledPermuteN on preshuffle Block Scale Gemm (#3019 ) * Adding support for TiledPermuteN * Adding test * resolving remod.py --------- Co-authored-by: root <root@banff-cyxtera-s73-2.ctr.dcgpu> [ROCm/composable_kernel commit: `0584399571`]	2025-10-24 11:06:51 -07:00
Max Podkorytov	99ad6f60e4	[CK][host] limit the rotating count to prevent oom (#3089 ) * [CK][host] limit the rotating count to prevent oom * add numeric header for accumulate [ROCm/composable_kernel commit: `f39626fcf7`]	2025-10-24 08:55:54 -07:00
Max Podkorytov	c67f3501b0	limit the rotating count to prevent oom (#3087 ) [ROCm/composable_kernel commit: `fdcc1f75c3`]	2025-10-24 08:55:34 -07:00
kyle-256	c4448c9d7c	[CK_TILE] add tensorwise quant in grouped gemm (#3007 ) * add tensorwise quant in grouped gemm * fix example issue * update test cases * format codes * clang format * use GTEST_FAIL * fix a bug in test_grouped_gemm_util * skip test when use wmma on grouped_quant kernel * change cmake * change code based on comments --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `3c12a02827`]	2025-10-24 07:41:54 -07:00
yinglu	6a7861bbec	conv:tf32:add missed instances (#3081 ) * conv:tf32:add missed instances [ROCm/composable_kernel commit: `6bbc05e1bd`]	2025-10-24 16:28:36 +08:00
Gino Lu	d6933e661d	[CK_TILE] Add fp4 warp gemm 16x16x128 (#2738 ) * first commit * fix format error * fix vec size error * fix clang format * fix type error * add interface in warp_gemm_impl * fix interface * fix bug * fix bug --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `bedade2572`]	2025-10-23 10:55:51 -07:00
Qianfeng	6ad906b040	[CK_TILE] Fix in set_slice_tile (#2232 ) Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `fbd101b1ac`]	2025-10-23 10:34:02 -07:00
Haocong WANG	895983c816	[CKTILE] FMHA fwd trload lse fix (#3046 ) * enable storelse for fmha_fwd_trload kernel * fix lse in trload * fix the mask related bug [ROCm/composable_kernel commit: `0d3860dfdb`]	2025-10-23 09:33:33 +08:00
lalala-sh	0329d71fb9	[CK_TILE] Update flatmm related kernels (#3022 ) --------- Co-authored-by: Ding, Yi <yi.ding@amd.com> Co-authored-by: felix <felix.li@amd.com> [ROCm/composable_kernel commit: `211d64e18a`]	2025-10-22 22:36:11 +08:00
Johannes Graner	a6c3252766	[CK_TILE] Conv bwd splitN support (#3047 ) * Conv bwd splitN support * Adjust splitting calculations to lengths format * Prepare indexing for future splitK support [ROCm/composable_kernel commit: `cbd1279ae6`]	2025-10-22 13:34:06 +02:00
MHYangAMD	f23b8cde7b	Introduce tree reduction for BlockReduce2dCrossWarpSync (#2588 ) * Introduce tree reduction for BlockReduce2dCrossWarpSync * Rename original impl to BlockReduce2dLinearCrossWarpSync * Replace warp_size with get_warp_size() --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `5a27a97391`]	2025-10-22 14:41:35 +08:00
John Shumway	a488126d3e	[CK_BUILDER] Add compile-time reflection for a convolution instance (#3065 ) * [CK_BILDER] Add compile-time reflection for a convolution instance Introduce InstanceTraits template metaprogramming framework to enable runtime introspection of device kernel template parameters without requiring implementation knowledge. This reflection system extracts configuration details (block sizes, data types, layouts, tuning parameters) directly from kernel specializations through template pattern matching. In particular, the GetInstanceString method returns a string that uniquely idenitfies the kernel, by explicitly serializing all template paramter values. This provides critical functionality for MIOpen integration, since the existing GetTypeString method is ambiguous, and only captures some of the template paramters. The implementation uses a two-level design: a primary InstanceTraits template declaration in instance_traits.hpp serves as the interface, while kernel-specific specializations (e.g., for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3) provide the actual extraction logic. This separation allows the reflection system to scale to additional kernel types without modifying the core interface. Key architectural decisions: - Forward-declare device kernels in instance_traits.hpp to avoid circular dependencies, since device implementation headers will include the reflection headers - Use compile-time constants and type aliases to expose kernel parameters, enabling zero-overhead introspection - Provide a templated instance_string() function that generates human-readable kernel configuration strings by serializing all template parameters in order, useful for debugging and kernel identification - Guard reflection integration with preprocessor definition CK_EXPERIMENTAL_BUILDER to keep it opt-in until the API stabilizes - Add GetInstanceString() virtual method to BaseOperator, allowing runtime polymorphic access to compile-time kernel information This infrastructure also enables upcoming higher-level semantic reflection abstractions (like ConvTraits) to query kernel configurations programmatically. Includes unit tests validating both the trait extraction accuracy and the string generation format. [ROCm/composable_kernel commit: `37dff024c1`]	2025-10-21 21:10:19 -07:00
Bartłomiej Kocot	ebd8495721	Gridwise gemm conv v3 force padded layout on gfx950 (#2961 ) * Gridwise gemm conv v3 force padded layout on gfx950 * fix bug in other gridwise * fix * Update gridwise_gemm_wmma_cshuffle_v3_common.hpp [ROCm/composable_kernel commit: `3a28632b20`]	2025-10-21 15:41:02 +02:00
Yashvardhan Agarwal	12e9bcd7e2	fix identity value of AbsMax (#3058 ) * fix identity value of AbsMax - Identity value of AbsMax should be 0 not numeric<T>::lowest() * Update include/ck_tile/core/utility/reduce_operator.hpp resolved comment Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> --------- Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> [ROCm/composable_kernel commit: `35754d2ec8`]	2025-10-21 14:42:08 +02:00
Johannes Graner	0be14218d4	Fix race conditions in ck_tile remod (#3061 ) [ROCm/composable_kernel commit: `4043401db1`]	2025-10-21 09:35:04 +02:00
Max Podkorytov	df3f347a27	[CK_TILE] Fix transpose_vectors for 2x2 8-bit tiles (#3042 ) fix transpose_vectors logic for 2x2 8-bit tiles add a test which goes through this code path. factor out constexpr'd cases into smaller functions. add inline docs about the data movement impact: gemms with 8-bit non-rcr inputs on gfx942 [ROCm/composable_kernel commit: `2570462ecf`]	2025-10-20 13:40:44 -07:00
Gino Lu	182c4404b5	[CK_TILE] Patch for pk_fp4 ref check and buffer load. (#3044 ) * Patch for pk_fp4_raw_t buffer load and ref check [ROCm/composable_kernel commit: `fb1d090f3c`]	2025-10-20 14:47:04 +08:00
AviralGoelAMD	21e65bbb22	docs: add inline comments about flush_cache and rotating buffer [ROCm/composable_kernel commit: `b03764ca5a`]	2025-10-17 12:56:47 -04:00
Yashvardhan Agarwal	89fb435ce2	fix identity values in Max and AbsMax (#3048 ) - The identity value method returned the minimum positive number while we need the lowest number for Max and AbsMax operations [ROCm/composable_kernel commit: `889ffc0b1d`]	2025-10-17 09:49:21 -07:00
Emily Martins	cdb6bd372b	Fix CK Tile Stream-K BF16 Validation Errors (#3039 ) Prior to this change, the number of accumulations passed into calculate_rtol_atol was 1. That said, in most cases, this is not correct when there are multiple workgroups contributing to the same macro tile in C. This change ensures uses the function estimate_num_wgs_per_tile, which was extracted into a common file and generalized, to estimate the number of workgroups per macro tile. This estimate is passed into calculate_rtol_atol to ensure we get a better relative and absolute tolerance. [ROCm/composable_kernel commit: `352dee5225`]	2025-10-17 09:33:38 -07:00
Johannes Graner	7fec9695d2	Pre-commit in CI (#3029 ) * Pre-commit in CI * Specify python version, and install dos2unix for remod * Refactor remod hook to correctly install dependencies * Run pre-commit [ROCm/composable_kernel commit: `8a4cd32d86`]	2025-10-17 09:28:38 -07:00
Ville Pietilä	bc3a91d23f	Fixed handling of split-K autodeduce argument for grouped convolution (#3024 ) * Fix handling of split-K autodeduce argument. * Fix clang formatting. * Test fix. * Fix clang formatting. [ROCm/composable_kernel commit: `7e44b845b5`]	2025-10-17 15:36:39 +03:00
Johannes Graner	580a54b400	Update pre-commit to fixed versions, run remod for ck_tile (#2895 ) * Fix ruff linter errors * Fix remod dos2unix command * Clang format * Ignore utility in remod * Run remod * Specify clang-format version in pre-commit * Specify ruff version * Include PoolKernelArgs in reference_pool * Add calculate_total_elements to reference batched contraction * Fix calculate_total_elements declaration * Refactor remod pre-commit hook * Fix Aquant tests --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `d40b50b9d5`]	2025-10-16 15:29:17 -07:00
Enrico Degregori	6066662785	Wave Tile Transfer supporting global load with transpose (#3027 ) * Initial implementation: - add new thread group transfer supporting transpose instruction - refactor AB transfer to switch between thread and wave tiles methods * Add some comments and remove explicit wave and lane calculations * Remove compiler option for performance * fp16 example: use tuned instance * Missing cleanup * Integrate wave transfer in existing gemm and batched gemm instances * Add fast instances * extend implementation for 8 bit datatypes packed types not supported * Address review comments * Optimize pipeline v1 and re-introduce compiler option * Disable wave tile approach for b scale gemm * Fix for clang20 * Avoid code duplication of amd_global_load_transpose_to_vgpr function [ROCm/composable_kernel commit: `440358c168`]	2025-10-16 11:33:56 -07:00
kabrahamAMD	b085a51b44	implement device batched gemm b scale for wmma (#2825 ) * rebased on top of develop * fixed missing shuffeling and wrong indexing * added tests for batched_b_scale * added missing files * fixed wrong stride computation and removed k batching (for now) due to precision issues * reinstated k-batching with PRNG constrained to -1..1 * added specialization of GeneratorTensor_3 for int4 and fixed internal overflow * added k-batching to reference and increased tolerances for test * changed gemm_b_scale and gemm_universal tests to use correct parameters * adressed review commentsd * ported fixes back to non-batched version of b_scale * adressed review comments * run clang-format on older commits * add type-conversion to AccDataType and then to CDataType to exactly mimic GPU's behavior * added newline at end of file * reflected changes from muitl-abd branch in batched b_scale * fixed gfx11 issue * changed range for pki4 to -1...1 (-0.5...0.5 never really made sense for i4 anyway and always should have caused compiler errors, but since there was no int4 specialization of GeneratorTensor3 until now, this passed * run clang format * set range of i4 generation to 0...1 for upstream tests to pass. This replicated previous behavior, which however means that it is NOT properly tested. * reduced range for pk_i4 even further to 0..0 * removed failing xld instances. Failure now uncovered now that tests were fixed * removed generation of int4 values entierly * divide B buffer by BPackedSize --------- Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com> [ROCm/composable_kernel commit: `c4b2da9cbd`]	2025-10-16 11:00:42 -07:00
Emily Martins	36020b389c	Style updates and cleanup The following changes were made - Renamed iter to iter_start - Renamed tile_iter to tile_iter_start - Moved documentation from member variables to getters - Removed double underscore from extra_iters_before_me variable - Defined parent header in impl file - Removed unused inlcudes [ROCm/composable_kernel commit: `cb83d52301`]	2025-10-16 08:47:06 -06:00
Astha	1d1f8af58b	Addition of the derived structs for the new Stream-K TilePartitioner There are 2 derived structs based on whether Stream-K is persistent or not. If it's persistent that means that both the data parallel and Stream-K sections are data parallel. If it's non-persistent that means that only the Stream-K section is persistent, while the data parallel section will have separate workgroups allocated for it. Both structs will have a template argument for Persistent. The 2 derived classes will inherit common variables and functions from the Stream-K TilePartitioner base class. There are additional variables for the differing data parallel sections that will be added to each derived class, that are in charge of the indexing/bookkeeping for the data parallel sections. The only additional function that will differ between the 2 structs is GridSize(), as the non-persistent will allocate extra workgroups for data parallel. Unit tests for the derived structs are included. [ROCm/composable_kernel commit: `8f75d7cea6`]	2025-10-16 08:47:06 -06:00
Emily Martins	64e6fef4ba	Stream-K Tile Partitioner Base Class with Tests To better align with the original Stream-K paper, this change implements a new Stream-K tile partitioner base class. This class will handle the Stream-K setup that is common to both a persistent and non-persistent DP section. A later change will implement derived classes to handle the differences between persistent and non-persistent DP. This change also includes unit tests for the base tile partitioner. [ROCm/composable_kernel commit: `f87f768d16`]	2025-10-16 08:47:06 -06:00
Illia Silin	fd12e33f27	re-enable clang-format by default (#3030 ) * re-enable clang-format by default * fix clang format [ROCm/composable_kernel commit: `3348f01e6f`]	2025-10-15 07:43:11 -07:00

1 2 3 4 5 ...

1176 Commits