composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 02:02:46 +00:00

Author	SHA1	Message	Date
kensclin	9f038ac7bc	[CK] Fix the issue of the aiter to call eightwarps pipeline. (#5218 ) ## Motivation Fix the failure of the aiter to call eightwarp. Changed Async to the name eightwarps. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan Pass ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-09 11:12:13 -07:00
Christopher Millette	9b3a64b7d0	[CK_TILE] Fix MMA layout test to match amdgcn_mma OpFamily parameter (#5222 ) ## Summary - PR #4837 added `MmaOpFamily OpFamily_` as a new template parameter to `amdgcn_mma` and `MmaDefaultSelector`, but the MMA layout test (PR #4495) was not updated to include it - Add the missing `OpFamily_` parameter to all three `RegisterMapTraits` partial specializations (gfx9, gfx11, gfx12) and all `MmaDefaultSelector` usages - Fixes build failure: `template argument for non-type template parameter must be an expression` ## Test plan - [x] Verified test compiles cleanly with ROCm 7.1.1 clang++ targeting gfx90a - [x] `test_amdgcn_mma_layout` gfx90a (MFMA): PASSED - [x] `test_amdgcn_mma_layout` gfx1201 (WMMA): SKIPPED (no device) - [x] `test_amdgcn_mma_layout` gfx1100 (WMMA): SKIPPED (no device) - [x] CI validation on all GPU targets 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 22:17:56 -07:00
Christopher Millette	bd6070fb5c	Compile-time optimize threadwise slice transfer (#4673 ) ## Motivation Profiling with `-ftime-trace` on representative translation units (e.g., `device_grouped_conv2d_fwd_xdl_nhwgc_gkyxc_nhwgk_f16_comp_instance.cpp`) revealed that 92% of frontend time was spent in template instantiation. The primary bottleneck was redundant instantiation of identical helper logic across multiple threadwise transfer class variants. Each `ThreadwiseTensorSliceTransfer_v` class independently contained its own copy of the same helper computations — serpentine traversal, coordinate stepping, thread scratch descriptors, lambda-like functors, and compile-time constants — duplicated across 13 header files. When a typical GEMM or convolution kernel TU includes blockwise operations (e.g., `blockwise_gemm_xdlops.hpp`), it pulls in multiple transfer variants simultaneously, causing the compiler to instantiate the same helper logic multiple times with the same template arguments. This was compounded by the helpers being defined as members of the outer `ThreadwiseTensorSliceTransfer_v` classes, which carry 14+ template parameters. Functions like `ComputeForwardSweep` depend only on their two argument types, but as inline members of the outer class, the compiler was forced to create separate instantiations for every unique combination of all outer parameters (data types, descriptors, vector widths, etc.) — even when most of those parameters had no effect on the helper's output. ## Technical Details ### The Fix: Shared Helper Struct Hierarchy Duplicated logic was extracted into a standalone helper hierarchy in `threadwise_tensor_slice_transfer_util.hpp`: ``` ThreadwiseTransferHelper_Base (I0..I16, MoveSliceWindow, ComputeThreadScratchDescriptor, \| ComputeForwardSteps, ComputeBackwardSteps, MakeVectorContainerTuple) +-- ThreadwiseTransferHelper_Serpentine (ComputeForwardSweep, ComputeMoveOnDim, ComputeDataIndex, \| ComputeCoordinateResetStep, VectorSizeLookupTable, VectorOffsetsLookupTable) +-- ThreadwiseTransferHelper_SFC (ComputeSFCCoordinateResetStep) ``` Each helper method is now parameterized only by what it actually uses: - `ComputeForwardSweep(idx, lengths)` — parameterized only by the two argument types, not by `SrcData`, `DstData`, `SrcDesc`, etc. - `ComputeForwardSteps(desc, scalar_per_access)` — parameterized only by the descriptor and access sequence types. - `ComputeCoordinateResetStep<SliceLengths, VectorDim, ScalarPerVector, DimAccessOrder>()` — parameterized only by the four values it actually needs. This reduces template instantiation work in two ways: 1. Across different transfer variants (v3r1 vs v3r2 vs v3r1_gather): the compiler reuses a single instantiation instead of creating one per variant. 2. Across different outer class instantiations (fp16 vs bf16 vs int8): the compiler reuses the helper instantiation because the helper doesn't depend on the data type at all. ### Refactored Headers 13 headers now delegate to the shared helpers instead of duplicating logic: - Serpentine family: v3r1, v3r2, v3r1_gather, v3r1_dequant - SFC family: v6r1, v6r1r2, v6r2, v6r3, v7r2, v7r3, v7r3_scatter - Dead code removed: v4r1, v5r1 ### Additional Fixes Found During Refactoring - Two latent bugs in v3r2 (`forward_sweep` indexing, `GetDstCoordinateResetStep` extraction) - Dead `SrcCoordStep` variables in v4r1 and v5r1 - Unused `scale_element_op_` member in v3r1_dequant (restored with note) ### Net Code Change +1,428 / -2,297 lines (~870 lines removed). ## Test Plan ### Unit Tests 28 host-side gtests in `test/threadwise_transfer_helper/test_threadwise_transfer_helper.cpp` covering the full helper hierarchy: \| Suite \| Tests \| What is verified \| \|-------\|-------\|------------------\| \| ThreadwiseTransferHelperBase \| 6 \| Compile-time constants, inheritance, `MoveSliceWindow` with `ResetCoordinateAfterRun` true/false in 2D and 3D \| \| ThreadwiseTransferHelperSerpentine \| 9 \| `ComputeForwardSweep` (even/odd row, 1D), `ComputeMoveOnDim` (inner complete/incomplete), `ComputeDataIndex`, `ComputeCoordinateResetStep`, `VectorSizeLookupTable`, `VectorOffsetsLookupTable` \| \| ThreadwiseTransferHelperSFC \| 6 \| `ComputeSFCCoordinateResetStep` — single access, 2D row-major, 2D column-major, 3D batch, even/odd inner access counts \| \| ThreadwiseTransferHelperInheritance \| 3 \| Serpentine and SFC derive from Base, are not related to each other \| \| DetailFunctors \| 4 \| `lambda_scalar_per_access`, `lambda_scalar_step_in_vector`, `lambda_scalar_per_access_for_src_and_dst` (same dim, different dims) \| ### Semantic Equivalence GPU ISA comparison using `--cuda-device-only -S` confirmed identical assembly output (modulo `__hip_cuid_` metadata) between baseline and refactored code. ## Test Results All measurements on a 384-core machine, `-j64`, freshly rebooted, near-idle. ### Targeted Builds (affected targets only) \| Target \| Baseline \| Refactored \| Wall-clock Delta \| CPU Delta \| \|--------\|----------\|------------\|-----------------\|-----------\| \| `device_grouped_conv2d_fwd_instance` (160 TUs) \| 7m 37s / 189m CPU \| 6m 53s / 161m CPU \| -9.7%* \| -14.9% \| \| `device_grouped_conv3d_fwd_instance` (185 TUs) \| 9m 49s / 202m CPU \| 6m 42s / 182m CPU \| -31.8% \| -10.0% \| \| Combined \| 17m 27s / 392m CPU \| 13m 35s / 344m CPU \| -22.2% \| -12.4% \| ### Full Project Build (8,243 targets) \| Metric \| Baseline \| Refactored \| Delta \| \|--------\|----------\|------------\|-------\| \| Wall-clock \| 103m 38s \| 111m 56s \| +8.0%* \| \| CPU time \| 4705m 7s \| 4648m 17s \| -1.2% \| \*Wall-clock inflated by external load spike during refactored build (load 90 vs 66). CPU time is the reliable metric. ### Context ~15% of all build targets (1,262 / 8,243) transitively include the modified headers. These are primarily GEMM and convolution kernel instantiations — the core compute workloads. The 12-15% CPU savings on affected targets is diluted to 1.2% across the full project because 85% of targets are unaffected. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-06 09:26:40 -07:00
Wojciech Laskowski	40e25a7820	Adding layout test for amdgcn_mma structs (#4495 ) ## Motivation Currently, the test suite for `amdgcn_mma` focuses on the design (e.g. choosing the correct specialization based on SFINAE) and a single live test that checks if selected MmaOp runs. This PR adds a simplified GEMM test kernel that checks the exact layout of the selected MmaOp. ## Technical Details The test in `test_amdgcn_mma_layout.cpp` launches MxKxN test cases (one per block), where each case: 1. Constructs A and B tensors on a device with a single 1 at A(m,k) and B(k,n) (rest is all 0s) 2. Executes the MMA intrinsic. 3. Checks if C has the "1" on the excpeted position. For the MMA instrinsic, it pulls a Mma op from amdgcn_mma specialization based on a given input (tile dimension, data types). Note 1: As a helper, in `test_amdgcn_mma_layout_util.hpp` we add register map for a given amdgcn_mma specialization. Register mapping is currently based on the `tile_distribution_encoding`. Note 2: Everything is added to the test suite, no additions to the actual `amdgcn_mma` structs. All the extra information that is needed, but not yet provided by `amdgcn_mma` structs, is added as a boilerplate to the header. TODO: Rebase this PR on top of the `amdgcn_mma` refactor or clean it up after merge. ## Test Plan This PR solely adds a new test to the existing code. ## Test Result Tests pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Kiefer van Teutem <kiefer.vanteutem@amd.com>	2026-03-06 09:20:39 -07:00
chris-tsiaousis-hpc	9a564f7e5f	[CK TILE] Unification of sparse MFMA/WMMA policy structs (#4837 ) ## Motivation The existing unification work supports DENSE intrinsics. In this PR we enable support for SPARSE as well as SCALE intrinsics and add an example SPARSE implementation. ## Technical Details Mostly trivial changes. One framework change is that the desired `MmaOpFamily` is passed to the `MmaDefaultSelector`. As my relevant commit explains, we do not support a fallback family at the moment, but it is something we can consider. ## Test Plan Added a new test for the relevant sparse specializations. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-03-05 19:52:04 +00:00
kensclin	03a93608bc	[CK_TILE][GEMM] Fix eightwarp error & Add eightwarp unit test (#4834 ) ## Motivation The primary goal of this PR is to fix a critical issue in the EightWarps implementation within ck_tile. Additionally, unit tests were added to ensure that CI can detect errors. ## Test Plan ninja test_tile_gemm_quant_abquant_eightwarps ./bin/test_tile_gemm_quant_abquant_eightwarps ## Test Result All EightWarps related test cases in TestCkTileGemmABQuant completed successfully without linker errors or validation mismatches. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-04 04:10:28 +00:00
Brock Hargreaves	8b594deb00	[CK] Address a bunch of errors associated with targeting gfx1200 on Windows (#5045 ) ## Motivation Still addressing errors that are blocking the merge of TheRock PR: https://github.com/ROCm/TheRock/actions/runs/22545831304/job/65308264096?pr=3382 ## Technical Details 1. There are multiple fmha python scripts that are writing native paths which are confusing cmake. I addressed one of these in an earlier PR https://github.com/ROCm/rocm-libraries/pull/4812 and now I'm addressing more that are exposed with gfx1200 target: ``` [composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library): [composable_kernel configure] Syntax error in cmake code when parsing string [composable_kernel configure] [composable_kernel configure] B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp [composable_kernel configure] [composable_kernel configure] Invalid character escape '\b'. ``` 2. In the following compiler error we see gemm_prec_str<ADataType, BDataType> being passed as a function to concat(...), instead of being evaluated with the parenthesis operator(), i.e., gemm_prec_str<ADataType, BDataType>(). There are multiples instances of this, I wonder what non-msvc compilers do here: ``` [composable_kernel] FAILED: [code=1] example/ck_tile/38_block_scale_gemm/CMakeFiles/tile_example_gemm_quant.dir/gemm_bquant_quantgrouped_mx_bf16bf8.cpp.obj [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_mx_bf16bf8.cpp:4: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm\run_gemm_quant_example.inc:17: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:7: [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/concat.hpp:119:21: error: implicit conversion between pointer-to-function and pointer-to-object is a Microsoft extension [-Werror,-Wmicrosoft-cast] [composable_kernel] 119 \| ((oss << sep << rest), ...); [composable_kernel] \| ^~~~ [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/ops/gemm_quant/kernel/gemm_quant_kernel.hpp:248:16: note: in instantiation of function template specialization 'ck_tile::concat<char, char[11], std::basic_string<char> (), std::basic_string<char>>' requested here [composable_kernel] 248 \| return concat('_', "gemm_quant", gemm_prec_str<ADataType, BDataType>, GemmPipeline::GetName()); [composable_kernel] \| ^ ``` There are plenty of other places where we use gemm_prec_str with the operator(), so I'm pretty sure these were just typos...but I'd like some eyes on it. 3. There are 2 tests that fail to build on Windows, which I've excluded from the build but will open bug tickets for: 1. gemm_weight_preshuffle 2. grouped_gemm_preshuffle Here's a sample of the compiler error for these tests: ``` [composable_kernel] [16/19] Building HIP object test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj [composable_kernel] FAILED: [code=1] test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj [composable_kernel] E:\TheRock\build\core\clr\dist\lib\llvm\bin\clang++.exe -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_TILE_USE_WMMA=1 -DCK_TIME_KERNEL=1 -DCK_USE_OCP_FP8 -DCK_USE_WMMA -DCK_USE_WMMA_FP8 -DCK_USE_XDL -DDPP_KERNELS -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -IE:/TheRock/rocm-libraries/projects/composablekernel/profiler/include -IE:/TheRock/rocm-libraries/projects/composablekernel -IE:/TheRock/rocm-libraries/projects/composablekernel/library/include -IE:/TheRock/rocm-libraries/projects/composablekernel/include -IE:/TheRock/build/ml-libs/composable_kernel/build/include -IE:/TheRock/build/base/half/stage/include -isystem E:/TheRock/build/core/clr/dist/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/getopt-src/src -O3 -DNDEBUG -std=gnu++20 --offload-arch=gfx1200 -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -fbracket-depth=1024 -Wno-nrvo -Werror -Weverything -fcolor-diagnostics -Wno-c++20-extensions -Wno-global-constructors -Wno-undef -DCK_TILE_USE_OCP_FP8 -MD -MT test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -MF test\ck_tile\grouped_gemm_preshuffle\CMakeFiles\test_ck_tile_grouped_gemm_preshuffle.dir\test_grouped_gemm_preshuffle.cpp.obj.d -o test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -x hip -c E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp:8: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:6: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/check_err.hpp:16: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core.hpp:89: [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core/utility/env.hpp:110:31: warning: 'getenv' is deprecated: This function or variable may be unsafe. Consider using _dupenv_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. [-Wdeprecated-declarations] [composable_kernel] 110 \| const char* vp = std::getenv(name); [composable_kernel] \| ^ [composable_kernel] C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdlib.h:1183:20: note: 'getenv' has been explicitly marked deprecated here [composable_kernel] 1183 \| _Check_return_ _CRT_INSECURE_DEPRECATE(_dupenv_s) [composable_kernel] \| ^ [composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:368:55: note: expanded from macro '_CRT_INSECURE_DEPRECATE' [composable_kernel] 368 \| #define _CRT_INSECURE_DEPRECATE(_Replacement) _CRT_DEPRECATE_TEXT( \ [composable_kernel] \| ^ [composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:358:47: note: expanded from macro '_CRT_DEPRECATE_TEXT' [composable_kernel] 358 \| #define _CRT_DEPRECATE_TEXT(_Text) __declspec(deprecated(_Text)) [composable_kernel] \| ^ [composable_kernel] clang++: error: clang frontend command failed due to signal (use -v to see invocation) [composable_kernel] AMD clang version 22.0.0git (https://github.com/ROCm/llvm-project.git a2dc42b87c63e686377a69f09ea23aec7550babc+PATCHED:e4d5bf498b7b8626bb9716f1f5a5946d45025918) [composable_kernel] Target: x86_64-pc-windows-msvc [composable_kernel] Thread model: posix [composable_kernel] InstalledDir: E:\TheRock\build\core\clr\dist\lib\llvm\bin [composable_kernel] clang++: note: diagnostic msg: Error generating preprocessed source(s). [composable_kernel] ninja: build stopped: subcommand failed. [composable_kernel FAILED WITH CODE 1 in 238 seconds] ninja: build stopped: subcommand failed. ``` ## Test Plan Wait for internal CI and make sure build compiles locally. ## Test Result Waiting on CI ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-03 13:54:08 -08:00
assistant-librarian[bot]	213d2f0572	test: Add umbrella test targets for CK Tile operations (#4301 ) ## Proposed changes Adds operation-specific umbrella test targets for CK Tile to enable running all tests for a specific operation without running the entire test suite. This improves the development workflow by allowing faster iteration when working on specific operations. ## Motivation Previously, developers working on CK Tile operations could only: - Run individual test executables one at a time - Run global labels (, , ) which test the entire codebase - Build all tests for an operation but had no simple way to run them all This made it cumbersome to validate changes to a specific operation (e.g., GEMM quantization) without either running tests individually or running the entire test suite. ### Documentation - - Comprehensive testing guide with usage examples and implementation details ## Usage Examples # Run all GEMM tests with 256 parallel jobs ninja -j256 ck_tile_gemm_tests # Run all GEMM block scale (quantization) tests ninja -j256 ck_tile_gemm_block_scale_tests # Run all GEMM StreamK tests ninja -j256 ck_tile_gemm_streamk_tests ## Checklist Please put an into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [x] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run on all changed files - [x] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3654](https://github.com/ROCm/composable_kernel/pull/3654) 🧑‍💻 Originally authored by @AviralGoelAMD --------- Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-03-03 07:39:32 -08:00
Emily Martins	314fcbb3ff	[CK_TILE] Reduce Register Spills in Stream-K Reductions (#4984 ) ## Motivation In CK Tile Stream-K, kernels using one of two non-atomic reduction strategies (i.e., linear, tree) have high register spill count, with the tree reduction generally being worse. These changes act a first step to help decrease the register spill count. ## Technical Details ### Problem 1: Unvectorized access to partials In both the linear and tree reductions, workgroups write partials results to a global buffer; another workgroup will later read this data. When the initial logic to support reading and writing to the partials buffer was added (see https://github.com/ROCm/composable_kernel/pull/3107), the tile distribution encoding used to read from and write to partials matches the register layout for the accumulator of the mfma instruction used for the kernel. Since we do not currently use the transposed register layout for the accumulator, we end with an encoding that is not optimized for writing to HBM. For example: Consider the register layout of the `v_mfma_f32_16x16x32_fp8_fp8` instruction. ```bash ./matrix_calculator.py --architecture gfx942 --instruction v_mfma_f32_16x16x32_fp8_fp8 --register-layout --C-matrix ``` <img width="1113" height="537" alt="image" src="https://github.com/user-attachments/assets/afc8f556-08cc-4224-a6e5-b5edabc5fc02" /> The above shows that threads are responsible for consecutive elements down a column of the C tile. If we use this distribution to read and write to partials with C in row major, then threads are unable to perform vectorized reads and writes. Note: thread 0 is shown in red and thread 1 is shown in green. Since the C-shuffle Epilogue only supports C in row major, reading and writing to partials is highly unoptimized. ### Problem 2: Missed opportunity for SPGR use in tree reduction loop Since the reduction occurs between workgroups, all threads in the workgroup follow the same execution paths in the tree reduction logic, hence various variables should be using SGPRs, but they are not. ### Implemented Solutions 1. Add a new tile distribution encoding that is optimized for accessing partials in HBM. This encoding does not change the data assignment to threads, it merely changes the addresses to which they write/read in the partials buffer. For example, continuing with the `v_mfma_f32_16x16x32_fp8_fp8` instruction, the new encoding would result in threads writing in the following layout: <img width="517" height="342" alt="image" src="https://github.com/user-attachments/assets/93b5e0ea-bafc-47b8-89bb-c40ba75cb202" /> This layout ensures that each thread writes along a row, enabling `buffer_{store\|load}_dwordx4` instructions (i.e., vectorized accesses). This helps reduce register usage due to requiring fewer offset calculations. 2. To force SGPR usage in the tree reduction loop, I make use of CK Tile's `amd_wave_read_first_lane` which is a wrapper around `__builtin_amdgcn_readfirstlane`. This helps reduce VGPR spills in the tree reduction. _These changes do not fully eliminate register spills. Future work will aim to further reduce spills. But these changes make good progress._ ## Test Plan Added tests for different warp tile sizes to validate that the new encoding works with different `WarpGemm` variants. ## Test Result All tests pass locally on all gfx9 architectures. Some results for decreases in register spills on gfx942: (BL = baseline) \| Kernel \| SGPR Spill (BL) \| SGPR Spill (new) \| SGPR Delta \| SGPR % \| VGPR Spill (BL) \| VGPR Spill (new) \| VGPR Delta \| VGPR % \| \|--------\|------------------:\|------------------:\|-----------:\|-------:\|-------------------:\|------------------:\|-----------:\|-------:\| \| fp16 linear F/F/F/T 256x256x32 2x2x1 32x32x16 \| 223 \| 0 \| -223 \| -100.0% \| 21 \| 20 \| -1 \| -4.8% \| \| fp16 tree F/F/F/T 256x256x32 2x2x1 32x32x16 \| 233 \| 11 \| -222 \| -95.3% \| 443 \| 23 \| -420 \| -94.8% \| \| fp8 linear F/F/F/F 256x256x32 2x2x1 32x32x32 \| 221 \| 3 \| -218 \| -98.6% \| 12 \| 6 \| -6 \| -50.0% \| \| fp8 tree F/F/F/F 256x256x32 2x2x1 32x32x32 \| 230 \| 14 \| -216 \| -93.9% \| 396 \| 12 \| -384 \| -97.0% \| ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-02 10:39:48 -07:00
Max Podkorytov	2b1c154ad4	[CK_TILE] Fix CShuffleEpilogue test to use correct GEMM accumulator distribution (#4518 ) ## Summary The test was using LDS distribution to create the accumulator tile, but CShuffleEpilogue expects the GEMM accumulator distribution that BlockGemm produces. This mismatch caused incorrect data permutation. ## Changes - Use WarpGemmDispatcher to get correct accumulator distribution encoding - Load test input from host-initialized global memory for deterministic verification - Shard tests by data type (FP16, FP8) with gfx950-specific FP8 tests - Extract scale tests into separate target for better organization - Implement exact permutation verification (all unique values appear once) - Reduce tile size from 256x256 to 128x128 to fit in unique fp16 range - Add parameterized test configurations for various warp layouts and MFMA types ## Test plan - [x] Run new cshuffle epilogue tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>	2026-03-02 00:54:14 -08:00
Max Podkorytov	e199a51dde	Add generate_identity_sequences helper and replace lambdas with named functors (#4828 ) ## Summary - Add `generate_identity_sequences<N>()` helper that returns `Tuple<Sequence<0>, Sequence<1>, ..., Sequence<N-1>>` - Replace lambdas with named functors in `transform_tensor_descriptor` - Add `unpack_and_merge_sequences` helper functor - Reduces `transform_tensor_descriptor` instantiations from 388 to 32 (92% reduction) ## Motivation Multiple call sites use `generate_tuple([](auto i) { return Sequence<i>{}; }, Number<N>{})` pattern. A named helper reduces lambda instantiations. Additionally, each lambda in `transform_tensor_descriptor` creates a unique closure type, causing the function to be instantiated separately for every call site. Named functors share a single type, so the compiler reuses the same instantiation. ## Changes ### Part 1: generate_identity_sequences helper - Replaces common lambda pattern for generating identity sequences - Each lambda expression creates a unique closure type, causing separate template instantiations at every call site - Named helper shares a single type across all uses ### Part 2: Named functors in transform_tensor_descriptor - Add `unpack_and_merge_sequences` helper to replace lambda in `GetNumOfHiddenDimension` - Use `generate_identity_sequences` in `matrix_padder.hpp` ## Test Plan - [x] Added 7 unit tests: - 4 tests for `generate_identity_sequences` - 3 tests for `unpack_and_merge_sequences` - [ ] Waiting for full CI ## Related PRs This PR merges the functionality from: - ROCm/composable_kernel#3588 (generate_identity_sequences helper) - ROCm/composable_kernel#3589 (Named functors in transform_tensor_descriptor) Part of PR stack for issue #4229 (Reduce CK/CKTile Build Times) Note: This PR supersedes #4283, ROCm/composable_kernel#3588 and ROCm/composable_kernel#3589, which can be closed once this is merged. --- 🔁 Imported from [ROCm/composable_kernel#3628](https://github.com/ROCm/composable_kernel/pull/3628) 🧑‍💻 Originally authored by @tenpercent Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 12:10:11 -08:00
Aviral Goel	82d01f9e2d	[CK] Add split-K support for ABQuantGrouped in block_scale_gemm (#4816 ) ## Changes ### Split-K support in `gemm_quant_kernel.hpp` - `SplitKBatchOffset`: Added `aq_group_offset` and `aq_k_split_offset` fields (mirroring the existing `bq_` fields for B) to track each split-K batch's position within the AQ scale tensor. For `ABQuantGrouped`, both offsets are computed from `k_id KRead` divided by `AQuantGroupSize::kK`. - `MakeAQBlockWindow`: Added an `aq_group_offset` parameter (defaulting to 0 for non-split-K paths) so the AQ tensor view's K-group dimension reflects only the remaining K-groups from the split-K offset, consistent with how `MakeBQBlockWindow` handles the BQ tensor. - `RunGemm`: Threads the `aq_k_split_offset` through to `MakeAQBlockWindow` when in split-K mode. ### Constraints in `IsSupportedArgument()` Four constraints gate split-K (`k_batch > 1`) for ABQuantGrouped: 1. Mode check — split-K is only allowed for `BQuantGrouped` (no preshuffle) or `ABQuantGrouped` (no `APreshuffleQuant`). Any other quant mode with `k_batch > 1` returns `false`. 2. B quant group alignment — `KRead` (per-batch K slice) must be divisible by `BQuantGroupSize::kK`. Each batch must operate on complete B quantization groups; a partial group would require splitting a scale value across batches. 3. A quant group alignment (new, ABQuantGrouped only) — `KRead` must also be divisible by `AQuantGroupSize::kK` for the same reason applied to the AQ scale tensor. 4. Minimum 2 K-tile iterations per batch (new) — The software-pipelined GEMM kernels (CompV3 family) prefetch one tile ahead, so they require `per_batch_num_loop = KRead / KPerBlock >= 2`. When `KRead == KPerBlock` (i.e. each batch is exactly one tile), the prefetch reads into the next batch's memory region and produces incorrect results. Configurations where `K == k_batch * KPerBlock` are therefore rejected. ### Example update (`run_gemm_quant_example.inc`) Updated the comment above the `IsSupportedArgument` call to document that split-K is now supported for both `BQuantGrouped` (no preshuffle) and `ABQuantGrouped` (no `APreshuffleQuant`). ## Unit Tests Two new test files covering decode and prefill tile shapes across a range of `k_batch` values (2–8), data types (FP8, BF8), and quantization group sizes (1×1×128 and 1×128×128 for B): - `test_gemm_quant_abquant_splitk_decode.cpp` — uses the decode tile shape (M=16, N=64, K_tile=256) - `test_gemm_quant_abquant_splitk_prefill.cpp` — uses the prefill tile shape (M=128, N=128, K_tile=128) Each test calls `run_test_with_validation` which runs the kernel and checks correctness against a CPU reference. Configurations excluded from tests are annotated with comments explaining which constraint they violate (typically the `per_batch_num_loop >= 2` requirement). ## Prerequisites This PR depends on #4429, which must be merged before this can be merged. --------- Co-authored-by: Erwin Terpstra <erwin.terpstra@streamhpc.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2026-02-26 15:56:34 -08:00
Illia Silin	d0406a6318	[CK] Disable test_fmha_fwd_fp8fp16 on gfx90a by default. (#4883 ) ## Motivation Since gfx90a has no native support for FP8 datatype, all FP8 tests should be disabled there by default. ## Technical Details The test_fmha_fwd_fp8fp16 is the last failing test in CK on gfx90a with staging compiler. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-26 02:08:07 +00:00
Yung-sheng Tu	00853e2bd2	Implement device_grouped_gemm_fixed_nk_bias for RDNA4 (#4340 ) ## Proposed changes Summary: - Modified implementation for grouped_gemm_fixed_nk_bias - FP16 WMMA examples - WMMA instances - Profiler for grouped_gemm_fixed_nk_bias - Add WMMA instances to existing tests This PR depends on PR https://github.com/ROCm/rocm-libraries/pull/4299 and should be merged after it. Only the last 6 commits are in the scope of this PR. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [x] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-26 00:28:09 +00:00
JP-Fernando	bea67cb520	[CK] Remove duplicated XDL/WMMA tests (#4415 ) ## Motivation When we started the RDNA4 support, the XDL instances were not supporting WMMA instructions, so we duplicated some tests. In this issue, we simplified most of the duplicated test files into common test files. ## Technical Details The following tests were unified: - `batched_gemm` - `batched_gemm_gemm` - `gemm_add` - `gemm_universal` - `grouped_convnd_bwd_data` The following tests were duplicated exactly, and copied into two files with `_xdl` and `_wmma` suffixes. Now they are unified in one single file without suffix: - `gemm_multi_abd` - `gemm_b_scale` There is still an apparent duplication which is a special case, namely `test_grouped_convnd_bwd_weight_interface_{suffix}` where `{suffix}` is `xdl` or `wmma`. However, the WMMA code relies on an old implementation, and is expected to be removed in the future. In addition, it differs from the XDL implementation significantly. Therefore, it was decided to keep both files separate instead of attempting any unification. ## Test Plan `CMakeLists.txt` files were modified to support the new, unified tests. In particular, testing was done for `gfx90a`, `gfx1201` and `gfx11` architectures. ## Test Result All tests passed successfully on all three tested architectures. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com>	2026-02-25 23:22:05 +00:00
Bartłomiej Kocot	2714328036	[CK] Small improvements for grouped conv backward weight (#4872 ) ## Motivation Improvements for CK Tile convolution builder run function and atol/rtol calculations. ## Technical Details - Add preprocessing function for wrw when k_batch is larger than 1 for builder run function - Divide num acums by number of groups to get real number of accums ## Test Plan CI wrw tests ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-783	2026-02-25 20:10:12 +00:00
Zoltán Lakatos	87d418a07a	[CK] Implement device grouped gemm fixed nk multi abd for rdna4 (#4425 ) ## Motivation Add support for grouped gemm multi ABD fixed NK. MR ## Technical Details Changes from the reverted PR: - Device struct for grouped gemm with multiple ABD and fixed NK (DeviceGroupedGemm_Wmma_Multi_ABD_Fixed_NK). - Wmma versions of existing example codes: 59_grouped_gemm_multi_ABD - Unit tests for both new wmma implementation and the reference xdl code (previously missing) - Note: Some Xdl instances were commented out because of unit test failures. As mentioned apparently for xdl this feature was missing tests so our assumption is either there is an implemenetation bug or these instances were not set up correctly. Has the potential for a follow-up issue. - Generic ck profiler interface with the purpose of calling unit tests. - Gemm instances with specific elementwise operations for gemm bias gelu calculations. - Added class for grouped gemm multi ABD reference calculations. Fix epilogue selection in device implementation that caused unit test failures ## Test Plan Covered by added unit tests ## Test Result CI successfully passing ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-25 05:16:07 +00:00
assistant-librarian[bot]	11e6485457	[CK_TILE] Extend support of mix precision microscaling BQuant (#4267 ) ## Proposed changes Supported types combinations using BQuant=e8m0: - A=bf16 - B=bf16,bf8,fp4 Summary: - remove usage of `pk_fp4_raw_t`: consistent with other implementations and avoid taking into account of the packed size explicitly. In general, the raw type should not be used because CK Tile internally takes care of the PackedSize, so using the raw type adds unnecessary complexity to the implementation - handle microscaling by checking for `e8m0` type for BQuant (previous implementation was inconsistent) - add support for scaling instructions in `DequantPack8` - mx pipeline: - extend existing pipeline to support different B types - add support to scale and cast before writing to LDS or after reading from LDS (this can be defined in the `Problem` by the user) - block gemm: - mx pipeline is now using block gemm BQuant - block gemm BQuant can now load from LDS and apply scale and then call block gemm universal operator. This adds new functionalities and remove code duplication - warp gemm: - add case to support 128bit ds_read/write for both A and B when A=16bit and B=8bit - add examples and tests: note that some tests for bf16/fp4 already existed but were removed during previous tests refactoring. I added them again and other relevant tests for new types combinations ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3689](https://github.com/ROCm/composable_kernel/pull/3689) 🧑‍💻 Originally authored by @EnricoDeg --------- Co-authored-by: Enrico Degregori <enrico@streamhpc.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Enrico Degregori <73224202+EnricoDeg@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-24 09:55:50 -08:00
Emily Martins	f973b15516	[CK_TILE] Update Stream-K Reduction Strategy Enum (#4756 ) ## Motivation Currently, Stream-K has 3 reduction options: 1) atomics, 2) The reduction described in the Stream-K paper, and 3) a tree reduction. The reduction strategy described in the original Stream-K paper has the starting workgroup of each tile sequentially accumulating partial results of other contributing workgroups in the tile, which requires a linear number of steps. Hence, for clarity, this works updates the naming of the `StreamKReductionStrategy` enum members to better describe the existing reduction strategy options. ## Technical Details Prior to this change, the enum is as follows: ```cpp enum StreamKReductionStrategy : uint32_t { Atomic = 0u, Reduction = 1u, TreeReduction = 2u }; ``` But, the distinction between `Reduction` and `TreeReduction` is not very clear and has some redundancy. Hence, the updated enum is as follows: ```cpp enum StreamKReductionStrategy : uint32_t { Atomic = 0u, Linear = 1u, Tree = 2u }; ``` All references to `StreamKReductionStrategy` were updated to reflect this change. ## Test Plan No new functionality was added, so no new tests were added; I just validated existing tests and examples. ## Test Result All tests passed locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-24 06:40:08 +00:00
Cong Ma	4ba02a8b68	[CK TILE] Refactor sequence_reverse_inclusive_scan (#4355 ) ## Proposed changes Refactor ck tile `sequence_reverse_inclusive_scan` from recursive to for-loop. Tracking issue: #4229 This pull request introduces a new lightweight array type, `static_array`, and refactors the sequence utilities to use it for improved constexpr support and simplicity. The changes also include updates to the build system to add container-related tests. Core Library Improvements: * Added a new header `static_array.hpp` that defines the `static_array` type, a constexpr-friendly array with basic accessors and no custom constructors. * Updated includes in `core.hpp` and `sequence.hpp` to import `static_array`. [[1]](diffhunk://#diff-14b406eccf59794051a16c0c9c1a7e11234324bfdd107a5bbe0f173cd25bcddcR44) [[2]](diffhunk://#diff-5042e5b47bb2ba78bbab2d284338cf0503bc8fb76a7d631cc2684ad6ca832a76R7) Refactoring to Use `static_array`: * Refactored sequence utilities in `sequence.hpp` to use `static_array` instead of the previously forward-declared `array` type, including in histogram and array generation logic. [[1]](diffhunk://#diff-5042e5b47bb2ba78bbab2d284338cf0503bc8fb76a7d631cc2684ad6ca832a76L1108-R1133) [[2]](diffhunk://#diff-5042e5b47bb2ba78bbab2d284338cf0503bc8fb76a7d631cc2684ad6ca832a76L1130-R1146) * Rewrote the implementation of `sequence_reverse_inclusive_scan` to use `static_array` for intermediate storage, improving constexpr evaluation and clarity. Build System and Testing: * Added a new test subdirectory for container tests and a GoogleTest executable for `unit_sequence.cpp` to the CMake build configuration. [[1]](diffhunk://#diff-5d35ff7555d3f0b438d45cde06b661eb1332cdbec66287ac7ec3c478d688aae5R5) [[2]](diffhunk://#diff-1f54f0d2b431b7fc74f7b4ffb66e80c381c904c3383b1d27987467e3482d6d7aR1-R7) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-23 14:12:03 -07:00
Anton Gorenko	cd23de8b38	[CK_TILE][FMHA] Support gfx11 (#4584 ) ## Motivation Add support of gfx11 architectures (RDNA3) to FMHA. ## Technical Details Distributions (matrix elements to lane registers mapping) of gfx11 WMMA are completely different from distributions of gfx9 MFMA and gfx12 WMMA. There are two cases in FMHA where this difference matters: * usage of results (matrix C) of one GEMM as input (matrix A) of another GEMM. * random number generation for dropout (implementation for gfx9 MFMA, gfx12 WMMA and host validation produce the same results). Both cases are solved by a special remapping implemented using `__builtin_amdgcn_permlanex16` and `__builtin_amdgcn_perm`. Additional changes: * FMHA tests are now build and run only for those types for which instances exist (gfx11 supports only fp16 and bf16). * Two fixes for uninitialized values (`mask.sink` and `do_fp8_static_quant`): they may contain garbage resulting in incorrect dispatching logic, sometimes tests report that there are no instance available for current parameters. * Small fix to remove expcnt(0) from s_waitcnt instruction on gfx11 when they are not requested (i.e. every time), likely has no effect on performance but makes disassembly a bit clearer. ## Test Plan ``` ninja test_ck_tile_fmha bin/test_ck_tile_fmha_fwd_fp16 bin/test_ck_tile_fmha_fwd_bf16 bin/test_ck_tile_fmha_bwd_fp16 bin/test_ck_tile_fmha_bwd_bf16 ``` ## Test Result All tests must pass (some tests may be skipped). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-20 17:15:10 -08:00
assistant-librarian[bot]	9c8d3a39ac	173 implement device grouped gemm fixed nk for rdna4 (#4299 ) ## Proposed changes This PR adds an RDNA4 implementation of the device_grouped_gemm_fixed_nk instance library using for WMMA. The implementation is based on the existing DeviceGroupedGemm_Xdl_Fixed_NK design and reuses the same high-level structure, but replaces the XDL kernel with a WMMA-based one. It uses the GridwiseGemm_wmma_cshuffle_v3 kernel. At this stage, the focus is functional correctness and compatibility, not performance tuning. ## Technical Details - Device struct for grouped gemm fixed NK - Example code for the WMMA version - Unit tests for both new wmma implementation and the reference XDL code (previously missing) - Generic ck profiler interface with the purpose of calling unit tests. ## Checklist Please put an into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [x] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run on all changed files - [x] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3668](https://github.com/ROCm/composable_kernel/pull/3668) 🧑‍💻 Originally authored by @bidlekm --------- Co-authored-by: Marton Bidlek <marton.bidlek@streamhpc.com> Co-authored-by: Erwin Terpstra <erwin.terpstra@streamhpc.com> Co-authored-by: bidlekm <bidlekmarton@gmail.com> Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-19 09:13:05 +01:00
Thomas Ning	e5fb690945	Fix the Composable Kernel CI and versions incompatibility (#4640 ) ## Motivation This PR has 4 patches: 1. Fix the CI error of grouped gemm. 2. Fix the incompatibility of old linux version. 3. Fix the potential errors of flatmm. 4. Address the previous comments of abquant eight warps pipeline solution. --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-02-18 06:59:37 -08:00
Christopher Millette	a95cf55aa7	[CK] Optimize multi-dimensional static for loop decomposition (#4447 ) ## Motivation Recursive template implementations might initially seem attractive to minimize necessary coding. Unfortunately, this style is often affects readability and requires significant resources from the compiler to generate instantiation chains. In "high-traffic" code (e.g., used in many places + compilation units), this generally does not scale well and can bloat the overall compile times to unnecessary lengths. The aim of this PR is to take some of most high-traffic utility code and try our best to eliminate recursive templates in favor of fold expansions and constexpr function helpers. In local tests with clang build analyzer, device_grouped_conv2d_fwd_xdl_ngchw_gkcyx_ngkhw_f16_16x16_instance.cpp showed high hit-rates on slow template instantiations in static_for, dimensional static_for (static_ford), which are subsequently affected by implementation of the Sequence class and associated transforms. Example: ** Templates that took longest to instantiate: 70111 ms: ck::detail::applier<int, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1... (372 times, avg 188 ms) // 70 seconds!** The above is part of the implementation of static_for which uses Sequence classes.. ## Technical Details ### Summary of Optimization Techniques \| Technique \| Used In \| Benefit \| \|-----------\|---------\|---------\| \| __Constexpr for-loop computation__ \| sequence_reverse_inclusive_scan, sequence_map_inverse \| Moves O(N) work from template instantiation to constexpr evaluation \| \| __Pack expansion with indexing__ \| sequence_reverse, Sequence::Modify \| Single template instantiation instead of recursive \| \| __Flat iteration + decomposition__ \| ford, static_ford \| O(1) template depth instead of O(N^D) \| \| __Pre-computed strides__ \| index_decomposer \| Enables O(1) linear-to-multi-index conversion \| ### Impact on Compile Time These optimizations reduce template instantiation depth from O(N) or O(N^D) to O(1), which: 1. Reduces compiler memory usage 2. Reduces compile time exponentially for deep instantiation chains 3. Enables larger iteration spaces without hitting template depth limits ## Test Plan * Existing tests for Sequence are re-used to affirm correctness * Unit tests for ford and static_ford are added (dimensional looping) * 8 new regression tests specifically verify the fixes for the PR feedback: - `NonTrivialOrder3D_201` - Tests Orders<2,0,1> for static_ford - `NonTrivialOrder3D_201_Runtime` - Tests Orders<2,0,1> for ford - `ConsistencyWithNonTrivialOrder_201` - Verifies static_ford and ford consistency - `NonTrivialOrder3D_120` - Tests Orders<1,2,0> for static_ford - `NonTrivialOrder3D_120_Runtime` - Tests Orders<1,2,0> for ford - `NonTrivialOrder4D` - Tests 4D with Orders<3,1,0,2> for static_ford - `NonTrivialOrder4D_Runtime` - Tests 4D with Orders<3,1,0,2> for ford - `AsymmetricDimensionsWithOrder` - Tests asymmetric dimensions with non-trivial ordering ## Test Result ### Compile Time Comparison: `8b72bc8` (base) → `477e0686` (optimized) #### Commits in Range (8 commits) 1. `fd4ca17f48` - Optimize sequence_reverse_inclusive_scan and sequence_reverse 2. `7a7e3fdeef` - Optimize sequence_map_inverse 3. `92855c9913` - Optimize ford and static_ford calls to eliminate nested template recursion 4. `88a564032b` - Add unit tests for ford and static_ford 5. `1a0fb22217` - Fix clang-format 6. `8a0d26bddf` - Increase template recursion depth to 1024 7. `dc53bb6e20` - Address copilot feedback and add regression tests 8. `477e06861d` - Increase bracket depth to 1024 #### Build Timing Results \| File \| Base (8b72bc8759d9 \| HEAD(a0438bd398) \| Improvement \| \|------\|------\|------\|-------------\| \| grouped_conv2d_fwd (f16) -j1 \| 313.31s \| 272.93s \| __12.9% faster__ \| \| grouped_conv1d_fwd (bf16) -j1 \| 79.33s \| 68.61s \| __13.5% faster__ \| \| grouped_conv1d_bwd_weight (f16) -j1\| 15.77s \| 14.31s \| __9.2% faster__ \| \| device_grouped_conv2d_fwd_instance -j64 \| s \| s \| __% faster__ \| #### Key Optimizations 1. __sequence_reverse_inclusive_scan/sequence_reverse__: O(N) → O(1) template depth 2. __sequence_map_inverse__: O(N) → O(1) template depth 3. __ford/static_ford__: O(N^D) → O(1) template depth using flat iteration with index decomposition 4. __Copilot feedback fixes__: Corrected New2Old mapping for non-trivial orderings ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2026-02-11 22:12:31 +00:00
Bartłomiej Kocot	790a786035	[CK][CK TILE] Add has hot loop check for pipeline v1 (#4407 ) ## Motivation Add has hot loop check for pipeline v1 (v1 basic and v1 basic async). Enable more tests which have been fixed by this change. ## Technical Details Hot loop has been executed without num loop check. ## Test Plan test_grouped_convnd_fwd_tile ## Test Result Passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-651 AICK-663	2026-02-11 13:41:59 +00:00
Cong Ma	cdba8b787c	[CK TILE] fix numerical errors of preshuffle_b (#4354 ) This pull request introduces several improvements and fixes related to quantized grouped GEMM (General Matrix Multiply) pipelines and their supporting utilities. # The numerical issue ## Steps to reproduce ```bash Run ./bin/tile_example_gemm_weight_preshuffle -prec=fp8 ./bin/tile_example_gemm_weight_preshuffle -prec=int4 ``` # Solution The main changes address type correctness, improve data layout and shuffling logic, and expand test coverage to better validate different GEMM configurations. Key changes include: ### Data layout and shuffling logic * Refactored the logic in `shuffle_b_permuteN` to use `constexpr` variables for `KLane` and `ItemsPerAccess`, simplifying tile view construction and correcting the permutation order for improved efficiency and correctness (`tensor_shuffle_utils.hpp`). * Fixed the calculation of `KLaneBytes` in weight preshuffle pipeline policies to account for internal data type conversion (e.g., from `pk_int4_t` to `fp8`), ensuring accurate memory access and alignment in quantized GEMM policies (`wp_pipeline_agmem_bgmem_creg_base_policy.hpp`, `gemm_wp_abquant_pipeline_ag_bg_cr_base_policy.hpp`). [[1]](diffhunk://#diff-93f16cd76e6e24404777e682a5ac8e039913ddd6a438c7efd61fdda42276e4efL274-R275) [[2]](diffhunk://#diff-9c3d0fc3c014feed435bfd93ba1f8f9fb3e054dcc322deada3addf70bee5a58cL100-R105) ### Test infrastructure enhancements * Unit tests did not catch this issue since there were no tests for fp8. Added new configuration structs (`config_mn_16x16`, `config_mn_32x32`) to support additional GEMM tile shapes and updated tests to run with these configurations for broader coverage (`test_gemm_pipeline_util.hpp`). [[1]](diffhunk://#diff-5a5962b2c4aa7f6a87d1d6201ad383135e30df13b42654e997d870d57420d5b8R86-R103) [[2]](diffhunk://#diff-5a5962b2c4aa7f6a87d1d6201ad383135e30df13b42654e997d870d57420d5b8L255-R269) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-10 23:04:44 -08:00
assistant-librarian[bot]	2073f0a73b	[CK_TILE]: PreshuffleB + PreshuffleBQuant for ABQuant pipeline (#4268 ) ## Proposed changes Implement BQuantPreshuffle option for the ABQuant PreshuffleB pipeline. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [X] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [X] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [X] I have added inline documentation which enables the maintainers with understanding the motivation - [X] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [X] I have run `clang-format` on all changed files - [X] Any dependent changes have been merged --- 🔁 Imported from [ROCm/composable_kernel#3687](https://github.com/ROCm/composable_kernel/pull/3687) 🧑‍💻 Originally authored by @ErwinTerpstra --------- Co-authored-by: Erwin Terpstra <erwin.terpstra@streamhpc.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>	2026-02-10 06:57:55 -07:00
Bartłomiej Kocot	0b9fa702ac	[CK] CK Tile grouped convolution direct load (#4406 ) ## Motivation CK Tile grouped convolution forward direct load support. ## Technical Details Basic pipeline for direct load and new instances for forward for v1 and v4 pipelines. ## Test Plan test_grouped_convnd_fwd_tile ## Test Result CI pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-130	2026-02-09 22:08:57 +01:00
Aviral Goel	92fbf5a880	Increase tolerance for FP16 GEMM tests to handle non-deterministic ro… (#4335 ) …unding Three tests were failing intermittently with small errors (0.01-1.5%) due to non-deterministic FP16 accumulation order from GPU thread scheduling: - test_ck_tile_batched_gemm - test_ck_tile_grouped_gemm_preshuffle - test_ck_tile_grouped_gemm_multi_d These tests use kbatch=1 (no split-K), so errors are from order-dependent rounding, not atomics. Increased tolerances from 1e-3 to 2e-3 (0.2%) to account for FP16 precision limits while still catching real bugs. - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-06 16:14:28 -08:00
Enrico Degregori	442c3097ee	[CK] Workaround blockscale wp test failure (#4372 ) ## Motivation Workaround to fix blockscale wp test failure for pipeline v3 ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-06 16:09:08 -08:00
Illia Silin	ea1f04464b	Revert "Implement device grouped gemm fixed nk multi abd for rdna4 (#3619 )" (#3705 ) This reverts commit `1a8bd3d34b`. [ROCm/composable_kernel commit: `569640dc70`]	2026-02-03 09:52:14 -08:00
Emily Martins	1bc181c33f	[CK_TILE] Stream-K Tile Engine Test Config File Generation (#3662 ) * Stream-K smoke test config file generation This change converts the stream-k smoke tests to use tile engine. Since the m, n, and k values dependent on the CU count of a device, the configs are generated during the Configuration Phase. * Compute GEMM reference on GPU * Remove redundant Stream-K tests Removing redundant tests that are now run via tile engine. * Fix relative and absolute tolerance calculation This change updates the Stream-K tile engine interface to ensure that num_wgs_per_tile is propaged and passed into the compare_results function to calculate the rel and abs tolerance. Before, split-k was used, which is incorrect for Stream-K since the split-k value is always 1. * Cleanup imports, types, and other misc items This commit makes the following changes: - Uses Typing module for nested type hints - Uses quotes around cu_count_arg argument in generate_configs.cmake in if statements - Adds explicit include for tuple in test_gemm_streamk_simple.cpp - Adds a type for the tiles argument in argparser to check argument validity * Use CU count as return value for better parsing * Add reduction tests for bf16, fp8, and bf8 [ROCm/composable_kernel commit: `8cbd09c84a`]	2026-02-03 09:12:15 -07:00
Aviral Goel	4ecc7da10e	feat: add split_k support for block scale gemm bquant mode. (#3653 ) * WIP: add splitk to bquant * feat: add support for bf8i4 and fp8i4 by calculating correct stride for packed data types * chore: remove temporary test script * fix: incorrect tile window length for splitted bq tensor window * chore: improve comments * test: add unit tests to cover bquant splitk functionality * fix: conflict resolution by renaming variables [ROCm/composable_kernel commit: `3e77721755`]	2026-02-02 14:41:53 -08:00
Zoltán Lakatos	1a8bd3d34b	Implement device grouped gemm fixed nk multi abd for rdna4 (#3619 ) * device struct implementation * added xdl grouped multi abd fixed nk testing * wmma implementation fixed * avoid unnecessary device mem allocation and code cleanups * cleanup instances definitions * wmma examples added * code cleanups * fix clang format * typo and compilation fixes related to reference gemm * fix compilation error due to std::remove_cvref_t * added missing hip_check_error includes * correction to example instances * review commentes addressed * removed split-k from testing * code formatting --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `301eb5cf08`]	2026-02-02 13:58:11 -08:00
Jan Patrick Lehr	4dece9c549	[Compiler] Addressing new compiler warnings (#3640 ) * [Compiler] Addressing new compiler warnings Clang enables new lifetime warnings in production and we see build errors due to this with the staging compiler. The attributes added in this PR are suggested by the compiler. However, I'm not very familiar with the code base, so the changes may be incorrect. * Update some more instances * Adds file-level ignores via clang diagnostic pragma The number of instances was large, so I decided to use file-level scope to disable the warning via pragma clang diagnostic ignored. It also showed this warning coming from the gtest dependency. For that, I did add the respective command line flag to the CMake variables. I don't know if this is acceptable or not. * This adds the remaining instances For a build on gfx90a. * fix clang format * Adding couple more instances from gfx1200 build * Fixed another few instances --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `069500464d`]	2026-02-02 09:39:48 -08:00
ZheWang	418ee44844	Mx fp6 flatmm (#3601 ) * add fp6 data-type and support sync/async dwordx3 load/store * clang-format * pre-commit * 1st commit * default mnk pass ut * fix a distrubution * fix * fix bdram distr * update * pass ut * improve perf * update * clean code * resolve copilot comment * reslove comment * clang-format --------- Co-authored-by: ZheWang <zhewan@amd.com> [ROCm/composable_kernel commit: `e6bcd192d4`]	2026-02-02 16:04:40 +08:00
Kiefer van Teutem	d34916a7ff	Adding remaining conv, dynamic_op, and scaleadd_scaleadd_relu flavors for grouped conv fwd (#3529 ) * Adding remaining flavors for grouped conv fwd As titled. Following variants are added: - grouped_conv2d_fwd_dynamic_op - grouped_conv3d_fwd_dynamic_op - grouped_conv3d_fwd_bilinear - grouped_conv3d_fwd_convscale - grouped_conv3d_fwd_convinvscale - grouped_conv3d_fwd_convscale_add - grouped_conv3d_fwd_convscale_relu - grouped_conv3d_fwd_scale - grouped_conv3d_fwd_combconvscale - grouped_conv3d_fwd_scaleadd_scaleadd_relu * Fix incomplete parsing of types from source names in add_instance_library() cmakelists function so we don't build f8 on RDNA3. * Do not build f8 / bf8 only flavor tests on RDNA3 * Make sure we have proper generic instances for all instance lists related to the post-ces extra flavors, with scalarPerVector = 1. Then disable all but one generic instance per instance list to reduce compile time. * Post rebase fix: Template parameters for Grouped Conv Fwd Device Impl got tweaked upstream. * adding int8 and fp16 overloads to the elementwise operations * fixed copilot nits * Addressing review comments: - removed unnecessary examples for dynamic op - removed unnecessary conv specalizations for all the flavors - removed spurious bilinear and scale source files * clang-format * reduced no of tests --------- Co-authored-by: Wojciech Laskowski <wojciech.laskowski@streamhpc.com> [ROCm/composable_kernel commit: `2377a62837`]	2026-01-30 17:02:14 +01:00
Erwin Terpstra	a5824466fb	[CK_Tile] Support for a4w4 (fp4) in block scale gemm AB quant (#3603 ) * chore: split block scale example instances in more separate files to speed up compile times * wip: fp4 scaffolding for abquant * feat: add fp4 decoding-while-loading to abquant pipeline * feat: add support for fp4 CPU verification in abquant * chore: add time tracking to reference calculation * feat: add a4w4 test for blockscale gemm * feat: optimize reference calculation by preconverting values to AccType * feat: add fp4 to fp8 look-up table * fix: reference to wrong ComputeDataType field in QuantProblem * feat: type utilities for determining MFMA compute types * feat: packed fp4 for abquant weight preshuffle * feat: add separate tests for a4w4 base case, padding and preshuffleB * fix: fp4 conversion on gfx950 attempting to use non-supported method * fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size * chore: add fp4 preshuffleb mode to block scale example * chore: sanity check for packed types being 1 byte * chore: clarify tensor dimension indices with constants * chore: replace traits check with specialized check for packed types * style: some minor refactoring and cleanup * fix: correct conversion table for FNUZ fp8 * chore: add fp4 instances to main abquant instances again * chore: use same initialization branch for int4 and fp4 * chore: add missing initialization for fp4 in block scale gemm example --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `6a6177a246`]	2026-01-30 04:40:50 -07:00
Johannes Graner	e8bf2e1418	[Conv] Enable bwd weight splitk autodeduction with cap (#3656 ) * Enable bwd weight splitk autodeduction with cap * Fix error threshold calculations * Add missing logic to wmma multiple d kernel * Fix threshold calculation * Update test with new applicability [ROCm/composable_kernel commit: `fabac7e2c3`]	2026-01-29 17:40:28 +00:00
Khushbu Agarwal	9fc9cc598f	[CK_Tile] Adding support for preshuffleQuant in AB quant Block Scale Gemm (#3629 ) * initial commit * preshuffleQuant support for ABQuant * fix mxfp4 to use correct QuantGroupSize * addressing review comments and seperated Preshufflequant for A and B * updated grouped gemm example for updated traits definition * fix for CI failure * updated grouped_gemm_abquant test for updated traits definition * updated grouped_gemm_abquant test for updated traits definition [ROCm/composable_kernel commit: `9b168082b7`]	2026-01-28 19:45:09 -08:00
Robin Voetter	7d1574a9ab	[CK_BUILDER] Integrate CKB validation with CK verification (#3649 ) * ck-builder: tensor copy function This function copies one tensor to another, so that the memory layout can be changed between them. * ck-builder: fix ck::bhalf literals These types don't work properly. * ck-builder: abstract compare_elements in gpu_verification.hpp and make builder use it This reduces the amount of duplicated code a bit. * ck-builder: add flat tensor iterator This "iterator" type pretends to be a pointer, useful for passing tensors to functions expecting pointer-like types. * ck-builder: integrate validation with ck gpu verification By templating the gpu_verify function over iterators, we can use the new FlatTensorIterator to adapt the function to multi- dimensional tensors without changing either implementation too much. * ck-builder: add check_by_accumulations This changes the gpu_verification.hpp code to also accept "iterator" types for the relevant gpu_verify and gpu_reduce_max functions. * ck: fix test_gpu_verification GenerateRandomData for bhalf is_integer_it<bhalf_t> yields true, but it is not actually an integer. * ck: make gpu_verification kernels be proper persistent kernels Previously these were using a hardcoded value for the grid size. This commit changes that so that the grid size is automatically derived from the kernel's occupancy and the number of multiprocessors on the GPU. * ck: clean up gpu_verification.hpp using block_reduce This implements a small generic block reduce function, and rewrites the rest of gpu_verification.hpp using that function to clean it up a bit. * ck-builder: doc typos * ck-builder: update testing readme with validation interface. * ck-builder: rebase fixes + review comments * ck-builder: fix device integer generation with float types Passing bfloat here causes a nans due to type_convert performing a bitcast. * ck: another bhalf_t bug CK expects that int-generation with ck::bhalf_t yields bhalf integers, not unsigned integers. This makes the logic of FillUniformRandInteger compatible with GeneratorTensor_2<InDataType>, however idiotic that may be. [ROCm/composable_kernel commit: `42048bdb7d`]	2026-01-28 17:41:02 +01:00
damien-lejeune	24d3cbc30d	[CK Tile] multi reduce improvements (#3607 ) * WIP: refactoring * Swap operation/data nested loops order * Improve memory coalescing * Add comments * Enforce same identity element for the reduce operations * Re-add compile time constant * Comment + re-add __builtin_amdgcn_readfirstlane(0) to the loop init --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> [ROCm/composable_kernel commit: `91e32f305f`]	2026-01-27 12:56:09 -08:00
Bartłomiej Kocot	42638c34b0	[CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err (#3624 ) * [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err * Update test_grouped_convnd_fwd_tile.cpp * Update test_grouped_convnd_fwd_tile.cpp * Update conv_tuning_params.hpp * clang format fix * Update CMakeLists.txt [ROCm/composable_kernel commit: `3d67e6c492`]	2026-01-27 11:04:11 +02:00
Johannes Graner	8049ce9be4	[CK tests] Extend conv GPU reference (#3539 ) * test_convnd_fwd * test_convnd_bwd_data * test_conv_bwd_data_scale * test_grouped_convnd_fwd_clamp * test_grouped_convnd_fwd_scale * multiple A/B tensors and D tensor for fwd GPU ref * test_grouped_convnd_fwd_scaleadd_ab * test_grouped_convnd_fwd_bias_clamp * test_grouped_convnd_fwd_bilinear * test_grouped_convnd_fwd_gk_bias_clamp * Extend GPU reference to enable batchnorm epilogue * test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp * test_grouped_conv_bwd_data_bilinear * test_grouped_convnd_bwd_weight_bilinear * Add missing template instantiation * Perform operations in float in reference * Slightly increase tolerance for batchnorm profiler * Revert "Slightly increase tolerance for batchnorm profiler" This reverts commit `a3b2475229`. * Revert "test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp" This reverts commit `6da4576060`. * Revert "Extend GPU reference to enable batchnorm epilogue" This reverts commit `e2f75fa10e`. * Clarify variable names * Refactor elementwise ops into helper functions * Make helpers C++17-compatible [ROCm/composable_kernel commit: `c190d8d61f`]	2026-01-27 09:49:42 +01:00
Robin Voetter	aa35585fc7	[CK_BUILDER] conv bwd weight testing (#3618 ) * ck-builder: restructure testing conv In order to prepare for bwd of conv testing, this commit moves some files and types around so that we can reuse ckt::Args for both forward and backwards convolution. * ck-builder: decouple fwd_ck.hpp and fwd_reference.hpp from fwd.hpp This will allow us to more easily include fwd.hpp from backwards definitions, which is required for initializing bwd values. * ck-builder: fix layout of test_ckb_conv_bwd_weight_xdl_cshuffle_v3 Turns out that the supplied layout isn't actually supported... * ck-builder: ck and reference conv integration for bwd weight * ck-builder: ck bwd weight execution test * ck-builder: ckt::run support for ck-tile bwd weight * ck-builder: ck tile bwd weight execution test * ck-builder: extra debug printing in MatchesReference * ck-builder: make ckt::run return RunResult This type is more convenient than std::tuple, as it will allow us to use google test matchers with this in the future. * ck-builder: RunResult matcher Using EXPECT_THAT(..., SuccessfulRun()) will generate a check and a nice error message about how and why running an algorithm failed. * ck-builder: doc fixes * ck-builder: add missing headers [ROCm/composable_kernel commit: `cc75948d1c`]	2026-01-26 23:50:15 +01:00
Enrico Degregori	6e95bf8179	Padding support for wave transfer (#3537 ) * Add padding support with transpose Also move check before writing storing is_src_valid during reading * Add/modify instances to use wave transfer for gemm universal Condition is changed so now the vectorsize of vmem reading and lds writing must be equal to 8 in order to use the wave transfer * Fix clang format * Modify example * Fix bwd data * Add restriction for wave transfer with padding and transpose Add test case which shows this limitation * Fix validity checks 8 bit types * Add validity check gemm_bias_add_reduce * Add validity check grouped gemm tile loop * Fix validity checks new flavours * Minor fixes * Fix clang format [ROCm/composable_kernel commit: `2e49b6b2f7`]	2026-01-26 12:57:09 -08:00
Aviral Goel	2a17f6e537	feat: Add Interwave scheduler for aquant memory pipeline (#3540 ) * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * test: add unit test for aquant memory pipeline * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * fix: compilation error on gfx950 * chore: remove debug statements from the code * test: resolve merge conflict * test: remove non rcr unit tests from test suite [ROCm/composable_kernel commit: `b8751e505d`]	2026-01-26 11:27:42 -08:00
SamiAario-AMD	b07fbbc33a	Re enable f8 x bf8 tests on compv3 and compv4 (#3605 ) * Re-enable f8 x bf8 tests on CompV3 as they now pass * On CompV4, fp8 x bf8 tests now pass with K_BlockSize I32 * Add a changelog entry --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `834642202c`]	2026-01-26 10:23:26 -08:00
Max Podkorytov	8ae166963e	Optimize sequence metaprogramming utilities to reduce template instantiation depth (#3585 ) This change significantly improves compile-time performance by reducing template instantiation depth for sequence generation and merging operations: Optimizations: - sequence_gen: Reduce instantiation depth from O(log N) to O(1) by using __make_integer_seq to generate indices in a single step, then applying the functor via pack expansion - uniform_sequence_gen: Similarly optimized to O(1) depth using __make_integer_seq with a helper that applies a constant value via pack expansion - sequence_merge: Reduce depth from O(N) to O(log N) using binary tree reduction strategy. Added direct concatenation specializations for 1-4 sequences to avoid recursion in common cases, falling back to binary tree merging for 5+ sequences Documentation: - Added extensive inline comments explaining why sequence_merge cannot achieve O(1) depth like sequence_gen (requires computing cumulative sequence lengths from heterogeneous inputs, inherently requiring recursion) - Documented the binary tree reduction approach and why it's superior to fold expressions for this use case Testing: - Added comprehensive unit tests for uniform_sequence_gen with different values, sizes, and edge cases - Added tests for sequence_gen with custom functors (double, square, identity, constant) to verify the new implementation works with arbitrary functors - Added tests for sequence_merge with 4, 5, and many sequences to verify both the direct concatenation path and binary tree reduction path - Added tests for empty sequence edge cases [ROCm/composable_kernel commit: `de59c0716c`]	2026-01-26 10:08:55 -08:00
Emily Martins	d0b16dc545	[CK_TILE] Fix alignment in Stream-K workspace buffer (#3625 ) * Fix alignment issue in Stream-K workspace buffer In CK Tile Stream-K, the workspace buffer is used to hold flags and partials, where the first i bytes holds the flags and the remaining bytes hold partials. This change adds padding to the flags prefix of the workspace buffer to ensure the number of bytes is 128B-aligned. Without this alignment, since workgroups do not skip cache when reading from partials, they may read stale partials data in cache, leading to incorrect results. The added padding avoids the stale data reading. This change also re-enables the test_ck_tile_streamk_reduction tests. * Compute reference GEMM on GPU for test verification to decrease testing time [ROCm/composable_kernel commit: `f5c2f09036`]	2026-01-23 16:14:22 -07:00

1 2 3 4 5 ...

631 Commits