composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-28 02:37:01 +00:00

Author	SHA1	Message	Date
Enrico Degregori	cd782f613c	Fix buffer load instruction 7.2	2026-06-18 10:49:01 +00:00
Enrico Degregori	252c288def	Fix scheduling instructions	2026-06-18 08:50:56 +00:00
Enrico Degregori	2b68eb63f3	Fix mfma instruction	2026-06-18 08:08:11 +00:00
Anton Gorenko	aa5ed1a749	Add operator() overload to GemmPipelineAgBgCrCompAsyncEightWaves to use in QuantGemmKernel	2026-06-17 15:46:44 +05:00
Anton Gorenko	8428732dc2	Support multiple D in quant gemm kernel A new kernel is added QuantGemmMultiDKernel, the existing QuantGemmKernel behaves as usual.	2026-06-16 11:42:27 +05:00
Anton Gorenko	335f80033b	Impove precision of CShuffle with scales or multi D Two new template parameters are introduced: * CShuffleDataType allows to apply multiple Ds before downcasting to ODataType (prevents unexpected precision loss and/or overflow); * CComputeDataType to use scales with int32 AccDataType (int8 gemms);	2026-06-16 11:41:20 +05:00
Brock Hargreaves	1b649a8d4b	[rocm-libraries] ROCm/rocm-libraries#8332 (commit 48c389c) [CK][CI] Retry builds on node failure with automatic rerouting (#8332) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation When a Jenkins node enters a bad state (missing GPU driver, dead Docker daemon, full disk), every PR scheduled onto it fails the same way until a human manually takes it offline. Some failures are also transient and would pass on a simple retry. Today the pipeline does neither — every failure goes straight to red on the same node. ## Technical Details Two new retry behaviors based on failure type: - Different node for persistent node faults (driver missing, daemon down, disk full, container won't start) - Retry in place for transient glitches (registry pull, DNS), then a different node if retries are exhausted Real build/compile failures and aborted builds are never retried. New: `src/org/ck/NodeFault.groovy`, `TransientFault.groovy` — typed exceptions in the shared library `src/` for stable classloader identity under dynamic library loading. `vars/ck.groovy`: adds `preflight()` (host health checks before build), `pullImage()` (classifying pull failures at the call site, replacing `getDockerImage()`), `runOnHealthyNode()` (outer reroute loop, up to 3 nodes), `runInPlace()` (same-node transient retries). GitHub failure status is only set once all retries are exhausted. `Jenkinsfile`: all active `Build CK and run Tests` stages converted to `agent none` + `ck.runOnHealthyNode(…)`. ## Test Plan Tested on `users/brockhargreaves-amd/ck/node-failure-retry-logic` with `USE_CURRENT_BRANCH_FOR_CK_GROOVY=true`. Verified preflight logging, reroute on node fault, attempt counter in logs, no retry on aborts, and single failure status report after budget exhausted. ## Test Result Retry logic working as expected. Three bugs found and fixed during testing: false `NodeFault` from host-level sccache probe (sccache is in-container), `null` node name in catch logging, and `sh` calls outside `node()` context in status reporting. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 17:40:10 +00:00
Andriy Roshchenko	b8440b3aeb	[rocm-libraries] ROCm/rocm-libraries#8325 (commit 559eaf6) [GFX1250][MX GEMM] Unified FLATMM GroupedGemm Implementation for MX Data Types (#8325) ## Motivation Design and test a unified FLATMM GroupedGemm interface so that it supports all MX FP8, FP6, and FP4 data types on both the gfx950 and gfx1250 architectures and works seamlessly across these platforms. ## Technical Details Implementation exposes Grouped Gemm interface for MX FLATMM and MX TDM FLATMM pipelines. ## Test Plan Add the following tests: - ck_tile/grouped_gemm_mx/test_grouped_gemm_mx_flatmm_non_tdm.cpp - ck_tile/grouped_gemm_mx/test_grouped_gemm_mx_flatmm_tdm.cpp - ck_tile/flatmm/test_mx_flatmm_persistent.cpp Verify on the gfx950 and gfx1250 architectures. ## Test Result All tests pass. Verified on A0 hardware with rocm-7.14.0a20260517 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 16:12:33 +00:00
Sami Remes	c1f7104852	[rocm-libraries] ROCm/rocm-libraries#6663 (commit f19fc01) [CKTile] Fix MX GEMM: num_loop==3 dispatch, split-K, unsupported-shape guard (#6663) Three independent MX GEMM correctness bugs reported against example/ck_tile/42_mx_gemm (fp8xfp8, A=Row/B=Col) on MI350X, plus one host-side atomic-add accumulation bug in the example's repeat loop. - Pipeline (gemm_pipeline_ag_bg_cr_comp_async.hpp): BlockHasHotloop required num_loop > PrefetchStages, which let num_loop == 3 enter a hot loop that produced 5 gemm accumulations instead of 3 (K == 3K_Tile, e.g. K=768, deterministically wrong). Require num_loop >= 4 instead: pre-pipeline + TailNumber::Three already totals exactly 3. - Kernel (gemm_mx_kernel.hpp): split-K was silently broken because GridSize did not thread k_batch into blockIdx.z and the scale tile windows were anchored at K=0 for every k_id. Every k_id >= 1 therefore read the wrong packed scales. Fix: GridSize returns dim3(grid_x, 1, k_batch) (persistent and non-persistent). * MakeScaleA/BBlockWindows accept a k_elem_offset and translate it to a packed-scale K offset (also apply pad_tensor_view so OOB scale loads return zero, matching A/B padding). * operator() derives k_id from blockIdx.z, uses GetSplitKElemOffset (matches Underlying::SplitKBatchOffset's K1-aligned formula), and dispatches the epilogue with memory_operation_enum::atomic_add for k_batch > 1, set for k_batch == 1. Same fp16/bf16 even-vector-size guard as UniversalGemmKernel. * MakeCBlockWindows templated on DstInMemOp; unconditionally applies pad_tensor_view using kPadM/kPadN so partial trailing M/N tiles are handled correctly. - Compile- and runtime unsupported-shape guards (gemm_mx_kernel.hpp): add IsSupportedArgument and a static_assert for configurations that produce silent wrong results: * static_assert(!kPadK) -- the MX comp-async pipeline uses async_load_tile whose OOB check is per-vector-start, so a vector straddling the K pad boundary reads garbage. Until the async path learns per-element pad masking, reject kPadK at compile time. * Runtime: k_batch >= 1; M/N multiples of MPerBlock/NPerBlock when kPadM/kPadN are false; M >= MPerBlock and N >= NPerBlock always (CShuffleEpilogue cannot safely run with a single partial tile); K % (KPerBlock * k_batch) == 0; and for k_batch > 1, K must be a multiple of WarpTile_K * k_batch so every split lands on a packed-scale boundary. * All error paths log under CK_TILE_LOGGING with actionable messages. - Example (example/ck_tile/42_mx_gemm/mx_gemm_instance.hpp): * Call Kernel::IsSupportedArgument up front and throw a clear runtime_error for rejected shapes (was silently launching an unsupported kernel). * Switch to launch_kernel_time_mask with a clear_gemm_output preprocess that zeroes C between iterations when k_batch > 1 (mirrors universal_gemm_invoker). Without this the default -warmup=50 -repeat=100 accumulated 150 atomic_adds into C after the kernel-side split-K fix. Tests (test/ck_tile/gemm_mx/): - Add MXfp8_GemmConfig16_PadMN (kPadM = kPadN = true). - test_mx_gemm_fp8.cpp: HotLoopTailNumLoopThree (K=768 regression), SplitK (k_batch=2,4 across full_k/partial_k paths), TestMxGemmFp8PadMN::{MNPaddingAligned, MPadding, NPadding, MNPadding} covering trailing partial tiles along M, N, or both. - Run(...) now takes k_batch. - packScalesMNxK: guard against OOB (mn, k) reads from src and initialise e8m0 bytes to the zero exponent (0x00) instead of the default-constructed NaN (0xFF), so padded lanes don't poison the packed int32_t shared with in-range lanes. - test_mx_gemm_instance.hpp: call IsSupportedArgument before launch. Verification on gfx950, ROCm 7.2.0: - ctest -R test_ck_tile_mx_gemm -> 100% (2/2). - Example sweep over the original bug-report shapes: all K-aligned shapes now validate correct (including 4096^3 sk=2 and the K=768 cases); all K=128 shapes cleanly rejected with the new error message instead of producing silent wrong results. Made-with: Cursor ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 08:28:55 +00:00
damien-lejeune	aab1d219f5	[rocm-libraries] ROCm/rocm-libraries#8350 (commit f92ded1) Add tile shape for FMHA batch prefill on MI308X (on fp8, hdim=256) (#8350) ## Motivation Add a tile size appropriate for FMHA batch prefill fp8/hdim256 on MI308X ## Technical Details Appending the tile shape to the existing factory such that it can be picked up by Aiter ## Test Plan Ran the performance test on both MI300X and MI308X ## Test Result MI300X performance seems unaffected by this change. MI308X does improve. ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>	2026-06-15 07:00:35 +00:00
SamiAario-AMD	947dcc2606	[rocm-libraries] ROCm/rocm-libraries#5510 (commit 8415c8c) [CK Tile] Add transposed tile load implementation, and tests for load_and_convert_tile (#5510) ## Motivation Mixed precision b/fp16 x fp8 requires a transposed tile load implementation that supports mixed precision using these types. Implement this, use it in `load_and_convert_tile`, and add a unit test for `load_and_convert_tile` which covers this functionality. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 06:42:28 +00:00
ltqin	0954a8f3fa	[rocm-libraries] ROCm/rocm-libraries#8262 (commit d4ff8fc) [CK_TILE] Add graph capture support for FMHA backward(new branch) (#8262) ## Motivation Add HIP graph capture support for FMHA backward operations. The original implementation only supported normal execution mode and would cause use-after-free crashes when used with graph capture replay. When FMHA backward is captured into a HIP graph: - First replay: host callback executes and deletes the closure (as designed for normal mode) - Subsequent replays: use-after-free crash because the closure was already freed This PR enables `fmha_bwd_launcher::prepare_workspace_async()` to work correctly in both normal execution and graph capture modes.	2026-06-14 03:11:53 +00:00
Johannes Graner	01cca38c8e	[rocm-libraries] ROCm/rocm-libraries#8220 (commit 4c04a3a) [CK Tile] WAVELET pipeline for backward-data grouped convolution (#8220) ## Motivation On the RetinaNet shapes (gfx950, fp16) CK Tile backward-data conv was ~18% behind classic CK, with the gap concentrated in the K=2376 3x3 detection-head family where bwd_data spends most of its time. The WAVELET GEMM pipeline already gives uplift for forward and backward-weight conv; this ports it to backward-data and consolidates the now-shared machinery across all three directions. ## Technical Details - Backward-data wavelet support in the tile kernel: launch extra load waves when the pipeline exposes `LaunchBlockSize`, and split the epilogue into math waves (run the CShuffle epilogue) and load waves (`RunBarrierStub`). - Register 7 WAVELET instances (fp16 and bf16), tuned for backward-data's tall-skinny GEMM rather than the forward tile shapes: a big-M `256/128/64` workhorse, a `VecA=4` variant for the `K % 8 != 0` shapes, and a `NumGroupsToMerge=32` variant for grouped (depthwise-style) shapes. - Implement the native backward-data instance parser in `generate_instances.py`. - Deduplicate the wavelet machinery shared by forward, backward-data, and backward-weight: `GroupedConvLaunchBlockSize`, `is_wavelet_pipeline`, and `RunWaveletAwareEpilogue` in `grouped_convolution_utils.hpp`; the three native instance parsers collapse to one parameterized parser. The three kernels now call the shared helpers. ## Test Plan - Rebuild the full profiler instance pools for all three directions (fp16/bf16/fp32, nhwgc/ndhwgc) to exercise the shared helpers across every instantiation. - Tile GTests on gfx950: `test_grouped_convnd_fwd_tile`, `test_grouped_convnd_bwd_data_tile`, `test_grouped_convnd_bwd_weight_tile`. - Per-shape sweep of the 35 RetinaNet backward-data shapes vs classic CK and the non-wavelet tile pool (`profile_wavelet_bwd_data.py`); correctness spot-checked with GPU-reference verification on the new big-M and NumGroupsToMerge instances. ## Test Result - GTests pass: forward 9/9, backward-data 6/6, backward-weight 6/6. - Backward-data perf (3x3 g=1 region, geomean classic/tile): 0.88 -> 1.11, i.e. the tile path goes from ~12% slower than classic to ~8% faster. The largest single backward-data shape (256x100x100->2376) moves from 11% slower than classic to 12.5% faster. - The dedup refactor preserves behavior (net -174 lines across the kernels/generator), confirmed by the full rebuild and the GTests above. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-13 00:10:50 +00:00
John Afaganis	329e589840	[rocm-libraries] ROCm/rocm-libraries#8260 (commit 1139236) [ck] Enforce LF-only line endings in C/C++ sources MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Several CK source files carry Windows CRLF line endings (a trailing carriage return on each line), introduced by editors configured for Windows endings or copy/paste from Windows tooling. These are purely cosmetic but they pollute diffs (whole-file churn the first time someone makes an LF edit), confuse `clang-format`, and are inconsistent with the LF-only convention used across the rest of the tree. This PR (a) normalizes every existing CRLF file (6 files) to LF and (b) adds a pre-checkin gate so new CRLF leaks are rejected before merge. ## File extensions covered Both the cleanup scan and the new Jenkins enforcement stage use the same predicate as the adjacent `ASCII Only Check` stage: ``` .h .hpp .cpp .h.in .hpp.in .cpp.in .inc .cl ``` (excluding `/build/` and `/include/rapidjson/`). The local pre-commit hook's `c++/inc` type filter covers the same set. ## Why no enforcement today CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so the existing `pre-commit` workflow doesn't touch CK. The local CK `.pre-commit-config.yaml` only runs for developers who installed hooks. The authoritative gate is therefore the new Jenkins stage in this PR; the local hook is convenience. ## Commit layout (bisect-friendly) 1. `[ck] Normalize CRLF line endings to LF in C/C++ sources` Mechanical line-ending cleanup across 6 files. No content change: every edit is purely CRLF -> LF, verified with `git diff --ignore-cr-at-eol` reporting an empty diff. 2. `[ck] Enforce LF-only line endings in C/C++ sources` - New `projects/composablekernel/script/check_no_crlf.sh` (modeled on `check_ascii_only.sh`). - New `crlf-checker` entry in `projects/composablekernel/.pre-commit-config.yaml` under the local-hooks block (`types_or: [c++, inc]`). - New `CRLF Check` parallel stage in `projects/composablekernel/Jenkinsfile`'s `Static checks` block, mirroring the adjacent `ASCII Only Check` stage. Always-on, no `RUN_CPPCHECK` gate. The tree is buildable at every commit boundary. Commit 1 leaves 0 CRLF violations; commit 2 wires the gate. ## Demo Script output on a synthesized violation: ``` $ printf 'int main() {}\r\n' > /tmp/bad.cpp $ projects/composablekernel/script/check_no_crlf.sh /tmp/bad.cpp ERROR: /tmp/bad.cpp contains CRLF (Windows) line endings: 1:int main() {}<CR> Fix: convert to LF, e.g. 'sed -i 's/\r$//' /tmp/bad.cpp' or 'dos2unix /tmp/bad.cpp' $ echo $? 1 ``` Full repo scan after the cleanup commit: ``` $ cd projects/composablekernel && find . -type f $ -name '.h' -o -name '.hpp' -o -name '.cpp' \ -o -name '.h.in' -o -name '.hpp.in' -o -name '.cpp.in' -o -name '.inc' -o -name '.cl' $ \ -not -path '/build/' -not -path '/include/rapidjson/' -print0 \ \| xargs -0 -P 8 -n 64 script/check_no_crlf.sh $ echo $? 0 ``` ## Test plan - [ ] Jenkins PR build: confirm new `Static checks -> CRLF Check` stage runs green over the full predicate and the existing `ASCII Only Check` / `Clang Format` stages are unaffected. - [ ] Local: `pre-commit run crlf-checker --all-files` runs cleanly after installing CK pre-commit hooks. - [ ] Manually inject a CRLF line ending in any `.cpp/.hpp/.inc` file, push: confirm Jenkins fails the new stage with a clear error. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-06-12 21:11:59 +00:00
Brock Hargreaves	96a7e44832	[rocm-libraries] ROCm/rocm-libraries#8378 (commit d68585d) [CK] Pre-emptively add groovy/ folder and skip TheRock CI filter (#8378) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation The CK Groovy library is growing and will be reorganized into a self-describing `groovy/` folder rather than living under `src/` and `vars/`. This PR creates that folder pre-emptively and adds it to the TheRock CI skip-list so that future Groovy additions do not unnecessarily trigger TheRock builds. ## Technical Details - Added `projects/composablekernel/groovy/` with a `.gitkeep` to establish the directory in the repo. - Added `"projects/composablekernel/groovy/"` to `SKIPPABLE_PATH_PATTERNS` in `.github/scripts/therock_configure_ci.py` alongside the existing `vars/` entry, ensuring changes confined to Groovy pipeline code are recognized as non-therock-relevant and skip the TheRock CI pipeline. ## Test Plan No code logic was changed. Verified that `therock_configure_ci.py` pattern list is consistent with the existing `vars/*` skip entry and that the new pattern follows the same glob convention. ## Test Result N/A — directory scaffolding and CI filter only; no functional code affected. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-12 20:11:53 +00:00
Illia Silin	d450749933	[rocm-libraries] ROCm/rocm-libraries#8357 (commit 800965c) [CK] Re-enable HIPRTC codegen tests for all CK PRs. ## Motivation At the time when we introduced the smart test filter to only build and run tests affected by the PR changes, we disabled the client examples, which required full CK build, and also the hiprtc tests that were grouped with the client examples. This caused a few PRs to sneak through that caused the hiprtc compilation to fail. By restoring the hiprtc tests in all PRs, we should close this gap. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-12 19:19:44 +00:00
Illia Silin	789ef38093	[rocm-libraries] ROCm/rocm-libraries#8333 (commit 69b3fc1) Revert "[CK_TILE] Implement RTC API for a subset of FMHA functionality for MGX" (#8333) Reverts ROCm/rocm-libraries#6086 Need to revert as the codegen test for fmha is failing due to including std header: 2026-06-11T22:36:03.673Z] In file included from /tmp/comgr-953928-0-473822/include/ck/host/device_fmha_fwd/fmha_fwd_wrapper.hpp:8: [2026-06-11T22:36:03.673Z] In file included from /bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/cmath:49: [2026-06-11T22:36:03.673Z] In file included from /bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/std_abs.h:38: [2026-06-11T22:36:03.673Z] /usr/include/stdlib.h:32:10: fatal error: 'stddef.h' file not found [2026-06-11T22:36:03.673Z] 32 \| #include <stddef.h> [2026-06-11T22:36:03.673Z] \| ^~~~~~~~~~ The ck_tile headers were never prepped for hiprtc compilation.	2026-06-12 18:19:31 +00:00
Wojciech Laskowski	c2601f38b7	[rocm-libraries] ROCm/rocm-libraries#6569 (commit 393049e) Adding amdgcn_mma specializations for sparse MFMA builtins (#6569) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the fourth of the series of PRs (after https://github.com/ROCm/rocm-libraries/pull/5801, https://github.com/ROCm/rocm-libraries/pull/6014 and https://github.com/ROCm/rocm-libraries/pull/6567) that add all the necessary MMA builtins as amdgcn_mma structs. This PR focuses on sparse MFMA intrinsics. ## Technical Details This change adds new specializations for MFMA sparse builtins. In total, we add 27 MFMA builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-12 12:48:29 +00:00
Enrico Degregori	e75076c826	[rocm-libraries] ROCm/rocm-libraries#8310 (commit 003bc6b) [CK Tile] Fix assert usage MX GEMM ## Motivation See issue https://github.com/ROCm/rocm-libraries/issues/8223 ## Technical Details - Use `std::runtime_error` in `mx_processing.hpp` - Use `static_assert` in `tensor_shuffle_utils.hpp` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-12 11:42:38 +00:00
Thrupti Raj Lakshmana Gowda	d7609923b6	[rocm-libraries] ROCm/rocm-libraries#7919 (commit 061001d) Users/tlakshma/ck/tile engine develop ## Motivation This PR adds multiple new GPU kernel benchmarking operations to the CK Tile Engine, expanding its coverage of GEMM-family operations: - gemm_multi_abd: GEMM with multiple A, B, and D tensors, enabling epilogue patterns such as scale/bias fusion. - batched_contraction: Batched tensor contraction supporting multi-dimensional batch (G), M, N, and K dimensions, targeting workloads where the contraction indices span more than one logical axis. - mx_gemm: MX-format GEMM with microscaling (e8m0) scale tensors. - gemm_rowcolquant: Block-scale GEMM with row/column quantization. - gemm_tensor_quant: Block-scale GEMM with tensor quantization. - grouped_gemm_rowcolquant: Grouped GEMM with row/column quantization. - grouped_gemm_tensorquant: Grouped GEMM with tensor quantization. - batched_gemm: Batched GEMM benchmarking support. ## Technical Details ### gemm_multi_abd - New subdirectory: tile_engine/ops/gemm/gemm_multi_abd/ - CMakeLists.txt follows the same individual-target pattern as gemm_universal / gemm_multi_d. - gemm_multi_abd_instance_builder.py subclasses GemmKernelBuilder from the shared gemm_instance_builder.py. - gemm_multi_abd_benchmark.py delegates to the shared GemmBenchmark parent class. - Configs: default_config.json, default_ci_config.json, user_provided_config.json. - Supported GPU targets: gfx90a, gfx942, gfx950, gfx1201. ### batched_contraction - New subdirectory: tile_engine/ops/gemm/batched_contraction/ - Extends GemmKernelBuilder via BatchedContractionKernelBuilder, adding num_dim_g, num_dim_m, num_dim_n, num_dim_k, num_d_tensors, and elementwise_function parameters. - Layout string uses 3-character encoding (A+B+E), e.g. rcr. - Self-contained benchmark sweep driver (batched_contraction_benchmark.py) with JSON/CSV export and best-kernel selection. - Supported GPU targets: gfx90a, gfx942, gfx950. ### mx_gemm - New subdirectory: tile_engine/ops/gemm/mx_gemm/ - Supports MX-format (e8m0) microscaling for A and B scale tensors. ### block_scale_gemm (gemm_rowcolquant, gemm_tensor_quant) - New subdirectory: tile_engine/ops/gemm/block_scale_gemm/ - gemm_rowcolquant: row/column quantization epilogue. - gemm_tensor_quant: tensor-level quantization epilogue. ### grouped_gemm_quant (grouped_gemm_rowcolquant, grouped_gemm_tensorquant) - New subdirectory: tile_engine/ops/gemm/grouped_gemm_quant/ - grouped_gemm_rowcolquant: grouped GEMM with row/column quantization. - grouped_gemm_tensorquant: grouped GEMM with tensor quantization. ### batched_gemm - New subdirectory: tile_engine/ops/gemm/batched_gemm/ - Batched GEMM benchmark support wired into the sampling/active-op lists. All new ops are registered in op_weights.json for budget allocation and wired into the active-op sampling lists in CMakeLists.txt. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-11 20:38:38 +00:00
jefyang1	276863ca87	[rocm-libraries] ROCm/rocm-libraries#8259 (commit df03f10) Add cluster launch in test ck_tile mx gemm tdm wmma ## Motivation Add cluster launch test in test_ck_tile_mx_gemm_pipeline_tdm_wmma on gfx1250, so that we can check the performance on gfx1250 hardware. ## Technical Details Added Out-of-bounds guard in RunGemm of MxGemmKernel to skip blocks padded by cluster alignment. Add ClusterEnable/ClusterDisable aliases and extend the tuple in test_mx_gemm_pipeline_kernel_types.hpp by adding two kernel types with ClusterEnable for F8 CompTDMV1 and CompTDMV2 respectively. The existing F4 non-ClusterLaunch kernel types have issue to be fixed, so this PR does not include F4 cases. Read ClusterLaunch from the tuple in test_mx_gemm_pipeline_util.hpp. Update invoke_mx_gemm to branch on ClusterLaunch, including Add cluster size constants, Switch GemmShape type, TilePartitioner type, and the kernel launch call. ## Test Plan Tested the changes on gfx1250 FFM. ## Test Result The added kernel types (instances) passed the tests on gfx1250 FFM. ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-11 17:33:11 +00:00
music-dino	359f664b25	[rocm-libraries] ROCm/rocm-libraries#6086 (commit d25d8cc) [CK_TILE] Implement RTC API for a subset of FMHA functionality for MGX (#6086) ## Motivation Introduce a wrapper for the FmhaFwdKernel, for use in real time compilation in MIGraphX. ## Technical Details The intent of the API is to provide multiple instances of the FmhaFwdKernelWrapper, suitable for a particular problem definition. At the moment the wrapper only supports bias and causal masking, feature expansion will come in a future pr. The usage pattern is, in short: 1. Define fmha_fwd::Problem (input dimensions, data type, etc) 2. Fetch Solutions for target architecture (currently only gfx942) based on Problem. The solutions contain a map of template -> template parameter and can be converted to a string representing the full instantiation of FmhFwdKernelWrapper e.g. `ck_tile::FmhaFwdWrapper<ck_tile::fp16_t, 128, 64, 16, 32, 32, 32, 4, 1, 1, 4, 1, 1, 32, 32, 16, 32, 32, 16, false, true, false, true, true, true, true, ck_tile::FmhaPipelineTag::QR>` 3. The instance can then be used in an RTC kernel. The kernel needs to: * Construct a Descriptor (containing descriptions of all input tensors) * Call IsValid() on the descriptor to check if the instance is applicable. Note that this is constexpr by design so that it can fail the kernel compilation as a signal that the kernel is not applicable. * Pass the descriptor and input pointers to the wrapper Run method. A more detailed example of usage can be found in codegen/test/fmh_fwd.cpp Beside work on creating the wrapper and the supporting API, the PR also contains some changes necessary to enable compilation with HIPRTC. The contents of the CK tile headers are embedded in a binary file which is used to pass the header files as strings to HIPRTC. Many of the ck tile headers contain host only code which leads to compilation failures. ck_tile_headers_preprocessor goes through the embedded headers and removes the bodies of host only functions, thereby eliminating the compilation failures. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-11 16:22:37 +00:00
Bartłomiej Kocot	0fdbf8a91d	[rocm-libraries] ROCm/rocm-libraries#8272 (commit 1c66ecb) [CK] Padding on K for global load for grouped conv bwd data (#8272) ## Motivation Fix incorrect results caused by lack of padding during global load in grouped convolution backward data kernel. It is needed since there is no OOB check for global load. ## Technical Details Add padding needed for global load which not use OOB check. ## Test Plan test_grouped_convnd_bwd_data* ## Test Result Passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-11 15:28:21 +00:00
BrianHarrisonAMD	f0545b5c15	[rocm-libraries] ROCm/rocm-libraries#8132 (commit 57d21a1) [CK dispatcher] - LGBM predict data_type FLOAT32->FLOAT64 in ml_heuristic (#8132) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary `ml_heuristic.hpp` calls `LGBM_BoosterPredictForMat(..., /data_type=/0, ...)` (`C_API_DTYPE_FLOAT32`) against a `std::array<double, NUM_FEATURES>` feature buffer. LightGBM reinterprets the 8-byte doubles as 4-byte floats → invalid predictions → the heuristic's argmax always tie-breaks to the first/smallest enumerated config. Fix: `data_type 0 → 1` (`C_API_DTYPE_FLOAT64`), matching the `double` buffer. After the fix, predictions vary and track real TFLOPS (the model correctly prefers larger tiles). ## Verification - The feature buffer `f` is `std::array<double, NUM_FEATURES>` (NUM_FEATURES = 72) → `f.data()` is a `double*`. - The changed `0` is the 3rd positional `data_type` argument (not `nrow`/`ncol`/`is_row_major`). One-line functional change. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 18:57:31 +00:00
Illia Silin	a433424e08	[rocm-libraries] ROCm/rocm-libraries#8241 (commit cd183df) [CK] increase time limit for fmha_bwd tests to prevent timeouts (#8241) ## Motivation Observed a CI failure due to fmha_bwd test timeout which never happened before. Going to increase the time limit for the test to prevent any further CI failures. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-10 15:37:44 +00:00
Ville Pietilä	c6c55db757	[rocm-libraries] ROCm/rocm-libraries#8019 (commit 6472935) [CK TILE] Fix performance regression caused by Dispatcher codegen compiler flag. (#8019) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Currently CK Tile two codegen paths: CK Builder and CK Tile Dispatcher. The CK Tile Dispatcher codegen uses an additional compiler flag that is not present in the CK Builder codegen workflow. The additional compiler flag can cause performance regression for so instances as it disables relevant compiler optimizations. ## Technical Details Removed compiler flag `-mllvm -enable-noalias-to-md-conversion=0` from the CMakeLists.txt that creates instance library from Dispatcher codegen. ## Test Plan Required testing is contained in the CI/CD pipeline. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-10 09:12:03 +00:00
Chao	320a813d67	[rocm-libraries] ROCm/rocm-libraries#6533 (commit 5dcaa45) [CK_TILE] Add host-side Pack-GQA optimization for FMHA forward (#6533) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit [CK_TILE] Add host-side Pack-GQA optimization for FMHA forward ## Motivation Host-side Pack-GQA optimization for CK-Tile FMHA forward. Reshapes Q tensor from `[b, nhead_q, seqlen_q, d]` to `[b, nhead_kv, nhead_ratio * seqlen_q, d]` by adjusting strides, so grouped Q-heads sharing the same KV data are processed in a single tile. Zero kernel changes — runner-only. Phase 1: non-causal attention with GQA ratio packing. Phase 2: extends to dropout and split-kv paths, fixes stride edge cases. ## Technical Details Modified files (2): - `example/ck_tile/01_fmha/example_fmha_fwd.cpp` — Pack-GQA flag plumbing - `example/ck_tile/01_fmha/fmha_fwd_runner.hpp` — Q tensor reshape logic, stride adjustment for GQA ratio packing New files (1): - `example/ck_tile/01_fmha/test_pack_gqa_phase2.sh` — 53 test cases covering non-causal, dropout, split-kv, various GQA ratios ## Dependencies None — this PR is standalone. ## Test Plan - GPU validation on MI300X (gfx942, ROCm 6.4.1): - Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=32 -h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` - GPU validation on MI350X (gfx950, ROCm 7.0), 53 parameterized test cases: - Command (GQA 4:1): `./build/bin/tile_example_fmha_fwd -b=2 -h=32 -h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` - Command (GQA 8:1): `./build/bin/tile_example_fmha_fwd -b=2 -h=64 -h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` - Command (decode): `./build/bin/tile_example_fmha_fwd -b=64 -h=32 -h_k=8 -s=1 -s_k=4096 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` ## Test Result Benchmark results (MI350X, gfx950, ROCm 7.0): \| Config \| Without Pack \| With Pack \| Improvement \| \|--------\|-------------\|-----------\|-------------\| \| GQA 4:1 prefill b=2 h=32 hk=8 s=2048 d=128 bf16 \| 690.05 TFlops (0.199 ms) \| 695.61 TFlops (0.198 ms) \| +0.8% \| \| GQA 8:1 prefill b=2 h=64 hk=8 s=2048 d=128 bf16 \| 706.25 TFlops (0.389 ms) \| 729.35 TFlops (0.377 ms) \| +3.3% \| \| GQA 8:1 decode b=64 h=32 hk=4 s_k=4096 d=128 bf16 \| 305.20 GB/s (1.763 ms) \| 1813.41 GB/s (0.297 ms) \| +5.9x \| \| LLaMA-70B decode b=32 h=64 hk=8 s_k=4096 d=128 bf16 \| 591.70 GB/s (0.909 ms) \| 1820.65 GB/s (0.295 ms) \| +3.1x \| \| MHA ratio=1 b=2 h=8 s=4096 d=128 bf16 \| 695.16 TFlops \| 702.72 TFlops \| no regression \| Benchmark results (MI300X, gfx942, ROCm 6.4.1): No regression on MI300X. Pack-GQA is a runner-only optimization (zero kernel changes), performance impact is within noise on MI300X. \| Config \| TFlops / GB/s \| Time (ms) \| Delta vs baseline \| \|--------\|-------------\|-----------\|-------------------\| \| MHA bf16 b=2 h=8 s=4096 d=128 \| 336.52 TFlops \| 0.408 \| -1.7% \| \| GQA 4:1 bf16 b=2 h=32 hk=8 s=2048 d=128 \| 322.52 TFlops \| 0.426 \| -0.7% \| \| GQA 8:1 bf16 b=2 h=64 hk=8 s=2048 d=128 \| 349.85 TFlops \| 0.786 \| +0.5% \| \| LLaMA-70B prefill b=1 h=64 hk=8 s=4096 d=128 bf16 \| 381.29 TFlops \| 1.442 \| +1.2% \| \| Decode b=64 h=32 hk=8 s_k=4096 d=128 bf16 \| 697.32 GB/s \| 1.541 \| +0.8% \| All validation tests pass (`valid:y`) on both MI300X and MI350X. Additional validation: - 53 parameterized test cases pass (23 phase 1 + 30 phase 2) - GQA ratios tested: 1:1, 2:1, 4:1, 8:1, 32:1 - No regression on MHA (ratio=1) workloads - fp16 and bf16 validated	2026-06-10 01:56:44 +00:00
Bartłomiej Kocot	928b46c3bd	[rocm-libraries] ROCm/rocm-libraries#8208 (commit 7240d71) [CK] Fix scale init in profile_grouped_conv_fwd_outelementop (#8208) ## Motivation Wrong scale initialization caused random errors on CI. ## Technical Details InvScale was initialized by 0 what caused nans during division. At now zero are excluded from randing. ## Test Plan TestGroupedConvndFwdConvInvscale3d ## Test Result Passed in 100 runs ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1400	2026-06-09 21:42:53 +00:00
Bartłomiej Kocot	cb099eb963	[rocm-libraries] ROCm/rocm-libraries#8155 (commit c25787b) [CK] Magic division for long_index_t ## Motivation Improve performance for long_index_t kernels ## Technical Details Support magic division for long_index_t ## Test Plan test_grouped_convnd* ## Test Result Passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1386	2026-06-09 20:57:20 +00:00
Aviral Goel	93e0d79103	[rocm-libraries] ROCm/rocm-libraries#8035 (commit 45186b8) [CK_Tile] Add wmma_bf16f32_16x16x32_bf16 warp-gemm test (#8035) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Adds the warp-gemm unit test for `wmma_bf16f32_16x16x32_bf16`. Stacked on #8028 (the API change) and based on its branch, so #8028 shows the isolated API diff and this PR shows just the test. ## Test gfx125-guarded `WmmaBf16f32.ResidualPrecisionContrast`: computes `Y_bf16 = X_bf16·W_bf16 + R_fp32` via `WarpGemm::mac_downconvert`, compares against an fp32 reference (within bf16 tolerance), and asserts it is at least as accurate as the bf16-accumulate path — i.e. it demonstrates the precision benefit of the fp32 accumulator (`C`) carried into the fused bf16 down-convert. Passes on gfx1250.	2026-06-09 15:52:45 +00:00
chris-tsiaousis-hpc	dc3c1cffd5	[rocm-libraries] ROCm/rocm-libraries#7891 (commit 4dee41d) Porting existing FMHA infra from users/shumway/ck/exp-kpack to develop (#7891) Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>	2026-06-09 14:00:32 +00:00
Brock Hargreaves	65c395984d	[rocm-libraries] ROCm/rocm-libraries#8108 (commit c620f0a) [ck] Unify Build_CK and buildHipClangJob into buildAndTest (#8108) ## Motivation `projects/composablekernel/vars/ck.groovy` had two near-identical build functions, `buildHipClangJob` (lean: static checks, FMHA, tile-engine, conv) and `Build_CK` (main per-arch matrix). This removes the duplication and fixes a latent GitHub-status bug that lived in both. ## Technical Details - Merged both into one `buildAndTest(Map conf)` gated by an explicit `is_main_build` flag (default `false` = lean path; `true` adds the GPU check + arch-gated inductor/perf/hipTensor; only `runBuildCKAndTests` sets it). - Deleted the `Build_CK_and_Reboot` / `buildHipClangJobAndReboot` wrappers (they only logged and re-threw); all 13 call sites now call `buildAndTest` directly. - Widened the shared `catch` to `Exception` so build / image-pull / "GPU not found" failures report failure instead of leaving the check stuck pending (failing stages now go red). - Removed the dead `no_reboot` key. No change to what is built or tested. ## Test Plan - Jenkins linter on the `Jenkinsfile`. - One branch run covering both paths (per-arch matrix + lean stages), spot-checking gfx1250 and a nogpu stage. ## Test Result - Verified statically: no `buildHipClangJob` / `Build_CK` references remain; `buildAndTest` defined once, all call sites wired. - Pending: linter + branch run before merge. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-08 23:45:42 +00:00
Emily Martins	97ca00e449	[rocm-libraries] ROCm/rocm-libraries#7836 (commit cdd9958) [CK Tile] Stream-K RDNA Support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Currently, CK Tile Stream-K only supports CDNA architectures. This change adds Stream-K support on RDNA3/3.5 and RDNA4 architectures. ## Technical Details Stream-K currently has 3 reduction strategies: 1) atomics, 2) linear, and 3) tree. The linear and tree reductions require inter-workgroup communication to a global flags buffer and a global partials buffer. To ensure cache coherency, we use cache modifiers to skip cache levels that are not visible to all workgroups. On CDNA architectures, scalar load and scalar store instructions are available, which we use to read and write to the flags buffer with appropriate cache skipping modifiers. However, RDNA architectures do not support scalar store instructions, so workgroups must use a buffer store instruction to write to flags. Additionally, cache modifiers differ between CDNA and RDNA; they also differ between RDNA3 and RDNA4. Given this information, the main changes are as follows: - Added RDNA flag signaling: Use buffer store instructions for writing to global flags buffer - Add appropriate cache modifiers for reading and writing to flags and partials: - RDNA3 (gfx11): Use `glc \| dlc` coherence flags - RDNA4 (gfx12): Use `DEVICE` coherence scope - SFINAE-guarded overloads: Added compile-time dispatch for `SignalStorePartialDone()` and `WaitStorePartialDone()` based on target architecture - RDNA alignment requirements: Increased flags buffer alignment from 128B to 256B due to RDNA cache line size A note about the `amd_buffer_coherence_enum`: - Problem: The `amd_buffer_coherence_enum` uses preprocessor conditionals (`#if defined(__gfx12__)`) to define architecture-specific values. Template specializations reference enum values from different architectures (e.g., `glc_dlc` for GFX11). Due to C++ two-phase name lookup, non-dependent names are resolved during template parsing regardless of which architecture is being compiled, causing compilation failures when referenced values do not exist in the active preprocessor branch. - Temporary Solution: Added compatibility enum values to each architecture block. For example, I added `glc_dlc` in the `__gfx12__` block. I will create a ticket to refactor this enum with a design that has better scalability and tries to avoid the use of preprocessor conditionals. ## Test Plan ### Summary gtests were added to test wmma variants of Stream-K. These tests were stressed tested locally on gfx11 and gfx12. ### More details This PR makes the following changes/additions to the Stream-K gtests: - Split tests into MFMA (CDNA) and WMMA (RDNA) variants - Added 16 WMMA kernel types: FP16/BF16/FP8/BF8 × Linear/Tree reduction - WMMA uses 16×16×16 wave tiles for RDNA (this is the only tile size supported on RDNA) - Fixed RDNA WGP mode: multiply multiProcessorCount by 2 for actual CU count - As described in [HIP documentation](https://rocm.docs.amd.com/projects/HIP/en/docs-7.2.0/doxygen/html/group___global_defs.html#ggacc0acd7b9bda126c6bb3dfd6e2796d7ca3ac50041beb59111a5c76edf03da0898), when in Workgroup Processor (WGP) mode, the value of `hipDeviceAttributeMultiprocessorCount` is half of CUs, because a single WGP contains two CUs. The default mode on RDNA is WGP mode, so when creating (M, N, K) instances for gtests using the CU count, we need to multiply the CU count by 2 to get the correct value. This is not needed in the kernel host code, because the occupancy ensures that overall `max_active_wgs` is correct. ## Test Result All tests pass locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-08 22:48:10 +00:00
Márton Bidlek	0d18f4fc05	[rocm-libraries] ROCm/rocm-libraries#4798 (commit 0acaf5f) Using named functors instead of lambdas ## Motivation Currently, in block-level GEMM pipelines, there is significant code repetition for prefetching and tail handling, where lambda functions create a unique instantiations at each call. This includes repeated static_for instantiations and large loops such as MRepeat. Each repetition results in additional instantiations, which increases compilation time and binary bloat. ## Technical Details Refactor repeated code blocks into named functors so the compiler can reuse already instantiated code instead of generating multiple copies. Scope of changes: 1. WMMAOPS pipeline internals: projects/composablekernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_wmmaops_base.hpp, projects/composablekernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_wmmaops_v1.hpp, projects/composablekernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_wmmaops_v3.hpp 2. XDLOPS and preshuffle pipeline variants across projects/composablekernel/include/ck/tensor_operation/gpu/block (v1/v2/v3/v4/v5, scale, dequant, gufusion, moe, mx, blockscale, skip-b-lds, dpp, xdlops) Shared functor file: projects/composablekernel/include/ck/utility/vector_load_functor.hpp ## Test Plan Note that the provided compilation traces by -ftime-trace do not report unnamed lambda instantiations, so a clear baseline for instantiation counts cannot be established. As a result, the impact of this change will be evaluated based on runtime performance rather than direct instantiation-count comparisons. ## Test Result The effects of this were timed by the compilation of a single HIP object through an example (grouped_gemm_wmma_splitk_fp16.cpp). The average user time and speedup of this using the average of 100 compilations is: - Mean compile time before the changes: 37.734 s - Mean compile time after: 32.087 s - Speedup: 17.6% Ran a full CK compilation on Alola with the following results: \| Metric \| Before (min) \| After (min) \| Absolute Reduction (min) \| % Reduction \| \| ------ \| ------------ \| ----------- \| ------------------------ \|	2026-06-08 17:11:53 +00:00
Emily Martins	674f7cdc0e	[rocm-libraries] ROCm/rocm-libraries#8141 (commit d3defa6) [CK] Remove Stream-K from old CK ## Motivation Since Stream-K has a CK Tile implementation, we no longer need Stream-K in old CK. Hence, this PR removes Stream-K from old CK. ## Technical Details All Stream-K artifacts in old CK have been removed including examples, tests, kernels, and CK profiler artifacts. ## Test Plan Ran a CI run on the branch before publishing PR. ## Test Result All tests passed. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com>	2026-06-08 16:47:26 +00:00
Johannes Graner	0b3c297ee2	[rocm-libraries] ROCm/rocm-libraries#8009 (commit 26ab70d) [CK Tile] Add WAVELET pipeline for forward grouped convolution (#8009) ## Motivation CK Tile forward grouped convolution trails classic CK on 3x3 convolutions whose output-channel count is not divisible by 8, where the narrow output store limits the compute CShuffle epilogue. This ports the WAVELET pipeline (added for backward-weight in #7937) to the forward kernel to close that gap. ## Technical Details - Kernel (`grouped_convolution_forward_kernel.hpp`): WAVELET load/math-wave wiring, mirroring the backward-weight implementation; the non-WAVELET path is unchanged. - Generator: implement `parse_native_fwd_instance`, the forward native-instance parser. - Registered WAVELET instances: profiler bf16 3 / fp16 5, tests 1 each. WAVELET requires input channels divisible by 8 (it does not apply to depthwise). The bf16/fp16 instance asymmetry is intentional and measured: the VecC=8 tiles never beat the compute pool in bf16 but win about 20% of divisible-by-8 3x3 shapes in fp16, so VecC=8 is registered for fp16 only. ## Test Plan - Correctness (CPU reference) for every registered profiler instance, across VecC variants. - Per-shape best-instance performance sweep over the 34 RetinaNet shapes (bf16) and a 200-shape cross-model sweep (bf16 and fp16), compared against classic CK. ## Test Result - Correctness: PASS for all instances. - RetinaNet (bf16, vs classic CK): faster on 28 of 34 shapes, geomean +19.5%; the not-divisible-by-8 shapes up to 3.7x. One 1x1 stride-2 shape stays ~20% behind classic CK, unrelated to WAVELET. - Cross-model (200 shapes): WAVELET wins 3x3 not-divisible-by-8 in both dtypes (up to 61% over the next-best compute instance); for divisible-by-8 3x3 it wins about 20% of shapes in fp16 (3-11%) and none in bf16. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 08:57:39 +00:00
Johannes Graner	b7d59e4b5f	[rocm-libraries] ROCm/rocm-libraries#8099 (commit fc4894b) [CK Tile] Fix Stream-K flag store: wave-uniform SGPR address for scalar s_store/s_load (#8099) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Stream-K grouped-conv (and GEMM) kernels fail to assemble for some instances: the inline scalar flag store/load gets a VGPR address operand, which scalar-memory instructions reject (`invalid operand for instruction`). This blocks Stream-K instances from building. ## Technical Details - `StreamKReductionOps::{Signal,Wait}StorePartialDone` (shared by GEMM and conv, added in #5393) take `kargs` by `const&` and feed `kargs.workspace_ptr` / `cta_idx` into inline `s_store_dword`/`s_load_dword` with `"s"` constraints. For some instantiations the compiler can't keep the pointer wave-uniform and emits a VGPR address. - Fix: route the pointer and offset through `amd_wave_read_first_lane` so the scalar-memory address is a wave-uniform SGPR before the asm. Same instructions, no algorithm change. - Not arch-specific: the affected instance fails on gfx908/gfx90a/gfx942/gfx950 without the fix; whether the compiler spills to a VGPR depends on the instantiation (tile/warp/pipeline), not the target. ## Test Plan - Compile the previously-failing dispatcher instance for gfx908/gfx90a/gfx942/gfx950. - `test_ck_tile_grouped_conv_bwd_weight_streamk` on gfx942, gfx90a, gfx950 hardware. - gfx950 perf A/B (example, bf16/tree, 10 runs each) with vs without the change. ## Test Result - Failing instance now assembles on all four archs; previously failed on every one. - 30/30 conv Stream-K tests pass on gfx942, gfx90a, gfx950. - gfx950 perf delta -0.13% (within run-to-run noise) — no regression from the added readfirstlane on the cold flag path. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 08:57:04 +00:00
Bartłomiej Kocot	28f2966762	[rocm-libraries] ROCm/rocm-libraries#7734 (commit 03ffb9d) [CK] Grouped Convolution Global Load/Store instances ## Motivation Support global load and store in grouped convolutions using instance factory. ## Technical Details - add new instances for each direction - add new tests for large cases ## Test Plan New test for large cases ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1255	2026-06-06 22:52:59 +00:00
Bartłomiej Kocot	2c363870d9	[rocm-libraries] ROCm/rocm-libraries#6744 (commit 9d056e8) [Ck][CK Tile] Global Load/Store for Large Tensors support (#6744) ## Motivation Create solution to support large tensors in the entire ck tile. ## Technical Details - add possiblity to use global load - int64 indexing ## Test Plan conv fwd tests ## Test Result passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-913	2026-06-06 10:14:17 +00:00
Enrico Degregori	1b4fbd95fd	[rocm-libraries] ROCm/rocm-libraries#6089 (commit c876d18) [CK Tile] Extend type support EightWave pipeline ## Motivation EightWave pipeline was designed for 8 bit types. This PR extend support for any FP type ## Technical Details - Generalize policy to support any FP type - Change LDS layout to fix bank conflicts. This removes all bank conflicts in the pipeline (checked for all supported types). Remaining bank conflicts are related to Cshuffle epilogue. ## Test Plan Added GEMM tests with new supported types. Note that FP6 is also supported for MX GEMM but the PR was reverted so no tests were added for it. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-05 23:54:40 +00:00
Thrupti Raj Lakshmana Gowda	054436ca4a	[rocm-libraries] ROCm/rocm-libraries#8079 (commit cf1e8f2) [tile_engine] Integrate gemm_streamk into budget-based sampling system (#8079) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation `gemm_streamk` was the only GEMM op not participating in the tile engine's budget-based sampling system. Without a budget cap, it would always generate its full feasible set, making build times unpredictable and inconsistent with the other ops. ## Technical Details - CMake budget propagation (`ops/gemm/CMakeLists.txt`): Added `gemm_streamk` to the active-ops detection loop so it receives a share of the sampling budget. Because `gemm_streamk` lives in a sibling subdirectory (`ops/gemm_streamk/`), its allocation is written via `CACHE STRING "" FORCE` to make the variable visible across the CMake directory boundary. - Per-combo budget division (`ops/gemm_streamk/CMakeLists.txt`, `ops/gemm/grouped_gemm/CMakeLists.txt`): Added the same per-combo `MAX_INSTANCES` division that exists in `gemm_universal` and `gemm_preshuffle`. The total budget is divided by `n_datatypes × n_layouts` before the inner `foreach` loop so that sampling fires independently per `(dtype, layout)` combo rather than acting as a single global cap. - Sampling integration (`gemm_streamk_instance_builder.py`): Added `_apply_sampling()` method to `GemmKernelBuilder`, mirroring the Sobol+LHS+maximin sampling used by other ops. New constructor parameters: `gpu_target`, `max_instances`, `seed`, `tier`, `manifest_path`. New CLI arguments: `--gpu_target`, `--max-instances`, `--seed`, `--tier`, `--manifest-path`. The `--gpu_target` argument is now also forwarded on the `--list_kernels` invocation. - `GEMM_STREAMK_AXES` (`sampling/feasible_set.py`): Defined as `GEMM_AXES + ["reduction_strategy"]` to account for the extra axis unique to stream-K. Added `reduction_strategy` to `CATEGORICAL_AXES`. - Weight rebalancing (`sampling/op_weights.json`): Allocated 10% weight to `gemm_streamk` by proportionally reducing `gemm_universal` (0.35 → 0.30) and `gemm_preshuffle` (0.30 → 0.25). Total remains 1.00. ## Test Plan - Configure with `TILE_ENGINE_SAMPLING_TIER=daily` and verify that `gemm_streamk` receives a non-zero budget allocation and that `GEMM_STREAMK_MAX_INSTANCES` is set correctly. - Configure with `TILE_ENGINE_SAMPLING_TIER=daily` across multiple `(dtype, layout)` combos and confirm per-combo budget = total / n_combos. - Configure with `-DGEMM_STREAMK_MAX_INSTANCES=50` explicit override and verify the override is respected (budget allocation skipped). - Verify `chosen_instances.json` manifest is written to the working path when tier is active. - Confirm `op_weights.json` weights still sum to 1.00. ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-05 17:06:11 +00:00
Brock Hargreaves	b2a3ffea5d	[rocm-libraries] ROCm/rocm-libraries#5945 (commit 8f9a5fe) [CK] [MIOPEN] Split convolution library by layout MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit # Split Composable Kernel convolution operations by data layout TLDR: 1. This is a reorganization of files, folders, and CMakeLists for convolution kernels and facilitates a splitting of the convolution library into layouts. 2. The speedup can range anywhere between 15-40% depending on the target architecture for miopen only builds of CK. For TheRock nightly builds of CK, which includes both miopen and hip tensor kernel instances, this constituted in a 10% decrease in compile time for gfx1100. ## Overview Based on https://github.com/ROCm/composable_kernel/pull/3010/ (except keeping 1 static library) ## What MIOpen Actually Uses MIOpen exclusively uses: - NHWGC for all 2D convolutions - NDHWGC for all 3D convolutions This is because MIOpen's tensor descriptors natively use channel-last, group-aware formats. ## Key Changes ### 1. Layout-Based Directory Structure Reorganized convolution instance files from flat per-operation to hierarchical layout-based structure. For example: Before:** grouped_conv2d_fwd/ ├── device_grouped_conv2d_fwd_xdl_nhwgc_.cpp (MIOpen-required) ├── device_grouped_conv2d_fwd_xdl_gnhwc_.cpp (optional) └── device_grouped_conv2d_fwd_xdl_ngchw_.cpp (optional) After:* grouped_conv2d_fwd/ ├── nhwgc/ ← MIOpen-required │ ├── xdl/device_grouped_conv2d_fwd_xdl_.cpp │ └── wmma/device_grouped_conv2d_fwd_wmma_.cpp ├── gnhwc/ ← Optional (excluded with MIOPEN_REQ_LIBS_ONLY) └── ngchw/ ← Optional (excluded with MIOPEN_REQ_LIBS_ONLY) ### 2. Preserved Umbrella Library As before, all convolution operations are consolidated into a single static `device_conv_operations` library: - Aggregates layout-specific instance object files via `ADD_CONV_LAYOUT_INSTANCES` macro - Default build: Includes all layouts (NHWGC + GNHWC + NGCHW + NDHWGC + GNDHWC + NGCDHW) - MIOpen build (`MIOPEN_REQ_LIBS_ONLY=ON`): Includes only NHWGC and NDHWGC layouts ### 3. Binary Size Reduction When building with `MIOPEN_REQ_LIBS_ONLY=ON`: Layouts Included (26 targets): - 7× NHWGC instances (2D operations + variants) - 19× NDHWGC instances (3D operations + variants) Layouts Excluded (16 targets): - 3× GNHWC instances (2D operations) - 3× NGCHW instances (2D operations) - 3× GNDHWC instances (3D operations) - 3× NGCDHW instances (3D operations) - 2× GNWC instances (1D operations) - 1× NWGC instance (1D operations) - 1× additional NHWGC instance (grouped_conv1d_fwd, not needed by MIOpen) This represents a ~38% reduction in instance targets (16 excluded out of 42 total layout-specific targets). ### Testing - ✅ All existing CK tests link against the umbrella library - ✅ MIOpen links successfully with the reduced umbrella library - ✅ Profiler builds with all layout-specific targets explicitly listed Notes from the Author: Since this refactor moved most of the convolution files further into subdirectories, I concentrated on ensuring that no source files were excluded, including sharded sources: Targets are correctly migrated — no missing targets, no shard count mismatches.	2026-06-05 15:09:20 +00:00
Yung-sheng Tu	e826b2eb7e	[rocm-libraries] ROCm/rocm-libraries#6768 (commit 43ca43f) =?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?= =?UTF-8?q?Add=20MFMA=20specialisations=20for=20`tf32=5Ft`=20(#6768)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation This PR adds two specialisations related to `tf32_t`. ## Technical Details This change treats `tf32_t` as a concrete type rather than an empty `struct`. It also adds two new specialisations for MFMA dense builtins and resolves existing circular include issues. ## Test Plan All the new wrappers were added to the test suite in test_amdgcn_mma_layout.inc. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-05 12:27:41 +00:00
Sami Remes	ad4e2e7624	[rocm-libraries] ROCm/rocm-libraries#7199 (commit 23f7320) [CK_TILE] [QuantGEMM] Fix SplitK tail handling and other improvements (#7199) This pull request introduces improved and more robust split-K support for quantized GEMM. The main changes add runtime validation, utility functions for split-K batch calculations, pointer offset handling for split-K in grouped kernels, and enhanced support for various tensor layouts. The changes also improve error handling and provide more flexibility for runtime tail handling in split-K pipelines. Split-K Support and Validation Enhancements: * Added runtime validation to ensure `k_batch` is a positive integer and that split-K configurations do not produce empty final batches or mismatched pipeline tails, with detailed error messages and logging for misconfiguration. [[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1184-R1211) [[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1161-R1250) * Introduced utility functions `get_splitk_batch_k_read` and `get_splitk_last_batch_k` to compute per-batch K read sizes and handle split rounding, ensuring correct and consistent split-K batch partitioning. [[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R206-R234) [[2]](diffhunk://#diff-635b89bdffa96b2b42f1632520cde36701d7d631e864185591f6b32f7645cf47L104-R107) [[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L388-R417) [[4]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1161-R1250) * Changed the default value of `k_batch` in `QuantGemmHostArgs` to 1 (no split-K) for safer default behavior. Pointer Offsets and Grouped Kernel Handling: * Updated `QuantGroupedGemmKernel` to apply split-K per-batch offsets to all input pointers, mirroring the behavior of non-grouped kernels and ensuring correctness for split-K launches. * Modified AQ tensor view handling to correctly reflect the remaining K-groups from the split-K batch's offset position, improving accuracy for split-K in grouped kernels. Pipeline and Layout Flexibility: * Added support for runtime selection of split-K tail handling via a new template parameter `RuntimeSplitKTail_`, with new helper methods to dispatch GEMM pipelines accordingly. [[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R273) [[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1496-R1567) [[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1427) [[4]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1447-R1629) [[5]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L1459-R1641) * Improved handling for tensor layout cases, including preshuffled B and both row-major and column-major AQ layouts, ensuring correct pointer arithmetic and compatibility checks. [[1]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R438-R454) [[2]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871L464-R516) [[3]](diffhunk://#diff-d000149a681cd42bfb9947872c603e556cea26cbd7fd4f8f60afc6595d975871R1184-R1211)	2026-06-05 11:41:49 +00:00
Enrico Degregori	7b9245f18c	[rocm-libraries] ROCm/rocm-libraries#5854 (commit 8e2d46d) [CK Tile] Async support preshuffle GEMM ## Motivation Add async support to existing preshuffle GEMM pipeline ## Technical Details Notes: the implementation avoids previous strategy of duplicating pipelines for async support and instead add a switch `Async` to the ops Problem to enable async pipeline. Then, integrate the async pipeline in the existing one. This allows to avoid code duplication and facilitate the integration of buffer load to lds in existing pipelines. In my opinion, it should be used also for other pipelines which don't support buffer load to lds yet and it would also be a good idea to refactor the existing async GEMM pipelines with the same approach. Summary: - integrate buffer load to lds in existing pipeline - add optimal tensor descriptors for vmem loading and lds reading. They are currently optimized for 16x16 wave tiles but they also work for 32x32 wave tiles. Optimizations for 32x32 wave tile requires different lds layout and it will be done in a follow-up issue - Add async config to examples - Add test (gfx950 only) ## Test Plan New test for gfx950 `test_ck_tile_gemm_pipeline_wp_async` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-05 07:17:09 +00:00
Aviral Goel	267ca67001	[rocm-libraries] ROCm/rocm-libraries#8028 (commit c1cb112) [CK_Tile] Add wmma_bf16f32_16x16x32_bf16 via fused-downconvert override (#8028) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Adds `__builtin_amdgcn_wmma_bf16f32_16x16x32_bf16` (fp32 accumulate → bf16 output) to the CK Tile WMMA warp-gemm path. API only — the unit test is split into a stacked PR (#8035) so this API change can be reviewed in isolation. ## Changes (4 files) - 16-bit trait: `wmma_intrinsic_downconvert` (calls the bf16f32 builtin — fp32 C in, bf16 C out) plus `COutDataType = bf16_t` / `COutVecType`. - `WarpGemmAttributeWmmaImpl` / `WarpGemmAttributeWmma`: `mac_downconvert(c_fp32, a, b)` (kTransC-aware) returning the bf16 C-output vector. - `WarpGemmImpl`: `mac_downconvert` tail handler producing a bf16 C-output tile from the fp32 accumulator tile, reusing `CWarpDstrEncoding` (output layout identical to the f32 C tile). Verified on gfx1250 (via the stacked test PR #8035): the test passes; the existing WMMA warp-gemm test is unaffected (additive change only).	2026-06-05 05:01:31 +00:00
Enrico Degregori	bdd7a8333d	[rocm-libraries] ROCm/rocm-libraries#6672 (commit bda3f97) [CK Tile] PermuteN support MX GEMM ## Motivation Add PermuteN support to preshuffle MX GEMM ## Technical Details - Modify `shuffle_b_permuteN` to support MX preshuffled layout - Add `preShuffleScalePermuteN` with same functionality of `preShuffleScale` but layout consistent with PermuteN - Include MX pre-processing functions in the library ## Test Plan Add test configuration for permuteN with preshuffle (both FP4 and FP8) ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Cong Ma <congma13@amd.com>	2026-06-05 03:04:43 +00:00
spolifroni-amd	449f8b4c5b	[rocm-libraries] ROCm/rocm-libraries#7955 (commit c87a40f) [ck] Updated CK Tile documentation to use mermaid diagrams (#7955) ## Motivation There were mermaid diagrams in the CK Tile doc that were converted to svg. However, there is an extension for mermaid diagrams. The conf.py and requirements.in have been updated to use that extension instead of the svg files.	2026-06-04 22:59:52 +00:00
Brock Hargreaves	4e1296674d	[rocm-libraries] ROCm/rocm-libraries#7990 (commit b8b5b43) [CK] Load ck.groovy via Jenkins Shared Library MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation This allows the CI service to have a configuration source-of-truth outside the PR under test, allowing rapid system changes. Bug fixes on the develop branch propagate immediately to all pipelines that don't override the parameter -- no rebase required. A new `USE_CURRENT_BRANCH_FOR_CK_GROOVY` parameter lets contributors test pipeline changes on their own branch without any extra configuration. ## Technical Details - `loadCk()` in the Jenkinsfile is updated to call `library("ck@${branch}").ck.get()` instead of `checkout scm` + `load "vars/ck.groovy"`. The `checkout scm` inside `loadCk()` is removed since Jenkins now handles the library fetch internally. - A `USE_CURRENT_BRANCH_FOR_CK_GROOVY` boolean parameter (default: off) is added. When off, `ck.groovy` is always loaded from `develop` — all normal PR builds are unaffected. When on, `ck.groovy` is loaded from the current branch automatically via `env.CHANGE_BRANCH`, so contributors testing pipeline changes just tick the box. - `return this` is removed from the end of `ck.groovy`. This was required by the `load` convention but is not needed (and can cause errors) in a shared library context. - `loadCk()` is kept at every call site rather than called once at the top, preserving restart-from-stage safety — if a build is restarted from a mid-pipeline stage, `ck` is still initialized correctly. - The Jenkins Shared Library named `"ck"` must be registered in Jenkins Global Pipeline Libraries ## Test Plan 1. Trigger "Build with Parameters" on the PR branch with `USE_CURRENT_BRANCH_FOR_CK_GROOVY=true` 2. Verify "Determine CI Execution" stage completes and the library() calls indicates the current branch 3. Verify "Static checks" stage completes. 4. Trigger a second build with `USE_CURRENT_BRANCH_FOR_CK_GROOVY=false` (default) to confirm normal builds still load from `develop`. ## Test Result Verified both paths. The develop library is loaded by default, the branch library is loaded when the parameter is enabled. ## Submission Checklist - [ X ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-04 22:32:37 +00:00
Illia Silin	aef7b42883	[rocm-libraries] ROCm/rocm-libraries#7816 (commit f6324af) [CK] Fix latest build issues with staging compiler. ## Motivation Fixing new warnings with staging compiler. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-04 17:41:09 +00:00

1 2 3 4 5 ...

3421 Commits