composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-01 12:17:00 +00:00

Author	SHA1	Message	Date
Yung-sheng Tu	604c56bc0e	[rocm-libraries] ROCm/rocm-libraries#7850 (commit e8f2756) =?UTF-8?q?style:=20[CK=20TILE]=20Unification=20Work=20?= =?UTF-8?q?=E2=80=93=20Unify=20format=20MFMA=20part=20(#7850)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation This PR unifies the parameter comments and simplifies the docs for `amdgcn_mma` specialisations of `MfmaOp`. ## Technical Details Except for the two things mentioned above, it also simplifies the sparse traits, unifies the usages of `enable_if_target_id_t`, and cleans up the files in [include/ck_tile/core/arch/mma](https://github.com/ROCm/rocm-libraries/tree/users/yungshengtu/ck/unification/unify_format_mfma/projects/composablekernel/include/ck_tile/core/arch/mma). NOTE: The first commit is not in the scope of this PR. ## Test Plan Test has existed. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. close #8907	2026-06-29 18:51:17 +00:00
Brock Hargreaves	3719bf05c2	[rocm-libraries] ROCm/rocm-libraries#8644 (commit 8b2545e) [CK][CI] Expand other stages to use healthy-node retry logic. (#8644) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation CI stages bound to a single node via the declarative `agent{ label }` cannot recover when that node is unhealthy — a `NodeFault` (e.g. an exhausted `gitNetRetry` after a persistent `Could not resolve host: github.com`, or a missing GPU) just fails the whole build. The existing `runOnHealthyNode` wrapper already reroutes such faults to a different node, but only the "Build CK and run Tests" stages used it. This PR brings the remaining node-bound stages under that wrapper so a bad node reroutes instead of failing the build. ## Technical Details - `runOnHealthyNode`/`preflight` (`ck.groovy`): `preflight` now takes a `requireGpu` flag that gates the GPU-only checks (`driverUp`/`devicesUp`); `daemonUp`/`diskOk` always run. `runOnHealthyNode` derives it from the node label (`!label.contains('nogpu')`), so no call-site argument is needed and nogpu stages skip the GPU checks automatically. - `Jenkinsfile`: migrated 19 `agent{ label }` stages to `agent none` + `ck.runOnHealthyNode(...)`: - 12 GPU test stages: Pytorch, AITER ×2, FA ×2, 3 grouped-conv (gfx90a), TILE_ENGINE_BASIC, TILE_ENGINE_GEMM ×3. - 4 FMHA stages: the per-stage `environment{}` block was dissolved and `build_and_run_fmha(arch)` is now evaluated on-node inside the closure (required since `agent none` evaluates `environment{}` off-node). - 3 nogpu stages: Determine CI Execution, Build Docker, Static checks (reroute via the label-derived `requireGpu=false`). - The commented-out `gfx908`/`gfx1010` stages were intentionally left untouched. ## Test Plan - Validated the Jenkinsfile against the pipeline linter (`pipeline-model-converter/validate`). - Manual pipeline run to confirm no regressions - Key checks: nogpu stages pass `preflight` (no false `gpu-devices-missing`); migrated stages still allocate, build, and test as before; on a `NodeFault` the `Node attempt N/3 … on <node>` reroute lines appear. ## Test Result - Linter: "Jenkinsfile successfully validated." - Groovy shared-library pipeline — not locally executable; the manual build above is required to fully validate. Check CI. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-29 16:03:23 +00:00
Kiefer van Teutem	2089713f94	[rocm-libraries] ROCm/rocm-libraries#8227 (commit 75c30d5) =?UTF-8?q?[CK=20TILE]=20Unification=20Work=20=E2=80=93=20?= =?UTF-8?q?Remove=20unification=20Flag=20structs=20in=20favor=20of=20new?= =?UTF-8?q?=20WarpGemmParams=20(#8227)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Recently, the way flags are sent down to the intrinsics was changed in CK Tile. At the point where the WarpGemm is invoked, an arbitrary number of template parameters can be passed, and these are passed down all the way to the lowest level intrinsics wrappers. Here `WarpGemmParamsParser<>` is used to extract flags for the intrinsics. In this MR we adapt the the unification framework (amdgcn_mma struct and MmaPipelines) to work in the same way. By doing this, there is no longer a point in our custom intrinsic Flag structs, so these are removed. Unrelated but I also tried removing the MmaPipeline flags because they arn't used for anything except CTranspose, which is already available. This also make test_amdgcn_mma_pipeline completely redundant so removed that as well.	2026-06-26 12:00:58 +00:00
Illia Silin	621697af8c	[rocm-libraries] ROCm/rocm-libraries#8723 (commit e2f28c1) [CK] [Security] pin getopt library to specific commit ## Motivation Making sure that FetchContent_Declare is pulling third-party code from a specific commit hash instead of the tip of main branch to avoid accidentally pulling potentially harmful code. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-25 19:06:13 +00:00
spolifroni-amd	713f1fbf46	[rocm-libraries] ROCm/rocm-libraries#8739 (commit 833c182) [hipcub, rocthrust, rocprim, ck] updating changelogs for 7.14 (#8739) ## Motivation The 7.14 change logs have a different format than the ones for 7.13. This changes the format and also sets the right component versions.	2026-06-24 18:32:34 +00:00
chris-tsiaousis-hpc	e503e6277a	[rocm-libraries] ROCm/rocm-libraries#8762 (commit fe88750) [CK] Fix flaky test `test_batched_gemm_b_scale_wmma` Loosen up the tolerance and add better logging in case of failure TLDR: `profile_batched_gemm_b_scale_impl.hpp` (lines 390-392) uses `rtol = atol = 1e-1`, while the sibling non-batched B-scale profiler uses `2e-2` in `profile_gemm_b_scale_impl.hpp` (lines 361-364). `KBatch > 1` adds an F16 atomic accumulation step, so the batched path is stricter on the noisier arithmetic path. Loosening up the tolerances should fix the flaky test. I've also added better loging in case of failures. ## Motivation `test_batched_gemm_b_scale_wmma` can fail on gfx1201 when `KBatch > 1` because that path uses split-K F16 atomic accumulation. The existing batched B-scale tolerance was stricter than the non-batched B-scale profiler, even though the batched split-K path has additional rounding noise from native F16 atomics. This PR aligns the batched tolerance with the comparable non-batched path and improves failure diagnostics so future mismatches identify the exact shape, KBatch value, instance, strides, and operator. ## Technical Details Updated [projects/composablekernel/profiler/include/profiler/profile_batched_gemm_b_scale_impl.hpp](projects/composablekernel/profiler/include/profiler/profile_batched_gemm_b_scale_impl.hpp) to: - Change non-FP8 verification tolerance from `1e-2` to `2e-2`. - Match the tolerance already used by the non-batched B-scale profiler. - Build a detailed failure message with: - instance index - `kbatch` - `M`, `N`, `K`, and `BatchSize` - `StrideA`, `StrideB`, and `StrideC` - selected operator name - Move operator name construction earlier so it is available for diagnostics. ## Test Plan Configured the CK gfx1201 build with device instances enabled and XDL disabled to isolate WMMA coverage. Built and ran the focused batched GEMM B-scale WMMA test target: ```bash ninja -C projects/composablekernel/build-gfx950-gfx1201 test_batched_gemm_b_scale_wmma ``` Ran the focused `MidLargeM` case across WMMA instances and then ran the full WMMA-only gtest target. ## Test Result The WMMA-only gfx1201 validation passed. - `MidLargeM` passed across 21 WMMA instances. - Full `test_batched_gemm_b_scale_wmma` passed all 3 gtests. - No incorrect-result lines were reported. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-06-24 16:41:50 +00:00
Kiefer van Teutem	137f2a9a10	[rocm-libraries] ROCm/rocm-libraries#7407 (commit 0b79e05) [CK TILE] Initial integration of MFMA / WMMA unification framework into CK Tile (#7407) (locked behind flag) Note: Everything works but this is still a draft MR because I want to do some more cleanup and maybe do some testing for MX fp6. Also please don't trigger copilot, I will do this once I feel it is clean enough, otherwise I'll get a bunch of comments about stuff I already know. ## Motivation The point of this MR is to finally use our unification MmaPipelines to replace the existing WarpGemms in CK tile and make sure everything works. I focused on gfx908 and gfx950 for now, dense and scale intrinsics, fp16, fp8, and fp4. I managed to get CK tests / examples working for all of these scenarios, so the basic implementation should be correct. I expect some more tweaks will be required to get full support, some of which I already anticipated in the section "New issues". ## Big switch: USE_NEW_UNIFIED_FRAMEWORK When USE_NEW_UNIFIED_FRAMEWORK is 1, we replace all WarpGemms with MmaPipelines from the new unified framework. This means WarpGemmDispatcher will use the UnificationDispatcher instead of the regular Dispatcher. Furthermore, named WarpGemms like WarpGemmMfmaF32F32F32M16N16K4 will also get rerouted to the UnificationDispatcher. The latter is necessary because some pipelines bypass the WarpGemmDispatcher in favor of directly using named WarpGemms. For now the switch is turned on for easier testing, so don't expect the CI to pass. When off, this MR should not affect any of the CK tile tests at all so I would expect the CI to pass. ## Simplification of MmaPipelineBase I found that the structure of MmaPipelineBase was a bit complex and I was able to reduce it a lot. The only thing an Mma Pipeline does (currently) is provide a wrapper around amdgcn structs that allows k iteration and sparse compression. We don't allow M and N composition for now for simplicity and since this is not expected from WarpGemms in CK Tile currently. ## Re-interpretation of tile distribution encodings for packed datatypes Tile distributions for packed types are expected to describe mathematical elements, not datatype elements! This distinction is why the gfx950 fp4 CK_tile tests were not working. Updated the interpretation in amdgcn_mma, tile distribution calculator, and layout test, along with comments. Tested on all architectures. ## getCMakeCompilerTarget() for configuration time target architecture This is a workaround because there are a lot of cases in CK Tile where the host code inspects Device constructions like WarpGemm, and we need to get the version that will be used on the device. This is a big kludge and we need to figure out a better solution. Also this util will always pick the first cmakelists target arch, so there will be issues when compiling for multiple target architectures. Ideally, the host code should not touch the WarpGemms at all, and there would be no issue. This has been a point of friction in CK for a long time. We can discuss this with Chris Millette. ## Tests I was able to verify that the following CK Tile tests and examples work with the new unified framework: tile_tutorial_mfma_16x16x16 (gfx9, fp16, uses transpose) tile_example_gemm_basic (gfx9, fp16) test_ck_tile_mx_gemm_async (gfx950, microscaling fp8 and fp4) Within the tile tutorial I was also able to use WarpGemmMfmaF16F16F32M16N16K32TransposedCDistribution instead of WarpGemmMfmaF16F16F32M16N16K16TransposedCDistribution to verify that basic K iteration also works. A little while ago I also verified that the performance did not change in a measurable way, and the compile did not change much but did see some swings up to 20% each way (faster or slower). We will need some broader and more accurate tests for this going forward. ## Moving forward To confidently be able to replace the existing Dispatcher and WarpGemm framework with our own, we need to make sure that all existing tests and examples work on all platforms. Furthermore, we should pay attention to performance and compile time of all these tests. Performance should definitely not change, as all we're doing is refactoring the support structure around the intrinsics, which should melt away during compilation. ## New issues (I will make new issues with descriptions for these but here is a short list (incomplete): Test RDNA CK Tile pipelines Test Sparse Ck Tile pipeline (does not exist but we can make one) Remove MmaOp flags from unification framework and update it to work with new WarpGemmParamsParser instead. Add Swizzle support and test in CK Tile pipelines. Test Scale + transpose Ck Tile pipelines. Coherent strategy for attrnumaccess for dense, scale, default, packed, wmma, gfx1250, etc in CK tile. It's messy now. Dispatcher should not be determining scale-ness of intrinsics based on MNK sizes. Try adding back the MN composition in MmaPipelines Why is test_amdgcn_wavewise_mma only compiled for CDNA? Investigate NOP and AGPR flags Maybe get rid of WmmaTag in dispatcher. Find a coherent strategy for dealing with host vs device compile passes, and the host sneaking a peak at WarpGemm internals. Related to getCMakeCompilerTarget(). ## TODO before merge Some changes exist just for ease of testing, and will be reverted before merging: - gemm_basic.cpp has a lot of datatypes disabled because otherwise compile time is huge for testing - USE_NEW_UNIFIED_FRAMEWORK is set to 1 for easier testing	2026-06-24 13:35:25 +00:00
Illia Silin	bd3713c710	[rocm-libraries] ROCm/rocm-libraries#8716 (commit 8230b20) [CK] [Security] remove allow-unauthenticated flag from dockerfile (#8716) ## Motivation Dockerfile uses apt-get install --allow-unauthenticated which disables APT GPG signature verification, allowing package installation without cryptographic validation. An attacker who can perform a man-in-the-middle attack on the build network (via corporate proxy, CI egress, or compromised mirror) can serve trojaned .deb packages such as libc6, cmake, or git that become embedded in published ROCm container images distributed to users. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-23 18:12:18 +00:00
Enrico Degregori	55e30feac6	[rocm-libraries] ROCm/rocm-libraries#8637 (commit a1a7f5f) [CK] Fix compilation ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-20 02:08:58 +00:00
Adel Johar	01bad4c3d9	[rocm-libraries] ROCm/rocm-libraries#8205 (commit f58120c) [Docs] Standardize precision support reference pages across components (#8205) ## Motivation The goal of this PR is to standardize the precision support reference page format across all components, while also reducing the maintenance of burden of having to manually update the YAML data file in https://rocm.docs.amd.com/en/latest/reference/precision-support.html ## Technical Details - Each component maintains its own YAML file which will be eventually used in https://rocm.docs.amd.com/en/latest/reference/precision-support.html - A new precision support reference page is introduced which will not override existing data type/precision support content; it will serve as the overview/summary that will be linked in the ROCm reference page ## Test Plan - Built locally, viewed each component manually ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-19 15:08:04 +00:00
Bartłomiej Kocot	7c2b979de2	[rocm-libraries] ROCm/rocm-libraries#8573 (commit 04c9f1d) [CK][CK Tile] Drop profiler for experimental builder codegen (#8573) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Switch to dispatcher profiler for ck tile conv. ## Technical Details - Switch to dispatcher profiler for ck tile conv. - Drop profiler for experimental codegen - Minor fixes for bwd data printing - Minor fixes for 3d conv in dispatcher codegen ## Test Plan test_grouped_conv*tile ## Test Result Passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-19 09:38:44 +00:00
Enrico Degregori	2733e75900	[rocm-libraries] ROCm/rocm-libraries#6565 (commit d41715e) [CK Tile] Async support pipeline V3 ## Motivation Optimize pipeline V3 for gfx950 by enabling buffer load to lds (async pipeline) ## Technical Details - Add `Async` bool to `Problem` struct to enable async pipeline in existing one - Add `static_move_ys` to load transpose. This generates offset in assembly instructions saving registers - Add `is_valid` to `async_get_vectorized_elements`. Before hard coded to true. It allows to support padding - Remove unnecessary restrictions to `is_a_load_tr` and `is_b_load_tr` (wider use of lds load transpose on gfx950) - Integrate async support in existing V3 pipeline (avoid pipelines duplication) - Create policy to support both async and default cases. This could be used by any async pipeline (next steps) - Define `wg_attr_num_access` separately for A and B. This allows to optimize ds_read instruction width for cases when one matrix is transposed and the other is not. Before in such cases, `ds_read_b64` was used instead of `ds_read_b128` - Add test for V3 async. Currently only supporting cases with A and B having the same type ## Test Plan New test `test_ck_tile_gemm_pipeline_compv3_async` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-19 06:57:14 +00:00
Brock Hargreaves	081fe18c1c	[rocm-libraries] ROCm/rocm-libraries#8558 (commit ccfa08b) [CK][CI] Retry git network ops to survive transient DNS blips (#8558) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation CI builds intermittently fail on transient git DNS blips (e.g. `Could notresolve host: github.com`). These surface as an untyped `exit code 1`, which the existing node/transient-fault retry doesn't catch — so a momentary glitch fails the whole build. ## Technical Details Added `gitNetRetry(label, body)` (3 attempts, 15s backoff) and wrapped every github.com-touching git step: ref-repo clone/update, `checkout scm`, and the hipTensor clone. All are idempotent on retry. Docker pulls are left to the existing `pullImage()` path. ## Test Plan - Mapped the failing build's `git remote update` DNS error to a now-wrapped call. - Confirmed no existing code retries git host-resolution failures. ## Test Result Groovy shared-library — not locally executable; needs a pipeline run to fully validate. Check CI. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-18 21:18:27 +00:00
Brock Hargreaves	8864dcc3a4	[rocm-libraries] ROCm/rocm-libraries#8560 (commit f8362a1) [CK][CI] Post failure GitHub status on stage build errors (#8560) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Failed CI stages (e.g. Static checks) were left stuck on a `pending` GitHub status instead of reporting `failure`, so PRs showed an overall failure with no indication of which check actually failed. ## Technical Details `buildAndTest` posted `pending`/`success` statuses but its catch only rethrew, deferring failure reporting to `runOnHealthyNode` — which deferred right back. Neither posted `failure`. This adds a `failure` status post for real build errors in `buildAndTest`, while letting node-reroute signals (`NodeFault`/`TransientFault`) and aborts (`FlowInterruptedException`) propagate untouched so retries still work. Since every stage routes through `buildAndTest`, this fixes both the directly-called `Static checks` stage and the `runOnHealthyNode`-wrapped per-arch build stages. ## Test Plan Trigger a stage failure (e.g. introduce a clang-format violation) and confirm the corresponding GitHub status context transitions `pending` → `failure` rather than remaining `pending`. ## Test Result Pending CI run on a branch with a deliberate failure to confirm the status transition. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-18 21:16:24 +00:00
Brock Hargreaves	bad7870830	[rocm-libraries] ROCm/rocm-libraries#8508 (commit 5cc3bef) [CK][CI] Make gfx1250 build compile-only ## Motivation gfx1250 has no CI hardware, so its build piggybacks on gfx90a nodes where gfx1250 binaries can be compiled but not run. The build currently fails because post-build runtime tests fire on the gfx90a node. This PR makes the gfx1250 build compile + install only. ## Technical Details The post-build test block in `buildAndTest` (`ck.groovy`) keys off the physical node arch (`gfx90a`), so runtime tests run for gfx1250. Gated that block off for gfx1250. Body-only change with no signature changes, so it's backward compatible with the develop-pinned shared library and doesn't affect other archs. ## Test Plan Trigger the gfx1250 build with `USE_CURRENT_BRANCH_FOR_CK_GROOVY=true` and confirm it compiles/installs with no runtime test steps; confirm gfx90a builds are unchanged. ## Test Result Check CI. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-18 18:33:59 +00:00
Sami Remes	a3a12b8945	[rocm-libraries] ROCm/rocm-libraries#5813 (commit 18b43cf) [CK_TILE] Enable full transpose layout support for MX GEMM pipeline (#5813) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Enable full transpose layout support for MX GEMM pipeline (32x32x64 MFMA) ### Summary This PR enables all four matrix layout combinations (Row/Col, Row/Row, Col/Col, Col/Row) for the MX GEMM pipeline with `32x32x64` MFMA warp tiles, using `ds_read_tr` transposed LDS loads on gfx950. Previously, only the canonical `A=RowMajor, B=ColumnMajor` layout was supported. ### Changes Kernel-side transpose support: - `warp_gemm_attribute_mfma.hpp`: Introduce `kSplitFactor` logic in `get_warp_dstr_encoding` to split the K-dimension distribution encoding when `kPerLane` exceeds the `ds_read_tr` subtile minor dimension. This satisfies the `TransposeTileDistributionTraits` suffix validation required by `load_tile_transpose`. The distribution encoding now also receives the `DataType` template parameter to compute the split factor based on packed element size. - `gemm_pipeline_ag_bg_cr_comp_async.hpp`: Uncomment and enable the `InputTileDistributionTraits` logic to properly transform LDS load tile distributions for transposed reads. Add `static_assert`s to catch misconfigurations where a layout requires transpose loads but the warp tile size disables them (e.g. `KWarpTile=128` exceeds `ds_read_tr` limits). - `load_tile_transpose.hpp`: Fix `DataVec` sizing for packed types (`pk_fp4_t`) — divide `vecLoadSize` by `PackedSize` to prevent buffer overflow when each physical element contains multiple logical values. - `warp_gemm_attribute_mfma_impl.hpp`: Set `kDefaultScale` to `0x7F7F7F7F` (unity in e8m0 format) for the unscaled `operator()` overloads of `WarpGemmAttributeMfmaImpl_f32_32x32x64_f8f6f4`, ensuring correct behavior with `mfma_scale_f32_32x32x64_f8f6f4`. - `warp_gemm.hpp` / `warp_gemm_dispatcher.hpp`: Add generic `WarpGemmMfma_f32_32x32x64_f8f6f4<A, B>` alias and dispatcher specialization to support arbitrary MX data type combinations (fp4, fp6, fp8) with the 32x32x64 MFMA, consolidating the existing type-specific aliases. - `gemm_pipeline_ag_bg_cr_comp_async_default_policy.hpp`: Simplify `wg_attr_num_access` determination — `Double` for fp8, `Single` otherwise. Reference implementation fix: - `reference_gemm.hpp`: Fix nibble selection for packed 4-bit types (`pk_fp4_t`, `pk_int4_t`) in `reference_mx_gemm`, `reference_gemm`, and `reference_gemm_abquant`. The previous logic used `k % 2` or `index[K_DIM] & 1` to select which nibble to extract, which assumed K was always the fast (contiguous) memory dimension. This is only true for `A=RowMajor` / `B=ColumnMajor`. For other layouts, the fix computes the flat memory offset via `mDesc.GetOffsetFromMultiIndex(...)` and uses its parity to correctly select the nibble regardless of layout. Test infrastructure: - `test_mx_gemm_config.hpp`: Add `MxGemmConfig32` base and `MXfp4_GemmConfig32` / `MXfp8_GemmConfig32` configs for the 32x32x64 warp tile. - `test_mx_gemm_fp4.cpp` / `test_mx_gemm_fp8.cpp`: Add `Config32` test suites covering all four layout combinations. Restrict `Config16` (16x16x128) to `A=Row, B=Col` only, since `KWarpTile=128` exceeds `ds_read_tr` limits. - `test_mx_gemm_util.hpp`: Fix scale tensor layout — scales are always row-major `[M, K/32]` and column-major `[K/32, N]`, independent of A/B data layout. ### Test plan - [x] `test_ck_tile_mx_gemm_fp4` — 5/5 passed (16x16x128 Row/Col + 32x32x64 all 4 layouts) - [x] `test_ck_tile_mx_gemm_fp8` — 5/5 passed (16x16x128 Row/Col + 32x32x64 all 4 layouts) - [x] `test_ck_tile_mx_gemm_fp6` — 1/1 passed (16x16x128 Row/Col)	2026-06-18 17:05:09 +00:00
Illia Silin	e2deaaba64	[rocm-libraries] ROCm/rocm-libraries#8591 (commit 5210ae6) [CK] fix daily hipTensor tests. ## Motivation Had to change the way hipTensor is cloned to make sure it doesn't erase CK installation and uses the correct path for the installation. Also added the "install" target every time we build and test everything, so we could use CK for testing third-party libs that depend on it. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-18 14:58:10 +00:00
Enrico Degregori	1762eaeaec	[rocm-libraries] ROCm/rocm-libraries#8535 (commit a0f47eb) [CK Tile] EightWaves pipeline int8 support ## Motivation EightWaves pipeline currently is supporting only FP types ## Technical Details - Enable 16x16x64 int8 instruction for gfx950 in dispatcher - Enable int8 in EightWaves pipeline - Add tests - Fix bug in `warp_gemm_attribute_mfma_impl.hpp` ## Test Plan Tests have been added for int8 GEMM using EightWaves pipeline ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-18 12:59:59 +00:00
Ville Pietilä	60b276647b	[rocm-libraries] ROCm/rocm-libraries#8157 (commit b0d9d39) [CK Tile] Rule-based configuration generation in CK Dispatcher codegen (#8157) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation The CK Tile Dispatcher code generation for CK Tile Profiler relies on flat JSON files to list the generated configurations. This approach has the following problems - The JSON files are verbose - The JSON files get easily out of sync with the CK Builder .config files from which they were generated from. - The JSON file based configuration make it hard to list explicitly the rules that govern the instance generation. ## Technical Details Replaced the JSON files with a rule based configuration. To preserve the existing functionality, the `profiler` and the `tests` instance sets are generated directly from the CK Builder config files. The JSON config files are removed from source control, and the "on-the-fly" generation guarantees that the Dispatcher codegen uses up to date configurations. This is PR introduces six different rule sets for the CK Tile Dispatcher code generation 1. `profiler`: matches with the old JSON set of profiler configurations. 2. `tests`: matches with the old JSON set of tests configurations. 3. `full`: full configuration set created from a rule-based config selection 4. `full-tests`: a subset of `full` for generating configurations for convolution integration tests. 5. `tiny`: a subset of `full-tests` to produce the minimal set of configurations to test the Dispatcher codegen. 6. `default`: the default rules, which corresponds to the existing heuristic rules for configuration selection. This ensures that ML based kernel selection doesn't get broken. The main use of the `full` rule set is to define a reasonable solution space for the possible implicit GEMM configurations. We start from the configurations that allowed by the device architecture. The `full` rule set defines the relevant tile sizes for each convolution direction. From the tile size we have a curated mapping to the number of waves over the different GEMM axes, i.e., we describe how many waves each GEMM dimensions corresponds to. The GEMM-K wave tile dimension can be computed from the other parameters and does not need to be listed explicitly. An orthogonal axis to the tiling strategy is the vectorization strategy. This mainly defined by the data type and hardware as in general, we want to use the maximum possible load widths. The maximum sizes for each convolution direction variant are defined by the implicit GEMM matrix dimensions. For cases where have a low number of channels per convolution group, we need smaller vector load sizes. These are captured by the `VecStrategy` enumeration in the codegen rules. The problem with the rule based configuration selection is that we "over generate" configurations. The old JSON configurations compose approximately 25% of all configuration that the `full` rule set creates. The additional configurations are valid, but they many not provide any performance benefits. Hence, we keep the `profiler` and `tests` rule set for now to avoid building an excessive amount configurations by default. The `full` rule set can be taken into use by specifying CMake configuration flag `-D DISPATCHER_RULE_SET=full`. By default, the `tests` rule set is used, i.e., we don't change the existing bahaviour. ## Test Plan Added a new stage in the CI/CD pipeline that ensures the Dispatcher codegen rules are up to date. Otherwise the functionality is covered by the existing CI/CD tests. There are no functional changes to the convolution kernels. Only how the different instances are generated. ## Test Result If the CK Tile conv instances build without errors, the Dispatcher codegen is generating valid code. If all tests in CI/CD pipeline are passing, the Dispatcher codegen generates valid instances. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-18 01:22:50 +00:00
Aviral Goel	c43b550206	[rocm-libraries] ROCm/rocm-libraries#8202 (commit 0911fa0) [GFX1250][CK_TILE] Add scale16 (ScaleBlockSize=16) support to MX GEMM TDM pipeline (#8202) Enables `ScaleBlockSize=16` end-to-end for the FP8/BF8 MX GEMM TDM pipeline, building on the scale16 warp-gemm layer already in develop. - warp gemm: add the 32x32x128 f8f6f4 scale16 traits and alias (2x2 grid of 16x16x128 scale16 intrinsic calls with per-subtile `SCALE_OPSEL`), and route 32x32 f8f6f4 through the dispatcher's `IsScale16` path. - default policy: select the warp gemm via the dispatcher with `IsScale16=(ScaleBlockSize==16)` so `WarpTile=16` and `WarpTile=32` each pick the matching scale16 path; guard WarpTile M/N to 16 or 32; scale-tile distribution for the scale16 layout. - pipeline V1/V2: thread `Problem::ScaleBlockSize` through the scale-window setup (replacing the hardcoded 32); expose `ScaleBlockSize` for the kernel. - block gemm: extract int64 (scale16) / int32 (scale32) scales by width. - kernel: scale16 descriptor order; reject unsupported `BlockScaleSize`. Test coverage for this path is in the stacked follow-up PR.	2026-06-17 16:41:00 +00:00
jakpiase	65bef78383	[rocm-libraries] ROCm/rocm-libraries#8518 (commit 1ad69c3) [CK] Add support for large tensor index handling into conv bwd data (#8518) ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-17 15:51:36 +00:00
Illia Silin	b5713be6cd	[rocm-libraries] ROCm/rocm-libraries#8501 (commit 54eb5dc) [CK] disable DPP kernels by default ## Motivation The dpp8 instruction has been disabled in the upstream llvm-project in the latest compiler version, so we're hitting compilation errors with staging compiler: <inline asm>:2:33: error: not a valid operand. v_dot2c_f32_f16_dpp v6, v8, v7 dpp8:[0, 0, 0, 0, 0, 0, 0, 0] ^ error: cannot compile inline asm These instructions are used for fp16 gemms that are slightly faster than dl gemms on gfx10, but are not critical. Going to disable these kernels for now, until a better solution is available, to unblock the builds with staging compiler. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-17 14:03:00 +00:00
SamiAario-AMD	39182b50eb	[rocm-libraries] ROCm/rocm-libraries#8487 (commit 06a73ba) Skip tests on gfx11 that have intermittent failures ## Motivation On gfx11, skip sporadic failures for any load_and_convert_tile case where X and Y differ. Same-type tuples (half/half, bf16/bf16, fp8/fp8) have been stable. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result Stress-tested on gfx11, gfx12, and gfx950 with 10000 iterations of the tests. No remaining test failures were detected. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-17 11:07:22 +00:00
damien-lejeune	5bebfd460f	[rocm-libraries] ROCm/rocm-libraries#8492 (commit 46b6a06) Add tile size for FMHA batch prefill bf16 for MI308X ## Motivation Adding a tile size adapted to MI308X, for the FMHA Batch Prefill BF16 input type case ## Technical Details N/A ## Test Plan Benchmarking from the Aiter side with: ``` python3 op_tests/test_batch_prefill.py -s 8000 -p 1 -q 4 -k 1 --head_dim 256 -c true -d bf16 --input_dtype bf16 --quant_method none --kv_layout linear -t sglang -l 0.0 --return_lse false --profile ``` ## Test Result We see an improvement with the new tile size on MI308X (both with PLT mode OFF and ON) ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>	2026-06-17 06:22:26 +00:00
damien-lejeune	2c0b7cbb0a	[rocm-libraries] ROCm/rocm-libraries#8424 (commit debb669) Add missing constraint in the FMHA qr async pipeline to enforce bk0=bk1 (#8424) ## Motivation The purpose of this change is to add a guardrail to what values bk0 and bk1 can take. This is to avoid ill defined sizes, silently failing and generating NaN (or other error) at runtime. An example of such failure can be obtained using the tile engine: ``` cd rocm-libraries/projects/composablekernel/tile_engine/ops/fmha python fmha_benchmark.py configs/batch_prefill.json \ --problems "1,4,1,8000,8000,256" \ --filter "c.data_type=='bf16' and c.hdim_q==256 and c.pipeline=='qr_async' and c.mode=='group' and c.tile_n0==32 and c.tile_k0==64" ``` ## Technical Details The qr_async pipeline stages data in the K dimensions into LDS using a bk1-descriptor, while the (Q*K^T) gemm0 consumes bk0 ## Test Plan See command above ## Test Result Before the change: (invalid) generate instances, error at runtime After this change: no instance generated ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>	2026-06-16 07:41:58 +00:00
Brock Hargreaves	1b649a8d4b	[rocm-libraries] ROCm/rocm-libraries#8332 (commit 48c389c) [CK][CI] Retry builds on node failure with automatic rerouting (#8332) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation When a Jenkins node enters a bad state (missing GPU driver, dead Docker daemon, full disk), every PR scheduled onto it fails the same way until a human manually takes it offline. Some failures are also transient and would pass on a simple retry. Today the pipeline does neither — every failure goes straight to red on the same node. ## Technical Details Two new retry behaviors based on failure type: - Different node for persistent node faults (driver missing, daemon down, disk full, container won't start) - Retry in place for transient glitches (registry pull, DNS), then a different node if retries are exhausted Real build/compile failures and aborted builds are never retried. New: `src/org/ck/NodeFault.groovy`, `TransientFault.groovy` — typed exceptions in the shared library `src/` for stable classloader identity under dynamic library loading. `vars/ck.groovy`: adds `preflight()` (host health checks before build), `pullImage()` (classifying pull failures at the call site, replacing `getDockerImage()`), `runOnHealthyNode()` (outer reroute loop, up to 3 nodes), `runInPlace()` (same-node transient retries). GitHub failure status is only set once all retries are exhausted. `Jenkinsfile`: all active `Build CK and run Tests` stages converted to `agent none` + `ck.runOnHealthyNode(…)`. ## Test Plan Tested on `users/brockhargreaves-amd/ck/node-failure-retry-logic` with `USE_CURRENT_BRANCH_FOR_CK_GROOVY=true`. Verified preflight logging, reroute on node fault, attempt counter in logs, no retry on aborts, and single failure status report after budget exhausted. ## Test Result Retry logic working as expected. Three bugs found and fixed during testing: false `NodeFault` from host-level sccache probe (sccache is in-container), `null` node name in catch logging, and `sh` calls outside `node()` context in status reporting. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 17:40:10 +00:00
Andriy Roshchenko	b8440b3aeb	[rocm-libraries] ROCm/rocm-libraries#8325 (commit 559eaf6) [GFX1250][MX GEMM] Unified FLATMM GroupedGemm Implementation for MX Data Types (#8325) ## Motivation Design and test a unified FLATMM GroupedGemm interface so that it supports all MX FP8, FP6, and FP4 data types on both the gfx950 and gfx1250 architectures and works seamlessly across these platforms. ## Technical Details Implementation exposes Grouped Gemm interface for MX FLATMM and MX TDM FLATMM pipelines. ## Test Plan Add the following tests: - ck_tile/grouped_gemm_mx/test_grouped_gemm_mx_flatmm_non_tdm.cpp - ck_tile/grouped_gemm_mx/test_grouped_gemm_mx_flatmm_tdm.cpp - ck_tile/flatmm/test_mx_flatmm_persistent.cpp Verify on the gfx950 and gfx1250 architectures. ## Test Result All tests pass. Verified on A0 hardware with rocm-7.14.0a20260517 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 16:12:33 +00:00
Sami Remes	c1f7104852	[rocm-libraries] ROCm/rocm-libraries#6663 (commit f19fc01) [CKTile] Fix MX GEMM: num_loop==3 dispatch, split-K, unsupported-shape guard (#6663) Three independent MX GEMM correctness bugs reported against example/ck_tile/42_mx_gemm (fp8xfp8, A=Row/B=Col) on MI350X, plus one host-side atomic-add accumulation bug in the example's repeat loop. - Pipeline (gemm_pipeline_ag_bg_cr_comp_async.hpp): BlockHasHotloop required num_loop > PrefetchStages, which let num_loop == 3 enter a hot loop that produced 5 gemm accumulations instead of 3 (K == 3K_Tile, e.g. K=768, deterministically wrong). Require num_loop >= 4 instead: pre-pipeline + TailNumber::Three already totals exactly 3. - Kernel (gemm_mx_kernel.hpp): split-K was silently broken because GridSize did not thread k_batch into blockIdx.z and the scale tile windows were anchored at K=0 for every k_id. Every k_id >= 1 therefore read the wrong packed scales. Fix: GridSize returns dim3(grid_x, 1, k_batch) (persistent and non-persistent). * MakeScaleA/BBlockWindows accept a k_elem_offset and translate it to a packed-scale K offset (also apply pad_tensor_view so OOB scale loads return zero, matching A/B padding). * operator() derives k_id from blockIdx.z, uses GetSplitKElemOffset (matches Underlying::SplitKBatchOffset's K1-aligned formula), and dispatches the epilogue with memory_operation_enum::atomic_add for k_batch > 1, set for k_batch == 1. Same fp16/bf16 even-vector-size guard as UniversalGemmKernel. * MakeCBlockWindows templated on DstInMemOp; unconditionally applies pad_tensor_view using kPadM/kPadN so partial trailing M/N tiles are handled correctly. - Compile- and runtime unsupported-shape guards (gemm_mx_kernel.hpp): add IsSupportedArgument and a static_assert for configurations that produce silent wrong results: * static_assert(!kPadK) -- the MX comp-async pipeline uses async_load_tile whose OOB check is per-vector-start, so a vector straddling the K pad boundary reads garbage. Until the async path learns per-element pad masking, reject kPadK at compile time. * Runtime: k_batch >= 1; M/N multiples of MPerBlock/NPerBlock when kPadM/kPadN are false; M >= MPerBlock and N >= NPerBlock always (CShuffleEpilogue cannot safely run with a single partial tile); K % (KPerBlock * k_batch) == 0; and for k_batch > 1, K must be a multiple of WarpTile_K * k_batch so every split lands on a packed-scale boundary. * All error paths log under CK_TILE_LOGGING with actionable messages. - Example (example/ck_tile/42_mx_gemm/mx_gemm_instance.hpp): * Call Kernel::IsSupportedArgument up front and throw a clear runtime_error for rejected shapes (was silently launching an unsupported kernel). * Switch to launch_kernel_time_mask with a clear_gemm_output preprocess that zeroes C between iterations when k_batch > 1 (mirrors universal_gemm_invoker). Without this the default -warmup=50 -repeat=100 accumulated 150 atomic_adds into C after the kernel-side split-K fix. Tests (test/ck_tile/gemm_mx/): - Add MXfp8_GemmConfig16_PadMN (kPadM = kPadN = true). - test_mx_gemm_fp8.cpp: HotLoopTailNumLoopThree (K=768 regression), SplitK (k_batch=2,4 across full_k/partial_k paths), TestMxGemmFp8PadMN::{MNPaddingAligned, MPadding, NPadding, MNPadding} covering trailing partial tiles along M, N, or both. - Run(...) now takes k_batch. - packScalesMNxK: guard against OOB (mn, k) reads from src and initialise e8m0 bytes to the zero exponent (0x00) instead of the default-constructed NaN (0xFF), so padded lanes don't poison the packed int32_t shared with in-range lanes. - test_mx_gemm_instance.hpp: call IsSupportedArgument before launch. Verification on gfx950, ROCm 7.2.0: - ctest -R test_ck_tile_mx_gemm -> 100% (2/2). - Example sweep over the original bug-report shapes: all K-aligned shapes now validate correct (including 4096^3 sk=2 and the K=768 cases); all K=128 shapes cleanly rejected with the new error message instead of producing silent wrong results. Made-with: Cursor ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 08:28:55 +00:00
damien-lejeune	aab1d219f5	[rocm-libraries] ROCm/rocm-libraries#8350 (commit f92ded1) Add tile shape for FMHA batch prefill on MI308X (on fp8, hdim=256) (#8350) ## Motivation Add a tile size appropriate for FMHA batch prefill fp8/hdim256 on MI308X ## Technical Details Appending the tile shape to the existing factory such that it can be picked up by Aiter ## Test Plan Ran the performance test on both MI300X and MI308X ## Test Result MI300X performance seems unaffected by this change. MI308X does improve. ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>	2026-06-15 07:00:35 +00:00
SamiAario-AMD	947dcc2606	[rocm-libraries] ROCm/rocm-libraries#5510 (commit 8415c8c) [CK Tile] Add transposed tile load implementation, and tests for load_and_convert_tile (#5510) ## Motivation Mixed precision b/fp16 x fp8 requires a transposed tile load implementation that supports mixed precision using these types. Implement this, use it in `load_and_convert_tile`, and add a unit test for `load_and_convert_tile` which covers this functionality. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 06:42:28 +00:00
ltqin	0954a8f3fa	[rocm-libraries] ROCm/rocm-libraries#8262 (commit d4ff8fc) [CK_TILE] Add graph capture support for FMHA backward(new branch) (#8262) ## Motivation Add HIP graph capture support for FMHA backward operations. The original implementation only supported normal execution mode and would cause use-after-free crashes when used with graph capture replay. When FMHA backward is captured into a HIP graph: - First replay: host callback executes and deletes the closure (as designed for normal mode) - Subsequent replays: use-after-free crash because the closure was already freed This PR enables `fmha_bwd_launcher::prepare_workspace_async()` to work correctly in both normal execution and graph capture modes.	2026-06-14 03:11:53 +00:00
Johannes Graner	01cca38c8e	[rocm-libraries] ROCm/rocm-libraries#8220 (commit 4c04a3a) [CK Tile] WAVELET pipeline for backward-data grouped convolution (#8220) ## Motivation On the RetinaNet shapes (gfx950, fp16) CK Tile backward-data conv was ~18% behind classic CK, with the gap concentrated in the K=2376 3x3 detection-head family where bwd_data spends most of its time. The WAVELET GEMM pipeline already gives uplift for forward and backward-weight conv; this ports it to backward-data and consolidates the now-shared machinery across all three directions. ## Technical Details - Backward-data wavelet support in the tile kernel: launch extra load waves when the pipeline exposes `LaunchBlockSize`, and split the epilogue into math waves (run the CShuffle epilogue) and load waves (`RunBarrierStub`). - Register 7 WAVELET instances (fp16 and bf16), tuned for backward-data's tall-skinny GEMM rather than the forward tile shapes: a big-M `256/128/64` workhorse, a `VecA=4` variant for the `K % 8 != 0` shapes, and a `NumGroupsToMerge=32` variant for grouped (depthwise-style) shapes. - Implement the native backward-data instance parser in `generate_instances.py`. - Deduplicate the wavelet machinery shared by forward, backward-data, and backward-weight: `GroupedConvLaunchBlockSize`, `is_wavelet_pipeline`, and `RunWaveletAwareEpilogue` in `grouped_convolution_utils.hpp`; the three native instance parsers collapse to one parameterized parser. The three kernels now call the shared helpers. ## Test Plan - Rebuild the full profiler instance pools for all three directions (fp16/bf16/fp32, nhwgc/ndhwgc) to exercise the shared helpers across every instantiation. - Tile GTests on gfx950: `test_grouped_convnd_fwd_tile`, `test_grouped_convnd_bwd_data_tile`, `test_grouped_convnd_bwd_weight_tile`. - Per-shape sweep of the 35 RetinaNet backward-data shapes vs classic CK and the non-wavelet tile pool (`profile_wavelet_bwd_data.py`); correctness spot-checked with GPU-reference verification on the new big-M and NumGroupsToMerge instances. ## Test Result - GTests pass: forward 9/9, backward-data 6/6, backward-weight 6/6. - Backward-data perf (3x3 g=1 region, geomean classic/tile): 0.88 -> 1.11, i.e. the tile path goes from ~12% slower than classic to ~8% faster. The largest single backward-data shape (256x100x100->2376) moves from 11% slower than classic to 12.5% faster. - The dedup refactor preserves behavior (net -174 lines across the kernels/generator), confirmed by the full rebuild and the GTests above. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-13 00:10:50 +00:00
John Afaganis	329e589840	[rocm-libraries] ROCm/rocm-libraries#8260 (commit 1139236) [ck] Enforce LF-only line endings in C/C++ sources MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Several CK source files carry Windows CRLF line endings (a trailing carriage return on each line), introduced by editors configured for Windows endings or copy/paste from Windows tooling. These are purely cosmetic but they pollute diffs (whole-file churn the first time someone makes an LF edit), confuse `clang-format`, and are inconsistent with the LF-only convention used across the rest of the tree. This PR (a) normalizes every existing CRLF file (6 files) to LF and (b) adds a pre-checkin gate so new CRLF leaks are rejected before merge. ## File extensions covered Both the cleanup scan and the new Jenkins enforcement stage use the same predicate as the adjacent `ASCII Only Check` stage: ``` .h .hpp .cpp .h.in .hpp.in .cpp.in .inc .cl ``` (excluding `/build/` and `/include/rapidjson/`). The local pre-commit hook's `c++/inc` type filter covers the same set. ## Why no enforcement today CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so the existing `pre-commit` workflow doesn't touch CK. The local CK `.pre-commit-config.yaml` only runs for developers who installed hooks. The authoritative gate is therefore the new Jenkins stage in this PR; the local hook is convenience. ## Commit layout (bisect-friendly) 1. `[ck] Normalize CRLF line endings to LF in C/C++ sources` Mechanical line-ending cleanup across 6 files. No content change: every edit is purely CRLF -> LF, verified with `git diff --ignore-cr-at-eol` reporting an empty diff. 2. `[ck] Enforce LF-only line endings in C/C++ sources` - New `projects/composablekernel/script/check_no_crlf.sh` (modeled on `check_ascii_only.sh`). - New `crlf-checker` entry in `projects/composablekernel/.pre-commit-config.yaml` under the local-hooks block (`types_or: [c++, inc]`). - New `CRLF Check` parallel stage in `projects/composablekernel/Jenkinsfile`'s `Static checks` block, mirroring the adjacent `ASCII Only Check` stage. Always-on, no `RUN_CPPCHECK` gate. The tree is buildable at every commit boundary. Commit 1 leaves 0 CRLF violations; commit 2 wires the gate. ## Demo Script output on a synthesized violation: ``` $ printf 'int main() {}\r\n' > /tmp/bad.cpp $ projects/composablekernel/script/check_no_crlf.sh /tmp/bad.cpp ERROR: /tmp/bad.cpp contains CRLF (Windows) line endings: 1:int main() {}<CR> Fix: convert to LF, e.g. 'sed -i 's/\r$//' /tmp/bad.cpp' or 'dos2unix /tmp/bad.cpp' $ echo $? 1 ``` Full repo scan after the cleanup commit: ``` $ cd projects/composablekernel && find . -type f $ -name '.h' -o -name '.hpp' -o -name '.cpp' \ -o -name '.h.in' -o -name '.hpp.in' -o -name '.cpp.in' -o -name '.inc' -o -name '.cl' $ \ -not -path '/build/' -not -path '/include/rapidjson/' -print0 \ \| xargs -0 -P 8 -n 64 script/check_no_crlf.sh $ echo $? 0 ``` ## Test plan - [ ] Jenkins PR build: confirm new `Static checks -> CRLF Check` stage runs green over the full predicate and the existing `ASCII Only Check` / `Clang Format` stages are unaffected. - [ ] Local: `pre-commit run crlf-checker --all-files` runs cleanly after installing CK pre-commit hooks. - [ ] Manually inject a CRLF line ending in any `.cpp/.hpp/.inc` file, push: confirm Jenkins fails the new stage with a clear error. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-06-12 21:11:59 +00:00
Brock Hargreaves	96a7e44832	[rocm-libraries] ROCm/rocm-libraries#8378 (commit d68585d) [CK] Pre-emptively add groovy/ folder and skip TheRock CI filter (#8378) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation The CK Groovy library is growing and will be reorganized into a self-describing `groovy/` folder rather than living under `src/` and `vars/`. This PR creates that folder pre-emptively and adds it to the TheRock CI skip-list so that future Groovy additions do not unnecessarily trigger TheRock builds. ## Technical Details - Added `projects/composablekernel/groovy/` with a `.gitkeep` to establish the directory in the repo. - Added `"projects/composablekernel/groovy/"` to `SKIPPABLE_PATH_PATTERNS` in `.github/scripts/therock_configure_ci.py` alongside the existing `vars/` entry, ensuring changes confined to Groovy pipeline code are recognized as non-therock-relevant and skip the TheRock CI pipeline. ## Test Plan No code logic was changed. Verified that `therock_configure_ci.py` pattern list is consistent with the existing `vars/*` skip entry and that the new pattern follows the same glob convention. ## Test Result N/A — directory scaffolding and CI filter only; no functional code affected. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-12 20:11:53 +00:00
Illia Silin	d450749933	[rocm-libraries] ROCm/rocm-libraries#8357 (commit 800965c) [CK] Re-enable HIPRTC codegen tests for all CK PRs. ## Motivation At the time when we introduced the smart test filter to only build and run tests affected by the PR changes, we disabled the client examples, which required full CK build, and also the hiprtc tests that were grouped with the client examples. This caused a few PRs to sneak through that caused the hiprtc compilation to fail. By restoring the hiprtc tests in all PRs, we should close this gap. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-12 19:19:44 +00:00
Illia Silin	789ef38093	[rocm-libraries] ROCm/rocm-libraries#8333 (commit 69b3fc1) Revert "[CK_TILE] Implement RTC API for a subset of FMHA functionality for MGX" (#8333) Reverts ROCm/rocm-libraries#6086 Need to revert as the codegen test for fmha is failing due to including std header: 2026-06-11T22:36:03.673Z] In file included from /tmp/comgr-953928-0-473822/include/ck/host/device_fmha_fwd/fmha_fwd_wrapper.hpp:8: [2026-06-11T22:36:03.673Z] In file included from /bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/cmath:49: [2026-06-11T22:36:03.673Z] In file included from /bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/std_abs.h:38: [2026-06-11T22:36:03.673Z] /usr/include/stdlib.h:32:10: fatal error: 'stddef.h' file not found [2026-06-11T22:36:03.673Z] 32 \| #include <stddef.h> [2026-06-11T22:36:03.673Z] \| ^~~~~~~~~~ The ck_tile headers were never prepped for hiprtc compilation.	2026-06-12 18:19:31 +00:00
Wojciech Laskowski	c2601f38b7	[rocm-libraries] ROCm/rocm-libraries#6569 (commit 393049e) Adding amdgcn_mma specializations for sparse MFMA builtins (#6569) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the fourth of the series of PRs (after https://github.com/ROCm/rocm-libraries/pull/5801, https://github.com/ROCm/rocm-libraries/pull/6014 and https://github.com/ROCm/rocm-libraries/pull/6567) that add all the necessary MMA builtins as amdgcn_mma structs. This PR focuses on sparse MFMA intrinsics. ## Technical Details This change adds new specializations for MFMA sparse builtins. In total, we add 27 MFMA builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-12 12:48:29 +00:00
Enrico Degregori	e75076c826	[rocm-libraries] ROCm/rocm-libraries#8310 (commit 003bc6b) [CK Tile] Fix assert usage MX GEMM ## Motivation See issue https://github.com/ROCm/rocm-libraries/issues/8223 ## Technical Details - Use `std::runtime_error` in `mx_processing.hpp` - Use `static_assert` in `tensor_shuffle_utils.hpp` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-12 11:42:38 +00:00
Thrupti Raj Lakshmana Gowda	d7609923b6	[rocm-libraries] ROCm/rocm-libraries#7919 (commit 061001d) Users/tlakshma/ck/tile engine develop ## Motivation This PR adds multiple new GPU kernel benchmarking operations to the CK Tile Engine, expanding its coverage of GEMM-family operations: - gemm_multi_abd: GEMM with multiple A, B, and D tensors, enabling epilogue patterns such as scale/bias fusion. - batched_contraction: Batched tensor contraction supporting multi-dimensional batch (G), M, N, and K dimensions, targeting workloads where the contraction indices span more than one logical axis. - mx_gemm: MX-format GEMM with microscaling (e8m0) scale tensors. - gemm_rowcolquant: Block-scale GEMM with row/column quantization. - gemm_tensor_quant: Block-scale GEMM with tensor quantization. - grouped_gemm_rowcolquant: Grouped GEMM with row/column quantization. - grouped_gemm_tensorquant: Grouped GEMM with tensor quantization. - batched_gemm: Batched GEMM benchmarking support. ## Technical Details ### gemm_multi_abd - New subdirectory: tile_engine/ops/gemm/gemm_multi_abd/ - CMakeLists.txt follows the same individual-target pattern as gemm_universal / gemm_multi_d. - gemm_multi_abd_instance_builder.py subclasses GemmKernelBuilder from the shared gemm_instance_builder.py. - gemm_multi_abd_benchmark.py delegates to the shared GemmBenchmark parent class. - Configs: default_config.json, default_ci_config.json, user_provided_config.json. - Supported GPU targets: gfx90a, gfx942, gfx950, gfx1201. ### batched_contraction - New subdirectory: tile_engine/ops/gemm/batched_contraction/ - Extends GemmKernelBuilder via BatchedContractionKernelBuilder, adding num_dim_g, num_dim_m, num_dim_n, num_dim_k, num_d_tensors, and elementwise_function parameters. - Layout string uses 3-character encoding (A+B+E), e.g. rcr. - Self-contained benchmark sweep driver (batched_contraction_benchmark.py) with JSON/CSV export and best-kernel selection. - Supported GPU targets: gfx90a, gfx942, gfx950. ### mx_gemm - New subdirectory: tile_engine/ops/gemm/mx_gemm/ - Supports MX-format (e8m0) microscaling for A and B scale tensors. ### block_scale_gemm (gemm_rowcolquant, gemm_tensor_quant) - New subdirectory: tile_engine/ops/gemm/block_scale_gemm/ - gemm_rowcolquant: row/column quantization epilogue. - gemm_tensor_quant: tensor-level quantization epilogue. ### grouped_gemm_quant (grouped_gemm_rowcolquant, grouped_gemm_tensorquant) - New subdirectory: tile_engine/ops/gemm/grouped_gemm_quant/ - grouped_gemm_rowcolquant: grouped GEMM with row/column quantization. - grouped_gemm_tensorquant: grouped GEMM with tensor quantization. ### batched_gemm - New subdirectory: tile_engine/ops/gemm/batched_gemm/ - Batched GEMM benchmark support wired into the sampling/active-op lists. All new ops are registered in op_weights.json for budget allocation and wired into the active-op sampling lists in CMakeLists.txt. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-11 20:38:38 +00:00
jefyang1	276863ca87	[rocm-libraries] ROCm/rocm-libraries#8259 (commit df03f10) Add cluster launch in test ck_tile mx gemm tdm wmma ## Motivation Add cluster launch test in test_ck_tile_mx_gemm_pipeline_tdm_wmma on gfx1250, so that we can check the performance on gfx1250 hardware. ## Technical Details Added Out-of-bounds guard in RunGemm of MxGemmKernel to skip blocks padded by cluster alignment. Add ClusterEnable/ClusterDisable aliases and extend the tuple in test_mx_gemm_pipeline_kernel_types.hpp by adding two kernel types with ClusterEnable for F8 CompTDMV1 and CompTDMV2 respectively. The existing F4 non-ClusterLaunch kernel types have issue to be fixed, so this PR does not include F4 cases. Read ClusterLaunch from the tuple in test_mx_gemm_pipeline_util.hpp. Update invoke_mx_gemm to branch on ClusterLaunch, including Add cluster size constants, Switch GemmShape type, TilePartitioner type, and the kernel launch call. ## Test Plan Tested the changes on gfx1250 FFM. ## Test Result The added kernel types (instances) passed the tests on gfx1250 FFM. ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-11 17:33:11 +00:00
music-dino	359f664b25	[rocm-libraries] ROCm/rocm-libraries#6086 (commit d25d8cc) [CK_TILE] Implement RTC API for a subset of FMHA functionality for MGX (#6086) ## Motivation Introduce a wrapper for the FmhaFwdKernel, for use in real time compilation in MIGraphX. ## Technical Details The intent of the API is to provide multiple instances of the FmhaFwdKernelWrapper, suitable for a particular problem definition. At the moment the wrapper only supports bias and causal masking, feature expansion will come in a future pr. The usage pattern is, in short: 1. Define fmha_fwd::Problem (input dimensions, data type, etc) 2. Fetch Solutions for target architecture (currently only gfx942) based on Problem. The solutions contain a map of template -> template parameter and can be converted to a string representing the full instantiation of FmhFwdKernelWrapper e.g. `ck_tile::FmhaFwdWrapper<ck_tile::fp16_t, 128, 64, 16, 32, 32, 32, 4, 1, 1, 4, 1, 1, 32, 32, 16, 32, 32, 16, false, true, false, true, true, true, true, ck_tile::FmhaPipelineTag::QR>` 3. The instance can then be used in an RTC kernel. The kernel needs to: * Construct a Descriptor (containing descriptions of all input tensors) * Call IsValid() on the descriptor to check if the instance is applicable. Note that this is constexpr by design so that it can fail the kernel compilation as a signal that the kernel is not applicable. * Pass the descriptor and input pointers to the wrapper Run method. A more detailed example of usage can be found in codegen/test/fmh_fwd.cpp Beside work on creating the wrapper and the supporting API, the PR also contains some changes necessary to enable compilation with HIPRTC. The contents of the CK tile headers are embedded in a binary file which is used to pass the header files as strings to HIPRTC. Many of the ck tile headers contain host only code which leads to compilation failures. ck_tile_headers_preprocessor goes through the embedded headers and removes the bodies of host only functions, thereby eliminating the compilation failures. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-11 16:22:37 +00:00
Bartłomiej Kocot	0fdbf8a91d	[rocm-libraries] ROCm/rocm-libraries#8272 (commit 1c66ecb) [CK] Padding on K for global load for grouped conv bwd data (#8272) ## Motivation Fix incorrect results caused by lack of padding during global load in grouped convolution backward data kernel. It is needed since there is no OOB check for global load. ## Technical Details Add padding needed for global load which not use OOB check. ## Test Plan test_grouped_convnd_bwd_data* ## Test Result Passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-11 15:28:21 +00:00
BrianHarrisonAMD	f0545b5c15	[rocm-libraries] ROCm/rocm-libraries#8132 (commit 57d21a1) [CK dispatcher] - LGBM predict data_type FLOAT32->FLOAT64 in ml_heuristic (#8132) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary `ml_heuristic.hpp` calls `LGBM_BoosterPredictForMat(..., /data_type=/0, ...)` (`C_API_DTYPE_FLOAT32`) against a `std::array<double, NUM_FEATURES>` feature buffer. LightGBM reinterprets the 8-byte doubles as 4-byte floats → invalid predictions → the heuristic's argmax always tie-breaks to the first/smallest enumerated config. Fix: `data_type 0 → 1` (`C_API_DTYPE_FLOAT64`), matching the `double` buffer. After the fix, predictions vary and track real TFLOPS (the model correctly prefers larger tiles). ## Verification - The feature buffer `f` is `std::array<double, NUM_FEATURES>` (NUM_FEATURES = 72) → `f.data()` is a `double*`. - The changed `0` is the 3rd positional `data_type` argument (not `nrow`/`ncol`/`is_row_major`). One-line functional change. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 18:57:31 +00:00
Illia Silin	a433424e08	[rocm-libraries] ROCm/rocm-libraries#8241 (commit cd183df) [CK] increase time limit for fmha_bwd tests to prevent timeouts (#8241) ## Motivation Observed a CI failure due to fmha_bwd test timeout which never happened before. Going to increase the time limit for the test to prevent any further CI failures. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-10 15:37:44 +00:00
Ville Pietilä	c6c55db757	[rocm-libraries] ROCm/rocm-libraries#8019 (commit 6472935) [CK TILE] Fix performance regression caused by Dispatcher codegen compiler flag. (#8019) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Currently CK Tile two codegen paths: CK Builder and CK Tile Dispatcher. The CK Tile Dispatcher codegen uses an additional compiler flag that is not present in the CK Builder codegen workflow. The additional compiler flag can cause performance regression for so instances as it disables relevant compiler optimizations. ## Technical Details Removed compiler flag `-mllvm -enable-noalias-to-md-conversion=0` from the CMakeLists.txt that creates instance library from Dispatcher codegen. ## Test Plan Required testing is contained in the CI/CD pipeline. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-10 09:12:03 +00:00
Chao	320a813d67	[rocm-libraries] ROCm/rocm-libraries#6533 (commit 5dcaa45) [CK_TILE] Add host-side Pack-GQA optimization for FMHA forward (#6533) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit [CK_TILE] Add host-side Pack-GQA optimization for FMHA forward ## Motivation Host-side Pack-GQA optimization for CK-Tile FMHA forward. Reshapes Q tensor from `[b, nhead_q, seqlen_q, d]` to `[b, nhead_kv, nhead_ratio * seqlen_q, d]` by adjusting strides, so grouped Q-heads sharing the same KV data are processed in a single tile. Zero kernel changes — runner-only. Phase 1: non-causal attention with GQA ratio packing. Phase 2: extends to dropout and split-kv paths, fixes stride edge cases. ## Technical Details Modified files (2): - `example/ck_tile/01_fmha/example_fmha_fwd.cpp` — Pack-GQA flag plumbing - `example/ck_tile/01_fmha/fmha_fwd_runner.hpp` — Q tensor reshape logic, stride adjustment for GQA ratio packing New files (1): - `example/ck_tile/01_fmha/test_pack_gqa_phase2.sh` — 53 test cases covering non-causal, dropout, split-kv, various GQA ratios ## Dependencies None — this PR is standalone. ## Test Plan - GPU validation on MI300X (gfx942, ROCm 6.4.1): - Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=32 -h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` - GPU validation on MI350X (gfx950, ROCm 7.0), 53 parameterized test cases: - Command (GQA 4:1): `./build/bin/tile_example_fmha_fwd -b=2 -h=32 -h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` - Command (GQA 8:1): `./build/bin/tile_example_fmha_fwd -b=2 -h=64 -h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` - Command (decode): `./build/bin/tile_example_fmha_fwd -b=64 -h=32 -h_k=8 -s=1 -s_k=4096 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` ## Test Result Benchmark results (MI350X, gfx950, ROCm 7.0): \| Config \| Without Pack \| With Pack \| Improvement \| \|--------\|-------------\|-----------\|-------------\| \| GQA 4:1 prefill b=2 h=32 hk=8 s=2048 d=128 bf16 \| 690.05 TFlops (0.199 ms) \| 695.61 TFlops (0.198 ms) \| +0.8% \| \| GQA 8:1 prefill b=2 h=64 hk=8 s=2048 d=128 bf16 \| 706.25 TFlops (0.389 ms) \| 729.35 TFlops (0.377 ms) \| +3.3% \| \| GQA 8:1 decode b=64 h=32 hk=4 s_k=4096 d=128 bf16 \| 305.20 GB/s (1.763 ms) \| 1813.41 GB/s (0.297 ms) \| +5.9x \| \| LLaMA-70B decode b=32 h=64 hk=8 s_k=4096 d=128 bf16 \| 591.70 GB/s (0.909 ms) \| 1820.65 GB/s (0.295 ms) \| +3.1x \| \| MHA ratio=1 b=2 h=8 s=4096 d=128 bf16 \| 695.16 TFlops \| 702.72 TFlops \| no regression \| Benchmark results (MI300X, gfx942, ROCm 6.4.1): No regression on MI300X. Pack-GQA is a runner-only optimization (zero kernel changes), performance impact is within noise on MI300X. \| Config \| TFlops / GB/s \| Time (ms) \| Delta vs baseline \| \|--------\|-------------\|-----------\|-------------------\| \| MHA bf16 b=2 h=8 s=4096 d=128 \| 336.52 TFlops \| 0.408 \| -1.7% \| \| GQA 4:1 bf16 b=2 h=32 hk=8 s=2048 d=128 \| 322.52 TFlops \| 0.426 \| -0.7% \| \| GQA 8:1 bf16 b=2 h=64 hk=8 s=2048 d=128 \| 349.85 TFlops \| 0.786 \| +0.5% \| \| LLaMA-70B prefill b=1 h=64 hk=8 s=4096 d=128 bf16 \| 381.29 TFlops \| 1.442 \| +1.2% \| \| Decode b=64 h=32 hk=8 s_k=4096 d=128 bf16 \| 697.32 GB/s \| 1.541 \| +0.8% \| All validation tests pass (`valid:y`) on both MI300X and MI350X. Additional validation: - 53 parameterized test cases pass (23 phase 1 + 30 phase 2) - GQA ratios tested: 1:1, 2:1, 4:1, 8:1, 32:1 - No regression on MHA (ratio=1) workloads - fp16 and bf16 validated	2026-06-10 01:56:44 +00:00
Bartłomiej Kocot	928b46c3bd	[rocm-libraries] ROCm/rocm-libraries#8208 (commit 7240d71) [CK] Fix scale init in profile_grouped_conv_fwd_outelementop (#8208) ## Motivation Wrong scale initialization caused random errors on CI. ## Technical Details InvScale was initialized by 0 what caused nans during division. At now zero are excluded from randing. ## Test Plan TestGroupedConvndFwdConvInvscale3d ## Test Result Passed in 100 runs ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1400	2026-06-09 21:42:53 +00:00
Bartłomiej Kocot	cb099eb963	[rocm-libraries] ROCm/rocm-libraries#8155 (commit c25787b) [CK] Magic division for long_index_t ## Motivation Improve performance for long_index_t kernels ## Technical Details Support magic division for long_index_t ## Test Plan test_grouped_convnd* ## Test Result Passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1386	2026-06-09 20:57:20 +00:00
Aviral Goel	93e0d79103	[rocm-libraries] ROCm/rocm-libraries#8035 (commit 45186b8) [CK_Tile] Add wmma_bf16f32_16x16x32_bf16 warp-gemm test (#8035) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Adds the warp-gemm unit test for `wmma_bf16f32_16x16x32_bf16`. Stacked on #8028 (the API change) and based on its branch, so #8028 shows the isolated API diff and this PR shows just the test. ## Test gfx125-guarded `WmmaBf16f32.ResidualPrecisionContrast`: computes `Y_bf16 = X_bf16·W_bf16 + R_fp32` via `WarpGemm::mac_downconvert`, compares against an fp32 reference (within bf16 tolerance), and asserts it is at least as accurate as the bf16-accumulate path — i.e. it demonstrates the precision benefit of the fp32 accumulator (`C`) carried into the fused bf16 down-convert. Passes on gfx1250.	2026-06-09 15:52:45 +00:00
chris-tsiaousis-hpc	dc3c1cffd5	[rocm-libraries] ROCm/rocm-libraries#7891 (commit 4dee41d) Porting existing FMHA infra from users/shumway/ck/exp-kpack to develop (#7891) Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>	2026-06-09 14:00:32 +00:00

1 2 3 4 5 ...

3440 Commits