composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 11:16:59 +00:00

Author	SHA1	Message	Date
Qianfeng Zhang	45019fd5fd	Remove the comparing of row/col to max_uih_len in masking	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	ad10a2dd53	Use kM0=128 kN0=64 to completely remove the vgprs spilling	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	8b2948b31e	Split HstuBlockMasking into HstuBlockMaskWithLocal and HstuBlockMaskNoLocal to save vgprs for non-local situations	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	fafb375122	Use packed cast_tile for fp16	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	6686c7af44	Update to partially reduce the register spilling	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	459c5565d4	Add IsFirstVLdsBufferOverlapLastKLdsBuffer() check to reduce call of s_barrier()	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	8a6c2591b0	Update the in pipeline codes	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	d360c61200	Fix in calculation of total_flops and update benchmark scripts	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	251136cca7	Add output of estimated TFLOPS	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	644ea27e0e	Update to the scripts and error thresholds	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	2a71304bbb	Tune the input initialization to avoid over-flow in silu	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	9c2dbf8d64	Add benchmark_hstu_attention.sh	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	cdb0704377	Add several verification test cases	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	beb6fa8cc1	Fix in kernel and forward dispatch for jagged mode	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	24822a4898	Fix in hstu-attention pipeline (which makes some testing cases passed)	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	50b0af257c	Fixes and updates	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	72774b718b	Change in HstBlockMasking and kernel/reference codes for using masking	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	74a0ec4609	Fix and change in example	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	450494945f	Add hstu attention kernel implementation, instances and interfaces (building succeeded)	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	e6b6323b67	fix the jagged mode tensor access in reference_hstu_attention	2026-06-23 09:17:26 +00:00
Qianfeng Zhang	a19f73c305	Initial reference implementation of hstu attention	2026-06-23 09:17:23 +00:00
Enrico Degregori	55e30feac6	[rocm-libraries] ROCm/rocm-libraries#8637 (commit a1a7f5f) [CK] Fix compilation ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-20 02:08:58 +00:00
Adel Johar	01bad4c3d9	[rocm-libraries] ROCm/rocm-libraries#8205 (commit f58120c) [Docs] Standardize precision support reference pages across components (#8205) ## Motivation The goal of this PR is to standardize the precision support reference page format across all components, while also reducing the maintenance of burden of having to manually update the YAML data file in https://rocm.docs.amd.com/en/latest/reference/precision-support.html ## Technical Details - Each component maintains its own YAML file which will be eventually used in https://rocm.docs.amd.com/en/latest/reference/precision-support.html - A new precision support reference page is introduced which will not override existing data type/precision support content; it will serve as the overview/summary that will be linked in the ROCm reference page ## Test Plan - Built locally, viewed each component manually ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-19 15:08:04 +00:00
Bartłomiej Kocot	7c2b979de2	[rocm-libraries] ROCm/rocm-libraries#8573 (commit 04c9f1d) [CK][CK Tile] Drop profiler for experimental builder codegen (#8573) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Switch to dispatcher profiler for ck tile conv. ## Technical Details - Switch to dispatcher profiler for ck tile conv. - Drop profiler for experimental codegen - Minor fixes for bwd data printing - Minor fixes for 3d conv in dispatcher codegen ## Test Plan test_grouped_conv*tile ## Test Result Passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-19 09:38:44 +00:00
Enrico Degregori	2733e75900	[rocm-libraries] ROCm/rocm-libraries#6565 (commit d41715e) [CK Tile] Async support pipeline V3 ## Motivation Optimize pipeline V3 for gfx950 by enabling buffer load to lds (async pipeline) ## Technical Details - Add `Async` bool to `Problem` struct to enable async pipeline in existing one - Add `static_move_ys` to load transpose. This generates offset in assembly instructions saving registers - Add `is_valid` to `async_get_vectorized_elements`. Before hard coded to true. It allows to support padding - Remove unnecessary restrictions to `is_a_load_tr` and `is_b_load_tr` (wider use of lds load transpose on gfx950) - Integrate async support in existing V3 pipeline (avoid pipelines duplication) - Create policy to support both async and default cases. This could be used by any async pipeline (next steps) - Define `wg_attr_num_access` separately for A and B. This allows to optimize ds_read instruction width for cases when one matrix is transposed and the other is not. Before in such cases, `ds_read_b64` was used instead of `ds_read_b128` - Add test for V3 async. Currently only supporting cases with A and B having the same type ## Test Plan New test `test_ck_tile_gemm_pipeline_compv3_async` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-19 06:57:14 +00:00
Brock Hargreaves	081fe18c1c	[rocm-libraries] ROCm/rocm-libraries#8558 (commit ccfa08b) [CK][CI] Retry git network ops to survive transient DNS blips (#8558) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation CI builds intermittently fail on transient git DNS blips (e.g. `Could notresolve host: github.com`). These surface as an untyped `exit code 1`, which the existing node/transient-fault retry doesn't catch — so a momentary glitch fails the whole build. ## Technical Details Added `gitNetRetry(label, body)` (3 attempts, 15s backoff) and wrapped every github.com-touching git step: ref-repo clone/update, `checkout scm`, and the hipTensor clone. All are idempotent on retry. Docker pulls are left to the existing `pullImage()` path. ## Test Plan - Mapped the failing build's `git remote update` DNS error to a now-wrapped call. - Confirmed no existing code retries git host-resolution failures. ## Test Result Groovy shared-library — not locally executable; needs a pipeline run to fully validate. Check CI. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-18 21:18:27 +00:00
Brock Hargreaves	8864dcc3a4	[rocm-libraries] ROCm/rocm-libraries#8560 (commit f8362a1) [CK][CI] Post failure GitHub status on stage build errors (#8560) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Failed CI stages (e.g. Static checks) were left stuck on a `pending` GitHub status instead of reporting `failure`, so PRs showed an overall failure with no indication of which check actually failed. ## Technical Details `buildAndTest` posted `pending`/`success` statuses but its catch only rethrew, deferring failure reporting to `runOnHealthyNode` — which deferred right back. Neither posted `failure`. This adds a `failure` status post for real build errors in `buildAndTest`, while letting node-reroute signals (`NodeFault`/`TransientFault`) and aborts (`FlowInterruptedException`) propagate untouched so retries still work. Since every stage routes through `buildAndTest`, this fixes both the directly-called `Static checks` stage and the `runOnHealthyNode`-wrapped per-arch build stages. ## Test Plan Trigger a stage failure (e.g. introduce a clang-format violation) and confirm the corresponding GitHub status context transitions `pending` → `failure` rather than remaining `pending`. ## Test Result Pending CI run on a branch with a deliberate failure to confirm the status transition. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-18 21:16:24 +00:00
Brock Hargreaves	bad7870830	[rocm-libraries] ROCm/rocm-libraries#8508 (commit 5cc3bef) [CK][CI] Make gfx1250 build compile-only ## Motivation gfx1250 has no CI hardware, so its build piggybacks on gfx90a nodes where gfx1250 binaries can be compiled but not run. The build currently fails because post-build runtime tests fire on the gfx90a node. This PR makes the gfx1250 build compile + install only. ## Technical Details The post-build test block in `buildAndTest` (`ck.groovy`) keys off the physical node arch (`gfx90a`), so runtime tests run for gfx1250. Gated that block off for gfx1250. Body-only change with no signature changes, so it's backward compatible with the develop-pinned shared library and doesn't affect other archs. ## Test Plan Trigger the gfx1250 build with `USE_CURRENT_BRANCH_FOR_CK_GROOVY=true` and confirm it compiles/installs with no runtime test steps; confirm gfx90a builds are unchanged. ## Test Result Check CI. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-18 18:33:59 +00:00
Sami Remes	a3a12b8945	[rocm-libraries] ROCm/rocm-libraries#5813 (commit 18b43cf) [CK_TILE] Enable full transpose layout support for MX GEMM pipeline (#5813) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Enable full transpose layout support for MX GEMM pipeline (32x32x64 MFMA) ### Summary This PR enables all four matrix layout combinations (Row/Col, Row/Row, Col/Col, Col/Row) for the MX GEMM pipeline with `32x32x64` MFMA warp tiles, using `ds_read_tr` transposed LDS loads on gfx950. Previously, only the canonical `A=RowMajor, B=ColumnMajor` layout was supported. ### Changes Kernel-side transpose support: - `warp_gemm_attribute_mfma.hpp`: Introduce `kSplitFactor` logic in `get_warp_dstr_encoding` to split the K-dimension distribution encoding when `kPerLane` exceeds the `ds_read_tr` subtile minor dimension. This satisfies the `TransposeTileDistributionTraits` suffix validation required by `load_tile_transpose`. The distribution encoding now also receives the `DataType` template parameter to compute the split factor based on packed element size. - `gemm_pipeline_ag_bg_cr_comp_async.hpp`: Uncomment and enable the `InputTileDistributionTraits` logic to properly transform LDS load tile distributions for transposed reads. Add `static_assert`s to catch misconfigurations where a layout requires transpose loads but the warp tile size disables them (e.g. `KWarpTile=128` exceeds `ds_read_tr` limits). - `load_tile_transpose.hpp`: Fix `DataVec` sizing for packed types (`pk_fp4_t`) — divide `vecLoadSize` by `PackedSize` to prevent buffer overflow when each physical element contains multiple logical values. - `warp_gemm_attribute_mfma_impl.hpp`: Set `kDefaultScale` to `0x7F7F7F7F` (unity in e8m0 format) for the unscaled `operator()` overloads of `WarpGemmAttributeMfmaImpl_f32_32x32x64_f8f6f4`, ensuring correct behavior with `mfma_scale_f32_32x32x64_f8f6f4`. - `warp_gemm.hpp` / `warp_gemm_dispatcher.hpp`: Add generic `WarpGemmMfma_f32_32x32x64_f8f6f4<A, B>` alias and dispatcher specialization to support arbitrary MX data type combinations (fp4, fp6, fp8) with the 32x32x64 MFMA, consolidating the existing type-specific aliases. - `gemm_pipeline_ag_bg_cr_comp_async_default_policy.hpp`: Simplify `wg_attr_num_access` determination — `Double` for fp8, `Single` otherwise. Reference implementation fix: - `reference_gemm.hpp`: Fix nibble selection for packed 4-bit types (`pk_fp4_t`, `pk_int4_t`) in `reference_mx_gemm`, `reference_gemm`, and `reference_gemm_abquant`. The previous logic used `k % 2` or `index[K_DIM] & 1` to select which nibble to extract, which assumed K was always the fast (contiguous) memory dimension. This is only true for `A=RowMajor` / `B=ColumnMajor`. For other layouts, the fix computes the flat memory offset via `mDesc.GetOffsetFromMultiIndex(...)` and uses its parity to correctly select the nibble regardless of layout. Test infrastructure: - `test_mx_gemm_config.hpp`: Add `MxGemmConfig32` base and `MXfp4_GemmConfig32` / `MXfp8_GemmConfig32` configs for the 32x32x64 warp tile. - `test_mx_gemm_fp4.cpp` / `test_mx_gemm_fp8.cpp`: Add `Config32` test suites covering all four layout combinations. Restrict `Config16` (16x16x128) to `A=Row, B=Col` only, since `KWarpTile=128` exceeds `ds_read_tr` limits. - `test_mx_gemm_util.hpp`: Fix scale tensor layout — scales are always row-major `[M, K/32]` and column-major `[K/32, N]`, independent of A/B data layout. ### Test plan - [x] `test_ck_tile_mx_gemm_fp4` — 5/5 passed (16x16x128 Row/Col + 32x32x64 all 4 layouts) - [x] `test_ck_tile_mx_gemm_fp8` — 5/5 passed (16x16x128 Row/Col + 32x32x64 all 4 layouts) - [x] `test_ck_tile_mx_gemm_fp6` — 1/1 passed (16x16x128 Row/Col)	2026-06-18 17:05:09 +00:00
Illia Silin	e2deaaba64	[rocm-libraries] ROCm/rocm-libraries#8591 (commit 5210ae6) [CK] fix daily hipTensor tests. ## Motivation Had to change the way hipTensor is cloned to make sure it doesn't erase CK installation and uses the correct path for the installation. Also added the "install" target every time we build and test everything, so we could use CK for testing third-party libs that depend on it. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-18 14:58:10 +00:00
Enrico Degregori	1762eaeaec	[rocm-libraries] ROCm/rocm-libraries#8535 (commit a0f47eb) [CK Tile] EightWaves pipeline int8 support ## Motivation EightWaves pipeline currently is supporting only FP types ## Technical Details - Enable 16x16x64 int8 instruction for gfx950 in dispatcher - Enable int8 in EightWaves pipeline - Add tests - Fix bug in `warp_gemm_attribute_mfma_impl.hpp` ## Test Plan Tests have been added for int8 GEMM using EightWaves pipeline ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-18 12:59:59 +00:00
Ville Pietilä	60b276647b	[rocm-libraries] ROCm/rocm-libraries#8157 (commit b0d9d39) [CK Tile] Rule-based configuration generation in CK Dispatcher codegen (#8157) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation The CK Tile Dispatcher code generation for CK Tile Profiler relies on flat JSON files to list the generated configurations. This approach has the following problems - The JSON files are verbose - The JSON files get easily out of sync with the CK Builder .config files from which they were generated from. - The JSON file based configuration make it hard to list explicitly the rules that govern the instance generation. ## Technical Details Replaced the JSON files with a rule based configuration. To preserve the existing functionality, the `profiler` and the `tests` instance sets are generated directly from the CK Builder config files. The JSON config files are removed from source control, and the "on-the-fly" generation guarantees that the Dispatcher codegen uses up to date configurations. This is PR introduces six different rule sets for the CK Tile Dispatcher code generation 1. `profiler`: matches with the old JSON set of profiler configurations. 2. `tests`: matches with the old JSON set of tests configurations. 3. `full`: full configuration set created from a rule-based config selection 4. `full-tests`: a subset of `full` for generating configurations for convolution integration tests. 5. `tiny`: a subset of `full-tests` to produce the minimal set of configurations to test the Dispatcher codegen. 6. `default`: the default rules, which corresponds to the existing heuristic rules for configuration selection. This ensures that ML based kernel selection doesn't get broken. The main use of the `full` rule set is to define a reasonable solution space for the possible implicit GEMM configurations. We start from the configurations that allowed by the device architecture. The `full` rule set defines the relevant tile sizes for each convolution direction. From the tile size we have a curated mapping to the number of waves over the different GEMM axes, i.e., we describe how many waves each GEMM dimensions corresponds to. The GEMM-K wave tile dimension can be computed from the other parameters and does not need to be listed explicitly. An orthogonal axis to the tiling strategy is the vectorization strategy. This mainly defined by the data type and hardware as in general, we want to use the maximum possible load widths. The maximum sizes for each convolution direction variant are defined by the implicit GEMM matrix dimensions. For cases where have a low number of channels per convolution group, we need smaller vector load sizes. These are captured by the `VecStrategy` enumeration in the codegen rules. The problem with the rule based configuration selection is that we "over generate" configurations. The old JSON configurations compose approximately 25% of all configuration that the `full` rule set creates. The additional configurations are valid, but they many not provide any performance benefits. Hence, we keep the `profiler` and `tests` rule set for now to avoid building an excessive amount configurations by default. The `full` rule set can be taken into use by specifying CMake configuration flag `-D DISPATCHER_RULE_SET=full`. By default, the `tests` rule set is used, i.e., we don't change the existing bahaviour. ## Test Plan Added a new stage in the CI/CD pipeline that ensures the Dispatcher codegen rules are up to date. Otherwise the functionality is covered by the existing CI/CD tests. There are no functional changes to the convolution kernels. Only how the different instances are generated. ## Test Result If the CK Tile conv instances build without errors, the Dispatcher codegen is generating valid code. If all tests in CI/CD pipeline are passing, the Dispatcher codegen generates valid instances. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-18 01:22:50 +00:00
Aviral Goel	c43b550206	[rocm-libraries] ROCm/rocm-libraries#8202 (commit 0911fa0) [GFX1250][CK_TILE] Add scale16 (ScaleBlockSize=16) support to MX GEMM TDM pipeline (#8202) Enables `ScaleBlockSize=16` end-to-end for the FP8/BF8 MX GEMM TDM pipeline, building on the scale16 warp-gemm layer already in develop. - warp gemm: add the 32x32x128 f8f6f4 scale16 traits and alias (2x2 grid of 16x16x128 scale16 intrinsic calls with per-subtile `SCALE_OPSEL`), and route 32x32 f8f6f4 through the dispatcher's `IsScale16` path. - default policy: select the warp gemm via the dispatcher with `IsScale16=(ScaleBlockSize==16)` so `WarpTile=16` and `WarpTile=32` each pick the matching scale16 path; guard WarpTile M/N to 16 or 32; scale-tile distribution for the scale16 layout. - pipeline V1/V2: thread `Problem::ScaleBlockSize` through the scale-window setup (replacing the hardcoded 32); expose `ScaleBlockSize` for the kernel. - block gemm: extract int64 (scale16) / int32 (scale32) scales by width. - kernel: scale16 descriptor order; reject unsupported `BlockScaleSize`. Test coverage for this path is in the stacked follow-up PR.	2026-06-17 16:41:00 +00:00
jakpiase	65bef78383	[rocm-libraries] ROCm/rocm-libraries#8518 (commit 1ad69c3) [CK] Add support for large tensor index handling into conv bwd data (#8518) ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-17 15:51:36 +00:00
Illia Silin	b5713be6cd	[rocm-libraries] ROCm/rocm-libraries#8501 (commit 54eb5dc) [CK] disable DPP kernels by default ## Motivation The dpp8 instruction has been disabled in the upstream llvm-project in the latest compiler version, so we're hitting compilation errors with staging compiler: <inline asm>:2:33: error: not a valid operand. v_dot2c_f32_f16_dpp v6, v8, v7 dpp8:[0, 0, 0, 0, 0, 0, 0, 0] ^ error: cannot compile inline asm These instructions are used for fp16 gemms that are slightly faster than dl gemms on gfx10, but are not critical. Going to disable these kernels for now, until a better solution is available, to unblock the builds with staging compiler. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-17 14:03:00 +00:00
SamiAario-AMD	39182b50eb	[rocm-libraries] ROCm/rocm-libraries#8487 (commit 06a73ba) Skip tests on gfx11 that have intermittent failures ## Motivation On gfx11, skip sporadic failures for any load_and_convert_tile case where X and Y differ. Same-type tuples (half/half, bf16/bf16, fp8/fp8) have been stable. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result Stress-tested on gfx11, gfx12, and gfx950 with 10000 iterations of the tests. No remaining test failures were detected. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-17 11:07:22 +00:00
damien-lejeune	5bebfd460f	[rocm-libraries] ROCm/rocm-libraries#8492 (commit 46b6a06) Add tile size for FMHA batch prefill bf16 for MI308X ## Motivation Adding a tile size adapted to MI308X, for the FMHA Batch Prefill BF16 input type case ## Technical Details N/A ## Test Plan Benchmarking from the Aiter side with: ``` python3 op_tests/test_batch_prefill.py -s 8000 -p 1 -q 4 -k 1 --head_dim 256 -c true -d bf16 --input_dtype bf16 --quant_method none --kv_layout linear -t sglang -l 0.0 --return_lse false --profile ``` ## Test Result We see an improvement with the new tile size on MI308X (both with PLT mode OFF and ON) ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>	2026-06-17 06:22:26 +00:00
damien-lejeune	2c0b7cbb0a	[rocm-libraries] ROCm/rocm-libraries#8424 (commit debb669) Add missing constraint in the FMHA qr async pipeline to enforce bk0=bk1 (#8424) ## Motivation The purpose of this change is to add a guardrail to what values bk0 and bk1 can take. This is to avoid ill defined sizes, silently failing and generating NaN (or other error) at runtime. An example of such failure can be obtained using the tile engine: ``` cd rocm-libraries/projects/composablekernel/tile_engine/ops/fmha python fmha_benchmark.py configs/batch_prefill.json \ --problems "1,4,1,8000,8000,256" \ --filter "c.data_type=='bf16' and c.hdim_q==256 and c.pipeline=='qr_async' and c.mode=='group' and c.tile_n0==32 and c.tile_k0==64" ``` ## Technical Details The qr_async pipeline stages data in the K dimensions into LDS using a bk1-descriptor, while the (Q*K^T) gemm0 consumes bk0 ## Test Plan See command above ## Test Result Before the change: (invalid) generate instances, error at runtime After this change: no instance generated ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>	2026-06-16 07:41:58 +00:00
Brock Hargreaves	1b649a8d4b	[rocm-libraries] ROCm/rocm-libraries#8332 (commit 48c389c) [CK][CI] Retry builds on node failure with automatic rerouting (#8332) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation When a Jenkins node enters a bad state (missing GPU driver, dead Docker daemon, full disk), every PR scheduled onto it fails the same way until a human manually takes it offline. Some failures are also transient and would pass on a simple retry. Today the pipeline does neither — every failure goes straight to red on the same node. ## Technical Details Two new retry behaviors based on failure type: - Different node for persistent node faults (driver missing, daemon down, disk full, container won't start) - Retry in place for transient glitches (registry pull, DNS), then a different node if retries are exhausted Real build/compile failures and aborted builds are never retried. New: `src/org/ck/NodeFault.groovy`, `TransientFault.groovy` — typed exceptions in the shared library `src/` for stable classloader identity under dynamic library loading. `vars/ck.groovy`: adds `preflight()` (host health checks before build), `pullImage()` (classifying pull failures at the call site, replacing `getDockerImage()`), `runOnHealthyNode()` (outer reroute loop, up to 3 nodes), `runInPlace()` (same-node transient retries). GitHub failure status is only set once all retries are exhausted. `Jenkinsfile`: all active `Build CK and run Tests` stages converted to `agent none` + `ck.runOnHealthyNode(…)`. ## Test Plan Tested on `users/brockhargreaves-amd/ck/node-failure-retry-logic` with `USE_CURRENT_BRANCH_FOR_CK_GROOVY=true`. Verified preflight logging, reroute on node fault, attempt counter in logs, no retry on aborts, and single failure status report after budget exhausted. ## Test Result Retry logic working as expected. Three bugs found and fixed during testing: false `NodeFault` from host-level sccache probe (sccache is in-container), `null` node name in catch logging, and `sh` calls outside `node()` context in status reporting. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 17:40:10 +00:00
Andriy Roshchenko	b8440b3aeb	[rocm-libraries] ROCm/rocm-libraries#8325 (commit 559eaf6) [GFX1250][MX GEMM] Unified FLATMM GroupedGemm Implementation for MX Data Types (#8325) ## Motivation Design and test a unified FLATMM GroupedGemm interface so that it supports all MX FP8, FP6, and FP4 data types on both the gfx950 and gfx1250 architectures and works seamlessly across these platforms. ## Technical Details Implementation exposes Grouped Gemm interface for MX FLATMM and MX TDM FLATMM pipelines. ## Test Plan Add the following tests: - ck_tile/grouped_gemm_mx/test_grouped_gemm_mx_flatmm_non_tdm.cpp - ck_tile/grouped_gemm_mx/test_grouped_gemm_mx_flatmm_tdm.cpp - ck_tile/flatmm/test_mx_flatmm_persistent.cpp Verify on the gfx950 and gfx1250 architectures. ## Test Result All tests pass. Verified on A0 hardware with rocm-7.14.0a20260517 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 16:12:33 +00:00
Sami Remes	c1f7104852	[rocm-libraries] ROCm/rocm-libraries#6663 (commit f19fc01) [CKTile] Fix MX GEMM: num_loop==3 dispatch, split-K, unsupported-shape guard (#6663) Three independent MX GEMM correctness bugs reported against example/ck_tile/42_mx_gemm (fp8xfp8, A=Row/B=Col) on MI350X, plus one host-side atomic-add accumulation bug in the example's repeat loop. - Pipeline (gemm_pipeline_ag_bg_cr_comp_async.hpp): BlockHasHotloop required num_loop > PrefetchStages, which let num_loop == 3 enter a hot loop that produced 5 gemm accumulations instead of 3 (K == 3K_Tile, e.g. K=768, deterministically wrong). Require num_loop >= 4 instead: pre-pipeline + TailNumber::Three already totals exactly 3. - Kernel (gemm_mx_kernel.hpp): split-K was silently broken because GridSize did not thread k_batch into blockIdx.z and the scale tile windows were anchored at K=0 for every k_id. Every k_id >= 1 therefore read the wrong packed scales. Fix: GridSize returns dim3(grid_x, 1, k_batch) (persistent and non-persistent). * MakeScaleA/BBlockWindows accept a k_elem_offset and translate it to a packed-scale K offset (also apply pad_tensor_view so OOB scale loads return zero, matching A/B padding). * operator() derives k_id from blockIdx.z, uses GetSplitKElemOffset (matches Underlying::SplitKBatchOffset's K1-aligned formula), and dispatches the epilogue with memory_operation_enum::atomic_add for k_batch > 1, set for k_batch == 1. Same fp16/bf16 even-vector-size guard as UniversalGemmKernel. * MakeCBlockWindows templated on DstInMemOp; unconditionally applies pad_tensor_view using kPadM/kPadN so partial trailing M/N tiles are handled correctly. - Compile- and runtime unsupported-shape guards (gemm_mx_kernel.hpp): add IsSupportedArgument and a static_assert for configurations that produce silent wrong results: * static_assert(!kPadK) -- the MX comp-async pipeline uses async_load_tile whose OOB check is per-vector-start, so a vector straddling the K pad boundary reads garbage. Until the async path learns per-element pad masking, reject kPadK at compile time. * Runtime: k_batch >= 1; M/N multiples of MPerBlock/NPerBlock when kPadM/kPadN are false; M >= MPerBlock and N >= NPerBlock always (CShuffleEpilogue cannot safely run with a single partial tile); K % (KPerBlock * k_batch) == 0; and for k_batch > 1, K must be a multiple of WarpTile_K * k_batch so every split lands on a packed-scale boundary. * All error paths log under CK_TILE_LOGGING with actionable messages. - Example (example/ck_tile/42_mx_gemm/mx_gemm_instance.hpp): * Call Kernel::IsSupportedArgument up front and throw a clear runtime_error for rejected shapes (was silently launching an unsupported kernel). * Switch to launch_kernel_time_mask with a clear_gemm_output preprocess that zeroes C between iterations when k_batch > 1 (mirrors universal_gemm_invoker). Without this the default -warmup=50 -repeat=100 accumulated 150 atomic_adds into C after the kernel-side split-K fix. Tests (test/ck_tile/gemm_mx/): - Add MXfp8_GemmConfig16_PadMN (kPadM = kPadN = true). - test_mx_gemm_fp8.cpp: HotLoopTailNumLoopThree (K=768 regression), SplitK (k_batch=2,4 across full_k/partial_k paths), TestMxGemmFp8PadMN::{MNPaddingAligned, MPadding, NPadding, MNPadding} covering trailing partial tiles along M, N, or both. - Run(...) now takes k_batch. - packScalesMNxK: guard against OOB (mn, k) reads from src and initialise e8m0 bytes to the zero exponent (0x00) instead of the default-constructed NaN (0xFF), so padded lanes don't poison the packed int32_t shared with in-range lanes. - test_mx_gemm_instance.hpp: call IsSupportedArgument before launch. Verification on gfx950, ROCm 7.2.0: - ctest -R test_ck_tile_mx_gemm -> 100% (2/2). - Example sweep over the original bug-report shapes: all K-aligned shapes now validate correct (including 4096^3 sk=2 and the K=768 cases); all K=128 shapes cleanly rejected with the new error message instead of producing silent wrong results. Made-with: Cursor ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 08:28:55 +00:00
damien-lejeune	aab1d219f5	[rocm-libraries] ROCm/rocm-libraries#8350 (commit f92ded1) Add tile shape for FMHA batch prefill on MI308X (on fp8, hdim=256) (#8350) ## Motivation Add a tile size appropriate for FMHA batch prefill fp8/hdim256 on MI308X ## Technical Details Appending the tile shape to the existing factory such that it can be picked up by Aiter ## Test Plan Ran the performance test on both MI300X and MI308X ## Test Result MI300X performance seems unaffected by this change. MI308X does improve. ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>	2026-06-15 07:00:35 +00:00
SamiAario-AMD	947dcc2606	[rocm-libraries] ROCm/rocm-libraries#5510 (commit 8415c8c) [CK Tile] Add transposed tile load implementation, and tests for load_and_convert_tile (#5510) ## Motivation Mixed precision b/fp16 x fp8 requires a transposed tile load implementation that supports mixed precision using these types. Implement this, use it in `load_and_convert_tile`, and add a unit test for `load_and_convert_tile` which covers this functionality. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-15 06:42:28 +00:00
ltqin	0954a8f3fa	[rocm-libraries] ROCm/rocm-libraries#8262 (commit d4ff8fc) [CK_TILE] Add graph capture support for FMHA backward(new branch) (#8262) ## Motivation Add HIP graph capture support for FMHA backward operations. The original implementation only supported normal execution mode and would cause use-after-free crashes when used with graph capture replay. When FMHA backward is captured into a HIP graph: - First replay: host callback executes and deletes the closure (as designed for normal mode) - Subsequent replays: use-after-free crash because the closure was already freed This PR enables `fmha_bwd_launcher::prepare_workspace_async()` to work correctly in both normal execution and graph capture modes.	2026-06-14 03:11:53 +00:00
Johannes Graner	01cca38c8e	[rocm-libraries] ROCm/rocm-libraries#8220 (commit 4c04a3a) [CK Tile] WAVELET pipeline for backward-data grouped convolution (#8220) ## Motivation On the RetinaNet shapes (gfx950, fp16) CK Tile backward-data conv was ~18% behind classic CK, with the gap concentrated in the K=2376 3x3 detection-head family where bwd_data spends most of its time. The WAVELET GEMM pipeline already gives uplift for forward and backward-weight conv; this ports it to backward-data and consolidates the now-shared machinery across all three directions. ## Technical Details - Backward-data wavelet support in the tile kernel: launch extra load waves when the pipeline exposes `LaunchBlockSize`, and split the epilogue into math waves (run the CShuffle epilogue) and load waves (`RunBarrierStub`). - Register 7 WAVELET instances (fp16 and bf16), tuned for backward-data's tall-skinny GEMM rather than the forward tile shapes: a big-M `256/128/64` workhorse, a `VecA=4` variant for the `K % 8 != 0` shapes, and a `NumGroupsToMerge=32` variant for grouped (depthwise-style) shapes. - Implement the native backward-data instance parser in `generate_instances.py`. - Deduplicate the wavelet machinery shared by forward, backward-data, and backward-weight: `GroupedConvLaunchBlockSize`, `is_wavelet_pipeline`, and `RunWaveletAwareEpilogue` in `grouped_convolution_utils.hpp`; the three native instance parsers collapse to one parameterized parser. The three kernels now call the shared helpers. ## Test Plan - Rebuild the full profiler instance pools for all three directions (fp16/bf16/fp32, nhwgc/ndhwgc) to exercise the shared helpers across every instantiation. - Tile GTests on gfx950: `test_grouped_convnd_fwd_tile`, `test_grouped_convnd_bwd_data_tile`, `test_grouped_convnd_bwd_weight_tile`. - Per-shape sweep of the 35 RetinaNet backward-data shapes vs classic CK and the non-wavelet tile pool (`profile_wavelet_bwd_data.py`); correctness spot-checked with GPU-reference verification on the new big-M and NumGroupsToMerge instances. ## Test Result - GTests pass: forward 9/9, backward-data 6/6, backward-weight 6/6. - Backward-data perf (3x3 g=1 region, geomean classic/tile): 0.88 -> 1.11, i.e. the tile path goes from ~12% slower than classic to ~8% faster. The largest single backward-data shape (256x100x100->2376) moves from 11% slower than classic to 12.5% faster. - The dedup refactor preserves behavior (net -174 lines across the kernels/generator), confirmed by the full rebuild and the GTests above. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-13 00:10:50 +00:00
John Afaganis	329e589840	[rocm-libraries] ROCm/rocm-libraries#8260 (commit 1139236) [ck] Enforce LF-only line endings in C/C++ sources MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Several CK source files carry Windows CRLF line endings (a trailing carriage return on each line), introduced by editors configured for Windows endings or copy/paste from Windows tooling. These are purely cosmetic but they pollute diffs (whole-file churn the first time someone makes an LF edit), confuse `clang-format`, and are inconsistent with the LF-only convention used across the rest of the tree. This PR (a) normalizes every existing CRLF file (6 files) to LF and (b) adds a pre-checkin gate so new CRLF leaks are rejected before merge. ## File extensions covered Both the cleanup scan and the new Jenkins enforcement stage use the same predicate as the adjacent `ASCII Only Check` stage: ``` .h .hpp .cpp .h.in .hpp.in .cpp.in .inc .cl ``` (excluding `/build/` and `/include/rapidjson/`). The local pre-commit hook's `c++/inc` type filter covers the same set. ## Why no enforcement today CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so the existing `pre-commit` workflow doesn't touch CK. The local CK `.pre-commit-config.yaml` only runs for developers who installed hooks. The authoritative gate is therefore the new Jenkins stage in this PR; the local hook is convenience. ## Commit layout (bisect-friendly) 1. `[ck] Normalize CRLF line endings to LF in C/C++ sources` Mechanical line-ending cleanup across 6 files. No content change: every edit is purely CRLF -> LF, verified with `git diff --ignore-cr-at-eol` reporting an empty diff. 2. `[ck] Enforce LF-only line endings in C/C++ sources` - New `projects/composablekernel/script/check_no_crlf.sh` (modeled on `check_ascii_only.sh`). - New `crlf-checker` entry in `projects/composablekernel/.pre-commit-config.yaml` under the local-hooks block (`types_or: [c++, inc]`). - New `CRLF Check` parallel stage in `projects/composablekernel/Jenkinsfile`'s `Static checks` block, mirroring the adjacent `ASCII Only Check` stage. Always-on, no `RUN_CPPCHECK` gate. The tree is buildable at every commit boundary. Commit 1 leaves 0 CRLF violations; commit 2 wires the gate. ## Demo Script output on a synthesized violation: ``` $ printf 'int main() {}\r\n' > /tmp/bad.cpp $ projects/composablekernel/script/check_no_crlf.sh /tmp/bad.cpp ERROR: /tmp/bad.cpp contains CRLF (Windows) line endings: 1:int main() {}<CR> Fix: convert to LF, e.g. 'sed -i 's/\r$//' /tmp/bad.cpp' or 'dos2unix /tmp/bad.cpp' $ echo $? 1 ``` Full repo scan after the cleanup commit: ``` $ cd projects/composablekernel && find . -type f $ -name '.h' -o -name '.hpp' -o -name '.cpp' \ -o -name '.h.in' -o -name '.hpp.in' -o -name '.cpp.in' -o -name '.inc' -o -name '.cl' $ \ -not -path '/build/' -not -path '/include/rapidjson/' -print0 \ \| xargs -0 -P 8 -n 64 script/check_no_crlf.sh $ echo $? 0 ``` ## Test plan - [ ] Jenkins PR build: confirm new `Static checks -> CRLF Check` stage runs green over the full predicate and the existing `ASCII Only Check` / `Clang Format` stages are unaffected. - [ ] Local: `pre-commit run crlf-checker --all-files` runs cleanly after installing CK pre-commit hooks. - [ ] Manually inject a CRLF line ending in any `.cpp/.hpp/.inc` file, push: confirm Jenkins fails the new stage with a clear error. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-06-12 21:11:59 +00:00
Brock Hargreaves	96a7e44832	[rocm-libraries] ROCm/rocm-libraries#8378 (commit d68585d) [CK] Pre-emptively add groovy/ folder and skip TheRock CI filter (#8378) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation The CK Groovy library is growing and will be reorganized into a self-describing `groovy/` folder rather than living under `src/` and `vars/`. This PR creates that folder pre-emptively and adds it to the TheRock CI skip-list so that future Groovy additions do not unnecessarily trigger TheRock builds. ## Technical Details - Added `projects/composablekernel/groovy/` with a `.gitkeep` to establish the directory in the repo. - Added `"projects/composablekernel/groovy/"` to `SKIPPABLE_PATH_PATTERNS` in `.github/scripts/therock_configure_ci.py` alongside the existing `vars/` entry, ensuring changes confined to Groovy pipeline code are recognized as non-therock-relevant and skip the TheRock CI pipeline. ## Test Plan No code logic was changed. Verified that `therock_configure_ci.py` pattern list is consistent with the existing `vars/*` skip entry and that the new pattern follows the same glob convention. ## Test Result N/A — directory scaffolding and CI filter only; no functional code affected. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-12 20:11:53 +00:00
Illia Silin	d450749933	[rocm-libraries] ROCm/rocm-libraries#8357 (commit 800965c) [CK] Re-enable HIPRTC codegen tests for all CK PRs. ## Motivation At the time when we introduced the smart test filter to only build and run tests affected by the PR changes, we disabled the client examples, which required full CK build, and also the hiprtc tests that were grouped with the client examples. This caused a few PRs to sneak through that caused the hiprtc compilation to fail. By restoring the hiprtc tests in all PRs, we should close this gap. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-12 19:19:44 +00:00
Illia Silin	789ef38093	[rocm-libraries] ROCm/rocm-libraries#8333 (commit 69b3fc1) Revert "[CK_TILE] Implement RTC API for a subset of FMHA functionality for MGX" (#8333) Reverts ROCm/rocm-libraries#6086 Need to revert as the codegen test for fmha is failing due to including std header: 2026-06-11T22:36:03.673Z] In file included from /tmp/comgr-953928-0-473822/include/ck/host/device_fmha_fwd/fmha_fwd_wrapper.hpp:8: [2026-06-11T22:36:03.673Z] In file included from /bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/cmath:49: [2026-06-11T22:36:03.673Z] In file included from /bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/std_abs.h:38: [2026-06-11T22:36:03.673Z] /usr/include/stdlib.h:32:10: fatal error: 'stddef.h' file not found [2026-06-11T22:36:03.673Z] 32 \| #include <stddef.h> [2026-06-11T22:36:03.673Z] \| ^~~~~~~~~~ The ck_tile headers were never prepped for hiprtc compilation.	2026-06-12 18:19:31 +00:00
Wojciech Laskowski	c2601f38b7	[rocm-libraries] ROCm/rocm-libraries#6569 (commit 393049e) Adding amdgcn_mma specializations for sparse MFMA builtins (#6569) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the fourth of the series of PRs (after https://github.com/ROCm/rocm-libraries/pull/5801, https://github.com/ROCm/rocm-libraries/pull/6014 and https://github.com/ROCm/rocm-libraries/pull/6567) that add all the necessary MMA builtins as amdgcn_mma structs. This PR focuses on sparse MFMA intrinsics. ## Technical Details This change adds new specializations for MFMA sparse builtins. In total, we add 27 MFMA builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-12 12:48:29 +00:00

1 2 3 4 5 ...

3453 Commits