composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-12 01:10:17 +00:00

Author	SHA1	Message	Date
Yaswanth Raparti	c1127a36f5	[rocm-libraries] ROCm/rocm-libraries#5676 (commit 1d18339) [CK][CK TILE]Autotuning heuristics infra for universal GEMM kernel selection (#5676) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation This PR adds ML-based kernel selection heuristics to the CK Tile dispatcher, enabling fast and accurate automatic kernel selection for Universal Gemm kernels. Instead of requiring exhaustive search through 4600+ kernel configurations (taking ~46 seconds per problem shape), the ML heuristic predicts optimal kernels in microseconds while achieving >98% of oracle-best performance. ## Technical Details ML infrastructure https://github.com/ROCm/rocm-libraries/tree/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics * Feature Engine ([feature_engine.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/feature_engine.py)): 55-feature extraction including problem dimensions, kernel configuration, tile efficiency, and hardware profile * Training Pipeline ([train.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/train.py)): LightGBM regression with log-transform, GroupKFold cross-validation, warm-start support * Predictor ([predict.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/predict.py)): Kernel ranking and TFLOPS prediction for problem shapes * Evaluation ([evaluate.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/evaluate.py)): Comprehensive metrics including efficiency, NDCG@k, shape family analysis Data Generation Tools: * [generate_benchmark_data.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/generate_benchmark_data.py): Build and benchmark kernels across diverse problem shapes * [convert_json_to_parquet.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/convert_json_to_parquet.py): Convert benchmark JSON to training-ready parquet format * [data_pipeline.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/heuristics/data_pipeline.py): Parse streaming benchmark logs into canonical datasets Examples * [09_ml_heuristic.cpp](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/examples/gemm/cpp/09_ml_heuristic.cpp): C++ example demonstrating ML-based kernel selection * [09_ml_heuristic.py](https://github.com/ROCm/rocm-libraries/blob/users/vanantha/ck/dispatcher-heuristics/projects/composablekernel/dispatcher/examples/gemm/python/09_ml_heuristic.py): Python example with validation Pre-trained Models (projects/composablekernel/dispatcher/heuristics/models/): * gemm_universal_fp8_gfx950/: fp8 RCR model (42K trees, 97.51% mean efficiency) * gemm_universal_fp16_gfx950/: fp16 RCR model (20K trees, 99.36% mean efficiency) ## Test Plan * Evaluated on 25 diverse shapes for fp16, 168 shapes for fp8 * All shape families tested: tiny M (M<8), small M, medium M, large M (M≥1024) * All pipeline types: compv3, compv4, mem ## Test Result fp16 Model (gfx950, RCR layout) * Mean Efficiency: 99.36% * P10 Efficiency: 98.05% (90th percentile of shapes achieve ≥98% of oracle best) * Min Efficiency: 95.45% fp8 Model (gfx950, RCR layout) * Mean Efficiency: 98.28% (original), 97.51% (wide coverage) * P10 Efficiency: 94.64% (original), 93.89% (wide coverage) * Min Efficiency: 84.5% ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-02 02:26:32 +00:00
Jobbins	404a5ce1a4	[rocm-libraries] ROCm/rocm-libraries#6107 (commit e69d1b2) [CK] poll every 6 hours as workaround	2026-04-01 19:53:41 +00:00
Chinmay Dattanand Kuchinad	2bb69a24ea	[rocm-libraries] ROCm/rocm-libraries#5776 (commit ee1bbcb) [CK] Fix async pivot mismatch in persistent GEMM kernel scheduler (#5776) ## Motivation Fix pivot mismatch in the persistent GEMM kernel's async input scheduler that causes GPU hangs and incorrect results when used with AsyncTP (Asynchronous Tensor Parallelism) on ROCm. PyTorch's `_fused_all_gather_matmul_native` uses this persistent GEMM kernel with chunk signals to overlap communication and computation. The pivot mechanism ensures each rank starts computing from its own local shard first (which is already available), then moves to remote chunks as they arrive over the network. Because of the pivot mismatch, the kernel frequently waits on signals for chunks that have not yet arrived, while attempting to read data from completely different chunks. This synchronization desync reliably triggers infinite hangs during multi-GPU native AsyncTP execution. This fix is required to enable functional AsyncTP support on ROCm. ## Technical Details In the persistent kernel loop (`UniversalGemmKernel::operator()`), the M-tile coordinate used for data selection (`i_m`) and the M-tile coordinate used for the chunk-signal wait (`chunk_idx`) were derived from inconsistent bases: * `i_m` was computed from the unpivoted tile index `iM`. * `chunk_idx` was computed from the pivoted expression `(iM + tile_idx_pivot)`. This means the kernel could wait for chunk N's signal but then read from chunk M's memory, or vice versa. The mismatch scales with GPU count: with 2 GPUs ~50% of tiles are wrong, with 4 GPUs ~75%, etc. The Fix: Introduce a single pivoted M-tile index (`iM_eff`) and derive both `i_m` and `chunk_idx` from it. This guarantees the kernel always waits for the correct chunk before reading its data. (Note: Minor cosmetic `clang-format` changes were also pulled in alongside the fix). ## Test Plan 1. Build PyTorch with this CK change. 2. Run the specific multi-GPU AsyncTP native test: `timeout 180s env HIP_VISIBLE_DEVICES=0,1 pytest test/distributed/test_symmetric_memory.py -k test_fused_all_gather_matmul_native -q -s -x` ## Test Result Tests verify correct overlapping execution without hangs or accuracy mismatches when running the AsyncTP native path with non-zero pivots. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-01 16:22:08 +00:00
Jobbins	9426f49b52	[rocm-libraries] ROCm/rocm-libraries#6064 (commit cce30ab) [CK] poll develop every 15 minutes for changes	2026-04-01 14:35:42 +00:00
Fu-Cheng Tsai	a502e5a00b	[rocm-libraries] ROCm/rocm-libraries#5798 (commit 7acd4e7) [CK_TILE] Update gfx12 FMHA forward kernel configs	2026-04-01 14:23:38 +00:00
aledudek	119712bd90	[rocm-libraries] ROCm/rocm-libraries#4469 (commit 0844cb0) [CK_TILE] Add pooling in tile_engine ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> Add pooling in ck tile engine ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-01 07:32:36 +00:00
Yi DING	791afc6465	[rocm-libraries] ROCm/rocm-libraries#5991 (commit 8d85e8e) [CK_TILE] Fix FMHA BWD IGLP incorrect results due to AGPR misallocation (#5991) ## Motivation After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard around the `num_total_loop <= 0` early-exit check, the IGLP pipeline (`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV gradients for non-masking kernels (even with fix in #5915). Assembly inspection confirms that the CFG change causes the LLVM register allocator to reuse AGPR accumulators as scratch destinations in the dK/dV reduction loop, breaking the loop-carried accumulation across Q-tile iterations. ## Technical Details - Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in `BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing: it restores the CFG shape that the register allocator needs to correctly assign dedicated AGPRs to each column of the dK/dV accumulator. - Only the IGLP pipeline is affected; the other two BWD pipelines do not exhibit this issue. ## Test Plan ## Test Result ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-01 05:45:19 +00:00
Estevan Vedovelli	a33b5be1b9	[rocm-libraries] ROCm/rocm-libraries#6022 (commit 54b284a) [CK] contraction: extend GetTypeString() to include layout-differentiating params (#6022) ## Motivation Consumers that identify kernels by their `GetTypeString()` (such as hipTensor's actor-critic kernel selection, which hashes the string into a stable cross-platform UID) were silently dropping one of two colliding variants during registry insertion. `GetTypeString()` in `DeviceContractionMultipleD_Xdl_CShuffle` previously printed 13 template parameters, omitting `ABlockTransferSrcScalarPerVector`, `BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and `BBlockLdsExtraN`. These four parameters determine the block-transfer access width and LDS padding strategy, and are precisely what differentiates the `kk`, `kn`, `mk`, and `mn` layout variants from one another when all other geometry parameters are equal. Two instantiations with identical 13-parameter strings are distinct C++ types that accept different stride layouts and reject each other's arguments via `IsSupportedArgument`. This patch extends the output to 17 parameters so that every distinct template instantiation of this class produces a unique `GetTypeString()`. ## Technical Details `include/ck/tensor_operation/gpu/device/impl/device_contraction_multiple_d_xdl_cshuffle.hpp`: - extend `GetTypeString()` from 13 to 17 parameters including `ABlockTransferSrcScalarPerVector`, `BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and `BBlockLdsExtraN`. ## Test Plan Build CK and hipTensor with these changes, and verify hipTensor can differentiate and select the correct kernels with layout variations. ## Test Result CK is building correctly and hipTensor is selecting the kernels correctly. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-31 15:19:43 +00:00
Bartłomiej Kocot	ef4ff4667d	[rocm-libraries] ROCm/rocm-libraries#5842 (commit 04c5690) [CK][CK Tile] Force padding for atomic_add bf16 C tensor (#5842) ## Motivation Force padding for atomic_add bf16 C tensor to avoid memfaults. ## Technical Details - add global atomic add for bf16 and enable them - add padding for atomic add bf16 due to the lack of oob - remove padding for not continous dims in conv for other cases - minor bwd data conv fixes ## Test Plan test_grouped_conv_*_tile ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-31 08:03:41 +00:00
jakpiase	66dc81d530	[rocm-libraries] ROCm/rocm-libraries#5729 (commit 516c974) [CK_TILE] Changed cshuffle LDS descriptor to naive layout (#5729) ## Motivation This PR changes gemm/convolution cshuffle layout into plain one. to improve cshuffle operation performance. ## Technical Details The purpose is that before this change the cshuffle layout was having some descriptor transformations that were probably aimed at reducing LDS bank conflicts, but the transformations itself were terribly slow, which negatively impacted the performance. ## Test Plan There is no need for additional tests, since current tests cover this functionality.	2026-03-31 03:40:25 +00:00
Illia Silin	e6b8094f94	[rocm-libraries] ROCm/rocm-libraries#5921 (commit 032ac1b) [CK] fix clang lifetimebound errors with staging compiler (#5921) ## Motivation The ROCm staging compiler (newer Clang) enforces `[[clang::lifetimebound]]` annotations on methods that return references or pointers to internal object data. Without these annotations, the staging compiler emits compilation errors for container accessor methods across the CK and CK Tile namespaces. ## Technical Details Adds `[[clang::lifetimebound]]` to all reference/pointer-returning accessors in core container types: `ck::` namespace: - `Array` -- `At()`, `operator[]`, `operator()`, `begin()`, `end()` - `index_array` -- `operator[]` - `StaticallyIndexedArray_v2` -- `At()`, `operator[]`, `operator()` - `IndexLookupTable` -- `operator[]` `ck_tile::` namespace: - `array` -- `get(i)`, `at()`, `operator[]`, `operator()` - `static_array` -- `operator[]` - `thread_buffer` -- `get(i)`, `at()`, `operator[]`, `operator()` - `make_kernel()` -- parameter pack Also removes the unused `instance_index` variable from `batched_gemm_reduce_fp16.cpp` and simplifies its argument parsing accordingly. ## Test Plan - Compile with the staging compiler to verify all lifetimebound errors are resolved - Existing tests pass unchanged -- the attribute is a compile-time annotation with no runtime effect ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 14:20:20 +00:00
Hosang Yoon	2dcae9d173	[rocm-libraries] ROCm/rocm-libraries#5977 (commit 794bea7) [CK_TILE] Fix Windows build in FMHA head grouping ## Motivation This is a follow-up fix for [PR #5018](https://github.com/ROCm/rocm-libraries/pull/5018). [PR #5018](https://github.com/ROCm/rocm-libraries/pull/5018) added LLC-aware FMHA head grouping / head-major scheduling on RDNA, but it also introduced Linux-only code paths, including `<dirent.h>`, which break Windows builds. This change fixes that by guarding the Linux-specific LLC probing logic so non-Linux platforms can still build correctly. ## Technical Details - Guard `<dirent.h>` with `#ifdef __linux__` - Guard KFD sysfs traversal logic with `#if defined(__linux__)` - On non-Linux platforms, return `0` from `get_kfd_sysfs_llc_cache_bytes()` - Preserve existing fallback behavior through: - `CK_TILE_FMHA_LLC_CACHE_MB` - arch-based default LLC sizes - no head grouping when no LLC size can be resolved ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 14:19:19 +00:00
Jeff Huang	7968368d92	[rocm-libraries] ROCm/rocm-libraries#5918 (commit a7e2c67) [CK][CK_TILE] Add fp8bf16 hdim=256 tile for batch prefill (#5918) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation FP8 batch prefill kernels currently only support head_dim=128. Models with head_dim=256 hit the "invalid argument for batch_prefill" error because no matching kernel variant exists in the codegen dispatch. ## Technical Details Add a hdim=256 tile size entry for fp8bf16 in the batch prefill codegen recipe (`fmha_batch_prefill.py`). Tile configuration: `FmhaFwdTileSize(128, 128, 32, 256, 32, 256, 4,1,1, 4,1,1, 32,32,32, 32,32,32, -1)` - bm0=128, bn0=128 (Q/K tile sizes) - bn1=256, bk0max=256 (V head_dim=256) - Warp MFMA 32x32x32 (fp8 MFMA instructions) This mirrors the existing bf16/fp16 hdim=256 tile but uses fp8 warp sizes. ## Test Plan Tested on both MI308X (gfx942) and MI355X (gfx950) via aiter batch prefill test with the following matrix: - page_size: {1, 16, 1024} - kv_layout: {linear, vectorized} - lookup_table: {sglang, vllm} - causal: {true, false} - logits_soft_cap: {0.0, 30.0} - contiguous_kv: {true, false} ## Test Result MI308X (gfx942): 160 passed, 32 skipped (page_size=1 + vectorized not applicable) MI355X (gfx950): 120 passed, 72 skipped (pre-existing ROCm 7.2 compiler issue with causal + no softcap) No register spills on either platform. ### Profiling — MI355X (gfx950), FP8 pertensor, hdim=256, seqlen=1024, 8 heads \| page_sz \| kv_layout \| table \| causal \| soft_cap \| time_us \| TFLOPS \| \|---------\|-----------\|-------\|--------\|----------\|---------\|--------\| \| 1 \| linear \| sglang \| False \| 0.00 \| 55.01 \| 156.16 \| \| 1 \| linear \| vllm \| False \| 0.00 \| 55.12 \| 155.84 \| \| 1 \| linear \| sglang \| False \| 30.00 \| 62.63 \| 137.16 \| \| 1 \| linear \| vllm \| False \| 30.00 \| 62.16 \| 138.20 \| \| 1 \| linear \| sglang \| True \| 30.00 \| 64.09 \| 67.01 \| \| 1 \| linear \| vllm \| True \| 30.00 \| 63.85 \| 67.27 \| \| 16 \| linear \| sglang \| False \| 0.00 \| 57.00 \| 150.69 \| \| 16 \| vectorized \| sglang \| False \| 0.00 \| 57.55 \| 149.25 \| \| 16 \| linear \| vllm \| False \| 0.00 \| 56.80 \| 151.23 \| \| 16 \| vectorized \| vllm \| False \| 0.00 \| 57.32 \| 149.87 \| \| 16 \| linear \| sglang \| False \| 30.00 \| 64.77 \| 132.62 \| \| 16 \| vectorized \| vllm \| False \| 30.00 \| 63.54 \| 135.18 \| \| 16 \| linear \| sglang \| True \| 30.00 \| 66.84 \| 64.26 \| \| 16 \| vectorized \| vllm \| True \| 30.00 \| 66.12 \| 64.96 \| \| 1024 \| linear \| sglang \| False \| 0.00 \| 58.25 \| 147.46 \| \| 1024 \| vectorized \| sglang \| False \| 0.00 \| 57.53 \| 149.31 \| \| 1024 \| linear \| vllm \| False \| 0.00 \| 58.06 \| 147.94 \| \| 1024 \| vectorized \| vllm \| False \| 0.00 \| 57.55 \| 149.27 \| \| 1024 \| linear \| sglang \| False \| 30.00 \| 65.38 \| 131.38 \| \| 1024 \| vectorized \| vllm \| False \| 30.00 \| 63.64 \| 134.98 \| \| 1024 \| linear \| sglang \| True \| 30.00 \| 66.85 \| 64.25 \| \| 1024 \| vectorized \| vllm \| True \| 30.00 \| 65.26 \| 65.81 \| ### Profiling — MI308X (gfx942), FP8 pertensor, hdim=256, seqlen=1024, 8 heads \| page_sz \| kv_layout \| table \| causal \| soft_cap \| time_us \| TFLOPS \| \|---------\|-----------\|-------\|--------\|----------\|---------\|--------\| \| 1 \| linear \| sglang \| False \| 0.00 \| 110.18 \| 77.96 \| \| 1 \| linear \| vllm \| True \| 30.00 \| 134.33 \| 31.97 \| \| 1 \| linear \| sglang \| True \| 30.00 \| 134.59 \| 31.91 \| \| 16 \| linear \| sglang \| False \| 0.00 \| 115.43 \| 74.42 \| \| 16 \| vectorized \| sglang \| False \| 0.00 \| 106.11 \| 80.95 \| \| 16 \| linear \| vllm \| False \| 0.00 \| 116.34 \| 73.83 \| \| 16 \| vectorized \| vllm \| False \| 0.00 \| 106.17 \| 80.91 \| \| 16 \| linear \| sglang \| False \| 30.00 \| 135.61 \| 63.34 \| \| 16 \| vectorized \| vllm \| False \| 30.00 \| 122.37 \| 70.20 \| \| 16 \| linear \| sglang \| True \| 0.00 \| 117.44 \| 36.57 \| \| 16 \| vectorized \| vllm \| True \| 0.00 \| 108.81 \| 39.47 \| \| 16 \| linear \| sglang \| True \| 30.00 \| 139.43 \| 30.80 \| \| 16 \| vectorized \| vllm \| True \| 30.00 \| 125.87 \| 34.12 \| \| 1024 \| linear \| sglang \| False \| 0.00 \| 110.65 \| 77.63 \| \| 1024 \| vectorized \| sglang \| False \| 0.00 \| 101.70 \| 84.46 \| \| 1024 \| linear \| vllm \| False \| 0.00 \| 111.71 \| 76.89 \| \| 1024 \| vectorized \| vllm \| False \| 0.00 \| 101.55 \| 84.59 \| \| 1024 \| linear \| sglang \| False \| 30.00 \| 129.33 \| 66.42 \| \| 1024 \| vectorized \| vllm \| False \| 30.00 \| 120.95 \| 71.02 \| \| 1024 \| linear \| sglang \| True \| 0.00 \| 112.26 \| 38.26 \| \| 1024 \| vectorized \| vllm \| True \| 0.00 \| 103.02 \| 41.69 \| \| 1024 \| linear \| sglang \| True \| 30.00 \| 133.73 \| 32.12 \| \| 1024 \| vectorized \| vllm \| True \| 30.00 \| 124.75 \| 34.43 \| ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 10:21:33 +00:00
Yi DING	fb64a4453c	[rocm-libraries] ROCm/rocm-libraries#5915 (commit a72cf7d) [CK_TILE] Fix FMHA BWD register pressure by wrapping num_total_loop with amd_wave_read_first_lane (#5915) ## Motivation In three FMHA backward pipelines, `num_total_loop` is computed without `amd_wave_read_first_lane()`, so the compiler treats it as a VGPR even though it is logically uniform across all lanes. This raises register pressure, and under high pressure the compiler may reuse VGPRs across overlapping live ranges. This was confirmed via assembly inspection: the compiler reused `v52:v53` as both the B-matrix input for dK MFMAs and an intermediate value for dV, producing incorrect dK/dV gradients. ## Technical Details Wrap `num_total_loop` with `amd_wave_read_first_lane()` in three pipelines: - `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr` - `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp` - `block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr` This promotes `num_total_loop` to an SGPR, eliminating the excess register pressure and the incorrect VGPR reuse. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 01:45:16 +00:00
Jan Patrick Lehr	b6bbada9f1	[rocm-libraries] ROCm/rocm-libraries#5639 (commit a65e645) [CK] More lifetime-warning suppression ## Motivation The staging compiler picked up another change from upstream that leads to more lifetime-analysis warnings. This breaks the build, given CK is built with -Werror. As a result, compiler promotion is blocked. ## Technical Details This patch adds the pragma push diagnostics to ignore the lifetime-warnings in the modified files to unblock compiler promotion. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-28 11:20:51 +00:00
Linjun-AMD	3b55a05e71	[rocm-libraries] ROCm/rocm-libraries#5849 (commit d9b89b2) [CK_TILE ]Revert "[CK_TILE] Enable MXFP6 for MX GEMM op (#5095)" (#5849) This reverts commit 7e55766ddf7e9e20791b0e4e2d7b4026cf16b637. ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 20:37:23 +00:00
Bartłomiej Kocot	c28d0033d7	[rocm-libraries] ROCm/rocm-libraries#5785 (commit d8ecfc1) [CK] Fix min k_batch calculation in conv kernels ## Motivation Avoid division by 0 and remove not needed "-1". ## Technical Details Our div up implementation return lower value if input is divisible. There is no need to subtract 1. ## Test Plan test_grouped_conv_bwd_weight ## Test Result Passed locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1019	2026-03-27 15:38:21 +00:00
Illia Silin	4c926497ad	[rocm-libraries] ROCm/rocm-libraries#5829 (commit 19b2813) [CK] Fix error in dockerfile when building staging compiler. (#5829) ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 15:37:21 +00:00
Johannes Graner	58475d3f45	[rocm-libraries] ROCm/rocm-libraries#5393 (commit d51b649) [CK Tile] StreamK support for Bwd Weight grouped convolutions (#5393) ## Motivation Add StreamK work distribution to the CK Tile grouped convolution backward weight kernel. Split-K divides the K-dimension uniformly across a fixed `k_batch`, which causes load imbalance when the number of output tiles doesn't evenly fill the GPU. StreamK distributes total K-iterations evenly across workgroups, improving utilization on these shapes. ## Technical Details StreamK is added as an `if constexpr` branch in the existing kernel, selected by the `TilePartitioner_` template parameter. Two reduction strategies are supported: - Linear: tile-starter sequentially accumulates partials from contributing CTAs - Tree: pairwise binary tree reduction (O(log n) depth, faster for many contributors) Both persistent and non-persistent data-parallel (DP) sections are supported. Key changes: - `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution path with `RunStreamK`/`RunStreamKLoop`, partial store/load via workspace, flag-based cross-CTA synchronization, `GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions - `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers) and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by both GEMM and Conv StreamK kernels - `streamk_gemm_kernel.hpp`: Refactored to use shared helpers - Merged split-K and StreamK example invokers via `PartitionerPolicy` template parameter - StreamK example binary with `--streamk_reduction=linear\|tree` and `--streamk_persistent=0\|1` - CK Builder integration: `SpecifiesStreamK` concept, `TilePartitionerType` factory helper, `InstanceTraits` with StreamK fields - 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP), negative, builder regression ### Performance (MI355X, gfx950) Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}): \| Shape \| 16x64 tiles \| \| 128x128 tiles \| \| \|---\|---\|---\|---\|---\| \| \| Split-K \| StreamK \| Split-K \| StreamK \| \| 1x1 128x128 N=32 28x28 \| 1.00x \| 0.54x \| 1.00x \| 0.81x \| \| 3x3 128x128 N=32 14x14 \| 1.00x \| 0.59x \| 1.00x \| 0.62x \| \| 1x1 256x64 N=32 56x56 \| 1.00x \| 0.83x \| 1.00x \| 1.83x \| \| 3x3 512x512 N=2 7x7 \| 1.00x \| 1.12x \| 1.00x \| 0.62x \| \| 1x1 1024x1024 N=4 7x7 \| 1.00x \| 1.09x \| 1.00x \| 0.60x \| \| 3x3 128x128 N=32 28x28 \| 1.00x \| 0.44x \| 1.00x \| 0.96x \| \| 3x3 256x256 N=32 14x14 \| 1.00x \| 0.67x \| 1.00x \| 0.93x \| \| 3x3 512x512 N=32 7x7 \| 1.00x \| 0.98x \| 1.00x \| 1.16x \| StreamK's value depends on tile config: with larger tiles (fewer output tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up to 1.16x on typical large-channel convolutions. Tree reduction consistently outperforms Linear when multiple CTAs contribute to the same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n) sequential accumulation. The table reports the best of Linear and Tree for each shape. ## Test Plan ```bash ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk ./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk # Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON) ninja -C build check-builder ``` 30 tests covering: - Host-side: type traits, kernel args construction, grid size, workspace size - GPU end-to-end (Linear + Tree): small/medium shapes, multi-group, stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher occupancy - Persistent DP: Linear + Tree with persistent data-parallel dispatch - Negative: `IsSupportedArgument` rejects unaligned K and C - Builder: Create (instance string validation) + Execution (reference comparison) + instance string regression ## Test Result All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK tests pass. Full `check-builder` suite passes. Tolerances computed dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 09:18:14 +00:00
arai713	36f2ec23f5	[rocm-libraries] ROCm/rocm-libraries#5445 (commit 2cdbf8b) [CK_TILE] Support for CompV4 pipeline in Stream-K GEMM (#5445) ## Motivation This PR is extending the pipeline support for Stream-K GEMM by adding the CompV4 pipeline. Additional pipelines will be added in subsequent PRs. ## Technical Details - Enable the CompV4 pipeline by adding an option to set DoubleSMemBuffer to true if the CompV4 pipeline has been selected as it requires double buffered shared memory - Addition of CompV4 pipeline into the extended tests: kernel instances mirror the existing CompV3/Mem configurations (same layout permutations, data types, and tile sizes) with the pipeline type set to CompV4. - Addition of CompV4 pipeline into smoke tests (generated using Tile Engine) ## Test Plan These were tested using the existing smoke and extended tests. ## Test Result All tests passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 08:13:27 +00:00
Yi DING	47a04fda08	[rocm-libraries] ROCm/rocm-libraries#5790 (commit c132b5a) [CK_TILE] Fix NaN for FMHA BWD When seq_q=0 ## Motivation This PR addresses NaNs in the FMHA backward (dQ/dK/dV) path when the effective query sequence length for a tile is zero, by ensuring the per-tile pipelines exit early with zeroed accumulators and by avoiding an early kernel return that prevented writing out cleared gradients. ## Technical Details - Add unconditional early-exit in the dK/dV pipelines when `num_total_loop <= 0` (no work), returning zeroed accumulators. - Adjust group-mode kernel early-return logic to only return when both `seqlen_q` and `seqlen_k` are zero, allowing blocks to run and store cleared dK/dV when `seqlen_q == 0`. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 07:54:53 +00:00
Yaswanth Raparti	e2470e837a	[rocm-libraries] ROCm/rocm-libraries#5880 (commit a6b6c05) [CK][CK_TILE] Fix CTest parsing to handle all test number formats (#5880) ## Motivation Fix a bug in the smart-build --ctest-only filter that was incorrectly excluding tests with numbers less than 100. ## Technical Details The issue was caused by CTest formatting test numbers with variable spacing based on the number of digits: - "Test `#1`: name (3 spaces for tests 1-9)" - "Test `#79`: name (2 spaces for tests 10-99)" - "Test `#100`: name (1 space for tests 100+)" The previous code used `line.strip().startswith("Test #")` which only matched tests with a single space (i.e., test numbers >= 100). This caused tests like ck_tile_unit_sequence (Test #79) to be excluded from smart-build test selection, resulting in CTest failures when the binary wasn't built. Solution: Replace string matching with a regex pattern that handles all spacing variations: r'^\sTest\s+#\d+:\s(.+)$' ## Test Plan Tested with test numbers from 1 to 12345. ## Test Result - Before: 48 tests selected (only tests #100+) - After: 146 tests selected (all CTest-registered tests) ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-03-27 06:34:12 +00:00
Illia Silin	2f98c7bbef	[rocm-libraries] ROCm/rocm-libraries#5891 (commit 82563ff) fix AITER docker setup ## Motivation Add a new python package required to build AITER. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 04:36:16 +00:00
Bartłomiej Kocot	1c95ce0668	[rocm-libraries] ROCm/rocm-libraries#5856 (commit 2d9a0a1) [CK] Fix unused param mask ## Motivation Compiler error caused by unused param mask. ## Technical Details Skip tests using param mask in test loop. ## Test Plan Current test improvements. ## Test Result Passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 03:58:37 +00:00
dependabot[bot]	6215bb8dbc	[rocm-libraries] ROCm/rocm-libraries#5896 (commit b7436b5) Bump requests from 2.32.5 to 2.33.0 in /projects/composablekernel/docs/sphinx (#5896) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bumps [requests](https://github.com/psf/requests) from 2.32.5 to 2.33.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/psf/requests/releases">requests's releases</a>.</em></p> <blockquote> <h2>v2.33.0</h2> <h2>2.33.0 (2026-03-25)</h2> <p><strong>Announcements</strong></p> <ul> <li>📣 Requests is adding inline types. If you have a typed code base that uses Requests, please take a look at <a href="https://redirect.github.com/psf/requests/issues/7271">#7271</a>. Give it a try, and report any gaps or feedback you may have in the issue. 📣</li> </ul> <p><strong>Security</strong></p> <ul> <li>CVE-2026-25645 <code>requests.utils.extract_zipped_paths</code> now extracts contents to a non-deterministic location to prevent malicious file replacement. This does not affect default usage of Requests, only applications calling the utility function directly.</li> </ul> <p><strong>Improvements</strong></p> <ul> <li>Migrated to a PEP 517 build system using setuptools. (<a href="https://redirect.github.com/psf/requests/issues/7012">#7012</a>)</li> </ul> <p><strong>Bugfixes</strong></p> <ul> <li>Fixed an issue where an empty netrc entry could cause malformed authentication to be applied to Requests on Python 3.11+. (<a href="https://redirect.github.com/psf/requests/issues/7205">#7205</a>)</li> </ul> <p><strong>Deprecations</strong></p> <ul> <li>Dropped support for Python 3.9 following its end of support. (<a href="https://redirect.github.com/psf/requests/issues/7196">#7196</a>)</li> </ul> <p><strong>Documentation</strong></p> <ul> <li>Various typo fixes and doc improvements.</li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/M0d3v1"><code>@M0d3v1</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/6865">psf/requests#6865</a></li> <li><a href="https://github.com/aminvakil"><code>@aminvakil</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/7220">psf/requests#7220</a></li> <li><a href="https://github.com/E8Price"><code>@E8Price</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/6960">psf/requests#6960</a></li> <li><a href="https://github.com/mitre88"><code>@mitre88</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/7244">psf/requests#7244</a></li> <li><a href="https://github.com/magsen"><code>@magsen</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/6553">psf/requests#6553</a></li> <li><a href="https://github.com/Rohan5commit"><code>@Rohan5commit</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/7227">psf/requests#7227</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/psf/requests/blob/main/HISTORY.md#2330-2026-03-25">https://github.com/psf/requests/blob/main/HISTORY.md#2330-2026-03-25</a></p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's changelog</a>.</em></p> <blockquote> <h2>2.33.0 (2026-03-25)</h2> <p><strong>Announcements</strong></p> <ul> <li>📣 Requests is adding inline types. If you have a typed code base that uses Requests, please take a look at <a href="https://redirect.github.com/psf/requests/issues/7271">#7271</a>. Give it a try, and report any gaps or feedback you may have in the issue. 📣</li> </ul> <p><strong>Security</strong></p> <ul> <li>CVE-2026-25645 <code>requests.utils.extract_zipped_paths</code> now extracts contents to a non-deterministic location to prevent malicious file replacement. This does not affect default usage of Requests, only applications calling the utility function directly.</li> </ul> <p><strong>Improvements</strong></p> <ul> <li>Migrated to a PEP 517 build system using setuptools. (<a href="https://redirect.github.com/psf/requests/issues/7012">#7012</a>)</li> </ul> <p><strong>Bugfixes</strong></p> <ul> <li>Fixed an issue where an empty netrc entry could cause malformed authentication to be applied to Requests on Python 3.11+. (<a href="https://redirect.github.com/psf/requests/issues/7205">#7205</a>)</li> </ul> <p><strong>Deprecations</strong></p> <ul> <li>Dropped support for Python 3.9 following its end of support. (<a href="https://redirect.github.com/psf/requests/issues/7196">#7196</a>)</li> </ul> <p><strong>Documentation</strong></p> <ul> <li>Various typo fixes and doc improvements.</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`bc04dfd6da`"><code>bc04dfd</code></a> v2.33.0</li> <li><a href="`66d21cb07b`"><code>66d21cb</code></a> Merge commit from fork</li> <li><a href="`8b9bc8fc0f`"><code>8b9bc8f</code></a> Move badges to top of README (<a href="https://redirect.github.com/psf/requests/issues/7293">#7293</a>)</li> <li><a href="`e331a288f3`"><code>e331a28</code></a> Remove unused extraction call (<a href="https://redirect.github.com/psf/requests/issues/7292">#7292</a>)</li> <li><a href="`753fd08c5e`"><code>753fd08</code></a> docs: fix FAQ grammar in httplib2 example</li> <li><a href="`774a0b837a`"><code>774a0b8</code></a> docs(socks): same block as other sections</li> <li><a href="`9c72a41bec`"><code>9c72a41</code></a> Bump github/codeql-action from 4.33.0 to 4.34.1</li> <li><a href="`ebf7190679`"><code>ebf7190</code></a> Bump github/codeql-action from 4.32.0 to 4.33.0</li> <li><a href="`0e4ae38f0c`"><code>0e4ae38</code></a> docs: exclude Response.is_permanent_redirect from API docs (<a href="https://redirect.github.com/psf/requests/issues/7244">#7244</a>)</li> <li><a href="`d568f47278`"><code>d568f47</code></a> docs: clarify Quickstart POST example (<a href="https://redirect.github.com/psf/requests/issues/6960">#6960</a>)</li> <li>Additional commits viewable in <a href="https://github.com/psf/requests/compare/v2.32.5...v2.33.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=requests&package-manager=pip&previous-version=2.32.5&new-version=2.33.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end)	2026-03-26 22:01:37 +00:00
joyeamd	046d3ac274	[rocm-libraries] ROCm/rocm-libraries#5789 (commit 6654ca6) [CK][CK_TILE] Revert addional oob check in gemm IsSupported function (#5789) ## Motivation fix ck_tile's oob check. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-26 01:41:35 +00:00
Estevan Vedovelli	0004a37de5	[rocm-libraries] ROCm/rocm-libraries#5675 (commit fbd7fa7) [CK] Properly build HIPTENSOR_REQ_LIBS_ONLY targets when used in addition to MIOPEN_REQ_LIBS_ONLY (#5675) ## Motivation When building CK with both -DHIPTENSOR_REQ_LIBS_ONLY=ON and -DMIOPEN_REQ_LIBS_ONLY=ON, only MIOpen targets were being properly installed. This change is necessary to allow hipTensor to build with TheRock without the need to rebuild CK from source. ## Technical Details The solutions consists in considering both HIPTENSOR_REQ_LIBS_ONLY and MIOPEN_REQ_LIBS_ONLY when including hiptensor's targets in CMake, following the same approach used to the conv target (for MIOpen). ## Test Plan Manually test the build and installation with `-DHIPTENSOR_REQ_LIBS_ONLY=ON` and both `-DHIPTENSOR_REQ_LIBS_ONLY=ON -DMIOPEN_REQ_LIBS_ONLY=ON`, and verify that the proper files as installed. ## Test Result The build with `-DHIPTENSOR_REQ_LIBS_ONLY=ON` properly includes the targets contraction, reduction and other, while `-DHIPTENSOR_REQ_LIBS_ONLY=ON -DMIOPEN_REQ_LIBS_ONLY=ON` includes conv, contraction, reduction and other. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-25 23:59:53 +00:00
Illia Silin	86ec92f925	[rocm-libraries] ROCm/rocm-libraries#5571 (commit 8f60932) [CK] fix clang lifetime bound error in ck_builder. ## Motivation This resolves the compilation error with latest develop compiler branch. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-25 16:45:38 +00:00
Illia Silin	bee61860c2	[rocm-libraries] ROCm/rocm-libraries#5764 (commit f3c1232) Re-enable daily builds with staging compiler ## Motivation This should help us catch and fix any new compilation issues early on. ## Technical Details We now have three compiler profiles: * develop: slightly stabilized version of amd-staging with some of the obvious offending PRs reverted, 1-2 weeks behind amd-staging; * amd-mainline: more stable version of compiler, the baseline for all other branches, e.g., release, npi, etc. 2-4 weeks behind amd-staging. * amd-staging: latest compiler version where all new PRs land, often broken; ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: kensclin <lshyhchy@amd.com>	2026-03-25 16:37:58 +00:00
Ville Pietilä	ec2dbfbfde	[rocm-libraries] ROCm/rocm-libraries#5516 (commit ff3afda) [CK_TILE, CK_BUILDER] Add bwd data to CK Tile profiler (#5516) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation We want close the performance gap between old CK and CK Tile for bwd data convolutions. To achieve this, we need tow things - Configurations for the old CK kernel instances such that we can map them into CK Tile instances. - Support in CK profiler to run the CK Tile instance with the same API as for old CK instances. ## Technical Details Extracted kernel configurations from old CK. The codegen python script for CK Tile convs is extended to support also bwd data. The generated instances are added to the CMake build (target `device_grouped_conv_bwd_data_tile_instances`). A new profiler op (`grouped_conv_bwd_data_tile`) has been added to the CK Profiler. The API is same as for old CK's profiler op `grouped_conv_bwd_data`.	2026-03-25 14:36:11 +00:00
joyeamd	1834e318da	[rocm-libraries] ROCm/rocm-libraries#5697 (commit dd1c396) Revert "Ck/joye/revert oob check (#5640)" This reverts commit 552ab4880292694cb8261f40fa4223af52cb8419. ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-23 22:05:30 +00:00
andrew clark	5a4243096b	[rocm-libraries] ROCm/rocm-libraries#5713 (commit e179279) Adding New Notification Detection ## Motivation Restricting one of the notification failure patterns to match a specific missing drivers log pattern. This will help reduce the noise of erroneous logs. Also adding a new failure pattern to notify us of Github access issues. ## Technical Details - Set the failure pattern to match the exact failure observed in the logs. - Switching to a plain substring search so special characters are handled literally. - Added a new failure pattern for Github access errors. ## Test Plan - Force a failure using the known failure patterns. ## Test Result The forced failures were triggered and caught by the notification system. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-23 20:57:55 +00:00
Eiden Yoshida	ba2fb0224f	[rocm-libraries] ROCm/rocm-libraries#5691 (commit 2fbb1fc) [CK] MICI: Revert "add self healing to ref repo" The check may not be working as intended, causing premature deletion of reference repositories	2026-03-23 14:16:53 +00:00
Bartłomiej Kocot	f79926009b	[rocm-libraries] ROCm/rocm-libraries#5555 (commit 1d2c4c8) [CK][CK Tile] Fix kbatch check in grouped conv and gemm kernels (#5555) ## Motivation Fix kbatch check in grouped conv and gemm kernels, allow tails for kbatch. ## Technical Details Round up K / Kperxdl and divide it by Kbatch to allow tail for K. ## Test Plan test_grouped_convnd_bwd_weight_tile ## Test Result passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-21 22:56:19 +00:00
Emily Martins	6b69ac9676	[rocm-libraries] ROCm/rocm-libraries#5625 (commit 7d2ed43) [CK_TILE] Prune Stream-K Tile Engine Tests ## Motivation Stream-K tile engine tests are causing issues for build time. While we work on a more permanent solution, these changes prune the Stream-K test instances to help reduce the build time burden. ## Technical Details The Stream-K team recently transitioned to using CK Tile's tile engine infrastructure for our smoke tests. However, since tile engine creates an individual target per kernel instance, we've found that the tile engine tests are increasing build times. Our team is currently working to convert our existing tile engine tests back to basic gtests. While this work takes place, we are temporarily pruning the existing Stream-K tile engine test instances to help reduce the build time burden. ## Test Plan Ran the pruned test set on all gfx90a, gfx942, and gfx950. ## Test Result All tests pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 20:31:39 +00:00
andrew clark	a66047ad09	[rocm-libraries] ROCm/rocm-libraries#5464 (commit debfc96) Improved CI infrastructure failure detection ## Motivation This PR re-enables CI infrastructure failure detection and notification, which had been disabled due to performance issues caused by loading large build logs (~80k lines) into memory for pattern scanning. The goal is to reliably detect known infrastructure failures (GPU errors, Docker authentication issues, disk space errors, etc.) and send actionable Teams notifications without hanging on large logs. ## Technical Details - Replaced full build log loading and Groovy-based pattern scanning with a streaming wget \| grep -E pipe. grep scans natively so the full log is never loaded into Groovy, resolving the hang on large logs. - Combined all failure patterns into a single grep -E call to avoid multiple log fetches. - The node name is now tracked with the observed failure. - Added a new failure pattern for device's running out of space. ## Test Plan - Forced failures in the "Determine CI Execution" stage with all 9 failure patterns echoed to the build log. - Simulated large log sizes (~80k lines of dummy output) to validate pattern detection and node name extraction at realistic log scales, including patterns placed both before and after large blocks of dummy output. ## Test Result All 9 failure patterns detected correctly. Teams notifications sent with accurate log context, node name, and job links. No hangs observed on 80k line simulated logs. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 19:18:07 +00:00
Jobbins	e8f57c0159	[rocm-libraries] ROCm/rocm-libraries#5630 (commit 14cd617) add self healing to ref repo ## Motivation Check for when mirror repo gets corrupted in CI ## Technical Details We detect broken ref objects and rebuild the local mirror in that case of corruption ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 16:43:37 +00:00
Bartłomiej Kocot	db40d3f517	[rocm-libraries] ROCm/rocm-libraries#5334 (commit bb5a3c8) [CK][CK Tile] Improve access for merged groups and remove modulo from xor (#5334) ## Motivation [CK][CK Tile] Improve access for merged groups and remove modulo from xor ## Technical Details - add template parameter to xor if modulo is needed. We don't need modulo for merged groups - use access by m for merged groups for a tensor - ## Test Plan test_grouped_convnd_fwd_tile ## Test Result passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 15:47:22 +00:00
Bartłomiej Kocot	fd8714aea9	[rocm-libraries] ROCm/rocm-libraries#5609 (commit 95afb2c) [CK][CK Tile] Move grouped conv cpp instances to build dir (#5609) ## Motivation Move grouped conv .cpp instances to build dir. Fix generate instances script. ## Technical Details Avoid CI problem when instances in experimental directory are not removed ## Test Plan test_grouped_convnd_*_tile ## Test Result Pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 13:20:48 +00:00
joyeamd	a22c822aef	[rocm-libraries] ROCm/rocm-libraries#5640 (commit 552ab48) Ck/joye/revert oob check ## Motivation fix ck_tile's oob check. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 12:31:27 +00:00
arai713	da863dae1b	[rocm-libraries] ROCm/rocm-libraries#4795 (commit 6590a1a) [CK_TILE] Rename Stream-K grid function ## Motivation This PR introduces a change in the name of the get_grid function in the Stream-K TilePartitioner to avoid confusion with a similarly named method. In the Stream-K TilePartitioner, there is get_grid() which returns num_cu*occupancy and there is grid_size() which returns the grid size used to launch the kernel. In this PR, we change get_grid() to be get_max_active_wgs() to better reflect what the function returns and not confuse it with grid_size(). ## Technical Details Initially in the Stream-K TilePartitioner we had get_grid() which returned grid_. We are renaming get_grid() to get_max_active_wgs() and grid_ to max_active_wgs_ internally, while keeping grid_size() the same. The parameter, grid, for the Stream-K TilePartitioner remains the same to maintain consistency with the rest of the Stream-K API. ## Test Plan Validated using the test suite that is already present. ## Test Result All tests passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 09:28:47 +00:00
yinglu	a268a2a2e1	[rocm-libraries] ROCm/rocm-libraries#5612 (commit 38c9498) [CK]fix: remove redundant structured sparsity check in run_gemm_example.inc (#5612) ## Motivation This issue if found via https://github.com/ROCm/rocm-libraries/pull/4302#discussion_r2958603418 and is introduced via https://github.com/ROCm/rocm-libraries/pull/5323. ## Technical Details The outer `if` and inner `if constexpr` both checked GemmConfig::UseStructuredSparsity. Merged into a single `if constexpr` since both preshuffle and UseStructuredSparsity are compile-time constants. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 08:23:07 +00:00
Sami Remes	d7c761e060	[rocm-libraries] ROCm/rocm-libraries#5095 (commit 7e55766) [CK_TILE] Enable MXFP6 for MX GEMM op ## Motivation Add support for MXFP6 in the MX GEMM op in CK-Tile. Depends on https://github.com/ROCm/rocm-libraries/pull/4594 ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 01:08:52 +00:00
Yaswanth Raparti	a5d0200ccf	[rocm-libraries] ROCm/rocm-libraries#5614 (commit 32933df) [CK][CK TILE] Fix smart-build to run install target for client examples (#5614) How ninja install works: - Builds library dependencies (device_operations, etc.) - Installs them to CMAKE_INSTALL_PREFIX - Skips building test executables (not install dependencies) Affected stages (8): - gfx942/gfx950/gfx908/gfx90a CK Client Examples - gfx10-1/gfx10-3/gfx11/gfx12 CK Client Examples ## Motivation Problem: When smart-build is enabled (runAllUnitTests=false), the build step is skipped entirely. This causes client example stages to fail because they depend on the CK library being installed to ../install. Error seen: Target "client_gemm" links to: composable_kernel::device_other_operations but the target was not found. ## Technical Details Root cause: Line 712 only checked runAllUnitTests, so when building with config_targets="install", the install target was never built, leaving the install directory empty. Fix: Added condition to always build when config_targets contains 'install'. The install target automatically builds its dependencies (the CK libraries) but skips building tests, which aligns with smart-build philosophy. ## Test Plan Should be tested on CI ## Test Result Should be tested on CI ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-19 22:02:03 +00:00
yinglu	d460ab35b6	[rocm-libraries] ROCm/rocm-libraries#4302 (commit e62bd8a) [CK_TILE] add tf32 support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in CK_TILE on gfx942 and gfx950. ## Checklist Please put an into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run on all changed files - [ ] Any dependent changes have been merged ## Discussion	2026-03-19 09:19:06 +00:00
Yaswanth Raparti	652d3456ca	[rocm-libraries] ROCm/rocm-libraries#5249 (commit 2a114bb) [CK] [CK_TILE] Improve build and test time of CI with smart dependency parser (#5249) ## Motivation Existing dependency parser needs full build of tests to determine which tests are affected by code changes in a PR. This still takes 2-4 hours for building the tests which slows down the CI as the number of tests grow. To resolve this issue we implemented a smart dependency parser which uses CMake Configure to parse dependencies and build only the affected test cases. We have ensured that two approaches are available 1) CMake pre-build analysis for each PR to ensure fast build and test. 2) Ninja post-build analysis to enable full build for nightly tests. ## Technical Details ```bash ### 1. Configure the project with CMake cmake -G Ninja -DCMAKE_EXPORT_COMPILE_COMMANDS=ON .. ### 2. Analyze dependencies (no build required!) python3 ../script/dependency-parser/main.py cmake-parse compile_commands.json build.ninja \ --workspace-root .. --output cmake_dependency_mapping.json --parallel 8 ### 3. Find tests affected by changes python3 ../script/dependency-parser/main.py select cmake_dependency_mapping.json origin/develop \ HEAD --test-prefix --output tests_to_run.json ### 4. Build only affected tests ninja $(jq -r '.executables[]' tests_to_run.json \| tr '\n' ' ') ### 5. Run affected tests ctest -R "$(jq -r '.regex' tests_to_run.json)" ``` ### Jenkins Integration - Added `buildMode` to jenkinsfile to integrate both `selective` and `full` build methods ### Known Limitations ### 1. Build-Time Generated Headers (HIGH RISK) Problem: Files generated during the build process (e.g., via `add_custom_command`) cannot be analyzed before building. Example: ```cmake add_custom_command( OUTPUT ${CMAKE_BINARY_DIR}/generated/config.hpp COMMAND generate_config.sh DEPENDS template.hpp.in ) ``` Impact: If a source file includes `generated/config.hpp`, the dependency won't be detected until after building. Mitigation: - CK analysis shows no generated headers currently used - If generated headers are added in the future, they must be built first - Recommendation: Generate headers in CMake configure phase (not build phase) when possible ## Test Plan 1. Modified Files: ``` include/ck_tile/ops/common.hpp include/ck_tile/ops/gemm.hpp include/ck_tile/ops/gemm/warp/warp_gemm.hpp ``` 2. Compare tests selected between `build.ninja` and `cmake-parse` methods ## Test Result - 1. The test completed in 5-6 minutes finding about 8000+ executables that should be built. - 2. We selected a commit 5ccc1387ea which resulted in same 7 tests with both legacy and new methods. - PR \| Legacy tests \| Smart tests \| Notes -- \| -- \| -- \| -- 5261 \| 453 \| 455 \| Only 2 tests (test_amdgcn_mma and test_amdgcn_sparse_mma) 5168 \| 0 \| 0 \| Changes in dispatcher only. No CK tests invoked. 5249 \| 0 \| 0 \| Changes to dependency parser. No CK tests invoked 5260 \| 0 \| 0 \| Changes in dispatcher only. No CK tests invoked. 5174 \| 1 \| 1 \| One test from FMHA affected by this PR in both cases 5383 \| 0 \| 0 \| Changes are only in benchmark files. Did not trigger any tests 5445 \| 1 \| 1 \| Changes are only to tests/ck_tile/gemm_streamk. Only triggered one streamk test in both cases. 5454 \| 3 \| 3 \| Both methods identified same test_grouped_conv_bwd tests 5427 \| 234 \| 234 \| Core infrastructure header changes. Detected exactly same tests 5388 \| 85 \| 85 \| modifies warp-level GEMM operations (warp_gemm.hpp, warp_gemm_dispatcher.hpp). Correctly identified all the streamK gemm tests ## Submission Checklist - [x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-19 05:31:35 +00:00
lalala-sh	345a56c55e	[rocm-libraries] ROCm/rocm-libraries#5086 (commit f4880d7) [CK] Fix MOE FP8 SplitK buffer descriptor OOB When SplitK is enabled, kernel entry shifts A/B/AScale/BScale base pointers by SplitKBatchOffset, but make_dynamic_buffer element spaces are still based on full K dimension. This causes hardware buffer resource descriptors to extend beyond the actual tensor allocation, leading to GPU memory access faults when the tensor happens to be placed at the end of an allocated memory pool region. Fix by subtracting the split offset from each buffer's element space in both Run() (v1 pipeline) and Run_2Lds() (v2/v3 pipeline), so the buffer descriptor range [shifted_base, shifted_base + reduced_space) exactly covers the valid allocation. Also refactor SplitKBatchOffset to accept const Problem& (instead of Argument&) and add a default constructor, enabling direct reuse in Run/Run_2Lds without duplicating offset calculation logic. Made-with: Cursor ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-19 02:43:30 +00:00
Christopher Millette	e5683e2290	[rocm-libraries] ROCm/rocm-libraries#5031 (commit 1d86a92) [CK] Replace nested static_for with static_ford to reduce device IR function emissions [1B] (#5031) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary ### Rationale CK's GPU kernels are among the slowest files in the ROCm build, with a single translation unit taking up to 10+ minutes. Profiling with `-ftime-trace` identified nested `static_for` loops as the root cause: each nesting level multiplies the number of unique lambda IR functions the compiler must process. A 2-level nest of `static_for<0, M, 1>` / `static_for<0, N, 1>` produces M×N unique lambda types. With typical GEMM dimensions (M=16, N=4), a single nest generates 64 unique functions — and these nests appear hundreds of times across the codebase. The LLVM backend's CGSCC (Call Graph Strongly Connected Components) framework processes each function independently, so reducing function count directly reduces backend time. ### What changed 393 nested compile-time loop patterns across 73 files are converted to `static_ford`, which flattens multi-dimensional compile-time iteration into a single `static_for` with index decomposition. This eliminates 994 `static_for` nesting levels (42% reduction). Three pattern categories were converted: - Category A: `static_for` wrapping `static_ford` — fold outer dimension into ford - Category B: nested `static_ford` — merge into single higher-dimensional ford - Category C: nested `static_for` chains — convert to single `static_ford` ### Verification ASM equivalence: PASS — 51/51 device assembly files identical (gfx942 + gfx1100) \| Architecture \| Files compared \| Largest file \| Result \| \|---\|---\|---\|---\| \| gfx942 \| 36 \| 386,685 lines \| ALL MATCH \| \| gfx1100 \| 15 \| 47,769 lines \| ALL MATCH \| Build time (Wilcoxon signed-rank test, 7 paired trials): \| Target \| Pre (s) \| Post (s) \| Delta \| p-value \| \|---\|---\|---\|---\|---\| \| bscale \| 169 \| 152 \| -9.8% \| 0.016 \* \| \| xdl_v1234 \| 207 \| 194 \| -6.6% \| 0.016 \* \| \| preshuffle \| 275 \| 264 \| -3.9% \| 0.016 \* \| \| xdl_base \| 142 \| 137 \| -3.2% \| 0.031 \* \| IR function counts (device backend, gfx942): \| Target \| InstFunc Δ \| CodeGen Δ \| Compiler Δ \| \|---\|---\|---\|---\| \| bscale \| -13,043 (-8.2%) \| -2,103 (-3.5%) \| -10.7% \| \| xdl_v1234 \| -9,431 (-5.7%) \| +59 (+0.1%) \| -5.2% \| \| xdl_base \| -6,162 (-4.9%) \| -1,141 (-2.5%) \| -2.2% \| \| xdl_old \| -3,234 (-3.7%) \| -963 (-8.7%) \| -3.3% \| ### Value - 994 fewer `static_for` nesting levels (-42%) across 73 files - 393 `static_ford` sites created (from 4 pre-existing) - Up to 9.8% compile-time reduction on representative targets (statistically significant, p < 0.05) - Up to 13K fewer IR function instantiations per translation unit - Net -849 LOC from reduced indentation - Zero ASM changes — identical device code output verified on gfx942 and gfx1100 - All scheduling barriers, `if constexpr` guards, and MFMA/WMMA accumulation order preserved ### Files changed (73) - `block/`: 47 files (GEMM pipelines — xdlops, wmma, moe, preshuffle, blockscale variants) - `grid/`: 20 files (softmax, normalization, reduction, attention, layernorm) - `thread/`: 5 files (tensor slice transfer, contraction, GEMM dlops, reduction) - `tensor_description/`: 1 file (tensor_adaptor) ## Test plan - [x] `static_ford` tested with 21 unit tests in `test/util/unit_ford.cpp` (1D-4D, custom orders, compile-time verification) - [x] All conversions preserve iteration order, `block_sync_lds()` placement, `if constexpr` scheduling guards, and MFMA/WMMA accumulation order - [x] ASM equivalence verified: 51 device `.s` files across gfx942 + gfx1100 - [x] Build-time improvement statistically confirmed (Wilcoxon, p < 0.05, 4 targets) - [x] IR function count reduction confirmed via `-ftime-trace` on 7 targets - [x] Detection script reports 0 remaining safe patterns (180 blocked with structural reasons) - [x] Existing CI tests (GEMM, softmax, normalization, batch norm, reduction, attention) exercise all converted code paths ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-18 14:46:50 +00:00
Thomas Ning	5f90f69795	[rocm-libraries] ROCm/rocm-libraries#5323 (commit 5454e9e) CK Tile MX GEMM Packing Improvement ## Motivation Reduce the scale loading size and also has better utilization of MFMA scale selection. ## Technical Details Add up the packing of mx scales. ## Test Plan Use the existing test cases. ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-17 18:58:56 +00:00
Hosang	859acb5ae7	[rocm-libraries] ROCm/rocm-libraries#5018 (commit b32e7e6) [CK_TILE] Add LLC-aware FMHA head grouping and head-major scheduling on RDNA (#5018) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Long-sequence FMHA can become memory-bound when K/V working sets exceed Infinity Cache (LLC), causing repeated DRAM traffic across heads. This PR introduces LLC-aware launch ordering improvements for FMHA forward, and it is currently enabled only on gfx11 and gfx12. The approach is inspired by [`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217), adapted to CK’s kernel/runner structure and layout handling. In this context, `bshd` is the layout used in Flash-Attention, while `bhsd` is the default layout used by the CK Tile FMHA example. ## Technical Details This PR adds two complementary strategies: - For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware head grouping: - Estimate LLC size (env override, KFD sysfs, or arch default). - Compute group size from K/V bytes per head vs LLC target. - Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and related tensors). - For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit launch-order adjustment: - Keep a single kernel launch. - Reinterpret block linearization in `GetTileIndex` to make execution head-major, improving temporal locality of per-head K/V reuse. Additional integration updates: - Propagate `num_head_q_total` and `head_start` through FMHA args/kargs. - Use global head indexing for dropout RNG stream mapping so grouped launches keep deterministic/consistent dropout behavior. - Keep fallback behavior unchanged when grouping is not beneficial or disabled. ## Test Plan - `test_ck_tile_fmha` - `tile_example_fmha_fwd` ## Test Result - `test_ck_tile_fmha`: all tests passed. - `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201, and all of them show higher performance compared to the baseline. The improvement is consistent, and performance is well maintained even at long sequence lengths. ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} - TFLOPs by sequence length target: gfx1100 layout: bhsd SeqLen \| Before \| After \| Speedup -- \| -- \| -- \| -- 1024 \| 56.27 \| 61.48 \| 1.09x 4096 \| 67.10 \| 72.27 \| 1.08x 8192 \| 65.99 \| 71.64 \| 1.09x 12288 \| 61.60 \| 76.61 \| 1.24x 16384 \| 58.99 \| 75.74 \| 1.28x 20480 \| 57.32 \| 74.42 \| 1.30x 24576 \| 56.89 \| 74.25 \| 1.31x 27280 \| 18.93 \| 24.48 \| 1.29x - TFLOPs by sequence length target: gfx1201 layout: bshd SeqLen \| Before \| After \| Speedup -- \| -- \| -- \| -- 1024 \| 66.79 \| 65.90 \| 0.99x 4096 \| 85.90 \| 86.80 \| 1.01x 8192 \| 77.06 \| 90.29 \| 1.17x 12288 \| 58.36 \| 88.98 \| 1.52x 16384 \| 52.12 \| 88.88 \| 1.71x 20480 \| 48.11 \| 88.42 \| 1.84x 24576 \| 47.12 \| 89.07 \| 1.89x 27280 \| 49.05 \| 50.31 \| 1.03x ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-16 21:19:23 +00:00

1 2 3 4 5 ...

3200 Commits