composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 19:28:33 +00:00

Author	SHA1	Message	Date
damien-lejeune	5bebfd460f	[rocm-libraries] ROCm/rocm-libraries#8492 (commit 46b6a06) Add tile size for FMHA batch prefill bf16 for MI308X ## Motivation Adding a tile size adapted to MI308X, for the FMHA Batch Prefill BF16 input type case ## Technical Details N/A ## Test Plan Benchmarking from the Aiter side with: ``` python3 op_tests/test_batch_prefill.py -s 8000 -p 1 -q 4 -k 1 --head_dim 256 -c true -d bf16 --input_dtype bf16 --quant_method none --kv_layout linear -t sglang -l 0.0 --return_lse false --profile ``` ## Test Result We see an improvement with the new tile size on MI308X (both with PLT mode OFF and ON) ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>	2026-06-17 06:22:26 +00:00
damien-lejeune	aab1d219f5	[rocm-libraries] ROCm/rocm-libraries#8350 (commit f92ded1) Add tile shape for FMHA batch prefill on MI308X (on fp8, hdim=256) (#8350) ## Motivation Add a tile size appropriate for FMHA batch prefill fp8/hdim256 on MI308X ## Technical Details Appending the tile shape to the existing factory such that it can be picked up by Aiter ## Test Plan Ran the performance test on both MI300X and MI308X ## Test Result MI300X performance seems unaffected by this change. MI308X does improve. ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>	2026-06-15 07:00:35 +00:00
ltqin	0954a8f3fa	[rocm-libraries] ROCm/rocm-libraries#8262 (commit d4ff8fc) [CK_TILE] Add graph capture support for FMHA backward(new branch) (#8262) ## Motivation Add HIP graph capture support for FMHA backward operations. The original implementation only supported normal execution mode and would cause use-after-free crashes when used with graph capture replay. When FMHA backward is captured into a HIP graph: - First replay: host callback executes and deletes the closure (as designed for normal mode) - Subsequent replays: use-after-free crash because the closure was already freed This PR enables `fmha_bwd_launcher::prepare_workspace_async()` to work correctly in both normal execution and graph capture modes.	2026-06-14 03:11:53 +00:00
Chao	320a813d67	[rocm-libraries] ROCm/rocm-libraries#6533 (commit 5dcaa45) [CK_TILE] Add host-side Pack-GQA optimization for FMHA forward (#6533) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit [CK_TILE] Add host-side Pack-GQA optimization for FMHA forward ## Motivation Host-side Pack-GQA optimization for CK-Tile FMHA forward. Reshapes Q tensor from `[b, nhead_q, seqlen_q, d]` to `[b, nhead_kv, nhead_ratio * seqlen_q, d]` by adjusting strides, so grouped Q-heads sharing the same KV data are processed in a single tile. Zero kernel changes — runner-only. Phase 1: non-causal attention with GQA ratio packing. Phase 2: extends to dropout and split-kv paths, fixes stride edge cases. ## Technical Details Modified files (2): - `example/ck_tile/01_fmha/example_fmha_fwd.cpp` — Pack-GQA flag plumbing - `example/ck_tile/01_fmha/fmha_fwd_runner.hpp` — Q tensor reshape logic, stride adjustment for GQA ratio packing New files (1): - `example/ck_tile/01_fmha/test_pack_gqa_phase2.sh` — 53 test cases covering non-causal, dropout, split-kv, various GQA ratios ## Dependencies None — this PR is standalone. ## Test Plan - GPU validation on MI300X (gfx942, ROCm 6.4.1): - Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=32 -h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` - GPU validation on MI350X (gfx950, ROCm 7.0), 53 parameterized test cases: - Command (GQA 4:1): `./build/bin/tile_example_fmha_fwd -b=2 -h=32 -h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` - Command (GQA 8:1): `./build/bin/tile_example_fmha_fwd -b=2 -h=64 -h_k=8 -s=2048 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` - Command (decode): `./build/bin/tile_example_fmha_fwd -b=64 -h=32 -h_k=8 -s=1 -s_k=4096 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` ## Test Result Benchmark results (MI350X, gfx950, ROCm 7.0): \| Config \| Without Pack \| With Pack \| Improvement \| \|--------\|-------------\|-----------\|-------------\| \| GQA 4:1 prefill b=2 h=32 hk=8 s=2048 d=128 bf16 \| 690.05 TFlops (0.199 ms) \| 695.61 TFlops (0.198 ms) \| +0.8% \| \| GQA 8:1 prefill b=2 h=64 hk=8 s=2048 d=128 bf16 \| 706.25 TFlops (0.389 ms) \| 729.35 TFlops (0.377 ms) \| +3.3% \| \| GQA 8:1 decode b=64 h=32 hk=4 s_k=4096 d=128 bf16 \| 305.20 GB/s (1.763 ms) \| 1813.41 GB/s (0.297 ms) \| +5.9x \| \| LLaMA-70B decode b=32 h=64 hk=8 s_k=4096 d=128 bf16 \| 591.70 GB/s (0.909 ms) \| 1820.65 GB/s (0.295 ms) \| +3.1x \| \| MHA ratio=1 b=2 h=8 s=4096 d=128 bf16 \| 695.16 TFlops \| 702.72 TFlops \| no regression \| Benchmark results (MI300X, gfx942, ROCm 6.4.1): No regression on MI300X. Pack-GQA is a runner-only optimization (zero kernel changes), performance impact is within noise on MI300X. \| Config \| TFlops / GB/s \| Time (ms) \| Delta vs baseline \| \|--------\|-------------\|-----------\|-------------------\| \| MHA bf16 b=2 h=8 s=4096 d=128 \| 336.52 TFlops \| 0.408 \| -1.7% \| \| GQA 4:1 bf16 b=2 h=32 hk=8 s=2048 d=128 \| 322.52 TFlops \| 0.426 \| -0.7% \| \| GQA 8:1 bf16 b=2 h=64 hk=8 s=2048 d=128 \| 349.85 TFlops \| 0.786 \| +0.5% \| \| LLaMA-70B prefill b=1 h=64 hk=8 s=4096 d=128 bf16 \| 381.29 TFlops \| 1.442 \| +1.2% \| \| Decode b=64 h=32 hk=8 s_k=4096 d=128 bf16 \| 697.32 GB/s \| 1.541 \| +0.8% \| All validation tests pass (`valid:y`) on both MI300X and MI350X. Additional validation: - 53 parameterized test cases pass (23 phase 1 + 30 phase 2) - GQA ratios tested: 1:1, 2:1, 4:1, 8:1, 32:1 - No regression on MHA (ratio=1) workloads - fp16 and bf16 validated	2026-06-10 01:56:44 +00:00
Anton Gorenko	7ecbf82708	[rocm-libraries] ROCm/rocm-libraries#7500 (commit f5cd4fd) [CK_TILE][FMHA] Optimize long-context decoding on gfx11/12 (#7500) ## Motivation Relevant issue: ROCM-22065 FMHA has less-than-optimal performance of long-context decoding (i.e. when seqlen_q = 1) on gfx11/12. This PR optimizes the splitkv pipeline and configs for such scenarios. ## Technical Details Optimizations applied in this PR: 1. use tiles with smaller M0 (16 vs 64), these tiles are used when seqlen_q <= 16 2. adapt qr_nwarp_sshuffle pipeline for gfx11, it allows to use more warps even for M0 = 16 (the qr pipeline parallelizes work between warps in M dim so with M0 = 16 it allows to use only 1 warp) 3. enable kMergeNumHeadGroupsSeqLenQ (an optimization that merges one group of heads in GQA) for all hdim values, not only 128 4. increase the number of splits (multiply by the number of head groups) if (3) is used 5. increase the number of splits for RDNAs (`multiProcessorCount` is the number of WGPs on RDNAs, not CUs, so it should be doubled to have meaning similar to CDNAs) Performance on gfx1151: \| Case \| develop (GB/s) \| This PR (GB/s) \| \|:-------\|-------:\|-------:\| \| [fp16\\|group\\|bshd] b:1, h:32/32, s:1/45056, d:64/64 \| 127.58 \| 183.11 \| \| [fp16\\|group\\|bhsd] b:1, h:32/32, s:1/45056, d:64/64 \| 153.64 \| 215.02 \| \| [fp16\\|group\\|bshd] b:1, h:16/8, s:1/77184, d:128/128 \| 120.51 \| 225.76 \| \| [fp16\\|group\\|bhsd] b:1, h:16/8, s:1/77184, d:128/128 \| 130.62 \| 223.84 \| \| [fp16\\|group\\|bshd] b:1, h:32/32, s:1/9600, d:128/128 \| 82.65 \| 138.44 \| \| [fp16\\|group\\|bhsd] b:1, h:32/32, s:1/9600, d:128/128 \| 105.75 \| 220.45 \| \| [fp16\\|group\\|bshd] b:1, h:8/1, s:1/401024, d:256/256 \| 16.27 \| 187.89 \| \| [fp16\\|group\\|bhsd] b:1, h:8/1, s:1/401024, d:256/256 \| 16.28 \| 188.19 \| ## Test Plan An additional test case is added to the exiting test. It uses seqlen_q = 1, GQA, no mask to trigger the changes ``` ninja test_ck_tile_fmha_fwd_fp16 && bin/test_ck_tile_fmha_fwd_fp16 --gtest_filter="SplitKV ninja test_ck_tile_fmha_fwd_bf16 && bin/test_ck_tile_fmha_fwd_bf16 --gtest_filter="SplitKV ``` Manual testing can be done with these commands: ``` bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=32 -h_k=32 -d=64 -s=1 -s_k=$((352 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1 bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=16 -h_k=8 -d=128 -s=1 -s_k=$((603 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1 bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=32 -h_k=32 -d=128 -s=1 -s_k=$((75 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1 bin/tile_example_fmha_fwd -prec=fp16 -mode=1 -page_block_size=128 -b=1 -h=8 -h_k=1 -d=256 -s=1 -s_k=$((3133 * 128)) -lse=1 -mask=0 -num_splits=0 -kname=1 -v=1 ``` ## Test Result All the tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-06-03 06:16:10 +00:00
Hosang Yoon	e7e8801dc3	[rocm-libraries] ROCm/rocm-libraries#7586 (commit c18f2c7) [CK_TILE] Use gfx11 float buffer atomics in FMHA Bwd ## Motivation FlashAttention CK backward on gfx11 can hit out-of-bounds/tail writes in the dQ accumulator atomic-add path when sequence rows are padded at the tile level but not marked invalid in the DQDKDV main tensor view. With the generic global atomic fallback, an incorrectly-valid tail element can issue an actual pointer-based `atomicAdd`. With the buffer atomic path, the write is issued through a buffer resource with bounds information and follows the same backend already used by gfx9/gfx12. This fixes the gfx11 FMHA BWD failure without changing the gfx11 default for unrelated CK Tile kernels. ## Technical Details This PR enables the existing CK Tile AMD buffer float atomic-add path only for generated FMHA BWD gfx11 translation units. gfx11 normally uses the generic global atomic fallback for floating-point `buffer_view::atomic_add`. That fallback performs the atomic through a raw computed pointer and depends on the software validity predicate to avoid invalid elements. In FMHA BWD dQ accumulation, padded tail rows can reach this path, so using the buffer atomic backend is safer: it uses a buffer resource with base pointer, bounds information, and an element offset, matching the backend already used by gfx9/gfx12. Enabling `CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT` globally for gfx11 is too broad and can break unrelated gfx11 CK builds such as GEMM. Instead, `config.hpp` now preserves an explicitly pre-defined `CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT`, while keeping the existing default disabled for gfx11. ## Test Plan Validated the change with the FlashAttention CK full test suite with backward pass enabled on gfx11. pytest -q -s tests/test_flash_attn_ck.py ## Test Result FlashAttention CK gfx11 test result: 260680 passed, 152076 skipped ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-05-30 00:10:26 +00:00
Illia Silin	8bd8094012	[rocm-libraries] ROCm/rocm-libraries#7833 (commit 8a444cd) [CK] Replace deprecated load_module function in python (#7833) ## Motivation Recent pytorch builds with python 3.15 failed in CK due to deprecation of load_module function. This should fix the issue. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-29 15:29:58 +00:00
ltqin	bf07a0150e	[rocm-libraries] ROCm/rocm-libraries#7723 (commit 4ed6c51) [CK Tile] Enable LSE output for fp8bf16 V3 FMHA kernels (#7723) ### Motivation The V3 pipeline (qr_async_trload_v3) for fp8bf16 FMHA kernels did not support LSE (Log-Sum-Exp) output. This PR enables LSE output support for fp8bf16 V3 FMHA kernels, allowing users to retrieve attention statistics alongside attention outputs. ### Technical Details - StandardAttention: lse = softmax_scale * m + log(l) - LogitsSoftCap: lse = (m / log2(e)) + log(l) ### Test Plan Run FMHA forward example with fp8bf16 precision and LSE output enabled: - Test 1: Basic LSE functionality ./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128 -prec=fp8bf16 -init=3 -qscale=1 -lse=1 - Test 2: LSE with LogitsSoftCap (CMakeList should remove Logits filter) ./build/bin/tile_example_fmha_fwd -v=1 -b=1 -h=8 -s=1024 -d=128 -prec=fp8bf16 -init=3 -qscale=1 -lse=1 -logits_soft_cap=30.0	2026-05-28 15:58:54 +00:00
Illia Silin	c24e528481	[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76) [CK] suppress compiler warnings while building pytorch. (#7760) ## Motivation Recently added compiler flags that are required to suppress false warnings by latest staging compiler are not recognized by older compiler versions and are triggering an avalanche of warnings. Previous attempt to suppress them by using -Wno-unknown-warning-option flag didn't help, because that flag wasn't recognized either and just added more warnings. I've verified that current approach by checking the clang version actually works as intended and makes the warnings go away. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-27 06:56:58 -07:00
Anton Gorenko	66d6714376	[rocm-libraries] ROCm/rocm-libraries#5388 (commit 45583bd) [CK_TILE][FMHA] Improve precision of mxfp4 FMHA with fp6 for matrix P (#5388) ## Motivation Improve precision of mxfp4 without performance penalties. ## Technical Details Since performance of scale MFMAs is the same when neither A nor B is fp8/bf8, it is possible to use fp6 x fp4 instead of fp4 x fp4 for the second GEMM, while types of Q, K, V stay the same. This allows to improve overall precision significantly because fp6 has 32 non-negative values used for P quantization compared to just 8 values for fp4. It was found that there is a compiler bug with `__builtin_amdgcn_cvt_scalef32_2xpk16_fp6_f32` (described in LCOMPILER-561) but a workaround seems to fix all failing instances. ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-26 06:55:17 -07:00
Yi DING	6a9c03f692	[rocm-libraries] ROCm/rocm-libraries#7450 (commit 402dbad) [CK_TILE] Use Persistent Scheduling for FMHA BWD Group Deterministic (#7450) ## Motivation FMHA BWD group-mode deterministic currently uses a non-persistent scheduler: each `(batch, head, K-row)` work-item is launched as its own block, with no work-stealing across CUs. On uneven workloads (varlen, GQA, many heads with few K-rows) this leaves CUs idle and forces a larger dq_acc workspace than necessary. This PR ports the persistent + deterministic scheduling already used in batch mode to group mode: a fixed-grid kernel that pre-computes per-CU work ranges on the host and uses sparse dq_acc slot indexing so multiple K-rows handled by the same CU share one accumulator slot via intra-CU atomic adds. Stacked on #7331; merge that first. ## Technical Details Single file changed: `ops/fmha/kernel/fmha_bwd_kernel.hpp`. A new `kUsePersistent` path is added to the group-mode deterministic kernel, mirroring the batch-mode persistent scheduler. The host pre-computes a fixed per-CU partition of the total `(batch, head, K-row)` work and packs it into `cu_states[]` so the GPU consumes it in a single launch. Host preparation happens in four steps: 1. Build per-batch `seqstart` prefix sums. 2. Fill per-batch `(sq_w, nc)` with a placeholder `nsplits` (bumped in step 3). 3. Two-pointer scan over CUs to fill `cu_states[c]` (`isplit`, `head_start`, `c_start`, `w_lo`, `w_hi`), accumulating `nsplits[b]` as `max(cs->isplit + 1)`. 4. Compute compact per-batch dq_acc offsets from the finalized `nsplits`. `isplit` is the sparse dq_acc slot index — one CU's multi-K-row writes share slot `ceil(wc_start / denom)`, enabling intra-CU atomic accumulation instead of one slot per K-row. `denom = max(sq_w, target_w)`, splitting two regimes: - `target_w >= sq_w` (large work): `denom = target_w`, intra-CU atomic optimization engaged. - `target_w < sq_w` (sub-K-row sharding, multiple CUs sharing one K-row): `denom = sq_w` collapses to per-K-row indexing (`= c_start`), keeping `isplit ∈ [0, nc-1]` and matching the `nsplits_max = ceil(s_k/kN0) = nc` upper bound that #7331's `GetWorkspaceDeviceSizeUpperBound` assumes for group+det. `isplit` is additionally clamped to `nc-1` to absorb empty CUs (rounded-up `wc_start` past the last K-row); they don't write dq_acc on GPU so the slot value is harmless. `nsplits[b]` is accumulated dynamically in step 3 rather than via a closed form so it tightly matches the actual sparse slots used; step 4 (offsets) follows step 3 since offsets now depend on the dynamic `nsplits`. Group mode also allows batches with `seqlen_q == 0`. The persistent scheduler skips them on the dQ path (no work) but dK/dV are still zero-filled. ## Test Plan Built `tile_example_fmha_bwd` with receipt 5 (fp16, no-bias, no-dropout, `dpad == dvpad`, group + batch) on gfx950 (MI355X). - 8-case smoke (shapes that exercise the sub-K-row regime). - 44-case sweep covering: mask 0/1/2, GQA, var seqlen, `d != d_v`, extreme small seqlen / `nc=1`, CU >> work, huge batch, batch-mode regression. - 12-case perf comparison vs the non-persistent baseline (warmup=10, repeat=50). ## Test Result - All 8 + 44 cases `valid:y`. - Perf: ±5% noise, average -0.4% across the 12 cases — neutral. - Batch-mode deterministic / non-deterministic regression unchanged. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-26 10:01:54 +08:00
Illia Silin	e02c566795	[rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24) [CK] upgrade CI to rocm7.13 as default compiler (#7612) ## Motivation Upgrade the default docker and compiler version in CI to rocm7.13. In order to pass all the checks I had to also clean up a lot of non-ascii characters in the source code comments and modify a couple of tests that were affected by a new compiler logic. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com>	2026-05-22 02:43:50 +00:00
kensclin	fc2862d712	[rocm-libraries] ROCm/rocm-libraries#6846 (commit 377def4) [CK_TILE] Add fmha forward hdim 256 support (#6846) ## Motivation Enable Composable Kernel FMHA forward kernel for hdim=256 BF16 on AMD gfx950 (MI350X). Prior to this change the (256, 256) head-dim configuration either failed to compile, was filtered out by the compatibility rules, or produced incorrect kernel output due to an LDS layout accounting bug. ## Technical Details Four files changed, all to enable hdim=256 BF16 on gfx950. - `fmha_fwd.py` — Allow `(256, 256)` in gfx950 compatibility rule; set `(256,256)` BF16 tile to `M0=128, N0=64` (the LDS-feasible shape on gfx950); emit minimal valid instance set for d=256 to bound compile time. - `fmha_fwd_kernel.hpp` — Gate Prefill launch path off for d=256 (`PrefillCase = kM0 > 64 && kQKHeaddim < 256`); the double-buffer Prefill variant overflows the 160 KB LDS budget. - `trload_policy.hpp` — Critical correctness fix: the LDS layout accounting in `GetSmemSize` was wrong (`max(Q, K+S+V)` instead of `max(Q, K) + V + S`), under-allocating LDS and silently corrupting d=256 output (~2% wrong values). - `trload.hpp` — Thread `LoadOnce=true` through all d=256 K-LDS descriptors so the compiler picks the matching XOR swizzle period; recompute the S-tile LDS offset to match the corrected `GetSmemSize` formula. ## Test Plan Built and ran `tile_example_fmha_fwd` on gfx950 (MI350X) with the canonical d=256 BF16 configurations: ```bash cd build && ninja tile_example_fmha_fwd ./bin/tile_example_fmha_fwd -prec=bf16 -d=256 -d_v=256 -b=1 -h=32 -h_k=2 -s=1024 -s_k=1024 -bias=n -mask=t -lse=0 -p_drop=0 -warmup=3 -repeat=10 -kname=1 -v=1 ./bin/tile_example_fmha_fwd -prec=bf16 -d=256 -d_v=256 -b=8 -h=32 -h_k=2 -s=16384 -s_k=16384 -bias=n -mask=t -lse=0 -p_drop=0 -warmup=3 -repeat=10 -kname=1 -v=1 ``` ## Test Result ```bash -b=1 -s=1024 [bf16\|batch\|bhsd] b:1, h:32/2, s:1024/1024, d:256/256, scale_s:0.0625, bias:n, p_drop:0, lse:0, qscale:n, mask:t(-1:0), v:r, fmha_fwd_d256_bf16_batch_b128x64x32x256x32x256_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_trload_vr_psddv_nlogits_nbias_mc_nlse_ndropout_nskip_nqscale_ntrload_nsink, 0.058 ms, 298.42 TFlops, 618.68 GB/s, valid:y -b=4 -s=16384 [bf16\|batch\|bhsd] b:8, h:32/2, s:16384/16384, d:256/256, scale_s:0.0625, bias:n, p_drop:0, lse:0, qscale:n, mask:t(-1:0), v:r, fmha_fwd_d256_bf16_batch_b128x64x32x256x32x256_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_trload_vr_psddv_nlogits_nbias_mc_nlse_ndropout_nskip_nqscale_ntrload_nsink, 42.797 ms, 822.18 TFlops, 106.63 GB/s, valid:y ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: poyenc <1132573+poyenc@users.noreply.github.com>	2026-05-22 01:57:41 +08:00
Po Yen Chen	a11f53564f	[rocm-libraries] ROCm/rocm-libraries#7530 (commit 378e049) [CK] Fix FMHA sink dispatch when init_sink_value is set (#7530) ## Summary - Fix `traits.has_sink` in `fmha_fwd_runner.hpp` to also check `init_sink_value != 0`, so the GPU kernel dispatches with sink support when `-init_sink=1` is passed. - Gate `run_sink_mask_tests` (StreamLLM) and `run_sink_init_tests` (GPT-OSS) behind opt-in flags `-m` and `-g` in `smoke_test_fwd.sh`. These tests require sink=true kernel instances which are excluded by the `BUILD_TESTING` CMake filter (`_nsink`), causing unconditional "not supported yet" failures (48 tests in CI). The opt-in flag approach was borrowed from PR #6057. ## Why gate tests instead of compiling sink=true kernels? The `BUILD_TESTING` filter in `CMakeLists.txt` uses `_nsink` glob patterns for the `fwd` and `fwd_splitkv` APIs, excluding sink=true kernel instances from compilation. We chose opt-in flags over widening the filter because: - Compile time: Enabling sink=true kernels doubles the kernel variants for `fwd` and `fwd_splitkv` APIs. The filter exists specifically to reduce CI build times. - Incremental enablement: Sink support (StreamLLM / GPT-OSS) is still maturing. Gating lets teams opt in explicitly (`smoke_test_fwd.sh -g`) while keeping the default CI path fast. - Precedent: splitkv (`-s`) and appendkv (`-a`) tests already follow this opt-in pattern. ## Test plan - [ ] Run `smoke_test_fwd.sh -g` with sink=true kernels compiled and verify sink-enabled kernels are dispatched - [ ] Verify `smoke_test_fwd.sh` still passes without `-m` / `-g` flags - [ ] Confirm CI no longer fails on sink tests (they are now opt-in)	2026-05-19 00:09:23 +08:00
Illia Silin	717f2efef7	[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d) [CK] add composable kernel support on gfx1250 (#6978) ## Motivation Add composable kernel support on gfx1250. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Qun Lin <qlin@amd.com> Co-authored-by: jialuo12_amdeng <jia.luo@amd.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>	2026-05-15 06:46:51 -07:00
Illia Silin	ac18460782	[rocm-libraries] ROCm/rocm-libraries#7384 (commit 10e9d70) [CK] Suppress new staging compiler errors (#7384) ## Motivation This should make new builds with staging compiler pass. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-14 12:51:08 -07:00
Meekail Zain	0937b002d8	[rocm-libraries] ROCm/rocm-libraries#6867 (commit 3cb0219) Added custom FMHA codegen receipt for TransformerEngine (#6867) ## Motivation TE uses AITER to build static MHA libraries, which ultimately rely on CK kernels. We use the `600` receipt which generates more kernels than TE truly needs. This bespoke receipt allows us to minimize the kernel count, compile time, and memory footprint of our MHA library. ## Technical Details Extended the receipt mechanism to include a custom `700` receipt for TE's needs ## Test Plan Test by building TE using the same receipt profile ## Test Result Build validated in TE using a custom feature branches of AITER/CK to temporarily apply the patch ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-05-14 10:32:54 -04:00
Yi DING	af7118e342	[rocm-libraries] ROCm/rocm-libraries#7331 (commit 5692db0) [CK_TILE] Add async workspace prepare to FMHA BWD launcher (#7331) ## Motivation `aiter::mha_bwd` in group mode currently issues two synchronous `hipMemcpy` D2H copies to read `seqstart_q/k` for launcher construction. These sync copies block the host (~10–30 µs each) and implicitly synchronize the device by draining the stream, breaking CPU/GPU overlap on hot training paths. This PR adds a fully stream-async workspace preparation path on the FMHA BWD launcher so callers can pre-allocate the device workspace from upper-bound shapes and stage seqstart-dependent metadata via D2H/host-pack/H2D entirely on the user's stream. ## Technical Details - `FmhaBwdWorkspaceManager::GetWorkspaceDeviceSizeUpperBound` (`include/ck_tile/ops/fmha/kernel/fmha_bwd_kernel.hpp`): computes the worst-case device dq_acc size from `(max_batch, hdim_q, nhead_q, max_seqlen_q, max_seqlen_k)` without dereferencing any seqstart array. Mirrors `PrepareWorkspaceHost`'s return value with worst-case bounds. - `fmha_bwd_launcher::prepare_workspace_async` (`example/ck_tile/01_fmha/fmha_bwd.hpp`): on the caller's stream, in order: 1. `hipMemsetAsync` of the dq_acc region (when `NeedsZeroDqAcc()`) 2. group mode: `hipMemcpyAsync` D2H of `seqstart_q/k` into a pinned host staging buffer 3. `hipLaunchHostFunc` runs `PrepareWorkspaceHost` on the pinned buffer 4. `hipMemcpyAsync` H2D of the packed metadata into `device_ws_ptr` The pinned staging buffer is held via `std::shared_ptr<void>` returned by a caller-provided `pinned_host_alloc` callback. Lifetime is extended past stream completion by a tail `hipLaunchHostFunc` scheduled in the launcher's destructor. - `ck_tile::pinned_host_releaser` (`include/ck_tile/host/pinned_host_releaser.hpp`): worker-thread utility for callers using bare `hipHostMalloc`. Defers `hipHostFree` off the HIP driver callback thread, which holds runtime locks and would deadlock against concurrent main-thread `hipFree`. PyTorch's `CachingHostAllocator` does not need this. - Example runner (`example/ck_tile/01_fmha/fmha_bwd_runner.hpp`): switched to the async path. ## Test Plan - `tile_example_fmha_bwd` (gfx950, dev preset `-Werror -Weverything`): - batch + nondet / batch + det / group + nondet / group + det - group + det 4-batch varlen (`-b=4 -h=8 -s=4096,3072,2048,1024 -d=128`) - FA (`flash-attention`) integration on ROCm 7.1.1 + PyTorch 2.9.1: - `tests/test_flash_attn_ck.py::test_flash_attn_varlen_deterministic` - `tests/test_flash_attn_ck.py::test_flash_attn_bwd_varlen_seqq_zero` ## Test Result - All CK runner cases `valid:y`. - FA pytest: 1952 passed in 44.82s. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-14 21:33:21 +08:00
Copilot	c39bff93d0	[rocm-libraries] ROCm/rocm-libraries#6983 (commit f4e9a84) Remove batch_prefill from FMHA_FWD_KNOWN_APIS (#6983) Remove `batch_prefill` from the `FMHA_FWD_KNOWN_APIS` list in `projects/composablekernel/example/ck_tile/01_fmha/CMakeLists.txt`. Change: ```cmake # Before set(FMHA_FWD_KNOWN_APIS "fwd;fwd_splitkv;fwd_appendkv;pagedkv_prefill;batch_prefill") # After set(FMHA_FWD_KNOWN_APIS "fwd;fwd_splitkv;fwd_appendkv;pagedkv_prefill") ``` Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: asleepzzz <4926646+asleepzzz@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-05-14 12:41:01 +00:00
Linjun-AMD	5003f7ef8a	[rocm-libraries] ROCm/rocm-libraries#7272 (commit d02f3c0) [ck_tile][fmha_bwd] Fix sink_host OOB in group mode reference runner (#7272) ## Summary In `fmha_bwd_runner.hpp`, the `sink_host` `HostTensor` is allocated with first dimension `shape_batch` (= 1 in group mode), but the reference forward loop accesses `sink_host(wb, i_h)` with `wb ∈ [0, batch-1]`. For any `wb >= 1` this is an out-of-bounds heap read, silently corrupting the reference forward math chain (`lse_host`, `o_host`) and turning the bwd-side `d_sink_head_acc` reference into non-deterministic garbage. `HostTensor::operator()` does not bounds check, so the OOB is not caught at runtime. This manifests as intermittent `tile_example_fmha_bwd` failures (25–67% fail rate) when `-sink_grad=1` is combined with `-mode=1` (group mode), with bit-exact but spurious `max_err` values like 4.27 / 14.6. ## Fix One-line: allocate `sink_host` with `batch` (the real per-batch dim) instead of `shape_batch`, mirroring how `sink_host` is accessed by the loop. ```diff - sink_grad ? std::array<ck_tile::index_t, 2>{shape_batch, nhead} + sink_grad ? std::array<ck_tile::index_t, 2>{batch, nhead} Repro tile_example_fmha_bwd -b=2 -h=2 -s=516 -s_k=253 -prec=bf16 -d=72 \ -bias=n -dbias=0 -p_drop=0 -iperm=1 -operm=1 -deterministic=0 \ -v=3 -mode=1 -kname=1 -sink_grad=1 Verification - 0/30 fail on the repro config after fix - Baselines (before fix): - sink=1, mask=n: 25% fail rate (p ≈ 1.8e-4) - sink=1, mask=t: 67% fail rate (p ≈ 6e-15) Attribution Shape bug introduced together with sink_grad in #5504. Unrelated to #6914 (which is a fwd-only fix on a different code path) ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: junlin12 <junlin12@amd.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2026-05-13 16:47:50 +08:00
Illia Silin	22b9feb40f	[rocm-libraries] ROCm/rocm-libraries#7111 (commit 651947f) [CK] Fix latest batch of staging compiler warnings (#7111) ## Motivation Suppress the new batch of clang lifetimebound and invalidation warnings with the latest staging compiler. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-08 07:14:14 -07:00
Chao	3d8c21e838	[rocm-libraries] ROCm/rocm-libraries#6529 (commit 93a6097) [CK_TILE] Enable V3 persistent kernel dispatch for FMHA forward on gfx950 (#6529) [CK_TILE] Enable V3 persistent kernel dispatch for FMHA forward on gfx950 ## Motivation Enable the existing V3 persistent kernel path for CK-Tile FMHA forward on gfx950 (MI350X/MI355X). The V3 kernel and codegen infrastructure already exist but are disabled via hardcoded `F_is_v3_enabled=False`. This change replaces the compile-time gate with a runtime environment variable `CK_FMHA_ENABLE_V3=1` (disabled by default, opt-in). When enabled: - Prefill workloads (seqlen_q > 1) dispatch to V3 persistent pipeline - Decode workloads (seqlen_q == 1) always use V2 (memory-bound, better suited) The V3 persistent kernel uses grid-stride scheduling, XCD-interleave tile assignment for L2 locality, LPT reversal for causal masks, and gfx950 async buffer loads. ## Technical Details Single file: `example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py` - Add `#include <cstdlib>` and `<string>` for `std::getenv` - Replace `{F_is_v3_enabled}` template parameter with runtime env var check - Add `seqlen_q > 1` guard (decode always uses V2) - Remove `.format()` call in `write_fwd_api()` ## Dependencies Depends on https://github.com/ROCm/rocm-libraries/pull/6501 — builds on XCD-interleave and LPT scheduling infrastructure. ## Test Plan - GPU validation on MI300X (gfx942, ROCm 6.4.1): - Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128 -prec=bf16 -v=1 -warmup=1 -repeat=3` - GPU validation on MI350X (gfx950, ROCm 7.0): - Command (V2): `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128 -prec=bf16 -v=1 -warmup=1 -repeat=3` - Command (V3): `CK_FMHA_ENABLE_V3=1 ./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128 -prec=bf16 -v=1 -warmup=1 -repeat=3` - Command (decode, always V2): `./build/bin/tile_example_fmha_fwd -b=64 -h=32 -h_k=8 -s=1 -s_k=4096 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` ## Test Result Benchmark results (MI350X, gfx950, ROCm 7.0): \| Config \| V2 (TFlops) \| V3 (TFlops) \| Speedup \| \|--------\|-------------\|-------------\|---------\| \| Non-causal b=2 h=8 hk=2 s=4096 d=128 bf16 \| 696.3 \| 884.2 \| +27.0% \| \| Causal b=2 h=8 hk=2 s=4096 d=128 bf16 \| 371.3 \| 494.9 \| +33.3% \| \| GQA b=2 h=32 hk=8 s=2048 d=128 bf16 \| 671.3 \| 831.7 \| +23.9% \| \| LLaMA-70B b=1 h=64 hk=8 s=4096 d=128 bf16 \| 761.5 \| 927.3 \| +21.8% \| \| Causal GQA b=2 h=32 hk=8 s=2048 d=128 bf16 \| 345.4 \| 631.9 \| +82.9% \| \| Long-seq b=1 h=16 s=16384 d=128 bf16 \| 797.8 \| 969.9 \| +21.6% \| \| Decode b=64 h=32 hk=8 s=1 s_k=4096 bf16 \| 1828 GB/s \| — (V2 path) \| unaffected \| Benchmark results (MI300X, gfx942, ROCm 6.4.1): V3 has 0% effect on MI300X — V3 relies on gfx950 async buffer loads and falls back to the V2 code path on gfx942. No regression on any config. \| Config \| TFlops / GB/s \| Time (ms) \| Delta vs baseline \| \|--------\|-------------\|-----------\|-------------------\| \| MHA bf16 b=2 h=8 s=4096 d=128 \| 342.98 TFlops \| 0.401 \| +0.1% \| \| MHA fp16 b=2 h=8 s=4096 d=128 \| 411.18 TFlops \| 0.334 \| +4.9% \| \| Causal MHA bf16 b=2 h=8 s=4096 d=128 \| 232.61 TFlops \| 0.296 \| +2.4% \| \| GQA 4:1 bf16 b=2 h=32 hk=8 s=2048 d=128 \| 320.07 TFlops \| 0.429 \| -1.4% \| \| GQA 8:1 bf16 b=2 h=64 hk=8 s=2048 d=128 \| 353.91 TFlops \| 0.777 \| +1.7% \| \| LLaMA-70B prefill b=1 h=64 hk=8 s=4096 d=128 bf16 \| 381.53 TFlops \| 1.441 \| +1.2% \| \| Long-seq bf16 b=1 h=16 s=16384 d=128 \| 388.61 TFlops \| 5.659 \| +1.4% \| \| Decode b=64 h=32 hk=8 s_k=4096 d=128 bf16 \| 693.40 GB/s \| 1.550 \| +0.3% \| All validation tests pass (`valid:y`) on both MI300X and MI350X. Additional validation: - `CK_FMHA_ENABLE_V3=0` correctly falls back to V2 (default behavior unchanged) - `CK_FMHA_ENABLE_V3=1` dispatches to V3 for prefill, V2 for decode - Validation passes across fp16/bf16, batch/group mode, causal/non-causal - No regression on decode path --------- Co-authored-by: Chao Zhou <chaozhou@fb.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-05-08 00:22:03 +08:00
Linjun-AMD	5a2a362c46	[rocm-libraries] ROCm/rocm-libraries#6914 (commit b791478) [CK_TILE][FMHA] Fix sink un-mask under right-window and emit fp8bf16 batch_prefill sink kernels (#6914) ## Summary Two related fixes to `ck_tile` FMHA so that StreamLLM-sink + sliding-window batch-prefill works correctly for fp8 KV / bf16 compute. Review the commits in this order: 1. `fmha: emit sink kernels for fp8bf16 batch_prefill` Extends `example/ck_tile/01_fmha/codegen/ops/fmha_batch_prefill.py` so the fp8(KV) / bf16(QO) batch-prefill codegen also emits the `mask=mask_enum::generic_with_sink` variant. Without this the runtime could not dispatch to a sink-aware kernel for the fp8bf16 path. 2. `fmha: respect right-window in IsOutOfSinkBound` The sink un-mask in `GenericAttentionMask::IsOutOfSinkBound` (local-mask branch) used `(i_y + x) > 1` as the gate, which conditioned on the row index instead of the column index. As a result, queries `1..sink-1` could attend to future sink positions (violating causal / right-window), while query `0` fell back to the plain causal mask. The fix replaces the guard with `i_x < i_y + x` so every query only sees sink columns up to its own right-window boundary. 3. `fmha: clarify IsOutOfSinkBound predicate comment` Doc-only follow-up that rewrites the comment above the predicate as a clause-by-clause explanation (`i_x < sink`, `i_x < i_y + x`, `y < y_total`, `i_y < x_total`). ## Test plan - [x] Repro on aiter `op_tests/test_batch_prefill.py` (fp8 + bf16_dequant modes with `sink=4`, `win_left=1023`, `softcap=0.0`, `sal=True`) now passes for all parametrized shapes. - [x] Existing fp16/bf16 batch-prefill paths (no sink) unchanged — codegen diff only adds the `generic_with_sink` variant for fp8bf16; existing kernel object lists unaffected. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: fengjunda.aml <fengjunda.aml@bytedance.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: root <root@smci350-rck-g03-f12-31.rck.dcgpu>	2026-05-07 10:39:55 +08:00
Yi DING	2c677e8471	[rocm-libraries] ROCm/rocm-libraries#6152 (commit 36b016a) [CK_TILE] Use Unified Workspace for FMHA BWD (#6152) ## Motivation `dq_acc` is the intermediate accumulation buffer used in FMHA backward pass for deterministic mode. The current implementation allocates it as a single rectangular tensor: ``` shape = [shape_batch, nhead, nsplits, shape_seqlen_q, hdim_q] ``` where `nsplits = launcher.dq_acc_splits` (a single scalar), computed from `max_seqlen_k` and shared across all batches. ### Problems 1. Memory waste: In group mode, each batch may have a different `seqlen_k`, but `nsplits` is computed from `max_seqlen_k`, causing batches with shorter `seqlen_k` to over-allocate in the split dimension. 2. Interface coupling: `fmha_bwd_args` exposes internal layout details such as `stride_dq_acc`, `nhead_stride_dq_acc`, `batch_stride_dq_acc`, and `split_stride_dq_acc`. The caller is responsible for computing these strides, but this logic belongs inside the kernel. ### Goals 1. Switch `dq_acc` buffer to a compact layout: batches are concatenated contiguously, with each batch occupying `nhead * nsplits_i * seqq_i * hdim_q` elements (nhead outermost). 2. *Remove all `_stride_dq_acc` fields from `fmha_bwd_args`, replacing them with a single `workspace_ptr`; the kernel splits this internally using a fixed layout. 4. `fmha_bwd_launcher` provides a workspace management interface: the caller only needs to allocate GPU memory and call `prepare_workspace()` — no layout computation required. 5. Isolate kernel internals from the caller API*: the `dq_acc` layout (nsplits, strides, buffer size) is determined entirely inside the launcher/kernel. Future changes to block shape, pipeline type, or persistent kernel strategy require no modifications to the caller's `fmha_bwd_args` or workspace allocation logic. ## Technical Details ### Interface Design #### New fields in `fmha_bwd_traits` ```cpp struct fmha_bwd_traits { int seqlen_q; int seqlen_k; int batch; int max_seqlen_q; int max_seqlen_k; int hdim_q; int hdim_v; int nhead_q; int nhead_k; std::string data_type; bool is_group_mode; mask_enum mask_type; bias_enum bias_type; bool has_dbias; bool has_dropout; bool is_store_randval; bool is_deterministic; // New: cumulative physical seqlen pointers for group mode (pass nullptr for batch mode). // seqstart_qs[i+1] - seqstart_qs[i] = physical seqlen_q of batch i (including padding); length = batch+1 // seqstart_ks[i+1] - seqstart_ks[i] = physical seqlen_k of batch i (including padding); length = batch+1 const int seqstart_qs = nullptr; const int* seqstart_ks = nullptr; }; ``` #### `fmha_bwd_launcher` actual structure ```cpp struct fmha_bwd_launcher { std::function<float(fmha_bwd_args, const ck_tile::stream_config&)> run{}; // Total workspace size in bytes (host_ws_size + device_ws_size), computed by init(). // Zero for kUseQrQtrDorPipeline (writes dq directly, no acc buffer needed). size_t workspace_size = 0; fmha_bwd_launcher(const fmha_bwd_traits&); // Copies auxiliary data (nsplits[], offsets[]) via hipMemcpy to the head of the GPU workspace, // and zeros the dq_acc buffer portion (tail of workspace) if required. // The memory pointed to by device_ws must be >= workspace_size bytes. std::function<void(void* device_ws)> prepare_workspace{}; template <typename... Args> float operator()(Args&&... args) const { return run(std::forward<Args>(args)...); } private: size_t host_ws_size = 0; // CPU workspace size (nsplits[] + offsets[] arrays) size_t device_ws_size = 0; // GPU-only data size (dq_acc buffer) std::unique_ptr<char[]> ws_host; // host-side workspace buffer public: template <typename T0, typename T1, typename T2, typename Arch> void init(const fmha_bwd_traits& traits); }; ``` The `init<>()` template method (invoked by codegen dispatch branches as `this->init<...>(t)`) is responsible for: 1. Setting the `run` lambda 2. Calling `FmhaBwdDQDKDVKernel::GetWorkspaceHostSize(batch)` to obtain `host_ws_size` 3. Allocating `ws_host` (host memory) 4. Calling `FmhaBwdDQDKDVKernel::PrepareWorkspaceHost(ws_host.get(), ...)` to fill nsplits/offsets; return value is `device_ws_size` 5. `workspace_size = host_ws_size + device_ws_size` 6. Setting the `prepare_workspace` lambda (captures `this`, calls `PrepareWorkspaceDevice`) When no kernel matches the given traits, both `run` and `prepare_workspace` are initialized to default lambdas that print a warning to `std::cerr` and return gracefully (no exception). #### Workspace overall layout The workspace is managed by `FmhaBwdWorkspaceManager` and consists of two segments: ``` Offset 0 (CPU-prepared segment, host_ws_size bytes; also hipMemcpy'd to the head of GPU workspace): index_t nsplits[batch or 1] — per-batch nsplits array group mode: batch elements batch mode / non-deterministic: 1 element [group mode only] long_index_t dq_acc_offsets[batch+1] — per-batch element offset (inclusive prefix sum) offsets[0]=0, offsets[i+1] = offsets[i] + nheadnsplits_iseqq_ihdim_q Offset host_ws_size (device data segment, device_ws_size bytes): AccDataType dq_acc[total_elements] — compact dq_acc buffer (zeroed if required) total_elements = sum_i(nhead nsplits_i * seqq_i * hdim_q) layout within each batch: [nhead, nsplits_i, seqq_i, hdim_q] note: seqq_i uses the physical length (including padding) ``` Alignment constant (`ALIGNMENT = 16`): ``` nsplits_size = align_up(sizeof(index_t) * N, 16) // N = batch (group) or 1 (batch/non-det) offsets_size = align_up(sizeof(long_index_t) * (batch+1), 16) // group mode only host_ws_size = nsplits_size + offsets_size dq_acc_offset = host_ws_size // GetDqAccDataOffset(batch) ``` Key benefits: - The kernel reads nsplits/offsets directly from the workspace head — no device-side recomputation. - `FmhaBwdConvertQGradKernel` is completely decoupled from the pipeline block shape (`kN0`): nsplits is read from `nsplits_ptr`, `kN0` is no longer a template parameter, and multiple dq_dk_dv tiles with different `F_bn0` values now share a single convert_dq kernel instance (under receipt 1/2, deterministic convert_dq kernel count drops from ~300 to 60). - nsplits/offsets are computed on the host and transferred in one `hipMemcpy`; the dq_acc buffer follows immediately, at the offset given by `GetDqAccDataOffset`. #### Workspace size by scenario \| Scenario \| `workspace_size` \| Notes \| \|----------\|-----------------\|-------\| \| kUseQrQtrDorPipeline (any mode) \| `0` \| Writes dq directly; no acc buffer; `PrepareWorkspaceHost` returns 0 \| \| Non-deterministic + batch mode \| `> 0` \| nsplits[1]=1; dq_acc used for atomic add; `workspace_size = host_ws_size + batchnheadseqlen_qhdim_qebytes` \| \| Non-deterministic + group mode \| `> 0` \| nsplits[1]=1; dq_acc contiguous layout; `workspace_size = host_ws_size + nheadseqstart_qs[batch]hdim_qebytes` \| \| Deterministic + group mode* \| `> 0` \| nsplits[batch], offsets[batch+1], compact dq_acc; nsplits_i computed independently per batch \| \| Deterministic + batch mode persistent \| `> 0` \| nsplits[1] (uniform across batches); dq_acc `batchnheadnsplitsseqlen_qhdim_q` \| NeedsZeroDqAcc (determines whether `PrepareWorkspaceDevice` calls `hipMemset`): - Persistent kernel (deterministic batch mode) or non-deterministic: must zero (atomic add requires zero initialization) - Deterministic group mode + no mask: no zeroing needed (every tile writes its full region) - Deterministic + with mask: must zero (some blocks are skipped, leaving uninitialized tiles that would contribute to the reduction) #### Caller usage ```cpp // 1. Create launcher (traits include seqstart_qs/ks pointers; workspace_size is computed during construction) fmha_bwd_launcher launcher(fmha_traits); // 2. Read launcher.workspace_size directly const auto ws_size = launcher.workspace_size; // 3. Allocate a single GPU workspace ck_tile::DeviceMem ws_buf(ws_size); // 4. Copy nsplits/offsets to GPU head and zero dq_acc if required launcher.prepare_workspace(ws_buf.GetDeviceBuffer()); // 5. Build args with a single workspace pointer; the kernel splits it internally fmha_bwd_args args{ ..., ws_size > 0 ? ws_buf.GetDeviceBuffer() : nullptr, // workspace_ptr }; launcher(args, stream_config); ``` --- ### Key Code Structure #### FmhaBwdWorkspaceManager (`fmha_bwd_kernel.hpp`, new class) ```cpp template <typename AccDataType, bool kIsGroupMode, bool kIsDeterministic> struct FmhaBwdWorkspaceManager { static constexpr size_t ALIGNMENT = 16; // CPU workspace (nsplits + offsets) sizes static size_t GetDqAccSplitsSize(int batch); // align_up(sizeof(index_t)N, 16) static size_t GetDqAccOffsetsSize(int batch); // group mode only: align_up(sizeof(long_index_t)(batch+1), 16) static size_t GetWorkspaceHostSize(int batch); // = SplitsSize + OffsetsSize // Starting offset of dq_acc data within the full workspace (= host_ws_size) static size_t GetDqAccDataOffset(int batch); // = GetWorkspaceHostSize(batch) // Fills nsplits/offsets in the CPU workspace; returns device_ws_size (dq_acc buffer bytes) template <bool kUseQrQtrDorPipeline, index_t kN0> static size_t PrepareWorkspaceHost(void* cpu_ws, index_t batch_size, index_t hdim_q, index_t nhead_q, index_t seqlen_q, index_t seqlen_k, const index_t* seqstart_qs, const index_t* seqstart_ks); // hipMemcpy's cpu_ws to device_ws head; hipMemset's the dq_acc portion to 0 if required template <bool kUseQrQtrDorPipeline, bool kHasMask> static void PrepareWorkspaceDevice(void* device_ws, const void* host_ws, size_t device_ws_size, size_t host_ws_size); }; ``` #### workspace_ptr parsing (inside the kernel) The kernel parses three address regions from `kargs.workspace_ptr`: Group mode (`FmhaBwdDQDKDVKernel::MakeKargs`): ```cpp const uint8_t* ws = reinterpret_cast<uint8_t>(workspace_ptr); // dq_acc_ptr (stored in FmhaBwdCommonKargs) ws + WorkspaceManager::GetDqAccDataOffset(batch) // dq_acc_batch_offset_ptr (FmhaBwdGroupModeKargs field) reinterpret_cast<const long_index_t>(ws + WorkspaceManager::GetDqAccOffsetsOffset(batch)) ``` Batch mode: ```cpp ws + WorkspaceManager::GetDqAccDataOffset(batch) // dq_acc_ptr // No offsets pointer; batch offset is computed inside run_() from nsplits ``` `FmhaBwdConvertQGradKernel` follows the same pattern: - Group mode: extracts `dq_acc_ptr`, `dq_acc_batch_offset_ptr`, and `nsplits_ptr` (`GetDqAccSplitsOffset(batch)`) from workspace - Batch mode: reads nsplits from `nsplits_ptr[0]`; batch offset computed internally ### Addressing in `run_()` (group mode) ```cpp // Per-batch processing: const long_index_t batch_offset_dq_acc = kargs.dq_acc_batch_offset_ptr[i_batch]; // seqq_i (physical length) derived from seqstart_q_ptr const index_t seqq_i = kargs.seqstart_q_ptr[i_batch+1] - kargs.seqstart_q_ptr[i_batch]; // nsplits_i read from nsplits_ptr (convert_dq kernel) or from GetDqAccSplits const long_index_t split_stride_i = static_cast<long_index_t>(seqq_i) * kargs.hdim_q; const long_index_t nhead_stride_i = static_cast<long_index_t>(nsplits_i) * split_stride_i; // Final address: dq_acc_base + batch_offset_dq_acc + i_nhead * nhead_stride_i + i_split * split_stride_i ``` #### nsplits computation (`PrepareWorkspaceHost`) `PrepareWorkspaceHost` is a template method of `FmhaBwdWorkspaceManager` that still takes `kN0` as a template parameter (from `BlockFmhaShape::kN0` of the dq_dk_dv pipeline). However, this parameter is only used inside this host-side function to compute nsplits — it is no longer passed into the convert_dq kernel. \| Mode \| nsplits computation \| \|------\|---------------------\| \| kUseQrQtrDorPipeline \| Writes dq directly; nsplits[0]=0; returns device_ws_size=0 \| \| Non-deterministic \| nsplits[0]=1; dq_acc used for atomic add \| \| Deterministic + group mode \| `ceil((seqstart_ks[i+1]-seqstart_ks[i]) / kN0)` computed per batch \| \| Deterministic + batch mode persistent \| Same logic as the original `GetDqAccSplits` (`dqdqkdv_workers` based) \| ### Removing kN0 dependency from `FmhaBwdConvertQGradKernel` `FmhaBwdConvertQGradKernel` previously required `kN0` as a template parameter (via `BlockFmhaBwdConvertQGradPipelineProblem`) for two purposes: 1. In batch mode `operator()`: self-computing `nsplits = ceil(seqlen_k / kN0)` 2. The `b{kM0}x{kN0}` component of the kernel name string Both have been removed in this refactor: - Batch mode: now reads `kargs.nsplits_ptr[0]` directly (guarded by `if constexpr(kIsDeterministic)` to avoid accessing a non-existent field in non-deterministic instances) - Kernel name: simplified to `b{kM0}`, no longer includes `kN0` - Template parameters: `BlockFmhaBwdConvertQGradPipelineProblem` drops the `kN0_` parameter; `fmha_bwd_convert_dq_traits_` drops the `kN0` parameter; `F_bn0`/`convert_dq_bn0` fields removed from codegen Effect: all dq_dk_dv tiles sharing the same `(hdim, dtype, mode, pad, deterministic)` combination — regardless of `F_bn0` value (16/64/128/192/256) — now share a single convert_dq kernel instance. --- ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-07 10:22:28 +08:00
ArthurLiu	67aa854621	[rocm-libraries] ROCm/rocm-libraries#6764 (commit 8c20d70) [CK][CK_TILE] Fix FMHA codegen group mode dispatch (#6764) ## Motivation FMHA codegen had incorrect dispatch behavior in group mode. Two root causes: 1. Wrong field names in dispatch conditions — Used batch-mode fields (seqlen_q, seqlen_k) instead of group-mode fields (max_seqlen_q, max_seqlen_k), causing wrong kernel selection at runtime on gfx950. 2. Missing kernel variants — Group mode was overly filtered out from smaller-tile specializations (bwd) and lacked spatial-padding pipeline variants on gfx950 (fwd). gfx942 don't support trload pipeline. ## Technical Details fmha_bwd.py: - max_seq_q_cond and extra_cond now emit t.max_seqlen_q / t.max_seqlen_k for group mode. - Relaxed kernel filtering: group mode no longer skips tiles with max_seq_q != 0. fmha_fwd.py: - get_bm0_cond emits a.max_seqlen_q for group mode tile-size dispatch. - Added two qr_async_trload pipeline variants with spatial padding for gfx950 group mode. ## Test Plan Triggering AITER CI job: ## Submission Checklist - [ x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-28 02:14:42 +08:00
Jeff Huang	08097fa515	[rocm-libraries] ROCm/rocm-libraries#6653 (commit 1df887e) [CK_TILE] fix(fmha): support >2GB KV cache in batch prefill via template dispatch (#6653) ## Motivation The CK batch prefill kernel previously failed (silent overflow + page faults) when the KV cache exceeded 2 GB, blocking long-context inference workloads (e.g., 128K+ token contexts with paged KV). Two distinct failure modes were addressed: 1. >4GB SRD overflow (`page_size < kN0`): The SRD `buffer_load_dwordx4` path uses a 32-bit `voffset` register; for small page sizes the rebased SRD spans the full KV pool and the offset wraps past 2 GB, corrupting K/V loads. 2. gfx950 page-table fault (`page_size >= kN0`): On CDNA4 the hardware validates the full SRD `num_records` range against page-table permissions (CDNA3 only checks per-instruction `voffset`). After per-tile SRD rebase, an un-trimmed `num_records` field extends past the live page and faults on freed/protected memory. ## Technical Details Two-mode `tile_scatter_gather` selected by the `kUseGlobalLoad` template parameter: \| Case \| `page_size` \| KV cache size \| Mode \| Load path \| Addressing \| \|---\|---\|---\|---\|---\|---\| \| 1 \| `>= kN0` (large pages) \| any \| SRD (`kUseGlobalLoad=false`) \| `buffer_load_dwordx4` \| 32-bit `voffset`, bounded by per-page rebase \| \| 2 \| `< kN0` (small pages) \| `<= 2 GB` \| SRD (`kUseGlobalLoad=false`) \| `buffer_load_dwordx4` \| 32-bit `voffset`, fits in INT32 byte range \| \| 3 \| `< kN0` (small pages) \| `> 2 GB` \| Global-load (`kUseGlobalLoad=true`) \| `async_load_tile_raw_flat` (K) + `load_tile_flat` (V) \| 64-bit \| Dispatch: the auto-gen API layer (`fmha_batch_prefill.py`) selects the kernel instantiation at launch from `(page_block_size, num_total_pages * batch_stride_k * kElementBytes)`, so the small-page penalty is paid only when correctness requires it. gfx950 SRD `num_records` trimming: in the K and V rebase lambdas of `block_fmha_batch_prefill_pipeline_qr_ks_vs_async`, `set_bottom_tensor_view_buffer_size(page_stride_k/v)` is called after each rebase to constrain `num_records` to the live page. Required for CDNA4 page-table validation; harmless on CDNA3. Pipeline sync for the global-load path: - V uses synchronous `load_tile_flat`; K uses `async_load_tile_raw_flat`. - `v_physical_pages_current` is double-buffered so the V flat load doesn't race against the next iteration's K rebase computation. Arch guards: `global_load_lds` intrinsics are gated to `__gfx94__` / `__gfx950__` (CDNA3+). Other architectures hit a `dependent_false` static_assert with a descriptive message. Device-side assertion convention: SRD setters use `__builtin_assume(cond)` (hint-only) rather than `<cassert>`'s `assert()`. The latter introduces an `__assert_fail` call whose register pressure scatters the K-SRD scalar register window across conditional branches, corrupting `buffer_load_dwordx4` on gfx950. ## Test Plan Tested on both MI308 (gfx942) and MI355 (gfx950) via the aiter wrapper test suite. All coverage lives in `op_tests/test_batch_prefill.py`: - Functional matrix (96 cases) — `test_batch_prefill`: `page_size ∈ {1, 16, 1024}` × `kv_layout ∈ {linear, vectorized}` × `dtype ∈ {bf16, fp8 quant variants}` × `causal` × `soft_cap` × `LSE` × `batch_size ∈ {1, 4}` (parametrized to exercise per-sequence SRD rebase across batch boundaries). - >2 GB coverage — `test_batch_prefill_large_kvcache`: extended to allocate a 5 GB+ KV cache pool and exercise both `kUseGlobalLoad=true` (small-page) and `kUseGlobalLoad=false` (large-page rebase) paths. Includes both single-batch and multi-batch (`batch_size=4`) cases to exercise per-sequence SRD rebase across the >2 GB pool. - Numerical reference: PyTorch SDPA, per-batch loop with `atol` / `rtol` from the existing batch prefill test harness. ## Test Result \| Arch \| `test_batch_prefill` \| `test_batch_prefill_large_kvcache` (>2 GB) \| \|------\|----------------------\|---------------------\| \| MI308 (gfx942) \| All passed \| Passed \| \| MI355 (gfx950) \| All passed \| Passed \| Performance impact (gfx950, hot SRD path): - +2.67% kernel-time on `seqlen=1024 / page_sz=1024 / bf16 / sglang / causal / soft_cap=30`, attributable in full to the two `set_bottom_tensor_view_buffer_size` calls in the K/V rebase lambdas (5-run median, signal/noise ≈ 9×). - This cost is mandatory for gfx950 correctness on >2 GB workloads — removing the setters re-introduces page-faults. - gfx942: 0 regressions in the same range (all configs ≤ +0.97%). ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-24 07:08:41 +08:00
Linjun-AMD	dfc1305685	[rocm-libraries] ROCm/rocm-libraries#6479 (commit 0705c2d) CK][fmha] Add StreamLLM sink support to batch_prefill pipeline (#6479) ## Motivation The existing paged-KV attention pipelines (pagedkv, splitkv) support StreamLLM-style sink tokens — a fixed set of initial tokens kept in attention alongside the sliding window. The `batch_prefill` pipeline (chunked-prefill with VLLM-style block tables) previously hardcoded `kHasSink = false`, making it incompatible with sink-based attention patterns in LLM serving scenarios. This PR extends `batch_prefill` to support `kHasSink` and wires it into `fmha_fwd_runner` for validation against the existing CPU reference. ## Technical Details Pipeline (`block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp`): - When `kHasSink`, the K/V loop splits into a sink phase [0, sink_seq_end) and a window phase [seqlen_k_start, seqlen_k_end), mirroring pagedkv. - K advance at the sink→window transition jumps `seqlen_k_start - sink_seq_end + kN0` to bridge the gap. - V scatter-gather offsets are re-initialized at the transition to fix a window mismatch bug: V was lagging kN0 behind K after the large jump, loading from the wrong sequence position. - Bias window, dropout seq_offset, and mask type (LogitsSinkMask) updated for sink-awareness. Traits / codegen (`tile_fmha_traits.hpp`, `fmha_fwd.hpp`, `fmha_batch_prefill.py`): - `TileFmhaBatchPrefillTraits` gains `kHasSink_` (was hardcoded `false`). - Codegen adds `F_sink` field; skips batch-mode kernels (group mode required). - CMake test filter broadened from 9 → 33 instances covering fp16/bf16 × mask/nmask × lse/nlse × sink/nsink. Runner (`fmha_fwd_runner.hpp`, `CMakeLists.txt`): - `fmha_batch_prefill()` dispatched from `run_fwd` when: group mode + paged KV + num_splits == 1. - K/V strides corrected for runner's [num_pages, nhead_k, page_block_size, hdim] layout. - `page_block_size % 128` check relaxed: batch_prefill supports ps=16. - CPU reference paged-KV reordering guards extended with `CK_TILE_FMHA_FWD_BATCH_PREFILL_API`. ## Test Plan Build with `-DFMHA_FWD_ENABLE_APIS="fwd;batch_prefill"`, run `tile_example_fmha_fwd` in group mode with page_block_size=16. Test matrix: - Mask: no-mask, causal, sliding window - Sink: nsink, sink=1..128 - dtype: fp16, bf16 - LSE output: on/off - seqlen ∈ {512,1024,2048,4096} × window ∈ {32,256,512,1024} - GQA, chunked prefill, large batch×seqlen - page_block_size: 16, 32 ## Test Result 171 test cases, all valid:y: - nmask + nsink: ✓ - causal + nsink: ✓ - causal + sink=8: ✓ - sliding window + sink=8 (d=128, d=256): ✓ - bf16, LSE output, GQA: ✓ ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-21 11:03:55 +00:00
Hosang Yoon	720cc88a31	[rocm-libraries] ROCm/rocm-libraries#6253 (commit 61934c6) [CK_TILE] Enable canonical-NaN BF16 conversion for FMHA on RDNA (#6253) ## Motivation - On gfx11/gfx12, the existing float -> bf16 conversion path in FMHA forward adds noticeable overhead and causes a meaningful performance gap versus fp16. The asm-based path (mode 3) does not improve this on RDNA and can perform even worse. - In particular, on gfx12, bf16 FMHA forward can be up to ~20% slower than the corresponding fp16 path. - This PR reduces that gap by switching FMHA forward to a different BF16 conversion strategy based on Triton’s canonical-NaN round-to-nearest-even behavior. ## Technical Details - Add a new `standard_cnan` BF16 conversion mode to CK Tile. - Implement a canonical-NaN RTN `float -> bf16` conversion path based on the Triton implementation. - Enable this conversion mode by default for FMHA forward builds targeting gfx11/gfx12. - Retune gfx11/gfx12 FMHA forward kernel selection thresholds for some `hdim=128` cases to keep kernel selection aligned with the updated conversion behavior. ## Test Plan ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16 -d={hdim} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result - all tests passed when running `test_ck_tile_fmha` - BF16 FMHA forward performance improves by up to ~5% on gfx11. - BF16 FMHA forward performance improves by up to ~10% on gfx12. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-20 14:52:24 -04:00
Hosang Yoon	7e4e291771	[rocm-libraries] ROCm/rocm-libraries#6450 (commit b75fed1) [CK_TILE] Skip padded k/n fragment work in qr_hpad FMHA fwd (#6450) ## Motivation `qr_hpad` currently executes work for padded head-dim fragments even when only a subset of the values are valid. This adds unnecessary computation for head dimensions that require padding, such as `hdim=72` and `hdim=80`, and hurts FMHA forward performance. The goal of this PR is to make the padded-head-dim path skip invalid work based on the actual valid fragment count, while preserving the existing behavior for the non-padded path. ## Technical Details This PR improves the `qr_hpad` FMHA forward path in three parts: - Skip padded `k`/`n` fragments in the GEMM/pipeline path when only part of the fragment is valid. - Add partial GEMM0 tail handling for `qr_hpad` so the kernel uses the valid fragment range instead of always computing over the padded extent. - Retune the gfx11 `qr_hpad` kernel configuration after enabling the partial-fragment path. To keep the existing path stable, the implementation adds overloads for the updated GEMM/pipeline interfaces. This allows existing full-tile callers to keep using the previous form, while the `qr_hpad` path can pass valid fragment counts when needed. ## Test Plan ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16 -d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result - On gfx11 and gfx12, for head dimensions that require padding, `tile_example_fmha_fwd` shows about 20-30% performance improvement at `hdim=72/80`. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-18 06:44:46 +00:00
Po Yen Chen	c0d9614af6	[rocm-libraries] ROCm/rocm-libraries#6305 (commit 19e10a0) [CK] Remove obsolete benchmark_fwd_v3.sh script and README reference (#6305) The tile_example_fmha_fwd_v3 target no longer exists in this project, making this benchmark script non-functional.	2026-04-15 15:37:37 +08:00
Hosang Yoon	00faecdfcf	[rocm-libraries] ROCm/rocm-libraries#6156 (commit 367565a) [CK_TILE] Optimize FMHA head-dim padded path on gfx11/gfx12 (#6156) ## Motivation On gfx11/gfx12, FMHA forward kernels that require head-dim padding show a large performance drop compared to the exact-head-dim path. In practice, padded cases such as `HDIM=72` and `HDIM=80` were falling too far off the fast path. This PR improves padded-head-dim FMHA performance on gfx11/gfx12 while keeping the behavior for other GPUs unchanged. ## Technical Details - Add/scope a dedicated padded-head-dim (`qr_hpad`) FMHA forward path for gfx11/gfx12. - For `receipt=0`, keep support conservative and only enable the padded fast path for vector-safe cases (`head_dim % 8 == 0`), matching the existing assumption used on other GPUs. - Move `v_prefetch` later only for the head-dim-padded path on gfx11/gfx12. This reduces live ranges and removes the register-spill behavior seen in the earlier scheduling. - Enable the buffer-load OOB check offset trick for the padded path on gfx11/gfx12. ## Test Plan ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16 -d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result Observed padded-head-dim performance improvements for HDIM=72/80: - gfx11: about ~3.5x - gfx1151: about ~2.0x - gfx12: about ~1.3x ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-08 10:51:53 -04:00
Copilot	31fd3fe9c1	[rocm-libraries] ROCm/rocm-libraries#6003 (commit b4c8c52) [CK_TILE] Refine FMHA Readme (#6003) Updates the FMHA README to document fp8 precision support more accurately, replacing the outdated "experimental" section and incomplete CLI arg descriptions. ## Changes - `-prec` arg: expanded supported values from `fp16/bf16/fp8/bf8` → `fp32/fp16/bf16/fp8/fp8bf16/fp8fp32/mxfp8/mxfp4` - `-qscale` arg: replaced single-line `1: per-tensor quantization` with all four modes: `pt/1`, `bs/2`, `kvbs/3`, `mx/4` - FP8 support section: replaced "FP8 experimental support" paragraph with: - Supported targets: gfx942/gfx950 + ROCm 6.0+ - Table distinguishing `fp8` / `fp8bf16` / `fp8fp32` by Q/K/V input type and output type - Table for all `-qscale` modes with descriptions - Note that `-vlayout=r` (`seqlenhdim` for V) is the only supported layout for fp8 types <!-- START COPILOT ORIGINAL PROMPT --> <details> <summary>Original prompt</summary> Please open a PR against base branch `develop` in repository `ROCm/rocm-libraries` applying the following documentation updates within the composable kernel path. ## Scope Update the file: - `projects/composablekernel/example/ck_tile/01_fmha/README.md` ## Changes to apply Apply the combined edits described in the diffs below (two consecutive patches). Ensure the final file content includes both* sets of changes. ### Patch 1 - In the CLI args section: - Update `-qscale` description lines to include: - `pt or 1, per-tensor scale` - `bs or 2, block scale` - `kvbs or 3, Q per-tensor, K/V per-page block scale` - `mx or 4, microscaling (exclusively for mxfp8/mxfp4)` - Update `-prec` supported data types from `fp16/bf16/fp8/bf8` to `fp32/fp16/bf16/fp8/fp8bf16/fp8fp32/mxfp8/mxfp4`. - Replace the existing "FP8 experimental support" section with an "FP8 support" section stating: - FP8 FMHA kernels supported on gfx942/gfx950 with ROCm 6.0+ - Precision selectable via `-prec=fp8` (or `fp8bf16`, `fp8fp32`) for `tile_example_fmha_fwd` - Add a table describing `-qscale` modes: - `n` or `0`: No quantization scale (default) - `pt` or `1`: Per-tensor quantization scale - `bs` or `2`: Per-block quantization scale - `kvbs` or `3`: Q per-tensor + K/V per-page block scale - `mx` or `4`: Microscaling (MX format), exclusively for `mxfp8` and `mxfp4` - Add/keep note that currently only `-vlayout=r` (`seqlenhdim` for V matrix) is supported for fp8 data types. ### Patch 2 Further refine the "FP8 support" paragraph to explain the difference between `fp8`, `fp8bf16`, and `fp8fp32` via a table: \| `-prec` value \| Q/K/V input type \| Output type \| Description \| \|---\|---\|---\|---\| \| `fp8` \| fp8 \| fp8 \| Fully fp8: both inputs and output are in fp8 \| \| `fp8bf16` \| fp8 \| bf16 \| Mixed precision: fp8 inputs, bf16 output — useful when the consumer expects a wider-range output format \| \| `fp8fp32` \| fp8 \| fp32 \| Mixed precision: fp8 inputs, fp32 output — highest-precision output, suitable for debugging or further fp32 processing \| Keep the rest of the `-qscale` table and the `-vlayout=r` limitation note. ## Notes - PR title must be: `[CK_TILE] Add fp8 in FMHA readme` - Ensure markdown formatting renders correctly (tables, code formatting). - Only modify the file listed above. The following is the prior conversation context from the user's chat exploration (may be truncated): User: 能幫我上這個pr嗎在composable kernel裡的路徑 diff --git a/example/ck_tile/01_fmha/README.md b/example/ck_tile/01_fmha/README.md index 0b526f4e9fc..1627435863b 100644 --- a/example/ck_tile/01_fmha/README.md +++ b/example/ck_tile/01_fmha/README.md @@ -62,14 +62,17 @@ args: -d_v head dim for v, -1 means equal to d (default:-1) -scale_s scale factor of S. 0 means equal to 1/sqrt(hdim). (default:0) -qscale n or 0, no scaling (default:n) - 1: per-tensor quantization. + pt or 1, per-tensor scale + bs or 2, block scale + kvbs or 3, Q per-tensor, K/V per-page block scale + mx or 4, microscaling (exclusively for mxfp8/mxfp4) -iperm permute input (default:1) if true, will be bhsd, else bshd -operm permute output (default:1) -bias n or 0, no bias (default:n) e(lementwise) or 1, elementwise bias with 11ss. e:1, 1hss. e:2, bhss a(libi) or 2, alibi with 1h. a:1, bh - -prec data type. fp16/bf16/fp8/bf8 (default:fp16) + -prec data type. fp32/fp16/bf16/fp8/fp8bf16/fp8fp32/mxfp8/mxfp4 (default:fp16) -mask 0: no mask, 1: top-left(same as 't'), 2:bottom-right(same as 'b') (default:0) 't', top-left causal mask, 'b', bottom-r causal mask 't:l,r', top-left sliding window attn(swa) with FA style left right size @@ -161,7 +164,17 @@ We support sequence padding and variable-length processing in both batch and gro Both approaches optimize memory access patterns while supporting flexible sequence length requirements commonly found in transformer inference scenarios. -## FP8 experimental support -As described in [this blog](https://blog.hippoml.com/8bit-hippoattention-up-to-3x-faster-compared-to-flashattentionv2-8f9def90b482), we have an experimental support for fp8 fmha kernels, you can evaluate the performance by setting the arg `-prec=fp8` to the `tile_example_fmha_fwd`, on a gfx942 machine and ROCm 6.0+. +## FP8 support +FP8 FMHA kernels are supported on gfx942/gfx950 machines with ROCm 6.0+. You can select fp8 precision by setting the arg `-prec=fp8` (or `fp8bf16`, `fp8fp32`) to the `tile_example_fmha_fwd`. -Currently we only support `-vlayout=r`( `seqlenhdim` for V matrix) for fp8 and fp8bf16 now. Full feature support will come later. +The following quantization scale modes are available via `-qscale`: + +\| `-qscale` value \| Description \| +\|---\|---\| +\| `n` or `0` \| No quantization sca... </details> <!-- START COPILOT CODING AGENT SUFFIX --> This pull request was created from Copilot chat.* > --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: asleepzzz <4926646+asleepzzz@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2026-04-08 14:58:53 +08:00
Po Yen Chen	bf736dfa74	[rocm-libraries] ROCm/rocm-libraries#6051 (commit f0838b2) [CK] Add FP8 per-tensor quantization support for FMHA V3 pipeline (#6051) ## Motivation The existing FMHA V3 pipeline only supports fp16/bf16 data types. This PR extends V3 to handle FP8 inputs with per-tensor descaling on gfx950, enabling higher throughput for FP8 inference workloads using the assembly-optimized V3 code path. ## Technical Details Warp GEMM: - Add FP8 32x32x32 warp gemm with C-transposed distribution (`WarpGemmMfma_f32_32x32x32_fp8_fp8_CTransposed`) and dispatcher entries V3 Kernel (`fmha_fwd_v3_kernel.hpp`): - Add per-tensor descale support for Q, K, V tensors, passing descale pointers through to pipeline kargs V3 Pipeline (`block_fmha_fwd_v3_pipeline.hpp`): - Add FP8 data path with dtype-aware type selection - Add asm volatile P matrix conversion from f32 to fp8 - Add FP8-aware instruction scheduling in `CoreLoopScheduler` V3 Pipeline Policy (`block_fmha_fwd_v3_pipeline_default_policy.hpp`): - Add FP8 QK warp gemm selection (SwizzleB variant for V tile distribution compatibility) Codegen (`fmha_fwd.py`): - Add gfx950 FP8BF16 V3 tile size (256x64x128x128x64x128) - Add FP8BF16 V3 pipeline variants (mask: no/causal, qscale: no/pertensor) - Extend `can_dispatch_v3` condition for fp8bf16 + pertensor Misc: - Add LLVM scheduler `TRANS` mask to `LLVMSchedGroupMask` enum (`arch.hpp`) - Fix `mask_info` default initialization for `no_mask` case (`mask.hpp`) V3 dispatch for FP8 is disabled by default (`F_is_v3_enabled=false`) pending further validation. ## Performance: fmha_fwd V3 FP8 (avg runs 2-6, stock ROCm 7.1.1, gfx950) \| Problem \| Regular (TFlops) \| Varlen (TFlops) \| \|---\|---:\|---:\| \| batch=1 heads=6/1 seqlen=1024 causal \| 48.9 \| 47.6 \| \| batch=1 heads=6/1 seqlen=2048 causal \| 119.8 \| 117.4 \| \| batch=1 heads=6/1 seqlen=4096 causal \| 263.7 \| 259.2 \| \| batch=1 heads=6/1 seqlen=8192 causal \| 548.9 \| 543.6 \| \| batch=1 heads=6/1 seqlen=16384 causal \| 1043.0 \| 1063.7 \| \| batch=1 heads=6/1 seqlen=32768 causal \| 1237.2 \| 1279.6 \| \| batch=1 heads=6/1 seqlen=65536 causal \| 1315.4 \| 1382.7 \| \| batch=1 heads=6/1 seqlen=131072 causal \| 1326.3 \| 1402.2 \| \| batch=1 heads=16/1 seqlen=65536 causal \| 1298.7 \| 1388.4 \| \| batch=1 heads=40/40 seqlen=37200 non-causal \| 1248.9 \| 1326.1 \| ## Test Plan Tested with aiter's `test_mha_fp8.py` test suite (176 cases) covering batch sizes (1-2), sequence lengths (113-4096), head counts (5/8/32/40), GQA ratios (1:1, 1:8), and causal/non-causal modes. Verified all cases dispatch to the V3 pipeline by enabling `F_is_v3_enabled` and confirming kernel names contain `qr_async_trload_v3`. ## Test Result 176/176 tests passed with V3 enabled. All cases correctly dispatched to V3 pipeline with `pertensor` quantization. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-07 22:19:28 +08:00
Hosang Yoon	e18f4d8e6d	[rocm-libraries] ROCm/rocm-libraries#6038 (commit d7041a2) [CK_TILE] Restrict FMHA codegen to the kernel subset used by FlashAttention (#6038) ## Motivation Currently, the CK FlashAttention integration generates a broader FMHA kernel set than the FlashAttention wrappers can actually dispatch, which increases compile time without improving runtime coverage. ## Technical Details The FlashAttention CK wrappers do not use all logits/LSE variants emitted by the default FMHA codegen. The direct `fmha_fwd` path always uses softcap-disabled, LSE-enabled kernels, and the `fmha_fwd_splitkv` path only uses softcap-disabled kernels. This change trims codegen to that subset and stops generating the unused logits/LSE variants. This reduces the generated forward kernel set without changing `fmha_fwd_appendkv` or `fmha_bwd`. The reduced kernel set was validated by building and running the [FlashAttention](https://github.com/Dao-AILab/flash-attention) CK backend. Across targets, the total generated FMHA kernel count is reduced by: - `gfx942`: 29.3% - `gfx1100`: 33.7% - `gfx1201`: 31.3% ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> pytest test/test_flash_attn_ck.py from https://github.com/Dao-AILab/flash-attention ## Test Result all tests passed <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-02 20:16:32 -04:00
Linjun-AMD	8d1fb9d33e	[rocm-libraries] ROCm/rocm-libraries#5504 (commit 47f86c7) [CK Tile] Add sink token gradient support in FMHA backward pass (#5504) ## Motivation Adds sink token support to the FMHA backward kernel (dot_do_o pipeline): ## Technical Details - Extend BlockFmhaBwdOGradDotOPipelineProblem with LSEDataType - Add sink_ptr/d_sink_ptr/lse_ptr/nhead to FmhaBwdOGradDotOCommonKargs - Compute per-head sink gradient via atomic accumulation in the pipeline - Update example runner with reference validation for sink gradient ## Test Plan Add new test case ## Test Result WIP ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-02 11:17:01 +08:00
Fu-Cheng Tsai	37e2cbb735	[rocm-libraries] ROCm/rocm-libraries#5798 (commit 7acd4e7) [CK_TILE] Update gfx12 FMHA forward kernel configs (#5798)	2026-04-01 14:22:37 +00:00
Hosang Yoon	0452c865d5	[rocm-libraries] ROCm/rocm-libraries#5977 (commit 794bea7) [CK_TILE] Fix Windows build in FMHA head grouping (#5977) ## Motivation This is a follow-up fix for [PR #5018](https://github.com/ROCm/rocm-libraries/pull/5018). [PR #5018](https://github.com/ROCm/rocm-libraries/pull/5018) added LLC-aware FMHA head grouping / head-major scheduling on RDNA, but it also introduced Linux-only code paths, including `<dirent.h>`, which break Windows builds. This change fixes that by guarding the Linux-specific LLC probing logic so non-Linux platforms can still build correctly. ## Technical Details - Guard `<dirent.h>` with `#ifdef __linux__` - Guard KFD sysfs traversal logic with `#if defined(__linux__)` - On non-Linux platforms, return `0` from `get_kfd_sysfs_llc_cache_bytes()` - Preserve existing fallback behavior through: - `CK_TILE_FMHA_LLC_CACHE_MB` - arch-based default LLC sizes - no head grouping when no LLC size can be resolved ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 08:18:40 -06:00
Jeff Huang	1cd6d11281	[rocm-libraries] ROCm/rocm-libraries#5918 (commit a7e2c67) [CK][CK_TILE] Add fp8bf16 hdim=256 tile for batch prefill (#5918) ## Motivation FP8 batch prefill kernels currently only support head_dim=128. Models with head_dim=256 hit the "invalid argument for batch_prefill" error because no matching kernel variant exists in the codegen dispatch. ## Technical Details Add a hdim=256 tile size entry for fp8bf16 in the batch prefill codegen recipe (`fmha_batch_prefill.py`). Tile configuration: `FmhaFwdTileSize(128, 128, 32, 256, 32, 256, 4,1,1, 4,1,1, 32,32,32, 32,32,32, -1)` - bm0=128, bn0=128 (Q/K tile sizes) - bn1=256, bk0max=256 (V head_dim=256) - Warp MFMA 32x32x32 (fp8 MFMA instructions) This mirrors the existing bf16/fp16 hdim=256 tile but uses fp8 warp sizes. ## Test Plan Tested on both MI308X (gfx942) and MI355X (gfx950) via aiter batch prefill test with the following matrix: - page_size: {1, 16, 1024} - kv_layout: {linear, vectorized} - lookup_table: {sglang, vllm} - causal: {true, false} - logits_soft_cap: {0.0, 30.0} - contiguous_kv: {true, false} ## Test Result MI308X (gfx942): 160 passed, 32 skipped (page_size=1 + vectorized not applicable) MI355X (gfx950): 120 passed, 72 skipped (pre-existing ROCm 7.2 compiler issue with causal + no softcap) No register spills on either platform. ### Profiling — MI355X (gfx950), FP8 pertensor, hdim=256, seqlen=1024, 8 heads \| page_sz \| kv_layout \| table \| causal \| soft_cap \| time_us \| TFLOPS \| \|---------\|-----------\|-------\|--------\|----------\|---------\|--------\| \| 1 \| linear \| sglang \| False \| 0.00 \| 55.01 \| 156.16 \| \| 1 \| linear \| vllm \| False \| 0.00 \| 55.12 \| 155.84 \| \| 1 \| linear \| sglang \| False \| 30.00 \| 62.63 \| 137.16 \| \| 1 \| linear \| vllm \| False \| 30.00 \| 62.16 \| 138.20 \| \| 1 \| linear \| sglang \| True \| 30.00 \| 64.09 \| 67.01 \| \| 1 \| linear \| vllm \| True \| 30.00 \| 63.85 \| 67.27 \| \| 16 \| linear \| sglang \| False \| 0.00 \| 57.00 \| 150.69 \| \| 16 \| vectorized \| sglang \| False \| 0.00 \| 57.55 \| 149.25 \| \| 16 \| linear \| vllm \| False \| 0.00 \| 56.80 \| 151.23 \| \| 16 \| vectorized \| vllm \| False \| 0.00 \| 57.32 \| 149.87 \| \| 16 \| linear \| sglang \| False \| 30.00 \| 64.77 \| 132.62 \| \| 16 \| vectorized \| vllm \| False \| 30.00 \| 63.54 \| 135.18 \| \| 16 \| linear \| sglang \| True \| 30.00 \| 66.84 \| 64.26 \| \| 16 \| vectorized \| vllm \| True \| 30.00 \| 66.12 \| 64.96 \| \| 1024 \| linear \| sglang \| False \| 0.00 \| 58.25 \| 147.46 \| \| 1024 \| vectorized \| sglang \| False \| 0.00 \| 57.53 \| 149.31 \| \| 1024 \| linear \| vllm \| False \| 0.00 \| 58.06 \| 147.94 \| \| 1024 \| vectorized \| vllm \| False \| 0.00 \| 57.55 \| 149.27 \| \| 1024 \| linear \| sglang \| False \| 30.00 \| 65.38 \| 131.38 \| \| 1024 \| vectorized \| vllm \| False \| 30.00 \| 63.64 \| 134.98 \| \| 1024 \| linear \| sglang \| True \| 30.00 \| 66.85 \| 64.25 \| \| 1024 \| vectorized \| vllm \| True \| 30.00 \| 65.26 \| 65.81 \| ### Profiling — MI308X (gfx942), FP8 pertensor, hdim=256, seqlen=1024, 8 heads \| page_sz \| kv_layout \| table \| causal \| soft_cap \| time_us \| TFLOPS \| \|---------\|-----------\|-------\|--------\|----------\|---------\|--------\| \| 1 \| linear \| sglang \| False \| 0.00 \| 110.18 \| 77.96 \| \| 1 \| linear \| vllm \| True \| 30.00 \| 134.33 \| 31.97 \| \| 1 \| linear \| sglang \| True \| 30.00 \| 134.59 \| 31.91 \| \| 16 \| linear \| sglang \| False \| 0.00 \| 115.43 \| 74.42 \| \| 16 \| vectorized \| sglang \| False \| 0.00 \| 106.11 \| 80.95 \| \| 16 \| linear \| vllm \| False \| 0.00 \| 116.34 \| 73.83 \| \| 16 \| vectorized \| vllm \| False \| 0.00 \| 106.17 \| 80.91 \| \| 16 \| linear \| sglang \| False \| 30.00 \| 135.61 \| 63.34 \| \| 16 \| vectorized \| vllm \| False \| 30.00 \| 122.37 \| 70.20 \| \| 16 \| linear \| sglang \| True \| 0.00 \| 117.44 \| 36.57 \| \| 16 \| vectorized \| vllm \| True \| 0.00 \| 108.81 \| 39.47 \| \| 16 \| linear \| sglang \| True \| 30.00 \| 139.43 \| 30.80 \| \| 16 \| vectorized \| vllm \| True \| 30.00 \| 125.87 \| 34.12 \| \| 1024 \| linear \| sglang \| False \| 0.00 \| 110.65 \| 77.63 \| \| 1024 \| vectorized \| sglang \| False \| 0.00 \| 101.70 \| 84.46 \| \| 1024 \| linear \| vllm \| False \| 0.00 \| 111.71 \| 76.89 \| \| 1024 \| vectorized \| vllm \| False \| 0.00 \| 101.55 \| 84.59 \| \| 1024 \| linear \| sglang \| False \| 30.00 \| 129.33 \| 66.42 \| \| 1024 \| vectorized \| vllm \| False \| 30.00 \| 120.95 \| 71.02 \| \| 1024 \| linear \| sglang \| True \| 0.00 \| 112.26 \| 38.26 \| \| 1024 \| vectorized \| vllm \| True \| 0.00 \| 103.02 \| 41.69 \| \| 1024 \| linear \| sglang \| True \| 30.00 \| 133.73 \| 32.12 \| \| 1024 \| vectorized \| vllm \| True \| 30.00 \| 124.75 \| 34.43 \| ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 18:20:27 +08:00
Hosang	4b0ebc5fab	[rocm-libraries] ROCm/rocm-libraries#5018 (commit b32e7e6) [CK_TILE] Add LLC-aware FMHA head grouping and head-major scheduling on RDNA (#5018) ## Motivation Long-sequence FMHA can become memory-bound when K/V working sets exceed Infinity Cache (LLC), causing repeated DRAM traffic across heads. This PR introduces LLC-aware launch ordering improvements for FMHA forward, and it is currently enabled only on gfx11 and gfx12. The approach is inspired by [`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217), adapted to CK’s kernel/runner structure and layout handling. In this context, `bshd` is the layout used in Flash-Attention, while `bhsd` is the default layout used by the CK Tile FMHA example. ## Technical Details This PR adds two complementary strategies: - For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware head grouping: - Estimate LLC size (env override, KFD sysfs, or arch default). - Compute group size from K/V bytes per head vs LLC target. - Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and related tensors). - For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit launch-order adjustment: - Keep a single kernel launch. - Reinterpret block linearization in `GetTileIndex` to make execution head-major, improving temporal locality of per-head K/V reuse. Additional integration updates: - Propagate `num_head_q_total` and `head_start` through FMHA args/kargs. - Use global head indexing for dropout RNG stream mapping so grouped launches keep deterministic/consistent dropout behavior. - Keep fallback behavior unchanged when grouping is not beneficial or disabled. ## Test Plan - `test_ck_tile_fmha` - `tile_example_fmha_fwd` ## Test Result - `test_ck_tile_fmha`: all tests passed. - `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201, and all of them show higher performance compared to the baseline. The improvement is consistent, and performance is well maintained even at long sequence lengths. ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} - TFLOPs by sequence length target: gfx1100 layout: bhsd SeqLen \| Before \| After \| Speedup -- \| -- \| -- \| -- 1024 \| 56.27 \| 61.48 \| 1.09x 4096 \| 67.10 \| 72.27 \| 1.08x 8192 \| 65.99 \| 71.64 \| 1.09x 12288 \| 61.60 \| 76.61 \| 1.24x 16384 \| 58.99 \| 75.74 \| 1.28x 20480 \| 57.32 \| 74.42 \| 1.30x 24576 \| 56.89 \| 74.25 \| 1.31x 27280 \| 18.93 \| 24.48 \| 1.29x - TFLOPs by sequence length target: gfx1201 layout: bshd SeqLen \| Before \| After \| Speedup -- \| -- \| -- \| -- 1024 \| 66.79 \| 65.90 \| 0.99x 4096 \| 85.90 \| 86.80 \| 1.01x 8192 \| 77.06 \| 90.29 \| 1.17x 12288 \| 58.36 \| 88.98 \| 1.52x 16384 \| 52.12 \| 88.88 \| 1.71x 20480 \| 48.11 \| 88.42 \| 1.84x 24576 \| 47.12 \| 89.07 \| 1.89x 27280 \| 49.05 \| 50.31 \| 1.03x ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-16 21:18:34 +00:00
Yi DING	b7d8c938b4	[rocm-libraries] ROCm/rocm-libraries#5174 (commit a358a21) [CK_TILE] FMHA BWD Use Persistent Kernels in Deterministic Mode (#5174) ## Motivation This PR enables a persistent-kernel execution path for FMHA backward (dQ/dK/dV) in deterministic mode, adjusting how dQ accumulation is split, stored, and converted back to final gradients. ## Technical Details - Introduces a persistent-kernel grid mapping in deterministic mode and updates split-count calculation accordingly. - Extends kernel kargs to carry batch-related info needed for persistent scheduling and dQ conversion. - Refactors dQ store conditions and adds mask-type traits/utilities and runner logging updates. ## Test Plan - Jenkins [base](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/10/pipeline) - Jenkins [AITER](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/12/pipeline) - Jenkins [FMHA](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/11/pipeline) - local FA tests ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-13 14:13:32 +08:00
Anton Gorenko	5f94c1aa0f	[rocm-libraries] ROCm/rocm-libraries#4368 (commit 17f7dfc) [CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 (#4368) ## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-11 09:59:50 +00:00
Hosang	4bdacc2065	[rocm-libraries] ROCm/rocm-libraries#5088 (commit 36ca523) [CK_TILE] Update gfx11 FMHA forward kernel configs (#5088) ## Motivation Tune gfx11 FMHA codegen to recover performance for mainly PSSK (padded seqlen_q/k) cases. This tuning is based on heuristic search and improves performance in most tested shapes. Performance should be evaluated on top of [`ROCm/rocm-libraries#5018`](https://github.com/ROCm/rocm-libraries/pull/5018) (required baseline). ## Technical Details - Updated gfx11 codegen heuristic choices for tile size and occupancy. - Updated gfx11 pipeline selection: - Disabled the `npad` (`f,f,f,f`) qr entry because it was consistently slower than the `pssk` (`t,t,f,f`) path, and kept `pssk` enabled so npad cases are dispatched to the faster kernel path.` - Kept gfx12 unchanged: with PSSK support from [`ROCm/rocm-libraries#4957`](https://github.com/ROCm/rocm-libraries/pull/4957), existing gfx12 config is already sufficient. - Tuning rationale: - In some cases, higher `kBlockPerCu` lowers register pressure. - On RDNA, this generally aligns with better performance when `waves_per_eu >= 6`. ## Test Plan - test_ck_tile_fmha - tile_example_fmha_fwd: tested this on gfx1100 and gfx1151 ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result - TFLOPs by sequence length target: `gfx1100` layout: `bhsd` - mode: batch / VGPR usage: 225 vs 214 SeqLen \| Baseline \| Tuned \| Gain -- \| -- \| -- \| -- 1024 \| 74.10 \| 71.97 \| 0.97x 4096 \| 66.26 \| 77.79 \| 1.17x 8192 \| 68.18 \| 75.88 \| 1.11x 12288 \| 68.47 \| 80.44 \| 1.17x 16384 \| 59.54 \| 79.66 \| 1.34x 20480 \| 55.78 \| 77.91 \| 1.40x 24576 \| 55.08 \| 77.47 \| 1.41x 27280 \| 47.45 \| 77.16 \| 1.63x - mode: group / VGPR usage: 256 vs 214 SeqLen \| Baseline \| Tuned \| Gain -- \| -- \| -- \| -- 1024 \| 71.47 \| 70.6 \| 0.99x 4096 \| 64.74 \| 77.06 \| 1.19x 8192 \| 64.68 \| 75.47 \| 1.17x 12288 \| 66.43 \| 79.95 \| 1.20x 16384 \| 56.02 \| 79.73 \| 1.42x 20480 \| 50.21 \| 78.15 \| 1.56x 24576 \| 47.29 \| 77.53 \| 1.64x 27280 \| 46.13 \| 77.04 \| 1.67x - TFLOPs by sequence length target: `gfx1151` layout: `bshd` - mode: batch / VGPR usage: 225 vs 223 Batch \| Baseline \| Tuned \| Gain -- \| -- \| -- \| -- 1024 \| 26.85 \| 29.17 \| 1.09x 4096 \| 24.75 \| 26.01 \| 1.05x 8192 \| 25.24 \| 25.50 \| 1.01x 12288 \| 25.18 \| 25.00 \| 0.99x 16384 \| 24.79 \| 25.91 \| 1.05x 20480 \| 25.56 \| 25.24 \| 0.99x 24576 \| 25.13 \| 26.20 \| 1.04x 27280 \| 10.78 \| 26.35 \| 2.44x - mode: group / VGPR usage: 256 vs 229 Batch \| Baseline \| Tuned \| Gain -- \| -- \| -- \| -- 1024 \| 27.44 \| 26.71 \| 0.97x 4096 \| 21.89 \| 23.09 \| 1.05x 8192 \| 22.85 \| 24.49 \| 1.07x 12288 \| 24.33 \| 24.42 \| 1.00x 16384 \| 20.05 \| 24.98 \| 1.24x 20480 \| 14.70 \| 25.15 \| 1.71x 24576 \| 11.30 \| 26.31 \| 2.33x 27280 \| 10.10 \| 26.32 \| 2.61x ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-10 09:46:41 -07:00
Anton Gorenko	9705d53ad9	[rocm-libraries] ROCm/rocm-libraries#4957 (commit d2e4eb8) [CK_TILE][FMHA] Extend pipelines with pssk for gfx11/12 (#4957) ## Motivation Build pipelines with seqlen padding only to support vectorized loads in the hdim dimension. The existing pipelines have either all dims padded or all dims not padded. These pipelines can be used in ComfyUI for slightly better performance. ## Technical Details Also a fix included for correct FLOPS calculation in `tile_example_fmha_fwd` when `seqlen_q * seqlen_k` overflows index_t capacity (signed int32). ## Test Plan The existing test cases will use the new pipelines when parameters allow (seqlens - padded, hdims - not padded): ``` ninja test_ck_tile_fmha_fwd bin/test_ck_tile_fmha_fwd_fp16 bin/test_ck_tile_fmha_fwd_bf16 bin/test_ck_tile_fmha_fwd_fp8bf16 # for gfx12 ``` ## Test Result All tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-03 20:50:05 -08:00
Yi DING	2709abee7e	[rocm-libraries] ROCm/rocm-libraries#4577 (commit a36922c) [CK_TILE] FMHA BWD Launcher Interface (#4577) ## Motivation Reduce memory usage; Be prepared to implement optimizations of reducing nsplits in deterministic cases. ## Technical Details This PR introduces a new launcher interface for the FMHA backward operation, replacing direct function calls with a more structured approach. The launcher encapsulates kernel dispatch logic and provides access to computed metadata like the number of dQ acc splits. Changes: - Added `fmha_bwd_launcher` class that wraps kernel execution and exposes `dq_acc_splits` - Moved `fmha_bwd_traits` construction earlier in the execution flow to support launcher initialization - Refactored code generation to produce both legacy API and new launcher constructor ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-04 09:20:06 +08:00
Brock Hargreaves	a49a464f74	[rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502) [CK] Address a bunch of errors associated with targeting gfx1200 on Windows (#5045) ## Motivation Still addressing errors that are blocking the merge of TheRock PR: https://github.com/ROCm/TheRock/actions/runs/22545831304/job/65308264096?pr=3382 ## Technical Details 1. There are multiple fmha python scripts that are writing native paths which are confusing cmake. I addressed one of these in an earlier PR https://github.com/ROCm/rocm-libraries/pull/4812 and now I'm addressing more that are exposed with gfx1200 target: ``` [composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library): [composable_kernel configure] Syntax error in cmake code when parsing string [composable_kernel configure] [composable_kernel configure] B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp [composable_kernel configure] [composable_kernel configure] Invalid character escape '\b'. ``` 2. In the following compiler error we see gemm_prec_str<ADataType, BDataType> being passed as a function to concat(...), instead of being evaluated with the parenthesis operator(), i.e., gemm_prec_str<ADataType, BDataType>(). There are multiples instances of this, I wonder what non-msvc compilers do here: ``` [composable_kernel] FAILED: [code=1] example/ck_tile/38_block_scale_gemm/CMakeFiles/tile_example_gemm_quant.dir/gemm_bquant_quantgrouped_mx_bf16bf8.cpp.obj [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_mx_bf16bf8.cpp:4: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm\run_gemm_quant_example.inc:17: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:7: [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/concat.hpp:119:21: error: implicit conversion between pointer-to-function and pointer-to-object is a Microsoft extension [-Werror,-Wmicrosoft-cast] [composable_kernel] 119 \| ((oss << sep << rest), ...); [composable_kernel] \| ^~~~ [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/ops/gemm_quant/kernel/gemm_quant_kernel.hpp:248:16: note: in instantiation of function template specialization 'ck_tile::concat<char, char[11], std::basic_string<char> (), std::basic_string<char>>' requested here [composable_kernel] 248 \| return concat('_', "gemm_quant", gemm_prec_str<ADataType, BDataType>, GemmPipeline::GetName()); [composable_kernel] \| ^ ``` There are plenty of other places where we use gemm_prec_str with the operator(), so I'm pretty sure these were just typos...but I'd like some eyes on it. 3. There are 2 tests that fail to build on Windows, which I've excluded from the build but will open bug tickets for: 1. gemm_weight_preshuffle 2. grouped_gemm_preshuffle Here's a sample of the compiler error for these tests: ``` [composable_kernel] [16/19] Building HIP object test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj [composable_kernel] FAILED: [code=1] test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj [composable_kernel] E:\TheRock\build\core\clr\dist\lib\llvm\bin\clang++.exe -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_TILE_USE_WMMA=1 -DCK_TIME_KERNEL=1 -DCK_USE_OCP_FP8 -DCK_USE_WMMA -DCK_USE_WMMA_FP8 -DCK_USE_XDL -DDPP_KERNELS -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -IE:/TheRock/rocm-libraries/projects/composablekernel/profiler/include -IE:/TheRock/rocm-libraries/projects/composablekernel -IE:/TheRock/rocm-libraries/projects/composablekernel/library/include -IE:/TheRock/rocm-libraries/projects/composablekernel/include -IE:/TheRock/build/ml-libs/composable_kernel/build/include -IE:/TheRock/build/base/half/stage/include -isystem E:/TheRock/build/core/clr/dist/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/getopt-src/src -O3 -DNDEBUG -std=gnu++20 --offload-arch=gfx1200 -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -fbracket-depth=1024 -Wno-nrvo -Werror -Weverything -fcolor-diagnostics -Wno-c++20-extensions -Wno-global-constructors -Wno-undef -DCK_TILE_USE_OCP_FP8 -MD -MT test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -MF test\ck_tile\grouped_gemm_preshuffle\CMakeFiles\test_ck_tile_grouped_gemm_preshuffle.dir\test_grouped_gemm_preshuffle.cpp.obj.d -o test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -x hip -c E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp:8: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:6: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/check_err.hpp:16: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core.hpp:89: [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core/utility/env.hpp:110:31: warning: 'getenv' is deprecated: This function or variable may be unsafe. Consider using _dupenv_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. [-Wdeprecated-declarations] [composable_kernel] 110 \| const char* vp = std::getenv(name); [composable_kernel] \| ^ [composable_kernel] C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdlib.h:1183:20: note: 'getenv' has been explicitly marked deprecated here [composable_kernel] 1183 \| _Check_return_ _CRT_INSECURE_DEPRECATE(_dupenv_s) [composable_kernel] \| ^ [composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:368:55: note: expanded from macro '_CRT_INSECURE_DEPRECATE' [composable_kernel] 368 \| #define _CRT_INSECURE_DEPRECATE(_Replacement) _CRT_DEPRECATE_TEXT( \ [composable_kernel] \| ^ [composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:358:47: note: expanded from macro '_CRT_DEPRECATE_TEXT' [composable_kernel] 358 \| #define _CRT_DEPRECATE_TEXT(_Text) __declspec(deprecated(_Text)) [composable_kernel] \| ^ [composable_kernel] clang++: error: clang frontend command failed due to signal (use -v to see invocation) [composable_kernel] AMD clang version 22.0.0git (https://github.com/ROCm/llvm-project.git a2dc42b87c63e686377a69f09ea23aec7550babc+PATCHED:e4d5bf498b7b8626bb9716f1f5a5946d45025918) [composable_kernel] Target: x86_64-pc-windows-msvc [composable_kernel] Thread model: posix [composable_kernel] InstalledDir: E:\TheRock\build\core\clr\dist\lib\llvm\bin [composable_kernel] clang++: note: diagnostic msg: Error generating preprocessed source(s). [composable_kernel] ninja: build stopped: subcommand failed. [composable_kernel FAILED WITH CODE 1 in 238 seconds] ninja: build stopped: subcommand failed. ``` ## Test Plan Wait for internal CI and make sure build compiles locally. ## Test Result Waiting on CI ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-03 13:54:08 -08:00
Linjun-AMD	dd51de47ff	[rocm-libraries] ROCm/rocm-libraries#4313 (commit 080ac66) [CK] Fix gptoss sink (#4313) ## Motivation This PR removes conditional logic for handling infinity values in the sink mechanism across multiple FMHA pipeline implementations, defaulting sink_size to 0 and adding a constraint in the kernel selection logic. ## Technical Details Changes: Removed __builtin_isinf_sign(sink_v) checks and conditional initialization of LSE accumulators across 7 pipeline files Added default initialization (= 0) for sink_size in 4 argument structs Added F_sink == "f" constraint to kernel compatibility checking ## Test Plan Local test ## Test Result passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-03-02 09:53:52 +08:00
Brock Hargreaves	0610c3a462	[rocm-libraries] ROCm/rocm-libraries#4812 (commit bb5a4dd) [CK] Use as_posix() instead of str() for paths in fmha_fwd_appendkv.py (#4812) ## Motivation This is causing a failing PR for Windows: https://github.com/ROCm/TheRock/pull/3382 ``` [composable_kernel configure] -- Jenga kernel files to be generated: B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_api.cpp [composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library): [composable_kernel configure] Syntax error in cmake code when parsing string [composable_kernel configure] [composable_kernel configure] B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp [composable_kernel configure] [composable_kernel configure] Invalid character escape '\b'. ``` ## Technical Details The file: [fmha_fwd_appendkv.py](https://github.com/ROCm/rocm-libraries/compare/users/brockhargreaves-amd/ck/fix-windows-cmake-path-problem?expand=1#diff-bef22bf9ba21eb93c725493ecc7edcb6f2a8f0a9a173dcfca6bda7a9f4eced78) writes a bunch of paths to a text file which is later parsed by cmake. When passing a pathlib.Path to str(), str() converts to a native path, in this case / to \\ on Windows which confuses cmake. In this case we need to write paths with forward slashes and then pass those onward to cmake. ## Test Plan 1. Ensure this doesn't impact existing CI. 2. Ensure compilation of Windows pass locally. ## Test Result 1. Passes existing CI 2. This fixes the compilation error locally. ## Submission Checklist - [ x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-25 09:12:46 -07:00
Brock Hargreaves	6225173e07	[rocm-libraries] ROCm/rocm-libraries#4819 (commit b995a0b) [CK] Fix windows build issues (#4819) ## Motivation Full build on Windows is currently broken due to compiler errors, this PR should help fix that. This is also holding up the following PR in the TheRock: https://github.com/ROCm/TheRock/pull/3382 ## Technical Details 1. I don't see a good reason to be nesting a windows include inside the ck_tile namespace. It was causing compiler errors too: Windows.h comes with min and max, which was conflicting with ck_tile::min and ck_tile::max, so I moved it out. I also defined NOMINMAX to prevent this inclusion in the future. 2. The TRUE/FALSE macros are already used by Windows.h, which causes an error. So I've opted for True/False. You can see this pattern in other rocm-libraries. 3. The M_PI macro isn't available, at least in the WIN32_LEAN_AND_MEAN context, from \<cmath\> on Windows. We'll be able to use std::numbers::v_pi\<float\> when we have C++20 support. 4. There was a missing \<chrono\> include. ## Test Plan Test locally and make sure this doesn't impact existing CI. ## Test Result Compiles locally and passes existing ci. ## Submission Checklist - [ x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-25 09:12:19 -07:00
Anton Gorenko	841e6b89d1	[rocm-libraries] ROCm/rocm-libraries#4584 (commit 42efd1d) [CK_TILE][FMHA] Support gfx11 (#4584) ## Motivation Add support of gfx11 architectures (RDNA3) to FMHA. ## Technical Details Distributions (matrix elements to lane registers mapping) of gfx11 WMMA are completely different from distributions of gfx9 MFMA and gfx12 WMMA. There are two cases in FMHA where this difference matters: * usage of results (matrix C) of one GEMM as input (matrix A) of another GEMM. * random number generation for dropout (implementation for gfx9 MFMA, gfx12 WMMA and host validation produce the same results). Both cases are solved by a special remapping implemented using `__builtin_amdgcn_permlanex16` and `__builtin_amdgcn_perm`. Additional changes: * FMHA tests are now build and run only for those types for which instances exist (gfx11 supports only fp16 and bf16). * Two fixes for uninitialized values (`mask.sink` and `do_fp8_static_quant`): they may contain garbage resulting in incorrect dispatching logic, sometimes tests report that there are no instance available for current parameters. * Small fix to remove expcnt(0) from s_waitcnt instruction on gfx11 when they are not requested (i.e. every time), likely has no effect on performance but makes disassembly a bit clearer. ## Test Plan ``` ninja test_ck_tile_fmha bin/test_ck_tile_fmha_fwd_fp16 bin/test_ck_tile_fmha_fwd_bf16 bin/test_ck_tile_fmha_bwd_fp16 bin/test_ck_tile_fmha_bwd_bf16 ``` ## Test Result All tests must pass (some tests may be skipped). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-20 17:15:10 -08:00
Chinmay Dattanand Kuchinad	0cafa68b6f	[rocm-libraries] ROCm/rocm-libraries#4292 (commit b7f1367) Enable group mode (varlen) kernel generation for PyTorch integration (#4292) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes This PR enables group mode (variable-length attention) kernel generation for PyTorch's CK SDPA backend. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [X] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [X] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion The change is minimal (single line deletion) but enables a significant feature: variable-length attention support for ROCm users via PyTorch's torch.nn.attention.varlen API.	2026-02-09 20:59:55 +00:00

1 2 3 4 5

216 Commits