composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-16 10:59:55 +00:00

Author	SHA1	Message	Date
Ding, Yi	92f2ed758e	[fmha-bwd] Implement group-mode persistent scheduling with optimized state management	2026-04-16 22:27:25 -05:00
Ding, Yi	1a9404ac96	[CK_TILE] Use Persistent Scheduling for FMHA BWD Group Deterministic	2026-04-09 03:24:49 -05:00
Ding, Yi	bedd60a568	Merge remote-tracking branch 'origin/develop' into users/yiding12/fmha-bwd-workspace	2026-04-07 23:13:09 -05:00
Ding, Yi	5a63b343d0	Fix	2026-04-07 23:12:39 -05:00
Christopher Millette	870112b861	[CK_TILE] Flatten nested static_for loops into static_ford (#5939 ) ## Summary Mechanical conversion of 129 nested `static_for`/`static_ford` patterns to flat `static_ford` across 29 ck_tile header files. Each conversion eliminates intermediate lambda closure instantiations by replacing nested compile-time loops with a single flat iteration using index decomposition. ### What `static_ford` eliminates When `static_for` loops are nested, each level creates unique closure types: ```cpp // BEFORE: M + M×N = 20 IR functions (for M=4, N=4) static_for<0, 4, 1>{}([&](auto m) { // 4 closure instantiations static_for<0, 4, 1>{}([&](auto n) { // 4×4 = 16 closure instantiations body(m, n); }); }); // AFTER: M×N = 16 IR functions (with ford_applier, no intermediates) static_ford<sequence<4, 4>>{}([&](auto mn) { constexpr auto m = number<mn[number<0>{}]>{}; constexpr auto n = number<mn[number<1>{}]>{}; body(m, n); }); ``` ### Pattern categories converted \| Category \| Count \| Description \| \|----------\|-------\|-------------\| \| C (2-level `static_for` chains) \| 112 \| Nested `static_for` → `static_ford` \| \| C3 (3-level `static_for` chains) \| 9 \| Three consecutive nests → `static_ford` \| \| Partial rescue \| 3 \| Outer 2 levels of blocked 4-level nests \| \| B (nested `static_ford` merge) \| 5 \| Two nested `static_ford` → single higher-dim `static_ford` \| \| Total \| 129 \| Across 29 files \| 6 false positives were detected and reverted (in `tensor_adaptor.hpp`, `tile_distribution.hpp`, `tile_distribution_encoding.hpp`) where the inner loop bound depended on the outer variable. ### Files changed by family \| Family \| Files \| Sites \| \|--------\|-------\|-------\| \| Block GEMM \| 12 \| ~20 \| \| FlatMM pipelines \| 4 \| ~69 (including 5 ford-ford merges) \| \| GEMM quant \| 7 \| ~22 \| \| FlatMM kernel \| 1 \| 2 \| \| FMHA \| 1 \| 2 \| \| Reduce/norm \| 2 \| 2 \| \| Epilogue \| 1 \| 1 \| ### Blocked locations from review comments - block_gemm_areg_breg_creg_v1.hpp:356 — BLOCKED: runtime scale loads (`scale_a_slice`, `scale_b_slice`, A warp tensor load) between every nesting level - block_universal_gemm_ar_aquant_flatbr_bquant_cr.hpp:228 — BLOCKED: `zero_accumulators()` before inner loop; `sched_barrier` + conditional `block_sync_lds()` after inner loop - block_universal_gemm_as_aquant_bs_bquant_cr.hpp:298 — BLOCKED: runtime `CWarpTensor` construction before inner loop; quantization scale application code after inner loop - block_universal_gemm_as_aquant_bs_cr.hpp:277 — BLOCKED: same pattern as above - block_universal_gemm_as_bs_bquant_cr.hpp:367 — BLOCKED: same pattern as above ## Depends on - #5938 ([CK_TILE] Optimize static_ford and sequence compile-time infrastructure) — provides the `ford_applier` that makes these conversions beneficial. Without it, `static_ford` uses a recursive implementation that provides no IR function savings. ## Results (combined with #5938) ### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942) \| Target \| Base (s) \| Treat (s) \| Delta \| % \| Significant? \| \|--------\|----------\|-----------\|-------\|---\|-------------\| \| flatmm \| 161.1 \| 149.0 \| -12.1s \| -7.5% \| YES (p<0.01, 7/7 wins) \| \| universal_gemm \| 225.4 \| 220.3 \| -5.1s \| -2.3% \| YES (p<0.01, 7/7 wins) \| ### IR Function Counts (device trace, gfx942) \| Target \| InstFunc \| CodeGen \| \|--------\|----------\|---------\| \| universal_gemm \| -8.5% \| -9.2% \| \| flatmm \| -7.6% \| -10.5% \| ### ASM Equivalence 5/5 PASS — 650,151 lines verified identical (gfx942). TUs: universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale. ## Test plan - [x] ASM equivalence verified (650K lines, gfx942) - [x] Wilcoxon timing verified (7 trials, p<0.01) - [x] IR function counts verified (-7.6% to -10.5% CodeGen reduction) - [ ] CI 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2026-04-07 08:36:45 -06:00
Po Yen Chen	6dc44114ba	[CK] Add FP8 per-tensor quantization support for FMHA V3 pipeline (#6051 ) ## Motivation The existing FMHA V3 pipeline only supports fp16/bf16 data types. This PR extends V3 to handle FP8 inputs with per-tensor descaling on gfx950, enabling higher throughput for FP8 inference workloads using the assembly-optimized V3 code path. ## Technical Details Warp GEMM: - Add FP8 32x32x32 warp gemm with C-transposed distribution (`WarpGemmMfma_f32_32x32x32_fp8_fp8_CTransposed`) and dispatcher entries V3 Kernel (`fmha_fwd_v3_kernel.hpp`): - Add per-tensor descale support for Q, K, V tensors, passing descale pointers through to pipeline kargs V3 Pipeline (`block_fmha_fwd_v3_pipeline.hpp`): - Add FP8 data path with dtype-aware type selection - Add asm volatile P matrix conversion from f32 to fp8 - Add FP8-aware instruction scheduling in `CoreLoopScheduler` V3 Pipeline Policy (`block_fmha_fwd_v3_pipeline_default_policy.hpp`): - Add FP8 QK warp gemm selection (SwizzleB variant for V tile distribution compatibility) Codegen (`fmha_fwd.py`): - Add gfx950 FP8BF16 V3 tile size (256x64x128x128x64x128) - Add FP8BF16 V3 pipeline variants (mask: no/causal, qscale: no/pertensor) - Extend `can_dispatch_v3` condition for fp8bf16 + pertensor Misc: - Add LLVM scheduler `TRANS` mask to `LLVMSchedGroupMask` enum (`arch.hpp`) - Fix `mask_info` default initialization for `no_mask` case (`mask.hpp`) V3 dispatch for FP8 is disabled by default (`F_is_v3_enabled=false`) pending further validation. ## Performance: fmha_fwd V3 FP8 (avg runs 2-6, stock ROCm 7.1.1, gfx950) \| Problem \| Regular (TFlops) \| Varlen (TFlops) \| \|---\|---:\|---:\| \| batch=1 heads=6/1 seqlen=1024 causal \| 48.9 \| 47.6 \| \| batch=1 heads=6/1 seqlen=2048 causal \| 119.8 \| 117.4 \| \| batch=1 heads=6/1 seqlen=4096 causal \| 263.7 \| 259.2 \| \| batch=1 heads=6/1 seqlen=8192 causal \| 548.9 \| 543.6 \| \| batch=1 heads=6/1 seqlen=16384 causal \| 1043.0 \| 1063.7 \| \| batch=1 heads=6/1 seqlen=32768 causal \| 1237.2 \| 1279.6 \| \| batch=1 heads=6/1 seqlen=65536 causal \| 1315.4 \| 1382.7 \| \| batch=1 heads=6/1 seqlen=131072 causal \| 1326.3 \| 1402.2 \| \| batch=1 heads=16/1 seqlen=65536 causal \| 1298.7 \| 1388.4 \| \| batch=1 heads=40/40 seqlen=37200 non-causal \| 1248.9 \| 1326.1 \| ## Test Plan Tested with aiter's `test_mha_fp8.py` test suite (176 cases) covering batch sizes (1-2), sequence lengths (113-4096), head counts (5/8/32/40), GQA ratios (1:1, 1:8), and causal/non-causal modes. Verified all cases dispatch to the V3 pipeline by enabling `F_is_v3_enabled` and confirming kernel names contain `qr_async_trload_v3`. ## Test Result 176/176 tests passed with V3 enabled. All cases correctly dispatched to V3 pipeline with `pertensor` quantization. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-07 22:19:28 +08:00
Jeff Huang	449844e3d3	[CK_TILLE] Temporarily remove batch prefill KV cache overflow asserts (#6201 ) ## Summary - Temporarily remove the KV cache offset overflow assert checks in `FmhaBatchPrefillWithPagedKVCacheKernel` - The asserts are correct, but they block project progress in certain configurations - This is a temporary workaround to unblock progress; a proper fix will follow ## Note This is NOT a permanent solution. A follow-up PR will add proper overflow handling that addresses the underlying issue without blocking progress.	2026-04-07 20:41:24 +08:00
Ding, Yi	3848d2411a	Merge origin/develop into users/yiding12/fmha-bwd-workspace	2026-04-07 05:28:49 -05:00
Ding, Yi	28afc8fee3	[CK_TILE] Use Unified Workspace for FMHA BWD	2026-04-03 03:54:38 -05:00
Linjun-AMD	ba0efe01af	[CK Tile] Add sink token gradient support in FMHA backward pass (#5504 ) ## Motivation Adds sink token support to the FMHA backward kernel (dot_do_o pipeline): ## Technical Details - Extend BlockFmhaBwdOGradDotOPipelineProblem with LSEDataType - Add sink_ptr/d_sink_ptr/lse_ptr/nhead to FmhaBwdOGradDotOCommonKargs - Compute per-head sink gradient via atomic accumulation in the pipeline - Update example runner with reference validation for sink gradient ## Test Plan Add new test case ## Test Result WIP ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-02 11:17:01 +08:00
Yi DING	9b8b2456b4	[CK_TILE] Fix FMHA BWD IGLP incorrect results due to AGPR misallocation (#5991 ) ## Motivation After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard around the `num_total_loop <= 0` early-exit check, the IGLP pipeline (`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV gradients for non-masking kernels (even with fix in #5915). Assembly inspection confirms that the CFG change causes the LLVM register allocator to reuse AGPR accumulators as scratch destinations in the dK/dV reduction loop, breaking the loop-carried accumulation across Q-tile iterations. ## Technical Details - Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in `BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing: it restores the CFG shape that the register allocator needs to correctly assign dedicated AGPRs to each column of the dK/dV accumulator. - Only the IGLP pipeline is affected; the other two BWD pipelines do not exhibit this issue. ## Test Plan ## Test Result ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-01 13:44:04 +08:00
Yi DING	4a1abd0e31	[CK_TILE] Fix FMHA BWD register pressure by wrapping num_total_loop with amd_wave_read_first_lane (#5915 ) ## Motivation In three FMHA backward pipelines, `num_total_loop` is computed without `amd_wave_read_first_lane()`, so the compiler treats it as a VGPR even though it is logically uniform across all lanes. This raises register pressure, and under high pressure the compiler may reuse VGPRs across overlapping live ranges. This was confirmed via assembly inspection: the compiler reused `v52:v53` as both the B-matrix input for dK MFMAs and an intermediate value for dV, producing incorrect dK/dV gradients. ## Technical Details Wrap `num_total_loop` with `amd_wave_read_first_lane()` in three pipelines: - `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr` - `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp` - `block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr` This promotes `num_total_loop` to an SGPR, eliminating the excess register pressure and the incorrect VGPR reuse. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 09:44:35 +08:00
Yi DING	8554618d6a	[CK_TILE] Fix NaN for FMHA BWD When seq_q=0 (#5790 ) ## Motivation This PR addresses NaNs in the FMHA backward (dQ/dK/dV) path when the effective query sequence length for a tile is zero, by ensuring the per-tile pipelines exit early with zeroed accumulators and by avoiding an early kernel return that prevented writing out cleared gradients. ## Technical Details - Add unconditional early-exit in the dK/dV pipelines when `num_total_loop <= 0` (no work), returning zeroed accumulators. - Adjust group-mode kernel early-return logic to only return when both `seqlen_q` and `seqlen_k` are zero, allowing blocks to run and store cleared dK/dV when `seqlen_q == 0`. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-03-27 15:54:01 +08:00
assistant-librarian[bot]	39bc8453c6	[CK_TILE] add tf32 support (#4302 ) ## Proposed changes TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in CK_TILE on gfx942 and gfx950. ## Checklist Please put an into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run on all changed files - [ ] Any dependent changes have been merged ## Discussion --- 🔁 Imported from [ROCm/composable_kernel#3538](https://github.com/ROCm/composable_kernel/pull/3538) 🧑‍💻 Originally authored by @yingluAMD --------- Co-authored-by: yingluAMD <Yingmao.Lu@amd.com> Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-03-19 10:17:20 +01:00
Hosang	b5894b3cbe	[CK_TILE] Add LLC-aware FMHA head grouping and head-major scheduling on RDNA (#5018 ) ## Motivation Long-sequence FMHA can become memory-bound when K/V working sets exceed Infinity Cache (LLC), causing repeated DRAM traffic across heads. This PR introduces LLC-aware launch ordering improvements for FMHA forward, and it is currently enabled only on gfx11 and gfx12. The approach is inspired by [`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217), adapted to CK’s kernel/runner structure and layout handling. In this context, `bshd` is the layout used in Flash-Attention, while `bhsd` is the default layout used by the CK Tile FMHA example. ## Technical Details This PR adds two complementary strategies: - For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware head grouping: - Estimate LLC size (env override, KFD sysfs, or arch default). - Compute group size from K/V bytes per head vs LLC target. - Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and related tensors). - For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit launch-order adjustment: - Keep a single kernel launch. - Reinterpret block linearization in `GetTileIndex` to make execution head-major, improving temporal locality of per-head K/V reuse. Additional integration updates: - Propagate `num_head_q_total` and `head_start` through FMHA args/kargs. - Use global head indexing for dropout RNG stream mapping so grouped launches keep deterministic/consistent dropout behavior. - Keep fallback behavior unchanged when grouping is not beneficial or disabled. ## Test Plan - `test_ck_tile_fmha` - `tile_example_fmha_fwd` ## Test Result - `test_ck_tile_fmha`: all tests passed. - `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201, and all of them show higher performance compared to the baseline. The improvement is consistent, and performance is well maintained even at long sequence lengths. ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} - TFLOPs by sequence length target: gfx1100 layout: bhsd SeqLen \| Before \| After \| Speedup -- \| -- \| -- \| -- 1024 \| 56.27 \| 61.48 \| 1.09x 4096 \| 67.10 \| 72.27 \| 1.08x 8192 \| 65.99 \| 71.64 \| 1.09x 12288 \| 61.60 \| 76.61 \| 1.24x 16384 \| 58.99 \| 75.74 \| 1.28x 20480 \| 57.32 \| 74.42 \| 1.30x 24576 \| 56.89 \| 74.25 \| 1.31x 27280 \| 18.93 \| 24.48 \| 1.29x - TFLOPs by sequence length target: gfx1201 layout: bshd SeqLen \| Before \| After \| Speedup -- \| -- \| -- \| -- 1024 \| 66.79 \| 65.90 \| 0.99x 4096 \| 85.90 \| 86.80 \| 1.01x 8192 \| 77.06 \| 90.29 \| 1.17x 12288 \| 58.36 \| 88.98 \| 1.52x 16384 \| 52.12 \| 88.88 \| 1.71x 20480 \| 48.11 \| 88.42 \| 1.84x 24576 \| 47.12 \| 89.07 \| 1.89x 27280 \| 49.05 \| 50.31 \| 1.03x ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-16 21:18:34 +00:00
Yi DING	f6bfcad437	[CK_TILE] FMHA BWD Use Persistent Kernels in Deterministic Mode (#5174 ) ## Motivation This PR enables a persistent-kernel execution path for FMHA backward (dQ/dK/dV) in deterministic mode, adjusting how dQ accumulation is split, stored, and converted back to final gradients. ## Technical Details - Introduces a persistent-kernel grid mapping in deterministic mode and updates split-count calculation accordingly. - Extends kernel kargs to carry batch-related info needed for persistent scheduling and dQ conversion. - Refactors dQ store conditions and adds mask-type traits/utilities and runner logging updates. ## Test Plan - Jenkins [base](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/10/pipeline) - Jenkins [AITER](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/12/pipeline) - Jenkins [FMHA](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/11/pipeline) - local FA tests ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-13 14:13:32 +08:00
Anton Gorenko	25d9fdfc16	[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 (#4368 ) ## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-11 09:59:50 +00:00
rocking	baa73e1515	[CK_TILE] Fix FMHA async pipeline LDS sync issue (#4742 ) ## Motivation Fix FMHA forward async pipeline (`block_fmha_pipeline_qr_ks_vs_async.hpp`) sync issue. Some attention test cases intermittently fail due to a race condition where the V tile store to LDS overwrites K tile data that is still being read by other threads during the tail `gemm_0` operation. ## Technical Details In the `BlockFmhaPipelineQRKSVSAsync` pipeline, K and V tiles share the same LDS memory through a rotation schedule (`LdsSeq`). After the tail `gemm_0` (line 458), some fast threads may proceed to store V to LDS (line 617) before slow threads finish reading K data from the same LDS buffer. The fix adds an `s_barrier` synchronization after the tail `gemm_0` when K's last sub-tile and V's first sub-tile use the same LDS buffer (i.e., `LdsSeq[k0_loops - 1] == LdsSeq[k0_loops]`): `if constexpr(LdsSeq.at(number<k0_loops - 1>{}) == LdsSeq.at(number<k0_loops>{})) __builtin_amdgcn_s_barrier();` Why `s_barrier` alone is sufficient (no s_waitcnt lgkmcnt(0) needed): The `gemm_0` MFMA instruction internally waits for its LDS operands (ds_read) to complete before execution Therefore, each thread's ds_read of K data is already complete by the time gemm_0 finishes Only cross-thread synchronization (`s_barrier`) is needed to ensure all threads have finished reading before any thread starts writing V --------- Co-authored-by: asleepzzz <hanwen.chang@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-03-09 18:05:36 +00:00
Jeff Huang	7654fd478a	[CK] Fix 32-bit overflow in batch prefill kernel for >4GB KV cache (#4999 ) Use SRD rebasing for page_block_size >= kN0: move SRD base pointer to page start via 48-bit arithmetic, encode only within-page offset in voffset. Original code path preserved for ps1/ps16 via constexpr-if. ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-05 09:08:01 +08:00
Yi DING	e454358113	[CK_TILE] FMHA BWD Launcher Interface (#4577 ) ## Motivation Reduce memory usage; Be prepared to implement optimizations of reducing nsplits in deterministic cases. ## Technical Details This PR introduces a new launcher interface for the FMHA backward operation, replacing direct function calls with a more structured approach. The launcher encapsulates kernel dispatch logic and provides access to computed metadata like the number of dQ acc splits. Changes: - Added `fmha_bwd_launcher` class that wraps kernel execution and exposes `dq_acc_splits` - Moved `fmha_bwd_traits` construction earlier in the execution flow to support launcher initialization - Refactored code generation to produce both legacy API and new launcher constructor ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-04 09:20:06 +08:00
assistant-librarian[bot]	1c2b9a21c4	Cleanup and refactoring related to tile loading (#4294 ) ## Proposed changes Cleanup and refactoring done while implementing mixed precision for fp16/bf16 x fp8 Key changes: - Renamed load_interleaved_pk_type.hpp to load_and_convert_tile.hpp and refactored the API to use consistent naming conventions - Updated load_tile_transpose functions to use output parameters instead of return values for consistency - Removed unused variable declarations and simplified type deduction logic - Define load_tile_with_elementwise to use tuple types explicitly for clarity ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [X] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3505](https://github.com/ROCm/composable_kernel/pull/3505) 🧑‍💻 Originally authored by @SamiAario-AMD --------- Co-authored-by: Sami Aario <samaario@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-03-02 12:20:55 +00:00
Linjun-AMD	25560c26fb	[CK] Fix gptoss sink (#4313 ) ## Motivation This PR removes conditional logic for handling infinity values in the sink mechanism across multiple FMHA pipeline implementations, defaulting sink_size to 0 and adding a constraint in the kernel selection logic. ## Technical Details Changes: Removed __builtin_isinf_sign(sink_v) checks and conditional initialization of LSE accumulators across 7 pipeline files Added default initialization (= 0) for sink_size in 4 argument structs Added F_sink == "f" constraint to kernel compatibility checking ## Test Plan Local test ## Test Result passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-03-02 09:53:52 +08:00
Anton Gorenko	ce6acc5f66	[CK_TILE][FMHA] Support gfx11 (#4584 ) ## Motivation Add support of gfx11 architectures (RDNA3) to FMHA. ## Technical Details Distributions (matrix elements to lane registers mapping) of gfx11 WMMA are completely different from distributions of gfx9 MFMA and gfx12 WMMA. There are two cases in FMHA where this difference matters: * usage of results (matrix C) of one GEMM as input (matrix A) of another GEMM. * random number generation for dropout (implementation for gfx9 MFMA, gfx12 WMMA and host validation produce the same results). Both cases are solved by a special remapping implemented using `__builtin_amdgcn_permlanex16` and `__builtin_amdgcn_perm`. Additional changes: * FMHA tests are now build and run only for those types for which instances exist (gfx11 supports only fp16 and bf16). * Two fixes for uninitialized values (`mask.sink` and `do_fp8_static_quant`): they may contain garbage resulting in incorrect dispatching logic, sometimes tests report that there are no instance available for current parameters. * Small fix to remove expcnt(0) from s_waitcnt instruction on gfx11 when they are not requested (i.e. every time), likely has no effect on performance but makes disassembly a bit clearer. ## Test Plan ``` ninja test_ck_tile_fmha bin/test_ck_tile_fmha_fwd_fp16 bin/test_ck_tile_fmha_fwd_bf16 bin/test_ck_tile_fmha_bwd_fp16 bin/test_ck_tile_fmha_bwd_bf16 ``` ## Test Result All tests must pass (some tests may be skipped). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-02-20 17:15:10 -08:00
assistant-librarian[bot]	9c0d4114ae	[CK] Add FP8 KV_BLOCKSCALE support for batch prefill (#4263 ) Implement per-page K/V quantization for paged attention: - Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum - Use exp2 shift trick to eliminate explicit P scaling overhead - Prefetch physical pages offset for KV cache, overlaps with computations ## Proposed changes Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3696](https://github.com/ROCm/composable_kernel/pull/3696) 🧑‍💻 Originally authored by @Jeff-Huang --------- Co-authored-by: Jeff Huang <chiachi.huang@amd.com> Co-authored-by: Illia Silin <Illia.Silin@amd.com>	2026-02-04 18:25:31 -05:00
Jeff Huang	29c56b8aae	Optimize batch prefill kernel performance for VECTORIZED_LAYOUT KV cache (#3657 ) - Add multi-dimensional page index support (YsGatherDims) in tile_scatter_gather - Add is_gather_dim() and get_gather_index() for multi-dim page lookup - Override MakeVDramTileDistribution() for VECTORIZED_LAYOUT to match GEMM's BWarpDstrEncoding (K decomposition: {K2, K0, K1}) - Add GetGemmKDecomposition() to retrieve kABKLane and kKPerThread - Add static_assert for RowMajor VLayout requirement in batch prefill Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `e3556fed04`]	2026-01-29 07:18:41 +08:00
ltqin	90b3476006	Revert "Revert " Fp8 block scale quantization for fmha fwd (#3330 )" (#3633 )" (#3635 ) This reverts commit 723b7ce0be2884da131036301892bf9157f51876. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `67f0b74ec6`]	2026-01-23 09:03:22 -08:00
Po Yen Chen	4ded7e5984	Revert " Fp8 block scale quantization for fmha fwd (#3330 )" (#3633 ) This reverts commit ceccf15275645cc64db0a4ae53f5a215c93a7969. [ROCm/composable_kernel commit: `de5a1d730d`]	2026-01-22 21:21:19 -08:00
Linjun-AMD	f6fac4cea6	[CK_TILE][FMHA]Add new tile size for async (#3623 ) * Revert "Revert "[CK_TILE][FMHA] Add new tile size for async (#3586)" (#3613)" This reverts commit cfdad49edda4b2ccef92571f23646a8505bb2859. * Add new tile_size for async pipeline Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `0b13697a88`]	2026-01-22 16:07:14 +08:00
ltqin	14254656f0	Fp8 block scale quantization for fmha fwd (#3330 ) * add block scale parameters to kernel * add block scale to kernel * add smoke test * format * Revert "format" This reverts commit `356c3c9706`. * only format my code * format py * fix auto not allowd in function prototype * change instance tttt to ttff * fix structured binding issue * change s_acc elementwise op * async pipeline add block scale * add quantation P using shift exp2 * precompute (m - shift) once per row * change blk scale seqstrt ptr name * fix some name * fix for deduction guide * fix some comments * add P scale to qr_ksvs_pipeline * add comment to idx_identity * change the method of calculating descale block index * unify naming style: use block_scale_ as name prefix * unify naming style * update the CHANGELOG.md * Add FP8 block scale quantization support for FMHA forward kernel --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `dd0b4294af`]	2026-01-21 20:58:26 -08:00
Yi DING	a0935f7669	[CK_TILE] Fix Int32 Overflow in Deterministic FMHA BWD (#3615 ) [ROCm/composable_kernel commit: `fcc9372c00`]	2026-01-21 09:54:46 +08:00
Linjun-AMD	e227e837be	Revert "[CK_TILE][FMHA] Add new tile size for async (#3586 )" (#3613 ) This reverts commit 217ac48fd83deef3d0d5084815689e8c79958cc1. [ROCm/composable_kernel commit: `8f75869408`]	2026-01-20 09:40:54 -08:00
Linjun-AMD	ecda0fe2e9	[CK_TILE][FMHA] Add new tile size for async (#3586 ) * add new tile size for async Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix lse error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `f3aafb9555`]	2026-01-19 15:22:33 -08:00
Jeff Huang	445ec888ba	[FMHA] Enable page size 16 for batch prefill kernel (#3568 ) * [FMHA] Enable page size 16 for batch prefill kernel * Refactor batch prefill KV offset logic to simplify template arguments - Remove redundant `kLog2PageSize` and `kIsVTileFitsInPage` from template args. - Add static assert to forbid `page_size=1` with vectorized layout. [ROCm/composable_kernel commit: `993d3e2f0e`]	2026-01-15 22:11:44 +08:00
Linjun-AMD	75ea587550	[CK_TILE][FMHA] Enable gpt-oss sink (#3490 ) * Enable gptoss sink Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * add gptoss sink test Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update CHANGELOG.md Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix test args error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update test_fmha_fwd.cpp * update sink test Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Revert "update sink test" This reverts commit `970b4f1686`. * update sink test Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update valid sink_v in splitkv pipeline Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp * Update example_fmha_fwd.cpp * fix lse error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix clangformat error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix aiter scale error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update block_fmha_pipeline_qr_ks_vs.hpp * div scale_s for sink_value Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update fmha_fwd_runner.hpp * update sink_value with bias Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp * Fix typo in dropout parameter in fmha_batch_prefill_kernel * Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp * Update example_fmha_fwd.cpp * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async_trload.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * optimized some code Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix splitkv error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update sink reference Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update fmha_fwd_runner.hpp * Update smoke_test_fwd_sink.sh --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `717ed0b59f`]	2026-01-14 21:32:06 +08:00
Thomas Ning	0c8c232a0a	Shuffle fix for gfx950 (#3491 ) * solve compiler issue * solve the gfx950 mfma shuffle regression * refactor jenkinsfile to handle arch name better * [CK TILE] set divisor to count of thread along k dimension * fix the compiler error * solve degradation * Finish the multiplies fix * fix the scales * solve compilation error * solve the composes * solve the error of tile sweeper * fix the test and example * fix for gfx950 --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: Cong Ma <congma13@amd.com> [ROCm/composable_kernel commit: `00c46785a8`]	2026-01-13 09:21:29 -08:00
Jeff Huang	0d13ef7329	[CK Tile] Fix FMHA LSE calculation and potential division by zero (#3326 ) This commit addresses numerical stability issues in the BlockFmhaPipelineQRKSVS pipeline when bias has -inf masking values: 1. Explicitly handle the case where the accumulated exponential sum (l) is zero. In this case, the LSE is now correctly set to negative infinity, preventing log(0) errors. 2. Extend the zero-check protection in the normalization step to cover the ELEMENTWISE_BIAS case, preventing potential division by zero. [ROCm/composable_kernel commit: `141f77aa12`]	2026-01-13 13:52:26 +08:00
Jeff Huang	99b88be5fb	[FMHA] Support page_size=1 (linear layout) in batch prefill pipeline (#3545 ) - Enable page_size=1 support in batch prefill codegen (linear layout only). - Implement per-token page lookup in `kv_offset_array_transform` for page_size=1 to handle 3D input tensors correctly. - Relax `kPageBlockSize` alignment assertion for the page_size=1 case. [ROCm/composable_kernel commit: `c9f112b026`]	2026-01-13 12:04:43 +08:00
Jeff Huang	fd84daec4c	[FMHA] Batch Prefill Support Improvements: Change KV Cache Layout & Large Page Size Support (#3442 ) * add page_block_size parameter * add is_sglang_layout to parameters * add kv_offset_array_transform to batch async for page size 16 * add kv_last_page_lens to kernel * change kv layout to [num_total_pages, page_block_size, hdim] * format * - enable codegen of batch_prefill kernels - create new problem struct BlockFmhaBatchPrefillPipelineProblem for batch prefill kernels - generate different page sizes of batch prefill kernels (1, 16) * 1. fix wrong calculation of page id in kv_offset_array_transform in gfx950 2. support page size 1024 * fix python format * change kv cache layout to [num_blocks, num_kv_heads, head_size/x, block_size, x] and [num_blocks, num_kv_heads, block_size/X, head_size, X] * 1. Introduced `kVectorSize` in BlockFmhaBatchPrefillPipelineProblem instead of using hardcode values 2. Makes batch prefill kernel traits structures inherent from fmha fwd traits 3. Add some static check for Page size, vector size, hdim, ..., etc. * [Refactor] Replace is_sglang_layout with Enums for KV cache configuration Refactored `fmha_batch_prefill` to use `BlockAttentionKVCacheMemoryLayoutEnum` (VECTORIZED/LINEAR) and `BlockAttentionKVCacheLookupTableEnum` (SGLANG_1D/VLLM_2D) instead of a single boolean. Changes: * Added Enum definitions in `block_attention_kvcache_layout_enum.hpp`. * Updated Kernel, Pipeline, and Traits to template on these Enums. * Implemented `kv_offset_array_transform` logic based on `kKVMemoryLayout`. * Refactored `PageBlockTableKargs` to adapt to `kKVLookupTable`. * Updated CodeGen scripts to support new parameters. This decouples memory layout from the paging mechanism, enabling flexible KV cache configurations. * 1. remove batch prefill pipeline with sk_pad=false 2. correct some comments 3. add static assert to make sure v offsets is in same page within a tile. * fix vgpr spill count * remove unnecessary t2s functions * add fp8 support for receipt 200 and 600 in fmha_bath_prefill.py * support linear kv cache layout * Remove block_table_ptr from fwd_batch_prefill_args. Instead, reuse kv_page_indices as a pointer of the lookup table. * 1. merge multiple transforms into single transform. 2. add static check to make sure vlayout is row-major. * move FmhaFwdCommonKargs::seqlen_k_ptr to VllmPageTableKargs. * update changelog --------- Co-authored-by: ltqin <letaoqin@amd.com> Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `cc75a1dc5f`]	2026-01-05 18:41:47 +08:00
Po Yen Chen	a2402950de	[CK_TILE][FMHA] Add FP8 support for batch_prefill kernel (#3425 ) * Add fp8bf16 support for batch_prefill * Fix wrong scale_s re-compute logic in batch_prefill * Fix wrong scale_s re-compute logic in fmha fwd * Fix batch_prefill codegen error * Remove no-longer used GetName() function * Add fp8 logits=True instances * Update CHANGELOG.md [ROCm/composable_kernel commit: `1c3151963b`]	2025-12-24 10:34:06 +08:00
Po Yen Chen	97556d24f2	[CK_TILE][FMHA] Add logits soft-capping support for FAv3 (WIP) (#3355 ) * Let fmha_fwd_v3() compatible with fmha_fwd() * Decouple get_fwd_blobs() and FmhaFwdKernel * Decouple compatibility checks from get_fwd_blobs() * Extract product feature checks out from get_fwd_blobs() * Remove duplicated code in factories and redundant checks * Remove FmhaFwdKernel<>::GetName() * Let FmhaFwdApiPool support pipelines with different mask_impl * Add tile setting for fmha fwd v3 pipeline * Add fwd v3 instances to tile_example_fmha_fwd manually * Remove unused function import * Undo irrelevant changes * Remove fwd v3 instances from tile_example_fmha_fwd * Finish fmha fwd v3 kernel instance codegen * Fix formatting * Remove unused F_idx attribute * Add is_generic_attention_mask<> traits * Add constraints to the fmha fwd v3 pipeline * Unify traits & problem used for fmha fwd v3 * Unify kernel launch code for fmha fwd v2 & v3 * Unify kernel template selection logic * Use same kernel codegen template for both v2 & v3 * Rename api() property as render() method * Allow specifying filter for fmha fwd api pool * Allow specifying function name when rendering api pool items * Separate fmha fwd v3 kernel dispatching logic from v2 * Remove lambda assignment * Add simple v2/v3 dispatch logic * Stop generating empty if-clauses Skip iterating over dictionaries that have no traits, and avoid assigning i_* to them. * Use "".join() to concatenate fmha fwd api string content * Add more feature checks for fmha fwd v3 pipeline * Check features before dispatch to fmha_fwd_v3() * Add more feature checks for fmha_fwd_v3() * Add missing filter call * Use Tuple to reserve the dtype orders * Fix wrong pipeline matching logic * Add fmha fwd v3 group mode instances * Add functor_transform<> * Add type constraints to make_tile_window() * Remove fmha fwd v3 example * Fix wrong product(aiter mha_fwd()) config * Fix wrong fmha fwd v2/v3 selection logic * Fix formatting * Add comment to warning v3 kernel users * Fix wrong codegen logics * Remove unnecessary param * Fix format * Add logits soft-capping support for fmha fwd v3 pipeline (WIP) * Add missing Kargs base type --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `bfac64953f`]	2025-12-18 16:08:45 +08:00
ltqin	c8397e8ef2	flashattention fwd add (80, 96) instance (#3415 ) * add hdim (96,96) instance * change to (80,96) * format py * remove 96 in optdim * when N=6 change to llvm_amdgcn_raw_buffer_load_i32x3 [ROCm/composable_kernel commit: `92653168c2`]	2025-12-17 09:16:11 -08:00
Linjun-AMD	51886bf22b	Add attention sink support for FMHA FWD (#3368 ) * Revert "Revert "Add attn sink (#2892)" (#3250)" This reverts commit e3be392d13e6ee107d823af32aca2d3ff03ca69d. * fix conflict Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Add F_sink parameter to FmhaFwdPipeline * Update tile_fmha_traits.hpp * Refactor pipeline creation in fmha_fwd.py Updated the pipeline creation logic to include 'sink' parameter in product combinations and adjusted the FmhaFwdPipeline calls accordingly. * Update fmha_fwd.py * Update fmha_fwd.py * Update example/ck_tile/01_fmha/script/correct_test_fwd_sink.sh Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update CHANGELOG.md Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update CHANGELOG with new features and support * Update fmha_fwd.hpp * Update CHANGELOG.md * Update smoke_test_fwd_sink.sh * Update correct_test_fwd_sink.sh * Update smoke_test_fwd_sink.sh --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `f5573f56d9`]	2025-12-15 12:21:59 +08:00
Anton Gorenko	9cb42b092a	Add a workaround for a compiler issue for bwd on gfx90a and ROCm 7.1.1 (#3369 ) Sometimes there are not enough wait-states between v_mfma_f32... and v_accvgpr_read_b32 instructions if they are separated by s_cbranch. The workaround is to read accvgprs to vgpr before branching. [ROCm/composable_kernel commit: `ca6143f0b2`]	2025-12-08 07:44:17 -08:00
Po Yen Chen	d96f632fa1	[CK_TILE][FMHA] Integrate FAv2 & FAv3 (WIP) in the single fmha_fwd() API (#3153 ) * Let fmha_fwd_v3() compatible with fmha_fwd() * Decouple get_fwd_blobs() and FmhaFwdKernel * Decouple compatibility checks from get_fwd_blobs() * Extract product feature checks out from get_fwd_blobs() * Remove duplicated code in factories and redundant checks * Remove FmhaFwdKernel<>::GetName() * Let FmhaFwdApiPool support pipelines with different mask_impl * Add tile setting for fmha fwd v3 pipeline * Add fwd v3 instances to tile_example_fmha_fwd manually * Remove unused function import * Undo irrelevant changes * Remove fwd v3 instances from tile_example_fmha_fwd * Finish fmha fwd v3 kernel instance codegen * Fix formatting * Remove unused F_idx attribute * Add is_generic_attention_mask<> traits * Add constraints to the fmha fwd v3 pipeline * Unify traits & problem used for fmha fwd v3 * Unify kernel launch code for fmha fwd v2 & v3 * Unify kernel template selection logic * Use same kernel codegen template for both v2 & v3 * Rename api() property as render() method * Allow specifying filter for fmha fwd api pool * Allow specifying function name when rendering api pool items * Separate fmha fwd v3 kernel dispatching logic from v2 * Remove lambda assignment * Add simple v2/v3 dispatch logic * Stop generating empty if-clauses Skip iterating over dictionaries that have no traits, and avoid assigning i_* to them. * Use "".join() to concatenate fmha fwd api string content * Add more feature checks for fmha fwd v3 pipeline * Check features before dispatch to fmha_fwd_v3() * Add more feature checks for fmha_fwd_v3() * Add missing filter call * Use Tuple to reserve the dtype orders * Fix wrong pipeline matching logic * Add fmha fwd v3 group mode instances * Add functor_transform<> * Add type constraints to make_tile_window() * Remove fmha fwd v3 example * Fix wrong product(aiter mha_fwd()) config * Fix wrong fmha fwd v2/v3 selection logic * Fix formatting * Add comment to warning v3 kernel users * Fix wrong codegen logics * Remove unnecessary param * Fix format --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `05292b3604`]	2025-12-05 10:31:12 +08:00
rocking	228b1e8d87	fp8 fmha async pipeline (#3339 ) * replace qr with async pipeline * Add fp8fp32 to DTYPE_BITS * Add kAlignmentRandVal to avoid compile fail * format --------- Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `eb7f617713`]	2025-12-04 12:18:25 +08:00
Aviral Goel	216c23b945	chore(copyright): update copyright header for include directory (#3293 ) [ROCm/composable_kernel commit: `de6466481f`]	2025-11-26 11:00:05 -07:00
rocking	f20f9dd453	Fix batch prefill compile fail in aiter (#3279 ) * Fix batch prefill aiter compile fail * Fix compile error [ROCm/composable_kernel commit: `229d43ea0c`]	2025-11-25 09:46:32 +08:00
Qianfeng	3b341e4a16	Fix a bug for qr_ks_vs_async_trload pipeline (#3271 ) [ROCm/composable_kernel commit: `81042ea574`]	2025-11-24 21:31:48 +08:00
rocking	cdd72e57d3	Support fp8 dynamic quantization for fmha (#3206 ) * Support qscale for dynamic quant, remove static quant * Support hdim=256 * Remove bias test case for fp8 --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> [ROCm/composable_kernel commit: `5948dbffe4`]	2025-11-24 16:28:25 +08:00
Yi DING	ac4f4ffb79	[CK_TILE] Refine FP32 => FP16/BF16 Conversion (#3215 ) * [CK_TILE] Refine FP32 => FP16/BF16 Conversion * Thank you Copilot * Rename fix * Fix example * Fix accu checking * Fix * Fix [ROCm/composable_kernel commit: `8b284a63a4`]	2025-11-20 10:50:26 -08:00

1 2 3 4

183 Commits