composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 02:02:46 +00:00

Author	SHA1	Message	Date
Ding, Yi	422b6d6c16	[CK_TILE] Fix FMHA BWD workspace upper-bound undersizing in group mode GetWorkspaceDeviceSizeUpperBound was computing max_batch * nhead_q * max_seqlen_q * hdim_q in non-deterministic group mode, but PrepareWorkspaceHost actually returns nhead_q * seqstart_q[batch] * hdim_q i.e. it scales with the sum of padded per-batch seqlen_q, not max_batch times the logical max. When per-batch padding makes seqstart_q[batch] exceed max_batch * max_seqlen_q the launcher under-allocates dq_acc, the kernel writes past the buffer, and tests see either ~42% wrong QGrad values or a GPU page fault (e.g. test_ck_tile_fmha_bwd_bf16 QKVPadding/23,24,26 corrupt; /27 page-faults). Fix: replace the (max_batch, max_seqlen_q) pair with a single total_seqlen_q_padded parameter holding the true total padded q tokens. Launcher derives it from the trait (group: t.seqlen_q already is the padded total; batch: t.batch * t.seqlen_q). The four mode formulas collapse to one: size = nhead_q * nsplits_factor * total_seqlen_q_padded * hdim_q where nsplits_factor is 1 for non-deterministic, ceil(max_seqlen_k, kN0) for deterministic group, and the persistent worker computation for deterministic non-group (the only branch that still needs max_batch). No caller-side API change: FA, AITER and the CK runner already pass q.shape[0] (the padded total) as traits.seqlen_q in group mode. Verified on gfx1201: full test_ck_tile_fmha_bwd_{bf16,fp16} 672/672 PASS, 0 fail, 0 crash (was 27/28 QKVPadding fails + 1 GPU illegal access).	2026-05-13 02:20:09 -04:00
Ding, Yi	1f4cc34e68	[CK_TILE] FMHA BWD launcher: address PR #7331 review comments	2026-05-12 02:46:41 -05:00
Ding, Yi	d434410e52	[CK_TILE] Add async workspace prepare to FMHA BWD launcher	2026-05-12 01:59:15 -05:00
Illia Silin	7cc0abc9ba	[CK] Fix latest batch of staging compiler warnings (#7111 ) ## Motivation Suppress the new batch of clang lifetimebound and invalidation warnings with the latest staging compiler. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-08 07:14:14 -07:00
Yi DING	26993f7d72	[CK_TILE] Fix typo in fmha_fwd_kernel K-dram unmerge tuple sizes (#7141 ) ## Summary The qr_async_trload K-dram lambda's `else (XorLengthFold == 1)` branch in `fmha_fwd_kernel.hpp` writes the outer-tile dim of its 3-tuple unmerge/xor/merge as ```cpp number<FmhaPipeline::kQKHeaddim / kDramTileK / FmhaPipeline::kAlignmentK>{} ``` which divides one extra time. For every fp16/bf16 hdim=128 configuration the outer length collapses to 0, e.g. `128 / 128 / 8 == 0`. The 3-tuple product no longer equals `kQKHeaddim`, so unmerge → xor → merge stops round-tripping the head dimension. This bug was masked by the async-load path: it only walks the descriptor via stride and silently absorbs a length=0 outer dim. Any consumer that actually traverses the descriptor (e.g. the TDM path on gfx1250) immediately faults on the resulting `tuple<int, constant<0>>`. The fix drops the extra `/ kAlignmentK` in all three call sites in the same lambda so the outer dim becomes `kQKHeaddim / kDramTileK` and the product is restored to `kQKHeaddim`. Strides are unaffected, so the async path is bit-identical. \| Config (fp16/bf16) \| hdim \| kDramTileK \| kAlignmentK \| a (typo) \| a (fixed) \| product (typo) \| product (fixed) \| \|---\|---\|---\|---\|---\|---\|---\|---\| \| hdim128, kKLoadOnce \| 128 \| 128 \| 8 \| 0 \| 1 \| 0 \| 128 \| \| hdim128, kK0=32 \| 128 \| 32 \| 8 \| 0 \| 4 \| 0 \| 128 \| \| hdim64, kKLoadOnce \| 64 \| 64 \| 8 \| 0 \| 1 \| 0 \| 64 \| \| hdim256, kK0=32 \| 256 \| 32 \| 8 \| 1 \| 8 \| 32 \| 256 \| Bug introduced in 2cc0af6a815a (PR #2888 \"[CK_TILE] FMHA FWD bug fix\"), where the original 2-tuple unmerge was generalized to a 3-tuple and the typo slipped in. ## Test plan - [x] Built `test_ck_tile_fmha_fwd` (umbrella, 5 gtest binaries) on gfx950 native at develop b3bdc63a509 with `dev-gfx950` preset (clang 22, ROCm 7.2.2). Compiles cleanly with `-Werror -Weverything`. - [x] Ran `ctest -R test_ck_tile_fmha_fwd` on gfx950 native, baseline vs patched: identical pass/fail (3 pass / 2 fail), identical failing case set (114 gtest fails + 2 GPU memory access faults, all in pre-existing fp16/bf16 group-mode `Alibi`/`Dropout` cases that reproduce on develop without this patch). Total wall time 403s → 393s. Per-case latency drift ±8% (noise). - [x] CI to verify on other gfx9 / gfx11 architectures.	2026-05-08 16:50:40 +08:00
Linjun-AMD	f17537b7c2	[CK] add swiglustep_and_mul activation to gridwise_moe_gemm (#6873 ) Title： feat(composablekernel): add swiglustep_and_mul activation to gridwise_moe_gemm Description： ## Motivation Step-3.5-Flash uses a clamped SwiGLU activation (`swiglu_limits[43]=7`, `swiglu_limits[44]=7`) for layers 43 and 44. Without this kernel path, those layers produce BOS token spam because unclamped gate/up values accumulate floating-point noise over 200+ decode steps, degrading output quality (cosine similarity drops from 0.999989 to ~0.998982). ## Changes Add `swiglustep_and_mul` as a new `Activation` enum branch in `gridwise_moe_gemm.hpp`, covering all 4 code paths: - Quantized (A×B scale) + IsInputGemm=true - Quantized (A×B scale) + IsInputGemm=false - Non-quantized + IsInputGemm=true - Non-quantized + IsInputGemm=false The activation computes: gate = silu(gate) gate = clamp(gate, max=7.0f) up = clamp(up, min=-7.0f, max=7.0f) output = gate * up Also handles the `MulRoutedWeight` case (topk weight multiplication) and `pk_i4_t` weight scaling (×16 dequant factor). ## Verification - Tested on gfx950 (MI350X, 8×GPU) - cosine similarity for layers 43/44: 0.999989 (vs 0.998982 before fix) - End-to-end Step-3.5-Flash inference: no BOS spam, output coherent - BF16 tp=2/tp=4 and FP8 tp=2/tp=4 all verified PASS - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-07 05:59:47 +00:00
Linjun-AMD	1cf336d87a	[CK_TILE][FMHA] Fix sink un-mask under right-window and emit fp8bf16 batch_prefill sink kernels (#6914 ) ## Summary Two related fixes to `ck_tile` FMHA so that StreamLLM-sink + sliding-window batch-prefill works correctly for fp8 KV / bf16 compute. Review the commits in this order: 1. `fmha: emit sink kernels for fp8bf16 batch_prefill` Extends `example/ck_tile/01_fmha/codegen/ops/fmha_batch_prefill.py` so the fp8(KV) / bf16(QO) batch-prefill codegen also emits the `mask=mask_enum::generic_with_sink` variant. Without this the runtime could not dispatch to a sink-aware kernel for the fp8bf16 path. 2. `fmha: respect right-window in IsOutOfSinkBound` The sink un-mask in `GenericAttentionMask::IsOutOfSinkBound` (local-mask branch) used `(i_y + x) > 1` as the gate, which conditioned on the row index instead of the column index. As a result, queries `1..sink-1` could attend to future sink positions (violating causal / right-window), while query `0` fell back to the plain causal mask. The fix replaces the guard with `i_x < i_y + x` so every query only sees sink columns up to its own right-window boundary. 3. `fmha: clarify IsOutOfSinkBound predicate comment` Doc-only follow-up that rewrites the comment above the predicate as a clause-by-clause explanation (`i_x < sink`, `i_x < i_y + x`, `y < y_total`, `i_y < x_total`). ## Test plan - [x] Repro on aiter `op_tests/test_batch_prefill.py` (fp8 + bf16_dequant modes with `sink=4`, `win_left=1023`, `softcap=0.0`, `sal=True`) now passes for all parametrized shapes. - [x] Existing fp16/bf16 batch-prefill paths (no sink) unchanged — codegen diff only adds the `generic_with_sink` variant for fp8bf16; existing kernel object lists unaffected. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: fengjunda.aml <fengjunda.aml@bytedance.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: root <root@smci350-rck-g03-f12-31.rck.dcgpu>	2026-05-07 10:39:55 +08:00
Yi DING	0398b864c3	[CK_TILE] Use Unified Workspace for FMHA BWD (#6152 ) ## Motivation `dq_acc` is the intermediate accumulation buffer used in FMHA backward pass for deterministic mode. The current implementation allocates it as a single rectangular tensor: ``` shape = [shape_batch, nhead, nsplits, shape_seqlen_q, hdim_q] ``` where `nsplits = launcher.dq_acc_splits` (a single scalar), computed from `max_seqlen_k` and shared across all batches. ### Problems 1. Memory waste: In group mode, each batch may have a different `seqlen_k`, but `nsplits` is computed from `max_seqlen_k`, causing batches with shorter `seqlen_k` to over-allocate in the split dimension. 2. Interface coupling: `fmha_bwd_args` exposes internal layout details such as `stride_dq_acc`, `nhead_stride_dq_acc`, `batch_stride_dq_acc`, and `split_stride_dq_acc`. The caller is responsible for computing these strides, but this logic belongs inside the kernel. ### Goals 1. Switch `dq_acc` buffer to a compact layout: batches are concatenated contiguously, with each batch occupying `nhead * nsplits_i * seqq_i * hdim_q` elements (nhead outermost). 2. *Remove all `_stride_dq_acc` fields from `fmha_bwd_args`, replacing them with a single `workspace_ptr`; the kernel splits this internally using a fixed layout. 4. `fmha_bwd_launcher` provides a workspace management interface: the caller only needs to allocate GPU memory and call `prepare_workspace()` — no layout computation required. 5. Isolate kernel internals from the caller API*: the `dq_acc` layout (nsplits, strides, buffer size) is determined entirely inside the launcher/kernel. Future changes to block shape, pipeline type, or persistent kernel strategy require no modifications to the caller's `fmha_bwd_args` or workspace allocation logic. ## Technical Details ### Interface Design #### New fields in `fmha_bwd_traits` ```cpp struct fmha_bwd_traits { int seqlen_q; int seqlen_k; int batch; int max_seqlen_q; int max_seqlen_k; int hdim_q; int hdim_v; int nhead_q; int nhead_k; std::string data_type; bool is_group_mode; mask_enum mask_type; bias_enum bias_type; bool has_dbias; bool has_dropout; bool is_store_randval; bool is_deterministic; // New: cumulative physical seqlen pointers for group mode (pass nullptr for batch mode). // seqstart_qs[i+1] - seqstart_qs[i] = physical seqlen_q of batch i (including padding); length = batch+1 // seqstart_ks[i+1] - seqstart_ks[i] = physical seqlen_k of batch i (including padding); length = batch+1 const int seqstart_qs = nullptr; const int* seqstart_ks = nullptr; }; ``` #### `fmha_bwd_launcher` actual structure ```cpp struct fmha_bwd_launcher { std::function<float(fmha_bwd_args, const ck_tile::stream_config&)> run{}; // Total workspace size in bytes (host_ws_size + device_ws_size), computed by init(). // Zero for kUseQrQtrDorPipeline (writes dq directly, no acc buffer needed). size_t workspace_size = 0; fmha_bwd_launcher(const fmha_bwd_traits&); // Copies auxiliary data (nsplits[], offsets[]) via hipMemcpy to the head of the GPU workspace, // and zeros the dq_acc buffer portion (tail of workspace) if required. // The memory pointed to by device_ws must be >= workspace_size bytes. std::function<void(void* device_ws)> prepare_workspace{}; template <typename... Args> float operator()(Args&&... args) const { return run(std::forward<Args>(args)...); } private: size_t host_ws_size = 0; // CPU workspace size (nsplits[] + offsets[] arrays) size_t device_ws_size = 0; // GPU-only data size (dq_acc buffer) std::unique_ptr<char[]> ws_host; // host-side workspace buffer public: template <typename T0, typename T1, typename T2, typename Arch> void init(const fmha_bwd_traits& traits); }; ``` The `init<>()` template method (invoked by codegen dispatch branches as `this->init<...>(t)`) is responsible for: 1. Setting the `run` lambda 2. Calling `FmhaBwdDQDKDVKernel::GetWorkspaceHostSize(batch)` to obtain `host_ws_size` 3. Allocating `ws_host` (host memory) 4. Calling `FmhaBwdDQDKDVKernel::PrepareWorkspaceHost(ws_host.get(), ...)` to fill nsplits/offsets; return value is `device_ws_size` 5. `workspace_size = host_ws_size + device_ws_size` 6. Setting the `prepare_workspace` lambda (captures `this`, calls `PrepareWorkspaceDevice`) When no kernel matches the given traits, both `run` and `prepare_workspace` are initialized to default lambdas that print a warning to `std::cerr` and return gracefully (no exception). #### Workspace overall layout The workspace is managed by `FmhaBwdWorkspaceManager` and consists of two segments: ``` Offset 0 (CPU-prepared segment, host_ws_size bytes; also hipMemcpy'd to the head of GPU workspace): index_t nsplits[batch or 1] — per-batch nsplits array group mode: batch elements batch mode / non-deterministic: 1 element [group mode only] long_index_t dq_acc_offsets[batch+1] — per-batch element offset (inclusive prefix sum) offsets[0]=0, offsets[i+1] = offsets[i] + nheadnsplits_iseqq_ihdim_q Offset host_ws_size (device data segment, device_ws_size bytes): AccDataType dq_acc[total_elements] — compact dq_acc buffer (zeroed if required) total_elements = sum_i(nhead nsplits_i * seqq_i * hdim_q) layout within each batch: [nhead, nsplits_i, seqq_i, hdim_q] note: seqq_i uses the physical length (including padding) ``` Alignment constant (`ALIGNMENT = 16`): ``` nsplits_size = align_up(sizeof(index_t) * N, 16) // N = batch (group) or 1 (batch/non-det) offsets_size = align_up(sizeof(long_index_t) * (batch+1), 16) // group mode only host_ws_size = nsplits_size + offsets_size dq_acc_offset = host_ws_size // GetDqAccDataOffset(batch) ``` Key benefits: - The kernel reads nsplits/offsets directly from the workspace head — no device-side recomputation. - `FmhaBwdConvertQGradKernel` is completely decoupled from the pipeline block shape (`kN0`): nsplits is read from `nsplits_ptr`, `kN0` is no longer a template parameter, and multiple dq_dk_dv tiles with different `F_bn0` values now share a single convert_dq kernel instance (under receipt 1/2, deterministic convert_dq kernel count drops from ~300 to 60). - nsplits/offsets are computed on the host and transferred in one `hipMemcpy`; the dq_acc buffer follows immediately, at the offset given by `GetDqAccDataOffset`. #### Workspace size by scenario \| Scenario \| `workspace_size` \| Notes \| \|----------\|-----------------\|-------\| \| kUseQrQtrDorPipeline (any mode) \| `0` \| Writes dq directly; no acc buffer; `PrepareWorkspaceHost` returns 0 \| \| Non-deterministic + batch mode \| `> 0` \| nsplits[1]=1; dq_acc used for atomic add; `workspace_size = host_ws_size + batchnheadseqlen_qhdim_qebytes` \| \| Non-deterministic + group mode \| `> 0` \| nsplits[1]=1; dq_acc contiguous layout; `workspace_size = host_ws_size + nheadseqstart_qs[batch]hdim_qebytes` \| \| Deterministic + group mode* \| `> 0` \| nsplits[batch], offsets[batch+1], compact dq_acc; nsplits_i computed independently per batch \| \| Deterministic + batch mode persistent \| `> 0` \| nsplits[1] (uniform across batches); dq_acc `batchnheadnsplitsseqlen_qhdim_q` \| NeedsZeroDqAcc (determines whether `PrepareWorkspaceDevice` calls `hipMemset`): - Persistent kernel (deterministic batch mode) or non-deterministic: must zero (atomic add requires zero initialization) - Deterministic group mode + no mask: no zeroing needed (every tile writes its full region) - Deterministic + with mask: must zero (some blocks are skipped, leaving uninitialized tiles that would contribute to the reduction) #### Caller usage ```cpp // 1. Create launcher (traits include seqstart_qs/ks pointers; workspace_size is computed during construction) fmha_bwd_launcher launcher(fmha_traits); // 2. Read launcher.workspace_size directly const auto ws_size = launcher.workspace_size; // 3. Allocate a single GPU workspace ck_tile::DeviceMem ws_buf(ws_size); // 4. Copy nsplits/offsets to GPU head and zero dq_acc if required launcher.prepare_workspace(ws_buf.GetDeviceBuffer()); // 5. Build args with a single workspace pointer; the kernel splits it internally fmha_bwd_args args{ ..., ws_size > 0 ? ws_buf.GetDeviceBuffer() : nullptr, // workspace_ptr }; launcher(args, stream_config); ``` --- ### Key Code Structure #### FmhaBwdWorkspaceManager (`fmha_bwd_kernel.hpp`, new class) ```cpp template <typename AccDataType, bool kIsGroupMode, bool kIsDeterministic> struct FmhaBwdWorkspaceManager { static constexpr size_t ALIGNMENT = 16; // CPU workspace (nsplits + offsets) sizes static size_t GetDqAccSplitsSize(int batch); // align_up(sizeof(index_t)N, 16) static size_t GetDqAccOffsetsSize(int batch); // group mode only: align_up(sizeof(long_index_t)(batch+1), 16) static size_t GetWorkspaceHostSize(int batch); // = SplitsSize + OffsetsSize // Starting offset of dq_acc data within the full workspace (= host_ws_size) static size_t GetDqAccDataOffset(int batch); // = GetWorkspaceHostSize(batch) // Fills nsplits/offsets in the CPU workspace; returns device_ws_size (dq_acc buffer bytes) template <bool kUseQrQtrDorPipeline, index_t kN0> static size_t PrepareWorkspaceHost(void* cpu_ws, index_t batch_size, index_t hdim_q, index_t nhead_q, index_t seqlen_q, index_t seqlen_k, const index_t* seqstart_qs, const index_t* seqstart_ks); // hipMemcpy's cpu_ws to device_ws head; hipMemset's the dq_acc portion to 0 if required template <bool kUseQrQtrDorPipeline, bool kHasMask> static void PrepareWorkspaceDevice(void* device_ws, const void* host_ws, size_t device_ws_size, size_t host_ws_size); }; ``` #### workspace_ptr parsing (inside the kernel) The kernel parses three address regions from `kargs.workspace_ptr`: Group mode (`FmhaBwdDQDKDVKernel::MakeKargs`): ```cpp const uint8_t* ws = reinterpret_cast<uint8_t>(workspace_ptr); // dq_acc_ptr (stored in FmhaBwdCommonKargs) ws + WorkspaceManager::GetDqAccDataOffset(batch) // dq_acc_batch_offset_ptr (FmhaBwdGroupModeKargs field) reinterpret_cast<const long_index_t>(ws + WorkspaceManager::GetDqAccOffsetsOffset(batch)) ``` Batch mode: ```cpp ws + WorkspaceManager::GetDqAccDataOffset(batch) // dq_acc_ptr // No offsets pointer; batch offset is computed inside run_() from nsplits ``` `FmhaBwdConvertQGradKernel` follows the same pattern: - Group mode: extracts `dq_acc_ptr`, `dq_acc_batch_offset_ptr`, and `nsplits_ptr` (`GetDqAccSplitsOffset(batch)`) from workspace - Batch mode: reads nsplits from `nsplits_ptr[0]`; batch offset computed internally ### Addressing in `run_()` (group mode) ```cpp // Per-batch processing: const long_index_t batch_offset_dq_acc = kargs.dq_acc_batch_offset_ptr[i_batch]; // seqq_i (physical length) derived from seqstart_q_ptr const index_t seqq_i = kargs.seqstart_q_ptr[i_batch+1] - kargs.seqstart_q_ptr[i_batch]; // nsplits_i read from nsplits_ptr (convert_dq kernel) or from GetDqAccSplits const long_index_t split_stride_i = static_cast<long_index_t>(seqq_i) * kargs.hdim_q; const long_index_t nhead_stride_i = static_cast<long_index_t>(nsplits_i) * split_stride_i; // Final address: dq_acc_base + batch_offset_dq_acc + i_nhead * nhead_stride_i + i_split * split_stride_i ``` #### nsplits computation (`PrepareWorkspaceHost`) `PrepareWorkspaceHost` is a template method of `FmhaBwdWorkspaceManager` that still takes `kN0` as a template parameter (from `BlockFmhaShape::kN0` of the dq_dk_dv pipeline). However, this parameter is only used inside this host-side function to compute nsplits — it is no longer passed into the convert_dq kernel. \| Mode \| nsplits computation \| \|------\|---------------------\| \| kUseQrQtrDorPipeline \| Writes dq directly; nsplits[0]=0; returns device_ws_size=0 \| \| Non-deterministic \| nsplits[0]=1; dq_acc used for atomic add \| \| Deterministic + group mode \| `ceil((seqstart_ks[i+1]-seqstart_ks[i]) / kN0)` computed per batch \| \| Deterministic + batch mode persistent \| Same logic as the original `GetDqAccSplits` (`dqdqkdv_workers` based) \| ### Removing kN0 dependency from `FmhaBwdConvertQGradKernel` `FmhaBwdConvertQGradKernel` previously required `kN0` as a template parameter (via `BlockFmhaBwdConvertQGradPipelineProblem`) for two purposes: 1. In batch mode `operator()`: self-computing `nsplits = ceil(seqlen_k / kN0)` 2. The `b{kM0}x{kN0}` component of the kernel name string Both have been removed in this refactor: - Batch mode: now reads `kargs.nsplits_ptr[0]` directly (guarded by `if constexpr(kIsDeterministic)` to avoid accessing a non-existent field in non-deterministic instances) - Kernel name: simplified to `b{kM0}`, no longer includes `kN0` - Template parameters: `BlockFmhaBwdConvertQGradPipelineProblem` drops the `kN0_` parameter; `fmha_bwd_convert_dq_traits_` drops the `kN0` parameter; `F_bn0`/`convert_dq_bn0` fields removed from codegen Effect: all dq_dk_dv tiles sharing the same `(hdim, dtype, mode, pad, deterministic)` combination — regardless of `F_bn0` value (16/64/128/192/256) — now share a single convert_dq kernel instance. --- ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-07 10:22:28 +08:00
Jeff Huang	537a9e7489	[CK] Fix OOB page table read in batch_prefill V prefetch (AICK-1171) (#6932 ) ## Summary Fix a GPU memory access fault in `mha_batch_prefill` triggered when the per-batch page table is tightly sized (no trailing slack). Affected configurations: - All FMHA batch prefill V2 kernels (`block_fmha_batch_prefill_pipeline_qr_ks_vs_async`) - Triggered by paged KV layouts where `kv_page_indices.numel() == ceil(seqlen_k / page_size)` exactly - Manifests as: `Memory access fault by GPU node-X (Agent handle: 0x...)` followed by `Aborted (core dumped)` - Silent corruption (no fault, wrong output) when the OOB read happens to land in zero-initialized memory ### Root cause `load_physical_pages` performs lookahead reads on the page table to prefetch K/V tiles for the next iteration. When the page table for a batch has exactly `N` entries, the V-tile prefetch indexes `page_idx[N]` (one past the last valid entry), reading either uninitialized memory or the next batch's slot. On gfx942 with a tightly-sized page table, the read crosses into an unmapped page and triggers an HSA page fault. The bug was masked in earlier testing because most test harnesses pad `kv_page_indices` with trailing zeros — OOB reads then return `page_id = 0`, a valid in-cache page, producing silent numerical drift instead of a fault. ### Fix design Thread `max_page_table_idx = (seqlen_k - 1) / page_size` from the kernel layer down to `load_physical_pages`, and clamp every page-table read with `ck_tile::min()`. Applied to all four code paths in the V prefetch: \| Branch \| What it does \| Clamp applied \| \|--------\|-------------\|---------------\| \| `kIsKcache` \| K prefetch loop \| `min(global_token_idx >> kLog2PageSize, max_page_table_idx)` \| \| V LINEAR (`page_size == 1`) \| One token = one page \| `min(global_token_idx, max_page_table_idx)` \| \| V crosses pages (`kVTileCrossesPages`) \| Per-thread page lookup \| `min(global_token_idx >> kLog2PageSize, max_page_table_idx)` \| \| V single page (lane0 broadcast) \| `readfirstlane`-uniform lookup \| `min(... >> kLog2PageSize, max_page_table_idx)` \| ### Key design decisions Mandatory parameter, not optional with a sentinel default. An optional `max_page_table_idx = INT32_MAX` default would let the bug silently come back at any new callsite that forgets to pass it. Making it mandatory forces every caller to opt in explicitly and surfaces missed callsites at compile time. `seqlen_k == 0` clamps to 0 instead of underflowing `(0 - 1) / page_size` to `-1`. The empty-batch case is rare but well-defined: clamp every read to slot 0. Single computation in the kernel layer. `FmhaBatchPrefillWithPagedKVCacheKernel` computes `max_page_table_idx` once per batch and forwards it through every QScale branch (PERTENSOR / KV_BLOCKSCALE / default). All three `operator()` overloads of the pipeline (rich, default forwarder, KV_BLOCKSCALE forwarder) take and forward the parameter. ### Files changed \| File \| Change \| \|------\|--------\| \| `include/ck_tile/ops/fmha/kernel/fmha_batch_prefill_kernel.hpp` \| Compute `max_page_table_idx` per batch, forward to all 3 QScale branches \| \| `include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp` \| Add `max_page_table_idx` to `load_physical_pages` and 3 `operator()` overloads; clamp page-id reads in 4 code paths \| ## Test plan - [x] AICK-1171 reproducer verified on MI-308X (gfx942) - [x] New pytest case `test_batch_prefill_aick1171_oob_page_table_read` in aiter, parametrized over `total_blocks ∈ {160, 164, 168, 176, 208, 256}` (matches the `crash1_r8_*` bisect family) - [x] Full FMHA batch prefill suite on gfx942 + gfx950 ## Linked issue AICK-1171.	2026-05-05 14:28:19 +08:00
ltqin	c1bf3f6972	[CK_TILE] Add SageAttention v2 forward kernel with multi-granularity quantization (#6574 ) ## Summary Add a CK_TILE forward kernel implementing [SageAttention v2](https://arxiv.org/abs/2411.10958) — an attention algorithm that applies multi-granularity quantization to Q/K/V before computing attention, trading minimal accuracy loss for higher throughput on low-precision hardware. ### Quantization design \| Tensor \| Supported data types \| Scale granularity options \| \|--------\|---------------------\|--------------------------\| \| Q \| fp8 / int8 / int4 \| per-tensor, per-block (128 tokens), per-warp (32 tokens), per-thread (4 tokens) \| \| K \| fp8 / int8 / int4 \| per-tensor, per-block (128 tokens), per-warp (64 tokens), per-thread (16 tokens) \| \| V \| fp8 \| per-channel (always) \| \| O \| bf16 \| — \| Three precision combinations are supported: `fp8/bf16` (QKV fp8, O bf16), `i8/fp8/bf16` (QK int8, V fp8, O bf16), and `i4/fp8/bf16` (QK int4, V fp8, O bf16). ### Architecture support - gfx9 (CDNA2/3, e.g. gfx90a, gfx942) — full tile set - gfx950 (CDNA4) — restricted tile set (N-per-block capped at 64 for fp8-family dtypes) ### Implementation - Two pipeline variants: `QRKSVS` (synchronous) and `QRKSVS_ASYNC` (async copy) - Masking support: no mask, causal (top-left / bottom-right), and generic windowed - Batch and group (variable-length) modes - Head dimension: d=128, d_v=128 - Python codegen under `example/ck_tile/49_sageattention/codegen/` generates kernel instances per target/dtype/tile combination - Smoke tests included via `tile_example_sageattn_fwd` ### Test commands \`\`\`bash # fp8 QKV ./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128 -kname=1 -prec=fp8bf16 -qscale=3 -init=3 # int8 QK, fp8 V ./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128 -kname=1 -prec=i8fp8bf16 -qscale=3 -init=3 \`\`\` \`-qscale\` values: 1=per-tensor, 2=per-block, 3=per-warp, 4=per-thread	2026-04-30 11:32:23 -07:00
Wojciech Laskowski	a581a451f1	[CK Tile] Adding WMMA wrappers for dense builtins (#5801 ) ## Motivation This PR is part of the [WMMA/MFMA] unification work. It's the first of the series of PRs that add all the necessary MMA builtins as a `amdgcn_mma` structs. ## Technical Details This change adds new specializations for WMMA dense builtins. In total, we have now 9 RDNA4 builtins and 3 RDNA3 builtins. ## Test Plan All the new wrappers were added to the test suite in `test_amdgcn_mma_layout.inc`. ## Test Result Test pass locally, waiting for the CI. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Yung-sheng Tu <yung-sheng@streamhpc.com>	2026-04-27 11:57:51 +00:00
Qianfeng	c92fd39299	Improve the performance of qr_ks_vs_whole_k_prefetch pipeline (#6209 ) ## About qr_ks_vs_whole_k_prefetch pipeline This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to improve performance on both MI350 GPUs through better MFMA instruction usage, transposed V-loading support, and N0-loop implementation. The pipeline targets scenarios where the number of workgroups is low, enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs 128) while prefetching entire K tiles. ## Changes: - Adds transposed V-loading support (qr_ks_vs_whole_k_prefetch_trload) to avoid using shuffle instructions on MI350 - Implements N0-loop based Gemm0 to reduce tile window movement overhead and eliminate `clear_tile` calls - Adds full support for hdim96/hdim160 without padding requirements - Updates MFMA instruction selection to ensure optimal choices for MI350 ## Performance results 1. For attention shapes which leads to kM0=64, `qr_ks_vs_async_whole_k_prefetch_trload` shows much better performance than `qr_ks_vs_async_trload` on the same case (execution time `41.02ms` by whole_k_prefetch_trload & `58.50ms` by async_load), and `qr_ks_vs_async_whole_k_prefetch_trload` also shows obviously better performance than the recently tuned `qr_ks_vs_async` on the same case (execution time `41.02ms` by whole_k_prefetch_trload 7 `47.60ms` by qr_ks_vs_async) 2. Also on MI300, for attention shapes which leads to kM0=64, `qr_ks_vs_async_whole_k_prefetch` shows much better performance than the `qr_ks_vs_async` (which is supposed to be very high-efficient) on the same case (execution time `64.50ms` by whole_k_prefetch & `80.20ms` by qr_ks_vs_async) 3. For attention shapes which leads to kM0=128, `qr_ks_vs_async_whole_k_prefetch_trload` show a little bit better performance than `qr_ks_vs_async` on mi350 (execution time `104.50ms` by whole_k_prefetch_trload & `106.50ms` by qr_ks_vs_async). And they shows completely on-par performance on MI300 ## Test/Verify 1. Use the ROCM xformers branch `test_whole_k_prefetch_n0loop` to test/verify qr_ks_vs_whole_k_prefetch pipeline since this pipeline can not be used by ck_tile fmha example so far 2. Use the following command-line for building/testing xformers >```bash > #> git clone -b test_whole_k_prefetch_n0loop https://github.com/ROCm/xformers > #> git submodule update --init --recursive > #> pip install --no-build-isolation -e ./ > #> pytest tests/test_mem_eff_attention.py::test_forward >``` 4. Any scripts which can run on xformers can be used to evaluate qr_ks_vs_whole_k_prefetch pipeline. Using the two environ variable to switch from using different pipelines > ```bash > #> export FMHA_DISABLE_SPECIAL_TREATMENT=1 #> to disable using FAV3 and qr_ks_vs_async_trload pipeline > #> export FMHA_ENABLE_ASYNC_PIPELINE=1 #> to disable using qr_ks_vs_async pipeline for comparing > ``` ## Discussion --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: poyenc <1132573+poyenc@users.noreply.github.com> Co-authored-by: qianfengz <12429178+qianfengz@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-04-24 10:30:41 -06:00
Jeff Huang	34084aeb11	[CK_TILE] fix(fmha): support >2GB KV cache in batch prefill via template dispatch (#6653 ) ## Motivation The CK batch prefill kernel previously failed (silent overflow + page faults) when the KV cache exceeded 2 GB, blocking long-context inference workloads (e.g., 128K+ token contexts with paged KV). Two distinct failure modes were addressed: 1. >4GB SRD overflow (`page_size < kN0`): The SRD `buffer_load_dwordx4` path uses a 32-bit `voffset` register; for small page sizes the rebased SRD spans the full KV pool and the offset wraps past 2 GB, corrupting K/V loads. 2. gfx950 page-table fault (`page_size >= kN0`): On CDNA4 the hardware validates the full SRD `num_records` range against page-table permissions (CDNA3 only checks per-instruction `voffset`). After per-tile SRD rebase, an un-trimmed `num_records` field extends past the live page and faults on freed/protected memory. ## Technical Details Two-mode `tile_scatter_gather` selected by the `kUseGlobalLoad` template parameter: \| Case \| `page_size` \| KV cache size \| Mode \| Load path \| Addressing \| \|---\|---\|---\|---\|---\|---\| \| 1 \| `>= kN0` (large pages) \| any \| SRD (`kUseGlobalLoad=false`) \| `buffer_load_dwordx4` \| 32-bit `voffset`, bounded by per-page rebase \| \| 2 \| `< kN0` (small pages) \| `<= 2 GB` \| SRD (`kUseGlobalLoad=false`) \| `buffer_load_dwordx4` \| 32-bit `voffset`, fits in INT32 byte range \| \| 3 \| `< kN0` (small pages) \| `> 2 GB` \| Global-load (`kUseGlobalLoad=true`) \| `async_load_tile_raw_flat` (K) + `load_tile_flat` (V) \| 64-bit \| Dispatch: the auto-gen API layer (`fmha_batch_prefill.py`) selects the kernel instantiation at launch from `(page_block_size, num_total_pages * batch_stride_k * kElementBytes)`, so the small-page penalty is paid only when correctness requires it. gfx950 SRD `num_records` trimming: in the K and V rebase lambdas of `block_fmha_batch_prefill_pipeline_qr_ks_vs_async`, `set_bottom_tensor_view_buffer_size(page_stride_k/v)` is called after each rebase to constrain `num_records` to the live page. Required for CDNA4 page-table validation; harmless on CDNA3. Pipeline sync for the global-load path: - V uses synchronous `load_tile_flat`; K uses `async_load_tile_raw_flat`. - `v_physical_pages_current` is double-buffered so the V flat load doesn't race against the next iteration's K rebase computation. Arch guards: `global_load_lds` intrinsics are gated to `__gfx94__` / `__gfx950__` (CDNA3+). Other architectures hit a `dependent_false` static_assert with a descriptive message. Device-side assertion convention: SRD setters use `__builtin_assume(cond)` (hint-only) rather than `<cassert>`'s `assert()`. The latter introduces an `__assert_fail` call whose register pressure scatters the K-SRD scalar register window across conditional branches, corrupting `buffer_load_dwordx4` on gfx950. ## Test Plan Tested on both MI308 (gfx942) and MI355 (gfx950) via the aiter wrapper test suite. All coverage lives in `op_tests/test_batch_prefill.py`: - Functional matrix (96 cases) — `test_batch_prefill`: `page_size ∈ {1, 16, 1024}` × `kv_layout ∈ {linear, vectorized}` × `dtype ∈ {bf16, fp8 quant variants}` × `causal` × `soft_cap` × `LSE` × `batch_size ∈ {1, 4}` (parametrized to exercise per-sequence SRD rebase across batch boundaries). - >2 GB coverage — `test_batch_prefill_large_kvcache`: extended to allocate a 5 GB+ KV cache pool and exercise both `kUseGlobalLoad=true` (small-page) and `kUseGlobalLoad=false` (large-page rebase) paths. Includes both single-batch and multi-batch (`batch_size=4`) cases to exercise per-sequence SRD rebase across the >2 GB pool. - Numerical reference: PyTorch SDPA, per-batch loop with `atol` / `rtol` from the existing batch prefill test harness. ## Test Result \| Arch \| `test_batch_prefill` \| `test_batch_prefill_large_kvcache` (>2 GB) \| \|------\|----------------------\|---------------------\| \| MI308 (gfx942) \| All passed \| Passed \| \| MI355 (gfx950) \| All passed \| Passed \| Performance impact (gfx950, hot SRD path): - +2.67% kernel-time on `seqlen=1024 / page_sz=1024 / bf16 / sglang / causal / soft_cap=30`, attributable in full to the two `set_bottom_tensor_view_buffer_size` calls in the K/V rebase lambdas (5-run median, signal/noise ≈ 9×). - This cost is mandatory for gfx950 correctness on >2 GB workloads — removing the setters re-introduces page-faults. - gfx942: 0 regressions in the same range (all configs ≤ +0.97%). ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-24 07:08:41 +08:00
Luo Cheng	6946b07408	[CK] Fix out of bounds modifications caused by negative topk_ids in MoeSortingMultiPhaseKernel_P0_v1 (#6242 ) ## Motivation Fix sglang randomly crash by filter negative topk ids. ## Technical Details In sglang expert parallel mode, there may be idle batch (batch=0) fired, it will reuse batch=1 resource in cuda graph mode. But in topk op, it will set non used topk ids to -1, in idle batch case, all topk ids are set to -1. In `MoeSortingMultiPhaseKernel_P0_v1` negative expert id will cause overwrite somewhere and sglang may randomly crash. Except idle batch case, if the captured batch sizes are discrete, there may be -1 of expert id due to the similar logic. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: zovonoir <jialzhu@amd.com>	2026-04-24 06:44:37 +08:00
Artem Kuzmitckii	16a9ced35d	[CK] Fix divide-by-zero crash for grouped conv kernels (#6132 ) ## Motivation During run pytorch unit tests for conv3d: `test_dtypes_nn_functional_conv3d_cuda`, `test_fake_crossref_backward_amp_nn_functional_conv3d_cuda_float32` found divide-by-zero crash during CK kernel selection. Refs ROCM-20764 ## Technical Details Add assert for K0PerBlock equal 0, also covered other potential places related with k_batch calculation. ## Test Plan Run miopen command extracted from mentioned test: `MIOpenDriver convfp16 --spatial_dim 3 -I NCDHW -O NCDHW -f NCDHW -n 1 -c 1 -k 1 -g 1 --in_d 4 -H 4 -W 4 --fil_d 4 -y 4 -x 4 --pad_d 0 -p 0 -q 0 --conv_stride_d 2 -u 2 -v 2 --dilation_d 1 -l 1 -j 1 -m conv -F 4 -t 1` ## Test Result Passed ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Signed-off-by: Artem Kuzmitckii <artem.kuzmitckii@amd.com>	2026-04-23 22:10:46 +02:00
KateJu	90ca12f14f	Add missing lds sync (#6655 ) ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-23 07:05:33 -07:00
jakpiase	876bc47c67	[CK_TILE] Grouped Convolution Backward Data Direct Load (#6624 ) ## Proposed changes Add Grouped Convolution Backward Data with Direct Load into DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffleV3 device implementation. This enables direct global memory loading (bypassing LDS) for the backward data convolution path on gfx950, following the same pattern used in both backward weight and forward convolution. Direct load convolution backward data improves performance by avoiding LDS round-trips for certain configurations on gfx950, which supports a wider range of instructions. Currently correctness is checked only at usage point, but should be extended to a standalone UT in the future.	2026-04-23 11:16:55 +02:00
Illia Silin	cfb09d76a5	[CK] Fix/suppress clang lifetimebound warnings with staging compiler. (#6550 ) ## Motivation New changes from upstream llvm-project cause an avalanche of warnings in CK. Gonna disable them by ignoring the lifetime-safety-intra-tu-suggestions flag until a better permanent solution is found. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-22 15:47:47 +00:00
Sami Remes	1e4eebfba8	[CK_TILE] Preserve input strides in EightWaves async-load descriptor (#6611 ) `MakeAsyncLoadADramWindow` in `GemmPipelineAgBgCrCompAsyncEightWavesPolicy` was rebuilding the 6D view descriptor with `make_naive_tensor_descriptor_packed`, which synthesizes strides from lengths and assumes a dense layout. When the input view's leading-dim stride is larger than its inner length (non-packed memory layout), the resulting tile window stepped through memory at the wrong stride. Compose the unmerge transforms on top of the input view's existing descriptor instead, so the actual runtime strides are preserved and the correct `element_space_size` is inherited for bounds checking. ## Test Plan Added an unit test showing the problem. ## Test Result The new test fails before fixes and passes after. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-22 12:52:02 +02:00
Linjun-AMD	803874c73b	CK][fmha] Add StreamLLM sink support to batch_prefill pipeline (#6479 ) ## Motivation The existing paged-KV attention pipelines (pagedkv, splitkv) support StreamLLM-style sink tokens — a fixed set of initial tokens kept in attention alongside the sliding window. The `batch_prefill` pipeline (chunked-prefill with VLLM-style block tables) previously hardcoded `kHasSink = false`, making it incompatible with sink-based attention patterns in LLM serving scenarios. This PR extends `batch_prefill` to support `kHasSink` and wires it into `fmha_fwd_runner` for validation against the existing CPU reference. ## Technical Details Pipeline (`block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp`): - When `kHasSink`, the K/V loop splits into a sink phase [0, sink_seq_end) and a window phase [seqlen_k_start, seqlen_k_end), mirroring pagedkv. - K advance at the sink→window transition jumps `seqlen_k_start - sink_seq_end + kN0` to bridge the gap. - V scatter-gather offsets are re-initialized at the transition to fix a window mismatch bug: V was lagging kN0 behind K after the large jump, loading from the wrong sequence position. - Bias window, dropout seq_offset, and mask type (LogitsSinkMask) updated for sink-awareness. Traits / codegen (`tile_fmha_traits.hpp`, `fmha_fwd.hpp`, `fmha_batch_prefill.py`): - `TileFmhaBatchPrefillTraits` gains `kHasSink_` (was hardcoded `false`). - Codegen adds `F_sink` field; skips batch-mode kernels (group mode required). - CMake test filter broadened from 9 → 33 instances covering fp16/bf16 × mask/nmask × lse/nlse × sink/nsink. Runner (`fmha_fwd_runner.hpp`, `CMakeLists.txt`): - `fmha_batch_prefill()` dispatched from `run_fwd` when: group mode + paged KV + num_splits == 1. - K/V strides corrected for runner's [num_pages, nhead_k, page_block_size, hdim] layout. - `page_block_size % 128` check relaxed: batch_prefill supports ps=16. - CPU reference paged-KV reordering guards extended with `CK_TILE_FMHA_FWD_BATCH_PREFILL_API`. ## Test Plan Build with `-DFMHA_FWD_ENABLE_APIS="fwd;batch_prefill"`, run `tile_example_fmha_fwd` in group mode with page_block_size=16. Test matrix: - Mask: no-mask, causal, sliding window - Sink: nsink, sink=1..128 - dtype: fp16, bf16 - LSE output: on/off - seqlen ∈ {512,1024,2048,4096} × window ∈ {32,256,512,1024} - GQA, chunked prefill, large batch×seqlen - page_block_size: 16, 32 ## Test Result 171 test cases, all valid:y: - nmask + nsink: ✓ - causal + nsink: ✓ - causal + sink=8: ✓ - sliding window + sink=8 (d=128, d=256): ✓ - bf16, LSE output, GQA: ✓ ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-21 11:03:55 +00:00
金黄色葡萄球君君	b5b3ba728d	projects/composablekernel: add SwigluStep support for MoE blockscale (#6118 ) ## Summary - add `swiglustep_and_mul` to the composablekernel MoE blockscale activation enum - implement the corresponding blockscale epilogue path for `SwigluStep` - keep existing `silu` and `gelu` paths unchanged ## Scope This PR covers the classic composablekernel blockscale MoE path under `projects/composablekernel`. This is separate from the `ck_tile` / FlatMM path being discussed in ROCm/rocm-libraries#5992. ## Motivation `Step-3.5-Flash-FP8` uses `SwigluStep` in its MoE MLP path. The dependent AITER change needs native support for this activation in the classic composablekernel MoE blockscale path. ## Validation - patch is limited to two composablekernel files under `projects/composablekernel` - existing `silu` / `gelu` paths are unchanged - dependent AITER runtime validation hit the classic CK 2-stage path with AITER MoE enabled	2026-04-21 07:24:48 +00:00
Hosang Yoon	2574f37483	[CK_TILE] Enable canonical-NaN BF16 conversion for FMHA on RDNA (#6253 ) ## Motivation - On gfx11/gfx12, the existing float -> bf16 conversion path in FMHA forward adds noticeable overhead and causes a meaningful performance gap versus fp16. The asm-based path (mode 3) does not improve this on RDNA and can perform even worse. - In particular, on gfx12, bf16 FMHA forward can be up to ~20% slower than the corresponding fp16 path. - This PR reduces that gap by switching FMHA forward to a different BF16 conversion strategy based on Triton’s canonical-NaN round-to-nearest-even behavior. ## Technical Details - Add a new `standard_cnan` BF16 conversion mode to CK Tile. - Implement a canonical-NaN RTN `float -> bf16` conversion path based on the Triton implementation. - Enable this conversion mode by default for FMHA forward builds targeting gfx11/gfx12. - Retune gfx11/gfx12 FMHA forward kernel selection thresholds for some `hdim=128` cases to keep kernel selection aligned with the updated conversion behavior. ## Test Plan ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16 -d={hdim} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result - all tests passed when running `test_ck_tile_fmha` - BF16 FMHA forward performance improves by up to ~5% on gfx11. - BF16 FMHA forward performance improves by up to ~10% on gfx12. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-20 14:52:24 -04:00
Bartłomiej Kocot	8fd401803f	[CK][CK Tile] Clamp element space size to max int32 value (#6168 ) ## Motivation Fix oob check by clamping element space size to avoid overflow when tensor is larger than 2GB. ## Technical Details - It is possible that tensor could be larger than 2GB but offsets no, so element space size must be clamped to 2GB if value is larger. ## Test Plan CI ## Test Result Pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. https://github.com/ROCm/composable_kernel/issues/3722 Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2026-04-20 15:32:24 +00:00
Yung-sheng Tu	91b7dae95a	[CK TILE] Unification of Scale MFMA/WMMA Policy Structs (#5857 ) ## Motivation The existing unification work supports DENSE and SPARSE intrinsics. In this PR, we enable support for SCALE intrinsics and add example SCALE implementations. ## Technical Details Adding MFMA SCALE intrinsics support, adding tests for MFMA SCALE intrinsics, and adding WMMA SCALE policy trait. Note: fp6 SCALE intrinsics support is not included in this PR, as its handling in ck_tile is currently more specialized and does not follow the same pattern as other datatypes. ## Test Plan Added new tests for the relevant SCALE specialisations. ## Test Result Test should pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-20 14:28:23 +00:00
Zoltán Lakatos	f73bfe1b7e	[CK] Remove code duplications in grouped gemm fixed nk implementations (#4961 ) ## Motivation Different flavours of grouped gemm fixed nk implemenations share the same block to tile mapping logic. Despite that the code responsible for it is duplicated in each device struct implementation. - Move `BlockToCTileMap_KBatch_M00_N0_M01Adapt_MLoops` and `OffsettedBlockToCTileMapMLoops` from the device struct implementations to a common header file. - Use the generic Kernel Argument structures in xdl versions of the fixed nk. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan CI in general. Relevant test and examples are all fixed_nk versions of grouped gemm multiple D and ABD. ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-04-20 12:24:59 +00:00
Hosang Yoon	f5e00ec904	[CK_TILE] Skip padded k/n fragment work in qr_hpad FMHA fwd (#6450 ) ## Motivation `qr_hpad` currently executes work for padded head-dim fragments even when only a subset of the values are valid. This adds unnecessary computation for head dimensions that require padding, such as `hdim=72` and `hdim=80`, and hurts FMHA forward performance. The goal of this PR is to make the padded-head-dim path skip invalid work based on the actual valid fragment count, while preserving the existing behavior for the non-padded path. ## Technical Details This PR improves the `qr_hpad` FMHA forward path in three parts: - Skip padded `k`/`n` fragments in the GEMM/pipeline path when only part of the fragment is valid. - Add partial GEMM0 tail handling for `qr_hpad` so the kernel uses the valid fragment range instead of always computing over the padded extent. - Retune the gfx11 `qr_hpad` kernel configuration after enabling the partial-fragment path. To keep the existing path stable, the implementation adds overloads for the updated GEMM/pipeline interfaces. This allows existing full-tile callers to keep using the previous form, while the `qr_hpad` path can pass valid fragment counts when needed. ## Test Plan ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16 -d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result - On gfx11 and gfx12, for head dimensions that require padding, `tile_example_fmha_fwd` shows about 20-30% performance improvement at `hdim=72/80`. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-18 06:44:46 +00:00
Ville Pietilä	7d6ef2396f	[MIOpen][CK] Fix bwd weight conv test failures by disabling one block-GEMM V5 instance for 3D convs (#6421 ) ## Motivation Due to compiler version update, there are test failures in the test target `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There are four failing tests for FP16/BF16 that arise from a single kernel instance. As the problem is in the current develop branch, the test failures are blocking any PR merges into develop. An example of a failed CI runs is here: [http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/). The underlying compiler problem is potentially the same as described in #6342 as the tests are passing for clang compiler version 20.0 and failing for clang compiler version 22.0. First attempt to fix this problem had to be reverted in #6400 because it broke MIOpen internal DB sync tests. ## Technical Details The root cause for the test failures are the block-GEMM V5 instances of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3` that have large tile size. The V5 pipeline uses double register buffer that in combination with large tile size causes high register pressure. The latest version of compiler handles the register spillage incorrectly for `gfx90a`, which cause the kernel to output incorrect results. The BF16/FP16 instances of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3` that do not use direct load for are divided into two groups - Base instances - Instances that result into high register usage (currently only one instance - one that causes the test failures). This division allows to disable only the V5 block-GEMM flavor of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8, 4, 1, 8, 8, 8, 8, 1, 1, 2>` for 3D convolutions on `gfx90a`. The selective disabling leaves the set of instances for 1D and 2D convolutions unaffected, and removes at runtime two V5 block-GEMM instances (`ConvBwdWeightDefault` and `ConvBwdWeightFilter1x1Stride1Pad0`) per data type (FP16/BF16) when the device is `gfx90a`. Because MIOpen uses CK's type string (provided by method `GetTypeString`) to identify the instances, the DB sync tests are expected to unaffected since there are still the V2 block-GEMM instances that result in the same type string (`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8, 4, 1, 8, 8, 8, 8, 1, 1, 2>`). This expectation needs to be verified by running the MIOpen DB sync tests that are not part of the normal CK PR build. ## Test Plan Running all CI tests + the MIOpen internal DB sync tests is sufficient to verify the correctness of the code changes. ## Test Result Verified locally that the previously failing tests `TestGroupedConvndBwdWeight3d/4.Test3D` and `TestGroupedConvndBwdWeight3d/4.Test3D` have instance counts - 231 on `gfx90a` - 233 on `gfx942` and are currently passing. This confirms the expectation that two instances per data type should be disabled on `gfx90a`. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Ville Pietilä <>	2026-04-17 09:16:32 +03:00
Max Podkorytov	027b95a21c	[CK_TILE] Add CShuffleLds microbenchmark suite (#5383 ) ## Summary Microbenchmarks isolating LDS store/load operations in CShuffleEpilogue for bank conflict analysis. ## Motivation CShuffleEpilogue performs LDS store (MFMA registers → LDS) and load (LDS → registers for coalesced global writes). This suite isolates each operation to: - Identify which operation causes bank conflicts - Measure pure LDS bandwidth per access pattern - Validate access patterns across MFMA tile sizes and wave layouts ## Components - Microkernels (`tile_load_store_microkernels.hpp`): `StoreTile<Setup>`, `LoadTile<Setup>` - Setup Adapters (`benchmark_cshuffle_lds.hpp`): Wire CShuffleEpilogue to microkernels - Template (`benchmark_template.cpp.in`): Generated benchmarks with timing ## Build ```bash cmake -G Ninja -B build -S . \ -DGPU_TARGETS=gfx950 \ -DBUILD_CK_EXAMPLES=ON \ -DBUILD_CK_TILE_CSHUFFLE_LDS_BENCHMARKS=ON ninja -C build bench_lds_fp8_16x16x128_2x2_fp8 ``` ## New CMake Options \| Option \| Default \| Description \| \|--------\|---------\|-------------\| \| `BUILD_CK_TILE_CSHUFFLE_LDS_BENCHMARKS` \| OFF \| LDS microbenchmarks \| \| `BUILD_CK_TILE_FMHA_TESTS` \| ON \| FMHA tests \| \| `BUILD_CK_TILE_ENGINE` \| ON \| Tile engine \| \| `BUILD_CK_TILE_ENGINE_TESTS` \| ON \| Tile engine tests \| \| `BUILD_CK_EXAMPLES` \| ON \| Examples \| \| `BUILD_CK_TUTORIALS` \| ON \| Tutorials \| \| `BUILD_CK_DEVICE_INSTANCES` \| ON \| Device instances \| \| `BUILD_CK_PROFILER` \| ON \| Profiler \| Setting guards to OFF reduces cmake configure from ~150s to ~5s. --------- Made-with: Claude Code, Opus 4.5	2026-04-14 20:43:23 -07:00
msaffari-amd	1ae4925302	[CK_TILE] Separate PermuteN epilogue from CShuffle epilogue into standalone file (#5863 ) ## Motivation The PermuteN epilogue was previously embedded within cshuffle_epilogue.hpp, despite having fundamentally different behaviour. Coupling these two independent strategies in one file introduced unnecessary complexity, SFINAE guards, and a dual operator() overload selected at compile time via TiledMMAPermuteN_ template parameter. This PR separates PermuteN into its own standalone file(pertmuten_epilogue.hpp), simplifying both implementations and making the codebase easier to maintain and extend independently. ## Technical Details New file: permuten_epilogue.hpp: contains PermuteNEpilogueProblem and PermuteNEpilogue, extracted from the permuteN code path in cshuffle_epilogue.hpp. Cleanup of cshuffle_epilogue.hpp: - Removed the TiledMMAPermuteN_ template parameter from [CShuffleEpilogueProblem] - Removed the SFINAE-guarded permuteN operator() overload - Removed the EnablePermuateN_ SFINAE alias - CShuffle now only contains CShuffle logic; EightWave support (independent feature) is retained Consumer migration : All consumer files now use compile-time epilogue selection via [std::conditional_t] `using GemmEpilogue = std::conditional_t< TiledMMAPermuteN, PermuteNEpilogue<PermuteNEpilogueProblem<...>>, CShuffleEpilogue<CShuffleEpilogueProblem<...>>>;` Files modified: - flatmm_basic.cpp, moe_flatmm.cpp, a16w4_moe_flatmm.cpp, mixed_prec_flatmm.cpp, mx_flatmm_instance.hpp — flatmm examples - run_gemm_quant_example.inc — block-scale GEMM example - gemm_weight_preshuffle_invoker.hpp — weight preshuffle invoker - test_gemm_quant_fixtures.hpp, test_gemm_persistent_async_input.cpp, test_gemm_pipeline_util.hpp — test utilities - universal_gemm_invoker.hpp — universal GEMM invoker - epilogue.hpp — add header updated to include permuten_epilogue.hpp ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2026-04-14 20:22:18 +00:00
arai713	c810a01ec6	[CK_TILE] Restructure Tile Engine's benchmarking and profiling (#4769 ) ## Motivation This PR introduces a restructure for the benchmarking and profiling aspects of CK Tile's Tile Engine, expanding on the groundwork from this previous https://github.com/ROCm/composable_kernel/pull/3434 and outlined in this [design document](https://amdcloud-my.sharepoint.com/:w:/r/personal/astharai_amd_com/Documents/Restructuring%20Tile%20Engine.docx?d=w14ea28a30718416988ed5ebb759bd3b2&csf=1&web=1&e=l3VBuX). In PR 3434, to reduce repeated code we implemented: - Base class that centralizes common functionality and provides a default implementation (Universal GEMM) - Child classes for GEMM variants override virtual functions to handle variant-specific behavior This refactoring in this PR follows the same process and should greatly reduce the duplicated code present in Tile Engine and make it simpler to add in new operations, increasing scalability. ## Technical Details The files have been refactored around new base structs for benchmarks, profiling and problem descriptions. The new base structs are: - GemmProblem - GemmBenchmark - GemmProfiler Universal GEMM, Preshuffle GEMM, and Multi-D GEMM all have child classes that will inherit from these base structs overriding only what differs per variant. All common functions across the benchmarking and profiling files have been moved into newly added common utility files under the commons/ directory. The new utility files are: - utils.hpp: common functions for the benchmarking and profiling process - benchmark_utils.py: common utility functions for the benchmark generation ## Test Plan I tested using the existing tests for Tile Engine. ## Test Result All tests passed. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-14 10:50:24 -07:00
Estevan Vedovelli	43b33b9034	[ck] Clamp negative kernel execution elapsed time to zero (#6379 ) ## Motivation hipEventElapsedTime can return a small negative value on Windows when timing a very fast kernel launch on the null stream. This caused consumers of launch_and_time_kernel to receive a negative elapsed time, which they reasonably treat as an error, breaking otherwise-correct kernel executions. ## Technical Details After calling hipEventElapsedTime, a clamp is applied in launch_and_time_kernel before the result is returned, avoiding the return of a physically impossible elapsed time. The negative value from hipEventElapsedTime has been observed on Windows. For kernels that complete in well under a millisecond, the HIP event timestamps can alias such that the computed difference is a small negative number (observed: ~-1.78 ms). No HIP error is reported by any surrounding call (hipEventRecord, hipEventSynchronize, hipGetLastError), confirming the kernel itself executed successfully. ## Test Plan - Recompile CK and validate no kernel execution reports a negative elapsed time during hipTensor tests. - Pass the CI/CD pre-checking tests for CK. ## Test Result - All tests passing ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-14 09:14:26 -07:00
Po Yen Chen	470a48530b	[CK] Skip fp16 dropout d256 batch tests for compiler VGPR aliasing bug (#6342 ) ## Summary - Skip fp16 FMHA forward dropout tests that use the d256 tile in batch mode, gated on compiler version - The AMDGPU compiler miscompiles these kernels due to VGPR aliasing of Philox RNG parameters under high register pressure (383 VGPRs) - bf16 dropout tests are unaffected and cover the same code paths ## Root Cause The compiler aliases `ph_seed` and `ph_head_offset` (Philox RNG state stored in VGPRs) with other live data during the softmax main loop. This causes corrupted `buffer_store_byte` writes for dropout randval on wave lanes 32-63, producing NaN in output and LSE tensors. Conditions: fp16 + d256 tile + dropout + batch mode + `qr` pipeline + gfx90a ## Changes - `include/ck_tile/core/config.hpp`: Add `CK_TILE_WORKAROUND_ROCM_7_12_FP16_DROPOUT_MISCOMPILE` macro - `test/ck_tile/fmha/test_fmha_fwd.cpp`: Version-gated `GTEST_SKIP` in `TEST_P(Dropout, ...)` ## Test plan - [x] ROCm 7.1.1 (clang 20): 168/168 fp16 dropout tests PASS (no skip active) - [x] ROCm 7.12 (clang 22): 132 PASS, 36 SKIPPED, 0 FAILED - [x] bf16 dropout tests: 168/168 PASS (unaffected by this change)	2026-04-14 14:07:20 +00:00
Yaswanth Raparti	d988d55275	[CK][CK TILE] Modify elementwise kernel template signature to accept independent type arguments (#6399 ) ## Motivation modify elementwise kernel template signature to fix cshuffle epilogue build error ## Technical Details Encountered a build error while building conv fallback kernel with dispatcher. Error: Type mismatch in `ElementWiseKernel::operator()` where the template required all three parameters (lens, input_strides, output_strides) to be the same type, but the CShuffle epilogue was passing them with different tuple element types. Solution: Modified the template signature in elementwise_kernel.hpp to accept three independent type parameters: Changed from single typename `Dims` to typename `DimsLens`, typename `DimsInStrides`, typename `DimsOutStrides` Updated references to `Dims::size()` to use the appropriate specific type ## Test Plan - Test with dispatcher conv unit tests - Relying on CI tests ## Test Result - Dispatcher unit tests passed - Relying on CI tests ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-14 01:44:27 -06:00
chris-tsiaousis-hpc	89c5e67028	[CK Tile] Unification work - mma transformations pipeline (#5508 ) ## Motivation In this PR we showcase how the amdgcn structs could be used in a pipeline that does some extra pre/post processing. For the sparse intrinsics, so far we compressed the A vector "on the fly" right before the execution of the builtin. This might introduce performance issues down the line if, for example, the user decided to chain multiple sparse builtins. We tackle this problem by creating a specific SparseCompressTransform. A MmaPipelineBase is also created to facilitate those kind of higher level compositions of the amdgcn structs and is integrated to the existing WaveWiseMma prototype. There is an effort to facilitate future operations, like swizzle A/B, C transpose or double/quad attr num access through the MmaPipelineOptionFlags, but those are not yet defined and should do so in a future PR. The pipeline base class is basically at the RFC stage. We also create a runtime test for the existing WaveWiseMma, as well as one for the SparseMma pipeline. ## Technical Details The goal should be to have the pipeline easily expandable. May the CRTP of the base class or the interface in general be insufficient or unable to handle all of our needs, then a design modification should be discussed. ## Test Plan New tests are added. ## Test Result Tests should pass. --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-04-14 09:25:01 +02:00
Brock Hargreaves	5eee93e67c	[MIOPEN] [CK] Revert "[CK] Disable test cases affected by compiler codegen bugs on gfx90a" (#6400 ) Reverts ROCm/rocm-libraries#6343 This is causing failures in miopen, namely Dbsync gfx942 even though it shouldn't be affected so this needs to be investigated. Please add miopen as a label to the new PR for addressing the compiler codegen bug so that this can be addressed simultaneously.	2026-04-13 20:46:07 -06:00
Ville Pietilä	6e0454216d	[CK] Disable compilation of problematic bwd weight conv instances for gfx90a (#6343 ) ## Motivation Due to compiler version update, there are test failures in the test suite `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There are four failing tests for FP16/BF16 that arise from a single kernel instance. As the problem is in the current `develop` branch, the test failures are blocking any PR merges into `develop`. An example of a failed CI runs is here: [http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/). The underlying compiler problem is potentially the same as described in #6342 as tests are passing for clang compiler version 20.0 and failing for clang compiler version 22.0. ## Technical Details This PR disables the compilation of the problematic bwd weight conv instance for `gfx90a` by adding a new CMake flag `CK_USE_GFX90A` that allows us to detect when we are compiling for `gfx90a`. Using the new CMake flag, compilation of instance `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8, 4, 1, 8, 8, 8, 8, 1, 1, 2>` is disabled for `gfx90a`. Co-authored-by: Ville Pietilä <>	2026-04-13 13:40:27 +02:00
Kiefer van Teutem	6cd016dde4	[CK Tile] Add Tile Distribution Encoding Calculator (#5515 ) ## Motivation We want to be able to calculate TileDistributionEncodings describing register mappings for any MmaOp. This is necessary for further integration with CK Tile. This MR adds a new struct TileDistrEncCalc, which takes an amdgcn_mma type (MmaOp) and provides ABC warp distribution encodings for mapping matrix fragment coordinates to register coordinates (lane, vector item) and vice versa. It is able to take CTranpose, Swizzle, and NumAccessA / NumAccessB template parameters for tweaking the tile distributions. Swizzle modification will be implemented later. The current implementation can deal with all intrinsic types and block-hiding. This MR also adds some additional static asserts and derived params within amdgcn_mma_base, to enforce consistency and help calculate Tile Distributions for block-hiding intrinsics. An Example was added that uses the Tile Distr Enc Calc to calc and print register layouts for Tile Distributions for some of our amdgcn_mma structs. It also makes sure that the CTranspose modifier works as intended. Some additional gfx9 intrinsics were added to test block-hiding layouts for the different types of C-block-hiding layouts. The sparse intrinsic wrappers were updated according to Chris's recent changes in another branch (https://github.com/ROCm/rocm-libraries/pull/5508), which moved the compression step outside of the intrinsic itself. This is necessary to make sure that the Calculator can deal with this new interpretation of the sparse intrinsics. I directly copied the new amdgcn structs from Chris's branch and changed nothing else to avoid more complex merges in the future. Note that this means I did not update a bunch of related sparse code since that would be a lot, and therefore I disabled test_amdgcn_sparse_mma for now. The amdgcn_mma_layout test was refactored a bit: - The old register mapping utility was removed and its use was replaced by the new TileDistrEncCalc - More tests were added to test layouts for different types of block-hiding and sparse intrinsics - The Selector method was removed and the tests were split up over target architectures, with each target arch having a direct list of amdgcn structs to be tested. This ensures that we force specific tests on specific architectures and makes sure that the selector doesn't quietly do some workarounds like creating compound intrinsics. ## Test Results Layout tests based on calculated tile distribution encodings pass on all architectures. Calculator works for all currently added amdgcn structs, which includes different types of block-hiding and sparse intrinsics. Printed layouts from new example verified by eye. CTranspose modifier tested for large set of intrinsics.	2026-04-13 08:00:31 +00:00
Aviral Goel	4ccbcbe0a4	CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302 ) Depends on #6300 ## Summary Remove 41 commented-out code blocks across 33 files in Composable Kernel, totaling ~200 lines. Identified using an automated dead code scanning skill (`ck-dead-code`) with a calibrated two-stage pipeline: 1. Pre-filter: Keyword-based scan found 1,338 `//`-commented blocks. Calibrated heuristics (trained on 50-sample expert classification) reduced to 89 high-confidence candidates — 93% noise reduction. 2. Expert triage: LLM expert classified each block in context as CODE_REMOVE, CODE_KEEP, or NOT_CODE. \| Classification \| Count \| \|---------------\|-------\| \| Removed (this PR) \| 41 \| \| Kept (debug helpers, alt configs, reference impls) \| 32 \| \| Not code (false positives) \| 16 \| Removed blocks include: superseded implementations, old test data, abandoned stubs, unreachable code, and buggy dead code.	2026-04-10 11:17:11 -04:00
Hosang Yoon	65ad35becd	[CK_TILE] Optimize FMHA head-dim padded path on gfx11/gfx12 (#6156 ) ## Motivation On gfx11/gfx12, FMHA forward kernels that require head-dim padding show a large performance drop compared to the exact-head-dim path. In practice, padded cases such as `HDIM=72` and `HDIM=80` were falling too far off the fast path. This PR improves padded-head-dim FMHA performance on gfx11/gfx12 while keeping the behavior for other GPUs unchanged. ## Technical Details - Add/scope a dedicated padded-head-dim (`qr_hpad`) FMHA forward path for gfx11/gfx12. - For `receipt=0`, keep support conservative and only enable the padded fast path for vector-safe cases (`head_dim % 8 == 0`), matching the existing assumption used on other GPUs. - Move `v_prefetch` later only for the head-dim-padded path on gfx11/gfx12. This reduces live ranges and removes the register-spill behavior seen in the earlier scheduling. - Enable the buffer-load OOB check offset trick for the padded path on gfx11/gfx12. ## Test Plan ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16 -d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result Observed padded-head-dim performance improvements for HDIM=72/80: - gfx11: about ~3.5x - gfx1151: about ~2.0x - gfx12: about ~1.3x ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-08 10:51:53 -04:00
Christopher Millette	870112b861	[CK_TILE] Flatten nested static_for loops into static_ford (#5939 ) ## Summary Mechanical conversion of 129 nested `static_for`/`static_ford` patterns to flat `static_ford` across 29 ck_tile header files. Each conversion eliminates intermediate lambda closure instantiations by replacing nested compile-time loops with a single flat iteration using index decomposition. ### What `static_ford` eliminates When `static_for` loops are nested, each level creates unique closure types: ```cpp // BEFORE: M + M×N = 20 IR functions (for M=4, N=4) static_for<0, 4, 1>{}([&](auto m) { // 4 closure instantiations static_for<0, 4, 1>{}([&](auto n) { // 4×4 = 16 closure instantiations body(m, n); }); }); // AFTER: M×N = 16 IR functions (with ford_applier, no intermediates) static_ford<sequence<4, 4>>{}([&](auto mn) { constexpr auto m = number<mn[number<0>{}]>{}; constexpr auto n = number<mn[number<1>{}]>{}; body(m, n); }); ``` ### Pattern categories converted \| Category \| Count \| Description \| \|----------\|-------\|-------------\| \| C (2-level `static_for` chains) \| 112 \| Nested `static_for` → `static_ford` \| \| C3 (3-level `static_for` chains) \| 9 \| Three consecutive nests → `static_ford` \| \| Partial rescue \| 3 \| Outer 2 levels of blocked 4-level nests \| \| B (nested `static_ford` merge) \| 5 \| Two nested `static_ford` → single higher-dim `static_ford` \| \| Total \| 129 \| Across 29 files \| 6 false positives were detected and reverted (in `tensor_adaptor.hpp`, `tile_distribution.hpp`, `tile_distribution_encoding.hpp`) where the inner loop bound depended on the outer variable. ### Files changed by family \| Family \| Files \| Sites \| \|--------\|-------\|-------\| \| Block GEMM \| 12 \| ~20 \| \| FlatMM pipelines \| 4 \| ~69 (including 5 ford-ford merges) \| \| GEMM quant \| 7 \| ~22 \| \| FlatMM kernel \| 1 \| 2 \| \| FMHA \| 1 \| 2 \| \| Reduce/norm \| 2 \| 2 \| \| Epilogue \| 1 \| 1 \| ### Blocked locations from review comments - block_gemm_areg_breg_creg_v1.hpp:356 — BLOCKED: runtime scale loads (`scale_a_slice`, `scale_b_slice`, A warp tensor load) between every nesting level - block_universal_gemm_ar_aquant_flatbr_bquant_cr.hpp:228 — BLOCKED: `zero_accumulators()` before inner loop; `sched_barrier` + conditional `block_sync_lds()` after inner loop - block_universal_gemm_as_aquant_bs_bquant_cr.hpp:298 — BLOCKED: runtime `CWarpTensor` construction before inner loop; quantization scale application code after inner loop - block_universal_gemm_as_aquant_bs_cr.hpp:277 — BLOCKED: same pattern as above - block_universal_gemm_as_bs_bquant_cr.hpp:367 — BLOCKED: same pattern as above ## Depends on - #5938 ([CK_TILE] Optimize static_ford and sequence compile-time infrastructure) — provides the `ford_applier` that makes these conversions beneficial. Without it, `static_ford` uses a recursive implementation that provides no IR function savings. ## Results (combined with #5938) ### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942) \| Target \| Base (s) \| Treat (s) \| Delta \| % \| Significant? \| \|--------\|----------\|-----------\|-------\|---\|-------------\| \| flatmm \| 161.1 \| 149.0 \| -12.1s \| -7.5% \| YES (p<0.01, 7/7 wins) \| \| universal_gemm \| 225.4 \| 220.3 \| -5.1s \| -2.3% \| YES (p<0.01, 7/7 wins) \| ### IR Function Counts (device trace, gfx942) \| Target \| InstFunc \| CodeGen \| \|--------\|----------\|---------\| \| universal_gemm \| -8.5% \| -9.2% \| \| flatmm \| -7.6% \| -10.5% \| ### ASM Equivalence 5/5 PASS — 650,151 lines verified identical (gfx942). TUs: universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale. ## Test plan - [x] ASM equivalence verified (650K lines, gfx942) - [x] Wilcoxon timing verified (7 trials, p<0.01) - [x] IR function counts verified (-7.6% to -10.5% CodeGen reduction) - [ ] CI 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>	2026-04-07 08:36:45 -06:00
Po Yen Chen	6dc44114ba	[CK] Add FP8 per-tensor quantization support for FMHA V3 pipeline (#6051 ) ## Motivation The existing FMHA V3 pipeline only supports fp16/bf16 data types. This PR extends V3 to handle FP8 inputs with per-tensor descaling on gfx950, enabling higher throughput for FP8 inference workloads using the assembly-optimized V3 code path. ## Technical Details Warp GEMM: - Add FP8 32x32x32 warp gemm with C-transposed distribution (`WarpGemmMfma_f32_32x32x32_fp8_fp8_CTransposed`) and dispatcher entries V3 Kernel (`fmha_fwd_v3_kernel.hpp`): - Add per-tensor descale support for Q, K, V tensors, passing descale pointers through to pipeline kargs V3 Pipeline (`block_fmha_fwd_v3_pipeline.hpp`): - Add FP8 data path with dtype-aware type selection - Add asm volatile P matrix conversion from f32 to fp8 - Add FP8-aware instruction scheduling in `CoreLoopScheduler` V3 Pipeline Policy (`block_fmha_fwd_v3_pipeline_default_policy.hpp`): - Add FP8 QK warp gemm selection (SwizzleB variant for V tile distribution compatibility) Codegen (`fmha_fwd.py`): - Add gfx950 FP8BF16 V3 tile size (256x64x128x128x64x128) - Add FP8BF16 V3 pipeline variants (mask: no/causal, qscale: no/pertensor) - Extend `can_dispatch_v3` condition for fp8bf16 + pertensor Misc: - Add LLVM scheduler `TRANS` mask to `LLVMSchedGroupMask` enum (`arch.hpp`) - Fix `mask_info` default initialization for `no_mask` case (`mask.hpp`) V3 dispatch for FP8 is disabled by default (`F_is_v3_enabled=false`) pending further validation. ## Performance: fmha_fwd V3 FP8 (avg runs 2-6, stock ROCm 7.1.1, gfx950) \| Problem \| Regular (TFlops) \| Varlen (TFlops) \| \|---\|---:\|---:\| \| batch=1 heads=6/1 seqlen=1024 causal \| 48.9 \| 47.6 \| \| batch=1 heads=6/1 seqlen=2048 causal \| 119.8 \| 117.4 \| \| batch=1 heads=6/1 seqlen=4096 causal \| 263.7 \| 259.2 \| \| batch=1 heads=6/1 seqlen=8192 causal \| 548.9 \| 543.6 \| \| batch=1 heads=6/1 seqlen=16384 causal \| 1043.0 \| 1063.7 \| \| batch=1 heads=6/1 seqlen=32768 causal \| 1237.2 \| 1279.6 \| \| batch=1 heads=6/1 seqlen=65536 causal \| 1315.4 \| 1382.7 \| \| batch=1 heads=6/1 seqlen=131072 causal \| 1326.3 \| 1402.2 \| \| batch=1 heads=16/1 seqlen=65536 causal \| 1298.7 \| 1388.4 \| \| batch=1 heads=40/40 seqlen=37200 non-causal \| 1248.9 \| 1326.1 \| ## Test Plan Tested with aiter's `test_mha_fp8.py` test suite (176 cases) covering batch sizes (1-2), sequence lengths (113-4096), head counts (5/8/32/40), GQA ratios (1:1, 1:8), and causal/non-causal modes. Verified all cases dispatch to the V3 pipeline by enabling `F_is_v3_enabled` and confirming kernel names contain `qr_async_trload_v3`. ## Test Result 176/176 tests passed with V3 enabled. All cases correctly dispatched to V3 pipeline with `pertensor` quantization. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-07 22:19:28 +08:00
Jeff Huang	449844e3d3	[CK_TILLE] Temporarily remove batch prefill KV cache overflow asserts (#6201 ) ## Summary - Temporarily remove the KV cache offset overflow assert checks in `FmhaBatchPrefillWithPagedKVCacheKernel` - The asserts are correct, but they block project progress in certain configurations - This is a temporary workaround to unblock progress; a proper fix will follow ## Note This is NOT a permanent solution. A follow-up PR will add proper overflow handling that addresses the underlying issue without blocking progress.	2026-04-07 20:41:24 +08:00
Bartłomiej Kocot	4112e08d0c	[CK][CK Tile] Conv Bwd Data flush cache and profiling improvements (#6090 ) ## Motivation Improve accuracy of conv bwd data perf measurements ## Technical Details - enable flush cache - for grouped conv we zero conv input(gemm output) inside device op, so we also include this in time measurement - for non-grouped conv we zero conv input(gemm output) outside device op (in profile_conv_bwd_data_impl.hpp) so it is not included. - In this pr I changed it to include zeroing if time_kernel/flush cache is enabled so at now you should have more fair comparison. I changed it only for time_kernel/flush_cache because MIOpen run own zeroing for non-grouped solvers. ## Test Plan test_grouped_conv_bwd_data_* ## Test Result CI pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-04 00:22:22 +00:00
harkgill-amd	12fe2c3de4	Add missing gfx1033 to gfx103 group definition in ck (#5141 ) ## Motivation Resolving PyTorch build failures when enabling builds for gfx103X-all family in TheRock. https://github.com/ROCm/TheRock/pull/3763. `gfx1033` is the only failing architecture in the family and the failures point to missing support in CK. ## Technical Details PyTorch build fails with repeated error message ``` /__w/TheRock/TheRock/external-builds/pytorch/pytorch/aten/src/ATen/../../../third_party/composable_kernel/include/ck/utility/amd_buffer_addressing_builtins.hpp:33:48: error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD' 33 \| wave_buffer_resource.config(Number<3>{}) = CK_BUFFER_RESOURCE_3RD_DWORD; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` `gfx1033` is missing from the `__gfx103__` group which results in `CK_BUFFER_RESOURCE_3RD_DWORD` never being defined for it. Adding in `gfx1033` to the missing files which should be the minimum fix to allow torch builds to pass. ## Test Plan Compile sample test file and target gfx1033 ``` ... #ifdef __HIP_DEVICE_COMPILE__ static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == 0x31014000, "wrong device value"); #else static_assert(CK_BUFFER_RESOURCE_3RD_DWORD == -1, "wrong host value"); #endif ``` ## Test Result Prior to the applying patch, compilation fails with `error: use of undeclared identifier 'CK_BUFFER_RESOURCE_3RD_DWORD'` After applying patch, test file compiles successfully. ## Submission Checklist - [X] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-04-03 13:44:38 -06:00
Thrupti Raj Lakshmana Gowda	8c42eaf3f8	[CK Tile] Fix architecture-dependent EightWave assignment in cshuffle_epilogue (#6102 ) ## Motivation Tile engine CI build on the develop branch started failing after a recent change(https://github.com/ROCm/rocm-libraries/pull/5218) in `cshuffle_epilogue.hpp`. The `EightWave` constant was unconditionally computed as `(MWave * NWave == 8)` for all architectures, but this logic is only valid for gfx9. On other architectures (e.g., gfx1201), `EightWave` must always be `false`, otherwise it leads to incorrect `BlockedXDLN_PerWarp` computation and build failures. ## Technical Details In `cshuffle_epilogue.hpp`, the `EightWave` static constexpr was set as: ```cpp static constexpr bool EightWave = (MWave NWave == 8); ``` This was applied regardless of the target GPU architecture. The fix uses a preprocessor guard to make this architecture-aware: - *gfx9 (`__gfx9__`):** `EightWave` is evaluated as `(MWave * NWave == 8)` — true or false depending on the wave configuration - All other architectures: `EightWave` defaults to `false` ## Test Plan - Tile engine CI build on develop branch ## Test Result - Pending CI ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --- > Note: This PR supersedes ROCm/rocm-libraries#5436, which is blocked pending a review approval from a reviewer currently on PTO. The same changes have been applied to this branch (`users/tlakshma/ck/develop-clone`) to allow merging. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2026-04-02 20:55:03 -06:00
Christopher Millette	522902f29b	[CK_TILE] Optimize static_ford and sequence compile-time infrastructure (#5938 ) ## Problem Each `static_for<0, N, 1>` instantiates its lambda N times (one per `number<I>` type). When nested, intermediate lambdas capture the outer loop variable (a different type per iteration), creating unique closure types. For a 3-level nest with M=4, N=4, K=2, this produces 4 + 16 + 32 = 52 IR functions, of which 20 are intermediate closures that get inlined away but still cost frontend compile time. ck_tile's `static_ford` was supposed to eliminate these intermediates (as old CK's PR #5031 did successfully), but it used a recursive `static_ford_impl` that recreated the same closure pattern plus added `reorder_old_to_new`/`reorder_new_to_old` overhead. Additionally, the sequence utility layer (`sequence_sort`, `is_valid_sequence_map`) used recursive template metaprogramming that generated O(N log N) intermediate types for every permutation validation — called on every `reorder_new_to_old`/`reorder_old_to_new` invocation. ## Changes ### 1. Replace `sequence_sort` with constexpr insertion sort Replace recursive merge sort (`sequence_sort_impl` + `sorted_sequence_merge_impl`, O(N log N) intermediate type instantiations) with constexpr insertion sort using `static_array`. O(1) template depth, same `::type` and `::sorted2unsorted_map` API. ### 2. Replace `is_valid_sequence_map` with constexpr check Replace sort-based permutation validation (which instantiated the full `sequence_sort` chain) with a constexpr "seen array" loop. O(N) constexpr steps instead of O(N log N) template instantiations. ### 3. Replace recursive `static_ford` with flat-loop `index_decomposer` Replace `static_ford_impl` (recursive `static_for` nesting + `pop_front`/`push_back` + `reorder_old_to_new` per iteration) with flat `index_decomposer` using pre-computed strides. Add `decompose_reordered` alias that folds reordering into decomposition, and `inverse_perm` helper that avoids the `sequence_map_inverse` → `is_valid_sequence_map` → `sequence_sort` chain. ### 4. Eliminate internal lambda via `ford_applier` The flat-loop approach still used `static_for` with a lambda, creating M×N internal lambda instantiations per call site. Replace with `ford_applier` struct that calls `f(decompose<I>{})` directly via fold expression — zero intermediate closures: ```cpp // Before: 2×M×N function instantiations static_for<0, MN, 1>{}([&](auto i) { f(decompose<i>{}); }); // After: M×N function instantiations (50% reduction) ford_applier<Decomposer, make_index_sequence<MN>>{}(f); ``` Also unified identity and non-identity order paths into a single template with `constexpr if`. ### 5. Fix const-qualified sequence handling Fix `is_valid_sequence_map` to handle const-qualified sequence types via `remove_cvref_t` in callers (`tensor_adaptor.hpp`, `tile_distribution_encoding.hpp`). ## Results (this PR only, without flattening) ### Build Time (Wilcoxon signed-rank, 7 paired trials, gfx942, load ~5) \| Target \| Base (s) \| Treat (s) \| Delta \| % \| Wins \| Significant? \| \|--------\|----------\|-----------\|-------\|---\|------\|-------------\| \| flatmm \| 160.1 \| 152.7 \| -7.4s \| -4.6% \| 6/7 \| YES (W+=1, p<0.05) \| \| universal_gemm \| 228.4 \| 224.7 \| -3.7s \| -1.6% \| 6/7 \| Trending (W+=4) \| Per-trial diffs (flatmm): [-6, -20, -9, -8, -8, 4, -5] Per-trial diffs (universal_gemm): [-2, -6, 4, -3, -2, -11, -6] ### IR Function Counts (device trace, gfx942) \| Target \| Metric \| Before \| After \| Delta \| % \| \|--------\|--------\|--------\|-------\|-------\|---\| \| universal_gemm \| InstantiateFunction \| 117,715 \| 109,165 \| -8,550 \| -7.3% \| \| universal_gemm \| CodeGen Function \| 47,912 \| 45,044 \| -2,868 \| -6.0% \| \| flatmm \| InstantiateFunction \| 100,939 \| 95,127 \| -5,812 \| -5.8% \| \| flatmm \| CodeGen Function \| 42,651 \| 40,367 \| -2,284 \| -5.4% \| Note: The `ford_applier` (commit 3) has minimal additional effect in this PR since ck_tile code does not yet use `static_ford` extensively. Its impact compounds when the follow-up flattening PR #5939 converts 124 `static_for` nests to `static_ford`. Combined results with #5939: flatmm -7.5% wall time (p<0.01), CodeGen -10.5%. ### ASM Equivalence 7/7 PASS — 979,943 lines of device assembly verified identical (gfx942 + gfx1100). TUs: universal_gemm, flatmm_basic, fmha_bwd, reduce, bscale. ## Test plan - [x] `test_ck_tile_static_ford`: 13 behavioral tests (identity/non-identity orders, 1D-4D, unit dimensions, edge cases) - [x] `ck_tile_unit_sequence`: 88 tests (11 new for sorted2unsorted_map, is_valid_sequence_map edge cases, sequence_unique_sort map round-trip) - [x] ASM equivalence verified (980K lines) - [x] Wilcoxon timing verified (7 trials, flatmm p<0.05) - [ ] CI 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-02 15:25:14 -06:00
Linjun-AMD	ba0efe01af	[CK Tile] Add sink token gradient support in FMHA backward pass (#5504 ) ## Motivation Adds sink token support to the FMHA backward kernel (dot_do_o pipeline): ## Technical Details - Extend BlockFmhaBwdOGradDotOPipelineProblem with LSEDataType - Add sink_ptr/d_sink_ptr/lse_ptr/nhead to FmhaBwdOGradDotOCommonKargs - Compute per-head sink gradient via atomic accumulation in the pipeline - Update example runner with reference validation for sink gradient ## Test Plan Add new test case ## Test Result WIP ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-02 11:17:01 +08:00
Chinmay Dattanand Kuchinad	820ed2dbb3	[CK] Fix async pivot mismatch in persistent GEMM kernel scheduler (#5776 ) ## Motivation Fix pivot mismatch in the persistent GEMM kernel's async input scheduler that causes GPU hangs and incorrect results when used with AsyncTP (Asynchronous Tensor Parallelism) on ROCm. PyTorch's `_fused_all_gather_matmul_native` uses this persistent GEMM kernel with chunk signals to overlap communication and computation. The pivot mechanism ensures each rank starts computing from its own local shard first (which is already available), then moves to remote chunks as they arrive over the network. Because of the pivot mismatch, the kernel frequently waits on signals for chunks that have not yet arrived, while attempting to read data from completely different chunks. This synchronization desync reliably triggers infinite hangs during multi-GPU native AsyncTP execution. This fix is required to enable functional AsyncTP support on ROCm. ## Technical Details In the persistent kernel loop (`UniversalGemmKernel::operator()`), the M-tile coordinate used for data selection (`i_m`) and the M-tile coordinate used for the chunk-signal wait (`chunk_idx`) were derived from inconsistent bases: * `i_m` was computed from the unpivoted tile index `iM`. * `chunk_idx` was computed from the pivoted expression `(iM + tile_idx_pivot)`. This means the kernel could wait for chunk N's signal but then read from chunk M's memory, or vice versa. The mismatch scales with GPU count: with 2 GPUs ~50% of tiles are wrong, with 4 GPUs ~75%, etc. The Fix: Introduce a single pivoted M-tile index (`iM_eff`) and derive both `i_m` and `chunk_idx` from it. This guarantees the kernel always waits for the correct chunk before reading its data. (Note: Minor cosmetic `clang-format` changes were also pulled in alongside the fix). ## Test Plan 1. Build PyTorch with this CK change. 2. Run the specific multi-GPU AsyncTP native test: `timeout 180s env HIP_VISIBLE_DEVICES=0,1 pytest test/distributed/test_symmetric_memory.py -k test_fused_all_gather_matmul_native -q -s -x` ## Test Result Tests verify correct overlapping execution without hangs or accuracy mismatches when running the AsyncTP native path with non-zero pivots. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>	2026-04-01 09:21:20 -07:00
aledudek	357a140e7b	[CK_TILE] Add pooling in tile_engine (#4469 ) ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> Add pooling in ck tile engine ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2026-04-01 07:31:46 +00:00
Yi DING	9b8b2456b4	[CK_TILE] Fix FMHA BWD IGLP incorrect results due to AGPR misallocation (#5991 ) ## Motivation After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard around the `num_total_loop <= 0` early-exit check, the IGLP pipeline (`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV gradients for non-masking kernels (even with fix in #5915). Assembly inspection confirms that the CFG change causes the LLVM register allocator to reuse AGPR accumulators as scratch destinations in the dK/dV reduction loop, breaking the loop-carried accumulation across Q-tile iterations. ## Technical Details - Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in `BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing: it restores the CFG shape that the register allocator needs to correctly assign dedicated AGPRs to each column of the dK/dV accumulator. - Only the IGLP pipeline is affected; the other two BWD pipelines do not exhibit this issue. ## Test Plan ## Test Result ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-01 13:44:04 +08:00

1 2 3 4 5 ...

1584 Commits