composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-13 17:55:48 +00:00

Author	SHA1	Message	Date
Linjun-AMD	5c7b7ec3f1	[rocm-libraries] ROCm/rocm-libraries#7272 (commit d02f3c0) [ck_tile][fmha_bwd] Fix sink_host OOB in group mode reference runner (#7272) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary In `fmha_bwd_runner.hpp`, the `sink_host` `HostTensor` is allocated with first dimension `shape_batch` (= 1 in group mode), but the reference forward loop accesses `sink_host(wb, i_h)` with `wb ∈ [0, batch-1]`. For any `wb >= 1` this is an out-of-bounds heap read, silently corrupting the reference forward math chain (`lse_host`, `o_host`) and turning the bwd-side `d_sink_head_acc` reference into non-deterministic garbage. `HostTensor::operator()` does not bounds check, so the OOB is not caught at runtime. This manifests as intermittent `tile_example_fmha_bwd` failures (25–67% fail rate) when `-sink_grad=1` is combined with `-mode=1` (group mode), with bit-exact but spurious `max_err` values like 4.27 / 14.6. ## Fix One-line: allocate `sink_host` with `batch` (the real per-batch dim) instead of `shape_batch`, mirroring how `sink_host` is accessed by the loop. ```diff - sink_grad ? std::array<ck_tile::index_t, 2>{shape_batch, nhead} + sink_grad ? std::array<ck_tile::index_t, 2>{batch, nhead} Repro tile_example_fmha_bwd -b=2 -h=2 -s=516 -s_k=253 -prec=bf16 -d=72 \ -bias=n -dbias=0 -p_drop=0 -iperm=1 -operm=1 -deterministic=0 \ -v=3 -mode=1 -kname=1 -sink_grad=1 Verification - 0/30 fail on the repro config after fix - Baselines (before fix): - sink=1, mask=n: 25% fail rate (p ≈ 1.8e-4) - sink=1, mask=t: 67% fail rate (p ≈ 6e-15) Attribution Shape bug introduced together with sink_grad in #5504. Unrelated to #6914 (which is a fwd-only fix on a different code path) ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-13 08:49:13 +00:00
Illia Silin	b05040b919	[rocm-libraries] ROCm/rocm-libraries#7111 (commit 651947f) [CK] Fix latest batch of staging compiler warnings ## Motivation Suppress the new batch of clang lifetimebound and invalidation warnings with the latest staging compiler. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-08 14:15:31 +00:00
Chao	1d1be9e3de	[rocm-libraries] ROCm/rocm-libraries#6529 (commit 93a6097) [CK_TILE] Enable V3 persistent kernel dispatch for FMHA forward on gfx950 (#6529) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit [CK_TILE] Enable V3 persistent kernel dispatch for FMHA forward on gfx950 ## Motivation Enable the existing V3 persistent kernel path for CK-Tile FMHA forward on gfx950 (MI350X/MI355X). The V3 kernel and codegen infrastructure already exist but are disabled via hardcoded `F_is_v3_enabled=False`. This change replaces the compile-time gate with a runtime environment variable `CK_FMHA_ENABLE_V3=1` (disabled by default, opt-in). When enabled: - Prefill workloads (seqlen_q > 1) dispatch to V3 persistent pipeline - Decode workloads (seqlen_q == 1) always use V2 (memory-bound, better suited) The V3 persistent kernel uses grid-stride scheduling, XCD-interleave tile assignment for L2 locality, LPT reversal for causal masks, and gfx950 async buffer loads. ## Technical Details Single file: `example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py` - Add `#include <cstdlib>` and `<string>` for `std::getenv` - Replace `{F_is_v3_enabled}` template parameter with runtime env var check - Add `seqlen_q > 1` guard (decode always uses V2) - Remove `.format()` call in `write_fwd_api()` ## Dependencies Depends on https://github.com/ROCm/rocm-libraries/pull/6501 — builds on XCD-interleave and LPT scheduling infrastructure. ## Test Plan - GPU validation on MI300X (gfx942, ROCm 6.4.1): - Command: `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128 -prec=bf16 -v=1 -warmup=1 -repeat=3` - GPU validation on MI350X (gfx950, ROCm 7.0): - Command (V2): `./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128 -prec=bf16 -v=1 -warmup=1 -repeat=3` - Command (V3): `CK_FMHA_ENABLE_V3=1 ./build/bin/tile_example_fmha_fwd -b=2 -h=8 -s=4096 -d=128 -prec=bf16 -v=1 -warmup=1 -repeat=3` - Command (decode, always V2): `./build/bin/tile_example_fmha_fwd -b=64 -h=32 -h_k=8 -s=1 -s_k=4096 -d=128 -prec=bf16 -mode=group -v=1 -warmup=1 -repeat=3` ## Test Result Benchmark results (MI350X, gfx950, ROCm 7.0): \| Config \| V2 (TFlops) \| V3 (TFlops) \| Speedup \| \|--------\|-------------\|-------------\|---------\| \| Non-causal b=2 h=8 hk=2 s=4096 d=128 bf16 \| 696.3 \| 884.2 \| +27.0% \| \| Causal b=2 h=8 hk=2 s=4096 d=128 bf16 \| 371.3 \| 494.9 \| +33.3% \| \| GQA b=2 h=32 hk=8 s=2048 d=128 bf16 \| 671.3 \| 831.7 \| +23.9% \| \| LLaMA-70B b=1 h=64 hk=8 s=4096 d=128 bf16 \| 761.5 \| 927.3 \| +21.8% \| \| Causal GQA b=2 h=32 hk=8 s=2048 d=128 bf16 \| 345.4 \| 631.9 \| +82.9% \| \| Long-seq b=1 h=16 s=16384 d=128 bf16 \| 797.8 \| 969.9 \| +21.6% \| \| Decode b=64 h=32 hk=8 s=1 s_k=4096 bf16 \| 1828 GB/s \| — (V2 path) \| unaffected \| Benchmark results (MI300X, gfx942, ROCm 6.4.1): V3 has 0% effect on MI300X — V3 relies on gfx950 async buffer loads and falls back to the V2 code path on gfx942. No regression on any config. \| Config \| TFlops / GB/s \| Time (ms) \| Delta vs baseline \| \|--------\|-------------\|-----------\|-------------------\| \| MHA bf16 b=2 h=8 s=4096 d=128 \| 342.98 TFlops \| 0.401 \| +0.1% \| \| MHA fp16 b=2 h=8 s=4096 d=128 \| 411.18 TFlops \| 0.334 \| +4.9% \| \| Causal MHA bf16 b=2 h=8 s=4096 d=128 \| 232.61 TFlops \| 0.296 \| +2.4% \| \| GQA 4:1 bf16 b=2 h=32 hk=8 s=2048 d=128 \| 320.07 TFlops \| 0.429 \| -1.4% \| \| GQA 8:1 bf16 b=2 h=64 hk=8 s=2048 d=128 \| 353.91 TFlops \| 0.777 \| +1.7% \| \| LLaMA-70B prefill b=1 h=64 hk=8 s=4096 d=128 bf16 \| 381.53 TFlops \| 1.441 \| +1.2% \| \| Long-seq bf16 b=1 h=16 s=16384 d=128 \| 388.61 TFlops \| 5.659 \| +1.4% \| \| Decode b=64 h=32 hk=8 s_k=4096 d=128 bf16 \| 693.40 GB/s \| 1.550 \| +0.3% \| All validation tests pass (`valid:y`) on both MI300X and MI350X. Additional validation: - `CK_FMHA_ENABLE_V3=0` correctly falls back to V2 (default behavior unchanged) - `CK_FMHA_ENABLE_V3=1` dispatches to V3 for prefill, V2 for decode - Validation passes across fp16/bf16, batch/group mode, causal/non-causal - No regression on decode path	2026-05-07 16:23:19 +00:00
Linjun-AMD	33b62ed087	[rocm-libraries] ROCm/rocm-libraries#6914 (commit b791478) [CK_TILE][FMHA] Fix sink un-mask under right-window and emit fp8bf16 batch_prefill sink kernels (#6914) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Two related fixes to `ck_tile` FMHA so that StreamLLM-sink + sliding-window batch-prefill works correctly for fp8 KV / bf16 compute. Review the commits in this order: 1. `fmha: emit sink kernels for fp8bf16 batch_prefill` Extends `example/ck_tile/01_fmha/codegen/ops/fmha_batch_prefill.py` so the fp8(KV) / bf16(QO) batch-prefill codegen also emits the `mask=mask_enum::generic_with_sink` variant. Without this the runtime could not dispatch to a sink-aware kernel for the fp8bf16 path. 2. `fmha: respect right-window in IsOutOfSinkBound` The sink un-mask in `GenericAttentionMask::IsOutOfSinkBound` (local-mask branch) used `(i_y + x) > 1` as the gate, which conditioned on the row index instead of the column index. As a result, queries `1..sink-1` could attend to future sink positions (violating causal / right-window), while query `0` fell back to the plain causal mask. The fix replaces the guard with `i_x < i_y + x` so every query only sees sink columns up to its own right-window boundary. 3. `fmha: clarify IsOutOfSinkBound predicate comment` Doc-only follow-up that rewrites the comment above the predicate as a clause-by-clause explanation (`i_x < sink`, `i_x < i_y + x`, `y < y_total`, `i_y < x_total`). ## Test plan - [x] Repro on aiter `op_tests/test_batch_prefill.py` (fp8 + bf16_dequant modes with `sink=4`, `win_left=1023`, `softcap=0.0`, `sal=True`) now passes for all parametrized shapes. - [x] Existing fp16/bf16 batch-prefill paths (no sink) unchanged — codegen diff only adds the `generic_with_sink` variant for fp8bf16; existing kernel object lists unaffected. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-07 02:40:45 +00:00
Yi DING	207a95d5e4	[rocm-libraries] ROCm/rocm-libraries#6152 (commit 36b016a) [CK_TILE] Use Unified Workspace for FMHA BWD MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation `dq_acc` is the intermediate accumulation buffer used in FMHA backward pass for deterministic mode. The current implementation allocates it as a single rectangular tensor: ``` shape = [shape_batch, nhead, nsplits, shape_seqlen_q, hdim_q] ``` where `nsplits = launcher.dq_acc_splits` (a single scalar), computed from `max_seqlen_k` and shared across all batches. ### Problems 1. Memory waste: In group mode, each batch may have a different `seqlen_k`, but `nsplits` is computed from `max_seqlen_k`, causing batches with shorter `seqlen_k` to over-allocate in the split dimension. 2. Interface coupling: `fmha_bwd_args` exposes internal layout details such as `stride_dq_acc`, `nhead_stride_dq_acc`, `batch_stride_dq_acc`, and `split_stride_dq_acc`. The caller is responsible for computing these strides, but this logic belongs inside the kernel. ### Goals 1. Switch `dq_acc` buffer to a compact layout: batches are concatenated contiguously, with each batch occupying `nhead * nsplits_i * seqq_i * hdim_q` elements (nhead outermost). 2. *Remove all `_stride_dq_acc` fields from `fmha_bwd_args`, replacing them with a single `workspace_ptr`; the kernel splits this internally using a fixed layout. 4. `fmha_bwd_launcher` provides a workspace management interface: the caller only needs to allocate GPU memory and call `prepare_workspace()` — no layout computation required. 5. Isolate kernel internals from the caller API*: the `dq_acc` layout (nsplits, strides, buffer size) is determined entirely inside the launcher/kernel. Future changes to block shape, pipeline type, or persistent kernel strategy require no modifications to the caller's `fmha_bwd_args` or workspace allocation logic. ## Technical Details ### Interface Design #### New fields in `fmha_bwd_traits` ```cpp struct fmha_bwd_traits { int seqlen_q; int seqlen_k; int batch; int max_seqlen_q; int max_seqlen_k; int hdim_q; int hdim_v; int nhead_q; int nhead_k; std::string data_type; bool is_group_mode; mask_enum mask_type; bias_enum bias_type; bool has_dbias; bool has_dropout; bool is_store_randval; bool is_deterministic; // New: cumulative physical seqlen pointers for group mode (pass nullptr for batch mode). // seqstart_qs[i+1] - seqstart_qs[i] = physical seqlen_q of batch i (including padding); length = batch+1 // seqstart_ks[i+1] - seqstart_ks[i] = physical seqlen_k of batch i (including padding); length = batch+1 const int seqstart_qs = nullptr; const int* seqstart_ks = nullptr; }; ``` #### `fmha_bwd_launcher` actual structure ```cpp struct fmha_bwd_launcher { std::function<float(fmha_bwd_args, const ck_tile::stream_config&)> run{}; // Total workspace size in bytes (host_ws_size + device_ws_size), computed by init(). // Zero for kUseQrQtrDorPipeline (writes dq directly, no acc buffer needed). size_t workspace_size = 0; fmha_bwd_launcher(const fmha_bwd_traits&); // Copies auxiliary data (nsplits[], offsets[]) via hipMemcpy to the head of the GPU workspace, // and zeros the dq_acc buffer portion (tail of workspace) if required. // The memory pointed to by device_ws must be >= workspace_size bytes. std::function<void(void* device_ws)> prepare_workspace{}; template <typename... Args> float operator()(Args&&... args) const { return run(std::forward<Args>(args)...); } private: size_t host_ws_size = 0; // CPU workspace size (nsplits[] + offsets[] arrays) size_t device_ws_size = 0; // GPU-only data size (dq_acc buffer) std::unique_ptr<char[]> ws_host; // host-side workspace buffer public: template <typename T0, typename T1, typename T2, typename Arch> void init(const fmha_bwd_traits& traits); }; ``` The `init<>()` template method (invoked by codegen dispatch branches as `this->init<...>(t)`) is responsible for: 1. Setting the `run` lambda 2. Calling `FmhaBwdDQDKDVKernel::GetWorkspaceHostSize(batch)` to obtain `host_ws_size` 3. Allocating `ws_host` (host memory) 4. Calling `FmhaBwdDQDKDVKernel::PrepareWorkspaceHost(ws_host.get(), ...)` to fill nsplits/offsets; return value is `device_ws_size` 5. `workspace_size = host_ws_size + device_ws_size` 6. Setting the `prepare_workspace` lambda (captures `this`, calls `PrepareWorkspaceDevice`) When no kernel matches the given traits, both `run` and `prepare_workspace` are initialized to default lambdas that print a warning to `std::cerr` and return gracefully (no exception). #### Workspace overall layout The workspace is managed by `FmhaBwdWorkspaceManager` and consists of two segments: ``` Offset 0 (CPU-prepared segment, host_ws_size bytes; also hipMemcpy'd to the head of GPU workspace): index_t nsplits[batch or 1] — per-batch nsplits array group mode: batch elements batch mode / non-deterministic: 1 element [group mode only] long_index_t dq_acc_offsets[batch+1] — per-batch element offset (inclusive prefix sum) offsets[0]=0, offsets[i+1] = offsets[i] + nheadnsplits_iseqq_ihdim_q Offset host_ws_size (device data segment, device_ws_size bytes): AccDataType dq_acc[total_elements] — compact dq_acc buffer (zeroed if required) total_elements = sum_i(nhead nsplits_i * seqq_i * hdim_q) layout within each batch: [nhead, nsplits_i, seqq_i, hdim_q] note: seqq_i uses the physical length (including padding) ``` Alignment constant (`ALIGNMENT = 16`): ``` nsplits_size = align_up(sizeof(index_t) * N, 16) // N = batch (group) or 1 (batch/non-det) offsets_size = align_up(sizeof(long_index_t) * (batch+1), 16) // group mode only host_ws_size = nsplits_size + offsets_size dq_acc_offset = host_ws_size // GetDqAccDataOffset(batch) ``` Key benefits: - The kernel reads nsplits/offsets directly from the workspace head — no device-side recomputation. - `FmhaBwdConvertQGradKernel` is completely decoupled from the pipeline block shape (`kN0`): nsplits is read from `nsplits_ptr`, `kN0` is no longer a template parameter, and multiple dq_dk_dv tiles with different `F_bn0` values now share a single convert_dq kernel instance (under receipt 1/2, deterministic convert_dq kernel count drops from ~300 to 60). - nsplits/offsets are computed on the host and transferred in one `hipMemcpy`; the dq_acc buffer follows immediately, at the offset given by `GetDqAccDataOffset`. #### Workspace size by scenario \| Scenario \| `workspace_size` \| Notes \| \|----------\|-----------------\|-------\| \| kUseQrQtrDorPipeline (any mode) \| `0` \| Writes dq directly; no acc buffer; `PrepareWorkspaceHost` returns 0 \| \| Non-deterministic + batch mode \| `> 0` \| nsplits[1]=1; dq_acc used for atomic add; `workspace_size = host_ws_size + batchnheadseqlen_qhdim_qebytes` \| \| Non-deterministic + group mode \| `> 0` \| nsplits[1]=1; dq_acc contiguous layout; `workspace_size = host_ws_size + nheadseqstart_qs[batch]hdim_qebytes` \| \| Deterministic + group mode* \| `> 0` \| nsplits[batch], offsets[batch+1], compact dq_acc; nsplits_i computed independently per batch \| \| Deterministic + batch mode persistent \| `> 0` \| nsplits[1] (uniform across batches); dq_acc `batchnheadnsplitsseqlen_qhdim_q` \| NeedsZeroDqAcc (determines whether `PrepareWorkspaceDevice` calls `hipMemset`): - Persistent kernel (deterministic batch mode) or non-deterministic: must zero (atomic add requires zero initialization) - Deterministic group mode + no mask: no zeroing needed (every tile writes its full region) - Deterministic + with mask: must zero (some blocks are skipped, leaving uninitialized tiles that would contribute to the reduction) #### Caller usage ```cpp // 1. Create launcher (traits include seqstart_qs/ks pointers; workspace_size is computed during construction) fmha_bwd_launcher launcher(fmha_traits); // 2. Read launcher.workspace_size directly const auto ws_size = launcher.workspace_size; // 3. Allocate a single GPU workspace ck_tile::DeviceMem ws_buf(ws_size); // 4. Copy nsplits/offsets to GPU head and zero dq_acc if required launcher.prepare_workspace(ws_buf.GetDeviceBuffer()); // 5. Build args with a single workspace pointer; the kernel splits it internally fmha_bwd_args args{ ..., ws_size > 0 ? ws_buf.GetDeviceBuffer() : nullptr, // workspace_ptr }; launcher(args, stream_config); ```	2026-05-07 02:23:28 +00:00
ltqin	de0a61e5c2	[rocm-libraries] ROCm/rocm-libraries#6574 (commit b3db057) [CK_TILE] Add SageAttention v2 forward kernel with multi-granularity quantization (#6574) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Add a CK_TILE forward kernel implementing [SageAttention v2](https://arxiv.org/abs/2411.10958) — an attention algorithm that applies multi-granularity quantization to Q/K/V before computing attention, trading minimal accuracy loss for higher throughput on low-precision hardware. ### Quantization design \| Tensor \| Supported data types \| Scale granularity options \| \|--------\|---------------------\|--------------------------\| \| Q \| fp8 / int8 / int4 \| per-tensor, per-block (128 tokens), per-warp (32 tokens), per-thread (4 tokens) \| \| K \| fp8 / int8 / int4 \| per-tensor, per-block (128 tokens), per-warp (64 tokens), per-thread (16 tokens) \| \| V \| fp8 \| per-channel (always) \| \| O \| bf16 \| — \| Three precision combinations are supported: `fp8/bf16` (QKV fp8, O bf16), `i8/fp8/bf16` (QK int8, V fp8, O bf16), and `i4/fp8/bf16` (QK int4, V fp8, O bf16). ### Architecture support - gfx9 (CDNA2/3, e.g. gfx90a, gfx942) — full tile set - gfx950 (CDNA4) — restricted tile set (N-per-block capped at 64 for fp8-family dtypes) ### Implementation - Two pipeline variants: `QRKSVS` (synchronous) and `QRKSVS_ASYNC` (async copy) - Masking support: no mask, causal (top-left / bottom-right), and generic windowed - Batch and group (variable-length) modes - Head dimension: d=128, d_v=128 - Python codegen under `example/ck_tile/49_sageattention/codegen/` generates kernel instances per target/dtype/tile combination - Smoke tests included via `tile_example_sageattn_fwd` ### Test commands \`\`\`bash # fp8 QKV ./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128 -kname=1 -prec=fp8bf16 -qscale=3 -init=3 # int8 QK, fp8 V ./build/bin/tile_example_sageattn_fwd -v=1 -b=16 -h=8 -s=1024 -d=128 -kname=1 -prec=i8fp8bf16 -qscale=3 -init=3 \`\`\` \`-qscale\` values: 1=per-tensor, 2=per-block, 3=per-warp, 4=per-thread	2026-04-30 18:33:36 +00:00
Jeff Huang	fdf4bb7fcc	[rocm-libraries] ROCm/rocm-libraries#6653 (commit 1df887e) [CK_TILE] fix(fmha): support >2GB KV cache in batch prefill via template dispatch (#6653) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation The CK batch prefill kernel previously failed (silent overflow + page faults) when the KV cache exceeded 2 GB, blocking long-context inference workloads (e.g., 128K+ token contexts with paged KV). Two distinct failure modes were addressed: 1. >4GB SRD overflow (`page_size < kN0`): The SRD `buffer_load_dwordx4` path uses a 32-bit `voffset` register; for small page sizes the rebased SRD spans the full KV pool and the offset wraps past 2 GB, corrupting K/V loads. 2. gfx950 page-table fault (`page_size >= kN0`): On CDNA4 the hardware validates the full SRD `num_records` range against page-table permissions (CDNA3 only checks per-instruction `voffset`). After per-tile SRD rebase, an un-trimmed `num_records` field extends past the live page and faults on freed/protected memory. ## Technical Details Two-mode `tile_scatter_gather` selected by the `kUseGlobalLoad` template parameter: \| Case \| `page_size` \| KV cache size \| Mode \| Load path \| Addressing \| \|---\|---\|---\|---\|---\|---\| \| 1 \| `>= kN0` (large pages) \| any \| SRD (`kUseGlobalLoad=false`) \| `buffer_load_dwordx4` \| 32-bit `voffset`, bounded by per-page rebase \| \| 2 \| `< kN0` (small pages) \| `<= 2 GB` \| SRD (`kUseGlobalLoad=false`) \| `buffer_load_dwordx4` \| 32-bit `voffset`, fits in INT32 byte range \| \| 3 \| `< kN0` (small pages) \| `> 2 GB` \| Global-load (`kUseGlobalLoad=true`) \| `async_load_tile_raw_flat` (K) + `load_tile_flat` (V) \| 64-bit \| Dispatch: the auto-gen API layer (`fmha_batch_prefill.py`) selects the kernel instantiation at launch from `(page_block_size, num_total_pages * batch_stride_k * kElementBytes)`, so the small-page penalty is paid only when correctness requires it. gfx950 SRD `num_records` trimming: in the K and V rebase lambdas of `block_fmha_batch_prefill_pipeline_qr_ks_vs_async`, `set_bottom_tensor_view_buffer_size(page_stride_k/v)` is called after each rebase to constrain `num_records` to the live page. Required for CDNA4 page-table validation; harmless on CDNA3. Pipeline sync for the global-load path: - V uses synchronous `load_tile_flat`; K uses `async_load_tile_raw_flat`. - `v_physical_pages_current` is double-buffered so the V flat load doesn't race against the next iteration's K rebase computation. Arch guards: `global_load_lds` intrinsics are gated to `__gfx94__` / `__gfx950__` (CDNA3+). Other architectures hit a `dependent_false` static_assert with a descriptive message. Device-side assertion convention: SRD setters use `__builtin_assume(cond)` (hint-only) rather than `<cassert>`'s `assert()`. The latter introduces an `__assert_fail` call whose register pressure scatters the K-SRD scalar register window across conditional branches, corrupting `buffer_load_dwordx4` on gfx950. ## Test Plan Tested on both MI308 (gfx942) and MI355 (gfx950) via the aiter wrapper test suite. All coverage lives in `op_tests/test_batch_prefill.py`: - Functional matrix (96 cases) — `test_batch_prefill`: `page_size ∈ {1, 16, 1024}` × `kv_layout ∈ {linear, vectorized}` × `dtype ∈ {bf16, fp8 quant variants}` × `causal` × `soft_cap` × `LSE` × `batch_size ∈ {1, 4}` (parametrized to exercise per-sequence SRD rebase across batch boundaries). - >2 GB coverage — `test_batch_prefill_large_kvcache`: extended to allocate a 5 GB+ KV cache pool and exercise both `kUseGlobalLoad=true` (small-page) and `kUseGlobalLoad=false` (large-page rebase) paths. Includes both single-batch and multi-batch (`batch_size=4`) cases to exercise per-sequence SRD rebase across the >2 GB pool. - Numerical reference: PyTorch SDPA, per-batch loop with `atol` / `rtol` from the existing batch prefill test harness. ## Test Result \| Arch \| `test_batch_prefill` \| `test_batch_prefill_large_kvcache` (>2 GB) \| \|------\|----------------------\|---------------------\| \| MI308 (gfx942) \| All passed \| Passed \| \| MI355 (gfx950) \| All passed \| Passed \| Performance impact (gfx950, hot SRD path): - +2.67% kernel-time on `seqlen=1024 / page_sz=1024 / bf16 / sglang / causal / soft_cap=30`, attributable in full to the two `set_bottom_tensor_view_buffer_size` calls in the K/V rebase lambdas (5-run median, signal/noise ≈ 9×). - This cost is mandatory for gfx950 correctness on >2 GB workloads — removing the setters re-introduces page-faults. - gfx942: 0 regressions in the same range (all configs ≤ +0.97%). ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-23 23:09:25 +00:00
KateJu	793a59736a	[rocm-libraries] ROCm/rocm-libraries#6656 (commit 1c958f8) Fix per-layer conv2d int8 CPU verification reference path (#6656) case example_conv2d_fwd_xdl_perlayer_quantization_int8.exe 1 0 ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-23 14:09:58 +00:00
Linjun-AMD	d22aafb48b	[rocm-libraries] ROCm/rocm-libraries#6479 (commit 0705c2d) CK][fmha] Add StreamLLM sink support to batch_prefill pipeline (#6479) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation The existing paged-KV attention pipelines (pagedkv, splitkv) support StreamLLM-style sink tokens — a fixed set of initial tokens kept in attention alongside the sliding window. The `batch_prefill` pipeline (chunked-prefill with VLLM-style block tables) previously hardcoded `kHasSink = false`, making it incompatible with sink-based attention patterns in LLM serving scenarios. This PR extends `batch_prefill` to support `kHasSink` and wires it into `fmha_fwd_runner` for validation against the existing CPU reference. ## Technical Details Pipeline (`block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp`): - When `kHasSink`, the K/V loop splits into a sink phase [0, sink_seq_end) and a window phase [seqlen_k_start, seqlen_k_end), mirroring pagedkv. - K advance at the sink→window transition jumps `seqlen_k_start - sink_seq_end + kN0` to bridge the gap. - V scatter-gather offsets are re-initialized at the transition to fix a window mismatch bug: V was lagging kN0 behind K after the large jump, loading from the wrong sequence position. - Bias window, dropout seq_offset, and mask type (LogitsSinkMask) updated for sink-awareness. Traits / codegen (`tile_fmha_traits.hpp`, `fmha_fwd.hpp`, `fmha_batch_prefill.py`): - `TileFmhaBatchPrefillTraits` gains `kHasSink_` (was hardcoded `false`). - Codegen adds `F_sink` field; skips batch-mode kernels (group mode required). - CMake test filter broadened from 9 → 33 instances covering fp16/bf16 × mask/nmask × lse/nlse × sink/nsink. Runner (`fmha_fwd_runner.hpp`, `CMakeLists.txt`): - `fmha_batch_prefill()` dispatched from `run_fwd` when: group mode + paged KV + num_splits == 1. - K/V strides corrected for runner's [num_pages, nhead_k, page_block_size, hdim] layout. - `page_block_size % 128` check relaxed: batch_prefill supports ps=16. - CPU reference paged-KV reordering guards extended with `CK_TILE_FMHA_FWD_BATCH_PREFILL_API`. ## Test Plan Build with `-DFMHA_FWD_ENABLE_APIS="fwd;batch_prefill"`, run `tile_example_fmha_fwd` in group mode with page_block_size=16. Test matrix: - Mask: no-mask, causal, sliding window - Sink: nsink, sink=1..128 - dtype: fp16, bf16 - LSE output: on/off - seqlen ∈ {512,1024,2048,4096} × window ∈ {32,256,512,1024} - GQA, chunked prefill, large batch×seqlen - page_block_size: 16, 32 ## Test Result 171 test cases, all valid:y: - nmask + nsink: ✓ - causal + nsink: ✓ - causal + sink=8: ✓ - sliding window + sink=8 (d=128, d=256): ✓ - bf16, LSE output, GQA: ✓ ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-21 11:05:12 +00:00
Po Yen Chen	2b83413b8d	[rocm-libraries] ROCm/rocm-libraries#6305 (commit 19e10a0) [CK] Remove obsolete benchmark_fwd_v3.sh script and README reference (#6305) The tile_example_fmha_fwd_v3 target no longer exists in this project, making this benchmark script non-functional.	2026-04-15 07:38:36 +00:00
Max Podkorytov	7dcc606adc	[rocm-libraries] ROCm/rocm-libraries#5383 (commit b660b8c) [CK_TILE] Add CShuffleLds microbenchmark suite MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary Microbenchmarks isolating LDS store/load operations in CShuffleEpilogue for bank conflict analysis. ## Motivation CShuffleEpilogue performs LDS store (MFMA registers → LDS) and load (LDS → registers for coalesced global writes). This suite isolates each operation to: - Identify which operation causes bank conflicts - Measure pure LDS bandwidth per access pattern - Validate access patterns across MFMA tile sizes and wave layouts ## Components - Microkernels (`tile_load_store_microkernels.hpp`): `StoreTile<Setup>`, `LoadTile<Setup>` - Setup Adapters (`benchmark_cshuffle_lds.hpp`): Wire CShuffleEpilogue to microkernels - Template (`benchmark_template.cpp.in`): Generated benchmarks with timing ## Build ```bash cmake -G Ninja -B build -S . \ -DGPU_TARGETS=gfx950 \ -DBUILD_CK_EXAMPLES=ON \ -DBUILD_CK_TILE_CSHUFFLE_LDS_BENCHMARKS=ON ninja -C build bench_lds_fp8_16x16x128_2x2_fp8 ``` ## New CMake Options \| Option \| Default \| Description \| \|--------\|---------\|-------------\| \| `BUILD_CK_TILE_CSHUFFLE_LDS_BENCHMARKS` \| OFF \| LDS microbenchmarks \| \| `BUILD_CK_TILE_FMHA_TESTS` \| ON \| FMHA tests \| \| `BUILD_CK_TILE_ENGINE` \| ON \| Tile engine \| \| `BUILD_CK_TILE_ENGINE_TESTS` \| ON \| Tile engine tests \| \| `BUILD_CK_EXAMPLES` \| ON \| Examples \| \| `BUILD_CK_TUTORIALS` \| ON \| Tutorials \| \| `BUILD_CK_DEVICE_INSTANCES` \| ON \| Device instances \| \| `BUILD_CK_PROFILER` \| ON \| Profiler \| Setting guards to OFF reduces cmake configure from ~150s to ~5s.	2026-04-15 03:44:07 +00:00
msaffari-amd	5348b577ed	[rocm-libraries] ROCm/rocm-libraries#5863 (commit 31d9247) [CK_TILE] Separate PermuteN epilogue from CShuffle epilogue into standalone file (#5863) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation The PermuteN epilogue was previously embedded within cshuffle_epilogue.hpp, despite having fundamentally different behaviour. Coupling these two independent strategies in one file introduced unnecessary complexity, SFINAE guards, and a dual operator() overload selected at compile time via TiledMMAPermuteN_ template parameter. This PR separates PermuteN into its own standalone file(pertmuten_epilogue.hpp), simplifying both implementations and making the codebase easier to maintain and extend independently. ## Technical Details New file: permuten_epilogue.hpp: contains PermuteNEpilogueProblem and PermuteNEpilogue, extracted from the permuteN code path in cshuffle_epilogue.hpp. Cleanup of cshuffle_epilogue.hpp: - Removed the TiledMMAPermuteN_ template parameter from [CShuffleEpilogueProblem] - Removed the SFINAE-guarded permuteN operator() overload - Removed the EnablePermuateN_ SFINAE alias - CShuffle now only contains CShuffle logic; EightWave support (independent feature) is retained Consumer migration : All consumer files now use compile-time epilogue selection via [std::conditional_t] `using GemmEpilogue = std::conditional_t< TiledMMAPermuteN, PermuteNEpilogue<PermuteNEpilogueProblem<...>>, CShuffleEpilogue<CShuffleEpilogueProblem<...>>>;` Files modified: - flatmm_basic.cpp, moe_flatmm.cpp, a16w4_moe_flatmm.cpp, mixed_prec_flatmm.cpp, mx_flatmm_instance.hpp — flatmm examples - run_gemm_quant_example.inc — block-scale GEMM example - gemm_weight_preshuffle_invoker.hpp — weight preshuffle invoker - test_gemm_quant_fixtures.hpp, test_gemm_persistent_async_input.cpp, test_gemm_pipeline_util.hpp — test utilities - universal_gemm_invoker.hpp — universal GEMM invoker - epilogue.hpp — add header updated to include permuten_epilogue.hpp ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-14 20:23:26 +00:00
Aviral Goel	fa4473fde6	[rocm-libraries] ROCm/rocm-libraries#6323 (commit a668483) CK: Extract shared boilerplate from 47 gemm_quant test files (#6323) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Depends on #6303 ## Summary Extract shared test boilerplate (includes, type aliases, test fixture macros) from 47 `test_gemm_quant_` files into a single `test_gemm_quant_common.hpp` header. Each test file is reduced from ~50 lines of boilerplate to ~5 lines. \| Metric \| Value \| \|--------\|-------\| \| Files changed \| 48 \| \| Insertions \| +413 \| \| Deletions \| −1,106 \| \| Net lines removed* \| −693 \| ### What changed \| Before \| After \| \|--------\|-------\| \| 47 test files, each with ~50 lines of identical includes, type aliases, and fixture macros \| 1 shared header (`test_gemm_quant_common.hpp`) + 47 thin files (~5 lines each: include + params) \| ### Readability assessment A code realist review confirmed this change improves readability: the 47 test files had identical boilerplate obscuring the only meaningful content — the `GemmConfig` type alias and test dimensions. After the refactoring, each file's unique configuration is immediately visible, and adding a new test variant requires specifying only the varying parameters instead of copying 50 lines. ### Cumulative cleanup series stats \| PR \| Description \| Net lines \| \|----\|-------------\|-----------\| \| #6300 \| Remove 61 dead `#if 0` blocks \| −2,648 \| \| #6302 \| Remove 41 commented-out dead code blocks \| −2,861 \| \| #6303 \| Remove 4 orphaned files \| −3,886 \| \| This PR \| Extract gemm_quant test boilerplate \| −693 \| \| Total \| \| −10,088 \|	2026-04-11 10:01:30 +00:00
Aviral Goel	e0dfe58d66	[rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8) CK: Remove 41 commented-out dead code blocks (~200 lines) (#6302) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Depends on #6300 ## Summary Remove 41 commented-out code blocks across 33 files in Composable Kernel, totaling ~200 lines. Identified using an automated dead code scanning skill (`ck-dead-code`) with a calibrated two-stage pipeline: 1. Pre-filter: Keyword-based scan found 1,338 `//`-commented blocks. Calibrated heuristics (trained on 50-sample expert classification) reduced to 89 high-confidence candidates — 93% noise reduction. 2. Expert triage: LLM expert classified each block in context as CODE_REMOVE, CODE_KEEP, or NOT_CODE. \| Classification \| Count \| \|---------------\|-------\| \| Removed (this PR) \| 41 \| \| Kept (debug helpers, alt configs, reference impls) \| 32 \| \| Not code (false positives) \| 16 \| Removed blocks include: superseded implementations, old test data, abandoned stubs, unreachable code, and buggy dead code.	2026-04-10 15:18:02 +00:00
Hosang Yoon	4c0e73ab12	[rocm-libraries] ROCm/rocm-libraries#6156 (commit 367565a) [CK_TILE] Optimize FMHA head-dim padded path on gfx11/gfx12 (#6156) ## Motivation On gfx11/gfx12, FMHA forward kernels that require head-dim padding show a large performance drop compared to the exact-head-dim path. In practice, padded cases such as `HDIM=72` and `HDIM=80` were falling too far off the fast path. This PR improves padded-head-dim FMHA performance on gfx11/gfx12 while keeping the behavior for other GPUs unchanged. ## Technical Details - Add/scope a dedicated padded-head-dim (`qr_hpad`) FMHA forward path for gfx11/gfx12. - For `receipt=0`, keep support conservative and only enable the padded fast path for vector-safe cases (`head_dim % 8 == 0`), matching the existing assumption used on other GPUs. - Move `v_prefetch` later only for the head-dim-padded path on gfx11/gfx12. This reduces live ranges and removes the register-spill behavior seen in the earlier scheduling. - Enable the buffer-load OOB check offset trick for the padded path on gfx11/gfx12. ## Test Plan ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16 -d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result Observed padded-head-dim performance improvements for HDIM=72/80: - gfx11: about ~3.5x - gfx1151: about ~2.0x - gfx12: about ~1.3x ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-08 14:53:18 +00:00
Po Yen Chen	c2ac7aa7b0	[rocm-libraries] ROCm/rocm-libraries#6051 (commit f0838b2) [CK] Add FP8 per-tensor quantization support for FMHA V3 pipeline (#6051) ## Motivation The existing FMHA V3 pipeline only supports fp16/bf16 data types. This PR extends V3 to handle FP8 inputs with per-tensor descaling on gfx950, enabling higher throughput for FP8 inference workloads using the assembly-optimized V3 code path. ## Technical Details Warp GEMM: - Add FP8 32x32x32 warp gemm with C-transposed distribution (`WarpGemmMfma_f32_32x32x32_fp8_fp8_CTransposed`) and dispatcher entries V3 Kernel (`fmha_fwd_v3_kernel.hpp`): - Add per-tensor descale support for Q, K, V tensors, passing descale pointers through to pipeline kargs V3 Pipeline (`block_fmha_fwd_v3_pipeline.hpp`): - Add FP8 data path with dtype-aware type selection - Add asm volatile P matrix conversion from f32 to fp8 - Add FP8-aware instruction scheduling in `CoreLoopScheduler` V3 Pipeline Policy (`block_fmha_fwd_v3_pipeline_default_policy.hpp`): - Add FP8 QK warp gemm selection (SwizzleB variant for V tile distribution compatibility) Codegen (`fmha_fwd.py`): - Add gfx950 FP8BF16 V3 tile size (256x64x128x128x64x128) - Add FP8BF16 V3 pipeline variants (mask: no/causal, qscale: no/pertensor) - Extend `can_dispatch_v3` condition for fp8bf16 + pertensor Misc: - Add LLVM scheduler `TRANS` mask to `LLVMSchedGroupMask` enum (`arch.hpp`) - Fix `mask_info` default initialization for `no_mask` case (`mask.hpp`) V3 dispatch for FP8 is disabled by default (`F_is_v3_enabled=false`) pending further validation. ## Performance: fmha_fwd V3 FP8 (avg runs 2-6, stock ROCm 7.1.1, gfx950) \| Problem \| Regular (TFlops) \| Varlen (TFlops) \| \|---\|---:\|---:\| \| batch=1 heads=6/1 seqlen=1024 causal \| 48.9 \| 47.6 \| \| batch=1 heads=6/1 seqlen=2048 causal \| 119.8 \| 117.4 \| \| batch=1 heads=6/1 seqlen=4096 causal \| 263.7 \| 259.2 \| \| batch=1 heads=6/1 seqlen=8192 causal \| 548.9 \| 543.6 \| \| batch=1 heads=6/1 seqlen=16384 causal \| 1043.0 \| 1063.7 \| \| batch=1 heads=6/1 seqlen=32768 causal \| 1237.2 \| 1279.6 \| \| batch=1 heads=6/1 seqlen=65536 causal \| 1315.4 \| 1382.7 \| \| batch=1 heads=6/1 seqlen=131072 causal \| 1326.3 \| 1402.2 \| \| batch=1 heads=16/1 seqlen=65536 causal \| 1298.7 \| 1388.4 \| \| batch=1 heads=40/40 seqlen=37200 non-causal \| 1248.9 \| 1326.1 \| ## Test Plan Tested with aiter's `test_mha_fp8.py` test suite (176 cases) covering batch sizes (1-2), sequence lengths (113-4096), head counts (5/8/32/40), GQA ratios (1:1, 1:8), and causal/non-causal modes. Verified all cases dispatch to the V3 pipeline by enabling `F_is_v3_enabled` and confirming kernel names contain `qr_async_trload_v3`. ## Test Result 176/176 tests passed with V3 enabled. All cases correctly dispatched to V3 pipeline with `pertensor` quantization. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-07 14:20:43 +00:00
Hosang Yoon	1dc35ff4ae	[rocm-libraries] ROCm/rocm-libraries#6038 (commit d7041a2) [CK_TILE] Restrict FMHA codegen to the kernel subset used by FlashAttention (#6038) ## Motivation Currently, the CK FlashAttention integration generates a broader FMHA kernel set than the FlashAttention wrappers can actually dispatch, which increases compile time without improving runtime coverage. ## Technical Details The FlashAttention CK wrappers do not use all logits/LSE variants emitted by the default FMHA codegen. The direct `fmha_fwd` path always uses softcap-disabled, LSE-enabled kernels, and the `fmha_fwd_splitkv` path only uses softcap-disabled kernels. This change trims codegen to that subset and stops generating the unused logits/LSE variants. This reduces the generated forward kernel set without changing `fmha_fwd_appendkv` or `fmha_bwd`. The reduced kernel set was validated by building and running the [FlashAttention](https://github.com/Dao-AILab/flash-attention) CK backend. Across targets, the total generated FMHA kernel count is reduced by: - `gfx942`: 29.3% - `gfx1100`: 33.7% - `gfx1201`: 31.3% ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> pytest test/test_flash_attn_ck.py from https://github.com/Dao-AILab/flash-attention ## Test Result all tests passed <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-03 00:18:21 +00:00
Linjun-AMD	08792e0b31	[rocm-libraries] ROCm/rocm-libraries#5504 (commit 47f86c7) [CK Tile] Add sink token gradient support in FMHA backward pass (#5504) ## Motivation Adds sink token support to the FMHA backward kernel (dot_do_o pipeline): ## Technical Details - Extend BlockFmhaBwdOGradDotOPipelineProblem with LSEDataType - Add sink_ptr/d_sink_ptr/lse_ptr/nhead to FmhaBwdOGradDotOCommonKargs - Compute per-head sink gradient via atomic accumulation in the pipeline - Update example runner with reference validation for sink gradient ## Test Plan Add new test case ## Test Result WIP ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-02 03:17:45 +00:00
Fu-Cheng Tsai	a502e5a00b	[rocm-libraries] ROCm/rocm-libraries#5798 (commit 7acd4e7) [CK_TILE] Update gfx12 FMHA forward kernel configs	2026-04-01 14:23:38 +00:00
Hosang Yoon	2dcae9d173	[rocm-libraries] ROCm/rocm-libraries#5977 (commit 794bea7) [CK_TILE] Fix Windows build in FMHA head grouping ## Motivation This is a follow-up fix for [PR #5018](https://github.com/ROCm/rocm-libraries/pull/5018). [PR #5018](https://github.com/ROCm/rocm-libraries/pull/5018) added LLC-aware FMHA head grouping / head-major scheduling on RDNA, but it also introduced Linux-only code paths, including `<dirent.h>`, which break Windows builds. This change fixes that by guarding the Linux-specific LLC probing logic so non-Linux platforms can still build correctly. ## Technical Details - Guard `<dirent.h>` with `#ifdef __linux__` - Guard KFD sysfs traversal logic with `#if defined(__linux__)` - On non-Linux platforms, return `0` from `get_kfd_sysfs_llc_cache_bytes()` - Preserve existing fallback behavior through: - `CK_TILE_FMHA_LLC_CACHE_MB` - arch-based default LLC sizes - no head grouping when no LLC size can be resolved ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 14:19:19 +00:00
Jeff Huang	7968368d92	[rocm-libraries] ROCm/rocm-libraries#5918 (commit a7e2c67) [CK][CK_TILE] Add fp8bf16 hdim=256 tile for batch prefill (#5918) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation FP8 batch prefill kernels currently only support head_dim=128. Models with head_dim=256 hit the "invalid argument for batch_prefill" error because no matching kernel variant exists in the codegen dispatch. ## Technical Details Add a hdim=256 tile size entry for fp8bf16 in the batch prefill codegen recipe (`fmha_batch_prefill.py`). Tile configuration: `FmhaFwdTileSize(128, 128, 32, 256, 32, 256, 4,1,1, 4,1,1, 32,32,32, 32,32,32, -1)` - bm0=128, bn0=128 (Q/K tile sizes) - bn1=256, bk0max=256 (V head_dim=256) - Warp MFMA 32x32x32 (fp8 MFMA instructions) This mirrors the existing bf16/fp16 hdim=256 tile but uses fp8 warp sizes. ## Test Plan Tested on both MI308X (gfx942) and MI355X (gfx950) via aiter batch prefill test with the following matrix: - page_size: {1, 16, 1024} - kv_layout: {linear, vectorized} - lookup_table: {sglang, vllm} - causal: {true, false} - logits_soft_cap: {0.0, 30.0} - contiguous_kv: {true, false} ## Test Result MI308X (gfx942): 160 passed, 32 skipped (page_size=1 + vectorized not applicable) MI355X (gfx950): 120 passed, 72 skipped (pre-existing ROCm 7.2 compiler issue with causal + no softcap) No register spills on either platform. ### Profiling — MI355X (gfx950), FP8 pertensor, hdim=256, seqlen=1024, 8 heads \| page_sz \| kv_layout \| table \| causal \| soft_cap \| time_us \| TFLOPS \| \|---------\|-----------\|-------\|--------\|----------\|---------\|--------\| \| 1 \| linear \| sglang \| False \| 0.00 \| 55.01 \| 156.16 \| \| 1 \| linear \| vllm \| False \| 0.00 \| 55.12 \| 155.84 \| \| 1 \| linear \| sglang \| False \| 30.00 \| 62.63 \| 137.16 \| \| 1 \| linear \| vllm \| False \| 30.00 \| 62.16 \| 138.20 \| \| 1 \| linear \| sglang \| True \| 30.00 \| 64.09 \| 67.01 \| \| 1 \| linear \| vllm \| True \| 30.00 \| 63.85 \| 67.27 \| \| 16 \| linear \| sglang \| False \| 0.00 \| 57.00 \| 150.69 \| \| 16 \| vectorized \| sglang \| False \| 0.00 \| 57.55 \| 149.25 \| \| 16 \| linear \| vllm \| False \| 0.00 \| 56.80 \| 151.23 \| \| 16 \| vectorized \| vllm \| False \| 0.00 \| 57.32 \| 149.87 \| \| 16 \| linear \| sglang \| False \| 30.00 \| 64.77 \| 132.62 \| \| 16 \| vectorized \| vllm \| False \| 30.00 \| 63.54 \| 135.18 \| \| 16 \| linear \| sglang \| True \| 30.00 \| 66.84 \| 64.26 \| \| 16 \| vectorized \| vllm \| True \| 30.00 \| 66.12 \| 64.96 \| \| 1024 \| linear \| sglang \| False \| 0.00 \| 58.25 \| 147.46 \| \| 1024 \| vectorized \| sglang \| False \| 0.00 \| 57.53 \| 149.31 \| \| 1024 \| linear \| vllm \| False \| 0.00 \| 58.06 \| 147.94 \| \| 1024 \| vectorized \| vllm \| False \| 0.00 \| 57.55 \| 149.27 \| \| 1024 \| linear \| sglang \| False \| 30.00 \| 65.38 \| 131.38 \| \| 1024 \| vectorized \| vllm \| False \| 30.00 \| 63.64 \| 134.98 \| \| 1024 \| linear \| sglang \| True \| 30.00 \| 66.85 \| 64.25 \| \| 1024 \| vectorized \| vllm \| True \| 30.00 \| 65.26 \| 65.81 \| ### Profiling — MI308X (gfx942), FP8 pertensor, hdim=256, seqlen=1024, 8 heads \| page_sz \| kv_layout \| table \| causal \| soft_cap \| time_us \| TFLOPS \| \|---------\|-----------\|-------\|--------\|----------\|---------\|--------\| \| 1 \| linear \| sglang \| False \| 0.00 \| 110.18 \| 77.96 \| \| 1 \| linear \| vllm \| True \| 30.00 \| 134.33 \| 31.97 \| \| 1 \| linear \| sglang \| True \| 30.00 \| 134.59 \| 31.91 \| \| 16 \| linear \| sglang \| False \| 0.00 \| 115.43 \| 74.42 \| \| 16 \| vectorized \| sglang \| False \| 0.00 \| 106.11 \| 80.95 \| \| 16 \| linear \| vllm \| False \| 0.00 \| 116.34 \| 73.83 \| \| 16 \| vectorized \| vllm \| False \| 0.00 \| 106.17 \| 80.91 \| \| 16 \| linear \| sglang \| False \| 30.00 \| 135.61 \| 63.34 \| \| 16 \| vectorized \| vllm \| False \| 30.00 \| 122.37 \| 70.20 \| \| 16 \| linear \| sglang \| True \| 0.00 \| 117.44 \| 36.57 \| \| 16 \| vectorized \| vllm \| True \| 0.00 \| 108.81 \| 39.47 \| \| 16 \| linear \| sglang \| True \| 30.00 \| 139.43 \| 30.80 \| \| 16 \| vectorized \| vllm \| True \| 30.00 \| 125.87 \| 34.12 \| \| 1024 \| linear \| sglang \| False \| 0.00 \| 110.65 \| 77.63 \| \| 1024 \| vectorized \| sglang \| False \| 0.00 \| 101.70 \| 84.46 \| \| 1024 \| linear \| vllm \| False \| 0.00 \| 111.71 \| 76.89 \| \| 1024 \| vectorized \| vllm \| False \| 0.00 \| 101.55 \| 84.59 \| \| 1024 \| linear \| sglang \| False \| 30.00 \| 129.33 \| 66.42 \| \| 1024 \| vectorized \| vllm \| False \| 30.00 \| 120.95 \| 71.02 \| \| 1024 \| linear \| sglang \| True \| 0.00 \| 112.26 \| 38.26 \| \| 1024 \| vectorized \| vllm \| True \| 0.00 \| 103.02 \| 41.69 \| \| 1024 \| linear \| sglang \| True \| 30.00 \| 133.73 \| 32.12 \| \| 1024 \| vectorized \| vllm \| True \| 30.00 \| 124.75 \| 34.43 \| ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 10:21:33 +00:00
Johannes Graner	58475d3f45	[rocm-libraries] ROCm/rocm-libraries#5393 (commit d51b649) [CK Tile] StreamK support for Bwd Weight grouped convolutions (#5393) ## Motivation Add StreamK work distribution to the CK Tile grouped convolution backward weight kernel. Split-K divides the K-dimension uniformly across a fixed `k_batch`, which causes load imbalance when the number of output tiles doesn't evenly fill the GPU. StreamK distributes total K-iterations evenly across workgroups, improving utilization on these shapes. ## Technical Details StreamK is added as an `if constexpr` branch in the existing kernel, selected by the `TilePartitioner_` template parameter. Two reduction strategies are supported: - Linear: tile-starter sequentially accumulates partials from contributing CTAs - Tree: pairwise binary tree reduction (O(log n) depth, faster for many contributors) Both persistent and non-persistent data-parallel (DP) sections are supported. Key changes: - `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution path with `RunStreamK`/`RunStreamKLoop`, partial store/load via workspace, flag-based cross-CTA synchronization, `GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions - `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers) and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by both GEMM and Conv StreamK kernels - `streamk_gemm_kernel.hpp`: Refactored to use shared helpers - Merged split-K and StreamK example invokers via `PartitionerPolicy` template parameter - StreamK example binary with `--streamk_reduction=linear\|tree` and `--streamk_persistent=0\|1` - CK Builder integration: `SpecifiesStreamK` concept, `TilePartitionerType` factory helper, `InstanceTraits` with StreamK fields - 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP), negative, builder regression ### Performance (MI355X, gfx950) Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}): \| Shape \| 16x64 tiles \| \| 128x128 tiles \| \| \|---\|---\|---\|---\|---\| \| \| Split-K \| StreamK \| Split-K \| StreamK \| \| 1x1 128x128 N=32 28x28 \| 1.00x \| 0.54x \| 1.00x \| 0.81x \| \| 3x3 128x128 N=32 14x14 \| 1.00x \| 0.59x \| 1.00x \| 0.62x \| \| 1x1 256x64 N=32 56x56 \| 1.00x \| 0.83x \| 1.00x \| 1.83x \| \| 3x3 512x512 N=2 7x7 \| 1.00x \| 1.12x \| 1.00x \| 0.62x \| \| 1x1 1024x1024 N=4 7x7 \| 1.00x \| 1.09x \| 1.00x \| 0.60x \| \| 3x3 128x128 N=32 28x28 \| 1.00x \| 0.44x \| 1.00x \| 0.96x \| \| 3x3 256x256 N=32 14x14 \| 1.00x \| 0.67x \| 1.00x \| 0.93x \| \| 3x3 512x512 N=32 7x7 \| 1.00x \| 0.98x \| 1.00x \| 1.16x \| StreamK's value depends on tile config: with larger tiles (fewer output tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up to 1.16x on typical large-channel convolutions. Tree reduction consistently outperforms Linear when multiple CTAs contribute to the same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n) sequential accumulation. The table reports the best of Linear and Tree for each shape. ## Test Plan ```bash ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk ./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk # Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON) ninja -C build check-builder ``` 30 tests covering: - Host-side: type traits, kernel args construction, grid size, workspace size - GPU end-to-end (Linear + Tree): small/medium shapes, multi-group, stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher occupancy - Persistent DP: Linear + Tree with persistent data-parallel dispatch - Negative: `IsSupportedArgument` rejects unaligned K and C - Builder: Create (instance string validation) + Execution (reference comparison) + instance string regression ## Test Result All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK tests pass. Full `check-builder` suite passes. Tolerances computed dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 09:18:14 +00:00
yinglu	a268a2a2e1	[rocm-libraries] ROCm/rocm-libraries#5612 (commit 38c9498) [CK]fix: remove redundant structured sparsity check in run_gemm_example.inc (#5612) ## Motivation This issue if found via https://github.com/ROCm/rocm-libraries/pull/4302#discussion_r2958603418 and is introduced via https://github.com/ROCm/rocm-libraries/pull/5323. ## Technical Details The outer `if` and inner `if constexpr` both checked GemmConfig::UseStructuredSparsity. Merged into a single `if constexpr` since both preshuffle and UseStructuredSparsity are compile-time constants. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 08:23:07 +00:00
yinglu	d460ab35b6	[rocm-libraries] ROCm/rocm-libraries#4302 (commit e62bd8a) [CK_TILE] add tf32 support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in CK_TILE on gfx942 and gfx950. ## Checklist Please put an into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run on all changed files - [ ] Any dependent changes have been merged ## Discussion	2026-03-19 09:19:06 +00:00
Thomas Ning	5f90f69795	[rocm-libraries] ROCm/rocm-libraries#5323 (commit 5454e9e) CK Tile MX GEMM Packing Improvement ## Motivation Reduce the scale loading size and also has better utilization of MFMA scale selection. ## Technical Details Add up the packing of mx scales. ## Test Plan Use the existing test cases. ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-17 18:58:56 +00:00
Hosang	859acb5ae7	[rocm-libraries] ROCm/rocm-libraries#5018 (commit b32e7e6) [CK_TILE] Add LLC-aware FMHA head grouping and head-major scheduling on RDNA (#5018) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Long-sequence FMHA can become memory-bound when K/V working sets exceed Infinity Cache (LLC), causing repeated DRAM traffic across heads. This PR introduces LLC-aware launch ordering improvements for FMHA forward, and it is currently enabled only on gfx11 and gfx12. The approach is inspired by [`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217), adapted to CK’s kernel/runner structure and layout handling. In this context, `bshd` is the layout used in Flash-Attention, while `bhsd` is the default layout used by the CK Tile FMHA example. ## Technical Details This PR adds two complementary strategies: - For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware head grouping: - Estimate LLC size (env override, KFD sysfs, or arch default). - Compute group size from K/V bytes per head vs LLC target. - Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and related tensors). - For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit launch-order adjustment: - Keep a single kernel launch. - Reinterpret block linearization in `GetTileIndex` to make execution head-major, improving temporal locality of per-head K/V reuse. Additional integration updates: - Propagate `num_head_q_total` and `head_start` through FMHA args/kargs. - Use global head indexing for dropout RNG stream mapping so grouped launches keep deterministic/consistent dropout behavior. - Keep fallback behavior unchanged when grouping is not beneficial or disabled. ## Test Plan - `test_ck_tile_fmha` - `tile_example_fmha_fwd` ## Test Result - `test_ck_tile_fmha`: all tests passed. - `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201, and all of them show higher performance compared to the baseline. The improvement is consistent, and performance is well maintained even at long sequence lengths. ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} - TFLOPs by sequence length target: gfx1100 layout: bhsd SeqLen \| Before \| After \| Speedup -- \| -- \| -- \| -- 1024 \| 56.27 \| 61.48 \| 1.09x 4096 \| 67.10 \| 72.27 \| 1.08x 8192 \| 65.99 \| 71.64 \| 1.09x 12288 \| 61.60 \| 76.61 \| 1.24x 16384 \| 58.99 \| 75.74 \| 1.28x 20480 \| 57.32 \| 74.42 \| 1.30x 24576 \| 56.89 \| 74.25 \| 1.31x 27280 \| 18.93 \| 24.48 \| 1.29x - TFLOPs by sequence length target: gfx1201 layout: bshd SeqLen \| Before \| After \| Speedup -- \| -- \| -- \| -- 1024 \| 66.79 \| 65.90 \| 0.99x 4096 \| 85.90 \| 86.80 \| 1.01x 8192 \| 77.06 \| 90.29 \| 1.17x 12288 \| 58.36 \| 88.98 \| 1.52x 16384 \| 52.12 \| 88.88 \| 1.71x 20480 \| 48.11 \| 88.42 \| 1.84x 24576 \| 47.12 \| 89.07 \| 1.89x 27280 \| 49.05 \| 50.31 \| 1.03x ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-16 21:19:23 +00:00
Enrico Degregori	eb033ef208	[rocm-libraries] ROCm/rocm-libraries#4964 (commit 3271d9a) [CK Tile] Eight Waves pipeline GEMM ## Motivation Eight waves pipeline was added for ABQuant. The goal of this PR is to enable it also for GEMM ## Technical Details Summary: - Block: - Create block struct for GEMM using eight warps specific distribution encodings - Use this block struct in ABQuant for encodings - Pipeline: - Create impl pipeline for eight waves which can be used by GEMM and ABQuant as base (and for AQuant and BQuant in the future) - Create eight waves pipeline for GEMM (this can not be easily integrated in the existing async pipeline) - Pipeline policy: - Extract GEMM specific parts in the ABQuant policy to define GEMM policy (then ABQuant use it as base and add Quant specific methods) - Minor: naming was inconsistent between warp/wave, everything is now referred to as eight waves So overall we have: - block struct directly used by GEMM -> ABQuant derived struct to implement operator - Impl base pipeline with general implementation -> GEMM and ABQuant pipelines use it to avoid code duplication but still define their own pipelines - pipeline policy struct directly used by GEMM -> ABQuant derived policy struct for Quant specific parts ## Test Plan Added new tests for GEMM pipeline: `test_ck_tile_gemm_pipeline_comp_async_eight_waves` (only gfx950 supports it). Note: K padding test is disabled for this pipeline because it's not implemented yet ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-16 08:31:56 +00:00
Bartłomiej Kocot	b8108662da	[rocm-libraries] ROCm/rocm-libraries#5387 (commit 0c259bd) [CK][CK Tile] Grouped Convolution Backward Weight set of fixes (#5387) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Grouped Convolution Backward Weight split k fixes for CK tile kernels ## Technical Details - get k batch from kargs to get deduced k batch - multiply zeroing size by data type size - disable v6 (producing a incorrect results) ## Test Plan test_grouped_convnd_bwd_weight_tile ## Test Result Pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-13 16:19:50 +00:00
Yi DING	574c1c121a	[rocm-libraries] ROCm/rocm-libraries#5174 (commit a358a21) [CK_TILE] FMHA BWD Use Persistent Kernels in Deterministic Mode (#5174) ## Motivation This PR enables a persistent-kernel execution path for FMHA backward (dQ/dK/dV) in deterministic mode, adjusting how dQ accumulation is split, stored, and converted back to final gradients. ## Technical Details - Introduces a persistent-kernel grid mapping in deterministic mode and updates split-count calculation accordingly. - Extends kernel kargs to carry batch-related info needed for persistent scheduling and dQ conversion. - Refactors dQ store conditions and adds mask-type traits/utilities and runner logging updates. ## Test Plan - Jenkins [base](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/10/pipeline) - Jenkins [AITER](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/12/pipeline) - Jenkins [FMHA](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/11/pipeline) - local FA tests ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-13 06:14:31 +00:00
Michał Kulikowski	2c3f9bfa52	[rocm-libraries] ROCm/rocm-libraries#5348 (commit 7b18234) [CK][Examples] Adding parameters for a couple of CK examples: -gemm_add_add_mean_meansquare_xdl_fp16 -gemm_dl_quantization_int8 -gemm_xdl_bias_relu_quantization_int8 -gemm_xdl_quantization_int8 Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>	2026-03-12 08:48:36 +00:00
Aviral Goel	1a4aa7fd89	[rocm-libraries] ROCm/rocm-libraries#5082 (commit 9313659) ck_tile: add gtest unit tests for MX flatmm (gfx950) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary - Add correctness unit tests for the MX-format flatmm kernel (`example/ck_tile/18_flatmm/mxgemm`) under `test/ck_tile/flatmm/` - Tests cover all five dtype combinations: FP4×FP4, FP8×FP8, FP6×FP6, FP8×FP4, FP4×FP8 - Tests cover all four kernel dispatch paths (the `has_hot_loop` × `tail_num` product): - `has_hot_loop=false, tail=ODD` (K=256, num_loop=1) - `has_hot_loop=false, tail=EVEN` (K=512, num_loop=2) - `has_hot_loop=true, tail=ODD` (K=768, num_loop=3) - `has_hot_loop=true, tail=EVEN` (K=1024, num_loop=4) - Remove unsupported `-split_k` CLI option from `tile_example_mx_flatmm`; the pre-shuffled B layout is incompatible with K-splitting and the option silently produced wrong results ## Changes New files (`test/ck_tile/flatmm/`): - `CMakeLists.txt` — builds 40 kernel instances as a shared OBJECT library, links into 5 per-dtype test executables; forwards `-DCK_TILE_USE_OCP_FP8` when `CK_USE_OCP_FP8` is ON - `test_mx_flatmm_base.hpp` — base test fixture with `run_test_with_validation(M, N, K, kbatch=1)` - `test_mx_flatmm_fixtures.hpp` — concrete `TestMXFlatmm` typed test class and type aliases - `test_mx_flatmm_fp{4fp4,8fp8,6fp6,8fp4,4fp8}.cpp` — per-dtype `TYPED_TEST_SUITE` files Modified files: - `example/ck_tile/18_flatmm/mxgemm/mx_flatmm_arch_traits.hpp` — moved `preShuffleWeight` here (was in `mx_flatmm.cpp`) so it is includeable by both the example and the tests - `example/ck_tile/18_flatmm/mxgemm/mx_flatmm.cpp` / `run_mx_flatmm.inc` — removed `-split_k` CLI arg, hardcoded `k_batch=1`, fixed `k_split` formula, updated call sites after `preShuffleWeight` move - `test/ck_tile/CMakeLists.txt` — added `add_subdirectory(flatmm)`	2026-03-11 22:47:59 +00:00
Anton Gorenko	2312eef6c3	[rocm-libraries] ROCm/rocm-libraries#4368 (commit 17f7dfc) [CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 (#4368) ## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-11 10:00:52 +00:00
Sami Remes	8f27f65d44	[rocm-libraries] ROCm/rocm-libraries#4594 (commit 1fce4cb) [CK_TILE] MX GEMM non-preshuffled RCR layout ## Motivation Implements a GEMM with MX scaling for fp4 and fp8 in non-preshuffled layouts using async pipeline. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-10 20:12:43 +00:00
Hosang	c800f88911	[rocm-libraries] ROCm/rocm-libraries#5088 (commit 36ca523) [CK_TILE] Update gfx11 FMHA forward kernel configs ## Motivation Tune gfx11 FMHA codegen to recover performance for mainly PSSK (padded seqlen_q/k) cases. This tuning is based on heuristic search and improves performance in most tested shapes. Performance should be evaluated on top of [`ROCm/rocm-libraries#5018`](https://github.com/ROCm/rocm-libraries/pull/5018) (required baseline). ## Technical Details - Updated gfx11 codegen heuristic choices for tile size and occupancy. - Updated gfx11 pipeline selection: - Disabled the `npad` (`f,f,f,f`) qr entry because it was consistently slower than the `pssk` (`t,t,f,f`) path, and kept `pssk` enabled so npad cases are dispatched to the faster kernel path.` - Kept gfx12 unchanged: with PSSK support from [`ROCm/rocm-libraries#4957`](https://github.com/ROCm/rocm-libraries/pull/4957), existing gfx12 config is already sufficient. - Tuning rationale: - In some cases, higher `kBlockPerCu` lowers register pressure. - On RDNA, this generally aligns with better performance when `waves_per_eu >= 6`. ## Test Plan - test_ck_tile_fmha - tile_example_fmha_fwd: tested this on gfx1100 and gfx1151 ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result - TFLOPs by sequence length target: `gfx1100` layout: `bhsd` - mode: batch / VGPR usage: 225 vs 214 SeqLen \| Baseline \| Tuned \| Gain -- \| -- \| -- \| -- 1024 \| 74.10 \| 71.97 \| 0.97x 4096 \| 66.26 \| 77.79 \| 1.17x 8192 \| 68.18 \| 75.88 \| 1.11x 12288 \| 68.47 \| 80.44 \| 1.17x 16384 \| 59.54 \| 79.66 \| 1.34x 20480 \| 55.78 \| 77.91 \| 1.40x 24576 \| 55.08 \| 77.47 \| 1.41x 27280 \| 47.45 \| 77.16 \| 1.63x - mode: group / VGPR usage: 256 vs 214 SeqLen \| Baseline \| Tuned \| Gain -- \| -- \| -- \| -- 1024 \| 71.47 \| 70.6 \| 0.99x 4096 \| 64.74 \| 77.06 \| 1.19x 8192 \| 64.68 \| 75.47 \| 1.17x 12288 \| 66.43 \| 79.95 \| 1.20x 16384 \| 56.02 \| 79.73 \| 1.42x 20480 \| 50.21 \| 78.15 \| 1.56x 24576 \| 47.29 \| 77.53 \| 1.64x 27280 \| 46.13 \| 77.04 \| 1.67x - TFLOPs by sequence length target: `gfx1151` layout: `bshd` - mode: batch / VGPR usage: 225 vs 223 Batch \| Baseline \| Tuned \| Gain -- \| -- \| -- \| -- 1024 \| 26.85 \| 29.17 \| 1.09x 4096 \| 24.75 \| 26.01 \| 1.05x 8192 \| 25.24 \| 25.50 \| 1.01x 12288 \| 25.18 \| 25.00 \| 0.99x 16384 \| 24.79 \| 25.91 \| 1.05x 20480 \| 25.56 \| 25.24 \| 0.99x 24576 \| 25.13 \| 26.20 \| 1.04x 27280 \| 10.78 \| 26.35 \| 2.44x - mode: group / VGPR usage: 256 vs 229 Batch \| Baseline \| Tuned \| Gain -- \| -- \| -- \| -- 1024 \| 27.44 \| 26.71 \| 0.97x 4096 \| 21.89 \| 23.09 \| 1.05x 8192 \| 22.85 \| 24.49 \| 1.07x 12288 \| 24.33 \| 24.42 \| 1.00x 16384 \| 20.05 \| 24.98 \| 1.24x 20480 \| 14.70 \| 25.15 \| 1.71x 24576 \| 11.30 \| 26.31 \| 2.33x 27280 \| 10.10 \| 26.32 \| 2.61x ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-10 16:47:43 +00:00
kensclin	8c216604d4	[rocm-libraries] ROCm/rocm-libraries#5218 (commit 60156cf) [CK] Fix the issue of the aiter to call eightwarps pipeline. (#5218) ## Motivation Fix the failure of the aiter to call eightwarp. Changed Async to the name eightwarps. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan Pass ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-09 18:13:07 +00:00
Christopher Millette	e0d11b969b	[rocm-libraries] ROCm/rocm-libraries#5030 (commit 8e02a26) [CK] Replace tuple value construction with tuple_element_t type extraction [1A] (#5030) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary ### Rationale CK's device operation instance registration uses `add_device_operation_instances` at ~1,850 call sites to register GPU kernel configurations. The existing implementation constructs `std::tuple` values just to extract their types via `decltype`, then copy-constructs each instance into `make_unique`. This is wasteful — only the types matter, not the values — and forces the compiler to instantiate the full `std::tuple` constructor and `std::get` machinery at every call site. ### What changed - Replace `remove_cvref_t<decltype(std::get<i>(tuple_obj))>` with `std::tuple_element_t<i.value, TupleType>`, which extracts the type directly without constructing any values - Replace copy-from-default `make_unique<T>(value)` with direct default construction `make_unique<T>()` — all CK device operation instances are stateless structs with configuration encoded in template parameters - Add `static_assert(std::is_default_constructible_v<NewOpInstance>)` to enforce this contract at compile time with a clear error message - Add Doxygen documentation for this high-traffic public API ### Value - Eliminates unnecessary template instantiation of `std::tuple` constructors and `std::get` across ~1,850 call sites - Establishes a cleaner, more intention-revealing pattern for type-only tuple usage - The `static_assert` prevents silent breakage if a non-default-constructible type is ever added - No runtime behavior change — zero risk ### Files changed (9) - `add_device_operation_instance.hpp`: Core pattern change - 3 example files, 3 reduce instance headers, 1 convolution header, 1 profiler header ## Test plan - [ ] Existing CI tests cover all ~1,850 call sites (GEMM, reduce, softmax, convolution) - [ ] `static_assert` provides compile-time validation stronger than runtime tests - [ ] No runtime behavior change — stateless struct default construction is identical to copy-from-default - [ ] Compatible with both `std::tuple` and `ck::type_list` containers 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-06 16:28:22 +00:00
Ville Pietilä	ae4e632c7d	[rocm-libraries] ROCm/rocm-libraries#4797 (commit 1a30400) [CK_TILE] Add CK Tile bwd weight profiler MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation To compare old CK and CK Tile, we need to extend the current CK profiler to support running also CK Tile instance with the same API. In order to have the same instance coverage in CK Tile compared to the old CK, I've added code generation from old CK configurations to CK Tile instances using the CK Builder. ## Technical Details - The codegen python script for CK Tile fwd convs is extended to support also bwd weight and bwd data. - The generated instances are added to the CMake build (target `device_grouped_conv_bwd_weight_tile_instance`s). - A new profiler op (`grouped_conv_bwd_weight_tile`) has been added to the CK Profiler.	2026-03-04 21:50:29 +00:00
Anton Gorenko	782f0a9eed	[rocm-libraries] ROCm/rocm-libraries#4957 (commit d2e4eb8) [CK_TILE][FMHA] Extend pipelines with pssk for gfx11/12 (#4957) ## Motivation Build pipelines with seqlen padding only to support vectorized loads in the hdim dimension. The existing pipelines have either all dims padded or all dims not padded. These pipelines can be used in ComfyUI for slightly better performance. ## Technical Details Also a fix included for correct FLOPS calculation in `tile_example_fmha_fwd` when `seqlen_q * seqlen_k` overflows index_t capacity (signed int32). ## Test Plan The existing test cases will use the new pipelines when parameters allow (seqlens - padded, hdims - not padded): ``` ninja test_ck_tile_fmha_fwd bin/test_ck_tile_fmha_fwd_fp16 bin/test_ck_tile_fmha_fwd_bf16 bin/test_ck_tile_fmha_fwd_fp8bf16 # for gfx12 ``` ## Test Result All tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-04 04:50:44 +00:00
Yi DING	b09112bbad	[rocm-libraries] ROCm/rocm-libraries#4577 (commit a36922c) [CK_TILE] FMHA BWD Launcher Interface ## Motivation Reduce memory usage; Be prepared to implement optimizations of reducing nsplits in deterministic cases. ## Technical Details This PR introduces a new launcher interface for the FMHA backward operation, replacing direct function calls with a more structured approach. The launcher encapsulates kernel dispatch logic and provides access to computed metadata like the number of dQ acc splits. Changes: - Added `fmha_bwd_launcher` class that wraps kernel execution and exposes `dq_acc_splits` - Moved `fmha_bwd_traits` construction earlier in the execution flow to support launcher initialization - Refactored code generation to produce both legacy API and new launcher constructor ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-04 01:21:07 +00:00
Brock Hargreaves	2a16d53cce	[rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502) [CK] Address a bunch of errors associated with targeting gfx1200 on Windows (#5045) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Still addressing errors that are blocking the merge of TheRock PR: https://github.com/ROCm/TheRock/actions/runs/22545831304/job/65308264096?pr=3382 ## Technical Details 1. There are multiple fmha python scripts that are writing native paths which are confusing cmake. I addressed one of these in an earlier PR https://github.com/ROCm/rocm-libraries/pull/4812 and now I'm addressing more that are exposed with gfx1200 target: ``` [composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library): [composable_kernel configure] Syntax error in cmake code when parsing string [composable_kernel configure] [composable_kernel configure] B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp [composable_kernel configure] [composable_kernel configure] Invalid character escape '\b'. ``` 2. In the following compiler error we see gemm_prec_str<ADataType, BDataType> being passed as a function to concat(...), instead of being evaluated with the parenthesis operator(), i.e., gemm_prec_str<ADataType, BDataType>(). There are multiples instances of this, I wonder what non-msvc compilers do here: ``` [composable_kernel] FAILED: [code=1] example/ck_tile/38_block_scale_gemm/CMakeFiles/tile_example_gemm_quant.dir/gemm_bquant_quantgrouped_mx_bf16bf8.cpp.obj [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_mx_bf16bf8.cpp:4: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm\run_gemm_quant_example.inc:17: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:7: [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/concat.hpp:119:21: error: implicit conversion between pointer-to-function and pointer-to-object is a Microsoft extension [-Werror,-Wmicrosoft-cast] [composable_kernel] 119 \| ((oss << sep << rest), ...); [composable_kernel] \| ^~~~ [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/ops/gemm_quant/kernel/gemm_quant_kernel.hpp:248:16: note: in instantiation of function template specialization 'ck_tile::concat<char, char[11], std::basic_string<char> (), std::basic_string<char>>' requested here [composable_kernel] 248 \| return concat('_', "gemm_quant", gemm_prec_str<ADataType, BDataType>, GemmPipeline::GetName()); [composable_kernel] \| ^ ``` There are plenty of other places where we use gemm_prec_str with the operator(), so I'm pretty sure these were just typos...but I'd like some eyes on it. 3. There are 2 tests that fail to build on Windows, which I've excluded from the build but will open bug tickets for: 1. gemm_weight_preshuffle 2. grouped_gemm_preshuffle Here's a sample of the compiler error for these tests: ``` [composable_kernel] [16/19] Building HIP object test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj [composable_kernel] FAILED: [code=1] test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj [composable_kernel] E:\TheRock\build\core\clr\dist\lib\llvm\bin\clang++.exe -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_TILE_USE_WMMA=1 -DCK_TIME_KERNEL=1 -DCK_USE_OCP_FP8 -DCK_USE_WMMA -DCK_USE_WMMA_FP8 -DCK_USE_XDL -DDPP_KERNELS -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -IE:/TheRock/rocm-libraries/projects/composablekernel/profiler/include -IE:/TheRock/rocm-libraries/projects/composablekernel -IE:/TheRock/rocm-libraries/projects/composablekernel/library/include -IE:/TheRock/rocm-libraries/projects/composablekernel/include -IE:/TheRock/build/ml-libs/composable_kernel/build/include -IE:/TheRock/build/base/half/stage/include -isystem E:/TheRock/build/core/clr/dist/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/getopt-src/src -O3 -DNDEBUG -std=gnu++20 --offload-arch=gfx1200 -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -fbracket-depth=1024 -Wno-nrvo -Werror -Weverything -fcolor-diagnostics -Wno-c++20-extensions -Wno-global-constructors -Wno-undef -DCK_TILE_USE_OCP_FP8 -MD -MT test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -MF test\ck_tile\grouped_gemm_preshuffle\CMakeFiles\test_ck_tile_grouped_gemm_preshuffle.dir\test_grouped_gemm_preshuffle.cpp.obj.d -o test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -x hip -c E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp:8: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:6: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/check_err.hpp:16: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core.hpp:89: [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core/utility/env.hpp:110:31: warning: 'getenv' is deprecated: This function or variable may be unsafe. Consider using _dupenv_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. [-Wdeprecated-declarations] [composable_kernel] 110 \| const char* vp = std::getenv(name); [composable_kernel] \| ^ [composable_kernel] C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdlib.h:1183:20: note: 'getenv' has been explicitly marked deprecated here [composable_kernel] 1183 \| _Check_return_ _CRT_INSECURE_DEPRECATE(_dupenv_s) [composable_kernel] \| ^ [composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:368:55: note: expanded from macro '_CRT_INSECURE_DEPRECATE' [composable_kernel] 368 \| #define _CRT_INSECURE_DEPRECATE(_Replacement) _CRT_DEPRECATE_TEXT( \ [composable_kernel] \| ^ [composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:358:47: note: expanded from macro '_CRT_DEPRECATE_TEXT' [composable_kernel] 358 \| #define _CRT_DEPRECATE_TEXT(_Text) __declspec(deprecated(_Text)) [composable_kernel] \| ^ [composable_kernel] clang++: error: clang frontend command failed due to signal (use -v to see invocation) [composable_kernel] AMD clang version 22.0.0git (https://github.com/ROCm/llvm-project.git a2dc42b87c63e686377a69f09ea23aec7550babc+PATCHED:e4d5bf498b7b8626bb9716f1f5a5946d45025918) [composable_kernel] Target: x86_64-pc-windows-msvc [composable_kernel] Thread model: posix [composable_kernel] InstalledDir: E:\TheRock\build\core\clr\dist\lib\llvm\bin [composable_kernel] clang++: note: diagnostic msg: Error generating preprocessed source(s). [composable_kernel] ninja: build stopped: subcommand failed. [composable_kernel FAILED WITH CODE 1 in 238 seconds] ninja: build stopped: subcommand failed. ``` ## Test Plan Wait for internal CI and make sure build compiles locally. ## Test Result Waiting on CI ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-03 21:55:14 +00:00
Kiefer van Teutem	b042e1805a	[rocm-libraries] ROCm/rocm-libraries#4804 (commit 832dd0e) Add Tile Distribution Encoding Register Mapping debug utility for MFMA / WMMA unification work. (#4804) ## Motivation This PR adds a small utility that allows you to use Tile Distribution Encodings to directly map matrix elements to register locations and vice versa. It can also print forward and backward layout mappings similar to the Matrix Calculator utility. The utility is not meant for index calculations in actual kernels, but rather as a debugging tool and probably for automated verification of the policy structs in the new WMMA / MFMA unification design. ## Technical Details Tile Distribution Encodings are a core part of CK Tile which can define the relationship between register and intrinsic matrix fragment elements. They allow for any mapping based on unmerge and merge transformations. Also, they allow for a special "Repeat" dimensions which acts like an additional matrix dimension and allows for replication of certain matrix elements. The new mapping utility can deal with all aspects. ## Test Plan Since this is a debug utility there is nothing to directly test, but there is an example file that defines four different Tile Distribution Encodings and prints their forward and backward mappings, along with some extra parameters. ## Test Result ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-02 16:42:42 +00:00
Linjun-AMD	78ae3835a6	[rocm-libraries] ROCm/rocm-libraries#4313 (commit 080ac66) [CK] Fix gptoss sink ## Motivation This PR removes conditional logic for handling infinity values in the sink mechanism across multiple FMHA pipeline implementations, defaulting sink_size to 0 and adding a constraint in the kernel selection logic. ## Technical Details Changes: Removed __builtin_isinf_sign(sink_v) checks and conditional initialization of LSE accumulators across 7 pipeline files Added default initialization (= 0) for sink_size in 4 argument structs Added F_sink == "f" constraint to kernel compatibility checking ## Test Plan Local test ## Test Result passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-02 01:54:46 +00:00
Andriy Roshchenko	b661eab573	[rocm-libraries] ROCm/rocm-libraries#4821 (commit 9456e0f) [CK TILE] Refactor MX FLATMM example Refactor the MX FLATMM example to support more pipelines across different architectures. This work facilitates the NPI team roadmap.	2026-02-27 23:21:39 +00:00
Aviral Goel	c8a8449eec	[rocm-libraries] ROCm/rocm-libraries#4816 (commit 17ff961) [CK] Add split-K support for ABQuantGrouped in block_scale_gemm (#4816) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Changes ### Split-K support in `gemm_quant_kernel.hpp` - `SplitKBatchOffset`: Added `aq_group_offset` and `aq_k_split_offset` fields (mirroring the existing `bq_` fields for B) to track each split-K batch's position within the AQ scale tensor. For `ABQuantGrouped`, both offsets are computed from `k_id KRead` divided by `AQuantGroupSize::kK`. - `MakeAQBlockWindow`: Added an `aq_group_offset` parameter (defaulting to 0 for non-split-K paths) so the AQ tensor view's K-group dimension reflects only the remaining K-groups from the split-K offset, consistent with how `MakeBQBlockWindow` handles the BQ tensor. - `RunGemm`: Threads the `aq_k_split_offset` through to `MakeAQBlockWindow` when in split-K mode. ### Constraints in `IsSupportedArgument()` Four constraints gate split-K (`k_batch > 1`) for ABQuantGrouped: 1. Mode check — split-K is only allowed for `BQuantGrouped` (no preshuffle) or `ABQuantGrouped` (no `APreshuffleQuant`). Any other quant mode with `k_batch > 1` returns `false`. 2. B quant group alignment — `KRead` (per-batch K slice) must be divisible by `BQuantGroupSize::kK`. Each batch must operate on complete B quantization groups; a partial group would require splitting a scale value across batches. 3. A quant group alignment (new, ABQuantGrouped only) — `KRead` must also be divisible by `AQuantGroupSize::kK` for the same reason applied to the AQ scale tensor. 4. Minimum 2 K-tile iterations per batch (new) — The software-pipelined GEMM kernels (CompV3 family) prefetch one tile ahead, so they require `per_batch_num_loop = KRead / KPerBlock >= 2`. When `KRead == KPerBlock` (i.e. each batch is exactly one tile), the prefetch reads into the next batch's memory region and produces incorrect results. Configurations where `K == k_batch * KPerBlock` are therefore rejected. ### Example update (`run_gemm_quant_example.inc`) Updated the comment above the `IsSupportedArgument` call to document that split-K is now supported for both `BQuantGrouped` (no preshuffle) and `ABQuantGrouped` (no `APreshuffleQuant`). ## Unit Tests Two new test files covering decode and prefill tile shapes across a range of `k_batch` values (2–8), data types (FP8, BF8), and quantization group sizes (1×1×128 and 1×128×128 for B): - `test_gemm_quant_abquant_splitk_decode.cpp` — uses the decode tile shape (M=16, N=64, K_tile=256) - `test_gemm_quant_abquant_splitk_prefill.cpp` — uses the prefill tile shape (M=128, N=128, K_tile=128) Each test calls `run_test_with_validation` which runs the kernel and checks correctness against a CPU reference. Configurations excluded from tests are annotated with comments explaining which constraint they violate (typically the `per_batch_num_loop >= 2` requirement). ## Prerequisites This PR depends on #4429, which must be merged before this can be merged.	2026-02-26 23:57:17 +00:00
Yung-sheng Tu	75aea70c2c	[rocm-libraries] ROCm/rocm-libraries#4340 (commit 70a312f) Implement device_grouped_gemm_fixed_nk_bias for RDNA4 ## Proposed changes Summary: - Modified implementation for grouped_gemm_fixed_nk_bias - FP16 WMMA examples - WMMA instances - Profiler for grouped_gemm_fixed_nk_bias - Add WMMA instances to existing tests This PR depends on PR https://github.com/ROCm/rocm-libraries/pull/4299 and should be merged after it. Only the last 6 commits are in the scope of this PR. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [x] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-26 00:28:58 +00:00
Bartłomiej Kocot	eede24de0d	[rocm-libraries] ROCm/rocm-libraries#4872 (commit ca623f7) [CK] Small improvements for grouped conv backward weight (#4872) ## Motivation Improvements for CK Tile convolution builder run function and atol/rtol calculations. ## Technical Details - Add preprocessing function for wrw when k_batch is larger than 1 for builder run function - Divide num acums by number of groups to get real number of accums ## Test Plan CI wrw tests ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-783	2026-02-25 20:11:01 +00:00
Brock Hargreaves	c90a363589	[rocm-libraries] ROCm/rocm-libraries#4812 (commit bb5a4dd) [CK] Use as_posix() instead of str() for paths in fmha_fwd_appendkv.py (#4812) ## Motivation This is causing a failing PR for Windows: https://github.com/ROCm/TheRock/pull/3382 ``` [composable_kernel configure] -- Jenga kernel files to be generated: B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_api.cpp [composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library): [composable_kernel configure] Syntax error in cmake code when parsing string [composable_kernel configure] [composable_kernel configure] B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp [composable_kernel configure] [composable_kernel configure] Invalid character escape '\b'. ``` ## Technical Details The file: [fmha_fwd_appendkv.py](https://github.com/ROCm/rocm-libraries/compare/users/brockhargreaves-amd/ck/fix-windows-cmake-path-problem?expand=1#diff-bef22bf9ba21eb93c725493ecc7edcb6f2a8f0a9a173dcfca6bda7a9f4eced78) writes a bunch of paths to a text file which is later parsed by cmake. When passing a pathlib.Path to str(), str() converts to a native path, in this case / to \\ on Windows which confuses cmake. In this case we need to write paths with forward slashes and then pass those onward to cmake. ## Test Plan 1. Ensure this doesn't impact existing CI. 2. Ensure compilation of Windows pass locally. ## Test Result 1. Passes existing CI 2. This fixes the compilation error locally. ## Submission Checklist - [ x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-25 16:13:56 +00:00
Brock Hargreaves	abf13bdec1	[rocm-libraries] ROCm/rocm-libraries#4819 (commit b995a0b) [CK] Fix windows build issues ## Motivation Full build on Windows is currently broken due to compiler errors, this PR should help fix that. This is also holding up the following PR in the TheRock: https://github.com/ROCm/TheRock/pull/3382 ## Technical Details 1. I don't see a good reason to be nesting a windows include inside the ck_tile namespace. It was causing compiler errors too: Windows.h comes with min and max, which was conflicting with ck_tile::min and ck_tile::max, so I moved it out. I also defined NOMINMAX to prevent this inclusion in the future. 2. The TRUE/FALSE macros are already used by Windows.h, which causes an error. So I've opted for True/False. You can see this pattern in other rocm-libraries. 3. The M_PI macro isn't available, at least in the WIN32_LEAN_AND_MEAN context, from \<cmath\> on Windows. We'll be able to use std::numbers::v_pi\<float\> when we have C++20 support. 4. There was a missing \<chrono\> include. ## Test Plan Test locally and make sure this doesn't impact existing CI. ## Test Result Compiles locally and passes existing ci. ## Submission Checklist - [ x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-25 16:13:13 +00:00
Zoltán Lakatos	a32d704d89	[rocm-libraries] ROCm/rocm-libraries#4425 (commit 513cf9f) [CK] Implement device grouped gemm fixed nk multi abd for rdna4 (#4425) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Add support for grouped gemm multi ABD fixed NK. MR ## Technical Details Changes from the reverted PR: - Device struct for grouped gemm with multiple ABD and fixed NK (DeviceGroupedGemm_Wmma_Multi_ABD_Fixed_NK). - Wmma versions of existing example codes: 59_grouped_gemm_multi_ABD - Unit tests for both new wmma implementation and the reference xdl code (previously missing) - Note: Some Xdl instances were commented out because of unit test failures. As mentioned apparently for xdl this feature was missing tests so our assumption is either there is an implemenetation bug or these instances were not set up correctly. Has the potential for a follow-up issue. - Generic ck profiler interface with the purpose of calling unit tests. - Gemm instances with specific elementwise operations for gemm bias gelu calculations. - Added class for grouped gemm multi ABD reference calculations. Fix epilogue selection in device implementation that caused unit test failures ## Test Plan Covered by added unit tests ## Test Result CI successfully passing ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-02-25 05:17:08 +00:00
Enrico Degregori	4c626aeaa6	[rocm-libraries] ROCm/rocm-libraries#4267 (commit 3c5d95e) [CK_TILE] Extend support of mix precision microscaling BQuant (#4267) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes Supported types combinations using BQuant=e8m0: - A=bf16 - B=bf16,bf8,fp4 Summary: - remove usage of `pk_fp4_raw_t`: consistent with other implementations and avoid taking into account of the packed size explicitly. In general, the raw type should not be used because CK Tile internally takes care of the PackedSize, so using the raw type adds unnecessary complexity to the implementation - handle microscaling by checking for `e8m0` type for BQuant (previous implementation was inconsistent) - add support for scaling instructions in `DequantPack8` - mx pipeline: - extend existing pipeline to support different B types - add support to scale and cast before writing to LDS or after reading from LDS (this can be defined in the `Problem` by the user) - block gemm: - mx pipeline is now using block gemm BQuant - block gemm BQuant can now load from LDS and apply scale and then call block gemm universal operator. This adds new functionalities and remove code duplication - warp gemm: - add case to support 128bit ds_read/write for both A and B when A=16bit and B=8bit - add examples and tests: note that some tests for bf16/fp4 already existed but were removed during previous tests refactoring. I added them again and other relevant tests for new types combinations ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered	2026-02-24 17:57:02 +00:00

1 2 3 4 5 ...

1009 Commits