mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-29 19:28:33 +00:00
new_develop
111 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
6a9c03f692 |
[rocm-libraries] ROCm/rocm-libraries#7450 (commit 402dbad)
[CK_TILE] Use Persistent Scheduling for FMHA BWD Group Deterministic (#7450) ## Motivation FMHA BWD group-mode deterministic currently uses a non-persistent scheduler: each `(batch, head, K-row)` work-item is launched as its own block, with no work-stealing across CUs. On uneven workloads (varlen, GQA, many heads with few K-rows) this leaves CUs idle and forces a larger dq_acc workspace than necessary. This PR ports the persistent + deterministic scheduling already used in batch mode to group mode: a fixed-grid kernel that pre-computes per-CU work ranges on the host and uses sparse dq_acc slot indexing so multiple K-rows handled by the same CU share one accumulator slot via intra-CU atomic adds. Stacked on #7331; merge that first. ## Technical Details Single file changed: `ops/fmha/kernel/fmha_bwd_kernel.hpp`. A new `kUsePersistent` path is added to the group-mode deterministic kernel, mirroring the batch-mode persistent scheduler. The host pre-computes a fixed per-CU partition of the total `(batch, head, K-row)` work and packs it into `cu_states[]` so the GPU consumes it in a single launch. Host preparation happens in four steps: 1. Build per-batch `seqstart` prefix sums. 2. Fill per-batch `(sq_w, nc)` with a placeholder `nsplits` (bumped in step 3). 3. Two-pointer scan over CUs to fill `cu_states[c]` (`isplit`, `head_start`, `c_start`, `w_lo`, `w_hi`), accumulating `nsplits[b]` as `max(cs->isplit + 1)`. 4. Compute compact per-batch dq_acc offsets from the finalized `nsplits`. `isplit` is the sparse dq_acc slot index — one CU's multi-K-row writes share slot `ceil(wc_start / denom)`, enabling intra-CU atomic accumulation instead of one slot per K-row. `denom = max(sq_w, target_w)`, splitting two regimes: - `target_w >= sq_w` (large work): `denom = target_w`, intra-CU atomic optimization engaged. - `target_w < sq_w` (sub-K-row sharding, multiple CUs sharing one K-row): `denom = sq_w` collapses to per-K-row indexing (`= c_start`), keeping `isplit ∈ [0, nc-1]` and matching the `nsplits_max = ceil(s_k/kN0) = nc` upper bound that #7331's `GetWorkspaceDeviceSizeUpperBound` assumes for group+det. `isplit` is additionally clamped to `nc-1` to absorb empty CUs (rounded-up `wc_start` past the last K-row); they don't write dq_acc on GPU so the slot value is harmless. `nsplits[b]` is accumulated dynamically in step 3 rather than via a closed form so it tightly matches the actual sparse slots used; step 4 (offsets) follows step 3 since offsets now depend on the dynamic `nsplits`. Group mode also allows batches with `seqlen_q == 0`. The persistent scheduler skips them on the dQ path (no work) but dK/dV are still zero-filled. ## Test Plan Built `tile_example_fmha_bwd` with receipt 5 (fp16, no-bias, no-dropout, `dpad == dvpad`, group + batch) on gfx950 (MI355X). - 8-case smoke (shapes that exercise the sub-K-row regime). - 44-case sweep covering: mask 0/1/2, GQA, var seqlen, `d != d_v`, extreme small seqlen / `nc=1`, CU >> work, huge batch, batch-mode regression. - 12-case perf comparison vs the non-persistent baseline (warmup=10, repeat=50). ## Test Result - All 8 + 44 cases `valid:y`. - Perf: ±5% noise, average -0.4% across the 12 cases — neutral. - Batch-mode deterministic / non-deterministic regression unchanged. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
e02c566795 |
[rocm-libraries] ROCm/rocm-libraries#7612 (commit 5427d24)
[CK] upgrade CI to rocm7.13 as default compiler (#7612) ## Motivation Upgrade the default docker and compiler version in CI to rocm7.13. In order to pass all the checks I had to also clean up a lot of non-ascii characters in the source code comments and modify a couple of tests that were affected by a new compiler logic. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Aviral Goel <aviral.goel@amd.com> |
||
|
|
fc2862d712 |
[rocm-libraries] ROCm/rocm-libraries#6846 (commit 377def4)
[CK_TILE] Add fmha forward hdim 256 support (#6846) ## Motivation Enable Composable Kernel FMHA forward kernel for **hdim=256 BF16** on AMD gfx950 (MI350X). Prior to this change the (256, 256) head-dim configuration either failed to compile, was filtered out by the compatibility rules, or produced incorrect kernel output due to an LDS layout accounting bug. ## Technical Details Four files changed, all to enable hdim=256 BF16 on gfx950. - **`fmha_fwd.py`** — Allow `(256, 256)` in gfx950 compatibility rule; set `(256,256)` BF16 tile to `M0=128, N0=64` (the LDS-feasible shape on gfx950); emit minimal valid instance set for d=256 to bound compile time. - **`fmha_fwd_kernel.hpp`** — Gate Prefill launch path off for d=256 (`PrefillCase = kM0 > 64 && kQKHeaddim < 256`); the double-buffer Prefill variant overflows the 160 KB LDS budget. - **`trload_policy.hpp`** — **Critical correctness fix**: the LDS layout accounting in `GetSmemSize` was wrong (`max(Q, K+S+V)` instead of `max(Q, K) + V + S`), under-allocating LDS and silently corrupting d=256 output (~2% wrong values). - **`trload.hpp`** — Thread `LoadOnce=true` through all d=256 K-LDS descriptors so the compiler picks the matching XOR swizzle period; recompute the S-tile LDS offset to match the corrected `GetSmemSize` formula. ## Test Plan Built and ran `tile_example_fmha_fwd` on gfx950 (MI350X) with the canonical d=256 BF16 configurations: ```bash cd build && ninja tile_example_fmha_fwd ./bin/tile_example_fmha_fwd -prec=bf16 -d=256 -d_v=256 -b=1 -h=32 -h_k=2 -s=1024 -s_k=1024 -bias=n -mask=t -lse=0 -p_drop=0 -warmup=3 -repeat=10 -kname=1 -v=1 ./bin/tile_example_fmha_fwd -prec=bf16 -d=256 -d_v=256 -b=8 -h=32 -h_k=2 -s=16384 -s_k=16384 -bias=n -mask=t -lse=0 -p_drop=0 -warmup=3 -repeat=10 -kname=1 -v=1 ``` ## Test Result ```bash -b=1 -s=1024 [bf16|batch|bhsd] b:1, h:32/2, s:1024/1024, d:256/256, scale_s:0.0625, bias:n, p_drop:0, lse:0, qscale:n, mask:t(-1:0), v:r, fmha_fwd_d256_bf16_batch_b128x64x32x256x32x256_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_trload_vr_psddv_nlogits_nbias_mc_nlse_ndropout_nskip_nqscale_ntrload_nsink, 0.058 ms, 298.42 TFlops, 618.68 GB/s, valid:y -b=4 -s=16384 [bf16|batch|bhsd] b:8, h:32/2, s:16384/16384, d:256/256, scale_s:0.0625, bias:n, p_drop:0, lse:0, qscale:n, mask:t(-1:0), v:r, fmha_fwd_d256_bf16_batch_b128x64x32x256x32x256_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_trload_vr_psddv_nlogits_nbias_mc_nlse_ndropout_nskip_nqscale_ntrload_nsink, 42.797 ms, 822.18 TFlops, 106.63 GB/s, valid:y ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: poyenc <1132573+poyenc@users.noreply.github.com> |
||
|
|
717f2efef7 |
[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)
[CK] add composable kernel support on gfx1250 (#6978) ## Motivation Add composable kernel support on gfx1250. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Qun Lin <qlin@amd.com> Co-authored-by: jialuo12_amdeng <jia.luo@amd.com> Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com> Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com> |
||
|
|
af7118e342 |
[rocm-libraries] ROCm/rocm-libraries#7331 (commit 5692db0)
[CK_TILE] Add async workspace prepare to FMHA BWD launcher (#7331) ## Motivation `aiter::mha_bwd` in group mode currently issues two synchronous `hipMemcpy` D2H copies to read `seqstart_q/k` for launcher construction. These sync copies block the host (~10–30 µs each) and implicitly synchronize the device by draining the stream, breaking CPU/GPU overlap on hot training paths. This PR adds a fully stream-async workspace preparation path on the FMHA BWD launcher so callers can pre-allocate the device workspace from upper-bound shapes and stage seqstart-dependent metadata via D2H/host-pack/H2D entirely on the user's stream. ## Technical Details - `FmhaBwdWorkspaceManager::GetWorkspaceDeviceSizeUpperBound` (`include/ck_tile/ops/fmha/kernel/fmha_bwd_kernel.hpp`): computes the worst-case device dq_acc size from `(max_batch, hdim_q, nhead_q, max_seqlen_q, max_seqlen_k)` without dereferencing any seqstart array. Mirrors `PrepareWorkspaceHost`'s return value with worst-case bounds. - `fmha_bwd_launcher::prepare_workspace_async` (`example/ck_tile/01_fmha/fmha_bwd.hpp`): on the caller's stream, in order: 1. `hipMemsetAsync` of the dq_acc region (when `NeedsZeroDqAcc()`) 2. group mode: `hipMemcpyAsync` D2H of `seqstart_q/k` into a pinned host staging buffer 3. `hipLaunchHostFunc` runs `PrepareWorkspaceHost` on the pinned buffer 4. `hipMemcpyAsync` H2D of the packed metadata into `device_ws_ptr` The pinned staging buffer is held via `std::shared_ptr<void>` returned by a caller-provided `pinned_host_alloc` callback. Lifetime is extended past stream completion by a tail `hipLaunchHostFunc` scheduled in the launcher's destructor. - `ck_tile::pinned_host_releaser` (`include/ck_tile/host/pinned_host_releaser.hpp`): worker-thread utility for callers using bare `hipHostMalloc`. Defers `hipHostFree` off the HIP driver callback thread, which holds runtime locks and would deadlock against concurrent main-thread `hipFree`. PyTorch's `CachingHostAllocator` does not need this. - Example runner (`example/ck_tile/01_fmha/fmha_bwd_runner.hpp`): switched to the async path. ## Test Plan - `tile_example_fmha_bwd` (gfx950, dev preset `-Werror -Weverything`): - batch + nondet / batch + det / group + nondet / group + det - group + det 4-batch varlen (`-b=4 -h=8 -s=4096,3072,2048,1024 -d=128`) - FA (`flash-attention`) integration on ROCm 7.1.1 + PyTorch 2.9.1: - `tests/test_flash_attn_ck.py::test_flash_attn_varlen_deterministic` - `tests/test_flash_attn_ck.py::test_flash_attn_bwd_varlen_seqq_zero` ## Test Result - All CK runner cases `valid:y`. - FA pytest: **1952 passed in 44.82s**. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
370c7d762b |
[rocm-libraries] ROCm/rocm-libraries#7141 (commit 37e40c3)
[CK_TILE] Fix typo in fmha_fwd_kernel K-dram unmerge tuple sizes (#7141) ## Summary The qr_async_trload K-dram lambda's `else (XorLengthFold == 1)` branch in `fmha_fwd_kernel.hpp` writes the outer-tile dim of its 3-tuple unmerge/xor/merge as ```cpp number<FmhaPipeline::kQKHeaddim / kDramTileK / FmhaPipeline::kAlignmentK>{} ``` which divides one extra time. For every fp16/bf16 hdim=128 configuration the outer length collapses to **0**, e.g. `128 / 128 / 8 == 0`. The 3-tuple product no longer equals `kQKHeaddim`, so unmerge → xor → merge stops round-tripping the head dimension. This bug was masked by the async-load path: it only walks the descriptor via stride and silently absorbs a length=0 outer dim. Any consumer that actually traverses the descriptor (e.g. the TDM path on gfx1250) immediately faults on the resulting `tuple<int, constant<0>>`. The fix drops the extra `/ kAlignmentK` in all three call sites in the same lambda so the outer dim becomes `kQKHeaddim / kDramTileK` and the product is restored to `kQKHeaddim`. Strides are unaffected, so the async path is bit-identical. | Config (fp16/bf16) | hdim | kDramTileK | kAlignmentK | a (typo) | a (fixed) | product (typo) | product (fixed) | |---|---|---|---|---|---|---|---| | hdim128, kKLoadOnce | 128 | 128 | 8 | 0 | 1 | **0** | **128** | | hdim128, kK0=32 | 128 | 32 | 8 | 0 | 4 | **0** | **128** | | hdim64, kKLoadOnce | 64 | 64 | 8 | 0 | 1 | **0** | **64** | | hdim256, kK0=32 | 256 | 32 | 8 | 1 | 8 | **32** | **256** | Bug introduced in 2cc0af6a815a (PR #2888 \"[CK_TILE] FMHA FWD bug fix\"), where the original 2-tuple unmerge was generalized to a 3-tuple and the typo slipped in. ## Test plan - [x] Built `test_ck_tile_fmha_fwd` (umbrella, 5 gtest binaries) on gfx950 native at develop b3bdc63a509 with `dev-gfx950` preset (clang 22, ROCm 7.2.2). Compiles cleanly with `-Werror -Weverything`. - [x] Ran `ctest -R test_ck_tile_fmha_fwd` on gfx950 native, baseline vs patched: identical pass/fail (3 pass / 2 fail), identical failing case set (114 gtest fails + 2 GPU memory access faults, all in pre-existing fp16/bf16 group-mode `Alibi`/`Dropout` cases that reproduce on develop without this patch). Total wall time 403s → 393s. Per-case latency drift ±8% (noise). - [x] CI to verify on other gfx9 / gfx11 architectures. |
||
|
|
2c677e8471 |
[rocm-libraries] ROCm/rocm-libraries#6152 (commit 36b016a)
[CK_TILE] Use Unified Workspace for FMHA BWD (#6152) ## Motivation `dq_acc` is the intermediate accumulation buffer used in FMHA backward pass for deterministic mode. The current implementation allocates it as a **single rectangular tensor**: ``` shape = [shape_batch, nhead, nsplits, shape_seqlen_q, hdim_q] ``` where `nsplits = launcher.dq_acc_splits` (a single scalar), computed from `max_seqlen_k` and shared across all batches. ### Problems 1. **Memory waste**: In group mode, each batch may have a different `seqlen_k`, but `nsplits` is computed from `max_seqlen_k`, causing batches with shorter `seqlen_k` to over-allocate in the split dimension. 2. **Interface coupling**: `fmha_bwd_args` exposes internal layout details such as `stride_dq_acc`, `nhead_stride_dq_acc`, `batch_stride_dq_acc`, and `split_stride_dq_acc`. The caller is responsible for computing these strides, but this logic belongs inside the kernel. ### Goals 1. Switch `dq_acc` buffer to a **compact layout**: batches are concatenated contiguously, with each batch occupying `nhead * nsplits_i * seqq_i * hdim_q` elements (nhead outermost). 2. **Remove all `*_stride_dq_acc` fields** from `fmha_bwd_args`, replacing them with a single `workspace_ptr`; the kernel splits this internally using a fixed layout. 4. `fmha_bwd_launcher` provides a **workspace management interface**: the caller only needs to allocate GPU memory and call `prepare_workspace()` — no layout computation required. 5. **Isolate kernel internals from the caller API**: the `dq_acc` layout (nsplits, strides, buffer size) is determined entirely inside the launcher/kernel. Future changes to block shape, pipeline type, or persistent kernel strategy require no modifications to the caller's `fmha_bwd_args` or workspace allocation logic. ## Technical Details ### Interface Design #### New fields in `fmha_bwd_traits` ```cpp struct fmha_bwd_traits { int seqlen_q; int seqlen_k; int batch; int max_seqlen_q; int max_seqlen_k; int hdim_q; int hdim_v; int nhead_q; int nhead_k; std::string data_type; bool is_group_mode; mask_enum mask_type; bias_enum bias_type; bool has_dbias; bool has_dropout; bool is_store_randval; bool is_deterministic; // New: cumulative physical seqlen pointers for group mode (pass nullptr for batch mode). // seqstart_qs[i+1] - seqstart_qs[i] = physical seqlen_q of batch i (including padding); length = batch+1 // seqstart_ks[i+1] - seqstart_ks[i] = physical seqlen_k of batch i (including padding); length = batch+1 const int* seqstart_qs = nullptr; const int* seqstart_ks = nullptr; }; ``` #### `fmha_bwd_launcher` actual structure ```cpp struct fmha_bwd_launcher { std::function<float(fmha_bwd_args, const ck_tile::stream_config&)> run{}; // Total workspace size in bytes (host_ws_size + device_ws_size), computed by init(). // Zero for kUseQrQtrDorPipeline (writes dq directly, no acc buffer needed). size_t workspace_size = 0; fmha_bwd_launcher(const fmha_bwd_traits&); // Copies auxiliary data (nsplits[], offsets[]) via hipMemcpy to the head of the GPU workspace, // and zeros the dq_acc buffer portion (tail of workspace) if required. // The memory pointed to by device_ws must be >= workspace_size bytes. std::function<void(void* device_ws)> prepare_workspace{}; template <typename... Args> float operator()(Args&&... args) const { return run(std::forward<Args>(args)...); } private: size_t host_ws_size = 0; // CPU workspace size (nsplits[] + offsets[] arrays) size_t device_ws_size = 0; // GPU-only data size (dq_acc buffer) std::unique_ptr<char[]> ws_host; // host-side workspace buffer public: template <typename T0, typename T1, typename T2, typename Arch> void init(const fmha_bwd_traits& traits); }; ``` The `init<>()` template method (invoked by codegen dispatch branches as `this->init<...>(t)`) is responsible for: 1. Setting the `run` lambda 2. Calling `FmhaBwdDQDKDVKernel::GetWorkspaceHostSize(batch)` to obtain `host_ws_size` 3. Allocating `ws_host` (host memory) 4. Calling `FmhaBwdDQDKDVKernel::PrepareWorkspaceHost(ws_host.get(), ...)` to fill nsplits/offsets; return value is `device_ws_size` 5. `workspace_size = host_ws_size + device_ws_size` 6. Setting the `prepare_workspace` lambda (captures `this`, calls `PrepareWorkspaceDevice`) When no kernel matches the given traits, both `run` and `prepare_workspace` are initialized to default lambdas that print a warning to `std::cerr` and return gracefully (no exception). #### Workspace overall layout The workspace is managed by `FmhaBwdWorkspaceManager` and consists of two segments: ``` Offset 0 (CPU-prepared segment, host_ws_size bytes; also hipMemcpy'd to the head of GPU workspace): index_t nsplits[batch or 1] — per-batch nsplits array group mode: batch elements batch mode / non-deterministic: 1 element [group mode only] long_index_t dq_acc_offsets[batch+1] — per-batch element offset (inclusive prefix sum) offsets[0]=0, offsets[i+1] = offsets[i] + nhead*nsplits_i*seqq_i*hdim_q Offset host_ws_size (device data segment, device_ws_size bytes): AccDataType dq_acc[total_elements] — compact dq_acc buffer (zeroed if required) total_elements = sum_i(nhead * nsplits_i * seqq_i * hdim_q) layout within each batch: [nhead, nsplits_i, seqq_i, hdim_q] note: seqq_i uses the physical length (including padding) ``` Alignment constant (`ALIGNMENT = 16`): ``` nsplits_size = align_up(sizeof(index_t) * N, 16) // N = batch (group) or 1 (batch/non-det) offsets_size = align_up(sizeof(long_index_t) * (batch+1), 16) // group mode only host_ws_size = nsplits_size + offsets_size dq_acc_offset = host_ws_size // GetDqAccDataOffset(batch) ``` **Key benefits**: - The kernel reads nsplits/offsets directly from the workspace head — no device-side recomputation. - `FmhaBwdConvertQGradKernel` is completely decoupled from the pipeline block shape (`kN0`): nsplits is read from `nsplits_ptr`, `kN0` is no longer a template parameter, and multiple dq_dk_dv tiles with different `F_bn0` values now share a single convert_dq kernel instance (under receipt 1/2, deterministic convert_dq kernel count drops from ~300 to 60). - nsplits/offsets are computed on the host and transferred in one `hipMemcpy`; the dq_acc buffer follows immediately, at the offset given by `GetDqAccDataOffset`. #### Workspace size by scenario | Scenario | `workspace_size` | Notes | |----------|-----------------|-------| | **kUseQrQtrDorPipeline** (any mode) | `0` | Writes dq directly; no acc buffer; `PrepareWorkspaceHost` returns 0 | | **Non-deterministic + batch mode** | `> 0` | nsplits[1]=1; dq_acc used for atomic add; `workspace_size = host_ws_size + batch*nhead*seqlen_q*hdim_q*ebytes` | | **Non-deterministic + group mode** | `> 0` | nsplits[1]=1; dq_acc contiguous layout; `workspace_size = host_ws_size + nhead*seqstart_qs[batch]*hdim_q*ebytes` | | **Deterministic + group mode** | `> 0` | nsplits[batch], offsets[batch+1], compact dq_acc; nsplits_i computed independently per batch | | **Deterministic + batch mode persistent** | `> 0` | nsplits[1] (uniform across batches); dq_acc `batch*nhead*nsplits*seqlen_q*hdim_q` | **NeedsZeroDqAcc** (determines whether `PrepareWorkspaceDevice` calls `hipMemset`): - Persistent kernel (deterministic batch mode) or non-deterministic: **must zero** (atomic add requires zero initialization) - Deterministic group mode + no mask: **no zeroing needed** (every tile writes its full region) - Deterministic + with mask: **must zero** (some blocks are skipped, leaving uninitialized tiles that would contribute to the reduction) #### Caller usage ```cpp // 1. Create launcher (traits include seqstart_qs/ks pointers; workspace_size is computed during construction) fmha_bwd_launcher launcher(fmha_traits); // 2. Read launcher.workspace_size directly const auto ws_size = launcher.workspace_size; // 3. Allocate a single GPU workspace ck_tile::DeviceMem ws_buf(ws_size); // 4. Copy nsplits/offsets to GPU head and zero dq_acc if required launcher.prepare_workspace(ws_buf.GetDeviceBuffer()); // 5. Build args with a single workspace pointer; the kernel splits it internally fmha_bwd_args args{ ..., ws_size > 0 ? ws_buf.GetDeviceBuffer() : nullptr, // workspace_ptr }; launcher(args, stream_config); ``` --- ### Key Code Structure #### FmhaBwdWorkspaceManager (`fmha_bwd_kernel.hpp`, new class) ```cpp template <typename AccDataType, bool kIsGroupMode, bool kIsDeterministic> struct FmhaBwdWorkspaceManager { static constexpr size_t ALIGNMENT = 16; // CPU workspace (nsplits + offsets) sizes static size_t GetDqAccSplitsSize(int batch); // align_up(sizeof(index_t)*N, 16) static size_t GetDqAccOffsetsSize(int batch); // group mode only: align_up(sizeof(long_index_t)*(batch+1), 16) static size_t GetWorkspaceHostSize(int batch); // = SplitsSize + OffsetsSize // Starting offset of dq_acc data within the full workspace (= host_ws_size) static size_t GetDqAccDataOffset(int batch); // = GetWorkspaceHostSize(batch) // Fills nsplits/offsets in the CPU workspace; returns device_ws_size (dq_acc buffer bytes) template <bool kUseQrQtrDorPipeline, index_t kN0> static size_t PrepareWorkspaceHost(void* cpu_ws, index_t batch_size, index_t hdim_q, index_t nhead_q, index_t seqlen_q, index_t seqlen_k, const index_t* seqstart_qs, const index_t* seqstart_ks); // hipMemcpy's cpu_ws to device_ws head; hipMemset's the dq_acc portion to 0 if required template <bool kUseQrQtrDorPipeline, bool kHasMask> static void PrepareWorkspaceDevice(void* device_ws, const void* host_ws, size_t device_ws_size, size_t host_ws_size); }; ``` #### workspace_ptr parsing (inside the kernel) The kernel parses three address regions from `kargs.workspace_ptr`: **Group mode (`FmhaBwdDQDKDVKernel::MakeKargs`)**: ```cpp const uint8_t* ws = reinterpret_cast<uint8_t*>(workspace_ptr); // dq_acc_ptr (stored in FmhaBwdCommonKargs) ws + WorkspaceManager::GetDqAccDataOffset(batch) // dq_acc_batch_offset_ptr (FmhaBwdGroupModeKargs field) reinterpret_cast<const long_index_t*>(ws + WorkspaceManager::GetDqAccOffsetsOffset(batch)) ``` **Batch mode**: ```cpp ws + WorkspaceManager::GetDqAccDataOffset(batch) // dq_acc_ptr // No offsets pointer; batch offset is computed inside run_() from nsplits ``` **`FmhaBwdConvertQGradKernel`** follows the same pattern: - Group mode: extracts `dq_acc_ptr`, `dq_acc_batch_offset_ptr`, and `nsplits_ptr` (`GetDqAccSplitsOffset(batch)`) from workspace - Batch mode: reads nsplits from `nsplits_ptr[0]`; batch offset computed internally ### Addressing in `run_()` (group mode) ```cpp // Per-batch processing: const long_index_t batch_offset_dq_acc = kargs.dq_acc_batch_offset_ptr[i_batch]; // seqq_i (physical length) derived from seqstart_q_ptr const index_t seqq_i = kargs.seqstart_q_ptr[i_batch+1] - kargs.seqstart_q_ptr[i_batch]; // nsplits_i read from nsplits_ptr (convert_dq kernel) or from GetDqAccSplits const long_index_t split_stride_i = static_cast<long_index_t>(seqq_i) * kargs.hdim_q; const long_index_t nhead_stride_i = static_cast<long_index_t>(nsplits_i) * split_stride_i; // Final address: dq_acc_base + batch_offset_dq_acc + i_nhead * nhead_stride_i + i_split * split_stride_i ``` #### nsplits computation (`PrepareWorkspaceHost`) `PrepareWorkspaceHost` is a template method of `FmhaBwdWorkspaceManager` that still takes `kN0` as a template parameter (from `BlockFmhaShape::kN0` of the dq_dk_dv pipeline). However, this parameter is **only used inside this host-side function** to compute nsplits — it is no longer passed into the convert_dq kernel. | Mode | nsplits computation | |------|---------------------| | kUseQrQtrDorPipeline | Writes dq directly; nsplits[0]=0; returns device_ws_size=0 | | Non-deterministic | nsplits[0]=1; dq_acc used for atomic add | | Deterministic + group mode | `ceil((seqstart_ks[i+1]-seqstart_ks[i]) / kN0)` computed per batch | | Deterministic + batch mode persistent | Same logic as the original `GetDqAccSplits` (`dqdqkdv_workers` based) | ### Removing kN0 dependency from `FmhaBwdConvertQGradKernel` `FmhaBwdConvertQGradKernel` previously required `kN0` as a template parameter (via `BlockFmhaBwdConvertQGradPipelineProblem`) for two purposes: 1. In batch mode `operator()`: self-computing `nsplits = ceil(seqlen_k / kN0)` 2. The `b{kM0}x{kN0}` component of the kernel name string Both have been removed in this refactor: - **Batch mode**: now reads `kargs.nsplits_ptr[0]` directly (guarded by `if constexpr(kIsDeterministic)` to avoid accessing a non-existent field in non-deterministic instances) - **Kernel name**: simplified to `b{kM0}`, no longer includes `kN0` - **Template parameters**: `BlockFmhaBwdConvertQGradPipelineProblem` drops the `kN0_` parameter; `fmha_bwd_convert_dq_traits_` drops the `kN0` parameter; `F_bn0`/`convert_dq_bn0` fields removed from codegen Effect: all dq_dk_dv tiles sharing the same `(hdim, dtype, mode, pad, deterministic)` combination — regardless of `F_bn0` value (16/64/128/192/256) — now share a **single** convert_dq kernel instance. --- ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
faa9dc52cb |
[rocm-libraries] ROCm/rocm-libraries#6932 (commit ce3e67b)
[CK] Fix OOB page table read in batch_prefill V prefetch (AICK-1171) (#6932) ## Summary Fix a GPU memory access fault in `mha_batch_prefill` triggered when the per-batch page table is tightly sized (no trailing slack). **Affected configurations:** - All FMHA batch prefill V2 kernels (`block_fmha_batch_prefill_pipeline_qr_ks_vs_async`) - Triggered by paged KV layouts where `kv_page_indices.numel() == ceil(seqlen_k / page_size)` exactly - Manifests as: `Memory access fault by GPU node-X (Agent handle: 0x...)` followed by `Aborted (core dumped)` - Silent corruption (no fault, wrong output) when the OOB read happens to land in zero-initialized memory ### Root cause `load_physical_pages` performs **lookahead reads** on the page table to prefetch K/V tiles for the next iteration. When the page table for a batch has exactly `N` entries, the V-tile prefetch indexes `page_idx[N]` (one past the last valid entry), reading either uninitialized memory or the next batch's slot. On gfx942 with a tightly-sized page table, the read crosses into an unmapped page and triggers an HSA page fault. The bug was masked in earlier testing because most test harnesses pad `kv_page_indices` with trailing zeros — OOB reads then return `page_id = 0`, a valid in-cache page, producing silent numerical drift instead of a fault. ### Fix design Thread `max_page_table_idx = (seqlen_k - 1) / page_size` from the kernel layer down to `load_physical_pages`, and clamp every page-table read with `ck_tile::min()`. Applied to **all four code paths** in the V prefetch: | Branch | What it does | Clamp applied | |--------|-------------|---------------| | `kIsKcache` | K prefetch loop | `min(global_token_idx >> kLog2PageSize, max_page_table_idx)` | | V LINEAR (`page_size == 1`) | One token = one page | `min(global_token_idx, max_page_table_idx)` | | V crosses pages (`kVTileCrossesPages`) | Per-thread page lookup | `min(global_token_idx >> kLog2PageSize, max_page_table_idx)` | | V single page (lane0 broadcast) | `readfirstlane`-uniform lookup | `min(... >> kLog2PageSize, max_page_table_idx)` | ### Key design decisions **Mandatory parameter, not optional with a sentinel default.** An optional `max_page_table_idx = INT32_MAX` default would let the bug silently come back at any new callsite that forgets to pass it. Making it mandatory forces every caller to opt in explicitly and surfaces missed callsites at compile time. **`seqlen_k == 0` clamps to 0** instead of underflowing `(0 - 1) / page_size` to `-1`. The empty-batch case is rare but well-defined: clamp every read to slot 0. **Single computation in the kernel layer.** `FmhaBatchPrefillWithPagedKVCacheKernel` computes `max_page_table_idx` once per batch and forwards it through every QScale branch (PERTENSOR / KV_BLOCKSCALE / default). All three `operator()` overloads of the pipeline (rich, default forwarder, KV_BLOCKSCALE forwarder) take and forward the parameter. ### Files changed | File | Change | |------|--------| | `include/ck_tile/ops/fmha/kernel/fmha_batch_prefill_kernel.hpp` | Compute `max_page_table_idx` per batch, forward to all 3 QScale branches | | `include/ck_tile/ops/fmha/pipeline/block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp` | Add `max_page_table_idx` to `load_physical_pages` and 3 `operator()` overloads; clamp page-id reads in 4 code paths | ## Test plan - [x] AICK-1171 reproducer verified on MI-308X (gfx942) - [x] New pytest case `test_batch_prefill_aick1171_oob_page_table_read` in aiter, parametrized over `total_blocks ∈ {160, 164, 168, 176, 208, 256}` (matches the `crash1_r8_*` bisect family) - [x] Full FMHA batch prefill suite on gfx942 + gfx950 ## Linked issue AICK-1171. |
||
|
|
0e6a514e4f |
[rocm-libraries] ROCm/rocm-libraries#6209 (commit 89c9f3e)
Improve the performance of qr_ks_vs_whole_k_prefetch pipeline (#6209) ## About qr_ks_vs_whole_k_prefetch pipeline This PR updates and enhances the qr_ks_vs_whole_k_prefetch pipeline to improve performance on both MI350 GPUs through better MFMA instruction usage, transposed V-loading support, and N0-loop implementation. The pipeline targets scenarios where the number of workgroups is low, enabling better CU occupancy by using smaller MTile sizes (kM0=64 vs 128) while prefetching entire K tiles. ## Changes: - Adds transposed V-loading support (qr_ks_vs_whole_k_prefetch_trload) to avoid using shuffle instructions on MI350 - Implements N0-loop based Gemm0 to reduce tile window movement overhead and eliminate `clear_tile` calls - Adds full support for hdim96/hdim160 without padding requirements - Updates MFMA instruction selection to ensure optimal choices for MI350 ## Performance results 1. For attention shapes which leads to kM0=64, `qr_ks_vs_async_whole_k_prefetch_trload` shows much better performance than `qr_ks_vs_async_trload` on the same case (execution time `41.02ms` by whole_k_prefetch_trload & `58.50ms` by async_load), and `qr_ks_vs_async_whole_k_prefetch_trload` also shows obviously better performance than the recently tuned `qr_ks_vs_async` on the same case (execution time `41.02ms` by whole_k_prefetch_trload 7 `47.60ms` by qr_ks_vs_async) 2. Also on MI300, for attention shapes which leads to kM0=64, `qr_ks_vs_async_whole_k_prefetch` shows much better performance than the `qr_ks_vs_async` (which is supposed to be very high-efficient) on the same case (execution time `64.50ms` by whole_k_prefetch & `80.20ms` by qr_ks_vs_async) 3. For attention shapes which leads to kM0=128, `qr_ks_vs_async_whole_k_prefetch_trload` show a little bit better performance than `qr_ks_vs_async` on mi350 (execution time `104.50ms` by whole_k_prefetch_trload & `106.50ms` by qr_ks_vs_async). And they shows completely on-par performance on MI300 ## Test/Verify 1. Use the ROCM xformers branch `test_whole_k_prefetch_n0loop` to test/verify qr_ks_vs_whole_k_prefetch pipeline since this pipeline can not be used by ck_tile fmha example so far 2. Use the following command-line for building/testing xformers >```bash > #> git clone -b test_whole_k_prefetch_n0loop https://github.com/ROCm/xformers > #> git submodule update --init --recursive > #> pip install --no-build-isolation -e ./ > #> pytest tests/test_mem_eff_attention.py::test_forward >``` 4. Any scripts which can run on xformers can be used to evaluate qr_ks_vs_whole_k_prefetch pipeline. Using the two environ variable to switch from using different pipelines > ```bash > #> export FMHA_DISABLE_SPECIAL_TREATMENT=1 #> to disable using FAV3 and qr_ks_vs_async_trload pipeline > #> export FMHA_ENABLE_ASYNC_PIPELINE=1 #> to disable using qr_ks_vs_async pipeline for comparing > ``` ## Discussion --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: poyenc <1132573+poyenc@users.noreply.github.com> Co-authored-by: qianfengz <12429178+qianfengz@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
dfc1305685 |
[rocm-libraries] ROCm/rocm-libraries#6479 (commit 0705c2d)
CK][fmha] Add StreamLLM sink support to batch_prefill pipeline (#6479) ## Motivation The existing paged-KV attention pipelines (pagedkv, splitkv) support StreamLLM-style sink tokens — a fixed set of initial tokens kept in attention alongside the sliding window. The `batch_prefill` pipeline (chunked-prefill with VLLM-style block tables) previously hardcoded `kHasSink = false`, making it incompatible with sink-based attention patterns in LLM serving scenarios. This PR extends `batch_prefill` to support `kHasSink` and wires it into `fmha_fwd_runner` for validation against the existing CPU reference. ## Technical Details **Pipeline** (`block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp`): - When `kHasSink`, the K/V loop splits into a sink phase [0, sink_seq_end) and a window phase [seqlen_k_start, seqlen_k_end), mirroring pagedkv. - K advance at the sink→window transition jumps `seqlen_k_start - sink_seq_end + kN0` to bridge the gap. - V scatter-gather offsets are re-initialized at the transition to fix a window mismatch bug: V was lagging kN0 behind K after the large jump, loading from the wrong sequence position. - Bias window, dropout seq_offset, and mask type (LogitsSinkMask) updated for sink-awareness. **Traits / codegen** (`tile_fmha_traits.hpp`, `fmha_fwd.hpp`, `fmha_batch_prefill.py`): - `TileFmhaBatchPrefillTraits` gains `kHasSink_` (was hardcoded `false`). - Codegen adds `F_sink` field; skips batch-mode kernels (group mode required). - CMake test filter broadened from 9 → 33 instances covering fp16/bf16 × mask/nmask × lse/nlse × sink/nsink. **Runner** (`fmha_fwd_runner.hpp`, `CMakeLists.txt`): - `fmha_batch_prefill()` dispatched from `run_fwd` when: group mode + paged KV + num_splits == 1. - K/V strides corrected for runner's [num_pages, nhead_k, page_block_size, hdim] layout. - `page_block_size % 128` check relaxed: batch_prefill supports ps=16. - CPU reference paged-KV reordering guards extended with `CK_TILE_FMHA_FWD_BATCH_PREFILL_API`. ## Test Plan Build with `-DFMHA_FWD_ENABLE_APIS="fwd;batch_prefill"`, run `tile_example_fmha_fwd` in group mode with page_block_size=16. Test matrix: - Mask: no-mask, causal, sliding window - Sink: nsink, sink=1..128 - dtype: fp16, bf16 - LSE output: on/off - seqlen ∈ {512,1024,2048,4096} × window ∈ {32,256,512,1024} - GQA, chunked prefill, large batch×seqlen - page_block_size: 16, 32 ## Test Result 171 test cases, all valid:y: - nmask + nsink: ✓ - causal + nsink: ✓ - causal + sink=8: ✓ - sliding window + sink=8 (d=128, d=256): ✓ - bf16, LSE output, GQA: ✓ ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
7e4e291771 |
[rocm-libraries] ROCm/rocm-libraries#6450 (commit b75fed1)
[CK_TILE] Skip padded k/n fragment work in qr_hpad FMHA fwd (#6450) ## Motivation `qr_hpad` currently executes work for padded head-dim fragments even when only a subset of the values are valid. This adds unnecessary computation for head dimensions that require padding, such as `hdim=72` and `hdim=80`, and hurts FMHA forward performance. The goal of this PR is to make the padded-head-dim path skip invalid work based on the actual valid fragment count, while preserving the existing behavior for the non-padded path. ## Technical Details This PR improves the `qr_hpad` FMHA forward path in three parts: - Skip padded `k`/`n` fragments in the GEMM/pipeline path when only part of the fragment is valid. - Add partial GEMM0 tail handling for `qr_hpad` so the kernel uses the valid fragment range instead of always computing over the padded extent. - Retune the gfx11 `qr_hpad` kernel configuration after enabling the partial-fragment path. To keep the existing path stable, the implementation adds overloads for the updated GEMM/pipeline interfaces. This allows existing full-tile callers to keep using the previous form, while the `qr_hpad` path can pass valid fragment counts when needed. ## Test Plan ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16 -d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} ## Test Result - On gfx11 and gfx12, for head dimensions that require padding, `tile_example_fmha_fwd` shows about 20-30% performance improvement at `hdim=72/80`. ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
bf736dfa74 |
[rocm-libraries] ROCm/rocm-libraries#6051 (commit f0838b2)
[CK] Add FP8 per-tensor quantization support for FMHA V3 pipeline (#6051) ## Motivation The existing FMHA V3 pipeline only supports fp16/bf16 data types. This PR extends V3 to handle FP8 inputs with per-tensor descaling on gfx950, enabling higher throughput for FP8 inference workloads using the assembly-optimized V3 code path. ## Technical Details **Warp GEMM:** - Add FP8 32x32x32 warp gemm with C-transposed distribution (`WarpGemmMfma_f32_32x32x32_fp8_fp8_CTransposed`) and dispatcher entries **V3 Kernel (`fmha_fwd_v3_kernel.hpp`):** - Add per-tensor descale support for Q, K, V tensors, passing descale pointers through to pipeline kargs **V3 Pipeline (`block_fmha_fwd_v3_pipeline.hpp`):** - Add FP8 data path with dtype-aware type selection - Add asm volatile P matrix conversion from f32 to fp8 - Add FP8-aware instruction scheduling in `CoreLoopScheduler` **V3 Pipeline Policy (`block_fmha_fwd_v3_pipeline_default_policy.hpp`):** - Add FP8 QK warp gemm selection (SwizzleB variant for V tile distribution compatibility) **Codegen (`fmha_fwd.py`):** - Add gfx950 FP8BF16 V3 tile size (256x64x128x128x64x128) - Add FP8BF16 V3 pipeline variants (mask: no/causal, qscale: no/pertensor) - Extend `can_dispatch_v3` condition for fp8bf16 + pertensor **Misc:** - Add LLVM scheduler `TRANS` mask to `LLVMSchedGroupMask` enum (`arch.hpp`) - Fix `mask_info` default initialization for `no_mask` case (`mask.hpp`) V3 dispatch for FP8 is disabled by default (`F_is_v3_enabled=false`) pending further validation. ## Performance: fmha_fwd V3 FP8 (avg runs 2-6, stock ROCm 7.1.1, gfx950) | Problem | Regular (TFlops) | Varlen (TFlops) | |---|---:|---:| | batch=1 heads=6/1 seqlen=1024 causal | 48.9 | 47.6 | | batch=1 heads=6/1 seqlen=2048 causal | 119.8 | 117.4 | | batch=1 heads=6/1 seqlen=4096 causal | 263.7 | 259.2 | | batch=1 heads=6/1 seqlen=8192 causal | 548.9 | 543.6 | | batch=1 heads=6/1 seqlen=16384 causal | 1043.0 | 1063.7 | | batch=1 heads=6/1 seqlen=32768 causal | 1237.2 | 1279.6 | | batch=1 heads=6/1 seqlen=65536 causal | 1315.4 | 1382.7 | | batch=1 heads=6/1 seqlen=131072 causal | 1326.3 | 1402.2 | | batch=1 heads=16/1 seqlen=65536 causal | 1298.7 | 1388.4 | | batch=1 heads=40/40 seqlen=37200 non-causal | 1248.9 | 1326.1 | ## Test Plan Tested with aiter's `test_mha_fp8.py` test suite (176 cases) covering batch sizes (1-2), sequence lengths (113-4096), head counts (5/8/32/40), GQA ratios (1:1, 1:8), and causal/non-causal modes. Verified all cases dispatch to the V3 pipeline by enabling `F_is_v3_enabled` and confirming kernel names contain `qr_async_trload_v3`. ## Test Result 176/176 tests passed with V3 enabled. All cases correctly dispatched to V3 pipeline with `pertensor` quantization. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
8287fe6c19 |
[rocm-libraries] ROCm/rocm-libraries#6201 (commit 5c0697e)
[CK_TILLE] Temporarily remove batch prefill KV cache overflow asserts (#6201) ## Summary - Temporarily remove the KV cache offset overflow assert checks in `FmhaBatchPrefillWithPagedKVCacheKernel` - The asserts are **correct**, but they block project progress in certain configurations - This is a **temporary workaround** to unblock progress; a proper fix will follow ## Note This is NOT a permanent solution. A follow-up PR will add proper overflow handling that addresses the underlying issue without blocking progress. |
||
|
|
8d1fb9d33e |
[rocm-libraries] ROCm/rocm-libraries#5504 (commit 47f86c7)
[CK Tile] Add sink token gradient support in FMHA backward pass (#5504) ## Motivation Adds sink token support to the FMHA backward kernel (dot_do_o pipeline): ## Technical Details - Extend BlockFmhaBwdOGradDotOPipelineProblem with LSEDataType - Add sink_ptr/d_sink_ptr/lse_ptr/nhead to FmhaBwdOGradDotOCommonKargs - Compute per-head sink gradient via atomic accumulation in the pipeline - Update example runner with reference validation for sink gradient ## Test Plan Add new test case ## Test Result WIP ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
dea23b31b4 |
[rocm-libraries] ROCm/rocm-libraries#5790 (commit c132b5a)
[CK_TILE] Fix NaN for FMHA BWD When seq_q=0 (#5790) ## Motivation This PR addresses NaNs in the FMHA backward (dQ/dK/dV) path when the effective query sequence length for a tile is zero, by ensuring the per-tile pipelines exit early with zeroed accumulators and by avoiding an early kernel return that prevented writing out cleared gradients. ## Technical Details - Add unconditional early-exit in the dK/dV pipelines when `num_total_loop <= 0` (no work), returning zeroed accumulators. - Adjust group-mode kernel early-return logic to only return when **both** `seqlen_q` and `seqlen_k` are zero, allowing blocks to run and store cleared dK/dV when `seqlen_q == 0`. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
4b0ebc5fab |
[rocm-libraries] ROCm/rocm-libraries#5018 (commit b32e7e6)
[CK_TILE] Add LLC-aware FMHA head grouping and head-major scheduling on RDNA (#5018) ## Motivation Long-sequence FMHA can become memory-bound when K/V working sets exceed Infinity Cache (LLC), causing repeated DRAM traffic across heads. This PR introduces LLC-aware launch ordering improvements for FMHA forward, and it is currently enabled only on gfx11 and gfx12. The approach is inspired by [`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217), adapted to CK’s kernel/runner structure and layout handling. In this context, `bshd` is the layout used in Flash-Attention, while `bhsd` is the default layout used by the CK Tile FMHA example. ## Technical Details This PR adds two complementary strategies: - For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware head grouping: - Estimate LLC size (env override, KFD sysfs, or arch default). - Compute group size from K/V bytes per head vs LLC target. - Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and related tensors). - For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit launch-order adjustment: - Keep a single kernel launch. - Reinterpret block linearization in `GetTileIndex` to make execution head-major, improving temporal locality of per-head K/V reuse. Additional integration updates: - Propagate `num_head_q_total` and `head_start` through FMHA args/kargs. - Use global head indexing for dropout RNG stream mapping so grouped launches keep deterministic/consistent dropout behavior. - Keep fallback behavior unchanged when grouping is not beneficial or disabled. ## Test Plan - `test_ck_tile_fmha` - `tile_example_fmha_fwd` ## Test Result - `test_ck_tile_fmha`: all tests passed. - `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201, and all of them show higher performance compared to the baseline. The improvement is consistent, and performance is well maintained even at long sequence lengths. ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} - TFLOPs by sequence length target: gfx1100 layout: bhsd SeqLen | Before | After | Speedup -- | -- | -- | -- 1024 | 56.27 | 61.48 | 1.09x 4096 | 67.10 | 72.27 | 1.08x 8192 | 65.99 | 71.64 | 1.09x 12288 | 61.60 | 76.61 | 1.24x 16384 | 58.99 | 75.74 | 1.28x 20480 | 57.32 | 74.42 | 1.30x 24576 | 56.89 | 74.25 | 1.31x 27280 | 18.93 | 24.48 | 1.29x - TFLOPs by sequence length target: gfx1201 layout: bshd SeqLen | Before | After | Speedup -- | -- | -- | -- 1024 | 66.79 | 65.90 | 0.99x 4096 | 85.90 | 86.80 | 1.01x 8192 | 77.06 | 90.29 | 1.17x 12288 | 58.36 | 88.98 | 1.52x 16384 | 52.12 | 88.88 | 1.71x 20480 | 48.11 | 88.42 | 1.84x 24576 | 47.12 | 89.07 | 1.89x 27280 | 49.05 | 50.31 | 1.03x ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
b7d8c938b4 |
[rocm-libraries] ROCm/rocm-libraries#5174 (commit a358a21)
[CK_TILE] FMHA BWD Use Persistent Kernels in Deterministic Mode (#5174) ## Motivation This PR enables a persistent-kernel execution path for FMHA backward (dQ/dK/dV) in deterministic mode, adjusting how dQ accumulation is split, stored, and converted back to final gradients. ## Technical Details - Introduces a persistent-kernel grid mapping in deterministic mode and updates split-count calculation accordingly. - Extends kernel kargs to carry batch-related info needed for persistent scheduling and dQ conversion. - Refactors dQ store conditions and adds mask-type traits/utilities and runner logging updates. ## Test Plan - Jenkins [base](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/10/pipeline) - Jenkins [AITER](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/12/pipeline) - Jenkins [FMHA](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/PR-5174/11/pipeline) - local FA tests ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
5f94c1aa0f |
[rocm-libraries] ROCm/rocm-libraries#4368 (commit 17f7dfc)
[CK_TILE][FMHA] Support microscaling (mxfp8 and mxfp4) on gfx950 (#4368) ## Motivation Microscaling types (mxfp8 and mxfp4) for fwd qr pipeline ## Technical Details The microscaling is used when quant scale mode is `BlockAttentionQuantScaleEnum::MX` and `Q/K/P/VDataType` are fp8/bf8/fp4. Supported features: * only "qr" pipeline is implemented * hdim 128 and 256 (smaller hdim are not possible due to restrictions of "qr" pipeline, but they can be computed using instances with padding) * both 32x32x64 and 16x16x128 scale MFMAs are supported * Q and K scales are applied in hdim, V scales - in seqlen dimension * column-major V only * batch and group mode * bias, Alibi (tested but no instances by default, just like fp8) * masking etc. Aiter PR with new API args: https://github.com/ROCm/aiter/pull/2008 ## Test Plan ``` ninja test_ck_tile_fmha_fwd_mxfp8 && bin/test_ck_tile_fmha_fwd_mxfp8 ninja test_ck_tile_fmha_fwd_mxfp4 && bin/test_ck_tile_fmha_fwd_mxfp4 ``` ## Test Result The tests must pass. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
c57d503c65 |
[rocm-libraries] ROCm/rocm-libraries#4999 (commit 45f6624)
[CK] Fix 32-bit overflow in batch prefill kernel for >4GB KV cache (#4999) Use SRD rebasing for page_block_size >= kN0: move SRD base pointer to page start via 48-bit arithmetic, encode only within-page offset in voffset. Original code path preserved for ps1/ps16 via constexpr-if. ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
2709abee7e |
[rocm-libraries] ROCm/rocm-libraries#4577 (commit a36922c)
[CK_TILE] FMHA BWD Launcher Interface (#4577) ## Motivation Reduce memory usage; Be prepared to implement optimizations of reducing nsplits in deterministic cases. ## Technical Details This PR introduces a new launcher interface for the FMHA backward operation, replacing direct function calls with a more structured approach. The launcher encapsulates kernel dispatch logic and provides access to computed metadata like the number of dQ acc splits. **Changes:** - Added `fmha_bwd_launcher` class that wraps kernel execution and exposes `dq_acc_splits` - Moved `fmha_bwd_traits` construction earlier in the execution flow to support launcher initialization - Refactored code generation to produce both legacy API and new launcher constructor ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
841e6b89d1 |
[rocm-libraries] ROCm/rocm-libraries#4584 (commit 42efd1d)
[CK_TILE][FMHA] Support gfx11 (#4584) ## Motivation Add support of gfx11 architectures (RDNA3) to FMHA. ## Technical Details Distributions (matrix elements to lane registers mapping) of gfx11 WMMA are completely different from distributions of gfx9 MFMA and gfx12 WMMA. There are two cases in FMHA where this difference matters: * usage of results (matrix C) of one GEMM as input (matrix A) of another GEMM. * random number generation for dropout (implementation for gfx9 MFMA, gfx12 WMMA and host validation produce the same results). Both cases are solved by a special remapping implemented using `__builtin_amdgcn_permlanex16` and `__builtin_amdgcn_perm`. Additional changes: * FMHA tests are now build and run only for those types for which instances exist (gfx11 supports only fp16 and bf16). * Two fixes for uninitialized values (`mask.sink` and `do_fp8_static_quant`): they may contain garbage resulting in incorrect dispatching logic, sometimes tests report that there are no instance available for current parameters. * Small fix to remove expcnt(0) from s_waitcnt instruction on gfx11 when they are not requested (i.e. every time), likely has no effect on performance but makes disassembly a bit clearer. ## Test Plan ``` ninja test_ck_tile_fmha bin/test_ck_tile_fmha_fwd_fp16 bin/test_ck_tile_fmha_fwd_bf16 bin/test_ck_tile_fmha_bwd_fp16 bin/test_ck_tile_fmha_bwd_bf16 ``` ## Test Result All tests must pass (some tests may be skipped). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
7b18f5fed2 |
[rocm-libraries] ROCm/rocm-libraries#4263 (commit f34aec2)
[CK] Add FP8 KV_BLOCKSCALE support for batch prefill MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implement per-page K/V quantization for paged attention: - Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum - Use exp2 shift trick to eliminate explicit P scaling overhead - Prefetch physical pages offset for KV cache, overlaps with computations ## Proposed changes Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request. ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered |
||
|
|
67f0b74ec6 |
Revert "Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)" (#3635)
This reverts commit
|
||
|
|
de5a1d730d |
Revert " Fp8 block scale quantization for fmha fwd (#3330)" (#3633)
This reverts commit
|
||
|
|
dd0b4294af |
Fp8 block scale quantization for fmha fwd (#3330)
* add block scale parameters to kernel
* add block scale to kernel
* add smoke test
* format
* Revert "format"
This reverts commit
|
||
|
|
fcc9372c00 | [CK_TILE] Fix Int32 Overflow in Deterministic FMHA BWD (#3615) | ||
|
|
717ed0b59f |
[CK_TILE][FMHA] Enable gpt-oss sink (#3490)
* Enable gptoss sink
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* add gptoss sink test
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
* update CHANGELOG.md
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
* fix test args error
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
* Update test_fmha_fwd.cpp
* update sink test
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
* Revert "update sink test"
This reverts commit
|
||
|
|
00c46785a8 |
Shuffle fix for gfx950 (#3491)
* solve compiler issue * solve the gfx950 mfma shuffle regression * refactor jenkinsfile to handle arch name better * [CK TILE] set divisor to count of thread along k dimension * fix the compiler error * solve degradation * Finish the multiplies fix * fix the scales * solve compilation error * solve the composes * solve the error of tile sweeper * fix the test and example * fix for gfx950 --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: Cong Ma <congma13@amd.com> |
||
|
|
cc75a1dc5f |
[FMHA] Batch Prefill Support Improvements: Change KV Cache Layout & Large Page Size Support (#3442)
* add page_block_size parameter * add is_sglang_layout to parameters * add kv_offset_array_transform to batch async for page size 16 * add kv_last_page_lens to kernel * change kv layout to [num_total_pages, page_block_size, hdim] * format * - enable codegen of batch_prefill kernels - create new problem struct BlockFmhaBatchPrefillPipelineProblem for batch prefill kernels - generate different page sizes of batch prefill kernels (1, 16) * 1. fix wrong calculation of page id in kv_offset_array_transform in gfx950 2. support page size 1024 * fix python format * change kv cache layout to [num_blocks, num_kv_heads, head_size/x, block_size, x] and [num_blocks, num_kv_heads, block_size/X, head_size, X] * 1. Introduced `kVectorSize` in BlockFmhaBatchPrefillPipelineProblem instead of using hardcode values 2. Makes batch prefill kernel traits structures inherent from fmha fwd traits 3. Add some static check for Page size, vector size, hdim, ..., etc. * [Refactor] Replace is_sglang_layout with Enums for KV cache configuration Refactored `fmha_batch_prefill` to use `BlockAttentionKVCacheMemoryLayoutEnum` (VECTORIZED/LINEAR) and `BlockAttentionKVCacheLookupTableEnum` (SGLANG_1D/VLLM_2D) instead of a single boolean. **Changes:** * Added Enum definitions in `block_attention_kvcache_layout_enum.hpp`. * Updated Kernel, Pipeline, and Traits to template on these Enums. * Implemented `kv_offset_array_transform` logic based on `kKVMemoryLayout`. * Refactored `PageBlockTableKargs` to adapt to `kKVLookupTable`. * Updated CodeGen scripts to support new parameters. This decouples memory layout from the paging mechanism, enabling flexible KV cache configurations. * 1. remove batch prefill pipeline with sk_pad=false 2. correct some comments 3. add static assert to make sure v offsets is in same page within a tile. * fix vgpr spill count * remove unnecessary t2s functions * add fp8 support for receipt 200 and 600 in fmha_bath_prefill.py * support linear kv cache layout * Remove block_table_ptr from fwd_batch_prefill_args. Instead, reuse kv_page_indices as a pointer of the lookup table. * 1. merge multiple transforms into single transform. 2. add static check to make sure vlayout is row-major. * move FmhaFwdCommonKargs::seqlen_k_ptr to VllmPageTableKargs. * update changelog --------- Co-authored-by: ltqin <letaoqin@amd.com> Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com> |
||
|
|
1c3151963b |
[CK_TILE][FMHA] Add FP8 support for batch_prefill kernel (#3425)
* Add fp8bf16 support for batch_prefill * Fix wrong scale_s re-compute logic in batch_prefill * Fix wrong scale_s re-compute logic in fmha fwd * Fix batch_prefill codegen error * Remove no-longer used GetName() function * Add fp8 logits=True instances * Update CHANGELOG.md |
||
|
|
bfac64953f |
[CK_TILE][FMHA] Add logits soft-capping support for FAv3 (WIP) (#3355)
* Let fmha_fwd_v3() compatible with fmha_fwd() * Decouple get_fwd_blobs() and FmhaFwdKernel * Decouple compatibility checks from get_fwd_blobs() * Extract product feature checks out from get_fwd_blobs() * Remove duplicated code in factories and redundant checks * Remove FmhaFwdKernel<>::GetName() * Let FmhaFwdApiPool support pipelines with different mask_impl * Add tile setting for fmha fwd v3 pipeline * Add fwd v3 instances to tile_example_fmha_fwd manually * Remove unused function import * Undo irrelevant changes * Remove fwd v3 instances from tile_example_fmha_fwd * Finish fmha fwd v3 kernel instance codegen * Fix formatting * Remove unused F_idx attribute * Add is_generic_attention_mask<> traits * Add constraints to the fmha fwd v3 pipeline * Unify traits & problem used for fmha fwd v3 * Unify kernel launch code for fmha fwd v2 & v3 * Unify kernel template selection logic * Use same kernel codegen template for both v2 & v3 * Rename api() property as render() method * Allow specifying filter for fmha fwd api pool * Allow specifying function name when rendering api pool items * Separate fmha fwd v3 kernel dispatching logic from v2 * Remove lambda assignment * Add simple v2/v3 dispatch logic * Stop generating empty if-clauses Skip iterating over dictionaries that have no traits, and avoid assigning i_* to them. * Use "".join() to concatenate fmha fwd api string content * Add more feature checks for fmha fwd v3 pipeline * Check features before dispatch to fmha_fwd_v3() * Add more feature checks for fmha_fwd_v3() * Add missing filter call * Use Tuple to reserve the dtype orders * Fix wrong pipeline matching logic * Add fmha fwd v3 group mode instances * Add functor_transform<> * Add type constraints to make_tile_window() * Remove fmha fwd v3 example * Fix wrong product(aiter mha_fwd()) config * Fix wrong fmha fwd v2/v3 selection logic * Fix formatting * Add comment to warning v3 kernel users * Fix wrong codegen logics * Remove unnecessary param * Fix format * Add logits soft-capping support for fmha fwd v3 pipeline (WIP) * Add missing Kargs base type --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
f5573f56d9 |
Add attention sink support for FMHA FWD (#3368)
* Revert "Revert "Add attn sink (#2892)" (#3250)"
This reverts commit
|
||
|
|
05292b3604 |
[CK_TILE][FMHA] Integrate FAv2 & FAv3 (WIP) in the single fmha_fwd() API (#3153)
* Let fmha_fwd_v3() compatible with fmha_fwd() * Decouple get_fwd_blobs() and FmhaFwdKernel * Decouple compatibility checks from get_fwd_blobs() * Extract product feature checks out from get_fwd_blobs() * Remove duplicated code in factories and redundant checks * Remove FmhaFwdKernel<>::GetName() * Let FmhaFwdApiPool support pipelines with different mask_impl * Add tile setting for fmha fwd v3 pipeline * Add fwd v3 instances to tile_example_fmha_fwd manually * Remove unused function import * Undo irrelevant changes * Remove fwd v3 instances from tile_example_fmha_fwd * Finish fmha fwd v3 kernel instance codegen * Fix formatting * Remove unused F_idx attribute * Add is_generic_attention_mask<> traits * Add constraints to the fmha fwd v3 pipeline * Unify traits & problem used for fmha fwd v3 * Unify kernel launch code for fmha fwd v2 & v3 * Unify kernel template selection logic * Use same kernel codegen template for both v2 & v3 * Rename api() property as render() method * Allow specifying filter for fmha fwd api pool * Allow specifying function name when rendering api pool items * Separate fmha fwd v3 kernel dispatching logic from v2 * Remove lambda assignment * Add simple v2/v3 dispatch logic * Stop generating empty if-clauses Skip iterating over dictionaries that have no traits, and avoid assigning i_* to them. * Use "".join() to concatenate fmha fwd api string content * Add more feature checks for fmha fwd v3 pipeline * Check features before dispatch to fmha_fwd_v3() * Add more feature checks for fmha_fwd_v3() * Add missing filter call * Use Tuple to reserve the dtype orders * Fix wrong pipeline matching logic * Add fmha fwd v3 group mode instances * Add functor_transform<> * Add type constraints to make_tile_window() * Remove fmha fwd v3 example * Fix wrong product(aiter mha_fwd()) config * Fix wrong fmha fwd v2/v3 selection logic * Fix formatting * Add comment to warning v3 kernel users * Fix wrong codegen logics * Remove unnecessary param * Fix format --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
de6466481f | chore(copyright): update copyright header for include directory (#3293) | ||
|
|
229d43ea0c |
Fix batch prefill compile fail in aiter (#3279)
* Fix batch prefill aiter compile fail * Fix compile error |
||
|
|
81042ea574 | Fix a bug for qr_ks_vs_async_trload pipeline (#3271) | ||
|
|
5948dbffe4 |
Support fp8 dynamic quantization for fmha (#3206)
* Support qscale for dynamic quant, remove static quant * Support hdim=256 * Remove bias test case for fp8 --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> |
||
|
|
5adaa201ed |
Revert "Add attn sink (#2892)" (#3250)
This reverts commit
|
||
|
|
9fa4e8d5ab |
Add attn sink (#2892)
* enable attn sink Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update attn_sink script Signed-off-by: JL-underdog <Jun.Lin@amd.com> * fix some error Signed-off-by: JL-underdog <Jun.Lin@amd.com> * clang-format Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update fmha_bwd mask Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update fmha_bwd_kernel'mask Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update block_fmha_pipeline_qr_ks_vs.hpp Signed-off-by: JL-underdog <Jun.Lin@amd.com> * fix ci error Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * fix format error Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update block_fmha_bwd_pipeline_default_policy.hpp * Update fmha_fwd_runner.hpp * Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp * Update fmha_fwd_runner.hpp * Update fmha_fwd_runner.hpp * Update fmha_fwd_runner.hpp * update splitkv_pipline Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update splitkv&pagedkv pipeline Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * add sink test Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update attn_sink result log Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update smoke_test_fwd_sink.sh Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update test file Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update test script Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp * use constexpr kHasSink for sink in fmha pipeline Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update by pre-commit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/kernel/fmha_fwd_pagedkv_kernel.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update fmha_fwd.py * Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd_splitkv.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Remove causal mask setting logic from mask.hpp Removed the mask setting logic for causal masks. * fix ci error that some usage of lamada not support in c++17 Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update remod.py * add smoke sink test Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update fmha_pagedkv_prefill.py * Update FmhaFwdPipeline parameters in fmha_fwd.py * update block_fmha_pipeline_qr_ks_vs_async_trload.hpp Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * fix c++17 unsupprot error Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp * Fix formatting of sink_seq_end assignment * Fix indentation for sink_seq_end assignment * Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp --------- Signed-off-by: JL-underdog <Jun.Lin@amd.com> Signed-off-by: LJ-underdog <Jun.Lin@amd.com> Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> |
||
|
|
d7b3197869 |
[CK_TILE] FMHA Reduce register spilling in fwd with dropout (workaround for CI failures with clang-22) (#3221)
* Use vectorized stores for dropout randvals With no kPadSeqLenK the kernel uses 2 buffer_store_dwordx2 instead of 16 buffer_store_byte. This requires less registers and reduces spilling. * Calculate dropout randvals for storing and applying only once Even though it may add a small overhead when storing is not required, it uses significantly less registers and hence no spilling. |
||
|
|
1e77695fe8 |
[CK_TILE] Support WMMA (gfx12) in FMHA (#2528)
* Pass hdim to tile_example_fmha_fwd in fp8 tests
* Add WMMA support to fwd FMHA pipelines
* Tune tile sizes a bit for less spilling
fp16 256 is still quite slow
* Fix Q grad tile distribution for warp size = 32 and hdim >= 256
With AccDataType = float and warp size = 32, K0 becomes 0, K repeat is required to correcty distribute the tile.
* Use code based on BlockDropout in BlockDropoutBwd
* Fix split KV combine kernel for gfx12 (warp size 32) and make it more universal
* Fix LSE LDS tensor descriptors: kMaxSplits and kM0 were swapped, it worked on gfx9
because they both equal to 8 while on gfx12 they are 8 and 4;
* Fix Oacc LDS tensor descriptor: it was transposed even though its shape=[4 * kM0, kN1],
it worked on gfx9 because 4 * kM == kN1 == 32;
* Removing these hidden dependecies allows to support:
* any number of warps (power-of-2), not only 4;
* kN1 = 16, not only 32;
* any number of splits;
* Rename ids like o_acc_4 and Oacc4 to eliminate confusion: kNumWarps doesn't have to be 4 now
* Replace hard-coded kN1 in dispatch code with the requested tile size
* Add gfx12-specific tile sizes for split KV
* Pass GPU architecture to kernel generation scripts
This is still a temporary solution.
* Build and run FMHA CI tests for gfx12
* Fix issue after merging
* Fix bwd tile sizes
The current pipelines always read only one tile K and V tile, this
requires bk0 == bhdq and bk2 == bhdv (kK0 == kQKHeaddim and
kK2 == kVHeaddim).
* Use hardware f32->f8 on gfx12, remove v_perm
__builtin_amdgcn_perm is not needed because
__builtin_amdgcn_cvt_pk_fp8_f32 allows to specify which word (16 bit of
32-bit dword) is used to store results (two f8 values).
* Update changelog
* Add WMMA support to pagedkv
* Fix scripts after rebasing
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout
Add comments with dropout implementation details
Fix performance regression of fwd+dropout
* Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
* "scalarize" seed and offset, they may come either from kernel args or from device memory
(presumably loaded with vector loads).
These changes help the compiler to procude more optimal code and reduce register spilling.
Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding
Use code based on BlockDropout in BlockDropoutBwd
Refactor BlockDropout (fwd)
Implement BlockDropout (fwd) for WMMA
Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
this version supports 16x16 tiles.
If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
to BlockDropoutBwd.
Implement BlockDropoutBwd for WMMA
Remove MakeRandValLds* functions unused in BlockDropoutBwd
Remove unused Run overload from BlockDropoutBwd
* Fix regression with philox seed and offset when they exceed 32-bit int
__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.
* Fix names after cherry-picking
* Fix selection of a fallback tile based on bm0
The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.
* Do not use filters related to qr_async_trload
They disable tiles/pipelines which are valid for gfx12.
* Use different dstr encoding when C is transposed
* Do not call GetQKBlockGemm (and hence WarpGemmDispatcher) in host code
Some WarpGemmDispatcher instantiations are defined only
for specific archs and undefined on host.
Calculations related to sched barriers are moved from Pipeline's public
fields into pipeline's operator().
* Fix incorrect name WarpGemmMfmaFp8Fp8F32M32N32K16SwizzleBTransposedCDistribution
Correct name is WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution
because it's 32x32x16 with IterateK = 2 so K = 32, also all tiles used
in codegen scripts are 32, 32, 32.
* Generalize usages of WarpGemmDispatcher for MFMA and WMMA
WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution is still
used explicitly becaus of swizzle factor = 4.
* Mark has_load_tr as maybe_unused
There are no transpose loading for RDNA.
* Remove CK_TILE_USE_MFMA/WMMA from fmha-related code
* Detect BlockSize on host based on warp size of the current device
If kBlockSize == kNumWarps * get_warp_size(), the kernel is launched with
kBlockSize / 2 because on host get_warp_size() == 64 always.
* Fix calculation of grid size for combine kernel with warp size = 32
* Add missing includes and header
* Support multiple archs in one binary for fwd
* Support multiple archs in one binary for fwd_splitkv, fwd_appendkv, pagedkv_prefill
* Support multiple archs in one binary for bwd
* trload kernels are compiled only for gfx950;
* instances with padding are checked after instances without padding so
they can be used as fallbacks (similarly to fwd);
* Extract common code from register_traits
* Revert "Fix regression with philox seed and offset when they exceed 32-bit int"
To simplify merging , the proper fix is in develop already.
* Support new numerical d paddings in trait ordering checks
* Build fp32 tests only on gfx9
* Do not use hardcoded M0 = 64 for dot bwd kernel
* Use textwrap.indent from standard library
* Make fp8 pipelines on gfx12 consistent with gfx9
* Update tests for current pipelines
* Make ninja check more responsive in CI
ninja buffers output so this job looks hanging.
* Support fp8fp32 by limiting O vector size
The fp32 output type requires storing 8 * sizeof(float) = 32 bytes,
which is not implemented (here 8 is the number of C values per lane for
v_wmma_f32_16x16x16...).
* Remove unused cmake options
* Unify including amd_buffer_addressing.hpp/_builtins.hpp
* Temporarily use amd_buffer_addressing.hpp on >=gfx10
amd_buffer_addressing_builtins.hpp uses inline asm for loads/stores
which is not compatible with >=gfx10:
* 1 scalar for exec masks instead of 2,
* gfx12 uses different instruction names etc.
* Update asm in bf16 conversions to work with warp 32
* Do not generate splitkv/appendkv with vlayout=col for consistency with fwd
* Add arch tags to kernels/host funcs, compile for each arch separately
* Add kM0 to fmha_bwd_dot_do_o kernel name to match filename
* Add workaround for miscompilation of bwd with padded hdim
SWDEV-559729: v_wmma instructions can be incorrectly placed in divergent
branches used to store padded tensors (when some lanes are inactive due
to padding). Inline asm with dummy dependencies on VGPRs of the tensors
prevents the compiler doing this.
* Fix add_gtest_executable for absolute paths
Some tests (like gemm_tile_engine) pass absolute paths to source files.
In CI the branch name is a part of the root dir, and if the branch name
contains "wmma", "xdl" etc., files can be incorrectly excluded.
* Run only hdim 128 smoke tests for fp8fp32
There are no instances for hdim 64 and 256.
* Format py with ruff to simplify merging develop
* Fix incorrect var name
* Codegen for gfx9,gfx950 when --targets is not specified
Aiter and Pytorch require changes for passing their targets to the codegen scripts.
With this temporary solution the files are generated but not all of them
have to be really built (depending on the used --offload-arch=).
* Combine arch-related values into ArchTrait
This more centralized approach removes duplication of various formatting templates.
* Try a workaround for Jenkins error "groovyjarjarasm.asm.MethodTooLargeException: Method too large"
Some code is extracted into a function.
|
||
|
|
7c6430eca0 |
[CK_TILE] fmha: Add query padding support to backward pass (#3097)
* [CK_TILE] fmha: Add query padding support to backward pass Introduces support for query sequence padding (q_padding) in the FMHA backward pass kernels. - Passing `seqlen_q_ptr` to the backward kernels to distinguish logical from physical sequence lengths. - Updating `OGradDotO`, `ConvertQGrad`, and `DQDKDV` kernels to respect logical lengths and handle zero-length sequences. - Aligning LSE indexing in the forward kernel with the padded layout for consistency. - Adding a new GTest suite (`test_fmha_bwd_kernel_padding.cpp`) with comprehensive tests for various padding scenarios, including zero-length sequences and deterministic mode. * fix clang format * Adapt fmha_bwd_runner.cpp to new q, kv sequence padding Add backward q/kv sequence padding unit tests. * [CK_TILE] fmha: Unify sequence length and padding handling Refactor the handling of sequence lengths and padding in the FMHA forward and backward kernels to provide a more unified and flexible interface. - Replaced `seqstart_padded_*_ptr` with a more robust system that uses `seqstart_*_ptr` for physical sequence lengths and introduces `seqlen_*_ptr` and `cu_seqlen_*_ptr` for logical (unpadded) lengths. - Established a clear order of precedence for determining sequence length: cumulative lengths (`cu_seqlen_*_ptr`) take priority, followed by per-sequence lengths (`seqlen_*_ptr`), and finally physical lengths derived from `seqstart_*_ptr`. - Clarified the distinction between "group mode" and "batch mode" and how sequence lengths are handled in each case. - Renamed `cu_seqlen_kv_ptr` to `cu_seqlen_k_ptr` for consistency. - Updated comments and documentation to reflect the new argument structure and usage. --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> |
||
|
|
1edd250115 |
[CK_TILE] Support f32 in FMHA (fwd and bwd) (#2836)
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout
Add comments with dropout implementation details
Fix performance regression of fwd+dropout
* Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
* "scalarize" seed and offset, they may come either from kernel args or from device memory
(presumably loaded with vector loads).
These changes help the compiler to procude more optimal code and reduce register spilling.
Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding
Use code based on BlockDropout in BlockDropoutBwd
Refactor BlockDropout (fwd)
Implement BlockDropout (fwd) for WMMA
Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
this version supports 16x16 tiles.
If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
to BlockDropoutBwd.
Implement BlockDropoutBwd for WMMA
Remove MakeRandValLds* functions unused in BlockDropoutBwd
Remove unused Run overload from BlockDropoutBwd
* Fix regression with philox seed and offset when they exceed 32-bit int
__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.
* Add F32 MFMA warp gemms
* Support f32 in fwd FMHA
* Implement transpose_vectors for 4-byte types (float)
* Fix unexpected implicit f32->uint32 cast in buffer_store<4>
__builtin_amdgcn_raw_buffer_store_b32 expects unsigned int but float was passed (implicitly casted to uint).
mbuf_t types in other buffer_store<> are changed for consistency.
* Support F32 in bwd FMHA
hdim = 256 is disabled for now because it uses too much memory on gfx90a
* Support Headdim = 48 (divisible by 16) in fwd
* Add fp32-specific receipts (800 and 801)
* Tune fwd tiles
* Tune bwd tiles
* Use small tiles only for small seqlen_q
* Fix after rebasing
* Fix selection of a fallback tile based on bm0
The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.
* Remove constraints and adjust filtering for fp32
Custom constraints are no longer needed because now the smallest tile
is selected automtically based on seqlen_q.
Filters related to qr_async_trload disabled valid fp32 tiles.
* Add fp32 tests
* Make splitkv and appendkv compile for fp32 only
There are no instances yet, but API still must compile when only fp32 is
requested.
* Remove unimportant f32 instances
* Add test_ck_tile_fmha_*_fp32 to REGRESSION_TESTS
* Replace magic numbers with a constant, improve comments for dropout
* Update changelog
* Fix condition that dq_acc must be set to zero when mask is used
The change was introduced in #2799
* Replace warp_uniform with recently added amd_wave_read_first_lane
* Add hdim = 96 and 192 to fwd
|
||
|
|
32773fe5cb | [CK_TILE] FMHA BWD Pad HDim to a Multiple of 8 (#2918) | ||
|
|
518d24e662 |
Add sequence padding and variable length support in fmha (#2932)
* * [CK_TILE] Add sequence padding and variable length support in fmha (and v3) - Group Mode Padding: Introduces the `-s_qpad` argument to support physically padded layouts. Kernels now use padded start pointers (`seqstart_padded_*_ptr`) for memory addressing. - Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens` arguments for efficient processing of variable-length sequences by passing cumulative effective lengths (`cu_seqlen_*_ptr`) to the kernel. - FMHA examples: Support padding and variable length both in group and batch mode. Dispatcher is updated as well (dispatch to kPadSeqLenK enabled pipeline). - New padding test cases: Add padding test cases to `smoke_test_fwd.sh` and `test_fmha_fwd.inc`, and add benchmarks to `benchmark_fwd.sh` and `benchmark_fwd_v3.sh` as well. These test cases and benchmarks that specifically validate/benchmark the new padding and variable-length functionalities in both group and batch modes. * [CK_TILE] Fix build error in fmha unit tests * [CK_TILE] add mqa, gqa to sequence padding unit tests * [CI_TILE] Reduce the number of padding seqlen unit tests in FMHA to avoid timeouts in CI * [CK_TILE] remove unnecessary MageKArgs overload in FmhaFwdV3Kernel and FmhaFwdKernel |
||
|
|
b56e5d1d79 |
Fix for Add the API to load SGPR (#2913)
* Revert "Revert "[CK-Tile] Add the API to load SGPR (#2878)" (#2904)"
This reverts commit
|
||
|
|
ab22f91a7c |
fix fmha fwd kernel name (#2880)
* fix fmha fwd kernel name * if the input and output types are the same, keep the original code |
||
|
|
f161b5b738 |
Revert "[CK-Tile] Add the API to load SGPR (#2878)" (#2904)
This reverts commit
|
||
|
|
7b16782d7c |
[CK_TILE] Fix fmha bwd (#2865)
* Fix fmha bwd filter * remove unnecessary change * enable test cases --------- Co-authored-by: Yi DING <yi.ding@amd.com> |
||
|
|
2cbbf5dcb3 |
[CK-Tile] Add the API to load SGPR (#2878)
* Have a workable version for SGPR * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. * substitute with the new sgpr read api * update the CHANGELOG * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. * change to static for logic * have a workable version for atomic add * Revert "have a workable version for atomic add" This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb. |