composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 19:28:33 +00:00

Author	SHA1	Message	Date
juuso-oskari	4aff2fa016	CK-UA: fix no-mask multi-Q-block path — was reading too-short K prefix The kernel's `_max_seq_prefix_len` computation unconditionally applied a causal upper bound on the KV-tile loop: _max_seq_prefix_len = context_len + q_block_local_idx * kBlockQ_dyn + (kBlockQ_dyn - 1) + 1 Under causal masking this is the correct optimisation — a Q-block whose largest row index is R only needs to read K[0..R] because rows beyond R are softmax-masked to zero. Under `mask_type=0` (no mask) every Q row must attend to all K rows, so this truncation is incorrect: every Q-block other than the last one ends up reading a too-short prefix of K and the resulting softmax / weighted-sum is over the wrong support. Symptoms at sq=sk=512, hq=hk=5, d=128, bf16, no-mask: Q-block 0 (rows 0..255): max diff vs fp32 attention_ref ≈ 0.25 Q-block 1 (rows 256..511): max diff vs fp32 attention_ref ≈ 1e-3 (ULP) The bug never showed up in the cross-impl sweeps because Triton-UA asserts causal=True (its only supported mode) and sweep_fp8.sh forwards that default through. Fix: gate the truncation behind kHasMask. When kHasMask == false the loop bound is simply `seq_len`, matching the math. Validated against `aiter.test_mha_common.attention_ref` across: - MHA d={64,128} sq=sk∈{256..2048} bf16/fp16 no-mask & causal - GQA-8 d=128 sq=sk∈{256..1024} bf16 no-mask & causal 22/22 stages PASS within bf16/fp16 ULP. sweep_fp8.sh (causal) timings unchanged — the truncation still fires for the causal kernels. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-19 14:29:48 +00:00
juuso-oskari	c9bc5350c8	CK-UA: optional paging — contiguous (THD) K/V path, prefill_d128 fp8 -28% Add a `bool kEnablePaging_` non-type template parameter on UnifiedAttentionPipeline (default true preserves the paged behaviour). When false, `refresh__offsets` collapses to a single per-row `logical_token row_stride` imad — no block_tables fetch, no / % page_size arithmetic, no Tier 0 scalar-promote, no Tier 2 LDS-cache populate. The host selects between paths via a new `args.kv_contiguous` runtime flag plumbed through dispatch_variant<V>. Twelve new prefill instances pin EnablePaging=false: prefill_d{64,128} × {fp16, bf16, fp8} × {mask, nmask} Decode variants stay on the paged path — callers without a KV cache don't have decode workloads, and the binary-size cost isn't justified. Measured impact on the same physical K/V memory (sq=1×4096, causal, page_size=32 paged baseline, MI355, n=30 iters): variant sk paged contig Δ prefill_d64 bf16 4096 0.274 0.227 -17.1 % prefill_d64 bf16 16384 1.529 1.198 -21.6 % prefill_d64 bf16 32768 3.218 2.505 -22.1 % prefill_d64 fp8 4096 0.299 0.235 -21.4 % prefill_d64 fp8 16384 1.489 1.150 -22.7 % prefill_d64 fp8 32768 3.054 2.386 -21.9 % prefill_d128 bf16 4096 0.493 0.397 -19.3 % prefill_d128 bf16 16384 2.638 2.224 -15.7 % prefill_d128 bf16 32768 5.731 4.598 -19.8 % prefill_d128 fp8 4096 0.476 0.341 -28.3 % prefill_d128 fp8 16384 2.416 1.792 -25.8 % prefill_d128 fp8 32768 4.973 3.727 -25.0 % prefill_d128 fp8 at -28 % is the single biggest UA optimisation measured to date — bigger than Tier 0 (-12 %), Tier 2 (-5 %), and the Tier-3 d=64 fp8 win (-16 %). Correctness validated by bit-exact comparison against the paged instance with page_size=32 and identity block_tables on 48 shape × dtype × mask combinations. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-19 13:15:31 +00:00
juuso-oskari	06e1a70e7a	CK-UA: constexpr page_size (Tier 3) — prefill_d64 fp8 -15.8%, prefill_d128 fp8 -6.3% Promote the runtime `page_size` argument to a non-type template parameter `kPageSize_` on UnifiedAttentionPipeline. Thread it through unified_attention_kernel_traits and dispatch_variant<V> so the host-side dispatcher routes on args.page_blk_size ∈ {16, 32, 64} to a constexpr- pinned prefill instance; values outside that menu (or any decode variant) fall back to the existing kPageSize_=0 runtime-page-size instance. Two wins fold together on the prefill tiers: 1. Strength-reduction. Every `/ page_size`, `* page_size`, and `% page_size` in the per-tile address chain collapses to a literal-folded shift / multiply-by-magic (`/ 32` → shr 5, etc). 2. Wider Tier-0/Tier-2 gate. The scalar-promote + LDS-cache fast path now uses the real precondition `KY0_step_N <= kPageSize` at compile time instead of the conservative `KY0_step_N <= 16` hedge — so prefill_d128 bf16/fp16 (KY0_step_N=32), prefill_d64 fp8 (KY0_step_N=32), and prefill_d64 bf16/fp16 (KY0_step_N=64) also enter the fast path at their natural page sizes. Measured impact (sq=sk=75600, MI355, n=30 iters, GQA-8): variant KY0_step_N ps before after Δ prefill_d128 fp8 16 32 119.0 111.5 -6.3 % prefill_d128 bf16 32 32 132.7 130.3 -1.8 % prefill_d64 fp8 32 32 80.9 68.1 -15.8 % prefill_d64 bf16 64 64 74.4 73.4 -1.3 % Decode variants stay on the kPageSize_=0 instances (Tier-0 gate gates them out anyway — <8 warps — and the binary-size cost isn't justified). All sweep_fp8.sh shapes + 21 multi-seed multi-sk-length prefill shapes correctness-PASS. Pre-existing Tier-2 LDS-cache limit (4096 entries) documented in the pipeline header — same constraint applies to the kPageSize_=0 fallback so this is not a regression. 36 new prefill instance files: prefill_d{64,128} × {fp16, bf16, fp8} × {mask, nmask} × {ps16, ps32, ps64}. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-19 12:46:39 +00:00
juuso-oskari	045b1f57bf	CK-UA: widen FP8 K/V async loads to dwordx4 where the tile allows it GetAlignmentK / GetAlignmentV previously returned a blanket 4 B/lane (one dword) for every FP8/BF8 tile, citing the gfx950 LDS-direct load constraint (only dword / dwordx3 / dwordx4 are supported). That cap was correct for the 8-warp prefill variants (kBlockSize=512, NumIssues drops to 0.5 at 16 B/lane) but over-applied to every decode tier, where the 1/2/4-warp tile geometry has plenty of headroom. Refactor the alignment selector into GetKVAlignmentBytes<>, which picks dwordx4 whenever NumIssues = kPageBlockSizekHeadDim/(kBlockSize16) is an integer >= 1 and falls back to dword otherwise. BF16/FP16 paths stay at 16 B/lane on every compiled tile, so existing perf is unchanged. FP8 prefill_d{64,128} also keep the historical dword path because NumIssues = 0.5 there. FP8 decode_d{64,128}_m{16,32,64,128} now use dwordx4: same byte volume per K/V tile but 4x fewer async-load issues (SQ_INSTS_VMEM 131M -> 33M on b=128 sq=1 sk=128000 d=64). Wall-clock impact on the long-context decode sweep (HIP_VISIBLE_DEVICES=2, ITERS=20, WARMUP=5, MI355): shape dtype before after speedup decode d=64 sq=1 sk=128000 b=128 fp8 7.17 ms 4.57 ms 1.57x decode d=64 sq=1 sk=128000 b=256 fp8 16.24 ms 9.51 ms 1.71x decode d=128 sq=1 sk=128000 b=128 fp8 13.11 ms 7.15 ms 1.83x decode d=128 sq=1 sk=128000 b=256 fp8 31.37 ms 9.78 ms 3.21x decode d=64 sq=1 sk=128000 b=4 fp8 0.42 ms 0.22 ms 1.92x decode d=128 sq=1 sk=128000 b=4 fp8 0.80 ms 0.42 ms 1.93x prefill d=64 sq=75600 sk=75600 b=1 fp8 81.4 ms 81.2 ms 1.00x (dword fallback) prefill d=128 sq=75600 sk=75600 b=1 fp8 143.5 ms 143.6 ms 1.00x (dword fallback) Correctness verified across fp8/bf16/fp16, causal/non-causal, and all 7 compiled tile variants. Full PMC + PC-sample analysis is in ua-test-scripts/rocprof_analysis/BOTTLENECK_ANALYSIS.md section 8. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-19 08:06:29 +00:00
juuso-oskari	7a319d9a4b	CK-UA: drop redundant phase-0 s_barrier (-3% fp8 prefill_d128 decode) `ADD_SBARRIER_FOR_PHASE0=1` added an extra `s_barrier()` at the start of every `cl_p` half of every KV iteration, on top of the three barriers that already gate the LDS hand-offs in phases 1/2/3. rocprofv3 bottleneck analysis (b=4 sq=8 sk=4096 hq=64 hk=8 d=128 fp8): the prefill_d128 8-warp variant spends ~15% of GUI_ACTIVE cycles at s_barriers and shows %any_wait ≈ 200%. PC sampling pinpoints the phase-0 `s_barrier` (right after softmax rescale, before async prefetch) as a top hotspot. Examining the data flow shows the phase-0 barrier is redundant: - phase1's `s_waitcnt vmcnt(...); s_barrier` guards the K-LDS write (from the previous iter's K async load) before any warp reads it. - phase2's `s_waitcnt lgkmcnt(0); s_barrier` guards the softmax-P LDS write before gemm1 reads it. - phase3's `s_waitcnt vmcnt(...); s_barrier` guards the V-LDS write before the next iter's V-LDS read. These three already provide every cross-warp ordering the pipeline needs. The phase-0 barrier was purely defensive. Measurement: 0.1945 → 0.1883 ms (n=300 iters × 3 trials, single shape). Correctness verified against the Triton reference on fp8/bf16/fp16 × {b=4/32/128} × {sq=1/4/8} × {causal,non-causal} × d∈{64,128}. Leaving the macro and the `=1` documented path in place so the previous behaviour can be restored if a future arch/shape regresses. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 19:16:48 +00:00
juuso-oskari	3431615ff0	CK-UA: fuse FP8 cvt + cross-lane swap to hide ds_bpermute latency Previously the 32x32x16 FP8 P-tile cvt and the QK-C -> PV-A cross-lane swap ran in two separate static_for loops back-to-back inside fmha_alu1: the whole tile was cvt'd into p.thread_buf_ first, then a second pass issued one ds_bpermute_b32 per 8-fp8 K-chunk and read/wrote the same buffer to swap the "bad" 4-byte halves between paired lanes. The ds_bpermute has nontrivial LDS-DMA latency that the scheduler has no way to hide when it lives alone in a tight serial loop with the gather/scatter packs around it. Fuse the two into one 8-fp8-per-iter loop: 1. cvt 8 fp32 -> 2 packed uint32 (lo_pack=slot[0..3], hi_pack=slot[4..7]) using the chained cvt_pk_fp8_f32 pattern matching cast_tile_pk_fp8_fp32. 2. Pick own_bad = (sub==0 ? hi_pack : lo_pack) and issue ds_bpermute on it. 3. Write back all 8 fp8 bytes; the "good" half lands first so its byte stores can overlap with the in-flight ds_bpermute, and the next iter's cvts can begin while the swap is still pending. The 16x16x32 LDS-roundtrip branch keeps the original separated cvt loop (no swap latency to hide there since the relayout goes through LDS, not ds_bpermute). Single-shape FP8 perf on gfx950 GPU 2 (CUDA graph, 50 iters): decode d=128 b=4 sq=8 sk=4096: 0.2106 -> 0.1951 ms (-7.4%) decode d=64 b=4 sq=8 sk=4096: 0.1464 -> 0.1208 ms (-17.5%) prefill d=128 b=2 sq=512 sk=4k: 0.2558 -> 0.2220 ms (-13.2%) BF16 unchanged (0.2046 -> 0.2039 ms, within noise). Correctness: pytest UA correctness suite 405 passed / 80 skipped (245 BF16/FP16 + 160 FP8), unchanged from before. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-18 15:48:01 +00:00
juuso-oskari	9d7cc3ee9e	CK-UA: extend FP8 to the 16x16x32 _m16 decode tier via LDS roundtrip The 32x32x16 tiers (prefill_d{64,128}, decode_d{64,128}_m{32,64,128}) keep the cheap in-register `ds_bpermute_b32` cross-lane swap that fixes the QK-C / PV-A per-thread alias for the union'd `sp_compute` / `p`. The 16x16x32 m16 tiers (decode_d{64,128}_m16) cannot use the swap -- the MFMA puts the paired-lane bit at a different position and the sub=0/sub=1 4-fp8 chunks no longer map onto each other. We add a layout-agnostic LDS roundtrip as the `else` branch, gated by the same `PVWarpTile` constexpr: - Hoist two distribution-bound windows over the existing `p_lds` region (one bound to the QK-C output distribution, one to the PV-A input distribution). Done once per kernel invocation. - In `fmha_alu1`, after the cvt_pk_fp8_f32 packing chain, view the union's bytes as a `static_distributed_tensor<fp8>` in the QK-C distribution, `store_tile` it through `p_lds` in canonical (M, N) order, `s_barrier`, then `load_tile` back with the PV-A distribution and copy into `sp(idx).p`. A/B'd a uniform LDS-roundtrip (no fast-path) vs the split: pure LDS regressed decode_m128 by ~1.5x end-to-end (CK FP8 dropped from ~0.39x of Triton FP8 to ~0.16x), driven by the extra block-wide barrier on the 4-warp decode path. Keeping the swap for 32x32x16 preserves the previously-tuned perf. Dispatcher (`unified_attention.cpp`) now FP8-enables every UA variant including decode_d{64,128}_m16. Four new instance .cpp files (`unified_attention_d{64,128}_fp8_{mask,nmask}_decode_t.cpp`) instantiate the m16 FP8 kernels. Pytest (`test_unified_attention_ck_correctness.py`): - 245 BF16/FP16: pass (no regression from the pipeline edit). - 160 FP8: pass (was 112 before m16 enablement). - 80 skipped: block_size<32 or query_len>kv_len -- pre-existing. Single-shape m16 dispatches verified on gfx950: b=128 sq=1 hq=hk=8 d=128 fp8 PASS (CK 0.109 ms / Triton 0.043 ms) b=128 sq=1 hq=hk=8 d=64 fp8 PASS (CK 0.077 ms / Triton 0.039 ms) Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-15 20:00:35 +00:00
juuso-oskari	63c75277a0	CK-UA: enable FP8 (e4m3) for prefill/m128 and the 32x32x16 small-tile decode variants Full pipeline support for FP8 (e4m3fn on gfx950 / e4m3fnuz on gfx942) in the unified-attention kernel, gated to the 32x32x16 MFMA tiers in both d=64 and d=128 ladders: prefill_d{64,128}, decode_d{64,128}_m128, decode_d128_m32, and decode_d64_m64. The 16x16x32 _m16 tiers stay BF16/FP16-only -- the QK-C and PV-A per-thread layouts there differ by an M<->N swap that the current slot-swap fixup cannot express; a full per-thread transpose (most likely via LDS) is needed. Pipeline (unified_attention_pipeline.hpp): * `fmha_alu1` now performs a cross-lane P-tile fixup right after the FP8 packing of softmax(P). It's a `ds_bpermute_b32` between paired lanes `lane ^ 32`, swapping sub=0 slot[k_base+4..k_base+7] with sub=1 slot[k_base..k_base+3] for every 8-fp8 chunk. This realigns the FP8 packed P operand with PV-A's `Single` AttrNumAccess per-thread layout, which is necessary because the QK-C output and PV-A input alias byte-for-byte via the sp_compute/p union -- and for FP8 the two warp-gemm layouts no longer agree (BF16/FP16 keep Double AttrNumAccess in the PV gemm, which matches QK-C natively). Gated on `Gemm1WarpTile == 32x32x16`; FP8-only (BF16/FP16 paths take the existing cvt_pk path unchanged). Default policy (unified_attention_pipeline_default_policy.hpp): * PV warp gemm now selects `WGAttrNumAccessEnum::Single` when V is fp8/bf8 and `Double` otherwise. Forced by load_tile_transpose's SubMinDim = 64-bit / sizeof(V) constraint: for FP8 SubMinDim=8 and kABKPerLane=8 only Single satisfies the validation static_asserts. * GetAlignmentK / GetAlignmentV on gfx950 drop to 4 B/lane for fp8/ bf8. The natural 16 B/lane async-load that BF16/FP16 use leaves NumIssues = 0 for the FP8 tile shapes we compile, and 8 B/lane fails the dword / dwordx3 / dwordx4 constraint in amd_buffer_addressing_builtins. 4 B/lane gives NumIssues >= 1 on every targeted variant and is the same alignment the gfx942 fallback already used. BF16/FP16 keep the full 16 B/lane path so existing perf is unchanged. * GetSmemSizeKV adds a `VLoadDescSize` lower bound. The MakeVLdsLoadBlockDescriptor's element span dominates the banked SingleVSize only for FP8 (small per-lane KVector + fixed kVLdsPadInBytes = 64), so without it FP8 hits the GetSmemSizeKV static_asserts. BF16/FP16 are unaffected. Warp-gemm headers + dispatcher: * New `WarpGemmMfma_f32_32x32x16_fp8_fp8_CTransposed_T<AttrNumAccess>` template alias in warp_gemm.hpp (mirrors the existing BF16 32x32x16 CTransposed template), used by the PV gemm to thread the FP8 Single AttrNumAccess through. * New Dispatcher specialization for <fp8_t, fp8_t, float, 32, 32, 16, true, false, false, EDouble> in warp_gemm_dispatcher.hpp routing to the new template. ABI / dispatcher (unified_attention.{cpp,hpp}, unified_attention_impl.hpp): * New `fp8` value in `unified_attention_args::data_type_enum` (selects e4m3fn on gfx950 via CK_TILE_USE_OCP_FP8, e4m3fnuz elsewhere). * New `unified_attention_problem_traits<...::fp8>` alias: qkvp_dtype = ck_tile::fp8_t, acc_dtype = float, o_dtype = bf16_t (matches the Triton reference), lse_dtype = float. * Per-tensor `q_descale` / `k_descale` / `v_descale` floats on `unified_attention_args` (default 1.0f so non-FP8 round-trips cleanly). The pipeline folds q_descalek_descale into the softmax scale and applies v_descale once to o_acc after the 1/l norm -- same semantics as Triton's q_scale/k_scale/v_scale. `dispatch_variant<>` enables FP8 on prefill_d{64,128}, decode_d{64,128}_m128, decode_d128_m32, decode_d64_m64. The 16x16x32 _m16 tiers return (false, -1.f) for now (see top comment). Instances: * 12 new FP8 .cpp files under example/.../42_unified_attention/ instances/ covering the 6 enabled variants x {mask, nmask}. Validation: 112 / 0 / 128 in the FP8 pytest sweep (passed / failed / m16-skipped); 245 / 245 in the BF16/FP16 sweep (no regression). Functional correctness is within the FP8 quant-noise tolerance the Triton FP8 suite uses (atol/rtol = 1.5e-1). Perf still trails Triton across the enabled tiers (CK FP8 / Triton FP8 = 0.39-0.69x on the shapes we benchmarked); that's a separate workstream. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-15 17:34:50 +00:00
juuso-oskari	c0e985d075	CK-UA: document why per-issue SRD-rebase path was tried and dropped Replace the speculative TODO-style comment next to the K_mem_load / V_mem_load dispatch with a record of the actual experiment: we implemented async_load_tile_raw_rebased (buffer_load_dword_lds with a per-issue SRD whose 48-bit base absorbs the wave-uniform page offset), verified correctness on multiple big-cache decode shapes, and measured it against the existing async_load_tile_raw_long path on an isolated GPU. Rebased was at best tied with long and at worst ~6% slower (b=1 sk=1M d=64 GQA8: 2.46 ms vs 2.32 ms; b=8 sk=200k d=128 GQA8: 2.12 ms vs 2.02 ms). The workloads are compute / softmax bound, not K/V load bandwidth bound, so the buffer_load throughput edge never materialises, while the per-issue SRD construction adds real SGPR pressure. No functional change in this commit -- only the explanatory comment is updated so the next person who eyes the same idea finds the receipts before re-implementing. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-15 10:18:39 +00:00
juuso-oskari	1f69421434	CK-UA: dispatch K/V async load on cache_ptr_int32_overflow_possible The shared-SRD buffer_load_dword_lds path that K_mem_load / V_mem_load use wraps the per-lane voffset (int32 bytes) once num_blocks * page_size * row_stride * sizeof(T) > INT32_MAX, silently returning wrong data on large paged-KV pools (e.g. >4 GB caches). Add a second path, async_load_tile_raw_long, that issues the same load via __builtin_amdgcn_global_load_lds with per-lane 64-bit base pointers, lifting both 4 GB limits (SRD size + voffset). Per-issue LDS pointers are computed explicitly because the intrinsic sets m0 itself, so the old m0_set / m0_inc bookkeeping doesn't apply. The path also clamps lane_elem_off to the live buffer range to mimic the original SRD's hardware OOB behaviour. Dispatch is a wave-uniform runtime branch on a new cache_ptr_int32_overflow_possible flag plumbed from unified_attention_args through MakeKargs into the pipeline operator(). Small caches keep the original buffer_load throughput; only the (rare) >4 GB cache pays the global_load_lds cost. k_page_offsets / v_page_offsets are widened to long_index_t. The original buffer_load path implicitly narrows back to int32 when forwarding through async_get_vectorized_elements_raw, which is intentional and safe whenever the overflow flag is false. For diagnostics, also derive a constexpr KWaveSpanInN = (LaneGroups - 1) * NumWarps + 1 inside the pipeline; when this exceeds page_size a single buffer_load spans multiple random pages, so the per-issue SRD-rebase optimisation (not implemented yet) would not apply even on a sub-4 GB cache. Informational only today. Test: ua-test-scripts correctness sweep (245/245 pass), plus test_single_shape.py -b 32 -sq 8192 -sk 120000 -hq 64 -hk 8 -d 64 \ --num-blocks 1200000 --block-size 16 --test which previously returned wrong data due to the int32 wrap and now passes with max abs diff 1.22e-04 vs Triton. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-15 09:00:43 +00:00
juuso-oskari	d77f0bea63	CK-UA: collapse MHA/GQA variants -- one binary per (head_dim, kBlockM) After moving kBlockQ to runtime in the previous commit, the static NumQPerKV in `variant_config<V>` and the runtime-vs-static assert in the kernel became the only things still tying a compiled binary to a specific num_queries_per_kv. Drop both and the existing instances now serve every num_qpkv that divides kBlockM evenly. Concretely: * `variant_config<V>` -- remove the NumQPerKV field from every specialization. * `unified_attention_kernel_traits` -- remove the `num_queries_per_kv` / `kBlockQ = kBlockM / num_qpkv` derivation. The BlockTile's 2nd entry (the static `kBlockQ` exposed via UnifiedAttentionShape) is anchored at kBlockM so it describes the "num_qpkv == 1" fallback; the actual kBlockQ is always the runtime value. * `unified_attention_kernel_launch` -- recompute kBlockQ at host time from `args.num_queries_per_kv` for the total_num_q_blocks math. * `unified_attention_kernel.hpp` -- drop the `assert(kBlockQ_dyn == kBlockQ)` (it enforced the very coupling we just removed). * `unified_attention.cpp::select_config` -- collapse the two per-num_qpkv code paths into a single (head_dim, avg_rows, max_rows) ladder, where avg_rows = avg_q * num_qpkv. Variant renames (8 variants): prefill_d128_mha -> prefill_d128 decode_d128_mha_m128 -> decode_d128_m128 decode_d128_mha_m32 -> decode_d128_m32 decode_d128_mha_m16 -> decode_d128_m16 prefill_d64_gqa8 -> prefill_d64 decode_d64_gqa8_m128 -> decode_d64_m128 decode_d64_gqa8_m64 -> decode_d64_m64 decode_d64_gqa8_m16 -> decode_d64_m16 The 16 d=64 instance files lose their `_gqa8` infix to match the d=128 naming (file count unchanged: 16 dtypes x mask combos per head_dim). Validation: * Correctness suite: 241/245 (same 4 pre-existing int32-overflow failures in the prefill rebased-pointer path). * d=128 GQA-8 (a NEW combo we never had a binary for) -- runs correctly on the existing decode_d128_m* binaries with num_qpkv=8 at runtime. max abs diff <= 1e-2 vs the torch reference at ql in {1, 4, 16}. * d=64 MHA (also a new combo) -- runs correctly on the existing decode_d64_m* binaries with num_qpkv=1. Same tolerance. * Perf sweep (b=4..256, sk=120000, MI300): d=64 GQA-8: speedups 1.28x..1.84x vs Triton (within 0.6% of baseline). d=128 MHA: speedups 0.98x..1.14x vs Triton (within 0.3% of baseline). Unlocked: adding new (head_dim, num_qpkv) combos no longer requires new kernel binaries -- just a host-side heuristic update mapping the combo to the appropriate (kBlockM, BlockWarps) ladder. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-12 12:15:55 +00:00
juuso-oskari	614afea7eb	CK-UA: derive kBlockQ at runtime, decouple from variant template kBlockQ (= kBlockM / num_queries_per_kv) was constexpr in `UnifiedAttentionShape` / the kernel-traits, forcing one kernel instance per (kBlockM, num_qpkv) pair even though the matmul tile is fully determined by kBlockM and kHeadDim. Audit confirmed kBlockQ only feeds: * arithmetic in `unified_attention_kernel.hpp` (loop bounds, Q-tile indexing, query_len padding), * `pad_tensor_view` size tuples for Q/O/LSE DRAM views, * one `mask.IsEdgeTile(... number<kBlockQ>{} ...)` call inside the pipeline's per-K-tile mask check. None of these structurally need a compile-time value: * `pad_tensor_view` already accepts mixed runtime/compile-time tuple elements (e.g. it's passed plain `1` next to `kHeadDimPadded`). * `IsEdgeTile` only does runtime arithmetic on the tile size; adding a runtime overload that accepts `index_t` is trivial (the compile-time one now forwards to it). Wiring: * `block_masking.hpp` -- add an `IsEdgeTile(..., index_t tile_h, index_t tile_w)` overload; the existing `number<>` overload just forwards to it. * `unified_attention_pipeline.hpp` -- new optional `num_queries_per_kv` arg on the pipeline's `operator()` (default 0 keeps existing call sites unchanged). Computes `kBlockQ_dyn = (num_qpkv > 0) ? (kBlockM / num_qpkv) : kBlockQ` once at the top, uses it in the IsEdgeTile call. * `unified_attention_kernel.hpp` -- compute `const index_t kBlockQ_dyn = kBlockM / kargs.num_queries_per_kv` once and replace every per-call `kBlockQ` use with `kBlockQ_dyn`. Pass `kargs.num_queries_per_kv` through to the pipeline. The debug-only assert(`kBlockQ_dyn == kBlockQ`) keeps the static and dynamic values in lock-step until we actually collapse variants. Perf A/B (b=4..256, sk=120000, MI300): d=128 MHA (num_qpkv = 1, runtime div is trivial): BW within +/-0.2% across all batch sizes (noise). d=64 GQA-8 (num_qpkv = 8, runtime division actually happens): speedups 1.28x..2.14x vs Triton -- identical to baseline. Correctness suite stays at 241/245 (same 4 pre-existing int32-overflow failures in the d=128 prefill rebased-pointer path). This is a no-op on perf and unlocks a follow-up where we collapse the two num_qpkv values per (head_dim, kBlockM) -- e.g. the future d=128 GQA-8 variant can reuse the existing decode_d128_mha_* instances by just passing a different runtime num_queries_per_kv. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-12 12:01:59 +00:00
juuso-oskari	25364aa634	Add KV-segment parallelism to CK unified attention pipeline End-to-end split-KV (FlashDecoding-style) for the CK unified attention kernel. The host launches a single 3D grid with z == num_splits; each CTA computes its KV-range slice and writes a normalized (o_acc, lse) partial to FP32 workspaces, which the caller reduces into the final output. Pipeline changes: - operator() returns ck_tile::make_tuple(o_acc, lse) instead of just o_acc. The masked-empty early-exit returns lse = -inf so downstream combine weighs the partial as zero. - LSE is built in the natural-log domain from the pipeline's unscaled rowmax: lse = (scale_s / log2(e)) * m + log(l). Previously we used m / log2(e) + log(l), which dropped the per-head scale and produced LSE values ~1/scale too large. - Fix post-process parity: which SP register is left in the alu0-done/alu1-pending state at loop exit depends on the parity of the iteration count (= num_total_loop - num_blocks_start), not on num_total_loop alone. For non-split (num_blocks_start == 0) the two parities coincide; for splits starting at an odd tile they don't. - Fix split-KV page-table offset: num_blocks_start is counted in kPageBlockSize-sized tiles, but block_tables is indexed in page_size-sized pages — shifting block_table_offset by num_blocks_start reads the wrong pages whenever kPageBlockSize != page_size. Replaced with split_token_offset = num_blocks_start * kPageBlockSize added to logical_token before /page_size, so the page lookup uses the absolute token position. Kernel + dispatcher: - Drop kargs.i_split; each CTA reads i_split = blockIdx.z. - GridSize{2D,Decode} now take num_splits and add it as the z-dim (defaults to 1, so non-split callers see dim3(..., 1, 1)). - New write path: when num_splits > 1, the kernel skips the user epilogue and instead writes the FP32 (o_acc, lse) tile pair into workspace tensors at [head, split, batch_start_token, ...] using Default2DEpilogue (UseRawStore=true) for o_acc and store_tile for lse. Host strides come from kargs. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-12 08:42:09 +00:00
juuso-oskari	473869aba5	Lift kPageBlockSize <= page_size constraint in CK-UA pipeline Refactor the K/V DRAM access in the unified-attention pipeline to use tile_scatter_gather with a unified per-(thread, Y0-iter) page-offset formula: logical_token = tile_idx * kPageBlockSize + thread_N_pos + i * Y0_step_N logical_page = logical_token / page_size within_page = logical_token % page_size phys_page = block_tables[block_table_offset + logical_page] page_offsets[i] = (phys_page * page_size + within_page) * row_stride The page indirection now lives entirely in page_offsets, refreshed via update_page_idx() between iters. The per-iter SRD rebase (set_bottom_tensor_view_data_ptr + init_raw) and the use_ptr_rebase overflow heuristic are gone. Effects: - The assertion kv_page_size_in_blocks >= 1 (i.e. kPageBlockSize <= page_size) in the kernel is dropped. Tiles may now span multiple cache pages, as long as Y0_step_N (= N1N2 from the K/V tile dist) divides page_size so that a wave-wide load never straddles a page. - Pipeline arg renamed kv_page_size_in_blocks -> page_size (PageSize in tokens). Kernel passes kargs.page_size through directly. - Validated correctness vs Triton on bf16 / d=64 / decode_s with block_size in {16, 32, 64}; max abs diff 1.22e-04 in all cases. Perf is on par with the prior pass-1 scaffolding (~3.6 ms on the 131072-context shape). TODO(overflow): page_offsets are index_t; caches whose num_blocks page_size * row_stride exceeds INT32_MAX will wrap. A future change should plumb long_index_t through the scatter-gather load path or compute a per-batch min-page shift in a pre-pass. TODO(unsupported regime): page_size < Y0_step_N (a wave crosses a page mid-iter) needs per-lane VGPR SRDs and is not implemented. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-11 10:04:01 +00:00
root	8506db8761	Fix int32 overflow in CK-UA pipeline via pointer rebasing tensor_coordinate::get_offset() returns index_t (int32), causing overflow when page_idx * block_size * stride > 2^31 (~131K blocks for d64/GQA-8). Fix: rebase K/V data pointer for each page using int64 arithmetic instead of set_window_origin with large offsets. After rebasing p_data_ and buffer_size_, call init_raw() to refresh the AMD buffer resource descriptor, then set_window_origin({0,0}) to reset cached coordinates. Tested: num_blocks up to 2M with nkh=1/8, blk=32/64. All pass. Made-with: Cursor	2026-04-02 09:39:07 +00:00
root	e8587b86c2	Fix CK-UA pipeline: s_waitcnt_vmcnt<0> in fmha_post_process The final V tile's async load was not properly waited on before reading from LDS: s_waitcnt_vmcnt<K_inst> allowed V_inst outstanding loads (a no-op when K_inst == V_inst). The last loop iteration never prefetches K, so only V is outstanding. Use s_waitcnt_vmcnt<0> unconditionally. This partially fixes the BS32 race condition for production workloads (maxk >= 256). A deeper pipeline race remains for very short KV sequences (maxk < ~165, 2-5 pages) with block_size=32 at high batch. Made-with: Cursor	2026-04-01 23:04:07 +00:00
root	87d16738bf	WIP: CK-UA KV-segment parallelism - kernel args and split range Added split-KV fields to UnifiedAttentionVarlenKargs (num_splits, i_split, lse_acc_ptr, o_acc_ptr + strides). Modified operator() to compute per-split KV range using blocks_per_split. INCOMPLETE: The pipeline returns normalized o_acc but the split-KV combine kernel needs unnormalized o_acc + lse. Need to modify the pipeline to optionally return m and l values alongside o_acc. The kernel changes compile but the epilogue needs the split path (write to float accumulators instead of final output). Made-with: Cursor	2026-04-01 19:09:59 +00:00
root	cd7ba6e2e8	Add unified attention (42_unified_attention) Squashed from aghamari/unified-attention-decode-opt branch. CK tile paged-KV attention kernel optimized for decode with 4-tier dispatch (tiny/small/medium/large), 16x16 MFMA, 2D decode grid, head-group merging. Supports hdim=64 GQA-8 and hdim=128 MHA with block_size=32. Made-with: Cursor	2026-04-01 16:39:15 +00:00
root	ec2db01e4a	Fix fmha_fwd early-exit bug: seqlen_q <= min_seqlen_q should be < The kSkipMinSeqlenQ optimization incorrectly used <= comparison, causing the kernel to skip batches where seqlen_q equals min_seqlen_q. This happens in the common case of no padding (all batches have the same seqlen_q == min_seqlen_q), producing all-zero output silently. Changed to strict < so batches with exactly min_seqlen_q tokens are still processed. Made-with: Cursor	2026-04-01 16:24:31 +00:00
root	6729989b97	Fix FMHA split-KV for paged-KV with page_block_size < kN0 Cherry-picked from aghamari/unified-attention-decode-opt (fadf0d585). - block_masking.hpp: 5-param GetTileRangeAlongX for GenericAttentionMask - fmha_fwd_splitkv.py: bn0=32 for hdim=64 Made-with: Cursor	2026-04-01 16:24:19 +00:00
root	4c5e290378	Add unified attention (42_unified_attention) and topk_softmax_decode Squashed from aghamari/unified-attention-decode-opt branch. 42_unified_attention: CK tile paged-KV attention kernel optimized for decode with 4-tier dispatch (tiny/small/medium/large), 16x16 MFMA, 2D decode grid, head-group merging. Supports hdim=64 GQA-8 and hdim=128 MHA with block_size=32. topk_softmax_decode: fused topk + softmax kernel for M=1 MoE decode. Made-with: Cursor	2026-04-01 16:24:04 +00:00
Chinmay Dattanand Kuchinad	2bb69a24ea	[rocm-libraries] ROCm/rocm-libraries#5776 (commit ee1bbcb) [CK] Fix async pivot mismatch in persistent GEMM kernel scheduler (#5776) ## Motivation Fix pivot mismatch in the persistent GEMM kernel's async input scheduler that causes GPU hangs and incorrect results when used with AsyncTP (Asynchronous Tensor Parallelism) on ROCm. PyTorch's `_fused_all_gather_matmul_native` uses this persistent GEMM kernel with chunk signals to overlap communication and computation. The pivot mechanism ensures each rank starts computing from its own local shard first (which is already available), then moves to remote chunks as they arrive over the network. Because of the pivot mismatch, the kernel frequently waits on signals for chunks that have not yet arrived, while attempting to read data from completely different chunks. This synchronization desync reliably triggers infinite hangs during multi-GPU native AsyncTP execution. This fix is required to enable functional AsyncTP support on ROCm. ## Technical Details In the persistent kernel loop (`UniversalGemmKernel::operator()`), the M-tile coordinate used for data selection (`i_m`) and the M-tile coordinate used for the chunk-signal wait (`chunk_idx`) were derived from inconsistent bases: * `i_m` was computed from the unpivoted tile index `iM`. * `chunk_idx` was computed from the pivoted expression `(iM + tile_idx_pivot)`. This means the kernel could wait for chunk N's signal but then read from chunk M's memory, or vice versa. The mismatch scales with GPU count: with 2 GPUs ~50% of tiles are wrong, with 4 GPUs ~75%, etc. The Fix: Introduce a single pivoted M-tile index (`iM_eff`) and derive both `i_m` and `chunk_idx` from it. This guarantees the kernel always waits for the correct chunk before reading its data. (Note: Minor cosmetic `clang-format` changes were also pulled in alongside the fix). ## Test Plan 1. Build PyTorch with this CK change. 2. Run the specific multi-GPU AsyncTP native test: `timeout 180s env HIP_VISIBLE_DEVICES=0,1 pytest test/distributed/test_symmetric_memory.py -k test_fused_all_gather_matmul_native -q -s -x` ## Test Result Tests verify correct overlapping execution without hangs or accuracy mismatches when running the AsyncTP native path with non-zero pivots. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-01 16:22:08 +00:00
aledudek	119712bd90	[rocm-libraries] ROCm/rocm-libraries#4469 (commit 0844cb0) [CK_TILE] Add pooling in tile_engine ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> Add pooling in ck tile engine ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-01 07:32:36 +00:00
Yi DING	791afc6465	[rocm-libraries] ROCm/rocm-libraries#5991 (commit 8d85e8e) [CK_TILE] Fix FMHA BWD IGLP incorrect results due to AGPR misallocation (#5991) ## Motivation After PR #5790 removed the `if constexpr(FmhaMask::IsMasking)` guard around the `num_total_loop <= 0` early-exit check, the IGLP pipeline (`BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`) produces incorrect dK/dV gradients for non-masking kernels (even with fix in #5915). Assembly inspection confirms that the CFG change causes the LLVM register allocator to reuse AGPR accumulators as scratch destinations in the dK/dV reduction loop, breaking the loop-carried accumulation across Q-tile iterations. ## Technical Details - Add `[[unlikely]]` to the `num_total_loop <= 0` early-exit in `BlockFmhaBwdDQDKDVPipelineKRKTRVRIGLP`. This attribute is load-bearing: it restores the CFG shape that the register allocator needs to correctly assign dedicated AGPRs to each column of the dK/dV accumulator. - Only the IGLP pipeline is affected; the other two BWD pipelines do not exhibit this issue. ## Test Plan ## Test Result ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-04-01 05:45:19 +00:00
Estevan Vedovelli	a33b5be1b9	[rocm-libraries] ROCm/rocm-libraries#6022 (commit 54b284a) [CK] contraction: extend GetTypeString() to include layout-differentiating params (#6022) ## Motivation Consumers that identify kernels by their `GetTypeString()` (such as hipTensor's actor-critic kernel selection, which hashes the string into a stable cross-platform UID) were silently dropping one of two colliding variants during registry insertion. `GetTypeString()` in `DeviceContractionMultipleD_Xdl_CShuffle` previously printed 13 template parameters, omitting `ABlockTransferSrcScalarPerVector`, `BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and `BBlockLdsExtraN`. These four parameters determine the block-transfer access width and LDS padding strategy, and are precisely what differentiates the `kk`, `kn`, `mk`, and `mn` layout variants from one another when all other geometry parameters are equal. Two instantiations with identical 13-parameter strings are distinct C++ types that accept different stride layouts and reject each other's arguments via `IsSupportedArgument`. This patch extends the output to 17 parameters so that every distinct template instantiation of this class produces a unique `GetTypeString()`. ## Technical Details `include/ck/tensor_operation/gpu/device/impl/device_contraction_multiple_d_xdl_cshuffle.hpp`: - extend `GetTypeString()` from 13 to 17 parameters including `ABlockTransferSrcScalarPerVector`, `BBlockTransferSrcScalarPerVector`, `ABlockLdsExtraM`, and `BBlockLdsExtraN`. ## Test Plan Build CK and hipTensor with these changes, and verify hipTensor can differentiate and select the correct kernels with layout variations. ## Test Result CK is building correctly and hipTensor is selecting the kernels correctly. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-31 15:19:43 +00:00
Bartłomiej Kocot	ef4ff4667d	[rocm-libraries] ROCm/rocm-libraries#5842 (commit 04c5690) [CK][CK Tile] Force padding for atomic_add bf16 C tensor (#5842) ## Motivation Force padding for atomic_add bf16 C tensor to avoid memfaults. ## Technical Details - add global atomic add for bf16 and enable them - add padding for atomic add bf16 due to the lack of oob - remove padding for not continous dims in conv for other cases - minor bwd data conv fixes ## Test Plan test_grouped_conv_*_tile ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-31 08:03:41 +00:00
jakpiase	66dc81d530	[rocm-libraries] ROCm/rocm-libraries#5729 (commit 516c974) [CK_TILE] Changed cshuffle LDS descriptor to naive layout (#5729) ## Motivation This PR changes gemm/convolution cshuffle layout into plain one. to improve cshuffle operation performance. ## Technical Details The purpose is that before this change the cshuffle layout was having some descriptor transformations that were probably aimed at reducing LDS bank conflicts, but the transformations itself were terribly slow, which negatively impacted the performance. ## Test Plan There is no need for additional tests, since current tests cover this functionality.	2026-03-31 03:40:25 +00:00
Illia Silin	e6b8094f94	[rocm-libraries] ROCm/rocm-libraries#5921 (commit 032ac1b) [CK] fix clang lifetimebound errors with staging compiler (#5921) ## Motivation The ROCm staging compiler (newer Clang) enforces `[[clang::lifetimebound]]` annotations on methods that return references or pointers to internal object data. Without these annotations, the staging compiler emits compilation errors for container accessor methods across the CK and CK Tile namespaces. ## Technical Details Adds `[[clang::lifetimebound]]` to all reference/pointer-returning accessors in core container types: `ck::` namespace: - `Array` -- `At()`, `operator[]`, `operator()`, `begin()`, `end()` - `index_array` -- `operator[]` - `StaticallyIndexedArray_v2` -- `At()`, `operator[]`, `operator()` - `IndexLookupTable` -- `operator[]` `ck_tile::` namespace: - `array` -- `get(i)`, `at()`, `operator[]`, `operator()` - `static_array` -- `operator[]` - `thread_buffer` -- `get(i)`, `at()`, `operator[]`, `operator()` - `make_kernel()` -- parameter pack Also removes the unused `instance_index` variable from `batched_gemm_reduce_fp16.cpp` and simplifies its argument parsing accordingly. ## Test Plan - Compile with the staging compiler to verify all lifetimebound errors are resolved - Existing tests pass unchanged -- the attribute is a compile-time annotation with no runtime effect ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 14:20:20 +00:00
Yi DING	fb64a4453c	[rocm-libraries] ROCm/rocm-libraries#5915 (commit a72cf7d) [CK_TILE] Fix FMHA BWD register pressure by wrapping num_total_loop with amd_wave_read_first_lane (#5915) ## Motivation In three FMHA backward pipelines, `num_total_loop` is computed without `amd_wave_read_first_lane()`, so the compiler treats it as a VGPR even though it is logically uniform across all lanes. This raises register pressure, and under high pressure the compiler may reuse VGPRs across overlapping live ranges. This was confirmed via assembly inspection: the compiler reused `v52:v53` as both the B-matrix input for dK MFMAs and an intermediate value for dV, producing incorrect dK/dV gradients. ## Technical Details Wrap `num_total_loop` with `amd_wave_read_first_lane()` in three pipelines: - `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr` - `block_fmha_bwd_dq_dk_dv_pipeline_kr_ktr_vr_iglp` - `block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr` This promotes `num_total_loop` to an SGPR, eliminating the excess register pressure and the incorrect VGPR reuse. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-30 01:45:16 +00:00
Jan Patrick Lehr	b6bbada9f1	[rocm-libraries] ROCm/rocm-libraries#5639 (commit a65e645) [CK] More lifetime-warning suppression ## Motivation The staging compiler picked up another change from upstream that leads to more lifetime-analysis warnings. This breaks the build, given CK is built with -Werror. As a result, compiler promotion is blocked. ## Technical Details This patch adds the pragma push diagnostics to ignore the lifetime-warnings in the modified files to unblock compiler promotion. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-28 11:20:51 +00:00
Linjun-AMD	3b55a05e71	[rocm-libraries] ROCm/rocm-libraries#5849 (commit d9b89b2) [CK_TILE ]Revert "[CK_TILE] Enable MXFP6 for MX GEMM op (#5095)" (#5849) This reverts commit 7e55766ddf7e9e20791b0e4e2d7b4026cf16b637. ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 20:37:23 +00:00
Bartłomiej Kocot	c28d0033d7	[rocm-libraries] ROCm/rocm-libraries#5785 (commit d8ecfc1) [CK] Fix min k_batch calculation in conv kernels ## Motivation Avoid division by 0 and remove not needed "-1". ## Technical Details Our div up implementation return lower value if input is divisible. There is no need to subtract 1. ## Test Plan test_grouped_conv_bwd_weight ## Test Result Passed locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-1019	2026-03-27 15:38:21 +00:00
Johannes Graner	58475d3f45	[rocm-libraries] ROCm/rocm-libraries#5393 (commit d51b649) [CK Tile] StreamK support for Bwd Weight grouped convolutions (#5393) ## Motivation Add StreamK work distribution to the CK Tile grouped convolution backward weight kernel. Split-K divides the K-dimension uniformly across a fixed `k_batch`, which causes load imbalance when the number of output tiles doesn't evenly fill the GPU. StreamK distributes total K-iterations evenly across workgroups, improving utilization on these shapes. ## Technical Details StreamK is added as an `if constexpr` branch in the existing kernel, selected by the `TilePartitioner_` template parameter. Two reduction strategies are supported: - Linear: tile-starter sequentially accumulates partials from contributing CTAs - Tree: pairwise binary tree reduction (O(log n) depth, faster for many contributors) Both persistent and non-persistent data-parallel (DP) sections are supported. Key changes: - `grouped_convolution_backward_weight_kernel.hpp`: StreamK execution path with `RunStreamK`/`RunStreamKLoop`, partial store/load via workspace, flag-based cross-CTA synchronization, `GridSize`/`MakeKernelArgs`/`GetWorkSpaceSize` extensions - `streamk_common.hpp`: Shared `StreamKReductionOps` (reduction helpers) and `StreamKDispatch` (persistent/non-persistent DP dispatch), used by both GEMM and Conv StreamK kernels - `streamk_gemm_kernel.hpp`: Refactored to use shared helpers - Merged split-K and StreamK example invokers via `PartitionerPolicy` template parameter - StreamK example binary with `--streamk_reduction=linear\|tree` and `--streamk_persistent=0\|1` - CK Builder integration: `SpecifiesStreamK` concept, `TilePartitionerType` factory helper, `InstanceTraits` with StreamK fields - 30 tests: host-side, GPU end-to-end (Linear + Tree + Persistent DP), negative, builder regression ### Performance (MI355X, gfx950) Speedup relative to best split-K (sweep over k_batch={1,2,4,8,16,32}): \| Shape \| 16x64 tiles \| \| 128x128 tiles \| \| \|---\|---\|---\|---\|---\| \| \| Split-K \| StreamK \| Split-K \| StreamK \| \| 1x1 128x128 N=32 28x28 \| 1.00x \| 0.54x \| 1.00x \| 0.81x \| \| 3x3 128x128 N=32 14x14 \| 1.00x \| 0.59x \| 1.00x \| 0.62x \| \| 1x1 256x64 N=32 56x56 \| 1.00x \| 0.83x \| 1.00x \| 1.83x \| \| 3x3 512x512 N=2 7x7 \| 1.00x \| 1.12x \| 1.00x \| 0.62x \| \| 1x1 1024x1024 N=4 7x7 \| 1.00x \| 1.09x \| 1.00x \| 0.60x \| \| 3x3 128x128 N=32 28x28 \| 1.00x \| 0.44x \| 1.00x \| 0.96x \| \| 3x3 256x256 N=32 14x14 \| 1.00x \| 0.67x \| 1.00x \| 0.93x \| \| 3x3 512x512 N=32 7x7 \| 1.00x \| 0.98x \| 1.00x \| 1.16x \| StreamK's value depends on tile config: with larger tiles (fewer output tiles), StreamK delivers up to 1.83x speedup on bottleneck shapes and up to 1.16x on typical large-channel convolutions. Tree reduction consistently outperforms Linear when multiple CTAs contribute to the same tile (up to 2.87x faster), due to O(log n) reduction depth vs O(n) sequential accumulation. The table reports the best of Linear and Tree for each shape. ## Test Plan ```bash ninja -C build test_ck_tile_grouped_conv_bwd_weight_streamk ./build/bin/test_ck_tile_grouped_conv_bwd_weight_streamk # Builder tests (requires CK_EXPERIMENTAL_BUILDER=ON) ninja -C build check-builder ``` 30 tests covering: - Host-side: type traits, kernel args construction, grid size, workspace size - GPU end-to-end (Linear + Tree): small/medium shapes, multi-group, stride>1, pure-DP degeneration, single-tile all-SK, large GemmK, higher occupancy - Persistent DP: Linear + Tree with persistent data-parallel dispatch - Negative: `IsSupportedArgument` rejects unaligned K and C - Builder: Create (instance string validation) + Execution (reference comparison) + instance string regression ## Test Result All 30 conv StreamK tests pass on MI355X (gfx950). 64/64 GEMM StreamK tests pass. Full `check-builder` suite passes. Tolerances computed dynamically using `calculate_rtol_atol` pattern (fp16 ULP-aware). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 09:18:14 +00:00
Yi DING	47a04fda08	[rocm-libraries] ROCm/rocm-libraries#5790 (commit c132b5a) [CK_TILE] Fix NaN for FMHA BWD When seq_q=0 ## Motivation This PR addresses NaNs in the FMHA backward (dQ/dK/dV) path when the effective query sequence length for a tile is zero, by ensuring the per-tile pipelines exit early with zeroed accumulators and by avoiding an early kernel return that prevented writing out cleared gradients. ## Technical Details - Add unconditional early-exit in the dK/dV pipelines when `num_total_loop <= 0` (no work), returning zeroed accumulators. - Adjust group-mode kernel early-return logic to only return when both `seqlen_q` and `seqlen_k` are zero, allowing blocks to run and store cleared dK/dV when `seqlen_q == 0`. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-27 07:54:53 +00:00
joyeamd	046d3ac274	[rocm-libraries] ROCm/rocm-libraries#5789 (commit 6654ca6) [CK][CK_TILE] Revert addional oob check in gemm IsSupported function (#5789) ## Motivation fix ck_tile's oob check. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-26 01:41:35 +00:00
Ville Pietilä	ec2dbfbfde	[rocm-libraries] ROCm/rocm-libraries#5516 (commit ff3afda) [CK_TILE, CK_BUILDER] Add bwd data to CK Tile profiler (#5516) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation We want close the performance gap between old CK and CK Tile for bwd data convolutions. To achieve this, we need tow things - Configurations for the old CK kernel instances such that we can map them into CK Tile instances. - Support in CK profiler to run the CK Tile instance with the same API as for old CK instances. ## Technical Details Extracted kernel configurations from old CK. The codegen python script for CK Tile convs is extended to support also bwd data. The generated instances are added to the CMake build (target `device_grouped_conv_bwd_data_tile_instances`). A new profiler op (`grouped_conv_bwd_data_tile`) has been added to the CK Profiler. The API is same as for old CK's profiler op `grouped_conv_bwd_data`.	2026-03-25 14:36:11 +00:00
joyeamd	1834e318da	[rocm-libraries] ROCm/rocm-libraries#5697 (commit dd1c396) Revert "Ck/joye/revert oob check (#5640)" This reverts commit 552ab4880292694cb8261f40fa4223af52cb8419. ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-23 22:05:30 +00:00
Bartłomiej Kocot	f79926009b	[rocm-libraries] ROCm/rocm-libraries#5555 (commit 1d2c4c8) [CK][CK Tile] Fix kbatch check in grouped conv and gemm kernels (#5555) ## Motivation Fix kbatch check in grouped conv and gemm kernels, allow tails for kbatch. ## Technical Details Round up K / Kperxdl and divide it by Kbatch to allow tail for K. ## Test Plan test_grouped_convnd_bwd_weight_tile ## Test Result passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-21 22:56:19 +00:00
Bartłomiej Kocot	db40d3f517	[rocm-libraries] ROCm/rocm-libraries#5334 (commit bb5a3c8) [CK][CK Tile] Improve access for merged groups and remove modulo from xor (#5334) ## Motivation [CK][CK Tile] Improve access for merged groups and remove modulo from xor ## Technical Details - add template parameter to xor if modulo is needed. We don't need modulo for merged groups - use access by m for merged groups for a tensor - ## Test Plan test_grouped_convnd_fwd_tile ## Test Result passed locally ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 15:47:22 +00:00
joyeamd	a22c822aef	[rocm-libraries] ROCm/rocm-libraries#5640 (commit 552ab48) Ck/joye/revert oob check ## Motivation fix ck_tile's oob check. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 12:31:27 +00:00
arai713	da863dae1b	[rocm-libraries] ROCm/rocm-libraries#4795 (commit 6590a1a) [CK_TILE] Rename Stream-K grid function ## Motivation This PR introduces a change in the name of the get_grid function in the Stream-K TilePartitioner to avoid confusion with a similarly named method. In the Stream-K TilePartitioner, there is get_grid() which returns num_cu*occupancy and there is grid_size() which returns the grid size used to launch the kernel. In this PR, we change get_grid() to be get_max_active_wgs() to better reflect what the function returns and not confuse it with grid_size(). ## Technical Details Initially in the Stream-K TilePartitioner we had get_grid() which returned grid_. We are renaming get_grid() to get_max_active_wgs() and grid_ to max_active_wgs_ internally, while keeping grid_size() the same. The parameter, grid, for the Stream-K TilePartitioner remains the same to maintain consistency with the rest of the Stream-K API. ## Test Plan Validated using the test suite that is already present. ## Test Result All tests passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 09:28:47 +00:00
Sami Remes	d7c761e060	[rocm-libraries] ROCm/rocm-libraries#5095 (commit 7e55766) [CK_TILE] Enable MXFP6 for MX GEMM op ## Motivation Add support for MXFP6 in the MX GEMM op in CK-Tile. Depends on https://github.com/ROCm/rocm-libraries/pull/4594 ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-20 01:08:52 +00:00
yinglu	d460ab35b6	[rocm-libraries] ROCm/rocm-libraries#4302 (commit e62bd8a) [CK_TILE] add tf32 support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Proposed changes TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in CK_TILE on gfx942 and gfx950. ## Checklist Please put an into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run on all changed files - [ ] Any dependent changes have been merged ## Discussion	2026-03-19 09:19:06 +00:00
lalala-sh	345a56c55e	[rocm-libraries] ROCm/rocm-libraries#5086 (commit f4880d7) [CK] Fix MOE FP8 SplitK buffer descriptor OOB When SplitK is enabled, kernel entry shifts A/B/AScale/BScale base pointers by SplitKBatchOffset, but make_dynamic_buffer element spaces are still based on full K dimension. This causes hardware buffer resource descriptors to extend beyond the actual tensor allocation, leading to GPU memory access faults when the tensor happens to be placed at the end of an allocated memory pool region. Fix by subtracting the split offset from each buffer's element space in both Run() (v1 pipeline) and Run_2Lds() (v2/v3 pipeline), so the buffer descriptor range [shifted_base, shifted_base + reduced_space) exactly covers the valid allocation. Also refactor SplitKBatchOffset to accept const Problem& (instead of Argument&) and add a default constructor, enabling direct reuse in Run/Run_2Lds without duplicating offset calculation logic. Made-with: Cursor ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-19 02:43:30 +00:00
Christopher Millette	e5683e2290	[rocm-libraries] ROCm/rocm-libraries#5031 (commit 1d86a92) [CK] Replace nested static_for with static_ford to reduce device IR function emissions [1B] (#5031) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Summary ### Rationale CK's GPU kernels are among the slowest files in the ROCm build, with a single translation unit taking up to 10+ minutes. Profiling with `-ftime-trace` identified nested `static_for` loops as the root cause: each nesting level multiplies the number of unique lambda IR functions the compiler must process. A 2-level nest of `static_for<0, M, 1>` / `static_for<0, N, 1>` produces M×N unique lambda types. With typical GEMM dimensions (M=16, N=4), a single nest generates 64 unique functions — and these nests appear hundreds of times across the codebase. The LLVM backend's CGSCC (Call Graph Strongly Connected Components) framework processes each function independently, so reducing function count directly reduces backend time. ### What changed 393 nested compile-time loop patterns across 73 files are converted to `static_ford`, which flattens multi-dimensional compile-time iteration into a single `static_for` with index decomposition. This eliminates 994 `static_for` nesting levels (42% reduction). Three pattern categories were converted: - Category A: `static_for` wrapping `static_ford` — fold outer dimension into ford - Category B: nested `static_ford` — merge into single higher-dimensional ford - Category C: nested `static_for` chains — convert to single `static_ford` ### Verification ASM equivalence: PASS — 51/51 device assembly files identical (gfx942 + gfx1100) \| Architecture \| Files compared \| Largest file \| Result \| \|---\|---\|---\|---\| \| gfx942 \| 36 \| 386,685 lines \| ALL MATCH \| \| gfx1100 \| 15 \| 47,769 lines \| ALL MATCH \| Build time (Wilcoxon signed-rank test, 7 paired trials): \| Target \| Pre (s) \| Post (s) \| Delta \| p-value \| \|---\|---\|---\|---\|---\| \| bscale \| 169 \| 152 \| -9.8% \| 0.016 \* \| \| xdl_v1234 \| 207 \| 194 \| -6.6% \| 0.016 \* \| \| preshuffle \| 275 \| 264 \| -3.9% \| 0.016 \* \| \| xdl_base \| 142 \| 137 \| -3.2% \| 0.031 \* \| IR function counts (device backend, gfx942): \| Target \| InstFunc Δ \| CodeGen Δ \| Compiler Δ \| \|---\|---\|---\|---\| \| bscale \| -13,043 (-8.2%) \| -2,103 (-3.5%) \| -10.7% \| \| xdl_v1234 \| -9,431 (-5.7%) \| +59 (+0.1%) \| -5.2% \| \| xdl_base \| -6,162 (-4.9%) \| -1,141 (-2.5%) \| -2.2% \| \| xdl_old \| -3,234 (-3.7%) \| -963 (-8.7%) \| -3.3% \| ### Value - 994 fewer `static_for` nesting levels (-42%) across 73 files - 393 `static_ford` sites created (from 4 pre-existing) - Up to 9.8% compile-time reduction on representative targets (statistically significant, p < 0.05) - Up to 13K fewer IR function instantiations per translation unit - Net -849 LOC from reduced indentation - Zero ASM changes — identical device code output verified on gfx942 and gfx1100 - All scheduling barriers, `if constexpr` guards, and MFMA/WMMA accumulation order preserved ### Files changed (73) - `block/`: 47 files (GEMM pipelines — xdlops, wmma, moe, preshuffle, blockscale variants) - `grid/`: 20 files (softmax, normalization, reduction, attention, layernorm) - `thread/`: 5 files (tensor slice transfer, contraction, GEMM dlops, reduction) - `tensor_description/`: 1 file (tensor_adaptor) ## Test plan - [x] `static_ford` tested with 21 unit tests in `test/util/unit_ford.cpp` (1D-4D, custom orders, compile-time verification) - [x] All conversions preserve iteration order, `block_sync_lds()` placement, `if constexpr` scheduling guards, and MFMA/WMMA accumulation order - [x] ASM equivalence verified: 51 device `.s` files across gfx942 + gfx1100 - [x] Build-time improvement statistically confirmed (Wilcoxon, p < 0.05, 4 targets) - [x] IR function count reduction confirmed via `-ftime-trace` on 7 targets - [x] Detection script reports 0 remaining safe patterns (180 blocked with structural reasons) - [x] Existing CI tests (GEMM, softmax, normalization, batch norm, reduction, attention) exercise all converted code paths ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-18 14:46:50 +00:00
Thomas Ning	5f90f69795	[rocm-libraries] ROCm/rocm-libraries#5323 (commit 5454e9e) CK Tile MX GEMM Packing Improvement ## Motivation Reduce the scale loading size and also has better utilization of MFMA scale selection. ## Technical Details Add up the packing of mx scales. ## Test Plan Use the existing test cases. ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-17 18:58:56 +00:00
Hosang	859acb5ae7	[rocm-libraries] ROCm/rocm-libraries#5018 (commit b32e7e6) [CK_TILE] Add LLC-aware FMHA head grouping and head-major scheduling on RDNA (#5018) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Long-sequence FMHA can become memory-bound when K/V working sets exceed Infinity Cache (LLC), causing repeated DRAM traffic across heads. This PR introduces LLC-aware launch ordering improvements for FMHA forward, and it is currently enabled only on gfx11 and gfx12. The approach is inspired by [`Dao-AILab/flash-attention#2217`](https://github.com/Dao-AILab/flash-attention/pull/2217), adapted to CK’s kernel/runner structure and layout handling. In this context, `bshd` is the layout used in Flash-Attention, while `bhsd` is the default layout used by the CK Tile FMHA example. ## Technical Details This PR adds two complementary strategies: - For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware head grouping: - Estimate LLC size (env override, KFD sysfs, or arch default). - Compute group size from K/V bytes per head vs LLC target. - Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and related tensors). - For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit launch-order adjustment: - Keep a single kernel launch. - Reinterpret block linearization in `GetTileIndex` to make execution head-major, improving temporal locality of per-head K/V reuse. Additional integration updates: - Propagate `num_head_q_total` and `head_start` through FMHA args/kargs. - Use global head indexing for dropout RNG stream mapping so grouped launches keep deterministic/consistent dropout behavior. - Keep fallback behavior unchanged when grouping is not beneficial or disabled. ## Test Plan - `test_ck_tile_fmha` - `tile_example_fmha_fwd` ## Test Result - `test_ck_tile_fmha`: all tests passed. - `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201, and all of them show higher performance compared to the baseline. The improvement is consistent, and performance is well maintained even at long sequence lengths. ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} - TFLOPs by sequence length target: gfx1100 layout: bhsd SeqLen \| Before \| After \| Speedup -- \| -- \| -- \| -- 1024 \| 56.27 \| 61.48 \| 1.09x 4096 \| 67.10 \| 72.27 \| 1.08x 8192 \| 65.99 \| 71.64 \| 1.09x 12288 \| 61.60 \| 76.61 \| 1.24x 16384 \| 58.99 \| 75.74 \| 1.28x 20480 \| 57.32 \| 74.42 \| 1.30x 24576 \| 56.89 \| 74.25 \| 1.31x 27280 \| 18.93 \| 24.48 \| 1.29x - TFLOPs by sequence length target: gfx1201 layout: bshd SeqLen \| Before \| After \| Speedup -- \| -- \| -- \| -- 1024 \| 66.79 \| 65.90 \| 0.99x 4096 \| 85.90 \| 86.80 \| 1.01x 8192 \| 77.06 \| 90.29 \| 1.17x 12288 \| 58.36 \| 88.98 \| 1.52x 16384 \| 52.12 \| 88.88 \| 1.71x 20480 \| 48.11 \| 88.42 \| 1.84x 24576 \| 47.12 \| 89.07 \| 1.89x 27280 \| 49.05 \| 50.31 \| 1.03x ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-16 21:19:23 +00:00
Bartłomiej Kocot	9c414d2e59	[rocm-libraries] ROCm/rocm-libraries#5454 (commit 8dade31) [CK][CK Tile] Grouped Convolution backward weight profiler flush cache (#5454) ## Motivation Flush cache to get more stable results during profiling old ck and ck tile. ## Technical Details Flush cache before each kernel call and one more first run. ## Test Plan test_grouped_conv_bwd_weight_tile ## Test Result pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-966	2026-03-16 17:47:07 +00:00
lalala-sh	a3ccd5dca1	[rocm-libraries] ROCm/rocm-libraries#5225 (commit 880166b) [CK] fix moe memset size which is bigger than alloc ## Motivation Fix an out-of-bounds hipMemsetAsync in DeviceMoeGemmBlockScale that crashes split-K MOE GEMM with "HIP runtime error: invalid argument". When KBatch > 1, the invoker zeroes the output buffer using arg.M * arg.N as the byte count. However, arg.M is the padded sorted-token-id length from MOE routing, which can be much larger than the actual output allocation (NumTokens * TopK * N). This causes hipMemsetAsync to write beyond the buffer, and the silently-swallowed HIP error propagates to the subsequent kernel launch via hipGetLastError(). This patch replaces arg.M with arg.NumTokens * arg.TopK so the memset matches the actual output size. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-16 09:30:57 +00:00
Enrico Degregori	eb033ef208	[rocm-libraries] ROCm/rocm-libraries#4964 (commit 3271d9a) [CK Tile] Eight Waves pipeline GEMM ## Motivation Eight waves pipeline was added for ABQuant. The goal of this PR is to enable it also for GEMM ## Technical Details Summary: - Block: - Create block struct for GEMM using eight warps specific distribution encodings - Use this block struct in ABQuant for encodings - Pipeline: - Create impl pipeline for eight waves which can be used by GEMM and ABQuant as base (and for AQuant and BQuant in the future) - Create eight waves pipeline for GEMM (this can not be easily integrated in the existing async pipeline) - Pipeline policy: - Extract GEMM specific parts in the ABQuant policy to define GEMM policy (then ABQuant use it as base and add Quant specific methods) - Minor: naming was inconsistent between warp/wave, everything is now referred to as eight waves So overall we have: - block struct directly used by GEMM -> ABQuant derived struct to implement operator - Impl base pipeline with general implementation -> GEMM and ABQuant pipelines use it to avoid code duplication but still define their own pipelines - pipeline policy struct directly used by GEMM -> ABQuant derived policy struct for Quant specific parts ## Test Plan Added new tests for GEMM pipeline: `test_ck_tile_gemm_pipeline_comp_async_eight_waves` (only gfx950 supports it). Note: K padding test is disabled for this pipeline because it's not implemented yet ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-03-16 08:31:56 +00:00

1 2 3 4 5 ...

1555 Commits